K8S中容器不能解析域名的问题

问题现象:

K8S中创建的容器有时可以ping通域名,有时不可以

基础环境:

K8S通过kubeaze自动搭建,域名解析使用的是coredns,coredns启动了两个实例,分布到两个不同的Worker节点。

调研步骤:

1. 手动启动一个centos7的容器

2. ping www.baidu.com, 不通

3. 将容器中的/etc/resolve.conf中加入K8S集群外的dns

nameserver 10.128.142.149 #这一条是集群外的dns地址
nameserver 172.20.0.2  #这一条是集群内的coredns 服务的地址

4. ping www.baidu.com ,可以通

5. 进入到容器中安装dig命令

#yum install bind-utils

#dig www.baidu.com ,可以看到返回我地址的是我集群外的DNS服务

[root@centos-bdf5cff5b-7jzt9 /]# dig www.baidu.com 

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.baidu.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2694
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.baidu.com.                 IN      A

;; ANSWER SECTION:
www.baidu.com.          1102    IN      CNAME   www.a.shifen.com.
www.a.shifen.com.       207     IN      A       112.80.248.76
www.a.shifen.com.       207     IN      A       112.80.248.75

;; Query time: 0 msec
;; SERVER: 10.128.142.149#53(10.128.142.149)
;; WHEN: Thu Dec 26 02:16:04 UTC 2019
;; MSG SIZE  rcvd: 104

6.将集群外的DNS从/etc/resolve.conf中去掉,再进行dig操作,为了避免缓存,我使用了另外一个域名,可以看到集群内部的dns给我返回了正确的地址,我也可以ping通qq这个域名

[root@centos-bdf5cff5b-7jzt9 /]# dig www.qq.com 

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.qq.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59772
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.qq.com.                    IN      A

;; ANSWER SECTION:
www.qq.com.             30      IN      CNAME   public.sparta.mig.tencent-cloud.net.
public.sparta.mig.tencent-cloud.net. 30 IN A    157.255.192.44
public.sparta.mig.tencent-cloud.net. 30 IN A    61.241.44.148

;; Query time: 1 msec
;; SERVER: 172.20.0.2#53(172.20.0.2)
;; WHEN: Thu Dec 26 02:16:42 UTC 2019
;; MSG SIZE  rcvd: 200

[root@centos-bdf5cff5b-7jzt9 /]# ping www.qq.com
PING public.sparta.mig.tencent-cloud.net (157.255.192.44) 56(84) bytes of data.
64 bytes from 157.255.192.44 (157.255.192.44): icmp_seq=1 ttl=49 time=33.0 ms

7. 再次dig 另外一个域名www.163.com,这次就报错了。

[root@centos-bdf5cff5b-7jzt9 /]# dig www.163.com
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.163.com
;; global options: +cmd
;; connection timed out; no servers could be reached

8. 把这个错误在网上搜索了一下,看到stackoverflow中也有人问这个问题,解决方法如下:

Ubuntu: 在worker节点中 查看 br_netfilter这个模块是不是启用了,如果没有启用运行modprobe br_netfilter

CentOS: 看看 /proc/sys/net/bridge/bridge-nf-call-iptables 的值是不是为1,如果不是: echo '1'> /proc/sys/net/bridge/bridge-nf-call-iptables

9. 我的问题就是coredns运行的节点上有一个节点的/proc/sys/net/bridge/bridge-nf-call-iptables 不为1,修改后再次dig 就可以成功了。 问题也解决了

[root@centos-bdf5cff5b-7jzt9 /]# dig www.163.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.163.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28746
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.163.com.                   IN      A

;; ANSWER SECTION:
www.163.com.            30      IN      CNAME   www.163.com.163jiasu.com.
www.163.com.163jiasu.com. 30    IN      CNAME   www.163.com.bsgslb.cn.
www.163.com.bsgslb.cn.  30      IN      CNAME   z163ipv6.v.bsgslb.cn.
z163ipv6.v.bsgslb.cn.   30      IN      A       58.16.59.134
z163ipv6.v.bsgslb.cn.   30      IN      A       58.16.59.131
z163ipv6.v.bsgslb.cn.   30      IN      A       58.16.59.137

;; Query time: 1 msec
;; SERVER: 172.20.0.2#53(172.20.0.2)
;; WHEN: Thu Dec 26 02:22:34 UTC 2019
;; MSG SIZE  rcvd: 311

参考:

https://github.com/kubernetes/kubernetes/issues/21613#issuecomment-343190401

你可能感兴趣的:(K8S)