问题现象:
K8S中创建的容器有时可以ping通域名,有时不可以
基础环境:
K8S通过kubeaze自动搭建,域名解析使用的是coredns,coredns启动了两个实例,分布到两个不同的Worker节点。
调研步骤:
1. 手动启动一个centos7的容器
2. ping www.baidu.com, 不通
3. 将容器中的/etc/resolve.conf中加入K8S集群外的dns
nameserver 10.128.142.149 #这一条是集群外的dns地址
nameserver 172.20.0.2 #这一条是集群内的coredns 服务的地址
4. ping www.baidu.com ,可以通
5. 进入到容器中安装dig命令
#yum install bind-utils
#dig www.baidu.com ,可以看到返回我地址的是我集群外的DNS服务
[root@centos-bdf5cff5b-7jzt9 /]# dig www.baidu.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.baidu.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2694
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.baidu.com. IN A
;; ANSWER SECTION:
www.baidu.com. 1102 IN CNAME www.a.shifen.com.
www.a.shifen.com. 207 IN A 112.80.248.76
www.a.shifen.com. 207 IN A 112.80.248.75
;; Query time: 0 msec
;; SERVER: 10.128.142.149#53(10.128.142.149)
;; WHEN: Thu Dec 26 02:16:04 UTC 2019
;; MSG SIZE rcvd: 104
6.将集群外的DNS从/etc/resolve.conf中去掉,再进行dig操作,为了避免缓存,我使用了另外一个域名,可以看到集群内部的dns给我返回了正确的地址,我也可以ping通qq这个域名
[root@centos-bdf5cff5b-7jzt9 /]# dig www.qq.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.qq.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59772
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.qq.com. IN A
;; ANSWER SECTION:
www.qq.com. 30 IN CNAME public.sparta.mig.tencent-cloud.net.
public.sparta.mig.tencent-cloud.net. 30 IN A 157.255.192.44
public.sparta.mig.tencent-cloud.net. 30 IN A 61.241.44.148
;; Query time: 1 msec
;; SERVER: 172.20.0.2#53(172.20.0.2)
;; WHEN: Thu Dec 26 02:16:42 UTC 2019
;; MSG SIZE rcvd: 200
[root@centos-bdf5cff5b-7jzt9 /]# ping www.qq.com
PING public.sparta.mig.tencent-cloud.net (157.255.192.44) 56(84) bytes of data.
64 bytes from 157.255.192.44 (157.255.192.44): icmp_seq=1 ttl=49 time=33.0 ms
7. 再次dig 另外一个域名www.163.com,这次就报错了。
[root@centos-bdf5cff5b-7jzt9 /]# dig www.163.com
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.163.com
;; global options: +cmd
;; connection timed out; no servers could be reached
8. 把这个错误在网上搜索了一下,看到stackoverflow中也有人问这个问题,解决方法如下:
Ubuntu: 在worker节点中 查看 br_netfilter这个模块是不是启用了,如果没有启用运行modprobe br_netfilter
CentOS: 看看 /proc/sys/net/bridge/bridge-nf-call-iptables 的值是不是为1,如果不是: echo '1'> /proc/sys/net/bridge/bridge-nf-call-iptables
9. 我的问题就是coredns运行的节点上有一个节点的/proc/sys/net/bridge/bridge-nf-call-iptables 不为1,修改后再次dig 就可以成功了。 问题也解决了
[root@centos-bdf5cff5b-7jzt9 /]# dig www.163.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.163.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28746
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.163.com. IN A
;; ANSWER SECTION:
www.163.com. 30 IN CNAME www.163.com.163jiasu.com.
www.163.com.163jiasu.com. 30 IN CNAME www.163.com.bsgslb.cn.
www.163.com.bsgslb.cn. 30 IN CNAME z163ipv6.v.bsgslb.cn.
z163ipv6.v.bsgslb.cn. 30 IN A 58.16.59.134
z163ipv6.v.bsgslb.cn. 30 IN A 58.16.59.131
z163ipv6.v.bsgslb.cn. 30 IN A 58.16.59.137
;; Query time: 1 msec
;; SERVER: 172.20.0.2#53(172.20.0.2)
;; WHEN: Thu Dec 26 02:22:34 UTC 2019
;; MSG SIZE rcvd: 311
参考:
https://github.com/kubernetes/kubernetes/issues/21613#issuecomment-343190401