生产排查k8s相关cluster ip部分不通问题

问题:cluster ip发现有些通有些不通
# kubectl get ep -A
monitoring             alertmanager-main                    10.244.107.233:8080,10.244.169.154:8080,10.244.169.157:8080 + 3 more...   6d7h
monitoring             alertmanager-operated                10.244.107.233:9094,10.244.169.154:9094,10.244.169.157:9094 + 6 more...   6d7h
monitoring             blackbox-exporter                    10.244.169.151:9115,10.244.169.151:19115                                  6d7h
monitoring             grafana                              10.244.122.99:3000                                                        6d7h
# curl 10.244.122.99:3000 
curl: (7) Failed connect to 10.244.122.96:8443; Connection timed out


排查:
kubernetes原生的,一个Service的ServiceType目前有四种方式。
-> ClusterIP:k8s默认。通过集群内的ClusterIP在内部发布服务。
-> NodePort:常用,用来对集群外暴露Service,你可以在集群外部通过访问集群内的每个主机的IP:NodePort的方式,访问到对应Service后端的Endpoint。
-> LoadBalancer: 这也是用来对集群外暴露服务的,不同的是这需要Cloud Provider的支持,比如AWS等。
-> ExternalName:这个也是在集群内发布服务用的,需要借助KubeDNS(version >= 1.7)的支持,就是用KubeDNS将该service和ExternalName做一个Map,KubeDNS返回一个CNAME记录。


 kube-proxy有多种模式,这里通过访问kube-proxy的接口 (systemctl管理的kube-proxy)
 curl localhost:10249/proxyMode                  正常返回iptables或者ipvs。这里是iptables
 查看kube-proxy配置参数,确认mode类型,直看转发规则是否正确
√ userspace :
    iptables-save | grep ${servicename}
    KUBE-PORTALS-CONTAINER、KUBE-PORTALS-HOST
√ iptables :
    iptables-save | grep ${servicename}
    KUBE-SVC-XXX KUBE-SEP-XXX
√ ipvs:
    ipvsadm -ln


根据上面返回的是iptables,所以使用如下查看规则,发现没啥问题,
# iptables-save | grep grafana
-A KUBE-NODEPORTS -p tcp -m comment --comment "monitoring/grafana:http" -m tcp --dport 31100 -j KUBE-SVC-AWA2CQSXVI7X2GE5
-A KUBE-SEP-ZM6O6UW6O2KZW4VR -s 10.244.122.99/32 -m comment --comment "monitoring/grafana:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZM6O6UW6O2KZW4VR -p tcp -m comment --comment "monitoring/grafana:http" -m tcp -j DNAT --to-destination 10.244.122.99:3000
-A KUBE-SERVICES -d 10.0.0.40/32 -p tcp -m comment --comment "monitoring/grafana:http cluster IP" -m tcp --dport 3000 -j KUBE-SVC-AWA2CQSXVI7X2GE5
-A KUBE-SVC-AWA2CQSXVI7X2GE5 ! -s 10.244.0.0/16 -d 10.0.0.40/32 -p tcp -m comment --comment "monitoring/grafana:http cluster IP" -m tcp --dport 3000 -j KUBE-MARK-MASQ
-A KUBE-SVC-AWA2CQSXVI7X2GE5 -p tcp -m comment --comment "monitoring/grafana:http" -m tcp --dport 31100 -j KUBE-MARK-MASQ
-A KUBE-SVC-AWA2CQSXVI7X2GE5 -m comment --comment "monitoring/grafana:http" -j KUBE-SEP-ZM6O6UW6O2KZW4VR
   
   
弄下别名
alias kgog='kubectl get pods -owide -A|grep'
alias kdsp='kubectl describe pod'
alias kgsg='kubectl get service -A|grep'


执行如下,发现是在主机192.168.1.75的calico有问题,0/1代表可能有问题,
# kgog calico
kube-system            calico-kube-controllers-8db96c76-lpc7m      1/1     Running   2 (18d ago)     88d     10.244.135.201   k8s-master3              
kube-system            calico-node-mggw6                           0/1     Running   0               44s     192.168.1.75     k8s-node1                


查看calico的event信息
# kdsp -nkube-system  calico-node-mggw6
  Normal   Pulled     66s                kubelet            Container image "calico/node:v3.15.1" already present on machine
  Warning  Unhealthy  64s (x2 over 65s)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  60s                kubelet            Readiness probe failed: 2022-10-17 14:08:13.585 [INFO][198] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.1.71,192.168.1.72,192.168.1.73,192.168.1.76,192.168.1.77,192.168.1.78
  Warning  Unhealthy  50s  kubelet  Readiness probe failed: 2022-10-17 14:08:23.589 [INFO][225] confd/health.go 180: Number of node(s) with BGP peering established = 0


移除192.168.1.75这台主机calico网卡,即删除网卡br某某,重启该节点的calico的pod如需网卡会自动添加
# ip link delete br-8813b01cb9a3
# kubectl delete pod  calico-node-mggw6 -n kube-system
# kgog calico
kube-system            calico-kube-controllers-8db96c76-lpc7m      1/1     Running   2 (18d ago)     88d     10.244.135.201   k8s-master3              
kube-system            calico-node-4vj7f                           1/1     Running   1 (21d ago)     88d     192.168.1.73     k8s-master3              
kube-system            calico-node-55xnq                           1/1     Running   2               88d     192.168.1.78     k8s-node4                
kube-system            calico-node-drswp                           1/1     Running   0               2m36s   192.168.1.75     k8s-node1                


# kgsg grafana
monitoring             grafana                              NodePort    10.0.0.40            3000:31100/TCP                  6d19h


curl下,此时cluster ip也可以通了,nodeport也可以访问了
# curl -s -w "%{http_code}\n" -o /dev/null 10.0.0.40:3000
302


你可能感兴趣的:(tcp/ip,kubernetes,docker)