在k8s中,通常在一个服务应用出现故障的时候,我们会在集群内自我检查:
具体过程如下:
1. 先查出该服务的service ,pod 信息
kubectl get svc,ep,po -o wide -n grafana
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/grafana ClusterIP 10.68.247.194 80/TCP 5h43m app=grafana,release=grafana
NAME ENDPOINTS AGE
endpoints/grafana 172.20.1.217:3000 5h43m
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/grafana-5c4c844fc6-dpcs8 1/1 Running 0 5h43m 172.20.1.217 10.2.2.121
2. 进入与pod不同节点的系统:
kubectl get node
NAME STATUS ROLES AGE VERSION
10.2.2.120 Ready,SchedulingDisabled master 260d v1.15.5
10.2.2.121 Ready metallb-speaker,node 260d v1.15.5
10.2.2.122 Ready metallb-speaker,node 260d v1.15.5
10.2.2.123 NotReady,SchedulingDisabled node 127d v1.15.5
上面可见grafana pod运行在10.2.2.121节点,这里我们就需要进入10.2.2.122:
ssh [email protected]
3. ping 检查:
3.1. pod的ip是可以ping的:
[[email protected] ~]# ping 172.20.1.217
PING 172.20.1.217 (172.20.1.217) 56(84) bytes of data.
64 bytes from 172.20.1.217: icmp_seq=1 ttl=63 time=0.893 ms
64 bytes from 172.20.1.217: icmp_seq=2 ttl=63 time=0.616 ms
^C
--- 172.20.1.217 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.616/0.754/0.893/0.141 ms
3.2. service的ip是不能ping,所以不用测试。
4. telnet检查:
使用pod ip 检查端口是否可用:
[[email protected] ~]# telnet 172.20.1.217 3000
Trying 172.20.1.217...
Connected to 172.20.1.217.
Escape character is '^]'.
使用service ip 检查端口可用:
[[email protected] ~]# telnet 10.68.247.194 80
Trying 10.68.247.194...
Connected to 10.68.247.194.
Escape character is '^]'.
没故障的时候,这两个检查是成功的,如果出现网络问题,就会telnet不上,你就需要检查网络问题。
经常出现的网络问题,通常需要检查:iptables、kube-proxy、flannel:
1. 如果使用flanneld网络插件,需要检查flanneld的状态:
systemctl status flannels.
2. docker 从 1.13 版本开始,可能将 iptables FORWARD chain的默认策略设置为DROP,从而导致 ping 其它 Node 上的 Pod IP 失败,遇到这种情况时,需要手动设置策略为 ACCEPT:
$ sudo iptables -P FORWARD ACCEPT
3. kubernetes-在pod里面的容器不能ping外部ip
这里网络故障,我们可以在prometheus中监控到,详细参考:https://github.com/prometheus/blackbox_exporter