kubectl get node慢问题排查

问题描述
在k8s集群第一个master节点(node1)上执行kubectl get node大概需要8s的时间才有数据返回,而另外的master上执行同样的命令却是很快返回。通过kube-apiserver的日志来看,是无法连接上cert-manager、metrics-server等服务,从而导致超时。pod网络无法ping通网关
排查结果
本集群环境中使用的cni是kube-ovn,pod网络的网卡为bond-data,之前为单网卡,后来服务器关机新加了一块网卡,导致bond-data 和ovn-host设备MAC地址发生变化,ovn数据库中ip和mac对应关系没有更新过来,在第一个master(node1)节点丢失了mac地址。

ovn-host实际就是一个tap设备,用于 pod--host 之间的连通性(社区版是ovn0,有点出入)

排查过程

  • 在node1上执行kubectl get node
[root@node1 ~]# time kubectl get node
NAME    STATUS   ROLES                         AGE   VERSION
node1   Ready    control-plane,master,worker   2d    v1.23.6
node2   Ready    control-plane,master,worker   2d    v1.23.6
node3   Ready    control-plane,master,worker   2d    v1.23.6

real	0m7.934s     ####花费时间将近8s
user	0m0.210s
sys	0m0.095s
  • 在node2上执行kubectl get node
[root@node2 ~]#  time kubectl get node
NAME    STATUS   ROLES                         AGE   VERSION
node1   Ready    control-plane,master,worker   2d    v1.23.6
node2   Ready    control-plane,master,worker   2d    v1.23.6
node3   Ready    control-plane,master,worker   2d    v1.23.6

real	0m0.116s   ###node2显示正常
user	0m0.155s
sys	0m0.086s
  • 查看node1上api日志
    kubectl logs kube-apiserver-node1 -n kube-system
Trace[1214208296]: [2.069305201s] [2.069305201s] END
W0330 07:20:33.165661       1 reflector.go:324] storage/cacher.go:/cert-manager.io/certificaterequests: failed to list cert-manager.io/v1alpha2, Kind=CertificateRequest: conversion webhook for cert-manager.io/v1, Kind=CertificateRequest failed: Post "https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s": dial tcp 10.233.31.151:443: connect: no route to host
W0330 07:20:33.165667       1 reflector.go:324] storage/cacher.go:/cert-manager.io/issuers: failed to list cert-manager.io/v1alpha3, Kind=Issuer: conversion webhook for cert-manager.io/v1, Kind=Issuer failed: Post "https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s": dial tcp 10.233.31.151:443: connect: no route to host
E0330 07:20:33.165684       1 cacher.go:424] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list cert-manager.io/v1alpha3, Kind=Issuer: conversion webhook for cert-manager.io/v1, Kind=Issuer failed: Post "https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s": dial tcp 10.233.31.151:443: connect: no route to host; reinitializing...
I0330 07:20:33.165689       1 trace.go:205] Trace[1452509263]: "List etcd3" key:/cert-manager.io/certificates,resourceVersion:,resourceVersionMatch:,limit:10000,continue: (30-Mar-2023 07:20:31.094) (total time: 2071ms):
Trace[1452509263]: [2.071338582s] [2.071338582s] END
E0330 07:20:33.165671       1 cacher.go:424] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list cert-manager.io/v1alpha2, Kind=CertificateRequest: conversion webhook for cert-manager.io/v1, Kind=CertificateRequest failed: Post "https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s": dial tcp 10.233.31.151:443: connect: no route to host; reinitializing...
W0330 07:20:33.165709       1 reflector.go:324] storage/cacher.go:/cert-manager.io/certificates: failed to list cert-manager.io/v1alpha2, Kind=Certificate: conversion webhook for cert-manager.io/v1, Kind=Certificate failed: Post "https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s": dial tcp 10.233.31.151:443: connect: no route to host
E0330 07:20:33.165722       1 cacher.go:424] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list cert-manager.io/v1alpha2, Kind=Certificate: conversion webhook for cert-manager.io/v1, Kind=Certificate failed: Post "https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s": dial tcp 10.233.31.151:443: connect: no route to host; reinitializing...
E0330 07:34:15.506727       1 available_controller.go:524] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.233.31.198:443/apis/metrics.k8s.io/v1beta1: Get "https://10.233.31.198:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.233.31.198:443: connect: no route to host

通过日志不难看出,node1节点无法连接metrics以及cert-manager等其他服务

下面的排查过程以metrics-server 10.233.31.198:443为例

  • 在node1节点telnet测试服务连通性
[root@node1 ~]# telnet 10.233.31.198 443
Trying 10.233.31.198...
telnet: connect to address 10.233.31.198: No route to host
[root@node1 ~]# 

###测试ping网络能通
[root@node1 ~]# ping 10.233.31.198
PING 10.233.31.198 (10.233.31.198) 56(84) bytes of data.
64 bytes from 10.233.31.198: icmp_seq=1 ttl=64 time=0.085 ms
64 bytes from 10.233.31.198: icmp_seq=2 ttl=64 time=0.043 ms
^C
--- 10.233.31.198 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1028ms
rtt min/avg/max/mdev = 0.043/0.064/0.085/0.021 ms

###测试ping pod网络的网关,不通
[root@node1 ~]# ping 172.10.0.1
PING 172.10.0.1 (172.10.0.1) 56(84) bytes of data.


From 172.10.255.2 icmp_seq=1 Destination Host Unreachable
From 172.10.255.2 icmp_seq=2 Destination Host Unreachable
From 172.10.255.2 icmp_seq=3 Destination Host Unreachable
  • 在node2节点telnet测试服务连通性
[root@node2 ~]# telnet 10.233.31.198 443
Trying 10.233.31.198...
Connected to 10.233.31.198.
Escape character is '^]'.
^CConnection closed by foreign host.
[root@node2 ~]# 

###测试ping pod网络的网关可以通
[root@node2 ~]# ping 172.10.0.1
PING 172.10.0.1 (172.10.0.1) 56(84) bytes of data.
64 bytes from 172.10.0.1: icmp_seq=1 ttl=254 time=1.70 ms
64 bytes from 172.10.0.1: icmp_seq=2 ttl=254 time=0.592 ms
^C
--- 172.10.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.592/1.145/1.699/0.554 ms
  • 查看node1和node2节点的arp
[root@node1 ~]# ip neigh |grep  ovn-host
172.10.0.52 dev ovn-host  INCOMPLETE
172.10.0.84 dev ovn-host  INCOMPLETE
172.10.0.77 dev ovn-host  FAILED
172.10.0.57 dev ovn-host  INCOMPLETE
172.10.0.62 dev ovn-host  FAILED
172.10.0.38 dev ovn-host  FAILED
172.10.0.72 dev ovn-host  INCOMPLETE
172.10.0.118 dev ovn-host  INCOMPLETE
172.10.255.4 dev ovn-host  FAILED
172.10.0.124 dev ovn-host  INCOMPLETE
172.10.255.3 dev ovn-host  FAILED
172.10.0.76 dev ovn-host  INCOMPLETE
172.10.0.2 dev ovn-host  FAILED
172.10.0.60 dev ovn-host  INCOMPLETE
172.10.0.127 dev ovn-host  INCOMPLETE
172.10.0.24 dev ovn-host  INCOMPLETE
172.10.0.80 dev ovn-host  INCOMPLETE

#######下面是node2的arp信息
root@node2 ~]# ip neigh |grep  ovn-host
172.10.0.84 dev ovn-host lladdr 00:00:00:eb:f4:ec REACHABLE
172.10.0.127 dev ovn-host lladdr 00:00:00:44:ca:b9 REACHABLE
172.10.0.83 dev ovn-host lladdr 00:00:00:3b:ee:a3 STALE
172.10.0.64 dev ovn-host lladdr 00:00:00:fb:1b:a0 REACHABLE
172.10.0.24 dev ovn-host lladdr 00:00:00:a8:de:5a REACHABLE
172.10.255.4 dev ovn-host lladdr 00:1b:21:be:0d:9f DELAY
172.10.0.76 dev ovn-host lladdr 00:00:00:29:ac:f4 REACHABLE
172.10.0.3 dev ovn-host lladdr 00:00:00:40:dd:ab REACHABLE
172.10.0.42 dev ovn-host lladdr 00:00:00:59:c1:16 REACHABLE
172.10.0.70 dev ovn-host lladdr 00:00:00:9a:b9:37 REACHABLE
172.10.0.119 dev ovn-host lladdr 00:00:00:6c:75:4c REACHABLE
172.10.0.30 dev ovn-host lladdr 00:00:00:23:15:18 REACHABLE
172.10.0.114 dev ovn-host lladdr 00:00:00:7a:e8:f4 REACHABLE
172.10.0.60 dev ovn-host lladdr 00:00:00:30:7a:a5 REACHABLE
172.10.0.59 dev ovn-host lladdr 00:00:00:b2:53:7e REACHABLE
172.10.0.68 dev ovn-host lladdr 00:00:00:6f:8a:4e REACHABLE
172.10.0.28 dev ovn-host lladdr 00:00:00:93:6a:56 REACHABLE
172.10.0.67 dev ovn-host lladdr 00:00:00:06:af:03 REACHABLE
172.10.0.34 dev ovn-host lladdr 00:00:00:06:27:06 REACHABLE
172.10.0.79 dev ovn-host lladdr 00:00:00:ed:3a:96 REACHABLE
172.10.0.22 dev ovn-host lladdr 00:00:00:5c:1e:62 REACHABLE
172.10.0.57 dev ovn-host lladdr 00:00:00:43:de:aa REACHABLE
172.10.255.2 dev ovn-host lladdr e8:61:1f:13:1b:01 DELAY
172.10.0.74 dev ovn-host lladdr 00:00:00:3b:d5:2d REACHABLE
172.10.0.52 dev ovn-host lladdr 00:00:00:2b:20:12 REACHABLE
172.10.0.80 dev ovn-host lladdr 00:00:00:f7:ca:5b REACHABLE
172.10.0.65 dev ovn-host lladdr 00:00:00:eb:3a:fd REACHABLE
172.10.0.32 dev ovn-host lladdr 00:00:00:f6:16:f0 REACHABLE
172.10.0.20 dev ovn-host lladdr 00:00:00:3d:0e:d7 STALE
172.10.0.44 dev ovn-host lladdr 00:00:00:5a:12:2d STALE
172.10.0.58 dev ovn-host lladdr 00:00:00:c4:6b:c8 DELAY
172.10.0.49 dev ovn-host lladdr 00:00:00:41:c0:da REACHABLE
172.10.0.66 dev ovn-host lladdr 00:00:00:5b:ac:69 REACHABLE
172.10.0.26 dev ovn-host lladdr 00:00:00:a0:4b:47 DELAY



######################
以上得知node1上丢失了所有的mac地址
  • 确认pod网络的网卡mac地址
[root@node1 ~]# ip a | grep bond-data -A  2
5: enp113s0f1:  mtu 9000 qdisc mq master bond-data state UP group default qlen 1000
    link/ether e8:61:1f:13:1b:01 brd ff:ff:ff:ff:ff:ff
6: enp115s0f0:  mtu 9000 qdisc mq master bond-data state UP group default qlen 1000
    link/ether e8:61:1f:13:1b:01 brd ff:ff:ff:ff:ff:ff
7: enp115s0f1:  mtu 1500 qdisc mq master bond-manage state UP group default qlen 1000
--
20: bond-data:  mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000
    link/ether e8:61:1f:13:1b:01 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ea61:1fff:fe13:1b01/64 scope link
    
    
 ###查看ovn-host 信息
[root@node1 ~]# ip a | grep ovn-host -A  2
40: ovn-host:  mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether e8:61:1f:13:1b:01 brd ff:ff:ff:ff:ff:ff
    inet 172.10.255.2/16 brd 172.10.255.255 scope global noprefixroute ovn-host
       valid_lft forever preferred_lft forever
    inet6 fe80::b439:cff:fe9b:a0ba/64 scope link 

由于我们是bond-data添加了一块网卡而且出现过关机的情况,bond-data的mac地址可能和之前的不一样,试着看一下数据库中ip 和mac是否正确对应

  • 查看ovn数据库信息
[root@node2 ~]# kubectl ko nbctl show | grep 255.2 -C 5
    port virt-operator-756d76994-hmrqf.kubevirt
        addresses: ["00:00:00:59:C1:16 172.10.0.42 fd00:10:16::2a"]
    port kube-ovn-webhook-5874f47bdd-hhjvt.kube-system
        addresses: ["00:00:00:93:6A:56 172.10.0.28 fd00:10:16::1c"]
    port ovn-default-node-node1
        addresses: ["e8:61:1f:13:48:df 172.10.255.2"]    ###果然,此处可以看到ovn数据库中的mac地址和ovn-host的mac地址不对应,没有自动更新
    port metrics-server-64b7dfb7d5-hfvfb.kube-system
        addresses: ["00:00:00:2B:20:12 172.10.0.52 fd00:10:16::34"]
    port lmg-statefulset-1.upf
        addresses: ["00:00:00:D9:EF:A6 172.10.0.118 fd00:10:16::76"]
    port kube-sriov-cni-ds-amd64-b5l4k.kube-system

  • 更新ovn数据库 mac地址对应关系
[root@node2 ~]# kubectl ko nbctl lsp-set-addresses ovn-default-node-node1 "e8:61:1f:13:1b:01 172.10.255.2"

###确认修改成功
[root@node2 ~]# kubectl ko nbctl show | grep 255.2 -C 5
    port virt-operator-756d76994-hmrqf.kubevirt
        addresses: ["00:00:00:59:C1:16 172.10.0.42 fd00:10:16::2a"]
    port kube-ovn-webhook-5874f47bdd-hhjvt.kube-system
        addresses: ["00:00:00:93:6A:56 172.10.0.28 fd00:10:16::1c"]
    port ovn-default-node-node1
        addresses: ["e8:61:1f:13:1b:01 172.10.255.2"]    ###ovn-host的ip地址和实际的mac地址对应上  
    port metrics-server-64b7dfb7d5-hfvfb.kube-system
        addresses: ["00:00:00:2B:20:12 172.10.0.52 fd00:10:16::34"]
    port lmg-statefulset-1.upf
        addresses: ["00:00:00:D9:EF:A6 172.10.0.118 fd00:10:16::76"]
    port kube-sriov-cni-ds-amd64-b5l4k.kube-system
  • node1节点测试
###测试访问业务正常
[root@node1 ~]# telnet 10.233.31.198 443   
Trying 10.233.31.198...
Connected to 10.233.31.198.
Escape character is '^]'.
^CConnection closed by foreign host.
[root@node1 ~]# 

###和网关正常通信
[root@node1 ~]# ping 172.10.0.1
PING 172.10.0.1 (172.10.0.1) 56(84) bytes of data.
64 bytes from 172.10.0.1: icmp_seq=1 ttl=254 time=0.326 ms
64 bytes from 172.10.0.1: icmp_seq=2 ttl=254 time=0.299 ms
^C
--- 172.10.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1062ms
rtt min/avg/max/mdev = 0.299/0.312/0.326/0.022 ms
[root@node1 ~]# 

####get时间正常
[root@node1 ~]# time kubectl get po 
NAME                       READY   STATUS    RESTARTS   AGE
busybox-86b69dd785-4kl8z   1/1     Running   0          5h4m
busybox-86b69dd785-8t6x7   1/1     Running   0          90m
busybox-86b69dd785-h4xft   1/1     Running   0          5h4m

real	0m0.089s
user	0m0.141s
sys	0m0.057s
  • 查看node1的arp表显示正常
[root@node1 ~]# ip neigh |grep  ovn-host
172.10.0.52 dev ovn-host lladdr 00:00:00:2b:20:12 STALE
172.10.0.28 dev ovn-host lladdr 00:00:00:93:6a:56 STALE
172.10.0.84 dev ovn-host lladdr 00:00:00:eb:f4:ec REACHABLE
172.10.0.77 dev ovn-host lladdr 00:00:00:96:1b:52 DELAY
172.10.0.57 dev ovn-host lladdr 00:00:00:43:de:aa REACHABLE
172.10.0.83 dev ovn-host lladdr 00:00:00:3b:ee:a3 STALE
172.10.0.62 dev ovn-host lladdr 00:00:00:86:ab:38 REACHABLE
172.10.0.38 dev ovn-host  FAILED
172.10.0.72 dev ovn-host lladdr 00:00:00:44:e3:17 REACHABLE
172.10.0.118 dev ovn-host lladdr 00:00:00:d9:ef:a6 STALE
172.10.255.4 dev ovn-host lladdr 00:1b:21:be:0d:9f STALE
172.10.0.124 dev ovn-host lladdr 00:00:00:60:0a:a5 STALE
172.10.255.3 dev ovn-host lladdr 00:1b:21:be:0d:a1 STALE
172.10.0.76 dev ovn-host lladdr 00:00:00:29:ac:f4 REACHABLE
172.10.0.2 dev ovn-host lladdr 00:00:00:61:05:f3 STALE
172.10.0.54 dev ovn-host lladdr 00:00:00:df:9f:f6 STALE
172.10.0.1 dev ovn-host lladdr 00:00:00:00:69:8b STALE
172.10.0.60 dev ovn-host lladdr 00:00:00:30:7a:a5 REACHABLE
172.10.0.127 dev ovn-host lladdr 00:00:00:44:ca:b9 REACHABLE
172.10.0.53 dev ovn-host lladdr 00:00:00:a0:8b:05 STALE
172.10.0.29 dev ovn-host lladdr 00:00:00:19:af:22 STALE
172.10.0.24 dev ovn-host lladdr 00:00:00:a8:de:5a REACHABLE
172.10.255.1 dev ovn-host  FAILED
172.10.0.80 dev ovn-host lladdr 00:00:00:f7:ca:5b REACHABLE

你可能感兴趣的:(k8s,k8s)