k8s无法正常启动使用,排查:etcd损坏

k8s无法正常启动使用,排查:etcd损坏

问题:

在跑项目的时候,机器意外断电了,重启后发现kubectl无法使用,报错如下:

[root@k8s-master01 wal]# kubectl get pod
The connection to the server 192.168.18.101:6443 was refused - did you specify the right host or port?

排查:

1、

[root@k8s-master01 ~]# journalctl -fu kubelet

5月 06 12:05:34 k8s-master01 kubelet[18467]: I0506 12:05:34.798387   18467 kubelet.go:461] "Kubelet nodes not sync"
5月 06 12:05:35 k8s-master01 kubelet[18467]: I0506 12:05:35.120046   18467 kubelet.go:461] "Kubelet nodes not sync"
5月 06 12:05:35 k8s-master01 kubelet[18467]: I0506 12:05:35.368482   18467 kubelet.go:461] "Kubelet nodes not sync"
5月 06 12:05:35 k8s-master01 kubelet[18467]: I0506 12:05:35.432439   18467 kubelet.go:461] "Kubelet nodes not sync"
5月 06 12:05:35 k8s-master01 kubelet[18467]: I0506 12:05:35.626419   18467 kubelet_node_status.go:71] "Attempting to register node" node="k8s-master01"
5月 06 12:05:35 k8s-master01 kubelet[18467]: E0506 12:05:35.626808   18467 kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://192.168.18.101:6443/api/v1/nodes\": dial tcp 192.168.18.101:6443: connect: connection refused" node="k8s-master01"
5月 06 12:05:35 k8s-master01 kubelet[18467]: I0506 12:05:35.797123   18467 kubelet.go:461] "Kubelet nodes not sync"
5月 06 12:05:35 k8s-master01 kubelet[18467]: I0506 12:05:35.797226   18467 kubelet.go:461] "Kubelet nodes not sync"
5月 06 12:05:35 k8s-master01 kubelet[18467]: E0506 12:05:35.797249   18467 kubelet.go:2298] "Error getting node" err="nodes have not yet been read at least once, cannot construct node object"

Unable to register node with API server" err=“Post “https://192.168.18.101:6443/api/v1/nodes”: dial tcp 192.168.18.101:6443: connect: connection refused” node="k8s-master01

发现连接不到apiserver

2、然后我到docker容器里,发现etcd和api都处于exitd状态

在这里插入图片描述

查看了apiserver相关的容器log

报错如下:

[root@k8s-master01 ~]# docker logs k8s_kube-apiserver_kube-apiserver-k8s-master01_kube-system_2dd80c82feea0886cf3a8a74b4b80db1_26

I0506 04:08:32.598696       1 client.go:360] parsed scheme: "endpoint"
I0506 04:08:32.598776       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
W0506 04:08:32.599383       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
I0506 04:08:33.594958       1 client.go:360] parsed scheme: "endpoint"
I0506 04:08:33.595009       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
W0506 04:08:33.596996       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused

好你个拒绝连接,来看看2379端口是干嘛的

[root@k8s-master01 ~]# cat /etc/kubernetes/manifests/kube-apiserver.yaml

k8s无法正常启动使用,排查:etcd损坏_第1张图片

原来是etcd连接失败,怪不得容器状态也是exitd。

3、查看etcd容器的log,发现报错如下:

[root@k8s-master01 ~]# docker logs k8s_etcd_etcd-k8s-master01_kube-system_7a98aec14f1d812921d8b2f6c35fbbcc_24

[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-05-06 04:10:42.638389 I | etcdmain: etcd Version: 3.4.13
2023-05-06 04:10:42.638426 I | etcdmain: Git SHA: ae9734ed2
2023-05-06 04:10:42.638429 I | etcdmain: Go Version: go1.12.17
2023-05-06 04:10:42.638431 I | etcdmain: Go OS/Arch: linux/amd64
2023-05-06 04:10:42.638434 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2023-05-06 04:10:42.638485 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-05-06 04:10:42.638516 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = 
2023-05-06 04:10:42.639017 I | embed: name = k8s-master01
2023-05-06 04:10:42.639042 I | embed: data dir = /var/lib/etcd
2023-05-06 04:10:42.639047 I | embed: member dir = /var/lib/etcd/member
2023-05-06 04:10:42.639050 I | embed: heartbeat = 100ms
2023-05-06 04:10:42.639053 I | embed: election = 1000ms
2023-05-06 04:10:42.639056 I | embed: snapshot count = 10000
2023-05-06 04:10:42.639079 I | embed: advertise client URLs = https://192.168.18.101:2379
2023-05-06 04:10:42.639084 I | embed: initial advertise peer URLs = https://192.168.18.101:2380
2023-05-06 04:10:42.639087 I | embed: initial cluster = 
2023-05-06 04:10:42.641786 W | wal: ignored file 0000000000000000-0000000000000000.wal.broken in wal
2023-05-06 04:10:42.641832 W | wal: ignored file 0000000000000004-000000000005dd9e.wal.broken in wal
2023-05-06 04:10:43.208339 I | etcdserver: recovered store from snapshot at index 600062
2023-05-06 04:10:43.212163 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0587e]

goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0002cae30, 0xc0002c8d80)
        /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee6840, 0xc0001120a0)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc000203a60, 0x10c13b2, 0x2a, 0xc0002c8e50, 0x1, 0x1)
        /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/github.com/coreos/[email protected]/capnslog/pkg_logger.go:75 +0x135
go.etcd.io/etcd/etcdserver.NewServer(0x7ffead4fde41, 0xc, 0x0, 0x0, 0x0, 0x0, 0xc0001bb080, 0x1, 0x1, 0xc0001bb200, ...)
        /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:464 +0x433c
go.etcd.io/etcd/embed.StartEtcd(0xc000292000, 0xc000197080, 0x0, 0x0)
        /tmp/etcd-release-3.4.13/etcd/release/etcd/embed/etcd.go:214 +0x988
go.etcd.io/etcd/etcdmain.startEtcd(0xc000292000, 0x10963d6, 0x6, 0x1, 0xc000221140)
        /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:144 +0x2ef9
go.etcd.io/etcd/etcdmain.Main()
        /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
        /tmp/etcd-release-3.4.13/etcd/release/etcd/main.go:28 +0x20

snap: snapshot file doesn’t exist
发现snap文件不存在,估计是意外断电导致的文件系统损坏或者遗失了

将备份的文件覆盖该目录文件即可,因为我没有备份,是在学习测试阶段,没啥重要东西,就直接删除了

[root@k8s-master01 ~]# rm -rf /var/lib/etcd/member/snap/.*

重启kubelet服务,发现还是报错

[root@k8s-master01 wal]# systemctl restart kubelet.service 
[root@k8s-master01 wal]# kubectl get pod
The connection to the server 192.168.18.101:6443 was refused - did you specify the right host or port?

4、继续查看etcd容器日志,发现又有报错:

[root@k8s-master01 ~]# docker logs k8s_etcd_etcd-k8s-master01_kube-system_7a98aec14f1d812921d8b2f6c35fbbcc_27

[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-05-06 04:20:51.236303 I | etcdmain: etcd Version: 3.4.13
2023-05-06 04:20:51.236348 I | etcdmain: Git SHA: ae9734ed2
2023-05-06 04:20:51.236351 I | etcdmain: Go Version: go1.12.17
2023-05-06 04:20:51.236354 I | etcdmain: Go OS/Arch: linux/amd64
2023-05-06 04:20:51.236358 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2023-05-06 04:20:51.236409 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-05-06 04:20:51.236456 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = 
2023-05-06 04:20:51.236926 I | embed: name = k8s-master01
2023-05-06 04:20:51.236947 I | embed: data dir = /var/lib/etcd
2023-05-06 04:20:51.236951 I | embed: member dir = /var/lib/etcd/member
2023-05-06 04:20:51.236953 I | embed: heartbeat = 100ms
2023-05-06 04:20:51.236954 I | embed: election = 1000ms
2023-05-06 04:20:51.236956 I | embed: snapshot count = 10000
2023-05-06 04:20:51.236962 I | embed: advertise client URLs = https://192.168.18.101:2379
2023-05-06 04:20:51.236964 I | embed: initial advertise peer URLs = https://192.168.18.101:2380
2023-05-06 04:20:51.236967 I | embed: initial cluster = 
2023-05-06 04:20:51.238833 W | wal: ignored file 0000000000000000-0000000000000000.wal.broken in wal
2023-05-06 04:20:51.238879 W | wal: ignored file 0000000000000004-000000000005dd9e.wal.broken in wal
2023-05-06 04:20:51.801126 W | wal: ignored file 0000000000000000-0000000000000000.wal.broken in wal
2023-05-06 04:20:51.801201 W | wal: ignored file 0000000000000004-000000000005dd9e.wal.broken in wal
2023-05-06 04:20:51.801254 C | etcdserver: open wal error: wal: file not found

etcdserver: open wal error: wal: file not found
发现找不到该文件walr

建议将备份的文件覆盖该目录文件,我就直接删除了

[root@k8s-master01 ~]# rm -rf /var/lib/etcd/member/wal/*

5、重启kubelet,过一会发现可以正常使用了,容器服务也正常启动了

在这里插入图片描述

你可能感兴趣的:(k8s,docker,linux,运维,容器,kubernetes)