当服务器突然断电重启后,有可能会出现apiserver无法启动,查看kubelet日志提示node “master” not found
#journalctl -u kubelet -f
Jun 16 21:36:35 master01 kubelet[53838]: E0616 21:36:35.000792 53838 kubelet.go:2240] node "master01" not found
若是嫌弃麻烦,如果使用的是kubeadm ,可以kubeadm reset ,然后重新初始化--kubeadm init ,我这个是二进制部署的,不想重新部署,大家不嫌弃麻烦可以参考,毕竟大家的错并不都是一样的,但是注意啊,我这个主要是etcd 的错,我后面删除了etcd 里面的东西,里面东西全没了,哎,大家可别删除啊
最开始 通过下面的命令查看到我的VIP 没有了
journalctl -u kubelet -f
hostname -I 这个命令查看并没有VIP
#执行下面的命令启动keepalived 和 haproxy
systemctl start keepalived haproxy
#再次hostname -I 就有VIP 了,如果没有启动就去看看是不是keepalived haproxy 有问题了,去看他们的日志进行排错就行了
Jun 16 21:36:34 master01 kubelet[53838]: E0616 21:36:34.815419 53838 kubelet_node_status.go:93] Unable to register node "master01" with API server: Post "https://10.0.0.10:8443/api/v1/nodes": net/http: TLS handshake timeout
Jun 16 21:36:34 master01 kubelet[53838]: E0616 21:36:34.900365 53838 kubelet.go:2240] node "master01" not found
Jun 16 21:36:34 master01 kubelet[53838]: I0616 21:36:34.934691 53838 trace.go:205] Trace[1124748425]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (16-Jun-2023 21:36:15.169) (total time: 19765ms):
Jun 16 21:36:34 master01 kubelet[53838]: Trace[1124748425]: [19.765397197s] [19.765397197s] END
Jun 16 21:36:34 master01 kubelet[53838]: E0616 21:36:34.934708 53838 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.0.0.10:8443/api/v1/services?limit=500&resourceVersion=0": net/http: TLS handshake timeout
Jun 16 21:36:35 master01 kubelet[53838]: E0616 21:36:35.000792 53838 kubelet.go:2240] node "master01" not found
#启动 apiserver controller scheduler
systemctl start kube-apiserver.service kube-proxy.service kube-controller-manager.service kube-scheduler.service
systemctl status kube-apiserver.service kube-proxy.service kube-controller-manager.service kube-scheduler.service
报错显示连接不上etcd
[root@master01 ~]# journalctl -u kube-apiserver -f
..
Jun 16 23:05:47 master01 kube-apiserver[100524]: W0616 23:05:47.772004 100524
clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.87:2379
<nil> 0 <nil>}. Err :connection
error: desc = "transport: Error while dialing dial tcp 10.0.0.87:2379: connect: connection refused". Reconnecting...
Jun 16 23:05:47 master01 kube-apiserver[100524]: W0616 23:05:47.773196
100524 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.97:2379 <nil> 0 <nil>}.
Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.97:2379: connect: connection refused".
Reconnecting...
[root@master01 kubernetes]# systemctl status etcd
● etcd.service - Etcd Service
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Fri 2023-06-16 23:09:02 CST; 3s ago
Docs: https://coreos.com/etcd/docs/latest/
Process: 101953 ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml (code=exited, status=1/FAILURE)
Main PID: 101953 (code=exited, status=1/FAILURE)
Jun 16 23:09:02 master01 systemd[1]: etcd.s-ervice: main process exited, code=exited, status=1/FAILURE
Jun 16 23:09:02 master01 systemd[1]: Failed to start Etcd Service.
Jun 16 23:09:02 master01 systemd[1]: Unit etcd.service entered failed state.
Jun 16 23:09:02 master01 systemd[1]: etcd.service failed.
[root@master01 kubernetes]# export ETCDCTL_API=3
#即etcd 并没有起来
[root@master01 kubernetes]# etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint status --write-out=table
{"level":"warn","ts":"2023-06-16T23:07:57.367+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.0.0.87:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.87:2379: connect: connection refused\""}
Failed to get the status of endpoint 10.0.0.87:2379 (context deadline exceeded)
{"level":"warn","ts":"2023-06-16T23:08:02.368+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.0.0.97:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.97:2379: connect: connection refused\""}
Failed to get the status of endpoint 10.0.0.97:2379 (context deadline exceeded)
{"level":"warn","ts":"2023-06-16T23:08:07.369+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.0.0.107:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint 10.0.0.107:2379 (context deadline exceeded)
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
## 报错:crc mismatch
[root@master01 ~]# journalctl -u etcd -f
...
Jun 16 23:09:23 master01 etcd[102153]: ignored file 0000000000000031-0000000000239ef6.wal.broken in wal
Jun 16 23:09:24 master01 etcd[102153]: walpb: crc mismatch
Jun 16 23:09:24 master01 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Jun 16 23:09:24 master01 systemd[1]: Failed to start Etcd Service.
Jun 16 23:09:24 master01 systemd[1]: Unit etcd.service entered failed state.
Jun 16 23:09:24 master01 systemd[1]: etcd.service failed.
#1.停数据损坏节点的kubelet
[root@master01 ~]# systemctl stop kubelet
[root@master01 ~]#systemctl status kubelet
#2.停数据损坏节点的etcd容器
[root@master01 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eec644681fe4 0cae8d5cc64c "kube-apiserver --ad…" 6 minutes ago Up 6 minutes k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082 303ce5db0e90 "etcd --advertise-cl…" 7 minutes ago Up 7 minutes k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1
[root@master01 ~]# docker stop +CONTAINER ID
#3.删除数据损坏节点的etcd数据
#这一步直接删除故障数据即可,加入集群后会自动从其他节点同步过来
#但是下面的语句会导致etcd里面的数据都被删除了,我集群里面一个pod 都没有了
[root@master01 ~]# rm -f /var/lib/etcd/member/wal/*
[root@master01 ~]# rm -f /var/lib/etcd/member/snap/*
#4.启动kubelet,记得启动
[root@master01 ~]# systemctl start kubelet
#就好了
[root@master01 ~]# etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint status --write-out=table
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.0.0.87:2379 | 949b9ccaa465bea8 | 3.4.13 | 3.3 MB | false | false | 4 | 7917 | 7917 | |
| 10.0.0.97:2379 | 795272eff6c8418e | 3.4.13 | 5.3 MB | false | false | 4 | 7917 | 7917 | |
| 10.0.0.107:2379 | 41172b80a9c89e7f | 3.4.13 | 5.3 MB | true | false | 4 | 7917 | 7917 | |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
#如果master01 是NotReady 那么看看kubelet 状态是不是上面忘记启动了
[root@master01 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master01 Ready <none> <invalid> v1.20.0
master02 Ready <none> <invalid> v1.20.0
master03 Ready <none> <invalid> v1.20.0
node01 Ready <none> <invalid> v1.20.0
node02 Ready <none> <invalid> v1.20.0