电脑蓝屏重启后master和node 节点出现node “master” not found node “node01” not found

一、问题描述

当服务器突然断电重启后,有可能会出现apiserver无法启动,查看kubelet日志提示node “master” not found

#journalctl -u kubelet -f
Jun 16 21:36:35 master01 kubelet[53838]: E0616 21:36:35.000792   53838 kubelet.go:2240] node "master01" not found

二、解决问题

若是嫌弃麻烦,如果使用的是kubeadm ,可以kubeadm reset ,然后重新初始化--kubeadm init ,我这个是二进制部署的,不想重新部署,大家不嫌弃麻烦可以参考,毕竟大家的错并不都是一样的,但是注意啊,我这个主要是etcd 的错,我后面删除了etcd 里面的东西,里面东西全没了,哎,大家可别删除啊

1.负载均衡报错和解决

最开始 通过下面的命令查看到我的VIP 没有了

journalctl -u kubelet -f

hostname -I 这个命令查看并没有VIP 
#执行下面的命令启动keepalived 和 haproxy
systemctl start keepalived haproxy

#再次hostname -I  就有VIP 了,如果没有启动就去看看是不是keepalived haproxy 有问题了,去看他们的日志进行排错就行了
Jun 16 21:36:34 master01 kubelet[53838]: E0616 21:36:34.815419   53838 kubelet_node_status.go:93] Unable to register node "master01" with API server: Post "https://10.0.0.10:8443/api/v1/nodes": net/http: TLS handshake timeout
Jun 16 21:36:34 master01 kubelet[53838]: E0616 21:36:34.900365   53838 kubelet.go:2240] node "master01" not found
Jun 16 21:36:34 master01 kubelet[53838]: I0616 21:36:34.934691   53838 trace.go:205] Trace[1124748425]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (16-Jun-2023 21:36:15.169) (total time: 19765ms):
Jun 16 21:36:34 master01 kubelet[53838]: Trace[1124748425]: [19.765397197s] [19.765397197s] END
Jun 16 21:36:34 master01 kubelet[53838]: E0616 21:36:34.934708   53838 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.0.0.10:8443/api/v1/services?limit=500&resourceVersion=0": net/http: TLS handshake timeout
Jun 16 21:36:35 master01 kubelet[53838]: E0616 21:36:35.000792   53838 kubelet.go:2240] node "master01" not found

2.apiserver 报错

#启动 apiserver controller scheduler

systemctl start kube-apiserver.service kube-proxy.service kube-controller-manager.service  kube-scheduler.service
systemctl status kube-apiserver.service kube-proxy.service kube-controller-manager.service kube-scheduler.service

报错显示连接不上etcd

[root@master01 ~]#  journalctl -u kube-apiserver -f
..
Jun 16 23:05:47 master01 kube-apiserver[100524]: W0616 23:05:47.772004  100524 
clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.87:2379 
 <nil> 0 <nil>}. Err :connection 
 error: desc = "transport: Error while dialing dial tcp 10.0.0.87:2379: connect: connection refused". Reconnecting...
Jun 16 23:05:47 master01 kube-apiserver[100524]: W0616 23:05:47.773196  
100524 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.97:2379  <nil> 0 <nil>}. 
Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.97:2379: connect: connection refused". 
Reconnecting...

3.etcd 报错

1.验证etcd 的状态

[root@master01 kubernetes]# systemctl status etcd
● etcd.service - Etcd Service
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2023-06-16 23:09:02 CST; 3s ago
     Docs: https://coreos.com/etcd/docs/latest/
  Process: 101953 ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml (code=exited, status=1/FAILURE)
 Main PID: 101953 (code=exited, status=1/FAILURE)

Jun 16 23:09:02 master01 systemd[1]: etcd.s-ervice: main process exited, code=exited, status=1/FAILURE
Jun 16 23:09:02 master01 systemd[1]: Failed to start Etcd Service.
Jun 16 23:09:02 master01 systemd[1]: Unit etcd.service entered failed state.
Jun 16 23:09:02 master01 systemd[1]: etcd.service failed.
[root@master01 kubernetes]# export ETCDCTL_API=3
#即etcd 并没有起来
[root@master01 kubernetes]# etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem  endpoint status --write-out=table
{"level":"warn","ts":"2023-06-16T23:07:57.367+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.0.0.87:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.87:2379: connect: connection refused\""}
Failed to get the status of endpoint 10.0.0.87:2379 (context deadline exceeded)
{"level":"warn","ts":"2023-06-16T23:08:02.368+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.0.0.97:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.97:2379: connect: connection refused\""}
Failed to get the status of endpoint 10.0.0.97:2379 (context deadline exceeded)
{"level":"warn","ts":"2023-06-16T23:08:07.369+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.0.0.107:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint 10.0.0.107:2379 (context deadline exceeded)
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

2.查看etcd 报错

## 报错:crc mismatch
[root@master01 ~]#  journalctl -u etcd -f
...
Jun 16 23:09:23 master01 etcd[102153]: ignored file 0000000000000031-0000000000239ef6.wal.broken in wal
Jun 16 23:09:24 master01 etcd[102153]: walpb: crc mismatch
Jun 16 23:09:24 master01 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Jun 16 23:09:24 master01 systemd[1]: Failed to start Etcd Service.
Jun 16 23:09:24 master01 systemd[1]: Unit etcd.service entered failed state.
Jun 16 23:09:24 master01 systemd[1]: etcd.service failed.

3.解决

#1.停数据损坏节点的kubelet
[root@master01 ~]# systemctl stop kubelet
[root@master01 ~]#systemctl status kubelet

#2.停数据损坏节点的etcd容器
[root@master01 ~]# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
eec644681fe4        0cae8d5cc64c           "kube-apiserver --ad…"   6 minutes ago       Up 6 minutes                            k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082        303ce5db0e90           "etcd --advertise-cl…"   7 minutes ago       Up 7 minutes                            k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1

[root@master01 ~]# docker stop +CONTAINER ID

#3.删除数据损坏节点的etcd数据
#这一步直接删除故障数据即可,加入集群后会自动从其他节点同步过来
#但是下面的语句会导致etcd里面的数据都被删除了,我集群里面一个pod 都没有了
[root@master01 ~]# rm -f /var/lib/etcd/member/wal/*
[root@master01 ~]# rm -f /var/lib/etcd/member/snap/*

#4.启动kubelet,记得启动
[root@master01 ~]# systemctl start kubelet

4.再次验证etcd

#就好了
[root@master01 ~]#  etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem  endpoint status --write-out=table
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|  10.0.0.87:2379 | 949b9ccaa465bea8 |  3.4.13 |  3.3 MB |     false |      false |         4 |       7917 |               7917 |        |
|  10.0.0.97:2379 | 795272eff6c8418e |  3.4.13 |  5.3 MB |     false |      false |         4 |       7917 |               7917 |        |
| 10.0.0.107:2379 | 41172b80a9c89e7f |  3.4.13 |  5.3 MB |      true |      false |         4 |       7917 |               7917 |        |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

三、验证

#如果master01 是NotReady 那么看看kubelet 状态是不是上面忘记启动了
[root@master01 ~]# kubectl get node
NAME       STATUS   ROLES    AGE         VERSION
master01   Ready    <none>   <invalid>   v1.20.0
master02   Ready    <none>   <invalid>   v1.20.0
master03   Ready    <none>   <invalid>   v1.20.0
node01     Ready    <none>   <invalid>   v1.20.0
node02     Ready    <none>   <invalid>   v1.20.0

你可能感兴趣的:(k8s和k8s遇到的问题,云原生,bash,linux,开发语言)