故障场景
- OpenShift 4 离线环境多 master 集群中有一个 master 节点出现故障(机器不可用)
- 这种场景下集群依然可以正常使用
- 为了让集群处于完整的高可用状态下,我们需要将故障节点移除,再重新添加 master节点
集群当前状态
- 检查节点状态
- 可以看到故障节点已经处于 NotReady 状态
[root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master
NAME STATUS ROLES AGE VERSION
master-0.ocp4-cluster1.guachen.ocp Ready master 21d v1.16.2
master-1.ocp4-cluster1.guachen.ocp Ready master 21d v1.16.2
master-2.ocp4-cluster1.guachen.ocp NotReady master 21d v1.16.2
[root@kr8s-ocp-tools ~]# oc get pod -A|grep -Ev "Running|Completed"
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-machine-config-operator etcd-quorum-guard-58696fdc97-422jn 1/1 Terminating 0 144m
openshift-machine-config-operator etcd-quorum-guard-58696fdc97-nsnnp 0/1 Pending 0 6m13s
- 检查 etcd cluster-health
- 登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp
- 目前 etcd cluster 处于 degraded 状态,只有两个 membership 可用
[root@kr8s-ocp-tools ~]# ssh [email protected]
[core@master-0 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
sh-4.2# export ETCDCTL_API=2
sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \
--ca-file=/etc/ssl/etcd/ca.crt \
--cert-file=$(find /etc/ssl/ -name *peer*crt) \
--key-file=$(find /etc/ssl/ -name *peer*key) cluster-health
~~~
member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379
failed to check the health of member d8cb362c01859289 on https://10.72.44.174:2379: Get https://10.72.44.174:2379/health: dial tcp 10.72.44.174:2379: connect: no route to host
member d8cb362c01859289 is unreachable: [https://10.72.44.174:2379] are all unreachable
member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379
cluster is degraded
处理过程
1. 删除故障节点
[root@kr8s-ocp-tools ~]# oc delete node master-2.ocp4-cluster1.guachen.ocp
node "master-2.ocp4-cluster1.guachen.ocp" deleted
[root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master
NAME STATUS ROLES AGE VERSION
master-0.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2
master-1.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2
2. 删除故障 etcd membership
- 登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp
[root@kr8s-ocp-tools ~]# ssh [email protected]
[core@master-0 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
sh-4.2# export ETCDCTL_API=2
sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \
--ca-file=/etc/ssl/etcd/ca.crt \
--cert-file=$(find /etc/ssl/ -name *peer*crt) \
--key-file=$(find /etc/ssl/ -name *peer*key) member remove 3d95fa872c4a2282
Removed member 3d95fa872c4a2282 from cluster
sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 --ca-file=/etc/ssl/etcd/ca.crt --cert-file=$(find /etc/ssl/ -name *peer*crt) --key-file=$(find /etc/ssl/ -name *peer*key) cluster-health
member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379
member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379
cluster is healthy
3. 重新添加新的节点作为 master 节点,以恢复完整的高可用集群
- 离线集群添加节点的方式跟部署集群时一致,使用 master 的 ign 文件重新引导一个 RHCOS 节点。
- 可以复用集群部署时该节点的ign文件,如果还在的话,若不在了按照部署集群时的方法重新生成即可
- 具体参考集群部署步骤
- approve 新添加的节点生成的 csr,有4个
[root@kr8s-ocp-tools ~]# oc get csr -o name | xargs oc adm certificate approve
4. 恢复 etcd membership 至完整的 etcd 集群
a. 部署 etcd-signer Pod
- 登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp
i. login 到 OpenShift 集群
[root@kr8s-ocp-tools ~]# ssh [email protected]
# 需要cluster-admin权限的user
[core@master-0 ~]$ oc login https://localhost:6443
Authentication required for https://localhost:6443 (openshift)
Username: admin
Password:
Login successful.
ii. 获取 kube-etcd-signer-server 镜像的 pull specification
export KUBE_ETCD_SIGNER_SERVER=$(sudo oc adm release info --image-for kube-etcd-signer-server --registry-config=/var/lib/kubelet/config.json)
上面的命令取到的值是quay.io的,离线环境我们需要另外的处理,转换成本地的registry
export KUBE_ETCD_SIGNER_SERVER=$(sudo crictl pull $(your-local-registry):5000/ocp4/openshift4:$(your-version)-kube-etcd-signer-server |awk '{print $7}')
### 比如我的环境
export KUBE_ETCD_SIGNER_SERVER=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-kube-etcd-signer-server |awk '{print $7}')
iii. 生成kube-etcd-cert-signer.yaml文件
[core@master-0 ~]$ sudo -E /usr/local/bin/tokenize-signer.sh master-0.ocp4-cluster1.guachen.ocp
iv. 创建 etcd-signer Pod
oc create -f assets/manifests/kube-etcd-cert-signer.yaml
b. 将新添加回来的 master 节点恢复到 etcd cluster
- 登陆到新增加的 master 节点操作,比如 master-2.ocp4-cluster1.guachen.ocp
i. login 到 OpenShift 集群
[root@kr8s-ocp-tools ~]# ssh [email protected]
[core@master-2 ~]$ oc login https://localhost:6443
Authentication required for https://localhost:6443 (openshift)
Username: admin
Password:
Login successful.
ii. 获取恢复 etcd cluster 需要的环境变量(etcd-member-recover.sh脚本需要)
export SETUP_ETCD_ENVIRONMENT=$(sudo oc adm release info --image-for machine-config-operator --registry-config=/var/lib/kubelet/config.json)
export KUBE_CLIENT_AGENT=$(sudo oc adm release info --image-for kube-client-agent --registry-config=/var/lib/kubelet/config.json)
上面的命令是通过 quay.io 取值的,离线环境我们需要另外的处理,转换成本地的 registry
# 注意 $your-local-registry 和 $your-version
[core@master-2 ~]$ export SETUP_ETCD_ENVIRONMENT=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-machine-config-operator |awk '{print $7}')
[core@master-2 ~]$ export KUBE_CLIENT_AGENT=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-kube-client-agent |awk '{print $7}')
iii. 修改 openshift-recovery-tools,将里面 etcd 的镜像转换成本地镜像仓库的
# 注意 $your-local-registry 和 $your-version
[core@master-2 ~]$ export ETCDIMG=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-etcd |awk '{print $7}')
[core@master-2 ~]$ sudo -E sed -i "s?local etcdimg=.*?local etcdimg=\"$ETCDIMG\"?g" /usr/local/bin/openshift-recovery-tools
iv. 运行恢复 etcd membership 脚本 etcd-member-recover.sh
sudo -E /usr/local/bin/etcd-member-recover.sh $IP etcd-member-$hostname
- IP 为恢复操作前正常的master节点 ip,master-0.ocp4-cluster1.guachen.ocp 的 ip 10.72.44.172
- hostname 为需要恢复的etcd membership 节点 hostname,如 master-2.ocp4-cluster1.guachen.ocp
[core@master-2 ~]$ sudo -E /usr/local/bin/etcd-member-recover.sh 10.72.44.172 etcd-member-master-2.ocp4-cluster1.guachen.ocp
4320daf71e2d45927d66c6a74f46faa6a1bfe7cabb708d81344255fdc289b5bb
etcdctl version: 3.3.17
API version: 3.3
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
Backing up /etc/etcd/etcd.conf to ./assets/backup/
Trying to backup etcd client certs..
etcd client certs found in /etc/kubernetes/static-pod-resources/kube-apiserver-pod-9 backing up to ./assets/backup/
Stopping etcd..
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Local etcd snapshot file not found, backup skipped..
Backing up etcd certificates..
Removing etcd certs..
Populating template /usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template
Populating template ./assets/tmp/etcd-generate-certs.stage1
Populating template ./assets/tmp/etcd-generate-certs.stage2
Starting etcd client cert recovery agent..
Waiting for certs to generate... (1/60)
Waiting for certs to generate... (2/60)
Waiting for certs to generate... (3/60)
Waiting for certs to generate... (4/60)
Stopping cert recover..
Waiting for generate-certs to stop
Patching etcd-member manifest..
Updating etcd membership..
Removing etcd data_dir /var/lib/etcd..
Member 3c6458d18aa43907 added to cluster a792367fd9b198cc
ETCD_NAME="etcd-member-master-2.ocp4-cluster1.guachen.ocp"
ETCD_INITIAL_CLUSTER="etcd-member-master-2.ocp4-cluster1.guachen.ocp=https://etcd-2.ocp4-cluster1.guachen.ocp:2380,etcd-member-master-1.ocp4-cluster1.guachen.ocp=https://etcd-1.ocp4-cluster1.guachen.ocp:2380,etcd-member-master-0.ocp4-cluster1.guachen.ocp=https://etcd-0.ocp4-cluster1.guachen.ocp:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-2.ocp4-cluster1.guachen.ocp:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Starting etcd..
验证处理结果
- 检查 node/etcd pod 状态
[root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master
NAME STATUS ROLES AGE VERSION
master-0.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2
master-1.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2
master-2.ocp4-cluster1.guachen.ocp Ready master 13m v1.16.2
[root@kr8s-ocp-tools ~]# oc -n openshift-etcd get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
etcd-member-master-0.ocp4-cluster1.guachen.ocp 2/2 Running 2 22d 10.72.44.172 master-0.ocp4-cluster1.guachen.ocp
etcd-member-master-1.ocp4-cluster1.guachen.ocp 2/2 Running 2 22d 10.72.44.173 master-1.ocp4-cluster1.guachen.ocp
etcd-member-master-2.ocp4-cluster1.guachen.ocp 2/2 Running 0 68s 10.72.44.174 master-2.ocp4-cluster1.guachen.ocp
- 检查 etcd cluster-health
- 登陆到新添加的 master 节点操作,比如 master-2.ocp4-cluster1.guachen.ocp
[root@kr8s-ocp-tools ~]# ssh [email protected]
[core@master-2 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
sh-4.2# export ETCDCTL_API=2
sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \
--ca-file=/etc/ssl/etcd/ca.crt \
--cert-file=$(find /etc/ssl/ -name *peer*crt) \
--key-file=$(find /etc/ssl/ -name *peer*key) cluster-health
member 3c6458d18aa43907 is healthy: got healthy result from https://10.72.44.174:2379
member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379
member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379
cluster is healthy
可以看到 etcd cluster 有 3 个 membership,且 cluster 状态是正常的
- 恢复完成后删除 etcd-signer pod
[root@kr8s-ocp-tools ~]# oc delete pod -n openshift-config etcd-signer