需要对原有的3节点高可用的rke集群进行迁移,原来的3个节点角色都是controlplane,etcd,worker。
利用原rke集群的etcd备份来进行集群恢复。
3个新节点ip如下:192.168.0.56,192.168.0.57,192.168.0.58
硬件参数尽量与原来的3个节点保持一致,其他的基础环境配置也保持一致,例如:docker版本、内核参数、防火墙规则等。
新的3个节点,使用普通用户docker添加ssh互信。后面的所有操作都在docker用户下。
把原集群的两个配置文件cluster.rkestate、cluster.yml
,rke_linux-amd64
这个二进制文件,/opt/rke/etcd-snapshots/
下面的etcd备份文件(格式如:2021-02-18T15:28:28Z_etcd.zip
)复制到新集群的控制节点(192.168.0.56)上。后面的所有操作都在控制节点上以普通用户权限执行。
192.168.0.56的备份恢复目录是/home/docker
下面有3个文件:
rancher-cluster-restore.rkestate (原集群的cluster.rkestate文件,包含集群的状态信息,CA证书信息)
rancher-cluster-restore.yml(原集群的cluster.yml文件,包含集群配置信息)
rke_linux-amd64(rke二进制文件,直接从原集群复制过来,保持版本一致)
# 新建etcd备份目录:
mkdir -pv /opt/rke/etcd-snapshots/
mv 2021-02-18T15:28:28Z_etcd.zip /opt/rke/etcd-snapshots/
cd /opt/rke/etcd-snapshots/
unzip 2021-02-18T15:28:28Z_etcd.zip
# 解压生成备份文件:
backup/2021-02-18T15:28:28Z_etcd
必须确保4个文件都已经在
指定的目录下
,且文件名
按照要求配置
刚开始我就是随便找了个目录放etcd的备份文件,且两个配置文件的文件名也不一致,出了很多错误
开始恢复之前确保3个新的节点需要做的准备工作已经完成。如果主机之前已经使用过,可以使用如下清理脚本清理主机环境。
官方参考:https://docs.rancher.cn/docs/rancher2/cluster-admin/cleaning-cluster-nodes/_index
docker rm -f $(sudo docker ps -aq);
docker volume rm $(sudo docker volume ls -q);
rm -rf /etc/cni \
/etc/kubernetes \
/opt/cni \
/opt/rke \
/run/secrets/kubernetes.io \
/run/calico \
/run/flannel \
/var/lib/calico \
/var/lib/etcd \
/var/lib/cni \
/var/lib/kubelet \
/var/lib/rancher/rke/log \
/var/log/containers \
/var/log/pods \
/var/run/calico
for mount in $(mount | grep tmpfs | grep '/var/lib/kubelet' | awk '{ print $3 }') /var/lib/kubelet /var/lib/rancher; do umount $mount; done
rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
# 清理Iptables表
## 注意:如果节点Iptables有特殊配置,以下命令请谨慎操作
sudo iptables --flush
sudo iptables --flush --table nat
sudo iptables --flush --table filter
sudo iptables --table nat --delete-chain
sudo iptables --table filter --delete-chain
sudo systemctl restart containerd docker
注意:Iptables规则清理的时候,一定要小心谨慎操作,否则主机可能会连不上,只能重启。
切换到docker用户,cd到/home/docker下,修改配置文件
官方参考
要点如下:
PS:感觉官方文档讲的不细致,或者说细节描述的不够清晰,这里也走了很多弯路
修改后的配置文件如下:
# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
- address: 192.168.0.56
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
hostname_override: ""
user: docker
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
#- address: 192.168.0.57
# port: "22"
# internal_address: ""
# role:
# - controlplane
# - worker
# - etcd
# hostname_override: ""
# user: docker
# docker_socket: /var/run/docker.sock
# ssh_key: ""
# ssh_key_path: ~/.ssh/id_rsa
# ssh_cert: ""
# ssh_cert_path: ""
# labels: {}
# taints: []
#- address: 192.168.0.58
# port: "22"
# internal_address: ""
# role:
# - controlplane
# - worker
# - etcd
# hostname_override: ""
# user: docker
# docker_socket: /var/run/docker.sock
# ssh_key: ""
# ssh_key_path: ~/.ssh/id_rsa
# ssh_cert: ""
# ssh_cert_path: ""
# labels: {}
# taints: []
services:
#etcd:
# image: ""
# extra_args:
# auto-compaction-retention: 3 #(单位小时)
# # 修改空间配额为$((6*1024*1024*1024)),默认2G,最大8G
# quota-backend-bytes: '6442450944'
# extra_binds: []
# extra_env: []
# external_urls: []
# ca_cert: ""
# cert: ""
# key: ""
# path: ""
# uid: 0
# gid: 0
# snapshot: null
# retention: ""
# creation: ""
# backup_config:
# enabled: true
# interval_hours: 2 # 每1小时备份一次
# retention: 36 # 保留24个备份
kube-api:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
service_cluster_ip_range: 10.243.0.0/16
service_node_port_range: ""
pod_security_policy: false
always_pull_images: false
secrets_encryption_config: null
audit_log: null
admission_configuration: null
event_rate_limit: null
kube-controller:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_cidr: 10.242.0.0/16
service_cluster_ip_range: 10.243.0.0/16
scheduler:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
kubelet:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_domain: cluster.local
infra_container_image: ""
cluster_dns_server: 10.243.0.10
fail_swap_on: false
generate_serving_certificate: false
extra_args:
# 修改节点最大Pod数量
max-pods: "150"
network:
plugin: canal
options: {}
mtu: 0
node_selector: {}
authentication:
strategy: x509
sans: []
webhook: null
#addons: ""
#addons_include: []
system_images:
etcd: rancher/coreos-etcd:v3.4.3-rancher1
alpine: rancher/rke-tools:v0.1.66
nginx_proxy: rancher/rke-tools:v0.1.66
cert_downloader: rancher/rke-tools:v0.1.66
kubernetes_services_sidecar: rancher/rke-tools:v0.1.66
kubedns: rancher/k8s-dns-kube-dns:1.15.0
dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
coredns: rancher/coredns-coredns:1.6.5
coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
nodelocal: rancher/k8s-dns-node-cache:1.15.7
kubernetes: rancher/hyperkube:v1.17.14-rancher1
flannel: rancher/coreos-flannel:v0.12.0
flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
calico_node: rancher/calico-node:v3.13.4
calico_cni: rancher/calico-cni:v3.13.4
calico_controllers: rancher/calico-kube-controllers:v3.13.4
calico_ctl: rancher/calico-ctl:v3.13.4
calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
canal_node: rancher/calico-node:v3.13.4
canal_cni: rancher/calico-cni:v3.13.4
canal_flannel: rancher/coreos-flannel:v0.12.0
canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
weave_node: weaveworks/weave-kube:2.6.4
weave_cni: weaveworks/weave-npc:2.6.4
pod_infra_container: rancher/pause:3.1
ingress: rancher/nginx-ingress-controller:nginx-0.35.0-rancher2
ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
metrics_server: rancher/metrics-server:v0.3.6
windows_pod_infra_container: rancher/kubelet-pause:v0.1.4
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries: []
ingress:
provider: ""
options: {}
node_selector: {}
extra_args: {}
dns_policy: ""
extra_envs: []
extra_volumes: []
extra_volume_mounts: []
cluster_name: ""
cloud_provider:
name: ""
prefix_path: ""
addon_job_timeout: 0
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
ssh_cert: ""
ssh_cert_path: ""
monitoring:
provider: ""
options: {}
node_selector: {}
restore:
restore: false
snapshot_name: ""
dns: null
实际上配置文件的ip只是把旧节点的换成新节点的,最后只保留一个恢复节点的。
修改好配置文件之后,执行如下命令
./rke_linux-amd64 etcd snapshot-restore --name 2021-02-18T15:28:28Z_etcd --config ./rancher-cluster-restore.yml
注意:这里的–name配置指定的是etcd备份文件的文件名,不需要全路径,etcd配置文件必须在/opt/rke/etcd-snapshots/backup/这个目录下,在其他路径下肯定会报错(我试过了。。。)
执行完之后,会生成一个新的配置文件kube_config_rancher-cluster-restore.yml,可以使用这个配置文件查看恢复的集群的状态了。
# 查看节点状态
$ kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get nodes
NAME STATUS ROLES AGE VERSION
192.168.0.56 Ready controlplane,etcd,worker 4m23s v1.17.14
# 查看pod情况
$ kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get pods -A
在修改配置文件之前,最好备份一下
修改后的配置文件如下:
# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
- address: 192.168.0.56
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
hostname_override: ""
user: docker
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: 192.168.0.57
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
hostname_override: ""
user: docker
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: 192.168.0.58
port: "22"
internal_address: ""
role:
- controlplane
- worker
- etcd
hostname_override: ""
user: docker
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
services:
etcd:
image: ""
extra_args:
auto-compaction-retention: 3 #(单位小时)
# 修改空间配额为$((6*1024*1024*1024)),默认2G,最大8G
quota-backend-bytes: '6442450944'
extra_binds: []
extra_env: []
external_urls: []
ca_cert: ""
cert: ""
key: ""
path: ""
uid: 0
gid: 0
snapshot: null
retention: ""
creation: ""
backup_config:
enabled: true
interval_hours: 2 # 每1小时备份一次
retention: 36 # 保留24个备份
kube-api:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
service_cluster_ip_range: 10.243.0.0/16
service_node_port_range: ""
pod_security_policy: false
always_pull_images: false
secrets_encryption_config: null
audit_log: null
admission_configuration: null
event_rate_limit: null
kube-controller:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_cidr: 10.242.0.0/16
service_cluster_ip_range: 10.243.0.0/16
scheduler:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
kubelet:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_domain: cluster.local
infra_container_image: ""
cluster_dns_server: 10.243.0.10
fail_swap_on: false
generate_serving_certificate: false
extra_args: {}
network:
plugin: canal
options: {}
mtu: 0
node_selector: {}
authentication:
strategy: x509
sans: []
webhook: null
addons: ""
addons_include: []
system_images:
etcd: rancher/coreos-etcd:v3.4.3-rancher1
alpine: rancher/rke-tools:v0.1.66
nginx_proxy: rancher/rke-tools:v0.1.66
cert_downloader: rancher/rke-tools:v0.1.66
kubernetes_services_sidecar: rancher/rke-tools:v0.1.66
kubedns: rancher/k8s-dns-kube-dns:1.15.0
dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
coredns: rancher/coredns-coredns:1.6.5
coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
nodelocal: rancher/k8s-dns-node-cache:1.15.7
kubernetes: rancher/hyperkube:v1.17.14-rancher1
flannel: rancher/coreos-flannel:v0.12.0
flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
calico_node: rancher/calico-node:v3.13.4
calico_cni: rancher/calico-cni:v3.13.4
calico_controllers: rancher/calico-kube-controllers:v3.13.4
calico_ctl: rancher/calico-ctl:v3.13.4
calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
canal_node: rancher/calico-node:v3.13.4
canal_cni: rancher/calico-cni:v3.13.4
canal_flannel: rancher/coreos-flannel:v0.12.0
canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
weave_node: weaveworks/weave-kube:2.6.4
weave_cni: weaveworks/weave-npc:2.6.4
pod_infra_container: rancher/pause:3.1
ingress: rancher/nginx-ingress-controller:nginx-0.35.0-rancher2
ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
metrics_server: rancher/metrics-server:v0.3.6
windows_pod_infra_container: rancher/kubelet-pause:v0.1.4
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries: []
ingress:
provider: ""
options: {}
node_selector: {}
extra_args: {}
dns_policy: ""
extra_envs: []
extra_volumes: []
extra_volume_mounts: []
cluster_name: ""
cloud_provider:
name: ""
prefix_path: ""
addon_job_timeout: 0
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
ssh_cert: ""
ssh_cert_path: ""
monitoring:
provider: ""
options: {}
node_selector: {}
restore:
restore: false
snapshot_name: ""
dns: null
把原来的配置文件注释掉的配置段打开,新节点加上。
执行如下命令,恢复集群
./rke_linux-amd64 up --config ./rancher-cluster-restore.yml
执行完毕之后,查看节点状态
kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get nodes
kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get pods -A
可以看到集群已经恢复了,但是有些pod可能由于网络原因还没恢复,查清原因,一一解决即可。
参考:
https://docs.rancher.cn/docs/rancher2/backups/2.0-2.4/restorations/ha-restoration/_index
https://www.pianshen.com/article/78151149004/
https://www.cnblogs.com/jatq/p/13344058.html
https://blog.csdn.net/hongxiaolu/article/details/113711538