Rancher v2.4.8 使用etcd备份恢复rke高可用集群

使用etcd备份恢复v1.0.14rke 高可用集群

  • 1. 背景说明
  • 2. 需要的准备
    • 2.1 新节点准备
    • 2.2 配置文件
  • 3. 开始恢复
    • 3.1 环境清理脚本
    • 3.2 修改配置文件
    • 3.3 执行etcd恢复
    • 3.4 修改配置文件,恢复集群
  • 4. 其他一些问题

1. 背景说明

需要对原有的3节点高可用的rke集群进行迁移,原来的3个节点角色都是controlplane,etcd,worker。
利用原rke集群的etcd备份来进行集群恢复。

2. 需要的准备

2.1 新节点准备

3个新节点ip如下:192.168.0.56,192.168.0.57,192.168.0.58
硬件参数尽量与原来的3个节点保持一致,其他的基础环境配置也保持一致,例如:docker版本、内核参数、防火墙规则等。

新的3个节点,使用普通用户docker添加ssh互信。后面的所有操作都在docker用户下。

2.2 配置文件

把原集群的两个配置文件cluster.rkestate、cluster.ymlrke_linux-amd64这个二进制文件,/opt/rke/etcd-snapshots/下面的etcd备份文件(格式如:2021-02-18T15:28:28Z_etcd.zip)复制到新集群的控制节点(192.168.0.56)上。后面的所有操作都在控制节点上以普通用户权限执行。

192.168.0.56的备份恢复目录是/home/docker
下面有3个文件:

rancher-cluster-restore.rkestate (原集群的cluster.rkestate文件,包含集群的状态信息,CA证书信息)
rancher-cluster-restore.yml(原集群的cluster.yml文件,包含集群配置信息)
rke_linux-amd64(rke二进制文件,直接从原集群复制过来,保持版本一致)

# 新建etcd备份目录:
mkdir -pv /opt/rke/etcd-snapshots/
mv 2021-02-18T15:28:28Z_etcd.zip  /opt/rke/etcd-snapshots/
cd  /opt/rke/etcd-snapshots/
unzip  2021-02-18T15:28:28Z_etcd.zip
# 解压生成备份文件:
backup/2021-02-18T15:28:28Z_etcd

必须确保4个文件都已经在指定的目录下,且文件名按照要求配置
刚开始我就是随便找了个目录放etcd的备份文件,且两个配置文件的文件名也不一致,出了很多错误

3. 开始恢复

开始恢复之前确保3个新的节点需要做的准备工作已经完成。如果主机之前已经使用过,可以使用如下清理脚本清理主机环境。

3.1 环境清理脚本

官方参考:https://docs.rancher.cn/docs/rancher2/cluster-admin/cleaning-cluster-nodes/_index

docker rm -f $(sudo docker ps -aq);
docker volume rm $(sudo docker volume ls -q);

rm -rf /etc/cni \
       /etc/kubernetes \
       /opt/cni \
       /opt/rke \
       /run/secrets/kubernetes.io \
       /run/calico \
       /run/flannel \
       /var/lib/calico \
       /var/lib/etcd \
       /var/lib/cni \
       /var/lib/kubelet \
       /var/lib/rancher/rke/log \
       /var/log/containers \
       /var/log/pods \
       /var/run/calico

for mount in $(mount | grep tmpfs | grep '/var/lib/kubelet' | awk '{ print $3 }') /var/lib/kubelet /var/lib/rancher; do umount $mount; done

rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db

# 清理Iptables表
## 注意:如果节点Iptables有特殊配置,以下命令请谨慎操作
sudo iptables --flush
sudo iptables --flush --table nat
sudo iptables --flush --table filter
sudo iptables --table nat --delete-chain
sudo iptables --table filter --delete-chain
 
sudo systemctl restart containerd docker

注意:Iptables规则清理的时候,一定要小心谨慎操作,否则主机可能会连不上,只能重启。

3.2 修改配置文件

切换到docker用户,cd到/home/docker下,修改配置文件

官方参考
要点如下:

  • 删除或注释掉整个addons:部分。Rancher 的部署和支持配置已经在etcd数据库中。
  • 更改您的nodes部分以指向还原节点。
  • 注释掉不是您的“目标节点”的节点。我们希望集群仅在该节点上启动。

PS:感觉官方文档讲的不细致,或者说细节描述的不够清晰,这里也走了很多弯路

修改后的配置文件如下:

# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
- address: 192.168.0.56
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - worker
  - etcd
  hostname_override: ""
  user: docker
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
#- address: 192.168.0.57
#  port: "22"
#  internal_address: ""
#  role:
#  - controlplane
#  - worker
#  - etcd
#  hostname_override: ""
#  user: docker
#  docker_socket: /var/run/docker.sock
#  ssh_key: ""
#  ssh_key_path: ~/.ssh/id_rsa
#  ssh_cert: ""
#  ssh_cert_path: ""
#  labels: {}
#  taints: []
#- address: 192.168.0.58
#  port: "22"
#  internal_address: ""
#  role:
#  - controlplane
#  - worker
#  - etcd
#  hostname_override: ""
#  user: docker
#  docker_socket: /var/run/docker.sock
#  ssh_key: ""
#  ssh_key_path: ~/.ssh/id_rsa
#  ssh_cert: ""
#  ssh_cert_path: ""
#  labels: {}
#  taints: []
services:
  #etcd:
  #  image: ""
  #  extra_args:
  #    auto-compaction-retention: 3 #(单位小时)
  #    # 修改空间配额为$((6*1024*1024*1024)),默认2G,最大8G
  #    quota-backend-bytes: '6442450944'
  #  extra_binds: []
  #  extra_env: []
  #  external_urls: []
  #  ca_cert: ""
  #  cert: ""
  #  key: ""
  #  path: ""
  #  uid: 0
  #  gid: 0
  #  snapshot: null
  #  retention: ""
  #  creation: ""
  #  backup_config:
  #    enabled: true
  #    interval_hours: 2  # 每1小时备份一次
  #    retention: 36 # 保留24个备份
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    service_cluster_ip_range: 10.243.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_cidr: 10.242.0.0/16
    service_cluster_ip_range: 10.243.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
  kubelet:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.243.0.10
    fail_swap_on: false
    generate_serving_certificate: false
    extra_args:
      # 修改节点最大Pod数量
      max-pods: "150"    
network:
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
authentication:
  strategy: x509
  sans: []
  webhook: null
#addons: ""
#addons_include: []
system_images:
  etcd: rancher/coreos-etcd:v3.4.3-rancher1
  alpine: rancher/rke-tools:v0.1.66
  nginx_proxy: rancher/rke-tools:v0.1.66
  cert_downloader: rancher/rke-tools:v0.1.66
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.66
  kubedns: rancher/k8s-dns-kube-dns:1.15.0
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
  kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
  coredns: rancher/coredns-coredns:1.6.5
  coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
  nodelocal: rancher/k8s-dns-node-cache:1.15.7
  kubernetes: rancher/hyperkube:v1.17.14-rancher1
  flannel: rancher/coreos-flannel:v0.12.0
  flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
  calico_node: rancher/calico-node:v3.13.4
  calico_cni: rancher/calico-cni:v3.13.4
  calico_controllers: rancher/calico-kube-controllers:v3.13.4
  calico_ctl: rancher/calico-ctl:v3.13.4
  calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
  canal_node: rancher/calico-node:v3.13.4
  canal_cni: rancher/calico-cni:v3.13.4
  canal_flannel: rancher/coreos-flannel:v0.12.0
  canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
  weave_node: weaveworks/weave-kube:2.6.4
  weave_cni: weaveworks/weave-npc:2.6.4
  pod_infra_container: rancher/pause:3.1
  ingress: rancher/nginx-ingress-controller:nginx-0.35.0-rancher2
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
  metrics_server: rancher/metrics-server:v0.3.6
  windows_pod_infra_container: rancher/kubelet-pause:v0.1.4
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries: []
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
monitoring:
  provider: ""
  options: {}
  node_selector: {}
restore:
  restore: false
  snapshot_name: ""
dns: null

实际上配置文件的ip只是把旧节点的换成新节点的,最后只保留一个恢复节点的。

3.3 执行etcd恢复

修改好配置文件之后,执行如下命令

./rke_linux-amd64 etcd snapshot-restore --name 2021-02-18T15:28:28Z_etcd --config ./rancher-cluster-restore.yml

注意:这里的–name配置指定的是etcd备份文件的文件名,不需要全路径,etcd配置文件必须在/opt/rke/etcd-snapshots/backup/这个目录下,在其他路径下肯定会报错(我试过了。。。)

执行完之后,会生成一个新的配置文件kube_config_rancher-cluster-restore.yml,可以使用这个配置文件查看恢复的集群的状态了。

# 查看节点状态
$ kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get nodes 
NAME           STATUS   ROLES                      AGE     VERSION
192.168.0.56   Ready    controlplane,etcd,worker   4m23s   v1.17.14
# 查看pod情况
$ kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get pods -A

3.4 修改配置文件,恢复集群

在修改配置文件之前,最好备份一下
修改后的配置文件如下:

# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
- address: 192.168.0.56
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - worker
  - etcd
  hostname_override: ""
  user: docker
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: 192.168.0.57
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - worker
  - etcd
  hostname_override: ""
  user: docker
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: 192.168.0.58
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - worker
  - etcd
  hostname_override: ""
  user: docker
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
services:
  etcd:
    image: ""
    extra_args:
      auto-compaction-retention: 3 #(单位小时)
      # 修改空间配额为$((6*1024*1024*1024)),默认2G,最大8G
      quota-backend-bytes: '6442450944'
    extra_binds: []
    extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config:
      enabled: true
      interval_hours: 2  # 每1小时备份一次
      retention: 36 # 保留24个备份
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    service_cluster_ip_range: 10.243.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_cidr: 10.242.0.0/16
    service_cluster_ip_range: 10.243.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
  kubelet:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.243.0.10
    fail_swap_on: false
    generate_serving_certificate: false
    extra_args: {}
network:
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
system_images:
  etcd: rancher/coreos-etcd:v3.4.3-rancher1
  alpine: rancher/rke-tools:v0.1.66
  nginx_proxy: rancher/rke-tools:v0.1.66
  cert_downloader: rancher/rke-tools:v0.1.66
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.66
  kubedns: rancher/k8s-dns-kube-dns:1.15.0
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
  kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
  coredns: rancher/coredns-coredns:1.6.5
  coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
  nodelocal: rancher/k8s-dns-node-cache:1.15.7
  kubernetes: rancher/hyperkube:v1.17.14-rancher1
  flannel: rancher/coreos-flannel:v0.12.0
  flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
  calico_node: rancher/calico-node:v3.13.4
  calico_cni: rancher/calico-cni:v3.13.4
  calico_controllers: rancher/calico-kube-controllers:v3.13.4
  calico_ctl: rancher/calico-ctl:v3.13.4
  calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
  canal_node: rancher/calico-node:v3.13.4
  canal_cni: rancher/calico-cni:v3.13.4
  canal_flannel: rancher/coreos-flannel:v0.12.0
  canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
  weave_node: weaveworks/weave-kube:2.6.4
  weave_cni: weaveworks/weave-npc:2.6.4
  pod_infra_container: rancher/pause:3.1
  ingress: rancher/nginx-ingress-controller:nginx-0.35.0-rancher2
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
  metrics_server: rancher/metrics-server:v0.3.6
  windows_pod_infra_container: rancher/kubelet-pause:v0.1.4
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries: []
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
monitoring:
  provider: ""
  options: {}
  node_selector: {}
restore:
  restore: false
  snapshot_name: ""
dns: null

把原来的配置文件注释掉的配置段打开,新节点加上。

执行如下命令,恢复集群

./rke_linux-amd64 up --config ./rancher-cluster-restore.yml

执行完毕之后,查看节点状态

kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get nodes
kubectl --kubeconfig=kube_config_rancher-cluster-restore.yml get pods -A 

可以看到集群已经恢复了,但是有些pod可能由于网络原因还没恢复,查清原因,一一解决即可。

4. 其他一些问题

  1. 如果rke集群里面配置了rancher高可用服务,注意修改配置一些配置才能访问到rancher控制台
  2. 需要用到共享存储的服务,相关的文件也需要提前同步到新的共享存储上面
  3. 有状态服务的配置在集群恢复之后也需要修改配置文件

参考:
https://docs.rancher.cn/docs/rancher2/backups/2.0-2.4/restorations/ha-restoration/_index
https://www.pianshen.com/article/78151149004/
https://www.cnblogs.com/jatq/p/13344058.html
https://blog.csdn.net/hongxiaolu/article/details/113711538

你可能感兴趣的:(k8s,运维,docker,kubernetes,运维)