etcd集群故障恢复测试

由于负责不同模块,转发一下同事的文章,希望帮到需要的人

搭建集群

 ## 环境准备
192.168.244.11 
192.168.244.12 
192.168.244.13
 
 ##分别安装etcd
 yum -y install etcd
 
 ##搭建集群
 ## 在11节点执行:
 etcd --name etcd01 --initial-advertise-peer-urls http://192.168.244.11:2380 \
  --data-dir /var/lib/etcd/default.etcd \
  --listen-peer-urls http://192.168.244.11:2380 \
  --listen-client-urls http://192.168.244.11:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.244.11:2379 \
  --initial-cluster-token etcd-cluster \
  --initial-cluster etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 \
  --initial-cluster-state new >> /tmp/etcd.log 2>&1 &
  
  ## 在12节点执行:
 etcd --name etcd02 --initial-advertise-peer-urls http://192.168.244.12:2380 \
  --data-dir /var/lib/etcd/default.etcd \
  --listen-peer-urls http://192.168.244.12:2380 \
  --listen-client-urls http://192.168.244.12:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.244.12:2379 \
  --initial-cluster-token etcd-cluster \
  --initial-cluster etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 \
  --initial-cluster-state new >> /tmp/etcd.log 2>&1 &
  
   ## 在13节点执行:
  etcd --name etcd03 --initial-advertise-peer-urls http://192.168.244.13:2380 \
  --data-dir /var/lib/etcd/default.etcd \
  --listen-peer-urls http://192.168.244.13:2380 \
  --listen-client-urls http://192.168.244.13:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.244.13:2379 \
  --initial-cluster-token etcd-cluster \
  --initial-cluster etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 \
  --initial-cluster-state new >> /tmp/etcd.log 2>&1 &

用这种方式启动的 etcd 服务,是使用root账号写数据的,如果要用 etcd 账号,另用 systemctl 服务的方式来启动。

查看集群

# ETCDCTL_API=2  etcdctl  member list
3875d2e31fd10372: name=etcd03 peerURLs=http://192.168.244.13:2380 clientURLs=http://192.168.244.13:2379 isLeader=false
397b52ecac7810c7: name=etcd01 peerURLs=http://192.168.244.11:2380 clientURLs=http://192.168.244.11:2379 isLeader=true
60192088bf0f1cbc: name=etcd02 peerURLs=http://192.168.244.12:2380 clientURLs=http://192.168.244.12:2379 isLeader=false

# ETCDCTL_API=2 etcdctl cluster-health
member 3875d2e31fd10372 is healthy: got healthy result from http://192.168.244.13:2379
member 397b52ecac7810c7 is healthy: got healthy result from http://192.168.244.11:2379
member 60192088bf0f1cbc is healthy: got healthy result from http://192.168.244.12:2379
cluster is healthy

# ETCDCTL_API=3 etcdctl  --endpoints 192.168.244.11:2379,192.168.244.12:2379,192.168.244.13:2379 member list
3875d2e31fd10372, started, etcd03, http://192.168.244.13:2380, http://192.168.244.13:2379
397b52ecac7810c7, started, etcd01, http://192.168.244.11:2380, http://192.168.244.11:2379
60192088bf0f1cbc, started, etcd02, http://192.168.244.12:2380, http://192.168.244.12:2379


# ETCDCTL_API=3  etcdctl   --endpoints 192.168.244.11:2379,192.168.244.12:2379,192.168.244.13:2379 endpoint status  --write-out="table"
+---------------------+------------------+---------+---------+-----------+-----------+------------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
| 192.168.244.11:2379 | 397b52ecac7810c7 |  3.3.11 |   20 kB |      true |       120 |         13 |
| 192.168.244.12:2379 | 60192088bf0f1cbc |  3.3.11 |   16 kB |     false |       120 |         13 |
| 192.168.244.13:2379 | 3875d2e31fd10372 |  3.3.11 |   20 kB |     false |       120 |         13 |
+---------------------+------------------+---------+---------+-----------+-----------+------------+

# ETCDCTL_API=3  etcdctl   --endpoints 192.168.244.11:2379,192.168.244.12:2379,192.168.244.13:2379 endpoint health  --write-out="table"
192.168.244.13:2379 is healthy: successfully committed proposal: took = 7.764781ms
192.168.244.12:2379 is healthy: successfully committed proposal: took = 7.589569ms
192.168.244.11:2379 is healthy: successfully committed proposal: took = 8.199871ms

故障恢复

不到半数节点宕掉

  1. 删掉坏的节点,添加新的节点(在正常的节点上操作)
etcdctl member remove 397b52ecac7810c7
etcdctl member add etcd01 http://192.168.244.11:2380

下面是具体演示:

  • 查出坏掉节点的 ID,删除,然后重新添加该节点的 peerURLs 。

​ - 杀掉 etcd01 节点


# ETCDCTL_API=3  etcdctl   --endpoints 192.168.244.11:2379,192.168.244.12:2379,192.168.244.13:2379 endpoint status  --write-out="table"
Failed to get the status of endpoint 192.168.244.11:2379 (context deadline exceeded)
+---------------------+------------------+---------+---------+-----------+-----------+------------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
| 192.168.244.12:2379 | 60192088bf0f1cbc |  3.3.11 |   16 kB |      true |       121 |         64 |
| 192.168.244.13:2379 | 3875d2e31fd10372 |  3.3.11 |   20 kB |     false |       121 |         64 |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
# ETCDCTL_API=2  etcdctl  member list
3875d2e31fd10372: name=etcd03 peerURLs=http://192.168.244.13:2380 clientURLs=http://192.168.244.13:2379 isLeader=false
397b52ecac7810c7: name=etcd01 peerURLs=http://192.168.244.11:2380 clientURLs=http://192.168.244.11:2379 isLeader=false
60192088bf0f1cbc: name=etcd02 peerURLs=http://192.168.244.12:2380 clientURLs=http://192.168.244.12:2379 isLeader=true
  • 删掉 etcd01 节点

    #etcdctl member remove 397b52ecac7810c7
    
    Removed member 397b52ecac7810c7 from cluste
    
  • 添加节点

    #etcdctl member add etcd01 http://192.168.244.11:2380
    
    Added member named etcd01 with ID a25253c213bdca83 to cluster
    
    ETCD_NAME="etcd01"
    ETCD_INITIAL_CLUSTER="etcd03=http://192.168.244.13:2380,etcd02=http://192.168.244.12:2380,etcd01=http://192.168.244.11:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    
  • 启动 etcd01 节点,注意上面添加节点给出的提示信息

    #etcd --name etcd01 --initial-advertise-peer-urls http://192.168.244.11:2380 --data-dir /var/lib/etcd/default.etcd --listen-peer-urls http://192.168.244.11:2380 --listen-client-urls http://192.168.244.11:2379,http://127.0.0.1:2379 --advertise-client-urls http://192.168.244.11:2379 --initial-cluster-token etcd-cluster --initial-cluster etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 --initial-cluster-state existing >> /tmp/etcd.log 2>&1
    

    启动未成功

    报出以下错误:

    2019-05-10 11:30:07.141141 E | etcdserver: the member has been permanently removed from the cluster
    2019-05-10 11:30:07.141195 I | etcdserver: the data-dir used by this member must be removed.
    

    将数据目录删除,然后重新启动

    启动成功:

    # ETCDCTL_API=2 etcdctl cluster-health
    member 3875d2e31fd10372 is healthy: got healthy result from http://192.168.244.13:2379
    member 60192088bf0f1cbc is healthy: got healthy result from http://192.168.244.12:2379
    member a25253c213bdca83 is healthy: got healthy result from http://192.168.244.11:2379
    cluster is healthy
    
    

超过半数节点坏掉

以三个节点,宕掉两个为例

宕掉 etcd01和etcd02两个节点(直接杀掉进程),宕掉超过半数节点后,集群已不可用。

  ##无法查询集群中内容
  # ETCDCTL_API=3  etcdctl   get foo
  Error: context deadline exceeded

此时将etcd03节点以单节点集群的方式启动:

#etcd --name etcd03 --initial-advertise-peer-urls http://192.168.244.13:2380 \
--data-dir /var/lib/etcd/default.etcd \
--listen-peer-urls http://192.168.244.13:2380 \
--listen-client-urls http://192.168.244.13:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://192.168.244.13:2379 \
--initial-cluster-token etcd-cluster \
--initial-cluster etcd03=http://192.168.244.13:2380 \
--initial-cluster-state=new \
--force-new-cluster >> /tmp/etcd.log 2>&1 &

# ETCDCTL_API=2 etcdctl cluster-health
member 3875d2e31fd10372 is healthy: got healthy result from http://192.168.244.13:2379
cluster is healthy
# ETCDCTL_API=2  etcdctl  member list
3875d2e31fd10372: name=etcd03 peerURLs=http://192.168.244.13:2380 clientURLs=http://192.168.244.13:2379 isLeader=true
# ETCDCTL_API=3  etcdctl   get foo
foo
1
# ETCDCTL_API=3  etcdctl   get name
name
hello world
  1. 注意这里用到了 --force-new-cluster 参数,这个参数会重置集群ID和集群的所有成员信息。
  2. 以单节点集群启动后,可以正常提供访问了。

在 etcd03 节点上添加其他节点 :

# etcdctl member add etcd01 http://192.168.244.11:2380
Added member named etcd01 with ID 6dd47e3c8257f639 to cluster

ETCD_NAME="etcd01"
ETCD_INITIAL_CLUSTER="etcd03=http://192.168.244.13:2380,etcd01=http://192.168.244.11:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
# etcdctl member add etcd02 http://192.168.244.12:2380
client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://192.168.244.13:2379 has no leader

一个一个节点的添加,等添加的节点启动成功后,再添加后面的。否则后面的会添加失败,如果继续添加,就相当于集群又出现了宕掉半数以上的情况。

启动添加的 etcd01 节点:

##先删除原数据
# cd /var/lib/etcd
# rm -rf default.etcd/

##启动节点
# etcd --name etcd01 --initial-advertise-peer-urls http://192.168.244.11:2380 --data-dir /var/lib/etc/default.etcd --listen-peer-urls http://192.168.244.11:2380 --listen-client-urls http://192.168.244.11:2379,http://127.0.0.1:2379 --advertise-client-urls http://192.168.244.11:2379 --initial-cluster-token etcd-cluster --initial-cluster etcd01=http://192.168.244.11:2380,etcd03=http://192.168.244.13:2380 --initial-cluster-state existing

然后添加 etcd02 和启动 etcd02 节点。

完成后,集群已正常:

# ETCDCTL_API=2  etcdctl  member list
3875d2e31fd10372: name=etcd03 peerURLs=http://192.168.244.13:2380 clientURLs=http://192.168.244.13:2379 isLeader=true
6dd47e3c8257f639: name=etcd01 peerURLs=http://192.168.244.11:2380 clientURLs=http://192.168.244.11:2379 isLeader=false
994b3d27fdf1df2c: name=etcd02 peerURLs=http://192.168.244.12:2380 clientURLs=http://192.168.244.12:2379 isLeader=false
# ETCDCTL_API=2 etcdctl cluster-health
member 3875d2e31fd10372 is healthy: got healthy result from http://192.168.244.13:2379
member 6dd47e3c8257f639 is healthy: got healthy result from http://192.168.244.11:2379
member 994b3d27fdf1df2c is healthy: got healthy result from http://192.168.244.12:2379
cluster is healthy

整个集群挂掉

整个集群挂掉,需要使用备份才能恢复。

由于 k8s 存储在 etcd 中的数据模型和API用的是 v3 的,这里只说 v3 的备份和还原。

v2 和 v3 的区别是 API 不同,存储不同,数据互相隔离。v2 升级到 v3 后,v2 的数据仍然还是要用 v2 API 访问,v3 的也只能用 v3 API 来访问。

##备份etcd(集群整个挂掉,就用日常备份的文件)
##endpoint指定哪台机器,就是备份哪台上的数据,虽然节点间是最终一致的,不加endpoints参数就是备份本机节点的数据
ETCDCTL_API=3 etcdctl snapshot --endpoints="192.168.244.11:2379" save /tmp/etcd_backup/etcdback.db

##为了防止存在节点出现问题,可以在 endpoints 中写多个ip:
ETCDCTL_API=3 etcdctl snapshot --endpoints=http://192.168.244.11:2379,http://192.168.244.12:2379,http://192.168.244.13:2379 save /tmp/etcd_backup/etcdback.db

重新准备三台机器, yum 安装好 etcd ,删除掉 /var/lib/etcd/ 目录下的内容。

##还原数据,分别在三台执行(这里ip和节点名称均使用的之前的,根据情况修改):
##还原时,不加集群信息,启动后将是单点的,只会还原单个的节点信息
ETCDCTL_API=3 etcdctl --name=etcd01 --endpoints="http://192.168.244.11:2379"  --initial-cluster-token=etcd-cluster --initial-advertise-peer-urls=http://192.168.244.11:2380 --initial-cluster=etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 --data-dir=/var/lib/etcd/default.etcd snapshot restore  /tmp/etcd_backup/etcdback.db

ETCDCTL_API=3 etcdctl --name=etcd02 --endpoints="http://192.168.244.12:2379"  --initial-cluster-token=etcd-cluster --initial-advertise-peer-urls=http://192.168.244.12:2380 --initial-cluster=etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 --data-dir=/var/lib/etcd/default.etcd snapshot restore  /tmp/etcd_backup/etcdback.db

ETCDCTL_API=3 etcdctl --name=etcd03 --endpoints="http://192.168.244.13:2379"  --initial-cluster-token=etcd-cluster --initial-advertise-peer-urls=http://192.168.244.13:2380 --initial-cluster=etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 --data-dir=/var/lib/etcd/default.etcd snapshot restore  /tmp/etcd_backup/etcdback.db

启动节点,启动的 --initial-cluster-state均为 existing:

etcd --name etcd01 --initial-advertise-peer-urls http://192.168.244.11:2380 \
  --data-dir /var/lib/etcd/default.etcd \
  --listen-peer-urls http://192.168.244.11:2380 \
  --listen-client-urls http://192.168.244.11:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.244.11:2379 \
  --initial-cluster-token etcd-cluster \
  --initial-cluster etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 \
  --initial-cluster-state existing >> /tmp/etcd.log 2>&1 &
  
etcd --name etcd02 --initial-advertise-peer-urls http://192.168.244.12:2380 \
  --data-dir /var/lib/etcd/default.etcd \
  --listen-peer-urls http://192.168.244.12:2380 \
  --listen-client-urls http://192.168.244.12:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.244.12:2379 \
  --initial-cluster-token etcd-cluster \
  --initial-cluster etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 \
  --initial-cluster-state existing >> /tmp/etcd.log 2>&1 &
  
etcd --name etcd03 --initial-advertise-peer-urls http://192.168.244.13:2380 \
  --data-dir /var/lib/etcd/default.etcd \
  --listen-peer-urls http://192.168.244.13:2380 \
  --listen-client-urls http://192.168.244.13:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.244.13:2379 \
  --initial-cluster-token etcd-cluster \
  --initial-cluster etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 \
  --initial-cluster-state existing >> /tmp/etcd.log 2>&1 &

至此集群恢复完成。

常见问题

  1. 某节点关闭后,再次手动启动时,–initial-cluster-state 值仍为 new,它不是删掉后,重新加入集群。

  2. 下面是使用 etcdct snapshot restore 命令比较的还原单个节点和还原集群的区别:

    # ETCDCTL_API=3 etcdctl  --data-dir=/var/lib/etcd/default.etcd snapshot restore  /tmp/etcd_backup/etcdback.db
    
    2019-05-10 17:19:45.016910 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
    
    #   ETCDCTL_API=3 etcdctl --name=etcd03 --endpoints="http://192.168.244.13:2379"  --initial-cluster-token=etcd-cluster --initial-advertise-peer-urls=http://192.168.244.13:2380 --initial-cluster=etcd01=http://192.168.244.11:2380,etcd02=http://192.168.244.12:2380,etcd03=http://192.168.244.13:2380 --data-dir=/var/lib/etcd/default.etcd snapshot restore  /tmp/etcd_backup/etcdback.db
    2019-05-10 17:24:27.743383 I | etcdserver/membership: added member 3875d2e31fd10372 [http://192.168.244.13:2380] to cluster f3c226181d7ed864
    2019-05-10 17:24:27.743634 I | etcdserver/membership: added member 397b52ecac7810c7 [http://192.168.244.11:2380] to cluster f3c226181d7ed864
    2019-05-10 17:24:27.743686 I | etcdserver/membership: added member 60192088bf0f1cbc [http://192.168.244.12:2380] to cluster f3c226181d7ed864
    
    

你可能感兴趣的:(#,Kubernetes)