etcd集群故障处理

知其然,不知其所以然的写代码,不是目的。
今天etcd集群的一个节点挂了,简单写下处理过程。raft算法春节期间必须仔细读一下。

从集群中剔除有问题的节点

//在正常的节点执行如下命令,获取到不正常节点的ID,hostB节点坏掉
# /usr/local/etcd/etcdctl  member list
4121a0b16ac4c150: name=hostA peerURLs=http://10.33.1.10:2380 clientURLs=http://10.33.1.10:2379
6b7887955883c31a: name=hostB peerURLs=http://10.33.1.11:2380 clientURLs=http://10.33.1.11:2379
d8dd30b565cff95c: name=hostC peerURLs=http://10.33.1.12:2380 clientURLs=http://10.33.1.12:2379
# /usr/local/etcd/etcdctl  cluster-health 
cluster is healthy
member 4121a0b16ac4c150 is healthy
member 6b7887955883c31a is unhealthy
member d8dd30b565cff95c is healthy

//删除掉坏的节点
# /usr/local/etcd/etcdctl  member remove 6b7887955883c31a

修复坏的节点

启动的命令
-initial-cluster-state 由"new" 改成"existing"

在启动新节点之前,必须把新节点接入集群,然后在启动新节点,,否则会出现
"etcd: error validating peerURLs hostA=http://10.33.1.10:2380,hostC=http://10.33.1.12:2380: member count is unequal"
的报错

# /usr/local/etcd/etcdctl member add hostD http://10.33.1.15:2380
Added member named hostD with ID 2519d1a6a8cb83da to cluster  

新节点启动命令:
/usr/local/etcd/etcd -name hostD -initial-advertise-peer-urls http://10.33.1.15:2380 -listen-peer-urls http://10.33.1.15:2380 -listen-client-urls http://10.33.1.15:2379,http://127.0.0.1:2379 -advertise-client-urls http://10.33.1.15:2379 -initial-cluster-token etcd-cluster-aa -initial-cluster hostA=http://10.33.1.10:2380,hostC=http://10.33.1.12:2380,hostD=http://10.33.1.15:2380 -initial-cluster-state existing &

查看集群状态:
# etcdctl  cluster-health                                            
cluster is healthy
member 2519d1a6a8cb83da is healthy
member 4121a0b16ac4c150 is healthy
member d8dd30b565cff95c is healthy

你可能感兴趣的:(etcd集群故障处理)