场景:ceph-4的osd21磁盘故障,更换新的磁盘。ceph-4上osd为21-26
准备操作
降低osd优先级
在大部分故障场景, 我们需要关机操作, 为了让用户无感知, 我们需要提前降低待操作的节点的优先级ceph -v
首先对比下版本号. 第二版OpenStack的ceph版本为10.x. 我们启用了primary-affinity支持,
用户的io请求会先转给primary pg处理. 然后写入其他replica(副本).
先找出host ceph-4对应的osd.
然后把这些osd的primary-affinity设为0, 意思就是上面的pg除非其他副本挂了, 否则不应该成为主pg.for osd in {21..26}; do ceph osd primary-affinity "$osd" 0; done
使用ceph osd tree可以看到对应的节点设置
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-4 21.81499 host ceph-4
21 3.63599 osd.21 up 0.79999 0
22 3.63599 osd.22 up 0.79999 0
23 3.63599 osd.23 up 0.79999 0
24 3.63599 osd.24 up 0.99989 0
25 3.63599 osd.25 up 0.79999 0
26 3.63599 osd.26 up 0.79999 0
禁止节点踢出ceph osd set noout
默认情况下, osd长时间无响应则会被自动踢出集群, 从而触发数据迁移. 关机更换磁盘操作时间较长, 为了避免数据无意义地来回迁移, 我们需要临时禁止集群自动踢osd,使用ceph -s检查是否配置完成。
可以看到集群状态变为WARN, 额外提示说noout flag被设置了, 而且flags这样多了一项
cluster 936a5233-9441-49df-95c1-01de82a192f4
health HEALTH_WARN
noout flag(s) set
election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
fsmap e85: 1/1/1 up {0=ceph-2=up:active}
osdmap e62563: 111 osds: 109 up, 109 in
flags noout,sortbitwise,require_jewel_osds
停止服务前的检查,检查pg是否切换完成
ceph pg ls | grep "\[2[1-6],"
39.5 1 0 0 0 0 12 1 1 active+clean 2019-01-17 06:49:18.749517 16598'1 62554:2290 [22,44,76] 22 [22,44,76] 22 16598'1 2019-01-17 06:49:18.749416 16598'1 2019-01-11 15:21:07.641442
可以看到, 这里还有个pg39.5是优先使用osd.22的. 那么继续等啊等,等待的目的是为了让用户无感, 如果情况紧急, 可以发布通知告知情况, 然后跳过此步。
检查已经 关机/停止服务 的节点数量
一个存储3份的集群, 可以容忍任意两个主机故障.
所以你需要确保已经关机的节点数量不要超出限制. 以免引发更大的故障
关机更换磁盘
重建osd,我这里journal分区是用的一块ssd磁盘,其中/dev/nvme0n1p1是用来给osd21使用的。
使用ceph-deploy重建
要先删除指定的osd id,我这里是osd.21
1,停止osd进程,(这一步是停止osd的进程,让其他的osd知道这个节点不提供服务了)systemctl stop [email protected]
2,out掉osd,(这个一步是告诉mon,这个节点已经不能服务了,需要在其他的osd上进行数据的恢复了)ceph osd out osd.21
3,从crush里面删除osd,(从crush中删除是告诉集群这个点回不来了,完全从集群的分布当中剔除掉,让集群的crush进行一次重新计算,之前节点还占着这个crush weight,会影响到当前主机的host crush weight)ceph osd crush remove osd.21
4,删除节点ceph osd rm osd.21
5,删除节点认证(不删除编号会占住,因为我们后面要新建一个osd.21 id的节点)ceph auth del osd.21
然后就是正常的osd创建过程,(新盘是/dev/sdh,对应的journal使用/dev/nvme0n1p1)
开始以下操作前需要等集群状态为 health: HEALTH_OK
建议将osd优先级别逐步调整,比如先给0.3、0.5、0.8这样
方法一:
1,擦除磁盘
用下列命令擦净(删除分区表)磁盘,以用于 Ceph :
ceph-deploy disk zap {osd-server-name}:{disk-name}
ceph-deploy disk zap ceph-4:sdh
2,准备osd
ceph-deploy osd prepare {node-name}:{data-disk}[:{journal-disk}]
ceph-deploy osd prepare ceph-4:sdh:/dev/nvme0n1p1
3,激活osd
ceph-deploy osd activate {node-name}:{data-disk-partition}[:{journal-disk-partition}]
ceph-deploy osd activate ceph-4:/dev/sdh1:/dev/nvme0n1p1
最后观察集群状态真正数据迁移
[root@ceph-4 ~]# ceph -s
cluster 936a5233-9441-49df-95c1-01de82a192f4
health HEALTH_WARN
189 pgs backfill_wait
8 pgs backfilling
35 pgs degraded
1 pgs recovering
34 pgs recovery_wait
35 pgs stuck degraded
232 pgs stuck unclean
recovery 33559/38322730 objects degraded (0.088%)
recovery 1127457/38322730 objects misplaced (2.942%)
monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
election epoch 394, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
fsmap e87: 1/1/1 up {0=ceph-2=up:active}
osdmap e64105: 111 osds: 109 up, 109 in; 197 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v79953954: 5064 pgs, 24 pools, 87760 GB data, 12364 kobjects
257 TB used, 149 TB / 407 TB avail
33559/38322730 objects degraded (0.088%)
1127457/38322730 objects misplaced (2.942%)
4828 active+clean
189 active+remapped+wait_backfill
34 active+recovery_wait+degraded
8 active+remapped+backfilling
4 active+clean+scrubbing+deep
1 active+recovering+degraded
recovery io 597 MB/s, 102 objects/s
client io 1457 kB/s rd, 30837 kB/s wr, 271 op/s rd, 846 op/s wr
查看pg状态,osd.21已经正常加入使用
[root@ceph-4 ~]# ceph pg ls | grep "\[21,"
13.2c 4837 0 0 9674 0 39526913024 3030 3030 active+remapped+wait_backfill 2019-04-18 10:28:26.804788 64115'27875808 64115:22041197 [21,109,54] 21 [109,54,26] 109 63497'27874419 2019-04-17 11:38:23.070192 63497'27874419 2019-04-17 11:38:23.070192
13.3c4 4769 0 0 4769 0 38960525312 3053 3053 active+remapped+wait_backfill 2019-04-18 10:28:26.478877 64115'30669377 64115:23715977 [21,103,28] 21 [22,103,28] 22 64048'30669336 2019-04-18 09:30:54.731664 63494'30667018 2019-04-17 07:01:35.747738
13.732 4852 0 0 4852 0 39605818368 3033 3033 active+remapped+wait_backfill 2019-04-18 10:28:26.625253 64115'35861577 64115:24577872 [21,109,66] 21 [22,109,66] 22 63494'35851988 2019-04-17 00:23:06.041163 63335'35775155 2019-04-12 08:20:06.199557
14.28 631 0 0 0 0 1245708288 2766 2766 active+clean 2019-04-18 10:37:27.551922 63012'2766 64115:1344 [21,88,59] 21 [21,88,59] 21 63012'2766 2019-04-18 00:35:27.505205 63012'2766 2019-04-18 00:35:27.505205
14.cd 638 0 0 0 0 1392508928 2878 2878 active+clean 2019-04-18 10:34:59.825683 63012'2878 64115:2004 [21,103,82] 21 [21,103,82] 21 63012'2878 2019-04-17 15:29:04.106329 63012'2878 2019-04-14 23:32:15.401675
14.144 642 0 0 642 0 1296039936 3065 3065 active+remapped+wait_backfill 2019-04-18 10:28:26.475525 63012'19235 64115:1809456 [21,108,39] 21 [39,108,85] 39 63012'19235 2019-04-17 11:38:26.926001 63012'19235 2019-04-14 22:23:28.753992
15.1d4 1718 0 0 1718 0 7180375552 3079 3079 active+remapped+wait_backfill 2019-04-18 10:28:27.009492 64115'55809177 64115:56046351 [21,101,30] 21 [23,101,30] 23 64047'55805700 2019-04-18 02:22:58.405438 63329'55704522 2019-04-11 08:45:36.280574
15.255 1636 0 0 1636 0 6831808512 3329 3329 active+remapped+wait_backfill 2019-04-18 10:28:27.077705 64115'50269798 64115:51399985 [21,93,78] 21 [22,93,78] 22 63494'50261333 2019-04-17 08:16:31.952291 63332'50165547 2019-04-11 15:24:03.756344
18.65 0 0 0 0 0 0 0 0 active+clean 2019-04-18 10:28:25.128442 0'0 64106:8 [21,74,60] 21 [21,74,60] 21 0'0 2019-04-17 05:13:16.502404 0'0 2019-04-11 09:45:49.746497
18.8b 0 0 0 0 0 0 0 0 active+clean 2019-04-18 10:28:27.414351 0'0 64106:8 [21,69,27] 21 [21,69,27] 21 0'0 2019-04-18 09:47:32.328522 0'0 2019-04-15 12:37:33.118690
20.b 0 0 0 0 0 0 16 16 active+clean 2019-04-18 10:28:27.417344 55957'16 64106:8 [21,84,54] 21 [21,84,54] 21 55957'16 2019-04-17 21:59:53.986712 55957'16 2019-04-13 21:31:20.267855
21.1 0 0 0 0 0 0 0 0 active+clean 2019-04-18 10:28:27.414667 0'0 64106:8 [21,56,85] 21 [21,56,85] 21 0'0 2019-04-17 15:32:15.034621 0'0 2019-04-17 15:32:15.034621
38.49 16 0 0 0 0 24102240 127 127 active+clean 2019-04-18 10:44:55.727955 16602'127 64114:43 [21,50,81] 21 [21,50,81] 21 16602'127 2019-04-17 09:35:03.687063 16602'127 2019-04-14 17:57:45.779953
38.b2 15 0 0 0 0 23948729 107 107 active+clean 2019-04-18 10:33:32.139150 16602'107 64110:41 [21,86,69] 21 [21,86,69] 21 16602'107 2019-04-18 05:53:38.968505 16602'107 2019-04-16 23:42:37.321383
38.c2 14 0 0 0 0 33846787 118 118 active+clean 2019-04-18 10:28:31.234530 16602'118 64106:39 [21,72,56] 21 [21,72,56] 21 16602'118 2019-04-17 09:48:27.765306 16602'118 2019-04-14 00:56:36.180240
38.ce 9 0 0 0 0 16840679 65 65 active+clean 2019-04-18 10:33:30.229084 16602'65 64110:29 [21,75,30] 21 [21,75,30] 21 16602'65 2019-04-18 03:29:52.768179 16602'65 2019-04-18 03:29:52.768179
在重建osd完成后, 即可进入下一步, 恢复osd flag
需要把干预期间的其他操作全部恢复ceph osd unset noout
等待集群恢复
等待集群自动recovery恢复到 HEALHTH_OK 状态.
期间如果出现 HEALTH_ERROR 状态, 可以及时跟进, 搜索Google
方法二:
ceph-deploy osd create --data /dev/sdk wuhan31-ceph03
主机名为wuhan31-ceph03
磁盘为/dev/sdk