参考:
https://www.jianshu.com/p/36c2d5682d87https://blog.csdn.net/wylfengyujiancheng/article/details/89235241?utm_medium=distribute.pc_relevant.none-task-blog-2defaultbaidujs_baidulandingword~default-1.no_search_link&spm=1001.2101.3001.4242
https://github.com/lidaohang/ceph_study/blob/master/%E5%B8%B8%E8%A7%81%20PG%20%E6%95%85%E9%9A%9C%E5%A4%84%E7%90%86.md
《Ceph 之Rados设计原理与实现》
这里PG状态指PG外部状态,即能被用户所直接看到的状态。
可以通过ceph pg stat
命令查看PG当前状态,健康状态为“active + clean”.
[root@node-1 ~]# ceph pg stat
464 pgs: 464 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail
下面给出部分常见的PG外部状态。(参考《Ceph 之Rados设计原理与实现》6.3节)
状态 | 含义 |
---|---|
activating | peering即将完成,正在等待所有副本同步并固化peering结果(Info、log等) |
active | 活跃。PG可以正常处理来自客户端的读写请求 |
backfilling | 正在后台填充态。 backfill是recovery的一种特殊场景,指peering完成后,如果基于当前权威日志无法对Up Set当中的某些PG实例实施增量同步(例如承载这些PG实例的OSD离线太久,或者是新的OSD加入集群导致的PG实例整体迁移) 则通过完全拷贝当前Primary所有对象的方式进行全量同步 |
backfill-toofull | 副本所在OSD空间不足,backfill流程被挂起 |
backfill-wait | 等待backfill资源预留完成 |
clean | PG当前不存在降级对象(待修复的对象),acting set和up set内容一致,并且大小等于存储池副本数 |
creating | PG正在被创建 |
deep | PG正在或即将执行Deep-Scrub(对象一致性扫描) |
degraded | PG存在降级对象(peering完成后,PG检测到一个PG实例存在不一致),或者acting set规模小于存储池副本数(但是不小于存储池最小副本数) |
down | Peering过程中,PG检测到某个不能被跳过的Interval中,当前仍然存活的副本不足以完成数据恢复 |
incomplete | Peering过程中,由于无法选出权威日志或者选出的acting set不足以完成数据修复 |
inconsistent | Scurb过程中检测到某个或者某些对象在副本之间出现了不一致 |
peered | Peering已经完成,但是pg当前acting set规模小于存储池规定的最小副本数 |
peering | Peer正在进行 |
recovering | PG正在后台按照Peering结果,对降级对象(不一致对象)进行修复 |
recovering-wait | 等待Recovery资源预留完成 |
remapped | PG活动集任何的一个改变,数据发生从老活动集到新活动集的迁移。在迁移期间还是用老的活动集中的主OSD处理客户端请求,一旦迁移完成新活动集中的主OSD开始处理 |
repair | 修复不一致对象 |
scrubbing | PG正在执行Scrub |
stale | Monitor检测到当前Primary所在的OSD宕掉且后续没有发生切换,或者Primary超时未向Monitor上报PG相关的统计信息(例如出现临时性的网络拥塞) |
undersized | 当前acting set中副本个数小于存储池副本数(但是不小于存储池最小副本数) |
unactive | PG不能处理读写请求 |
unclean | PG不能从上一个失败中恢复 |
参考链接:http://luqitao.github.io/2016/07/14/ceph-pg-states-introduction/
下面给出部分PG异常状态(需要人为修复)介绍。
degraded:降级
当客户端向主 OSD 写入数据时,由主 OSD 负责把数据副本写入其余副本 OSD 。主 OSD 把对象写入存储器后,在副本 OSD 创建完对象副本并报告给主 OSD 之前,主 OSD 会一直停留在 degraded 状态。归置组状态可以处于 active+degraded 状态,原因在于一 OSD 即使尚未持有所有对象也可以处于 active 状态。如果一 OSD 挂了, Ceph 会把分配到此 OSD 的归置组都标记为 degraded ;那个 OSD 重生后,它们必须重新互联。然而,客户端仍可以向处于 degraded 状态的归置组写入新对象,只要它还在 active 状态。
如果一 OSD 挂了,且老是处于 degraded 状态, Ceph 会把 down 的 OSD 标记为在集群外( out )、并把那个 down 掉的 OSD 上的数据重映射到其它 OSD 。从标记为 down 到 out 的时间间隔由 mon osd down out interval 控制,默认是 300 秒。
归置组也会被降级( degraded ),因为 Ceph 找不到本应存在于此归置组中的一或多个对象,这时,你不能读写找不到的对象,但仍能访问位于降级归置组中的其它对象。
remapped:重映射
负责维护某一归置组的 Acting Set 变更时,数据要从旧集合迁移到新的。新的主 OSD 要花费一些时间才能提供服务,所以老的主 OSD 还要持续提供服务、直到归置组迁移完。数据迁移完后,运行图会包含新 acting set 里的主 OSD 。
stale:陈旧
默认, OSD 守护进程每半秒( 0.5 )会一次报告其归置组、出流量、引导和失败统计状态,此频率高于心跳阀值。如果一归置组的主 OSD 所在的 acting set 没能向监视器报告、或者其它监视器已经报告了那个主 OSD 已 down ,监视器们就会把此归置组标记为 stale 。
启动集群时,会经常看到 stale 状态,直到互联完成。集群运行一阵后,如果还能看到有归置组位于 stale 状态,就说明那些归置组的主 OSD 挂了( down )、或没在向监视器报告统计信息。
inconsistent:不一致
PG通常存在多个副本,其所有副本的数据应当是完全一致的。但有时会由于OSD故障、网络阻塞等某些因素,导致副本上的数据发生不一致的现象,此时需要对不一致的PG惊醒修复。
一般情况下,存储池设置为3副本,也就是1个PG会存储到3个OSD。正常情况,PG状态显示为“active + clean”
如果说你的集群小于三副本,例如只有2个OSD,那么你可能会所有OSD都处于 up 和 in状态,但是PG始终无法达到 “active + clean”,这可能是因为 osd pool size/min_size设置了大于2的值。
# min_size=4,size=5,实际osd副本数=3
[root@node-1 ~]# ceph osd dump | grep pool-1
pool 1 'pool-1' replicated size 5 min_size 4 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 6359 flags hashpspool stripe_width 0 application rbd
[root@node-1 ~]# ceph pg stat
464 pgs: 70 undersized+degraded+peered, 58 undersized+peered, 336 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail; 192/162165 objects degraded (0.118%)
[root@node-1 ~]# rados -p pool-1 put file_bench_cephfs.f file_bench_cephfs.f
# 操作阻塞,无法完成写入请求
可以看出,osd pool min_size是必须满足的OSD副本数,osd pool size则是建议满足的OSD副本数。前者是必须满足的条件,否则该pool无法读写;后者可以不满足,只是集群会报出警告。可以通过设置合理的osd pool size 和osd pool min size来解决上述问题。
[root@node-1 ~]# ceph osd pool set pool-1 size 3
set pool 1 size to 3
[root@node-1 ~]# ceph osd pool set pool-1 min_size 2
set pool 1 min_size to 2
[root@node-1 ~]# ceph pg stat
464 pgs: 464 active+clean; 802 MiB data, 12 GiB used, 24 GiB / 40 GiB avail
CRUSH MAP 错误
PG 达不到 clean 状态的另一个可能的原因就是集群的 CRUSH Map 有错误,导致 PG 不能映射到正确的地方。
最常见的PG故障都是由于某个或者多个OSD进程挂掉导致的。一般重启OSD后恢复健康。
可以通过ceph -s
或者ceph osd stat
检查是否有OSD down。
[root@node-1 ~]# ceph osd stat
4 osds: 4 up (since 4h), 4 in (since 6d); epoch: e6364
尝试停掉一个或多个OSD(3副本集群,总共4个OSD),观察集群状态。
# 停掉1个OSD,还剩3个OSD,出现active+undersized+degraded警告,说明集群还可以读写
[root@node-1 ~]# ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 52306/162054 objects degraded (32.277%), 197 pgs degraded
OSD_DOWN 1 osds down
osd.0 (root=default,host=node-1) is down
PG_DEGRADED Degraded data redundancy: 52306/162054 objects degraded (32.277%), 197 pgs degraded
pg 1.1d is active+undersized+degraded, acting [2,1]
pg 1.60 is active+undersized+degraded, acting [1,2]
pg 1.62 is active+undersized+degraded, acting [2,1]
...
# 停掉2个OSD,还剩2个OSD,满足min_size=2,集群依然可以读写
[root@node-1 ~]# ceph health detail
HEALTH_WARN 2 osds down; 1 host (2 osds) down; Degraded data redundancy: 54018/162054 objects degraded (33.333%), 208 pgs degraded, 441 pgs undersized
OSD_DOWN 2 osds down
osd.0 (root=default,host=node-1) is down
osd.3 (root=default,host=node-1) is down
OSD_HOST_DOWN 1 host (2 osds) down
host node-1 (root=default) (2 osds) is down
PG_DEGRADED Degraded data redundancy: 54018/162054 objects degraded (33.333%), 208 pgs degraded, 441 pgs undersized
pg 1.29 is stuck undersized for 222.261023, current state active+undersized, last acting [2,1]
pg 1.2a is stuck undersized for 222.251868, current state active+undersized, last acting [2,1]
pg 1.2b is stuck undersized for 222.246564, current state active+undersized, last acting [2,1]
pg 1.2c is stuck undersized for 221.679774, current state active+undersized+degraded, last acting [1,2]
# 停掉3个OSD,还剩1个OSD,不满足min_size=2,集群失去读写能力,出现undersized+degraded+peered警告
[root@node-2 ~]# ceph -s
cluster:
id: 60e065f1-d992-4d1a-8f4e-f74419674f7e
health: HEALTH_WARN
3 osds down
2 hosts (3 osds) down
Reduced data availability: 192 pgs inactive
Degraded data redundancy: 107832/161748 objects degraded (66.667%), 208 pgs degraded
services:
mon: 3 daemons, quorum node-1,node-2,node-3 (age 5h)
mgr: node-1(active, since 20h)
mds: cephfs:2 {0=mds2=up:active,1=mds1=up:active} 1 up:standby
osd: 4 osds: 1 up (since 47s), 4 in (since 6d)
rgw: 1 daemon active (node-2)
task status:
data:
pools: 9 pools, 464 pgs
objects: 53.92k objects, 803 MiB
usage: 16 GiB used, 24 GiB / 40 GiB avail
pgs: 100.000% pgs not active
107832/161748 objects degraded (66.667%)
256 undersized+peered
208 undersized+degraded+peered
# 停掉4个OSD,发现最后只剩一个OSD的时候,即便停止此OSD进程,但是通过ceph -s命令检查发现依然up
# 但是检查改进程,发现进程状态为dead
# PG状态为stale+undersized+peered,集群失去读写能力
[root@node-1 ~]# systemctl status ceph-osd@0
● [email protected] - Ceph object storage daemon osd.0
Loaded: loaded (/usr/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: disabled)
Active: inactive (dead) since 四 2021-10-14 15:36:14 CST; 1min 56s ago
Process: 5528 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
Process: 5524 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
Main PID: 5528 (code=exited, status=0/SUCCESS)
。。。
[root@node-1 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03918 root default
-3 0.01959 host node-1
0 hdd 0.00980 osd.0 up 1.00000 1.00000
3 hdd 0.00980 osd.3 down 0.09999 1.00000
-5 0.00980 host node-2
1 hdd 0.00980 osd.1 down 1.00000 1.00000
-7 0.00980 host node-3
2 hdd 0.00980 osd.2 down 1.00000 1.00000
[root@node-1 ~]# ceph pg stat
464 pgs: 440 down, 14 stale+undersized+peered, 10 stale+undersized+degraded+peered; 801 MiB data, 12 GiB used, 24 GiB / 40 GiB avail; 3426/161460 objects degraded (2.122%)
重启所有停掉的OSD,集群会慢慢恢复健康。
# 所有OSD已经重启,PG在进行peering,但此时已经可以读写。
[root@node-1 ~]# ceph -s
cluster:
id: 60e065f1-d992-4d1a-8f4e-f74419674f7e
health: HEALTH_WARN
Reduced data availability: 1 pg inactive, 2 pgs peering
Degraded data redundancy: 16715/162054 objects degraded (10.314%), 65 pgs degraded
services:
mon: 3 daemons, quorum node-1,node-2,node-3 (age 5h)
mgr: node-1(active, since 20h)
mds: cephfs:2 {0=mds2=up:active,1=mds1=up:active} 1 up:standby
osd: 4 osds: 4 up (since 5s), 4 in (since 5s)
rgw: 1 daemon active (node-2)
task status:
data:
pools: 9 pools, 464 pgs
objects: 54.02k objects, 803 MiB
usage: 11 GiB used, 19 GiB / 30 GiB avail
pgs: 65.302% pgs not active
16715/162054 objects degraded (10.314%)
294 peering
75 active+undersized
62 active+undersized+degraded
21 active+clean
9 remapped+peering
2 active+recovery_wait+degraded
1 active+recovering+degraded
# 过一段时间,再次检查ceph健康,发现正在rebalancing中,此时集群依然可以读写,并且PG状态为“active + clean”
[root@node-1 ~]# ceph -s
cluster:
id: 60e065f1-d992-4d1a-8f4e-f74419674f7e
health: HEALTH_OK
...
progress:
Rebalancing after osd.0 marked in
[==............................]
[root@node-1 ~]# rados -p pool-1 put file_bench_cephfs.f file_bench_cephfs.f
[root@node-1 ~]# rados -p pool-1 ls | grep file_bench
file_bench_cephfs.f
这里罗列一下集群不能读写的PG状态:
stale和peered状态上文已经演示过,通过停止OSD服务达到。
down的一个经典场景:A(主)、B、C
a. 首先kill B
b. 新写入数据到 A、C
c. kill A和C
d. 拉起B
此时存活的B数据陈旧(不含新数据),而且集群中也没有其他OSD可以帮助其完成数据迁移,因此会显示down,参考链接:https://zhuanlan.zhihu.com/p/138778000#:~:text=3.8.3%20PG%E4%B8%BADown%E7%9A%84OSD%E4%B8%A2%E5%A4%B1%E6%88%96%E6%97%A0%E6%B3%95%E6%8B%89%E8%B5%B7
down的解决方法依然是重启失败的OSD。
参考链接:https://ceph.com/geen-categorie/ceph-manually-repair-object/
一般手动修复损坏的PG即可,使用ceph pg repair {pgid}
PG状态为inconsistent时,说明PG中存在对象不一致的情况。有可能时某个OSD磁盘损坏,或者磁盘上的数据发生静默错误。
下面手动构造一个PG数据损坏的例子,并修复它。
# 1.关闭OSD服务
$ systemctl stop ceph-osd@{id}
# 2.使用ceph-objectstore-tool 挂载 /var/lib/ceph/osd/ceph-0 到 /mnt/ceph-osd@0
[root@node-1 ceph-objectstore-tool-test]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op fuse --mountpoint /mnt/ceph-osd@0/
mounting fuse at /mnt/ceph-osd@0/ ...
# 3.删除 /mnt/ceph-osd/@0/10.0_head/all文件夹中某个目录文件(即PG中的对象),即破坏了10.0pg的某个对象
[root@node-1 all]# rm -rf \#10\:01ec679f\:\:\:10000011eba.00000000\:head#/
rm: 无法删除"#10:01ec679f:::10000011eba.00000000:head#/bitwise_hash": 不允许的操作
rm: 无法删除"#10:01ec679f:::10000011eba.00000000:head#/omap": 不允许的操作
rm: 无法删除"#10:01ec679f:::10000011eba.00000000:head#/attr": 不允许的操作
# 4.卸载/mnt/ceph-osd@0,重启 OSD服务,等待集群恢复正常
# 5.手动对10.0 PG做scrub,命令:ceph pg scrub 10.0,等待后台scrub完成
[root@node-1 ~]# ceph pg scrub 10.0
instructing pg 10.0 on osd.2 to scrub
# 6.发现集群报错,PG id是10.0,状态为active+clean+inconsistent
[root@node-1 ~]# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 10.0 is active+clean+inconsistent, acting [2,1,0]
# 7.执行修复操作,PG状态:active+clean+scrubbing+deep+inconsistent+repair
[root@node-1 ~]# ceph pg repair 10.0
instructing pg 10.0 on osd.2 to repair
[root@node-1 ~]# ceph -s
cluster:
id: 60e065f1-d992-4d1a-8f4e-f74419674f7e
health: HEALTH_ERR
2 scrub errors
Possible data damage: 1 pg inconsistent
。。。
data:
pools: 9 pools, 464 pgs
objects: 53.99k objects, 802 MiB
usage: 16 GiB used, 24 GiB / 40 GiB avail
pgs: 463 active+clean
1 active+clean+scrubbing+deep+inconsistent+repair
# 8.等待集群恢复健康
[root@node-1 ~]# ceph health detail
HEALTH_OK
如果ceph pg repair {pgid}
命令无法修复PG,可以使用ceph-objectstore-tool导入整个PG的方式。
参考链接:https://www.jianshu.com/p/36c2d5682d87#:~:text=%E8%B5%B7%E5%A4%AF%E4%BD%8F%E3%80%82-,3.9%20Incomplete,-Peering%E8%BF%87%E7%A8%8B%E4%B8%AD
构造故障
# 构造故障环境,使用ceph-objectstore-tool,删除三副本中两个副本上的同一个对象。
# 注意,使用ceph-objectstore-tool前,需要停掉该osd服务,使用systemctl stop ceph-osd@{id}
# 选取10.0 ,在node-2和node3节点上都删除1000000d4dc.00000000对象,集群为3副本,10.0PG分布在node1,2,3上
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 1000000d4dc.00000000 remove
remove #10:03f57502:::1000000d4dc.00000000:head#
[root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 1000000d4dc.00000000 remove
remove #10:03f57502:::1000000d4dc.00000000:head#
[root@node-1 ~]# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 10.0 is active+clean+inconsistent, acting [2,1,0]
使用ceph-objectstore-tool修复
# 查询数据对比
# 1.导出PG的object清单,把所有清单都放到node-1节点的~/export文件夹下,方便比较
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list > ~/export/pg-10.0-osd0.txt
[root@node-1 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list > ~/export/pg-10.0-osd1.txt
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op list > ~/pg-10.0-osd1.txt
[root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 --op list > ~/pg-10.0-osd2.txt
[root@node-1 export]# scp root@node-2:/root/pg-10.0-osd1.txt ./
pg-10.0-osd1.txt 100% 97KB 19.5MB/s 00:00
pg-10.0-osd0.txt pg-10.0-osd1.txt
[root@node-1 export]# scp root@node-3:/root/pg-10.0-osd2.txt ./
pg-10.0-osd2.txt 100% 97KB 35.0MB/s 00:00
[root@node-1 export]# ls
pg-10.0-osd0.txt pg-10.0-osd1.txt pg-10.0-osd2.txt
# 2.查询PG中object的数量,发现node-1节点上的10.0PG拥有最多的对象,833个
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list | wc -l
[root@node-1 export]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op list | wc -l
833
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op list | wc -l
832
[root@node-3 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid 10.0 --op list | wc -l
832
# 3.对比所有副本的object是否一致,本例中node-2和node-3一致,node-1和他们不一致,但是object数量最多
# 本例中使用node-1节点的PG副本,并导入node-2和node-3上的OSD
# - 如上述情况,diff对比后,每个副本(主从所有副本)的object list是否一致。避免有数据不一致。使用数量最多,并且diff对比后,数量最多的包含所有object的备份。
# - 如上述情况,diff对比后,数量是不一致,最多的不包含所有的object,则需要考虑不覆盖导入,再导出。最终使用完整的所有的object进行导入。注:import是需要提前remove pg后进行导入,等于覆盖导入。
# - 如上述情况,diff对比后,数据是一致,则使用object数量最多的备份,然后import到object数量少的pg里面 然后在所有副本mark complete,一定要先在所有副本的osd节点export pg备份,避免异常后可恢复pg。
[root@node-1 export]# diff -u ./pg-10.0-osd0.txt ./pg-10.0-osd1.txt
[root@node-1 export]# diff -u ./pg-10.0-osd0.txt ./pg-10.0-osd2.txt
[root@node-1 export]# diff -u ./pg-10.0-osd2.txt ./pg-10.0-osd1.txt
# 4.导出node-1节点的PG,导出文件名可自行定义,并把此文件拷贝到node-2和node3节点
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op export --file ~/export/pg-10.0.obj
[root@node-1 export]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --pgid 10.0 --op export --file ~/export/pg-10.0.obj
Read #10:03f57502:::1000000d4dc.00000000:head#
Read #10:03f6b1a4:::100000091a0.00000000:head#
Read #10:03f6dfc2:::10000010b31.00000000:head#
Read #10:03f913b2:::10000010740.00000000:head#
Read #10:03f99080:::10000010f0f.00000000:head#
Read #10:03fc19a4:::10000011c5e.00000000:head#
Read #10:03fe3b90:::10000010166.00000000:head#
Read #10:03fe60e1:::10000011c44.00000000:head#
........
Export successful
[root@node-1 export]# ls
pg-10.0.obj pg-10.0-osd0.txt pg-10.0-osd1.txt pg-10.0-osd2.txt
[root@node-1 export]# scp pg-10.0.obj root@node-2:/root/
pg-10.0.obj 100% 4025KB 14.7MB/s 00:00
[root@node-1 export]# scp pg-10.0.obj root@node-3:/root/
pg-10.0.obj
# 注:后续所有操作,node-2和node-3节点是一样的,出于简洁考虑,只展示node-2节点
# 5.在node-2和node-1节点上导入备份的PG
# 在导入备份前,建议把将要被替换的PG导出,这样后续出现问题后,还可以还原
# 将指定的pg元数据导入到当前pg,导入前需要先删除当前pg(remove之前请先export备份一下pg数据)。需要remove当前pg,否则无法导入,提示已存在。
# 5.1备份
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op export --file ~/pg-10.0-node-1.obj
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op export --file ~/pg-10.0-node-2.obj
Read #10:03f6b1a4:::100000091a0.00000000:head#
Read #10:03f6dfc2:::10000010b31.00000000:head#
Read #10:03f913b2:::10000010740.00000000:head#
Read #10:03f99080:::10000010f0f.00000000:head#
Read #10:03fc19a4:::10000011c5e.00000000:head#
Read #10:03fe3b90:::10000010166.00000000:head#
Read #10:03fe60e1:::10000011c44.00000000:head#
...
Export successful
# 5.2删除
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --pgid 10.0 --op remove --force
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op remove --force
marking collection for removal
setting '_remove' omap key
finish_remove_pgs 10.0_head removing 10.0
Remove successful
# 5.3导入
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op import --file ~/pg-10.0.obj
[root@node-2 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --type bluestore --pgid 10.0 --op import --file ~/pg-10.0.obj
Write #10:03f6dfc2:::10000010b31.00000000:head#
snapset 1=[]:{}
Write #10:03f913b2:::10000010740.00000000:head#
snapset 1=[]:{}
Write #10:03f99080:::10000010f0f.00000000:head#
snapset 1=[]:{}
....
write_pg epoch 6727 info 10.0( v 5925'23733 (5924'20700,5925'23733] local-lis/les=6726/6727 n=814 ec=5833/5833 lis/c 6726/6726 les/c/f 6727/6727/0 6726/6726/6724)
Import successful
# 6.检查,PG没有inconsistent状态,处于慢慢恢复中
[root@node-3 ~]# ceph -s
cluster:
id: 60e065f1-d992-4d1a-8f4e-f74419674f7e
health: HEALTH_WARN
Degraded data redundancy: 37305/161973 objects degraded (23.032%), 153 pgs degraded, 327 pgs undersized
services:
mon: 3 daemons, quorum node-1,node-2,node-3 (age 23m)
mgr: node-1(active, since 23m)
mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby
osd: 4 osds: 4 up (since 6s), 4 in (since 16h)
rgw: 1 daemon active (node-2)
task status:
data:
pools: 9 pools, 464 pgs
objects: 53.99k objects, 802 MiB
usage: 16 GiB used, 24 GiB / 40 GiB avail
pgs: 37305/161973 objects degraded (23.032%)
182 active+undersized
145 active+undersized+degraded
129 active+clean
8 active+recovering+degraded
io:
recovery: 0 B/s, 2 objects/s
上述介绍了重启OSD的方法来解决集群故障,但有时会遇到OSD down却无法重启的状况。
# OSD down 导致 PG故障,查看OSD,发现位于node-2节点上的osd.1 down
[root@node-2 ~]# ceph health detail
HEALTH_WARN Degraded data redundancy: 53852/161556 objects degraded (33.333%), 209 pgs degraded, 464 pgs undersized
PG_DEGRADED Degraded data redundancy: 53852/161556 objects degraded (33.333%), 209 pgs degraded, 464 pgs undersized
...
[root@node-2 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03918 root default
-3 0.01959 host node-1
0 hdd 0.00980 osd.0 up 1.00000 1.00000
3 hdd 0.00980 osd.3 up 0.09999 1.00000
-5 0.00980 host node-2
1 hdd 0.00980 osd.1 down 0 1.00000
-7 0.00980 host node-3
2 hdd 0.00980 osd.2 up 1.00000 1.00000
# 尝试重启 OSD,发现无法重启
[root@node-2 ~]# systemctl restart ceph-osd@1
Job for [email protected] failed because start of the service was attempted too often. See "systemctl status [email protected]" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed [email protected]" followed by "systemctl start [email protected]" again.
[root@node-2 ~]# systemctl stop ceph-osd@1
[root@node-2 ~]# systemctl restart ceph-osd@1
Job for [email protected] failed because start of the service was attempted too often. See "systemctl status [email protected]" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed [email protected]" followed by "systemctl start [email protected]" again.
[root@node-2 ~]# systemctl stop ceph-osd@1
[root@node-2 ~]# systemctl restart ceph-osd@1
Job for [email protected] failed because start of the service was attempted too often. See "systemctl status [email protected]" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed [email protected]" followed by "systemctl start [email protected]" again
遇到以上问题,有以下三种方案:
下面给出手动删除OSD再重新创建OSD的例子:
# 本例中删除node-2节点上的osd.1
# 1.
[root@node-1 ~]# ceph osd rm osd.1
removed osd.1
[root@node-1 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03918 root default
-3 0.01959 host node-1
0 hdd 0.00980 osd.0 up 1.00000 1.00000
3 hdd 0.00980 osd.3 up 0.09999 1.00000
-5 0.00980 host node-2
1 hdd 0.00980 osd.1 DNE 0
-7 0.00980 host node-3
2 hdd 0.00980 osd.2 up 1.00000 1.00000
# 2.
[root@node-1 ~]# ceph osd crush rm osd.1
removed item id 1 name 'osd.1' from crush map
[root@node-1 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02939 root default
-3 0.01959 host node-1
0 hdd 0.00980 osd.0 up 1.00000 1.00000
3 hdd 0.00980 osd.3 up 0.09999 1.00000
-5 0 host node-2
-7 0.00980 host node-3
2 hdd 0.00980 osd.2 up 1.00000 1.00000
# 3.
[root@node-1 ~]# ceph auth del osd.1
updated
[root@node-1 ~]# ceph auth ls | grep osd.1
installed auth entries:
# 4.检查集群状态,由于集群是3副本,因此少了一个OSD,所有PG状态依然active,不影响集群读写
[root@node-1 ~]# ceph -s
cluster:
id: 60e065f1-d992-4d1a-8f4e-f74419674f7e
health: HEALTH_WARN
Degraded data redundancy: 53157/161724 objects degraded (32.869%), 207 pgs degraded, 461 pgs undersized
services:
mon: 3 daemons, quorum node-1,node-2,node-3 (age 21m)
mgr: node-1(active, since 57m)
mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby
osd: 3 osds: 3 up (since 2m), 3 in (since 24m); 4 remapped pgs
rgw: 1 daemon active (node-2)
task status:
data:
pools: 9 pools, 464 pgs
objects: 53.91k objects, 802 MiB
usage: 11 GiB used, 19 GiB / 30 GiB avail
pgs: 53157/161724 objects degraded (32.869%)
816/161724 objects misplaced (0.505%)
254 active+undersized
206 active+undersized+degraded
3 active+clean+remapped
1 active+undersized+degraded+remapped+backfilling
io:
recovery: 35 KiB/s, 7 objects/s
# 5.到node-2上卸载 /var/lib/ceph/osd/ceph-1/,此目录下放OSD.1的相关内容,并有一个链接到具体磁盘的/block文件
[root@node-1 ~]# ssh node-2
Last login: Mon Oct 18 09:48:25 2021 from node-1
[root@node-2 ~]# umount /var/lib/ceph/osd/ceph-1/
[root@node-2 ~]# rm -rf /var/lib/ceph/osd/ceph-1/
# 6.node-2节点上OSD.1映射的磁盘为/dev/sdb,后续要重建的也是这个磁盘
[root@node-2 ~]# ceph-volume lvm list
====== osd.1 =======
[block] /dev/ceph-847ed937-dfbb-485e-af90-9cf27bf08c99/osd-block-66119fd9-226d-4665-b2cc-2b6564b7d715
block device /dev/ceph-847ed937-dfbb-485e-af90-9cf27bf08c99/osd-block-66119fd9-226d-4665-b2cc-2b6564b7d715
block uuid 9owOrT-EMVD-c2kY-53Xj-2ECv-0Kji-euIRkX
cephx lockbox secret
cluster fsid 60e065f1-d992-4d1a-8f4e-f74419674f7e
cluster name ceph
crush device class None
encrypted 0
osd fsid 66119fd9-226d-4665-b2cc-2b6564b7d715
osd id 1
osdspec affinity
type block
vdo 0
devices /dev/sdb
# 7.格式化,先执行dmsetup remove {设备号},再执行格式化
[root@node-2 ~]# dmsetup ls
ceph--847ed937--dfbb--485e--af90--9cf27bf08c99-osd--block--66119fd9--226d--4665--b2cc--2b6564b7d715 (253:2)
centos-swap (253:1)
centos-root (253:0)
[root@node-2 ~]# dmsetup remove ceph--847ed937--dfbb--485e--af90--9cf27bf08c99-osd--block--66119fd9--226d--4665--b2cc--2b6564b7d715
[root@node-2 ~]# mkfs.xfs -f /dev/sdb
meta-data=/dev/sdb isize=512 agcount=4, agsize=655360 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=2621440, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# 8.重新创建OSD
[root@node-2 ~]# ceph-volume lvm create --data /dev/sdb
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1e8fb7ca-a870-414b-930d-1e22f32eb84b
Running command: /usr/sbin/vgcreate --force --yes ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254 /dev/sdb
stdout: Wiping xfs signature on /dev/sdb.
stdout: Physical volume "/dev/sdb" successfully created.
stdout: Volume group "ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254" successfully created
Running command: /usr/sbin/lvcreate --yes -l 2559 -n osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254
stdout: Logical volume "osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b" created.
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-2
Running command: /usr/bin/ln -s /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b /var/lib/ceph/osd/ceph-1/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1/activate.monmap
stderr: 2021-10-18 13:53:29.353 7f369a1e2700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-10-18 13:53:29.353 7f369a1e2700 -1 AuthRegistry(0x7f36940662f8) no keyring found at /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx
stderr: got monmap epoch 3
Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-1/keyring --create-keyring --name osd.1 --add-key AQDYC21hGa9KKRAAZoHogGn9ouEPpb9RY3/FXw==
stdout: creating /var/lib/ceph/osd/ceph-1/keyring
added entity osd.1 auth(key=AQDYC21hGa9KKRAAZoHogGn9ouEPpb9RY3/FXw==)
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 1e8fb7ca-a870-414b-930d-1e22f32eb84b --setuser ceph --setgroup ceph
stderr: 2021-10-18 13:53:29.899 7fa4e1df1a80 -1 bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid
--> ceph-volume lvm prepare successful for: /dev/sdb
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b --path /var/lib/ceph/osd/ceph-1 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-2c45e0ec-5edd-4df8-b6b0-af18ef077254/osd-block-1e8fb7ca-a870-414b-930d-1e22f32eb84b /var/lib/ceph/osd/ceph-1/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-2
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Running command: /usr/bin/systemctl enable ceph-volume@lvm-1-1e8fb7ca-a870-414b-930d-1e22f32eb84b
stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/[email protected] to /usr/lib/systemd/system/[email protected].
Running command: /usr/bin/systemctl enable --runtime ceph-osd@1
Running command: /usr/bin/systemctl start ceph-osd@1
--> ceph-volume lvm activate successful for osd ID: 1
--> ceph-volume lvm create successful for: /dev/sdb
# 9.检查集群,等待数据慢慢迁移恢复
[root@node-2 ~]# ceph -s
cluster:
id: 60e065f1-d992-4d1a-8f4e-f74419674f7e
health: HEALTH_WARN
Degraded data redundancy: 53237/161556 objects degraded (32.953%), 213 pgs degraded
services:
mon: 3 daemons, quorum node-1,node-2,node-3 (age 4h)
mgr: node-1(active, since 5h)
mds: cephfs:2 {0=mds1=up:active,1=mds2=up:active} 1 up:standby
osd: 4 osds: 4 up (since 38s), 4 in (since 4h); 111 remapped pgs
rgw: 1 daemon active (node-2)
task status:
data:
pools: 9 pools, 464 pgs
objects: 53.85k objects, 802 MiB
usage: 12 GiB used, 28 GiB / 40 GiB avail
pgs: 53237/161556 objects degraded (32.953%)
814/161556 objects misplaced (0.504%)
241 active+clean
103 active+recovery_wait+degraded
63 active+undersized+degraded+remapped+backfill_wait
46 active+recovery_wait+undersized+degraded+remapped
9 active+recovery_wait
1 active+recovering+undersized+degraded+remapped
1 active+remapped+backfill_wait
io:
recovery: 0 B/s, 43 keys/s, 2 objects/s
[root@node-2 ~]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00980 1.00000 10 GiB 4.8 GiB 3.8 GiB 20 MiB 1004 MiB 5.2 GiB 47.59 1.57 440 up
3 hdd 0.00980 0.09999 10 GiB 1.3 GiB 265 MiB 1.5 MiB 1022 MiB 8.7 GiB 12.59 0.42 25 up
1 hdd 0.00980 1.00000 10 GiB 1.2 GiB 212 MiB 0 B 1 GiB 8.8 GiB 12.08 0.40 361 up
2 hdd 0.00980 1.00000 10 GiB 4.9 GiB 3.9 GiB 22 MiB 1002 MiB 5.1 GiB 48.85 1.61 464 up
TOTAL 40 GiB 12 GiB 8.1 GiB 44 MiB 4.0 GiB 28 GiB 30.28
MIN/MAX VAR: 0.40/1.61 STDDEV: 18.02
重建OSD需要注意的是,如果你的集群中对crush map做了特别定制,那么还需要去检查crush map。
在OSD恢复过程中,可能会影响集群对外提供的io服务。这里给出以下可修改配置。
参考链接:
https://www.cnblogs.com/gzxbkk/p/7704464.html
http://strugglesquirrel.com/2019/02/02/ceph%E8%BF%90%E7%BB%B4%E5%A4%A7%E5%AE%9D%E5%89%91%E4%B9%8B%E9%9B%86%E7%BE%A4osd%E4%B8%BAfull/
为了避免pg开始迁移后造成较大的压力导致osd挂掉,先在配置文件global中写入如下配置
osd_op_thread_suicide_timeout = 900
osd_op_thread_timeout = 900
osd_recovery_thread_suicide_timeout = 900
osd_heartbeat_grace = 900
磁盘恢复速度配置,其实默认的速度已经比较写了,如果想要加快迁移速度,可以尝试调制下列参数
osd_recovery_max_single_start #越大,OSD恢复速度越快,集群对外服务受到影响越大,默认为1
osd_recovery_max_active #越大,OSD恢复速度越快,集群对外服务受到影响越大,默认为3
osd_recovery_op_priority #越大,OSD恢复速度越快,集群对外服务受到影响越大,默认为3
osd_max_backfills #越大,OSD恢复速度越快,集群对外服务受到影响越大,默认为1
osd_recovery_sleep #越小,OSD恢复速度越快,集群对外服务受到影响越大,默认为0秒
[root@node-2 ~]# ceph config set osd osd_recovery_max_single_start 32
[root@node-2 ~]# ceph config set osd osd_max_backfills 32
[root@node-2 ~]# ceph config set osd osd_recovery_max_active 32
设置后,OSD恢复速度大大加快,注意这三个配置要同步增加,否则只增加其中一个,会由于其他短板而无法使得恢复速度加快
# ceph -s 检查恢复速度,设置上述参数前,恢复速度只有6 objects/s。
[root@node-2 ~]# ceph -s
io:
recovery: 133 KiB/s, 35 objects/s
附上配置操控命令
# 查看所有配置
ceph config ls
# 查看配置默认信息
[root@node-2 ~]# ceph config help osd_recovery_sleep
osd_recovery_sleep - Time in seconds to sleep before next recovery or backfill op
(float, advanced)
Default: 0.000000
Can update at runtime: true
# 查看已经自定义配置
[root@node-2 ~]# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
mon advanced mon_warn_on_insecure_global_id_reclaim false
mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false
mgr advanced mgr/balancer/active false
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/sleep_interval 60
mds advanced mds_session_blacklist_on_evict true
mds advanced mds_session_blacklist_on_timeout true
client advanced client_reconnect_stale true
client advanced debug_client 20/20
# 修改配置
ceph config set {mon/osd/client/mgr/..} [config_name] [value]
ceph config set client client_reconnect_stale true
# 查询指定自定义配置 (osd.osd mon.mon mgr.mgr, 得这样写)
$ ceph config get client.client
WHO MASK LEVEL OPTION VALUE RO
client advanced client_reconnect_stale true
client advanced debug_client 20/20
# 删除自定义配置
$ ceph config rm [who] [name]
参考链接:https://zhuanlan.zhihu.com/p/74323736
一般来说,集群三副本的情况下不太可能出现PG丢失的情况,如果一旦出现了,那也就意味着这丢失的数据无法找回。
# 1.查找丢失的PG
root@storage01-ib:~# ceph pg dump_stuck unclean | grep unknown
20.37 unknown [] -1 [] -1
20.29 unknown [] -1 [] -1
20.16 unknown [] -1 [] -1
# 2.创建PG
root@storage01-ib:~# ceph osd force-create-pg 20.37 --yes-i-really-mean-it
pg 20.37 now creating, ok
注意:不要使用单副本的集群。
出现“1 pools have many more objects per pg than average”警告时,说明集群中某个pool的PG数量配置过少,其每个PG承载的对象高于集群平均PG承载对象10倍以上,最简单的解决方法就是增加pool的pg数即可。
# 告警:1 pools have many more objects per pg than average
[root@lab8106 ~]# ceph -s
cluster fa7ec1a1-662a-4ba3-b478-7cb570482b62
health HEALTH_WARN
1 pools have many more objects per pg than average
# 增加该pool的pg数,N版以后,只需要调整pg_num即可,pgp_num会自动调整。
ceph osd pool set cephfs_metadata pg_num 64