zabbix监控报警一台ceph节点journal盘写入寿命已经达到96%以上,根据intel官方说法写入寿命达到设置值将会无法正常写入。PercentageUsed : 97
[root@ceph-11 ~]# isdct show -sensor
PowerOnHours : 0x021B5
EraseFailCount : 0
EndToEndErrorDetectionCount : 0
ReliabilityDegraded : False
AvailableSpare : 100
AvailableSpareBelowThreshold : False
DeviceStatus : Healthy
SpecifiedPCBMaxOperatingTemp : 85
SpecifiedPCBMinOperatingTemp : 0
UnsafeShutdowns : 0x08
CrcErrorCount : 0
AverageNandEraseCycles : 2917
MediaErrors : 0x00
PowerCycles : 0x0C
ProgramFailCount : 0
MaxNandEraseCycles : 2922
HighestLifetimeTemperature : 57
PercentageUsed : 97
ThermalThrottleStatus : 0
ErrorInfoLogEntries : 0x00
MinNandEraseCycles : 2913
LowestLifetimeTemperature : 23
ReadOnlyMode : False
ThermalThrottleCount : 0
TemperatureThresholdExceeded : False
Temperature - Celsius : 50
有12个osd用这块盘做的日志
[root@ceph-11 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 5.5T 0 disk
└─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87
sdb 8:16 0 5.5T 0 disk
└─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88
sdc 8:32 0 5.5T 0 disk
└─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89
sdd 8:48 0 5.5T 0 disk
└─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90
sde 8:64 0 5.5T 0 disk
└─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91
sdf 8:80 0 5.5T 0 disk
└─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92
sdg 8:96 0 5.5T 0 disk
└─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93
sdh 8:112 0 5.5T 0 disk
└─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94
sdi 8:128 0 5.5T 0 disk
└─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95
sdj 8:144 0 5.5T 0 disk
└─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96
sdk 8:160 0 5.5T 0 disk
└─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97
sdl 8:176 0 5.5T 0 disk
└─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98
sdm 8:192 0 419.2G 0 disk
└─sdm1 8:193 0 419.2G 0 part /
nvme0n1 259:0 0 372.6G 0 disk
├─nvme0n1p1 259:1 0 30G 0 part
├─nvme0n1p2 259:2 0 30G 0 part
├─nvme0n1p3 259:3 0 30G 0 part
├─nvme0n1p4 259:4 0 30G 0 part
├─nvme0n1p5 259:5 0 30G 0 part
├─nvme0n1p6 259:6 0 30G 0 part
├─nvme0n1p7 259:7 0 30G 0 part
├─nvme0n1p8 259:8 0 30G 0 part
├─nvme0n1p9 259:9 0 30G 0 part
├─nvme0n1p10 259:10 0 30G 0 part
├─nvme0n1p11 259:11 0 30G 0 part
└─nvme0n1p12 259:12 0 30G 0 part
[root@ceph-11 ~]#
1,降低osd优先级
在大部分故障场景, 我们需要关机操作, 为了让用户无感知, 我们需要提前降低待操作的节点的优先级。首先看下ceph版本号,ceph版本为10.x. 我们启用了primary-affinity支持, 用户的io请求会先转给primary pg处理. 然后写入其他replica(副本).。先找出host ceph-11对应的osd,然后把这些osd的primary-affinity设为0, 意思就是上面的pg除非其他副本挂了, 否则不应该成为主pg.
-12 65.47299 host ceph-11
87 5.45599 osd.87 up 1.00000 0.89999
88 5.45599 osd.88 up 0.79999 0.29999
89 5.45599 osd.89 up 1.00000 0.89999
90 5.45599 osd.90 up 1.00000 0.89999
91 5.45599 osd.91 up 1.00000 0.89999
92 5.45599 osd.92 up 1.00000 0.79999
93 5.45599 osd.93 up 1.00000 0.89999
94 5.45599 osd.94 up 1.00000 0.89999
95 5.45599 osd.95 up 1.00000 0.89999
96 5.45599 osd.96 up 1.00000 0.89999
97 5.45599 osd.97 up 1.00000 0.89999
98 5.45599 osd.98 up 0.89999 0.89999
将osd87到98优先级设置为0for osd in {87..98}; do ceph osd primary-affinity "$osd" 0; done
使用ceph osd tree可以看到对应的节点设置
-12 65.47299 host ceph-11
87 5.45599 osd.87 up 1.00000 0
88 5.45599 osd.88 up 0.79999 0
89 5.45599 osd.89 up 1.00000 0
90 5.45599 osd.90 up 1.00000 0
91 5.45599 osd.91 up 1.00000 0
92 5.45599 osd.92 up 1.00000 0
93 5.45599 osd.93 up 1.00000 0
94 5.45599 osd.94 up 1.00000 0
95 5.45599 osd.95 up 1.00000 0
96 5.45599 osd.96 up 1.00000 0
97 5.45599 osd.97 up 1.00000 0
98 5.45599 osd.98 up 0.89999 0
2,禁止踢出节点ceph osd set noout
默认情况下, osd长时间无响应则会被自动踢出集群, 从而触发数据迁移. 关机更换ssd操作时间较长, 为了避免数据无意义地来回迁移, 我们需要临时禁止集群自动踢osd,使用ceph -s检查是否配置完成。可以看到集群状态变为WARN, 额外提示说noout flag被设置了, 而且flags这样多了一项
[root@ceph-11 ~]# ceph -s
cluster 936a5233-9441-49df-95c1-01de82a192f4
health HEALTH_WARN
noout flag(s) set
monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
osdmap e73511: 111 osds: 108 up, 108 in
flags noout,sortbitwise,require_jewel_osds
pgmap v85913863: 5064 pgs, 24 pools, 89164 GB data, 12450 kobjects
261 TB used, 141 TB / 403 TB avail
5060 active+clean
4 active+clean+scrubbing+deep
client io 27608 kB/s rd, 59577 kB/s wr, 399 op/s rd, 668 op/s wr
3,检查pg是否完成切换
[root@ceph-11 ~]# ceph pg ls | grep "\[9[1-8],"
13.24 5066 0 0 0 0 41480507922 3071 3071 active+clean 2019-07-02 19:33:37.537802 73497'120563162 73511:110960694 [94,25,64] 94 [94,25,64] 94 73497'120562718 2019-07-02 19:33:37.537761 73294'120561198 2019-07-01 18:11:54.686413
13.10f 4874 0 0 0 0 39967832064 3083 3083 active+clean 2019-07-01 23:56:13.911259 73511'59603193 73511:52739094 [91,44,38] 91 [91,44,38] 91 73302'59589396 2019-07-01 23:56:13.911226 69213'59545762019-06-26 22:58:12.864475
13.17d 5001 0 0 0 0 40919228578 3088 3088 active+clean 2019-07-02 13:51:04.162137 73511'34680543 73511:26095334 [96,45,72] 96 [96,45,72] 96 73497'34678725 2019-07-02 13:51:04.162089 70393'34676042019-07-01 08:47:58.771910
13.20d 4872 0 0 0 0 40007166482 3036 3036 active+clean 2019-07-03 07:40:28.677097 73511'27811217 73511:22372286 [93,85,73] 93 [93,85,73] 93 73497'27809831 2019-07-03 07:40:28.677059 73302'27796622019-07-01 23:15:14.731237
13.214 5006 0 0 0 0 40940654592 3079 3079 active+clean 2019-07-02 21:10:51.094829 73511'34400529 73511:27161705 [94,61,53] 94 [94,61,53] 94 73497'34398612 2019-07-02 21:10:51.094784 73294'34393962019-07-01 18:54:06.249357
13.2fd 4950 0 0 0 0 40522633728 3086 3086 active+clean 2019-07-02 06:36:14.763435 73511'149011011 73511:136693896 [91,58,36] 91 [91,58,36] 91 73497'148963815 2019-07-02 06:36:14.763383 73497'148963815 2019-07-02 06:36:14.763383
13.3ae 4989 0 0 0 0 40879544320 3055 3055 active+clean 2019-07-02 00:30:44.817062 73511'67827999 73511:60578765 [91,54,25] 91 [91,54,25] 91 73302'67806651 2019-07-02 00:30:44.817017 69213'67776352
主pg不肯走啊,既然这样那就不管它了,我们前面已经设置禁止踢出节点,且我们用的是三副本,直接关闭这台机器ceph会启用副本,也不会出现数据迁移。
一个存储3份的集群, 可以容忍任意两个主机故障.,所以你需要确保已经关机的节点数量不要超出限制. 以免引发更大的故障.
4,停止服务、关闭服务器、更换ssd
新换上去的ssd使用率为0,PercentageUsed : 0
[root@ceph-11 ~]# isdct show -sensor
PowerOnHours : 0x063F3
EraseFailCount : 0
EndToEndErrorDetectionCount : 0
ReliabilityDegraded : False
AvailableSpare : 100
AvailableSpareBelowThreshold : False
DeviceStatus : Healthy
SpecifiedPCBMaxOperatingTemp : 85
SpecifiedPCBMinOperatingTemp : 0
UnsafeShutdowns : 0x00
CrcErrorCount : 0
AverageNandEraseCycles : 7
MediaErrors : 0x00
PowerCycles : 0x012
ProgramFailCount : 0
MaxNandEraseCycles : 10
HighestLifetimeTemperature : 48
PercentageUsed : 0
ThermalThrottleStatus : 0
ErrorInfoLogEntries : 0x00
MinNandEraseCycles : 6
LowestLifetimeTemperature : 16
ReadOnlyMode : False
ThermalThrottleCount : 0
TemperatureThresholdExceeded : False
Temperature - Celsius : 48
5,插入新的磁盘为nvme0n1
[root@ceph-11 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 5.5T 0 disk
└─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87
sdb 8:16 0 5.5T 0 disk
└─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88
sdc 8:32 0 5.5T 0 disk
└─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89
sdd 8:48 0 5.5T 0 disk
└─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90
sde 8:64 0 5.5T 0 disk
└─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91
sdf 8:80 0 5.5T 0 disk
└─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92
sdg 8:96 0 5.5T 0 disk
└─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93
sdh 8:112 0 5.5T 0 disk
└─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94
sdi 8:128 0 5.5T 0 disk
└─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95
sdj 8:144 0 5.5T 0 disk
└─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96
sdk 8:160 0 5.5T 0 disk
└─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97
sdl 8:176 0 5.5T 0 disk
└─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98
sdm 8:192 0 419.2G 0 disk
└─sdm1 8:193 0 419.2G 0 part /
nvme0n1 259:0 0 372.6G 0 disk
6,重建journal
由于journal故障, 开机后无法正常启动osd. 需要重新创建journal,编辑脚本来生成最终执行的脚本。
#!/bin/bash
desc="create ceph journal part for specified osd."
type_journal_uuid=45b0969e-9b03-4f30-b4c6-b4b80ceff106
sgdisk=sgdisk
journal_size=30G //分区设置大小
journal_dev=/dev/nvme0n1 //ssd磁盘名称
sleep=5
osd_uuids=$(grep "" /var/lib/ceph/osd/ceph-*/journal_uuid 2>/dev/null)
die(){ echo >&2 "$@"; exit 1; }
tip(){ printf >&2 "%b" "$@"; }
[ "$osd_uuids" ] || die "no osd uuid found."
echo "osd journal uuid:"
echo "$osd_uuids"
echo "now sleep $sleep"
sleep $sleep
journal_script="/dev/shm/ceph-journal.sh"
echo "ls -l /dev/nvme0n1p*" > "$journal_script"
echo "sleep 5" >> "$journal_script"
# 需要预先检测分区的位置. 然后才能成功设置名称和uuid之类的数据.
IFS=": "
while read osd_path uuid; do
let d++
[ "$osd_path" ] || continue
osd_id=${osd_path#/var/lib/ceph/osd/ceph-}
osd_id=${osd_id%/journal_uuid}
journal_link=${osd_path%_uuid}
[ ${osd_id:-1} -ge 0 ] || {
echo "invalid osd id: $osd_id."; exit 11;
}
tip "create journal for osd $osd_id ... "
$sgdisk --mbrtogpt --new=$d:0:+"$journal_size" \
--change-name=$d:'ceph journal' \
--typecode=$d:"$type_journal_uuid" \
--partition-guid=$d:"$uuid" \
"$journal_dev" || exit 1
tip "part done.\n"
ln -sfT /dev/disk/by-partuuid/"$uuid" "$journal_link" || exit 3
echo "ceph-osd --mkjournal --osd-journal /dev/nvme0n1p"$d "-i "$osd_id >> "$journal_script"
sleep 1
done << EOF
$osd_uuids
EOF
上述脚本仅用于生成最终的执行脚本. 其默认路径是/dev/shm/ceph-journal.sh
请务必人工确认内容操作无误, 方可以root权限手动执行之[root@ceph-11~]# bash /dev/shm/ceph-journal.sh
脚本内容:
[root@ceph-11 ~]# cat /dev/shm/ceph-journal.sh
#!/bin/bash
ls -l /dev/nvme0n1p*
sleep 5
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p1 -i 87
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p2 -i 88
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p3 -i 89
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p4 -i 90
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p5 -i 91
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p6 -i 92
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p7 -i 93
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p8 -i 94
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p9 -i 95
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p10 -i 96
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p11 -i 97
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p12 -i 98
[root@ceph-11 ~]#
7,journal跟换完毕,检查恢复服务
osd服务已恢复
[root@ceph-11 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-10008 0 root sas6t3
-10007 0 root sas6t2
-10006 130.94598 root sas6t1
-12 65.47299 host ceph-11
87 5.45599 osd.87 up 1.00000 0
88 5.45599 osd.88 up 0.79999 0
89 5.45599 osd.89 up 1.00000 0
90 5.45599 osd.90 up 1.00000 0
91 5.45599 osd.91 up 1.00000 0
92 5.45599 osd.92 up 1.00000 0
93 5.45599 osd.93 up 1.00000 0
94 5.45599 osd.94 up 1.00000 0
95 5.45599 osd.95 up 1.00000 0
96 5.45599 osd.96 up 1.00000 0
97 5.45599 osd.97 up 1.00000 0
98 5.45599 osd.98 up 0.89999 0
恢复osd flag,需要把干预期间的其他操作全部恢复ceph osd unset noout
恢复osd优先级
[root@ceph-11 ~]# for osd in {87..98}; do ceph osd primary-affinity "$osd" 0.8; done
set osd.87 primary-affinity to 0.8 (8524282)
set osd.88 primary-affinity to 0.8 (8524282)
set osd.89 primary-affinity to 0.8 (8524282)
set osd.90 primary-affinity to 0.8 (8524282)
set osd.91 primary-affinity to 0.8 (8524282)
set osd.92 primary-affinity to 0.8 (8524282)
set osd.93 primary-affinity to 0.8 (8524282)
set osd.94 primary-affinity to 0.8 (8524282)
set osd.95 primary-affinity to 0.8 (8524282)
set osd.96 primary-affinity to 0.8 (8524282)
set osd.97 primary-affinity to 0.8 (8524282)
set osd.98 primary-affinity to 0.8 (8524282)
[root@ceph-11 ~]#
等待集群恢复
等待集群自动recovery恢复到 HEALHTH_OK 状态.
期间如果出现 HEALTH_ERROR 状态, 可以及时跟进, 搜索Google.
[root@ceph-11 ~]# ceph -s
cluster 936a5233-9441-49df-95c1-01de82a192f4
health HEALTH_WARN
12 pgs degraded
2 pgs recovering
10 pgs recovery_wait
12 pgs stuck unclean
recovery 116/38259009 objects degraded (0.000%)
monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
osdmap e73609: 111 osds: 108 up, 108 in
flags sortbitwise,require_jewel_osds
pgmap v85918476: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects
261 TB used, 141 TB / 403 TB avail
116/38259009 objects degraded (0.000%)
5049 active+clean
10 active+recovery_wait+degraded
3 active+clean+scrubbing+deep
2 active+recovering+degraded
recovery io 22105 kB/s, 4 objects/s
client io 55017 kB/s rd, 77280 kB/s wr, 944 op/s rd, 590 op/s wr
[root@ceph-11 ~]#
[root@ceph-11 ~]# ceph -s
cluster 936a5233-9441-49df-95c1-01de82a192f4
health HEALTH_WARN
1 pgs degraded
1 pgs recovering
1 pgs stuck unclean
recovery 2/38259009 objects degraded (0.000%)
monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
osdmap e73609: 111 osds: 108 up, 108 in
flags sortbitwise,require_jewel_osds
pgmap v85918493: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects
261 TB used, 141 TB / 403 TB avail
2/38259009 objects degraded (0.000%)
5060 active+clean
3 active+clean+scrubbing+deep
1 active+recovering+degraded
client io 81789 kB/s rd, 245 MB/s wr, 1441 op/s rd, 651 op/s wr
[root@ceph-11 ~]# ceph -s
cluster 936a5233-9441-49df-95c1-01de82a192f4
health HEALTH_OK
monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
osdmap e73609: 111 osds: 108 up, 108 in
flags sortbitwise,require_jewel_osds
pgmap v85918494: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects
261 TB used, 141 TB / 403 TB avail
5061 active+clean
3 active+clean+scrubbing+deep
recovery io 7388 kB/s, 0 objects/s
client io 67551 kB/s rd, 209 MB/s wr, 1153 op/s rd, 901 op/s wr
[root@ceph-11 ~]#
集群状态已经恢复正常。