zabbix监控报警一台ceph节点journal盘写入寿命已经达到96%以上,根据intel官方说法写入寿命达到设置值将会无法正常写入。PercentageUsed : 97

[root@ceph-11 ~]# isdct show -sensor
PowerOnHours : 0x021B5
EraseFailCount : 0
EndToEndErrorDetectionCount : 0
ReliabilityDegraded : False
AvailableSpare : 100
AvailableSpareBelowThreshold : False
DeviceStatus : Healthy
SpecifiedPCBMaxOperatingTemp : 85
SpecifiedPCBMinOperatingTemp : 0
UnsafeShutdowns : 0x08
CrcErrorCount : 0
AverageNandEraseCycles : 2917
MediaErrors : 0x00
PowerCycles : 0x0C
ProgramFailCount : 0
MaxNandEraseCycles : 2922
HighestLifetimeTemperature : 57
PercentageUsed : 97
ThermalThrottleStatus : 0
ErrorInfoLogEntries : 0x00
MinNandEraseCycles : 2913
LowestLifetimeTemperature : 23
ReadOnlyMode : False
ThermalThrottleCount : 0
TemperatureThresholdExceeded : False
Temperature - Celsius : 50

有12个osd用这块盘做的日志

[root@ceph-11 ~]# lsblk 
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0   5.5T  0 disk 
└─sda1         8:1    0   5.5T  0 part /var/lib/ceph/osd/ceph-87
sdb            8:16   0   5.5T  0 disk 
└─sdb1         8:17   0   5.5T  0 part /var/lib/ceph/osd/ceph-88
sdc            8:32   0   5.5T  0 disk 
└─sdc1         8:33   0   5.5T  0 part /var/lib/ceph/osd/ceph-89
sdd            8:48   0   5.5T  0 disk 
└─sdd1         8:49   0   5.5T  0 part /var/lib/ceph/osd/ceph-90
sde            8:64   0   5.5T  0 disk 
└─sde1         8:65   0   5.5T  0 part /var/lib/ceph/osd/ceph-91
sdf            8:80   0   5.5T  0 disk 
└─sdf1         8:81   0   5.5T  0 part /var/lib/ceph/osd/ceph-92
sdg            8:96   0   5.5T  0 disk 
└─sdg1         8:97   0   5.5T  0 part /var/lib/ceph/osd/ceph-93
sdh            8:112  0   5.5T  0 disk 
└─sdh1         8:113  0   5.5T  0 part /var/lib/ceph/osd/ceph-94
sdi            8:128  0   5.5T  0 disk 
└─sdi1         8:129  0   5.5T  0 part /var/lib/ceph/osd/ceph-95
sdj            8:144  0   5.5T  0 disk 
└─sdj1         8:145  0   5.5T  0 part /var/lib/ceph/osd/ceph-96
sdk            8:160  0   5.5T  0 disk 
└─sdk1         8:161  0   5.5T  0 part /var/lib/ceph/osd/ceph-97
sdl            8:176  0   5.5T  0 disk 
└─sdl1         8:177  0   5.5T  0 part /var/lib/ceph/osd/ceph-98
sdm            8:192  0 419.2G  0 disk 
└─sdm1         8:193  0 419.2G  0 part /
nvme0n1      259:0    0 372.6G  0 disk 
├─nvme0n1p1  259:1    0    30G  0 part 
├─nvme0n1p2  259:2    0    30G  0 part 
├─nvme0n1p3  259:3    0    30G  0 part 
├─nvme0n1p4  259:4    0    30G  0 part 
├─nvme0n1p5  259:5    0    30G  0 part 
├─nvme0n1p6  259:6    0    30G  0 part 
├─nvme0n1p7  259:7    0    30G  0 part 
├─nvme0n1p8  259:8    0    30G  0 part 
├─nvme0n1p9  259:9    0    30G  0 part 
├─nvme0n1p10 259:10   0    30G  0 part 
├─nvme0n1p11 259:11   0    30G  0 part 
└─nvme0n1p12 259:12   0    30G  0 part 
[root@ceph-11 ~]#

1,降低osd优先级
在大部分故障场景, 我们需要关机操作, 为了让用户无感知, 我们需要提前降低待操作的节点的优先级。首先看下ceph版本号,ceph版本为10.x. 我们启用了primary-affinity支持, 用户的io请求会先转给primary pg处理. 然后写入其他replica(副本).。先找出host ceph-11对应的osd,然后把这些osd的primary-affinity设为0, 意思就是上面的pg除非其他副本挂了, 否则不应该成为主pg.

   -12  65.47299     host ceph-11                                   
    87   5.45599         osd.87        up  1.00000          0.89999 
    88   5.45599         osd.88        up  0.79999          0.29999 
    89   5.45599         osd.89        up  1.00000          0.89999 
    90   5.45599         osd.90        up  1.00000          0.89999 
    91   5.45599         osd.91        up  1.00000          0.89999 
    92   5.45599         osd.92        up  1.00000          0.79999 
    93   5.45599         osd.93        up  1.00000          0.89999 
    94   5.45599         osd.94        up  1.00000          0.89999 
    95   5.45599         osd.95        up  1.00000          0.89999 
    96   5.45599         osd.96        up  1.00000          0.89999 
    97   5.45599         osd.97        up  1.00000          0.89999 
    98   5.45599         osd.98        up  0.89999          0.89999 

将osd87到98优先级设置为0
for osd in {87..98}; do ceph osd primary-affinity "$osd" 0; done

使用ceph osd tree可以看到对应的节点设置

   -12  65.47299     host ceph-11                                   
    87   5.45599         osd.87        up  1.00000                0 
    88   5.45599         osd.88        up  0.79999                0 
    89   5.45599         osd.89        up  1.00000                0 
    90   5.45599         osd.90        up  1.00000                0 
    91   5.45599         osd.91        up  1.00000                0 
    92   5.45599         osd.92        up  1.00000                0 
    93   5.45599         osd.93        up  1.00000                0 
    94   5.45599         osd.94        up  1.00000                0 
    95   5.45599         osd.95        up  1.00000                0 
    96   5.45599         osd.96        up  1.00000                0 
    97   5.45599         osd.97        up  1.00000                0 
    98   5.45599         osd.98        up  0.89999                0 

2,禁止踢出节点
ceph osd set noout

默认情况下, osd长时间无响应则会被自动踢出集群, 从而触发数据迁移. 关机更换ssd操作时间较长, 为了避免数据无意义地来回迁移, 我们需要临时禁止集群自动踢osd,使用ceph -s检查是否配置完成。可以看到集群状态变为WARN, 额外提示说noout flag被设置了, 而且flags这样多了一项

[root@ceph-11 ~]# ceph -s
    cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_WARN
            noout flag(s) set
     monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
            election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
     osdmap e73511: 111 osds: 108 up, 108 in
            flags noout,sortbitwise,require_jewel_osds
      pgmap v85913863: 5064 pgs, 24 pools, 89164 GB data, 12450 kobjects
            261 TB used, 141 TB / 403 TB avail
                5060 active+clean
                   4 active+clean+scrubbing+deep
  client io 27608 kB/s rd, 59577 kB/s wr, 399 op/s rd, 668 op/s wr

3,检查pg是否完成切换

[root@ceph-11 ~]# ceph pg ls | grep "\[9[1-8],"
13.24   5066    0   0   0   0   41480507922 3071    3071    active+clean    2019-07-02 19:33:37.537802  73497'120563162 73511:110960694 [94,25,64]  94  [94,25,64]  94  73497'120562718 2019-07-02 19:33:37.537761  73294'120561198 2019-07-01 18:11:54.686413
13.10f  4874    0   0   0   0   39967832064 3083    3083    active+clean    2019-07-01 23:56:13.911259  73511'59603193  73511:52739094  [91,44,38]  91  [91,44,38]  91  73302'59589396  2019-07-01 23:56:13.911226  69213'59545762019-06-26 22:58:12.864475
13.17d  5001    0   0   0   0   40919228578 3088    3088    active+clean    2019-07-02 13:51:04.162137  73511'34680543  73511:26095334  [96,45,72]  96  [96,45,72]  96  73497'34678725  2019-07-02 13:51:04.162089  70393'34676042019-07-01 08:47:58.771910
13.20d  4872    0   0   0   0   40007166482 3036    3036    active+clean    2019-07-03 07:40:28.677097  73511'27811217  73511:22372286  [93,85,73]  93  [93,85,73]  93  73497'27809831  2019-07-03 07:40:28.677059  73302'27796622019-07-01 23:15:14.731237
13.214  5006    0   0   0   0   40940654592 3079    3079    active+clean    2019-07-02 21:10:51.094829  73511'34400529  73511:27161705  [94,61,53]  94  [94,61,53]  94  73497'34398612  2019-07-02 21:10:51.094784  73294'34393962019-07-01 18:54:06.249357
13.2fd  4950    0   0   0   0   40522633728 3086    3086    active+clean    2019-07-02 06:36:14.763435  73511'149011011 73511:136693896 [91,58,36]  91  [91,58,36]  91  73497'148963815 2019-07-02 06:36:14.763383  73497'148963815 2019-07-02 06:36:14.763383
13.3ae  4989    0   0   0   0   40879544320 3055    3055    active+clean    2019-07-02 00:30:44.817062  73511'67827999  73511:60578765  [91,54,25]  91  [91,54,25]  91  73302'67806651  2019-07-02 00:30:44.817017  69213'67776352

主pg不肯走啊,既然这样那就不管它了,我们前面已经设置禁止踢出节点,且我们用的是三副本,直接关闭这台机器ceph会启用副本,也不会出现数据迁移。
一个存储3份的集群, 可以容忍任意两个主机故障.,所以你需要确保已经关机的节点数量不要超出限制. 以免引发更大的故障.

4,停止服务、关闭服务器、更换ssd
新换上去的ssd使用率为0,PercentageUsed : 0

[root@ceph-11 ~]# isdct show -sensor

PowerOnHours : 0x063F3
EraseFailCount : 0
EndToEndErrorDetectionCount : 0
ReliabilityDegraded : False
AvailableSpare : 100
AvailableSpareBelowThreshold : False
DeviceStatus : Healthy
SpecifiedPCBMaxOperatingTemp : 85
SpecifiedPCBMinOperatingTemp : 0
UnsafeShutdowns : 0x00
CrcErrorCount : 0
AverageNandEraseCycles : 7
MediaErrors : 0x00
PowerCycles : 0x012
ProgramFailCount : 0
MaxNandEraseCycles : 10
HighestLifetimeTemperature : 48
PercentageUsed : 0
ThermalThrottleStatus : 0
ErrorInfoLogEntries : 0x00
MinNandEraseCycles : 6
LowestLifetimeTemperature : 16
ReadOnlyMode : False
ThermalThrottleCount : 0
TemperatureThresholdExceeded : False
Temperature - Celsius : 48

5,插入新的磁盘为nvme0n1

[root@ceph-11 ~]# lsblk 
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda       8:0    0   5.5T  0 disk 
└─sda1    8:1    0   5.5T  0 part /var/lib/ceph/osd/ceph-87
sdb       8:16   0   5.5T  0 disk 
└─sdb1    8:17   0   5.5T  0 part /var/lib/ceph/osd/ceph-88
sdc       8:32   0   5.5T  0 disk 
└─sdc1    8:33   0   5.5T  0 part /var/lib/ceph/osd/ceph-89
sdd       8:48   0   5.5T  0 disk 
└─sdd1    8:49   0   5.5T  0 part /var/lib/ceph/osd/ceph-90
sde       8:64   0   5.5T  0 disk 
└─sde1    8:65   0   5.5T  0 part /var/lib/ceph/osd/ceph-91
sdf       8:80   0   5.5T  0 disk 
└─sdf1    8:81   0   5.5T  0 part /var/lib/ceph/osd/ceph-92
sdg       8:96   0   5.5T  0 disk 
└─sdg1    8:97   0   5.5T  0 part /var/lib/ceph/osd/ceph-93
sdh       8:112  0   5.5T  0 disk 
└─sdh1    8:113  0   5.5T  0 part /var/lib/ceph/osd/ceph-94
sdi       8:128  0   5.5T  0 disk 
└─sdi1    8:129  0   5.5T  0 part /var/lib/ceph/osd/ceph-95
sdj       8:144  0   5.5T  0 disk 
└─sdj1    8:145  0   5.5T  0 part /var/lib/ceph/osd/ceph-96
sdk       8:160  0   5.5T  0 disk 
└─sdk1    8:161  0   5.5T  0 part /var/lib/ceph/osd/ceph-97
sdl       8:176  0   5.5T  0 disk 
└─sdl1    8:177  0   5.5T  0 part /var/lib/ceph/osd/ceph-98
sdm       8:192  0 419.2G  0 disk 
└─sdm1    8:193  0 419.2G  0 part /
nvme0n1 259:0    0 372.6G  0 disk 

6,重建journal
由于journal故障, 开机后无法正常启动osd. 需要重新创建journal,编辑脚本来生成最终执行的脚本。

#!/bin/bash
desc="create ceph journal part for specified osd."

type_journal_uuid=45b0969e-9b03-4f30-b4c6-b4b80ceff106
sgdisk=sgdisk
journal_size=30G  //分区设置大小
journal_dev=/dev/nvme0n1 //ssd磁盘名称
sleep=5

osd_uuids=$(grep "" /var/lib/ceph/osd/ceph-*/journal_uuid 2>/dev/null)
die(){ echo >&2 "$@"; exit 1; }
tip(){ printf >&2 "%b" "$@"; }

[ "$osd_uuids" ] || die "no osd uuid found."
echo "osd journal uuid:"
echo "$osd_uuids"
echo "now sleep $sleep"
sleep $sleep

journal_script="/dev/shm/ceph-journal.sh"
echo "ls -l /dev/nvme0n1p*" > "$journal_script"
echo "sleep 5" >> "$journal_script"
# 需要预先检测分区的位置. 然后才能成功设置名称和uuid之类的数据.
IFS=": "
while read osd_path uuid; do
  let d++
  [ "$osd_path" ] || continue
  osd_id=${osd_path#/var/lib/ceph/osd/ceph-}
  osd_id=${osd_id%/journal_uuid}
  journal_link=${osd_path%_uuid}
  [ ${osd_id:-1} -ge 0 ] || {
    echo "invalid osd id: $osd_id."; exit 11;
  }
  tip "create journal for osd $osd_id ... "
  $sgdisk --mbrtogpt --new=$d:0:+"$journal_size" \
    --change-name=$d:'ceph journal' \
   --typecode=$d:"$type_journal_uuid" \
   --partition-guid=$d:"$uuid" \
   "$journal_dev" || exit 1
  tip "part done.\n"
  ln -sfT /dev/disk/by-partuuid/"$uuid" "$journal_link" || exit 3
  echo "ceph-osd --mkjournal --osd-journal /dev/nvme0n1p"$d "-i "$osd_id >> "$journal_script"
  sleep 1
done << EOF
$osd_uuids
EOF

上述脚本仅用于生成最终的执行脚本. 其默认路径是
/dev/shm/ceph-journal.sh
请务必人工确认内容操作无误, 方可以root权限手动执行之
[root@ceph-11~]# bash /dev/shm/ceph-journal.sh
脚本内容:

[root@ceph-11 ~]# cat /dev/shm/ceph-journal.sh 
#!/bin/bash
ls -l /dev/nvme0n1p*
sleep 5
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p1 -i 87
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p2 -i 88
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p3 -i 89
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p4 -i 90
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p5 -i 91
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p6 -i 92
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p7 -i 93
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p8 -i 94
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p9 -i 95
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p10 -i 96
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p11 -i 97
ceph-osd --mkjournal --osd-journal /dev/nvme0n1p12 -i 98
[root@ceph-11 ~]# 

7,journal跟换完毕,检查恢复服务
osd服务已恢复

[root@ceph-11 ~]# ceph osd tree
ID     WEIGHT    TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-10008         0 root sas6t3                                        
-10007         0 root sas6t2                                        
-10006 130.94598 root sas6t1                                        
   -12  65.47299     host ceph-11                                   
    87   5.45599         osd.87        up  1.00000                0 
    88   5.45599         osd.88        up  0.79999                0 
    89   5.45599         osd.89        up  1.00000                0 
    90   5.45599         osd.90        up  1.00000                0 
    91   5.45599         osd.91        up  1.00000                0 
    92   5.45599         osd.92        up  1.00000                0 
    93   5.45599         osd.93        up  1.00000                0 
    94   5.45599         osd.94        up  1.00000                0 
    95   5.45599         osd.95        up  1.00000                0 
    96   5.45599         osd.96        up  1.00000                0 
    97   5.45599         osd.97        up  1.00000                0 
    98   5.45599         osd.98        up  0.89999                0 

恢复osd flag,需要把干预期间的其他操作全部恢复
ceph osd unset noout

恢复osd优先级

[root@ceph-11 ~]# for osd in {87..98}; do ceph osd primary-affinity "$osd" 0.8; done
set osd.87 primary-affinity to 0.8 (8524282)
set osd.88 primary-affinity to 0.8 (8524282)
set osd.89 primary-affinity to 0.8 (8524282)
set osd.90 primary-affinity to 0.8 (8524282)
set osd.91 primary-affinity to 0.8 (8524282)
set osd.92 primary-affinity to 0.8 (8524282)
set osd.93 primary-affinity to 0.8 (8524282)
set osd.94 primary-affinity to 0.8 (8524282)
set osd.95 primary-affinity to 0.8 (8524282)
set osd.96 primary-affinity to 0.8 (8524282)
set osd.97 primary-affinity to 0.8 (8524282)
set osd.98 primary-affinity to 0.8 (8524282)
[root@ceph-11 ~]#

等待集群恢复
等待集群自动recovery恢复到 HEALHTH_OK 状态.
期间如果出现 HEALTH_ERROR 状态, 可以及时跟进, 搜索Google.

[root@ceph-11 ~]# ceph -s
    cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_WARN
            12 pgs degraded
            2 pgs recovering
            10 pgs recovery_wait
            12 pgs stuck unclean
            recovery 116/38259009 objects degraded (0.000%)
     monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
            election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
     osdmap e73609: 111 osds: 108 up, 108 in
            flags sortbitwise,require_jewel_osds
      pgmap v85918476: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects
            261 TB used, 141 TB / 403 TB avail
            116/38259009 objects degraded (0.000%)
                5049 active+clean
                  10 active+recovery_wait+degraded
                   3 active+clean+scrubbing+deep
                   2 active+recovering+degraded
recovery io 22105 kB/s, 4 objects/s
  client io 55017 kB/s rd, 77280 kB/s wr, 944 op/s rd, 590 op/s wr
[root@ceph-11 ~]# 
[root@ceph-11 ~]# ceph -s
    cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_WARN
            1 pgs degraded
            1 pgs recovering
            1 pgs stuck unclean
            recovery 2/38259009 objects degraded (0.000%)
     monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
            election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
     osdmap e73609: 111 osds: 108 up, 108 in
            flags sortbitwise,require_jewel_osds
      pgmap v85918493: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects
            261 TB used, 141 TB / 403 TB avail
            2/38259009 objects degraded (0.000%)
                5060 active+clean
                   3 active+clean+scrubbing+deep
                   1 active+recovering+degraded
  client io 81789 kB/s rd, 245 MB/s wr, 1441 op/s rd, 651 op/s wr
[root@ceph-11 ~]# ceph -s
    cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_OK
     monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
            election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby
     osdmap e73609: 111 osds: 108 up, 108 in
            flags sortbitwise,require_jewel_osds
      pgmap v85918494: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects
            261 TB used, 141 TB / 403 TB avail
                5061 active+clean
                   3 active+clean+scrubbing+deep
recovery io 7388 kB/s, 0 objects/s
  client io 67551 kB/s rd, 209 MB/s wr, 1153 op/s rd, 901 op/s wr
[root@ceph-11 ~]# 

集群状态已经恢复正常。