故障发生时间: 2015-11-05 20.30
故障解决时间: 2015-11-05 20:52:33
故障现象: 由于 hh-yun-ceph-cinder016-128056.vclound.com 硬盘故障, 导致 ceph 集群产生异常报警
故障处理: ceph 集群自动进行数据迁移, 没有产生数据丢失, 待 IDC 同事更换硬盘后再重新迁移数据
日志分析如下
1 CEPH 集群再执行深度清理, 清理过程中, 发现 hh-yun-ceph-cinder016-128056.vclound.com ( 10.199.128.56 ) 中的硬盘故障, 导致 OSD 进程没有回应, 然后出现故障
2015-11-05 20:30:26.084840 mon.0 240.30.128.55:6789/0 7291068 : cluster [INF] pgmap v7319699: 20544 pgs: 20542 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent; 13357 GB data, 40048 GB used, 215 TB / 254 TB avail; 142 kB/s rd, 880 kB/s wr, 232 op/s
2015-11-05 20:30:27.034297 mon.0 240.30.128.55:6789/0 7291071 : cluster [INF] osd.14 240.30.128.56:6820/137420 failed (3 reports from 3 peers after 20.000246 >= grace 20.000000)
2015-11-05 20:30:27.087421 mon.0 240.30.128.55:6789/0 7291072 : cluster [INF] pgmap v7319700: 20544 pgs: 20542 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent; 13357 GB data, 40048 GB used, 215 TB / 254 TB avail; 1001 kB/s rd, 1072 kB/s wr, 256 op/s
2015-11-05 20:30:27.142073 mon.0 240.30.128.55:6789/0 7291073 : cluster [INF] osdmap e503: 70 osds: 69 up, 70 in
2 故障发生后, ceph 进行 pg 计算, 计算存放在 osd 中的 object 数据量
2015-11-05 20:30:34.230435 mon.0 240.30.128.55:6789/0 7291081 : cluster [INF] pgmap v7319706: 20544 pgs: 19595 active+clean, 1 active+undersized+degraded+inconsistent, 948 active+undersized+degraded; 13357 GB data, 40048 GB used, 215 TB / 254 TB avail; 6903 kB/s wr, 287 op/s; 160734/10441218 objects degraded (1.539%)
3. ceph 集群把故障 osd 踢出集群中,
2015-11-05 20:35:28.839639 mon.0 240.30.128.55:6789/0 7291328 : cluster [INF] pgmap v7319909: 20544 pgs: 19590 active+clean, 1 active+undersized+degraded+inconsistent, 5 active+clean+scrubbing, 948 active+undersized+degraded; 13358 GB data, 40049 GB used, 215 TB / 254 TB avail; 5988 kB/s rd, 21084 kB/s wr, 1406 op/s; 160742/10441470 objects degraded (1.539%)
2015-11-05 20:35:31.419279 mon.0 240.30.128.55:6789/0 7291329 : cluster [INF] osd.14 out (down for 304.292431)
4 ceph 自动执行了数据恢复操作, 没有造成数据丢失
2015-11-05 20:35:37.483627 mon.0 240.30.128.55:6789/0 7291338 : cluster [INF] pgmap v7319914: 20544 pgs: 19590 active+clean, 1 active+undersized+degraded+inconsistent, 5 active+clean+scrubbing, 948 active+undersized+degraded; 13358 GB data, 39433 GB used, 212 TB / 250 TB avail; 252 kB/s wr, 31 op/s; 160742/10441473 objects degraded (1.539%)
2015-11-05 20:35:39.345830 mon.0 240.30.128.55:6789/0 7291340 : cluster [INF] pgmap v7319915: 20544 pgs: 19599 active+clean, 5 undersized+degraded+remapped, 557 active+undersized+degraded+remapped, 62 active+recovering+degraded, 1 active+undersized+degraded+inconsistent, 5 active+undersized+degraded+remapped+backfilling, 3 active+undersized+degraded+remapped+wait_backfill, 232 remapped+peering, 80 active+undersized+degraded; 13358 GB data, 39437 GB used, 212 TB / 250 TB avail; 5607 kB/s rd, 17453 kB/s wr, 1028 op/s; 136092/10556428 objects degraded (1.289%); 249227/10556428 objects misplaced (2.361%); 2413 MB/s, 6 keys/s, 627 objects/s recovering
2015-11-05 20:35:40.045989 mon.0 240.30.128.55:6789/0 7291341 : cluster [INF] pgmap v7319916: 20544 pgs: 19599 active+clean, 5 undersized+degraded+remapped, 576 active+undersized+degraded+remapped, 67 active+recovering+degraded, 5 active+undersized+degraded+remapped+backfilling, 3 active+undersized+degraded+remapped+wait_backfill, 273 remapped+peering, 15 active+undersized+degraded, 1 active+undersized+degraded+remapped+inconsistent; 13358 GB data, 39437 GB used, 212 TB / 250 TB avail; 4825 kB/s rd, 18997 kB/s wr, 913 op/s; 130350/10567081 objects degraded (1.234%); 261108/10567081 objects misplaced (2.471%); 2136 MB/s, 5 keys/s, 555 objects/s recovering
...............
...............
...............
2015-11-05 20:52:28.966672 mon.0 240.30.128.55:6789/0 7293152 : cluster [INF] pgmap v7321045: 20544 pgs: 20542 active+clean, 1 active+undersized+degraded+remapped+backfilling, 1 active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 3360 kB/s rd, 5667 kB/s wr, 490 op/s; 243/10441812 objects degraded (0.002%); 14/10441812 objects misplaced (0.000%)
2015-11-05 20:52:30.039527 mon.0 240.30.128.55:6789/0 7293154 : cluster [INF] pgmap v7321046: 20544 pgs: 20542 active+clean, 1 active+undersized+degraded+remapped+backfilling, 1 active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 2491 kB/s rd, 6139 kB/s wr, 302 op/s; 243/10441812 objects degraded (0.002%); 14/10441812 objects misplaced (0.000%)
2015-11-05 20:52:31.087910 mon.0 240.30.128.55:6789/0 7293155 : cluster [INF] pgmap v7321047: 20544 pgs: 20543 active+clean, 1 active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 3439 kB/s rd, 6470 kB/s wr, 356 op/s; 16398 kB/s, 4 objects/s recovering
2015-11-05 20:52:32.066947 mon.0 240.30.128.55:6789/0 7293156 : cluster [INF] pgmap v7321048: 20544 pgs: 20543 active+clean, 1 active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 2529 kB/s rd, 8932 kB/s wr, 388 op/s; 16853 kB/s, 4 objects/s recovering
2015-11-05 20:52:33.054558 mon.0 240.30.128.55:6789/0 7293157 : cluster [INF] pgmap v7321049: 20544 pgs: 20543 active+clean, 1 active+clean+inconsistent; 13358 GB data, 40058 GB used, 211 TB / 250 TB avail; 2389 kB/s rd, 13650 kB/s wr, 380 op/s
5. 经验证, ceph 集群重新计算 pg 路径, 已经没有 pg 计划存储数据到 osd.14 中, 意味着本次故障没有产生数据丢失, 暂时不影响使用
[root@hh-yun-ceph-cinder015-128055 tmp]# ceph pg dump | awk '{ print $15}' | grep 14
dumped all in format plain