ceph 故障分析

在执行了 ceph 扩容之前, 发现长时间内都具有下面的状态存在
参考下面信息

# ceph -s
    cluster dc4f91c1-8792-4948-b68f-2fcea75f53b9
     health HEALTH_WARN 13 pgs backfill_toofull; 1 pgs degraded; 1 pgs stuck degraded; 13 pgs stuck unclean; 9 requests are blocked > 32 sec; recovery 190/54152986 objects degraded (0.000%); 47030/54152986 objects misplaced (0.087%); 2 near full osd(s); clock skew detected on mon.hh-yun-ceph-cinder025-128075
     monmap e3: 5 mons at {hh-yun-ceph-cinder015-128055=240.30.128.55:6789/0,hh-yun-ceph-cinder017-128057=240.30.128.57:6789/0,hh-yun-ceph-cinder024-128074=240.30.128.74:6789/0,hh-yun-ceph-cinder025-128075=240.30.128.75:6789/0,hh-yun-ceph-cinder026-128076=240.30.128.76:6789/0}, election epoch 168, quorum 0,1,2,3,4 hh-yun-ceph-cinder015-128055,hh-yun-ceph-cinder017-128057,hh-yun-ceph-cinder024-128074,hh-yun-ceph-cinder025-128075,hh-yun-ceph-cinder026-128076
     osdmap e23216: 100 osds: 100 up, 100 in
      pgmap v11159189: 20544 pgs, 2 pools, 70024 GB data, 17620 kobjects
            205 TB used, 158 TB / 363 TB avail
            190/54152986 objects degraded (0.000%); 47030/54152986 objects misplaced (0.087%)
               20527 active+clean
                   1 active+degraded+remapped+backfill_toofull
                   4 active+clean+scrubbing+deep
                  12 active+remapped+backfill_toofull
  client io 7609 kB/s rd, 46866 kB/s wr, 1909 op/s

参考重点:

12 active+remapped+backfill_toofull
1 active+degraded+remapped+backfill_toofull

获得 pg 详细信息

# ceph pg dump 2> /dev/null | grep -E 'pg_stat|toofull' | awk '{printf "%-8s %-15s %-15s %-15s %-55s\n", $1, $7, $15, $17, $10}'
pg_stat  bytes           up_primary      acting_primary  state
1.19ae   4427174912      [50,24,31]      [21,33,69]      active+remapped+backfill_toofull
1.f51    2313255936      [51,8,24]       [8,31,58]       active+remapped+backfill_toofull
1.86f    2199311872      [57,24,18]      [57,22,65]      active+degraded+remapped+backfill_toofull
1.531    2257795584      [12,59,24]      [12,59,31]      active+remapped+backfill_toofull
1.186    2359985152      [51,8,24]       [2,27,57]       active+remapped+backfill_toofull
1.4f35   2429229056      [52,24,38]      [12,26,57]      active+remapped+backfill_toofull
1.44cb   2247723008      [51,24,18]      [15,26,60]      active+remapped+backfill_toofull
1.405e   2286564864      [50,24,14]      [16,27,40]      active+remapped+backfill_toofull
1.3bc2   4308700672      [55,12,24]      [55,14,40]      active+remapped+backfill_toofull
1.3b35   4711967232      [43,52,24]      [43,19,26]      active+remapped+backfill_toofull
1.3845   4573419008      [12,59,24]      [12,29,43]      active+remapped+backfill_toofull
1.35f3   4424525312      [45,58,24]      [45,23,59]      active+remapped+backfill_toofull
1.291f   4661793280      [14,50,24]      [14,21,48]      active+remapped+backfill_toofull

参考上面资料

1. 当前有约莫 12 个 PG,  PG 容量约为 2GB ~ 4.5GB 空间, 
2. 参考 up_primary, 每个 pg 都需要占用 osd.24 作为数据存储空间

参考当前 OSD 容量

total  used  free  ptc   target
3.7T   3.2T  539G  86%   /var/lib/ceph/osd/ceph-24 

当前所有故障 PG 需要执行磁盘迁移
磁盘迁移时候, 都需要把数据存放至 OSD.24 中
约莫需要存 40GB 数据到上述 OSD 存储中
当前 OSD.24 磁盘容量空间为 86%
由于 osd near full 设定为 .85
因此, ceph 集群会认为该 osd 空间容量不足, 导致长时间都属于该状态中

你可能感兴趣的:(ceph)