Ceph 集群状态监控细化

需求
在做Ceph的监控报警系统时，对于Ceph集群监控状态的监控，最初只是简单的OK、WARN、ERROR，按照Ceph的status输出来判断的，仔细想想，感觉这些还不够，因为WARN、ERROR状态中，是包含多种状态的，如果在大晚上收到一条关于Ceph health的报警信息，只知道了集群有问题，但具体是什么问题呢，不得而知。这个事情发生在工作时间，就还好处理，直接到Ceph环境中查看一下就OK。但是在晚上，有些报警没有那么紧急，可以第二天再处理。所以，就需要细化这些健康状态。

因此，从代码中将HEALTH_OK、HEALTH_WARN、HEALTH_ERR的相关描述输出拉出来，进行判断，分类处理，然后用状态码(status code)的方式来进行Level化。

Ceph本身的健康状态信息：
HEALTH_WARN：

集群健康状态描述信息代表的现象
Monitor clock skew detected 时钟偏移
mons down, quorum Ceph Monitor down
some monitors are running older code 部署完就可以看到，运行过程中不会出现
in osds are down OSD down后会出现
flag(s) set 标志位设置，可以忽略
crush map has legacy tunables 部署完就可以看到，运行过程中不会出现
crush map has straw_calc_version=0 部署完就可以看到，运行过程中不会出现
cache pools are missing hit_sets 使用cache tier后会出现
no legacy OSD present but 'sortbitwise' flag is not set 部署完就可以看到，运行过程中不会出现
has mon_osd_down_out_interval set to 0 将mon_osd_down_out_interval参数设置为0会出现，这个参数设置为0，和noout效力一致
'require_jewel_osds' osdmap flag is not set 部署完就可以看到，运行过程中不会出现
is full pool满后会出现
near full osd OSD快满时警告
unscrubbed pgs 有些pg没有scrub
pgs stuck PG处于一些不健康状态的时候，会显示出来
requests are blocked slow requests会警告
osds have slow requests slow requests会警告
recovery 需要recovery的时候会报
at/near target max 使用cache tier的时候会警告
too few PGs per OSD 每个OSD的PG数过少
too many PGs per OSD 每个OSD的PG数过多

pgp_num pg_num大于pgp_num
has many more objects per pg than average (too few pgs?) 每个Pg上的objects数过多
HEALTH_ERR：

集群健康状态描述信息代表的现象
no osds 部署完就可以看到，运行过程中不会出现
full osd OSD满时出现
pgs are stuck inactive for more than Pg处于inactive状态，该Pg读写都不行
scrub errors scrub 错误出现，是scrub错误?还是scrub出了不一致的pg
当前监控代码中处理
从上述输出里选出所有关键的几项，作为一些单独的状态码，也就是只关注这些，其他的要么运行过程中不出现，要么目前没有使用，即忽略。

Ceph Health Status Code:

代码 10进制数值
其他警告 0
HEALTH_OK 1
HEALTH_CLOCK_SKEW = 1 << 1 2
HEALTH_NEAR_FULL = 1 << 2 4
HEALTH_FULL = 1 << 3 8
HEALTH_SLOW_REQUEST = 1 << 4 16
HEALTH_PG_STALE = 1 << 5 32
HEALTH_SCRUB_ERROR = 1 << 6 64
注: 在报警的描述中增加基本的状态码说明：

ceph cluster not health; clock skew:2,nearfull:4,full:8,slow_request:16,pg_stale:32,scrub_error:64,others:0

链接
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/initial-troubleshooting

具体代码
附：

注：每行最后含detail的，说明是ceph health detail能看到的描述
HEALTH_WARN:

【Monitor.cc:】
Monitor clock skew detected

【MonmapMonitor.cc:】
mons down, quorum
is down (out of quorum) [detail]
some monitors are running older code
only supports the "classic" command set [detail]

【OSDMonitor.cc:】
osd." << i << " is down since epoch [detail]
in osds are down
flag(s) set
crush map has legacy tunables (require
see http://ceph.com/docs/master/rados/operations/crush-map/#tunables [detail]
crush map has straw_calc_version=0
see http://ceph.com/docs/master/rados/operations/crush-map/#tunables [detail]
with cache_mode needs hit_set_type to be set but it is not [detail]
cache pools are missing hit_sets
no legacy OSD present but 'sortbitwise' flag is not set
has mon_osd_down_out_interval set to 0
this has the same effect as the 'noout' flag [detail]
'require_jewel_osds' osdmap flag is not set
is full
near full osd

【PGMonitor.cc:】
current state/last acting [detail]
ops are blocked > [detail]
deep-scrubbed, last_deep_scrub_stamp [detail]
unscrubbed pgs
pgs stuck
min_size from / may help; search ceph.com/docs for 'incomplete [detail]
requests are blocked >
osds have slow requests
recovery
objects at/near target max [detail]
B at/near target max [detail]
at/near target max
too few PGs per OSD
too many PGs per OSD

pgp_num
has many more objects per pg than average (too few pgs?)

HEALTH_ERR:

【OSDMonitor.cc:】
no osds
full osd

【PGMonitor.cc:】
pgs are stuck inactive for more than
scrub errors

Ceph 集群状态监控细化

你可能感兴趣的:(Ceph 集群状态监控细化)