CEPH OSD_SLOW_PING_TIME_FRONT/BACK 警告处理

HEALTH_WARN Slow OSD heartbeats on back (longest 1093.720ms); Slow OSD heartbeats on front (longest 1088.357ms)
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1093.720ms)
Slow OSD heartbeats on back from osd.15 [] to osd.192 [] 1093.720 msec
Slow OSD heartbeats on back from osd.222 [] to osd.192 [] 1024.067 msec
[WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 1088.357ms)
Slow OSD heartbeats on front from osd.15 [] to osd.192 [] 1088.357 msec
Slow OSD heartbeats on front from osd.222 [] to osd.192 [] 1018.825 msec

使用ceph时经常会出现如上报警,原因不外乎网络异常,CPU异常,硬盘IO高影响。最好的办法就是换台好电脑,有时也就比1s慢了一点点,把这个阀值稍微调大就行了
本文对应ceph版本V16,其它版本可以参可参考

把超时时间改成2s有两种方法,支持实时更新
方法1

ceph config set mgr mon_warn_on_slow_ping_time 2000

方法2

ceph config set mgr mon_warn_on_slow_ping_ratio 0.1

用了方法1,方法2就相当于摆设
剩下就说明下程序中的逻辑
可以先看帮助

ceph config help mon_warn_on_slow_ping_time
mon_warn_on_slow_ping_time - Override mon_warn_on_slow_ping_ratio with specified threshold in milliseconds
(float, advanced)
Default: 0.000000
Can update at runtime: true
Services: [mgr]
See also: [mon_warn_on_slow_ping_ratio]
ceph config help osd_heartbeat_grace
osd_heartbeat_grace -
(int, advanced)
Default: 20
Can update at runtime: true
ceph config help mon_warn_on_slow_ping_ratio
mon_warn_on_slow_ping_ratio - Issue a health warning if heartbeat ping longer than percentage of osd_heartbeat_grace
(float, advanced)
Default: 0.050000
Can update at runtime: true
Services: [mgr]
See also: [osd_heartbeat_grace,mon_warn_on_slow_ping_time]

当mon_warn_on_slow_ping_time不为0时,超时时间使用该值,单位ms,(可以自己改成1看看有没有警告出来,我这边看的时候至少也要1ms多点,设2ms就不报了)
当mon_warn_on_slow_ping_time为0时,
超时时间=osd_heartbeat_grace*mon_warn_on_slow_ping_ratio
注意这里单位是s
不介意改osd_heartbeat_grace,该值还有其它用处,不动的好
默认值带入,默认的1s就是这么来的

源码中多处有如下判断

// SLOW_PING_TIME
// Convert milliseconds to microseconds
auto warn_slow_ping_time = cct->_conf.get_val<double>("mon_warn_on_slow_ping_time") * 1000;
auto grace = cct->_conf.get_val<int64_t>("osd_heartbeat_grace");
if (warn_slow_ping_time == 0) {
	double ratio = cct->_conf.get_val<double>("mon_warn_on_slow_ping_ratio");
	warn_slow_ping_time = grace;
	warn_slow_ping_time *= 1000000 * ratio; // Seconds of grace to microseconds at ratio
}
if (warn_slow_ping_time > 0) {
	...
	if (back.pingtime > warn_slow_ping_time) {
		...
	}
	if (front.pingtime > warn_slow_ping_time) {
		...
	}
...
}

很明显OSD_SLOW_PING_TIME_BACK跟OSD_SLOW_PING_TIME_FRONT用的都是同一个值判断的
通过修改阀值,报警就会消失了

你可能感兴趣的:(运维,运维)