Ceph排错之osd之间心跳检测没有回应

 ceph存储集群是建立在八台服务器上面,每台服务器各有9个OSD节点,上班的时候发现,四台服务器上总共有8个OSD节点在crush里面显示down状态,重启OSD节点后恢复正常,但是之后会继续显示down状态,实际上osd进程已经在运行中了。查看OSD节点日志后发现,整个故障过程如下:


1. 单个OSD节点接收不到另外一台服务器节点上的所有OSD心跳信息,日志记录如下

2017-01-12 15:46:44.461929 7f913748c700 -1 error_msg osd.49 359369 heartbeat_check: no reply from osd.57 ever on either front or back, first ping sent 2017-01-12 15:18:09.046948 (cutoff 2017-01-12 15:46:24.461923)
2017-01-12 15:46:44.461946 7f913748c700 -1 error_msg osd.49 359369 heartbeat_check: no reply from osd.58 ever on either front or back, first ping sent 2017-01-12 15:18:09.046948 (cutoff 2017-01-12 15:46:24.461923)
2017-01-12 15:46:44.461967 7f913748c700 -1 error_msg osd.49 359369 heartbeat_check: no reply from osd.61 ever on either front or back, first ping sent 2017-01-12 15:18:09.046948 (cutoff 2017-01-12 15:46:24.461923)
2017-01-12 15:46:44.529580 7f838761f700 -1 error_msg osd.52 359367 heartbeat_check: no reply from osd.58 ever on either front or back, first ping sent 2017-01-12 15:17:14.579226 (cutoff 2017-01-12 15:46:24.529579)
2017-01-12 15:46:44.529596 7f838761f700 -1 error_msg osd.52 359367 heartbeat_check: no reply from osd.61 ever on either front or back, first ping sent 2017-01-12 15:17:14.579226 (cutoff 2017-01-12 15:46:24.529579)
2017-01-12 15:46:44.711242 7f83670f2700 -1 error_msg osd.52 359367 heartbeat_check: no reply from osd.58 ever on either front or back, first ping sent 2017-01-12 15:17:14.579226 (cutoff 2017-01-12 15:46:24.711242)


2. 单个OSD节点接收不到相同服务器上其他节点的OSD心跳信息;

2017-01-12 15:46:44.461929 7f913748c700 -1 error_msg osd.49 359369 heartbeat_check: no reply from osd.57 ever on either front or back, first ping sent 2017-01-12 15:18:09.046948 (cutoff 2017-01-12 15:46:24.461923)
2017-01-12 15:46:44.461946 7f913748c700 -1 error_msg osd.49 359369 heartbeat_check: no reply from osd.58 ever on either front or back, first ping sent 2017-01-12 15:18:09.046948 (cutoff 2017-01-12 15:46:24.461923)
2017-01-12 15:46:44.461967 7f913748c700 -1 error_msg osd.49 359369 heartbeat_check: no reply from osd.61 ever on either front or back, first ping sent 2017-01-12 15:18:09.046948 (cutoff 2017-01-12 15:46:24.461923)
2017-01-12 15:46:44.529580 7f838761f700 -1 error_msg osd.52 359367 heartbeat_check: no reply from osd.58 ever on either front or back, first ping sent 2017-01-12 15:17:14.579226 (cutoff 2017-01-12 15:46:24.529579)
2017-01-12 15:46:44.529596 7f838761f700 -1 error_msg osd.52 359367 heartbeat_check: no reply from osd.61 ever on either front or back, first ping sent 2017-01-12 15:17:14.579226 (cutoff 2017-01-12 15:46:24.529579)
2017-01-12 15:46:44.711242 7f83670f2700 -1 error_msg osd.52 359367 heartbeat_check: no reply from osd.58 ever on either front or back, first ping sent 2017-01-12 15:17:14.579226 (cutoff 2017-01-12 15:46:24.711242)


3. 该OSD节点被map标记为down状态,日志提示是:map wrongly mark me down;

 
  
2017-01-12 15:46:44.711242 7f715610e700  0 log_channel(cluster) log [WRN] : map e83 wrongly marked me down

排错过程如下:

1. 首先第一反应应该是时钟问题,时钟不同步(环境用的是ntp),检查发现确实时钟不同步,错误的时区。然后解决时钟问题,解决时钟问题之后,还是不行,还是no reply from osd.xxx


2. 会不会是网络问题?

然后我选择其中两台有问题的服务器,登陆上去查看是不是网络不通,简单ping了一下,发现可以,也没有丢包现象。那就奇怪了,到底肿么了?然后我突然想到防火墙,这可是很多linux服务的克星。关闭防火墙和selinux,ceph集群恢复正常!!!


附加关闭防火墙和selinux的命令:

关闭iptables:

关闭命令:  service iptables stop 
        永久关闭防火墙:chkconfig iptables off


关闭selinux
永久性关闭(这样需要重启服务器后生效)
# sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config

临时性关闭(立即生效,但是重启服务器后失效)
# setenforce 0 #设置selinux为permissive模式(即关闭)
# setenforce 1 #设置selinux为enforcing模式(即开启)





你可能感兴趣的:(Ceph排错之osd之间心跳检测没有回应)