ceph mon节点时钟同步异常:
$ sudo /var/lib/ceph/bin/ceph -s
cluster:
id: 3fe6c651-2a0c-4f15-851b-7215536897eb
health: HEALTH_WARN
clock skew detected on mon.c
$ sudo /var/lib/ceph/bin/ceph health detail
HEALTH_WARN clock skew detected on mon.c
MON_CLOCK_SKEW clock skew detected on mon.c
mon.c clock skew 0.274578s > max 0.15s (latency 0.000160198s)
集群配置的最大允许时钟偏差为0.15s:
$ cat /var/lib/ceph/etc/ceph/ceph.conf | grep mon_clock_drift_allowed
mon_clock_drift_allowed = 0.15
使用date +%s.%N
命令可确认系统精确时间,经确认非误报,确实存在时钟同步异常。
集群节点间采用chronyd
进行时钟同步,所有节点的时钟源均配置为第一个mon
节点:
$ cat /etc/chrony.conf
driftfile /var/lib/chrony/drift
rtcsync
local stratum 10
#default
server 10.127.15.182 minpoll 0 maxpoll 0
logdir /var/log/chrony
log measurements statistics tracking
$ sudo chronyc sources -v
210 Number of sources = 1
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^? 10.127.15.182 0 0 0 - +0ns[ +0ns] +/- 0ns
显示^?
,即unreachable,时钟源不可达。
$ sudo chronyc -a makestep
200 OK
$ sudo chronyc sources -v
210 Number of sources = 1
.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^? 10.127.15.182 0 0 0 - +0ns[ +0ns] +/- 0ns
无效,时钟无法正常同步。
$ ping 10.127.15.182
PING 10.127.15.182 (10.127.15.182) 56(84) bytes of data.
64 bytes from 10.127.15.182: icmp_seq=1 ttl=64 time=0.011 ms
64 bytes from 10.127.15.182: icmp_seq=2 ttl=64 time=0.013 ms
64 bytes from 10.127.15.182: icmp_seq=3 ttl=64 time=0.017 ms
^C
--- 10.127.15.182 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.011/0.013/0.017/0.002 ms
ping
可达,暂判断网络连通性无异常(10.127.15.182为时钟源的浮动地址)。
节点处于同一子网下,检查操作系统防火墙配置:
$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
配置无异常。
在时钟源尝试抓取异常节点ntp数据包,未抓取到数据包,可判断ntp同步包未送达时钟源:
$ sudo tcpdump -nn -i bond0.3530 udp and host 10.127.15.156 and port 123
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0.3530, link-type EN10MB (Ethernet), capture size 262144 bytes
在异常节点进行的同样的抓包操作,也未抓取到数据包,此时结合第3步ping
可达,出现一个奇怪的现象,即ntp同步包未到达自身网卡。在异常节点长ping
时钟源的地址10.127.15.182
,并在时钟源抓包:
$ sudo tcpdump -nn -i bond0.3530 icmp and host 10.127.15.182
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0.3530, link-type EN10MB (Ethernet), capture size 262144 bytes
同样未抓取到数据,可见icmp
包未达到时钟源,但ping
却可达,由此可以判断数据包送错了主机。
检查arp
表无对应mac
地址:
$ ip neigh show | grep 10.127.15.182
# 或
$ arp -n | grep 10.127.15.182
检查路由表发现路由走到了回环接口lo
:
$ ip route get 10.127.15.182
local 10.127.15.182 dev lo src 10.127.15.182 uid 1003
cache <local>
$ ip route show table local
xxxxxx
local 10.127.15.156 dev bond0.3530 proto kernel scope host src 10.127.15.156
local 10.127.15.182 dev bond0.3530 proto kernel scope host src 10.127.15.182
xxxxxx
$ ip a
xxxxxx
15: bond0.3530@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 8c:2a:8e:57:5c:d5 brd ff:ff:ff:ff:ff:ff
inet 10.127.15.156/26 brd 10.127.15.191 scope global noprefixroute bond0.3530
valid_lft forever preferred_lft forever
inet6 2409:8c00:7821:4000::a7f:f9c/122 scope global noprefixroute
valid_lft forever preferred_lft forever
inet6 fe80::b9ed:8ee:b44d:286e/64 scope link noprefixroute
valid_lft forever preferred_lft forever
10.127.15.182
非本机地址,却走到了回环接口,问题定位。
删除异常路由条目:
$ sudo ip route delete table local 10.127.15.182 dev bond0.3530 src 10.127.15.182
或重启网络服务
$ sudo systemctl restart NetworkManager