1)
环境介绍
OS:Redhat Enterprise Linux 5.2 AP
Node1:heartbeat 10.1.1 .100 host IP:192.168.150.21 ipmi IP: 10.1.1.102
Node2:heartbeat 10.1.1 .101 host IP:192.168.150.22 ipmi IP: 10.1.1.103
services IP: 192.168.150.25
2)
故障现象:
群集正常启动后,在当前活动节点上执行
clusvcadm �Cr services �Cm node2
进行服务切换,服务切换后,
node1
机器物理网卡
IP
丢失,网卡状态变为未激活;
node2
浮动地址正常加载。该故障现象可稳定重现。
3)
分析测试:
提供了系统
sysreport
相关文件给
redhat 800
后,
800
答复在他们的测试环境中无法重现故障,将问题反馈给美国
redhat
开发工程师。后答复为
rhcs
本身
bug
导致。
Bug NO.
为
453000
。
查看/usr/share/cluster/ip.sh配置文件,使用真实IP地址替换参数变量,执行以下命令进行测试。发现,原ip.sh存在bug,当浮动IP地址为广播地址的子串时,将同时匹配到两条IP地址记录,cluster做IP removing动作时,取了第一条记录,导致我的物理网卡IP地址丢失。
如果浮动IP地址设置为192.168.150.2也将出现同样的问题。funny!
测试如下:
[root@ngccintfB ~]# /sbin/ip addr list | grep "192.168.150.25" | head -n 2 | awk '{print $w}'
inet 192.168.150.22/24 brd 192.168.150.255 scope global eth0
inet 192.168.150.25/24 scope global secondary eth0
[root@ngccintfB ~]# /sbin/ip addr list | grep "192.168.150.25/" | head -n 1 | awk '{print $w}'
inet 192.168.150.25/24 scope global secondary eth0
4)
故障处理:
查看
bug
说明后,直接修改了
/usr/share/cluster/ip.sh
文件中行
addr=`/sbin/ip addr list | grep "$addr" | head -n 1 | awk '{print $2}'`
为
addr=`/sbin/ip addr list | grep "$addr
/
" | head -n 1 | awk '{print $2}'`
使用tail -f /var/log/message 观察群集服务切换时日志,192.168.150.25被正常removing。
Oct 26 12:45:22 ngccintfB clurgmgrd[13157]: <notice> Stopping service service:tomcat_services
Oct 26 12:45:23 ngccintfB clurgmgrd: [13157]: <info> unmounting /oradata2
Oct 26 12:45:23 ngccintfB clurgmgrd: [13157]: <info> Removing IPv4 address 192.168.150.25/24 from eth0
Oct 26 12:45:33 ngccintfB clurgmgrd[13157]: <notice> Service service:tomcat_services is stopped
至此,问题解决!
5)
参考材料
Buglist:
https://bugzilla.redhat.com/show_bug.cgi?id=453000
Problem description: On a cluster node where there is an rgmanager IP resource that is a substring match to the broadcast address of any interface, stopping or relocating that service will bring down that entire interface. If this happens to be the interface used for cluster communication then the node misses its heartbeats and gets fenced.
Example: ashprdgfs01 has eth0 = 172.20.200.21/24. This means the broadcast address is 172.20.200.255. Stopping the following resource:
<ip address="172.20.200.25" monitor_link="1"/>
causes the wrong ip address to be removed:
Jun 23 15:18:27 ashprdgfs01 clurgmgrd[3889]: <notice> Stopping service service:VIP
Jun 23 15:18:27 ashprdgfs01 clurgmgrd: [3889]: <info> Removing IPv4 address 172.20.200.21/24 from eth0
Since this is the main ip for eth0, this kills cluster communication and the node gets fenced:
Jun 23 15:18:42 ashprdgfs02 kernel: dlm: closing connection to node 2
Jun 23 15:18:42 ashprdgfs02 fenced[3199]: fencing node "ashprdgfs01.gspt.net"
The problem is on line 714 of ip.sh:
addr=`/sbin/ip addr list | grep "$addr" | head -n 1 | awk '{print $2}'`
Because 172.20.200.25 is a substring of 172.20.200.255, this actually returns 2 lines:
inet 172.20.200.21/24 brd 172.20.200.255 scope global eth0
inet 172.20.200.25/24 scope global secondary eth0
and the first one is chosen mistakenly, causing it to be the one removed (718):
/sbin/ip -f inet addr del dev $dev $addr
The address we want will always be follwed by a '/' for the subnet, so this can be fixed easily like so:
addr=`/sbin/ip addr list | grep "$addr/" | head -n 1 | awk '{print $2}'`