问题现象及日志:pmapp1服务器异常宕掉之后,VCS未进行切换。
2016/02/02 05:52:57 VCS INFO V-16-1-53504 VCS Engine Alive message!!
2016/02/02 06:44:19 VCS ERROR V-16-2-13067 (pmapp2) Agent is calling clean for resource(app2) because the resource became OFFLINE unexpectedly, on its own.
2016/02/02 06:44:20 VCS INFO V-16-2-13068 (pmapp2) Resource(app2) - clean completed successfully.
2016/02/02 06:44:23 VCS INFO V-16-2-13082 (pmapp2) Resource(app2) recovered from fault, on its own.
2016/02/02 06:52:05 VCS INFO V-16-1-10077 Received new cluster membership
2016/02/02 06:52:05 VCS NOTICE V-16-1-10112 System (pmapp2) - Membership: 0x2, DDNA: 0x0
2016/02/02 06:52:05 VCS ERROR V-16-1-10079 System pmapp1 (Node '0') is in Down State - Membership: 0x2
2016/02/02 06:52:05 VCS ERROR V-16-1-10322 System pmapp1 (Node '0') changed state from RUNNING to FAULTED
2016/02/02 06:52:05 VCS NOTICE V-16-1-10086 System pmapp2 (Node '1') is in Regular Membership - Membership: 0x2
2016/02/02 06:52:05 VCS NOTICE V-16-1-10449 Group app1_grp autodisabled on node pmapp1 until it is probed
2016/02/02 06:52:05 VCS NOTICE V-16-1-10449 Group app2_grp autodisabled on node pmapp1 until it is probed
2016/02/02 06:52:05 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node pmapp1 until it is probed
2016/02/02 06:52:05 VCS NOTICE V-16-1-10446 Group app1_grp is offline on system pmapp1
2016/02/02 06:52:06 VCS INFO V-16-6-15015 (pmapp2) hatrigger:/opt/VRTSvcs/bin/triggers/sysoffline is not a trigger scripts directory or can not be executed
2016/02/02 06:52:10 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP, eth3, DOWN; Current status =eth1, DOWN, eth3, DOWN.
2016/02/02 09:52:59 VCS INFO V-16-1-53504 VCS Engine Alive message!!
2016/02/02 10:04:58 VCS INFO V-16-1-50133 User admin has logged in from ::ffff:192.168.226.88
快速定位了一下:
2016/02/02 06:52:10 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP, eth3, DOWN; Current status =eth1, DOWN, eth3, DOWN.
可以看到 之前的2个LLT心跳就是只有1个up,1个down, 这个状态在VCS中叫jeopardy 状态. 是个异常状态.
当最后一个 LLT也断开的时候, VCS无法判断是对方节点宕机, 还是网络故障导致的LLT心跳断开, 为了维护数据完整性(防止2个主机同时 online 资源组 并写数据) , 是不做切换的. 这个是by design的.
如果想确定上面的结论是否正确 ,需要收集2个node的Vxexplorer:
# /opt/VRTSspt/VRTSexplorer/VRTSexplorer
...........
SORT Data Collector has not been initialized.
To start collecting troubleshooting information
the SORT Data Collector needs to be initialized.
Initialize now? [y/n] (default: y)y
...........
Generating report for VRTSexplorer:
Do you want to run all VRTSexplorer modules? [y,n,q] (y) y
...........
the default output file is /opt/VRTSspt/DataCollector/sort/reports/VRTSexplorer_XXX.tar.gz
检查2个机器的VCS 心跳状态的命令是:
#/sbin/lltstat -vvn
pmapp1:~ # /sbin/lltstat -vvn
LLT node information:
Node State Link Status Address
* 0 pmapp1 OPEN
eth1 UP 08:E8:4F:FE:33:97
eth3 UP 00:15:17:95:5D:AD
1 pmapp2 OPEN
eth1 UP 08:E8:4F:FE:2F:F9
eth3 DOWN
2 CONNWAIT
eth1 DOWN
eth3 DOWN
3 CONNWAIT
eth1 DOWN
eth3 DOWN
pmapp2:~ # /sbin/lltstat -vvn
LLT node information:
Node State Link Status Address
0 pmapp1 OPEN
eth1 UP 08:E8:4F:FE:33:97
eth3 DOWN
* 1 pmapp2 OPEN
eth1 UP 08:E8:4F:FE:2F:F9
eth3 UP 00:15:17:75:DD:0D
2 CONNWAIT
eth1 DOWN
eth3 DOWN
3 CONNWAIT
eth1 DOWN
eth3 DOWN
现在还有问题,现在2个机器都是看自己的2根心跳是好的, 但是看对端只有一个UP.
请先检查网线 连接.