赛门铁克VCS软件HA未正常切换问题分析

问题现象及日志:pmapp1服务器异常宕掉之后,VCS未进行切换。

2016/02/02 05:52:57 VCS INFO V-16-1-53504 VCS Engine Alive message!!
2016/02/02 06:44:19 VCS ERROR V-16-2-13067 (pmapp2) Agent is calling clean for resource(app2) because the resource became OFFLINE unexpectedly, on its own.
2016/02/02 06:44:20 VCS INFO V-16-2-13068 (pmapp2) Resource(app2) - clean completed successfully.
2016/02/02 06:44:23 VCS INFO V-16-2-13082 (pmapp2) Resource(app2) recovered from fault, on its own.
2016/02/02 06:52:05 VCS INFO V-16-1-10077 Received new cluster membership
2016/02/02 06:52:05 VCS NOTICE V-16-1-10112 System (pmapp2) - Membership: 0x2, DDNA: 0x0
2016/02/02 06:52:05 VCS ERROR V-16-1-10079 System pmapp1 (Node '0') is in Down State - Membership: 0x2
2016/02/02 06:52:05 VCS ERROR V-16-1-10322 System pmapp1 (Node '0') changed state from RUNNING to FAULTED
2016/02/02 06:52:05 VCS NOTICE V-16-1-10086 System pmapp2 (Node '1') is in Regular Membership - Membership: 0x2
2016/02/02 06:52:05 VCS NOTICE V-16-1-10449 Group app1_grp autodisabled on node pmapp1 until it is probed
2016/02/02 06:52:05 VCS NOTICE V-16-1-10449 Group app2_grp autodisabled on node pmapp1 until it is probed
2016/02/02 06:52:05 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node pmapp1 until it is probed
2016/02/02 06:52:05 VCS NOTICE V-16-1-10446 Group app1_grp is offline on system pmapp1
2016/02/02 06:52:06 VCS INFO V-16-6-15015 (pmapp2) hatrigger:/opt/VRTSvcs/bin/triggers/sysoffline is not a trigger scripts directory or can not be executed
2016/02/02 06:52:10 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP, eth3, DOWN; Current status =eth1, DOWN, eth3, DOWN.


2016/02/02 09:52:59 VCS INFO V-16-1-53504 VCS Engine Alive message!!
2016/02/02 10:04:58 VCS INFO V-16-1-50133 User admin has logged in from ::ffff:192.168.226.88


快速定位了一下:

2016/02/02 06:52:10 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP, eth3, DOWN; Current status =eth1, DOWN, eth3, DOWN.

可以看到 之前的2LLT心跳就是只有1up,1down,  这个状态在VCS中叫jeopardy 状态. 是个异常状态.

当最后一个 LLT也断开的时候,  VCS无法判断是对方节点宕机, 还是网络故障导致的LLT心跳断开, 为了维护数据完整性(防止2个主机同时 online 资源组 并写数据) , 是不做切换的这个是by design.

如果想确定上面的结论是否正确 ,需要收集2nodeVxexplorer:

# /opt/VRTSspt/VRTSexplorer/VRTSexplorer

...........

 

SORT Data Collector has not been initialized.

 

To start collecting troubleshooting information

the SORT Data Collector needs to be initialized.

Initialize now? [y/n] (default: y)y

...........

 

Generating report for VRTSexplorer:

 

Do you want to run all VRTSexplorer modules? [y,n,q] (y) y

...........

 

the default output file is /opt/VRTSspt/DataCollector/sort/reports/VRTSexplorer_XXX.tar.gz



检查2个机器的VCS 心跳状态的命令是:

#/sbin/lltstat -vvn



pmapp1:~ # /sbin/lltstat -vvn

LLT node information:

    Node                 State    Link  Status  Address

   * 0 pmapp1            OPEN   

                                  eth1   UP         08:E8:4F:FE:33:97

                                  eth3   UP         00:15:17:95:5D:AD

     1 pmapp2            OPEN   

                                  eth1   UP         08:E8:4F:FE:2F:F9

                                  eth3   DOWN      

     2                   CONNWAIT

                                  eth1   DOWN      

                                  eth3   DOWN      

     3                   CONNWAIT

                                  eth1   DOWN      

                                  eth3   DOWN       




pmapp2:~ # /sbin/lltstat -vvn

LLT node information:

    Node                 State    Link  Status  Address

     0 pmapp1            OPEN   

                                  eth1   UP         08:E8:4F:FE:33:97

                                  eth3   DOWN      

   * 1 pmapp2            OPEN    

                                  eth1   UP         08:E8:4F:FE:2F:F9

                                  eth3   UP         00:15:17:75:DD:0D

     2                   CONNWAIT

                                  eth1   DOWN      

                                  eth3   DOWN      

     3                   CONNWAIT

                                  eth1   DOWN      

                                  eth3   DOWN       



现在还有问题,现在2个机器都是看自己的2根心跳是好的, 但是看对端只有一个UP.

请先检查网线 连接.


你可能感兴趣的:(赛门铁克VCS软件HA未正常切换问题分析)