环境:11.2.0.4 RHEL6.5 RAC,两节点
问题描述:故意将网络心跳线去掉,分析两节点的心路历程
分析过程:
1.去掉心跳线
2.查看ocssd.log
节点1:
2016-04-19 00:19:59.407: [ CSSD][299706112]clssnmPollingThread: node rac2 (2) at 50% heartbeat fatal, removal in 14.440 seconds 2016-04-19 00:19:59.407: [ CSSD][299706112]clssnmPollingThread: node rac2 (2) is impending reconfig, flag 2229260, misstime 15560 节点1发现节点2已经在连续一段时间内丢失网络心跳了,集群在14.440s后重新配置节点2:
2016-04-19 00:19:59.349: [ CSSD][3818866432]clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, removal in 14.230 seconds 2016-04-19 00:19:59.349: [ CSSD][3818866432]clssnmPollingThread: node rac1 (1) is impending reconfig, flag 2491406, misstime 15770 节点2也发先节点1已经连续一段时间丢失网络心跳了,集群在14.230s后重新配置
节点1:
2016-04-19 00:20:06.409: [ CSSD][299706112]clssnmPollingThread: node rac2 (2) at 75% heartbeat fatal, removal in 7.420 seconds 2016-04-19 00:20:06.410: [ CSSD][315549440]clssnmvDHBValidateNcopy: node 2, rac2, has a disk HB, but no network HB, DHB has rcfg 356437458, wrtcnt, 169517, LATS 39917994, lastSeqNo 169514, uniqueness 1460995653, timestamp 1460996406/44133904 75%了,就要重新配置了!节点2:
2016-04-19 00:20:06.353: [ CSSD][3818866432]clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, removal in 7.200 seconds 2016-04-19 00:20:06.353: [ CSSD][4030301952]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 356437458, wrtcnt, 164891, LATS 44133864, lastSeqNo 164888, uniqueness 1460956156, timestamp 1460996406/39917744 节点2也表示,75%了!
节点1:
2016-04-19 00:20:13.831: [ CSSD][299706112]clssnmPollingThread: Removal started for node rac2 (2), flags 0x22040c, state 3, wt4c 0 2016-04-19 00:20:13.831: [ CSSD][299706112]clssnmMarkNodeForRemoval: node 2, rac2 marked for removal 2016-04-19 00:20:13.831: [ CSSD][299706112]clssnmDiscHelper: rac2, node(2) connection failed, endp (0x1dae5a), probe(0x7f2b00000000), ninf->endp 0x1dae5a 2016-04-19 00:20:13.831: [ CSSD][299706112]clssnmDiscHelper: node 2 clean up, endp (0x1dae5a), init state 5, cur state 5 节点1表示要清理节点2了节点2:
2016-04-19 00:20:13.556: [ CSSD][3818866432]clssnmPollingThread: Removal started for node rac1 (1), flags 0x26040e, state 3, wt4c 0 2016-04-19 00:20:13.556: [ CSSD][3818866432]clssnmMarkNodeForRemoval: node 1, rac1 marked for removal 2016-04-19 00:20:13.556: [ CSSD][3818866432]clssnmDiscHelper: rac1, node(1) connection failed, endp (0x5577), probe(0x7f4e00000000), ninf->endp 0x5577 2016-04-19 00:20:13.556: [ CSSD][3818866432]clssnmDiscHelper: node 1 clean up, endp (0x5577), init state 5, cur state 5 节点2页表示要收拾节点1了!节点1:
2016-04-19 00:20:13.833: [ CSSD][296552192]clssnmCheckDskInfo: Checking disk info... 2016-04-19 00:20:13.833: [ CSSD][296552192]clssnmCheckSplit: Node 2, rac2, is alive, DHB (1460996413, 44140634) more than disk timeout of 27000 after the last NHB (1460996383, 44111344) 检查磁盘信息,发现节点2是正常的。节点2:
2016-04-19 00:20:13.558: [ CSSD][3815712512]clssnmCheckDskInfo: Checking disk info... 2016-04-19 00:20:13.558: [ CSSD][3815712512]clssnmCheckSplit: Node 1, rac1, is alive, DHB (1460996413, 39924894) more than disk timeout of 27000 after the last NHB (1460996383, 39895134) 检查磁盘信息,发现节点1是正常的。
这个时候:
节点1:
2016-04-19 00:20:13.833: [ CSSD][296552192]clssnmCheckDskInfo: My cohort: 1 2016-04-19 00:20:13.833: [ CSSD][296552192]clssnmRemove: Start 2016-04-19 00:20:13.833: [ CSSD][296552192](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, rac2, from the cluster in incarnation 356437458, node birth incarnation 356437457, death incarnation 356437458, stateflags 0x224000 uniqueness value 1460995653 好吧,节点2被踢出去了节点2:
2016-04-19 00:20:13.558: [ CSSD][3815712512]clssnmCheckDskInfo: My cohort: 2 2016-04-19 00:20:13.558: [ CSSD][3815712512]clssnmCheckDskInfo: Surviving cohort: 1 2016-04-19 00:20:13.558: [ CSSD][3815712512](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2 节点2出去了。。