2015-01-12 09:09:30.742: [ CSSCLNT][1]clssscConnect: gipc request failed with 13 (1a)
CSS失败了(CSS,cluster synchronization services集群同步服务-涉及netwok hearbeat,disk heartbeat两种机制)
2015-01-12 09:09:30.742: [ CSSCLNT][1]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_)) failed, rc 13
2015-01-12 09:09:30.745: [ CRSRTI][1] CSS is not ready. Received status 3
CSS not ready
2015-01-12 09:09:30.745: [ CRSMAIN][1] First attempt: init CSS context failed. Error = 3
[ clsdmt][515]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzh2DBG_CRSD))
第一次偿试,失败了
2015-01-12 09:09:30.812: [ clsdmt][515]PID for the Process [4522624], connkey 1
2015-01-12 09:09:30.812: [ clsdmt][515]Creating PID [4522624] file for home /u01/app/oracle/grid/asm host jzh2 bin crs to /u01/app/oracle/grid/asm/crs/init/
2015-01-12 09:09:30.812: [ clsdmt][515]Writing PID [4522624] to the file [/u01/app/oracle/grid/asm/crs/init/jzh2.pid]
2015-01-12 09:09:31.863: [ CRSMAIN][1] CRS Daemon Starting--> CRS staring(crsd服务没有问题,一会验证)
说明:以上说明CRS服务没有问题,在09:09:30s时CSS集群同步服务出现异常,集群同步服务涉及到disk heartbeat(磁盘心跳)network heartbear(网络心跳),也就是说网络与磁盘心跳有问题, 接下来看一下jzh1 ocssd.log记录node1 jzh1在2015-01-12 09:09:30干什么?。
2015-01-12 09:09:00.934: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:16.758: [ CSSD][2577]clssnmSetupReadLease: status 1
2015-01-12 09:09:16.762: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1
2015-01-12 09:09:16.762: [ CSSD][2577]clssgmDoneQEle: re-queueing req 110617b30 status 1
2015-01-12 09:09:16.763: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (111b38850), operation status 1, client status 0
2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sent 3 status msgs to all nodes
2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 0, events 0x8, state 0x0
2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 1, events 0x8, state 0x0说明:可以看到node1的CSSD进程send status messagesto all nondes,为什么要send?在RAC启动后(在node1的ASM日志中可以看到在09:08:48s时reconfiguration了),各个node要将自己的信息写入ocr与vote中,然后 master收集这些信息发送给所有node,告诉所有的node,谁是master,有几个node,在votedisk中记录node相关信息,然后进行投票,到这里,我们可以看到整个集群中有两个member,分别是member 0(jzh1)和member 1(jzh2),也就是说CRSD进程没有问题(已验证),还说明什么?其他node可以将自己的信息写入vote,就是说disk heatbeat没什么问题(一会验证)。
接着往下看:
2015-01-12 09:17:02.539: [ CSSD][4376]clssscSelect: cookie accept request 110991628
2015-01-12 09:17:02.539: [ CSSD][4376]clssnmeventhndlr: gipcAssociate endp 1d2198b in container 73 type of conn gipcha
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnSetNames: hostname jzh2 privname 10.10.0.20 con 1d2198b-->连接jzh2 ,private IP为10.10.0.20(记得在jzh1的asm日志中显示private IP为169.254.189.151)
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSetNodeProperties: properties node 2 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17-->jzh2 属性
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnComplete: node node2 softver 11.2.0.4.0-->node2的software版本
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: Incoming connect from node 2 (node2) ninf endp 0, probendp 0, endp 1d2198b
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSendConnAck: connected to node 2, jzh2, con (1d2198b), state 0
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: node jzh2, 2, uniqueness 1421024974, msg uniqueness 1421024974, endp 1d2198b probendp 0 endp 1d2198b
2015-01-12 09:17:03.044: [ CSSD][4376]clssnmHandleJoin: node 2 JOINING, state 0->1 ninfendp 1d2198b
2015-01-12 09:17:03.354: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:17:03.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/jzh2), LATSvalid(0), nodeInfoDHB uniqueness(1420692326)àread disk heartbeat
2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Setting LATS valid due to uniqueness change for node(jzh2) number(2), nodeInfoDHB(1420692326), readInfo(1421024974)
2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Saving DHB uniqueness for node jzh2, number 2 latestInfo(1421024974), readInfo(1421024974), nodeInfoDHB(1420692326)à保存jzh2的disk heartbeat信息
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 315891915
2015-01-12 09:17:03.754: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 1, from -1, changes 20
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->设置disk heartbeat(磁盘心跳设置200s,11g默认200s)
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed-->本地与远程disk heartbeat生效
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 315891915
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11)
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE-->node 1 是alive活的
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE--> node 2 是alive活的
接着往下看:
2015-01-12 09:24:20.392: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) at 50% heartbeat fatal, removal in 14.647 seconds-->node1 cssd进程检查jzh2了,到了50%失败,14.647s(记住这个时间)要移除jzh2,上面说disk hearbeat没有问题,这里为什么会报错?
2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) is impending reconfig, flag 2294796, misstime 15353-->misstime 15.353s (记住这个时间)+ 14.647s=30s
2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)-->disk timeout被设置成27s了,不是200s吗?
2015-01-12 09:24:23.578: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771770, LATS 706581505, lastSeqNo 19771331, uniqueness 1421024974, timestamp 1421025905/706212683à原来node2 jzh2的disk heartbeat可以检测到,所以不需要200s了,DHB has rcfg,再次验证heartbeat 没有问题,but no network HB,难道network heartbeat有问题?
2015-01-12 09:24:23.618: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025863/706581544
2015-01-12 09:24:24.082: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771771, LATS 706582008, lastSeqNo 19771770, uniqueness 1421024974, timestamp 1421025905/706213190
2015-01-12 09:24:24.119: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025864/706582045-->disk ping错误了。
2015-01-12 09:24:24.398: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes-->node1 要告诉大家什么呢?
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmNeedConfReq: No configuration to change
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, jzh2, misstime(31566) state(5)-->要终止node2 jzh2了,misstime为31.566s,记得上面时间是15.353+14.647=30s,这是oracle网络心跳默认最大阀值30s
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->要更新votedisk,要投票了
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckSplit: Node 2, jzh2, is alive, DHB (1421025918, 706226248) more than disk timeout of 27000 after the last
NHB (1421025890, 706197520)-->再次验证disk heartbeat没有问题
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 1
2015-01-12 09:24:39.805: [ CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, jzh2, from the cluster in incarnation 315891916, node birth incarnation 315891915, death incarnation 315891916, stateflags 0x234000 uniqueness value 1421024974—>node1 要驱逐node2 jzh2了
2015-01-12 09:24:39.806: [ default][4119]kgzf_gen_node_reid2: generated reid cid=41daa0e19d0a6f84ff29b9f37a2f1a38,icin=315891908,nmn=2,lnid=315891915,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmrFenceSage: Fenced node jzh2, number 2, with EXADATA, handle 0
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: req to node 2, kill time 706597731-->node1要将 node2 shutdown kill
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmsendmsg: not connected to node 2-->连不上node2
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: Send to node 2 failed-->为了保证数据一致,要将node2 shutdown,但是shutown 失败
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmWaitOnEvictions: Start-->开始驱逐。
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh1, number 1, current state 3, proposed state 3, current unique 1420396557, proposed u
nique 1420396557, prevConuni 0, birth 315891909
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh2, number 2, current state 5, proposed state 0, current unique 1421024974, proposed u
nique 1421024974, prevConuni 1421024974, birth 315891915
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2, state 5
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2 (jzh2) left cluster-->node2 jzh2离开了cluster
2015-01-12 10:11:27.825: [ CSSD][4119]clssnmWaitForAcks: Ack message type(11), ackCount(2)
2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: Node jzh1, number 1, is EXADATA fence capable
2015-01-12 10:11:27.825: [ CSSD][4376]clssscUpdateEventValue: NMReconfigInProgress val 1, changes 33
2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->本地和远程disk timeout设置为200s
2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: initleader 1 newleader 1-->node1 是leader了,也就是master node
说明: 根据以上分析,磁盘心跳没有问题,问题出现在网络心跳。2015-01-12 09:25:19.224: [ CSSD][1]clssgmSuspendAllGrocks: done
2015-01-12 09:25:19.224: [ CSSD][1]clssgmCompareSwapEventValue: changed CmInfo State val 2, from 5, changes 13
2015-01-12 09:25:19.224: [ CSSD][1]clssgmUpdateEventValue: ConnectedNodes val 315891915, changes 5
2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(315891915)
2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(315891915)
2015-01-12 09:25:19.224: [ CSSD][1]clssgmStartNMMon: completed node cleanup
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSendSync: syncSeqNo(315891916), indicating EXADATA fence initialization complete
2015-01-12 09:25:19.224: [ CSSD][4119]List of nodes that have ACKed my sync: 2
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmWaitForAcks: done, syncseq(315891916), msg type(11)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion:node2 product/protocol (11.2/1.4)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmNeedConfReq: No configuration to change
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, jzh1, misstime(30000) state(5)-->node2 jzh2与node1同步,misstime 30s(网络心跳阀值),要终止node1。
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->等待投票。
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckSplit: Node 1, jzh1, is alive, DHB (1421025877, 706595081) more than disk timeout of 27000 after the last
NHB (1421025847, 706565177)-->node1 jzh1 disk heartbeat没有问题
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 2-->本地编号
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1-->node1 jzh1活着
2015-01-12 09:25:19.224: [ CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, jzh2, is smaller than cohort of 1 nodes led by node 1, jzh1, based on map type 2à终止本地节点node 2 jzh2,node1 jzh1为leader。
2015-01-12 09:25:19.224: [ CSSD][4119]###################################
2015-01-12 09:25:19.224: [ CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread-->CSSD在调用clssnmRcfgMgrThread时终止
至此可以判断RAC出现brain split是由于节点心跳网络通信异常导致RAC 出现脑裂!
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/10271187/viewspace-1407451/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/10271187/viewspace-1407451/