记一次RAC Brain Split脑裂分析过程

1.环境介绍:
DB版本:11.2.0.4 64位
OS版本:AIX 6.1 位
2.错误现象
node1 节点1ASM实例日志:
Mon Jan 12 09:08:48 2015
Reconfiguration started (old inc 8, new inc 10)
List of instances:
1 (myinst: 1) 
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE 
* dead instance detected - domain 2 invalid = TRUE 
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:08:48 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info 
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:08:48 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=10.4585 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DG)
NOTE: SMON successfully validated lock domain 1
NOTE: advancing ckpt for group 1 (DG) thread=2 ckpt=10.4585
NOTE: SMON did instance recovery for group DG domain 1
NOTE: SMON starting instance recovery for group VOTE domain 2 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: starting recovery of thread=2 ckpt=15.26 group=2 (VOTE)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 2 (VOTE)
NOTE: SMON successfully validated lock domain 2
NOTE: advancing ckpt for group 2 (VOTE) thread=2 ckpt=15.26
NOTE: SMON did instance recovery for group VOTE domain 2
Mon Jan 12 09:21:06 2015
Reconfiguration started (old inc 10, new inc 12)
List of instances:
1 2 (myinst: 1) 
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:21:06 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info 
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:25:08 2015
Reconfiguration started (old inc 12, new inc 14)
List of instances:
1 (myinst: 1) 
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:25:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info 
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:51:08 2015
Reconfiguration started (old inc 14, new inc 18)
List of instances:
1 (myinst: 1) 
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:51:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info 
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 10:15:52 2015
Reconfiguration started (old inc 18, new inc 20)
List of instances:
1 2 (myinst: 1) 
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 10:15:53 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info 
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:06 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more
details.
Mon Jan 12 12:20:26 2015
Reconfiguration started (old inc 20, new inc 22)
List of instances:
1 (myinst: 1) 
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE 
* dead instance detected - domain 2 invalid = TRUE 
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 12:20:26 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info 
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:26 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=11.4610 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
Mon Jan 12 12:32:50 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info 
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 13:44:50 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 1 
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Private Interface 'en0' configured from GPnP for use as a private interconnect.
[name='en0', type=1, ip=169.254.189.151, mac=00-90-fa-52-21-be, net=169.254.0.0/16,
mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'en2' configured from GPnP for use as a public interface.
[name='en2', type=1, ip=192.168.0.12, mac=00-90-fa-52-22-3c, net=192.168.0.0/24,
mask=255.255.255.0, use=public/1]
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
NUMA status: non-NUMA system
cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
Grp 0: 
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/grid/asm/dbs/arch
Autotune of undo retention is turned on.c
说明:主节点node1的asm日志中显示,asm实例分别在Mon Jan 12 09:08:48 2015
,Mon Jan 12 09:21:06 2015,Mon Jan 12 09:25:08 2015,Mon Jan 12 09:51:08 2015,Mon Jan 12 10:15:52 2015,Mon Jan 12 12:20:26 2015多次进行reconfiguration,也就是asm实例不断的重启,出现脑裂现象,且en0为private网卡,IP地址为169.254.189.151,en2 为public网卡,IP地址为192.168.0.12。
3.分析过程
node2 节点2ASM实例日志
Mon Jan 12 09:09:29 2015
NOTE: client exited [4719298]
Mon Jan 12 09:09:29 2015
NOTE: ASMB process exiting, either shutdown is in progress 
NOTE: or foreground connected to ASMB was killed. 
Mon Jan 12 09:09:29 2015
PMON (ospid: 3867214): terminating the instance due to error 471
Mon Jan 12 09:51:49 2015
Shutting down instance (abort)
License high water mark = 2
USER (ospid: 2425590): terminating the instance
Instance terminated by USER, pid = 2425590
Mon Jan 12 09:51:50 2015
Instance shutdown complete
Mon Jan 12 10:16:27 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 2 
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Mon Jan 12 12:21:07 2015
NOTE: client exited [2098012]
Mon Jan 12 12:21:07 2015
NOTE: ASMB process exiting, either shutdown is in progress 
NOTE: or foreground connected to ASMB was killed. 
Mon Jan 12 12:21:09 2015
PMON (ospid: 2818726): terminating the instance due to error 481
Instance terminated by PMON, pid = 2818726
Mon Jan 12 12:33:30 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 2 
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Mon Jan 12 12:38:19 2015
NOTE: client jzhprd2:jzhprd registered, osid 3146138, mbr 0x1
Mon Jan 12 12:38:51 2015
ALTER SYSTEM SET local_listener=' (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.13)(PORT=1521))))' SCOPE=MEMORY SID='+ASM2';
Mon Jan 12 13:01:34 2015
NOTE: ASMB process exiting, either shutdown is in progress 
NOTE: or foreground connected to ASMB was killed. 
Mon Jan 12 13:01:34 2015
NOTE: client exited [4915336]
NOTE: force a map free for map id 2
Mon Jan 12 13:01:36 2015
PMON (ospid: 3212222): terminating the instance due to error 481
Instance terminated by PMON, pid = 3212222
Mon Jan 12 13:46:02 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 2 
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
说明:标红字体显示节点JZH2中ASMB进程在9:9:29s时开始出现异常,ASM异常需要检查CRS、CSS相关日志。

node 2 节点2CRSD日志

2015-01-12 09:09:30.742: [ CSSCLNT][1]clssscConnect: gipc request failed with 13 (1a)

CSS失败了(CSS,cluster synchronization services集群同步服务-涉及netwok hearbeat,disk heartbeat两种机制)

2015-01-12 09:09:30.742: [ CSSCLNT][1]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_)) failed, rc 13

2015-01-12 09:09:30.745: [  CRSRTI][1] CSS is not ready. Received status 3
 CSS not ready


2015-01-12 09:09:30.745: [ CRSMAIN][1] First attempt: init CSS context failed. Error = 3

[  clsdmt][515]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzh2DBG_CRSD))

第一次偿试,失败了

2015-01-12 09:09:30.812: [  clsdmt][515]PID for the Process [4522624], connkey 1

2015-01-12 09:09:30.812: [  clsdmt][515]Creating PID [4522624] file for home /u01/app/oracle/grid/asm host jzh2 bin crs to /u01/app/oracle/grid/asm/crs/init/

2015-01-12 09:09:30.812: [  clsdmt][515]Writing PID [4522624] to the file [/u01/app/oracle/grid/asm/crs/init/jzh2.pid]

2015-01-12 09:09:31.863: [ CRSMAIN][1] CRS Daemon Starting--> CRS staring(crsd服务没有问题,一会验证)

说明:以上说明CRS服务没有问题,在090930s时CSS集群同步服务出现异常,集群同步服务涉及到disk heartbeat(磁盘心跳)network heartbear(网络心跳),也就是说网络与磁盘心跳有问题, 接下来看一下jzh1 ocssd.log记录node1 jzh1在2015-01-12 09:09:30干什么?。

node1 节点1 CSSD日志:
2015-01-12 09:09:00.934: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes


2015-01-12 09:09:00.934: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:04.946: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:04.946: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:08.980: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:08.980: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:12.994: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:12.994: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:16.758: [    CSSD][2577]clssnmSetupReadLease: status 1

2015-01-12 09:09:16.762: [    CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1

2015-01-12 09:09:16.762: [    CSSD][2577]clssgmDoneQEle: re-queueing req 110617b30 status 1

2015-01-12 09:09:16.763: [    CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (111b38850), operation status 1, client status 0

2015-01-12 09:09:17.009: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:17.009: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:21.028: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:21.028: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:25.349: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:25.349: [    CSSD][3862]clssnmSendingThread: sent 3 status msgs to all nodes

2015-01-12 09:09:29.355: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:29.355: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:33.362: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:33.362: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:37.366: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:37.366: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:41.377: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:41.377: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:45.394: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:45.394: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:49.371: [    CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 0, events 0x8, state 0x0

2015-01-12 09:09:49.371: [    CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 1, events 0x8, state 0x0

说明:可以看到node1CSSD进程send status messagesto all nondes,为什么要send?在RAC启动后(在node1ASM日志中可以看到在09:08:48sreconfiguration),各个node要将自己的信息写入ocrvote中,然后 master收集这些信息发送给所有node,告诉所有的node,谁是master,有几个node,在votedisk中记录node相关信息,然后进行投票,到这里,我们可以看到整个集群中有两个member,分别是member 0(jzh1)member 1jzh2),也就是说CRSD进程没有问题(已验证),还说明什么?其他node可以将自己的信息写入vote,就是说disk heatbeat没什么问题(一会验证)

接着往下看:
2015-01-12 09:17:02.539: [    CSSD][4376]clssscSelect: cookie accept request 110991628

2015-01-12 09:17:02.539: [    CSSD][4376]clssnmeventhndlr: gipcAssociate endp 1d2198b in container 73 type of conn gipcha

2015-01-12 09:17:02.540: [    CSSD][4376]clssnmConnSetNames: hostname jzh2 privname 10.10.0.20 con 1d2198b-->连接jzh2 ,private IP为10.10.0.20(记得在jzh1的asm日志中显示private IP为169.254.189.151)

2015-01-12 09:17:02.540: [    CSSD][4376]clssnmSetNodeProperties: properties node 2 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17-->jzh2 属性

2015-01-12 09:17:02.540: [    CSSD][4376]clssnmConnComplete: node node2 softver 11.2.0.4.0-->node2的software版本

2015-01-12 09:17:02.540: [    CSSD][4376]clssnmCompleteConnProtocol: Incoming connect from node 2 (node2) ninf endp 0, probendp 0, endp 1d2198b

2015-01-12 09:17:02.540: [    CSSD][4376]clssnmSendConnAck: connected to node 2, jzh2, con (1d2198b), state 0

2015-01-12 09:17:02.540: [    CSSD][4376]clssnmCompleteConnProtocol: node jzh2, 2, uniqueness 1421024974, msg uniqueness 1421024974, endp 1d2198b probendp 0 endp 1d2198b

2015-01-12 09:17:03.044: [    CSSD][4376]clssnmHandleJoin: node 2 JOINING, state 0->1 ninfendp 1d2198b

2015-01-12 09:17:03.354: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:17:03.355: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:17:03.360: [    CSSD][2577]clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/jzh2), LATSvalid(0), nodeInfoDHB uniqueness(1420692326)àread disk heartbeat

2015-01-12 09:17:03.360: [    CSSD][2577]clssnmvDHBValidateNcopy: Setting LATS valid due to uniqueness change for node(jzh2) number(2), nodeInfoDHB(1420692326), readInfo(1421024974)

2015-01-12 09:17:03.360: [    CSSD][2577]clssnmvDHBValidateNcopy: Saving DHB uniqueness for node jzh2, number 2 latestInfo(1421024974), readInfo(1421024974), nodeInfoDHB(1420692326)à保存jzh2的disk heartbeat信息

2015-01-12 09:17:03.754: [    CSSD][4119]clssnmDoSyncUpdate: Initiating sync 315891915

2015-01-12 09:17:03.754: [    CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 1, from -1, changes 20

2015-01-12 09:17:03.754: [    CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->设置disk heartbeat(磁盘心跳设置200s,11g默认200s)

2015-01-12 09:17:03.754: [    CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed-->本地与远程disk heartbeat生效

2015-01-12 09:17:03.754: [    CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 315891915

2015-01-12 09:17:03.754: [    CSSD][4119]clssnmSetupAckWait: Ack message type (11)

2015-01-12 09:17:03.754: [    CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE-->node 1 是alive活的

2015-01-12 09:17:03.754: [    CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE--> node 2 是alive活的

说明: node1 jzh1觉得jzh 1,jzh2 都是活的,注意:这里还是 disk heartbeat ,再一次验证 disk heatbeat 没什么问题。


接着往下看:

2015-01-12 09:24:20.392: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:24:23.578: [    CSSD][3605]clssnmPollingThread: node jzh2 (2) at 50% heartbeat fatal, removal in 14.647 seconds-->node1 cssd进程检查jzh2了,到了50%失败,14.647s(记住这个时间)要移除jzh2,上面说disk hearbeat没有问题,这里为什么会报错?

2015-01-12 09:24:23.578: [    CSSD][3605]clssnmPollingThread: node jzh2 (2) is impending reconfig, flag 2294796, misstime 15353-->misstime 15.353s (记住这个时间)+ 14.647s=30s

2015-01-12 09:24:23.578: [    CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)-->disk timeout被设置成27s了,不是200s吗?

2015-01-12 09:24:23.578: [    CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771770, LATS 706581505, lastSeqNo 19771331, uniqueness 1421024974, timestamp 1421025905/706212683à原来node2 jzh2的disk heartbeat可以检测到,所以不需要200s了,DHB has rcfg,再次验证heartbeat 没有问题,but no network HB,难道network heartbeat有问题?

2015-01-12 09:24:23.618: [    CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025863/706581544

2015-01-12 09:24:24.082: [    CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771771, LATS 706582008, lastSeqNo 19771770, uniqueness 1421024974, timestamp 1421025905/706213190

2015-01-12 09:24:24.119: [    CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025864/706582045-->disk ping错误了。

2015-01-12 09:24:24.398: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes-->node1 要告诉大家什么呢

2015-01-12 09:24:39.805: [    CSSD][4119]clssnmNeedConfReq: No configuration to change

2015-01-12 09:24:39.805: [    CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, jzh2, misstime(31566) state(5)-->要终止node2 jzh2了,misstime为31.566s,记得上面时间是15.353+14.647=30s,这是oracle网络心跳默认最大阀值30s

2015-01-12 09:24:39.805: [    CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->要更新votedisk,要投票了

2015-01-12 09:24:39.805: [    CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2015-01-12 09:24:39.805: [    CSSD][4119]clssnmCheckSplit: Node 2, jzh2, is alive, DHB (1421025918, 706226248) more than disk timeout of 27000 after the last

NHB (1421025890, 706197520)-->再次验证disk heartbeat没有问题

2015-01-12 09:24:39.805: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 1

2015-01-12 09:24:39.805: [    CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, jzh2, from the cluster in incarnation 315891916, node birth incarnation 315891915, death incarnation 315891916, stateflags 0x234000 uniqueness value 1421024974—>node1 要驱逐node2 jzh2了

2015-01-12 09:24:39.806: [ default][4119]kgzf_gen_node_reid2: generated reid cid=41daa0e19d0a6f84ff29b9f37a2f1a38,icin=315891908,nmn=2,lnid=315891915,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100

2015-01-12 09:24:39.806: [    CSSD][4119]clssnmrFenceSage: Fenced node jzh2, number 2, with EXADATA, handle 0

2015-01-12 09:24:39.806: [    CSSD][4119]clssnmSendShutdown: req to node 2, kill time 706597731-->node1要将 node2 shutdown kill

2015-01-12 09:24:39.806: [    CSSD][4119]clssnmsendmsg: not connected to node 2-->连不上node2

2015-01-12 09:24:39.806: [    CSSD][4119]clssnmSendShutdown: Send to node 2 failed-->为了保证数据一致,要将node2 shutdown,但是shutown 失败

2015-01-12 09:24:39.806: [    CSSD][4119]clssnmWaitOnEvictions: Start-->开始驱逐。

2015-01-12 09:25:07.095: [    CSSD][4376]clssnmUpdateNodeState: node jzh1, number 1, current state 3, proposed state 3, current unique 1420396557, proposed u

nique 1420396557, prevConuni 0, birth 315891909

2015-01-12 09:25:07.095: [    CSSD][4376]clssnmUpdateNodeState: node jzh2, number 2, current state 5, proposed state 0, current unique 1421024974, proposed u

nique 1421024974, prevConuni 1421024974, birth 315891915

2015-01-12 09:25:07.095: [    CSSD][4376]clssnmDeactivateNode: node 2, state 5

2015-01-12 09:25:07.095: [    CSSD][4376]clssnmDeactivateNode: node 2 (jzh2) left cluster-->node2 jzh2离开了cluster

2015-01-12 10:11:27.825: [    CSSD][4119]clssnmWaitForAcks: Ack message type(11), ackCount(2)

2015-01-12 10:11:27.825: [    CSSD][4376]clssnmHandleSync: Node jzh1, number 1, is EXADATA fence capable

2015-01-12 10:11:27.825: [    CSSD][4376]clssscUpdateEventValue: NMReconfigInProgress  val 1, changes 33

2015-01-12 10:11:27.825: [    CSSD][4376]clssnmHandleSync: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->本地和远程disk timeout设置为200s

2015-01-12 10:11:27.825: [    CSSD][4376]clssnmHandleSync: initleader 1 newleader 1-->node1 是leader了,也就是master node 

说明: 根据以上分析,磁盘心跳没有问题,问题出现在网络心跳。
node 2节点2CSSD日志

2015-01-12 09:25:19.224: [    CSSD][1]clssgmSuspendAllGrocks: done

2015-01-12 09:25:19.224: [    CSSD][1]clssgmCompareSwapEventValue: changed CmInfo State  val 2, from 5, changes 13

2015-01-12 09:25:19.224: [    CSSD][1]clssgmUpdateEventValue: ConnectedNodes  val 315891915, changes 5

2015-01-12 09:25:19.224: [    CSSD][1]clssgmCleanupNodeContexts():  cleaning up nodes, rcfg(315891915)

2015-01-12 09:25:19.224: [    CSSD][1]clssgmCleanupNodeContexts():  successful cleanup of nodes rcfg(315891915)

2015-01-12 09:25:19.224: [    CSSD][1]clssgmStartNMMon:  completed node cleanup

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmSendSync: syncSeqNo(315891916), indicating EXADATA fence initialization complete

2015-01-12 09:25:19.224: [    CSSD][4119]List of nodes that have ACKed my sync: 2

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmWaitForAcks: done, syncseq(315891916), msg type(11)

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmSetMinMaxVersion:node2  product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmNeedConfReq: No configuration to change

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, jzh1, misstime(30000) state(5)-->node2 jzh2与node1同步,misstime 30s(网络心跳阀值),要终止node1

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->等待投票。

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmCheckSplit: Node 1, jzh1, is alive, DHB (1421025877, 706595081) more than disk timeout of 27000 after the last

 NHB (1421025847, 706565177)-->node1 jzh1 disk heartbeat没有问题

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 2-->本地编号

2015-01-12 09:25:19.224: [    CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1-->node1 jzh1活着

2015-01-12 09:25:19.224: [    CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, jzh2, is smaller than cohort of 1 nodes led by node 1, jzh1, based on map type 2à终止本地节点node 2 jzh2,node1 jzh1leader

2015-01-12 09:25:19.224: [    CSSD][4119]###################################

2015-01-12 09:25:19.224: [    CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread-->CSSD在调用clssnmRcfgMgrThread时终止

说明: node2 在与 node1 同步时出现 misstime 超时,将终止 local node

 
至此可以判断RAC出现brain split是由于节点心跳网络通信异常导致RAC 出现脑裂!



来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/10271187/viewspace-1407451/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/10271187/viewspace-1407451/

你可能感兴趣的:(记一次RAC Brain Split脑裂分析过程)