本文章通过模拟私网问题,导致集群节点宕机,来进行日志分析。
# ifconfig eth1 down
节点1
2019-01-16 13:44:16.839
[cssd(28522)]CRS-1612:Network communication with node rac2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.400 seconds
2019-01-16 13:44:23.855
[cssd(28522)]CRS-1611:Network communication with node rac2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.380 seconds
2019-01-16 13:44:28.865
[cssd(28522)]CRS-1610:Network communication with node rac2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.370 seconds
与节点2 rac2的网络通信在50%的超时间隔内丢失。 在14秒内从群集中删除此节点
与节点2 rac2的网络通信在75%的超时间隔内丢失。 在7秒内从群集中删除此节点
与节点2 rac2的网络通信在90%的超时间隔内丢失。 在2秒内从群集中删除此节点
2019-01-16 13:44:31.243
[cssd(28522)]CRS-1607:Node rac2 is being evicted in cluster incarnation 322157220; details at (:CSSNM00007:) in /u01/11.2.0/grid/log/rac1/cssd/ocssd.log.
2019-01-16 13:44:54.468
[ohasd(28301)]CRS-8011:reboot advisory message from host: rac2, component: mo093358, with time stamp: L-2019-01-16-13:44:53.124
[ohasd(28301)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2019-01-16 13:45:02.236
[cssd(28522)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 .
2019-01-16 13:45:02.393
[ctssd(28691)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.
2019-01-16 13:45:27.304
[crsd(28825)]CRS-5504:Node down event reported for node 'rac2'.
2019-01-16 13:45:32.086
[crsd(28825)]CRS-2773:Server 'rac2' has been removed from pool 'Generic'.
2019-01-16 13:45:32.086
[crsd(28825)]CRS-2773:Server 'rac2' has been removed from pool 'ora.orcl'.
2019-01-16 13:46:37.328
[ctssd(28691)]CRS-2406:The Cluster Time Synchronization Service timed out on host rac1. Details in /u01/11.2.0/grid/log/rac1/ctssd/octssd.log.
在集群中驱逐rac2节点,要求rac2重启,集群重新配置完成,从服务池中删除rac2节点
节点2
2019-01-16 13:44:17.512
[cssd(24201)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.400 seconds
2019-01-16 13:44:24.529
[cssd(24201)]CRS-1611:Network communication with node rac1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 7.380 seconds
2019-01-16 13:44:29.539
[cssd(24201)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.370 seconds
与节点1 rac1的网络通信在50%的超时间隔内丢失。 在14秒内从群集中删除此节点
与节点1 rac1的网络通信在75%的超时间隔内丢失。 在7秒内从群集中删除此节点
与节点1 rac1的网络通信在90%的超时间隔内丢失。 在2秒内从群集中删除此节点
2019-01-16 13:44:31.915
[cssd(24201)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/11.2.0/grid/log/rac2/cssd/ocssd.log.
2019-01-16 13:44:32.025
[cssd(24201)]CRS-1608:This node was evicted by node 1, rac1; details at (:CSSNM00005:) in /u01/11.2.0/grid/log/rac2/cssd/ocssd.log.
2019-01-16 13:47:33.459
[ohasd(2892)]CRS-2112:The OLR service started on node rac2.
2019-01-16 13:47:34.870
[ohasd(2892)]CRS-8011:reboot advisory message from host: rac2, component: ag125511, with time stamp: L-2019-01-15-16:20:15.283
[ohasd(2892)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2019-01-16 13:47:35.804
[ohasd(2892)]CRS-8011:reboot advisory message from host: rac2, component: mo093358, with time stamp: L-2019-01-16-13:44:53.124
[ohasd(2892)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
以上信息看出节点2无法与群集中的其他节点通信,并且正在关闭以保持群集完整性; 节点2被节点1驱逐。
/var/log/messages
Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Cluster Membership change - Current incarn 0x19
Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Nodes in cluster:
Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Node 1 (IP 0xa0a0a0a)
Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Node count 1
Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration started.
Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration completed.
Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration completed.
Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Cluster Membership change setup complete
Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.187 on eth0.
Jan 16 13:45:30 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.187 on eth0.
Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.187 on eth0.
Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.172 on eth0.
Jan 16 13:45:30 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.172 on eth0.
Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.172 on eth0.
Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for fe80::a00:27ff:feb1:3373 on eth1.
Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 10.10.10.10 on eth1.
Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for fe80::a00:27ff:fe2e:e4e on eth0.
Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.172 on eth0.
Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.186 on eth0.
Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.182 on eth0.
Jan 16 13:45:31 rac1 avahi-daemon[2647]: Host name conflict, retrying with <rac1-3>
没有故障时间段信息
根据上面信息,可以看出集群重新配置,节点2的上vip与scanip都注册到到节点1的eth0网卡上。
节点1
2019-01-16 13:44:16.839: [ CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 50% heartbeat fatal, removal in 14.400 seconds
2019-01-16 13:44:16.839: [ CSSD][1218210112]clssnmPollingThread: node rac2 (2) is impending reconfig, flag 394254, misstime 15600
以上信息显示:节点1发现节点2已经在连续一段时间内丢失网络心跳。如果这种情况继续下》去,集群会在14.400s之后发生重新配置,11Gr2 misscount=30s,通过命令
crsctl get css misscount
可以查询。
节点2
2019-01-16 13:44:17.512: [ CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, removal in 14.400 seconds
2019-01-16 13:44:17.512: [ CSSD][1222531392]clssnmPollingThread: node rac1 (1) is impending reconfig, flag 132108, misstime 15600
节点2出现了和节点1类似的消息。由于这是一个双节点集群,所以这是正常情况。当丢失网络心跳的时间达到misscount时间后,集群就需要重新配置了。
节点1
2019-01-16 13:44:16.839: [ CSSD][1218210112]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2019-01-16 13:44:16.926: [ CSSD][1249679680]clssnmvSchedDiskThreads: DiskPingThread for voting file /dev/raw/raw1 sched delay 950 > margin 750 cur_ms 59003974 lastalive 59003024
2019-01-16 13:44:18.914: [ CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
2019-01-16 13:44:18.914: [ CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
2019-01-16 13:44:20.946: [ CSSD][1099692352]clssscSelect: cookie accept request 0xa430f48
2019-01-16 13:44:20.946: [ CSSD][1099692352]clssgmAllocProc: (0xad8a000) allocated
2019-01-16 13:44:20.946: [ CSSD][1099692352]clssgmClientConnectMsg: properties of cmProc 0xad8a000 - 1,2,3,4
2019-01-16 13:44:20.946: [ CSSD][1099692352]clssgmClientConnectMsg: Connect from con(0x5a2bd) proc(0xad8a000) pid(9116) version 11:2:1:4, properties: 1,2,3,4
2019-01-16 13:44:20.946: [ CSSD][1099692352]clssgmClientConnectMsg: msg flags 0x0000
2019-01-16 13:44:20.947: [ CSSD][1099692352]clssgmExecuteClientRequest: Node name request from client ((nil))
2019-01-16 13:44:20.950: [ CSSD][1099692352]clssgmDeadProc: proc 0xad8a000
2019-01-16 13:44:20.950: [ CSSD][1099692352]clssgmDestroyProc: cleaning up proc(0xad8a000) con(0x5a2bd) skgpid ospid 9116 with 0 clients, refcount 0
2019-01-16 13:44:20.950: [ CSSD][1099692352]clssgmDiscEndpcl: gipcDestroy 0x5a2bd
2019-01-16 13:44:23.855: [ CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 75% heartbeat fatal, removal in 7.380 seconds
2019-01-16 13:44:23.917: [ CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
2019-01-16 13:44:23.917: [ CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
2019-01-16 13:44:28.865: [ CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 90% heartbeat fatal, removal in 2.370 seconds, seedhbimpd 1
2019-01-16 13:44:28.921: [ CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
2019-01-16 13:44:28.921: [ CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
2019-01-16 13:44:30.731: [ CSSD][1099692352]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 850 > margin 750 cur_ms 59017784 lastalive 59016934
2019-01-16 13:44:31.242: [ CSSD][1218210112]clssnmPollingThread: Removal started for node rac2 (2), flags 0x6040e, state 3, wt4c 0
2019-01-16 13:44:31.242: [ CSSD][1218210112]clssnmDiscHelper: rac2, node(2) connection failed, endp (0x331), probe(0x100000000), ninf->endp 0x331
2019-01-16 13:44:31.242: [ CSSD][1218210112]clssnmDiscHelper: node 2 clean up, endp (0x331), init state 5, cur state 5
节点1认为节点2持续丢失网络心跳,决定从集群中清除节点2
节点2
2019-01-16 13:44:17.512: [ CSSD][1222531392]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2019-01-16 13:44:18.247: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 840 > margin 750 cur_ms 20599214 lastalive 20598374
2019-01-16 13:44:19.232: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20600194 lastalive 20599374
2019-01-16 13:44:20.232: [ CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20601194 lastalive 20600374
2019-01-16 13:44:21.239: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20602204 lastalive 20601374
2019-01-16 13:44:22.236: [ CSSD][1233021248]clssnmSendingThread: sending status msg to all nodes
2019-01-16 13:44:22.236: [ CSSD][1233021248]clssnmSendingThread: sent 5 status msgs to all nodes
2019-01-16 13:44:22.243: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20603204 lastalive 20602374
2019-01-16 13:44:23.238: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20604204 lastalive 20603374
2019-01-16 13:44:24.238: [ CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20605204 lastalive 20604374
2019-01-16 13:44:24.529: [ CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, removal in 7.380 seconds
2019-01-16 13:44:25.237: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20606204 lastalive 20605374
2019-01-16 13:44:26.246: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 840 > margin 750 cur_ms 20607214 lastalive 20606374
2019-01-16 13:44:27.230: [ CSSD][1233021248]clssnmSendingThread: sending status msg to all nodes
2019-01-16 13:44:27.230: [ CSSD][1233021248]clssnmSendingThread: sent 5 status msgs to all nodes
2019-01-16 13:44:27.232: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20608194 lastalive 20607374
2019-01-16 13:44:28.231: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20609194 lastalive 20608374
2019-01-16 13:44:29.236: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20610194 lastalive 20609374
2019-01-16 13:44:29.539: [ CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 90% heartbeat fatal, removal in 2.370 seconds, seedhbimpd 1
2019-01-16 13:44:30.241: [ CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20611204 lastalive 20610384
2019-01-16 13:44:31.229: [ CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 20612194 lastalive 20611384
2019-01-16 13:44:31.914: [ CSSD][1222531392]clssnmPollingThread: Removal started for node rac1 (1), flags 0x2040c, state 3, wt4c 0
2019-01-16 13:44:31.914: [ CSSD][1222531392]clssnmDiscHelper: rac1, node(1) connection failed, endp (0x256), probe(0x100000000), ninf->endp 0x256
2019-01-16 13:44:31.914: [ CSSD][1222531392]clssnmDiscHelper: node 1 clean up, endp (0x256), init state 5, cur state 5
节点2同样认为节点1持续丢失网络心跳,决定从集群中清除节点1.
节点1
2019-01-16 13:44:31.242: [GIPCXCPT][1218210112]gipcInternalDissociate: obj 0xab2fe10 [0000000000000331] { gipcEndpoint : localAddr 'gipc://rac1:3f93-7c9d-5dcd-3d8d#10.10.10.10#43951', remoteAddr 'gipc://rac2:nm_raccluster#10.10.10.20#56183', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x5e89, pidPeer 0, flags 0x2616, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2019-01-16 13:44:31.242: [GIPCXCPT][1218210112]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3215]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 0xab2fe10 [0000000000000331] { gipcEndpoint : localAddr 'gipc://rac1:3f93-7c9d-5dcd-3d8d#10.10.10.10#43951', remoteAddr 'gipc://rac2:nm_raccluster#10.10.10.20#56183', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x5e89, pidPeer 0, flags 0x2616, usrFlags 0x0 }, flags 0x0
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssnmDoSyncUpdate: Initiating sync 322157220
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssscUpdateEventValue: NMReconfigInProgress val 1, changes 3
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 322157220
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssnmSetupAckWait: Ack message type (11)
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssnmSetupAckWait: node(1) is ALIVE
2019-01-16 13:44:31.242: [ CSSD][1239189824]clssnmSendSync: syncSeqNo(322157220), indicating EXADATA fence initialization complete
2019-01-16 13:44:31.242: [ CSSD][1239189824]List of nodes that have ACKed my sync: NULL
节点2
2019-01-16 13:44:31.914: [GIPCXCPT][1222531392]gipcInternalDissociate: obj 0x4ab8b60 [0000000000000256] { gipcEndpoint : localAddr 'gipc://rac2:34a4-9438-68a9-4123#10.10.10.20#56183', remoteAddr 'gipc://10.10.10.10:3f4c-bc6d-9424-f406#10.10.10.10#43951', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x6f6a, pidPeer 0, flags 0x102616, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2019-01-16 13:44:31.914: [GIPCXCPT][1222531392]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3215]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 0x4ab8b60 [0000000000000256] { gipcEndpoint : localAddr 'gipc://rac2:34a4-9438-68a9-4123#10.10.10.20#56183', remoteAddr 'gipc://10.10.10.10:3f4c-bc6d-9424-f406#10.10.10.10#43951', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x6f6a, pidPeer 0, flags 0x102616, usrFlags 0x0 }, flags 0x0
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssnmDoSyncUpdate: Initiating sync 322157220
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssscUpdateEventValue: NMReconfigInProgress val 1, changes 9
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 322157220
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssnmSetupAckWait: Ack message type (11)
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssnmSetupAckWait: node(2) is ALIVE
2019-01-16 13:44:31.914: [ CSSD][1243511104]clssnmSendSync: syncSeqNo(322157220), indicating EXADATA fence initialization complete
2019-01-16 13:44:31.914: [ CSSD][1243511104]List of nodes that have ACKed my sync: NULL
每个节点都向自己认为没问题的节点发送重新配置消息,但是对于两个节点的集群来说,不会有其他节点需要接收这个消息的。接下来,需要通过磁盘心跳信息决定每个节点的状态。
节点1
2019-01-16 13:44:31.243: [ CSSD][1239189824]clssnmCheckDskInfo: Checking disk info...
2019-01-16 13:44:31.243: [ CSSD][1239189824]clssnmCheckSplit: Node 2, rac2, is alive, DHB (1547617471, 20612054) more than disk timeout of 27000 after the last NHB (1547617441, 20582204)
节点2
2019-01-16 13:44:31.915: [ CSSD][1243511104]clssnmCheckDskInfo: Checking disk info...
2019-01-16 13:44:31.915: [ CSSD][1243511104]clssnmCheckSplit: Node 1, rac1, is alive, DHB (1547617471, 59018214) more than disk timeout of 27000 after the last NHB (1547617441, 58988964)
看起来脑裂要发生了,因为每个节点都发现对方的状态是正常的。接下来,集群需要通过自己处理脑裂的规则来决定哪一个节点离开集群。
集群处理脑裂规则:
对于Oracle集群,脑裂是指集群的某些节点间的网络心跳丢失,但是节点的磁盘心跳是正常的情况。当脑裂出现后,集群会分裂成若干个子集群(Corhort)。对于这种情况的出现,集群需要进行重新配置,基本原则是:节点数多的子集群存活,如果子集群包含的节点数相同,那么包含最小编号节点的子集群存活。
节点1
2019-01-16 13:44:31.243: [ CSSD][1239189824]clssnmCheckDskInfo: My cohort: 1
2019-01-16 13:44:31.243: [ CSSD][1239189824]clssnmEvict: Start
2019-01-16 13:44:31.243: [ CSSD][1239189824](:CSSNM00007:)clssnmrEvict: Evicting node 2, rac2, from the cluster in incarnation 322157220, node birth incarnation 322157217, death incarnation 322157220, stateflags 0x64000
2019-01-16 13:44:31.243: [ CSSD][1239189824]clssnmrFenceSage: Fenced node rac2, number 2, with EXADATA, handle 1239186920
2019-01-16 13:44:31.243: [ CSSD][1239189824]clssnmSendShutdown: req to node 2, kill time 59018294
节点2
2019-01-16 13:44:31.915: [ CSSD][1243511104]clssnmCheckDskInfo: My cohort: 2
2019-01-16 13:44:31.915: [ CSSD][1243511104]clssnmCheckDskInfo: Surviving cohort: 1
2019-01-16 13:44:31.915: [ CSSD][1243511104](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2
节点2为避免脑裂,离开了集群,并重启了节点。从11.2.0.2开始,由于新特性rebootless restart的原因,节点不会被重启,而是集群管理软件进行重启。
《RAC核心技术详解》