模拟私网问题导致节点宕机无法启动

模拟私网问题导致节点宕机无法启动

    • 目的
    • 分析过程
      • GI alert日志
      • os日志
      • ocssd.log 日志
    • 参考文档

目的

本文章通过模拟私网问题,导致集群节点宕机,来进行日志分析。

# ifconfig eth1 down

分析过程

GI alert日志

/log/<节点名称>/alert<节点名称>.log

  • 节点1

    2019-01-16 13:44:16.839
    [cssd(28522)]CRS-1612:Network communication with node rac2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.400 seconds
    2019-01-16 13:44:23.855
    [cssd(28522)]CRS-1611:Network communication with node rac2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.380 seconds
    2019-01-16 13:44:28.865
    [cssd(28522)]CRS-1610:Network communication with node rac2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.370 seconds
    

    与节点2 rac2的网络通信在50%的超时间隔内丢失。 在14秒内从群集中删除此节点
    与节点2 rac2的网络通信在75%的超时间隔内丢失。 在7秒内从群集中删除此节点
    与节点2 rac2的网络通信在90%的超时间隔内丢失。 在2秒内从群集中删除此节点

    2019-01-16 13:44:31.243
    [cssd(28522)]CRS-1607:Node rac2 is being evicted in cluster incarnation 322157220; details at (:CSSNM00007:) in /u01/11.2.0/grid/log/rac1/cssd/ocssd.log.
    2019-01-16 13:44:54.468
    [ohasd(28301)]CRS-8011:reboot advisory message from host: rac2, component: mo093358, with time stamp: L-2019-01-16-13:44:53.124
    [ohasd(28301)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
    2019-01-16 13:45:02.236
    [cssd(28522)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 .
    2019-01-16 13:45:02.393
    [ctssd(28691)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.
    2019-01-16 13:45:27.304
    [crsd(28825)]CRS-5504:Node down event reported for node 'rac2'.
    2019-01-16 13:45:32.086
    [crsd(28825)]CRS-2773:Server 'rac2' has been removed from pool 'Generic'.
    2019-01-16 13:45:32.086
    [crsd(28825)]CRS-2773:Server 'rac2' has been removed from pool 'ora.orcl'.
    2019-01-16 13:46:37.328
    [ctssd(28691)]CRS-2406:The Cluster Time Synchronization Service timed out on host rac1. Details in /u01/11.2.0/grid/log/rac1/ctssd/octssd.log.		
    

    在集群中驱逐rac2节点,要求rac2重启,集群重新配置完成,从服务池中删除rac2节点

  • 节点2

    2019-01-16 13:44:17.512
    [cssd(24201)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.400 seconds
    2019-01-16 13:44:24.529
    [cssd(24201)]CRS-1611:Network communication with node rac1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 7.380 seconds
    2019-01-16 13:44:29.539
    [cssd(24201)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.370 seconds
    

    与节点1 rac1的网络通信在50%的超时间隔内丢失。 在14秒内从群集中删除此节点
    与节点1 rac1的网络通信在75%的超时间隔内丢失。 在7秒内从群集中删除此节点
    与节点1 rac1的网络通信在90%的超时间隔内丢失。 在2秒内从群集中删除此节点

      2019-01-16 13:44:31.915
     [cssd(24201)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/11.2.0/grid/log/rac2/cssd/ocssd.log.
     2019-01-16 13:44:32.025
     [cssd(24201)]CRS-1608:This node was evicted by node 1, rac1; details at (:CSSNM00005:) in /u01/11.2.0/grid/log/rac2/cssd/ocssd.log.
     2019-01-16 13:47:33.459
     [ohasd(2892)]CRS-2112:The OLR service started on node rac2.
     2019-01-16 13:47:34.870
     [ohasd(2892)]CRS-8011:reboot advisory message from host: rac2, component: ag125511, with time stamp: L-2019-01-15-16:20:15.283
     [ohasd(2892)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
     2019-01-16 13:47:35.804
     [ohasd(2892)]CRS-8011:reboot advisory message from host: rac2, component: mo093358, with time stamp: L-2019-01-16-13:44:53.124
     [ohasd(2892)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
    

    以上信息看出节点2无法与群集中的其他节点通信,并且正在关闭以保持群集完整性; 节点2被节点1驱逐。

os日志

/var/log/messages

  • 节点1
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Cluster Membership change - Current  incarn 0x19
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Nodes in cluster:
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS]   Node 1 (IP 0xa0a0a0a) 
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Node count 1
    Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration started.
    Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration completed.
    Jan 16 13:45:02 rac1 kernel: [Oracle ADVM] Cluster reconfiguration completed.
    Jan 16 13:45:02 rac1 kernel: [Oracle OKS] Cluster Membership change setup complete
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.187 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.187 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.187 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.172 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.172 on eth0.
    Jan 16 13:45:30 rac1 avahi-daemon[2647]: Registering new address record for 172.18.4.172 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for fe80::a00:27ff:feb1:3373 on eth1.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 10.10.10.10 on eth1.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for fe80::a00:27ff:fe2e:e4e on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.172 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.186 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Withdrawing address record for 172.18.4.182 on eth0.
    Jan 16 13:45:31 rac1 avahi-daemon[2647]: Host name conflict, retrying with <rac1-3>
    
  • 节点2
    没有故障时间段信息	
    

    根据上面信息,可以看出集群重新配置,节点2的上vip与scanip都注册到到节点1的eth0网卡上。

ocssd.log 日志

/log/<节点名称>/cssd/ocssd.log

  • 节点1

    2019-01-16 13:44:16.839: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 50% heartbeat fatal, removal in 14.400 seconds
    2019-01-16 13:44:16.839: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) is impending reconfig, flag 394254, misstime 15600
    

    以上信息显示:节点1发现节点2已经在连续一段时间内丢失网络心跳。如果这种情况继续下》去,集群会在14.400s之后发生重新配置,11Gr2 misscount=30s,通过命令crsctl get css misscount可以查询。

  • 节点2

    2019-01-16 13:44:17.512: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 50% heartbeat fatal, removal in 14.400 seconds
    2019-01-16 13:44:17.512: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) is impending reconfig, flag 132108, misstime 15600
    

    节点2出现了和节点1类似的消息。由于这是一个双节点集群,所以这是正常情况。当丢失网络心跳的时间达到misscount时间后,集群就需要重新配置了。

  • 节点1

    2019-01-16 13:44:16.839: [    CSSD][1218210112]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
    2019-01-16 13:44:16.926: [    CSSD][1249679680]clssnmvSchedDiskThreads: DiskPingThread for voting file /dev/raw/raw1 sched delay 950 > margin 750 cur_ms 59003974 lastalive 59003024
    2019-01-16 13:44:18.914: [    CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:18.914: [    CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssscSelect: cookie accept request 0xa430f48
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmAllocProc: (0xad8a000) allocated
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmClientConnectMsg: properties of cmProc 0xad8a000 - 1,2,3,4
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmClientConnectMsg: Connect from con(0x5a2bd) proc(0xad8a000) pid(9116) version 11:2:1:4, properties: 1,2,3,4
    2019-01-16 13:44:20.946: [    CSSD][1099692352]clssgmClientConnectMsg: msg flags 0x0000
    2019-01-16 13:44:20.947: [    CSSD][1099692352]clssgmExecuteClientRequest: Node name request from client ((nil))
    2019-01-16 13:44:20.950: [    CSSD][1099692352]clssgmDeadProc: proc 0xad8a000
    2019-01-16 13:44:20.950: [    CSSD][1099692352]clssgmDestroyProc: cleaning up proc(0xad8a000) con(0x5a2bd) skgpid  ospid 9116 with 0 clients, refcount 0
    2019-01-16 13:44:20.950: [    CSSD][1099692352]clssgmDiscEndpcl: gipcDestroy 0x5a2bd
    2019-01-16 13:44:23.855: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 75% heartbeat fatal, removal in 7.380 seconds
    2019-01-16 13:44:23.917: [    CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:23.917: [    CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:28.865: [    CSSD][1218210112]clssnmPollingThread: node rac2 (2) at 90% heartbeat fatal, removal in 2.370 seconds, seedhbimpd 1
    2019-01-16 13:44:28.921: [    CSSD][1228699968]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:28.921: [    CSSD][1228699968]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:30.731: [    CSSD][1099692352]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 850 > margin 750 cur_ms 59017784 lastalive 59016934
    2019-01-16 13:44:31.242: [    CSSD][1218210112]clssnmPollingThread: Removal started for node rac2 (2), flags 0x6040e, state 3, wt4c 0
    2019-01-16 13:44:31.242: [    CSSD][1218210112]clssnmDiscHelper: rac2, node(2) connection failed, endp (0x331), probe(0x100000000), ninf->endp 0x331
    2019-01-16 13:44:31.242: [    CSSD][1218210112]clssnmDiscHelper: node 2 clean up, endp (0x331), init state 5, cur state 5
    

    节点1认为节点2持续丢失网络心跳,决定从集群中清除节点2

  • 节点2

    2019-01-16 13:44:17.512: [    CSSD][1222531392]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
    2019-01-16 13:44:18.247: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 840 > margin 750 cur_ms 20599214 lastalive 20598374
    2019-01-16 13:44:19.232: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20600194 lastalive 20599374
    2019-01-16 13:44:20.232: [    CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20601194 lastalive 20600374
    2019-01-16 13:44:21.239: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20602204 lastalive 20601374
    2019-01-16 13:44:22.236: [    CSSD][1233021248]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:22.236: [    CSSD][1233021248]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:22.243: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20603204 lastalive 20602374
    2019-01-16 13:44:23.238: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20604204 lastalive 20603374
    2019-01-16 13:44:24.238: [    CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20605204 lastalive 20604374
    2019-01-16 13:44:24.529: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 75% heartbeat fatal, removal in 7.380 seconds
    2019-01-16 13:44:25.237: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 830 > margin 750 cur_ms 20606204 lastalive 20605374
    2019-01-16 13:44:26.246: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 840 > margin 750 cur_ms 20607214 lastalive 20606374
    2019-01-16 13:44:27.230: [    CSSD][1233021248]clssnmSendingThread: sending status msg to all nodes
    2019-01-16 13:44:27.230: [    CSSD][1233021248]clssnmSendingThread: sent 5 status msgs to all nodes
    2019-01-16 13:44:27.232: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20608194 lastalive 20607374
    2019-01-16 13:44:28.231: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20609194 lastalive 20608374
    2019-01-16 13:44:29.236: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20610194 lastalive 20609374
    2019-01-16 13:44:29.539: [    CSSD][1222531392]clssnmPollingThread: node rac1 (1) at 90% heartbeat fatal, removal in 2.370 seconds, seedhbimpd 1
    2019-01-16 13:44:30.241: [    CSSD][1254000960]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 820 > margin 750 cur_ms 20611204 lastalive 20610384
    2019-01-16 13:44:31.229: [    CSSD][1138612544]clssnmvSchedDiskThreads: DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 20612194 lastalive 20611384
    2019-01-16 13:44:31.914: [    CSSD][1222531392]clssnmPollingThread: Removal started for node rac1 (1), flags 0x2040c, state 3, wt4c 0
    2019-01-16 13:44:31.914: [    CSSD][1222531392]clssnmDiscHelper: rac1, node(1) connection failed, endp (0x256), probe(0x100000000), ninf->endp 0x256
    2019-01-16 13:44:31.914: [    CSSD][1222531392]clssnmDiscHelper: node 1 clean up, endp (0x256), init state 5, cur state 5
    

    节点2同样认为节点1持续丢失网络心跳,决定从集群中清除节点1.

  • 节点1

    2019-01-16 13:44:31.242: [GIPCXCPT][1218210112]gipcInternalDissociate: obj 0xab2fe10 [0000000000000331] { gipcEndpoint : localAddr 'gipc://rac1:3f93-7c9d-5dcd-3d8d#10.10.10.10#43951', remoteAddr 'gipc://rac2:nm_raccluster#10.10.10.20#56183', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x5e89, pidPeer 0, flags 0x2616, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
    2019-01-16 13:44:31.242: [GIPCXCPT][1218210112]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3215]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 0xab2fe10 [0000000000000331] { gipcEndpoint : localAddr 'gipc://rac1:3f93-7c9d-5dcd-3d8d#10.10.10.10#43951', remoteAddr 'gipc://rac2:nm_raccluster#10.10.10.20#56183', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x5e89, pidPeer 0, flags 0x2616, usrFlags 0x0 }, flags 0x0
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: Initiating sync 322157220
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssscUpdateEventValue: NMReconfigInProgress  val 1, changes 3
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 322157220
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmSetupAckWait: Ack message type (11) 
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmSetupAckWait: node(1) is ALIVE
    2019-01-16 13:44:31.242: [    CSSD][1239189824]clssnmSendSync: syncSeqNo(322157220), indicating EXADATA fence initialization complete
    2019-01-16 13:44:31.242: [    CSSD][1239189824]List of nodes that have ACKed my sync: NULL	
    
  • 节点2

    2019-01-16 13:44:31.914: [GIPCXCPT][1222531392]gipcInternalDissociate: obj 0x4ab8b60 [0000000000000256] { gipcEndpoint : localAddr 'gipc://rac2:34a4-9438-68a9-4123#10.10.10.20#56183', remoteAddr 'gipc://10.10.10.10:3f4c-bc6d-9424-f406#10.10.10.10#43951', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x6f6a, pidPeer 0, flags 0x102616, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
    2019-01-16 13:44:31.914: [GIPCXCPT][1222531392]gipcDissociateF [clssnmDiscHelper : clssnm.c : 3215]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 0x4ab8b60 [0000000000000256] { gipcEndpoint : localAddr 'gipc://rac2:34a4-9438-68a9-4123#10.10.10.20#56183', remoteAddr 'gipc://10.10.10.10:3f4c-bc6d-9424-f406#10.10.10.10#43951', numPend 5, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x6f6a, pidPeer 0, flags 0x102616, usrFlags 0x0 }, flags 0x0
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: Initiating sync 322157220
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssscUpdateEventValue: NMReconfigInProgress  val 1, changes 9
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 322157220
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmSetupAckWait: Ack message type (11) 
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmSetupAckWait: node(2) is ALIVE
    2019-01-16 13:44:31.914: [    CSSD][1243511104]clssnmSendSync: syncSeqNo(322157220), indicating EXADATA fence initialization complete
    2019-01-16 13:44:31.914: [    CSSD][1243511104]List of nodes that have ACKed my sync: NULL
    

    每个节点都向自己认为没问题的节点发送重新配置消息,但是对于两个节点的集群来说,不会有其他节点需要接收这个消息的。接下来,需要通过磁盘心跳信息决定每个节点的状态。

  • 节点1

    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmCheckDskInfo: Checking disk info...
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmCheckSplit: Node 2, rac2, is alive, DHB (1547617471, 20612054) more than disk timeout of 27000 after the last NHB (1547617441, 20582204)
    
    
  • 节点2

    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckDskInfo: Checking disk info...
    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckSplit: Node 1, rac1, is alive, DHB (1547617471, 59018214) more than disk timeout of 27000 after the last NHB (1547617441, 58988964)
    
    

    看起来脑裂要发生了,因为每个节点都发现对方的状态是正常的。接下来,集群需要通过自己处理脑裂的规则来决定哪一个节点离开集群。

    集群处理脑裂规则:
    对于Oracle集群,脑裂是指集群的某些节点间的网络心跳丢失,但是节点的磁盘心跳是正常的情况。当脑裂出现后,集群会分裂成若干个子集群(Corhort)。对于这种情况的出现,集群需要进行重新配置,基本原则是:节点数多的子集群存活,如果子集群包含的节点数相同,那么包含最小编号节点的子集群存活。

  • 节点1

    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmCheckDskInfo: My cohort: 1
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmEvict: Start
    2019-01-16 13:44:31.243: [    CSSD][1239189824](:CSSNM00007:)clssnmrEvict: Evicting node 2, rac2, from the cluster in incarnation 322157220, node birth incarnation 322157217, death incarnation 322157220, stateflags 0x64000
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmrFenceSage: Fenced node rac2, number 2, with EXADATA, handle 1239186920
    2019-01-16 13:44:31.243: [    CSSD][1239189824]clssnmSendShutdown: req to node 2, kill time 59018294
    
  • 节点2

    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckDskInfo: My cohort: 2
    2019-01-16 13:44:31.915: [    CSSD][1243511104]clssnmCheckDskInfo: Surviving cohort: 1
    2019-01-16 13:44:31.915: [    CSSD][1243511104](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2
    

    节点2为避免脑裂,离开了集群,并重启了节点。从11.2.0.2开始,由于新特性rebootless restart的原因,节点不会被重启,而是集群管理软件进行重启。

参考文档

《RAC核心技术详解》

你可能感兴趣的:(故障解决)