上个月遇到的一个故障;
首先是用户反映有RAC环境中一个节点无法启动,需要进行处理。需要对集群中此节点无法启动的原因进行排查,同时需要排查为何出现了节点重启; 故障分析结论是:RAC节点2异常,集群无法启动并报无法找到votingfile所需的共享磁盘,在此之前2017/01/17发生了集群及OS重启。
对无法发现共享磁盘及集群重启的原因进行分析总结如下:
1.2017/01/17 11:17分集群节点2异常重启原因: 2017-01-17 11:17:12左右RAC集群两个节点均出现私网网卡异常的情况(与用户沟通当时是在进行换光纤线~~), gipcd进程首先发现私网网卡异常,之后两个节点的ocssd进程日志中均开始出现集群网络心跳超时的信息,此时磁盘心跳均正常。在集群网络心跳超时达到默认的超时阈值30秒后,RAC集群分裂为两个子集群,集群出现脑裂。2017-01-17 11:17:39左右,按照RAC集群的脑裂处理规则,节点2所在的子集群被驱逐,节点重启。(在11.2.0.2版本以上由于rebootless restart特性正常情况下丢失网络心跳只会发生GI集群软件重启,不会发生集群节点的主机重启,此处发生主机重启原因是rebootless restart特性时在正常方式关闭集群时CRSD未完成关闭资源动作,具体见:http://www.haibusuanyun.com/?p=142)
2.2017/01/17 11:22分集群节点2重启后RAC集群不能启动的原因是: 两个节点multipath的配置文件multipath.conf的devices {}部分存在如下语句:getuid_callout "/lib/udev/scsi_id -g -u -s /dev/%n",这是LINUX5版本的获取scsi_id的命令。当前OS版本为LINUX6.5 ,在LINUX6中应为: getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"。
3.节点1不存在共享磁盘配置问题的原因:节点1、节点2同时存在此配置文件错误的问题,因为节点1是节点启动后此配置文件被修改,因此节点1的共享磁盘仍然存在。节点2在2017/01/17发生了主机重启后,此配置文件错误导致multipath未配置出共享磁盘,导致集群无法找到votingfile所需的共享盘,无法启动。
查出故障原因后,后续建议调整multipath配置文件、私网网卡做双网卡绑定即可避免此类情况;
如下是故障分析过程及日志:
1.集群alert日志中的信息
[xslydb2:/app/11.2/grid/log/xslydb2:]tail -n 1000 alertxslydb2.log|more
2017-01-17 11:17:39.771: [cssd(18611)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /app/11.2/grid/log/xslydb2/cssd /ocssd.log
2017-01-17 11:17:39.799:[cssd(18611)]CRS-1652:Starting clean up of CRSD resources. 2017-01-17 11:17:39.929: [cssd(18611)]CRS-1608:This node was evicted by node 1, xslydb1; details at (:CSSNM00005:) in /app/11.2/grid/log/xslydb2/cssd/ocssd.log .
2017-01-17 11:17:39.966: [cssd(18611)]CRS-1653:The clean up of the CRSD resources failed. 2017-01-17 11:17:41.955: [ohasd(18384)]CRS-2765:Resource 'ora.ctssd' has failed on server 'xslydb2'. 2017-01-17 11:17:41.959: [ohasd(18384)]CRS-2765:Resource 'ora.evmd' has failed on server 'xslydb2'.
2017-01-17 11:17:42.338:[/app/11.2/grid/bin/oraagent.bin(18504)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/app/11.2/grid/log/xs lydb2/agent/ohasd/oraagent_grid/oraagent_grid.log"[ohasd(18634)]CRS-2112:The OLR service started on node xslydb2.
2017-01-17 11:22:09.987: [ohasd(18634)]CRS-1301:Oracle High Availability Service started on node xslydb2. =====节点2启动的时间
2017-01-17 11:22:10.011: [ohasd(18634)]CRS-8011:reboot advisory message from host: xslydb2, component: cssagent, with time stamp: L-
2017-01-17-11:17:42.855[ohasd(18634)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2017-01-17 11:22:10.021: [ohasd(18634)]CRS-8011:reboot advisory message from host: xslydb2, component: cssmonit, with time stamp: L-
2017-01-17-11:17:42.855 [ohasd(18634)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2017-01-17 11:22:10.021: [ohasd(18634)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred
2017-01-17 11:22:11.627: [/app/11.2/grid/bin/oraagent.bin(18820)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/app/11.2/grid/log/xslydb2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2017-01-17 11:22:14.928: [/app/11.2/grid/bin/orarootagent.bin(18824)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2017-01-17 11:22:18.773: [ohasd(18634)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2017-01-17 11:22:18.948: [gpnpd(18975)]CRS-2328:GPNPD started on node xslydb2. =====集群GPNPD进程启动,ocssd进程即从此进程提供的服务中获取votingfile磁盘的信息
2017-01-17 11:22:21.627: [cssd(19048)]CRS-1713:CSSD daemon is started in clustered mode
2017-01-17 11:22:23.092: [ohasd(18634)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2017-01-17 11:22:23.099: [ohasd(18634)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2017-01-17 11:22:25.765: [cssd(19048)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /app/11.2/grid/log/xslydb2/cssd/ocssd.log =====集群ocssd进程无法找到所需的共享磁盘
2017-01-17 11:22:40.778: [cssd(19048)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /app/11.2/grid/log/xslydb2/cssd/ocssd.log ………………省略了同样的日志输出
2017-01-17 11:32:26.249: [ohasd(18634)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'xslydb2'.
2017-01-17 11:32:27.970: [cssd(20199)]CRS-1713:CSSD daemon is started in clustered mode
2017-01-17 11:32:28.038: [cssd(20199)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /app/11.2/gr id/log/xslydb2/cssd/ocssd.log ………………
省略中间集群多次尝试重启cssd进程的日志 此处日志可以看到CSSD进程启动时因为无法找到voting files所在的共享磁盘,导致CSSD进程多次尝试启动均失败,在13:03分集群停止尝试启动CSSD进程,集群启动停止在此。
通过对集群各节点状态及集群相应日志的检查分析,可以初步定位到集群节点2无法启动原因是:
cssd通过gpnp进程获得votingfile的路径:/dev/mapper后,在此路径也无法找到所需的共享磁盘,导致cssd进程无法启动。 通过对节点1、节点2的分析,可以发现在OS层面是可以发现磁盘的(/dev/sd*的输出)。节点1的multipath输出是正确的,节点2的multipath未能正确提供共享磁盘。 节点2共享磁盘情况分析 节点2的分析,可以发现在OS层面是可以发现磁盘的(/dev/sd*的输出)。节点2的multipath未能正确提供共享磁盘。进一步分析发现节点2的multipath的配置文件multipath.conf的devices {}部分存在如下语句:
getuid_callout "/lib/udev/scsi_id -g -u -s /dev/%n",这是LINUX5版本的获取scsi_id的命令,当前OS版本为LINUX6.5 ,在LINUX6中应为getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"。
[testdb2:/app/11.2/grid/log/testdb2:]df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root 244G 14G 218G 6% /
tmpfs 63G 224K 63G 1% /dev/shm
/dev/sda2 477M 55M 397M 13% /boot
/dev/sda1 200M 264K 200M 1% /boot/efi
[testdb2:/app/11.2/grid/log/testdb2:]ls -al /dev/sd*
brw-rw---- 1 root disk 8, 0 Jan 17 19:21 /dev/sda
brw-rw---- 1 root disk 8, 1 Jan 17 19:21 /dev/sda1
brw-rw---- 1 root disk 8, 2 Jan 17 11:21 /dev/sda2
brw-rw---- 1 root disk 8, 3 Jan 17 19:21 /dev/sda3
brw-rw---- 1 root disk 8, 16 Jan 17 19:21 /dev/sdb
brw-rw---- 1 root disk 8, 32 Jan 17 19:21 /dev/sdc
brw-rw---- 1 root disk 8, 48 Jan 17 19:21 /dev/sdd
brw-rw---- 1 root disk 8, 64 Jan 17 19:21 /dev/sde
brw-rw---- 1 root disk 8, 80 Jan 17 19:21 /dev/sdf
brw-rw---- 1 root disk 8, 96 Jan 17 19:21 /dev/sdg
brw-rw---- 1 root disk 8, 112 Jan 17 19:21 /dev/sdh
brw-rw---- 1 root disk 8, 128 Jan 17 19:21 /dev/sdi
brw-rw---- 1 root disk 8, 144 Jan 17 19:21 /dev/sdj
brw-rw---- 1 root disk 8, 160 Jan 17 19:21 /dev/sdk
brw-rw---- 1 root disk 8, 176 Jan 17 19:21 /dev/sdl
brw-rw---- 1 root disk 8, 192 Jan 17 19:21/dev/sdm
[testdb2:/app/11.2/grid/log/testdb2:]gpnptool get
Warning: some command line parameters were defaulted. Resulting command line:/app/11.2/grid/bin/gpnptool.bin get -o- …………Use="cluster_interconnect"/>
………… Success.
# multipath -ll
# cd /dev/mapper/
# ls -al
total 0
drwxr-xr-x 2 root root 100 Jan 17 19:21 .
drwxr-xr-x 18 root root 4180 Apr 19 10:48 ..
crw-rw---- 1 root root 10, 236 Jan 17 11:21 control
lrwxrwxrwx 1 root root 7 Jan 17 11:21 VolGroup-lv_root -> ../dm-0
lrwxrwxrwx 1 root root 7 Jan 17 11:21 VolGroup-lv_swap -> ../dm-1
# cat /var/log/messages
Apr 16 03:13:01 testdb2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="18158" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Apr 19 10:48:22 testdb2 kernel: device-mapper: table: 252:2: multipath: error getting device
Apr 19 10:48:22 testdb2 kernel: device-mapper: ioctl: error adding target to table
# cat multipath.conf
# blacklist { } defaults {
udev_dir /devpolling_interval 10 user_friendly_names yes path_grouping_policy failover path_checker tur checker_timeout 90 failback 10 }
devices { device { vendor "EIM" product "OSNSolution"path_grouping_policy failover getuid_callout "/lib/udev/scsi_id -g -u -s /dev/%n" 错用了LINUX5中的参数 } }
# cd multipath
# ls bindings wwids
# cat bindings
# mpatha 3600605b00bc86af01f8bec9f0d8aa650 data1 3600050cc2c9c56fc189ce61180c0246e ocr 3600050cc349c56fc189ce61180c0246e fra 3600050cc1a9c56fc189ce61180c0246e
# cat wwids # Multipath wwids, Version : 1.0 # NOTE: This file is automatically maintained by multipath and multipathd. # You should not need to edit this file in normal circumstances. # # Valid WWIDs: /3600050cc2c9c56fc189ce61180c0246e/ /3600050cc349c56fc189ce61180c0246e//3600050cc1a9c56fc189ce61180c0246e/
# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago)
2017-01-17 11:17:09.667: [ CSSD][74032896]clssnmSendingThread: sent 4 status msgs to all nodes
2017-01-17 11:17:12.588: [GIPCHGEN][86673152] gipchaInterfaceFail: marking interface failing 0x7f9fe82207d0 { host '', haName 'CSS_testdb-cluster', local (nil), ip '172.16.0.2:27773', subnet '172.16.0.0', mask '255.255.255.0', mac '6c-0b-84-b7-d7-c5', ifname 'eth3', numRef 1, numFail 0, idxBoot 0, flags 0x184d }
2017-01-17 11:17:12.669: [GIPCHGEN][88250112] gipchaInterfaceFail: marking interface failing 0x7f9ff4049410 { host 'testdb1', haName 'CSS_testdb-cluster', local 0x7f9fe82207d0, ip '172.16.0.1:20731', subnet '172.16.0.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 2, flags 0x6 }
2017-01-17 11:17:13.093: [GIPCHGEN][88250112] gipchaInterfaceDisable: disabling interface 0x7f9fe82207d0 { host '', haName 'CSS_testdb-cluster', local (nil), ip '172.16.0.2:27773', subnet '172.16.0.0', mask '255.255.255.0', mac '6c-0b-84-b7-d7-c5', ifname 'eth3', numRef 0, numFail 1, idxBoot 0, flags 0x19cd } =====集群的gipcd进程发现私网心跳网卡(ip '172.16.0.2)异常,私网网卡被关闭
2017-01-17 11:17:22.676: [GIPCHALO][88250112] gipchaLowerProcessNode: no valid interfaces found to node for 11010 ms, node 0x7f9ff4065e40 { host 'testdb1', haName 'CSS_testdb-cluster', srcLuid 1a947c9b-5bb0e40b, dstLuid 31b13d49-3dbc6b8f numInf 0, contigSeq 6734734, lastAck 6735227, lastValidAck 6734733, sendSeq [6735228 : 6735251], createTime 1502167534, sentRegister 1, localMonitor 1, flags 0x2408 }
2017-01-17 11:17:23.675: [ CSSD][74032896]clssnmSendingThread: sending status msg to all nodes
2017-01-17 11:17:23.675: [ CSSD][74032896]clssnmSendingThread: sent 4 status msgs to all nodes
2017-01-17 11:17:25.585: [ CSSD][75609856]clssnmPollingThread: node testdb1 (1) at 50% heartbeat fatal, removal in 14.180 seconds =====cssd心跳丢失14.18秒-50%(misscount默认是30秒) 2017-01-17 11:17:37.588: [ CSSD][75609856]clssnmPollingThread: node testdb1 (1) at 90% heartbeat fatal, removal in 2.180 seconds, seedhbimpd 1 ==此时已经丢失90%心跳,
2017-01-17 11:17:37.589: [ CSSD][81934080]clssnmvDHBValidateNcopy: node 1, testdb1, has a disk HB, but no network HB, DHB has rcfg 375585397, wrtcnt, 6005211, LATS 1739661218, lastSeqNo 6005210, uniqueness 1480090576, timestamp 1484623063/1741096478
2017-01-17 11:17:39.589: [ CSSD][81934080]clssnmvDHBValidateNcopy: node 1, testdb1, has a disk HB, but no network HB, DHB has rcfg 375585397, wrtcnt, 6005215, LATS 1739663218, lastSeqNo 6005214, uniqueness 1480090576, timestamp 1484623065/1741098498 2017-01-17 11:17:39.769: [ CSSD][75609856]clssnmPollingThread: Removal started for node testdb1 (1), flags 0x26040c, state 3, wt4c 0
2017-01-17 11:17:39.769: [ CSSD][75609856]clssnmMarkNodeForRemoval: node 1, testdb1 marked for removal ====节点2表示要驱逐节点1,这在丢失网络心跳情境中是正常的信息
2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)
2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmNeedConfReq: No configuration to change
2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmDoSyncUpdate: Terminating node 1, testdb1, misstime(30000) state(5)
2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmDoSyncUpdate: Wait for 0 vote ack(s) =====这里提示节点2提出要驱逐节点1,等待votingfile表决/仲裁
2017-01-17 11:17:39.770: [ CSSD][108664576]clssgmQueueGrockEvent: groupName(IGHRPSYS$USERS) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmCheckDskInfo: Checking disk info...
2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmCheckSplit: Node 1, testdb1, is alive, DHB (1484623065, 1741098498) more than disk timeout of 27000 after the last NHB (1484623035, 1741068778)
2017-01-17 11:17:39.770: [ CSSD][108664576]clssgmQueueGrockEvent: groupName(IG+ASMSYS$USERS) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmCheckDskInfo: My cohort: 2 ====集群在丢失网络心跳出现脑裂情况下,分裂为两个子集群,本节点所在的是子集群2 2017-01-17 11:17:39.770: [ CSSD][72455936]clssnmCheckDskInfo: Surviving cohort: 1
2017-01-17 11:17:39.770: [ CSSD][72455936](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, testdb2, is smaller than cohort of 1 nodes led by node 1, testdb1, based on map type 2 =====本地的cssd进程从表决盘votingfile中读到终止本地节点以规避脑裂的信息。为什么驱逐本地节点,这是因为在11gR2集群中,在发生脑裂时子集群中节点多的存活/子集群中节点数量一致时节点编号小的存活。
2017-01-17 11:17:39.771: [ CSSD][108664576]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 5, mbrc 3, to member 2, events 0x0, state 0x0
2017-01-17 11:17:39.771: [ CSSD][72455936]################################### 2017-01-17 11:17:39.771: [ CSSD][72455936]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread =====cssd的重新配置线程clssnmRcfgMgrThread终止了本地集群
2017-01-17 11:17:39.771: [ CSSD][72455936]###################################
2017-01-17 11:17:39.771: [ CSSD][72455936](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2017-01-17 11:17:39.771: [ CSSD][108664576]clssgmQueueGrockEvent: groupName(CRF-) count(4) master(0) event(2), incarn 4, mbrc 4, to member 1, events 0x38, state 0x0
2017-01-17 11:17:39.771: [ CSSD][72455936] ]]> 136 0 0 0
从版本11.2.0.2 开始,oracle新特性rebootless restart被介绍。当出现以下情况的时候,集群件(GI)会重新启动集群管理软件,而不是将节点重启: 1.当某个节点连续丢失网络心跳超过misscount时。 2.当某个节点不能访问大多数表决盘(VF)时。 3.当member kill 被升级成为node kill的时候。 在之前的版本,以上情况,集群管理软件(CRS)会直接重启节点。 在GI 在重启集群之前,首先要对集群进行graceful shutdown, 基本的步骤如下: 1.停止本地节点的所有心跳(网络心跳,磁盘心跳和本地心跳)。 2.通知cssd agent,ocssd.bin即将停止 3.停止所有注册到css的具有i/o能力的进程,例如 lmon。 4.cssd通知crsd 停止所有资源,如果crsd不能成功的停止所有的资源,节点重启仍然会发生。 5.Cssd等待所有的具有i/o能力的进程退出,如果这些进程在short i/o timeout时间内不能不能全部推迟,节点重启仍然会发生。 6.通知cssd agent 所有的有i/o能力的进程全部退出。 7.Ohasd 重新启动集群。 8.本地节点通知其他节点进行集群重配置。 本次案例节点重启原因即为: 11.2.0.4版本RAC节点间网络心跳异常后集群驱逐了节点2,此时会符合rebootless restart特性,此时应该发生GI集群软件重启而不是主机重启; 在GI集群软件重启前,在对集群进行正常关闭时(类似crsctl stop crs/has命令),关闭时在进行到步骤:cssd通知crsd 停止所有资源时,CRSD终止资源时出现异常(2017-01-17 11:17:39.966: [CSSD][72455936]clssscExit: CRSD cleanup failed with 184);因此发生了节点重启。
2017-01-17 11:17:37.588: [CSSD][75609856]clssnmPollingThread: node testdb1 (1) at 90% heartbeat fatal, removal in 2.180 seconds, seedhbimpd 1 ===>>丢失90%网络心跳
…………
2017-01-17 11:17:39.502: [CSSD][85096192]clssnmvDiskPing: Writing with status 0x3, timestamp 1484623059/1739663128
2017-01-17 11:17:39.589: [CSSD][81934080]clssnmvDHBValidateNcopy: node 1, testdb1, has a disk HB, but no network HB, DHB has rcfg 375585397, wrtcnt, 6005215, LATS 1739663218, lastSeqNo 6005214, uniqueness 1480090576, timestamp 1484623065/1741098498
2017-01-17 11:17:39.769: [CSSD][75609856]clssnmPollingThread: Removal started for node testdb1 (1), flags 0x26040c, state 3, wt4c 0
2017-01-17 11:17:39.769: [CSSD][75609856]clssnmMarkNodeForRemoval: node 1, testdb1 marked for removal =====>>节点2表示要驱逐节点1,这在丢失网络心跳情境中是正常的信息,节点1也会表示要驱逐节点2,互相认为对方异常
2017-01-17 11:17:39.769: [CSSD][75609856]clssnmDiscHelper: testdb1, node(1) connection failed, endp (0x87c), probe(0x7fa000000000), ninf->endp 0x87c =====>>
………………
2017-01-17 11:17:39.769: [CSSD][72455936]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 375585397
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmSetupAckWait: Ack message type (11)
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmSetupAckWait: node(2) is ALIVE =====>>节点2检测到自己是存活~~
………………
2017-01-17 11:17:39.770: [CSSD][70878976]clssnmHandleSync: Acknowledging sync: src[2] srcName[testdb2] seq[1] sync[375585397]
2017-01-17 11:17:39.770: [CSSD][70878976]clssnmSendAck: node 2, testdb2, syncSeqNo(375585397) type(11)
2017-01-17 11:17:39.770: [CSSD][108664576]clssgmStartNMMon: node 1 active, birth 375585395 =====>>通过磁盘心跳发现两节点均存活
2017-01-17 11:17:39.770: [CSSD][108664576]clssgmStartNMMon: node 2 active, birth 375585396 =====>>
2017-01-17 11:17:39.770: [CSSD][108664576]NMEVENT_SUSPEND [00][00][00][06]
2017-01-17 11:17:39.770: [CSSD][108664576]clssgmCompareSwapEventValue: changed CmInfo State val 5, from 11, changes 12
2017-01-17 11:17:39.770: [CSSD][70878976]clssnmHandleAck: Received ack type 11 from node testdb2, number 2, with seq 0 for sync 375585397, waiting for 0 acks
2017-01-17 11:17:39.770: [CSSD][108664576]clssgmSuspendAllGrocks: Issue SUSPEND
…………
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmNeedConfReq: No configuration to change
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmDoSyncUpdate: Terminating node 1, testdb1, misstime(30000) state(5) =====>>
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2017-01-17 11:17:39.770: [CSSD][108664576]clssgmQueueGrockEvent: groupName(IGHRPSYS$USERS) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmCheckDskInfo: Checking disk info... =====>>检查votdisk信息
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmCheckSplit: Node 1, testdb1, is alive, DHB (1484623065, 1741098498) more than disk timeout of 27000 after the last NHB (1484623035, 1741068778) =====>>发现节点1是存活
2017-01-17 11:17:39.770: [CSSD][108664576]clssgmQueueGrockEvent: groupName(IG+ASMSYS$USERS) count(2) master(1) event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmCheckDskInfo: My cohort: 2 =====>>自己的子集群是2
2017-01-17 11:17:39.770: [CSSD][72455936]clssnmCheckDskInfo: Surviving cohort: 1 =====>>集群是存活的子集群是1,从votdisk获取的
2017-01-17 11:17:39.770: [CSSD][72455936](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, testdb2, is smaller than cohort of 1 nodes led by node 1, testdb1, based on map type 2 =====>>终止本节点来回避脑裂
2017-01-17 11:17:39.771: [CSSD][108664576]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 5, mbrc 3, to member 2, events 0x0, state 0x0
2017-01-17 11:17:39.771: [CSSD][72455936]###################################
2017-01-17 11:17:39.771: [CSSD][72455936]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread =====>>
2017-01-17 11:17:39.771: [CSSD][72455936]###################################
2017-01-17 11:17:39.771: [CSSD][72455936](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally =====>>
2017-01-17 11:17:39.771: [CSSD][108664576]clssgmQueueGrockEvent: groupName(CRF-) count(4) master(0) event(2), incarn 4, mbrc 4, to member 1, events 0x38, state 0x0
接下来正常情况下此时CSSD进程会通知CRSD进程终止相应进程,关闭集群;在11gR2时,由于新特性rebootless,此时会发生GI集群件重启而不是节点的主机重启。 从以下日志可以发现,CRSD终止资源时出现异常:2017-01-17 11:17:39.966: [CSSD][72455936]clssscExit: CRSD cleanup failed with 184; 即集群资源不能正常关闭,因此集群认为需要重启主机来解决。具体CRSD进程在清理哪些资源时失败,就不在此处分析了。
2017-01-17 11:17:39.773: [CSSD][108664576]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(375585396)
2017-01-17 11:17:39.774: [CSSD][108664576]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(375585396)
2017-01-17 11:17:39.774: [CSSD][108664576]clssgmStartNMMon: completed node cleanup =====>>节点层清理 ----- End of Call Stack Trace -----
2017-01-17 11:17:39.799: [CSSD][72455936]clssnmSendMeltdownStatus: node testdb2, number 2, has experienced a failure in thread number 3 and is shutting down
2017-01-17 11:17:39.799: [CSSD][72455936]clssscExit: Starting CRSD cleanup =====>>CSSD通知CRSD清理相应资源
2017-01-17 11:17:39.799: [CSSD][92903168]clssgmProcClientReqs: Checking RPC Q
2017-01-17 11:17:39.799: [CSSD][92903168]clssgmProcClientReqs: Checking dead client
2017-01-17 11:17:39.799: [CSSD][92903168]clssgmProcClientReqs: Checking dead proc
2017-01-17 11:17:39.831: [CSSD][92903168]clssgmProcClientReqs: Checking RPC Q
2017-01-17 11:17:39.831: [CSSD][92903168]clssgmProcClientReqs: Checking dead client
2017-01-17 11:17:39.831: [CSSD][92903168]clssgmProcClientReqs: Checking dead proc
2017-01-17 11:17:39.831: [GIPCHAUP][88250112] gipchaUpperDisconnect: initiated discconnect umsg 0x7f9ff4061520 { msg 0x7f9ff402b1d8, ret gipcretRequestPending (15), flags 0x2 }, msg 0x7f9ff402b1d8 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-00000806, dstCid 00000000-000007da }, endp 0x7f9ff403eb00 [0000000000000806] { gipchaEndpoint : port 'nm2_testdb-cluster/8422-2126-402d-6c2f', peer 'testdb1:cc84-74bb-b80e-4918', srcCid 00000000-00000806, dstCid 00000000-000007da, numSend 30, maxSend 100, groupListType 2, hagroup 0x14c75d0, usrFlags 0x4000, flags 0x21c }
2017-01-17 11:17:39.831: [CSSD][92903168]clssgmProcClientReqs: Checking RPC Q
2017-01-17 11:17:39.831: [CSSD][92903168]clssgmProcClientReqs: Checking dead client Q
2017-01-17 11:17:39.831: [CSSD][92903168]clssgmProcClientReqs: Checking dead proc Q
2017-01-17 11:17:39.930: [CSSD][83519232](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node testdb1, number 1, sync 375585397, stamp 1741098688 =====>>
2017-01-17 11:17:39.930: [CSSD][83519232]clssscExit: abort already set 1
2017-01-17 11:17:39.966: [ default][69302016]clsc_connect: (0x7f9fc0053ed0) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
2017-01-17 11:17:39.966: [CSSD][72455936]clssscExit: CRSD cleanup status 184
2017-01-17 11:17:39.966: [CSSD][72455936]clssscExit: CRSD cleanup failed with 184 =====>>CRSD清理相应资源失败 2017-01-17 11:17:39.973: [CSSD][92903168]clssgmProcClientReqs: Checking RPC Q …………………… 后面就是DUMP内存的一些信息,就不放了。
2017-01-17 11:17:40.148: [ CSSD][72455936]--- END OF GROCK STATE DUMP --- ………………
2017-01-17 11:22:21.627: [ CSSD][3826484992]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1484623341
2017-01-17 11:22:21.627: [ CSSD][3826484992]clssscmain: Environment is production