近日,公司搭建的ORACLE 11G RAC 出现故障,整个故障的排错思路大概持续了一周左右,最终确定了问题原因,现在简介下ORACLE 11G RAC的环境:
jbdb1 IBM P740 AIX 6100-08-02-1316
jbdb2 IBM P740 AIX 6100-08-02-1316
Oracle 11G 11.2.0.3 grid and database
当日,软件人员说无法连接数据库,我就很纳闷,于是就登录到系统,通过查看群集发现整个RAC有异常,正当我查看群集状态时候出现:
hostname:/home/grid$crs_stat -t CRS-0184: Cannot communicate with the CRS daemon.
通过登录两个节点发现同样的问题,ORACLE Cluster 无法启动crs群集管理,crs无法启动更谈不上什么服务了,数据库无法启动,服务中断。
现在是没有什么办法只有先恢复应用再说,于是手动执行crsctl start cluster -all 结果还是不能启动crs对应的服务,错误依旧。最后进行crs的检测
hostname:/home/grid$crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online CRS-4534: Cannot communicate with Event Manager
通过上面可以看到CRS无法通信,并且事件管理也无法通信连接。并且查询crs resource运行状态
节点1:
hostname:/home/grid$crsctl stat resource -t -init -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE OFFLINE Instance Shutdown ora.cluster_interconnect.haip 1 ONLINE ONLINE jbdb1 ora.crf 1 ONLINE ONLINE jbdb1 ora.crsd 1 ONLINE OFFLINE ora.cssd 1 ONLINE ONLINE jbdb1 ora.cssdmonitor 1 ONLINE ONLINE jbdb1 ora.ctssd 1 ONLINE ONLINE jbdb1 OBSERVER ora.diskmon 1 OFFLINE OFFLINE ora.drivers.acfs 1 ONLINE ONLINE jbdb1 ora.evmd 1 ONLINE INTERMEDIATE jbdb1 ora.gipcd 1 ONLINE ONLINE jbdb1 ora.gpnpd 1 ONLINE ONLINE jbdb1 ora.mdnsd 1 ONLINE ONLINE jbdb1 hostname:/home/grid$crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online CRS-4534: Cannot communicate with Event Manager 节点2 hostname:/home/grid$crsctl stat resource -t -init -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE jbdb2 Started ora.cluster_interconnect.haip 1 ONLINE ONLINE jbdb2 ora.crf 1 ONLINE ONLINE jbdb2 ora.crsd 1 ONLINE INTERMEDIATE jbdb2 ora.cssd 1 ONLINE ONLINE jbdb2 ora.cssdmonitor 1 ONLINE ONLINE jbdb2 ora.ctssd 1 ONLINE ONLINE jbdb2 OBSERVER ora.diskmon 1 OFFLINE OFFLINE ora.drivers.acfs 1 ONLINE ONLINE jbdb2 ora.evmd 1 ONLINE INTERMEDIATE jbdb2 ora.gipcd 1 ONLINE ONLINE jbdb2 ora.gpnpd 1 ONLINE ONLINE jbdb2 ora.mdnsd 1 ONLINE ONLINE jbdb2 hostname:/home/grid$crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online CRS-4534: Cannot communicate with Event Manager
通过上面信息可知节点1的ora.asm offline,ora.crsd offline 同时也会伴随asm后台无法启动
节点2ora.crsd INTERMEDIATE,在整个排查过程中,尝试过重启两节点,但是重启完毕后,crs的状态还是一样的,数据库不可使用。被迫无奈判断问题,于是大胆关闭一个节点,同时喜剧的时刻发生了,节点2上是应用资源都正常,所有群集资源漂移至节点2上,包括底层的ASM还有数据库instance状态可用。
hostname:/home/grid$crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE jbdb2 ora....BASE.dg ora....up.type ONLINE ONLINE jbdb2 ora....ER.lsnr ora....er.type ONLINE ONLINE jbdb2 ora....N1.lsnr ora....er.type ONLINE ONLINE jbdb2 ora.asm ora.asm.type ONLINE ONLINE jbdb2 ora.cvu ora.cvu.type ONLINE ONLINE jbdb2 ora.gsd ora.gsd.type OFFLINE OFFLINE ora.jbdb.db ora....se.type ONLINE ONLINE jbdb2 ora.jbdb1.vip ora....t1.type ONLINE ONLINE jbdb2 ora....SM2.asm application ONLINE ONLINE jbdb2 ora....B2.lsnr application ONLINE ONLINE jbdb2 ora.jbdb2.gsd application OFFLINE OFFLINE ora.jbdb2.ons application ONLINE ONLINE jbdb2 ora.jbdb2.vip ora....t1.type ONLINE ONLINE jbdb2 ora....network ora....rk.type ONLINE ONLINE jbdb2 ora.oc4j ora.oc4j.type ONLINE ONLINE jbdb2 ora.ons ora.ons.type ONLINE ONLINE jbdb2 ora....ry.acfs ora....fs.type ONLINE ONLINE jbdb2 ora.scan1.vip ora....ip.type ONLINE ONLINE jbdb2
没有办法后来查看crsd.log日志,在节点1上看到如下日志:
2015-01-11 13:09:49.196: [UiServer][12338] {1:55097:16132} Done for ctx=126cef9b0 2015-01-11 13:09:51.152: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30009, msg 126d68dd8 { len 1160, seq 4382378, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 4382377, flags 0x1, srcLuid a0a5bfd0-2d6a5e20, dstLuid 00000000-00000000, msgId 4382376 }, node 126ceeb10 { host 'jbdb2', haName '100e-6e24-76d3-7ed0', srcLuid 2a58b1e0-aa5c4864, dstLuid a0a5bfd0-2d6a5e20 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [79 : 79], createTime 2789741483, sentRegister 1, localMonitor 0, flags 0x4 } 2015-01-11 13:09:51.898: [GIPCHALO][2314] gipchaLowerProcessNode: no valid interfaces found to node for 2789821594 ms, node 126ceeb10 { host 'jbdb2', haName '100e-6e24-76d3-7ed0', srcLuid 2a58b1e0-aa5c4864, dstLuid a0a5bfd0-2d6a5e20 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [80 : 80], createTime 2789741483, sentRegister 1, localMonitor 0, flags 0x4 } 2015-01-11 13:09:52.160: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30008, msg 126da6958 { len 1160, seq 4382379, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 4382378, flags 0x1, srcLuid a0a5bfd0-2d6a5e20, dstLuid 00000000-00000000, msgId 4382377 }, node 126ceeb10 { host 'jbdb2', haName '100e-6e24-76d3-7ed0', srcLuid 2a58b1e0-aa5c4864, dstLuid a0a5bfd0-2d6a5e20 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [80 : 80], createTime 2789741483, sentRegister 1, localMonitor 0, flags 0x4 } 2015-01-11 13:09:53.167: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30009, msg 126d6b678 { len 1160, seq 4382380, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 4382379, flags 0x1, srcLuid a0a5bfd0-2d6a5e20, dstLuid 00000000-00000000, msgId 4382378 }, node 126ceeb10 { host 'jbdb2', haName '100e-6e24-76d3-7ed0', srcLuid 2a58b1e0-aa5c4864, dstLuid a0a5bfd0-2d6a5e20 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [81 : 81], createTime 2789741483, sentRegister 1, localMonitor 0, flags 0x4 } 2015-01-11 13:09:54.173: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30007, msg 126da91f8 { len 1160, seq 4382381, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 4382380, flags 0x1, srcLuid a0a5bfd0-2d6a5e20, dstLuid 00000000-00000000, msgId 4382379 }, node 126ceeb10 { host 'jbdb2', haName '100e-6e24-76d3-7ed0', srcLuid 2a58b1e0-aa5c4864, dstLuid a0a5bfd0-2d6a5e20 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [82 : 82], createTime 2789741483, sentRegister 1, localMonitor 0, flags 0x4 } 2015-01-11 13:09:56.188: [GIPCHALO][2314] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30008, msg 126d5cd18 { len 1160, seq 4382383, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 4382382, flags 0x1, srcLuid a0a5bfd0-2d6a5e20, dstLuid 00000000-00000000, msgId 4382381 }, node 126ceeb10 { host 'jbdb2', haName '100e-6e24-76d3-7ed0', srcLuid 2a58b1e0-aa5c4864, dstLuid a0a5bfd0-2d6a5e20 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [84 : 84], createTime 2789741483, sentRegister 1, localMonitor 0, flags 0x4 }
通过上面日志发现“gipchaLowerProcessNode: no valid interfaces found to node for 2789821594 ms”信息,通过字面理解应该是群集心跳出现问题,可以在节点1和节点2上通过ping测试发现没有异常,于是就没有在意,反而把重点放在操作系统层面,就这样一连5天都是这样熬着,在这几天我尝试各种解决办法,操作系统层面,存储多路径、系统补丁、数据库bug均是无效,问题依旧,双节点同时启动RAC不可用,关闭一个节点后所有资源漂移正常,并且crs也正常,在这样一次次尝试我发现,存储层面和系统层面应该没有问题,否则所有资源是不会在一个节点上运行,这样问题应该出现在群集本身。定位问题后,就测试RAC的心跳通信。
ping 192.168.10.254 -t 正在 Ping 192.168.10.254 具有 32 字节的数据: 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=3ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=3ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 来自 192.168.10.254 的回复: 字节=32 时间=1ms TTL=255 192.168.10.254 的 Ping 统计信息: 数据包: 已发送 = 12,已接收 = 8,丢失 =4 (33% 丢失), 往返行程的估计时间(以毫秒为单位): 最短 = 1ms,最长 = 3ms,平均 = 1ms
通过上面的测试发现有数据包丢失虽然心跳地址任然通信,会不会RAC通过测定数据包丢失而判断心跳失效,从而发生类似“脑裂”的问题,结合crsd.log中gipchaLowerProcessNode: no valid interfaces found to node for 2789821594 ms”,通过上面信息判断问题出现在RAC心跳上,于是想利用独立交换机接入节点1、2的心跳网卡,通过这个操作,奇迹发送了,节点1、2的crs尽然启动了,群集资源也没有异常,可以查看crsd.log日志如下:
hostname:/grid/app/11.2/grid/log/jbdb2/crsd$tail -n 100 crsd.log 2015-01-15 17:19:06.666: [UiServer][11568] {2:24121:114} Done for ctx=11236f630 2015-01-15 17:19:20.233: [UiServer][11825] CS(11237a810)set Properties ( grid,11238c390) 2015-01-15 17:19:20.244: [UiServer][11568] {2:24121:115} Sending message to PE. ctx= 112004170 2015-01-15 17:19:20.300: [UiServer][11568] {2:24121:115} Done for ctx=112004170 2015-01-15 17:19:42.929: [ AGFW][10026] {2:24121:2} Received the reply to the message: RESOURCE_START[ora.jbdb.db 2 1] ID 4098:323 from the agent /grid/app/11.2/grid/bin/oraagent_oracle 2015-01-15 17:19:42.930: [ AGFW][10026] {2:24121:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_START[ora.jbdb.db 2 1] ID 4098:881 2015-01-15 17:19:42.932: [ AGFW][10026] {2:24121:2} Received the reply to the message: RESOURCE_START[ora.jbdb.db 2 1] ID 4098:323 from the agent /grid/app/11.2/grid/bin/oraagent_oracle 2015-01-15 17:19:42.933: [ AGFW][10026] {2:24121:2} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.jbdb.db 2 1] ID 4098:881 2015-01-15 17:19:42.946: [UiServer][11568] {2:24121:97} Container [ Name: ORDER MESSAGE: TextMessage[CRS-5702: Resource 'ora.jbdb.db' is already running on 'jbdb2'] MSGTYPE: TextMessage[1] OBJID: TextMessage[ora.jbdb.db 2 1] WAIT: TextMessage[0] ] 2015-01-15 17:19:42.946: [ COMMCRS][11568]clscsendx: (11237cbb0) Connection not active hostname:/grid/app/11.2/grid/log/jbdb2/crsd$crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE jbdb1 ora....BASE.dg ora....up.type ONLINE ONLINE jbdb1 ora....ER.lsnr ora....er.type ONLINE ONLINE jbdb1 ora....N1.lsnr ora....er.type ONLINE ONLINE jbdb1 ora.asm ora.asm.type ONLINE ONLINE jbdb1 ora.cvu ora.cvu.type ONLINE ONLINE jbdb1 ora.gsd ora.gsd.type OFFLINE OFFLINE ora.jbdb.db ora....se.type ONLINE ONLINE jbdb1 ora....SM1.asm application ONLINE ONLINE jbdb1 ora....B1.lsnr application ONLINE ONLINE jbdb1 ora.jbdb1.gsd application OFFLINE OFFLINE ora.jbdb1.ons application ONLINE ONLINE jbdb1 ora.jbdb1.vip ora....t1.type ONLINE ONLINE jbdb1 ora....SM2.asm application ONLINE ONLINE jbdb2 ora....B2.lsnr application ONLINE ONLINE jbdb2 ora.jbdb2.gsd application OFFLINE OFFLINE ora.jbdb2.ons application ONLINE ONLINE jbdb2 ora.jbdb2.vip ora....t1.type ONLINE ONLINE jbdb2 ora....network ora....rk.type ONLINE ONLINE jbdb1 ora.oc4j ora.oc4j.type ONLINE ONLINE jbdb1 ora.ons ora.ons.type ONLINE ONLINE jbdb1 ora....ry.acfs ora....fs.type ONLINE ONLINE jbdb1 ora.scan1.vip ora....ip.type ONLINE ONLINE jbdb1 hostname:/grid/app/11.2/grid/log/jbdb2/crsd$crsctl stat res -t -init -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE jbdb2 Started ora.cluster_interconnect.haip 1 ONLINE ONLINE jbdb2 ora.crf 1 ONLINE ONLINE jbdb2 ora.crsd 1 ONLINE ONLINE jbdb2 ora.cssd 1 ONLINE ONLINE jbdb2 ora.cssdmonitor 1 ONLINE ONLINE jbdb2 ora.ctssd 1 ONLINE ONLINE jbdb2 OBSERVER ora.diskmon 1 OFFLINE OFFLINE ora.drivers.acfs 1 ONLINE ONLINE jbdb2 ora.evmd 1 ONLINE ONLINE jbdb2 ora.gipcd 1 ONLINE ONLINE jbdb2 ora.gpnpd 1 ONLINE ONLINE jbdb2 ora.mdnsd 1 ONLINE ONLINE jbdb2 hostname:/grid/app/11.2/grid/log/jbdb1/crsd$tail -n 100 crsd.log 2015-01-15 17:19:18.657: [ CRSPE][12082] {2:24121:115} Processing PE command id=157. Description: [Stat Resource : 112df6db0] 2015-01-15 17:19:41.343: [ CRSPE][12082] {2:24121:2} Received reply to action [Start] message ID: 881 2015-01-15 17:19:41.346: [ CRSPE][12082] {2:24121:2} Received reply to action [Start] message ID: 881 2015-01-15 17:19:41.346: [ CRSPE][12082] {2:24121:2} RI [ora.jbdb.db 2 1] new internal state: [STABLE] old value: [STARTING] 2015-01-15 17:19:41.346: [ CRSPE][12082] {2:24121:2} RI [ora.jbdb.db 2 1] new external state [ONLINE] old value: [OFFLINE] on jbdb2 label = [Open] 2015-01-15 17:19:41.346: [ CRSPE][12082] {2:24121:2} Set State Details to [Open] from [ ] for [ora.jbdb.db 2 1] 2015-01-15 17:19:41.347: [ CRSPE][12082] {2:24121:2} CRS-2676: Start of 'ora.jbdb.db' on 'jbdb2' succeeded 2015-01-15 17:19:41.351: [ CRSRPT][12339] {2:24121:2} Published to EVM CRS_RESOURCE_STATE_CHANGE for ora.jbdb.db 2015-01-15 17:19:41.353: [ CRSPE][12082] {2:24121:97} Re-evaluation of queued op [START of [ora.jbdb.db 2 1] on [jbdb2] : local=0, unplanned=0114e51c50]. found it no longer needed:CRS-5702: Resource 'ora.jbdb.db' is already running on 'jbdb2' . Finishing the op. 2015-01-15 17:19:41.354: [ CRSPE][12082] {2:24121:97} PE Command [ Start Resource : 11200d6d0 ] has completed 2015-01-15 17:19:41.354: [ CRSPE][12082] {2:24121:97} UI Command [Start Resource : 11200d6d0] is replying to sender. 2015-01-15 17:19:41.357: [ CRSPE][12082] {2:24121:97} PE Command [ Server Join: Resource Startup {jbdb2 } : 11211a670 ] has completed 2015-01-15 17:19:41.357: [ AGFW][10797] {2:24121:97} Agfw Proxy Server received the message: CMD_COMPLETED[Proxy] ID 20482:930 2015-01-15 17:19:41.358: [ AGFW][10797] {2:24121:97} Agfw Proxy Server replying to the message: CMD_COMPLETED[Proxy] ID 20482:930 2015-01-15 17:19:41.358: [ CRSPE][12082] {2:24121:97} Updated state details for server jbdb2from [AUTOSTARTING RESOURCES] to : [] 2015-01-15 17:19:41.358: [ CRSOCR][11054] {2:24121:97} Multi Write Batch processing... 2015-01-15 17:19:41.360: [ CRSRPT][12339] {2:24121:97} Published to EVM CRS_SERVER_STATE_CHANGE for jbdb2 2015-01-15 17:19:41.364: [ CRSOCR][11054] {2:24121:97} Multi Write Batch done. 2015-01-15 17:19:41.364: [ CRSPE][12082] {2:24121:97} Resource Autostart completed for jbdb2 2015-01-15 17:19:41.506: [ CRSPE][12082] {2:24121:116} Processing PE command id=158. Description: [Stat Resource : 1118a75b0] 2015-01-15 17:19:41.841: [ CRSPE][12082] {2:24121:117} Processing PE command id=159. Description: [Stat Resource : 11200c5b0] 2015-01-15 17:19:42.080: [ CRSPE][12082] {2:24121:118} Processing PE command id=160. Description: [Stat Resource : 1118a75b0] 2015-01-15 17:21:40.306: [UiServer][12596] CS(11200c570)set Properties ( root,11200def0)
通过上面的日志和crs的状态,发现群集资源已经恢复,并且资源启动,数据库的状态及实例的状态、ASM等都运行正常,最终通过排查是RAC小型机的心跳网线配线架出现问题,导致心跳丢包,待所有问题解决后,长长舒了口气,终于可以睡个安心觉了。
本文出自 “freedom” 博客,谢绝转载!