RAC ASM磁盘路径故障导致OCR和Voting Disk被Force Dismount
创金合信基金管理有限公司的金融交易系统底层使用的Oracle数据库为2节点RAC架构,其版本为Oracle Database 11.2.0.4 AMD64,操作系统平台为Redhat Linux 5.7 x86-64,数据库主机为实体小型机DELL系列,内存32GB。在上一交易日结束后托管机房运维部门要求相关厂商更换存储控制器内存模块,并未通知要求停止Oracle RAC数据库和集群件Clusterware,致使次一交易日股市开盘前巡检发现数据库无法正常连接。但是现有的应用层连接不受影响,能够正常访问数据库。
使用集群资源状态检查命令查询相关集群资源的状态如下所示,我们发现使用crsctl stat res -t查询集群资源操作失败:在节点1和节点2上资源ora.crsd均发生失败。
+ASM1@prod1:/u01/app/oragrid/11.2.0.2/bin>./crsctl stat res -t -init ------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE ONLINE prod1 Started ora.cluster_interconnect.haip 1 ONLINE ONLINE prod1 ora.crf 1 ONLINE ONLINE prod1 ora.crsd 1 ONLINE OFFLINE ora.cssd 1 ONLINE ONLINE prod1 ora.cssdmonitor 1 ONLINE ONLINE prod1 ora.ctssd 1 ONLINE ONLINE prod1 OBSERVER ora.diskmon 1 ONLINE ONLINE prod1 ora.drivers.acfs 1 ONLINE ONLINE prod1 ora.evmd 1 ONLINE INTERMEDIATE prod1 ora.gipcd 1 ONLINE ONLINE prod1 ora.gpnpd 1 ONLINE ONLINE prod1 ora.mdnsd 1 ONLINE ONLINE prod1
然后我们使用crsctl check crs命令检查CRS守护进程的相关状态,发现该命令返回如下所示的错误信息:集群就绪服务无法正常通信。
CRS-4535: Cannot Communicate With Cluster Ready Services CRS-4000: Command Status failed, or completed with errors.
我们使用ocrcheck命令检查表决磁盘Voting Disk和Oracle Cluster Registry集群注册表发现如下错误,无法获取OCR和Voting Disk的验证信息:
CRS-4535: Cannot Communicate With Cluster Ready Services CRS-4000: Command check failed, or completed with errors.
该信息说明目前集群无法验证Voting Disk和Oracle Cluster Registry所在的磁盘组,那么如果该磁盘所在的磁盘组出现故障或者内部错误,那么使用ocrcheck是没有办法获得验证信息的。通过结合前面的检查ora.crsd(集群就绪服务守护进程)资源并没有ONLINE,可知很大程度上集群注册表已经不能访问了。于是,我们很自然的想到从底层的ASM磁盘组的状态入手,查看ASM_DISKGROUP的相关状态如下:
SQL> select name, state from v$asm_diskgroup;
NAME STATE ------------------------------ ----------- DATADG MOUNTED CRSDG DISMOUNTED
ASMCMD> ls ASMCMD> lsdg +DATADG
根据以上信息我们知道CRSDG表决磁盘和集群注册表所在的磁盘组在oracle RAC ASM实例层面已经处于DISMOUNTED状态。当然这一信息也可以通过ASMCMD命令行的ls或者lsdg来查看。而在操作系统层面,我们发现相应的裸设备所属磁盘并没有DISMOUNT,也就是说很可能是因为我们配置的多路径multipath出现故障或者ASM设备暂时在RAC ASM实例层面无法识别。我们可以通过ora.crsd服务的日志进一步的获取相关信息。
Crsd.log日志文件实时的反映了故障的发生时间和具体的错误信息。并且crsd.log记录的错误发生时刻与深圳证券通信公司机房工程师开始维护的时间一致,完全吻合。托管机房16:10-17:30进行维护操作,crsd.log在16:12:15出现设备的访问故障,并且被RAC ASM层面所探测到,具体的日志错误性信息如下所示:
2016-04-30 16:13:15.564: [UiServer][2968028928]{1:53576:13269} Container [ Name: UI_STOP API_HDR_VER: TextMessage[2] ASYNC_TAG: TextMessage[1] CLIENT: TextMessage[] CLIENT_NAME: TextMessage[/u01/grid/11.2.0/bin/oracle] CLIENT_PID: TextMessage[3337] CLIENT_PRIMARY_GROUP: TextMessage[oinstall] EVENT_TAG: TextMessage[1] FILTER: TextMessage[(((NAME==ora.CRSDG.dg)&&(LAST_SERVER==rac1))&& (STATE!=OFFLINE))USR_ORA_OPI=true] FILTER_TAG: TextMessage[1] FORCE_TAG: TextMessage[1] LOCALE: TextMessage[AMERICAN_AMERICA.AL32UTF8] NO_WAIT_TAG: TextMessage[1] QUEUE_TAG: TextMessage[1] ] 2016-04-30 16:13:15.564: [UiServer][2968028928]{1:53576:13269} Sending message to PE. ctx= 0x7f6a1800c1c0, Client PID: 3337 2016-04-30 16:13:15.564: [ CRSPE][2970130176]{1:53576:13269} Cmd : 0x7f6a2412f480 : flags: EVENT_TAG | FORCE_TAG | QUEUE_TAG 2016-04-30 16:13:15.564: [ CRSPE][2970130176]{1:53576:13269} Processing PE command id=78811. Description: [Stop Resource : 0x7f6a2412f480] 2016-04-30 16:13:15.564: [ CRSPE][2970130176]{1:53576:13269} Expression Filter : (((NAME == ora.CRSDG.dg) AND (LAST_SERVER == rac1)) AND (STATE != OFFLINE)) 2016-04-30 16:13:15.565: [ CRSPE][2970130176]{1:53576:13269} Expression Filter : (((NAME == ora.CRSDG.dg) AND (LAST_SERVER == rac1)) AND (STATE != OFFLINE)) 2016-04-30 16:13:15.565: [ CRSPE][2970130176]{1:53576:13269} Attribute overrides for the command: USR_ORA_OPI = true; 2016-04-30 16:13:15.565: [ CRSPE][2970130176]{1:53576:13269} Filtering duplicate ops: server [] state [OFFLINE] 2016-04-30 16:13:15.565: [ CRSPE][2970130176]{1:53576:13269} Op 0x7f6a2410d8b0 has 5 WOs 2016-04-30 16:13:15.566: [ CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new target state: [OFFLINE] old value: [ONLINE] 2016-04-30 16:13:15.566: [ CRSOCR][2978535168]{1:53576:13269} Multi Write Batch processing... 2016-04-30 16:13:15.566: [ CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new internal state: [STOPPING] old value: [STABLE] 2016-04-30 16:13:15.566: [ CRSPE][2970130176]{1:53576:13269} Sending message to agfw: id = 1774249 2016-04-30 16:13:15.566: [ AGFW][2980636416]{1:53576:13269} Agfw Proxy Server received the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249 2016-04-30 16:13:15.566: [ CRSPE][2970130176]{1:53576:13269} CRS-2673: Attempting to stop 'ora.CRSDG.dg' on 'rac1'
我们根据以上日志信息发现,当机房运维人员拔下存储控制器的时候,crsd.log在oracle层面已经检测到这一行为,并且oracle 集群件认为这一行为有必要停止CRSDG磁盘组的挂载,从而避免不必要的Corruption,于是oracle集群件进行了一个操作Container [ Name: UI_STOP,停止相关资源的服务并且DISMOUNTED磁盘组。我们都可以通过以下部分日志来得出以上的结论。尤其是集群件的crsd守护进程向外发送的TextMessage信息,尤其值得注意:
TextMessage[CRS-2673: Attempting to stop 'ora.CRSDG.dg' on 'rac1'] TextMessage[CRS-2677: Stop of 'ora.CRSDG.dg' on 'rac1' succeeded] 2016-04-30 16:13:15.566: [ AGFW][2980636416]{1:53576:13269} Agfw Proxy Server forwarding the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249 to the agent /u01/grid/11.2.0/bin/oraagent_grid 2016-04-30 16:13:15.566: [UiServer][2968028928]{1:53576:13269} Container [ Name: ORDER MESSAGE: TextMessage[CRS-2673: Attempting to stop 'ora.CRSDG.dg' on 'rac1'] MSGTYPE: TextMessage[3] OBJID: TextMessage[ora.CRSDG.dg rac1 1] WAIT: TextMessage[0] ] 2016-04-30 16:13:15.566: [ COMMCRS][2968028928]clscsendx: (0x7f6a5c0e9eb0) Connection not active 2016-04-30 16:13:15.566: [UiServer][2968028928]{1:53576:13269} CS(0x7f6a1c009ec0)Error sending msg over socket.6 2016-04-30 16:13:15.567: [UiServer][2968028928]{1:53576:13269} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. (File: clsMessageStream.cpp, line: 275 2016-04-30 16:13:15.568: [ AGFW][2980636416]{1:53576:13269} Received the reply to the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774250 from the agent /u01/grid/11.2.0/bin/oraagent_grid 2016-04-30 16:13:15.569: [ AGFW][2980636416]{1:53576:13269} Agfw Proxy Server sending the reply to PE for message:RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249 2016-04-30 16:13:15.569: [ CRSPE][2970130176]{1:53576:13269} Received reply to action [Stop] message ID: 1774249 2016-04-30 16:13:15.587: [ AGFW][2980636416]{1:53576:13269} Received the reply to the message: RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774250 from the agent /u01/grid/11.2.0/bin/oraagent_grid 2016-04-30 16:13:15.587: [ AGFW][2980636416]{1:53576:13269} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_STOP[ora.CRSDG.dg rac1 1] ID 4099:1774249 2016-04-30 16:13:15.587: [ CRSPE][2970130176]{1:53576:13269} Received reply to action [Stop] message ID: 1774249 2016-04-30 16:13:15.587: [ CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new internal state: [STABLE] old value: [STOPPING] 2016-04-30 16:13:15.588: [ CRSPE][2970130176]{1:53576:13269} RI [ora.CRSDG.dg rac1 1] new external state [OFFLINE] old value: [ONLINE] label = [] 2016-04-30 16:13:15.588: [ CRSPE][2970130176]{1:53576:13269} CRS-2677: Stop of 'ora.CRSDG.dg' on 'rac1' succeeded 2016-04-30 16:13:15.588: [ CRSRPT][2968028928]{1:53576:13269} Published to EVM CRS_RESOURCE_STATE_CHANGE for ora.CRSDG.dg 2016-04-30 16:13:15.588: [UiServer][2968028928]{1:53576:13269} Container [ Name: ORDER MESSAGE: TextMessage[CRS-2677: Stop of 'ora.CRSDG.dg' on 'rac1' succeeded] MSGTYPE: TextMessage[3] OBJID: TextMessage[ora.CRSDG.dg rac1 1] WAIT: TextMessage[0] ] 2016-04-30 16:13:15.588: [UiServer][2968028928]{1:53576:13269} CS(0x7f6a1c009ec0)No connection to client.6 2016-04-30 16:13:15.588: [UiServer][2968028928]{1:53576:13269} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client. (File: clsMessageStream.cpp, line: 275
完成ora.CRSDG.dg服务的STOP操作的验证记录信息如下所示,同样的系统调用了 Container 这一函数:不过参数换成了UI_DATA。
2016-04-30 16:13:15.590: [UiServer][2968028928]{1:53576:13269} Container [ Name: UI_DATA ora.CRSDG.dg rac1 1: TextMessage[0] ]
同时,我们可以查看crsdOUT.log日志来发现具体的集群件的后续行为是否异常,我们可以很清晰的看到该日志文件中的异常信息,因为该日志只是为了启动crsd守护进程而进行目录的切换以及命令执行的最终结果信息,其并不提供详细的日志行为信息,所以较为简单明了。
2016-04-30 18:36:52 CRSD REBOOT CRSD exiting: Could not init OCR, code: 26 2016-04-30 18:36:54 Changing directory to /u01/grid/11.2.0/log/rac1/crsd 2016-04-30 18:36:54 CRSD REBOOT CRSD exiting: Could not init OCR, code: 26 2016-04-30 18:36:56 Changing directory to /u01/grid/11.2.0/log/rac1/crsd 2016-04-30 18:36:56 CRSD REBOOT
可以看到,CRSD集群就绪服务守护进程一直在尝试着初始化OCR集群注册表,但是一直失败,因为前面的日志信息表明OCR 的ASM磁盘组已经处于DISMOUNT状态。我们查看操作系统级别的日志信息/var/log/message如下所示:以下部分日志信息验证了我们关于multipath绑定裸设备的DISMOUNTED故障的假设。
Apr 30 16:12:57 rac1 kernel: rport-8:0-1: blocked FC remote port time out: removing target and saving binding Apr 30 16:12:57 rac1 kernel: sd 8:0:1:1: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 8:0:1:2: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 8:0:1:3: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:224. Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:192. Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:208. Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] killing request Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:176. Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] Unhandled error code Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: [sdl] CDB: Read(10): 28 00 00 03 50 60 00 00 20 00 Apr 30 16:12:57 rac1 kernel: rport-9:0-0: blocked FC remote port time out: removing target and saving binding Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device Apr 30 16:12:57 rac1 multipathd: ocrvote1: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 3 1 8:112 1 8:192 1 65:16 1 round-robin 0 1 1 8:32 1] Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:96. Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:0: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:3: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:3: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:3: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:144. Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112. Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: alua: rtpg failed with 10000 Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112. Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15. Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:192. Apr 30 16:12:57 rac1 kernel: sd 9:0:1:1: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: rejecting I/O to offline device Apr 30 16:12:57 rac1 multipathd: 8:144: mark as failed Apr 30 16:12:57 rac1 multipathd: ocrvote3: remaining active paths: 3 Apr 30 16:12:57 rac1 multipathd: 8:224: mark as failed Apr 30 16:12:57 rac1 multipathd: ocrvote3: remaining active paths: 2 Apr 30 16:12:57 rac1 multipathd: 8:208: mark as failed Apr 30 16:12:57 rac1 multipathd: ocrvote2: remaining active paths: 3 Apr 30 16:12:57 rac1 multipathd: sdl: remove path (uevent) Apr 30 16:12:57 rac1 kernel: sd 9:0:0:1: alua: rtpg failed with 10000 Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112. Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15. Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:192. Apr 30 16:12:57 rac1 multipathd: data1: load table [0 482402304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:96 1 round-robin 0 2 1 8:16 1 65:0 1] Apr 30 16:12:57 rac1 multipathd: sdl: path removed from map data1 Apr 30 16:12:57 rac1 multipathd: 8:112: mark as failed Apr 30 16:12:57 rac1 multipathd: ocrvote1: remaining active paths: 3 Apr 30 16:12:57 rac1 multipathd: 8:192: mark as failed Apr 30 16:12:57 rac1 multipathd: ocrvote1: remaining active paths: 2 Apr 30 16:12:57 rac1 multipathd: sdm: remove path (uevent) Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15. Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:0: alua: Detached Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] killing request Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: rejecting I/O to offline device Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] Unhandled error code Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:128. Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Apr 30 16:12:57 rac1 kernel: sd 9:0:0:2: [sdi] CDB: Write(10): 2a 00 00 08 00 11 00 00 01 00 Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:96. Apr 30 16:12:57 rac1 kernel: sd 8:0:0:0: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: sd 9:0:1:0: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: sd 8:0:0:2: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: sd 9:0:1:2: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: sd 8:0:0:3: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: sd 9:0:1:3: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 multipathd: ocrvote1: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 2 1 8:112 1 65:16 1 round-robin 0 1 1 8:32 1] Apr 30 16:12:57 rac1 multipathd: sdm: path removed from map ocrvote1 Apr 30 16:12:57 rac1 multipathd: sdg: remove path (uevent) Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Could not failover the device: Handler scsi_dh_alua Error 15. Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:1: alua: Detached Apr 30 16:12:57 rac1 kernel: device-mapper: multipath: Failing path 8:112. Apr 30 16:12:57 rac1 kernel: sd 9:0:1:1: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 multipathd: data1: load table [0 482402304 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 2 1 8:16 1 65:0 1] Apr 30 16:12:57 rac1 multipathd: sdg: path removed from map data1 Apr 30 16:12:57 rac1 multipathd: sdn: remove path (uevent) Apr 30 16:12:57 rac1 kernel: scsi 9:0:0:0: alua: Detached Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:2: multipath: error getting device Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table Apr 30 16:12:57 rac1 kernel: sd 8:0:0:0: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: sd 9:0:1:0: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:2: multipath: error getting device Apr 30 16:12:57 rac1 multipathd: ocrvote2: failed in domap for removal of path sdn Apr 30 16:12:57 rac1 multipathd: uevent trigger error Apr 30 16:12:57 rac1 multipathd: sdh: remove path (uevent) Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table Apr 30 16:12:57 rac1 kernel: scsi 9:0:0:1: alua: Detached Apr 30 16:12:57 rac1 multipathd: ocrvote1: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 65:16 1 round-robin 0 1 1 8:32 1] Apr 30 16:12:57 rac1 multipathd: sdh: path removed from map ocrvote1 Apr 30 16:12:57 rac1 multipathd: sdo: remove path (uevent) Apr 30 16:12:57 rac1 kernel: sd 9:0:1:1: alua: port group 03 state A non-preferred supports TolUsNA Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:3: multipath: error getting device Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:3: multipath: error getting device Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table Apr 30 16:12:57 rac1 multipathd: ocrvote3: failed in domap for removal of path sdo Apr 30 16:12:57 rac1 multipathd: uevent trigger error Apr 30 16:12:57 rac1 multipathd: sdp: remove path (uevent) Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:4: multipath: error getting device Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table Apr 30 16:12:57 rac1 multipathd: data2: failed in domap for removal of path sdp Apr 30 16:12:57 rac1 multipathd: uevent trigger error Apr 30 16:12:57 rac1 multipathd: sdi: remove path (uevent) Apr 30 16:12:57 rac1 multipathd: ocrvote2: load table [0 4194304 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 2 1 8:48 1 65:32 1] Apr 30 16:12:57 rac1 multipathd: sdi: path removed from map ocrvote2 Apr 30 16:12:57 rac1 multipathd: sdj: remove path (uevent) Apr 30 16:12:57 rac1 kernel: device-mapper: table: 253:4: multipath: error getting device Apr 30 16:12:57 rac1 kernel: device-mapper: ioctl: error adding target to table Apr 30 16:12:57 rac1 kernel: scsi 9:0:0:2: alua: Detached Apr 30 16:12:57 rac1 kernel: scsi 8:0:1:2: alua: Detached
这里我们会产生一个疑问,为什么ora.crsd挂掉,但是ora.cssd没有OFFLINE(通过crsctl stat res -t -init可以确认ora.cssd没有挂掉,数据库实例还正常运行,节点并没有被踢出去),原因在于OCR和VotingDisk对应的磁盘只是短暂的不可访问,cssd进程是直接访问OCR和VotingDisk对应的3个ASM磁盘,并不依赖于OCR和VotingDisk磁盘组必须处于MOUNT状态,并且Clusterware默认的磁盘心跳超时时间为120秒,所以cssd进程没有出现问题。
如果multipath对应的绑定裸设备没有在操作系统层面识别并挂载,那么我们必须首先重新启动数据库服务器主机,或者采用其他手段先解决raw设备的挂载MOUNT问题(如果您使用的是raw设备做的ASM设备名绑定),一般情况下该故障可以通过重启数据库服务器主机或者HAS服务解决。但是,也存在很多情况下CRSDG 对应的ASM磁盘组文件头损坏或者CRS以及Voting Disk的数据丢失,都必定不能使得CRSDG重新挂载,这需要寻求其他的解决方案。