一次异常关闭ORACLE RAC集群导致的GRID软件无法启动处理

问题:

2019/12/12,某存储维护厂商对客户的LINUX+ORACLE 11.2.0.4 RAC环境进行ASM磁盘扩容,计划通过新加LUN方式,不重启数据库LINUX主机方式在线操作加盘。在实际操作过程中(未知如何操作的)导致数据库ASM磁盘IO报错,后强制重启主机;在主机重启后,发现GRID软件无法启动,紧急介入响应……

分析:

首先连入环境,通过crsctl stat res -t -init查看集群的初始进程启动状态,发现在启动ASM资源时异常;

通过查看ASM进程的日志,发现在实例进行reconfiguration后卡住,通过查看报错中对应的TRACE,初步怀疑是集群的心跳私网问题,但是查看集群日志可以发现CSSD进程启动并且加入集群是成功的,ASM进程也是在Reconfiguration complete后出现hung,参考MOS文档上的几种情况也一一排查,仍未解决,一时陷入迷茫……

参考MOS文档:

ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)
最常见的 5 个导致节点重新启动、驱逐或 CRS 意外重启的问题 (文档 ID 1524455.1)
Grid Infrastructure 启动的五大问题 (文档 ID 1526147.1)

处理:

后同事接入处理,临时关闭存活节点发现故障节点就可以启动,即只能一个节点启动,矛头再次指向集群通信问题;后通过将果/var/tmp/.oracle下的socket文件清空,重启集群软件,恢复正常。因此可以推断是前期加盘导致IO异常时强制关机导致了/var/tmp/.oracle下的socket文件异常(正常情况下集群软件启动时会重建此处的socket文件).
 

ASM实例日志
节点1:

Reconfiguration complete
Thu Dec 12 15:50:06 2019
LMON (ospid: 8523) detects hung instances during IMR reconfiguration
LMON (ospid: 8523) tries to kill the instance 2 in 37 seconds.
Please check instance 2's alert log and LMON trace file for more details.
LMON (ospid: 8523) aborts 1 previously scheduled instance kills



节点2:

MMNL started with pid=21, OS id=26248 
lmon registered with NM - instance number 2 (internal mem no 1)
Thu Dec 12 16:24:08 2019
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2 
Thu Dec 12 16:24:11 2019
PMON (ospid: 26206): terminating the instance due to error 481
Thu Dec 12 16:24:11 2019
ORA-1092 : opitsk aborting process
Thu Dec 12 16:24:13 2019
System state dump requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216_20191212162413.trc
Dumping diagnostic data in directory=[cdmp_20191212162411], requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 26206
Thu Dec 12 16:24:33 2019
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 2 
Starting ORACLE instance (normal)

 

你可能感兴趣的:(ORACLE,RAC/ASM)