1.故障现象
使用crsctl查看集群各资源状态,在任一节点都会直接报错CRS-4535,
CRS-4000;但此时数据库是可以被正常访问的。
具体故障现象如下:
#节点1查询
grid@bjdb1:/home/grid>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
#节点2查询
root@bjdb2:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
同样的,crs_stat -t 查看一样报错,错误码是CRS-0184:
root@bjdb1:/>crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
节点2也一样!
确定此时数据库是可以被正常访问的。如下:
#节点2模拟客户端登录RAC集群,使用SCAN
IP访问,发现可以正常访问到数据库 oracle@bjdb2:/home/oracle>sqlplus
jingyu/[email protected]/bjdb
SQL*Plus: Release
11.2.0.4.0
Production on Mon Oct 10
14:24:47
2016
Copyright (c) 1982,
2013, Oracle. All rights
reserved.
Connected to: Oracle Database
11g Enterprise Edition Release
11.2.0.4.0 -
64bit Production With the
Partitioning, Real Application Clusters, Automatic Storage
Management, OLAP, Data Mining and
Real Application Testing options
SQL>
2.定位问题
首先查看节点1的集群相关日志:
Clusterware(GI)的日志存放在$GRID_HOME/log/nodename下;
Clusterware(GI)对应几个关键的后台进程css,crs,evm,它们的日志分别存在cssd,crsd,evmd目录下;
节点1查看相关日志:
#查看GI的alert日志文件,最近的记录只是提示GI所在存储空间使用率高,稍后清理下即可,而且目前还有一定空间剩余,显然并非是此次故障的原因。
root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
2016-10-10 14:18:26.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:23:31.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:28:36.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:33:41.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:38:46.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
#因为crsctl不可以使用,进而查看crs的日志信息,发现3号已经有报错,无法打开裸设备,从而导致无法初始化OCR;继续看错误信息,发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题,需要进一步和现场人员确定此时间点是否有存储相关的施工。
root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log
2016-10-03 18:04:40.248:
[OCRRAW][1]proprinit: Could not open raw device
2016-10-03 18:04:40.248:
[OCRASM][1]proprasmcl: asmhandle is NULL
2016-10-03 18:04:40.252:
[OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
2016-10-03 18:04:40.253:
[CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:40.253:
[CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:40.253:
[CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-10-03 18:04:40.253: [CRSD][1] Done.
节点2查看相关日志:
#查看GI的alert日志,发现节点2的ctss有CRS-2409的报错,虽然根据MOS文档 ID 1135337.1
说明,This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode.
只需要查看两个节点的时间是否一致,但实际上查询节点时间一致:
root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log
2016-10-10 12:29:22.145:
[ctssd(5243030)]CRS-2409:Th