oracle crs 4535,11gRAC报错CRS-4535,CRS-4000解决

1.故障现象

使用crsctl查看集群各资源状态,在任一节点都会直接报错CRS-4535,

CRS-4000;但此时数据库是可以被正常访问的。

具体故障现象如下:

#节点1查询

grid@bjdb1:/home/grid>crsctl stat res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

#节点2查询

root@bjdb2:/>crsctl stat res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

同样的,crs_stat -t 查看一样报错,错误码是CRS-0184:

root@bjdb1:/>crs_stat -t

CRS-0184: Cannot communicate with the CRS daemon.

节点2也一样!

确定此时数据库是可以被正常访问的。如下:

#节点2模拟客户端登录RAC集群,使用SCAN

IP访问,发现可以正常访问到数据库 oracle@bjdb2:/home/oracle>sqlplus

jingyu/[email protected]/bjdb

SQL*Plus: Release

11.2.0.4.0

Production on Mon Oct 10

14:24:47

2016

Copyright (c) 1982,

2013, Oracle. All rights

reserved.

Connected to: Oracle Database

11g Enterprise Edition Release

11.2.0.4.0 -

64bit Production With the

Partitioning, Real Application Clusters, Automatic Storage

Management, OLAP, Data Mining and

Real Application Testing options

SQL>

2.定位问题

首先查看节点1的集群相关日志:

Clusterware(GI)的日志存放在$GRID_HOME/log/nodename下;

Clusterware(GI)对应几个关键的后台进程css,crs,evm,它们的日志分别存在cssd,crsd,evmd目录下;

节点1查看相关日志:

#查看GI的alert日志文件,最近的记录只是提示GI所在存储空间使用率高,稍后清理下即可,而且目前还有一定空间剩余,显然并非是此次故障的原因。

root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log

2016-10-10 14:18:26.125:

[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

2016-10-10 14:23:31.125:

[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

2016-10-10 14:28:36.125:

[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

2016-10-10 14:33:41.125:

[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

2016-10-10 14:38:46.125:

[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

#因为crsctl不可以使用,进而查看crs的日志信息,发现3号已经有报错,无法打开裸设备,从而导致无法初始化OCR;继续看错误信息,发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题,需要进一步和现场人员确定此时间点是否有存储相关的施工。

root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log

2016-10-03 18:04:40.248:

[OCRRAW][1]proprinit: Could not open raw device

2016-10-03 18:04:40.248:

[OCRASM][1]proprasmcl: asmhandle is NULL

2016-10-03 18:04:40.252:

[OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]

2016-10-03 18:04:40.253:

[CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

2016-10-03 18:04:40.253:

[CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage

2016-10-03 18:04:40.253:

[CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26

2016-10-03 18:04:40.253: [CRSD][1] Done.

节点2查看相关日志:

#查看GI的alert日志,发现节点2的ctss有CRS-2409的报错,虽然根据MOS文档 ID 1135337.1

说明,This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode.

只需要查看两个节点的时间是否一致,但实际上查询节点时间一致:

root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log

2016-10-10 12:29:22.145:

[ctssd(5243030)]CRS-2409:Th

你可能感兴趣的:(oracle,crs,4535)