2014年第一个故障排查和解决:同事反馈给我说solaris 11.2 两节点rac无法启动,让我帮忙看下。通过分析是因为sga_target参数设置不合理导致asm无法正常启动
GI无法正常启动
[email protected]:~$crsctl status resource -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors. [email protected]:~$crsctl status resource -t -init -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE OFFLINE Instance Shutdown ora.cluster_interconnect.haip 1 ONLINE ONLINE zwq-rpt1 ora.crf 1 ONLINE ONLINE zwq-rpt1 ora.crsd 1 ONLINE OFFLINE ora.cssd 1 ONLINE ONLINE zwq-rpt1 ora.cssdmonitor 1 ONLINE ONLINE zwq-rpt1 ora.ctssd 1 ONLINE ONLINE zwq-rpt1 ACTIVE:0 ora.diskmon 1 OFFLINE OFFLINE ora.evmd 1 ONLINE INTERMEDIATE zwq-rpt1 ora.gipcd 1 ONLINE ONLINE zwq-rpt1 ora.gpnpd 1 ONLINE ONLINE zwq-rpt1 ora.mdnsd 1 ONLINE ONLINE zwq-rpt1
asm未正常启动
GI日志报错
2014-01-01 00:40:47.708 [cssd(1418)]CRS-1605:CSSD voting file is online: /dev/rdsk/emcpower0a; details in /export/home/app/grid/log/zwq-rpt1/cssd/ocssd.log. 2014-01-01 00:40:53.234 [cssd(1418)]CRS-1601:CSSD Reconfiguration complete. Active nodes are zwq-rpt1 zwq-rpt2 . 2014-01-01 00:40:56.659 [ctssd(1483)]CRS-2407:The new Cluster Time Synchronization Service reference node is host zwq-rpt2. 2014-01-01 00:40:56.661 [ctssd(1483)]CRS-2401:The Cluster Time Synchronization Service started on host zwq-rpt1. 2014-01-01 00:41:02.016 [ctssd(1483)]CRS-2408:The clock on host zwq-rpt1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. 2014-01-01 00:43:23.874 [/export/home/app/grid/bin/oraagent.bin(1348)]CRS-5019:All OCR locations are on ASM disk groups [], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/export/home/app/grid/log/zwq-rpt1/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-01-01 00:45:42.837 [/export/home/app/grid/bin/oraagent.bin(1348)]CRS-5019:All OCR locations are on ASM disk groups [], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/export/home/app/grid/log/zwq-rpt1/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-01-01 00:48:02.087 [/export/home/app/grid/bin/oraagent.bin(1348)]CRS-5019:All OCR locations are on ASM disk groups [], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/export/home/app/grid/log/zwq-rpt1/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-01-01 00:48:18.836 [ohasd(1083)]CRS-2807:Resource 'ora.asm' failed to start automatically. 2014-01-01 00:48:18.837 [ohasd(1083)]CRS-2807:Resource 'ora.crsd' failed to start automatically. 2014-01-01 01:05:15.396 [/export/home/app/grid/bin/oraagent.bin(1348)]CRS-5019:All OCR locations are on ASM disk groups [CRSDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/export/home/app/grid/log/zwq-rpt1/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-01-01 01:05:45.101 [/export/home/app/grid/bin/oraagent.bin(1348)]CRS-5019:All OCR locations are on ASM disk groups [CRSDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/export/home/app/grid/log/zwq-rpt1/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-01-01 01:06:15.104 [/export/home/app/grid/bin/oraagent.bin(1348)]CRS-5019:All OCR locations are on ASM disk groups [CRSDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/export/home/app/grid/log/zwq-rpt1/agent/ohasd/oraagent_grid/oraagent_grid.log".
这里较为明显的看到,因为asm磁盘组异常导致ocr无法被访问导致crs无法正常启动
ORAAGENT日志
2014-01-01 00:43:23.870: [ora.asm][9] {0:0:2} [start] InstConnection::connectInt (2) Exception OCIException 2014-01-01 00:43:23.870: [ora.asm][9] {0:0:2} [start] InstConnection:connect:excp OCIException OCI error 604 2014-01-01 00:43:23.870: [ora.asm][9] {0:0:2} [start] DgpAgent::queryDgStatus excp ORA-00604: error occurred at recursive SQL level 1 ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^34f764db","kglHeapInitialize:temp")
报了较为清晰的ORA-4031错误,检查asm日志
ASM日志报错
Wed Jan 01 00:47:33 2014 ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Reusing ORACLE_BASE from an earlier startup = /export/home/app/oracle Wed Jan 01 00:47:39 2014 Errors in file /export/home/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_1728.trc (incident=291447): ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^34f764db","kglHeapInitialize:temp") Incident details in: /export/home/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_291447/+ASM1_ora_1728_i291447.trc Wed Jan 01 00:47:48 2014 Dumping diagnostic data in directory=[cdmp_20140101004748], requested by (instance=1, osid=1728), summary=[incident=291447]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Jan 01 00:47:53 2014 Errors in file /export/home/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_1730.trc (incident=291448): ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^34f764db","kglHeapInitialize:temp") Incident details in: /export/home/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_291448/+ASM1_ora_1730_i291448.trc Wed Jan 01 00:48:01 2014 Dumping diagnostic data in directory=[cdmp_20140101004801], requested by (instance=1, osid=1730), summary=[incident=291448]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Jan 01 00:48:07 2014 Errors in file /export/home/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_1732.trc (incident=291449): ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^34f764db","kglHeapInitialize:temp") Incident details in: /export/home/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_291449/+ASM1_ora_1732_i291449.trc Wed Jan 01 00:48:16 2014 Dumping diagnostic data in directory=[cdmp_20140101004816], requested by (instance=1, osid=1732), summary=[incident=291449]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Jan 01 00:48:16 2014 License high water mark = 1 USER (ospid: 1736): terminating the instance Instance terminated by USER, pid = 1736
这里可以清晰的看到,因为shared pool不足,导致asm报ora-4031错误,从而使得asm无法正常启动
分析原因
Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options. ORACLE_HOME = /export/home/app/grid System name: SunOS Node name: zwq-rpt1 Release: 5.11 Version: 11.1 Machine: sun4v Using parameter settings in server-side spfile +CRSDG/zwq-rpt-cluster/asmparameterfile/registry.253.823992831 System parameters with non-default values: sga_max_size = 2G large_pool_size = 16M instance_type = "asm" sga_target = 0 remote_login_passwordfile= "EXCLUSIVE" asm_diskstring = "/dev/rdsk/*" asm_diskgroups = "FRADG" asm_diskgroups = "DATADG" asm_power_limit = 1 diagnostic_dest = "/export/home/app/oracle"
这里可以看到sga_target被设置为了0,而shared pool又未被配置,这里因为shared pool不足从而出现了ORA-4031,从而导致crs在启动asm的过程失败,从而使得ocr不能被访问,进而使得crs不能正常启动.
处理方法
1.编辑pfile
[email protected]:/export/home/app/oracle/diag/asm/+asm/+ASM1/trace$vi /tmp/asm.pfile memory_target = 2G large_pool_size = 16M instance_type = "asm" sga_target = 0 remote_login_passwordfile= "EXCLUSIVE" asm_diskstring = "/dev/rdsk/*" asm_diskgroups = "FRADG" asm_diskgroups = "DATADG" asm_power_limit = 1 diagnostic_dest = "/export/home/app/oracle"
2.启动asm
[email protected]:/export/home/app/oracle/diag/asm/+asm/+ASM1/trace$sqlplus / as sysasm SQL*Plus: Release 11.2.0.3.0 Production on Wed Jan 1 01:04:10 2014 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to an idle instance. SQL> startup pfile='/tmp/asm.pfile' ASM instance started Total System Global Area 2138521600 bytes Fixed Size 2161024 bytes Variable Size 2102806144 bytes ASM Cache 33554432 bytes ASM diskgroups mounted
3. 创建spfile
SQL> create spfile='+CRSDG' FROM PFILE='/tmp/asm.pfile'; File created. --asm alert日志 Wed Jan 01 01:08:59 2014 NOTE: updated gpnp profile ASM SPFILE to NOTE: updated gpnp profile ASM diskstring: /dev/rdsk/* NOTE: updated gpnp profile ASM diskstring: /dev/rdsk/* NOTE: updated gpnp profile ASM SPFILE to +CRSDG/zwq-rpt-cluster/asmparameterfile/registry.253.835664939
4. 关闭asm
SQL> shutdown immediate ORA-15097: cannot SHUTDOWN ASM instance with connected client (process 1971) SQL> shutdown abort ASM instance shutdown
5. 重启crs
[email protected]:~# crsctl stop crs -f [email protected]:~# crsctl start crs
6. 重启其他节点crs
[email protected]:~# crsctl stop crs -f [email protected]:~# crsctl start crs
7. 检查结果
[email protected]:~# crsctl status res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.CRSDG.dg ONLINE ONLINE zwq-rpt1 ONLINE ONLINE zwq-rpt2 ora.DATADG.dg ONLINE ONLINE zwq-rpt1 ONLINE ONLINE zwq-rpt2 ora.FRADG.dg ONLINE ONLINE zwq-rpt1 ONLINE ONLINE zwq-rpt2 ora.LISTENER.lsnr ONLINE ONLINE zwq-rpt1 ONLINE ONLINE zwq-rpt2 ora.asm ONLINE ONLINE zwq-rpt1 Started ONLINE ONLINE zwq-rpt2 Started ora.gsd OFFLINE OFFLINE zwq-rpt1 OFFLINE OFFLINE zwq-rpt2 ora.net1.network ONLINE ONLINE zwq-rpt1 ONLINE ONLINE zwq-rpt2 ora.ons ONLINE ONLINE zwq-rpt1 ONLINE ONLINE zwq-rpt2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE zwq-rpt1 ora.cvu 1 ONLINE ONLINE zwq-rpt1 ora.oc4j 1 ONLINE ONLINE zwq-rpt1 ora.rptdb.db 1 ONLINE ONLINE zwq-rpt1 Open 2 ONLINE ONLINE zwq-rpt2 Open ora.scan1.vip 1 ONLINE ONLINE zwq-rpt1 ora.zwq-rpt1.vip 1 ONLINE ONLINE zwq-rpt1 ora.zwq-rpt2.vip 1 ONLINE ONLINE zwq-rpt2
至此恢复正常,2014年第一个故障顺利解决