DBA 群里朋友的RAC 环境的ONS 进程无法启动。 平台是Redhat 5.3 64bit的。
Ons log 如下:
2010-10-18 09:42:11.384: [RACG][3041022624] [16815][3041022624][ora.rac1.ons]: clsrcexecut: env ORACLE_CONFIG_HOME=/u01/oracle/product/10.2.0/crs_1
2010-10-18 09:42:11.384: [RACG][3041022624] [16815][3041022624][ora.rac1.ons]: clsrcexecut: cmd = /u01/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=0 540 /u01/oracle/product/10.2.0/crs_1/bin/onsctl stop
2010-10-18 09:42:11.384: [RACG][3041022624] [16815][3041022624][ora.rac1.ons]: clsrcexecut: rc = 99, time = 540.630s
2010-10-18 10:55:44.720: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: timeout: killed the spawned process
2010-10-18 10:55:44.721: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: clsrcexecut: env ORACLE_CONFIG_HOME=/u01/oracle/product/10.2.0/crs_1
2010-10-18 10:55:44.721: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: clsrcexecut: cmd = /u01/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=0 540 /u01/oracle/product/10.2.0/crs_1/bin/onsctl start
2010-10-18 10:55:44.721: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: clsrcexecut: rc = 99, time = 540.410s
2010-10-18 10:59:12.517: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: /u01/oracle/product/10.2.0/crs_1/bin/onsctl: line 81: 31584Terminated $ONSADMIN ping
ons is not running ...
2010-10-18 10:59:12.517: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: clsrcexecut: env ORACLE_CONFIG_HOME=/u01/oracle/product/10.2.0/crs_1
2010-10-18 10:59:12.517: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: clsrcexecut: cmd = /u01/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=0 540 /u01/oracle/product/10.2.0/crs_1/bin/onsctl ping
2010-10-18 10:59:12.517: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: clsrcexecut: rc = 1, time = 207.800s
2010-10-18 10:59:12.517: [RACG][1604653728] [18288][1604653728][ora.rac1.ons]: end for resource = ora.rac1.ons, action = start, status = 1, time = 748.230s
2010-10-18 10:59:13.781: [RACG][1366147744] [1357][1366147744][ora.rac1.ons]: onsctl: shutting down ons daemon ...
/u01/oracle/product/10.2.0/crs_1/bin/onsctl: line 118: 1362 Terminated $ONSADMIN shutdown
onsctl: shutdown of ons failed!
crsd.log 信息如下:
timeout for ora.rac1.ons timeout=600
start resource error for ora.rac1.ons error code=-2
从错误看是连接超时。而且RAC 运行正常,但是ONS 进程较多,而且占用大量的CPU 资源,cpu 消耗100%。因为这个是生产库,所以慎重操作。 将DBA1 群的布豆 加入讨论组,布豆在RAC上的经验比较丰富。
布豆的说法, Oracle RAC进程有时会有莫名其妙的不正常, Oracle 原厂也说不清。 朋友重启了节点1的服务器后,ons 启动正常了,然后又重启了节点2. 朋友怀疑是网络的策略做了变更,对系统产生了影响。
问题解决后,我们三小聊了会,其中一个话题就是备份。 备份对与数据库来说重于一切。 要备份数据库,控制文件,spfile。 这些文件对恢复来说很重要。 只有有效的备份,才可能将出现的损失降到最低。
------------------------------------------------------------------------------