早上例行检查时发现集群中节点3已经宕掉。但crs显示集群中各节点资源正常。
在检查节点3数据库实例日志,ASM日志,及CRS日志后发现:
节点3在凌晨4:14分被踢出集群,导致操作系统重启,数据库实例宕掉。同时检查集群中其它三个节点日志,未发现异常情况。
以下是节点3的检查信息:
1. crs_stat –t 显示集群中各项资源状态都正常。此命令是在其它节点上运行的,节点3执行crs_stat -t无响应
[oracle@RACe00-ser01 ~]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....SM1.asm application ONLINE ONLINE RAC...er01
ora....01.lsnr application ONLINE ONLINE RAC...er01
ora....r01.gsd application ONLINE ONLINE RAC...er01
ora....r01.ons application ONLINE ONLINE RAC...er01
ora....r01.vip application ONLINE ONLINE RAC...er01
ora....SM2.asm application ONLINE ONLINE RAC...er02
ora....02.lsnr application ONLINE ONLINE RAC...er02
ora....r02.gsd application ONLINE ONLINE RAC...er02
ora....r02.ons application ONLINE ONLINE RAC...er02
ora....r02.vip application ONLINE ONLINE RAC...er02
ora....SM3.asm application ONLINE ONLINE RAC...er03
ora....03.lsnr application ONLINE ONLINE RAC...er03
ora....r03.gsd application ONLINE ONLINE RAC...er03
ora....r03.ons application ONLINE ONLINE RAC...er03
ora....r03.vip application ONLINE ONLINE RAC...er03
ora....SM4.asm application ONLINE ONLINE RAC...er04
ora....04.lsnr application ONLINE ONLINE RAC...er04
ora....r04.gsd application ONLINE ONLINE RAC...er04
ora....r04.ons application ONLINE ONLINE RAC...er04
ora....r04.vip application ONLINE ONLINE RAC...er04
ora....d1.inst application ONLINE ONLINE RAC...er01
ora....d2.inst application ONLINE ONLINE RAC...er02
ora....d3.inst application ONLINE ONLINE RAC...er03
ora....d4.inst application ONLINE ONLINE RAC...er04
ora.caprod.db application ONLINE ONLINE RAC...er01
2.查看alert*.log 没有异常信息
[oracle@RAC-ser03 ~]$ tail –f alert_caprod3.log |more
Wed Mar 23 14:09:26 2011
Thread 3 advanced to log sequence 6440 (LGWR switch)
Current log# 17 seq# 6440 mem# 0: +DG1/caprod/onlinelog/group_17.324.702599663
3.检查oracle后台进程,发现没有oracle进程在工作
[oracle@RAC-ser03 ~]$ ps -ef|grep ora_
oracle 13772 10533 0 09:01 pts/0 00:00:00 grep ora_
4.检查ASM实例情况,发现ASM实例被关闭
[oracle@RAC-ser03 ~]$ ps -ef|grep ASM
oracle 13878 10533 0 09:01 pts/0 00:00:00 grep ASM
注意:此时集群中第三个节点已经挂掉,但crs_stat 中显示RAC-ser03的状态是正常的。
5.last 查看操作系统登录情况,发现系统在04:14重启
[oracle@RAC-ser03 ~]$ last
oracle pts/1 19.168.0.16 Thu Mar 24 08:59 - down (00:09)
oracle pts/0 192.168.0.3 Thu Mar 24 08:38 - down (00:30)
reboot system boot 2.6.18-128.el5 Thu Mar 24 04:14 (04:54)
6.检查ASM日志,没有异常
[oracle@RAC-ser03 bdump]$ pwd
/home/oracle/product/admin/+ASM/bdump
[oracle@RAC-ser03 bdump]$ tail -n 100 alert_+ASM3.log
Mon Mar 21 10:53:04 2011
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
7.检查CRS日志
[oracle@RAC-ser03 ~]$ cd $ORACLE_BASE/crs/log/RAC-ser03/
[oracle@RAC-ser03 RAC-ser03]$ tail -n 100 alertRAC-ser03.log
2011-03-24 04:10:31.834
[cssd(7573)]CRS-1612:node RAC-ser04 (4) at 50% heartbeat fatal, eviction in 29.172 seconds
2011-03-24 04:10:32.835
[cssd(7573)]CRS-1612:node RAC-ser04 (4) at 50% heartbeat fatal, eviction in 28.162 seconds
2011-03-24 04:10:46.834
[cssd(7573)]CRS-1611:node RAC-ser04 (4) at 75% heartbeat fatal, eviction in 14.172 seconds
2011-03-24 04:10:47.836
[cssd(7573)]CRS-1611:node RAC-ser04 (4) at 75% heartbeat fatal, eviction in 13.172 seconds
2011-03-24 04:10:55.832
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 5.172 seconds
2011-03-24 04:10:56.833
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 4.172 seconds
2011-03-24 04:10:57.835
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 3.172 seconds
2011-03-24 04:10:58.838
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 2.172 seconds
2011-03-24 04:10:59.840
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 1.162 seconds
2011-03-24 04:11:00.832
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 0.172 seconds
RAC-ser03被踢出了集群,导致RAC-ser03所在的主机重启。同时检查其它各节点日志,均未发现异常。
在09:12分手工重启RAC-ser03操作系统:
reboot system boot 2.6.18-128.el5 Thu Mar 24 09:12 (00:31)
尝试重启RAC-ser03上的oracle实例
[root@RAC-ser03 /]# ./etc/rc.local
Illegal operation: The specified slave interface 'eth0' is already a slave
Master 'bond0', Slave 'eth0': Error: Enslave failed
Illegal operation: The specified slave interface 'eth1' is already a slave
Master 'bond0', Slave 'eth1': Error: Enslave failed
/dev/raw/raw1: bound to major 253, minor 1
/dev/raw/raw2: bound to major 253, minor 2
/dev/raw/raw3: bound to major 253, minor 3
在09:15左右CRS启动
2011-03-24 04:11:00.832
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 0.172 seconds
2011-03-24 09:15:16.937
[cssd(7728)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /home/oracle/product/crs/log/RAC-ser03/cssd/ocssd.log
[cssd(7728)]CRS-1601:CSSD Reconfiguration complete. Active nodes are RAC-ser01 RAC-ser02 RAC-ser03 RAC-ser04 .
2011-03-24 09:15:18.281
[crsd(7141)]CRS-1012:The OCR service started on node RAC-ser03.
2011-03-24 09:15:18.313
[evmd(7064)]CRS-1401:EVMD started on node RAC-ser03.
2011-03-24 09:15:20.657
[crsd(7141)]CRS-1201:CRSD started on node RAC-ser03.
检查alert日志,发现数据库能正常OPEN。
Thu Mar 24 09:30:20 2011
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE MOUNT
Thu Mar 24 09:30:20 2011
ALTER DATABASE OPEN
This instance was first to open
Thu Mar 24 09:30:23 2011
Beginning crash recovery of 1 threads
parallel recovery started with 15 processes
Thu Mar 24 09:30:25 2011
Started redo scan
Thu Mar 24 09:30:25 2011
Completed redo scan
496 redo blocks read, 147 data blocks need recovery
Thu Mar 24 09:30:25 2011
Started redo application at
Thread 4: logseq 7911, block 657430
Thu Mar 24 09:30:25 2011
Recovery of Online Redo Log: Thread 4 Group 18 Seq 7911 Reading mem 0
Mem# 0: +DG1/caprod/onlinelog/group_18.325.702599667
Thu Mar 24 09:30:25 2011
Completed redo application
Thu Mar 24 09:30:25 2011
Completed crash recovery at
Thread 4: logseq 7911, block 657926, scn 9932862661957
147 data blocks read, 147 data blocks written, 496 redo blocks read
Thu Mar 24 09:30:26 2011
Thread 4 advanced to log sequence 7912 (thread recovery)
Picked broadcast on commit scheme to generate SCNs
Thu Mar 24 09:30:26 2011
Thread 3 advanced to log sequence 6442 (thread open)
Thread 3 opened at log sequence 6442
Current log# 16 seq# 6442 mem# 0: +DG1/caprod/onlinelog/group_16.323.702599657
Successful open of redo thread 3
Thu Mar 24 09:30:26 2011
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Mar 24 09:30:26 2011
SMON: enabling cache recovery
Thu Mar 24 09:30:27 2011
Successfully onlined Undo Tablespace 5.
Thu Mar 24 09:30:27 2011
SMON: enabling tx recovery
Thu Mar 24 09:30:27 2011
Database Characterset is ZHS16GBK
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 16
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=45, OS id=13654
Thu Mar 24 09:30:31 2011
Completed: ALTER DATABASE OPEN
本以为事情到此已经结束,但这只是个开始。。。。。。
在RAC-ser03启动的过程中,集群中的其他几个节点由于ASM磁盘组空间满,相继宕机。
RAC-ser03在启动后不久,由于ASM实例宕掉,数据库实例也随之关闭。
RAC-ser01中alert日志
Thu Mar 24 09:17:30 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod1_asmb_26891.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Thu Mar 24 09:17:30 2011
ASMB: terminating instance due to error 15064
RAC-ser02中alert日志
Thu Mar 24 09:22:03 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod2_asmb_8227.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Thu Mar 24 09:22:03 2011
ASMB: terminating instance due to error 15064
RAC-ser04中alert日志
Thu Mar 24 09:29:51 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod4_asmb_8610.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Thu Mar 24 09:29:51 2011
ASMB: terminating instance due to error 15064
RAC-ser03中alert日志
Thu Mar 24 09:38:35 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod3_asmb_13523.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Thu Mar 24 09:38:35 2011
ASMB: terminating instance due to error 15064
[oracle@RAC-ser03 bdump]$ more caprod3_asmb_13523.trc
*** 2011-03-24 09:38:35.480
*** SERVICE NAME:(SYS$BACKGROUND) 2011-03-24 09:38:35.480
*** SESSION ID:(639.1) 2011-03-24 09:38:35.480
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
ksuitm: waiting up to [5] seconds before killing DIAG
四个节点的日志表明,引起数据库宕机的原因是ASM实例宕掉。而引起ASM实例挂起的原因是,ASM磁盘组空间不足。
至此,集群中的四个节点全部宕掉。
依次重启集群中的四个主机,节点1,4主机能正常启动;但节点2,3不能正常重启。
可以ping通节点2和节点3,但不能远程登录。
准备先起节点1的数据库实例,但发现节点1上数据库实例也不能启动。
RAC-ser01.中crs日志
2011-03-24 10:16:07.978
[crsd(7611)]CRS-1205:Auto-start failed for the CRS resource . Details in RAC-ser01.
2011-03-24 10:17:04.611
[cssd(8232)]CRS-1612:node RAC-ser03 (3) at 50% heartbeat fatal, eviction in 29.008 seconds
2011-03-24 10:17:05.613
[cssd(8232)]CRS-1612:node RAC-ser03 (3) at 50% heartbeat fatal, eviction in 28.008 seconds
2011-03-24 10:17:19.639
[cssd(8232)]CRS-1611:node RAC-ser03 (3) at 75% heartbeat fatal, eviction in 14.244 seconds
2011-03-24 10:17:28.656
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 5.224 seconds
2011-03-24 10:17:29.658
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 4.224 seconds
2011-03-24 10:17:30.660
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 3.224 seconds
2011-03-24 10:17:31.662
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 2.214 seconds
2011-03-24 10:17:32.654
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 1.224 seconds
2011-03-24 10:17:33.656
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 0.224 seconds
2011-03-24 10:17:34.139
[cssd(8232)]CRS-1607:CSSD evicting node RAC-ser03. Details in /home/oracle/product/crs/log/RAC-ser01/cssd/ocssd.log
强制关闭节点3,和节点2的主机后,节点1能正常启动。
节点1重启后,检查ASM磁盘组空间,状态:
SQL> select inst_id ,group_number ,name ,state,total_mb , free_mb from gv$asm_diskgroup;
INST_ID |
GROUP_NUMBER |
NAME |
STATE |
TOTAL_MB |
FREE_MB |
1 |
1 |
DG1 |
CONNECTED |
1012470 |
0 |
发现ASM磁盘组空闲空间为0.这也是ASM实例挂起的原因。
检查表空间信息,发现temp表空间增长到103G,undo表空间也增大到50G左右。
对这两个表空间进行收缩后,ASM磁盘组空间得以释放。