ORA-15064与ASM

早上例行检查时发现集群中节点3已经宕掉。但crs显示集群中各节点资源正常。


在检查节点3数据库实例日志,ASM日志,及CRS日志后发现:


节点3在凌晨4:14分被踢出集群,导致操作系统重启,数据库实例宕掉。同时检查集群中其它三个节点日志,未发现异常情况。

 

以下是节点3的检查信息:

1. crs_stat   –t  显示集群中各项资源状态都正常。此命令是在其它节点上运行的,节点3执行crs_stat -t无响应

[oracle@RACe00-ser01 ~]$ crs_stat -t
Name           Type           Target    State     Host       
------------------------------------------------------------
ora....SM1.asm application    ONLINE    ONLINE    RAC...er01
ora....01.lsnr application    ONLINE    ONLINE    RAC...er01
ora....r01.gsd application    ONLINE    ONLINE    RAC...er01
ora....r01.ons application    ONLINE    ONLINE    RAC...er01
ora....r01.vip application    ONLINE    ONLINE    RAC...er01
ora....SM2.asm application    ONLINE    ONLINE    RAC...er02
ora....02.lsnr application    ONLINE    ONLINE    RAC...er02
ora....r02.gsd application    ONLINE    ONLINE    RAC...er02
ora....r02.ons application    ONLINE    ONLINE    RAC...er02
ora....r02.vip application    ONLINE    ONLINE    RAC...er02
ora....SM3.asm application    ONLINE    ONLINE    RAC...er03
ora....03.lsnr application    ONLINE    ONLINE    RAC...er03
ora....r03.gsd application    ONLINE    ONLINE    RAC...er03
ora....r03.ons application    ONLINE    ONLINE    RAC...er03
ora....r03.vip application    ONLINE    ONLINE    RAC...er03
ora....SM4.asm application    ONLINE    ONLINE    RAC...er04
ora....04.lsnr application    ONLINE    ONLINE    RAC...er04
ora....r04.gsd application    ONLINE    ONLINE    RAC...er04
ora....r04.ons application    ONLINE    ONLINE    RAC...er04
ora....r04.vip application    ONLINE    ONLINE    RAC...er04
ora....d1.inst application    ONLINE    ONLINE    RAC...er01
ora....d2.inst application    ONLINE    ONLINE    RAC...er02
ora....d3.inst application    ONLINE    ONLINE    RAC...er03
ora....d4.inst application    ONLINE    ONLINE    RAC...er04
ora.caprod.db  application    ONLINE    ONLINE    RAC...er01
 

2.查看alert*.log  没有异常信息


[oracle@RAC-ser03 ~]$ tail –f alert_caprod3.log |more
Wed Mar 23 14:09:26 2011
Thread 3 advanced to log sequence 6440 (LGWR switch)
  Current log# 17 seq# 6440 mem# 0: +DG1/caprod/onlinelog/group_17.324.702599663
 

3.检查oracle后台进程,发现没有oracle进程在工作
[oracle@RAC-ser03 ~]$ ps -ef|grep ora_
oracle   13772 10533  0 09:01 pts/0    00:00:00 grep ora_

 

4.检查ASM实例情况,发现ASM实例被关闭
[oracle@RAC-ser03 ~]$ ps -ef|grep ASM
oracle   13878 10533  0 09:01 pts/0    00:00:00 grep ASM

 

注意:此时集群中第三个节点已经挂掉,但crs_stat 中显示RAC-ser03的状态是正常的。

 

5.last 查看操作系统登录情况,发现系统在04:14重启
[oracle@RAC-ser03 ~]$ last
oracle   pts/1        19.168.0.16        Thu Mar 24 08:59 - down   (00:09)   
oracle   pts/0        192.168.0.3         Thu Mar 24 08:38 - down   (00:30)   
reboot   system boot  2.6.18-128.el5   Thu Mar 24 04:14          (04:54) 

 

6.检查ASM日志,没有异常
[oracle@RAC-ser03 bdump]$ pwd
/home/oracle/product/admin/+ASM/bdump
[oracle@RAC-ser03 bdump]$ tail -n 100 alert_+ASM3.log
Mon Mar 21 10:53:04 2011
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources

 

7.检查CRS日志
[oracle@RAC-ser03 ~]$ cd $ORACLE_BASE/crs/log/RAC-ser03/

[oracle@RAC-ser03 RAC-ser03]$ tail -n 100 alertRAC-ser03.log
2011-03-24 04:10:31.834
[cssd(7573)]CRS-1612:node RAC-ser04 (4) at 50% heartbeat fatal, eviction in 29.172 seconds
2011-03-24 04:10:32.835
[cssd(7573)]CRS-1612:node RAC-ser04 (4) at 50% heartbeat fatal, eviction in 28.162 seconds
2011-03-24 04:10:46.834
[cssd(7573)]CRS-1611:node RAC-ser04 (4) at 75% heartbeat fatal, eviction in 14.172 seconds
2011-03-24 04:10:47.836
[cssd(7573)]CRS-1611:node RAC-ser04 (4) at 75% heartbeat fatal, eviction in 13.172 seconds
2011-03-24 04:10:55.832
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 5.172 seconds
2011-03-24 04:10:56.833
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 4.172 seconds
2011-03-24 04:10:57.835
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 3.172 seconds
2011-03-24 04:10:58.838
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 2.172 seconds
2011-03-24 04:10:59.840
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 1.162 seconds
2011-03-24 04:11:00.832
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 0.172 seconds

RAC-ser03被踢出了集群,导致RAC-ser03所在的主机重启。同时检查其它各节点日志,均未发现异常。

 

09:12分手工重启RAC-ser03操作系统:
reboot   system boot  2.6.18-128.el5   Thu Mar 24 09:12          (00:31) 

 

尝试重启RAC-ser03上的oracle实例
 [root@RAC-ser03 /]# ./etc/rc.local
Illegal operation: The specified slave interface 'eth0' is already a slave
Master 'bond0', Slave 'eth0': Error: Enslave failed
Illegal operation: The specified slave interface 'eth1' is already a slave
Master 'bond0', Slave 'eth1': Error: Enslave failed
/dev/raw/raw1:  bound to major 253, minor 1
/dev/raw/raw2:  bound to major 253, minor 2
/dev/raw/raw3:  bound to major 253, minor 3

 

09:15左右CRS启动
2011-03-24 04:11:00.832
[cssd(7573)]CRS-1610:node RAC-ser04 (4) at 90% heartbeat fatal, eviction in 0.172 seconds
2011-03-24 09:15:16.937
[cssd(7728)]CRS-1605:CSSD voting file is online: /dev/raw/raw2. Details in /home/oracle/product/crs/log/RAC-ser03/cssd/ocssd.log
[cssd(7728)]CRS-1601:CSSD Reconfiguration complete. Active nodes are RAC-ser01 RAC-ser02 RAC-ser03 RAC-ser04 .
2011-03-24 09:15:18.281
[crsd(7141)]CRS-1012:The OCR service started on node RAC-ser03.
2011-03-24 09:15:18.313
[evmd(7064)]CRS-1401:EVMD started on node RAC-ser03.
2011-03-24 09:15:20.657
[crsd(7141)]CRS-1201:CRSD started on node RAC-ser03.

 

检查alert日志,发现数据库能正常OPEN。
Thu Mar 24 09:30:20 2011
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE   MOUNT
Thu Mar 24 09:30:20 2011
ALTER DATABASE OPEN
This instance was first to open
Thu Mar 24 09:30:23 2011
Beginning crash recovery of 1 threads
 parallel recovery started with 15 processes
Thu Mar 24 09:30:25 2011
Started redo scan
Thu Mar 24 09:30:25 2011
Completed redo scan
 496 redo blocks read, 147 data blocks need recovery
Thu Mar 24 09:30:25 2011
Started redo application at
 Thread 4: logseq 7911, block 657430
Thu Mar 24 09:30:25 2011
Recovery of Online Redo Log: Thread 4 Group 18 Seq 7911 Reading mem 0
  Mem# 0: +DG1/caprod/onlinelog/group_18.325.702599667
Thu Mar 24 09:30:25 2011
Completed redo application
Thu Mar 24 09:30:25 2011
Completed crash recovery at
 Thread 4: logseq 7911, block 657926, scn 9932862661957
 147 data blocks read, 147 data blocks written, 496 redo blocks read
Thu Mar 24 09:30:26 2011
Thread 4 advanced to log sequence 7912 (thread recovery)
Picked broadcast on commit scheme to generate SCNs
Thu Mar 24 09:30:26 2011
Thread 3 advanced to log sequence 6442 (thread open)
Thread 3 opened at log sequence 6442
  Current log# 16 seq# 6442 mem# 0: +DG1/caprod/onlinelog/group_16.323.702599657
Successful open of redo thread 3
Thu Mar 24 09:30:26 2011
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Mar 24 09:30:26 2011
SMON: enabling cache recovery
Thu Mar 24 09:30:27 2011
Successfully onlined Undo Tablespace 5.
Thu Mar 24 09:30:27 2011
SMON: enabling tx recovery
Thu Mar 24 09:30:27 2011
Database Characterset is ZHS16GBK
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 16
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=45, OS id=13654
Thu Mar 24 09:30:31 2011
Completed: ALTER DATABASE OPEN

本以为事情到此已经结束,但这只是个开始。。。。。。


在RAC-ser03启动的过程中,集群中的其他几个节点由于ASM磁盘组空间满,相继宕机。
RAC-ser03在启动后不久,由于ASM实例宕掉,数据库实例也随之关闭。


RAC-ser01中alert日志
 
Thu Mar 24 09:17:30 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod1_asmb_26891.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Thu Mar 24 09:17:30 2011
ASMB: terminating instance due to error 15064

 

RAC-ser02中alert日志
Thu Mar 24 09:22:03 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod2_asmb_8227.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel

Thu Mar 24 09:22:03 2011
ASMB: terminating instance due to error 15064

RAC-ser04中alert日志
Thu Mar 24 09:29:51 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod4_asmb_8610.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel

Thu Mar 24 09:29:51 2011
ASMB: terminating instance due to error 15064

 

RAC-ser03中alert日志
Thu Mar 24 09:38:35 2011
Errors in file /home/oracle/product/admin/caprod/bdump/caprod3_asmb_13523.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel

Thu Mar 24 09:38:35 2011
ASMB: terminating instance due to error 15064

 

[oracle@RAC-ser03 bdump]$ more caprod3_asmb_13523.trc
*** 2011-03-24 09:38:35.480
*** SERVICE NAME:(SYS$BACKGROUND) 2011-03-24 09:38:35.480
*** SESSION ID:(639.1) 2011-03-24 09:38:35.480
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
ksuitm: waiting up to [5] seconds before killing DIAG

 

四个节点的日志表明,引起数据库宕机的原因是ASM实例宕掉。而引起ASM实例挂起的原因是,ASM磁盘组空间不足。
至此,集群中的四个节点全部宕掉。


依次重启集群中的四个主机,节点1,4主机能正常启动;但节点2,3不能正常重启。

可以ping通节点2和节点3,但不能远程登录。

准备先起节点1的数据库实例,但发现节点1上数据库实例也不能启动。
RAC-ser01.中crs日志
2011-03-24 10:16:07.978
[crsd(7611)]CRS-1205:Auto-start failed for the CRS resource . Details in RAC-ser01.
2011-03-24 10:17:04.611
[cssd(8232)]CRS-1612:node RAC-ser03 (3) at 50% heartbeat fatal, eviction in 29.008 seconds
2011-03-24 10:17:05.613
[cssd(8232)]CRS-1612:node RAC-ser03 (3) at 50% heartbeat fatal, eviction in 28.008 seconds
2011-03-24 10:17:19.639
[cssd(8232)]CRS-1611:node RAC-ser03 (3) at 75% heartbeat fatal, eviction in 14.244 seconds
2011-03-24 10:17:28.656
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 5.224 seconds
2011-03-24 10:17:29.658
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 4.224 seconds
2011-03-24 10:17:30.660
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 3.224 seconds
2011-03-24 10:17:31.662
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 2.214 seconds
2011-03-24 10:17:32.654
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 1.224 seconds
2011-03-24 10:17:33.656
[cssd(8232)]CRS-1610:node RAC-ser03 (3) at 90% heartbeat fatal, eviction in 0.224 seconds
2011-03-24 10:17:34.139
[cssd(8232)]CRS-1607:CSSD evicting node RAC-ser03. Details in /home/oracle/product/crs/log/RAC-ser01/cssd/ocssd.log

 

强制关闭节点3,和节点2的主机后,节点1能正常启动。

节点1重启后,检查ASM磁盘组空间,状态:

SQL> select inst_id ,group_number ,name ,state,total_mb , free_mb  from gv$asm_diskgroup;

 

INST_ID

GROUP_NUMBER

NAME

STATE

TOTAL_MB

FREE_MB

1

1

DG1

CONNECTED

1012470

0

 

  发现ASM磁盘组空闲空间为0.这也是ASM实例挂起的原因。

 

检查表空间信息,发现temp表空间增长到103G,undo表空间也增大到50G左右。

对这两个表空间进行收缩后,ASM磁盘组空间得以释放。

你可能感兴趣的:(ORA-15064与ASM)