RAC一个节点内存故障宕机,无法访问

环境描述 :

应用使用sap ERP6.0

数据库使用11g RAC + ASM

巡检时SAP 里面执行事务码db02

RAC一个节点内存故障宕机,无法访问_第1张图片



查看clusterware状态,只剩下一个节点

bash-3.00$ ./crsctl status resource -t

--------------------------------------------------------------------------------

NAME TARGET STATE SERVER STATE_DETAILS

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.ACFS.dg

ONLINE ONLINE r3prddb01

ora.ARCH.dg

ONLINE ONLINE r3prddb01

ora.DATA.dg

ONLINE ONLINE r3prddb01

ora.LISTENER.lsnr

ONLINE ONLINE r3prddb01

ora.MLOG.dg

ONLINE ONLINE r3prddb01

ora.OLOG.dg

ONLINE ONLINE r3prddb01

ora.RECO.dg

ONLINE ONLINE r3prddb01

ora.VCR.dg

ONLINE ONLINE r3prddb01

ora.acfs.acfs.acfs

ONLINE ONLINE r3prddb01

ora.asm

ONLINE ONLINE r3prddb01

ora.gsd

OFFLINE OFFLINE r3prddb01

ora.net1.network

ONLINE ONLINE r3prddb01

ora.ons

ONLINE ONLINE r3prddb01

ora.registry.acfs

ONLINE ONLINE r3prddb01

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.LISTENER_SCAN1.lsnr

1 ONLINE ONLINE r3prddb01

ora.cvu

1 ONLINE ONLINE r3prddb01

ora.oc4j

1 ONLINE ONLINE r3prddb01

ora.p01.db

1 ONLINE ONLINE r3prddb01 Open

2 ONLINE OFFLINE

ora.r3prddb01.vip

1 ONLINE ONLINE r3prddb01

ora.r3prddb02.vip

1 ONLINE INTERMEDIATE r3prddb01 FAILED OVER

ora.scan1.vip

1 ONLINE ONLINE r3prddb01

bash-3.00$

db01 数据库实例alert日志,

Sat Aug 23 11:17:20 2014

Reconfiguration started (old inc 4, new inc 6)

List of instances:

1 (myinst: 1)

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Sat Aug 23 11:17:20 2014

LMS 1: 6 GCS shadows cancelled, 0 closed, 0 Xw survived

Sat Aug 23 11:17:20 2014

LMS 0: 5 GCS shadows cancelled, 1 closed, 0 Xw survived

Sat Aug 23 11:17:20 2014

LMS 2: 5 GCS shadows cancelled, 2 closed, 0 Xw survived

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Post SMON to start 1st pass IR

Sat Aug 23 11:17:20 2014

Instance recovery: looking for dead threads

Beginning instance recovery of 1 threads

Submitted all GCS remote-cache requests

Post SMON to start 1st pass IR

Fix write in gcs resources

Reconfiguration complete

parallel recovery started with 31 processes

Started redo scan

Completed redo scan

read 58319 KB redo, 11434 data blocks need recovery

Started redo application at

Thread 2: logseq 49560, block 81359

Recovery of Online Redo Log: Thread 2 Group 44 Seq 49560 Reading mem 0

Mem# 0: +OLOG/p01/onlinelog/log_g44m1.dbf

Mem# 1: +MLOG/p01/onlinelog/log_g44m2.dbf

Sat Aug 23 11:17:25 2014

Setting Resource Manager plan SCHEDULER[0x447BF]:DEFAULT_MAINTENANCE_PLAN via scheduler window

Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter

Sat Aug 23 11:17:25 2014

minact-scn: master found reconf/inst-rec before recscn scan old-inc#:6 new-inc#:6

Completed redo application of 48.31MB

Completed instance recovery at

Thread 2: logseq 49560, block 197998, scn 11518227963

10738 data blocks read, 11483 data blocks written, 58319 redo k-bytes read

Thread 2 advanced to log sequence 49561 (thread recovery)

Redo thread 2 internally disabled at seq 49561 (SMON)

Sat Aug 23 11:17:27 2014

Archived Log entry 91800 added for thread 2 sequence 49560 ID 0x592ddd4a dest 1:

Sat Aug 23 11:17:27 2014

ARC0: Archiving disabled thread 2 sequence 49561

Archived Log entry 91801 added for thread 2 sequence 49561 ID 0x592ddd4a dest 1:

minact-scn: master continuing after IR

minact-scn: Master considers inst:2 dead

Sat Aug 23 11:17:28 2014

Beginning log switch checkpoint up to RBA [0xa4e4.2.10], SCN: 11518240393

Thread 1 advanced to log sequence 42212 (LGWR switch)

Current log# 35 seq# 42212 mem# 0: +OLOG/p01/onlinelog/log_g35m1.dbf

Current log# 35 seq# 42212 mem# 1: +MLOG/p01/onlinelog/log_g35m2.dbf

Archived Log entry 91802 added for thread 1 sequence 42211 ID 0x592ddd4a dest 1:



DB01 ASM实例alert日志 ASM01

Sat Aug 23 11:17:20 2014

Reconfiguration started (old inc 4, new inc 6)

List of instances:

1 (myinst: 1)

Global Resource Directory frozen

* dead instance detected - domain 1 invalid = TRUE

* dead instance detected - domain 2 invalid = TRUE

* dead instance detected - domain 3 invalid = TRUE

* dead instance detected - domain 4 invalid = TRUE

* dead instance detected - domain 5 invalid = TRUE

* dead instance detected - domain 6 invalid = TRUE

* dead instance detected - domain 7 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Sat Aug 23 11:17:20 2014

LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Post SMON to start 1st pass IR

Sat Aug 23 11:17:20 2014

NOTE: SMON starting instance recovery for group ACFS domain 1 (mounted)

NOTE: F1X0 found on disk 1 au 49 fcn 0.12248

Submitted all GCS remote-cache requests

Post SMON to start 1st pass IR

Fix write in gcs resources

Reconfiguration complete

NOTE: starting recovery of thread=2 ckpt=19.43 group=1 (ACFS)

NOTE: SMON waiting for thread 2 recovery enqueue

NOTE: SMON about to begin recovery lock claims for diskgroup 1 (ACFS)

NOTE: SMON successfully validated lock domain 1

NOTE: advancing ckpt for thread=2 ckpt=19.43

NOTE: SMON did instance recovery for group ACFS domain 1

Sat Aug 23 11:17:20 2014

ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

NOTE: SMON starting instance recovery for group ARCH domain 2 (mounted)

NOTE: F1X0 found on disk 9 au 113 fcn 0.41343439

NOTE: starting recovery of thread=2 ckpt=77.3254 group=2 (ARCH)

NOTE: SMON waiting for thread 2 recovery enqueue

NOTE: SMON about to begin recovery lock claims for diskgroup 2 (ARCH)

NOTE: SMON successfully validated lock domain 2

NOTE: advancing ckpt for thread=2 ckpt=77.3254

NOTE: SMON did instance recovery for group ARCH domain 2

ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

NOTE: SMON starting instance recovery for group DATA domain 3 (mounted)

NOTE: F1X0 found on disk 15 au 60241 fcn 0.5143392

NOTE: starting recovery of thread=2 ckpt=22.3858 group=3 (DATA)

NOTE: SMON waiting for thread 2 recovery enqueue

NOTE: SMON about to begin recovery lock claims for diskgroup 3 (DATA)

NOTE: SMON successfully validated lock domain 3

NOTE: advancing ckpt for thread=2 ckpt=22.3858

NOTE: SMON did instance recovery for group DATA domain 3

ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

NOTE: SMON starting instance recovery for group MLOG domain 4 (mounted)

NOTE: F1X0 found on disk 3 au 639 fcn 0.120137

NOTE: starting recovery of thread=2 ckpt=23.2161 group=4 (MLOG)

NOTE: SMON waiting for thread 2 recovery enqueue

NOTE: SMON about to begin recovery lock claims for diskgroup 4 (MLOG)

NOTE: SMON successfully validated lock domain 4

NOTE: advancing ckpt for thread=2 ckpt=23.2161

NOTE: SMON did instance recovery for group MLOG domain 4

ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

NOTE: SMON starting instance recovery for group OLOG domain 5 (mounted)

NOTE: F1X0 found on disk 3 au 637 fcn 0.121291

NOTE: starting recovery of thread=2 ckpt=23.2261 group=5 (OLOG)

NOTE: SMON waiting for thread 2 recovery enqueue

NOTE: SMON about to begin recovery lock claims for diskgroup 5 (OLOG)

NOTE: SMON successfully validated lock domain 5

NOTE: advancing ckpt for thread=2 ckpt=23.2261

NOTE: SMON did instance recovery for group OLOG domain 5

ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

NOTE: SMON starting instance recovery for group RECO domain 6 (mounted)

NOTE: F1X0 found on disk 11 au 11 fcn 0.2264

NOTE: starting recovery of thread=2 ckpt=19.6 group=6 (RECO)

NOTE: SMON waiting for thread 2 recovery enqueue

NOTE: SMON about to begin recovery lock claims for diskgroup 6 (RECO)

NOTE: SMON successfully validated lock domain 6

NOTE: advancing ckpt for thread=2 ckpt=19.6

NOTE: SMON did instance recovery for group RECO domain 6

ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.

NOTE: SMON starting instance recovery for group VCR domain 7 (mounted)

NOTE: F1X0 found on disk 2 au 177 fcn 0.1216

NOTE: starting recovery of thread=2 ckpt=16.13 group=7 (VCR)

NOTE: SMON waiting for thread 2 recovery enqueue

NOTE: SMON about to begin recovery lock claims for diskgroup 7 (VCR)

NOTE: SMON successfully validated lock domain 7

NOTE: advancing ckpt for thread=2 ckpt=16.13

NOTE: SMON did instance recovery for group VCR domain 7



通过以上日志判断,故障发生在 23日 11:17 ,联系机房同事,检查发现M9000上该机器告警,内存故障,已报修。

机器修理中。。。。

第二天02机器内存故障已修好,启动02操作系统。等待一会儿发现cluserware自动起来了,02数据库实例也恢复

该机器设置的cluster随着操作系统启动自动启动,默认安装好设置也是这样的。

如果设置不自动启动执行以下命令crsctl disable crs,使用crsctl enable crs改回开启自动启动

设置后,需要手工启动执行crsctl start crs命令

bash-3.00$ ./crsctl status resource -t

--------------------------------------------------------------------------------

NAME TARGET STATE SERVER STATE_DETAILS

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.ACFS.dg

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.ARCH.dg

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.DATA.dg

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.LISTENER.lsnr

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.MLOG.dg

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.OLOG.dg

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.RECO.dg

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.VCR.dg

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.acfs.acfs.acfs

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.asm

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.gsd

OFFLINE OFFLINE r3prddb01

OFFLINE OFFLINE r3prddb02

ora.net1.network

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.ons

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

ora.registry.acfs

ONLINE ONLINE r3prddb01

ONLINE ONLINE r3prddb02

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.LISTENER_SCAN1.lsnr

1 ONLINE ONLINE r3prddb01

ora.cvu

1 ONLINE ONLINE r3prddb01

ora.oc4j

1 ONLINE ONLINE r3prddb01

ora.p01.db

1 ONLINE ONLINE r3prddb01 Open

2 ONLINE ONLINE r3prddb02 Open

ora.r3prddb01.vip

1 ONLINE ONLINE r3prddb01

ora.r3prddb02.vip

1 ONLINE ONLINE r3prddb02

ora.scan1.vip

1 ONLINE ONLINE r3prddb01


DB02查看已经两个实例

RAC一个节点内存故障宕机,无法访问_第2张图片

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/27771627/viewspace-1257887/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/27771627/viewspace-1257887/

你可能感兴趣的:(RAC一个节点内存故障宕机,无法访问)