虚拟机RAC的ASM磁盘组坏块导致重建DB

2011.11.23虚拟机RAC的ASM磁盘组坏块导致重建DB
刚刚在公司的一台PC机器上用vmware workstation8搭建了一套10gr2的rac环境,用的是裸设备+ASM搭建,在安装成功后,不小心被直接重启了下主机,结果再次启动虚拟机的时候提示到有磁盘损坏,也没有在意。但是在启动RAC的时候出现了问题,一开始的现象是如下几个个资源没办法随着其他资源一起启动:
ora.node1.LISTENER_NODE1.lsnr
ora.node2.LISTENER_NODE2.lsnr
ora.RAC.RAC1.inst
ora.RAC.RAC2.inst
ora.RAC.db
看具体的启动过程:
[oracle@node1 bin]$ crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora....C1.inst application    OFFLINE   OFFLINE               
ora....C2.inst application    OFFLINE   OFFLINE               
ora.RAC.db     application    OFFLINE   OFFLINE               
ora....SM1.asm application    OFFLINE   OFFLINE               
ora....E1.lsnr application    OFFLINE   OFFLINE               
ora.node1.gsd  application    OFFLINE   OFFLINE               
ora.node1.ons  application    OFFLINE   OFFLINE               
ora.node1.vip  application    OFFLINE   OFFLINE               
ora....SM2.asm application    OFFLINE   OFFLINE               
ora....E2.lsnr application    OFFLINE   OFFLINE               
ora.node2.gsd  application    OFFLINE   OFFLINE               
ora.node2.ons  application    OFFLINE   OFFLINE               
ora.node2.vip  application    OFFLINE   OFFLINE               
[oracle@node1 bin]$ crs_start -all
Attempting to start `ora.node1.vip` on member `node1`
Attempting to start `ora.node2.vip` on member `node2`
Start of `ora.node1.vip` on member `node1` succeeded.
Start of `ora.node2.vip` on member `node2` succeeded.
Attempting to start `ora.node1.ASM1.asm` on member `node1`
Attempting to start `ora.node2.ASM2.asm` on member `node2`
Start of `ora.node2.ASM2.asm` on member `node2` succeeded.
Attempting to start `ora.RAC.RAC2.inst` on member `node2`
Start of `ora.RAC.RAC2.inst` on member `node2` failed.
node1 : CRS-1018: Resource ora.node2.vip (application) is already running on node2




node1 : CRS-1018: Resource ora.node2.vip (application) is already running on node2




Start of `ora.node1.ASM1.asm` on member `node1` succeeded.
Attempting to start `ora.RAC.RAC1.inst` on member `node1`
Start of `ora.RAC.RAC1.inst` on member `node1` failed.
node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1




node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1




CRS-1002: Resource 'ora.node1.ons' is already running on member 'node1'


CRS-1002: Resource 'ora.node2.ons' is already running on member 'node2'


Attempting to start `ora.node1.gsd` on member `node1`
Attempting to start `ora.RAC.db` on member `node1`
Attempting to start `ora.node2.gsd` on member `node2`
Start of `ora.node1.gsd` on member `node1` succeeded.
Start of `ora.node2.gsd` on member `node2` succeeded.
Start of `ora.RAC.db` on member `node1` failed.
Attempting to start `ora.RAC.db` on member `node2`
Start of `ora.RAC.db` on member `node2` failed.
CRS-1006: No more members to consider


CRS-0215: Could not start resource 'ora.RAC.RAC1.inst'.


CRS-0215: Could not start resource 'ora.RAC.RAC2.inst'.


CRS-0215: Could not start resource 'ora.RAC.db'.


CRS-0223: Resource 'ora.node1.LISTENER_NODE1.lsnr' has placement error.


CRS-0223: Resource 'ora.node1.ons' has placement error.


CRS-0223: Resource 'ora.node2.LISTENER_NODE2.lsnr' has placement error.


CRS-0223: Resource 'ora.node2.ons' has placement error.


[oracle@node1 bin]$ crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora....C1.inst application    ONLINE    OFFLINE               
ora....C2.inst application    ONLINE    OFFLINE               
ora.RAC.db     application    ONLINE    OFFLINE               
ora....SM1.asm application    ONLINE    ONLINE    node1       
ora....E1.lsnr application    OFFLINE   OFFLINE               
ora.node1.gsd  application    ONLINE    ONLINE    node1       
ora.node1.ons  application    ONLINE    ONLINE    node1       
ora.node1.vip  application    ONLINE    ONLINE    node1       
ora....SM2.asm application    ONLINE    ONLINE    node2       
ora....E2.lsnr application    OFFLINE   OFFLINE               
ora.node2.gsd  application    ONLINE    ONLINE    node2       
ora.node2.ons  application    ONLINE    ONLINE    node2       
ora.node2.vip  application    ONLINE    ONLINE    node2       
尝试先把lsnr起来:
[oracle@node1 bin]$ crs_start ora.node1.LISTENER_NODE1.lsnr
Attempting to start `ora.node1.LISTENER_NODE1.lsnr` on member `node1`
Start of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.
[oracle@node1 bin]$ crs_start ora.node2.LISTENER_NODE2.lsnr
Attempting to start `ora.node2.LISTENER_NODE2.lsnr` on member `node2`
Start of `ora.node2.LISTENER_NODE2.lsnr` on member `node2` succeeded.
接着启动两个inst,接着出现问题了,inst无法拉起来:
[oracle@node1 bin]$ crs_start ora.RAC.RAC1.inst
Attempting to start `ora.RAC.RAC1.inst` on member `node1`
Start of `ora.RAC.RAC1.inst` on member `node1` failed.
node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1




CRS-0215: Could not start resource 'ora.RAC.RAC1.inst'.
检查相关的日志:
首先查看了下asm的日志:
alert_+ASM1.log
ed Nov 23 15:14:12 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 eth1 192.168.91.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth0 192.168.88.0 configured from OCR for use as  a public interface
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/app/product/10.2.0/db_1/dbs/arch
Autotune of undo retention is turned off. 
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  large_pool_size          = 12582912
  instance_type            = asm
  cluster_database         = TRUE
  instance_number          = 1
  remote_login_passwordfile= EXCLUSIVE
  background_dump_dest     = /opt/app/admin/+ASM/bdump
  user_dump_dest           = /opt/app/admin/+ASM/udump
  core_dump_dest           = /opt/app/admin/+ASM/cdump
  asm_diskgroups           = DATA1
Cluster communication is configured to use the following interface(s) for this instance
  192.168.91.100
Wed Nov 23 15:14:13 2011
cluster interconnect IPC version:Oracle UDP/IP
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=25132
DIAG started with pid=3, OS id=25134
PSP0 started with pid=4, OS id=25136
LMON started with pid=5, OS id=25138
LMD0 started with pid=6, OS id=25140
LMS0 started with pid=7, OS id=25142
MMAN started with pid=8, OS id=25152
DBW0 started with pid=9, OS id=25154
LGWR started with pid=10, OS id=25156
CKPT started with pid=11, OS id=25158
SMON started with pid=12, OS id=25160
RBAL started with pid=13, OS id=25162
GMON started with pid=14, OS id=25164
Wed Nov 23 15:14:13 2011
lmon registered with NM - instance id 1 (internal mem no 0)
Wed Nov 23 15:14:13 2011
Reconfiguration started (old inc 0, new inc 1)
ASM instance 
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Wed Nov 23 15:14:14 2011
 LMS 0: 0 GCS shadows cancelled, 0 closed
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Wed Nov 23 15:14:14 2011
 LMS 0: 0 GCS shadows traversed, 0 replayed
Wed Nov 23 15:14:14 2011
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=15, OS id=25208
Wed Nov 23 15:14:15 2011
SQL> ALTER DISKGROUP ALL MOUNT 
Wed Nov 23 15:14:15 2011
NOTE: cache registered group DATA1 number=1 incarn=0x6f877cd9
* allocate domain 1, invalid = TRUE 
freeing rdom 1
 Received dirty detach msg from node 1 for dom 1
Wed Nov 23 15:14:22 2011
Loaded ASM Library - Generic Linux, version 2.0.4 (KABI_V2) library for asmlib interface
Wed Nov 23 15:14:22 2011
ORA-15186: ASMLIB error function = [asm_open],  error = [1],  mesg = [Operation not permitted]
Wed Nov 23 15:14:22 2011
ORA-15186: ASMLIB error function = [asm_open],  error = [1],  mesg = [Operation not permitted]
Wed Nov 23 15:14:23 2011
NOTE: Hbeat: instance first (grp 1)
Wed Nov 23 15:14:27 2011
NOTE: start heartbeating (grp 1)
NOTE: cache opening disk 0 of grp 1: DATA1_0000 path:/dev/raw/raw3
Wed Nov 23 15:14:27 2011
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DATA1_0001 path:/dev/raw/raw4
NOTE: cache mounting (first) group 1/0x6F877CD9 (DATA1)
* allocate domain 1, invalid = TRUE 
kjbdomatt send to node 1
Wed Nov 23 15:14:27 2011
NOTE: attached to recovery domain 1
Wed Nov 23 15:14:27 2011
NOTE: starting recovery of thread=1 ckpt=3.315
NOTE: starting recovery of thread=2 ckpt=3.50
WARNING: cache failed to read fn=3  indblk=0 from disk(s): 1
ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]
NOTE: a corrupted block was dumped to the trace file
System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc
NOTE: cache initiating offline of disk 1  group 1
WARNING: offlining disk 1.3914828841 (DATA1_0001) with mask 0x3
NOTE: PST update: grp = 1, dsk = 1, mode = 0x6
Wed Nov 23 15:14:27 2011
ERROR: too many offline disks in PST (grp 1)
Wed Nov 23 15:14:27 2011
NOTE: halting all I/Os to diskgroup DATA1
NOTE: active pin found: 0x0x2427ccd0
NOTE: active pin found: 0x0x2427cc64
Abort recovery for domain 1
NOTE: crash recovery signalled OER-15130
ERROR: ORA-15130 signalled during mount of diskgroup DATA1
NOTE: cache dismounting group 1/0x6F877CD9 (DATA1) 
Wed Nov 23 15:14:28 2011
kjbdomdet send to node 1
detach from dom 1, sending detach message to node 1
Wed Nov 23 15:14:28 2011
Dirty detach reconfiguration started (old inc 1, new inc 1)
List of nodes:
 0 1
 Global Resource Directory partially frozen for dirty detach 
* dirty detach - domain 1 invalid = TRUE 
 0 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Wed Nov 23 15:14:28 2011
freeing rdom 1
Wed Nov 23 15:14:28 2011
WARNING: dirty detached from domain 1
Wed Nov 23 15:14:28 2011
ERROR: diskgroup DATA1 was not mounted
Wed Nov 23 15:14:28 2011
WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted
Wed Nov 23 15:14:28 2011
Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:
ORA-15001: diskgroup "DATA1" does not exist or is not mounted
[oracle@node1 bdump]$ 


从下面2段内容可以看到asm在mount diskgroup的时候出现错误了:
。。。。。
WARNING: cache failed to read fn=3  indblk=0 from disk(s): 1
ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]
NOTE: a corrupted block was dumped to the trace file
System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc
NOTE: cache initiating offline of disk 1  group 1
WARNING: offlining disk 1.3914828841 (DATA1_0001) with mask 0x3
。。。。
Wed Nov 23 15:14:28 2011
WARNING: dirty detached from domain 1
Wed Nov 23 15:14:28 2011
ERROR: diskgroup DATA1 was not mounted
Wed Nov 23 15:14:28 2011
WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted
Wed Nov 23 15:14:28 2011
Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:
ORA-15001: diskgroup "DATA1" does not exist or is not mounted
查看具体的trace:
cat /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc | less
找到如下错误提示
******************************************************
*** 2011-11-23 15:14:28.703
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [723], [529336], [529336], [memory leak], [], [], [], []
Current SQL information unavailable - no SGA.


cat /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc| less
/opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /opt/app/product/10.2.0/db_1
System name:    Linux
Node name:      node1
Release:        2.6.18-164.el5
Version:        #1 SMP Tue Aug 18 15:51:54 EDT 2009
Machine:        i686
Instance name: +ASM1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 17
Unix process pid: 25521, image: oracle@node1 (B000)


*** SERVICE NAME:() 2011-11-23 15:14:28.679
*** SESSION ID:(33.1) 2011-11-23 15:14:28.679
ORA-15001: diskgroup "DATA1" does not exist or is not mounted


怎么看都是没有成功mount磁盘组,还是先收工mount下磁盘组看下:
[oracle@node1 bdump]$ sqlplus /nolog


SQL*Plus: Release 10.2.0.1.0 - Production on Wed Nov 23 15:33:36 2011


Copyright (c) 1982, 2005, Oracle.  All rights reserved.


SQL> exit
[oracle@node1 bdump]$ export ORACLE_SID=+ASM1
[oracle@node1 bdump]$ sqlplus /nolog


SQL*Plus: Release 10.2.0.1.0 - Production on Wed Nov 23 15:33:41 2011


Copyright (c) 1982, 2005, Oracle.  All rights reserved.


SQL> conn /as sysdba
Connected.
SQL> desc v$asm_diskgroup;
 Name   Null?    Type
 ----------------------------------------- -------- ----------------------------
 GROUP_NUMBER    NUMBER
 NAME    VARCHAR2(30)
 SECTOR_SIZE    NUMBER
 BLOCK_SIZE    NUMBER
 ALLOCATION_UNIT_SIZE    NUMBER
 STATE    VARCHAR2(11)
 TYPE    VARCHAR2(6)
 TOTAL_MB    NUMBER
 FREE_MB    NUMBER
 REQUIRED_MIRROR_FREE_MB    NUMBER
 USABLE_FILE_MB    NUMBER
 OFFLINE_DISKS    NUMBER
 UNBALANCED    VARCHAR2(1)
 COMPATIBILITY    VARCHAR2(60)
 DATABASE_COMPATIBILITY    VARCHAR2(60)


SQL> set linesize 150
SQL> column name format a30;
SQL> column state format a10;
SQL> select name,state from v$asm_diskgroup;


NAME       STATE
------------------------------ ----------
DATA1       DISMOUNTED


果然磁盘组没有加载成功,尝试收工mount磁盘组:
SQL> alter diskgroup data1 mount;
alter diskgroup data1 mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15130: diskgroup "DATA1" is being dismounted
ORA-15066: offlining disk "DATA1_0001" may result in a data loss




SQL> 
报错了,看看日志:
[oracle@node1 bdump]$ tail -50 alert_+ASM1.log 
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DATA1_0001 path:/dev/raw/raw4
NOTE: cache mounting (first) group 1/0x26277CDE (DATA1)
* allocate domain 1, invalid = TRUE 
kjbdomatt send to node 1
Wed Nov 23 15:37:49 2011
NOTE: attached to recovery domain 1
Wed Nov 23 15:37:49 2011
NOTE: starting recovery of thread=1 ckpt=3.315
NOTE: starting recovery of thread=2 ckpt=3.50
WARNING: cache failed to read fn=3  indblk=0 from disk(s): 1
ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]
NOTE: a corrupted block was dumped to the trace file
System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_21931.trc
NOTE: cache initiating offline of disk 1  group 1
WARNING: offlining disk 1.3914828843 (DATA1_0001) with mask 0x3
NOTE: PST update: grp = 1, dsk = 1, mode = 0x6
Wed Nov 23 15:37:49 2011
ERROR: too many offline disks in PST (grp 1)
Wed Nov 23 15:37:49 2011
NOTE: halting all I/Os to diskgroup DATA1
NOTE: active pin found: 0x0x2427ccd0
NOTE: active pin found: 0x0x2427cc64
Abort recovery for domain 1
NOTE: crash recovery signalled OER-15130
ERROR: ORA-15130 signalled during mount of diskgroup DATA1
NOTE: cache dismounting group 1/0x26277CDE (DATA1) 
Wed Nov 23 15:37:51 2011
kjbdomdet send to node 1
detach from dom 1, sending detach message to node 1
Wed Nov 23 15:37:51 2011
Dirty detach reconfiguration started (old inc 1, new inc 1)
List of nodes:
 0 1
 Global Resource Directory partially frozen for dirty detach 
* dirty detach - domain 1 invalid = TRUE 
 0 GCS resources traversed, 0 cancelled
Wed Nov 23 15:37:51 2011
freeing rdom 1
Dirty Detach Reconfiguration complete
Wed Nov 23 15:37:51 2011
WARNING: dirty detached from domain 1
Wed Nov 23 15:37:51 2011
ERROR: diskgroup DATA1 was not mounted
Wed Nov 23 15:37:52 2011
WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted
Wed Nov 23 15:37:52 2011
Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:
ORA-15001: diskgroup "DATA1" does not exist or is not mounted
ORA-15001: diskgroup "DATA1" does not exist or is not mounted


致命的ORA-15196: invalid ASM block header ,提示磁盘坏块了。
[oracle@node1 bdump]$ oerr ora 15196
15196, 00000, "invalid ASM block header [%s:%s] [%s] [%s] [%s] [%s != %s]"
// *Cause:  ASM encountered an invalid metadata block.
// *Action: Contact Oracle Support Services.
//
[oracle@node1 bdump]$ oerr ora 15001
15001, 00000, "diskgroup \"%s\" does not exist or is not mounted"
// *Cause:  An operation failed because the diskgroup specified does not 
//          exist or is not mounted by the current ASM instance.
// *Action: Verify that the diskgroup name used is valid, that the 
//          diskgroup exists, and that the diskgroup is mounted by
//          the current ASM instance.
//
没辙了,好在是测试环境,重建吧:
先dbca卸载DB,然后重建diskgroup,最后重建db。
在两个在节点上root用户操作,注意raw3和raw4是要创建磁盘组的设备:
dd if=/dev/zero of=/dev/raw/raw3 bs=1024 count=4
dd if=/dev/zero of=/dev/raw/raw4 bs=1024 count=4
接着重建磁盘组:
SQL> column header_status format a15;
SQL> column path format a30;
SQL> select header_status,path from v$asm_disk;


HEADER_STATUS PATH
--------------- ------------------------------
CANDIDATE /dev/raw/raw3
CANDIDATE /dev/raw/raw4
UNKNOWN ORCL:VOL2
FOREIGN /dev/raw/raw1
UNKNOWN ORCL:VOL1
FOREIGN /dev/raw/raw2


6 rows selected.


SQL> create diskgroup datadisk1 external redundancy disk '/dev/raw/raw3' name d1 disk '/dev/raw/raw4' name d2;


Diskgroup created.
最后重新dbca重建db。
重建之后,重启了虚拟机和主机几把,还有再次发现问题,ok。
-The End-

你可能感兴趣的:(oracle,虚拟机,application,domain,磁盘)