ORA-01578 ORACLE data block corrupted

我们用户连接到数据库执行存储过程时报坏块ORA-01578ORACLE data block corrupted错误。

从错误提示信息来看,确实该数据库遭遇到坏块,通过查询数据字典,如下图:

SQL> select * from V$DATABASE_BLOCK_CORRUPTION;

     FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
        53      88510          1                  0 FRACTURED
        54       2048          1                  0 CORRUPT
        54     771072        512                  0 CORRUPT
        54     856239         23                  0 CORRUPT
        54     856262          1                  0 FRACTURED
        54     856263         85                  0 CORRUPT
        54     856352        137                  0 CORRUPT
        54     856496         80                  0 CORRUPT
        54     856064        172                  0 CORRUPT
        54     856492          3                  0 CORRUPT
        54     839168        334                  0 CORRUPT

     FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
        54     839504          6                  0 CORRUPT
        54     839511          1                  0 FRACTURED
        54     839512          6                  0 CORRUPT
        54     839520         22                  0 CORRUPT
        54     839543          1                  0 FRACTURED
        54     839544         56                  0 CORRUPT
        54     839600          1                  0 FRACTURED
        54     839601         79                  0 CORRUPT
        54    1112064        512                  0 CORRUPT
        53    2625308          3                  0 CORRUPT
        53    2625393          1                  0 FRACTURED

     FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
        53    2625394          6                  0 CORRUPT
        53    2625408          3                  0 CORRUPT

24 rows selected.

我们看到53、54号文件共出现1000多个坏块,出现这么多坏块用户今天才发现,说明是突发的。

正在检查数据字典时,突然当前连接的这个实例无法连接了,检查数据库进程,发现没有pmon等进程。该主机上共有两个实例,而这两个实例的pmon进程都不存在了,说明应该不是数据库的问题,这时我检查了一下集群资源。

bjscwbdb01:/home/grid$crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               OFFLINE OFFLINE      bjscwbdb01                                   
               ONLINE  ONLINE       bjscwbdb02                                   
ora.FRA.dg
               ONLINE  ONLINE       bjscwbdb01                                   
               ONLINE  ONLINE       bjscwbdb02                                   
ora.GRID.dg
               ONLINE  ONLINE       bjscwbdb01                                   
               ONLINE  ONLINE       bjscwbdb02                                   
ora.LISTENER.lsnr
               ONLINE  ONLINE       bjscwbdb01                                   
               ONLINE  ONLINE       bjscwbdb02                                   
ora.asm
               ONLINE  ONLINE       bjscwbdb01               Started             
               ONLINE  ONLINE       bjscwbdb02               Started             
ora.gsd
               OFFLINE OFFLINE      bjscwbdb01                                   
               OFFLINE OFFLINE      bjscwbdb02                                   
ora.net1.network
               ONLINE  ONLINE       bjscwbdb01                                   
               ONLINE  ONLINE       bjscwbdb02                                   
ora.ons
               ONLINE  ONLINE       bjscwbdb01                                   
               ONLINE  ONLINE       bjscwbdb02                                   
ora.registry.acfs
               ONLINE  ONLINE       bjscwbdb01                                   
               ONLINE  ONLINE       bjscwbdb02                                   
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       bjscwbdb01                                   
ora.bjscwbdb01.vip
      1        ONLINE  ONLINE       bjscwbdb01                                   
ora.bjscwbdb02.vip
      1        ONLINE  ONLINE       bjscwbdb02                                   
ora.cvu
      1        ONLINE  ONLINE       bjscwbdb01                                   
ora.oc4j
      1        ONLINE  ONLINE       bjscwbdb01                                   
ora.scan1.vip
      1        ONLINE  ONLINE       bjscwbdb01                                   
ora.zjwbjh.db
      1        ONLINE  OFFLINE                               Instance Shutdown   
      2        ONLINE  ONLINE       bjscwbdb02               Open                
ora.zjwbjhdsf.db
      1        ONLINE  OFFLINE                               Instance Shutdown   
      2        ONLINE  ONLINE       bjscwbdb02               Open                
bjscwbdb01:/home/grid$

哎呀!DATA diskgroup怎么dismount了?

赶紧去检查asm日志:

Thu Nov 20 16:18:05 2014
NOTE: SMON starting instance recovery for group DATA domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.983825
NOTE: starting recovery of thread=2 ckpt=51.1789 group=1 (DATA)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DATA)
NOTE: SMON successfully validated lock domain 1
NOTE: advancing ckpt for group 1 (DATA) thread=2 ckpt=51.1791
NOTE: SMON did instance recovery for group DATA domain 1
Thu Nov 20 16:18:43 2014
WARNING: cache read  a corrupt block: group=1(DATA) fn=374 indblk=0 disk=20 (DATA_0020) incarn=2727947309 au=30611 blk=0 count=16
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0]
NOTE: a corrupted block from group DATA was dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc
WARNING: cache read (retry) a corrupt block: group=1(DATA) fn=374 indblk=0 disk=20 (DATA_0020) incarn=2727947309 au=30611 blk=0 count=1
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0]
ERROR: cache failed to read group=1(DATA) fn=374 indblk=0 from disk(s): 20(DATA_0020)
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0]
NOTE: cache initiating offline of disk 20 group DATA
NOTE: process _user59637778_+asm1 (59637778) initiating offline of disk 20.2727947309 (DATA_0020) with mask 0x7e in group 1
WARNING: Disk 20 (DATA_0020) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 20/0xa2992c2d, mask = 0x6a, op = clear
Thu Nov 20 16:18:44 2014
GMON updating disk modes for group 1 at 16 for pid 33, osid 59637778
ERROR: Disk 20 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Thu Nov 20 16:18:44 2014
NOTE: cache dismounting (not clean) group 1/0xE5E9DB0B (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 54001828, image: oracle@bjscwbdb01 (B000)
WARNING: Offline of disk 20 (DATA_0020) in group 1 and mode 0x7f failed on ASM inst 1
Thu Nov 20 16:18:44 2014
NOTE: halting all I/Os to diskgroup 1 (DATA)
System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc
Thu Nov 20 16:18:45 2014
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xe5e9db0b (DATA)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_10485844.trc:
ORA-15130: diskgroup "DATA" is being dismounted
Thu Nov 20 16:18:45 2014
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=54.3653 last written ABA 54.3653
Thu Nov 20 16:18:45 2014
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
Thu Nov 20 16:18:45 2014
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 2, cluster inc 12)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE 
 48 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
freeing rdom 1
Thu Nov 20 16:18:45 2014
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0xE5E9DB0B (DATA) 
NOTE: AMDU dump of disk group DATA created at /u01/app/grid/diag/asm/+asm/+ASM1/trace
SQL> alter diskgroup DATA dismount force /* ASM SERVER */ 
Thu Nov 20 16:18:48 2014
NOTE: cache deleting context for group DATA 1/0xe5e9db0b
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xe5e9db0b (DATA)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_10485844.trc:
ORA-15130: diskgroup "" is being dismounted
GMON dismounting group 1 at 17 for pid 57, osid 54001828
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DATA was dismounted
Thu Nov 20 16:18:49 2014
NOTE: ASM client zjwbjhds1:zjwbjhdsf disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_18546746.trc
NOTE: ASM client zjwbjh1:zjwbjh disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_13303980.trc
Thu Nov 20 16:19:09 2014
SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER */

Disk DATA_0020怎么会报ORA-15196:invalid ASM block header[kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 !=0]???

DATA_0020是哪块磁盘呢?难道磁盘头损坏了?

这个简单,通过检查视图v$asm_disk.name,path就可以知道是哪块磁盘,我们这里DATA_0020对应的是/dev/rhdisk16。

然后通过使用kfed read /dev/rhdisk16发现磁盘头并没有损坏,如果是磁盘头损坏,简单的修改一下就完事了,但如果是里面的内容损坏就比较麻烦了。

此时我又尝试去mount,看是否还报同样的错误,结果奇怪的发现又可以mount了。

Thu Nov 20 16:19:15 2014
SQL> ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {0:6:55035} */ 
NOTE: cache registered group DATA number=1 incarn=0x184ba133
NOTE: cache began mount (first) of group DATA number=1 incarn=0x184ba133
NOTE: Assigning number (1,20) to disk (/dev/rhdisk16)
NOTE: Assigning number (1,13) to disk (/dev/rhdisk17)
NOTE: Assigning number (1,14) to disk (/dev/rhdisk18)
NOTE: Assigning number (1,15) to disk (/dev/rhdisk19)
NOTE: Assigning number (1,0) to disk (/dev/rhdisk2)
NOTE: Assigning number (1,16) to disk (/dev/rhdisk20)
NOTE: Assigning number (1,17) to disk (/dev/rhdisk21)
NOTE: Assigning number (1,18) to disk (/dev/rhdisk22)
NOTE: Assigning number (1,19) to disk (/dev/rhdisk23)
NOTE: Assigning number (1,21) to disk (/dev/rhdisk25)
NOTE: Assigning number (1,22) to disk (/dev/rhdisk26)
NOTE: Assigning number (1,23) to disk (/dev/rhdisk27)
NOTE: Assigning number (1,24) to disk (/dev/rhdisk28)
NOTE: Assigning number (1,25) to disk (/dev/rhdisk29)
NOTE: Assigning number (1,1) to disk (/dev/rhdisk3)
NOTE: Assigning number (1,26) to disk (/dev/rhdisk30)
NOTE: Assigning number (1,27) to disk (/dev/rhdisk31)
NOTE: Assigning number (1,28) to disk (/dev/rhdisk32)
NOTE: Assigning number (1,2) to disk (/dev/rhdisk4)
NOTE: Assigning number (1,3) to disk (/dev/rhdisk5)
NOTE: Assigning number (1,4) to disk (/dev/rhdisk6)
NOTE: Assigning number (1,5) to disk (/dev/rhdisk7)
Thu Nov 20 16:19:23 2014
NOTE: GMON heartbeating for grp 1
GMON querying group 1 at 20 for pid 29, osid 29819072
NOTE: cache opening disk 0 of grp 1: DATA_0000 path:/dev/rhdisk2
NOTE: F1X0 found on disk 0 au 2 fcn 0.983825
NOTE: cache opening disk 1 of grp 1: DATA_0001 path:/dev/rhdisk3
NOTE: cache opening disk 2 of grp 1: DATA_0002 path:/dev/rhdisk4
NOTE: cache opening disk 3 of grp 1: DATA_0003 path:/dev/rhdisk5
NOTE: cache opening disk 4 of grp 1: DATA_0004 path:/dev/rhdisk6
NOTE: cache opening disk 5 of grp 1: DATA_0005 path:/dev/rhdisk7
NOTE: cache opening disk 13 of grp 1: DATA_0013 path:/dev/rhdisk17
NOTE: cache opening disk 14 of grp 1: DATA_0014 path:/dev/rhdisk18
NOTE: cache opening disk 15 of grp 1: DATA_0015 path:/dev/rhdisk19
NOTE: cache opening disk 16 of grp 1: DATA_0016 path:/dev/rhdisk20
NOTE: cache opening disk 17 of grp 1: DATA_0017 path:/dev/rhdisk21
NOTE: cache opening disk 18 of grp 1: DATA_0018 path:/dev/rhdisk22
NOTE: cache opening disk 19 of grp 1: DATA_0019 path:/dev/rhdisk23
NOTE: cache opening disk 20 of grp 1: DATA_0020 path:/dev/rhdisk16
NOTE: cache opening disk 21 of grp 1: DATA_0021 path:/dev/rhdisk25
NOTE: cache opening disk 22 of grp 1: DATA_0022 path:/dev/rhdisk26
NOTE: cache opening disk 23 of grp 1: DATA_0023 path:/dev/rhdisk27
NOTE: cache opening disk 24 of grp 1: DATA_0024 path:/dev/rhdisk28
NOTE: cache opening disk 25 of grp 1: DATA_0025 path:/dev/rhdisk29
NOTE: cache opening disk 26 of grp 1: DATA_0026 path:/dev/rhdisk30
NOTE: cache opening disk 27 of grp 1: DATA_0027 path:/dev/rhdisk31
NOTE: cache opening disk 28 of grp 1: DATA_0028 path:/dev/rhdisk32
NOTE: cache mounting (first) external redundancy group 1/0x184BA133 (DATA)
Thu Nov 20 16:19:23 2014
* allocate domain 1, invalid = TRUE 
kjbdomatt send to inst 2
Thu Nov 20 16:19:23 2014
NOTE: attached to recovery domain 1
NOTE: starting recovery of thread=1 ckpt=54.3654 group=1 (DATA)
NOTE: advancing ckpt for group 1 (DATA) thread=1 ckpt=54.3654
NOTE: cache recovered group 1 to fcn 0.4788896
NOTE: redo buffer size is 512 blocks (2101760 bytes)
Thu Nov 20 16:19:23 2014
NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA)
NOTE: LGWR found thread 1 closed at ABA 54.3653
NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA)
NOTE: LGWR opening thread 1 at fcn 0.4788896 ABA 55.3654
NOTE: cache mounting group 1/0x184BA133 (DATA) succeeded
NOTE: cache ending mount (success) of group DATA number=1 incarn=0x184ba133
GMON querying group 1 at 21 for pid 18, osid 10485844
Thu Nov 20 16:19:23 2014
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup DATA was mounted
SUCCESS: ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {0:6:55035} */

怎么又好了?似乎没问题了。还正在纳闷,结果Diskgroup DATA在第二个节点上又Force dismount了。我又尝试去mount data到第二个节点,mounted后,没过3分钟,第一个节点又force dismount了,似乎这两个节点只能一边操作这个磁盘。

此时正自言自语说这个rhdisk16的问题,现场工程师听到后说:“这个磁盘是我们11月17日下午才加到DiskgroupDATA中的,上次这个磁盘添加时就没成功,似乎有点问题,让硬件工程师查,他们也没查出什么问题”。

我对比了一下db alert中坏块报错时间正好是asmalert中添加磁盘rhdisk16的时间以后才报,很有可能是添加磁盘不正确导致。此时我有意识地去看磁盘pvid是否清楚,结果让我大吃一惊,找到了坏块的原因。

两边执行lspv,结果如下:

bjscwbdb01:/home/grid$lspv
hdisk2          none                                None            
hdisk3          none                                None            
:::           
hdisk14         none                                None            
hdisk15         none                                None            
hdisk16         00f7e1c0abd3b11d                    goldengatevg    active    <<<<<<<<<< goldengate disk
hdisk17         none                                None            
:::     
hdisk32         none                                None            
bjscwbdb01:/home/grid$


bjscwbdb02:/u01/app/grid/diag/asm/+asm/+ASM2/trace$lspv
hdisk2          none                                None            
::: 
hdisk15         none                                None            
hdisk16         00f7e1c0abd3b11d                    None            
hdisk17         none                                None            
hdisk18         none                                None            
:::
hdisk32         none                                None   

这块磁盘是存放GoldenGate软件的磁盘,被错误地添加到diskgroup DATA中。这样就可以解释出现的几个现象了。

1、Diskgroup data在第一个节点上mount后,第二个节点就必须force dismount。这是因为GoldenGate磁盘本来就设置成只能在一个节点active。

2、数据库出现坏块。由于GoldenGate和RAC同时写一块盘,肯定会出现相互数据覆盖的情况,那么必然产生坏块。

这个故障原因已经很明确了。

另外,在诊断此故障的过程中,遇到一个由aix 7.1的新特性引起的故障。

这个集群在没有任何业务的情况下,第一个节点CPU资源是90%,内存仅剩几十兆。如下图:

kthr    memory              page              faults              cpu          
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
10  0 8383596  9049   0   0   0 4669156116821704704 4671567070943510528   0 1804 73984 21442 72  9 19  0  5.88 147.0
13  0 8381263  9246   0   0   0 4669840708353808881 4670415209250807178   0 1875 75725 22706 70 10 19  0  5.87 146.9
 9  0 8381265  9099   0   0   0 4668223181205536768 4668224555595071488   0 1859 74071 21039 72  9 19  1  5.86 146.4
10  0 8381221  9068   0   0   0 4669563211001888768 4669567609048399872   0 1894 75266 21850 70  9 20  1  5.88 146.9
10  0 8381217  9136   0   0   0 4669875594370350367 4669875869191003008   0 1890 75058 21624 71  9 20  0  5.87 146.7
10  0 8381197  9353   0   0   0 4669212437484529857 4669948253929140635   0 1874 72257 21279 71  9 20  1  5.85 146.1

TOP CPU进程中topasrec占了100%cpu,这个截图无法粘贴,就省略。 

经统计,共有100多个topasrec进程,经检查是aix的新特性引起,kill这些进程后,CPU和内存恢复正常。

解决方案:

1、由于磁盘rhdisk16出现数据出现损坏,对于集群来说,无法完成rebalance操作,因此无法drop。因此建议用户使用备份将数据库恢复到11月17日加盘前。

2、关于Topasrec进程引起cpu高的故障,我们让客户联系操作系统厂家,是否能规避或禁用这个特性。

你可能感兴趣的:(ORA-01578 ORACLE data block corrupted)