我们用户连接到数据库执行存储过程时报坏块ORA-01578ORACLE data block corrupted错误。
从错误提示信息来看,确实该数据库遭遇到坏块,通过查询数据字典,如下图:
SQL> select * from V$DATABASE_BLOCK_CORRUPTION; FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO ---------- ---------- ---------- ------------------ --------- 53 88510 1 0 FRACTURED 54 2048 1 0 CORRUPT 54 771072 512 0 CORRUPT 54 856239 23 0 CORRUPT 54 856262 1 0 FRACTURED 54 856263 85 0 CORRUPT 54 856352 137 0 CORRUPT 54 856496 80 0 CORRUPT 54 856064 172 0 CORRUPT 54 856492 3 0 CORRUPT 54 839168 334 0 CORRUPT FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO ---------- ---------- ---------- ------------------ --------- 54 839504 6 0 CORRUPT 54 839511 1 0 FRACTURED 54 839512 6 0 CORRUPT 54 839520 22 0 CORRUPT 54 839543 1 0 FRACTURED 54 839544 56 0 CORRUPT 54 839600 1 0 FRACTURED 54 839601 79 0 CORRUPT 54 1112064 512 0 CORRUPT 53 2625308 3 0 CORRUPT 53 2625393 1 0 FRACTURED FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO ---------- ---------- ---------- ------------------ --------- 53 2625394 6 0 CORRUPT 53 2625408 3 0 CORRUPT 24 rows selected.
我们看到53、54号文件共出现1000多个坏块,出现这么多坏块用户今天才发现,说明是突发的。
正在检查数据字典时,突然当前连接的这个实例无法连接了,检查数据库进程,发现没有pmon等进程。该主机上共有两个实例,而这两个实例的pmon进程都不存在了,说明应该不是数据库的问题,这时我检查了一下集群资源。
bjscwbdb01:/home/grid$crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.DATA.dg OFFLINE OFFLINE bjscwbdb01 ONLINE ONLINE bjscwbdb02 ora.FRA.dg ONLINE ONLINE bjscwbdb01 ONLINE ONLINE bjscwbdb02 ora.GRID.dg ONLINE ONLINE bjscwbdb01 ONLINE ONLINE bjscwbdb02 ora.LISTENER.lsnr ONLINE ONLINE bjscwbdb01 ONLINE ONLINE bjscwbdb02 ora.asm ONLINE ONLINE bjscwbdb01 Started ONLINE ONLINE bjscwbdb02 Started ora.gsd OFFLINE OFFLINE bjscwbdb01 OFFLINE OFFLINE bjscwbdb02 ora.net1.network ONLINE ONLINE bjscwbdb01 ONLINE ONLINE bjscwbdb02 ora.ons ONLINE ONLINE bjscwbdb01 ONLINE ONLINE bjscwbdb02 ora.registry.acfs ONLINE ONLINE bjscwbdb01 ONLINE ONLINE bjscwbdb02 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE bjscwbdb01 ora.bjscwbdb01.vip 1 ONLINE ONLINE bjscwbdb01 ora.bjscwbdb02.vip 1 ONLINE ONLINE bjscwbdb02 ora.cvu 1 ONLINE ONLINE bjscwbdb01 ora.oc4j 1 ONLINE ONLINE bjscwbdb01 ora.scan1.vip 1 ONLINE ONLINE bjscwbdb01 ora.zjwbjh.db 1 ONLINE OFFLINE Instance Shutdown 2 ONLINE ONLINE bjscwbdb02 Open ora.zjwbjhdsf.db 1 ONLINE OFFLINE Instance Shutdown 2 ONLINE ONLINE bjscwbdb02 Open bjscwbdb01:/home/grid$
哎呀!DATA diskgroup怎么dismount了?
赶紧去检查asm日志:
Thu Nov 20 16:18:05 2014 NOTE: SMON starting instance recovery for group DATA domain 1 (mounted) NOTE: F1X0 found on disk 0 au 2 fcn 0.983825 NOTE: starting recovery of thread=2 ckpt=51.1789 group=1 (DATA) NOTE: SMON waiting for thread 2 recovery enqueue NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DATA) NOTE: SMON successfully validated lock domain 1 NOTE: advancing ckpt for group 1 (DATA) thread=2 ckpt=51.1791 NOTE: SMON did instance recovery for group DATA domain 1 Thu Nov 20 16:18:43 2014 WARNING: cache read a corrupt block: group=1(DATA) fn=374 indblk=0 disk=20 (DATA_0020) incarn=2727947309 au=30611 blk=0 count=16 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc: ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0] NOTE: a corrupted block from group DATA was dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc WARNING: cache read (retry) a corrupt block: group=1(DATA) fn=374 indblk=0 disk=20 (DATA_0020) incarn=2727947309 au=30611 blk=0 count=1 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc: ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0] ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0] ERROR: cache failed to read group=1(DATA) fn=374 indblk=0 from disk(s): 20(DATA_0020) ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0] ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 != 0] NOTE: cache initiating offline of disk 20 group DATA NOTE: process _user59637778_+asm1 (59637778) initiating offline of disk 20.2727947309 (DATA_0020) with mask 0x7e in group 1 WARNING: Disk 20 (DATA_0020) in group 1 in mode 0x7f is now being taken offline on ASM inst 1 NOTE: initiating PST update: grp = 1, dsk = 20/0xa2992c2d, mask = 0x6a, op = clear Thu Nov 20 16:18:44 2014 GMON updating disk modes for group 1 at 16 for pid 33, osid 59637778 ERROR: Disk 20 cannot be offlined, since diskgroup has external redundancy. ERROR: too many offline disks in PST (grp 1) Thu Nov 20 16:18:44 2014 NOTE: cache dismounting (not clean) group 1/0xE5E9DB0B (DATA) NOTE: messaging CKPT to quiesce pins Unix process pid: 54001828, image: oracle@bjscwbdb01 (B000) WARNING: Offline of disk 20 (DATA_0020) in group 1 and mode 0x7f failed on ASM inst 1 Thu Nov 20 16:18:44 2014 NOTE: halting all I/Os to diskgroup 1 (DATA) System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_59637778.trc Thu Nov 20 16:18:45 2014 ERROR: ORA-15130 in COD recovery for diskgroup 1/0xe5e9db0b (DATA) ERROR: ORA-15130 thrown in RBAL for group number 1 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_10485844.trc: ORA-15130: diskgroup "DATA" is being dismounted Thu Nov 20 16:18:45 2014 NOTE: LGWR doing non-clean dismount of group 1 (DATA) NOTE: LGWR sync ABA=54.3653 last written ABA 54.3653 Thu Nov 20 16:18:45 2014 kjbdomdet send to inst 2 detach from dom 1, sending detach message to inst 2 Thu Nov 20 16:18:45 2014 List of instances: 1 2 Dirty detach reconfiguration started (new ddet inc 2, cluster inc 12) Global Resource Directory partially frozen for dirty detach * dirty detach - domain 1 invalid = TRUE 48 GCS resources traversed, 0 cancelled Dirty Detach Reconfiguration complete freeing rdom 1 Thu Nov 20 16:18:45 2014 WARNING: dirty detached from domain 1 NOTE: cache dismounted group 1/0xE5E9DB0B (DATA) NOTE: AMDU dump of disk group DATA created at /u01/app/grid/diag/asm/+asm/+ASM1/trace SQL> alter diskgroup DATA dismount force /* ASM SERVER */ Thu Nov 20 16:18:48 2014 NOTE: cache deleting context for group DATA 1/0xe5e9db0b ERROR: ORA-15130 in COD recovery for diskgroup 1/0xe5e9db0b (DATA) ERROR: ORA-15130 thrown in RBAL for group number 1 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_10485844.trc: ORA-15130: diskgroup "" is being dismounted GMON dismounting group 1 at 17 for pid 57, osid 54001828 NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment NOTE: Disk in mode 0x8 marked for de-assignment SUCCESS: diskgroup DATA was dismounted Thu Nov 20 16:18:49 2014 NOTE: ASM client zjwbjhds1:zjwbjhdsf disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_18546746.trc NOTE: ASM client zjwbjh1:zjwbjh disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_13303980.trc Thu Nov 20 16:19:09 2014 SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER */
Disk DATA_0020怎么会报ORA-15196:invalid ASM block header[kfc.c:26076] [endian_kfbh] [374] [2147483648] [22 !=0]???
DATA_0020是哪块磁盘呢?难道磁盘头损坏了?
这个简单,通过检查视图v$asm_disk.name,path就可以知道是哪块磁盘,我们这里DATA_0020对应的是/dev/rhdisk16。
然后通过使用kfed read /dev/rhdisk16发现磁盘头并没有损坏,如果是磁盘头损坏,简单的修改一下就完事了,但如果是里面的内容损坏就比较麻烦了。
此时我又尝试去mount,看是否还报同样的错误,结果奇怪的发现又可以mount了。Thu Nov 20 16:19:15 2014 SQL> ALTER DISKGROUP DATA MOUNT /* asm agent *//* {0:6:55035} */ NOTE: cache registered group DATA number=1 incarn=0x184ba133 NOTE: cache began mount (first) of group DATA number=1 incarn=0x184ba133 NOTE: Assigning number (1,20) to disk (/dev/rhdisk16) NOTE: Assigning number (1,13) to disk (/dev/rhdisk17) NOTE: Assigning number (1,14) to disk (/dev/rhdisk18) NOTE: Assigning number (1,15) to disk (/dev/rhdisk19) NOTE: Assigning number (1,0) to disk (/dev/rhdisk2) NOTE: Assigning number (1,16) to disk (/dev/rhdisk20) NOTE: Assigning number (1,17) to disk (/dev/rhdisk21) NOTE: Assigning number (1,18) to disk (/dev/rhdisk22) NOTE: Assigning number (1,19) to disk (/dev/rhdisk23) NOTE: Assigning number (1,21) to disk (/dev/rhdisk25) NOTE: Assigning number (1,22) to disk (/dev/rhdisk26) NOTE: Assigning number (1,23) to disk (/dev/rhdisk27) NOTE: Assigning number (1,24) to disk (/dev/rhdisk28) NOTE: Assigning number (1,25) to disk (/dev/rhdisk29) NOTE: Assigning number (1,1) to disk (/dev/rhdisk3) NOTE: Assigning number (1,26) to disk (/dev/rhdisk30) NOTE: Assigning number (1,27) to disk (/dev/rhdisk31) NOTE: Assigning number (1,28) to disk (/dev/rhdisk32) NOTE: Assigning number (1,2) to disk (/dev/rhdisk4) NOTE: Assigning number (1,3) to disk (/dev/rhdisk5) NOTE: Assigning number (1,4) to disk (/dev/rhdisk6) NOTE: Assigning number (1,5) to disk (/dev/rhdisk7) Thu Nov 20 16:19:23 2014 NOTE: GMON heartbeating for grp 1 GMON querying group 1 at 20 for pid 29, osid 29819072 NOTE: cache opening disk 0 of grp 1: DATA_0000 path:/dev/rhdisk2 NOTE: F1X0 found on disk 0 au 2 fcn 0.983825 NOTE: cache opening disk 1 of grp 1: DATA_0001 path:/dev/rhdisk3 NOTE: cache opening disk 2 of grp 1: DATA_0002 path:/dev/rhdisk4 NOTE: cache opening disk 3 of grp 1: DATA_0003 path:/dev/rhdisk5 NOTE: cache opening disk 4 of grp 1: DATA_0004 path:/dev/rhdisk6 NOTE: cache opening disk 5 of grp 1: DATA_0005 path:/dev/rhdisk7 NOTE: cache opening disk 13 of grp 1: DATA_0013 path:/dev/rhdisk17 NOTE: cache opening disk 14 of grp 1: DATA_0014 path:/dev/rhdisk18 NOTE: cache opening disk 15 of grp 1: DATA_0015 path:/dev/rhdisk19 NOTE: cache opening disk 16 of grp 1: DATA_0016 path:/dev/rhdisk20 NOTE: cache opening disk 17 of grp 1: DATA_0017 path:/dev/rhdisk21 NOTE: cache opening disk 18 of grp 1: DATA_0018 path:/dev/rhdisk22 NOTE: cache opening disk 19 of grp 1: DATA_0019 path:/dev/rhdisk23 NOTE: cache opening disk 20 of grp 1: DATA_0020 path:/dev/rhdisk16 NOTE: cache opening disk 21 of grp 1: DATA_0021 path:/dev/rhdisk25 NOTE: cache opening disk 22 of grp 1: DATA_0022 path:/dev/rhdisk26 NOTE: cache opening disk 23 of grp 1: DATA_0023 path:/dev/rhdisk27 NOTE: cache opening disk 24 of grp 1: DATA_0024 path:/dev/rhdisk28 NOTE: cache opening disk 25 of grp 1: DATA_0025 path:/dev/rhdisk29 NOTE: cache opening disk 26 of grp 1: DATA_0026 path:/dev/rhdisk30 NOTE: cache opening disk 27 of grp 1: DATA_0027 path:/dev/rhdisk31 NOTE: cache opening disk 28 of grp 1: DATA_0028 path:/dev/rhdisk32 NOTE: cache mounting (first) external redundancy group 1/0x184BA133 (DATA) Thu Nov 20 16:19:23 2014 * allocate domain 1, invalid = TRUE kjbdomatt send to inst 2 Thu Nov 20 16:19:23 2014 NOTE: attached to recovery domain 1 NOTE: starting recovery of thread=1 ckpt=54.3654 group=1 (DATA) NOTE: advancing ckpt for group 1 (DATA) thread=1 ckpt=54.3654 NOTE: cache recovered group 1 to fcn 0.4788896 NOTE: redo buffer size is 512 blocks (2101760 bytes) Thu Nov 20 16:19:23 2014 NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA) NOTE: LGWR found thread 1 closed at ABA 54.3653 NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA) NOTE: LGWR opening thread 1 at fcn 0.4788896 ABA 55.3654 NOTE: cache mounting group 1/0x184BA133 (DATA) succeeded NOTE: cache ending mount (success) of group DATA number=1 incarn=0x184ba133 GMON querying group 1 at 21 for pid 18, osid 10485844 Thu Nov 20 16:19:23 2014 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1 SUCCESS: diskgroup DATA was mounted SUCCESS: ALTER DISKGROUP DATA MOUNT /* asm agent *//* {0:6:55035} */
怎么又好了?似乎没问题了。还正在纳闷,结果Diskgroup DATA在第二个节点上又Force dismount了。我又尝试去mount data到第二个节点,mounted后,没过3分钟,第一个节点又force dismount了,似乎这两个节点只能一边操作这个磁盘。
此时正自言自语说这个rhdisk16的问题,现场工程师听到后说:“这个磁盘是我们11月17日下午才加到DiskgroupDATA中的,上次这个磁盘添加时就没成功,似乎有点问题,让硬件工程师查,他们也没查出什么问题”。
我对比了一下db alert中坏块报错时间正好是asmalert中添加磁盘rhdisk16的时间以后才报,很有可能是添加磁盘不正确导致。此时我有意识地去看磁盘pvid是否清楚,结果让我大吃一惊,找到了坏块的原因。
两边执行lspv,结果如下:
bjscwbdb01:/home/grid$lspv hdisk2 none None hdisk3 none None ::: hdisk14 none None hdisk15 none None hdisk16 00f7e1c0abd3b11d goldengatevg active <<<<<<<<<< goldengate disk hdisk17 none None ::: hdisk32 none None bjscwbdb01:/home/grid$ bjscwbdb02:/u01/app/grid/diag/asm/+asm/+ASM2/trace$lspv hdisk2 none None ::: hdisk15 none None hdisk16 00f7e1c0abd3b11d None hdisk17 none None hdisk18 none None ::: hdisk32 none None
这块磁盘是存放GoldenGate软件的磁盘,被错误地添加到diskgroup DATA中。这样就可以解释出现的几个现象了。
1、Diskgroup data在第一个节点上mount后,第二个节点就必须force dismount。这是因为GoldenGate磁盘本来就设置成只能在一个节点active。
2、数据库出现坏块。由于GoldenGate和RAC同时写一块盘,肯定会出现相互数据覆盖的情况,那么必然产生坏块。
这个故障原因已经很明确了。
另外,在诊断此故障的过程中,遇到一个由aix 7.1的新特性引起的故障。
这个集群在没有任何业务的情况下,第一个节点CPU资源是90%,内存仅剩几十兆。如下图:kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------------------- r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec 10 0 8383596 9049 0 0 0 4669156116821704704 4671567070943510528 0 1804 73984 21442 72 9 19 0 5.88 147.0 13 0 8381263 9246 0 0 0 4669840708353808881 4670415209250807178 0 1875 75725 22706 70 10 19 0 5.87 146.9 9 0 8381265 9099 0 0 0 4668223181205536768 4668224555595071488 0 1859 74071 21039 72 9 19 1 5.86 146.4 10 0 8381221 9068 0 0 0 4669563211001888768 4669567609048399872 0 1894 75266 21850 70 9 20 1 5.88 146.9 10 0 8381217 9136 0 0 0 4669875594370350367 4669875869191003008 0 1890 75058 21624 71 9 20 0 5.87 146.7 10 0 8381197 9353 0 0 0 4669212437484529857 4669948253929140635 0 1874 72257 21279 71 9 20 1 5.85 146.1
TOP CPU进程中topasrec占了100%cpu,这个截图无法粘贴,就省略。
经统计,共有100多个topasrec进程,经检查是aix的新特性引起,kill这些进程后,CPU和内存恢复正常。
解决方案:
1、由于磁盘rhdisk16出现数据出现损坏,对于集群来说,无法完成rebalance操作,因此无法drop。因此建议用户使用备份将数据库恢复到11月17日加盘前。
2、关于Topasrec进程引起cpu高的故障,我们让客户联系操作系统厂家,是否能规避或禁用这个特性。