1,现象描述:服务器存储断电,导致数据库down机,再次尝试启动数据库,数据库不能正常启动,数据库报错如下
SQL> alter database open; alter database open * ERROR at line 1: ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []2,查看alert日志,数据库启动时,已开始日志恢复,但恢复时报错ORA-600,除了生成一个trc文件外,没有更详细的信息
Thu Feb 06 23:13:19 2014 ALTER DATABASE OPEN Beginning crash recovery of 1 threads parallel recovery started with 3 processes Started redo scan Completed redo scan read1580 KB redo, 211 data blocks need recovery Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_4679.trc (incident=375883): ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], [] Incident details in:/app/oracle/diag/rdbms/epm/epm1/incident/incdir_375883/epm1_ora_4679_i375883.trc Aborting crash recovery due to error 600 Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_4679.trc: ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], [] Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_4679.trc: ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], [] ORA-600 signalled during: ALTER DATABASEOPEN... Trace dumping is performingid=[cdmp_20140206231320]3,查看trc文件,通过WARNING可以看出,数据库恢复时认为seq2022(对应日志文件)中的记录,应该恢复到523240,但是恢复到523468却被迫停止了。应该是控制文件和日志文件记录不完全一致导致,注意这几个数据正好对应到kcratr_nab_less_than_odr的几个参数:[kcratr_nab_less_than_odr],[1], [2022], [523240], [523468]
*** 2014-02-06 23:37:11.872 Successfully allocated 3 recovery slaves Using 45 overflow buffers per recoveryslave Thread 1 checkpoint: logseq 2022, block 2,scn 9187726619852 cache-low rba: logseq 2022, block 520079 on-disk rba: logseq 2022, block 523468, scn 9187726672777 start recovery at logseq 2022, block 520079, scn 0 *** 2014-02-06 23:37:11.890 Started writing zeroblks thread 1 seq 2022blocks 523240-523247 ---------------------------------------------- WARNING! Crash recovery of thread 1 seq2022 is ending at redo block 523240 but should nothave ended before redo block 523468 Incident 375884 created, dump file:/app/oracle/diag/rdbms/epm/epm1/incident/incdir_375884/epm1_ora_4971_i375884.trc ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], [] ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], [] ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []
4,查询metalink,已有相关案例,主要是由于存储突然断电,导致向在线日志文件中写日志失败。
This Problem is caused by Storage Problemof the Database Files. The Subsystem (eg. SAN) crashed while the Database wasopen. The Database then crashed since the Database Files were not accessibleanymore. This caused a lost Write into the Online RedoLogs and so InstanceRecovery is not possible and raising the ORA-600.
官网中提供的解决办法是通过备份的控制文件恢复数据库,然后resetlog启动,当然如果没有备份,也可以通过参考文件中重建控制文件的方法。
1. If you could restore your StorageEnvironment and the Online RedoLogs from the Time of the crash you can try amanual Recovery followed by a RESETLOGS: SQL> startup mount; SQL> recover database until cancel usingbackup controlfile; -> manually provide Online RedoLogcontaining the last (current) Sequence when asked, eg. ORA-00279: change 100000 generated atxx/xx/xxxx xx:xx:xx needed for thread 1 ORA-00289: suggestion : /flash_recovery/archivelog/xxxx_xx_xx/o1_mf_1_100_%u_.arc ORA-00280: change 100000 for thread 1 is insequence #100 Specify log: {<RET>=suggested |filename | AUTO | CANCEL} /ora/oradata/dbtest/redo04_1.rdo Log applied. Media recovery complete. SQL> alter database open resetlogs;
5,通过metalink上的解决方法,根据提示输入在线日志文件名,数据库却再次报错ORA-00600 [2662] [2139]: internal errorcode, arguments: [2662], [2139], [791626372], [2139],
[oracle@localhost oradata]$ sqlplus"/as sysdba" SQL*Plus: Release 11.2.0.1.0 Production onThu Feb 6 17:53:47 2014 Copyright (c) 1982, 2009, Oracle. All rights reserved. Connected to an idle instance. SQL> startup mount; ORACLE instance started. Total System Global Area 6714322944 bytes Fixed Size 2226056 bytes Variable Size 5033166968 bytes Database Buffers 1660944384 bytes Redo Buffers 17985536 bytes Database mounted. SQL> recover database until cancel usingbackup controlfile; ORA-00279: change 9187726668895 generatedat 02/05/2014 01:00:04 needed for thread 1 ORA-00289: suggestion :/app/archive_log/1_2022_804560942.dbf ORA-00280: change 9187726668895 for thread1 is in sequence #2022 Specify log: {<RET>=suggested |filename | AUTO | CANCEL} /app/oracle/oradata/epm/redo03.log Log applied. Media recovery complete. SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-00603: ORACLE server session terminatedby fatal error ORA-00600: internal error code, arguments:[2662], [2139], [791626372], [2139], [791626610], [12583120], [], [], [], [],[], [] ORA-00600: internal error code, arguments:[2662], [2139], [791626371], [2139], [791626610], [12583120], [], [], [], [],[], [] ORA-01092: ORACLE instance terminated.Disconnection forced ORA-00600: internal error code, arguments:[2662], [2139], [791626368], [2139], [791626610], [12583120], [], [], [], [],[], [] Process ID: 9658 Session ID: 1705 Serial number: 5
6,再次查看alert日志,报错基本和上述相同,并提示,生成了一个trc文件,网上查询,确定为resetlogs仍然存在数据库坏块
[oracle@localhost trace]$ tail -falert_epm1.log ORA-00600: internal error code, arguments:[2662], [2139], [791626371], [2139], [791626610], [12583120], [], [], [], [],[], [] ORA-01092: ORACLE instance terminated.Disconnection forced ORA-00600: internal error code, arguments:[2662], [2139], [791626368], [2139], [791626610], [12583120], [], [], [], [],[], [] Incident details in:/app/oracle/diag/rdbms/epm/epm1/incident/incdir_411735/epm1_ora_9658_i411735.trc Errors in file/app/oracle/diag/rdbms/epm/epm1/incident/incdir_411735/epm1_ora_9658_i411735.trc: ORA-00603: ORACLE server session terminatedby fatal error ORA-00600: internal error code, arguments:[2662], [2139], [791626372], [2139], [791626610], [12583120], [], [], [], [],[], [] ORA-00600: internal error code, arguments:[2662], [2139], [791626371], [2139], [791626610], [12583120], [], [], [], [],[], [] ORA-01092: ORACLE instance terminated.Disconnection forced ORA-00600: internal error code, arguments:[2662], [2139], [791626368], [2139], [791626610], [12583120], [], [], [], [],[], []
7.重启启动数据库,数据库已能正常启动,但启动后很快就down掉,
SQL> startup mount; ORACLE instance started. Total System Global Area 6714322944 bytes Fixed Size 2226056 bytes Variable Size 5033166968 bytes Database Buffers 1660944384 bytes RedoBuffers 17985536 bytes Database mounted. SQL> alter database open; Database altered. SQL> select * from v$instance; select * from v$instance * ERROR at line 1: ORA-03135: connection lost contact Process ID: 10171 Session ID: 1705 Serial number: 5
8,查看日志,日志报错,ORA-01595,ora-600 [4194]
Block recovery completed at rba 2.137.16,scn 2139.791646684
Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_smon_10046.trc:
ORA-01595: error freeing extent (49) ofrollback segment (6))
ORA-00600: internal error code, arguments:[4194], [], [], [], [], [], [], [], [], [], [], []
9,继续查询metalink,关于ora-600[4194],文章ID 1428786.1有详细讲解,原因是由于断电或者硬件故障,数据库实例恢复rollback时报错,
Symptoms The following error is occurring in thealert.log right before the database crashes. ORA-00600: internal error code, arguments:[4194], [#], [#], [], [], [], [], [] This error indicates that a mismatch hasbeen detected between redo records and rollback (undo) records. ARGUMENTS: Arg [a] - Maximum Undo record number inUndo block Arg [b] - Undo record number from Redoblock Since we are adding a new undo record toour undo block, we would expect that the new record number is equal to themaximum record number in the undo block plus one. Before Oracle can add a newundo record to the undo block it validates that this is correct. If thisvalidation fails, then an ORA-600 [4194] will be triggered. Changes This issue generally occurs when there is apower outage or hardware failure that initially crashes the database. Onstartup, the database does the normal roll forward (redo) and then rollback(undo), this is where the error is generated on the rollback.
10,metalink的解决方案是重建undo表空间,基本思路是先把undo设置为manual管理方式,重建undo,然后重启库即可
SQL> Create pfile='/tmp/corrupt.ora'from spfile ;
vi /tmp/corrupt.ora
*.Undo_management=Manual
启动数据库到mount状态
SQL> Startup mountpfile='/tmp/corrupt.ora' ORACLE instance started. Total System Global Area 6714322944 bytes Fixed Size 2226056 bytes Variable Size 5033166968 bytes Database Buffers 1660944384 bytes Redo Buffers 17985536 bytes Database mounted. SQL> Show parameter undo NAME TYPE VALUE ----------------------------------------------- ------------------------------ undo_management string MANUAL undo_retention integer 10800 undo_tablespace string UNDOTBS1 SQL> Alter database open ; Database altered. SQL> Create rollback segment r01 ; Rollback segment created. SQL> Alter rollback segment r01 online ; Rollback segment altered. SQL> Create undo tablespace undotbs_2datafile '/app/oracle/oradata/epm/undotbs_2.dbf' size 200M ; Tablespace created. SQL> alter system set undo_tablespace ='undotbs_2' scope=spfile; System altered.
再次重启数据库,此时也能正确启动
SQL> startup ORACLE instance started. Total System Global Area 6714322944 bytes Fixed Size 2226056 bytes Variable Size 5033166968 bytes Database Buffers 1660944384 bytes Redo Buffers 17985536 bytes Database mounted. Database opened.
11,此时,查看数据库日志,有数据库坏块方面的报错,虽然只有只报出一个数据库坏块,实际上可能存在更多的坏块
Thu Feb 06 18:52:57 2014 Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_j000_13117.trc (incident=460023): ORA-01578: ORACLE data block corrupted(file # 2, block # 31061) ORA-01110: data file 2:'/app/oracle/oradata/epm/sysaux01.dbf' Errors in file /app/oracle/diag/rdbms/epm/epm1/trace/epm1_j000_13117.trc (incident=460024): ORA-01578: ORACLE data block corrupted(file # 2, block # 31061) ORA-01110: data file 2:'/app/oracle/oradata/epm/sysaux01.dbf' Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_j000_13117.trc (incident=460025): ORA-01578: ORACLE data block corrupted(file # , block # ) ORA-01578: ORACLE data block corrupted(file # 2, block # 31061) ORA-01110: data file 2:'/app/oracle/oradata/epm/sysaux01.dbf
12,验证数据库坏块,常用的有两种方法,两种方法本质上是一致的,本文采用的是rman命令
1),rman命令
run { allocate channel d1 type disk; allocate channel d2 type disk; backup validate check logical database; }
验证某个文件
RMAN> backup validate datafile 4;
Starting backup at 06-FEB-14
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafilebackup set
channel ORA_DISK_1: specifying datafile(s)in backup set
input datafile file number=00004name=/app/oracle/oradata/epm/users01.dbf
channel ORA_DISK_1: backup set complete,elapsed time: 00:04:25
List of Datafiles
=================
File Status Marked Corrupt Empty BlocksBlocks Examined High SCN
---- ------ -------------- --------------------------- ----------
4 FAILED 0 46754 477122 9187726907487
File Name: /app/oracle/oradata/epm/users01.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 292182
Index 0 45604
Other 2 92580
validate found one or more corrupt blocks
See trace file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_13757.trc for details
Finished backup at 06-FEB-14
2),dbv命令
验证某个数据库文件
dbvFILE=/app/oracle/oradata/epm/users01.dbf
Page 439168 is influx - most likely media corrupt
Corrupt block relative dba: 0x0106b380(file 4, block 439168)
Fractured block found during dbv:
Data in bad block:
type: 32 format: 2 rdba: 0x0106b380
lastchange scn: 0x085b.2ef4cb1d seq: 0xe flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x72a7201b
check value in block header: 0xfd1d
computed block checksum: 0xb9af
Page 439296 is influx - most likely mediacorrupt
Corrupt block relative dba: 0x0106b400(file 4, block 439296)
Fractured block found during dbv:
Data in bad block:
type: 32 format: 2 rdba: 0x0106b400
lastchange scn: 0x085b.2ef4cb24 seq: 0x19 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x72ac2025
check value in block header: 0xfd1f
computed block checksum: 0xb9b4
DBVERIFY - Verification complete
Total Pages Examined : 477120
Total Pages Processed (Data) : 292182
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 45604
Total Pages Failing (Index): 0
Total Pages Processed (Other): 92578
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 46754
Total Pages Marked Corrupt : 2
Total Pages Influx : 2
Total Pages Encrypted : 0
Highest block SCN : 791817065 (2139.791817065)
13,此时,全库逻辑导出,保护好现有的数据
expdp userid=\"/ as sysdba\"full=y dumpfile=epm20140206_3.dmp directory=dpdata1 LOGFILE=epm20140206_3.log PARALLEL=3
14,通过rman命令,共检验出22个数据库坏块
SQL> select * fromv$database_block_corruption; FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO ---------- ---------- ---------------------------- --------- 4 439296 1 0 FRACTURED 4 439168 1 0 FRACTURED 8 2610431 1 0 FRACTURED 8 2547178 1 0 FRACTURED 8 2547114 1 0 FRACTURED 8 2547050 1 0 FRACTURED 8 2546986 1 0 FRACTURED 8 2546922 1 0 FRACTURED 8 2546890 1 0 FRACTURED 8 2546858 1 0 FRACTURED 8 2546826 1 0 FRACTURED FILE# BLOCK# BLOCKSCORRUPTION_CHANGE# CORRUPTIO ---------- ---------- ---------------------------- --------- 8 2546794 1 0 FRACTURED 8 2546762 1 0 FRACTURED 8 2546730 1 0 FRACTURED 8 2546698 1 0 FRACTURED 8 2459433 1 0 FRACTURED 8 2459305 1 0 FRACTURED 8 1596687 1 0 FRACTURED 9 876808 1 0 FRACTURED 9 662038 1 0 FRACTURED 9 345491 1 0 FRACTURED 9 281617 1 0 FRACTURED
15,进一步定位到数据库对象
查看坏块对应的对象
Selecttablespace_name,segment_type,owner,segment_name From dba_extents Wherefile_id=2 and 31061 between block_id and block_id+blocks-1;
根据坏块生产查看对象的sql
select 'Select tablespace_name,segment_type,owner,segment_nameFrom dba_extents Where file_id=' || FILE# || ' and ' || BLOCK# || ' between block_id and block_id+blocks-1;'from v$database_block_corruption;
16,对于索引坏块,处理比较简单,直接rebuild即可,如果索引坏块较多,可以选择重建所有索引
重建索引
SQL> alter INDEX EPM.IDX_QRTZ_T_NEXT_FIRE_TIMErebuild online;
对于表比较复杂
1),可以通过备份恢复数据块,本例由于备份有问题,未成功
RMAN> blockrecover datafile 2 block31061 from backupset; Starting recover at 06-FEB-14 using channel ORA_DISK_1 channel ORA_DISK_1: restoring block(s) channel ORA_DISK_1: specifying block(s) torestore from backup set restoring blocks of datafile 00002 channel ORA_DISK_1: reading from backuppiece /app/backup/rman_bak/rman.c0_20140122_1629_1_EPM.bak channel ORA_DISK_1: ORA-19870: error whilerestoring backup piece /app/backup/rman_bak/rman.c0_20140122_1629_1_EPM.bak ORA-19501: read error on file"/app/backup/rman_bak/rman.c0_20140122_1629_1_EPM.bak", block number154496 (block size=8192) ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 1048576
2),如果没有备份,则可以通过如下方法跳过坏块,当然会丢失部分数据
SQL> ALTER SESSION SET EVENTS 2 '10231 TRACE NAME CONTEXTFOREVER, LEVEL 10'; Session altered. SQL> create tableEPM.REQ_RESPSB_SUPPLIER_BAK as select * from EPM.REQ_RESPSB_SUPPLIER; Table created. SQL> select count(*) fromEPM.REQ_RESPSB_SUPPLIER_BAK; COUNT(*) ---------- 188786
17,至此,数据库启动成功,业务也基本恢复正常了
参考文档
metalink:ORA-600 [kcratr_nab_less_than_odr]during Instance Recovery after Database Crash (Doc ID 1299564.1)
metalink:Step by step to resolve ORA-6004194 4193 4197 on database crash (Doc ID 1428786.1)
http://www.eygle.com/archives/2010/05/ora-00600_kcratr1_lostwrt.html
http://www.xifenfei.com/2347.html
http://www.askmaclean.com/archives/ora-6004194%E9%94%99%E8%AF%AF%E4%B8%80%E4%BE%8B.html