数据库版本为
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Prod
PL/SQL Release 10.2.0.3.0 - Production
CORE 10.2.0.3.0 Production
TNS for Linux: Version 10.2.0.3.0 - Production
NLSRTL Version 10.2.0.3.0 - Production
运行在linux系统中
[ora10g@hzmc admin]$ uname -a
Linux hzmc 2.6.18-53.el5xen #1 SMP Mon Nov 12 03:26:12 EST 2007 i686 i686 i386 GNU/Linux
由于数据库的online redolog全部丢掉,导致数据库在open阶段时出现以下错误
alter database open
Thu Nov 3 16:55:02 2011
Beginning crash recovery of 1 threads
parallel recovery started with 2 processes
Thu Nov 3 16:55:02 2011
Started redo scan
Thu Nov 3 16:55:02 2011
Errors in file /ora10g/admin/drb/udump/drb_ora_14761.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/Tbackup/drb/redo01_1.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
ORA-00312: online log 1 thread 1: '/Tbackup/drb/redo01.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Thu Nov 3 16:55:02 2011
Aborting crash recovery due to error 313
Thu Nov 3 16:55:02 2011
Errors in file /ora10g/admin/drb/udump/drb_ora_14761.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/Tbackup/drb/redo01_1.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
ORA-00312: online log 1 thread 1: '/Tbackup/drb/redo01.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
我们知道数据库在异常宕机,再次open数据库时需要扫描online redolog,从而确保数据不丢失。如果redlog丢失,存储在数据库里的业务数据可能出现不一致状态。
(如果redolog完好,数据库open过程中会进行事物恢复,从而保证事物的一致性)但目前问题的难点是在线日志已经全部丢失的情况下怎么打开数据库?从而导出业务数据。
在这种情况下,我们首先想到了一个隐含参数_allow_resetlogs_corruption,该隐含参数Oracle的解释是:allow resetlogs even if it will cause corruption,
也就是说在数据库打开过程中,如果碰到处于current或者active状态的redlog损坏,当该参数置为true时,Oracle将直接跳过redolog恢复,并打开数据库。当然,在打开数据库的过程中,
Oracle将会重建redolog。基于目前这个情况,我们将参数_allow_resetlogs_corruption置为true,并尝试打开数据库。
SQL> alter system set "_allow_resetlogs_corruption"=true scope=spfile;
System altered.
由于需要重建redolog,所以需要将数据库进行不完全恢复,这样数据库open的时候才能使用resetlogs选项,所以我们在设好_allow_resetlogs_corruption参数并重启数据库至mount状态之后
对数据库做了不完全恢复。
SQL> recover database until cancel;
ORA-00279: change 11000485117844 generated at 11/03/2011 16:36:32 needed for
thread 1
ORA-00289: suggestion :
/ora10g/oracle/product/10.2.0/db_1/dbs/arch1_2_766254516.dbf
ORA-00280: change 11000485117844 for thread 1 is in sequence #2
Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
再次尝试打开数据库,这时候意想不到事情发生了,数据库中竟然有部分数据文件的resetlogs_change#处于不一致状态。
SQL> alter database open RESETLOGS;
alter database open RESETLOGS
*
ERROR at line 1:
ORA-01190: control file or data file 11 is from before the last RESETLOGS
ORA-01110: data file 11: '/Tbackup/drb/wentest03.dbf'
SQL> select resetlogs_change#,file# from v$datafile_header;
RESETLOGS_CHANGE# FILE#
----------------- ----------
11000485117845 1
11000485117845 2
11000485117845 3
11000485117845 4
。。。
11000485117845 10
456954 11
RESETLOGS_CHANGE# FILE#
----------------- ----------
456954 12
456954 13
11000485117845 14
11000485117845 15
11000485117845 16
11000485117845 17
。。。
11000485117845 37
11000485117845 38
9745339313660 39
。。。
11000485117845 51
11000485117845 52
进一步查看数据文件状态,可以看到11,12,13号数据文件处于online状态,这几个数据文件必须要进行处理,否则业务数据将丢失,而39号数据文件在数据库正常使用时,已经处于offline状态
可以暂时不管,如果有时间,再进行处理也不迟。
SQL> select file#,status from v$datafile where file# in (11,12,13,39);
FILE# STATUS
---------- -------
11 ONLINE
12 ONLINE
13 ONLINE
39 OFFLINE
SQL> select name from v$datafile where file# in (11,12,13);
NAME
--------------------------------------------------------------------------------
/Tbackup/drb/wentest03.dbf
/Tbackup/drb/wentest02.dbf
/Tbackup/drb/wentest01.dbf
首先我们需要清晰一个概念,就是Oracle在open数据库时,需要对数据文件进行一系列检查,主要检查该数据文件是不是本数据库的文件文件(主要检查db_id,db_name),
数据文件号和存储在滋控制文件的数据文件号是否一致(file#,rfile#),数据文件的checkpoint change和checkpoint count是否和控制文件的checkpoint change和checkpoint count一致,
正常状态下数据文件的resetlogs change和resetlogs是否处于一致等等。
以下就是Oracle 10g下,各个检查项在block 1中的位置:
Block 1:
offset:0028――0031
这四个bytes存放的dbid
offset:0032――0039
这个8个bytes存放的是数据库名的ascii码表示
offset:0040――0041
这两个字节存放的是Control Seq
offset:0052――0053
这两个字节存放的是文件号
offset:0112――0115
这四个字节存放的是reset logs count
offset:0116――0123
这六个字节存放的是reset logs scn
offset:0148――0149
这两个字节表示数据文件ctl cnt
offset:0368――00372
这四格字节表示相对文件号
offset:0484――0489
这六个字节存放的是checkpoint scn
进一步我们通过dump数据文件头,发现了一些问题,resetlogs_change,resetlog count,file number都不准确
SQL> ALTER SESSION SET EVENTS 'immediate trace name file_hdrs level 10';
Session altered.
[color=red]DATA FILE #11[/color]:
(name #19) /Tbackup/drb/wentest03.dbf
creation size=2560 block size=8192 status=0xf head=19 tail=19 dup=1
tablespace 11, index=12 krfil=12 prev_file=0
unrecoverable scn: 0x0883.33ab315d 01/01/1988 00:00:00
Checkpoint cnt:4 scn: 0x0883.33a89fae 04/16/2009 02:27:10
Stop scn: 0x0883.33a89fae 01/01/1988 00:00:00
Creation Checkpointed at scn: 0x0883.33a89f0c 04/16/2009 02:21:30
thread:0 rba:(0x0.0.0)
...
aux_file is NOT DEFINED
File 11 with tablespace ID 11 is plugged in read only
V10 STYLE FILE HEADER:
Compatibility Vsn = 169870080=0xa200300
Db ID=3342305182=0xc737879e, Db Name='DRB'
Activation ID=0=0x0
Control Seq=93349=0x16ca5, File size=2560=0xa00
[color=red]File Number=12[/color], Blksiz=8192, File Type=3 DATA
Tablespace #11 - WEN rel_fn:12
Creation at scn: 0x0883.33a89f0c 04/16/2009 02:21:30
Backup taken at scn: 0x0000.00000000 01/01/1988 00:00:00 thread:0
[color=red] reset logs count:0x2803df21 scn: 0x0a01.4001ff95[/color] reset logs terminal rcv data:0x0 scn: 0x
0000.00000000
prev reset logs count:0x24293a18 scn: 0x0000.00000001 prev reset logs terminal rcv data:0
x0 scn: 0x0000.00000000
recovered at 01/01/1988 00:00:00
status:0x0 root dba:0x00000000 chkpt cnt: 4 ctl cnt:3
begin-hot-backup file size: 0
Checkpointed at scn: 0x0883.33a89fae 04/16/2009 02:27:10
thread:1 rba:(0x5d12.14eb5.10)
现在我们的目标很清晰,就是首先需要用bbed修复数据文件11,12,13的resetlogs_change,resetlog count,file number。以修改11号数据文件为例,12号,13号修改方式类似
[ora10g@hzmc ~]$ bbed listfile=l blocksize=8192 password=blockedit
BBED> dump offset 112
BBED> modify 0xb724
BBED> dump offset 114
BBED> modify 0xac2d
BBED> dump offset 116
BBED> modify 0x95ff
BBED> dump offset 118
BBED> modify 0x0140
BBED> dump offset 52
BBED> modify 0x0b
BBED> sum apply
修改完成之后,再次进行不完全恢复,结果命令hang,问题又再次出现。
SQL> recover database until cancel;
后台日志显示:
/ora10g/admin/drb/udump/drb_ora_31042.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /ora10g/oracle/product/10.2.0/db_1
System name: Linux
Node name: hzmc
Release: 2.6.18-53.el5xen
Version: #1 SMP Mon Nov 12 03:26:12 EST 2007
Machine: i686
Instance name: drb
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 31042, image: oracle@hzmc (TNS V1-V3)
*** SERVICE NAME:() 2011-11-03 17:23:58.006
*** SESSION ID:(159.3) 2011-11-03 17:23:58.006
*** 2011-11-03 17:23:58.006
Beginning recovery file header examination (51 files)
*** 2011-11-03 17:23:58.007
Completed recovery file header examination
*** 2011-11-03 17:23:58.007
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [2130], [0], [8], [2], [], [], [], []
Current SQL statement for this session:
ALTER DATABASE RECOVER database until cancel
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+27 call ksedst1() 0 ? 1 ?
ksedmp()+557 call ksedst() 0 ? 9B8E606 ? 9B8E606 ? 155 ?
0 ? 0 ?
ksfdmp()+19 call ksedmp() 3 ? BFF69F8C ? AD0710C ?
CD49D00 ? 3 ? CCF9E38 ?
kgeriv()+188 call 00000000 CD49D00 ? 3 ?
kgeasi()+113 call kgeriv() CD49D00 ? B7F70020 ? 852 ?
3 ? BFF69FC8 ?
kccugg()+433 call kgeasi() CD49D00 ? B7F70020 ? 852 ?
2 ? 3 ? 0 ?
kcc_get_record()+32 call kccugg() 1 ? BFF6B58C ? 0 ? BFF6A1F0 ?
2 ? 200 ?
kccgri()+31 call kcc_get_record() BFF6B58C ? 0 ? BFF6A1F0 ? 2 ?
BFF6A2F0 ?
kctgdc()+52 call kccgri() BFF6B58C ? 0 ? BFF6A1F0 ?
BFF6B58C ? BFF6A0AC ?
krdsmr()+16009 call kctgdc() BFF6B58C ? BFF6AF58 ?
同时查看数据等待事件,数据库处于enq: CF - contention等待,也就是说数据库在不完全恢复时,将控制文件加锁,且没有释放
SQL> select event from v$session_wait;
EVENT
----------------------------------------------------------------
pmon timer
rdbms ipc message
rdbms ipc message
rdbms ipc message
rdbms ipc message
rdbms ipc message
rdbms ipc message
rdbms ipc message
rdbms ipc message
rdbms ipc message
rdbms ipc message
EVENT
----------------------------------------------------------------
direct path read
smon timer
SQL*Net message to client
enq: CF - contention
这时如果进行控制文件备份,命令同样挂起
SQL> alter database backup controlfile to trace resetlogs;
搜索metalink,有相关文档说这是数据库在10.2.0.3的bug,但是问题现象和我均不一样,这时候抱着试试看的态度,将数据库软件升级至10.2.0.5。
注意升级的时候需将10.2.0.3数据库软件进行备份,这样就可以保证快速回退,升级好之后,结果还是令人失望的,错误依旧。考验的时候来了,网上无解决方案,只能靠自己。
把数据库进行open reselogs打开,归根到底有2种方式:
1、进行不完全恢复命令恢复 。如
recover database until cancel;
recover database until change;
2、进行备份的控制文件进行恢复
recover database using backup controlfile;
而用到 using backup controlfile的一个办法就是将控制文件进行reselogs选项重建,现在目标又很明确,开工!
将控制文件用resetlogs选项备份
SQL> alter database backup controlfile to trace resetlogs;
Database altered.
但是在进行控制文件重建时,后台警告日志出现错误
WARNING: Default Temporary Tablespace not specified in CREATE DATABASE command
Default Temporary Tablespace will be necessary for a locally managed database in future release
Thu Nov 3 17:41:34 2011
Errors in file /ora10g/admin/drb/udump/drb_ora_17372.trc:
ORA-00600: internal error code, arguments: [kccscf_1], [9], [93440], [65535], [], [], [], []
Thu Nov 3 17:41:35 2011
Errors in file /ora10g/admin/drb/udump/drb_ora_17372.trc:
ORA-00600: internal error code, arguments: [kccscf_1], [9], [93440], [65535], [], [], [], []
还好这次metalink有相关解释,只要将MAXLOGHISTORY选项从93440将为65536以下数字即可,经过一番努力,控制文件终于建成功,于是再次进行不完全恢复.
悲剧的是数据库在open resetlogs时再次出现错误,实例异常终止。
SQL> recover database using backup controlfile;
ORA-00279: change 11000485137852 generated at 11/03/2011 19:28:40 needed for
thread 1
ORA-00289: suggestion :
/ora10g/oracle/product/10.2.0/db_1/dbs/arch1_2_766265181.dbf
ORA-00280: change 11000485137852 for thread 1 is in sequence #2
Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
Media recovery cancelled.
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
马上查看后台日志,出现熟悉的ora-600 [2662],看到这个错误就像看到久违的老朋友一样,离我们目标就不远了!
ARC1: Becoming the heartbeat ARCH
Thu Nov 3 19:36:42 2011
SMON: enabling cache recovery
Thu Nov 3 19:36:42 2011
Errors in file /ora10g/admin/drb/udump/drb_ora_32054.trc:
ORA-00600: internal error code, arguments: [2662], [2561], [1073892802], [2561], [1073892833], [4194313], [], []
Thu Nov 3 19:37:46 2011
Shutting down instance (abort)
License high water mark = 3
Instance terminated by USER, pid = 5440
解决这个错误之前,先简单的科普一下这个错误的来历。
ERROR:
ORA-600 [2662] [a] [b] [c] [d] [e]
Arg [a] Current SCN WRAP
Arg [b] Current SCN BASE
Arg [c] dependent SCN WRAP
Arg [d] dependent SCN BASE
Arg [e] Where present this is the DBA where the dependent SCN came from.
这个错误的主要意思是数据库在open或者查询时,发现数据块的内容比当前的current scn大。
知道这个意思之后,解决起来也很容易
1、用bbed工具将数据块的scn改小,如果涉及到块比较多的情况下,这个方法可行性不大
2、将Oracle的current scn加大,这个可行性较高
于是我们采用第二种方案进行修复,但是又面临一个问题,当前的current scn怎么该?该多大?这里再介绍一下修改算法
计算规则如下:Arg [c]*4得出一个数值,假设为V_Wrap,
如果Arg [d]=0,则V_Wrap值为需要的level
Arg [d] < 1073741824,V_Wrap+1为需要的level
Arg [d] < 2147483648,V_Wrap+2为需要的level
Arg [d] < 3221225472,V_Wrap+3为需要的level
本案例的话,Arg [d]为1073892802,大于1073741824,所以level值为2561*4+2。
设置隐含参数_minimum_giga_scn为10245,终于成功将数据库打开,后台日志显示,scn已经递增至11001558728704
Thu Nov 3 19:39:42 2011
Completed crash recovery at
Thread 1: logseq 1, block 3, scn 11000485157857
0 data blocks read, 0 data blocks written, 0 redo blocks read
Advancing SCN to 11001558728704 according to _minimum_giga_scn
。。。
QMNC started with pid=20, OS id=21397
Thu Nov 3 19:46:36 2011
LOGSTDBY: Validating controlfile with logical metadata
Thu Nov 3 19:46:36 2011
LOGSTDBY: Validation complete
Thu Nov 3 19:46:51 2011
Completed: alter database open resetlogs
数据打开之后,接下来就是收尾工作了,检查发现表空间为READ ONLY状态
SQL> select TABLESPACE_NAME,STATUS from dba_tablespaces where TABLESPACE_NAME='WEN';
TABLESPACE_NAME STATUS
------------------------------ ---------
WEN READ ONLY
将表空间置为read write时出现以下错误,这个错误网上搜索,又是没资料,又得靠自己了
SQL> alter database datafile 13 online;
Database altered.
SQL> alter tablespace wen read write;
alter tablespace wen read write
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kcpgucv1], [13], [], [], [], [],
[], []
我们知道表空间的状态其实是存储在ts$这张表格中,其状态的意思在sql.bsq中写的很清楚
create table ts$ /* tablespace table */
( ts# number not null, /* tablespace identifier number */
name varchar2("M_IDEN") not null, /* name of tablespace */
owner# number not null, /* owner of tablespace */
online$ number not null, /* status (see KTT.H): */
/*[color=red] 1 = ONLINE, 2 = OFFLINE, 3 = INVALID[/color] */
contents$ number not null, /* TEMPORARY/PERMANENT */
undofile# number, /* undo_off segment file number (status is OFFLINE) */
undoblock# number, /* undo_off segment header file number */
blocksize number not null, /* size of block in bytes */
inc# number not null, /* incarnation number of extent */
scnwrp number, /* clean offline scn - zero if not offline clean */
scnbas number, /* scnbas - scn base, scnwrp - scn wrap */
dflminext number not null, /* default minimum number of extents */
dflmaxext number not null, /* default maximum number of extents */
dflinit number not null, /* default initial extent size */
dflincr number not null, /* default next extent size */
dflminlen number not null, /* default minimum extent size */
dflextpct number not null, /* default percent extent size increase */
dflogging number not null,
/* lowest bit: default logging attribute: clear=NOLOGGING, set=LOGGING */
/* second lowest bit: force logging mode */
affstrength number not null, /* Affinity strength */
bitmapped number not null, /* If not bitmapped, 0 else unit size */
/* in blocks */
plugged number not null, /* If plugged */
directallowed number not null, /* Operation which invalidate standby are */
/* allowed */
flags number not null, /* various flags */
/* 0x01 = system managed allocation */
/* 0x02 = uniform allocation */
/* if above 2 bits not set then user managed */
/* 0x04 = migrated tablespace */
/* 0x08 = tablespace being migrated */
/* 0x10 = undo tablespace */
/* 0x20 = auto segment space management */
/* if above bit not set then freelist segment managed */
/* 0x40 = COMPRESS */
/* 0x80 = ROW MOVEMENT */
/* 0x100 = SFT */
/* 0x200 = undo retention guarantee */
/* 0x400 = tablespace belongs to a group */
/* 0x800 = this actually describes a group */
pitrscnwrp number, /* scn wrap when ts was created */
pitrscnbas number, /* scn base when ts was created */
ownerinstance varchar("M_IDEN"), /* Owner instance name */
backupowner varchar("M_IDEN"), /* Backup owner instance name */
groupname varchar("M_IDEN"), /* Group name */
spare1 number, /* plug-in SCN wrap */
spare2 number, /* plug-in SCN base */
spare3 varchar2(1000),
spare4 date
)
命令不行,那我们就直接修改基表,[color=red]注意此方法在正常场合不能使用[/color]
SQL> update ts$ set ONLINE$=1 where name='WEN';
1 row updated.
可以看到表空间wen已经处于online状态
SQL> select TABLESPACE_NAME,STATUS from dba_tablespaces where TABLESPACE_NAME='WEN';
TABLESPACE_NAME STATUS
------------------------------ ---------
WEN ONLINE
并已经可以读写。
SQL> create table test111 tablespace wen as select * from v$datafile;
Table created.
此时建议数据库做一次重启操作,如果正常 ,问题解决至此,就暂时告一段落了。鼓掌!!!