
Oracle Database 10g Enterprise Edition Release - Prod
PL/SQL Release - Production
CORE      Production
TNS for Linux: Version - Production
NLSRTL Version - Production
[ora10g@hzmc admin]$ uname -a
Linux hzmc 2.6.18-53.el5xen #1 SMP Mon Nov 12 03:26:12 EST 2007 i686 i686 i386 GNU/Linux
由于数据库的online redolog全部丢掉,导致数据库在open阶段时出现以下错误
alter database open                                                       
Thu Nov  3 16:55:02 2011                                                  
Beginning crash recovery of 1 threads                                     
parallel recovery started with 2 processes                               
Thu Nov  3 16:55:02 2011                                                  
Started redo scan                                                         
Thu Nov  3 16:55:02 2011                                                  
Errors in file /ora10g/admin/drb/udump/drb_ora_14761.trc:                 
Thu Nov  3 16:55:02 2011                                                  
Aborting crash recovery due to error 313                                  
Thu Nov  3 16:55:02 2011                                                  
Errors in file /ora10g/admin/drb/udump/drb_ora_14761.trc:                 
我们知道数据库在异常宕机,再次open数据库时需要扫描online redolog,从而确保数据不丢失。如果redlog丢失,存储在数据库里的业务数据可能出现不一致状态。
在这种情况下,我们首先想到了一个隐含参数_allow_resetlogs_corruption,该隐含参数Oracle的解释是:allow resetlogs even if it will cause corruption,
SQL> alter system set "_allow_resetlogs_corruption"=true scope=spfile;     
System altered.


SQL> recover database until cancel;                                                                                                                         
ORA-00279: change 11000485117844 generated at 11/03/2011 16:36:32 needed for
thread 1                                                                    
ORA-00289: suggestion :                                                     
ORA-00280: change 11000485117844 for thread 1 is in sequence #2             
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}                   

SQL> alter database open RESETLOGS;                                         
alter database open RESETLOGS                                                                                                                 
ERROR at line 1:                                                            
ORA-01190: control file or data file 11 is from before the last RESETLOGS   
ORA-01110: data file 11: '/Tbackup/drb/wentest03.dbf'     

SQL> select resetlogs_change#,file# from v$datafile_header;              
RESETLOGS_CHANGE#      FILE#                                
----------------- ----------                                
   11000485117845          1                                
   11000485117845          2                                
   11000485117845          3                                
   11000485117845          4                                
   11000485117845         10                                
           456954         11                                
RESETLOGS_CHANGE#      FILE#                                
----------------- ----------                                
           456954         12                                
           456954         13                                
   11000485117845         14                                
   11000485117845         15                                
   11000485117845         16                                
   11000485117845         17 
   11000485117845         37                                  
   11000485117845         38   
    9745339313660         39   
   11000485117845         51   
   11000485117845         52   

SQL> select file#,status from v$datafile where file# in (11,12,13,39);

---------- -------
        11 ONLINE
        12 ONLINE
        13 ONLINE
        39 OFFLINE
SQL> select name from v$datafile where file# in (11,12,13);                    

数据文件号和存储在滋控制文件的数据文件号是否一致(file#,rfile#),数据文件的checkpoint change和checkpoint count是否和控制文件的checkpoint change和checkpoint count一致,
正常状态下数据文件的resetlogs change和resetlogs是否处于一致等等。
以下就是Oracle 10g下,各个检查项在block 1中的位置:
Block 1:
这两个字节存放的是Control Seq
这四个字节存放的是reset logs count
这六个字节存放的是reset logs scn
这两个字节表示数据文件ctl cnt
这六个字节存放的是checkpoint scn

进一步我们通过dump数据文件头,发现了一些问题,resetlogs_change,resetlog count,file number都不准确
SQL> ALTER SESSION SET EVENTS 'immediate trace name file_hdrs level 10';
Session altered.                                                        

DATA FILE #11:                                                                                  
  (name #19) /Tbackup/drb/wentest03.dbf                                                         
creation size=2560 block size=8192 status=0xf head=19 tail=19 dup=1                             
tablespace 11, index=12 krfil=12 prev_file=0                                                   
unrecoverable scn: 0x0883.33ab315d 01/01/1988 00:00:00                                         
Checkpoint cnt:4 scn: 0x0883.33a89fae 04/16/2009 02:27:10                                      
Stop scn: 0x0883.33a89fae 01/01/1988 00:00:00                                                  
Creation Checkpointed at scn:  0x0883.33a89f0c 04/16/2009 02:21:30                             
thread:0 rba:(0x0.0.0)                                                                         
aux_file is NOT DEFINED                                                                        
File 11 with tablespace ID 11 is plugged in read only                                           
V10 STYLE FILE HEADER:                                                                         
        Compatibility Vsn = 169870080=0xa200300                                                 
        Db ID=3342305182=0xc737879e, Db Name='DRB'                                              
        Activation ID=0=0x0                                                                     
        Control Seq=93349=0x16ca5, File size=2560=0xa00                                         
        File Number=12, Blksiz=8192, File Type=3 DATA                                           
Tablespace #11 - WEN  rel_fn:12                                                                 
Creation   at   scn: 0x0883.33a89f0c 04/16/2009 02:21:30                                        
Backup taken at scn: 0x0000.00000000 01/01/1988 00:00:00 thread:0                               
reset logs count:0x2803df21 scn: 0x0a01.4001ff95 reset logs terminal rcv data:0x0 scn: 0x      
prev reset logs count:0x24293a18 scn: 0x0000.00000001 prev reset logs terminal rcv data:0      
x0 scn: 0x0000.00000000                                                                         
recovered at 01/01/1988 00:00:00                                                               
status:0x0 root dba:0x00000000 chkpt cnt: 4 ctl cnt:3                                          
begin-hot-backup file size: 0                                                                   
Checkpointed at scn:  0x0883.33a89fae 04/16/2009 02:27:10                                       
thread:1 rba:(0x5d12.14eb5.10)                                                                 

现在我们的目标很清晰,就是首先需要用bbed修复数据文件11,12,13的resetlogs_change,resetlog count,file number。以修改11号数据文件为例,12号,13号修改方式类似
[ora10g@hzmc ~]$ bbed listfile=l blocksize=8192 password=blockedit
BBED> dump offset 112
BBED> modify 0xb724
BBED> dump offset 114
BBED> modify 0xac2d
BBED> dump offset 116
BBED> modify 0x95ff
BBED> dump offset 118
BBED> modify 0x0140
BBED> dump offset 52
BBED> modify 0x0b
BBED> sum apply

SQL> recover database until cancel;
Oracle Database 10g Enterprise Edition Release - Production         
With the Partitioning, OLAP and Data Mining options                                                    
ORACLE_HOME = /ora10g/oracle/product/10.2.0/db_1                               
System name:    Linux                                                          
Node name:      hzmc                                                           
Release:        2.6.18-53.el5xen                                               
Version:        #1 SMP Mon Nov 12 03:26:12 EST 2007                            
Machine:        i686                                                           
Instance name: drb                                                             
Redo thread mounted by this instance: 1                                        
Oracle process number: 15                                                      
Unix process pid: 31042, image: oracle@hzmc (TNS V1-V3)                        
*** SERVICE NAME:() 2011-11-03 17:23:58.006                                    
*** SESSION ID:(159.3) 2011-11-03 17:23:58.006                                 
*** 2011-11-03 17:23:58.006                                                    
Beginning recovery file header examination (51 files)                          
*** 2011-11-03 17:23:58.007                                                    
Completed recovery file header examination                                     
*** 2011-11-03 17:23:58.007                                                    
ksedmp: internal or fatal error                                                
ORA-00600: internal error code, arguments: [2130], [0], [8], [2], [], [], [], []
Current SQL statement for this session:                                        
ALTER DATABASE RECOVER  database until cancel                                  
----- Call Stack Trace -----                                                   
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst()+27          call     ksedst1()            0 ? 1 ?                     
ksedmp()+557         call     ksedst()             0 ? 9B8E606 ? 9B8E606 ? 155 ?
                                                   0 ? 0 ?                     
ksfdmp()+19          call     ksedmp()             3 ? BFF69F8C ? AD0710C ?    
                                                   CD49D00 ? 3 ? CCF9E38 ?     
kgeriv()+188         call     00000000             CD49D00 ? 3 ?               
kgeasi()+113         call     kgeriv()             CD49D00 ? B7F70020 ? 852 ?  
                                                   3 ? BFF69FC8 ?              
kccugg()+433         call     kgeasi()             CD49D00 ? B7F70020 ? 852 ?  
                                                   2 ? 3 ? 0 ?                 
kcc_get_record()+32  call     kccugg()             1 ? BFF6B58C ? 0 ? BFF6A1F0 ?
                                                   2 ? 200 ?                   
kccgri()+31          call     kcc_get_record()     BFF6B58C ? 0 ? BFF6A1F0 ? 2 ?
                                                   BFF6A2F0 ?                  
kctgdc()+52          call     kccgri()             BFF6B58C ? 0 ? BFF6A1F0 ?   
                                                   BFF6B58C ? BFF6A0AC ?       
krdsmr()+16009       call     kctgdc()             BFF6B58C ? BFF6AF58 ?       

同时查看数据等待事件,数据库处于enq: CF - contention等待,也就是说数据库在不完全恢复时,将控制文件加锁,且没有释放
SQL> select event from v$session_wait;                            
pmon timer                                                        
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
rdbms ipc message                                                 
direct path read                                                  
smon timer                                                        
SQL*Net message to client                                         
enq: CF - contention       

SQL> alter database backup controlfile to trace resetlogs;
把数据库进行open reselogs打开,归根到底有2种方式:
1、进行不完全恢复命令恢复 。如
recover database until cancel;
recover database until change;
recover database using backup controlfile;
而用到 using backup controlfile的一个办法就是将控制文件进行reselogs选项重建,现在目标又很明确,开工!
SQL> alter database backup controlfile to trace resetlogs;
Database altered.

WARNING: Default Temporary Tablespace not specified in CREATE DATABASE command                                
Default Temporary Tablespace will be necessary for a locally managed database in future release                                        
Thu Nov  3 17:41:34 2011                                                                              
Errors in file /ora10g/admin/drb/udump/drb_ora_17372.trc:                                             
ORA-00600: internal error code, arguments: [kccscf_1], [9], [93440], [65535], [], [], [], []          
Thu Nov  3 17:41:35 2011                                                                              
Errors in file /ora10g/admin/drb/udump/drb_ora_17372.trc:                                             
ORA-00600: internal error code, arguments: [kccscf_1], [9], [93440], [65535], [], [], [], []  
悲剧的是数据库在open resetlogs时再次出现错误,实例异常终止。
SQL> recover database using backup controlfile;                                
ORA-00279: change 11000485137852 generated at 11/03/2011 19:28:40 needed for   
thread 1                                                                               
ORA-00289: suggestion :                                                        
ORA-00280: change 11000485137852 for thread 1 is in sequence #2                
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}                      
Media recovery cancelled.                                                      
SQL> alter database open resetlogs;                                            
alter database open resetlogs                                                  
ERROR at line 1:                                                               
ORA-03113: end-of-file on communication channel        

马上查看后台日志,出现熟悉的ora-600 [2662],看到这个错误就像看到久违的老朋友一样,离我们目标就不远了!
ARC1: Becoming the heartbeat ARCH                                                                                                        
Thu Nov  3 19:36:42 2011                                                                                         
SMON: enabling cache recovery                                                                                    
Thu Nov  3 19:36:42 2011                                                                                         
Errors in file /ora10g/admin/drb/udump/drb_ora_32054.trc:                                                        
ORA-00600: internal error code, arguments: [2662], [2561], [1073892802], [2561], [1073892833], [4194313], [], [] 
Thu Nov  3 19:37:46 2011                                                                                         
Shutting down instance (abort)                                                                                   
License high water mark = 3                                                                                      
Instance terminated by USER, pid = 5440

  ORA-600 [2662] [a] [b] [c] [d] [e]

  Arg [a]  Current SCN WRAP
  Arg [b]  Current SCN BASE
  Arg [c]  dependent SCN WRAP
  Arg [d]  dependent SCN BASE
  Arg [e]  Where present this is the DBA where the dependent SCN came from.

这个错误的主要意思是数据库在open或者查询时,发现数据块的内容比当前的current scn大。
2、将Oracle的current scn加大,这个可行性较高   
于是我们采用第二种方案进行修复,但是又面临一个问题,当前的current scn怎么该?该多大?这里再介绍一下修改算法
计算规则如下:Arg [c]*4得出一个数值,假设为V_Wrap,
如果Arg [d]=0,则V_Wrap值为需要的level
Arg [d] < 1073741824,V_Wrap+1为需要的level
Arg [d] < 2147483648,V_Wrap+2为需要的level
Arg [d] < 3221225472,V_Wrap+3为需要的level
本案例的话,Arg [d]为1073892802,大于1073741824,所以level值为2561*4+2。

Thu Nov  3 19:39:42 2011                                        
Completed crash recovery at                                                                                                           
Thread 1: logseq 1, block 3, scn 11000485157857                
0 data blocks read, 0 data blocks written, 0 redo blocks read  
Advancing SCN to 11001558728704 according to _minimum_giga_scn  
QMNC started with pid=20, OS id=21397                          
Thu Nov  3 19:46:36 2011                                       
LOGSTDBY: Validating controlfile with logical metadata         
Thu Nov  3 19:46:36 2011                                       
LOGSTDBY: Validation complete                                  
Thu Nov  3 19:46:51 2011                                       
Completed: alter database open resetlogs

数据打开之后,接下来就是收尾工作了,检查发现表空间为READ ONLY状态
SQL> select TABLESPACE_NAME,STATUS from dba_tablespaces where   TABLESPACE_NAME='WEN';
------------------------------ ---------
WEN                            READ ONLY
将表空间置为read write时出现以下错误,这个错误网上搜索,又是没资料,又得靠自己了
SQL> alter database datafile 13 online;                                    
Database altered.                                                          
SQL> alter tablespace wen read write;                                      
alter tablespace wen read write                                            
ERROR at line 1:                                                           
ORA-00600: internal error code, arguments: [kcpgucv1], [13], [], [], [], [],
[], []   

create table ts$                                         /* tablespace table */
( ts#           number not null,             /* tablespace identifier number */
  name          varchar2("M_IDEN") not null,           /* name of tablespace */
  owner#        number not null,                      /* owner of tablespace */
  online$       number not null,                      /* status (see KTT.H): */
                                     /* 1 = ONLINE, 2 = OFFLINE, 3 = INVALID */
  contents$     number not null,     /* TEMPORARY/PERMANENT                  */
  undofile#     number,  /* undo_off segment file number (status is OFFLINE) */
  undoblock#    number,               /* undo_off segment header file number */
  blocksize     number not null,                   /* size of block in bytes */
  inc#          number not null,             /* incarnation number of extent */
  scnwrp        number,     /* clean offline scn - zero if not offline clean */
  scnbas        number,              /* scnbas - scn base, scnwrp - scn wrap */
  dflminext     number not null,       /*  default minimum number of extents */
  dflmaxext     number not null,        /* default maximum number of extents */
  dflinit       number not null,              /* default initial extent size */
  dflincr       number not null,                 /* default next extent size */
  dflminlen     number not null,              /* default minimum extent size */
  dflextpct     number not null,     /* default percent extent size increase */
  dflogging     number not null,
      /* lowest bit: default logging attribute: clear=NOLOGGING, set=LOGGING */
                                    /* second lowest bit: force logging mode */
  affstrength   number not null,                        /* Affinity strength */
  bitmapped     number not null,       /* If not bitmapped, 0 else unit size */
                                                                /* in blocks */
  plugged       number not null,                               /* If plugged */
  directallowed number not null,   /* Operation which invalidate standby are */
                                                                  /* allowed */
  flags         number not null,                            /* various flags */
                                         /* 0x01 = system managed allocation */
                                         /* 0x02 = uniform allocation        */
                                /* if above 2 bits not set then user managed */
                                         /* 0x04 = migrated tablespace       */
                                         /* 0x08 = tablespace being migrated */
                                         /* 0x10 = undo tablespace           */
                                     /* 0x20 = auto segment space management */
                       /* if above bit not set then freelist segment managed */
                                                          /* 0x40 = COMPRESS */
                                                      /* 0x80 = ROW MOVEMENT */
                                                              /* 0x100 = SFT */
                                         /* 0x200 = undo retention guarantee */
                                    /* 0x400 = tablespace belongs to a group */
                                  /* 0x800 = this actually describes a group */
  pitrscnwrp    number,                      /* scn wrap when ts was created */
  pitrscnbas    number,                      /* scn base when ts was created */
  ownerinstance varchar("M_IDEN"),                    /* Owner instance name */
  backupowner   varchar("M_IDEN"),             /* Backup owner instance name */
  groupname     varchar("M_IDEN"),                             /* Group name */
  spare1        number,                                  /* plug-in SCN wrap */
  spare2        number,                                  /* plug-in SCN base */
  spare3        varchar2(1000),
  spare4        date

命令不行,那我们就直接修改基表, 注意此方法在正常场合不能使用
SQL> update ts$ set ONLINE$=1 where name='WEN';

1 row updated.

SQL> select TABLESPACE_NAME,STATUS from dba_tablespaces where   TABLESPACE_NAME='WEN';

------------------------------ ---------
WEN                            ONLINE

SQL> create table test111 tablespace wen as select * from v$datafile;

Table created.
此时建议数据库做一次重启操作,如果正常 ,问题解决至此,就暂时告一段落了。鼓掌!!!                                
