今天收到一个发过来请求帮助的case,Oracle数据库无法启动,请求帮助恢复。仔细阅读了发过来的告警日志,这是一个典型的“事情越弄越糟”的案例。
以下就来根据告警日志,一条一条地回顾这位DBA是如何将数据库弄到完全启动不了的。
故障最开始是从1月11日的凌晨3:30开始出现,数据库在归档的时候,意外发现某个控制文件的头块全部被清零了,这可能是存储本身的问题,并非人为。
Fri Jan 11 03:30:242013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
Masterbackground archival failure: 227
Fri Jan 11 03:31:242013
Hex dump of (file0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc
Corrupt blockrelative dba: 0x00000001 (file 0, block 1)
Completely zeroblock found during control file header read
Fri Jan 11 03:31:242013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:31:242013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:242013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:242013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:242013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
接下来,数据库痛苦地挣扎了半小时,期间不停地报相同的ORA-00227错误。一直到凌晨4:01,终于CKPT进程也发现无法更新控制文件头部,于是强势地将数据库直接关闭了。
Fri Jan 11 04:01:252013
Hex dump of (file0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc
Corrupt blockrelative dba: 0x00000001 (file 0, block 1)
Completely zeroblock found during control file header read
Fri Jan 11 04:01:252013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 04:01:252013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
CKPT:terminating instance due to error 227
Fri Jan 11 04:01:252013
Errors in file /oracle/admin/dpdata/bdump/dpdata_pmon_2997.trc:
ORA-00227:corrupt block detected in control file: (block , # blocks )
Fri Jan 11 04:01:262013
Errors in file /oracle/admin/dpdata/bdump/dpdata_psp0_2999.trc:
ORA-00227: corruptblock detected in control file: (block , # blocks )
Instanceterminated by CKPT, pid = 3007
接下来的5个小时,数据库静静地躺在机房里,没有人知道这个数据库已经挂掉了,一直到上午DBA来上班。他发现数据库无法访问,于是尝试重新启动数据库。
Fri Jan 11 09:15:512013
Starting ORACLEinstance (normal)
LICENSE_MAX_SESSION= 0
LICENSE_SESSIONS_WARNING= 0
Picked latch-freeSCN scheme 3
Autotune ofundo retention is turned on.
自然数据库无法正常启动,连mount状态都无法进入,因为某个控制文件头部已经损坏了。告警日志的信息明确地说明了无法读取control03.ctl文件的头块,因此在尝试mount数据库的时候报了ORA-00205错误。
Fri Jan 11 09:15:562013
ALTERDATABASE MOUNT
Fri Jan 11 09:15:562013
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047:unable to read the header block of file
Additionalinformation: 2
Fri Jan 11 09:15:592013
ORA-205signalled during: ALTER DATABASE MOUNT...
Fri Jan 11 09:19:312013
Starting ORACLEinstance (normal)
Fri Jan 11 09:19:432013
alter databasemount
Fri Jan 11 09:19:432013
ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047:unable to read the header block of file
Additional information:2
Fri Jan 11 09:19:432013
ORA-205signalled during: alter database mount
接下来,这位DBA开始反复地关闭实例,又启动实例。这样的操作一直持续了1个小时,一直到上午的10:28,可以想象这是多么纠结的一个小时。忠告: 除非十分特殊的恢复案例,否则反复起停数据库实例是于事无补的。
Shutting downinstance: further logons disabled
Fri Jan 11 09:23:472013
Stoppingbackground process CJQ0
……
Fri Jan 11 09:38:022013
Starting ORACLEinstance (normal)
……
Fri Jan 11 09:43:002013
Shutting downinstance: further logons disabled
……
Fri Jan 11 09:43:582013
Starting ORACLEinstance (normal)
……
Fri Jan 11 09:55:342013
ALTER DATABASECLOSE NORMAL
……
Fri Jan 11 09:56:552013
Starting ORACLEinstance (normal)
……
Fri Jan 11 10:28:102013
ALTER DATABASECLOSE NORMAL
接下来10:29的再次重新启动数据库实例之前,DBA终于意识到可能是控制文件出现了问题,因此修改了初始化参数,将报错的control03.ctl文件从初始化参数control_files中去掉了。这次数据库得以正常启动。
Fri Jan 11 10:29:202013
Starting ORACLEinstance (normal)
……
control_files = /data1/oradata/dpdata/control01.ctl,/data1/oradata/dpdata/control02.ctl
……
Fri Jan 11 10:29:372013
Completed:ALTER DATABASE OPEN
而DBA也迅速地作了一次备份控制文件的操作,但是正是这个操作引导到了后面噩梦一般的结果。
Fri Jan 11 10:36:142013
alter databasebackup controlfile to trace
Fri Jan 11 10:36:142013
Completed: alterdatabase backup controlfile to trace
数据库又平稳地运行了一个上午,这种宁静到下午14:16的时候被打破,数据库开始报ORA-600错误,并且在CKPT进程作检查点的时候,报数 据文件10和31的文件头部无法被正确读取。如果不是更深层次的原因,那么这可能仍然是跟凌晨时候控制文件意外损坏时候的故障一样,也许是存储子系统本身的问题。
Fri Jan 11 14:16:072013
Errors in file /oracle/admin/dpdata/udump/dpdata_ora_22240.trc:
ORA-00600: internalerror code, arguments: [6002], [0], [0], [3], [0], [], [], []
Fri Jan 11 14:19:442013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171:datafile 10 going offline due to error advancing checkpoint
ORA-01122:database file 10 failed verification check
ORA-01110: datafile 10: '/data2/DECTR_HIS2.dbf'
ORA-01251: UnknownFile Header Version read for file number 10
Fri Jan 11 14:19:592013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171:datafile 31 going offline due to error advancing checkpoint
ORA-01122:database file 31 failed verification check
ORA-01110: datafile 31: '/data3/ts2_dpcis.dbf'
ORA-01251: UnknownFile Header Version read for file number 31
紧接着,应用系统的某个JOB也由于数据文件无法访问,而开始报错。
Fri Jan 11 14:30:192013
Errors in file /oracle/admin/dpdata/bdump/dpdata_j001_12993.trc:
ORA-12012:error on auto execute of job 88
ORA-00376: file10 cannot be read at this time
ORA-01110: datafile 10: '/data2/DECTR_HIS2.dbf'
ORA-06512: at "DECTR.P_MOVE_CONTS_SHIP",line 77
ORA-06512: atline 1
相同的报错一直持续了四十多分钟,之后DBA又再次介入了。但是DBA很奇怪地断然执行了offline这两个数据文件的操作,并在2分多钟之后, 又尝试将两个数据文件再次online。而由于文件损坏,自然在online的时候遇到了ORA-1122错误,而无法成功online。
Fri Jan 11 15:16:232013
alter databasedatafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:16:232013
Completed:alter database datafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:17:052013
alter databasedatafile '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:17:052013
Completed:alter database datafile '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:19:412013
alter databasedatafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:19:412013
ORA-1122signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...
Fri Jan 11 15:21:102013
alter databasedatafile '/data2/DECTR_HIS2.dbf' online
Fri Jan 11 15:21:102013
ORA-1122signalled during: alter database datafile '/data2/DECTR_HIS2.dbf' online...
这才仅仅是噩梦的开始,接下来的一切属于危险动作,请勿轻易模仿。
遇到ORA-1122错误以后,DBA考虑了9秒钟,再次断然地关闭了数据库,并随之又重新启动。由于仅仅是用户表空间数据文件损坏,并且之前也已经被offline了,因此数据库实例毫无障碍地OPEN成功。
Fri Jan 11 15:21:192013
Shutting downinstance: further logons disabled
Fri Jan 11 15:21:192013
Stoppingbackground process QMNC
Fri Jan 11 15:21:192013
Stopping backgroundprocess CJQ0
Fri Jan 11 15:21:212013
Stoppingbackground process MMNL
Fri Jan 11 15:21:222013
Stoppingbackground process MMON
Fri Jan 11 15:21:232013
Shutting downinstance (immediate)
……
Fri Jan 11 15:22:592013
Starting ORACLEinstance (normal)
……
Fri Jan 11 15:23:132013
Completed:ALTER DATABASE OPEN
DBA再次尝试online数据文件的操作,同样的ORA-1122错误。
Fri Jan 11 15:23:312013
alter databasedatafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:23:312013
ORA-1122signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...
考虑了2分半钟之后,DBA也许是想起上午的时候做过控制文件的备份,因此决定进行数据库恢复。重启数据库到nomount状态,并开始进行 RECOVER DATABASE USING BACKUP CONTROLFILE,ORA-1507错误的意思是告知如果要使用备份的控制文件进行数据库恢复,那么应该要先使用备份的控制文件将数据库启动到 mount状态。
Fri Jan 11 15:25:052013
Shutting downinstance: further logons disabled
Fri Jan 11 15:25:052013
Stoppingbackground process QMNC
Fri Jan 11 15:25:052013
Stoppingbackground process CJQ0
Fri Jan 11 15:25:072013
Stoppingbackground process MMNL
Fri Jan 11 15:25:082013
Stoppingbackground process MMON
Fri Jan 11 15:25:092013
Shutting downinstance (immediate)
……
Fri Jan 11 15:26:322013
Starting ORACLEinstance (normal)
……
Fri Jan 11 15:26:462013
ALTER DATABASERECOVER database using backupcontrolfile until cancel
Fri Jan 11 15:26:462013
ORA-1507signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ...
DBA于是将数据库启动到mount状态,继续进行数据库恢复。这其中的几个ORA错误都是正常的,ORA-279提示需要一个归档文件来完成恢 复,ORA-308提示打不开1_87749_604491553.dbf归档文件,根据前面的告警日志,可以知道实际上87749这个重做日志是当前日志,还没有归档,自然找不到。ORA-1547错误表示恢复已经完成,但是OPENRESETLOGS的时候仍然要遇到错误。
Fri Jan 11 15:26:562013
alter databasemount
Fri Jan 11 15:27:002013
Settingrecovery target incarnation to 2
Fri Jan 11 15:27:002013
Successfulmount of redo thread 1, with mount id 560899584
Fri Jan 11 15:27:002013
Databasemounted in Exclusive Mode
Completed:alter database mount
Fri Jan 11 15:27:102013
ALTER DATABASERECOVER database using backupcontrolfile until cancel
Media Recovery Start
parallel recovery started with 3 processes
ORA-279signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ...
Fri Jan 11 15:27:282013
ALTER DATABASERECOVER CONTINUE DEFAULT
Fri Jan 11 15:27:282013
Media Recovery Log/soft/db_arch/1_87749_604491553.dbf
Errors with log/soft/db_arch/1_87749_604491553.dbf
ORA-308signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ...
Fri Jan 11 15:27:282013
ALTER DATABASERECOVER CONTINUE DEFAULT
Fri Jan 11 15:27:282013
Media Recovery Log/soft/db_arch/1_87749_604491553.dbf
Errors with log/soft/db_arch/1_87749_604491553.dbf
ORA-308signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ...
Fri Jan 11 15:27:282013
ALTER DATABASERECOVER CANCEL
ORA-1547signalled during: ALTER DATABASE RECOVER CANCEL ...
DBA忽略了这个错误,尝试将数据库打开,很显然会遇到ORA-1589错误,之后又尝试用NORESTLOGS方式OPEN数据库,这也很显然会遇到ORA-1588错误。不完全恢复的数据库必须要以RESETLOGS方式打开。
Fri Jan 11 15:29:522013
alter databaseopen
Fri Jan 11 15:29:522013
ORA-1589signalled during: alter database open...
Fri Jan 11 15:30:112013
alter databaseopen NORESETLOGS
Fri Jan 11 15:30:112013
ORA-1588signalled during: alter database open NORESETLOGS
之后,DBA作了一个艰难的决定,再次关闭并重启了数据库。又再次尝试相同的OPEN步骤。当然,Oracle也给与了相同的报错。数据库仍然无法打开。至此,数据库无法提供服务已经1个多小时。
Fri Jan 11 15:30:422013
Shutting downinstance: further logons disabled
Fri Jan 11 15:30:422013
Stoppingbackground process CJQ0
Fri Jan 11 15:30:422013
Stoppingbackground process MMNL
Fri Jan 11 15:30:432013
Stoppingbackground process MMON
Fri Jan 11 15:30:442013
Shutting downinstance (immediate)
……
Fri Jan 11 15:30:592013
Starting ORACLEinstance (normal)
……
Fri Jan 11 15:31:082013
ALTER DATABASEOPEN
ORA-1589signalled during: ALTER DATABASE OPEN...
Fri Jan 11 15:31:282013
alter databaseopen NORESETLOGS
Fri Jan 11 15:31:282013
ORA-1588signalled during: alter database open NORESETLOGS...
Fri Jan 11 15:31:412013
alter databaseopen RESETLOGS
Fri Jan 11 15:31:412013
ORA-1122signalled during: alter database open RESETLOGS...
再接下来,是一团混乱,DBA多次重启数据库,尝试了多种恢复手段。offline数据文件,recover数据文件,recover数据 库,online数据文件,再recover,再offline,再open,但是一切尝试都是徒劳的。一直到晚上18:35,在数据库宕机4个多小时以 后,开始求助我们帮助其恢复数据库。
Fri Jan 11 15:41:282013
alter databasedatafile '/data2/DECTR_HIS2.dbf' offline
Completed: alterdatabase datafile '/data2/DECTR_HIS2.dbf' offline
Fri Jan 11 15:41:352013
alter databaseopen
Fri Jan 11 15:41:352013
ORA-1589signalled during: alter database open...
Fri Jan 11 15:42:202013
alterdatabase open resetlogs
Fri Jan 11 15:42:202013
ORA-1245signalled during: alter database openresetlogs...
Fri Jan 11 15:43:402013
ALTER DATABASERECOVER datafile '/data3/ts2_dpcis.dbf'
Fri Jan 11 15:43:402013
Media Recovery Start
Fri Jan 11 15:43:402013
Media Recoveryfailed with error 1610
ORA-283signalled during: ALTER DATABASE RECOVER datafile '/data3/ts2_dpcis.dbf' ...
Fri Jan 11 15:46:092013
ALTER DATABASERECOVER datafile 10
Fri Jan 11 15:46:092013
Media Recovery Start
Fri Jan 11 15:46:092013
Media Recoveryfailed with error 1610
ORA-283signalled during: ALTER DATABASE RECOVER datafile 10 ...
……
Fri Jan 11 16:37:512013
ALTER DATABASERECOVER database
Fri Jan 11 16:37:512013
Media Recovery Start
Fri Jan 11 16:37:512013
Media Recoveryfailed with error 1610
ORA-283 signalledduring: ALTER DATABASE RECOVER database ...
Fri Jan 11 16:39:292013
ALTER DATABASERECOVER database using backupcontrolfile until cancel
Fri Jan 11 16:39:292013
Media Recovery Start
parallel recovery started with 3 processes
ORA-279 signalledduring: ALTER DATABASE RECOVER database usingbackup controlfile until cancel ...
Fri Jan 11 16:39:432013
ALTER DATABASERECOVER CANCEL
Fri Jan 11 16:39:442013
ORA-1547signalled during: ALTER DATABASE RECOVER CANCEL ...
Fri Jan 11 16:39:442013
ALTER DATABASERECOVER CANCEL
ORA-1112signalled during: ALTER DATABASE RECOVER CANCEL ...
Fri Jan 11 16:40:152013
alter databasedatafile 10 online
Fri Jan 11 16:40:152013
Completed:alter database datafile 10 online
Fri Jan 11 16:40:252013
alter databasedatafile 31 online
Completed:alter database datafile 31 online
Fri Jan 11 16:40:472013
ALTER DATABASERECOVER database using backupcontrolfile until cancel
Fri Jan 11 16:40:472013
Media Recovery Start
Fri Jan 11 16:40:472013
Media Recoveryfailed with error 1110
ORA-283signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ...
Fri Jan 11 16:47:122013
WARNING:inbound connection timed out (ORA-3136)
Fri Jan 11 17:44:472013
ALTER DATABASERECOVER datafile 10
Fri Jan 11 17:44:472013
Media Recovery Start
Fri Jan 11 17:44:472013
Media Recoveryfailed with error 1610
ORA-283signalled during: ALTER DATABASE RECOVER datafile 10 ...
Fri Jan 11 17:45:192013
ALTER DATABASERECOVER database until cancel usingbackup controlfile
Fri Jan 11 17:45:192013
Media Recovery Start
Fri Jan 11 17:45:192013
Media Recoveryfailed with error 1110
ORA-283signalled during: ALTER DATABASE RECOVER database until cancel using backup controlfile ...
Fri Jan 11 17:46:392013
alter databasedatafile 10 offline
Fri Jan 11 17:46:402013
Completed:alter database datafile 10 offline
Fri Jan 11 17:47:182013
ALTER DATABASERECOVER database until cancel
Fri Jan 11 17:47:182013
Media Recovery Start
Fri Jan 11 17:47:182013
Media Recoveryfailed with error 1610
ORA-283signalled during: ALTER DATABASE RECOVER database until cancel ...
Fri Jan 11 18:11:312013
alter databaseopen
Fri Jan 11 18:11:312013
ORA-1589signalled during: alter database open...
Fri Jan 11 18:35:292013
Starting ORACLEinstance (normal)
Fri Jan 11 18:35:432013
alter databaseopen
Fri Jan 11 18:35:432013
ORA-1589signalled during: alter database open...
这是一个没有备份的数据库,实际上如果是存储字系统的问题导致了数据文件损坏,那么可能与DBA的关系并不大,但是在经过一下午的折腾,将一个其实仅仅是坏了2个数据文件而可以轻松OPEN的数据库恢复到无论如何也无法轻易打开的状态,这就与DBA有很大的关系了。