DBA, If don’t know what you are doing, please don’t do

今天收到一个发过来请求帮助的case,Oracle数据库无法启动,请求帮助恢复。仔细阅读了发过来的告警日志,这是一个典型的“事情越弄越糟”的案例。

以下就来根据告警日志,一条一条地回顾这位DBA是如何将数据库弄到完全启动不了的。

故障最开始是从1月11日的凌晨3:30开始出现,数据库在归档的时候,意外发现某个控制文件的头块全部被清零了,这可能是存储本身的问题,并非人为。

Fri Jan 11 03:30:242013

Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:

ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

Masterbackground archival failure: 227

Fri Jan 11 03:31:242013

Hex dump of (file0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc

Corrupt blockrelative dba: 0x00000001 (file 0, block 1)

Completely zeroblock found during control file header read

Fri Jan 11 03:31:242013

Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

Fri Jan 11 03:31:242013

Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:

ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

Fri Jan 11 03:30:242013

Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:

ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

Fri Jan 11 03:30:242013

Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:

ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

Fri Jan 11 03:30:242013

Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:

ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

接下来,数据库痛苦地挣扎了半小时,期间不停地报相同的ORA-00227错误。一直到凌晨4:01,终于CKPT进程也发现无法更新控制文件头部,于是强势地将数据库直接关闭了。

Fri Jan 11 04:01:252013

Hex dump of (file0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc

Corrupt blockrelative dba: 0x00000001 (file 0, block 1)

Completely zeroblock found during control file header read

Fri Jan 11 04:01:252013

Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

Fri Jan 11 04:01:252013

Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:

ORA-00227:corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

CKPT:terminating instance due to error 227

Fri Jan 11 04:01:252013

Errors in file /oracle/admin/dpdata/bdump/dpdata_pmon_2997.trc:

ORA-00227:corrupt block detected in control file: (block , # blocks )

Fri Jan 11 04:01:262013

Errors in file /oracle/admin/dpdata/bdump/dpdata_psp0_2999.trc:

ORA-00227: corruptblock detected in control file: (block , # blocks )

Instanceterminated by CKPT, pid = 3007

接下来的5个小时,数据库静静地躺在机房里,没有人知道这个数据库已经挂掉了,一直到上午DBA来上班。他发现数据库无法访问,于是尝试重新启动数据库。

Fri Jan 11 09:15:512013

Starting ORACLEinstance (normal)

LICENSE_MAX_SESSION= 0

LICENSE_SESSIONS_WARNING= 0

Picked latch-freeSCN scheme 3

Autotune ofundo retention is turned on.

自然数据库无法正常启动,连mount状态都无法进入,因为某个控制文件头部已经损坏了。告警日志的信息明确地说明了无法读取control03.ctl文件的头块,因此在尝试mount数据库的时候报了ORA-00205错误。

Fri Jan 11 09:15:562013

ALTERDATABASE   MOUNT

Fri Jan 11 09:15:562013

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

ORA-27047:unable to read the header block of file

Additionalinformation: 2

Fri Jan 11 09:15:592013

ORA-205signalled during: ALTER DATABASE   MOUNT...

Fri Jan 11 09:19:312013

Starting ORACLEinstance (normal)

Fri Jan 11 09:19:432013

alter databasemount

Fri Jan 11 09:19:432013

ORA-00202:control file: '/oracle/oradata/dpdata/control03.ctl'

ORA-27047:unable to read the header block of file

Additional information:2

Fri Jan 11 09:19:432013

ORA-205signalled during: alter database mount

接下来,这位DBA开始反复地关闭实例,又启动实例。这样的操作一直持续了1个小时,一直到上午的10:28,可以想象这是多么纠结的一个小时。忠告: 除非十分特殊的恢复案例,否则反复起停数据库实例是于事无补的。

Shutting downinstance: further logons disabled

Fri Jan 11 09:23:472013

Stoppingbackground process CJQ0

……

Fri Jan 11 09:38:022013

Starting ORACLEinstance (normal)

……

Fri Jan 11 09:43:002013

Shutting downinstance: further logons disabled

……

Fri Jan 11 09:43:582013

Starting ORACLEinstance (normal)

……

Fri Jan 11 09:55:342013

ALTER DATABASECLOSE NORMAL

……

Fri Jan 11 09:56:552013

Starting ORACLEinstance (normal)

……

Fri Jan 11 10:28:102013

ALTER DATABASECLOSE NORMAL

接下来10:29的再次重新启动数据库实例之前,DBA终于意识到可能是控制文件出现了问题,因此修改了初始化参数,将报错的control03.ctl文件从初始化参数control_files中去掉了。这次数据库得以正常启动。

Fri Jan 11 10:29:202013

Starting ORACLEinstance (normal)

……

control_files            = /data1/oradata/dpdata/control01.ctl,/data1/oradata/dpdata/control02.ctl

……

Fri Jan 11 10:29:372013

Completed:ALTER DATABASE OPEN

而DBA也迅速地作了一次备份控制文件的操作,但是正是这个操作引导到了后面噩梦一般的结果。

Fri Jan 11 10:36:142013

alter databasebackup controlfile to trace

Fri Jan 11 10:36:142013

Completed: alterdatabase backup controlfile to trace

数据库又平稳地运行了一个上午,这种宁静到下午14:16的时候被打破,数据库开始报ORA-600错误,并且在CKPT进程作检查点的时候,报数 据文件10和31的文件头部无法被正确读取。如果不是更深层次的原因,那么这可能仍然是跟凌晨时候控制文件意外损坏时候的故障一样,也许是存储子系统本身的问题。

Fri Jan 11 14:16:072013

Errors in file /oracle/admin/dpdata/udump/dpdata_ora_22240.trc:

ORA-00600: internalerror code, arguments: [6002], [0], [0], [3], [0], [], [], []

Fri Jan 11 14:19:442013

Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:

ORA-01171:datafile 10 going offline due to error advancing checkpoint

ORA-01122:database file 10 failed verification check

ORA-01110: datafile 10: '/data2/DECTR_HIS2.dbf'

ORA-01251: UnknownFile Header Version read for file number 10

Fri Jan 11 14:19:592013

Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:

ORA-01171:datafile 31 going offline due to error advancing checkpoint

ORA-01122:database file 31 failed verification check

ORA-01110: datafile 31: '/data3/ts2_dpcis.dbf'

ORA-01251: UnknownFile Header Version read for file number 31

紧接着,应用系统的某个JOB也由于数据文件无法访问,而开始报错。

Fri Jan 11 14:30:192013

Errors in file /oracle/admin/dpdata/bdump/dpdata_j001_12993.trc:

ORA-12012:error on auto execute of job 88

ORA-00376: file10 cannot be read at this time

ORA-01110: datafile 10: '/data2/DECTR_HIS2.dbf'

ORA-06512: at "DECTR.P_MOVE_CONTS_SHIP",line 77

ORA-06512: atline 1

相同的报错一直持续了四十多分钟,之后DBA又再次介入了。但是DBA很奇怪地断然执行了offline这两个数据文件的操作,并在2分多钟之后, 又尝试将两个数据文件再次online。而由于文件损坏,自然在online的时候遇到了ORA-1122错误,而无法成功online。

Fri Jan 11 15:16:232013

alter databasedatafile '/data3/ts2_dpcis.dbf' offline

Fri Jan 11 15:16:232013

Completed:alter database datafile '/data3/ts2_dpcis.dbf' offline

Fri Jan 11 15:17:052013

alter databasedatafile  '/data2/DECTR_HIS2.dbf'

offline

Fri Jan 11 15:17:052013

Completed:alter database datafile  '/data2/DECTR_HIS2.dbf'

offline

Fri Jan 11 15:19:412013

alter databasedatafile '/data3/ts2_dpcis.dbf' online

Fri Jan 11 15:19:412013

ORA-1122signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...

Fri Jan 11 15:21:102013

alter databasedatafile  '/data2/DECTR_HIS2.dbf' online

Fri Jan 11 15:21:102013

ORA-1122signalled during: alter database datafile '/data2/DECTR_HIS2.dbf' online...

这才仅仅是噩梦的开始,接下来的一切属于危险动作,请勿轻易模仿。

遇到ORA-1122错误以后,DBA考虑了9秒钟,再次断然地关闭了数据库,并随之又重新启动。由于仅仅是用户表空间数据文件损坏,并且之前也已经被offline了,因此数据库实例毫无障碍地OPEN成功。

Fri Jan 11 15:21:192013

Shutting downinstance: further logons disabled

Fri Jan 11 15:21:192013

Stoppingbackground process QMNC

Fri Jan 11 15:21:192013

Stopping backgroundprocess CJQ0

Fri Jan 11 15:21:212013

Stoppingbackground process MMNL

Fri Jan 11 15:21:222013

Stoppingbackground process MMON

Fri Jan 11 15:21:232013

Shutting downinstance (immediate)

……

Fri Jan 11 15:22:592013

Starting ORACLEinstance (normal)

……

Fri Jan 11 15:23:132013

Completed:ALTER DATABASE OPEN

DBA再次尝试online数据文件的操作,同样的ORA-1122错误。

Fri Jan 11 15:23:312013

alter databasedatafile '/data3/ts2_dpcis.dbf' online

Fri Jan 11 15:23:312013

ORA-1122signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...

考虑了2分半钟之后,DBA也许是想起上午的时候做过控制文件的备份,因此决定进行数据库恢复。重启数据库到nomount状态,并开始进行 RECOVER DATABASE USING BACKUP CONTROLFILE,ORA-1507错误的意思是告知如果要使用备份的控制文件进行数据库恢复,那么应该要先使用备份的控制文件将数据库启动到 mount状态。

Fri Jan 11 15:25:052013

Shutting downinstance: further logons disabled

Fri Jan 11 15:25:052013

Stoppingbackground process QMNC

Fri Jan 11 15:25:052013

Stoppingbackground process CJQ0

Fri Jan 11 15:25:072013

Stoppingbackground process MMNL

Fri Jan 11 15:25:082013

Stoppingbackground process MMON

Fri Jan 11 15:25:092013

Shutting downinstance (immediate)

……

Fri Jan 11 15:26:322013

Starting ORACLEinstance (normal)

……

Fri Jan 11 15:26:462013

ALTER DATABASERECOVER  database using backupcontrolfile until cancel 

Fri Jan 11 15:26:462013

ORA-1507signalled during: ALTER DATABASE RECOVER database using backup controlfile   until cancel  ...

DBA于是将数据库启动到mount状态,继续进行数据库恢复。这其中的几个ORA错误都是正常的,ORA-279提示需要一个归档文件来完成恢 复,ORA-308提示打不开1_87749_604491553.dbf归档文件,根据前面的告警日志,可以知道实际上87749这个重做日志是当前日志,还没有归档,自然找不到。ORA-1547错误表示恢复已经完成,但是OPENRESETLOGS的时候仍然要遇到错误。

Fri Jan 11 15:26:562013

alter databasemount

Fri Jan 11 15:27:002013

Settingrecovery target incarnation to 2

Fri Jan 11 15:27:002013

Successfulmount of redo thread 1, with mount id 560899584

Fri Jan 11 15:27:002013

Databasemounted in Exclusive Mode

Completed:alter database mount

Fri Jan 11 15:27:102013

ALTER DATABASERECOVER  database using backupcontrolfile until cancel 

Media Recovery Start

 parallel recovery started with 3 processes

ORA-279signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel  ...

Fri Jan 11 15:27:282013

ALTER DATABASERECOVER    CONTINUE DEFAULT 

Fri Jan 11 15:27:282013

Media Recovery Log/soft/db_arch/1_87749_604491553.dbf

Errors with log/soft/db_arch/1_87749_604491553.dbf

ORA-308signalled during: ALTER DATABASE RECOVER   CONTINUE DEFAULT  ...

Fri Jan 11 15:27:282013

ALTER DATABASERECOVER    CONTINUE DEFAULT 

Fri Jan 11 15:27:282013

Media Recovery Log/soft/db_arch/1_87749_604491553.dbf

Errors with log/soft/db_arch/1_87749_604491553.dbf

ORA-308signalled during: ALTER DATABASE RECOVER   CONTINUE DEFAULT  ...

Fri Jan 11 15:27:282013

ALTER DATABASERECOVER CANCEL

ORA-1547signalled during: ALTER DATABASE RECOVER CANCEL ...

DBA忽略了这个错误,尝试将数据库打开,很显然会遇到ORA-1589错误,之后又尝试用NORESTLOGS方式OPEN数据库,这也很显然会遇到ORA-1588错误。不完全恢复的数据库必须要以RESETLOGS方式打开。

Fri Jan 11 15:29:522013

alter databaseopen

Fri Jan 11 15:29:522013

ORA-1589signalled during: alter database open...

Fri Jan 11 15:30:112013

alter databaseopen NORESETLOGS

Fri Jan 11 15:30:112013

ORA-1588signalled during: alter database open NORESETLOGS

之后,DBA作了一个艰难的决定,再次关闭并重启了数据库。又再次尝试相同的OPEN步骤。当然,Oracle也给与了相同的报错。数据库仍然无法打开。至此,数据库无法提供服务已经1个多小时。

Fri Jan 11 15:30:422013

Shutting downinstance: further logons disabled

Fri Jan 11 15:30:422013

Stoppingbackground process CJQ0

Fri Jan 11 15:30:422013

Stoppingbackground process MMNL

Fri Jan 11 15:30:432013

Stoppingbackground process MMON

Fri Jan 11 15:30:442013

Shutting downinstance (immediate)

……

Fri Jan 11 15:30:592013

Starting ORACLEinstance (normal)

……

Fri Jan 11 15:31:082013

ALTER DATABASEOPEN

ORA-1589signalled during: ALTER DATABASE OPEN...

Fri Jan 11 15:31:282013

alter databaseopen NORESETLOGS

Fri Jan 11 15:31:282013

ORA-1588signalled during: alter database open NORESETLOGS...

Fri Jan 11 15:31:412013

alter databaseopen RESETLOGS

Fri Jan 11 15:31:412013

ORA-1122signalled during: alter database open RESETLOGS...

再接下来,是一团混乱,DBA多次重启数据库,尝试了多种恢复手段。offline数据文件,recover数据文件,recover数据 库,online数据文件,再recover,再offline,再open,但是一切尝试都是徒劳的。一直到晚上18:35,在数据库宕机4个多小时以 后,开始求助我们帮助其恢复数据库。

Fri Jan 11 15:41:282013

alter databasedatafile '/data2/DECTR_HIS2.dbf' offline

Completed: alterdatabase datafile '/data2/DECTR_HIS2.dbf' offline

Fri Jan 11 15:41:352013

alter databaseopen

Fri Jan 11 15:41:352013

ORA-1589signalled during: alter database open...

Fri Jan 11 15:42:202013

alterdatabase  open resetlogs

Fri Jan 11 15:42:202013

ORA-1245signalled during: alter database  openresetlogs...

Fri Jan 11 15:43:402013

ALTER DATABASERECOVER  datafile '/data3/ts2_dpcis.dbf' 

Fri Jan 11 15:43:402013

Media Recovery Start

Fri Jan 11 15:43:402013

Media Recoveryfailed with error 1610

ORA-283signalled during: ALTER DATABASE RECOVER datafile '/data3/ts2_dpcis.dbf'  ...

Fri Jan 11 15:46:092013

ALTER DATABASERECOVER  datafile 10 

Fri Jan 11 15:46:092013

Media Recovery Start

Fri Jan 11 15:46:092013

Media Recoveryfailed with error 1610

ORA-283signalled during: ALTER DATABASE RECOVER datafile 10  ...

……

Fri Jan 11 16:37:512013

ALTER DATABASERECOVER  database 

Fri Jan 11 16:37:512013

Media Recovery Start

Fri Jan 11 16:37:512013

Media Recoveryfailed with error 1610

ORA-283 signalledduring: ALTER DATABASE RECOVER database  ...

Fri Jan 11 16:39:292013

ALTER DATABASERECOVER  database using backupcontrolfile until cancel 

Fri Jan 11 16:39:292013

Media Recovery Start

 parallel recovery started with 3 processes

ORA-279 signalledduring: ALTER DATABASE RECOVER  database usingbackup controlfile until cancel  ...

Fri Jan 11 16:39:432013

ALTER DATABASERECOVER    CANCEL 

Fri Jan 11 16:39:442013

ORA-1547signalled during: ALTER DATABASE RECOVER   CANCEL  ...

Fri Jan 11 16:39:442013

ALTER DATABASERECOVER CANCEL

ORA-1112signalled during: ALTER DATABASE RECOVER CANCEL ...

Fri Jan 11 16:40:152013

alter databasedatafile 10 online

Fri Jan 11 16:40:152013

Completed:alter database datafile 10 online

Fri Jan 11 16:40:252013

alter databasedatafile 31 online

Completed:alter database datafile 31 online

Fri Jan 11 16:40:472013

ALTER DATABASERECOVER  database using backupcontrolfile until cancel 

Fri Jan 11 16:40:472013

Media Recovery Start

Fri Jan 11 16:40:472013

Media Recoveryfailed with error 1110

ORA-283signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel  ...

Fri Jan 11 16:47:122013

WARNING:inbound connection timed out (ORA-3136)

Fri Jan 11 17:44:472013

ALTER DATABASERECOVER  datafile 10 

Fri Jan 11 17:44:472013

Media Recovery Start

Fri Jan 11 17:44:472013

Media Recoveryfailed with error 1610

ORA-283signalled during: ALTER DATABASE RECOVER datafile 10  ...

Fri Jan 11 17:45:192013

ALTER DATABASERECOVER  database until cancel usingbackup controlfile 

Fri Jan 11 17:45:192013

Media Recovery Start

Fri Jan 11 17:45:192013

Media Recoveryfailed with error 1110

ORA-283signalled during: ALTER DATABASE RECOVER database until cancel using backup controlfile  ...

Fri Jan 11 17:46:392013

alter databasedatafile 10 offline

Fri Jan 11 17:46:402013

Completed:alter database datafile 10 offline

Fri Jan 11 17:47:182013

ALTER DATABASERECOVER  database until cancel 

Fri Jan 11 17:47:182013

Media Recovery Start

Fri Jan 11 17:47:182013

Media Recoveryfailed with error 1610

ORA-283signalled during: ALTER DATABASE RECOVER database until cancel  ...

Fri Jan 11 18:11:312013

alter databaseopen

Fri Jan 11 18:11:312013

ORA-1589signalled during: alter database open...

Fri Jan 11 18:35:292013

Starting ORACLEinstance (normal)

Fri Jan 11 18:35:432013

alter databaseopen

Fri Jan 11 18:35:432013

ORA-1589signalled during: alter database open...

这是一个没有备份的数据库,实际上如果是存储字系统的问题导致了数据文件损坏,那么可能与DBA的关系并不大,但是在经过一下午的折腾,将一个其实仅仅是坏了2个数据文件而可以轻松OPEN的数据库恢复到无论如何也无法轻易打开的状态,这就与DBA有很大的关系了。

 


你可能感兴趣的:(oracle恢复)