故障场景描述:这是我之前服务过的一家客户,是一套跑在Linux机器上的11gR2单实例数据库,由于服务器异常断电导致的故障!印象较为深刻的是就在2011年2月14日情人节那天,这个库也曾出现过另外一则ORA-00600的错误。较为不幸的是,就在今天又遭遇到这个ORA-600的错误!
下面记录一下这次故障的处理过程:
1 启库时遭遇ORA-600的错误:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
[oracle@os5 ~]$ uname -rm
2.6.18-128.el5PAE i686
[oracle@os5 ~]$ ps -ef|grep asm_
oracle 9099 9068 0 10:54 pts/8 00:00:00 grep asm_
[oracle@os5 ~]$ ps -ef|grep ora_
oracle 4797 1 0 10:52 ? 00:00:00 ora_pmon_database
oracle 4799 1 0 10:52 ? 00:00:00 ora_vktm_database
oracle 4803 1 0 10:52 ? 00:00:00 ora_gen0_database
oracle 4805 1 0 10:52 ? 00:00:00 ora_diag_database
oracle 4807 1 0 10:52 ? 00:00:00 ora_dbrm_database
oracle 4809 1 0 10:52 ? 00:00:00 ora_psp0_database
oracle 4811 1 0 10:52 ? 00:00:01 ora_dia0_database
oracle 4813 1 0 10:52 ? 00:00:00 ora_mman_database
oracle 4815 1 0 10:52 ? 00:00:00 ora_dbw0_database
oracle 4817 1 0 10:52 ? 00:00:00 ora_lgwr_database
oracle 4819 1 0 10:52 ? 00:00:00 ora_ckpt_database
oracle 4821 1 0 10:52 ? 00:00:00 ora_smon_database
oracle 4823 1 0 10:52 ? 00:00:00 ora_reco_database
oracle 4825 1 0 10:52 ? 00:00:00 ora_mmon_database
oracle 4827 1 0 10:52 ? 00:00:00 ora_mmnl_database
oracle 4829 1 0 10:52 ? 00:00:00 ora_d000_database
oracle 4831 1 0 10:52 ? 00:00:00 ora_s000_database
oracle 9105 9068 0 11:55 pts/8 00:00:00 grep ora_
[oracle@os5 ~]$ sqlplus /
as
sysdba
SQL*Plus: Release 11.2.0.1.0 Production
on
Wed Nov 21 10:54:16 2012
Copyright (c) 1982, 2009, Oracle.
All
rights reserved.
Connected
to
:
Oracle
Database
11g Enterprise Edition Release 11.2.0.1.0 - Production
With
the Partitioning, OLAP, Data Mining
and
Real
Application Testing options
SQL>
alter
database
open
;
alter
database
open
*
ERROR
at
line 1:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1],
[3464], [12432], [12534], [], [], [], [], [], [], []
SQL>
select
open_mode
from
v$
database
;
OPEN_MODE
--------------------
MOUNTED
SQL>
|
从上可以看到,当前数据库停留在MOUNT状态,没法OPEN,然后,就用RMAN重新做了个全备,并手工拷贝了联机日志文件,保留现场!
2 接下来,重新尝试以RESETLOGS方式也打不开,尝试RECOVER依然报错!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
SQL>
alter
database
open
resetlogs;
alter
database
open
resetlogs
*
ERROR
at
line 1:
ORA-01139: RESETLOGS
option
only
valid
after
an incomplete
database
recovery
SQL> recover
database
;
ORA-00283: recovery session canceled due
to
errors
ORA-00264:
no
recovery required
SQL>
alter
database
open
;
alter
database
open
*
ERROR
at
line 1:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
SQL>
|
3 那么接下来,看看ALERT日志都记录哪些有用的信息:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
Wed Nov 21 10:59:16 2012
alter
database
open
Beginning crash recovery
of
1 threads
parallel recovery started
with
2 processes
Started redo scan
Completed redo scan
read
1642 KB redo, 0 data blocks need recovery
Errors
in
file /oracle/ora11gR2/diag/rdbms/
database
/
database
/trace/database_ora_4848.trc (incident=129790):
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
Incident details
in
: /oracle/ora11gR2/diag/rdbms/
database
/
database
/incident/incdir_129790/database_ora_4848_i129790.trc
Aborting crash recovery due
to
error 600
Errors
in
file /oracle/ora11gR2/diag/rdbms/
database
/
database
/trace/database_ora_4848.trc:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
Errors
in
file /oracle/ora11gR2/diag/rdbms/
database
/
database
/trace/database_ora_4848.trc:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
ORA-600 signalled during:
alter
database
open
...
Wed Nov 21 10:59:17 2012
Sweep [inc][129790]: completed
Sweep [inc2][129790]: completed
Wed Nov 21 10:59:17 2012
Trace dumping
is
performing id=[cdmp_20121121105917]
Wed Nov 21 11:07:50 2012
db_recovery_file_dest_size
of
3852 MB
is
0.00% used. This
is
a
user
-specified limit
on
the amount
of
space
that will be used
by
this
database
for
recovery-related files,
and
does
not
reflect the amount
of
space
available
in
the underlying filesystem
or
ASM diskgroup.
|
从alert里,一直看到ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []的错误,还有一些提示就是Errors in file /oracle/ora11gR2/diag/rdbms/database/database/trace/database_ora_4848.trc。
4 就去看看上述的TRACE文件:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
*** 2012-11-21 10:59:16.123
Successfully allocated 2 recovery slaves
Using 66 overflow buffers per recovery slave
Thread 1
checkpoint
: logseq 3464, block 4945, scn 203336710
cache-low rba: logseq 3464, block 9148
on
-disk rba: logseq 3464, block 12534, scn 203340512
start recovery
at
logseq 3464, block 9148, scn 0
*** 2012-11-21 10:59:16.144
Started writing zeroblks thread 1 seq 3464 blocks 12432-12439
*** 2012-11-21 10:59:16.144
Completed writing zeroblks thread 1 seq 3464
==== Redo
read
statistics
for
thread 1 ====
Total physical reads (
from
disk
and
memory): 4096Kb
-- Redo read_disk statistics --
Read
rate (ASYNC): 1642Kb
in
0.02s => 80.18 Mb/sec
Longest record: 14Kb, moves: 0/1953 (0%)
Change moves: 22/192 (11%), moved: 0Mb
Longest LWN: 445Kb, moves: 0/733 (0%), moved: 0Mb
Last
redo scn: 0x0000.0c1eba6c (203340396)
----------------------------------------------
----- Recovery Hash Table Statistics ---------
Hash
table
buckets = 32768
Longest hash chain = 0
Average hash chain = 0/0 = 0.0
Max
compares per lookup = 0
Avg
compares per lookup = 0/0 = 0.0
----------------------------------------------
WARNING! Crash recovery
of
thread 1 seq 3464
is
ending
at
redo block 12432 but should
not
have ended before
redo block 12534
Incident 129790 created, dump file: /oracle/ora11gR2/diag/rdbms/
database
/
database
/incident/incdir_129790/database_ora_4848_i129790.trc
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
|
5 结合ALERT里的错误ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], 和TRACE里的错误提示WARNING! Crash recovery of thread 1 seq 3464 is ending at redo block 12432 but should not have ended before redo block 12534 以及查询MetaLink文档ID 1299564.1获取的指导性信息,应该是由于服务器异常断电,导致LGWR写联机日志文件时失败,下次重新启动数据库时,需要做实例级恢复,而又无法从联机日志文件里获取到这些redo信息,因为上次断电时,写日志失败了。
那么ORA-00600的错误里,那几个参数 [1], [3464], [12432], [12534]又表示什么呢? 从EYGLE的网站上查询到类似的案例信息,结合本故障场景分析,原来是实例需要恢复日志序列号为3464的联机日志文件,需要恢复到编号为12534的日志块,而实际上只能恢复到第12432个日志块儿,所以库就启不来了。不过,从当前日志文件信息,可以看到,当前日志组的确是3464:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
SQL>
alter
session
set
nls_date_format=
'yyyy/mm/dd hh24:mi:ss'
;
Session altered.
SQL>
select
*
from
v$log;
GROUP
# THREAD#
SEQUENCE
# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME
---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------------- ------------ -------------------
1 1 3463 52428800 512 1 YES INACTIVE 203285629 2012/11/21 03:00:32 203335249 2012/11/21 08:59:46
3 1 3462 52428800 512 1 YES INACTIVE 203238520 2012/11/20 21:00:52 203285629 2012/11/21 03:00:32
2 1 3464 52428800 512 1
NO
CURRENT
203335249 2012/11/21 08:59:46 2.8147E+14
SQL>
|
6 参照MetaLink文档ID 1299564.1的方案来恢复数据库:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
SQL> recover
database
until cancel using backup controlfile;
ORA-00279: change 203360397 generated
at
11/21/2012 09:21:51 needed
for
thread 1
ORA-00289: suggestion : /home/oracle/archive_no_delete/ARC_743097220_0000003464_1.arc
ORA-00280: change 203360397
for
thread 1
is
in
sequence
#3464
Specify log: {
/oracle/ora11gR2/oradata/
database
/redo02.log
Log applied.
Media recovery complete.
SQL>
alter
database
open
;
alter
database
open
*
ERROR
at
line 1:
ORA-01589: must use RESETLOGS
or
NORESETLOGS
option
for
database
open
SQL>
alter
database
open
resetlogs;
Database
altered.
SQL>
|
至此,这个库是成功恢复,并且启动了。恢复之后,再次对数据库做了备份。看到Alert日志信息:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
Wed Nov 21 12:43:44 2012
ALTER
DATABASE
RECOVER LOGFILE
'/oracle/ora11gR2/oradata/database/redo02.log'
Media Recovery Log /oracle/ora11gR2/oradata/
database
/redo02.log
Wed Nov 21 12:43:44 2012
Incomplete recovery applied
all
redo ever generated.
Recovery completed through change 203360398
time
11/21/2012 09:21:51
Media Recovery Complete (
database
)
Completed:
ALTER
DATABASE
RECOVER LOGFILE
'/oracle/ora11gR2/oradata/database/redo02.log'
Wed Nov 21 12:43:55 2012
alter
database
open
Errors
in
file /oracle/ora11gR2/diag/rdbms/
database
/
database
/trace/database_ora_10698.trc:
ORA-01589: must use RESETLOGS
or
NORESETLOGS
option
for
database
open
ORA-1589 signalled during:
alter
database
open
...
Wed Nov 21 12:44:37 2012
alter
database
open
resetlogs
Archived Log entry 3382 added
for
thread 1
sequence
3463 ID 0xd9842c95 dest 1:
Archived Log entry 3383 added
for
thread 1
sequence
3464 ID 0xd9842c95 dest 1:
Archived Log entry 3384 added
for
thread 1
sequence
3462 ID 0xd9842c95 dest 1:
RESETLOGS
after
complete recovery through change 203360398
Resetting resetlogs activation ID 3649318037 (0xd9842c95)
Wed Nov 21 12:44:43 2012
Setting recovery target incarnation
to
4
Wed Nov 21 12:44:43 2012
Assigning activation ID 3706166088 (0xdce79b48)
LGWR: STARTING ARCH PROCESSES
Wed Nov 21 12:44:43 2012
ARC0 started
with
pid=24, OS id=11236
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Wed Nov 21 12:44:44 2012
ARC1 started
with
pid=25, OS id=11238
Wed Nov 21 12:44:44 2012
ARC2 started
with
pid=26, OS id=11240
Thread 1 opened
at
log
sequence
1
Current
log# 1 seq# 1 mem# 0: /oracle/ora11gR2/oradata/
database
/redo01.log
Successful
open
of
redo thread 1
MTTR advisory
is
disabled because FAST_START_MTTR_TARGET
is
not
set
Wed Nov 21 12:44:44 2012
SMON: enabling cache recovery
ARC1: Archival started
ARC2: Archival started
Wed Nov 21 12:44:44 2012
ARC3 started
with
pid=29, OS id=11242
ARC1: Becoming the
'no FAL'
ARCH
ARC1: Becoming the
'no SRL'
ARCH
ARC2: Becoming the heartbeat ARCH
Successfully onlined Undo Tablespace 2.
Dictionary
check
beginning
Dictionary
check
complete
Verifying file header compatibility
for
11g tablespace encryption..
Verifying 11g file header compatibility
for
tablespace encryption completed
|
从Metalink上看到,目前这个ORA-600的错误应该只有在Oracle 11g上才有的。