朋友那一开发用的oracle系统,linux平台,版本10.2.0.1.0。
一次在做rman备份时,提示ORA-00600错误。退出后,只要一查询DBA_JOBS,数据库就提示ORA-00600错误,然后数据库就自动关闭了。
一、故障现象
将alert.log拿来看了看,如下:
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Mon Oct 15 14:55:00 2012
ALTER DATABASE OPEN
Mon Oct 15 14:55:00 2012
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=16, OS id=4306
Mon Oct 15 14:55:00 2012
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=17, OS id=4308
Mon Oct 15 14:55:00 2012
Thread 1 opened at log sequence 29
Current log# 1 seq# 29 mem# 0: /opt/ora10g/product/10.2.0/oradata/tftdb/REDO01.LOG
Successful open of redo thread 1
Mon Oct 15 14:55:00 2012
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Mon Oct 15 14:55:00 2012
ARC0: STARTING ARCH PROCESSES
Mon Oct 15 14:55:00 2012
SMON: enabling cache recovery
Mon Oct 15 14:55:00 2012
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
Mon Oct 15 14:55:00 2012
ARC2: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the heartbeat ARCH
ARC2 started with pid=18, OS id=4310
Mon Oct 15 14:55:00 2012
Successfully onlined Undo Tablespace 1.
Mon Oct 15 14:55:00 2012
SMON: enabling tx recovery
Mon Oct 15 14:55:00 2012
Database Characterset is ZHS16GBK
Mon Oct 15 14:55:00 2012
Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_smon_4289.trc:
ORA-00600: internal error code, arguments: [4000], [2426], [], [], [], [], [], []
replication_dependency_tracking turned off (no async multimaster replication found)
Mon Oct 15 14:55:00 2012
Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/udump/tftdb_ora_4304.trc:
ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], []
Mon Oct 15 14:55:01 2012
Non-fatal internal error happenned while SMON was doing logging scn->time mapping.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
Mon Oct 15 14:55:01 2012
Starting background process QMNC
QMNC started with pid=19, OS id=4312
Mon Oct 15 14:55:01 2012
db_recovery_file_dest_size of 2048 MB is 48.96% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Mon Oct 15 14:55:01 2012
Completed: ALTER DATABASE OPEN
Mon Oct 15 14:55:01 2012
Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_4293.trc:
ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], []
Mon Oct 15 14:55:01 2012
Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_4293.trc:
ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], []
Mon Oct 15 14:55:01 2012
Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_4293.trc:
ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], []
Mon Oct 15 14:55:02 2012
Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_4293.trc:
ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], []
Mon Oct 15 14:55:06 2012
Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_4293.trc:
ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], []
Mon Oct 15 14:55:06 2012
可以看到,在数据库启动时,SMON进程就报错了:
二、故障分析
主要故障有两个:ORA-00600 [4000], [2426], [], [], [], [], [], [] 和ORA-00600 [4000], [2411], [], [], [], [], [], []
1)ORA-00600 [4000], [2426], [], [], [], [], [], []
打开tftdb_smon_4289.trc文件,可看到如下信息:
/opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_smon_4289.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options ORACLE_HOME = /opt/ora10g/product/10.2.0/db_1 System name: Linux Node name: node1 Release: 2.6.18-194.el5 Version: #1 SMP Tue Mar 16 21:52:43 EDT 2010 Machine: i686 Instance name: tftdb Redo thread mounted by this instance: 1 Oracle process number: 8 Unix process pid: 4289, image: oracle@node1 (SMON) *** SERVICE NAME:() 2012-10-15 14:55:00.738 *** SESSION ID:(164.1) 2012-10-15 14:55:00.738 *** 2012-10-15 14:55:00.738 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [4000], [2426], [], [], [], [], [], [] Current SQL statement for this session: select smontabv.cnt, smontab.time_mp, smontab.scn, smontab.num_mappings, smontab.tim_scn_map, smontab.orig_thread from smon_scn_time smontab, (select max(scn) scnmax,count(*)+sum(NVL2(TIM_SCN_MAP,NUM_MAPPINGS,0)) cnt from smon_scn_time where thread=0) smontabv where smontab.scn = smontabv.scnmax and thread=0 ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst()+27 call ksedst1() 0 ? 1 ? ksedmp()+557 call ksedst() 0 ? D ? CBD2D20 ? 2A ? CBD2D20 ? 2A ?
看样子是smon_scn_time表出了问题。这个表smon进程约每6秒更新一次,写入scn与time的map信息。创建语句如下:
create cluster smon_scn_to_time ( thread number /* thread, compatibility */ ) / create index smon_scn_to_time_idx on cluster smon_scn_to_time / create table smon_scn_time ( thread number, /* thread, compatibility */ time_mp number, /* time this recent scn represents */ time_dp date, /* time as date, compatibility */ scn_wrp number, /* scn.wrp, compatibility */ scn_bas number, /* scn.bas, compatibility */ num_mappings number, tim_scn_map raw(1200), scn number default 0, /* scn */ orig_thread number default 0 /* for downgrade */ ) cluster smon_scn_to_time (thread) / create unique index smon_scn_time_tim_idx on smon_scn_time(time_mp) / create unique index smon_scn_time_scn_idx on smon_scn_time(scn) /
执行查询:
select count(*) from sys.smon_scn_time; ERROR at line 1: ORA-00600: internal error code, arguments: [4000], [2521], [], [], [], [], [], []
但是一下语句却可以正确执行:
select * from smon_scn_time where rownum<1000;
执行到rownum<2000时,也出现了ORA-00600错误。数据库down掉。
考虑是否数据库坏块,执行数据坏块检查:
[oracle@node1 tftdb]$ dbv file=SYSTEM01.DBF blocksize=8192 DBVERIFY: Release 10.2.0.1.0 - Production on Mon Oct 15 15:46:44 2012 Copyright (c) 1982, 2005, Oracle. All rights reserved. DBVERIFY - Verification starting : FILE = SYSTEM01.DBF DBVERIFY - Verification complete Total Pages Examined : 88320 Total Pages Processed (Data) : 50252 Total Pages Failing (Data) : 0 Total Pages Processed (Index): 19344 Total Pages Failing (Index): 0 Total Pages Processed (Other): 1785 Total Pages Processed (Seg) : 0 Total Pages Failing (Seg) : 0 Total Pages Empty : 16939 Total Pages Marked Corrupt : 0 Total Pages Influx : 0 Highest block SCN : 4458516 (0.4458516) [oracle@node1 tftdb]$
没有发现坏块。
2)ORA-00600 [4000], [2411], [], [], [], [], [], []
执行select * from dba_jobs,也会提示ORA-00600错误,然后数据库down掉。
后台日志如下:
Mon Oct 15 14:55:01 2012 Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_4293.trc: ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], []
二、解决方案
既然select * from smon_scn_time where rownum<1000可以执行,<2000就报错。那么最有可能,就是smon_scn_time表中数据有问题。
可尝试备份全库后,清空该表。
sql> conn / as sysdba sql> alter system set events '12500 trace name context forever, level 10'; --禁用SMON记录SCN与TIME的MAP。 sql> delete from sys.smon_scn_time; delete from sys.smon_scn_time * ERROR at line 1: ORA-00600: internal error code, arguments: [4000], [2521], [], [], [], [], [], [] sql> truncate cluster sys.smon_scn_to_time; sql>alter system set events '12500 trace name context off'; sql>shutdown immediate; sql>startup
重启数据库后,ORA-00600 [4000], [2426], [], [], [], [], [], []不再出现,但ORA-00600 [4000], [2411], [], [], [], [], [], []依然报错。如下:
ARC1: STARTING ARCH PROCESSES COMPLETE ARC1: Becoming the heartbeat ARCH ARC2 started with pid=18, OS id=5087 Mon Oct 15 16:17:41 2012 Successfully onlined Undo Tablespace 1. Mon Oct 15 16:17:41 2012 SMON: enabling tx recovery Mon Oct 15 16:17:41 2012 Database Characterset is ZHS16GBK replication_dependency_tracking turned off (no async multimaster replication found) Mon Oct 15 16:17:41 2012 Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/udump/tftdb_ora_5081.trc: ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], [] Starting background process QMNC QMNC started with pid=19, OS id=5089 Mon Oct 15 16:17:42 2012 db_recovery_file_dest_size of 2048 MB is 49.06% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Mon Oct 15 16:17:42 2012 Completed: ALTER DATABASE OPEN Mon Oct 15 16:17:42 2012 Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_5070.trc: ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], [] Mon Oct 15 16:17:42 2012 Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_5070.trc: ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], [] Mon Oct 15 16:17:42 2012 Errors in file /opt/ora10g/product/10.2.0/admin/tftdb/bdump/tftdb_cjq0_5070.trc: ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], [] Mon Oct 15 16:17:43 2012
跟踪文件信息如下:
/opt/ora10g/product/10.2.0/admin/tftdb/udump/tftdb_ora_5081.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options ORACLE_HOME = /opt/ora10g/product/10.2.0/db_1 System name: Linux Node name: node1 Release: 2.6.18-194.el5 Version: #1 SMP Tue Mar 16 21:52:43 EDT 2010 Machine: i686 Instance name: tftdb Redo thread mounted by this instance: 1 Oracle process number: 15 Unix process pid: 5081, image: oracle@node1 (TNS V1-V3) *** SERVICE NAME:(SYS$USERS) 2012-10-15 16:17:41.308 *** SESSION ID:(159.3) 2012-10-15 16:17:41.308 tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x1) tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x1) *** 2012-10-15 16:17:41.590 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [4000], [2411], [], [], [], [], [], [] Current SQL statement for this session: select count(*) from dba_jobs where what = 'sys.dbms_aqadm_sys.register_driver();' and instance = :1 ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst()+27 call ksedst1() 0 ? 1 ? ksedmp()+557 call ksedst() 0 ? 13 ? CBD2D20 ? 2A ? CBD2D20 ? 2A ? ksfdmp()+19 call ksedmp() 3 ? BFB0C3CC ? AC152A0 ?
查询dba_jobs仍然会引起数据库down掉。
看来DBA_JOBS表也有点问题,删除该表数据,然后重建作业。
数据库恢复正常。不在出现上述ORA-00600错误。