ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [

author:skate
time:2012/02/22

 

昨天晚上凌晨12点接到监控短信(dataguard is down),于是登录系统查看原因,

首先查看备库的alertlog文件,查看最近的半小时的log都是如下的信息
........
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:06 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
.......

在往前查看alertlog

.....
Mon Feb 20 09:35:59 2012
Archived Log entry 10127 added for thread 1 sequence 11099 ID 0x263e89b dest 1:
Mon Feb 20 10:01:06 2012
RFS[6]: Selected log 13 for thread 1 sequence 11101 dbid 40093083 branch 760555291
Mon Feb 20 10:01:06 2012
Media Recovery Waiting for thread 1 sequence 11101 (in transit)
Recovery of Online Redo Log: Thread 1 Group 13 Seq 11101 Reading mem 0
  Mem# 0: /oracle/oradata/skate01/standbyredo13.log
Mon Feb 20 10:01:14 2012
Archived Log entry 10128 added for thread 1 sequence 11100 ID 0x263e89b dest 1:
Mon Feb 20 10:03:58 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc  (incident=264961):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:27 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_1590.trc  (incident=264121):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264121/skate01_mmon_1590_i264121.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:29 2012
Restarting dead background process MMON
Mon Feb 20 10:04:29 2012
MMON started with pid=15, OS id=17808
Mon Feb 20 10:04:29 2012
Dumping diagnostic data in directory=[cdmp_20120220100429], requested by (instance=1, osid=1590 (MMON)), summary=[incident=264121].
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc  (incident=264122):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264122/skate01_mmon_17808_i264122.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc  (incident=264123):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264123/skate01_mmon_17808_i264123.trc
Dumping diagnostic data in directory=[cdmp_20120220100432], requested by (instance=1, osid=17808 (MMON)), summary=[incident=264122].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:52 2012
.......

发现在“Mon Feb 20 10:03:58 2012”就已经开始报ora-600错误了。首先看看数据库现在是什么状态,是否正常。

1. $ ps -ef |grep $ORACLE_SID     //检查oracle进程是否正常
2. $ netstat -an | grep 1588| wc -l   //检查oracle是否有连接
3. 检查os的状态:vmstat,top,iostat

 

从以上检查,没发现什么异常,想起来20号有项目迁移到这个active备库上,可能和这有原因,于是想登录数据库进一步查证,发现无法登陆,提示错误如下:

[root@skate01 ~]# su - oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:26:12 2012

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

ERROR:
ORA-01075: you are currently logged on


Enter user-name:
ERROR:
ORA-01017: invalid username/password; logon denied


SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus


尝试两次都提示一样的错误,无法登陆,看来数据库服务当掉了,看来只能重启数据库了,ORA-01075的错误一般是磁盘空间不够或审计原因,但我检查我的环境不是这两种原因,所以使用os命令kill进程,使用如下两个命令

 

1. $ ps -ef |grep $ORACLE_SID|grep -v grep|awk '{print $2}' | xargs kill -9     //kill进程
2. $ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm                 //删除掉oracle的共享段

先查看需要kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi


kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi |awk '{print $2}' | xargs kill -9

如果只kill掉oracle进程,还是无法登陆oracle

 


查看删除的共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'


删除共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm

resource(s) deleted
[oracle@skate01 ~]$
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'


尝试登录oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:47:36 2012

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup nomount;
ORACLE instance started.

Total System Global Area 3.5275E+10 bytes
Fixed Size                  2233656 bytes
Variable Size            3623881416 bytes
Database Buffers         3.1541E+10 bytes
Redo Buffers              108003328 bytes
SQL> alter database mount standby database;

Database altered.

SQL> alter database open read only;

Database altered.

SQL> alter database recover managed standby database disconnect using current logfile;

Database altered.

然后检查alterlog看是否有异常,发现都很正常,然后检查确认os层是正常的,然后在登录数据库检查dataguard是否健康。

1.standby库和primary的时间延迟(在standby上运行):

select 'Last applied  : ' Logs,
        to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
   from v$archived_log
  where sequence# =
        (select max(sequence#) from v$archived_log where applied = 'YES')
 union
 select 'Last received : ' Logs,
        to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
   from v$archived_log
  where sequence# = (select max(sequence#) from v$archived_log);    
 

2.查看进程的活动状态(在standby运行):

select process, status, thread#, sequence#, block#, blocks
  from v$managed_standby;

3.检查log的恢复速度
select * from v$dataguard_status
select * from v$recovery_progress

 

确认库目前是正常的,然后在会头看数据库为什么会宕机,为什么会报ora-600

查看trace文件

[root@skate01 ~]#  more /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Dump file /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/product/11.2.0/db_1
System name:    Linux
Node name:      skate01
Release:        2.6.18-194.el5
Version:        #1 SMP Fri Apr 2 14:58:14 EDT 2010
Machine:        x86_64
Instance name: skate01
Redo thread mounted by this instance: 1
Oracle process number: 120
Unix process pid: 17783, image: oracle@skate01


*** 2012-02-20 10:03:58.215
*** SESSION ID:(17.5) 2012-02-20 10:03:58.215
*** CLIENT ID:() 2012-02-20 10:03:58.215
*** SERVICE NAME:(SYS$USERS) 2012-02-20 10:03:58.215
*** MODULE NAME:(JDBC Thin Client) 2012-02-20 10:03:58.215
*** ACTION NAME:() 2012-02-20 10:03:58.215
 
Dump continued from file: /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []

========= Dump for incident 264961 (ORA 600 [KGHLKREM1]) ========
----- Beginning of Customized Incident Dump(s) -----
***** Internal heap ERROR KGHLKREM1 addr=0x838000020 ds=0x60001188 *****
***** Dump of memory around addr 0x838000020:
837FFF020 00000000 00000000 00000000 00000000 
[................]
  Repeat 511 times
Recovery state: ds=0x60001188 rtn=(nil) *rtn=(nil) szo=0 u4o=0 hdo=0 off=0
 Szo:
 UB4o:
 Hdo:
 Off:
 Hla: 0
******************************************************
HEAP DUMP heap name="sga heap"  desc=0x60001188
 extent sz=0x9800 alt=248 het=32767 rec=9 flg=-126 opc=4
 parent=(nil) owner=(nil) nex=(nil) xsz=0x0 heap=(nil)
 fl2=0x60, nex=(nil)
 ds for latch 1: 0x600551d8 0x60056a30 0x60058288 0x60059ae0
 ds for latch 2: 0x6005eaa0 0x600602f8 0x60061b50 0x600633a8
 reserved granule count 12 (granule size 134217728)
----- End of Customized Incident Dump(s) -----

*** 2012-02-20 10:03:58.341
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=40p7rprfbt1as) -----
select 'a' from dual

----- Call Stack Trace -----
calling              call     entry                argument values in hex     
location             type     point                (? means dubious value)    
-------------------- -------- -------------------- ----------------------------
skdstdst()+36        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFF0B5CCD88 ? 000000001 ?
                                                   000000001 ? 000000002 ?
ksedst1()+98         call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFF0B5CCD88 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksedst()+34          call     ksedst1()            000000000 ? 000000001 ?
                                                   7FFF0B5CCD88 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbkedDefDump()+2741  call     ksedst()             000000000 ? 000000001 ?
                                                   7FFF0B5CCD88 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksedmp()+36          call     dbkedDefDump()       000000003 ? 000000002 ?
                                                   7FFF0B5CCD88 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksfdmp()+64          call     ksedmp()             000000003 ? 000000002 ?
                                                   7FFF0B5CCD88 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgexPhaseII()+1764  call     ksfdmp()             000000003 ? 000000002 ?
                                                   7FFF0B5CCD88 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgexExplicitEndInc  call     dbgexPhaseII()       2B1892AF1710 ? 2B1892EA06A8 ?
()+750                                             7FFF0B5D88C0 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgeEndDDEInvocatio  call     dbgexExplicitEndInc  2B1892AF1710 ? 2B1892EA06A8 ?
nImpl()+767                   ()                   7FFF0B5D88C0 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgeEndDDEInvocatio  call     dbgeEndDDEInvocatio  2B1892AF1710 ? 2B1892EA06A8 ?
n()+47                        nImpl()              7FFF0B5D88C0 ? 000000001 ?
                                                   000000000 ? 000000002 ?
kghnerror()+394      call     dbgeEndDDEInvocatio  2B1892AF1710 ? 2B1892EA06A8 ?
                              n()                  7FFF0B5D88C0 ? 000000001 ?
                                                   000000000 ? 000000002 ?
kghadd_reserved_ext  call     kghnerror()          00B7CCEA0 ? 060001188 ?
ent()+945                                          00A0EF0C0 ? 838000020 ?
                                                   100000000 ? 000000002 ?
kghget_reserved_ext  call     kghadd_reserved_ext  00B7CCEA0 ? 060001188 ?
ent()+526                     ent()                060059AE0 ? 060059B28 ?
                                                   000000000 ? 000000000 ?
kghgex()+1455        call     kghget_reserved_ext  00B7CCEA0 ? 060004CD8 ?
                              ent()                060059AE0 ? 060059B28 ?
                                                   000000000 ? 000000000 ?
kghfnd()+734         call     kghgex()             00B7CCEA0 ? 060004CD8 ?
                                                   060059AE0 ? 000001058 ?
                                                   000000000 ? 000000000 ?
kghalo()+536         call     kghfnd()             00B7CCEA0 ? 000000000 ?
                                                   060004CD8 ? 000000000 ?
                                                   060059AE0 ? 7FFF0B5C95C0 ?
kghgex()+437         call     kghalo()             00B7CCEA0 ? 060059AE0 ?
                                                   000001000 ? 000001000 ?
                                                   060059AE0 ? 060004CD8 ?
kghalf()+395         call     kghgex()             00B7CCEA0 ? 000000000 ?
                                                   85DD58D08 ? 000000FD0 ?
                                                   060059AE0 ? 060004CD8 ?
kksLoadChild()+2785  call     kghalf()             00B7CCEA0 ? 85DD58D08 ?
                                                   000000001 ? 060004CD8 ?
                                                   000000000 ? 009A98F10 ?
kxsGetRuntimeLock()  call     kksLoadChild()       00B7CCEA0 ? 88FD354E0 ?
+2061                                              7FFF0B5DB5B0 ? 2B1892F59070 ?
                                                   85DD586F8 ? 000000000 ?
kksfbc()+14522       call     kxsGetRuntimeLock()  00B7CCEA0 ? 2B1892F59070 ?
                                                   7FFF0B5DB5B0 ? 2B1892F59070 ?
                                                   85DD586F8 ? 88FD354E0 ?
kkspsc0()+2020       call     kksfbc()             2B1892F59070 ? 000000003 ?
                                                   000000108 ? 7FFF0B5DD6F8 ?
                                                   000000015 ? 000000000 ?
kksParseCursor()+13  call     kkspsc0()            2B1892F41BB8 ? 7FFF0B5DD6F8 ?
9                                                  000000015 ? 000000003 ?
                                                   000000006 ? 0000000A4 ?
opiosq0()+2022       call     kksParseCursor()     7FFF0B5DC0D0 ? 7FFF0B5DD6F8 ?
                                                   000000015 ? 000000003 ?
                                                   000000006 ? 0000000A4 ?
kpooprx()+269        call     opiosq0()            000000003 ? 00000000E ?
                                                   7FFF0B5DC2A0 ? 0000000A4 ?
                                                   000000000 ? 7FFF0B5DBFB0 ?
kpoal8()+795         call     kpooprx()            7FFF0B5DF694 ? 7FFF0B5DD6F8 ?
                                                   000000014 ? 000000001 ?
                                                   000000000 ? 7FFF0B5DBFB0 ?
opiodr()+910         call     kpoal8()             00000005E ? 00000001C ?
                                                   7FFF0B5DF690 ? 000000001 ?
                                                   000000000 ? 000000001 ?
ttcpip()+2289        call     opiodr()             00000005E ? 00000001C ?
                                                   7FFF0B5DF690 ? 000000000 ?
                                                   0098A1530 ? 000000001 ?
opitsk()+1665        call     ttcpip()             00B7E2B10 ? 00923BB90 ?
                                                   7FFF0B5DF690 ? 000000000 ?
                                                   7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiino()+961         call     opitsk()             00B7E2B10 ? 000000001 ?
                                                   7FFF0B5DF690 ? 000000000 ?
                                                   7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiodr()+910         call     opiino()             00000003C ? 000000004 ?
                                                   7FFF0B5E0E18 ? 000000000 ?
                                                   7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opidrv()+565         call     opiodr()             00000003C ? 000000004 ?
                                                   7FFF0B5E0E18 ? 000000000 ?
                                                   0098A0FE0 ? 7FFF0B5DF888 ?
sou2o()+98           call     opidrv()             00000003C ? 000000004 ?
                                                   7FFF0B5E0E18 ? 000000000 ?
                                                   0098A0FE0 ? 7FFF0B5DF888 ?
opimai_real()+128    call     sou2o()              7FFF0B5E0DF0 ? 00000003C ?
                                                   000000004 ? 7FFF0B5E0E18 ?
                                                   0098A0FE0 ? 7FFF0B5DF888 ?
ssthrdmain()+252     call     opimai_real()        000000002 ? 7FFF0B5E0FE0 ?
                                                   000000004 ? 7FFF0B5E0E18 ?
                                                   0098A0FE0 ? 7FFF0B5DF888 ?
main()+196           call     ssthrdmain()         000000002 ? 7FFF0B5E0FE0 ?
                                                   000000001 ? 000000000 ?
                                                   0098A0FE0 ? 7FFF0B5DF888 ?
__libc_start_main()  call     main()               000000002 ? 7FFF0B5E1188 ?
+244                                               000000001 ? 000000000 ?
                                                   0098A0FE0 ? 7FFF0B5DF888 ?
_start()+36          call     __libc_start_main()  000A07368 ? 000000002 ?
                                                   7FFF0B5E1178 ? 000000000 ?
                                                   0098A0FE0 ? 000000002 ?
 

--------------------- Binary Stack Dump ---------------------

再往前查看alertlog,发现还报了ora-07445

Tue Jan 17 08:42:12 2012
Archived Log entry 7472 added for thread 1 sequence 8444 ID 0x263e89b dest 1:
Tue Jan 17 09:00:14 2012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x8] [PC:0xB0997A, ksmdscan_internal()+82] [flags: 0x0, count: 1]
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_25574.trc  (incident=264155):
ora-07445: exception encountered: core dump [ksmdscan_internal()+82] [SIGSEGV] [ADDR:0x8] [PC:0xB0997A] [Address not mapped to objec
t] []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264155/skate01_ora_25574_i264155.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Jan 17 09:00:21 2012
Dumping diagnostic data in directory=[cdmp_20120117090021], requested by (instance=1, osid=25574), summary=[incident=264155].
Tue Jan 17 09:00:22 2012
Sweep [inc][264155]: completed
Sweep [inc2][264155]: completed
Tue Jan 17 09:06:08 2012
Media Recovery Waiting for thread 1 sequence 8446

然后查看oracle文档“ID 1070812.1”,发现这个我启用hugepage有关,

当系统vm.drop_caches设置大于0,并且启用hugepage,这时这两个就会冲突,因为drop_caches是要释放内存,而hugepage是hold住内存。

参考:http://blog.csdn.net/wyzxg/article/details/7279986

 

解决方法


1.如果启用hugepage,那就设置vm.drop_caches=0
[root@localhost ~]# more /proc/sys/vm/drop_caches
3
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 3
[root@localhost ~]# vi /etc/sysctl.conf

##skate add
vm.drop_caches=0

使其立刻生效
[root@localhost ~]# sysctl -p

 

检查是否生效
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 0

或者
2.升级Linux Kernel version到 2.6.18-194.0.0.0.4.EL5

 

 

附上官方文档:

 

ORA-600 [KGHLKREM1] On Linux Using Parameter drop_cache On hugepages Configuration [ID 1070812.1]

 

Applies to:

Oracle Server - Enterprise Edition - Version: 10.2.0.1 and later   [Release: 10.2 and later ]
Generic Linux

Symptoms


You are running an Oracle Database, single-instance or RAC. You have the SGA backed by hugepages.
 
You are getting the error

ORA-00600: internal error code, arguments: [KGHLKREM1], [0x06BC00020]
with stack trace similar to: kghnerror kghadd_reserved_ext kghgex

or also

ORA-07445: exception encountered: core dump
[kglhdal()+1105][SIGSEGV] [Address not mapped to object] [0x000000008] [] []

ORA-07445: exception encountered: core dump [kghfnd()+2328] [SIGSEGV]
[Address not mapped to object] [0xFFFFFFFFFFFFFFF0] [] []


and the SGA heap Dump of memory around the offending addr (in this particular example: 0x6bc00020)
it's showing zeroed out :

asm1_lmd0_8600.trc
~~~~~~~~~~~~~~~~~~
*** 2010-02-08 15:57:38.274
***** Internal heap ERROR KGHLKREM1 addr=0x6c400020 ds=0x60000058 *****
***** Dump of memory around addr 0x6c400020:
06C3FF020 00000000 00000000 00000000 00000000 [................]
Repeat 511 times





 

Changes

1. On your system you are running with vm.drop_caches=1 (or 3), drop_cache have been set to a value greater than zero , or you are executing

echo 3 > /proc/sys/vm/drop_caches


 

/proc/sys/vm/drop_caches (since Linux 2.6.16)
Writing to this file causes the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.

To free pagecache:

* echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

* echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

* echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation, and dirty objects are not freeable, the user should run "sync" first in order to make sure all cached objects are freed.


2. You have setup the Hugepages

Cause

This is a Linux Kernel issue.
Using the linux kernel "drop_cache" parameter and having the hugepages a memory corruption can occurs.

Per internal Bug 9461825, executing vm.drop_caches corrupts Oracle Database SGA hugepages;
it is fixed in Linux Kernel version 2.6.18-194.0.0.0.4.EL5


Solution

1.  As a workaround when hugepages are set avoid any vm.drop_cache settings.

OR

2.  Upgrade to Linux Kernel version 2.6.18-194.0.0.0.4.EL5


----------end-----------

你可能感兴趣的:(thread,oracle,数据库,linux,exception,database)