昨天晚上凌晨12点接到监控短信(dataguard is down),于是登录系统查看原因,
首先查看备库的alertlog文件,查看最近的半小时的log都是如下的信息
........
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:06 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
.......
在往前查看alertlog
.....
Mon Feb 20 09:35:59 2012
Archived Log entry 10127 added for thread 1 sequence 11099 ID 0x263e89b dest 1:
Mon Feb 20 10:01:06 2012
RFS[6]: Selected log 13 for thread 1 sequence 11101 dbid 40093083 branch 760555291
Mon Feb 20 10:01:06 2012
Media Recovery Waiting for thread 1 sequence 11101 (in transit)
Recovery of Online Redo Log: Thread 1 Group 13 Seq 11101 Reading mem 0
Mem# 0: /oracle/oradata/skate01/standbyredo13.log
Mon Feb 20 10:01:14 2012
Archived Log entry 10128 added for thread 1 sequence 11100 ID 0x263e89b dest 1:
Mon Feb 20 10:03:58 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc (incident=264961):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:27 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_1590.trc (incident=264121):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264121/skate01_mmon_1590_i264121.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:29 2012
Restarting dead background process MMON
Mon Feb 20 10:04:29 2012
MMON started with pid=15, OS id=17808
Mon Feb 20 10:04:29 2012
Dumping diagnostic data in directory=[cdmp_20120220100429], requested by (instance=1, osid=1590 (MMON)), summary=[incident=264121].
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc (incident=264122):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264122/skate01_mmon_17808_i264122.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc (incident=264123):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264123/skate01_mmon_17808_i264123.trc
Dumping diagnostic data in directory=[cdmp_20120220100432], requested by (instance=1, osid=17808 (MMON)), summary=[incident=264122].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:52 2012
.......
发现在“Mon Feb 20 10:03:58 2012”就已经开始报ora-600错误了。首先看看数据库现在是什么状态,是否正常。
1. $ ps -ef |grep $ORACLE_SID //检查oracle进程是否正常
2. $ netstat -an | grep 1588| wc -l //检查oracle是否有连接
3. 检查os的状态:vmstat,top,iostat
从以上检查,没发现什么异常,想起来20号有项目迁移到这个active备库上,可能和这有原因,于是想登录数据库进一步查证,发现无法登陆,提示错误如下:
[root@skate01 ~]# su - oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"
SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:26:12 2012
Copyright (c) 1982, 2010, Oracle. All rights reserved.
ERROR:
ORA-01075: you are currently logged on
Enter user-name:
ERROR:
ORA-01017: invalid username/password; logon denied
SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus
尝试两次都提示一样的错误,无法登陆,看来数据库服务当掉了,看来只能重启数据库了,ORA-01075的错误一般是磁盘空间不够或审计原因,但我检查我的环境不是这两种原因,所以使用os命令kill进程,使用如下两个命令
1. $ ps -ef |grep $ORACLE_SID|grep -v grep|awk '{print $2}' | xargs kill -9 //kill进程
2. $ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm //删除掉oracle的共享段
先查看需要kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi
kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi |awk '{print $2}' | xargs kill -9
如果只kill掉oracle进程,还是无法登陆oracle
查看删除的共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'
删除共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm
resource(s) deleted
[oracle@skate01 ~]$
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'
尝试登录oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"
SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:47:36 2012
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup nomount;
ORACLE instance started.
Total System Global Area 3.5275E+10 bytes
Fixed Size 2233656 bytes
Variable Size 3623881416 bytes
Database Buffers 3.1541E+10 bytes
Redo Buffers 108003328 bytes
SQL> alter database mount standby database;
Database altered.
SQL> alter database open read only;
Database altered.
SQL> alter database recover managed standby database disconnect using current logfile;
Database altered.
然后检查alterlog看是否有异常,发现都很正常,然后检查确认os层是正常的,然后在登录数据库检查dataguard是否健康。
1.standby库和primary的时间延迟(在standby上运行):
select 'Last applied : ' Logs,
to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
from v$archived_log
where sequence# =
(select max(sequence#) from v$archived_log where applied = 'YES')
union
select 'Last received : ' Logs,
to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
from v$archived_log
where sequence# = (select max(sequence#) from v$archived_log);
2.查看进程的活动状态(在standby运行):
select process, status, thread#, sequence#, block#, blocks
from v$managed_standby;
3.检查log的恢复速度
select * from v$dataguard_status
select * from v$recovery_progress
确认库目前是正常的,然后在会头看数据库为什么会宕机,为什么会报ora-600
查看trace文件
[root@skate01 ~]# more /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Dump file /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/product/11.2.0/db_1
System name: Linux
Node name: skate01
Release: 2.6.18-194.el5
Version: #1 SMP Fri Apr 2 14:58:14 EDT 2010
Machine: x86_64
Instance name: skate01
Redo thread mounted by this instance: 1
Oracle process number: 120
Unix process pid: 17783, image: oracle@skate01
*** 2012-02-20 10:03:58.215
*** SESSION ID:(17.5) 2012-02-20 10:03:58.215
*** CLIENT ID:() 2012-02-20 10:03:58.215
*** SERVICE NAME:(SYS$USERS) 2012-02-20 10:03:58.215
*** MODULE NAME:(JDBC Thin Client) 2012-02-20 10:03:58.215
*** ACTION NAME:() 2012-02-20 10:03:58.215
Dump continued from file: /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
========= Dump for incident 264961 (ORA 600 [KGHLKREM1]) ========
----- Beginning of Customized Incident Dump(s) -----
***** Internal heap ERROR KGHLKREM1 addr=0x838000020 ds=0x60001188 *****
***** Dump of memory around addr 0x838000020:
837FFF020 00000000 00000000 00000000 00000000 [................]
Repeat 511 times
Recovery state: ds=0x60001188 rtn=(nil) *rtn=(nil) szo=0 u4o=0 hdo=0 off=0
Szo:
UB4o:
Hdo:
Off:
Hla: 0
******************************************************
HEAP DUMP heap name="sga heap" desc=0x60001188
extent sz=0x9800 alt=248 het=32767 rec=9 flg=-126 opc=4
parent=(nil) owner=(nil) nex=(nil) xsz=0x0 heap=(nil)
fl2=0x60, nex=(nil)
ds for latch 1: 0x600551d8 0x60056a30 0x60058288 0x60059ae0
ds for latch 2: 0x6005eaa0 0x600602f8 0x60061b50 0x600633a8
reserved granule count 12 (granule size 134217728)
----- End of Customized Incident Dump(s) -----
*** 2012-02-20 10:03:58.341
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=40p7rprfbt1as) -----
select 'a' from dual
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36 call kgdsdst() 000000000 ? 000000000 ?
7FFF0B5CCD88 ? 000000001 ?
000000001 ? 000000002 ?
ksedst1()+98 call skdstdst() 000000000 ? 000000000 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksedst()+34 call ksedst1() 000000000 ? 000000001 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbkedDefDump()+2741 call ksedst() 000000000 ? 000000001 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksedmp()+36 call dbkedDefDump() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksfdmp()+64 call ksedmp() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbgexPhaseII()+1764 call ksfdmp() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbgexExplicitEndInc call dbgexPhaseII() 2B1892AF1710 ? 2B1892EA06A8 ?
()+750 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
dbgeEndDDEInvocatio call dbgexExplicitEndInc 2B1892AF1710 ? 2B1892EA06A8 ?
nImpl()+767 () 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
dbgeEndDDEInvocatio call dbgeEndDDEInvocatio 2B1892AF1710 ? 2B1892EA06A8 ?
n()+47 nImpl() 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
kghnerror()+394 call dbgeEndDDEInvocatio 2B1892AF1710 ? 2B1892EA06A8 ?
n() 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
kghadd_reserved_ext call kghnerror() 00B7CCEA0 ? 060001188 ?
ent()+945 00A0EF0C0 ? 838000020 ?
100000000 ? 000000002 ?
kghget_reserved_ext call kghadd_reserved_ext 00B7CCEA0 ? 060001188 ?
ent()+526 ent() 060059AE0 ? 060059B28 ?
000000000 ? 000000000 ?
kghgex()+1455 call kghget_reserved_ext 00B7CCEA0 ? 060004CD8 ?
ent() 060059AE0 ? 060059B28 ?
000000000 ? 000000000 ?
kghfnd()+734 call kghgex() 00B7CCEA0 ? 060004CD8 ?
060059AE0 ? 000001058 ?
000000000 ? 000000000 ?
kghalo()+536 call kghfnd() 00B7CCEA0 ? 000000000 ?
060004CD8 ? 000000000 ?
060059AE0 ? 7FFF0B5C95C0 ?
kghgex()+437 call kghalo() 00B7CCEA0 ? 060059AE0 ?
000001000 ? 000001000 ?
060059AE0 ? 060004CD8 ?
kghalf()+395 call kghgex() 00B7CCEA0 ? 000000000 ?
85DD58D08 ? 000000FD0 ?
060059AE0 ? 060004CD8 ?
kksLoadChild()+2785 call kghalf() 00B7CCEA0 ? 85DD58D08 ?
000000001 ? 060004CD8 ?
000000000 ? 009A98F10 ?
kxsGetRuntimeLock() call kksLoadChild() 00B7CCEA0 ? 88FD354E0 ?
+2061 7FFF0B5DB5B0 ? 2B1892F59070 ?
85DD586F8 ? 000000000 ?
kksfbc()+14522 call kxsGetRuntimeLock() 00B7CCEA0 ? 2B1892F59070 ?
7FFF0B5DB5B0 ? 2B1892F59070 ?
85DD586F8 ? 88FD354E0 ?
kkspsc0()+2020 call kksfbc() 2B1892F59070 ? 000000003 ?
000000108 ? 7FFF0B5DD6F8 ?
000000015 ? 000000000 ?
kksParseCursor()+13 call kkspsc0() 2B1892F41BB8 ? 7FFF0B5DD6F8 ?
9 000000015 ? 000000003 ?
000000006 ? 0000000A4 ?
opiosq0()+2022 call kksParseCursor() 7FFF0B5DC0D0 ? 7FFF0B5DD6F8 ?
000000015 ? 000000003 ?
000000006 ? 0000000A4 ?
kpooprx()+269 call opiosq0() 000000003 ? 00000000E ?
7FFF0B5DC2A0 ? 0000000A4 ?
000000000 ? 7FFF0B5DBFB0 ?
kpoal8()+795 call kpooprx() 7FFF0B5DF694 ? 7FFF0B5DD6F8 ?
000000014 ? 000000001 ?
000000000 ? 7FFF0B5DBFB0 ?
opiodr()+910 call kpoal8() 00000005E ? 00000001C ?
7FFF0B5DF690 ? 000000001 ?
000000000 ? 000000001 ?
ttcpip()+2289 call opiodr() 00000005E ? 00000001C ?
7FFF0B5DF690 ? 000000000 ?
0098A1530 ? 000000001 ?
opitsk()+1665 call ttcpip() 00B7E2B10 ? 00923BB90 ?
7FFF0B5DF690 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiino()+961 call opitsk() 00B7E2B10 ? 000000001 ?
7FFF0B5DF690 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiodr()+910 call opiino() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opidrv()+565 call opiodr() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
sou2o()+98 call opidrv() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
opimai_real()+128 call sou2o() 7FFF0B5E0DF0 ? 00000003C ?
000000004 ? 7FFF0B5E0E18 ?
0098A0FE0 ? 7FFF0B5DF888 ?
ssthrdmain()+252 call opimai_real() 000000002 ? 7FFF0B5E0FE0 ?
000000004 ? 7FFF0B5E0E18 ?
0098A0FE0 ? 7FFF0B5DF888 ?
main()+196 call ssthrdmain() 000000002 ? 7FFF0B5E0FE0 ?
000000001 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
__libc_start_main() call main() 000000002 ? 7FFF0B5E1188 ?
+244 000000001 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
_start()+36 call __libc_start_main() 000A07368 ? 000000002 ?
7FFF0B5E1178 ? 000000000 ?
0098A0FE0 ? 000000002 ?
--------------------- Binary Stack Dump ---------------------
再往前查看alertlog,发现还报了ora-07445
Tue Jan 17 08:42:12 2012
Archived Log entry 7472 added for thread 1 sequence 8444 ID 0x263e89b dest 1:
Tue Jan 17 09:00:14 2012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x8] [PC:0xB0997A, ksmdscan_internal()+82] [flags: 0x0, count: 1]
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_25574.trc (incident=264155):
ora-07445: exception encountered: core dump [ksmdscan_internal()+82] [SIGSEGV] [ADDR:0x8] [PC:0xB0997A] [Address not mapped to objec
t] []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264155/skate01_ora_25574_i264155.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Jan 17 09:00:21 2012
Dumping diagnostic data in directory=[cdmp_20120117090021], requested by (instance=1, osid=25574), summary=[incident=264155].
Tue Jan 17 09:00:22 2012
Sweep [inc][264155]: completed
Sweep [inc2][264155]: completed
Tue Jan 17 09:06:08 2012
Media Recovery Waiting for thread 1 sequence 8446
然后查看oracle文档“ID 1070812.1”,发现这个我启用hugepage有关,
当系统vm.drop_caches设置大于0,并且启用hugepage,这时这两个就会冲突,因为drop_caches是要释放内存,而hugepage是hold住内存。
参考:http://blog.csdn.net/wyzxg/article/details/7279986
解决方法
1.如果启用hugepage,那就设置vm.drop_caches=0
[root@localhost ~]# more /proc/sys/vm/drop_caches
3
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 3
[root@localhost ~]# vi /etc/sysctl.conf
##skate add
vm.drop_caches=0
使其立刻生效
[root@localhost ~]# sysctl -p
检查是否生效
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 0
或者
2.升级Linux Kernel version到 2.6.18-194.0.0.0.4.EL5
附上官方文档:
ORA-600 [KGHLKREM1] On Linux Using Parameter drop_cache On hugepages Configuration [ID 1070812.1]
Oracle Server - Enterprise Edition - Version: 10.2.0.1 and later [Release: 10.2 and later ]
Generic Linux
You are running an Oracle Database, single-instance or RAC. You have the SGA backed by hugepages.
You are getting the error
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x06BC00020]
with stack trace similar to: kghnerror kghadd_reserved_ext kghgex
or also
ORA-07445: exception encountered: core dump
[kglhdal()+1105][SIGSEGV] [Address not mapped to object] [0x000000008] [] []
ORA-07445: exception encountered: core dump [kghfnd()+2328] [SIGSEGV]
[Address not mapped to object] [0xFFFFFFFFFFFFFFF0] [] []
and the SGA heap Dump of memory around the offending addr (in this particular example: 0x6bc00020)
it's showing zeroed out :
asm1_lmd0_8600.trc
~~~~~~~~~~~~~~~~~~
*** 2010-02-08 15:57:38.274
***** Internal heap ERROR KGHLKREM1 addr=0x6c400020 ds=0x60000058 *****
***** Dump of memory around addr 0x6c400020:
06C3FF020 00000000 00000000 00000000 00000000 [................]
Repeat 511 times
1. On your system you are running with vm.drop_caches=1 (or 3), drop_cache have been set to a value greater than zero , or you are executing
echo 3 > /proc/sys/vm/drop_caches
2. You have setup the Hugepages
This is a Linux Kernel issue.
Using the linux kernel "drop_cache" parameter and having the hugepages a memory corruption can occurs.
Per internal Bug 9461825, executing vm.drop_caches corrupts Oracle Database SGA hugepages;
it is fixed in Linux Kernel version 2.6.18-194.0.0.0.4.EL5
1. As a workaround when hugepages are set avoid any vm.drop_cache settings.
OR
2. Upgrade to Linux Kernel version 2.6.18-194.0.0.0.4.EL5
----------end-----------