故障描述:
数据库版本:11.2.0.3.0(单实例) 操作系统版本:redhat linux 6.5
数据库在凌晨3点自动关闭,同事发现故障后,立马手动startup开启了数据库,开启后数据库运行正常。
故障时数据库alert日志如下:
Wed Oct 10 02:51:47 2018
Thread 1 advanced to log sequence 75978 (LGWR switch)
Current log# 5 seq# 75978 mem# 0: /u01/app/oracle/oradata/XXX/redo05.log
Wed Oct 10 02:51:52 2018
Archived Log entry 149289 added for thread 1 sequence 75977 ID 0xbca5e435 dest 1:
Wed Oct 10 03:00:05 2018
ALTER SYSTEM ARCHIVE LOG
Wed Oct 10 03:00:05 2018
Thread 1 advanced to log sequence 75979 (LGWR switch)
Current log# 2 seq# 75979 mem# 0: /u01/app/oracle/oradata/XXX/redo02.log
Archived Log entry 149291 added for thread 1 sequence 75978 ID 0xbca5e435 dest 1:
Wed Oct 10 03:06:34 2018
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x17E86CF, opiaba()+639] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_ora_48752.trc (incident=113956):
ORA-07445: 出现异常错误: 核心转储 [opiaba()+639] [SIGSEGV] [ADDR:0x0] [PC:0x17E86CF] [SI_KERNEL(general_protection)] []
Incident details in: /u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_113956/XXX_ora_48752_i113956.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Oct 10 03:06:48 2018
Dumping diagnostic data in directory=[cdmp_20181010030648], requested by (instance=1, osid=48752), summary=[incident=113956].
Wed Oct 10 03:12:03 2018
Thread 1 advanced to log sequence 75980 (LGWR switch)
Current log# 6 seq# 75980 mem# 0: /u01/app/oracle/oradata/XXX/redo06.log
Wed Oct 10 03:12:15 2018
Archived Log entry 149293 added for thread 1 sequence 75979 ID 0xbca5e435 dest 1:
Wed Oct 10 03:17:09 2018
Process 0x0x26c1c99fc8 appears to be hung while dumping
Current time = 1550270160, process death time = 1550208043 interval = 60000
Attempting to kill process 0x0x26c1c99fc8 with OS pid = 48752
OSD kill succeeded for process 0x26c1c99fc8
Errors in file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_pmon_4152.trc (incident=112148):
ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_112148/XXX_pmon_4152_i112148.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_pmon_4152.trc:
ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []
PMON (ospid: 4152): terminating the instance due to error 472
Wed Oct 10 03:17:12 2018
System state dump requested by (instance=1, osid=4152 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_diag_4164.trc
Wed Oct 10 03:17:12 2018
opiodr aborting process unknown ospid (50967) as a result of ORA-1092
Wed Oct 10 03:17:12 2018
opiodr aborting process unknown ospid (50979) as a result of ORA-1092
Wed Oct 10 03:17:12 2018
opiodr aborting process unknown ospid (53178) as a result of ORA-1092
Wed Oct 10 03:17:12 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:12 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:13 2018
opiodr aborting process unknown ospid (53176) as a result of ORA-1092
Wed Oct 10 03:17:13 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:13 2018
opiodr aborting process unknown ospid (53174) as a result of ORA-1092
Wed Oct 10 03:17:13 2018
ORA-1092 : opitsk aborting process
Wed Oct 10 03:17:14 2018
ORA-1092 : opitsk aborting process
Instance terminated by PMON, pid = 4152
Wed Oct 10 03:57:17 2018
Adjusting the default value of parameter parallel_max_servers
from 1600 to 985 due to the value of parameter processes (1000)
Starting ORACLE instance (normal)
****************** Large Pages Information *****************
通过alert日志我们可以看出,
数据库在3:00:05的时候还是正常运行的。
在03:06:34报了第一个错误ORA-07445[opiaba()+639],
在03:17:09报了第二个错误ORA-00600[17147],紧接着pmon进程中止了实例(PMON (ospid: 4152): terminating the instance due to error 472),
在03:57:17,同事发现数据库故障后,手动开启了数据库Starting ORACLE instance (normal)。开启后数据库运行正常,alert日志也未出现报错。
故障分析:
1,数据库是因为pmon进程中止的,先查看pmon进程的trace文件。(/u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_pmon_4152.trc)
…………
Process 0x0x26c1c99fc8 appears to be hung while dumping
Current time = 1550270160, process death time = 1550208043 interval = 60000
Attempting to kill process 0x0x26c1c99fc8 with OS pid = 48752
OSD kill succeeded for process 0x26c1c99fc8
KGX cleanup...
KGX Atomic Operation Log 0x1ef541bc0
Mutex 0x25a26eace0(1095, 0) idn 96fd9677 oper EXCL
Cursor Parent uid 1095 efd 11 whr 17 slp 0
oper=OPERATION_DEFAULT pt1=0x25afcf4400 pt2=0x25a26eac68 pt3=(nil)
pt4=(nil) u41=0 stt=0
Incident 112148 created, dump file: /u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_112148/XXX_pmon_4152_i112148.trc
ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []
PMON: fatal error while deleting s.o. 0x1782b9760 in this tree:
----------------------------------------
…………
…………
SO: 0x1782b9760, type: 78, owner: 0x26e1f01fb8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x26c1c99fc8, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8547, pg=0
LibraryObjectLock: Address=0x1782b9760 Handle=0x280f602980 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=0
ClusterLock=0x178588100 Context=0x7fb6cc0cb790 User=0x26e1f01fb8 Session=0x26e1f01fb8 ReferenceCount=1
Flags=BRO/CBK/[0020] SavepointNum=0
LibraryHandle: Address=0x280f602980 Hash=0 LockMode=N PinMode=X LoadLockMode=0 Status=INVL
Name: Namespace=SQL AREA(00) Type=CURSOR(00)
Statistics: InvalidationCount=1 ExecutionCount=0 LoadCount=1 ActiveLocks=1 TotalLockCount=1 TotalPinCount=2
Counters: BrokenCount=2 RevocablePointer=2 KeepDependency=0 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
Concurrency: DependencyMutex=0x280f602a30(0, 0, 0, 0) Mutex=0x1b0564720(0, 24, 0, 0)
Flags=RON/PIN/PN0/EXP/CHD/[10010111]
WaitersLists:
Lock=0x280f602a10[0x280f602a10,0x280f602a10]
Pin=0x280f6029f0[0x280f6029f0,0x280f6029f0]
LoadLock=0x280f602a68[0x280f602a68,0x280f602a68]
LibraryObject: Address=0x25bba25230 HeapMask=0000-0001-0001-0000
DataBlocks:
Block: #='0' name=KGLH0^96fd9677 pins=0 Change=NONE
Heap=0x280a78ad78 Pointer=0x25bba252d0 Extent=0x25bba251b0 Flags=I/-/P/A/-/-
FreedLocation=0 Alloc=1.875000 Size=4.000000 LoadTime=15502013660
NamespaceDump:
Child Cursor: Heap0=0x25bba252d0 Heap6=0x280f602980 Heap0 Load Time=10-10-2018 03:05:41 Heap6 Load Time=10-10-2018 03:05:41 ----------------------------------------
KGX Atomic Operation Log 0x178588100
Mutex 0x25a26eb1c0(1095, 0) idn 96fd9677 oper LONG_EXCL
Cursor Pin uid 1095 efd 0 whr 1 slp 0
opr=3 pso=0x1782b9760 flg=0
pcs=0x25a26eb128 nxt=(nil) flg=31 cld=0 hd=0x280f602980 par=0x25a26eac20
ct=0 hsh=0 unp=(nil) unn=0 hvl=a26ebac8 nhv=0 ses=(nil)
hep=0x25a26eb1c0 flg=80 ld=1 ob=0x25bba25230 ptr=0x2446d95de0 fex=0x2446d95188
…………
…………
这里可以看到,pmon进程kill process(OS pid = 48752),在做KGX cleanup...的时候发生致命的错误:PMON: fatal error while deleting s.o. 0x1782b9760 in this tree。也就是ORA-00600: internal error code, arguments: [17147], [0x2581886088], [], [], [], [], [], [], [], [], [], []该错误。
由于pmon是为了回收OS pid = 48752进程资源时,出现的错误。查看OS pid = 48752进程出现dead的原因,查看报错的trace文件:/u01/app/oracle/diag/rdbms/XXX1/XXX/incident/incdir_113956/XXX_ora_48752_i113956.trc
*** 2018-10-10 03:06:34.474
*** SESSION ID:(1095.5747) 2018-10-10 03:06:34.474
*** CLIENT ID:() 2018-10-10 03:06:34.474
*** SERVICE NAME:(XXX) 2018-10-10 03:06:34.474
*** MODULE NAME:(XXXX.Service.Console.exe) 2018-10-10 03:06:34.474
*** ACTION NAME:() 2018-10-10 03:06:34.474
Dump continued from file: /u01/app/oracle/diag/rdbms/XXX1/XXX/trace/XXX_ora_48752.trc
ORA-07445: 出现异常错误: 核心转储 [opiaba()+639] [SIGSEGV] [ADDR:0x0] [PC:0x17E86CF] [SI_KERNEL(general_protection)] []
========= Dump for incident 113956 (ORA 7445 [opiaba()+639]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x17E86CF, opiaba()+639] [flags: 0x0, count: 1]
Registers:
%rax: 0x00000025a26eb540 %rbx: 0x0000002449f5fd00 %rcx: 0x0000000000005000
%rdx: 0x0000000000000000 %rdi: 0x0000000000000400 %rsi: 0x0000000000000400
%rsp: 0x00007fff86989b50 %rbp: 0x00007fff86989bc0 %r8: 0x0000002581884168
%r9: 0x0000000000000059 %r10: 0x0000000000000b33 %r11: 0x00000025a26eb4c0
%r12: 0x00000025afe69330 %r13: 0x0000000000000000 %r14: 0x0200000025a26eb5
%r15: 0x0000000000000005 %rip: 0x00000000017e86cf %efl: 0x0000000000010202
*** 2018-10-10 03:06:34.664
opiaba()+625 (0x17e86c1) jmp 0x17e86fe
opiaba()+627 (0x17e86c3) movzbl 0x2c(%rbx),%r15d
opiaba()+632 (0x17e86c8) mov %r14,-0x60(%rbp)
opiaba()+636 (0x17e86cc) mov %rax,%r14
> opiaba()+639 (0x17e86cf) movswq 0xa(%r14),%rcx
opiaba()+644 (0x17e86d4) cmp %ecx,%r15d
opiaba()+647 (0x17e86d7) jne 0x17e86f2
opiaba()+649 (0x17e86d9) mov 0x18(%rbx),%rdi
opiaba()+653 (0x17e86dd) lea 0xc(%r14),%rsi
*** 2018-10-10 03:06:34.664
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=fpthk5ubgv5mr) -----
BEGIN insert into TB_1 (col1,…………,col100) values (:p1,…… ……,:p100);
insert into TB_1 (col1,…… ……,col100) values (:p101,…… ……,:p200);
…… ……
…… ……
insert into TB_1 (col1,…… ……,col100) values (:p29901,…… ……,:p30000);
insert into TB_2 (col1,…… ……,col100) values (:p30001,…… ……,:p30100);
…… ……
…… ……
insert into TB_2 (col1,…… ……,col100) values (:p65901,…… ……,:p66000);END;
----- Call Stack Trace -----
calling call entry argument values in hex
…… ……
…… ……
这时在trace文件中,我们可以看到故障进程的SESSION ID,MODULE NAME,sql_id=fpthk5ubgv5mr等信息。通过sql_id这里可以看出非常奇怪的现象,(案例中所有的信息都脱敏处理,但关键信息保留)从begin 到end的执行体中,执行的是带有绑定变量的insert写法。但是绑定变量竟然高达6万多个。这里不禁怀疑是否是因为触发了数据库bug造成问题,(正常情况下sql语句不会有6万多个绑定变量)
通过查询MOS,发现和如下bug相匹配:
这里提供的解决办法是打patch 12578873补丁包,但是补丁包仅仅是解决超过65535个绑定变量不会造成实例的中止,数据库仍然不支持大于65535个绑定变量。
最终解决
找出出现问题时的sql语句,和程序名。反馈给研发同事。研发同事改写应用程序,不适用超过65535个绑定变量,问题得到解决。