0. Summary
1. 问题现象
2. 问题分析
. 2.1 查看SGA设置参数
. 2.2 查看large pool大小以及自动调整
. 2.3 并行参数查看
. 2.4 告警日志详细分析
. 2.5 shared pool大小查看
3. 问题处理建议
1. 问题现象
#### alert log ####
Sat Feb 04 02:08:41 2017
Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :SZ1X.IN_JS_CDR_HW_AC_TI
Memory Notification: Library Cache Object loaded into SGA
Heap size 331660K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186
Sat Feb 04 02:09:20 2017
TABLE SZ1X.MOD_CDR_HW: ADDED INTERVAL PARTITION SYS_P1404388 (47883) VALUES LESS THAN (TO_DATE(' 2017-02-04 03:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
Sat Feb 04 02:14:22 2017
Thread 1 advanced to log sequence 603901 (LGWR switch)
Current log# 1 seq# 603901 mem# 0: /app/oracle/oradata/noap/redo01.log
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc
......
#### noap_p085_5172_i109601.trc ####
Dump continued from file: /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc
ORA-04031: ?T·¨·??? 2048024 ??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
========= Dump for incident 109601 (ORA 4031) ========μ??μ???è,
*** 2017-02-04 02:32:34.673
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=385pbhfh4g7rn) -----
insert /*+append*/ into c_cdr_railway_huning
select /*+full(t) parallel(64)*/
RELEASE_CAUSE AS o??Dêí·??-ò
ACCESS_TIME AS ?óè?ê±?ì
......
告警日志有ORA-04031报错,从报错信息来看,直接原因是因为并行引起的large pool不足导致。
2. 问题分析
2.1 查看SGA设置参数
SQL> show parameter sga
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
lock_sga boolean FALSE
pre_page_sga boolean FALSE
sga_max_size big integer 32G
sga_target big integer 32G
SQL> show parameter db_cache
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
db_cache_advice string ON
db_cache_size big integer 22G
SQL> show parameter shared_pool
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
shared_pool_reserved_size big integer 510027366
shared_pool_size big integer 8G
当前数据库SGA设置为ASMM自动管理
2.2 查看large pool大小以及自动调整
SQL> show parameter large
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
large_pool_size big integer 0
use_large_pages string TRUE
SQL> select t.*
2 from (select name,
3 bytes / (1024 * 1024) "MB",
4 round(bytes / (select value
5 from v$parameter t
6 where t.name = 'shared_pool_size') * 100,
7 2) || '%' "USED%"
8 from v$sgastat
9 where pool = 'large pool'
10 order by 2 desc) t
11 where rownum < 20;
NAME MB USED%
-------------------------- ---------- -----------------------------------------
free memory 119.8125 1.46%
PX msg pool 7.8125 .1%
ASM map operations hashta .375 0%
当前数据库SGA设置为ASMM自动管理,large pool没有设置最小值,目前使用是正常。因为使用的是自动管理,在组件进行调整的时候,也是有可能积压到large pool的使用的。
SQL> select start_time,
2 component,
3 oper_type,
4 oper_mode,
5 initial_size / 1024 / 1024 "INITIAL",
6 final_size / 1024 / 1024 "FINAL",
7 end_time
8 from v$sga_resize_ops
9 where component in ('large pool')
10 order by start_time, component;
START_TIME COMPONENT OPER_TYPE OPER_MODE INITIAL FINAL END_TIME
------------------- ------------------------- ------------- --------- -------------------- -------------------- -------------------
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool GROW IMMEDIATE 192 256 30/01/2017 03:02:02
......
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool SHRINK DEFERRED 384 128 04/02/2017 02:35:47
......
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool GROW IMMEDIATE 320 384 04/02/2017 03:01:57
04/02/2017 03:04:24 large pool SHRINK DEFERRED 384 128 04/02/2017 03:04:24
可以发现large pool较频繁性的进行grow和shrink
2.3 并行参数查看
从报错的trc中看,sql使用的并行度(64)较高。查看并行相关的参数
SQL> show parameter cpu_count
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
cpu_count integer 16
SQL> show parameter parallel_max
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
parallel_max_servers integer 640
64设置的较大,该主机cpu count只有16,建议适当降低点并行度。
2.4 告警日志详细分析
Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)
该信息代表内存中某个组件的需求空间超过阈值,这个阈值由_kgl_large_heap_warning_threshold来控制。这个特性在10gR2被引入,单独这个信息并不代表有问题,需要观察后续是否有4031的报错。
参考:
Memory Notification: Library Cache Object loaded into SGA / ORA-600 [KGL-heap-size-exceeded] (文档 ID 330239.1)
#### noap_j000_4031.trc ####
Memory Notification: Library Cache Object loaded into SGA
Heap size 73935K exceeds notification threshold (51200K)
LibraryHandle: Address=0x855ade650 Hash=70548654 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName: Name=alter table MOD_CDR_HW drop partition SYS_P1399962
FullHashValue=3aa1433897dd4d6fc458246c70548654 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=1884587604 OwnerIdn=83
Statistics: InvalidationCount=0 ExecutionCount=0 LoadCount=2 ActiveLocks=1 TotalLockCount=1 TotalPinCount=1
Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=1 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
Concurrency: DependencyMutex=0x855ade700(0, 1, 0, 0) Mutex=0x855ade780(1011, 21, 0, 6)
Flags=RON/PIN/TIM/PN0/DBN/[10012841]
WaitersLists:
Lock=0x855ade6e0[0x855ade6e0,0x855ade6e0]
Pin=0x855ade6c0[0x855ade6c0,0x855ade6c0]
Timestamp: Current=02-04-2017 02:00:34
HandleReference: Address=0x855ade820 Handle=(nil) Flags=[00]
触发这个信息的trc中记录了语句,即alert后面输出的语句:
KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186
继续看large pool方面的报错
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc
large pool这部分输出,从前面SQL查询刚好是large pool的shrink操作。
04/02/2017 02:32:32 large pool GROW IMMEDIATE 320 384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool SHRINK DEFERRED 384 128 04/02/2017 02:35:47
参考:
Multiple ORA-4031 Errors Of Reducing Sizes For "PX msg pool" In The Large Pool (文档 ID 1515877.1)
和Bug:13072654 - ORA-4031 CANT ALLOC 14MB IN LARGE POOL, PX MSG POOL有关,该bug在11.2.0.2有one-off patch, 可以考虑应用,或者设置large pool的最小值,或者改动SGA管理为手工管理。
2.5 shared pool大小查看
因为有LCO方面的信息,查看shared pool当前使用的大小
SQL> select t.*
2 from (select name,
3 bytes / (1024 * 1024) "MB",
4 round(bytes / (select value
5 from v$parameter t
6 where t.name = 'shared_pool_size') * 100,
7 2) || '%' "USED%"
8 from v$sgastat
9 where pool = 'shared pool'
10 order by 2 desc) t
11 where rownum < 20;
NAME MB USED%
-------------------------- -------------------- -----------------------------------------
free memory 2971.667167663574219 36.28%
PRTMV 2858.977592468261719 34.9%
SQLA 1862.341529846191406 22.73%
PRTDS 614.0089035034179688 7.5%
KQR M PO 315.4219131469726563 3.85%
KGLH0 199.7758560180664063 2.44%
dbktb: trace buffer 81.90625 1%
FileOpenBlock 60.796417236328125 .74%
ASM extent pointer array 52.86400604248046875 .65%
db_block_hash_buckets 44.50390625 .54%
dbwriter coalesce buffer 32.03125 .39%
ASH buffers 32 .39%
KGLHD 29.06992340087890625 .35%
kglsim object batch 19.32781219482421875 .24%
private strands 17.5341796875 .21%
Checkpoint queue 15.6328125 .19%
event statistics per sess 15.33984375 .19%
write state object 14.6377716064453125 .18%
ksunfy : SSO free list 14.32470703125 .17%
这里发现PRTMV这个组件比较陌生,并且占用了2.8G的空间,对比了其他库:
NAME MB USED%
-------------------------- ---------- -----------------------------------------
free memory 6021.38947 58.8%
SQLA 1472.31024 14.38%
KGLH0 1264.46631 12.35%
PRTMV 219.83268 2.15%
KGLHD 199.816628 1.95%
db_block_hash_buckets 178.003906 1.74%
dbktb: trace buffer 102.390625 1%
ASH buffers 96 .94%
dbwriter coalesce buffer 80.078125 .78%
FileOpenBlock 71.1162643 .69%
KGLDA 65.826004 .64%
Checkpoint queue 46.8984375 .46%
KKSSP 40.4567947 .4%
private strands 25.9765625 .25%
dirty object counts array 24 .23%
event statistics per sess 22.8779297 .22%
ksunfy : SSO free list 21.7646484 .21%
parameter table block 19.9453812 .19%
KGLS 19.5513763 .19%
从对比可以看出,这个值可能存在异常,搜索了下MOS,确实存在相关的bug:
Bug 19461270 - high PRTMV allocations in shared pool executing concurrent DML and DDLs on interval partitioned tables (文档 ID 19461270.8)
Description
Concurrent DDLs and DMLs happening on interval partitioned table that was created with deferred segment creation clause may do high PRTMV allocations.
Workaround
Do not run DDLs concurrently.
在使用interval分区的情况下,可能会触发,与当前问题现象较为吻合。
Bug 17037130 - Excess shared pool "PRTMV" memory use / ORA-4031 with partitioned tables (文档 ID 17037130.8)
Description
This bug is only relevant when using Partitioned Tables
SQL on a partitioned table may cause excess shared pool usage and
ultimately fail with ORA-4031.
Rediscovery Notes:
ORA-4031 with child cursor(s) having dependency table entries
referencing obsolete (OBS) multi-versioned objects.
Workaround
Flushing the shared_pool and avoiding DDLs during high load time
can help to avoid this issue.
3. 问题处理建议
以上分析,large pool的4031报错很可能和shrink large pool有关。另外shared pool方面也存在问题。
对于large pool的bug,这个库版本为11.2.0.2,未打PSU. 该bug在11.2.0.2有one-off patch,如果不应用patch,可以考虑使用以下手段规避
- 对large pool设置最小值避免频繁shrink,当前库设置为ASMM自动管理,db_cache(22g)和shared_pool(8g)已设置最小值,large_pool建议设置为200M.
alter system set large_pool_size=200M scope=spfile sid='*';
如果频繁影响到并行任务,建议打上one-off patch或者修改内存管理为手工管理。
- 并行任务中并行度64设置的较大,该主机cpu count只有16,建议适当降低点并行度。
对于shared pool的问题,当前数据库版本为11.2.0.2基版本没有打PSU,涉及的两个bug均没有在11.2.0.2以及linux平台下的one-off patch. 在无法立即升级到11.2.0.3或以上版本的情况下,建议:
- 从bug 19461270描述来看,该bug除了与interval分区有关,还和11g的新特性deferred segment creation特性有关,建议关闭这个特性。
alter system set deferred_segment_creation=false scope=spfile sid='*';
另一个bug 17037130从描述中和段延迟创建特性无关,建议按照第一步设置后持续观察,临时解决问题的方法是flush shared_pool或者避免在高负载时间段进行ddl.
对于当前已经使用的PRTMV组件,如果需要释放,建议可以找业务空闲的时间段手工flush shared_pool释放。