ORA-04031这个错误,几乎每一个专业的DBA都遇到过。这是一个相当严重的错误,Oracle进程在向SGA申请内存时,如果申请失败,则会报这个错误。大部分情况下是在向SGA中的shared pool申请内存时失败,而少有向large pool等池中申请内存失败。比如下面的报错:
- Wed Apr 27 16:00:25 2011
- Errors in file /oracle/app/oracle/admin/zxin/bdump/zxin1_ora_2052294.trc:
- ORA-04031: unable to allocate 4128 bytes of shared memory
- ("shared pool","unknown object","sga heap(3,0)","kgllk hash table")
这里很清楚地表示出来,是在向shared pool申请内存时失败。
shared pool内存申请(分配)失败,通常有如下的几种可能:
- shared pool过小,比如在SGA Manual Management方式下,shared pool设置过小。比如一套数千连接的大系统,shared pool只设置了几百M。这种情况下,要解决问题很解单,增加shared pool的大小即可。
- 应用没有使用绑定变量,硬解析非常多,导致shared pool内存碎片严重,分配大块内存时不能获得连续的内存空间。硬解析多的一个变种是虽然使用了绑定变量,但是由于某种原因,Cursor不能共享,导致Child Cursor非常多。实际上,如果shared pool较大(比如数GB大小),这种问题还是很少出现的,并且出现也通常出现在申请大块内存时。这种情况如果使用alter system flush shared_pool可以暂时缓解问题。但是这条命令又通常不适用于shared pool较大而且比较繁忙的系统。使用绑定变量
- Cache的cursor很多,同时cursor_space_for_time这一参数设置为TRUE,可能会使shared pool碎片化严重,导致不能分配到大块的连续内存。
- Oracle的BUG导致内存泄露,比如在一些版本中查询v$segment_statistics这样的视图导致内存泄露,使shared pool内存耗光。同样的情形还有类似于“obj stat memory”,”gcs resources”,”ges resources”等。通常这类内存为perm类型(permanet),这类内存通常是在分配时就确定了固定的用途,不能用于其他用途,因此极容易产生碎片。
- Oracle从9i开始,根据shared pool的大小将shared pool分为多个子池(subpool),每个子池有独立的free list,同时在分配时单独管理(有其独立 的shared pool latch)。Oracle的BUG或者说是内存分配策略缺陷导致某一类shared pool的内存分配只在一个子池(subpool)中,即多个子池的使用极不均衡,导致向那个使用得最多的子池申请内存时失败。报错信息中的”sga heap(3,0)”即指明是在第3个子池申请内存时失败。本文案例中的ORA-04031错误其产生的原因可以归结为Oracle对shared pool的分配/使用策略问题。
- 操作系统内存不足,这只会出现在shared pool的使用还没有达到最大值时才会出现,并且在操作系统都有swap的情况下,只有部分操作系统才可能有这种情况,比如在HP-UX下,reserved 内存过多导致swap满。
- 其他原因,多数是因为BUG。请参考下面提及的MOS参考文档。
本文中的案例,其数据库是运行在AIX 5.3系统中的10.2.0.4 RAC,RAC节点数为2。数据库是从9i升级到10g,而目前处于正式升级前的测试阶段。数据库报的ORA-04031错误信息如本文前面所示(其中的数据库名称已经做了处理)。
在继续讲解案例之前,不得不提到MOS上的几篇关于ORA-04031错误的文档:
- Master Note for Diagnosing ORA-4031 [ID 1088239.1]
- Diagnosing and Resolving Error ORA-04031 on the Shared Pool or Other Memory Pools [Video] [ID 146599.1]
- Interpreting the automatically generated ORA-4031 diagnostic trace. [ID 809560.1]
- Troubleshooting and Diagnosing ORA-4031 Error [Video] [ID 396940.1]
- ORA-4031 Common Analysis/Diagnostic Scripts [Video] [ID 430473.1]
其实分析ORA-04031错误,通常有以下几个要点:
- 判断错误发生所有的内存区域,是shared pool,large pool还是streams pool等。这个很容易从错误信息中判断出来,本文主要描述shared pool的ORA-04031错误,这也是最常见的。
- 检查Shared Pool的总大小以及free memory的大小。如果free memory看上去挺多,以subpool为单位检查是否存在是由于碎片导致没有足够的连续内存以供分配,特别是关注报错信息中提及的子池。
- 如果Shared Pool相较于系统规模来说足够大(通常数GB都已经是很大的了),检查Shared Pool中有没有占用非常多的内存类型或内存组件,如果有,是什么样的类型的内存,在各个子池之间是否分布均匀。如果有异常占用较多的内存类型,根据此类型在MOS上搜寻是否是会有相应的BUG引起,或者分析这种类型的内存消耗较多的原因。比如如果是sql area很大,检查是不是硬解析特别多,或者是不是child cursor特别多引起。
- 基于以上分析的数据,来判断shared pool内存分配失败的原因。
上面的步骤写得比较粗略,关于分析和解决ORA-04031问题,这里也有一篇不错的文章:Simplified Approach to Resolve ORA-4031
这里关键的是分析Shared Pool的内存数据。ORA-04031错误发生后如果有条件可以马上连接到数据库中查询相应的x$表和v$视图得到相应的数据,否则只能通过ORA-4031错误发生时产生的trace文件。_4031_dump_bitvec这个隐含参数用于控制发生ORA-04031错误时对SGA的dump行为,而trace文件的分析就不像使用SQL那样简单了。
下面再来详细地分析案例:
从错误信息来看,很显然,是向shared pool的第3个subpool申请内存时出错。
以下的数据是shared pool的数据:
- SQL> select sum(bytes)/1024/1024 mb from v$sgastat where pool='shared pool';
- MB
- ----------
- 4096.53062
- SQL> SELECT KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ)
- 2 SIZ,
- 3 To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')|
- 4 |'k' "AVG SIZE"
- 5 FROM X$KSMSP GROUP BY KSMCHCLS;
- CLASS NUM SIZ AVG SIZE
- -------- ---------- ---------- ------------
- R-freea 512 24576 .05k
- freeabl 807395 1643969848 1.99k
- recr 530728 662065240 1.22k
- R-free 256 214910976 819.82k
- free 43063 100605496 2.28k
- perm 140 1673368632 11,672.49k
虽然free的数量不是太多,但是freeable的数量还是很多的。
下面是各个子池更详细的数据:
- SQL> SELECT KSMCHIDX,KSMCHDUR, KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ)
- 2 SIZ,
- 3 To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')|
- 4 |'k' "AVG SIZE"
- 5 FROM X$KSMSP GROUP BY KSMCHIDX,KSMCHDUR, KSMCHCLS
- 6 order by 1,2,3;
- KSMCHIDX KSMCHDUR CLASS NUM SIZ AVG SIZE
- ---------- ---------- -------- ---------- ---------- ------------
- 1 1 R-free 27 22666392 819.82k
- R-freea 54 2592 .05k
- free 26 14024 .53k
- perm 32 430299448 13,131.70k
- 2 R-free 12 10073952 819.82k
- R-freea 24 1152 .05k
- free 10531 24519112 2.27k
- freeabl 44922 32457736 .71k
- recr 163177 134273584 .80k
- 3 R-free 9 7555464 819.82k
- R-freea 18 864 .05k
- free 1678 4555704 2.65k
- freeabl 79815 102514024 1.25k
- recr 32689 36368096 1.09k
- 4 R-free 20 16789920 819.82k
- R-freea 40 1920 .05k
- free 2182 5810056 2.60k
- freeabl 66235 254656184 3.75k
- recr 16245 58284480 3.50k
- 2 1 R-free 25 20987400 819.82k
- R-freea 50 2400 .05k
- free 23 20016 .85k
- perm 35 398418384 11,116.58k
- 2 R-free 4 3357984 819.82k
- R-freea 8 384 .05k
- free 5137 6604176 1.26k
- freeabl 40377 12140944 .29k
- recr 54942 45005024 .80k
- 3 R-free 9 7555464 819.82k
- R-freea 18 864 .05k
- free 1477 5524568 3.65k
- freeabl 79548 101879808 1.25k
- recr 32380 36033448 1.09k
- 4 R-free 21 17629416 819.82k
- R-freea 42 2016 .05k
- free 2540 7092424 2.73k
- freeabl 70133 270332800 3.76k
- recr 15924 57263032 3.51k
- 3 1 R-free 26 21826896 819.82k
- R-freea 52 2496 .05k
- free 26 20520 .77k
- perm 33 414355416 12,261.94k
- 2 R-free 4 3357984 819.82k
- R-freea 8 384 .05k
- free 4693 7053032 1.47k
- freeabl 49723 14339800 .28k
- recr 52771 42357312 .78k
- 3 R-free 11 9234456 819.82k
- R-freea 22 1056 .05k
- free 3594 9280904 2.52k
- freeabl 95823 121934488 1.24k
- recr 39643 44097504 1.09k
- 4 R-free 25 20987400 819.82k
- R-freea 50 2400 .05k
- free 2822 7291680 2.52k
- freeabl 84443 323149712 3.74k
- recr 19148 67997008 3.47k
- 4 1 R-free 27 22666392 819.82k
- R-freea 54 2592 .05k
- free 26 18088 .68k
- perm 40 430295384 10,505.26k
- 2 R-free 6 5036976 819.82k
- R-freea 12 576 .05k
- free 4818 11473920 2.33k
- freeabl 46089 39963224 .85k
- recr 54061 44188072 .80k
- 3 R-free 9 7555464 819.82k
- R-freea 18 864 .05k
- free 1427 4109504 2.81k
- freeabl 80026 102379080 1.25k
- recr 33217 36949240 1.09k
- 4 R-free 21 17629416 819.82k
- R-freea 42 2016 .05k
- free 1993 6228464 3.05k
- freeabl 70581 269087136 3.72k
- recr 16614 59372656 3.49k
KSMCHDUR是什么意思?在9i里面这个列的值通常为1。实际上,Oracle从9i开始,将shared pool划分为多个sub pool。而在10g以上的版本中(具体开始的版本号已经不记得),每个sub pool又分了4个更小的池,我们暂且称之为mini heap。每个mini heap有其自己的free list。KSMCHDUR这一列即表示mini heap的编号。”heap(3,0)”中的0是指第1个mini heap。
在上面的数据中,可以看到这个子池的第1个mini heap的free已经很少,只有10来K。另外,我们可以观察到,perm类型的内存块只存在于每个sub pool的第1个min heap中。这个是一个重点,在后面会有解释。
这里本应该有通过查询v$sgastat得到shared pool的各个组件占用的内存分布,只是写BLOG时找不到了….但是我们可以在trace文件中找到数据,下面只列出sub pool 3的数据:
- ==============================
- Memory Utilization of Subpool 3
- ================================
- Allocation Name Size
- _________________________ __________
- "free memory " 81466568
- "miscellaneous " 0
- "dpslut_kfdsg " 512
- "trace buffer " 737280
- "trace_knlasg " 504
- "gcs res hash bucket " 1048576
- "gcs res latch table " 12288
- "evaluation con " 0
- "sql area " 344545752
- "UNDO STAT INFO " 59904
- "txncallback " 141744
- "transaction " 2103264
- "ges resource pools " 3968
- "sessions " 4526488
- "dlo fib struct " 128032
- "KJCTS process batching st" 240
- "row cache " 3272
- "KCB where statistics arra" 25888
- "KCB buffer wait statistic" 32000
- "KCB incremental ckpt entr" 512
- "invalid low rba queue " 1024
- "table definiti " 108704
- "temporary tabl " 4136
- "KCL instance cache transf" 131072
- "resumable " 2720
- "KESTB existence bitvec se" 128
- "type object de " 392448
- "enqueue_hash " 318960
- "KSXR pending consumption " 20192
- "KTI SGA freeable small po" 0
- "trigger defini " 885472
- "trigger source " 99264
- "trigger inform " 960
- "KTCN: Obj Invalidation Se" 2336
- "kmgsb circular statistics" 108800
- "kgl lock hash table state" 45360
- "kglsim size of pinned mem" 8024
- "kelr system metrics table" 280
- "kzctxgjsi ksuseclid memor" 117360
- "kzctxgjci ksuseclid memo" 0
- "CCursor " 95912048
- "ksr message pool free que" 188960
- "ksb ci process list (each" 144
- "ksunfy: nodes of hierarch" 320
- "ksuloi: long op free list" 256
- "kwqmncal: allocate buffer" 4048
- "ksim group query request " 0
- "ksuxds ksuseclid memory " 0
- "call " 87304
- "dictionary cache " 0
- "KSXR pending reply queue " 255488
- "hng: All sessions data fo" 0
- "ksfv subheap descriptor " 184
- "gcs resources " 169082312
- "gcs affinity " 8320
- "gcs opaque in " 12312
- "PCursor " 50743128
- "ges resource " 539376
- "fdrec_kffsg " 24
- "work area tab " 80640
- "kglsim main lru count " 38400
- "plwpil:wa " 4264
- "grptab_kfgsg " 2464
- "AW SGA " 40
- "KEWS sesstat seg tbl " 8
- "kebm slave descriptors " 1456
- "kglsim hash table bkts " 1048576
- "KSXR global channels " 1288
- "ges enqueues " 17333720
- "PLS chunk " 352
- "KSQ event description " 1440
- "KESTB existence bitvec " 4096
- "gcs shadows " 101246344
- "qmtb_init_data " 224
- "Core dump directory " 264
- "sort segment handle " 7480
- "SERVICE NAME ENTRY " 48
- "PQ/BizCard " 1536
- "qtree_kwqbspse " 40
- "latch descriptor table " 1576
- "recovery domain " 29856
- "parameters " 30056
- "SHARED SERVERS INFO " 240
- "qtree_kwqbsgn " 40
- "post agent " 0
- "pspool_kfsg " 80
- "plwsppwp:wa " 0
- "PL/SQL DIANA " 14050624
- "segmented arrays " 2072
- "Checkpoint queue " 4097024
- "sim lru segments " 2560
- "sim segment hits " 2560
- "sim state object " 40
- "partitioning d " 199616
- "ASH buffers " 8388608
- "message pool freequeue " 276336
- "PL/SQL MPCODE " 4499360
- "PL/SQL PPCODE " 3984944
- "procs: ksunfy " 1512000
- "primem_kfmdsg " 1032
- "SYSTEM PARAMETERS " 76624
- "object queue hash buckets" 262656
- "object queue hash table d" 7552
- "object level stats hash t" 512
- "object stat dummy statprv" 144
- "sim cache sizes " 320
- "logout storm management " 24000
- "pl/sql source " 21256
- "sys event stats " 199136
- "parameter handle " 67896
- "Parameter Handle " 1656
- "channel handle " 828672
- "API blockers array " 64
- "PARAMETER TABLE " 2048
- "PARAMETER ENTRY " 8
- "LGWR post requested array" 24
- "bloom filter " 3104
- "param hash values " 5984
- "sql area:PLSQL " 11477776
- "PX subheap desc " 256
- "repository " 213544
- "sql area:KOKA " 16192
- "archive_lag_target " 9624
- "state objects " 640
- "latch nowait fails or sle" 116832
- "sched job slv " 5952
- "pso tbs: ksunfy " 390000
- "dummy " 269928
- "Sort Segment " 37440
- "Cursor Stats " 6095760
- "Banner Storage " 2048
- "quiescing session " 3872
- "API data buffer " 16
- "buffer handles " 1020000
- "prmtzdini tz region " 408320
- "sga node map " 16
- "savepoints " 0
- "Managed Standby Proc Arra" 24576
- "OS proc request holder " 4664
- "db_files " 416576
- "PX server msg stats " 2288
- "KQR M PO " 283376
- "kks stats " 40
- "parameter table block " 483168
- "KSFV SGA " 824
- "plugin datafile array " 36016
- "plwda:PLW_STR_NEW_RVAL " 24
- "plwspv:PLW_STR_NEW_VAL " 16
- "KGKP sga " 32
- "BRANCH TABLE SEGMENTED AR" 70176
- "mvobj part des " 306544
- "parameter value memory " 216
- "multiblock re " 98496
- "parameter text value " 1080
- "parallel_max_servers " 8192
- "KGLS heap " 13290800
- "KGSKI sga " 80
- "resize request state obje" 368000
- "MTTR advisory " 1462832
- "monitoring co " 12480
- "rules engine aggregate st" 1416
- "krbmror " 36400
- "joxs heap " 136
- "krbmrsr " 152
- "ksfqpar " 4008
- "SGA - SWRF DrvMet Runtime" 2656
- "SGA - SWRF Metrics ksuTim" 72
- "SGA - SWRF RawMet Runtime" 1408
- "SGA - SWRF Metrics WCTime" 32
- "SQL Memory Manager Base W" 13400
- "change notification regis" 4096
- "simulator latch/bucket st" 59392
- "Prefetch history buffer " 2832
- "change notification obj m" 4096
- "KQR ENQ " 16512
- "kksss " 16464
- "API data buffer length " 0
- "kokcd " 0
- "kohsg " 8
- "Sequence Background Insta" 88
- "ksfqpn " 416
- "KGLS SP " 4704
- "knstsg " 48
- "latch classes " 352
- "system default language h" 568
- "name-service entry " 2592
- "API data buffer array " 0
- "kzull " 4096
- "kzulu " 392
- "kfgsga " 104
- "library cache " 46604712
- "kcrrny " 25320
- "spfile cleanup structure " 16760
- "xssinfo " 5952
- "buffer_pool_desc_array " 3384
- "row cache child latch " 3360
- "rm request queue link " 5320
- "SCHEDULING POLICY TABLE " 160
- "namhsh_kfdsg " 4104
- "Closed Thread SCN Bitvec " 8448
- "Client ID trace settings " 3872
- "osp allocation " 21104
- "os statistics " 9192
- "plwppwp:PLW_STR_NEW_LEN_V" 16
- "plwgc: plwgc_garbage_clea" 0
- "plwiiw: kglpql warnings " 0
- "object queue " 808080
- "obj stat memo " 599184
- "obj htab chun " 122960
- "object level " 111888
- "XCT XGA " 0
- "SGA - SWRF Metric Eidbuf " 900840
- "Processor group descripto" 64
- "Prefetch client count per" 32
- "X$SKGXPIA " 2680
- "simulator hash buckets " 2101248
- "State object subpools " 896
- "API data buffer length ma" 0
- "AWR Table Info (KEW layer" 872
- "character set memory " 4856
- "sim segment num bufs " 1280
- "character set object " 129728
- "session idle latches " 2560
- "qesmmaInitialize: " 112
- "returns from remote ops " 49152
- "name-service " 4080
- "SGA - SWRF Metric CHBs " 10912
- "listener addresses " 32
- "db_block_hash_buckets " 67108864
- "KSI resource types " 2704
- "kglsim object batch " 4196304
- "trigger condition node " 72
- "ksws service events " 18560
- "Heap0: KGL " 11642128
- "fixed allocation callback" 392
- "kqlpWrntoStr:value " 0
- "KEWS statistic name " 424
- "KEWS statistic maps " 1096
- "KCL partition table " 131072
- "kebm slave message " 88
- "kcbl state objects " 12800
- "free rm request queue lin" 0
- "xsoqsehift " 3104
- "DBWR event stats array " 192
- "kgllk hash table " 659456
- "event descriptor table " 192
- "kpssnfy: kpsssgct " 32
- "kpscad: kpscscon " 1952
- "dbwriter coalesce buffer " 3158016
- "kglsim hash table " 8208
- "gcs resource freelist dyn" 256
- "gcs shadow locks dyn seg " 256
- "kks stats latch " 160
- "KTC latch cleanup " 576
- "ges enqueue max. usage pe" 64
- "ges lmd process descripto" 2760
- "KTU latch cleanup " 2496
- "kscdnfyinithead " 16
- "X$KSVIT table " 512
- "kqlpaac:value-1 " 64
- "KCL buffer header " 192064
- "kxfpdp pointers " 28800
- "kodosgi kopfdo " 104
- "kglsim latches " 136
- "TXN TABLE SEGMENTED ARRAY" 54784
- "KJCT remote i " 1640
- "KKJ SHRD WRQS " 288
- "KJC dest ctx " 3560
- "kwrsnfy: kwrs " 1624
- "kwqmn:tskdata " 0
- "KKKI consumer " 4136
- "dbwr suspend/resume ptr a" 16
- "dbwr actual working sets " 64
- "KGSKI schedule " 0
- "temp lob duration state o" 3296
- "ges regular msg buffers " 3078008
- "jsksncb: 9 " 28672
- "Transportable DB Converte" 2552
- "KTU lat struct " 800
- "kks stats hds " 256
- "KSFD SGA I/O b " 4190248
- "HTTP fixed headers " 72
- "UNDO INFO SEGMENTED ARRAY" 649856
- "ges process hash table " 132000
- "jsksncb-latch " 1280
- "kfkid hrec " 24
- "KTCCC OBJECT " 0
- "KTPR HIST TB " 2808
- "KTF MAPPINGS " 12288
- "kksss-heap " 35136
- "kglsim heap " 3431232
- "event statistics per sess" 7665280
- "eventlist to post commits" 16
从上面的数据可以看到,第3个sub pool中,占用较多的内存是gcs resources、gcs shadows以及sql area。但是没有明显的异常。
下面是第3个sub pool中第1个mini-heap中free memory的更详细数据:
- SQL> break on ksmchidx on ksmchdur
- SQL> select
- 2 ksmchidx,ksmchdur,
- 3 case
- 4 when ksmchsiz < 1672 then trunc((ksmchsiz-32)/8)
- 5 when ksmchsiz < 4120 then trunc((ksmchsiz+7928)/48)
- 6 when ksmchsiz < 8216 then 250
- 7 when ksmchsiz < 16408 then 251
- 8 when ksmchsiz < 32792 then 252
- 9 when ksmchsiz < 65560 then 253
- 10 when ksmchsiz >= 65560 then 253
- 11 end bucket,
- 12 sum(ksmchsiz) free_space,
- 13 count(*) free_chunks,
- 14 trunc(avg(ksmchsiz)) average_size,
- 15 max(ksmchsiz) biggest
- 16 from
- 17 sys.x$ksmsp
- 18 where
- 19 inst_id = userenv('Instance') and
- 20 ksmchcls = 'free'
- 21 group by
- 22 case
- 23 when ksmchsiz < 1672 then trunc((ksmchsiz-32)/8)
- 24 when ksmchsiz < 4120 then trunc((ksmchsiz+7928)/48)
- 25 when ksmchsiz < 8216 then 250
- 26 when ksmchsiz < 16408 then 251
- 27 when ksmchsiz < 32792 then 252
- 28 when ksmchsiz < 65560 then 253
- 29 when ksmchsiz >= 65560 then 253
- 30 end ,
- 31 ksmchidx, ksmchdur
- 32 order by ksmchidx , ksmchdur
- 33 /
- KSMCHIDX KSMCHDUR BUCKET FREE_SPACE FREE_CHUNKS AVERAGE_SIZE BIGGEST
- ---------- ---------- ---------- ---------- ----------- ------------ ----------
- 3 1 5 72 1 72 72
- 13 136 1 136 136
- 27 248 1 248 248
- 48 416 1 416 416
- 56 1920 4 480 480
- 66 1680 3 560 560
- 68 4608 8 576 576
- 164 1344 1 1344 1344
- 180 1472 1 1472 1472
- 188 1536 1 1536 1536
- 190 1552 1 1552 1552
- 199 1624 1 1624 1624
- 204 1880 1 1880 1880
- 207 2032 1 2032 2032
可以看到,最大的free memory块才2032字节,而报错中提到的申请的内存大小为4128字节。由于在第3个sub pool的第1个mini heap中没有4128字节的连续free memory,所以导致内存申请失败。
那么这里的问题是,为什么这个mini heap中的free memory那么少?正如前面提及,为什么这个mini heap中的已经使用的类型全是perm类型?这个问题的答案就在于”DURATION”。Oracle在启用了SGA自动管理的模式下,为了便于在shared pool与buffer cache或其他内存之间动态调整大小,规定了在每一个mini heap中分配内存按照duration来进行。这里duration可以理解为内存块的持久时间。perm类型的内存块,就是分配后不能释放,只能用于相同组件的重用。比如gcs resources这种组件的内存是perm类型,这种内存被分配后,不能释放给sql area使用,也不能给gcs shadows使用,只能给其他的gcs resource使用。按DURATION分配内存时,perm类型的内存就只能从每个sub pool的第1个mini heap中分配。而其他类型的内存通常在sub pool的第2-4个mini heap中分配。由于perm类型的内存不能释放,也不能被其他组件的内存重用,所以里面的内存会越用越少,如果没有了free memory怎么办?前面说到,这种模式主要是工作在SGA自动管理模式下,如果free memory没有了,就会从SGA中的其他部分,比如buffer cache中取得memory chunk,加入到缺少内存的mini heap中。正常情况下这种机制没有问题。
完全使用SGA自动管理有一个缺陷就是,如果应用系统绑定变量做得不好,或者由于BUG,child cursor过多,导致shared pool会变得很大,甚至超过10G,严重的比buffer cache还大,另一方面,在buffer cache和shared pool之间频繁地调整大小,可能会导致严重的解析问题和其他性能问题。针对这个问题,通常有2种解决办法:一种就是关闭SGA自动管理,即将SGA_TARGET设置为0,以9i的方式来设置shared_pool_size,db_cache_size这些参数,来手动管理SGA;第二种就是sga_target仍然大于0,即自动管理SGA,但是通过设置shared_pool_size,db_cache_size等参数限制这些内存组件的最小大小,而只留给系统极少的自动调整空间。
而出现问题的这套系统,正是使用了第二种方式,开启了SGA自动调整,但是留给自动调整的空间极少。SGA_TARGET为35G,buffer_cache_size为30G,shared_pool_size为4G,再加上large_pool等组件,几乎没有什么可自动调整的余地。这种方式下,就存在了问题。下面来做一个按时间的分析:
- 时间T1,数据库启动,shared pool只消耗了极少量的内存。
- 时间T2至时间T3,Oracle进程请求shared pool内存,Oracle会向操作系统以指定的粒度为单位(比如16MB)请求物理内存,加入到所请求内存所在的mini heap中。直至shared pool的大小达到shared pool最大容许的大小。这个容许大小由各参数计算而来。比如说SGA_TARGET为10G,其他组件的参数设置后最小值为8G,shared_pool_size的值为1G,但是shared pool的最大容许大小为2G。这个时候,每个sub pool的mini heap的大小已经固定。在到达shared pool最大容许大小这一阶段,可能会从buffer cache等组件中占用。
- 时间T4,Oracle进程请求shared pool内存,这个时候只能从free list或age out内存块后获取内存,对于sub pool的第1个mini heap,只能从free list中获取,因为这个mini heap中的已用内存全是perm,是不能age out的。
- 时间T5,Oracle进程请求shared sub pool中第1个mini heap的内存,但是free list中已经没有内存。所以报ORA-04031错误。
在上面的时间点T5那里,如果SGA有较大的自动调整空间,比如说完全没有限制,即buffer_cache_size等参数很少或为0,这样在请求第1个mini heap中的内存时,完全可以从buffer cache中占用,这样的后果是使shared pool越来越大。
而本文案例的ORA-04031,正是由于SGA自动管理,而自动调整的余地又太小,最终使sub pool的第1个mini heap空间用光。当然我们可以分析为什么会用光,这个就显得更为复杂,这跟数据量、应用系统都有很大的关系。而系统中第1次出现ORA-04031错误的进程,是一个job进程,而此后大部分出现的错误均是job进程,能检查job代码,发现在做大量的表的大量数据的UPDATE操作,这可能是引起gcs shadows和gcs resources大量内存使用的原因。在一套RAC数据库中,gcs和ges相关的perm内存占用可能会比较大。
那么除了调整应用,应该怎么样解决这样问题?这里的解决方法是增加shared_pool_size参数到6G,同时将sga_target设置为0,再重启。
而另一种可能的办法是将参数“_enable_shared_pool_durations”设置为FALSE。这一参数为FALSE,将会使shared pool内存分配时,不再使某一类型的内存(比如perm)必须要求在一个固定的mini heap中。而实际上,sga_target设置为0之后,这一个参数自动会设为FALSE(由于这一参数是静态参数,所以修改了sga_target之后需要重启才会使这个隐含参数改变),所以建议的解决办法是设置sga_target参数,而不建议修改隐含参数。当然还有一种办法是完全让Oracle自动管理SGA,将buffer_cache_size和shared_pool_size等参数设置为0,但是正如前面所说,这种方法有比较大的缺陷。