ORA-04031这个错误,几乎每一个专业的DBA都遇到过。这是一个相当严重的错误,Oracle进程在向SGA申请内存时,如果申请失败,则会报这个错误。大部分情况下是在向SGA中的shared pool申请内存时失败,而少有向large pool等池中申请内存失败。比如下面的报错:
view plaincopy to clipboardprint? Wed Apr 27 16:00:25 2011 Errors in file /oracle/app/oracle/admin/zxin/bdump/zxin1_ora_2052294.trc: ORA-04031: unable to allocate 4128 bytes of shared memory ("shared pool","unknown object","sga heap(3,0)","kgllk hash table")
这里很清楚地表示出来,是在向shared pool申请内存时失败。
shared pool内存申请(分配)失败,通常有如下的几种可能:
本文中的案例,其数据库是运行在AIX 5.3系统中的10.2.0.4 RAC,RAC节点数为2。数据库是从9i升级到10g,而目前处于正式升级前的测试阶段。数据库报的ORA-04031错误信息如本文前面所示(其中的数据库名称已经做了处理)。
在继续讲解案例之前,不得不提到MOS上的几篇关于ORA-04031错误的文档:
其实分析ORA-04031错误,通常有以下几个要点:
上面的步骤写得比较粗略,关于分析和解决ORA-04031问题,这里也有一篇不错的文章:Simplified Approach to Resolve ORA-4031
这里关键的是分析Shared Pool的内存数据。ORA-04031错误发生后如果有条件可以马上连接到数据库中查询相应的x$表和v$视图得到相应的数据,否则只能通过ORA-4031错误发生时产生的trace文件。_4031_dump_bitvec这个隐含参数用于控制发生ORA-04031错误时对SGA的dump行为,而trace文件的分析就不像使用SQL那样简单了。
下面再来详细地分析案例:
从错误信息来看,很显然,是向shared pool的第3个subpool申请内存时出错。
以下的数据是shared pool的数据:
view plaincopy to clipboardprint? SQL> select sum(bytes)/1024/1024 mb from v$sgastat where pool='shared pool'; MB ---------- 4096.53062 SQL> SELECT KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ) 2 SIZ, 3 To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')| 4 |'k' "AVG SIZE" 5 FROM X$KSMSP GROUP BY KSMCHCLS; CLASS NUM SIZ AVG SIZE -------- ---------- ---------- ------------ R-freea 512 24576 .05k freeabl 807395 1643969848 1.99k recr 530728 662065240 1.22k R-free 256 214910976 819.82k free 43063 100605496 2.28k perm 140 1673368632 11,672.49k
view plaincopy to clipboardprint? SQL> SELECT KSMCHIDX,KSMCHDUR, KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ) 2 SIZ, 3 To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')| 4 |'k' "AVG SIZE" 5 FROM X$KSMSP GROUP BY KSMCHIDX,KSMCHDUR, KSMCHCLS 6 order by 1,2,3; KSMCHIDX KSMCHDUR CLASS NUM SIZ AVG SIZE ---------- ---------- -------- ---------- ---------- ------------ 1 1 R-free 27 22666392 819.82k R-freea 54 2592 .05k free 26 14024 .53k perm 32 430299448 13,131.70k 2 R-free 12 10073952 819.82k R-freea 24 1152 .05k free 10531 24519112 2.27k freeabl 44922 32457736 .71k recr 163177 134273584 .80k 3 R-free 9 7555464 819.82k R-freea 18 864 .05k free 1678 4555704 2.65k freeabl 79815 102514024 1.25k recr 32689 36368096 1.09k 4 R-free 20 16789920 819.82k R-freea 40 1920 .05k free 2182 5810056 2.60k freeabl 66235 254656184 3.75k recr 16245 58284480 3.50k 2 1 R-free 25 20987400 819.82k R-freea 50 2400 .05k free 23 20016 .85k perm 35 398418384 11,116.58k 2 R-free 4 3357984 819.82k R-freea 8 384 .05k free 5137 6604176 1.26k freeabl 40377 12140944 .29k recr 54942 45005024 .80k 3 R-free 9 7555464 819.82k R-freea 18 864 .05k free 1477 5524568 3.65k freeabl 79548 101879808 1.25k recr 32380 36033448 1.09k 4 R-free 21 17629416 819.82k R-freea 42 2016 .05k free 2540 7092424 2.73k freeabl 70133 270332800 3.76k recr 15924 57263032 3.51k 3 1 R-free 26 21826896 819.82k R-freea 52 2496 .05k free 26 20520 .77k perm 33 414355416 12,261.94k 2 R-free 4 3357984 819.82k R-freea 8 384 .05k free 4693 7053032 1.47k freeabl 49723 14339800 .28k recr 52771 42357312 .78k 3 R-free 11 9234456 819.82k R-freea 22 1056 .05k free 3594 9280904 2.52k freeabl 95823 121934488 1.24k recr 39643 44097504 1.09k 4 R-free 25 20987400 819.82k R-freea 50 2400 .05k free 2822 7291680 2.52k freeabl 84443 323149712 3.74k recr 19148 67997008 3.47k 4 1 R-free 27 22666392 819.82k R-freea 54 2592 .05k free 26 18088 .68k perm 40 430295384 10,505.26k 2 R-free 6 5036976 819.82k R-freea 12 576 .05k free 4818 11473920 2.33k freeabl 46089 39963224 .85k recr 54061 44188072 .80k 3 R-free 9 7555464 819.82k R-freea 18 864 .05k free 1427 4109504 2.81k freeabl 80026 102379080 1.25k recr 33217 36949240 1.09k 4 R-free 21 17629416 819.82k R-freea 42 2016 .05k free 1993 6228464 3.05k freeabl 70581 269087136 3.72k recr 16614 59372656 3.49k
KSMCHDUR是什么意思?在9i里面这个列的值通常为1。实际上,Oracle从9i开始,将shared pool划分为多个sub pool。而在10g以上的版本中(具体开始的版本号已经不记得),每个sub pool又分了4个更小的池,我们暂且称之为mini heap。每个mini heap有其自己的free list。KSMCHDUR这一列即表示mini heap的编号。”heap(3,0)”中的0是指第1个mini heap。
在上面的数据中,可以看到这个子池的第1个mini heap的free已经很少,只有10来K。另外,我们可以观察到,perm类型的内存块只存在于每个sub pool的第1个min heap中。这个是一个重点,在后面会有解释。
这里本应该有通过查询v$sgastat得到shared pool的各个组件占用的内存分布,只是写BLOG时找不到了….但是我们可以在trace文件中找到数据,下面只列出sub pool 3的数据:
view plaincopy to clipboardprint? ============================== Memory Utilization of Subpool 3 ================================ Allocation Name Size _________________________ __________ "free memory " 81466568 "miscellaneous " 0 "dpslut_kfdsg " 512 "trace buffer " 737280 "trace_knlasg " 504 "gcs res hash bucket " 1048576 "gcs res latch table " 12288 "evaluation con " 0 "sql area " 344545752 "UNDO STAT INFO " 59904 "txncallback " 141744 "transaction " 2103264 "ges resource pools " 3968 "sessions " 4526488 "dlo fib struct " 128032 "KJCTS process batching st" 240 "row cache " 3272 "KCB where statistics arra" 25888 "KCB buffer wait statistic" 32000 "KCB incremental ckpt entr" 512 "invalid low rba queue " 1024 "table definiti " 108704 "temporary tabl " 4136 "KCL instance cache transf" 131072 "resumable " 2720 "KESTB existence bitvec se" 128 "type object de " 392448 "enqueue_hash " 318960 "KSXR pending consumption " 20192 "KTI SGA freeable small po" 0 "trigger defini " 885472 "trigger source " 99264 "trigger inform " 960 "KTCN: Obj Invalidation Se" 2336 "kmgsb circular statistics" 108800 "kgl lock hash table state" 45360 "kglsim size of pinned mem" 8024 "kelr system metrics table" 280 "kzctxgjsi ksuseclid memor" 117360 "kzctxgjci ksuseclid memo" 0 "CCursor " 95912048 "ksr message pool free que" 188960 "ksb ci process list (each" 144 "ksunfy: nodes of hierarch" 320 "ksuloi: long op free list" 256 "kwqmncal: allocate buffer" 4048 "ksim group query request " 0 "ksuxds ksuseclid memory " 0 "call " 87304 "dictionary cache " 0 "KSXR pending reply queue " 255488 "hng: All sessions data fo" 0 "ksfv subheap descriptor " 184 "gcs resources " 169082312 "gcs affinity " 8320 "gcs opaque in " 12312 "PCursor " 50743128 "ges resource " 539376 "fdrec_kffsg " 24 "work area tab " 80640 "kglsim main lru count " 38400 "plwpil:wa " 4264 "grptab_kfgsg " 2464 "AW SGA " 40 "KEWS sesstat seg tbl " 8 "kebm slave descriptors " 1456 "kglsim hash table bkts " 1048576 "KSXR global channels " 1288 "ges enqueues " 17333720 "PLS chunk " 352 "KSQ event description " 1440 "KESTB existence bitvec " 4096 "gcs shadows " 101246344 "qmtb_init_data " 224 "Core dump directory " 264 "sort segment handle " 7480 "SERVICE NAME ENTRY " 48 "PQ/BizCard " 1536 "qtree_kwqbspse " 40 "latch descriptor table " 1576 "recovery domain " 29856 "parameters " 30056 "SHARED SERVERS INFO " 240 "qtree_kwqbsgn " 40 "post agent " 0 "pspool_kfsg " 80 "plwsppwp:wa " 0 "PL/SQL DIANA " 14050624 "segmented arrays " 2072 "Checkpoint queue " 4097024 "sim lru segments " 2560 "sim segment hits " 2560 "sim state object " 40 "partitioning d " 199616 "ASH buffers " 8388608 "message pool freequeue " 276336 "PL/SQL MPCODE " 4499360 "PL/SQL PPCODE " 3984944 "procs: ksunfy " 1512000 "primem_kfmdsg " 1032 "SYSTEM PARAMETERS " 76624 "object queue hash buckets" 262656 "object queue hash table d" 7552 "object level stats hash t" 512 "object stat dummy statprv" 144 "sim cache sizes " 320 "logout storm management " 24000 "pl/sql source " 21256 "sys event stats " 199136 "parameter handle " 67896 "Parameter Handle " 1656 "channel handle " 828672 "API blockers array " 64 "PARAMETER TABLE " 2048 "PARAMETER ENTRY " 8 "LGWR post requested array" 24 "bloom filter " 3104 "param hash values " 5984 "sql area:PLSQL " 11477776 "PX subheap desc " 256 "repository " 213544 "sql area:KOKA " 16192 "archive_lag_target " 9624 "state objects " 640 "latch nowait fails or sle" 116832 "sched job slv " 5952 "pso tbs: ksunfy " 390000 "dummy " 269928 "Sort Segment " 37440 "Cursor Stats " 6095760 "Banner Storage " 2048 "quiescing session " 3872 "API data buffer " 16 "buffer handles " 1020000 "prmtzdini tz region " 408320 "sga node map " 16 "savepoints " 0 "Managed Standby Proc Arra" 24576 "OS proc request holder " 4664 "db_files " 416576 "PX server msg stats " 2288 "KQR M PO " 283376 "kks stats " 40 "parameter table block " 483168 "KSFV SGA " 824 "plugin datafile array " 36016 "plwda:PLW_STR_NEW_RVAL " 24 "plwspv:PLW_STR_NEW_VAL " 16 "KGKP sga " 32 "BRANCH TABLE SEGMENTED AR" 70176 "mvobj part des " 306544 "parameter value memory " 216 "multiblock re " 98496 "parameter text value " 1080 "parallel_max_servers " 8192 "KGLS heap " 13290800 "KGSKI sga " 80 "resize request state obje" 368000 "MTTR advisory " 1462832 "monitoring co " 12480 "rules engine aggregate st" 1416 "krbmror " 36400 "joxs heap " 136 "krbmrsr " 152 "ksfqpar " 4008 "SGA - SWRF DrvMet Runtime" 2656 "SGA - SWRF Metrics ksuTim" 72 "SGA - SWRF RawMet Runtime" 1408 "SGA - SWRF Metrics WCTime" 32 "SQL Memory Manager Base W" 13400 "change notification regis" 4096 "simulator latch/bucket st" 59392 "Prefetch history buffer " 2832 "change notification obj m" 4096 "KQR ENQ " 16512 "kksss " 16464 "API data buffer length " 0 "kokcd " 0 "kohsg " 8 "Sequence Background Insta" 88 "ksfqpn " 416 "KGLS SP " 4704 "knstsg " 48 "latch classes " 352 "system default language h" 568 "name-service entry " 2592 "API data buffer array " 0 "kzull " 4096 "kzulu " 392 "kfgsga " 104 "library cache " 46604712 "kcrrny " 25320 "spfile cleanup structure " 16760 "xssinfo " 5952 "buffer_pool_desc_array " 3384 "row cache child latch " 3360 "rm request queue link " 5320 "SCHEDULING POLICY TABLE " 160 "namhsh_kfdsg " 4104 "Closed Thread SCN Bitvec " 8448 "Client ID trace settings " 3872 "osp allocation " 21104 "os statistics " 9192 "plwppwp:PLW_STR_NEW_LEN_V" 16 "plwgc: plwgc_garbage_clea" 0 "plwiiw: kglpql warnings " 0 "object queue " 808080 "obj stat memo " 599184 "obj htab chun " 122960 "object level " 111888 "XCT XGA " 0 "SGA - SWRF Metric Eidbuf " 900840 "Processor group descripto" 64 "Prefetch client count per" 32 "X$SKGXPIA " 2680 "simulator hash buckets " 2101248 "State object subpools " 896 "API data buffer length ma" 0 "AWR Table Info (KEW layer" 872 "character set memory " 4856 "sim segment num bufs " 1280 "character set object " 129728 "session idle latches " 2560 "qesmmaInitialize: " 112 "returns from remote ops " 49152 "name-service " 4080 "SGA - SWRF Metric CHBs " 10912 "listener addresses " 32 "db_block_hash_buckets " 67108864 "KSI resource types " 2704 "kglsim object batch " 4196304 "trigger condition node " 72 "ksws service events " 18560 "Heap0: KGL " 11642128 "fixed allocation callback" 392 "kqlpWrntoStr:value " 0 "KEWS statistic name " 424 "KEWS statistic maps " 1096 "KCL partition table " 131072 "kebm slave message " 88 "kcbl state objects " 12800 "free rm request queue lin" 0 "xsoqsehift " 3104 "DBWR event stats array " 192 "kgllk hash table " 659456 "event descriptor table " 192 "kpssnfy: kpsssgct " 32 "kpscad: kpscscon " 1952 "dbwriter coalesce buffer " 3158016 "kglsim hash table " 8208 "gcs resource freelist dyn" 256 "gcs shadow locks dyn seg " 256 "kks stats latch " 160 "KTC latch cleanup " 576 "ges enqueue max. usage pe" 64 "ges lmd process descripto" 2760 "KTU latch cleanup " 2496 "kscdnfyinithead " 16 "X$KSVIT table " 512 "kqlpaac:value-1 " 64 "KCL buffer header " 192064 "kxfpdp pointers " 28800 "kodosgi kopfdo " 104 "kglsim latches " 136 "TXN TABLE SEGMENTED ARRAY" 54784 "KJCT remote i " 1640 "KKJ SHRD WRQS " 288 "KJC dest ctx " 3560 "kwrsnfy: kwrs " 1624 "kwqmn:tskdata " 0 "KKKI consumer " 4136 "dbwr suspend/resume ptr a" 16 "dbwr actual working sets " 64 "KGSKI schedule " 0 "temp lob duration state o" 3296 "ges regular msg buffers " 3078008 "jsksncb: 9 " 28672 "Transportable DB Converte" 2552 "KTU lat struct " 800 "kks stats hds " 256 "KSFD SGA I/O b " 4190248 "HTTP fixed headers " 72 "UNDO INFO SEGMENTED ARRAY" 649856 "ges process hash table " 132000 "jsksncb-latch " 1280 "kfkid hrec " 24 "KTCCC OBJECT " 0 "KTPR HIST TB " 2808 "KTF MAPPINGS " 12288 "kksss-heap " 35136 "kglsim heap " 3431232 "event statistics per sess" 7665280 "eventlist to post commits" 16从上面的数据可以看到,第3个sub pool中,占用较多的内存是gcs resources、gcs shadows以及sql area。但是没有明显的异常。
view plaincopy to clipboardprint? SQL> break on ksmchidx on ksmchdur SQL> select 2 ksmchidx,ksmchdur, 3 case 4 when ksmchsiz < 1672 then trunc((ksmchsiz-32)/8) 5 when ksmchsiz < 4120 then trunc((ksmchsiz+7928)/48) 6 when ksmchsiz < 8216 then 250 7 when ksmchsiz < 16408 then 251 8 when ksmchsiz < 32792 then 252 9 when ksmchsiz < 65560 then 253 10 when ksmchsiz >= 65560 then 253 11 end bucket, 12 sum(ksmchsiz) free_space, 13 count(*) free_chunks, 14 trunc(avg(ksmchsiz)) average_size, 15 max(ksmchsiz) biggest 16 from 17 sys.x$ksmsp 18 where 19 inst_id = userenv('Instance') and 20 ksmchcls = 'free' 21 group by 22 case 23 when ksmchsiz < 1672 then trunc((ksmchsiz-32)/8) 24 when ksmchsiz < 4120 then trunc((ksmchsiz+7928)/48) 25 when ksmchsiz < 8216 then 250 26 when ksmchsiz < 16408 then 251 27 when ksmchsiz < 32792 then 252 28 when ksmchsiz < 65560 then 253 29 when ksmchsiz >= 65560 then 253 30 end , 31 ksmchidx, ksmchdur 32 order by ksmchidx , ksmchdur 33 / KSMCHIDX KSMCHDUR BUCKET FREE_SPACE FREE_CHUNKS AVERAGE_SIZE BIGGEST ---------- ---------- ---------- ---------- ----------- ------------ ---------- 3 1 5 72 1 72 72 13 136 1 136 136 27 248 1 248 248 48 416 1 416 416 56 1920 4 480 480 66 1680 3 560 560 68 4608 8 576 576 164 1344 1 1344 1344 180 1472 1 1472 1472 188 1536 1 1536 1536 190 1552 1 1552 1552 199 1624 1 1624 1624 204 1880 1 1880 1880 207 2032 1 2032 2032
可以看到,最大的free memory块才2032字节,而报错中提到的申请的内存大小为4128字节。由于在第3个sub pool的第1个mini heap中没有4128字节的连续free memory,所以导致内存申请失败。
那么这里的问题是,为什么这个mini heap中的free memory那么少?正如前面提及,为什么这个mini heap中的已经使用的类型全是perm类型?这个问题的答案就在于”DURATION”。Oracle在启用了SGA自动管理的模式下,为了便于在shared pool与buffer cache或其他内存之间动态调整大小,规定了在每一个mini heap中分配内存按照duration来进行。这里duration可以理解为内存块的持久时间。perm类型的内存块,就是分配后不能释放,只能用于相同组件的重用。比如gcs resources这种组件的内存是perm类型,这种内存被分配后,不能释放给sql area使用,也不能给gcs shadows使用,只能给其他的gcs resource使用。按DURATION分配内存时,perm类型的内存就只能从每个sub pool的第1个mini heap中分配。而其他类型的内存通常在sub pool的第2-4个mini heap中分配。由于perm类型的内存不能释放,也不能被其他组件的内存重用,所以里面的内存会越用越少,如果没有了free memory怎么办?前面说到,这种模式主要是工作在SGA自动管理模式下,如果free memory没有了,就会从SGA中的其他部分,比如buffer cache中取得memory chunk,加入到缺少内存的mini heap中。正常情况下这种机制没有问题。
完全使用SGA自动管理有一个缺陷就是,如果应用系统绑定变量做得不好,或者由于BUG,child cursor过多,导致shared pool会变得很大,甚至超过10G,严重的比buffer cache还大,另一方面,在buffer cache和shared pool之间频繁地调整大小,可能会导致严重的解析问题和其他性能问题。针对这个问题,通常有2种解决办法:一种就是关闭SGA自动管理,即将SGA_TARGET设置为0,以9i的方式来设置shared_pool_size,db_cache_size这些参数,来手动管理SGA;第二种就是sga_target仍然大于0,即自动管理SGA,但是通过设置shared_pool_size,db_cache_size等参数限制这些内存组件的最小大小,而只留给系统极少的自动调整空间。
而出现问题的这套系统,正是使用了第二种方式,开启了SGA自动调整,但是留给自动调整的空间极少。SGA_TARGET为35G,buffer_cache_size为30G,shared_pool_size为4G,再加上large_pool等组件,几乎没有什么可自动调整的余地。这种方式下,就存在了问题。下面来做一个按时间的分析:
在上面的时间点T5那里,如果SGA有较大的自动调整空间,比如说完全没有限制,即buffer_cache_size等参数很少或为0,这样在请求第1个mini heap中的内存时,完全可以从buffer cache中占用,这样的后果是使shared pool越来越大。
而本文案例的ORA-04031,正是由于SGA自动管理,而自动调整的余地又太小,最终使sub pool的第1个mini heap空间用光。当然我们可以分析为什么会用光,这个就显得更为复杂,这跟数据量、应用系统都有很大的关系。而系统中第1次出现ORA-04031错误的进程,是一个job进程,而此后大部分出现的错误均是job进程,能检查job代码,发现在做大量的表的大量数据的UPDATE操作,这可能是引起gcs shadows和gcs resources大量内存使用的原因。在一套RAC数据库中,gcs和ges相关的perm内存占用可能会比较大。
那么除了调整应用,应该怎么样解决这样问题?这里的解决方法是增加shared_pool_size参数到6G,同时将sga_target设置为0,再重启。
而另一种可能的办法是将参数“_enable_shared_pool_durations”设置为FALSE。这一参数为FALSE,将会使shared pool内存分配时,不再使某一类型的内存(比如perm)必须要求在一个固定的mini heap中。而实际上,sga_target设置为0之后,这一个参数自动会设为FALSE(由于这一参数是静态参数,所以修改了sga_target之后需要重启才会使这个隐含参数改变),所以建议的解决办法是设置sga_target参数,而不建议修改隐含参数。当然还有一种办法是完全让Oracle自动管理SGA,将buffer_cache_size和shared_pool_size等参数设置为0,但是正如前面所说,这种方法有比较大的缺陷。
原作地址http://www.laoxiong.net/an-ora-04031-case.html