Oracle 11.2.0.2.0 RAC环境一次内存溢出ORA-04031问题的处理

Oracle 11.2.0.2.0 RAC环境一次内存溢出ORA-04031问题的处理

环境信息

数据库版本:11.2.0.2.0

操作系统版本:AIX 6.1

 

 

 

一.现象

使用sar 观察cpu使用情况,注意观察 %usr 参数

 

1.RAC环境中其中一台服务器node2负载上升(平时为idle为90左右),node1数据库正常,下面为node2当时的SAR报告

bash-3.00#sar -f sa15|more

AIXp740db2 1 6 00F688714C00    08/15/12

Systemconfiguration: lcpu=32  mode=Capped

00:00:00   %usr    %sys    %wio  %idle   physc

16:15:00                 5               2                             4                           89           8.00

16:16:00      8       5      0      87    8.00

16:17:00                              36                       6                       0                        57              8.00

16:18:00     50       5      4      42    8.00

16:19:00      7       2      7      85    8.00

16:20:00      6       2      5      88    8.00

16:21:00      6       2      8      84    8.00

16:22:00      6       2      9      83    8.00

16:23:00      7       2      7      84    8.00

16:24:00      6       2      5      87    8.00

16:25:00      6       2      6      87    8.00

16:26:00      6       2      6      86    8.00

16:27:00      5       2      5      89    8.00

16:28:00      5       2      5      88    8.00

16:29:00      5       2      4      89    8.00

16:30:00      5       2      4      90    8.00

16:31:00     25       3      3      70    8.00

16:32:01     63       2      0      34    8.00

16:33:00     27       2      0      71    8.00

16:34:00     46       2      0      51    8.00

16:35:00     66       2      0      32    8.00

16:36:01     37       2      0      61    8.00

16:37:00     27       2      0      72    8.00

16:38:00     15       1      0      83    8.00

16:39:00     29       2      0      69    8.00

16:40:00     41       2      0      57    8.00

16:41:00     45       2      0      53    8.00

16:42:00     47       2      0      51    8.00

16:43:00     39       2      0      59    8.00

16:44:00     50       2      1      48    8.00

16:45:00     15       2      5      78    8.00

16:46:00      1       1      3      95    8.00

16:47:00      6       2      1      91    8.00

 

2.登陆数据库一直挂起,长时间无响应,查看alert log,发现从16:15开始报ORA-04031 shared pool不够的错

ArchivedLog entry 15904 added forthread 2 sequence 2418 ID 0xffffffffd412add4 dest 1:

WedAug 15 16:15:09 2012

Errorsin file/db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_24707320.trc (incident=5241532):

ORA-04031:鎊谞娉^斟^萟崎^臹?32瀛^阻^蔨络^赹腻^疟浜 ^芧佩璣?("shared pool","selectcompany_id,user_id fr...","KGLH0^28668775","k

glHeapInitialize:temp")

Incidentdetails in:/db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshpd2_ora_24707320_i5241532.trc

WedAug 15 16:15:09 2012

Noncritical error ORA-48913 caught while writing to trace file"/db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_40042558.trc"

Writingto the above trace file is disabled for now on...

Noncritical error ORA-48913 caught while writing to trace file"/db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshp

d2_ora_24707320_i5241532.trc"

Writingto the above trace file is disabled for now on...

WedAug 15 16:16:02 2012

Errorsin file/db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_36044968.trc (incident=5241340):

ORA-04031:unable to allocate 32 bytes of shared memory ("sharedpool","select count(*) as totalrec...","KGLH0^c0f94bbb","kglHeapInitialize:temp")

 

二.诊断过程

1.查看这段时间的ASH可以得知过去数据库发生了什么事,其中最严重的等待事件为“cursor: pin S wait on X

 

SQL> select * from (

   select /*+ rule */

     event,count(1) cnt

      from dba_hist_active_sess_history

     where sample_time > to_date('2013-09-2310:15:00', 'yyyy-MM-dd HH24:mi:ss')

       and sample_time <to_date('2013-09-23 11:40:00', 'yyyy-MM-dd HH24:mi:ss')

       group by event

       order by cnt desc) whererownum<=5;

 

EVENT                                     CNT

--------------------------------------------------------------------------

cursor:pinwaitonX              944          latch:sharedpool                                     305 dbfilesequentialread                     277   nullevent                  95          library cache: mutex                             87

SQL>

 

--根据等待事情查找相关的SQL

SQL> select sql_id,count(1)

      fromdba_hist_active_sess_history

    where

     sample_time > to_date('2013-09-2316:15:00', 'yyyy-MM-dd HH24:mi:ss')

     and sample_time < to_date('2013-09-2316:40:00', 'yyyy-MM-dd HH24:mi:ss')

     and event = 'cursor: pin S wait on X'

     group by sql_id;

 

SQL_ID         COUNT(1)

------------- ----------

22bb9t1167adc         5

6yx2qgbpw6j8x       909

bunvx480ynf57        22

dcrjrtxqrsrsw         8

根据上面结果,有问题的SQL_id为6yx2qgbpw6j8x,根据SQL_ID查找问题的SQL

Select *from v$sql where sql_id='6yx2qgbpw6j8x'

取得的SQL如下:

selectcompany_id,user_id from t_xxxx_user where user_name=:1

 

同时查看这个SQL的执行计划是否正常,通过下面结果可以看到,这个SQL非常简单,全表只有883条记录,查询时走的是唯一索引

SQL>

SQL> --查看历吏执行计划

SQL> SELECT sp.id,

  2        lpad(' ', sp.DEPTH * 3, ' ') || sp.OPERATION || ' ' || sp.OPTIONS operation,

  3        sp.OBJECT_NAME

  4    FROM dba_hist_sql_plan sp

  5   WHERESQL_ID ='6yx2qgbpw6j8x'

  6   ORDER BY ID;

 

 ID    OPERATION   OBJECT_NAME

-------------------------------------------------------------------------------------------------------------------------

         0SELECTSTATEMENT                                                                

        1    TABLE ACCESSBYINDEXROWID                                                  T_XXXX_USER

        2       INDEX UNIQUESCAN                                                         U_TXXXX_USER__USER_NAME

 

SQL> select * fromtable(dbms_xplan.display_awr('6yx2qgbpw6j8x'));

 

PLAN_TABLE_OUTPUT

--------------------------------------------------------------------------------

SQL_ID 6yx2qgbpw6j8x

--------------------

select company_id,user_id from t_xxxx_user whereuser_name=:1

Plan hash value: 377560326

--------------------------------------------------------------------------------

| Id  |Operation                  |Name                   | Rows  | Bytes |

--------------------------------------------------------------------------------

|   0 | SELECTSTATEMENT           |                        |       |       |

|   1 |  TABLE ACCESS BY INDEX ROWID|T_XXXX_USER            |     1 |    20 |

|   2 |   INDEX UNIQUESCAN         | U_TXXXX_USER__USER_NAME|     1 |       |

--------------------------------------------------------------------------------

 

14 rows selected

 

 

2.分析这期间的AWR报告如下

 产生AWR比对报告

@?/rdbms/admin/awrddrpt

比较出问题时的AWR和正常情况下的AWR,发现平均每秒的硬解析和软解析没有明显的增长,再比较了AWR中的SQL ordered by Parse Calls和SQL ordered by Version Count都和平时差不多,包括SQL_ID为6yx2qgbpw6j8x的SQL的执行次数,初步怀疑为oracle bug

出问题期间的AWR

Per Second

Per Transaction

Per Exec

Per Call

DB Time(s):

13.8

0.3

0.02

0.01

DB CPU(s):

0.7

0.0

0.00

0.00

Redo size:

194,547.2

4,767.2

 

 

Logical reads:

7,571.0

185.5

 

 

Block changes:

815.7

20.0

 

 

Physical reads:

296.2

7.3

 

 

Physical writes:

71.7

1.8

 

 

User calls:

1,339.8

32.8

 

 

Parses:

497.4

12.2

 

 

Hard parses:

18.2

0.5

 

 

W/A MB processed:

1.3

0.0

 

 

Logons:

0.1

0.0

 

 

Executes:

774.7

19.0

 

 

Rollbacks:

0.0

0.0

 

 

Transactions:

40.8

 

 

 

 

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:

99.90

Redo NoWait %:

100.00

Buffer Hit %:

96.12

In-memory Sort %:

100.00

Library Hit %:

92.39

Soft Parse %:

96.35

Execute to Parse %:

35.79

Latch Hit %:

99.73

Parse CPU to Parse Elapsd %:

2.49

% Non-Parse CPU:

73.0

 

 

 

 

正常情况下的AWR

Load Profile

Per Second

Per Transaction

Per Exec

Per Call

DB Time(s):

2.7

0.1

0.00

0.00

DB CPU(s):

0.3

0.0

0.00

0.00

Redo size:

182,077.7

4,810.9

 

 

Logical reads:

9,038.7

238.8

 

 

Block changes:

756.9

20.0

 

 

Physical reads:

338.2

8.9

 

 

Physical writes:

78.9

2.1

 

 

User calls:

1,284.9

34.0

 

 

Parses:

482.7

12.8

 

 

Hard parses:

17.5

0.5

 

 

W/A MB processed:

1.0

0.0

 

 

Logons:

0.1

0.0

 

 

Executes:

744.9

19.7

 

 

Rollbacks:

0.0

0.0

 

 

Transactions:

37.9

 

 

 

 

 

Instance Efficiency Percentages (Target100%)

Buffer Nowait %:

99.94

Redo NoWait %:

100.00

Buffer Hit %:

96.29

In-memory Sort %:

100.00

Library Hit %:

92.88

Soft Parse %:

96.37

Execute to Parse %:

35.20

Latch Hit %:

99.78

Parse CPU to Parse Elapsd %:

16.07

% Non-Parse CPU:

92.86

 

 

 

3.metalink1349387.1中发现下面一段话,If the database has been migrated from 10g to 11g and mutexperformance issue surfaces,please consider the 11.2.0.2.2 psu + fix for12431716.  Many mutex fixes are already included in this patch,查看Bug12431716的说明,版本和现象都比较相似,oracle强烈推荐在11.2.0.2版本中打上补丁12431716

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Troubleshooting  'cursor: pin S wait on X' waits. [ID1349387.1]

To  Bottom

Modified:Jan 7, 2012Type:TROUBLESHOOTINGStatus:PUBLISHEDPriority:1

Comments (0)

In this Document
  Purpose
  
LastReview Date
  
Instructionsfor the Reader
  
TroubleshootingDetails
     
Whatis a 'Cursor: pin S wait on X' wait?
     
Whatcauses 'Cursor: pin S wait on X' waits?
     
How todiagnose the cause.
     
How toExamine the Diagnostics.
     
PotentialSolutions
     
TroubleshootingOther Issues
  
References

Applies to:

Oracle Server - Enterprise Edition - Version:10.2.0.1 to 11.2.0.2 - Release: 10.2 to 11.2
Information in this document applies to any platform.

Purpose

The purpose of the note is to helpcustomers troubleshoot 'Cursor: pin S wait on X' waits.

Last Review Date

August 17, 2011

Instructions for the Reader

A Troubleshooting Guide is provided toassist in debugging a specific issue. When possible, diagnostic tools areincluded in the document to assist in troubleshooting.

Troubleshooting Details

What is a 'Cursor: pin S wait on X'wait?

A cursor wait is associated with parsing in some form. Asession may wait for this event when it is trying to get a mutex pin in Sharemode but another session is holding the mutex pin on the same cursor object inexclusive.  Frequently, waits for 'Cursor: pin S wait on X' is a symptomand not the cause.  There may be underlying tuning requirements or knownissues. 

What causes 'Cursor: pin S wait on X'waits?

·        Frequent Hard Parses
If the frequency of Hard Parsing is extremely high, then contention can occuron this pin.

·        High Version Counts
When Version counts become excessive, a long chain of versions needs to beexamined and this can lead to contention on this event

·        Known bugs

How to diagnose the cause.

Obtain information and diagnostics to help locate thecause.  

1. Run AWR and ADDM reports during the time that you see 'Cursor: pin S wait onX' waits and also collect baseline reports when you do not see the waits tomake comparisons.  The baseline shows us the typical 'background'concurrency and activity going on during the time of both the issue and thebaseline and  may help identify (for example) culprit SQLs with highversion counts. 

To gather this it is suggested to run AWR and ADDM for half an hour to an hourinterval as follows:

SQL>@$ORACLE_HOME/rdbms/admin/awrrpt.sql

SQL>@$ORACLE_HOME/rdbms/admin/addmrpt.sql



2. Sometimes system state dump is necessary to match known issues.  Forexample, if there is no obvious candidate SQL in AWR, capturing holder orwaiter processes in systemstate allows you to focus in on potential problems.Run system state when processes appear hung on 'Cursor: pin S wait on X' waits:

(a) Non-Rac
sqlplus "/as sysdba"

oradebugsetmypid
oradebug unlimit
oradebug dumpsystemstate 266
wait 90 seconds
oradebug dumpsystemstate 266
wait 90 seconds
oradebug dumpsystemstate 266
quit

(b) RAC
$ sqlplus '/ assysdba'
oradebugsetmypid
oradebug unlimit
oradebug setinstall
oradebug -g allhanganalyze 4
oradebug -g alldump systemstate 266
quit



3. Errorstacks: Another way to obtain process information is witherrorstack.  Assuming you can identify a blocker, taking errorstacks willprovide much the same information as systemstates but with a much reduced diskfootprint for trace. Once the ospid of the blocker has been found, anerrorstack can be generated:


$ sqlplus
SQL> oradebugsetospid <p.spid from above>
oradebug dumperrorstack 3
<< wait1min>> 
oradebug dumperrorstack 3
<< wait1min>>
oradebug dumperrorstack 3
exit


In particular, the stack from the resultant trace can be used to match knownissues.

The system state and errorstacks are not easily readable; so a Service Requestmay need to be opened to read the files.

4. It is not always feasible to run system state dump. Thus, this note onfinding the blocker can be used as well:

Note:786507.1 How toDetermine the Blocking Session for Event: 'cursor: pin S wait on X'


5. Furthermore, the following sqls can be ran to identify the waiters:

SELECT s.sid, t.sql_text
FROM v$sessions, v$sql t
WHERE s.eventLIKE '%cursor: pin S wait on X%'
AND t.sql_id =s.sql_id


6. In 11g RAC, there is another less resource intensive tool that can be usedwhen compared with taking system state dumps:

Note:459694.1 Procwatcher:Script. to Monitor and Examine Oracle DB and Clusterware Processes

 

How to Examine the Diagnostics.

1. Look for high parsing and high version counts fromAWR.
Click on *SQL Statistics under Main Report of AWR:



Then, under SQL Statistics click on 'SQL ordered by Parse Calls' or 'SQLordered by Version Count' to view that information:

SQL ordered by Parse Calls

* Total Parse Calls: 2,935,390
* Captured SQLaccount for 95.6% of Total 

Parse CallsExecutions % Total Parses SQL Id SQL Module SQL Text
668,174 668,01422.76 gpkdazvcfwkv select DATA_TABLE (...
667,440 668,23422.74 5p9vjzht9jqb INSERT INTO DATA_TABLE (DATA_I...


From this list, investigate the top SQLs to determine whether this volume ofparse calls is excessive or can be reduced. 

SQL ordered by Version Count

 

* Only Statements with Version Countgreater than 20 are displayed 

Version CountExecutions SQL Id SQL Module SQL Text
277 392,73730d5a6v16mpb select FLOW_CONTEXT (...
78 131,1047c0gj35488xs INSERT INTO PROJECT (DOC_ID, ...


From this list, investigate the SQLs with the high version count. What are thereasons that these statements are not shared? Can this be addressed? 

2. From systemstate and errorstack, we are looking for stacks to match knownissues.  

For systemstates/errorstacks to be useful, it relies them being taken "atthe right time". On a fast moving system, processes may have moved on soholders and waiters are not shown. In a perfect example the  systemstatesshould show the blocker and processes being blocked or one of these.  Thestack from blocker may indicate known issues.

Potential Solutions

1. Tune sqls with  high parse count by eitherinvestigating the application or the sql. 

From AWR, the parse statistics is at the top of the report under load profile:





In this case there are mainly soft parses, however, if there were high hardparses, this may indicate high usage of literals or introduction of many newsqls. In this case consider using bind variables in place ofliterals.  

If there is high percentage of soft parses, then check the application to seeif it is using shareable sqls.  

Ideally, Execute to Parse should be closer to 100%.  The applicationshould parse less and execute many times. See:

Note:62143.1 Understandingand Tuning the Shared Pool


Also remember if the shared pool is flushed, then sqls will need to be hardparsed.  This also may cause mutex waits.  So make sure sqls are inmemory once hard parsed and monitor to see if mutex waits arealleviated.  

2. High version counts can also cause 'cursor: pin S wait on X' waits.

Check V$SQL_SHARED_CURSOR to see the potential reason for the high versioncount using:

Note:438755.1 FormatedV$SQL_SHARED_CURSOR Report by SQLID or Hash Value
Note:296377.1 Troubleshooting:High Version Count Issues (Doc ID 296377.1)


There are some notable bugs with high version counts:

Note:10157392.8 Bug10157392 - High version counts for SQL with binds (BIND_MISMATCH)
Note:9689310.8 Bug9689310 - Excessive child cursors / high VERSION_COUNT / OERI:17059 due to bindmismatch


3. For more known defects, please go to following note and click on known bugs:

Note:1298015.1 WAITEVENT:"cursor: pin S wait on X" Reference Note


Click on the version that applies and review bug or bugs with similar scenario.

4. If the database has been migrated from 10g to 11g and mutex performanceissue surfaces,
please consider the 11.2.0.2.2 psu + fix for 12431716.  Manymutex fixes are already included in this patch:

Note:1291879.1 Oracle Database Patch Set Update11.2.0.2.2 Known Issues




Troubleshooting Other Issues

For guidance troubleshooting other performance issuessee:

Document 1377446.1 TroubleshootingPerformance Issues

 

References

NOTE:1291879.1 - Oracle DatabasePatch Set Update 11.2.0.2.2 Known Issues
NOTE:1298015.1 -WAITEVENT: "cursor: pin S wait on X" Reference Note
NOTE:296377.1 -Troubleshooting: High Version Count Issues
NOTE:459694.1 -Procwatcher: Script. to Monitor and Examine Oracle DB and Clusterware Processes
NOTE:786507.1 - How toDetermine the Blocking Session for Event: 'cursor: pin S wait on X'
NOTE:1356828.1 - FAQ:'cursor: ' waits


 

Attachments

 

·        Parse_Vers.png(13.75 KB)

·        Parses.png(16.69 KB)

·        SQL_Stats_link.png(9.94 KB)

·        Soft_Parse.png(13.39 KB)

 

 

 

 

Mutex waits may  cause higher CPU usage in 11.2.0.2.2 PSU - Overlay fix Recommended [ID  1321817.1]

To  Bottom

Modified:May 17, 2011Type:ALERTStatus:PUBLISHEDPriority:1

Comments (0)

In this Document
  Description
  
Likelihoodof Occurrence
  
PossibleSymptoms
  
Workaroundor Resolution
  
Patches
  
ModificationHistory
  
References

Applies to:

Oracle Server - Enterprise Edition -Version: 11.2.0.2 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.
This issue affects the following server releases only:
- 11.2.0.2.2 Database Patch Set Update (PSU) [ 11724916 ]
- 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU) [ 12311357 ]
- Windows 11.2.0.2 Patch bundles 4, 5 and 6.

This issue also affects 11.2.0.2.1 interim overlay patches for Bug 10411618 onUnix platforms downloaded before 14/May/2011.
(Interim patches for other versions are NOT affected. Interim patchesdownloaded on or after 14/May/2011 are NOT affected).

This issue does NOT affect any of the Exadata Database Recommended Patches.

Description

An incomplete fix in the above releasescan cause database mutex waits to be accompanied by higher CPU usage thanexpected.

Likelihood of Occurrence

This problem is most likely to impactlarge heavily concurrent systems especially if those systems have some degreeof mutex contention.

Possible Symptoms

There are 2 mains symptoms of theproblem:

1.     Higher CPU usagemay be seen for mutex waits compared to earlier releases. This may appear asCPU spikes or as general higher CPU usage levels depending on the nature of thecontention.

2.     The hiddenparameters described in Note:10411618.8, which allowconfiguration of the mutex wait scheme, do not work as expected. Notablysetting "_mutex_wait_scheme" has no effect.

Workaround or Resolution

·        For non-Windows platforms it is recommended toapply Patch:12431716 on top of the 11.2.0.2.2 PSU.

·        For Windows platforms on 11.2.0.2 bundle patch 4, 5 or 6it is recommended to apply bundle patch 7 when available. If that patch cannotbe applied then it is recommended to set "_mutex_wait_time"=8 atinstance level until such time as it is applied.

·        11.2.0.2.1 systems with overlay patches for bug 10411618from before 14/May/2011 are advised to replace the patch with the latestversion.

·        Once the relevant patch is in place then the mutex waitscheme can be configured as described in Note:10411618.8 ifrequired.

 

Patches

·        For non-Windows platforms with 11.2.0.2.2 PSU installedit is recommended to apply overlay Patch:12431716 (or any patch containing thatfix).

·        For Windows platforms it is recommended to apply 11.2.0.2bundle patch 7 (or later) once available.

·        11.2.0.2.1 overlay patches for 10411618 on non-Windowsplatforms were reloaded on 14/May/2011. If you have an 11.2.0.2.1 overlay patchfor bug 10411618 from before that date then it is advisable to replace it withthe current version.
eg: Rollback the installed patch, download the latest version of 
Patch:10411618 for 11.2.0.2.1, apply thereplacement patch. 

·        The problem will be fixed generally in the next 11.2.0.2Patch Set Update (11.2.0.2.3) due for release in July 2011.

Modification History

13-May-2011 - Created
16-May-2011 - Confirm this does NOT affect Exadata Recommended Patches
17-May-2011 - Add references to 11.2.0.2.2 PSU / GI PSU Known Issues (whichmention this problem)

References

NOTE:10411618.8 - Bug10411618 - Enhancement to add different "Mutex" wait schemes
NOTE:11724916.8 - Bug11724916 - 11.2.0.2.2 Patch Set Update (PSU)
NOTE:12311357.8 - Bug12311357 - 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU)
NOTE:12431716.8 - Bug12431716 - Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU
NOTE:1272288.1 -11.2.0.2.X Grid Infrastructure Bundle/PSU Known Issues
NOTE:1291879.1 - OracleDatabase Patch Set Update 11.2.0.2.2 Known Issues

 

你可能感兴趣的:(Oracle 11.2.0.2.0 RAC环境一次内存溢出ORA-04031问题的处理)