In this Document
Purpose |
Troubleshooting Steps |
Mutex Event Waits |
How to Identify Mutex Event Waits. |
Diagnosing Potential Causes using AWR Report |
Load Profile |
Increased Parse Counts |
Mutex Sleeps |
Database Appears 'Hung' |
Documents with Suggestions on How to Diagnose Specific Waits |
Potential Solutions |
Use Recommended Patch Levels |
Reduce Parsing |
High Version Counts |
CURSOR_SHARING=SIMILAR |
OS Resources |
Size the Shared Pool Correctly |
Known Issues |
11.2.0.X |
11.1.0.7.X |
Further Details on Specific Mutex Wait Events |
Troubleshooting Other Issues |
References |
The purpose of the note is to help customers troubleshoot mutex waits.
"Mutex Waits" is a collective term for waits for resources associated with the management of cursor objects in the shared pool during parsing. Mutexes were introduced in 10g as faster and lightweight forms of latches and waits for these resources will occur in normal operations. However when these waits become excessive, contention can occur causing problems.
Full troubleshooting and diagnostics for every type of mutex related issue may be beyond the scope of this article, but basic principles and problem identification can be achieved.
Firstly you need to actually identify that mutex waits are occurring.
Mutex waits are characterised by sessions waiting for one one or more of the following events:
Cursor mutexes are used to protect the parent cursor and also with cursor statistic operations.
Cursor pins are used to pin a cursor in preparation for a related operation on the cursor.
Library cache mutexes are similar to library cache operations in earlier versions except they are now implemented using mutexes. In all these cases, waits for these resources occurs when 2 (or more) sessions are working with the same cursors simultaneously. When one session takes and holds a resource required by another, the second session will wait and will wait on one of these events.
Mutex contention is typically characterised by a perception of slow performance at a session or even the database level. Since Mutexes are almost wholly a CPU using resource, if contention occurs, CPU usage can rise and will quickly start to impact users. In normal operation the amount of CPU usage per mutex and the time taken is extremely small, but when contention occurs and the number of mutex operations against the same objects goes in to millions these small numbers add up. Additionally as the CPU is used the mutex operations themselves can start to take longer (because of the time taken waiting on the CPU run queue) further adding to problems.
The best starting point for identification of Mutex waits is the use of a general database report such as the Automatic Workload Repository (AWR) Reports.
When looking for mutex contention it is best to collect AWR reports for 2 separate periods:
Collection of both an active report and a baseline is extremely useful for comparison purposes.
For information on how to collect AWR reports refer to:
For mutex contention, it is preferable to look at snapshots with a maximum duration of an hour. Durations as short as 5-10 minutes can be used as long as the durations are the same for the baseline and and problem periods.
If mutex contention is occurring then usually mutex waits will surface to the top timed events:
Problem Period AWR Report: (1 hour duration)
Compare to the baseline report: (1 hour duration)
Baseline AWR Report:
In the problem report, the top wait is for a cursor operation 'library cache: mutex X' which means that sessions are waiting to get a library cache mutex in eXclusive mode for one or more cursors. From the figures, this is taking > 56.42% of the database time. The Average wait time of 294 ms (milliseconds) is extremely high as is the number of waits at > 1.3 Million waits in an hour.
In comparison, during the baseline, there is no evidence of high waits for mutex events in the top 5 at all and the events seen are the more normal I/O waits.
Now that we have identified a problem, we want to dig deeper and determine the area the problem is in so that we can ultimately get to a root cause and a solution.
- If you have an AWR which shows the high mutex issue then start by running:
select * from (
select p1, sql_id,
count(*),
(ratio_to_report(count(*)) over ())*100 pct
from dba_hist_active_sess_history
where event='library cache: mutex X'
and snap_id between <begin snap> and <end snap>
and dbid = <dbid>
group by p1, sql_id
order by count(*) desc)
where rownum <= 10;
This will give you the top 10 P1/SQL_ID arguments of the waits.
The SQL_ID is the SQL statement the session is running.
The P1 is the object the mutex is against.
For the topmost P1 run:
select KGLNAOBJ, KGLNAOWN, KGLHDNSP, KGLOBTYP
from x$kglob where KGLNAHSH= {value of P1}
This will tell you the object the mutex is against. If the same SQL_ID shows up with different P1 values in the Top10, then it is likely to be related to that SQL statement. If the SQL_ID and P1 is unique, it is likely to be a hot object.
If there is hot object, review following bug:
If there is no hot object, but high general mutex waits, start diagnosing the load profile.
The load profile on the server and the location of that load can help drill down. For mutex contention issues we are primarily interested in parse information
Problem Period AWR Report: (1 hour duration)
Baseline AWR Report: (1 hour duration)
Generally, the load is higher in the "Problem Period" report. Furthermore, the parse statistics are higher in the 'bad' report; hard parse is 45 vs 23 per second. So this indicates that there is a higher rate of parsing in the Problem period which may be causing contention issues. Now we should look to see the SQL that is being parsed the most as this is likely to be the cause of the problem.
Under SQL ordered by Parse Calls, we are looking for the total parse calls and then the parse calls for particular statements:
Problem Period AWR Report: (1 hour duration)
Baseline AWR Report: (1 hour duration)
In general the parse count has increased moving from 1.8M to 3.1M. Focusing on specific statements, SQL_Id '68subccxd9b0'3 and '12235mxs4h54u' have doubled the number of parses and '3j91frnd21kks' has come in from 'nowhere' and must also have at least doubled the parses since the lowest parse calls shown in the baseline is 15,000 and this shows 42,000.
These SQL statements are good candidates for investigation:
By answering these kind of questions, you can often find potential causes.
See the "Over Parsing" section in:
When a mutex is requested, this is called a get request.
If a session tries to get a mutex but another session is already holding it, then the get request cannot be granted and the session requesting the mutex will 'spin' for the mutex a number of times in the hope that it will be quickly freed. The session spins in a tight loop, testing each time to see if the mutex has been freed.
If the mutex has not been freed by the end of this spinning, the session waits.
When this happens the sleeps column for the particular code location where the session is waiting is incremented in the v$mutex_sleep* views.
This 'Sleeps' count for a particular location is very useful for identification of the area in which mutex contention is occurring.
In later versions this information is externalised in the 'Mutex Sleep Summary' section of the AWR report:
Mutex Type Location Sleeps Time (ms) ---------------- -------------- ------------ ------------ Library Cache kglpin1 4 20,053,325 201,203 Library Cache kglget1 1 38,809 110,015 Library Cache kglpndl1 95 25,147 55,946 Library Cache kglpin1 4 24,887 52,524
What we are interested in here is the location and primarily the Time spent in each. The number of sleeps is also important but if it takes no time then it is unlikely to be affecting performance.
This information can be used to search for other similar issues that have also resulted in contention in this particular area and from these determine solutions that have previously been used to address these.
As an example:
In this case the the top location for sleeps is in the Library cache 'kglpin1 4'.
In terms of time this is taking almost 2x as much time as the next sleeper and also is responsible for 20M more sleeps. This would therefore be a good candidate for a search for known issues. In this case if you search on 'kglpin1 4', one of the documents you will find is:
which may be directly applicable, or may give pointers as to potential solutions.
Interpretation is as with the AWR example above.
Sometimes contention for mutexes will become so intense that the database may appear to hang. In these cases, it is useful to determine which session or sessions are blocking others and to investigate what the blocking sessions are doing.
By running the following select (which outputs the Session ID and the Text of the SQL being executed) at short intervals, pick up common blockers and investigate their activities. If the same SQL is seen then it can be investigated for problems in a similar way to we investigated High Parsing SQL previously.
For guidance troubleshooting other performance issues see: