数据库 hang问题的诊断信息收集方法
来源于:
How to Collect Diagnostics for Database Hanging Issues (文档 ID 452358.1)
适用于:
Oracle Database - Enterprise Edition - Version 9.0.1.0 and later
Oracle Database - Personal Edition - Version 9.0.1.0 and later
Oracle Database - Standard Edition - Version 9.0.1.0 and later
Information in this document applies to any platform.
目标:
当一个数据库hang时,为了确定hang的根本原因,进行数据库的信息收集是很用的。hang的根本原因经常被隔离并通过诊断信息收集来解决。
作为一个可选的方案,如果不可行,我们可以使用收集到的信息来防止将来再次方式此类事件。
Performance Service Request Diagnostic Collection (SRDC) documents
Service Request Data Collection (SRDC) documents have been specially designed to provide the reader with the necessary instructions to provide a step by step guide to collecting information for a various common Database Performance Issues.
Document 1938786.1 List of Available Database Performance Related SRDC Documents
What is needed to diagnose "Database Hang" issues? 为了诊断“database hang”问题, 什么是需要做的?
数据库hang的特指是:许多进程在等待一些其他activities 完成。典型情况是:有一个或者多个阻塞者,这些阻塞者被卡住(stuck)或者 正在努力工作并且没有快速释放资源。
为了判断这个问题,需要做下面的诊断:
A.Hanganalyze and Systemstate Dumps
B.AWR/Statspack snapshots of General database performance
C.Up to date RDA
注意: 多租户数据库中的Hang 如果你正在运行多租户数据库,那么你需要确定你遇到的hang问题是container级别的还是一个特定的pluggable database(PDB)级别的。 Once this is established then connect to and collect the diagnostics only within the PDB that is experiencing the issue. 如果不明确在哪里hang住了,那么从诊断的观点来说,最好是连接到Root container中收集诊断信息,以便于所有PDB中的所有的processes被覆盖到,而不会忽略掉有用的信息。 但是,如果你有大量的PDB,并且只是他们其中的一个 hang住了,那么这回导致收集到很多不相关的数据。有了这个想法,请付出每一个努力以辨别是哪个database hang住了,并在那个hang住的database中收集信息。
A. Dumps and Traces
Hanganalyze and Systemstate Dumps
Hanganalyze 和 Systemstate Dumps 提供了在某一个特定的时间点上,数据库里边进程的信息。Hanganalyze 提供了卷入到hang chain中的所有的process信息,而systemstate 提供了数据库里边所有的process的信息。
当查找一个潜在的hang situation的时候,你需要确定一个process是否被卡住(stuck)或者 缓慢的移动。通过连续收集两个间隔的dump,这个可以被看出来。
如果一个process 被卡住了,这些traces会提供信息以启动更进一步的诊断,并帮助提供解决方案。
Hanganalyze 是一个概括,将会确认(confirm)db是否是真的hang住,还是只是慢(slow),并提供一个连续的snapshot
Systemstate dump 显示了数据库中的process正在干些什么。
收集Hanganalyze and Systemstate Dumps
登陆进入系统:
使用下列命令登陆:
sqlplus '/ as sysdba'
若是如上方式连接数据库有问题,那么在10gR2以及更高版本中,可以使用 sqlplus preliminary connection方式:
sqlplus -prelim '/ as sysdba'
注意:从11.2.0.2开始到更高的版本,sqlplus "preliminary connection" 下的hanganalyze 不会产生输出,因为,hanganalyze需要 a process state object and a session state object。
如果 hanganalyze被尝试执行,虽然hanganalyze 表面上是成功的:
SQL> oradebug hanganalyze 3 Statement processed.
tracefile将会包括下面的输出:
HANG ANALYSIS: ERROR: Can not perform hang analysis dump without a process state object and a session state object. ( process=(nil), sess=(nil) )
For more about connecting with a preliminary connection, see:
Document 986640.1 How To Connect Using A Sqlplus Preliminary Connection
非rac时, Hanganalyze and Systemstate的收集命令
有些时候, database 实际上是很慢,并不是真正的hang。
It is therefore recommended, where possible to get 2 hanganalyze and 2 systemstate dumps in order to determine whether processes are moving at all or whether they are "frozen".
Hanganalyze
sqlplus '/ as sysdba' oradebug setmypid oradebug unlimit oradebug hanganalyze 3 -- Wait one minute before getting the second hanganalyze oradebug hanganalyze 3 oradebug tracefile_name exit
Systemstate
sqlplus '/ as sysdba' oradebug setmypid oradebug unlimit oradebug dump systemstate 266 oradebug dump systemstate 266 oradebug tracefile_name exit
rac时, Hanganalyze and Systemstate的收集命令
若是没有打相关的patch,会有两个影响rac的bug,make using level 266 or 267 very costly。因此,没有打这两个补丁的话,不推荐使用这些level
有关这些补丁的信息,请见:
Document 11800959.8 Bug 11800959 - A SYSTEMSTATE dump with level >= 10 in RAC dumps huge BUSY GLOBAL CACHE ELEMENTS - can hang/crash instances Document 11827088.8 Bug 11827088 - Latch 'gc element' contention, LMHB terminates the instance
注意: 这两个bug 在11.2.0.3中被fix掉了。
当bug 11800959 和 bug 11827088被fix掉时,收集rac的Hanganalyze and Systemstate:
For 11g: sqlplus '/ as sysdba' oradebug setorapname reco oradebug unlimit oradebug -g all hanganalyze 3 oradebug -g all hanganalyze 3 oradebug -g all dump systemstate 266 oradebug -g all dump systemstate 266 exit
当bug 11800959 和 bug 11827088没有被fix掉时,收集rac的Hanganalyze and Systemstate:
sqlplus '/ as sysdba' oradebug setorapname reco oradebug unlimit oradebug -g all hanganalyze 3 oradebug -g all hanganalyze 3 oradebug -g all dump systemstate 258 oradebug -g all dump systemstate 258 exit
对于10g,用 oradebug setmypid 来替代oradebug setorapname reco
sqlplus '/ as sysdba' oradebug setmypid oradebug unlimit oradebug -g all hanganalyze 3 oradebug -g all hanganalyze 3 oradebug -g all dump systemstate 258 oradebug -g all dump systemstate 258 exit
在rac环境中,一个dump file 被创建在所有的rac实例上的diag trace file中。
解释Hanganalyze and Systemstate Levels
Hanganalyze levels:
Level 3: In 11g onwards, level 3 also collects a short stack for relevant processes in hang chain
Systemstate levels:
Level 258 is a fast alternative but we'd lose some lock element data
Level 267 can be used if additional buffer cache / lock element data is needed with an understanding of the cost
其他方法:
If connection to the system is not possible in any form, then please refer to the following article which describes how to collect systemstates in that situation:
Document 121779.1 Taking a SYSTEMSTATE dump when you cannot CONNECT to Oracle.
On RAC Systems, hanganalyze, systemstates and some other RAC information can be collected using the 'racdiag.sql' script, see:
Document 135714.1 Script to Collect RAC Diagnostic Information (racdiag.sql)
Sometimes you may wish to collect systemstate information at the time that a particular error occurs. This can be done by setting an event in the session or system wide to trigger based upon the detection of an error. For example, if a hang was being encountered that was related to an ORA-00054 error, then you could capture a systemstate when the ORA-00054 occurs using the following command:
sqlplus '/ as sysdba' ALTER SYSTEM SET events '54 trace name systemstate level 258';
The next time an ORA-00054 is encountered, a systemstate will be dumped.
The tracing can be disabled with :
ALTER SYSTEM SET events '54 trace name context off';
You can also set such events in the spfile. See:
Document 160178.1 How To Set EVENTS In The SPFILE
Note: be aware that this will produce a trace for every occurrence of the error.
v$wait_chain
从11gR1开始,dia0后台进程会启动收集hanganalyze information 并把这些信息存在内存中的 "hang analysis cache"区中。
Oracle每3秒做一次本地 hanganalyze information ,每10秒做一次 global (RAC) hanganalyze information.
当hang正在发生的时候,这个信息可以提供一个对hang chain occurring的quick view
更多信息,请见:
Document 1428210.1 Troubleshooting Database Contention With V$Wait_Chains
B. Provide AWR/Statspack snapshots of General database performance
Hangs are a visible effect of a number of potential causes, this can range from a single process issue to something brought on by a global problem.
Collecting information about the general performance of the database in the build up to, during and after the problem is of primary importance since these snapshots can help to determine the nature of the load on the database at these times and can provide vital diagnostic information. This may prove invaluable in identifying the area of the problem and ultimately resolving the issue.
To do this, please take and upload snapshot reports of database performance (AWR (or statspack) reports) immediately before, during and after the hang..
Please refer to the following article for details of what to collect:
Document 781198.1 Diagnostics for Database Performance Issues
C. Gather an up to date RDA
An up to date current RDA provides a lot of additional information about the configuration of the database and performance metrics and can be examined to spot background issues that may impact performance.
See the following note on My Oracle Support:
Document 314422.1 Remote Diagnostic Agent (RDA) 4 - Getting Started
On some systems a hang can occur when the DBA is not available to run diagnostics or at times it may be too late to collect the relevant diagnostics. In these cases, the following methods may be used to gather diagnostics:
Real-Time ADDM is a feature of Oracle Enterprise Manager Cloud Control 12c that allows you to analyze database performance automatically when you cannot logon to the database because it is hung or performing very slowly due to a performance issue. It analyzes current performance when database is hanging or running slow and reports sources of severe contention.
For more information see the following video:
Sometimes we may only notice a hang after it has occurred. In this case the following information may help with Root Cause Analysis: