ORACLE数据库的systemstat dump生成trace文件虽然比较简单,但是怎么从trace文件中浩如烟海的信息中提炼有用信息,并作出分析诊断是一件技术活,下面收集、整理如何分析解读systemstat dump产生的trace文件。
如果要人工去解读systemstat dump生成的trace文件,真是一件体力活,因为这些trace文件动不动就几百M甚至更大,它产生的跟踪文件包含了系统中所有进程的进程状态等信息。每个进程对应跟踪文件中的一段内容,反映该进程的状态信息,包括进程信息,会话信息,enqueues信息(主要是lock的信息),缓冲区的信息和该进程在SGA区中持有的(held)对象的状态等信息。dump systemstate产生的跟踪文件是从dump那一刻开始到dump任务完成之间一段事件内的系统内所有进程的信息。 我们需要的是找到导致问题出现的进程的信息,然后根据进程的信息判断导致问题出现的root cause,并在分析问题后解决问题。
幸好网上有人写了这个一个工具ass109.awk ,可以节约分析systemstat dump文件或跟踪文件(trace file)的时间,可以将trace文件中关键信息梳理、整理出来,当然如果了解详细信息,还是必须人工去解读。下面贴上一个例子,是我在学习中的一个案例,本人也是在学习、研究过程中,如有分析不对的地方敬请指出
[oracle@db-server udump]$ awk -f ass109.awk scm2_ora_25575.trc
Starting Systemstate 1
.................................................
Starting Systemstate 2
....................................................
Starting Systemstate 3
....................................................
Ass.Awk Version 1.0.9 - Processing scm2_ora_25575.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'rdbms ipc message'
7: waiting for 'rdbms ipc message'
8: last wait for 'smon timer'
9: waiting for 'rdbms ipc message'
10: waiting for 'rdbms ipc message'
11: waiting for 'rdbms ipc message'
12: waiting for 'rdbms ipc message'
13: waiting for 'SQL*Net message from client'[Latch 855675ae0]
Cmd: Select
14: waiting for 'SQL*Net message from client'[Latch 8556759a0]
Cmd: Select
15: waiting for 'SQL*Net message from client'
Cmd: Select
16: waiting for 'SQL*Net message from client'[Latch 855675720]
17: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Insert
18: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Select
19: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Select
20: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Select
21: waiting for 'SQL*Net message from client'[Latch 8556755e0]
Cmd: Insert
22: waiting for 'virtual circuit status'
Cmd: Select
23:
24:
25: waiting for 'virtual circuit status'
Cmd: Select
26:
27: waiting for 'virtual circuit status'
Cmd: Select
28:
29: waiting for 'latch: shared pool' [Latch 8556759a0]
Cmd: Select
30:
31: waiting for 'virtual circuit status'
Cmd: Select
33: waiting for 'jobq slave wait'
34:
35:
36: waiting for 'Streams AQ: qmn slave idle wait'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39: waiting for 'rdbms ipc message'
40: waiting for 'rdbms ipc message'
41: waiting for 'rdbms ipc message'
42: waiting for 'rdbms ipc message'
43: waiting for 'rdbms ipc message'
44: waiting for 'rdbms ipc message'
45: waiting for 'rdbms ipc message'
46: waiting for 'rdbms ipc message'
47: waiting for 'Streams AQ: qmn coordinator idle wait'
49: for 'Streams AQ: waiting for time management or cleanup tasks'
58:
61: waiting for 'virtual circuit status'
Cmd: Select
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 855675ae0 ??? Blocker
Latch 8556759a0 ??? Blocker
Latch 855675720 ??? Blocker
Latch 8556755e0 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 855675ae0 Child library cache
Latch 8556759a0 Child library cache
Latch 855675720 Child library cache
Latch 8556755e0 Child library cache
System State 2
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'rdbms ipc message'
7: waiting for 'rdbms ipc message'
8: waiting for 'smon timer'
9: waiting for 'rdbms ipc message'
10: waiting for 'rdbms ipc message'
11: waiting for 'rdbms ipc message'
12: waiting for 'rdbms ipc message'
13: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
14: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
15: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
16: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
17: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
18: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
19: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
20: waiting for 'SQL*Net message from client'[Latch 855675900]
Cmd: Select
21: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
22:
23:
24:
25:
26:
27:
28: waiting for 'virtual circuit status'
29:
30:
31: waiting for 'virtual circuit status'
Cmd: Select
32: waiting for 'jobq slave wait'
33: last wait for 'latch: shared pool' [Latch 600f7320]
34:
35: waiting for 'virtual circuit status'
Cmd: Select
36: waiting for 'Streams AQ: qmn slave idle wait'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39: waiting for 'rdbms ipc message'
40: waiting for 'rdbms ipc message'
41: waiting for 'rdbms ipc message'
42: waiting for 'rdbms ipc message'
43: waiting for 'rdbms ipc message'
44: waiting for 'rdbms ipc message'
45: waiting for 'rdbms ipc message'
46: waiting for 'rdbms ipc message'
47: waiting for 'Streams AQ: qmn coordinator idle wait'
48: waiting for 'library cache load lock'
49: for 'Streams AQ: waiting for time management or cleanup tasks'
50: waiting for 'library cache load lock'
58:
61: waiting for 'virtual circuit status'
Cmd: Select
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 855675900 ??? Blocker
Latch 855675680 ??? Blocker
Latch 600f7320 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 855675900 Child library cache
Latch 855675680 Child library cache
Latch 600f7320 Child shared pool
System State 3
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'rdbms ipc message'
7: waiting for 'rdbms ipc message'
8: waiting for 'smon timer'
9: waiting for 'rdbms ipc message'
10: waiting for 'rdbms ipc message'
11: waiting for 'latch: shared pool' [Latch 600f7320]
12: waiting for 'rdbms ipc message'
13: waiting for 'SQL*Net message from client'[Latch 855675540]
Cmd: Select
14: waiting for 'SQL*Net message from client'[Latch 855675540]
Cmd: Select
15: waiting for 'SQL*Net message from client'[Latch 855675b80]
Cmd: Select
16: waiting for 'SQL*Net message from client'[Latch 8556757c0]
Cmd: Select
17: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
18: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
19: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
20: waiting for 'SQL*Net message from client'
Cmd: Select
21: waiting for 'SQL*Net message from client'[Latch 855675680]
Cmd: Select
22:
23:
24:
25:
26:
27:
28: waiting for 'virtual circuit status'
29:
30:
31: waiting for 'virtual circuit status'
Cmd: Select
32: waiting for 'jobq slave wait'
33: last wait for 'latch: shared pool' [Latch 600f7320]
Cmd: Select
34:
35: waiting for 'virtual circuit status'
Cmd: Select
36: waiting for 'Streams AQ: qmn slave idle wait'
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39: waiting for 'rdbms ipc message'
40: waiting for 'rdbms ipc message'
41: waiting for 'rdbms ipc message'
42: waiting for 'rdbms ipc message'
43: waiting for 'rdbms ipc message'
44: waiting for 'rdbms ipc message'
45: waiting for 'rdbms ipc message'
46: waiting for 'rdbms ipc message'
47: waiting for 'Streams AQ: qmn coordinator idle wait'
48: waiting for 'SQL*Net message from client'
49: for 'Streams AQ: waiting for time management or cleanup tasks'
50: waiting for 'latch: shared pool' [Latch 600f7320]
Cmd: Select
58:
61: waiting for 'virtual circuit status'
Cmd: Select
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch 600f7320 ??? Blocker
Latch 855675540 ??? Blocker
Latch 855675b80 ??? Blocker
Latch 8556757c0 ??? Blocker
Latch 855675680 ??? Blocker
Object Names
~~~~~~~~~~~~
Latch 600f7320 Child shared pool
Latch 855675540 Child library cache
Latch 855675b80 Child library cache
Latch 8556757c0 Child library cache
Latch 855675680 Child library cache
从输出信息我们能判断我当时做了3次系统状态转储(实际也是执行了三次oradebug dump systemstate 266),从System State 2,我们可以看到有3个Blocker,我们以其中部分信息做分析
其中的一个Blocker的latch是855675900,而且这个latch造成了进程20、17、16、19、14、21、15的waiting for 'SQL*Net message from client',从下面信息可以看到hold住latch 855675900是oracle@xxxx (J000)进程,也就是job的进程。也就说,由于这个j000进程的异常,hold住了855675900 的latch。
其实这个案例跟“一个Job运行失败导致数据库挂死”有点类似,最后也发现这个JOB是EMD_MAINTENANCE.EXECUTE_EM_DBMS_JOB_PROCS,当然引起问题的原因更复杂,不在此处讨论。另外,metalink上也有一篇关于如何解读、理解Systemstate Dumps的文章: Reading and Understanding Systemstate Dumps (文档 ID 423153.1),具体内容如下所示。
To be able to read a systemstate, or navigate through a systemstate in order to identify what sessions are doing and ,
in the case of a waiting session, which session(s) hold the resource it requires
This document is intended for DBAs.
另外如果要理解、解读systemstate dump的内容,如何阅读systemstate dump这篇文章不得不细读。这个里面讲述了很多Detail方面的东西。非常受益!
参考资料:
http://www.oracleblog.org/working-case/database-hang-due-to-job-dead/
http://www.askmaclean.com/archives/%E8%BD%AC%E5%A6%82%E4%BD%95%E9%98%85%E8%AF%BBsystemstate-dump.html