为什么要使用hanganalyze
Oracle 数据库“真的”hang住了,可以理解为数据库内部发生死锁。因为普通的DML死锁,oracle服务器会自动监测他们的依赖关系,并回滚其中一个操作, 终止这种相互等待的局面。而当这种死锁发生在争夺内核级别的资源(比如说是pins或latches)时,Oracle并不能自动的监测并处理这种死锁。
其实很多时候数据库并没有hang住,而只是由于数据库的性能问题,处理的时间比较长而已。
Hanganalyze工具使用内核调用检测会话在等待什么资源,报告出占有者和等待者的相互关系。另外,它还会将一些比较”interesting”的进程状态dump出来,这个取决于我们使用hanganalyze的分析级别。
hanganalyze工具从oracle8i第二版开始提供,到9i增强了诊断RAC环境下的“集群范围”的信息,这意味着它将会报告出整个集群下的所有会话的信息。
目前有三种使用hanganalyze的方法:
--一种是会话级别的: SQL>ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level <level>'; --一种是实例级别: SQL>ORADEBUG hanganalyze <level> --一种是集群范围的: SQL>ORADEBUG setmypid SQL>ORADEBUG setinst all SQL>ORADEBUG -g def hanganalyze <level>各个level的含义如下:
1.session1更新行数据
SQL> connect scott/scott Connected. SQL> create table tb_hang(id number,remark varchar2(20)); Table created. SQL> insert into tb_hang values(1,'test'); 1 row created. SQL> commit; Commit complete. SQL> select USERENV('sid') from dual; USERENV('SID') -------------- 146 SQL> update tb_hang set remark='hang' where id=1; 1 row updated. --这个时候不提交2.session2同样更新session1更新的行
SQL> select USERENV('sid') from dual; USERENV('SID') -------------- 154 SQL> update tb_hang set remark='hang' where id=1; --这个时候已经hang住了3.session3使用hangalyze生成trace文件
SQL> connect / as sysdba Connected. SQL> oradebug hanganalyze 3; Hang Analysis in /u01/app/oracle/admin/oracl/udump/oracl_ora_3941.trc4.查看trace文件的内容
$ more /u01/app/oracle/admin/oracl/udump/oracl_ora_3941.trc /u01/app/oracle/admin/oracl/udump/oracl_ora_3941.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1 System name: Linux Node name: hxl Release: 2.6.18-8.el5xen Version: #1 SMP Fri Jan 26 14:42:21 EST 2007 Machine: i686 Instance name: oracl Redo thread mounted by this instance: 1 Oracle process number: 21 Unix process pid: 3941, image: oracle@hxl (TNS V1-V3) *** SERVICE NAME:(SYS$USERS) 2012-06-16 01:13:29.241 *** SESSION ID:(144.14) 2012-06-16 01:13:29.241 *** 2012-06-16 01:13:29.241 ============== HANG ANALYSIS: ============== Open chains found: Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> : <0/146/5/0x7861b254/3858/SQL*Net message from client> -- <0/154/5/0x7861c370/3903/enq: TX - row lock contention> Other chains found: Chain 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> : <0/144/14/0x7861d48c/3941/No Wait> Chain 3 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> : <0/149/1/0x7861ced8/3806/Streams AQ: waiting for time man> Chain 4 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> : <0/151/1/0x7861c924/3804/Streams AQ: qmn coordinator idle> Chain 5 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> : <0/158/5/0x7861da40/3810/Streams AQ: qmn slave idle wait> Extra information that will be dumped at higher levels: [level 4] : 1 node dumps -- [REMOTE_WT] [LEAF] [LEAF_NW] [level 5] : 4 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP] [level 6] : 1 node dumps -- [NLEAF] [level 10] : 13 node dumps -- [IGN] State of nodes ([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predec essor): [143]/0/144/14/0x786fa3dc/3941/SINGLE_NODE_NW/1/2//none [145]/0/146/5/0x786fc944/3858/LEAF/3/4//153 [148]/0/149/1/0x78700160/3806/SINGLE_NODE/5/6//none [150]/0/151/1/0x787026c8/3804/SINGLE_NODE/7/8//none [153]/0/154/5/0x78705ee4/3903/NLEAF/9/10/[145]/none [154]/0/155/1/0x78707198/3797/IGN/11/12//none [155]/0/156/1/0x7870844c/3799/IGN/13/14//none [157]/0/158/5/0x7870a9b4/3810/SINGLE_NODE/15/16//none [159]/0/160/1/0x7870cf1c/3782/IGN/17/18//none [160]/0/161/1/0x7870e1d0/3784/IGN/19/20//none [161]/0/162/1/0x7870f484/3788/IGN/21/22//none [162]/0/163/1/0x78710738/3786/IGN/23/24//none [163]/0/164/1/0x787119ec/3774/IGN/25/26//none [164]/0/165/1/0x78712ca0/3780/IGN/27/28//none [165]/0/166/1/0x78713f54/3778/IGN/29/30//none [166]/0/167/1/0x78715208/3776/IGN/31/32//none [167]/0/168/1/0x787164bc/3770/IGN/33/34//none [168]/0/169/1/0x78717770/3772/IGN/35/36//none [169]/0/170/1/0x78718a24/3768/IGN/37/38//none ==================== END OF HANG ANALYSIS ====================
Trace文件内容的解释如下:
CYCLES: This section reports the process dependencies between sessions that are in a deadlock condition. Cycles are considered “true” hangs.
Cycle 1 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <980/3887/0xe4214964/24065/latch free> -- <2518/352/0xe4216560/24574/latch free> -- <55/10/0xe41236a8/13751/latch free>BLOCKER OF MANY SESSIONS :This section is found when a process is blocking a lot of other sessions. Usually when a process is blocking more that 10 sessions this section will appear in the trace file.
Found 21 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event> <55/10/0xe41236a8/13751/latch free> Found 12 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event> <2098/2280/0xe42870d0/3022/db file scattered read> Found 12 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event> <1941/1783/0xe41ac9e0/462/No Wait> Found 12 objects waiting for <sid/sess_srno/proc_ptr/ospid/wait_event> <980/3887/0xe4214964/24065/latch free>OPEN CHAINS :This section reports sessions involved on a wait chain. A wait chains means that one session is blocking one or more other sessions.
Open chains found: Chain 1 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <2/1/0xe411b0f4/12280/db file parallel write> Chain 2 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <3/1/0xe411b410/12282/No Wait> Chain 6 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <18/1631/0xe4243cf8/25457/db file scattered read> -- <229/1568/0xe422b84c/8460/buffer busy waits> Chain 17 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <56/11/0xe4123ce0/13755/latch free> -- <2384/599/0xe41890dc/22488/latch free> -- <32/2703/0xe41fa284/25693/latch free>
OTHER CHAINS: It refers to chains of blockers and waiters related to other sessions identified under “open chains”, but not blocked directly by the process reported on the "open chain".
Other chains found:
Chain 676 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <20/93/0xe411d644/13597/latch free> Chain 677 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <27/1201/0xe41d3188/15809/latch free> Chain 678 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <36/1532/0xe428be8c/4232/latch free> -- <706/1216/0xe4121aac/23317/latch free> Chain 679 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <43/12/0xe4122d54/13745/latch free> Chain 680 : <sid/sess_srno/proc_ptr/ospid/wait_event> : <80/2/0xe41290d4/13811/library cache pin> -- <1919/1134/0xe421fdbc/3343/enqueue>STATE OF NODES:
1.trace中最重要的部分
State of nodes ([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor): [145]/0/146/5/0x786fc944/3858/LEAF/9/10//153 --这里的LEAF表示该SID阻塞了predec essor=153对应的SID=154,即为被阻塞者 [148]/0/149/1/0x78700160/3806/SINGLE_NODE/11/12//none [150]/0/151/1/0x787026c8/3804/SINGLE_NODE/13/14//none [153]/0/154/5/0x78705ee4/3903/NLEAF/15/16/[145]/none --这里的NLEAF表示该SID被阻塞了,adjlist对应的145对应的SID=146,即为阻塞者.
SQL>ALTER TABLE A_PDA_SP_STAT SPLIT PARTITIONP_MAXAT(20090609) INTO (PARTITIONP_20090608 TABLESPACE TS_DATA_A,PARTITIONP_MAX TABLESPACE TS_DATA_A) --检查该session的等待事件: EVENT P1 P2 P3 ------------------------------ ---------- ---------- ---------- rowcachelock 8 0 5查 了网上的一些资料,说和sga的shared pool大小不足有关,或者和sequence的cache不大有关。经过分析,这2个原因应该都不是。因为1、如果是shared pool不足,这样的现象一般是某个sql执行的比较慢,但是还是会执行完,而不是像现在这样的挂住;2,只是执行split分区,并没有和 sequence相关。
SQL> select spid from v$session a,v$process b where a.paddr=b.addr and a.sid=295; SPID ------------ 19237 SQL> oradebug SETOSPID 19237 Oracle pid: 235, Unix process pid: 19237, image: oracle@hl_rdb01 (TNS V1-V3) SQL> oradebug hanganalyze 3; Cycle 1: (0/295) Cycle 2: (0/254)--(0/239) Hang Analysis in /oracle/app/oracle/admin/hlreport/udump/hlreport_ora_25247.trc
$ more /oracle/app/oracle/admin/hlreport/udump/hlreport_ora_25247.trc Dump file /oracle/app/oracle/admin/hlreport/udump/hlreport_ora_25247.trc Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.6.0 - Production ORACLE_HOME = /oracle/app/oracle/product/9.2.0 System name: HP-UX Node name: hl_rdb01 Release: B.11.11 Version: U Machine: 9000/800 Instance name: hlreport Redo thread mounted by this instance: 1 Oracle process number: 157 Unix process pid: 25247, image: oracle@hl_rdb01 (TNS V1-V3) *** SESSION ID:(312.10459) 2009-05-20 16:21:58.423 *** 2009-05-20 16:21:58.423 ============== HANG ANALYSIS: ============== Cycle 1 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>: <0/329/43816/0x4d6b5638/23487/rowcachelock> --<0/254/19761/0x4d687438/23307/librarycachelock> Cycle 2 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>: <0/295/57125/0x4d6b8978/19237/rowcachelock> Cycle 3 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>: <0/295/57125/0x4d6b8978/19237/rowcachelock> Open chains found: Other chains found: Chain 1 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>: <0/312/10459/0x4d69f9b8/25247/NoWait> Extra information that will be dumped at higher levels: [level 3] : 4 node dumps -- [IN_HANG] [level 5] : 1 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP] [level 10] : 223 node dumps -- [IGN] State of nodes ([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor): [0]/0/1/1/0x4d7146c0/5132/IGN/1/2//none …………………………………………………… [238]/0/239/57618/0x4d7b18a0/13476/IN_HANG/395/402/[294][238][328][253]/none …………………………………………………… [253]/0/254/19761/0x4d7bb710/23307/IN_HANG/397/400/[328][238][294]/294 ……………………………………………………………… [294]/0/295/57125/0x4d7d6820/19237/IN_HANG/396/401/[294][238][253]/238 [328]/0/329/43816/0x4d7ecf40/23487/IN_HANG/398/399/[253]/253 ……………………………………………………………… Dumping System_State and Fixed_SGA in process with ospid 13476 Dumping Process information for process with ospid 13476 Dumping Process information for process with ospid 23307 Dumping Process information for process with ospid 19237 Dumping Process information for process with ospid 23487 ==================== END OF HANG ANALYSIS ==================== *** 2009-05-20 16:48:20.686现在我们来看看我们的trace出来的文件:
Cycle 1 :: <0/329/43816/0x4d6b5638/23487/row cache lock> — <0/254/19761/0x4d687438/23307/library cache lock> Cycle 2 :: <0/295/57125/0x4d6b8978/19237/row cache lock> Cycle 3 :: <0/295/57125/0x4d6b8978/19237/row cache lock>cycle表示oracle内部确定的死锁。其中我们的当前手工执行split的295进程也在里面。我们观察其他的进程在做什么,如329:
SQL>selectmachine,status,program,sql_textfromv$sessiona,v$sqlareab wherea.sql_address=b.addressanda.sid=329; MACHINE STATUS PROGRAM SQL_TEXT --------- ------- -------------------------- ---------------------------------------------------------- hl_rdb01 ACTIVE sqlplus@hl_rdb01(TNSV1-V3) ALTER TABLEA_PDA_SP_STATS PLITPARTITION P_MAXAT(20090609) INTO(PARTITION P_20090608 TABLESPACETS_DATA_A ,PARTITION P_MAX TABLESPACETS_DATA_A) SQL>select event from v$session_wait wheresid=329; EVENT -------------------------------------------- row cache lock发现也是在执行split语句,但是问了同事,他已经把之前运行失败的脚本完全kill掉了。估计在数据库中进程挂死了,没有完全的释放。
============== HANG ANALYSIS: ============== Cycle 1 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>: <0/295/57125/0x4d6b8978/19237/rowcachelock> Cycle 2 :<cnode/sid/sess_srno/proc_ptr/ospid/wait_event>: <0/254/19761/0x4d687438/23307/librarycachelock> --<0/239/57618/0x4d6b74f8/13476/rowcachelock>我们继续把其他的进程杀掉。终于295的split执行成功。
SQL>ALTER TABLEA_PDA_SP_STAT SPLIT PARTITIONP_MAXAT(20090609) INTO(PARTITIONP_20090608 TABLESPACETS_DATA_A ,PARTITION P_MAX TABLESPACETS_DATA_A) Table altered. Elapsed:00:31:03.21继续执行split下一个分区,也很快完成。
SQL>ALTER TABLEA_PDA_SP_STATS PLITPARTITION P_MAXAT(20090610) INTO(PARTITIONP_20090609 TABLESPACETS_DATA_A ,PARTITIONP_MAX TABLESPACETS_DATA_A);至此,问题解决.
[238]/0/239/57618/0x4d7b18a0/13476/IN_HANG/395/402/[294][238][328][253]/none [253]/0/254/19761/0x4d7bb710/23307/IN_HANG/397/400/[328][238][294]/294 [294]/0/295/57125/0x4d7d6820/19237/IN_HANG/396/401/[294][238][253]/238 [328]/0/329/43816/0x4d7ecf40/23487/IN_HANG/398/399/[253]/253 329堵塞住了254 254堵塞住了295 295堵塞住了239 杀掉的应该是329,254,295
提示:(如果由于hang,导致sqlplus无法登陆可以使用参数 -perlim 如:sqlplus -prelim "/as sysdba"。但是这样登陆无法查询视图,不过可以shutdown abort 数据库)