记一次数据库无法增删改趋于HANG住状态的故障诊断和处理

数据库无法增删改,包括v$transaction视图无法查询,类似于HANG的状态,我首先我通过查询v$session_wait视图,情况如下:
SQL> select sid,event,p1,p2,p3,wait_time,seconds_in_wait,state from v$session_wait where wait_class <> 'Idle';

       SID EVENT                                             P1         P2         P3  WAIT_TIME SECONDS_IN_WAIT STATE
---------- ----------------------------------------- ---------- ---------- ---------- ---------- --------------- -------------------
       125 library cache lock                        2130013560 2158412560        301          0            7850 WAITING
       126 library cache lock                        2130014088 2158399144        301          0            7850 WAITING
       127 library cache lock                        2130014088 2158495528        301          0            9231 WAITING
       128 library cache lock                        2130013560 2158380576        301          0            9231 WAITING
       129 library cache lock                        2130013560 2158307736        301          0           10611 WAITING
       130 library cache lock                        2130014088 2158498840        301          0           10611 WAITING
       131 buffer busy waits                                  1      11170          1          0           11839 WAITING
       132 library cache lock                        2130013560 2158391432        301          0           11992 WAITING
       133 log file switch (checkpoint incomplete)            0          0          0          0           12616 WAITING
       136 log file switch (checkpoint incomplete)            0          0          0          0           12947 WAITING
       138 enq: TX - row lock contention             1415053318     589854        665          0           13321 WAITING
       139 buffer busy waits                                  2          9         17          0           12616 WAITING
       141 enq: WF - contention                      1464205318          0          0          0            1650 WAITING
       144 enq: CI - contention                      1128857606          1          5          0           15355 WAITING
       150 log file switch (checkpoint incomplete)            0          0          0          0           12891 WAITING
       159 switch logfile command                             0          0          0          0           17051 WAITING
       161 log file switch (checkpoint incomplete)            0          0          0          0           12715 WAITING
       164 rdbms ipc reply                                    7   21457644          0          0               0 WAITING  

18 rows selected

以上看到有三个log file switch (checkpoint incomplete)等待事件,意味着checkpoint没完成,于是查询v$log检查在线日志的情况:

SQL> select * from v$log;

    GROUP#    THREAD#  SEQUENCE#      BYTES    MEMBERS ARC STATUS         FIRST_CHANGE# FIRST_TIM
---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ---------
     1         1          113   52428800          1 NO  CURRENT            8590086940 21-MAR-15  <<<< CURRENT
     2         1          111   52428800          1 YES ACTIVE            8590086619 21-MAR-15    <<<< ACTIVE
     3         1          112   52428800          1 YES ACTIVE            8590086938 21-MAR-15    <<<< ACTIVE

SQL> archive log list;
Database log mode            Archive Mode
Automatic archival            Enabled   <<<<<< 归模式档
Archive destination            /home/oracle/arch
Oldest online log sequence     111
Next log sequence to archive   113
Current log sequence            113
SQL> !
[oracle@ora10g bdump]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3              95G   14G   77G  15% /         <<<<<<<<<<<< 磁盘空间足够,说明不是磁盘空间不够无法归档造成的
/dev/sda1              99M   12M   82M  13% /boot
tmpfs                1006M     0 1006M   0% /dev/shm
[oracle@ora10g bdump]$ ls -lrt /home/oracle/arch/
total 107176
-rw-r----- 1 oracle oinstall   952832 Mar 15 12:02 1_103_847657195.dbf
-rw-r----- 1 oracle oinstall 29585920 Mar 17 21:35 1_104_847657195.dbf
-rw-r----- 1 oracle oinstall 14306816 Mar 21 12:02 1_105_847657195.dbf
-rw-r----- 1 oracle oinstall 22298112 Mar 21 12:19 1_106_847657195.dbf
-rw-r----- 1 oracle oinstall 42112000 Mar 21 17:14 1_107_847657195.dbf
-rw-r----- 1 oracle oinstall   159232 Mar 21 17:20 1_108_847657195.dbf
-rw-r----- 1 oracle oinstall     1536 Mar 21 17:21 1_109_847657195.dbf
-rw-r----- 1 oracle oinstall    15360 Mar 21 17:24 1_110_847657195.dbf
-rw-r----- 1 oracle oinstall   148480 Mar 21 17:30 1_111_847657195.dbf
-rw-r----- 1 oracle oinstall     1024 Mar 21 17:30 1_112_847657195.dbf        <<<<<<<< 自17:30时间以后不再有归档日志产生
[oracle@ora10g bdump]$ date
Sat Mar 21 22:24:09 CST 2015

在线日志切换之后没有并归档之前都是处于ACTIVE状态,该状态下无法被复用,没有在线日志组可用时候,数据库会挂起.

做了HANGANALYZE:
==============
HANG ANALYSIS:
==============
Open chains found:
Chain 1 : :
    <0/ 133 /2/0x83a69028/4790/log file switch (checkpoint inco>
-- <0/ 129 /1/0x83a6afc8/4877/library cache lock>
-- <0/ 125 /1/0x83a6cf68/4993/library cache lock>
-- <0/ 128 /3/0x83a6b7b0/4931/library cache lock>
-- <0/ 132 /3/0x83a69810/4815/library cache lock>        <<<<<<<<<<<<sid为 132,128,125,129被sid为133的会话阻塞,而133又在等检查点完成
Chain 2 : :
    <0/ 164 /1/0x83a5f208/3048/ rdbms ipc reply >
-- <0/ 144 /47/0x83a660b8/4402/ enq: CI - contention >
-- <0/ 138 /137/0x83a68058/4757/enq: TX - row lock contention>  <<<<<< sid为138,144的会话被sid为164的会话阻塞,而144在等待的 enq: CI - contention却被
Other chains found:
Chain 3 : :
    <0/123/6/0x83a6d750/5378/No Wait>
Chain 4 : :
    <0/126/2/0x83a6c780/4991/library cache lock>
-- <0/127/1/0x83a6bf98/4933/library cache lock>
-- <0/130/1/0x83a6a7e0/4875/library cache lock>
Chain 5 : :
    <0/131/1/0x83a69ff8/4821/buffer busy waits>
-- <0/126/2/0x83a6c780/4991/library cache lock>
-- <0/127/1/0x83a6bf98/4933/library cache lock>
-- <0/130/1/0x83a6a7e0/4875/library cache lock>
Chain 6 : :
    <0/136/7/0x83a68840/4770/log file switch (checkpoint inco>
Chain 7 : :
    <0/139/10/0x83a67870/4609/buffer busy waits>
Chain 8 : :
    <0/141/6/0x83a668a0/4748/enq: WF - contention>
Chain 9 : :
    <0/147/3/0x83a658d0/3079/Streams AQ: qmn slave idle wait>
Chain 10 : :
    <0/148/4/0x83a650e8/3077/Streams AQ: waiting for time man>
Chain 11 : :
    <0/150/166/0x83a64900/4752/log file switch (checkpoint inco>
Chain 12 : :
    <0/152/1/0x83a64118/3068/Streams AQ: qmn coordinator idle>
Chain 13 : :
    <0/159/3/0x83a62960/3062/switch logfile command>
Chain 14 : :
    <0/161/1/0x83a609c0/3054/log file switch (checkpoint inco>
Extra information that will be dumped at higher levels:
[level  4] :   2 node dumps -- [REMOTE_WT] [LEAF] [LEAF_NW]
[level  5] :   8 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP]
[level  6] :  12 node dumps -- [NLEAF]
[level 10] :  12 node dumps -- [IGN]

State of nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor):
[122]/0/123/6/0x83b52750/5378/SINGLE_NODE_NW/1/2//none
[124]/0/125/1/0x83b55220/4993/NLEAF/3/8/[128][132]/127
[125]/0/126/2/0x83b56788/4991/NLEAF/9/12/[130]/126
[126]/0/127/1/0x83b57cf0/4933/NLEAF/13/14/[125][130]/129
[127]/0/128/3/0x83b59258/4931/NLEAF/15/16/[128][124][132]/131
[128]/0/129/1/0x83b5a7c0/4877/NLEAF/4/7/[132]/124
[129]/0/130/1/0x83b5bd28/4875/NLEAF/17/18/[125][126][130]/none
[130]/0/131/1/0x83b5d290/4821/NLEAF/10/11/[132]/125
[131]/0/132/3/0x83b5e7f8/4815/NLEAF/19/20/[128][124][127][132]/none
[ 132 ]/0/ 133 /2/0x83b5fd60/ 4790 / LEAF /5/6//128      <<<<<<< sid为 133 的阻塞了sid为 128 的会话
[135]/0/136/7/0x83b63d98/4770/SINGLE_NODE/21/22//none
[137]/0/138/137/0x83b66868/4757/NLEAF/23/28/[143]/none
[138]/0/139/10/0x83b67dd0/4609/NLEAF/29/30/[132]/none
[139]/0/140/8/0x83b69338/4605/IGN/31/32//none
[140]/0/141/6/0x83b6a8a0/4748/NLEAF/33/34/[143]/none
[ 143 ]/0/ 144 /47/0x83b6e8d8/ 4402 /NLEAF/24/27/[163]/137
[146]/0/147/3/0x83b72910/3079/SINGLE_NODE/35/36//none
[147]/0/148/4/0x83b73e78/3077/SINGLE_NODE/37/38//none
[149]/0/150/166/0x83b76948/4752/SINGLE_NODE/39/40//none
[151]/0/152/1/0x83b79418/3068/SINGLE_NODE/41/42//none
[154]/0/155/1/0x83b7d450/3066/IGN/43/44//none
[155]/0/156/1/0x83b7e9b8/3064/IGN/45/46//none
[158]/0/159/3/0x83b829f0/3062/SINGLE_NODE/47/48//none
[159]/0/160/1/0x83b83f58/3056/IGN/49/50//none
[160]/0/161/1/0x83b854c0/3054/SINGLE_NODE/51/52//none
[161]/0/162/1/0x83b86a28/3052/IGN/53/54//none
[162]/0/163/1/0x83b87f90/3050/IGN/55/56//none
[163]/0/ 164 /1/0x83b894f8/ 3048 / LEAF /25/26// 143      <<<<<<< sid为164的阻塞了sid为143的会话
[164]/0/165/1/0x83b8aa60/3046/IGN/57/58//none
[165]/0/166/1/0x83b8bfc8/3044/IGN/59/60//none
[166]/0/167/1/0x83b8d530/3042/IGN/61/62//none
[167]/0/168/1/0x83b8ea98/3040/IGN/63/64//none
[168]/0/169/1/0x83b90000/3038/IGN/65/66//none
[169]/0/170/1/0x83b91568/3036/IGN/67/68//none

做了SYSTEMSTATE DUMP:
用ass对SYSTEMSTATE处理以后:
System State 1
~~~~~~~~~~~~~~~~
1:                                      
2:  waiting for 'pmon timer'            wait
3:  waiting for 'rdbms ipc message'     wait
4:  waiting for 'rdbms ipc message'     wait
5:  waiting for 'rdbms ipc message'     wait
6:  waiting for 'rdbms ipc message'     wait
7:  waiting for 'rdbms ipc message'     wait
8:  waiting for 'rdbms ipc reply'       wait
9:  waiting for 'rdbms ipc message'     wait
10: waiting for 'rdbms ipc message'     wait
11: waiting for 'log file switch (checkpoint incomplete)' wait
12: waiting for 'rdbms ipc message'     wait
13:                                     
14:                                     
15: waiting for 'switch logfile command' wait
16: waiting for 'rdbms ipc message'     wait
17: waiting for 'rdbms ipc message'     wait
18: waiting for 'Streams AQ: qmn coordinator idle wait' wait
19: waiting for 'log file switch (checkpoint incomplete)' wait
     Cmd: Insert
20: for 'Streams AQ: waiting for time management or cleanup tasks' wait
21: waiting for 'Streams AQ: qmn slave idle wait' wait
22: waiting for 'enq: CI - contention' [Enqueue CI-00000001-00000005] wait
23: waiting for 'enq: WF - contention' [Enqueue WF-00000000-00000000] wait
     Cmd: PL/SQL Execute
24: waiting for 'SQL*Net message from client' wait
25: waiting for 'buffer busy waits' (2,9,11) wait
     Cmd: Select
26: waiting for 'enq: TX - row lock contention'[Enqueue TX-0009001E-00000299] wait
27: waiting for 'log file switch (checkpoint incomplete)' wait
     Cmd: Insert
28: waiting for 'log file switch (checkpoint incomplete)' wait
29: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
30: waiting for 'buffer busy waits' (1,2ba2,1) wait
31: waiting for 'library cache lock'   [LOCK: handle=7ef56f88] wait
32: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
33: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
34: waiting for 'library cache lock'   [LOCK: handle=7ef56f88] wait
35: waiting for 'library cache lock'   [LOCK: handle=7ef56f88] wait
36: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
37: last wait for 'ksdxexeotherwait'    
Blockers
~~~~~~~~

     Above is a list of all the processes. If they are waiting for a resource
     then it will be given in square brackets. Below is a summary of the
     waited upon resources, together with the holder of that resource.
     Notes:
     ~~~~~
     o A process id of '???' implies that the holder was not found in the
        systemstate.

                    Resource Holder State
Enqueue CI-00000001-00000005     8: waiting for 'rdbms ipc reply'
Enqueue WF-00000000-00000000    22: 22: is waiting for 8:
Enqueue TX-0009001E-00000299    22: 22: is waiting for 8:
       LOCK: handle=7ef56d78    28: waiting for 'log file switch (checkpoint incomplete)'
       LOCK: handle=7ef56f88    30: waiting for 'buffer busy waits' (1,2ba2,1)


Object Names
~~~~~~~~~~~~
Enqueue CI-00000001-00000005                                   
Enqueue WF-00000000-00000000                                   
Enqueue TX-0009001E-00000299                                   
LOCK: handle=7ef56d78     TABL:EXFSYS.RLM$SCHDNEGACTION 
LOCK: handle=7ef56f88     TABL:EXFSYS.RLM$EVTCLEANUP    

System State 2
~~~~~~~~~~~~~~~~
1:                                      
2:  waiting for 'pmon timer'            wait
3:  waiting for 'rdbms ipc message'     wait
4:  waiting for 'rdbms ipc message'     wait
5:  waiting for 'rdbms ipc message'     wait
6:  waiting for 'rdbms ipc message'     wait
7:  waiting for 'rdbms ipc message'     wait
8:  waiting for 'rdbms ipc reply'       wait
9:  waiting for 'rdbms ipc message'     wait
10: waiting for 'rdbms ipc message'     wait
11: waiting for 'log file switch (checkpoint incomplete)' wait
12: waiting for 'rdbms ipc message'     wait
13:                                     
14:                                     
15: waiting for 'switch logfile command' wait
16: waiting for 'rdbms ipc message'     wait
17: waiting for 'rdbms ipc message'     wait
18: waiting for 'Streams AQ: qmn coordinator idle wait' wait
19: waiting for 'log file switch (checkpoint incomplete)' wait
     Cmd: Insert
20: for 'Streams AQ: waiting for time management or cleanup tasks' wait
21: waiting for 'Streams AQ: qmn slave idle wait' wait
22: waiting for 'enq: CI - contention' [Enqueue CI-00000001-00000005] wait
23: waiting for 'enq: WF - contention' [Enqueue WF-00000000-00000000] wait
     Cmd: PL/SQL Execute
24: waiting for 'SQL*Net message from client' wait
25: waiting for 'buffer busy waits' (2,9,11) wait
     Cmd: Select
26: waiting for 'enq: TX - row lock contention'[Enqueue TX-0009001E-00000299] wait
27: waiting for 'log file switch (checkpoint incomplete)' wait
     Cmd: Insert
28: waiting for 'log file switch (checkpoint incomplete)' wait
29: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
30: waiting for 'buffer busy waits' (1,2ba2,1) wait
31: waiting for 'library cache lock'   [LOCK: handle=7ef56f88] wait
32: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
33: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
34: waiting for 'library cache lock'   [LOCK: handle=7ef56f88] wait
35: waiting for 'library cache lock'   [LOCK: handle=7ef56f88] wait
36: waiting for 'library cache lock'   [LOCK: handle=7ef56d78] wait
37: last wait for 'ksdxexeotherwait'    
Blockers
~~~~~~~~

     Above is a list of all the processes. If they are waiting for a resource
     then it will be given in square brackets. Below is a summary of the
     waited upon resources, together with the holder of that resource.
     Notes:
     ~~~~~
     o A process id of '???' implies that the holder was not found in the
        systemstate.

                    Resource Holder State
Enqueue CI-00000001-00000005     8: waiting for 'rdbms ipc reply'
Enqueue WF-00000000-00000000    22: 22: is waiting for 8:
Enqueue TX-0009001E-00000299    22: 22: is waiting for 8:
       LOCK: handle=7ef56d78    28: waiting for 'log file switch (checkpoint incomplete)'
       LOCK: handle=7ef56f88    30: waiting for 'buffer busy waits' (1,2ba2,1)


Object Names
~~~~~~~~~~~~
Enqueue CI-00000001-00000005                                   
Enqueue WF-00000000-00000000                                   
Enqueue TX-0009001E-00000299                                   
LOCK: handle=7ef56d78     TABL:EXFSYS.RLM$SCHDNEGACTION 
LOCK: handle=7ef56f88     TABL:EXFSYS.RLM$EVTCLEANUP    

List of Processes That May Be Stuck
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8: waiting for 'rdbms ipc reply' wait
11: waiting for 'log file switch (checkpoint incomplete)' wait
15: waiting for 'switch logfile command' wait
18: waiting for 'Streams AQ: qmn coordinator idle wait' wait
19: waiting for 'log file switch (checkpoint incomplete)' wait
20: for 'Streams AQ: waiting for time management or cleanup tasks' wait
21: waiting for 'Streams AQ: qmn slave idle wait' wait
22: waiting for 'enq: CI - contention' wait
23: waiting for 'enq: WF - contention' wait
24: waiting for 'SQL*Net message from client' wait
25: waiting for 'buffer busy waits' (2,9,11) wait
26: waiting for 'enq: TX - row lock contention' wait
27: waiting for 'log file switch (checkpoint incomplete)' wait
28: waiting for 'log file switch (checkpoint incomplete)' wait
29: waiting for 'library cache lock' wait
30: waiting for 'buffer busy waits' (1,2ba2,1) wait
31: waiting for 'library cache lock' wait
32: waiting for 'library cache lock' wait
33: waiting for 'library cache lock' wait
34: waiting for 'library cache lock' wait
35: waiting for 'library cache lock' wait
36: waiting for 'library cache lock' wait

 
    SO: 0x83b5fd60, type: 4, owner: 0x83a69028, flag: INIT/-/-/0x00
    (session) sid: 133 trans: 0x822d5790, creator: 0x83a69028, flag: (41) USR/- BSY/-/-/-/-/-
              DID: 0001-001C-00000002, short-term DID: 0000-0000-00000000
              txn branch: (nil)
              oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
    service name: SYS$USERS
    O/S info: user: oracle, term: UNKNOWN,   ospid: 4790, machine: ora10g
              program: oracle@ora10g ( J001)
    waiting for 'log file switch (checkpoint incomplete)' blocking sess=0x(nil) seq=1 wait_time=0 seconds since wait started=12130
                =0, =0, =0
    Dumping Session Wait History
     for 'log file switch ( checkpoint incomplete)' count=1 wait_time=1107898
                =0, =0, =0
     for 'log file switch ( checkpoint incomplete)' count=1 wait_time=336
...

PROCESS 8:
  ----------------------------------------
  SO: 0x83a5f208, type: 2, owner: (nil), flag: INIT/-/-/0x00
  (process) Oracle pid=8, calls cur/top: 0x83ba9a28/0x83ba9a28, flag: (16) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 33
              last post received-location: ksrpublish
              last process to post me: 83a68840 1 0
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 83a5ea20 1 6
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 0x83aa67e8
    O/S info: user: oracle, term: UNKNOWN, ospid: 3048
    OSD pid info: Unix process   pid: 3048, image: oracle@ora10g ( SMON)
    Short stack dump:
...

PROCESS 22:
  ----------------------------------------
  SO: 0x83a660b8, type: 2, owner: (nil), flag: INIT/-/-/0x00
  (process) Oracle pid=22, calls cur/top: 0x83bafe00/0x83bafb40, flag: (2) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 83a5e238 1 6
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 0x83aa67e8
    O/S info: user: oracle, term: UNKNOWN, ospid:   4402
    OSD pid info: Unix process pid: 4402, image: oracle@ora10g ( m000)
    Short stack dump:

自此,把注意力转移至SMON进程,首先检查SMON的trace文件:
/u01/app/oracle/admin/ora10g/bdump/ora10g_smon_3048.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10.2.0
System name:     Linux
Node name:     ora10g
Release:     2.6.18-194.el5
Version:     #1 SMP Tue Mar 16 21:52:39 EDT 2010
Machine:     x86_64
Instance name: ora10g
Redo thread mounted by this instance: 1
Oracle process number: 8
Unix process pid: 3048, image: oracle@ora10g ( SMON)

*** 2015-03-21 17:34:17.611
*** SERVICE NAME:(SYS$BACKGROUND) 2015-03-21 17:34:17.611
*** SESSION ID:(164.1) 2015-03-21 17:34:17.611
Waited for detached process: CKPT for 300 seconds:
*** 2015-03-21 17:34:17.611
Dumping diagnostic information for CKPT:
OS pid = 3046
loadavg : 0.00 0.00 0.00
memory info: free memory = 0.00M
swap info:   free = 0.00M alloc = 0.00M total = 0.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle    3046     1  0  75   0 - 184266 -     17:24 ?        00:00:00 ora_ckpt_ora10g
[Thread debugging using libthread_db enabled]
0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#0  0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#1  0x0000000003be9b09 in sskgpwwait ()
#2  0x0000000003bccdf0 in skgpwwait ()
#3  0x0000000000855f4a in ksdxsus ()
#4  0x0000000000857118 in ksdxffrz ()
#5  0x0000000000853003 in ksdxcb ()
#6  0x0000000001ebd0cf in sspuser ()
#7 
#8  0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#9  0x0000000003be9b09 in sskgpwwait ()
#10 0x0000000003bccdf0 in skgpwwait ()
#11 0x0000000000798319 in kslwaitns_timed ()
#12 0x00000000008c3b1d in kskthbwt ()
#13 0x0000000000797e54 in kslwait ()
#14 0x00000000029f3f3b in ksarcv ()
#15 0x000000000082e8bf in ksbabs ()
#16 0x0000000000835822 in ksbrdp ()
#17 0x0000000002f4d840 in opirip ()
#18 0x000000000132b016 in opidrv ()
#19 0x0000000001eb3146 in sou2o ()
#20 0x0000000000723245 in opimai_real ()
#21 0x00000000007230fc in main ()
A debugging session is active.
     Inferior 1 [process 3046] will be detached.
Quit anyway? (y or n) [answered Y; input not from terminal]

*** 2015-03-21 17:34:19.125
*** 2015-03-21 17:34:29.129
Waited for detached process: CKPT for 310 seconds:
*** 2015-03-21 17:34:29.129
Dumping diagnostic information for CKPT:
OS pid = 3046
loadavg : 0.07 0.02 0.00
memory info: free memory = 0.00M
swap info:   free = 0.00M alloc = 0.00M total = 0.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle    3046     1  0  75   0 - 184266 -     17:24 ?        00:00:00 ora_ckpt_ora10g
[Thread debugging using libthread_db enabled]
0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#0  0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#1  0x0000000003be9b09 in sskgpwwait ()
#2  0x0000000003bccdf0 in skgpwwait ()
#3  0x0000000000855f4a in ksdxsus ()
#4  0x0000000000857118 in ksdxffrz ()
#5  0x0000000000853003 in ksdxcb ()
#6  0x0000000001ebd0cf in sspuser ()
#7 
#8  0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#9  0x0000000003be9b09 in sskgpwwait ()
#10 0x0000000003bccdf0 in skgpwwait ()
#11 0x0000000000798319 in kslwaitns_timed ()
#12 0x00000000008c3b1d in kskthbwt ()
#13 0x0000000000797e54 in kslwait ()
#14 0x00000000029f3f3b in ksarcv ()
#15 0x000000000082e8bf in ksbabs ()
#16 0x0000000000835822 in ksbrdp ()
#17 0x0000000002f4d840 in opirip ()
#18 0x000000000132b016 in opidrv ()
#19 0x0000000001eb3146 in sou2o ()
#20 0x0000000000723245 in opimai_real ()
#21 0x00000000007230fc in main ()
A debugging session is active.
     Inferior 1 [process 3046] will be detached.
Quit anyway? (y or n) [answered Y; input not from terminal]
*** 2015-03-21 17:34:29.895
*** 2015-03-21 17:34:39.899
Waited for detached process: CKPT for 320 seconds:
*** 2015-03-21 17:34:39.899
Dumping diagnostic information for CKPT:
OS pid = 3046
loadavg : 0.06 0.01 0.00
memory info: free memory = 0.00M
swap info:   free = 0.00M alloc = 0.00M total = 0.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle    3046     1  0  75   0 - 184266 -     17:24 ?        00:00:00 ora_ckpt_ora10g
[Thread debugging using libthread_db enabled]
0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#0  0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#1  0x0000000003be9b09 in sskgpwwait ()
#2  0x0000000003bccdf0 in skgpwwait ()
#3  0x0000000000855f4a in ksdxsus ()
#4  0x0000000000857118 in ksdxffrz ()
#5  0x0000000000853003 in ksdxcb ()
#6  0x0000000001ebd0cf in sspuser ()
#7 
#8  0x00000035cc2d517a in semtimedop () from /lib64/libc.so.6
#9  0x0000000003be9b09 in sskgpwwait ()
#10 0x0000000003bccdf0 in skgpwwait ()
#11 0x0000000000798319 in kslwaitns_timed ()
#12 0x00000000008c3b1d in kskthbwt ()
#13 0x0000000000797e54 in kslwait ()
#14 0x00000000029f3f3b in ksarcv ()
#15 0x000000000082e8bf in ksbabs ()
#16 0x0000000000835822 in ksbrdp ()
#17 0x0000000002f4d840 in opirip ()
#18 0x000000000132b016 in opidrv ()
#19 0x0000000001eb3146 in sou2o ()
#20 0x0000000000723245 in opimai_real ()
#21 0x00000000007230fc in main ()
A debugging session is active.
     Inferior 1 [process 3046] will be detached.
Quit anyway? (y or n) [answered Y; input not from terminal]
*** 2015-03-21 17:34:40.598

从日志文件看出,SMON进程在等待与CKPT进程通信,难道CKPT进程异常吗?再将注意力转移至CKPT,查看CKPT的trace文件:
*** 2015-03-21 17:26:15.972
*** SERVICE NAME:(SYS$BACKGROUND) 2015-03-21 17:26:15.972
*** SESSION ID:(165.1) 2015-03-21 17:26:15.972
Received ORADEBUG command 'suspend' from process Unix process pid: 3062, image: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<问题出在这里!!!!!
Received ORADEBUG command 'tracefile_name' from process Unix process pid: 3062, image: 


自此,我们就明白了事情的缘由经过:

1. 由于CKPT进程被挂起,使得检查点无法完成、SMON进程长时间持有资源Enqueue CI-00000001-00000005,最后导致数据库HANG住;
2. 随着资源Enqueue CI-00000001-00000005被SMON进程持有,导致一系列等待事件library cache lock,enq: CI - contention,log file switch (checkpoint incomplete)的出现

重新把CKPT进程恢复:

SQL> oradebug setospid 3046
Oracle pid: 7, Unix process pid: 3046, image: oracle@ora10g (CKPT)
SQL> oradebug resume;
Statement processed.
SQL> select sid,event,p1,p2,p3,wait_time,seconds_in_wait,state from v$session_wait where wait_class <> 'Idle';

       SID EVENT                                                                    P1         P2         P3  WAIT_TIME SECONDS_IN_WAIT STATE
---------- ---------------------------------------------------------------- ---------- ---------- ---------- ---------- --------------- -------------------

到此,数据库恢复正常
 

---------------------------------------------------

道行尚浅,欢迎拍砖。

转载请注明出处.

你可能感兴趣的:(故障处理)