今天检查数据库中的备份输出脚本时,发现RMAN备份出现了错误。
这一篇主要描述问题的现象。
错误信息如下:
bash-3.00$ more /data/backup/backup_tradedb_090523.out
Script. /data/backup/backup_tradedb.sh
==== started on Sat May 23 23:00:00 CST 2009 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
Recovery Manager: Release 10.2.0.3.0 - Production on Sat May 23 23:00:01 2009
Copyright (c) 1982, 2005, Oracle. All rights reserved.
RMAN-06900: WARNING: unable to generate V$RMAN_STATUS or V$RMAN_OUTPUT row
RMAN-06901: WARNING: disabling update of the V$RMAN_STATUS and V$RMAN_OUTPUT rows
ORACLE error from target database:
ORA-03113: end-of-file on communication channel
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00554: initialization of internal recovery manager package failed
RMAN-06003: ORACLE error from target database:
ORA-03114: not connected to ORACLE
Script. /data/backup/backup_tradedb.sh
==== ended in error on Sat May 23 23:22:34 CST 2009 ====
这个错误信息以前还真的没有碰到过,出现了ORA-3113错误,还有两个告警信息,初步怀疑可能与网络不稳定有关,尝试重新启动后台备份,如果可以顺利运行,说明这个问题可能是瞬时网络不稳定造成的。
bash-3.00$ more /data/backup/backup_tradedb_090525.out
Script. bash
==== started on Mon May 25 14:34:14 CST 2009 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
恢复管理器: Release 10.2.0.3.0 - Production on星期一5月25 14:34:15 2009
Copyright (c) 1982, 2005, Oracle. All rights reserved.
RMAN-06900:警告:无法生成V$RMAN_STATUS或V$RMAN_OUTPUT行
RMAN-06901:警告:禁止更新V$RMAN_STATUS和V$RMAN_OUTPUT行
来自目标数据库的ORACLE错误:
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00554:内部恢复管理器程序包初始化失败
RMAN-06003:来自目标数据库的ORACLE错误:
ORA-03114:未连接到ORALCE
Script. bash
==== ended in error on Mon May 25 14:56:54 CST 2009 ====
在后台运行备份后,很快输出脚本中就显示了错误信息,和前一次错误信息略有区别,这次没有ORA-3113错误了。看来问题是可以重现的,尝试直接RMAN连接:
bash-3.00$ rman target /
恢复管理器: Release 10.2.0.3.0 - Production on星期一5月25 15:02:51 2009
Copyright (c) 1982, 2005, Oracle. All rights reserved.
RMAN连接数据库的操作被挂起。在等待RMAN相应的时候,查询了一下alert文件,检查备份启动时刻是否有对应的错误信息,结果果然有对应的错误:
Sat May 23 17:16:43 2009
GES: Potential blocker (pid=17895) on resource WF-00000046-00000000;
enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file
Sat May 23 21:28:36 2009
GES: Potential blocker (pid=15807) on resource TM-00002391-00000000;
enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file
Sat May 23 23:15:04 2009
GES: Potential blocker (pid=21881) on resource CF-00000000-00000000;
enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file
Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000
by killing session 275.54937
Sat May 23 23:15:58 2009
Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000
by killing session 275.54937
Sat May 23 23:15:58 2009
Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000
by killing session 275.54937
Sat May 23 23:17:14 2009
Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000
by killing session 275.54937
Sat May 23 23:22:34 2009
Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000
by terminating the process
看这个信息,问题似乎是由于全局资源被占用导致的。而且刚才后台运行的备份脚本也有对应的错误信息:
Mon May 25 15:17:58 2009
GES: Potential blocker (pid=8568) on resource CF-00000000-00000000;
enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file
Mon May 25 15:17:58 2009
Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000
by killing session 87.1697
Mon May 25 15:17:58 2009
Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000
by killing session 87.1697
Mon May 25 15:21:57 2009
Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000
by killing session 87.1697
Mon May 25 15:21:58 2009
Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000
by killing session 87.1697
Mon May 25 15:25:28 2009
Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000
by terminating the process
很快刚才RMAN操作也报错退出了:
bash-3.00$ rman target /
恢复管理器: Release 10.2.0.3.0 - Production on星期一5月25 15:02:51 2009
Copyright (c) 1982, 2005, Oracle. All rights reserved.
RMAN-06900:警告:无法生成V$RMAN_STATUS或V$RMAN_OUTPUT行
RMAN-06901:警告:禁止更新V$RMAN_STATUS和V$RMAN_OUTPUT行
来自目标数据库的ORACLE错误:
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00554:内部恢复管理器程序包初始化失败
RMAN-06003:来自目标数据库的ORACLE错误:
ORA-03114:未连接到ORALCE
对比rman的错误信息,以及Oracle的alert文件中的错误信息,很可能正是由于无法获取到共享的资源,导致无法更新V$RMAN_STATUS和V$RMAN_OUTPUT信息,从而导致RMAN连接报错。
检查trace文件/opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc的信息,在RMAN报错的时间点发现了错误信息:
*** 2009-05-25 15:17:58.130
DUMP LOCAL BLOCKER/HOLDER: block level 3 res [0x0][0x0],[CF]
----------resource 0x5763846d0----------------------
resname : [0x0][0x0],[CF]
Local node : 0
dir_node : 1
master_node : 1
hv idx : 95
hv last r.inc : 2
current inc : 4
hv status : 0
hv master : 1
open options : dd
Held mode : KJUSERPW
Cvt mode : KJUSERNL
Next Cvt mode : KJUSERPR
msg_seq : bdddbdcf
res_seq : 5
grant_bits : KJUSERNL KJUSERCR KJUSERPW
grant mode : KJUSERNL KJUSERCR KJUSERCW KJUSERPR KJUSERPW KJUSEREX
count : 2 1 0 0 1 0
val_state : KJUSERVS_VALUE
valblk : 0x0009c399000000000038000000000000 .8
access_node : 0
vbreq_state : 0
state : x8
resp : 5763846d0
On Scan_q? : N
Total accesses: 4017991
Imm. accesses: 3763043
Granted_locks : 2
Cvting_locks : 2
value_block: 0009 c3 99 00 00 00 00 00 38 00 00 00 00 00 00
GRANTED_Q :
lp 8b6213238 gl KJUSERPW rp 5763846d0 [0x0][0x0],[CF]
master 1 gl owner 8b60b8da0 possible pid 8568 xid 1006-0066-000016ACbast 0 rseq 5 mseq 0 history 0x49a51495
open opt KJUSERDEADLOCK KJUSERIDLK
lp 737246d38 gl KJUSERCR rp 5763846d0 [0x0][0x0],[CF]
master 1 gl owner737115f20 possible pid 7228 xid 0001-0013-00000004 bast 0 rseq 5 mseq 0 history 0x9a5
open opt KJUSERDEADLOCK KJUSERIDLK
CONVERT_Q:
lp 737336690 gl KJUSERNL rl KJUSERPR rp 5763846d0 [0x0][0x0],[CF]
master 1 gl owner 73710ce50 possible pid 7654 xid 1001-001D-00000019 bast 0 rseq 5 mseq 0 history 0x49a5149a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT
lp 8b62134d8 gl KJUSERNL rl KJUSERPR rp 5763846d0 [0x0][0x0],[CF]
master 1 gl owner 8b61173d0 possible pid 7652 xid 1001-0012-00000002 bast 0 rseq 5 mseq 0 history 0x49a5149a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT
----------enqueue 0x8b6213238------------------------
lock version : 21790201
Owner node : 0
grant_level : KJUSERPW
req_level : KJUSERPW
bast_level : KJUSERNL
notify_func : 0
resp : 5763846d0
procp : 58e58acd8
pid : 8568
proc version : 1264
oprocp : 0
opid : 0
group lock owner : 8b60b8da0
possible pid : 8568
xid : 1006-0066-000016AC
dd_time : 0.0 secs
dd_count : 0
timeout : 0.0 secs
On_timer_q? : N
On_dd_q? : N
lock_state : GRANTED
Open Options : KJUSERDEADLOCK KJUSERIDLK
Convert options : KJUSERGETVALUE KJUSERNODEADLOCKWAIT
History : 0x49a51495
Msg_Seq : 0x0
res_seq : 5
valblk : 0x0009c399000000000038000000000000 .8
Potential blocker (pid=8568) on resource CF-00000000-00000000
DUMP LOCAL BLOCKER: initiate state dump for TIMEOUT
possible owner[102.8568]
Submitting asynchronized dump request [28]
KCC: controlfile sequence number from SGA = 639897
Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000
by killing session 87.1697
Kill session sent to session 87.1697
*** 2009-05-25 15:18:02.165
Setting 3-way CR grants to 0 global-lru off? 0
*** 2009-05-25 15:19:04.032
Setting 3-way CR grants to 1 global-lru off? 0
*** 2009-05-25 15:21:57.782
Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000
by killing session 87.1697
Kill session sent to session 87.1697
这个信息进一步证实问题确实是由于资源被占用所引发的。既然是资源占用问题,检查一下系统当前的锁信息:
SQL> SELECT SID, TYPE, ID1, ID2, LMODE, REQUEST, CTIME, BLOCK
2 FROM V$LOCK
3 WHERE SID IN (SELECT SID FROM V$SESSION WHERE USERNAME IS NOT NULL);
SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
118 TO 196404 1 3 0 144940 2
118 JQ 0 4 6 0 144943 2
102 JQ 0 74 6 0 141340 2
289 TO -40016215 1 3 0 141337 2
289 JQ 0 27 6 0 141343 2
102 TM 70434 0 3 0 141319 2
118 TM 196404 0 3 0 144940 2
118 TX 262181 87690 6 0 144940 2
289 TX 458770 86777 6 0 141337 2
已选择9行。
当前持有锁的会话都和JOB有关,且这个锁的持有时间已经不短了,检查一下当前系统运行的JOB:
SQL> SELECT SID, JOB, LAST_DATE, THIS_DATE
2 FROM DBA_JOBS_RUNNING;
SID JOB LAST_DATE THIS_DATE
---------- ---------- ------------------- -------------------
118 4 2009-05-10 00:00:05 2009-05-24 00:00:02
102 74 2009-05-23 01:00:00 2009-05-24 01:00:04
289 27 2009-05-17 01:00:03 2009-05-24 01:00:04
可以看到,这3个JOB的运行时间都不正常,最长的一个居然已经运行了将近两天。
另外在检查操作系统的时候还发现一个问题,后台存在大量的racgmain进程:
bash-3.00$ ps -ef|grep racgmain
oracle 24085 1 0 May 23 ? 0:00 /opt/oracle/product/10.2/crs/bin/racgmain check
oracle 20734 1 0 May 23 ? 0:00 /opt/oracle/product/10.2/crs/bin/racgmain check
oracle 19021 1 0 May 23 ? 0:00 /opt/oracle/product/10.2/database/bin/racgmain check
oracle 19035 1 0 May 23 ? 0:00 /opt/oracle/product/10.2/database/bin/racgmain check
oracle 20207 1 0 May 23 ? 0:00 /opt/oracle/product/10.2/database/bin/racgmain check
.
.
.
oracle 18643 1 0 May 23 ? 0:00 /opt/oracle/product/10.2/database/bin/racgmain check
oracle 18420 1 0 May 23 ? 0:00 /opt/oracle/product/10.2/database/bin/racgmain check
统计一下,发现居然有四百多个进程:
bash-3.00$ ps -ef|grep -c "racgmain check"
449
看来整个问题还是很复杂的,而且要解决这个问题,还需要找到一个方向来入手解决这个问题。
oracle视频教程请关注:http://u.youku.com/user_video/id_UMzAzMjkxMjE2.html