一次enq: CF - contention 导致数据库宕机的故障分析

数据库意外宕机,alert日志信息如下:


Sun Jul 27 01:02:48 2014
System State dumped to trace file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_diag_569650.trc
Sun Jul 27 01:03:48 2014
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
 by killing session 3289.1
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
 by terminating the process
LMD0: terminating instance due to error 2103
Sun Jul 27 01:03:48 2014
System state dump is made for local instance
Sun Jul 27 01:03:48 2014
Errors in file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_lms5_1048666.trc:
ORA-02103: PCC: inconsistent cursor cache (out-of-range cuc ref)
Sun Jul 27 01:03:48 2014
Trace dumping is performing id=[cdmp_20140727010348]
Sun Jul 27 01:03:49 2014
Shutting down instance (abort)
License high water mark = 1575
Sun Jul 27 01:03:57 2014
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:04:02 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:08:53 2014
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:08:59 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:12:30 2014
Shutting down instance (abort)
License high water mark = 1575
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:12:45 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:13:21 2014
Shutting down instance (abort)
License high water mark = 1575
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:13:36 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:21:06 2014
Instance terminated by USER, pid = 16228674
Sun Jul 27 01:21:52 2014
Instance terminated by USER, pid = 4215324
Sun Jul 27 01:28:44 2014
Instance terminated by USER, pid = 5947584

-----------------------------------------------------------------------------------------------------------------
分析:


首先看到,27号1点03分
oracle执行如下操作:  意思是杀死阻塞CF事件的进程号721256


Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
 by killing session 3289.1
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000



查看历史trace文件,查找进程721256在实例崩溃前是什么进程:


通过查看当时时间点的trace文件可知:jcgl2_ckpt_721256.trw


进程721256是oracle的检查点进程,也就是说,oracle自动杀死了ckpt进程


从而出现了如下信息:LMD0进程中止了实例


LMD0: terminating instance due to error 2103


那oracle为什么会自动杀死持有CF锁的进程呢?


为什么杀死:(参见metalink doc id:753290.1)


在oracle中:

a process was unable to obtain the CF enqueue within the specified timeout (default 900 sesconds)

如果某个进程持有CF锁,超过这个参数设置值则:

The instance terminated the blocker pid 721256 to avoid hanging up all the instances behind the CF enqueue blocker.

这个操作是受隐含参数 _kill_controlfile_enqueue_blocker 控制
若是想避免这种操作,可以将此参数设为false
当此参数设置为false的时候,不管任何情况的CF阻塞,oracle都不会再干预,


如果我们仅仅是想让oracle不干预oracle后台程序的话,即后台进程不被kill
则需设置参数:_kill_enqueue_blocker=1 


那么为什么ckpt会长时间持有CF锁:

关于ckpt持有锁CF:(doc id :406191.1)
出现进程持有CF锁时间过长问题的原因有:

1、very slow I/O subsystem where the control files are stored
2、frequent log switching -redo logs too small or insufficient logs
3、async I/O issue or multiple db_writers,you can't use both of them,back one out
4、OS / hardware issues


再次查看alert日志进行分析:
Sun Jul 27 00:44:14 2014
Thread 2 advanced to log sequence 388559 (LGWR switch)
  Current log# 19 seq# 388559 mem# 0: /gljcdata/zgls/redo19.log
Sun Jul 27 00:45:29 2014
Thread 2 advanced to log sequence 388560 (LGWR switch)
  Current log# 20 seq# 388560 mem# 0: /gljcdata/zgls/redo20.log
Sun Jul 27 00:46:07 2014
Thread 2 advanced to log sequence 388561 (LGWR switch)
  Current log# 11 seq# 388561 mem# 0: /gljcdata/zgls/redo11.log
Sun Jul 27 00:47:16 2014
Thread 2 advanced to log sequence 388562 (LGWR switch)
  Current log# 12 seq# 388562 mem# 0: /gljcdata/zgls/redo12.log
Sun Jul 27 01:02:48 2014

可以看出,在故障发生之前,redo log的切换大概在1分钟一次(切换频率过快),而且是有规律的,但是突然在零点47分之后,到1点02分之间,没有任何日志切换存在

大概时间是15分钟,也就是900秒,正好佐证了CF锁持有时间超过了900秒,导致oracle杀死该持有进程。


类似文档见(doc id 28045.1)
建议修改redo logfile大小,让系统保持日志切换大概在半小时左右一次,另外查看当时系统日志,是否存在I/O缓慢情况,是不是有异步I/O的使用。



-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

另外,和CF锁相关,常会出现ora-2103,ora-600 [2103] 之类的错误,以后遇见此类错误的话,可以立马着手分析是否是CF问题


你可能感兴趣的:(一次enq: CF - contention 导致数据库宕机的故障分析)