数据库无响应(hang住)故障,常见的就是数据库实例不能响应客户端发起的SQL语句请求,客户端提交一个SQL后,就一直处于等待数据库实例返回结果的状态。最为严重的现象是客户端根本不能连接到数据库,甚至通过操作系统本地sqlplus / as sysdba命令也不能正常访问数据库。发起一个连接请求后,一直处于等待状态。
对于oracle数据库一般上面运行的业务都是比较核心,发生了数据库无响应是必须要及时发现并紧急处理的。
数据库都部署有监控,一般会接收到类似这样的告警信息:
数据库hang住时收集诊断信息参考文章:
How to Collect Diagnostics for Database Hanging Issues (Doc ID 452358.1)
无响应故障排除方法参考:
Troubleshooting Database Hang Issues (Doc ID 1378583.1)
How to Investigate Slow or Hanging Database Performance Issues (Doc ID 1362329.1)
这种情况造成的影响非常大。在这个实例上的所有应用系统均受到严重影响,并且在找到根源并最终解决问题之前,数据库实例往往须要重启,快速恢复业务。
有时候在数据库实例重启之后,业务恢复正常,但是可供分析的数据也随之消失,比如共享池碎片严重 报ORA-4031错误,实例重启后共享池的状态数据丢失,无法找到根源问题。
1)、检查数据库告警日志
单实例:alert
RAC:alert
如果没有发现明显的异常,进行下一步检查
2)、测试数据库是否可以正常连接
本地:sqlplus / as sysdba
远程: sqlplus system/pass@ip:1521/service_name
3)、若不能正常连接,需要看操作系统是否出现了异常
检查操作系统:
zabbix监控上查看user cpu 、sys cpu信息
top命令
vmstat 1 10命令
iostat -mx 1
检查是否有swap严重: free -g
检查操作系统日志信息:
AIX: errpt
Linux: less /var/log/message
通过简单快速的分析,若没发现太多异常信息,或者暂时无法快速分析出问题根因,又或者知道原因但没法解决。那就要考虑重启,紧急恢复业务。
重启之前,先要收集一波日志
整个数据库实例hang住的情况下,是办法通过常规方式登录数据库的。
单实例环境下收集Hanganalyze 和 Systemstate信息:
Hanganalyze
sqlplus - prelim / as sysdba oradebug setmypid oradebug unlimit oradebug hanganalyze 3 -- Wait one minute before getting the second hanganalyze oradebug hanganalyze 3 oradebug tracefile_name exit
Systemstate
sqlplus - prelim / as sysdba oradebug setmypid oradebug unlimit oradebug dump systemstate 258 oradebug dump systemstate 258 oradebug tracefile_name exit
RAC环境下收集Hanganalyze 和 Systemstate信息:
sqlplus - prelim / as sysdba oradebug setmypid oradebug unlimit oradebug -g all hanganalyze 3 oradebug -g all hanganalyze 3 oradebug -g all dump systemstate 258 oradebug -g all dump systemstate 258 exit
sqlplus - prelim / as sysdba shutdown immediate; 必要时需要kill进程 ps -ef|grep "LOCAL=NO"|grep -v grep|awk '{ print $2 }'|xargs kill –9
对于oracle数据库,一般alert日志中都会有详细的日志信息。Hang Analyze trace file会给你呈现更详尽的信息,并且会以会话的维度给你呈现出阻塞链关系,对分析问题非常有帮助。
根据经验来说,当你去分析hang trace文件的时候,一般的结局是遇到了bug,官方的建议就是让你打补丁或者升级。操作方案都可以在mos文档上搜索到,Oracle官方依靠强大的知识库和bug库,可以让你找到90%的解决方案。
参考mos文档 RAC Database Hangs Trying To Archive ('Log archive I/O'<='enq: DM - contention'). (Doc ID 1565777.1)
1) RAC 数据库有规律的hang住,每次进行归档日志的时,数据库无响应
2) 如果为数据库做了Hang Analyze 从trace日志文件中可以看到如下信息:
Chains most likely to have caused the hang: [a] Chain 1 Signature: 'Log archive I/O'<='enq: DM - contention' Chain 1 Signature Hash: 0x7abeb73c . . . session serial #: 5 } which is waiting for 'Log archive I/O' with wait info: { p1: 'count'=0x1 p2: 'intr'=0x100 p3: 'timeout'=0xffffffff time in wait: 4.166054 sec timeout after: never wait id: 5635 blocking: 1 session current sql: ALTER DATABASE OPEN . . . 2. event: 'Log archive I/O' wait id: 5633 p1: 'count'=0x1 time waited: 0.013025 sec p2: 'intr'=0x100 p3: 'timeout'=0xffffffff 3. event: 'Log archive I/O' wait id: 5632 p1: 'count'=0x1 time waited: 3.126933 sec p2: 'intr'=0x100 p3: 'timeout'=0xffffffff } Chain 1 Signature: 'Log archive I/O'<='enq: DM - contention' Chain 1 Signature Hash: 0x7abeb73c
1)RAC数据库使用两个归档路径
1.1)第一个路径在共享ASM磁盘组中(+RECO)
log_archive_dest_1 = "LOCATION=USE_DB_RECOVERY_FILE_DEST" db_recovery_file_dest = "+RECO"
1.2)第二个路径在NAS文件系统中
log_archive_dest_2 = "LOCATION=/NAS1/arch01"
2) 在ASM磁盘组归档日志路径中是没有任何问题的
3)NAS/NFS文件系统辅助归档日志位置在被访问时出现挂起问题,例如NFS文件系统挂起的“df”操作系统命令
4)此外,OS日志报告了网络接口上的几个网络问题(包括用于访问NAS的交换机)。
1)删除NAS /NFS归档og目标(从受影响的数据库中),只保留磁盘组归档og位置。
2)另外,从所有RAC节点卸载NAS文件系统,直到解决NAS问题为止。
3)删除NAS /NFS归档og目标(从受影响的数据库中)后,数据库将在两个节点上打开,并且能够再次归档redolog文件(再次生成归档日志)。
4)稍后您需要修复NAS问题,并在修复后将其添加回NAS位置作为辅助存档位置。
参考文章:
Active Dataguard Standby Database Hang Due To ROW CACHE ENQUEUE LOCK (Doc ID 2586299.1)
Hanganalyze trace 日志如下:
Chains most likely to have caused the hang: [a] Chain 1 Signature: 'row cache lock'<='cursor: pin S wait on X' Chain 1 Signature Hash: 0x6b385219 [b] Chain 2 Signature: 'row cache lock' Chain 2 Signature Hash: 0x95d00c11 [c] Chain 3 Signature: 'row cache lock' Chain 3 Signature Hash: 0x95d00c11 ------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (xyz.xyz) os id: 15615 process id: 40, oracle@abc-dg (MMON) session id: 1121 session serial #: 52846 } is waiting for 'cursor: pin S wait on X' with wait info: { ... } and is blocked by => Oracle session identified by: { instance: 1 (xyz.xyz) os id: 378537 process id: 350, oracle@abc-dg (W001) session id: 397 session serial #: 32479 } which is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0x8 p2: 'mode'=0x0 p3: 'request'=0x3 time in wait: 5247 min 32 sec (last interval) time in wait: 5747 min 47 sec (total) timeout after: never wait id: 707 blocking: 1 session current sql: select s.file#, s.block#, s.ts#, t.obj#, s.hwmincr, t.obj# from tab$ t, seg$ s where bitand(s.spare1, 4503599627370496) = 4503599627370496 and bitand(s.spare1, 65536) <> 65536 and s.file# = t.file# and s.ts# = t.ts# and s.block# = t.block# UNION select s.file#, s.block#, s.ts#, t.obj#, s.hwmincr, tab.obj# from tabp
The Function Stack of Blocker:
ksedsts <- ksdxfstk <- ksdxcb <- sspuser <- __sighandler
<- semtimedop <- skgpwwait <- ksliwat <- kslwaitctx <- kqrget
<- kqrLockAndPinPo <- kqrpre1 <- kkdlSetTableVersion <- kkdlgstd <- kkmfcbloCbk
<- kkmpfcbk <- qcsprfro <- qcsprfro_tree <- qcsprfro_tree <- qcspafq
<- qcspqbDescendents <- qcspqb <- kkmdrv <- opiSem <- opiprs
<- kksParseChildCursor <- rpiswu2 <- kksLoadChild <- kxsGetRuntimeLock
This is due to below bug:
Bug 27716177 : ADG: ORA-04021:ORA-04024:ROW CACHE ENQUEUE AGAINST DC_OBJECTS:OBJ$
Duplicate of
Bug 28228168 : ORA-04024: SELF-DEADLOCK DETECTED WHILE TRYING TO MUTEX PIN CURSOR
Duplicate of
Unpublished Bug 28423598 : ROW CACHE ENQUEUE AGAINST DC_OBJECTS:OBJ$ on Active Data Guard
Document 28423598.8 ORA-4021: ORA-4024: ROW CACHE ENQUEUE AGAINST DC_OBJECTS:OBJ$ on Active Data Guard
Dataguard Standby database could freeze and wait on row cache enqueue whilst trying to apply a change to a bootstrap object (eg. OBJ$). Sometimes it can crash as well.
Apply the latest RU for 12.2.0.1 (12.2.0.1.190716DBJUL2019RU) which includes fix for the Bug 28423598. Patch 29757449 for 12.2.0.1.190716 RU.
Or
Apply One-off Patch for Bug 28423598.