rac异常hang死故障分析(sskgxpsnd2)

x86虚拟化的平台麒麟系统的一套RAC。事件梳理20:24左右,发现一个节点hang死,关闭操作没有响应。关闭hang死节点,另一个节点也发生hang死,然后重启了另一个节点。

无效分析部分

检查gi的alert日志

rac异常hang死故障分析(sskgxpsnd2)_第1张图片

有一个很大跨度的时间回退

再看crsd日志

rac异常hang死故障分析(sskgxpsnd2)_第2张图片

直接的崩溃,并没有有价值的信息

又看了ctssd日志,也用处不大。

rac异常hang死故障分析(sskgxpsnd2)_第3张图片

经过一顿确认,这个应该是重启虚拟机然后时间同步到宿主机的缘故。不是hang死的原因。

有效分析部分

查看asm实例日志

2023-08-29T14:08:32.939390+08:00
skgxpvfynet: mtype: 61 process 12032 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2023-08-29T14:08:32.995797+08:00
opiodr aborting process unknown ospid (11828) as a result of ORA-603
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_m000_12032.trc  (incident=247819):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247819/+ASM1_m000_12032_i247819.trc
2023-08-29T14:08:33.135129+08:00
skgxpvfynet: mtype: 61 process 12030 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_12030.trc  (incident=247820):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247820/+ASM1_ora_12030_i247820.trc
2023-08-29T14:08:35.419844+08:00
opidrv aborting process M000 ospid (12032) as a result of ORA-603
2023-08-29T14:08:35.699231+08:00
opiodr aborting process unknown ospid (12030) as a result of ORA-603
2023-08-29T14:08:35.868746+08:00
Process m000 died, see its trace file
2023-08-29T15:46:28.224445+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.303.1146123969'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40051182.285.1146152783'
2023-08-29T16:10:47.520135+08:00
skgxpvfynet: mtype: 61 process 8178 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8178.trc  (incident=247821):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247821/+ASM1_ora_8178_i247821.trc
2023-08-29T16:10:49.918887+08:00
opiodr aborting process unknown ospid (8178) as a result of ORA-603
2023-08-29T19:46:34.829519+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.285.1146152783'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40735354.289.1146167189'
2023-08-29T20:21:04.087366+08:00
NOTE: ASM client -MGMTDB:_mgmtdb:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb:gz11db connection state (reg:1944369513)
2023-08-29T20:21:04.693879+08:00
Dumping diagnostic data in directory=[cdmp_20230829202104], requested by (instance=0, osid=23886), summary=[trace bucket dump request (kfnclDelete0)].
2023-08-29T20:21:05.338195+08:00
NOTE: detected orphaned client id 0x10001.
2023-08-29T20:21:50.697339+08:00
NOTE: ASM client YXPTDB1:YXPTDB:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client YXPTDB1:YXPTDB:gz11db connection state (reg:1287912358)
2023-08-29T20:21:53.345392+08:00
NOTE: detected orphaned client id 0x10002.
2023-08-29T20:22:08.589123+08:00
NOTE: client exited [11015]
2023-08-29T20:22:09.901462+08:00
NOTE: client +ASM1:+ASM:gz11db no longer has group 2 (OCRVOTE) mounted
NOTE: client +ASM1:+ASM:gz11db no longer has group 1 (DATA) mounted
2023-08-29T20:22:09.972392+08:00
NOTE: ASMB0 process exiting due to ASM instance shutdown (inactive for 1 seconds)
NOTE: ASMB0 clearing idle groups before exit
2023-08-29T20:22:10.123064+08:00
NOTE: client +ASM1:+ASM:gz11db deregistered
2023-08-29T20:22:10.195414+08:00
Shutting down instance (immediate) (OS id: 20496)
Shutting down instance: further logons disabled
Stopping background process MMNL
2023-08-29T20:22:11.333414+08:00
Stopping background process MMON
2023-08-29T20:22:13.335540+08:00
License high water mark = 19
2023-08-29T20:22:14.528890+08:00
SQL> ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */ 
2023-08-29T20:22:14.641327+08:00
NOTE: cache dismounting (clean) group 1/0xA5E0179B (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.254098+08:00
NOTE: LGWR doing clean dismount of group 1 (DATA) thread 1
NOTE: LGWR closing thread 1 of diskgroup 1 (DATA) at ABA 31.3948
NOTE: LGWR released recovery enqueue for thread 1 group 1 (DATA)
2023-08-29T20:22:15.525844+08:00
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
2023-08-29T20:22:15.652953+08:00
NOTE: detached from domain 1
2023-08-29T20:22:15.656908+08:00
NOTE: cache dismounted group 1/0xA5E0179B (DATA)
2023-08-29T20:22:15.714886+08:00
GMON dismounting group 1 at 6571 for pid 33, osid 20496
2023-08-29T20:22:15.778396+08:00
NOTE: Disk DATA_0000 in mode 0x7f marked for de-assignment
2023-08-29T20:22:15.813238+08:00
SUCCESS: diskgroup DATA was dismounted
NOTE: cache deleting context for group DATA 1/0xa5e0179b
2023-08-29T20:22:15.820667+08:00
NOTE: cache dismounting (clean) group 2/0xA5F0179C (OCRVOTE)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.830477+08:00
NOTE: LGWR doing clean dismount of group 2 (OCRVOTE) thread 1
NOTE: LGWR closing thread 1 of diskgroup 2 (OCRVOTE) at ABA 33.8895
NOTE: LGWR released recovery enqueue for thread 1 group 2 (OCRVOTE)
2023-08-29T20:22:16.030466+08:00
kjbdomdet send to inst 2
detach from dom 2, sending detach message to inst 2
2023-08-29T20:22:16.058295+08:00
NOTE: detached from domain 2
2023-08-29T20:22:16.066375+08:00
NOTE: cache dismounted group 2/0xA5F0179C (OCRVOTE)
2023-08-29T20:22:16.068034+08:00
GMON dismounting group 2 at 6572 for pid 33, osid 20496
2023-08-29T20:22:16.076227+08:00
NOTE: Disk OCRVOTE_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0002 in mode 0x7f marked for de-assignment
2023-08-29T20:22:16.086385+08:00
SUCCESS: diskgroup OCRVOTE was dismounted
NOTE: cache deleting context for group OCRVOTE 2/0xa5f0179c
2023-08-29T20:22:16.092465+08:00
SUCCESS: ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.511581+08:00
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.702392+08:00
Stopping background process VKTM
2023-08-29T20:22:24.267747+08:00
freeing rdom 2
freeing rdom 1
freeing rdom 0
2023-08-29T20:22:27.651980+08:00
Instance shutdown complete (OS id: 20496)
2023-08-29T20:05:25.578362+08:00
MEMORY_TARGET defaulting to 1128267776.
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
ksxp_exafusion_enabled_dcf: ipclw_enabled=0 
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
* instance_number obtained from CSS = 1, checking for the existence of node 0... 
* node 0 does not exist. instance_number = 1 
Starting ORACLE instance (normal) (OS id: 10339)
2023-08-29T20:05:25.597821+08:00
CLI notifier numLatches:7 maxDescs:2103
2023-08-29T20:05:25.640338+08:00
**********************************************************************
2023-08-29T20:05:25.640902+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SG

重启前的asm日志,看到了比较重要的一个报错

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

在asm实例上看到buffer空间不足,可能是系统层的缓存出现问题

再看db实例的alert日志

2023-08-29T20:06:53.015478+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015515+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015616+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015651+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015861+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.015911+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016094+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016176+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016311+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016330+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016377+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.016444+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0) 
2023-08-29T20:06:53.066125+08:00
Started redo scan
2023-08-29T20:06:53.220605+08:00
Completed redo scan
 read 7782 KB redo, 707 data blocks need recovery
2023-08-29T20:06:53.314712+08:00
Started redo application at
 Thread 2: logseq 17827, block 147240, offset 0
2023-08-29T20:06:53.393986+08:00
Recovery of Online Redo Log: Thread 2 Group 3 Seq 17827 Reading mem 0
  Mem# 0: +DATA/YXPTDB/redo03.log
2023-08-29T20:06:53.491915+08:00
Completed redo application of 4.09MB

再往前翻,也看到了进程无法启动的日志

Errors in file /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/trace/YXPTDB1_m000_14903.trc  (incident=420364):
ORA-00700: 软内部错误, 参数: [kfnRConnect2], [0], [0x7F263EE44E48], [], [], [], [], [], [], [], [], []
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/incident/incdir_420364/YXPTDB1_m000_14903_i420364.trc
2023-08-25T19:25:01.488828+08:00
Dumping diagnostic data in directory=[cdmp_20230825192501], requested by (instance=1, osid=14903 (M000)), summary=[incident=420364].
2023-08-25T21:42:03.079640+08:00
Thread 1 advanced to log sequence 10126 (LGWR switch)
  Current log# 2 seq# 10126 mem# 0: +DATA/YXPTDB/redo02.log

看到这里崩溃的m000进程,因为该进程不重要,当时没有影响业务。

官方资料确认

查看mos,发现一篇文档2484025.1

rac异常hang死故障分析(sskgxpsnd2)_第4张图片

对比可知现象比较类似,原因是内部bug没有找到说明。

但是这个bug只出现在UEK4上,解决方法是调整lo网卡的mtu,调整到16436

我们的环境是

中标麒麟v7u6    3.10.0-957的内核

错误现象能对应,但是内核不能对应。只能作为一个参考。有修改的价值,期待后续反馈。

你可能感兴趣的:(sskgxpsnd2,RAC)