诊断RAC数据库上的“IPC Send timeout”问题(原创)

IPC Send timeout故障现象

RAC 数据库上比较常见的一种问题就是“IPC Send timeout”。数据库Alert log中出现了“IPC Send timeout”之后,经常会伴随着ora-29740 或者 "Waiting for clusterware split-brain resolution"等,数据库实例会因此异常终止或者被驱逐出集群

比如:

实例1的ALERT LOG:

Thu Jul 02 05:24:50 2012

IPC Send timeout detected.Sender: ospid 6143755      <==发送者

Receiver: inst 2 binc 1323620776 ospid 49715160        <==接收者

Thu Jul 02 05:24:51 2012

IPC Send timeout to 1.7 inc 120 for msg type 65516 from opid 13

Thu Jul 02 05:24:51 2012

Communications reconfiguration: instance_number 2

Waiting for clusterware split-brain resolution       <==出现脑裂

Thu Jul 02 05:24:51 2012

Trace dumping is performing id=[cdmp_20120702052451]

Thu Jul 02 05:34:51 2012

Evicting instance 2 from cluster   <==过了10分钟,实例2被驱逐出集群实例2的ALERT LOG:

Thu Jul 02 05:24:50 2012

IPC Send timeout detected. Receiver ospid 49715160       <==接收者

Thu Jul 02 05:24:50 2012

Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms6_49715160.trc:

Thu Jul 02 05:24:51 2012

Waiting for clusterware split-brain resolution

Thu Jul 02 05:24:51 2012

Trace dumping is performing id=[cdmp_20120702052451]

Thu Jul 02 05:35:02 2012

Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lmon_6257780.trc:

ORA-29740: evicted by member 0, group incarnation 122  <==实例2出现ORA- 29740错误,并被驱逐出集群

Thu Jul 02 05:35:02 2012

LMON: terminating instance due to error 29740

Thu Jul 02 05:35:02 2012

Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms7_49453031.trc:

ORA-29740: evicted by member , group incarnation

在RAC实例间主要的通讯进程有LMON, LMD, LMS等进程。正常来说,当一个消息被发送给其它实例之后,发送者期望接收者会回复一个确认消息,但是如果这个确认消息没有在指定的时间内收到(默认300秒),发送者就会认为消息没有达到接收者,于是会出现“IPC Send timeout”问题。

这种问题通常有以下几种可能性:

1. 网络问题造成丢包或者通讯异常。

2. 由于主机资源(CPU、内存)问题造成这些进程无法被调度或者这些进程无响应。

3. Oracle Bug.

4. AIX平台没有打IZ97457丁包

网络问题造成的“IPC Send timeout”例子

实例1的Alert log中显示接收者是2号机的进程49715160,

Thu Jul 02 05:24:50 2012

IPC Send timeout detected.Sender: ospid 6143755       <==发送者

Receiver: inst 2 binc 1323620776 ospid 49715160       <==接收者

查看当时2号机的OSWatcher的vmstat输出,没有发现CPU和内存紧张的问题,查看OSWatcher的netstat输出,在发生问题前几分钟,私网的网卡上有大量的网络包传输。

Node2:

zzz Thu Jul 02 05:12:38 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.2       4073847798     0 512851119     0     0 <==4073847798 - 4073692530 = 155268 个包/30秒

zzz Thu Jul 02 05:13:08 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.2       4074082951     0 513107924     0     0 <==4074082951 - 4073847798 = 235153 个包/30秒

Node1:

zzz Thu Jul 02 05:12:54 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.1       502159550     0 4079190700     0     0 <==502159550 - 501938658 = 220892 个包/30秒

zzz Thu Jul 02 05:13:25 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.1       502321317     0 4079342048     0     0 <==502321317 - 502159550 = 161767 个包/30秒

查看这个系统正常的时候,大概每30秒传输几千个包:

zzz Thu Jul 02 04:14:09 CDT 2012

Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll

en1   1500  10.182.3    10.182.3.2       4074126796     0 513149195     0     0 <==4074126796 - 4074122374 = 4422个包/30秒

这种突然的大量的网络传输可能会引发网络传输异常,另外网络的UDP或者IP包丢失也会造成该错误。对于这种情况,需要联系网管对网络进行检查。在某些案例中,重启私网交换机或者调换了交换机后问题不再发生。(请注意,网络的正常的传输量会根据硬件和业务的不同而不同。)

CPU负载过高造成的“IPC Send timeout”例子

实例1的Alert log中显示接收者是2号机的进程1596935,

Fri Aug 01 02:04:29 2008 

 IPC Send timeout detected.Sender: ospid 1506825 <==发送者

 Receiver: inst 2 binc -298848812 ospid 1596935  <==接收者

查看当时2号机的OSWatcher的vmstat输出:

 zzz ***Fri Aug 01 02:01:51 CST 2008 

 System Configuration: lcpu=32 mem=128000MB 

 kthr     memory             page              faults        cpu     

 ----- ----------- ------------------------ ------------ ----------- 

  r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa 

 25  1 7532667 19073986   0   0   0   0    5   0 9328 88121 20430 32 10 47 11 

58  0 7541201 19065392   0   0   0   0    0   0 11307 177425 10440 87 13  0  0 <==idle的CPU为0,说明CPU100%被使用

61  1 7552592 19053910   0   0   0   0    0   0 11122 206738 10970 85 15  0  0 

 zzz ***Fri Aug 01 02:03:52 CST 2008 

   System Configuration: lcpu=32 mem=128000MB 

   kthr     memory             page              faults        cpu     

 ----- ----------- ------------------------ ------------ ----------- 

  r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa 

 25  1 7733673 18878037   0   0   0   0    5   0 9328 88123 20429 32 10 47 11 

81  0 7737034 18874601   0   0   0   0    0   0 9081 209529 14509 87 13  0  0 <==CPU的run queue非常高

80  0 7736142 18875418   0   0   0   0    0   0 9765 156708 14997 91  9  0  0 <==idle的CPU为0,说明CPU100%被使用

上面这个例子说明当主机CPU负载非常高的时候,接收进程无法响应发送者,从而引发了“IPC Send timeout”。

引起IPC Send timeout问题的常见bug

10g平台上该问题的常见Bug有Bug 5190596和Bug 6200820。这两个bug多出现在10.2.0.3和10.2.0.4,到了10.2.0.5版本就已经修复了该bug,具体请参见MOS上的文章:

LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]

'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]

11g平台上的常见bug有Bug 6200820和Bug 7653579具体请参见MOS上的文章:

Bug 6200820  AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)

Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]

AIX平台没有打IZ97457丁包引起的 IPC Send timeout

关于这点MOS上的这篇文章

AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]

有如下介绍

Applies to:

Oracle Server - Enterprise Edition - Version 9.2.0.2 and later

IBM AIX on POWER Systems (64-bit)

Symptoms

Environment with IBM AIX VIO experiences one or some or all of the following symptoms:

Packet Loss

Cache Fusion "block lost"

IPC Send timeout

Instance Eviction

SKGXPSEGRCV: MESSAGE TRUNCATED user data nnnn bytes payload nnnn bytes

Cause

AIX issue APAR IZ97457 - A VIOS Server will not forward traffic from its VIO Clients to the external network

Solution

Please engage your OS vendor for fix.

Oracle的建议是打上补丁,IZ97457补丁的介绍如下
Error description
A VIOS Server will not forward traffic from its VIO Clients to the external network.
Packets from the VIO Client travel to the hypervisor(phype) but the packets are dropped by the hypervisor as it attempts to deliver the packet to the VIO Server's trunk adapter.
The hypervisor will have dropped the packets because there are no buffers to place the data in. On the VIOServer,interrupts are not activating the trunk adapter to read and remove data from its buffers. This results in having full buffers at the trunk adapter.
Since the trunk adapter's buffers are full, phype cannot deliver the data and so VIO Clients cannot get packets through the SEA adapter and out to the network.
The problem was discovered on P7 systems where Vlans on the SEA are used.
"Hypervisor Receive" errors on the trunk adapter will increase as this problem occurs and the VIO Clients are not able to reach the outside network.
Problem summary
Unresponsive VIO Clients with traffice not forwarded to external network.
Problem conclusion
Ensure proper locking around receive scheduling operations.
可以看到,IZ97457该补丁是用于处理网络缓冲池用满的情况,建议AIX系统的用户检查下是否打了这个补丁。

 

参考至:https://blogs.oracle.com/Database4CN/entry/%E5%A6%82%E4%BD%95%E8%AF%8A%E6%96%ADrac%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8A%E7%9A%84_ipc_send_timeout_%E9%97%AE%E9%A2%98
              http://www.killdb.com/2011/11/29/ipc-send-timeout-error-caused-2-nodes-to-reboot-in-rac.html

              http://www.eygle.com/archives/2009/06/ipc_send_timeout_instance_evicted.html

              AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]

              LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]

              'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]

              Bug 6200820  AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)

              Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]

Top 5 issues for Instance Eviction (Doc ID 1374110.1)

本文原创,转载请注明出处、作者

如有错误,欢迎指正

邮箱:[email protected]

你可能感兴趣的:(timeout)