某生产系统RAC故障分析日志

故障情况描述

**公司生产系统数据库**f2010422凌晨846左右发生了重启,经过核实,这次重启操作不是人为原因导致的。为了保证生产库的顺利运行,保证业务的安全性以及稳定性,我们需要详细了解整个事件的起因,规避可能再次发生的风险。

故障机器情况

发生故障的服务器为IBM BLADE CENTER刀片上的JS21机器,安装了AIX操作系统,后台连接了磁盘阵列IBM DS4800。此台服务器与另一台刀片机JS21组成了ORACLE RAC系统。

具体的情况如下

硬件平台信息

主机操作系统

数据库版本

JS21

5300-10-00-0000

ORACLE 10.2.0.4

 

主机网络信息

故障主机所在的公网和私网信息如下:

主机ip地址

主机网卡信息

对应的RAC地址

**f

en0

*.*.*.A

**f-vip

en0

*.*.*.B(vip)

**f-priv

en1

*.*.*.C(心跳地址)

 

故障日志以及分析

从日志中可以发现,因为网络的连接失败,导致了RAC**f节点的驱逐操作。

我们可以发现,oracle RAC节点驱逐原因分三种,《可以参考oracle的文档(Doc ID 559365.1)

1Node is not pinging via the network heartbeat

2Node is not pinging the Voting disk

3Node is hung/busy and is unable to perform either of the earlier tasks

因为发生故障之前,用户在**f节点上执行了一个应用进程,然后才发生了服务器**f的重启,

我们有理由怀疑因为这个应用进程导致了上面的第3个情况,因此CRS把节点**f驱逐了,最后发生了**f节点的重启操作。

 

节点**g的日志

ocssd.log

    [    CSSD]2010-04-02 21:36:55.123 [3858] >TRACE:   clssgmReconfigThread:  completed for reconfig(2), with status(1)
[    CSSD]2010-04-22 08:40:39.079 [1801] >WARNING: clssnmeventhndlr: Receive failure with node 2 (**f), state 3, con(11085fbf0), probe(0), rc=11
[    CSSD]2010-04-22 08:40:39.079 [1801] >TRACE:   clssnmDiscHelper: **f, node(2) connection failed, con (11085fbf0), probe(0)
[    CSSD]2010-04-22 08:40:39.079 [2829] >TRACE:   clssgmPeerDeactivate: node 2(**f), death 0, state 0x1 connstate 0xf
[    CSSD]2010-04-22 08:40:53.819 [3086] >WARNING: clssnmPollingThread: node ma0
2 (2) at 50 2.481040e-265artbeat fatal, eviction in 14.634 seconds
[    CSSD]2010-04-22 08:40:53.819 [3086] >TRACE:   clssnmPollingThread: node ma0
2 (2) is impending reconfig, flag 1, misstime 15366
[    CSSD]2010-04-22 08:40:53.819 [3086] >TRACE:   clssnmPollingThread: diskTime
out set to (27000)ms impending reconfig status(1)
[    CSSD]2010-04-22 08:40:54.819 [3086] >WARNING: clssnmPollingThread: node ma0
2 (2) at 50 2.481040e-265artbeat fatal, eviction in 13.633 seconds
[    CSSD]2010-04-22 08:41:01.821 [3086] >WARNING: clssnmPollingThread: node ma0
2 (2) at 75 2.481040e-265artbeat fatal, eviction in 6.632 seconds
[    CSSD]2010-04-22 08:41:05.821 [3086] >WARNING: clssnmPollingThread: node ma0
2 (2) at 90 2.481040e-265artbeat fatal, eviction in 2.632 seconds
[    CSSD]2010-04-22 08:41:06.821 [3086] >WARNING: clssnmPollingThread: node ma0
2 (2) at 90 2.481040e-265artbeat fatal, eviction in 1.632 seconds
[    CSSD]2010-04-22 08:41:07.821 [3086] >WARNING: clssnmPollingThread: node ma0
2 (2) at 90 2.481040e-265artbeat fatal, eviction in 0.632 seconds
[    CSSD]2010-04-22 08:41:08.454 [3086] >TRACE:   clssnmPollingThread: Eviction started for node **f (2), flags 0x0001, state 3, wt4c 0
[    CSSD]2010-04-22 08:41:08.454 [3600] >TRACE:   clssnmDoSyncUpdate: Initiating sync 3

      

cssd.log


2010-04-22 08:41:08.492: [ CRSCOMM][11592]32CLEANUP: Searching for connections to failed node **f
2010-04-22 08:41:08.511: [  CRSEVT][11592]32Processing member leave for **f, incarnation: 3
2010-04-22 08:41:08.512: [    CRSD][11592]32SM: recovery in process: 8
2010-04-22 08:41:08.512: [  CRSEVT][11592]32Do failover for: **f
2010-04-22 08:41:08.556: [  OCRSRV][11335 s_update_remote_cache_int: FAILED TO RCV ACK FROM node 2 retcode 7
2010-04-22 08:41:08.697: [  CRSRES][11592]32 startup = 0
2010-04-22 08:41:08.708: [  CRSRES][11592]32 startup = 0
2010-04-22 08:41:08.714: [  CRSRES][11592]32 startup = 0
2010-04-22 08:41:08.720: [  CRSRES][11592]32 startup = 0
2010-04-22 08:41:08.725: [  CRSRES][11592]32 startup = 0
2010-04-22 08:41:08.732: [  CRSRES][11592]32 startup = 0
2010-04-22 08:41:08.790: [  CRSRES][11850]32startRunnable: setting CLI values
2010-04-22 08:41:08.792: [  CRSRES][11850]32Attempting to start `ora.**f.vip` on member `**g`
2010-04-22 08:41:10.500: [  CRSRES][11850]32Start of `ora.**f.vip` on member `**g` succeeded.
2010-04-22 08:41:10.514: [  CRSEVT][11592]32Post recovery done evmd event for: **f
2010-04-22 08:41:10.515: [    CRSD][11592]32SM: recoveryDone: 0
2010-04-22 08:41:10.516: [  CRSEVT][11349]32Processing RecoveryDone
2010-04-22 08:53:08.060: [    CRSD][11427]32SM: Recovery processing done.0
2010-04-22 08:53:08.060: [    CRSD][11427]32SM: Ongoing commands done.0
2010-04-22 08:53:08.074: [    CRSD][11427]32SM: asked E2E to exit.
2010-04-22 08:53:08.074: [ CRSCOMM][10292]32CLEANUP: quiesceing connection 11434a8d0
2010-04-22 08:53:08.075: [ CRSMAIN][10292]32runCommandServer for E2E exiting.
2010-04-22 08:53:08.075: [    CRSD][11427]32SM: E2E port closed.3
2010-04-22 08:53:08.075: [    CRSD][11427]32SM: stoppedResources: all =1 m_state= 3
2010-04-22 08:53:08.110: [  CRSRES][11685]32StopResource: setting CLI values
2010-04-22 08:53:08.115: [  CRSRES][12968]32StopResource: setting CLI values

 

 

故障分析详细过程

根据日志时间排序如下所示

[    CSSD]2010-04-22 08:40:39.079 [1801] >WARNING: clssnmeventhndlr: Receive failure with node 2 (**f), state 3, con(11085fbf0), probe(0), rc=11”------à “[CSSD]2010-04-22 08:40:53.819 [3086] >WARNING: clssnmPollingThread: node **f (2) at 50 2.481040e-265artbeat fatal, eviction in 14.634 seconds”-------à“2010-04-22 08:41:08.492: [ CRSCOMM][11592]32CLEANUP: Searching for connections to failed node **f”--------------> 2010-04-22 08:41:08.556: [  OCRSRV][11335 s_update_remote_cache_int: FAILED TO RCV ACK FROM node 2 retcode 7

 

从上面的日志中可以发现,因为网络的连接失败,导致了RAC**f节点的驱逐操作。

oracle RAC节点驱逐原因分三种,《参照oracle文档(Doc ID 559365.1)

1Node is not pinging via the network heartbeat

2Node is not pinging the Voting disk

3Node is hung/busy and is unable to perform either of the earlier tasks

因为发生故障之前,用户在**f节点上执行了一个应用进程,然后才发生了服务器**f的重启,

我们有理由怀疑因为这个应用进程导致了上面的第3个情况,因此CRS把节点**f驱逐了,最后发生了**f节点的重启操作。

 

||||||||||||||

Document TitleUsing Diagwait as a diagnostic to get more information

for diagnosing Oracle Clusterware Node evictions (Doc ID 559365.1)

Modified Date LabelModifiedModified Date09-FEB-2010Document Type LabelTypeDocument TypePROBLEM

In this Document

  Symptoms

  Changes

  Cause

  Solution

  Unsetting/Removing diagwait

  References

--------------------------------------------------------------------------------

Applies to:

Oracle Server - Standard Edition - Version: 10.1.0.5 to 11.2.0.1 - Release: to 11.2

Oracle Server - Enterprise Edition - Version: 10.1.0.5 to 11.2.0.1.0 - Release: to 11.2

Linux x86

HP-UX PA-RISC (64-bit)

IBM AIX on POWER Systems (64-bit)

Sun Solaris SPARC (64-bit)

HP-UX Itanium

Red Hat Enterprise Linux Advanced Server x86-64 (AMD Opteron Architecture)

Red Hat Enterprise Linux Advanced Server Itanium

Sun Solaris x86-64 (64-bit)

Linux x86-64

UnitedLinux Itanium

Oracle Server Enterprise Edition - Version: 10.1.0.5 to 11.1.0.7

Oracle Clusterware

Symptoms

Oracle Clusterware evicts the node from the cluster when

Node is not pinging via the network heartbeat

Node is not pinging the Voting disk

Node is hung/busy and is unable to perform either of the earlier tasks

In Most cases when the node is evicted, there is information written to the logs to analyze the cause of the node eviction. However in certain cases this may be missing, the steps documented in this note are to be used for those cases where there is not enough information or no information to diagnose the cause of the eviction for Clusterware versions less than 11gR2 (11.2.0.1).

Starting with 11.2.0.1, Customers do not need to set diagwait as the architecture has been changed.

Changes

None

Cause

When the node is evicted and the node is extremely busy in terms of CPU (or lack of it) it is possible that the OS did  not get time to flush the logs/traces to the file system. It may be useful to set diagwait attribute to delay the node reboot to give additional time to the OS to write the traces. This setting will provide more time for diagnostic data to be collected by safely and will NOT increase probability of corruption. After setting diagwait, the Clusterware will wait an additional 10 seconds (Diagwait - reboottime). Customers can unset diagwait by following the steps documented below after fixing their OS scheduling issues.

* -- Diagwait can be set on windows but it does not change the behaviour as it does on Unix-Linux platforms

@ For internal Support Staff

Diagwait attribute was introduced in 10.2.0.3 and is included in 10.2.0.4 & 11.1.0.6 and higher releases. It has also been subsequently backported to 10.1.0.5 on most platforms. This means it is possible to set diagwait on 10.1.0.5 (or higher), 10.2.0.3 (or higher) and in 11.1.0.6 (or higher). If the command crsctl set/get css diagwait reports "unrecognized parameter diagwait specified" then it can be safely assumed that the Clusterware version does not the necessary fixes to implement diagwait. If that is the case then customer is adviced to apply the latest patchset available before attempting to set diagwait

Solution

It is important that the clusterware stack must be down on all the nodes when changing diagwait .The following steps provides the step-by-step instructions on setting diagwait.

Execute as root

#crsctl stop crs

#<CRS_HOME>/bin/oprocd stop

Ensure that Clusterware stack is down on all nodes by executing

#ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"This should return no processes. If there are clusterware processes running and you proceed to the next step, you will corrupt your OCR. Do not continue until the clusterware processes are down on all the nodes of the cluster.

From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:

#crsctl set css diagwait 13 -force

Check if diagwait is set successfully by executing. the following command. The command should return 13. If diagwait is not set, the following message will be returned "Configuration parameter diagwait is not defined"

#crsctl get css diagwait

Restart the Oracle Clusterware on all the nodes by executing:

#crsctl start crs

Validate that the node is running by executing:

#crsctl check crs

Unsetting/Removing diagwait

Customers should not unset diagwait without fixing the OS scheduling issues as that can lead to node evictions via reboot. Diagwait delays the node eviction (and reconfiguration) by diagwait (13) seconds and as such setting diagwait does not affect most customers.In case there is a need to remove diagwait, the above mentioned steps need to be followed except step 3 needs to be replaced by the following command

#crsctl unset css diagwait

References

NOTE:726833.1 - Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux

   Related

  Products

--------------------------------------------------------------------------------

|||||||||||||||||||||||

 

 

故障解决建议

    因为发生故障时间很短,从日志中没有查到相关的进程信息,建议安装OSW来监控服务器信息。如果下次发现同样的问题,可以从OSW中抓取到具体进程信息。再来调试相关的出错进程。

 

 

 



你可能感兴趣的:(oracle,server,command,服务器,processing,2010)