Troubleshooting 11.2 Clusterware Node Evictions (Reboots) [ID 1050693.1] |
||
|
||
|
Modified 29-JUL-2010Type BULLETINStatus PUBLISHED |
|
In this Document
Purpose
Scope and Application
Troubleshooting 11.2 Clusterware Node Evictions (Reboots)
NODE EVICTION OVERVIEW
1.0 - PROCESS ROLES FOR REBOOTS
2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT
3.0 - TROUBLESHOOTING OCSSD EVICTIONS
3.1 - COMMON CAUSES OF OCSSD EVICTIONS
3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS
4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS
4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS
4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS
References
Oracle Server - Enterprise Edition - Version: 11.2.0.1 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.
This document is to provide a reference for troubleshooting 11.2 Clusterware node evictions. For clusterware node evictions prior to 11.2, see Note: 265769.1
This document is intended for DBA's and support analysts experiencing clusterware node evictions (reboots).
The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process. The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.
OCSSD (aka CSS daemon) - This process is spawned by the cssdagent process. It runs in both
vendor clusterware and non-vendor clusterware environments. OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files). OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin
CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent
CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor
Important files to review:
* Messages file locations:
Note that the diagcollection.pl script in <GRID_HOME>/bin can be used to obtain the <GRID_HOME>/log files.
11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log. This can be used to determine which process is responsible for the reboot. Example message from a clusterware alert log:
[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component:
cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653
This particular eviction happened when the cssagent timed out heartbeating to the CSSD process (oclsomon functionality). Once you know the process (in the first message), the corresponding logs can be reviewed.
More examples to come later...
If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes.
If you have encountered an OCSSD eviction review common causes in section 3.1 below. If the problem cannot be determined by reviewing the common causes, review and collect the data from section 3.3.
All files from section 2.0 from all cluster nodes. More data may be required.
If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section 4.1 below. If the problem cannot be determined by reviewing the common causes, review and collect the data from section 4.3.
All files from section 2.0 from all cluster nodes. More data may be required.
NOTE:265769.1 - Troubleshooting 10g and 11.1 Clusterware Reboots
NOTE:301137.1 - OS Watcher User Guide
NOTE:736752.1 - Introducing Cluster Health Monitor (IPD/OS)
------------------------------------------------------------------------------
Blog: http://blog.csdn.net/tianlesoftware
网上资源: http://tianlesoftware.download.csdn.net
相关视频:http://blog.csdn.net/tianlesoftware/archive/2009/11/27/4886500.aspx
DBA1 群:62697716(满); DBA2 群:62697977(满)
DBA3 群:62697850 DBA 超级群:63306533;
聊天 群:40132017
--加群需要在备注说明Oracle表空间和数据文件的关系,否则拒绝申请