Troubleshooting 10g and 11.1 Clusterware Reboots (文档 ID 265769.1)

In this Document

  Purpose
  Troubleshooting Steps
  1.0 - PROCESS ROLES FOR REBOOTS
  2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT
  3.0 - TROUBLESHOOTING OCSSD REBOOTS
  3.1 - COMMON CAUSES OF OCSSD REBOOTS
  3.2 - FILES TO REVIEW AND GATHER FOR OCSSD REBOOTS
  4.0 - TROUBLESHOOTING OPROCD REBOOTS
  4.1 - COMMON CAUSES OF OPROCD REBOOTS
  4.2 - FILES TO REVIEW AND GATHER FOR OPROCD REBOOTS
  5.0 - TROUBLESHOOTING OCLSOMON REBOOTS
  5.1 - COMMON CAUSES OF OCLSOMON REBOOTS
  5.2 - FILES TO REVIEW AND GATHER FOR OCLSOMON REBOOTS
  References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.5 to 11.1.0.7 [Release 10.1 to 11.1]
Information in this document applies to any platform.

PURPOSE


This document is intended for DBA's and support analysts experiencing 10g or 11.1 Clusterware Reboots.  For 11.2 and above, see Note: 1050693.1

TROUBLESHOOTING STEPS

10g or 11.1 RAC: TROUBLESHOOTING CRS REBOOTS
-- For 11.2 and above, see Note: 1050693.1
-------------------------------------

If there is a ocssd.bin problem/failure, the oprocd daemon detected a scheduling 
problem, or some other fatal problem, a node will reboot in a RAC cluster. This 
functionality is used for I/O fencing to ensure that writes from I/O capable clients 
can be cleared avoiding potential corruption scenarios in the event of a network 
split, node hang, or some other fatal event. 

1.0 - PROCESS ROLES FOR REBOOTS


OCSSD (aka CSS daemon) - This process is spawned in init.cssd. It runs in both 
vendor clusterware and non-vendor clusterware environments and is armed with a 
node kill via the init script. OCSSD's primary job is internode health monitoring 
and RDBMS instance endpoint discovery. It runs as the Oracle user.

PS Output:
oracle 686 0.0 0.23207216608 ? S 11:42:42 0:12 /oracle/10g/crs/bin/ocssd.bin

INIT.CSSD - In a normal environment, init spawns init.cssd, which in turn spawns 
OCSSD as a child. If ocssd dies or is killed, the node kill functionality of the 
init script will kill the node. If the script is killed, its ocssd survives and 
continues operating. However init has been instructed to respawn init.cssd via 
inittab. When it does so, the second init.cssd will attempt to start its own ocssd. 
That ocssd starts up, finds that its endpoint is owned by the first ocssd, fails, 
and then the 2nd init.cssd kills the node. 

PS Output:
root 635 0.0 0.0 1120 840 ? S 11:41:41 0:00 /bin/sh /etc/init.d/init.cssd fatal

OPROCD - This process is spawned in any non-vendor clusterware environment, except 
on Windows where Oracle uses a kernel driver to perform the same actions and Linux 
prior to version 10.2.0.4. If oprocd detects problems, it will kill a node via C 
code. It is spawned in init.cssd and runs as root. This daemon is used to detect 
hardware and driver freezes on the machine. If a machine were frozen for long enough 
that the other nodes evicted it from the cluster, it needs to kill itself to prevent 
any IO from getting reissued to the disk after the rest of the cluster has remastered 
locks."

PS Output:
root 684 0.0 0.0 2240 968 ? S 11:42:42 0:00 /oracle/10g/crs/bin/oprocd start -t 1000 -m 50

OCLSOMON (10.2.0.2 and above) - This process monitors the CSS daemon for hangs or 
scheduling issues and can reboot a node if there is a perceived hang. 

2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT



* Messages file locations:
Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

** CSS log locations:
11.1 and 10.2: <CRS_HOME>/log/<node name>/cssd
10.1: <CRS_HOME>/css/log

*** Oprocd log locations:
In /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.

Note that oprocd only runs when no vendor clusterware is running or on Linux > 10.2.0.4

 

3.0 - TROUBLESHOOTING OCSSD REBOOTS


If you have encountered an OCSSD reboot, review common causes in section 3.1 below. 
If the problem cannot be determined by reviewing the common causes, review and 
collect the data from section 3.3.

3.1 - COMMON CAUSES OF OCSSD REBOOTS


- Network failure or latency between nodes. It would take at least 30 consecutive
missed checkins to cause a reboot, where heartbeats are issued once per second.

Example of missed checkins in the CSS log:

WARNING: clssnmPollingThread: node <node> (1) at 50% heartbeat fatal, eviction in 29.100 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 14.960 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 13.950 seconds

The first thing to do is find out if the missed checkins ARE the problem or are a 
result of the node going down due to other reasons. Check the messages file to see 
what exact time the node went down and compare it to the time of the missed checkins.

- If the messages file reboot time < missed checkin time then the node eviction was 
likely not due to these missed checkins.

- If the messages file reboot time > missed checkin time then the node eviction was 
likely a result of the missed checkins.


- Problems writing to or reading from the CSS voting disk. 

Example of a voting disk problem in the CSS log:

ERROR: clssnmDiskPingMonitorThread: voting device access hanging (160008 miliseconds)

- Lack of CPU resources. There are some situations which will appear to be missed 
heartbeat issues, however turn out to be caused by a user running a high 
sustained load average. When a machine gets too heavily loaded, the scheduling 
reliability can be bad. This could cause CSS to not get scheduled in time and 
thus CSS cannot get its work done. If this happens, the node is declared 
not-viable for cluster work and is evicted. 

- A problem with the executables (for example, removing CRS Home files)

- Misconfiguration of CRS. Possible misconfigurations:

- Wrong network selected as the private network for CRS (confirm with CSS log, 
/etc/hosts, and ifconfig output). Make sure it is not the public or VIP 
address. Look in the CSS log for strings like...
clsc_listen: (*) Listening on 
(ADDRESS=(PROTOCOL=tcp)(HOST=dlsun2046)(PORT=61196))

- Putting the CSS vote file on a Netapp that's shared over some kind of public 
network or otherwise excessively loaded/unreliable network. If this is the 
case, you are likely to see the following message in the CSS logfile:

ERROR: clssnmDiskPingThread(): Large disk IO timeout * seconds.

If you ever see this error, then it's important to investigate why the disk 
subsystem is unresponsive.

See section 3.2 for information on how to correct common misconfiguration
problems.

- Killing the "init.cssd fatal" process or "ocssd" process.

- An unexpected failure of the OCSSD process, this can be caused by any of the 
above issues.

- An Oracle bug. Known bugs that can cause CSS reboots:


Note 264699.1 - CSS Fails to Flush Writes After Installing 10.1.0.2 CRS on Linux with OCFS
Workaround: Put OCR and CSS Voting files on raw devices
Fixed in OCFS 1.0.11 and above.

Bug 3942568 - A deadlock can occur between 2 threads of the CSS daemon process.
Fixed in 10.1.0.4 and above.




SOLARIS ONLY: See these bugids that fixed the problem (in Solaris 9; the fixes were 
backported to Solaris 8 Update 6): 
Bug 4308370 cond_timedwait(), sigtimedwait(), poll() and /proc time out too soon 
Bug 4391799 Fix for BugID 4308370 causes timeout failures when system time is reset 


3.2 - FILES TO REVIEW AND GATHER FOR OCSSD REBOOTS


If logging a service request, please provide ALL of the following files to Oracle 
Support if possible:

- All the files in the following directories from all nodes. 

For 10.2 and above, all files under:

<CRS_HOME>/log

Recommended method for gathering these for each node would be to run the 
diagcollection.pl script.

For 10.1:

<CRS_HOME>/crs/log
<CRS_HOME>/crs/init 
<CRS_HOME>/css/log
<CRS_HOME>/css/init 
<CRS_HOME>/evm/log
<CRS_HOME>/evm/init 
<CRS_HOME>/srvm/log

Recommended method for gathering these for each node:

cd <CRS_HOME>
tar cf crs.tar crs/init crs/log css/init css/log evm/init evm/log srvm/log

- Messages or Syslog from all nodes from the time of the problem:

Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

- If a core files was written it would be useful to obtain a stack trace of the
core file using Note 1812.1 "TECH Getting a Stack Trace from a CORE file".
Core files are usually writtin in one of the following directories:

<CRS_HOME>/crs/init
<CRS_HOME>/css/init
<CRS_HOME>/evm/init

You should also check all threads of the core file and get a stack trace for each.
Note 118252.1 has information on gathering multiple threads.

- OCR dump file - To get this cd to <CRS_HOME>/bin as the root user and issue 
"ocrdump <unique filename>". This will generate two files (ocrdump.log and the
a dump file with the name given for it). 

- 'opatch lsinventory -detail' output for the CRS home

- Back up the scls_scr directory, inittab, and hosts file for analysis with:

Sun, HP-UX, HP Tru64:

cd /
tar cf /var/backup/oraclecrs.tar var/opt/oracle/scls_scr etc/hosts etc/inittab

Linux, IBM-AIX:

cd /
tar cf /var/backup/oraclecrs.tar etc/oracle/scls_scr etc/hosts etc/inittab


- Ifconfig output from each node (ifconfig -a on unix platforms). 


- It would also be useful to get the following from each node leading up to the time
of the reboot:

- netstat -is (or equivelant)
- iostat -x (or equivelant)
- vmstat (or equivelant)
- ping -s (or equivelant) output of the private network 

There is a tool called "OS Watcher" that helps gather this information. This tool 
will dump netstat, vmstat, iostat, and other output at an inverval and save x number 
of hours of archived data. For more information about this tool see Note 301137.1.

4.0 - TROUBLESHOOTING OPROCD REBOOTS


If you have encountered an OPROCD reboot, review common causes in section 4.1 below. 
If the problem cannot be determined by reviewing the common causes, review and 
collect the data from section 4.2.

4.1 - COMMON CAUSES OF OPROCD REBOOTS


- A problem detected by the OPROCD process. This can be caused by 4 things:

1) An OS scheduler problem.
2) The OS is getting locked up in a driver or hardware. 
3) Excessive amounts of load on the machine, thus preventing the scheduler from 
behaving reasonably.
4) An Oracle bug.

OPROCD Bugs Known to Cause Reboots:


Bug 5015469 - OPROCD may reboot the node whenever the system date is moved
backwards.
Fixed in 10.2.0.3+

Bug 4206159 - Oprocd is prone to time regression due to current API used (AIX only)
Fixed in 10.1.0.3 + One off patch for Bug 4206159.


Diagnostic Fixes (VERY NECESSARY IN MOST CASES):

Bug 5137401 - Oprocd logfile is cleared after a reboot
Fixed in 10.2.0.4+

Bug 5037858 - Increase the warning levels if a reboot is approaching 
Fixed in 10.2.0.3+

4.2 - FILES TO REVIEW AND GATHER FOR OPROCD REBOOTS


If logging a service request, please provide ALL of the following files to Oracle 
Support if possible:

- Oprocd logs in /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.

- All the files in the following directories from all nodes. 

For 10.2 and above, all files under:

<CRS_HOME>/log

Recommended method for gathering these for each node would be to run the 
diagcollection.pl script.

For 10.1:

<CRS_HOME>/crs/log
<CRS_HOME>/crs/init 
<CRS_HOME>/css/log
<CRS_HOME>/css/init 
<CRS_HOME>/evm/log
<CRS_HOME>/evm/init 
<CRS_HOME>/srvm/log

Recommended method for gathering these for each node:

cd <CRS_HOME>
tar cf crs.tar crs/init crs/log css/init css/log evm/init evm/log srvm/log

- Messages or Syslog from all nodes from the time of the problem:

Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

- 'opatch lsinventory -detail' output for the CRS home

- It would also be useful to get the following from each node leading up to the time
of the reboot:

- netstat -is (or equivelant)
- iostat -x (or equivelant)
- vmstat (or equivelant)

There is a tool called "OS Watcher" that helps gather this information. This tool 
will dump netstat, vmstat, iostat, and other output at an inverval and save x number 
of hours of archived data. For more information about this tool see Note 301137.1.


5.0 - TROUBLESHOOTING OCLSOMON REBOOTS


If you have encountered an OCLSOMON reboot, review common causes in section 5.1 below. 
If the problem cannot be determined by reviewing the common causes, review and 
collect the data from section 5.2.

5.1 - COMMON CAUSES OF OCLSOMON REBOOTS


- A problem detected by the OCLSOMON process. This can be caused by 4 things:

1) A thread(s) within the CSS daemon hung.
2) An OS scheduler problem.
3) Excessive amounts of load on the machine, thus preventing the scheduler from 
behaving reasonably.
4) An Oracle bug.

5.2 - FILES TO REVIEW AND GATHER FOR OCLSOMON REBOOTS


If logging a service request, please provide ALL of the following files to Oracle 
Support if possible:

- All the files in the following directories from all nodes. For a description of
these directories, see Note 259301.1 :

For 10.2, all files under:

<CRS_HOME>/log

Recommended method for gathering these for each node would be to run the 
diagcollection.pl script.

- Messages or Syslog from all nodes from the time of the problem:

Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

- 'opatch lsinventory -detail' output for the CRS home

- It would also be useful to get the following from each node leading up to the time
of the reboot:

- netstat -is (or equivelant)
- iostat -x (or equivelant)
- vmstat (or equivelant)

There is a tool called "OS Watcher" that helps gather this information. This tool 
will dump netstat, vmstat, iostat, and other output at an inverval and save x number 
of hours of archived data. For more information about this tool see Note 301137.1.

Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database - RAC/Scalability Community

REFERENCES

NOTE:239989.1  - 10g RAC: Stopping Reboot Loops When CRS Problems Occur
NOTE:259301.1  - CRS and 10g/11.1 Real Application Clusters
NOTE:264699.1  - CSS Fails to Flush Writes After Installing 10.1.0.2 CRS on Linux with OCFS
NOTE:301137.1  - OSWatcher Black Box (Includes: [Video])

BUG:3875098  - INSTANCE STAYED UP AFTER ONE NODE LOST CONNECTION TO VOTING DEVICE
BUG:3942568  - NODE CRASHING WHEN HEARTBEAT CABLE IS PULLED ON OTHER (MASTER) NODE


BUG:4206159  - OPROCD IS PRONE TO TIME REGRESSION DUE TO CURRENT API USED






BUG:5015469  - OPROCD REBOOTS NODE WHEN TIME IS SET BACK BY XNTPD

BUG:5137401  - OPROCD LOGFILE IS CLEARED AFTER A REBOOT

NOTE:1050693.1  - Troubleshooting 11.2 Clusterware Node Evictions (Reboots)
NOTE:110888.1  - How to Trace Unix System Calls
NOTE:118252.1  - How to Process an Express Core File Using dbx, dbg, dde, gdb or ladebug
NOTE:1812.1  - TECH: Getting a Stack Trace from a CORE file on Unix

NOTE:559365.1  - Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
NOTE:605449.1  - No Core or Stack Traces are Produced Upon Failure of ocssd.bin

你可能感兴趣的:(Troubleshooting 10g and 11.1 Clusterware Reboots (文档 ID 265769.1))