How to Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1] | ||
|
||
Modified 25-JUN-2010Type HOWTOStatus PUBLISHED |
In this Document
Goal
Solution
Start up sequence:
Cluster status
Case 1: OHASD.BIN does not start
Case 2: OHASD Agents does not start
Case 3: CSSD.BIN does not start
Case 4: CRSD.BIN does not start
Case 5: GPNPD.BIN does not start
Case 6: Various other daemons does not start
Case 7: CRSD Agents does not start
Network and Naming Resolution Verification
Log File Location, Ownership and Permission
Network Socket File Location, Ownership and Permission
Diagnostic file collection
References
Oracle Server - Enterprise Edition - Version: 11.2.0.1 and later[Release: 11.2 and later ]
Information in this document applies to any platform.
This goal of the note is to provide reference to troubleshoot 11gR2 Grid Infrastructure clusterware startup issues. It applies to issues in both new environments (during root.sh or rootupgrade.sh) and unhealthy existing environments. To look specifically at root.sh issues, see Note: 1053970.1 for more information.
In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).
For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1
To find out cluster and daemon status:
As ohasd.bin is responsible to start up all other cluserwareprocesses directly or indirectly, it needs tostart up properly for the rest of the stack to come up.
Automatic ohasd.bin start up depends on the following:
1. OS is at appropriate run level:
OS need to be at specified run level before CRS will try to start up.
To find out at which run levelthe clusterwareneeds to come up:
Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.
To find out current run level:
2. "init.ohasd run" is up
On Linux/UNIX, as "init.ohasd run" is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it fails. Without "init.ohasd run" up and running, ohasd.bin will not start:
3. Cluserware auto start is enabled - its enabled by default
By default CRS is enabled for auto start upon node reboot, to enable:
To verify whether its currently enabled or not:
SCRBASE is /etc/oracle/scls_scr on Linux and AIX, /var/opt/oracle/scls_scr on hp-ux and Solaris
Note: NEVER EDIT THE FILE MANUALLY, use "crsctl enable/disable crs" command instead.
4. syslogd is up and OS is able to execute init script S96ohasd
OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:
If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.
To find out whether OS is able to execute S96ohasd while node is coming up, modify ohasd:
From:
case `$CAT $AUTOSTARTFILE` in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."
To:
case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/touch /tmp/ohasd.start."`date`"
$LOGERR "Oracle HA daemon is enabled for autostart."
After a node reboot, if you don't see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:
From:
case `$CAT $AUTOSTARTFILE` in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."
To:
case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/sleep 120
$LOGERR "Oracle HA daemon is enabled for autostart."