Recently I encountered a situation in an Oracle RAC cluster whereby
files were accidentally deleted within the CRS_HOME on one of the nodes
resulting in a node failure. I discovered that the conventional method
of node removal and addition from the cluster didn't work.
This document describes how to clean up a cluster in such a situation.
1. During or after the conventional Oracle method of removing a node,
(as documented in Note:269320.1,
Removing a Node from a 10g RAC Cluster),
various errors might be encountered such as;
[oracle@<working node name> bin]$ ./srvctl stop nodeapps -n <broken node name>
CRS-0216: Could not stop resource 'ora.<broken node name>.ons'.
CRS-0216: Could not stop resource 'ora.<broken node name>.vip'.
CRS-0216: Could not stop resource 'ora.<broken node name>.gsd'.
[oracle@<working node name> bin]$
[root@<working node name> bin]# ./srvctl remove nodeapps -n <broken node name>
Please confirm that you intend to remove the node-level applications on node <broken node name> (y/[n]) y
PRKO-2112 : Some or all node applications are not removed successfully on node: <broken node name>
2. However, according to the OCR all information for the broken node has been removed.
[oracle@<working node name> bin]$ ./crs_stat -u
NAME=ora.<working node name>.inst
TYPE=application
TARGET=ONLINE
STATE=ONLINE on <working node name>
NAME=ora.cmastage.db
TYPE=application
TARGET=ONLINE
STATE=ONLINE on <working node name>
NAME=ora.<working node name>.ASM1.asm
TYPE=application
TARGET=ONLINE
STATE=OFFLINE
NAME=ora.<working node name>.LISTENER_<WORKING_NODE_NAME>.lsnr
TYPE=application
TARGET=ONLINE
STATE=ONLINE on <working node name>
NAME=ora.<working node name>.gsd
TYPE=application
TARGET=ONLINE
STATE=ONLINE on <working node name>
NAME=ora.<working node name>.ons
TYPE=application
TARGET=ONLINE
STATE=UNKNOWN on <working node name>
NAME=ora.<working node name>.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on <working node name>
3. If there still appear to be resources in the ocr for the broken node,
they can be removed as follows:
$CRS_HOME/bin/crs_unregister <resource name>
(where resource name is acquired from the output of the crs_stat command as above)
4. Now one might think the procedure has completed okay and that the broken node
can be added back into the cluster using the standard add node procedure but alas,
all sorts of weird errors might be encountered from here on in, if so this indicates
that the OCR might have become corrupted and will need to be re-initialised. This will
require an outage to the cluster and is detailed below.
5. Shutdown the Oracle Clusterware stack on all the nodes using command crsctl stop crs as root user.
6. Execute the following on all nodes:
<CRS_HOME>/install/rootdelete.sh
7. Execute the following on the node which is supposed to be the first node:
<CRS_HOME>/install/rootdeinstall.sh
8. The following commands should return nothing
ps -e | grep -i 'ocs[s]d'
ps -e | grep -i 'cr[s]d.bin'
ps -e | grep -i 'ev[m]d.bin'
9. Execute <CRS_HOME>/root.sh on first node
10. After successful root.sh execution on first node, execute root.sh on the rest of the nodes of the cluster.
11. The nodeapps might need to be added manually using the srvctl command as follows (as root user for each node):
[root@<working node name> bin]# ./srvctl add nodeapps -n <working node name> -o /u01/app/oracle/product/10.2/db_1 -A <working node name vip>/<netmask>/<device name>
(where <working node name vip> = hosts file entry for vip, or IP address, and <device name> = device name such as eth0)
12. Add the database to the OCR using the appropriate srvctl add database command as the user who owns the database,
ensure that this is not run as root user
13. Add ASM, DB, Instance, services using approproate srvctl add commands.
14. Add the listener using netca. This may give errors if the listener.ora contains the entries already.
If this is the case, move the listener.ora to /tmp from the $ORACLE_HOME/network/admin or from the
$TNS_ADMIN directory if the TNS_ADMIN environmental is defined and then run netca.
Add all the listeners that were added earlier.
References:
Removing a Node from a 10g RAC Cluster. Note:269320.1 (Oracle Metalink)
Re-initialising the OCR. Note:399482.1 (Oracle Metalink)
<!-- / message --><!-- edit note -->