Oracle Clusterware Cannot Start on all Nodes: Network communication with node missing for 90%

Oracle Clusterware Cannot Start on all Nodes: Network communication with node missing for 90% of timeout interval (Doc ID 1507482.1)

In this Document

  Purpose
  Troubleshooting Steps
  Step 1. Basic connectivity:
  Step 2. After basic connectivity is confirmed, check advanced connectivity checks:
  Step 3. Known bugs
  Community Discussions
  References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 to 11.2.0.4 [Release 11.2]
Information in this document applies to any platform.

PURPOSE

This note is a troubleshooting guide for the following situation:  Oracle Clusterware cannot be started on all nodes at once.  For example, in a 2-node cluster, the Oracle Clusterware on the 2nd node won't start, or, attempting to start clusterware on the second node causes the first node's clusterware to shutdown.

In the clusterware alert log ($GRID_HOME/log//alert.log) of one or more nodes where Oracle Clusterware is started, the following messages are seen:

2012-07-14 19:24:18.420
[cssd(6192)]CRS-1612:Network communication with node racnode02 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.500 seconds
2012-07-14 19:24:25.422
[cssd(6192)]CRS-1611:Network communication with node racnode02 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.500 seconds
2012-07-14 19:24:30.424
[cssd(6192)]CRS-1610:Network communication with node racnode02 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.500 seconds
2012-07-14 19:24:32.925
[cssd(6192)]CRS-1607:Node racnode02 is being evicted in cluster incarnation 179915229; details at (:CSSNM00007:) in /u01/app/gridhome/log/racnode01/cssd/ocssd.log.

 

In the clusterware alert log ($GRID_HOME/log//alert.log) of the evicted node(s), the following messages are seen:

2012-07-14 19:24:29.282
[cssd(8625)]CRS-1608:This node was evicted by node 1, racnode01; details at (:CSSNM00005:) in /u01/app/gridhome/log/racnode02/cssd/ocssd.log.
2012-07-14 19:24:29.282
[cssd(8625)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/gridhome/log/racnode02/cssd/ocssd.log

 

TROUBLESHOOTING STEPS

The Oracle clusterware cannot be up on two (or more) nodes if those nodes cannot communicate with each other over the interconnect.

The CRS-1612, CRS-1611, CRS-1610 messages "Network communication with node NAME(n) missing for PCT% of timeout interval" are warning that ocssd on that node cannot communicate with ocssd on the other node(s) over the interconnect. If this persists for the full timeout interval (usually thirty seconds - reference: Document 294430.1) then Oracle Clusteware is designed to evict one of the nodes.

Therefore, the issue that requires troubleshooting in such as case is:  why the nodes cannot communicate over the interconnect

 

Step 1. Basic connectivity:

Follow the steps in Note 1054902.1 to validate the network connectivity:
Note 1054902.1 - How to Validate Network and Name Resolution Setup for the Clusterware and RAC

 

Note: If the problem is intermittent, also conduct the test from the following My Oracle Support document:

To check TCP/IP communication:
Note 1445075.1 - Node reboot or eviction: How to check if your private interconnect CRS can transmit network heartbeats

 

Step 2. After basic connectivity is confirmed, check advanced connectivity checks:

1. Firewall

Firewall needs to be turned off on the private network.

If unsure whether there is any firewall between the nodes, use a tool like ipmon or wireshark.

Linux: Turn off iptables completely on all nodes and test:

service iptables stop

If clusterware on all nodes can come up when iptables is turned off completely, but cannot come up when iptables is running, then the IP packet filter rules need adjusting to allow ALL traffic between the private interconnects of all the nodes.

 

2. Multicast

In 11.2.0.2 (only), multicast must be configured on either 230.0.1.0 or 224.0.0.251 for Clusterware startup. Follow the steps in Document 1212703.1 to check multicast communication.

Reference:  Grid Infrastructure 11.2.0.2 Installation or Upgrade may fail due to Multicasting Requirement (Doc ID 1212703.1)

 

3. Jumbo Frames Configuration

If Jumbo Frames is configured, check to make sure its is configured properly

a) Check the MTU on the private interconnect interface(s) of each node:

/bin/netstat -in
Kernel Interface table
Iface        MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0   203273      0      0      0     2727      0      0      0 BMRU

Note:  In the above example MTU is set to 1500 for eth0

b) If MTU > 1500 on any interface, follow the steps in Note 1085638.1 to check if Jumbo Frames are properly configured.

 

4. Third-party mDNS daemons running

HAIP uses mDNS. If there are any 3rd-party mDNS daemons running, such as avahi or bonjour, they can actually remove the HAIP addresses and prevent cluster communication. Make sure that there are no 3rd party mDNS daemons running on the server.

Note 1501093.1 - CSSD Fails to Join the Cluster After Private Network Recovered if avahi Daemon is up and Running

 

5. Advanced UDP checks

Please refer to the steps in the following document to check UDP communication over the interconnect:
Note 563566.1 - Troubleshooting gc block lost and Poor Network Performance in a RAC Environment

 

Step 3. Known bugs

After reviewing all of the above, if no problems were found, check the following known issues:

Document 1488378.1 - List of gipc defects that prevent GI from starting/joining after private network is restored or node rebooted


你可能感兴趣的:(Node,Evictions)