In this Document
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.2.0.5 and later
Information in this document applies to any platform.
SYMPTOMS
VIPs often go offline unexpectedly, with the following message in crsd.log:
2011-02-17 15:11:16.437: [ CRSAPP][11321]32CheckResource error for ora.node02.vip error code = 1
2011-02-17 15:11:16.441: [ CRSRES][11321]32In stateChanged, ora.node02.vip target is ONLINE
2011-02-17 15:11:16.441: [ CRSRES][11321]32
ora.node02.vip on node02 went OFFLINE unexpectedly
VIP tracing is set by using the following commands:
#crsctl debug log res "ora.node01.vip:5"
#crsctl debug log res "ora.node02.vip:5"
Following error messages (highlighted in bold letters) can be seen in the generated VIP trace "CRS_HOME/log/node02:
2011-02-18 15:32:39.481: [ RACG][1] [4587556][1][ora.node02.vip]: Fri Feb 18 15:32:37 GMT+08:00 2011 [ 8257768 ] About to execute command: /usr/sbin/ping -S 192.168.220.36 -c 1 -w 1 192.168.220.33
Fri Feb 18 15:32:39 GMT+08:00 2011 [ 8257768 ]
IsIfAlive: RX packets checked if=en1 failed
2011-02-18 15:32:39.481: [ RACG][1] [4587556][1][ora.node02.vip]: Fri Feb 18 15:32:39 GMT+08:00 2011 [ 8257768 ] Interface en1 checked failed (host=node02)
Fri Feb 18 15:32:39 GMT+08:00 2011 [ 8257768 ] IsIfAlive: end for if=en1
Fri Feb 18 15:32:39 GMT+08:00 2011 [ 8257768 ] checkIf: end for if=en1
You can reset the VIP tracing to the default level by using the following commands:
#crsctl debug log res "ora.node01.vip:0"
#crsctl debug log res "ora.node02.vip:0"
CAUSE
The issue can be due to network performance when pinging the gateway using the public IP.
See "man ping" on AIX:
-S hostname/IP addr
Uses the IP address as the source address in outgoing ping packets.
-c Count
Specifies the number of echo requests, as indicated by the Count
variable, to be sent (and received).
-w timeout
This option works only with the -c option. It causes ping to wait
for a maximum of 'timeout' seconds for a reply (after sending the
last packet).
So the following command will check, if 1 packet sent from 192.168.220.36 to 192.168.220.33 will receive a reply within 1s.
ping -S 192.168.220.36 -c 1 -w 1 192.168.220.33
==>192.168.220.36 is the public IP, 192.168.220.33 is the gateway.
If the problem is with the network, the above "ping" command would take longer than 1s, and this leads to VIPs going offline unexpectedly and relocating to another node.
SOLUTION
To resolve the issue, please contact your network administrator to tune your network and ensure that the reply of the ping command is within 1s.
If you can't improve the network performance, please use the following temporary workaround (which is not recommended):
1. Stop all node applications.
% srvctl stop nodeapps -n
2. Backup then Modify the racgvip script .
Change:
# timeout of ping in number of loops (1 sec)
PING_TIMEOUT=" -c 1 -w 1"
To:
# timeout of ping in number of loops (3 sec)
PING_TIMEOUT=" -c 1 -w 3"
3. Start the node applications and other necessary resources.
% srvctl start nodeapps -n