记一次rac vip故障处理

此次rac vip故障主要是由于vip所在网卡ent3(做了EtherChannel,即主备网卡绑定)出现故障,导致1号节点vip漂移至2号节点。
$ crs_stat -t
Name           Type           Target    State     Host       
------------------------------------------------------------
ora....b1.inst application    ONLINE    ONLINE    crmdb01    
ora....b2.inst application    ONLINE    ONLINE    crmdb02    
ora....db2.srv application    ONLINE    ONLINE    crmdb02    
ora....srv1.cs application    ONLINE    ONLINE    crmdb02    
ora.crmdb.db   application    ONLINE    ONLINE    crmdb02    
ora....01.lsnr application    ONLINE    OFFLINE              
ora....b01.gsd application    ONLINE    ONLINE    crmdb01    
ora....b01.ons application    ONLINE    ONLINE    crmdb01    
ora....b01.vip application    ONLINE    ONLINE    crmdb02    
ora....02.lsnr application    ONLINE    ONLINE    crmdb02    
ora....b02.gsd application    ONLINE    ONLINE    crmdb02    
ora....b02.ons application    ONLINE    ONLINE    crmdb02    
ora....b02.vip application    ONLINE    ONLINE    crmdb02 
解决办法处理相对比较简单,只要更换问题网卡,1号节点重启nodeapps即可,vip就自动从2号机切回1号机。
但通过此次故障,我们是不是可以更加挖掘一下,rac vip漂移背后的一些东西呢?
1号机故障发生时,在操作系统级别,我们可以看到一些错误:
$ netstat -in
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en0   1500  link#2      0.11.25.be.50.e9  2364166277     0 1352130944   371     0
en0   1500  3.3.22      3.3.22.1          2364166277     0 1352130944   371     0
en3   1500  link#3      0.11.25.be.4d.41  3591277841     0 1817998840     5     0
en3   1500  130.36.23   130.36.23.8       3591277841     0 1817998840     5     0

lo0   16896 link#1                        1335635349     0 1335747477     0     0
lo0   16896 127         127.0.0.1         1335635349     0 1335747477     0     0
lo0   16896 ::1                           1335635349     0 1335747477     0     0

$ errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
173C787F   0416124011 I S topsvcs        Possible malfunction on local adapter
4FC185D1   0416124011 T H ent1           TRANSMIT FAILURE

173C787F   0416095911 I S topsvcs        Possible malfunction on local adapter
4FC185D1   0416095811 T H ent1           TRANSMIT FAILURE
4FC185D1   0416065011 T H ent1           TRANSMIT FAILURE

更为详细的错误如下所示:
$ errpt -a -j 4FC185D1|more
---------------------------------------------------------------------------
LABEL:          GOENT_TX_ERR
IDENTIFIER:     4FC185D1

Date/Time:       Sat Apr 16 12:40:04 BEIST 2011
Sequence Number: 10413
Machine Id:      00CE37F34C00
Node Id:         crmdb01
Class:           H
Type:            TEMP
Resource Name:   ent1           
Resource Class:  adapter
Resource Type:   14106802
Location:        U5791.001.99B18ND-P1-C06-T1
VPD:            
        Product Specific.(  ).......Gigabit Ethernet-SX PCI-X Adapter
        Part Number.................10N8586
        FRU Number..................10N8586
        EC Level....................D76267
        Manufacture ID..............YL1021
        Network Address.............001125BE4D41
        ROM Level.(alterable).......GOL021

Description
TRANSMIT FAILURE

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
FILE NAME
line: 2187 file: goent_tx.c
PCI ETHERNET STATISTICS
0000 25C5 0063 081B 0000 0003 0000 0003 0000 0000 0000 0000 0000 0000 0000 00DA
0000 010C D192 B18E 0001 B2FA DD4E 1CFC 0000 0041 1C93 93A5 0000 0000 0031 20A1
0000 00EE 256D C53E 0002 3042 90A3 0EE5 0000 0000 0000 0000 0000 0001 0001 B321
0000 09DF 0000 0000 0000 0000 0000 01DF 0000 000F 0000 0205 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 BBA3 087C 0200 D400 4120 8000 01A0 0000 0000
0230 0156 0009 F007 0443 C808 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000
DEVICE DRIVER INTERNAL STATE
2222 2222 256D C53E 0000 00C8
SOURCE ADDRESS
0011 25BE 4D41
---------------------------------------------------------------------------
LABEL:          GOENT_TX_ERR
IDENTIFIER:     4FC185D1
$ errpt -a -j 173C787F|more
---------------------------------------------------------------------------
LABEL:          TS_LOC_DOWN_ST
IDENTIFIER:     173C787F

Date/Time:       Sat Apr 16 12:40:21 BEIST 2011
Sequence Number: 10414
Machine Id:      00CE37F34C00
Node Id:         crmdb01
Class:           S
Type:            INFO
Resource Name:   topsvcs        

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

        Recommended Actions
        Verify adapter configuration
        Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.21,4983            
ERROR ID
6zV5DL.pqFeB/ThN//Ml.1....................
REFERENCE CODE
                                         
Adapter interface name
en3
Adapter offset
           0
Adapter IP address
130.36.23.8
由于硬件故障,我们对OS日志不做详细解读,我们关心的是故障发生一刻,Oracle做了什么?
故障发生时racg首先检测到vip发生故障,并再次进行了vip检测,racgvip check crmdb01,并记录至ora.crmdb01.vip.log中
2011-04-16 12:40:13.049: [    RACG][1] [4276526][1][ora.crmdb01.vip]: Invalid parameters, or failed to bring up VIP (host=crmdb01)

2011-04-16 12:40:13.054: [    RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10.2.0.4/crs

2011-04-16 12:40:13.054: [    RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: cmd = /opt/oracle/product/10.2.0.4/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracl
e/product/10.2.0.4/crs/bin/racgvip check crmdb01

2011-04-16 12:40:13.054: [    RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: rc = 1, time = 4.405s

2011-04-16 12:40:13.054: [    RACG][1] [4276526][1][ora.crmdb01.vip]: end for resource = ora.crmdb01.vip, action = check, status = 1, time = 4.572s
检测结束后,判断存在异常之后,由crs进程执行vip漂移动作,可以看到当crs检测到vip异常offline之后(OFFLINE unexpectedly),
首先停止了监听,然后将组件ora.crmdb.crmsrv1.crmdb2.srv漂移至crmdb02即2号节点。
2011-04-16 12:40:13.058: [  CRSAPP][11051]32CheckResource error for ora.crmdb01.vip error code = 1
2011-04-16 12:40:13.071: [  CRSRES][11051]32In stateChanged, ora.crmdb01.vip target is ONLINE
2011-04-16 12:40:13.072: [  CRSRES][11051]32ora.crmdb01.vip on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.072: [  CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.086: [  CRSRES][11051]32Attempting to stop `ora.crmdb01.vip` on member `crmdb01`
2011-04-16 12:40:13.487: [  CRSRES][11312]32In stateChanged, ora.crmdb.crmsrv1.crmdb2.srv target is ONLINE
2011-04-16 12:40:13.487: [  CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.488: [  CRSRES][11312]32StopResource: setting CLI values
2011-04-16 12:40:13.520: [  CRSRES][11312]32Attempting to stop `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01`
2011-04-16 12:40:13.636: [  CRSRES][11051]32Stop of `ora.crmdb01.vip` on member `crmdb01` succeeded.
2011-04-16 12:40:13.636: [  CRSRES][11051]32ora.crmdb01.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:13.650: [  CRSRES][11051]32ora.crmdb01.vip failed on crmdb01 relocating.
2011-04-16 12:40:13.770: [  CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.786: [  CRSRES][11051]32Attempting to stop `ora.crmdb01.LISTENER_CRMDB01.lsnr` on member `crmdb01`
2011-04-16 12:40:14.093: [  CRSRES][11312]32Stop of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01` succeeded.
2011-04-16 12:40:14.094: [  CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:14.105: [  CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv failed on crmdb01 relocating.
2011-04-16 12:40:14.150: [  CRSRES][11312]32Attempting to start `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02`
2011-04-16 12:40:14.442: [  CRSRES][11312]32Start of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02` succeeded.

此时2号节点crs日志显示如下:
2011-04-16 12:40:14.148: [  CRSRES][11617]32startRunnable: setting CLI values
2011-04-16 12:40:24.488: [  CRSRES][12145]32CRS-1002: Resource 'ora.crmdb.crmsrv1.cs' is already running on member 'crmdb02'

需要注意的是,vip出现故障,甚至会将和vip相关的资源全部停止,
If the VIP fails for any reason and cannot be restarted, CRS will bring down all dependent resources, including the Listener, ASM instance and database instance. CRS will attempt to bring these resources down gracefully - hence, a shutdown immediate will be issued, and will be seen in the alert log of the ASM instance - no errors will be evident in the alert log for the ASM instance.
以下来自一metalink (ID 277274.1) 案例,此故障经常在10.1上出现
`ora.rmsclnxclu1.vip` on `rmsclnxclu1` went OFFLINE unexpectedly
2004-06-21 21:21:05.562: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
RTD #0: Action Script /home/oracle/product/crs/bin/racgwrap(stop) timed out for ora.rmsclnxclu1.vip! (timeout=60)
2004-06-21 21:22:16.472: [RTI:884782] StopResource error for ora.rmsclnxclu1.vip error code = 1
2004-06-21 21:22:18.611: `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` has experienced an unrecoverable failure.
2004-06-21 21:22:18.611: Human intervention required to resume its availability.
2004-06-21 21:22:18.790: [RUNNABLELISTENER:884782] Resource failed into UNKNOWN, killing dependents
`ora.rmsclnxclu1.vip` experienced a failure on `rmsclnxclu1`. Stopping dependent resources.
2004-06-21 21:22:20.525: Attempting to stop `ora.gofod.gofod1.inst` on member `rmsclnxclu1`
2004-06-21 21:25:38.531: Stop of `ora.gofod.gofod1.inst` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:38.611: Attempting to stop `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1`
2004-06-21 21:25:38.983: Stop of `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:39.041: Attempting to stop `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1`
2004-06-21 21:25:46.669: Stop of `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:46.728: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
2004-06-21 21:25:55.547: Stop of `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` succeeded.

如果出现上述故障或者vip经常自动offline,可以用以下思路来解决问题:
1、启用vip跟踪,如果vip出现故障,可以进一步获得更为详细的日志信息
开启vip跟踪:
[root@node1 admin]#  crsctl debug log res ora.node1.vip:1
Set Resource Debug Module: ora.node1.vip  Level: 1
关闭vip跟踪
[root@node1 admin]#  crsctl debug log res ora.node1.vip:0
Set Resource Debug Module: ora.node1.vip  Level: 0
在11 R2中开启跟踪语法变为:
#crsctl set log res "ora.rmntops1.vip.com:1"

2、修改vip检查间隔时间和脚本超时时间,vip检查间隔时间从默认的30秒改为120秒,脚本超时时间从60秒改为120秒。
1. Create the .cap file for each vip resource (on each node):

./crs_stat -p ora.rmsclnxclu1.vip > /tmp/ora.rmsclnxclu1.vip.cap

2. Then, update the .cap file using the following syntax and values:

./crs_profile -update ora.rmsclnxclu1.vip -dir /tmp -o ci=120,st=120

(Where ci = the CHECK_INTERVAL and st = the SCRIPT_TIMEOUT value.)

3. Finally, re-register it using the '-u' option:

./crs_register ora.rmsclnxclu1.vip -dir /tmp -u

3、如果是10.1的话,可以在asm资源中将vip相关性移除:
ASM resource name is in the form of ora.<nodename>.<ASM instance name>.asm.
VIP resource name is in the form of ora.<nodename>.vip
- crs_stat -p <ASM resource name> > /tmp/<ASM resource name>.cap
- Edit /tmp/<ASM resource name>.cap to remove VIP resource name from the REQUIRED_RESOURCES attribute.
- crs_register -u <ASM resource name> -dir /tmp
- Use "crs_stat -p <ASM resource name>" to verify if REQUIRED_RESOURCE attribute is updated.

你可能感兴趣的:(oracle,db2,脚本,OS,UP)