某个支付系统11.2.0.3的rac系统,其中一个节点忽然无法启动 1.尝试关闭集群重新启动集群 [root@rac2 ~]$ crsctl stop crs -f 2.尝试重新启动集群 [root@rac2 ~]$ crsctl start crs CRS-4123: Oracle High Availability Services has been started. 集群启动成功,无其他报错,此时感觉asm已经起来了 [grid@rac2 ~]$ ps -ef | grep smon root 8472 1 1 14:14 ? 00:00:01 /u01/app/11.2/product/crs_1/bin/osysmond.bin grid 9238 1 0 14:15 ? 00:00:00 asm_smon_+ASM2 grid 9500 6212 0 14:16 pts/5 00:00:00 grep smon 这时试图连接ASM查看dg状态,奇怪的事情发生了,asm可以登录,但是查询asmdisk报错 [grid@rac2 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.3.0 Production on Wed May 21 15:35:05 2014 Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options SQL> select status from v$asm_disk; select status from v$asm_disk * ERROR at line 1: ORA-01034: ORACLE not available Process ID: 14465 Session ID: 2707 Serial number: 3 查看d.bin进程,发现crsd并没有起来 [root@rac01 ~]# ps -ef | grep d.bin root 4142 1 0 09:27 ? 00:00:05 /u01/app/11.2.0/grid/product/db_1/bin/ohasd.bin reboot grid 4548 1 0 09:28 ? 00:00:00 /u01/app/11.2.0/grid/product/db_1/bin/mdnsd.bin grid 4558 1 0 09:28 ? 00:00:01 /u01/app/11.2.0/grid/product/db_1/bin/gpnpd.bin grid 4568 1 0 09:28 ? 00:00:05 /u01/app/11.2.0/grid/product/db_1/bin/gipcd.bin root 4590 1 0 09:28 ? 00:00:10 /u01/app/11.2.0/grid/product/db_1/bin/osysmond.bin 再次查看asm进程,还是存在相应的asm实例进程 [grid@rac2 ~]$ ps -ef | grep smon root 17377 1 2 14:31 ? 00:00:09 /u01/app/11.2/product/crs_1/bin/osysmond.bin grid 21518 1 0 14:37 ? 00:00:00 asm_smon_+ASM2 grid 21615 18834 0 14:37 pts/5 00:00:00 grep smon 登录asmcmd,也同样抛出异常 [grid@rac2 ~]$ asmcmd ORA-01034: ORACLE not available Process ID: 21716 Session ID: 2707 Serial number: 1 (DBD ERROR: OCIStmtExecute/Describe) ocr无法进行正常check [root@rac2 bin]# ./ocrcheck PROT-602: Failed to retrieve data from the cluster registry PROC-26: Error while accessing the physical storage ORA-15077: could not locate ASM instance serving a required diskgroup 关闭asm,尝试手动启动asm实例,同样无法启动asm实例 [grid@rac2 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.3.0 Production on Wed May 21 14:33:37 2014 Copyright (c) 1982, 2011, Oracle. All rights reserved. SQL> shutdown abort ASM instance shutdown SQL> startup ORA-27103: internal error Linux-x86_64 Error: 2: No such file or directory Additional information: 1 Additional information: 25919497 Additional information: 2 通过+ASM1实例的spfile 创建新的pfile尝试启动+ASM2实例,同样无法启动+ASM2实例 SQL> startup nomount pfile='/tmp/init+asm2.ora'; ORA-24324: service handle not initialized ORA-01041: internal error. hostdef extension doesn't exist 期初怀疑是存储的问题,于是开始检查存储状态,本套RAC使用multipath +asmlib的架构 查询multipath 状态,均为active [root@rac2 bin]# multipath -ll 360a9800044336b327a24446172587864 dm-5 NETAPP,LUN [size=200G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 4:0:1:0 sdau 66:224 [active][ready] \_ 3:0:1:0 sdq 65:0 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:0 sdaf 65:240 [active][ready] \_ 3:0:0:0 sdb 8:16 [active][ready] 360a9800044336b327a24446172587862 dm-6 NETAPP,LUN [size=200G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 4:0:1:1 sdav 66:240 [active][ready] \_ 3:0:1:1 sdr 65:16 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:1 sdag 66:0 [active][ready] \_ 3:0:0:1 sdc 8:32 [active][ready] 360a9800044336b327a2444617258786e dm-13 NETAPP,LUN [size=200G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 4:0:1:8 sdbc 67:96 [active][ready] \_ 3:0:1:8 sdy 65:128 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:8 sdan 66:112 [active][ready] \_ 3:0:0:8 sdj 8:144 [active][ready] 360a9800044336b327a24446172587876 dm-11 NETAPP,LUN [size=2.0G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 4:0:1:6 sdba 67:64 [active][ready] \_ 3:0:1:6 sdw 65:96 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:6 sdal 66:80 [active][ready] \_ 3:0:0:6 sdh 8:112 [active][ready] [size=10G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 3:0:1:13 sdac 65:192 [active][ready] \_ 4:0:1:13 sdbg 67:160 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:13 sdar 66:176 [active][ready] \_ 3:0:0:13 sdn 8:208 [active][ready] 360a9800044336b327a24446172587868 dm-0 NETAPP,LUN [size=200G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 3:0:1:10 sdaa 65:160 [active][ready] \_ 4:0:1:10 sdbe 67:128 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:10 sdap 66:144 [active][ready] \_ 3:0:0:10 sdl 8:176 [active][ready] 360a9800044336b327a24446172587870 dm-9 NETAPP,LUN [size=2.0G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 4:0:1:4 sday 67:32 [active][ready] \_ 3:0:1:4 sdu 65:64 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:4 sdaj 66:48 [active][ready] \_ 3:0:0:4 sdf 8:80 [active][ready] 360a9800044336b327a24446172587970 dm-1 NETAPP,LUN \_ round-robin 0 [prio=50][active] \_ 3:0:1:15 sdae 65:224 [active][ready] \_ 4:0:1:15 sdbi 67:192 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:15 sdat 66:208 [active][ready] \_ 3:0:0:15 sdp 8:240 [active][ready] 360a9800044336b327a24446172587a55 dm-3 NETAPP,LUN [size=5.0G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 3:0:1:14 sdad 65:208 [active][ready] \_ 4:0:1:14 sdbh 67:176 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:14 sdas 66:192 [active][ready] \_ 3:0:0:14 sdo 8:224 [active][ready] 360a9800044336b327a2444617258786a dm-8 NETAPP,LUN [size=200G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 4:0:1:3 sdax 67:16 [active][ready] \_ 3:0:1:3 sdt 65:48 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:3 sdai 66:32 [active][ready] \_ 3:0:0:3 sde 8:64 [active][ready] 360a9800044336b327a24446172587872 dm-14 NETAPP,LUN [size=200G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 4:0:1:9 sdbd 67:112 [active][ready] \_ 3:0:1:9 sdz 65:144 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:9 sdao 66:128 [active][ready] \_ 3:0:0:9 sdk 8:160 [active][ready] 360a9800044336b327a24446172587972 dm-2 NETAPP,LUN [size=10G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 3:0:1:13 sdac 65:192 [active][ready] \_ 4:0:1:13 sdbg 67:160 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:13 sdar 66:176 [active][ready] \_ 3:0:0:13 sdn 8:208 [active][ready] 360a9800044336b327a24446172587868 dm-0 NETAPP,LUN [size=200G][features=3 queue_if_no_path pg_init_retries 50][hwhandler=1 alua][rw] \_ round-robin 0 [prio=50][active] \_ 3:0:1:10 sdaa 65:160 [active][ready] \_ 4:0:1:10 sdbe 67:128 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 4:0:0:10 sdap 66:144 [active][ready] \_ 3:0:0:10 sdl 8:176 [active][ready] 查询asmlib中asm磁盘,也都可以正常识别,此时排除了存储的故障,开始定位grid的日志 [root@rac2 bin]# /etc/init.d/multipathd restart Stopping multipathd daemon: [ OK ] Starting multipathd daemon: [ OK ] [root@rac2 bin]# /etc/init.d/oracleasm restart Dropping Oracle ASMLib disks: [ OK ] Shutting down the Oracle ASMLib driver: [ OK ] Initializing the Oracle ASMLib driver: [ OK ] Scanning the system for Oracle ASMLib disks: [ OK ] [root@rac2 bin]# oracleasm scandisks Reloading disk partitions: done Cleaning any stale ASM disks... Scanning system for ASM disks... [root@rac2 bin]# oracleasm listdisks ASMLB_OCR01 ASMLB_OCR02 ASMLB_OCR03 ASMLB_ORCL01 ASMLB_ORCL02 ASMLB_ORCL03 ASMLB_ORCLCTL ASMLB_ORCLFRA ASMLB_ORCLREDO ASMLB_YYZF01 ASMLB_YYZF02 ASMLB_YYZF03 ASMLB_YYZFCTL ASMLB_YYZFFRA ASMLB_YYZREDO 重新启动crs后报,系统仍存在ASM进程,但asm无法mount磁盘,crsd依旧无法启动 alerrac2.log的日志如下 [/u01/app/11.2/product/crs_1/bin/orarootagent.bin(13382)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0 2014-05-21 13:18:09.087 [/u01/app/11.2/product/crs_1/bin/orarootagent.bin(13382)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0 2014-05-21 13:22:37.716 [ctssd(17298)]CRS-2409:The clock on host rac2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2014-05-21 13:38:09.677 [/u01/app/11.2/product/crs_1/bin/orarootagent.bin(13382)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0 2014-05-21 13:52:38.315 [ctssd(17298)]CRS-2409:The clock on host rac2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2014-05-21 13:58:10.245 [/u01/app/11.2/product/crs_1/bin/orarootagent.bin(13382)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0 [gpnpd(8438)]CRS-2328:GPNPD started on node rac2. 2014-05-21 14:14:29.618 [cssd(8520)]CRS-1713:CSSD daemon is started in clustered mode 2014-05-21 14:14:31.419 [ohasd(8217)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE 2014-05-21 14:14:50.409 [cssd(8520)]CRS-1707:Lease acquisition for node rac2 number 2 completed 2014-05-21 14:14:51.693 [cssd(8520)]CRS-1621:The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /u01/app/11.2/product/crs_1/log/rac2/cssd/ocssd.log 2014-05-21 14:14:51.693 [cssd(8520)]CRS-1617:The information required to do node kill for node rac2 is incomplete; details at (:CSSNM00004:) in /u01/app/11.2/product/crs_1/log/rac2/cssd/ocssd.log 2014-05-21 14:14:51.696 [cssd(8520)]CRS-1605:CSSD voting file is online: ORCL:ASMLB_OCR03; details in /u01/app/11.2/product/crs_1/log/rac2/cssd/ocssd.log. 2014-05-21 14:14:51.700 [cssd(8520)]CRS-1605:CSSD voting file is online: ORCL:ASMLB_OCR02; details in /u01/app/11.2/product/crs_1/log/rac2/cssd/ocssd.log. 2014-05-21 14:14:51.704 [cssd(8520)]CRS-1605:CSSD voting file is online: ORCL:ASMLB_OCR01; details in /u01/app/11.2/product/crs_1/log/rac2/cssd/ocssd.log. 2014-05-21 14:14:56.235 [cssd(8520)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac2 rac2 . 2014-05-21 14:14:58.695 [ctssd(8722)]CRS-2403:The Cluster Time Synchronization Service on host rac2 is in observer mode. 2014-05-21 14:14:58.973 [ctssd(8722)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac2. 2014-05-21 14:14:58.974 [ctssd(8722)]CRS-2401:The Cluster Time Synchronization Service started on host rac2. [client(8758)]CRS-10001:21-May-14 14:14 ACFS-9391: Checking for existing ADVM/ACFS installation. [client(8763)]CRS-10001:21-May-14 14:14 ACFS-9392: Validating ADVM/ACFS installation files for operating system. [client(8765)]CRS-10001:21-May-14 14:14 ACFS-9393: Verifying ASM Administrator setup. [client(8768)]CRS-10001:21-May-14 14:14 ACFS-9308: Loading installed ADVM/ACFS drivers. [client(8771)]CRS-10001:21-May-14 14:14 ACFS-9154: Loading 'oracleoks.ko' driver. [client(8812)]CRS-10001:21-May-14 14:15 ACFS-9154: Loading 'oracleadvm.ko' driver. [client(8854)]CRS-10001:21-May-14 14:15 ACFS-9154: Loading 'oracleacfs.ko' driver. [client(8966)]CRS-10001:21-May-14 14:15 ACFS-9327: Verifying ADVM/ACFS devices. [client(8971)]CRS-10001:21-May-14 14:15 ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'. [client(8975)]CRS-10001:21-May-14 14:15 ACFS-9156: Detecting control device '/dev/ofsctl'. [client(8981)]CRS-10001:21-May-14 14:15 ACFS-9322: completed 2014-05-21 14:26:29.083 [/u01/app/11.2/product/crs_1/bin/orarootagent.bin(14037)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0 [cssd(14105)]CRS-1612:Network communication with node rac2 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.010 seconds 2014-05-21 14:28:40.898 [cssd(14105)]CRS-1662:Member kill requested by node rac2 for member number 1, group DB+ASM 2014-05-21 14:28:42.637 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5019:All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:28:42.637 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log" 2014-05-21 14:28:42.919 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5019: All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:28:42.920 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)] CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log" 2014-05-21 14:28:43.147 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5019 :All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:28:43.147 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)] CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log" 2014-05-21 14:28:43.429 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)] CRS-5019:All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:28:43.430 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log" 2014-05-21 14:30:44.648 [cssd(14105)]CRS-1662:Member kill requested by node rac2 for member number 1, group DB+ASM 2014-05-21 14:30:47.200 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5019:All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:30:47.201 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log" 2014-05-21 14:30:48.565 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5019:All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:30:48.792 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5019:All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:30:49.079 [/u01/app/11.2/product/crs_1/bin/oraagent.bin(13997)]CRS-5019:All OCR locations are on ASM disk groups [OCRDG], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". 2014-05-21 14:30:49.290 [ohasd(13804)]CRS-2807:Resource 'ora.asm' failed to start automatically. 2014-05-21 14:30:49.290 [ohasd(13804)]CRS-2807:Resource 'ora.crsd' failed to start automatically. 2014-05-21 14:30:49.294 [mdnsd(14011)]CRS-5602:mDNS service stopping by request. 2014-05-21 14:30:49.306 [ctssd(14319)]CRS-2405:The Cluster Time Synchronization Service on host rac2 is shutdown by user crsd日志 MESSAGE: TextMessage[CRS-2672: Attempting to start 'ora.asm' on 'rac2' CRS-5017: The resource action "ora.asm start" encountered the following error: ORA-03113: end-of-file on communication channel Process ID: 0 Session ID: 0 Serial number: 0 . For details refer to "(:CLSN00107:)" in "/u01/app/11.2/product/crs_1/log/rac2/agent/ohasd/oraagent_grid/oraagent_grid.log". CRS-2674: Start of 'ora.asm' on 'rac2' failed CRS-2679: Attempting to clean 'ora.asm' on 'rac2' CRS-2681: Clean of 'ora.asm' on 'rac2' succeeded +ASM2实例的日志 Time drift detected. Please check VKTM trace file for more details. Wed Feb 26 09:52:57 2014 IPC Send timeout detected. Sender: ospid 17566 [oracle@yyzfrac2 (RBAL)] Receiver: inst 1 binc 436769961 ospid 3983 IPC Send timeout to 1.0 inc 28 for msg type 8 from opid 18 Wed Feb 26 09:53:01 2014 Communications reconfiguration: instance_number 1 Detected an inconsistent instance membership by instance 2 整个过程asm磁盘组始终无法mount磁盘,导致crsd无法启动,有个奇怪的报错CRS-5018:(:CLSN00037:) Removed unused HAIP route 于是查询主机网络,本系统用了多块网卡bond绑定,IBM pc独有的usb0的网卡占用了169.254.xxx.xxx的ip,Oracle在11.2.0.2以上的GI上使用多网卡构成的HAIP技术,那么不同网卡应该在不同的子网上,如果所有的网卡在同一个子网上,那么拔掉其中一个网卡可能导致节点被踢出 注:IBM的PC服务器使用USB0做为管理网络的特性。没有连接USB网卡的时候会不停向DHCP申请IP,如果没有发现DHCP时就会默认分配一个169.254.xxx.xxx的IP地址会和ORACLE的HAIP产生冲突,造成路由信息丢失。 [root@rac2 ~]# ifconfig bond0 Link encap:Ethernet HWaddr 40:F2:E9:63:14:98 inet addr:10.118.7.32 Bcast:10.118.7.255 Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:6349 errors:0 dropped:0 overruns:0 frame:0 TX packets:8373 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:909372 (888.0 KiB) TX bytes:7154180 (6.8 MiB) bond1 Link encap:Ethernet HWaddr 40:F2:E9:63:14:9A inet addr:10.10.10.20 Bcast:10.10.10.255 Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:14396 errors:0 dropped:0 overruns:0 frame:0 TX packets:13380 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:7089083 (6.7 MiB) TX bytes:5387148 (5.1 MiB) bond1:1 Link encap:Ethernet HWaddr 40:F2:E9:63:14:9A inet addr:169.254.112.114 Bcast:169.254.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 eth0 Link encap:Ethernet HWaddr 40:F2:E9:63:14:98 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:222 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:42708 (41.7 KiB) TX bytes:0 (0.0 b) Interrupt:169 Memory:94000000-94012800 eth1 Link encap:Ethernet HWaddr 40:F2:E9:63:14:9A UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:209 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:41610 (40.6 KiB) TX bytes:0 (0.0 b) Interrupt:217 Memory:92000000-92012800 eth4 Link encap:Ethernet HWaddr 40:F2:E9:63:14:98 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:6127 errors:0 dropped:0 overruns:0 frame:0 TX packets:8376 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:866664 (846.3 KiB) TX bytes:7154738 (6.8 MiB) Memory:c0300000-c0380000 eth5 Link encap:Ethernet HWaddr 40:F2:E9:63:14:9A UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:14187 errors:0 dropped:0 overruns:0 frame:0 TX packets:13380 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7047473 (6.7 MiB) TX bytes:5387148 (5.1 MiB) Memory:c0380000-c0400000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:6584 errors:0 dropped:0 overruns:0 frame:0 TX packets:6584 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9343532 (8.9 MiB) TX bytes:9343532 (8.9 MiB) usb0 Link encap:Ethernet HWaddr 42:F2:E9:64:94:9B inet addr:169.254.95.120 Bcast:169.254.95.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:907 errors:0 dropped:0 overruns:0 frame:0 TX packets:81 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:59498 (58.1 KiB) TX bytes:14668 (14.3 KiB) 注:在ora.cluster_interconnect.haip资源启动之前,cssd进程会检查私有网络来判定是否启动cssd进程,当ora.cluster_interconnect.haip资源启动之后,ora.asm的LMON进程会检查私有网络的通信状况,来判断是否启动集群ora.asm, 这时interconnect_IP地址是169.254网段的地址,如果节点间的169.254网段的地址不通,asm实例必定只能在部分节点运行,asm实例不能启动,Clusterware和数据库实例都无法启动, 本次故障的原因就是USB0的网卡占用的169.254网段IP导致,私有网络的HAIP无法正常绑定通信,集群驱逐了节点2, 我们可以通过使用netsta -rn命令在所有节点来查看HAIP绑定的信息,在节点2上bond1确实没有绑定HAIP [root@rac2 bin]# netstat -rn Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 10.10.10.0 0.0.0.0 255.255.255.0 U 0 0 0 bond1 10.118.7.0 0.0.0.0 255.255.255.0 U 0 0 0 bond0 0.0.0.0 10.118.7.1 0.0.0.0 UG 0 0 0 bond0 [root@rac2 bin]# ./crsctl stat res ora.cluster_interconnect.haip -init NAME=ora.cluster_interconnect.haip TYPE=ora.haip.type TARGET=ONLINE STATE=ONLINE on rac2 在节点2上手动增加HAIP到bond1上 [root@rac2 bin]# route add -net 169.254.0.0 netmask 255.255.0.0 dev bond1 再次使用netstat查看,HAIP已经绑定在bond1上了 [root@rac2 bin]# netstat -rn Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 10.10.10.0 0.0.0.0 255.255.255.0 U 0 0 0 bond1 10.118.7.0 0.0.0.0 255.255.255.0 U 0 0 0 bond0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond1 0.0.0.0 10.118.7.1 0.0.0.0 UG 0 0 0 bond0 手动down掉usb0,修改配置文件将ONBOOT改为no [root@rac2 bin]#ifdown usb0 [root@rac2 bin]# vi /etc/sysconfig/network-scripts/ifcfg-usb0 # File created by IBM # IMM_DRIVER_STATUS=1 # IMM_IFACE_STATUS=0 # IBM RNDIS/CDC ETHER NM_CONTROLLED=no DEVICE=usb0 ONBOOT=no BOOTPROTO=dhcp HWADDR=42:f2:e9:64:96:df TYPE=Ethernet 重启grid,集群正常启动 总结 在使用ibm X系列服务器时候,要将无用的USB0网卡禁掉,保证HAIP除了私有网卡不被其他设备使用