现象:两个节点的11grac环境,在使用软件做复制时,rac1直接down机,rac2ASM实例重启了一下,然后就好了。但rac1一直没启动
由于rac1一直down机无法登陆上去,所以只好登陆rac2上检查log日志
rac2的alter日志
- Stopping background process CJQ0
- Tue Jun 26 14:38:18 2012
- NOTE: ASMB terminating
- Errors in file /oracle/app/db/diag/rdbms/xxxx/xxxx2/trace/xxxx2_asmb_12325.trc:
- ORA-15064: communication failure with ASM instance
- ORA-03113: end-of-file on communication channel
- Process ID:
- Session ID: 103 Serial number: 5
- Errors in file /oracle/app/db/diag/rdbms/xxxx/xxxx2/trace/xxxx2_asmb_12325.trc:
- ORA-15064: communication failure with ASM instance
- ORA-03113: end-of-file on communication channel
- Process ID:
- Session ID: 103 Serial number: 5
- ASMB (ospid: 12325): terminating the instance due to error 15064
- Termination issued to instance processes. Waiting for the processes to exit
- Tue Jun 26 14:38:30 2012
- Instance termination failed to kill one or more processes
- Instance terminated by ASMB, pid = 12325
- Tue Jun 26 14:58:40 2012
- Adjusting the default value of parameter parallel_max_servers
- from 2560 to 1485 due to the value of parameter processes (1500)
- Starting ORACLE instance (normal)
- Tue Jun 26 14:59:04 2012
- LICENSE_MAX_SESSION = 0
- LICENSE_SESSIONS_WARNING = 0
- Private Interface 'ib0:1' configured from GPnP for use as a private interconnect.
- [name='ib0:1', type=1, ip=169.254.41.10, mac=80-00-00-48-fe-80, net=169.254.0.0/18, mask=255.255.192.0, use=haip:cluster_interconnect/62]
- Private Interface 'ib1:1' configured from GPnP for use as a private interconnect.
- [name='ib1:1', type=1, ip=169.254.67.102, mac=80-00-00-49-fe-80, net=169.254.64.0/18, mask=255.255.192.0, use=haip:cluster_interconnect/62]
- Private Interface 'ib2:1' configured from GPnP for use as a private interconnect.
- [name='ib2:1', type=1, ip=169.254.163.124, mac=80-00-00-48-fe-80, net=169.254.128.0/18, mask=255.255.192.0, use=haip:cluster_interconnect/62]
- Private Interface 'ib3:1' configured from GPnP for use as a private interconnect.
- [name='ib3:1', type=1, ip=169.254.232.204, mac=80-00-00-49-fe-80, net=169.254.192.0/18, mask=255.255.192.0, use=haip:cluster_interconnect/62]
- Public Interface 'bond0' configured from GPnP for use as a public interface.
- [name='bond0', type=1, ip=10.240.52.148, mac=00-16-35-02-7f-02, net=10.240.52.128/25, mask=255.255.255.128, use=public/1]
- Public Interface 'bond0:1' configured from GPnP for use as a public interface.
- [name='bond0:1', type=1, ip=10.240.52.151, mac=00-16-35-02-7f-02, net=10.240.52.128/25, mask=255.255.255.128, use=public/1]
- Public Interface 'bond0:2' configured from GPnP for use as a public interface.
- [name='bond0:2', type=1, ip=10.240.52.149, mac=00-16-35-02-7f-02, net=10.240.52.128/25, mask=255.255.255.128, use=public/1]
- Public Interface 'bond0:3' configured from GPnP for use as a public interface.
- [name='bond0:3', type=1, ip=10.240.52.150, mac=00-16-35-02-7f-02, net=10.240.52.128/25, mask=255.255.255.128, use=public/1]
- Picked latch-free SCN scheme 3
- Tue Jun 26 15:00:22 2012
- Autotune of undo retention is turned on.
- LICENSE_MAX_USERS = 0
- SYS auditing is disabled
- Starting up:
- Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
- With the Partitioning, Real Application Clusters, OLAP, Data Mining
发现14:38分报错
单独trace出来的日志
- Trace file /oracle/app/db/diag/rdbms/xxxx/xxxx2/trace/xxxx2_asmb_12325.trc
- Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
- With the Partitioning, Real Application Clusters, OLAP, Data Mining
- and Real Application Testing options
- ORACLE_HOME = /oracle/app/db/11gr2
- System name: Linux
- Node name: xxxx
- Release: 2.6.18-274.el5
- Version: #1 SMP Fri Jul 8 17:36:59 EDT 2011
- Machine: x86_64
- Instance name: xxxx2
- Redo thread mounted by this instance: 0 <none>
- Oracle process number: 33
- Unix process pid: 12325, image: oracle@xxxx (ASMB)
- *** 2012-06-25 14:07:00.911
- *** SESSION ID:(1189.1) 2012-06-25 14:07:00.911
- *** CLIENT ID:() 2012-06-25 14:07:00.911
- *** SERVICE NAME:() 2012-06-25 14:07:00.911
- *** MODULE NAME:() 2012-06-25 14:07:00.911
- *** ACTION NAME:() 2012-06-25 14:07:00.911
- NOTE: initiating MARK startup
- *** 2012-06-26 14:38:18.936
- NOTE: ASMB terminating
- ORA-15064: communication failure with ASM instance
- ORA-03113: end-of-file on communication channel
- Process ID:
- Session ID: 103 Serial number: 5
- error 15064 detected in background process
- ORA-15064: communication failure with ASM instance
- ORA-03113: end-of-file on communication channel
- Process ID:
- Session ID: 103 Serial number: 5
- kjzduptcctx: Notifying DIAG for crash event
- ----- Abridged Call Stack Trace -----
- ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1325<-ksbrdp()+3344<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+252<-main()+201<-__libc_start_main()+244<-_start()+36
- ----- End of Abridged Call Stack Trace -----
- *** 2012-06-26 14:38:19.509
- ASMB (ospid: 12325): terminating the instance due to error 15064
- *** 2012-06-26 14:38:30.826
- Instance termination failed to kill one or more processes
- ksuitm_check: OS PID=13918 is still alive
- ksuitm_check: OS PID=13914 is still alive
- ksuitm_check: OS PID=13910 is still alive
- ksuitm_check: OS PID=13905 is still alive
- ksuitm_check: OS PID=12309 is still alive
- ksuitm_check: OS PID=12305 is still alive
- ksuitm_check: OS PID=12301 is still alive
- ksuitm_check: OS PID=12297 is still alive
- ksuitm_check: OS PID=12293 is still alive
- ksuitm_check: OS PID=12289 is still alive
- ksuitm_check: OS PID=12285 is still alive
- ksuitm_check: OS PID=12281 is still alive
- ksuitm_check: OS PID=12277 is still alive
- ksuitm_check: OS PID=12273 is still alive
- ksuitm_check: OS PID=12229 is still alive
ocssd.log
- 2012-06-26 14:38:14.007: [ CSSD][1077279040]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 196740 msecs
- 2012-06-26 14:38:16.543: [ CSSD][1115167040]clssnmHandleMeltdownStatus: node bjyq-hist-par-db01, number 1, has experienced a failure in thread number 9 and is shutting down
- 2012-06-26 14:38:16.984: [ CSSD][1101257024](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 200720 ms for voting file /dev/mapper/crsdisk001)
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]clssnmvDiskAvailabilityChange: voting file /dev/mapper/crsdisk001 now offline
- 2012-06-26 14:38:16.984: [ CSSD][1101257024](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]###################################
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]###################################
- 2012-06-26 14:38:16.984: [ CSSD][1101257024](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
- 2012-06-26 14:38:16.984: [ SKGFD][1107282240]Lib :UFS:: closing handle 0x2aaaac04fa00 for disk :/dev/mapper/crsdisk001:
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]
- ----- Call Stack Trace -----
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]calling call entry argument values in hex
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]location type point (? means dubious value)
- 2012-06-26 14:38:16.984: [ CSSD][1101257024]-------------------- -------- -------------------- ----------------------------
- 2012-06-26 14:38:17.012: [ CSSD][1101257024]clssscExit()+726 call kgdsdst() 000000000 ? 000000000 ?
发现asm日志后有这样一段话
- SUCCESS: diskgroup ARCHDG was mounted
- GMON querying group 2 at 17 for pid 18, osid 25807
- NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2
- SUCCESS: diskgroup CRS was mounted
- GMON querying group 3 at 18 for pid 18, osid 25807
- NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3
- SUCCESS: diskgroup DATADG was mounted
- GMON querying group 4 at 19 for pid 18, osid 25807
- NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 4
- SUCCESS: diskgroup IDXDG was mounted
- GMON querying group 5 at 20 for pid 18, osid 25807
- NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 5
- SUCCESS: diskgroup SYSDG was mounted
- SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
- SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */
- SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */
- Tue Jun 26 14:58:00 2012
证明crs盘在asm里面进行管理
这时节点一启动了,查看rac1的日志
- Tue Jun 26 14:38:18 2012
- NOTE: ASMB process exiting, either shutdown is in progress
- NOTE: or foreground connected to ASMB was killed.
- Tue Jun 26 14:38:18 2012
- NOTE: client exited [18463]
- NOTE: force a map free for map id 2
- Tue Jun 26 14:38:20 2012
- PMON (ospid: 18351): terminating the instance due to error 481
- Tue Jun 26 14:38:20 2012
- ORA-1092 : opitsk aborting process
- Tue Jun 26 14:38:20 2012
- License high water mark = 75
- Termination issued to instance processes. Waiting for the processes to exit
- Tue Jun 26 14:38:30 2012
- Instance termination failed to kill one or more processes
- Instance terminated by PMON, pid = 18351
- Tue Jun 26 14:38:30 2012
- USER (ospid: 26836): terminating the instance
- Termination issued to instance processes. Waiting for the processes to exit
查看ASM1的log日志
- NOTE: ASMB process exiting, either shutdown is in progress
- NOTE: or foreground connected to ASMB was killed.
- Tue Jun 26 14:38:18 2012
- NOTE: client exited [18463]
- NOTE: force a map free for map id 2
- Tue Jun 26 14:38:20 2012
- PMON (ospid: 18351): terminating the instance due to error 481
- Tue Jun 26 14:38:20 2012
- ORA-1092 : opitsk aborting process
- Tue Jun 26 14:38:20 2012
- License high water mark = 75
- Termination issued to instance processes. Waiting for the processes to exit
- Tue Jun 26 14:38:30 2012
再此之前注意到有如下的报错
- ORA-15055: unable to connect to ASM instance
- ORA-00020: maximum number of processes (100) exceeded
后来想了一下,由于CRS盘在ASM中,由于应用程序的连接数过大导致了processes () exceeded,最基本的CRS通讯都无法启动一个process所以就会导致服务的漂移。但是有两个地方想不清楚,第一,为什么节点1会down机。第二,为什么在切换的时候ASM实例会自己重启一下?
本文出自 “成神之路” 博客,谢绝转载!