author:skate
time:2010-05-09
我的测试环境:
母系统:win2003
虚拟软件:vmware3.2.1
guest系统:centos4.7
oracle db:oracle10.2.1
以下是我在重建rac的ocr/voting disk过程中遇到的错误及解决方法,记录一下。
rac故障现象总结:
0. 检查crs的状态
直接查看进程”ps -ef |grep d.bin“
[root@rac1 oracle]# ps -ef |grep d.bin
root 15716 6979 0 11:04 pts/0 00:00:00 grep d.bin
root 28240 1 1 09:29 ? 00:01:00 /u01/crs/oracle/product/10.2.0/crs/bin/crsd.bin reboot
oracle 29059 28223 0 09:32 ? 00:00:11 /u01/crs/oracle/product/10.2.0/crs/bin/evmd.bin
oracle 29209 29181 0 09:32 ? 00:00:44 /u01/crs/oracle/product/10.2.0/crs/bin/ocssd.bin
看见以上的进程,就代表crs已经正常启动了
用命令查看"crsctl check crs"
[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
看见以上输出,crs已经正常启动。
像如下的情况,就代表crs没有成功启动。
# ps -ef | grep css
root 6929 1 0 19:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root 6960 6928 0 19:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 6963 6929 0 19:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 7064 6935 0 19:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
可以查看crs的相关日志:crsd.log,ocssd.log,evmd.log
1. crs的故障:
1.1 报错:Insufficient user privileges.
现象:
[oracle@green ~]$ crsctl stop crs
Insufficient user privileges.
解决:
由于root环境变量没有设置$oracle_home和crs的环境变量,所以root下提示没有这个命令
[root@rac2 ~]# crsctl check crs
-bash: crsctl: command not found
[root@rac2 ~]# su - oracle
[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
[oracle@rac2 ~]$ crsctl stop crs
Insufficient user privileges.
[oracle@rac2 ~]$ su
Password:
[root@rac2 oracle]# crsctl stop crs
Stopping resources.
Successfully stopped CRS resources.
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
[root@rac2 oracle]#
1.2 报错:
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
现象:
[root@rac2 oracle]# crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
查看下crs的进程启动情况: ps -ef |grep d.bin
当crs启动后(crsctl start crs),要稍等一会才能起来,如果很快就核查,就会报上面这个错误。
还有一种情况也会产生这个错误,那就是节点间时间不同步,我这次遇到的这个问题就是因为节点
间时间不同步,我用了简单的rdate保证两个节点间的同步,当然还有其他的方法,如ntpdate或建立
时间服务器。
也可以直接用如下文件管理:
/etc/rc.d/init.d/init.crs
/etc/rc.d/init.d/init.crsd
/etc/rc.d/init.d/init.cssd
/etc/rc.d/init.d/init.evmd
参考:http://www.dbspecialists.com/files/presentations/rac_quick_reference.html
######################################################################################
rac2上的asm无法启动,报如下的错误:
[oracle@rac1 ~]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
[PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
然后用crs_start单独启动,看报什么错,结果又报了一大堆错误:
[oracle@rac2 ~]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 15:25:59 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle. All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-27504: IPC error creating OSD context
rac2:ora.rac2.ASM2.asm:ORA-27300: OS system dependent operation:if_not_found failed with status: 0
rac2:ora.rac2.ASM2.asm:ORA-27301: OS failure message: Error 0
rac2:ora.rac2.ASM2.asm:ORA-27302: failure occurred at: skgxpvaddr9
rac2:ora.rac2.ASM2.asm:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
[PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 15:25:59 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle. All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-27504: IPC error creating OSD context
rac2:ora.rac2.ASM2.asm:ORA-27300: OS system dependent operation:if_not_found failed with status: 0
rac2:ora.rac2.ASM2.asm:ORA-27301: OS failure message: Error 0
rac2:ora.rac2.ASM2.asm:ORA-27302: failure occurred at: skgxpvaddr9
rac2:ora.rac2.ASM2.asm:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
一般错误“ORA-27504: IPC error creating OSD context”是因为节点间的通信的有问题
首先查看/etc/hosts文件
正确的格式应该如下:
[oracle@rac1 ~]$ more /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
#skate add
# Public
192.168.2.31 rac1.localdomain rac1
192.168.2.22 rac2.localdomain rac2
#Private
192.168.0.31 rac1-priv.localdomain rac1-priv
192.168.0.22 rac2-priv.localdomain rac2-priv
#Virtual
192.168.2.131 rac1-vip.localdomain rac1-vip
192.168.2.122 rac2-vip.localdomain rac2-vip
[oracle@rac1 ~]$
我的这个文件没有问题,在群里讨论,我和大家都比较关注下面的错误:
ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
但是什么引起这个错误的呢?
先怀疑网卡设置,可能是ip设置有问题,或者MUT有问题。不过经过检查我的网卡设置都是正常的
rac1节点网络:
[root@rac1 tmp]# ip a
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:0c:29:2a:81:d3 brd ff:ff:ff:ff:ff:ff
inet 192.168.2.31/24 brd 192.168.2.255 scope global eth0
inet 192.168.2.131/24 brd 192.168.2.255 scope global secondary eth0:1
inet6 fe80::20c:29ff:fe2a:81d3/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:0c:29:2a:81:dd brd ff:ff:ff:ff:ff:ff
inet 192.168.0.31/24 brd 192.168.0.255 scope global eth1
inet6 fe80::20c:29ff:fe2a:81dd/64 scope link
valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
link/sit 0.0.0.0 brd 0.0.0.0
rac2节点网络:
[root@rac2 ~]# ip a
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:0c:29:81:22:38 brd ff:ff:ff:ff:ff:ff
inet 192.168.2.22/24 brd 192.168.2.255 scope global eth0
inet 192.168.2.122/24 brd 192.168.2.255 scope global secondary eth0:1
inet6 fe80::20c:29ff:fe81:2238/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:0c:29:81:22:42 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.22/24 brd 192.168.0.255 scope global eth1
inet6 fe80::20c:29ff:fe81:2242/64 scope link
valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
link/sit 0.0.0.0 brd 0.0.0.0
我又google了半天,找到一个帖子,说是尝试如下修改,可以解决
1、关闭 Oracle 实例-instance。
2、cd $ORACLE_HOME/rdbms/lib
3、make -f ins_rdbms.mk rac_off
4、make -f ins_rdbms.mk ioracle
我按其操作后,没起作用,反而出来如下的错误:
[oracle@rac2 lib]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 16:34:23 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle. All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-00439: feature not enabled: Real Application Clusters
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
[PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 16:34:23 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle. All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-00439: feature not enabled: Real Application Clusters
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
从错误码“ORA-00439: feature not enabled: Real Application Clusters”可以看出已经禁用了集群功能,于是我有反向执行
1、关闭 Oracle 实例-instance。(这步我没操作,因为我的实例就没起来,呵呵)
2、cd $ORACLE_HOME/rdbms/lib
3、make -f ins_rdbms.mk rac_on
4、make -f ins_rdbms.mk ioracle
执行后,又恢复到以前的额错误了,查看相应的alertlog都没有错误,不过在asm2的alertlog中最后两行有错误
[root@rac2 ~]# tail -50 /u01/app/oracle/admin/+ASM/bdump/alert_+ASM2.log |more
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 27244
Fri May 7 03:26:47 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/10.2
.0 _1 s/arch
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
large_pool_size = 12582912
instance_type = asm
cluster_interconnects = 192,168.0.22
cluster_database = TRUE
instance_number = 2
remote_login_passwordfile= EXCLUSIVE
background_dump_dest = /u01/app/oracle/admin/+ASM/bdump
user_dump_dest = /u01/app/oracle/admin/+ASM/udump
core_dump_dest = /u01/app/oracle/admin/+ASM ump
asm_diskgroups = DATA
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 29732
Fri May 7 03:28:40 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/10.2
.0 _1 s/arch
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
large_pool_size = 12582912
instance_type = asm
cluster_interconnects = 192,168.0.22
cluster_database = TRUE
instance_number = 2
remote_login_passwordfile= EXCLUSIVE
background_dump_dest = /u01/app/oracle/admin/+ASM/bdump
user_dump_dest = /u01/app/oracle/admin/+ASM/udump
core_dump_dest = /u01/app/oracle/admin/+ASM ump
asm_diskgroups = DATA
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 31201
这个信息也不能定位错误在哪,在这过程中,我用sqlplus在rac1中可以成功启动数据库。最后一个群里的朋友说看看
asm2的参数文件内容
我经过检查发现我把+ASM2.cluster_interconnects='192.168.0.22' 写成+ASM2.cluster_interconnects='192,168.0.22'
把逗点写成了逗号,马上改正过了,然后在启动asm2实例,就可以启动了。
现在在回头想,报错:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
就可以理解了,因为节点间通信有问题,才会报这个错误.
#####################################################################################
启动数据库报错
[oracle@rac2 ~]$ srvctl start database -d rac
PRKP-1001 : Error starting instance rac1 on node rac1
CRS-0215: Could not start resource 'ora.rac.rac1.inst'.
PRKP-1001 : Error starting instance rac2 on node rac2
CRS-0215: Could not start resource 'ora.rac.rac2.inst'.
虽然用srvctl无法启动数据库,但是可以用sqlplus分别在两个节点正常启动数据库
抱这个错误,网上有说按如下方法可以解决:
as root:
crsctl stop crs
rm -f /var/tmp/.oracle/*
crsctl start crs
等一会,crs正常启动后,就可以正常启动数据库了
但对我的环境,问题依旧。这是我突然想到数据库名和实例名的大小写的问题,
我刚才注册到ocr里的都是小写的,怀疑可能是这个原因。于是删除原来小写的
,从新注册大写的
这是原来注册的小写的:
[oracle@rac2 ~]$ srvctl add database -d rac -o /u01/app/oracle/product/10.2.0/db_1
[oracle@rac2 ~]$ srvctl add instance -d rac -i rac1 -n rac1
[oracle@rac2 ~]$ srvctl add instance -d rac -i rac2 -n rac2
[oracle@rac2 ~]$ srvctl modify instance -d rac -i rac1 -s +ASM1
[oracle@rac2 ~]$ srvctl modify instance -d rac -i rac2 -s +ASM2
把小写的删除
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac2
Remove instance rac2 from the database rac? (y/[n]) y
[oracle@rac2 ~]$ srvctl remove database -d rac
Remove the database rac? (y/[n]) y
[oracle@rac2 ~]$
把database和instance注册成大写的
[oracle@rac2 ~]$ srvctl add database -d RAC -o $ORACLE_HOME
[oracle@rac2 ~]$ srvctl add instance -d RAC -i RAC1 -n rac1
[oracle@rac2 ~]$ srvctl add instance -d RAC -i RAC2 -n rac2
[oracle@rac2 ~]$ srvctl modify instance -d RAC -i RAC1 -s +ASM1
[oracle@rac2 ~]$ srvctl modify instance -d RAC -i RAC2 -s +ASM2
然后在启动数据库,居然启动了啊。
[oracle@rac2 ~]$ srvctl start database -d rac
######################################################################
在ocr中删除instance和database的报错:
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
PRKP-1023 : The instance {0} is still running.rac
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
PRKP-1023 : The instance {0} is still running.rac
[oracle@rac2 ~]$ srvctl remove instance -d rac
PRKO-2001 : Invalid command line syntax
[oracle@rac2 ~]$ srvctl remove database -d rac
Remove the database rac? (y/[n]) y
PRKP-1022 : The database rac is still running.
解决方式:用crs_stop -all停掉所有的服务,然后用crs_stat -t -v 检查各服务的状态
如果有服务的state是UNKNO的,那就只能一个一个的停掉了。
################################################################
onsctl启动的问题:
[oracle@rac1 ~]$ onsctl ping
Number of onsconfiguration retrieved, numcfg = 0
ons is not running ...
解决:
[oracle@rac1 ~]$ onsctl start
Number of onsconfiguration retrieved, numcfg = 0
Number of onsconfiguration retrieved, numcfg = 0
onsctl: ons started
[oracle@rac1 ~]$ onsctl ping
Number of onsconfiguration retrieved, numcfg = 0
ons is running ...
##########################################################
错误现象oifcfg getif 没有返回值
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg iflist
eth0 192.168.2.0
eth1 192.168.0.0
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global rac1
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global rac2
手工把网络信息注册到ocr中
[oracle@rac2 public]# oifcfg setif -global eth0/192.168.2.0:public
[oracle@rac2 public]# oifcfg setif -global eth1/192.168.0.0:cluster_interconnect
然后就可以查了啊
[oracle@rac1 ~]$ oifcfg getif
eth0 192.168.2.0 global public
eth1 192.168.0.0 global cluster_interconnect
#########################################################################
以上的错误是我在重建ocr/voting disk所遇到的一些错误。
----------end---------