RAC重建OCR/Voting disk遇到的一些故障

author:skate

time:2010-05-09

我的测试环境:

 

母系统:win2003
虚拟软件:vmware3.2.1
guest系统:centos4.7
oracle db:oracle10.2.1

 

以下是我在重建rac的ocr/voting disk过程中遇到的错误及解决方法,记录一下。


rac故障现象总结:

 

0. 检查crs的状态

 

直接查看进程”ps -ef |grep d.bin“


[root@rac1 oracle]# ps -ef |grep d.bin
root     15716  6979  0 11:04 pts/0    00:00:00 grep d.bin
root     28240     1  1 09:29 ?        00:01:00 /u01/crs/oracle/product/10.2.0/crs/bin/crsd.bin reboot
oracle   29059 28223  0 09:32 ?        00:00:11 /u01/crs/oracle/product/10.2.0/crs/bin/evmd.bin
oracle   29209 29181  0 09:32 ?        00:00:44 /u01/crs/oracle/product/10.2.0/crs/bin/ocssd.bin

 

 看见以上的进程,就代表crs已经正常启动了

 

 用命令查看"crsctl check crs"

 

[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

 

看见以上输出,crs已经正常启动。

 

像如下的情况,就代表crs没有成功启动。

 

# ps -ef | grep css

root      6929     1  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      6960  6928  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      6963  6929  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      7064  6935  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

 

可以查看crs的相关日志:crsd.log,ocssd.log,evmd.log

 

 

 

 


1. crs的故障:

 

1.1 报错:Insufficient user privileges.

 

现象:
[oracle@green ~]$ crsctl stop crs
Insufficient user privileges.

 

解决:
由于root环境变量没有设置$oracle_home和crs的环境变量,所以root下提示没有这个命令


[root@rac2 ~]# crsctl check crs
-bash: crsctl: command not found


[root@rac2 ~]# su - oracle


[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy


[oracle@rac2 ~]$ crsctl stop crs
Insufficient user privileges.

 

[oracle@rac2 ~]$ su
Password:
[root@rac2 oracle]# crsctl stop crs
Stopping resources.
Successfully stopped CRS resources.
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
[root@rac2 oracle]#

 

 


1.2 报错:

Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM


现象:
[root@rac2 oracle]# crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM

 

查看下crs的进程启动情况: ps -ef |grep d.bin

 

当crs启动后(crsctl start crs),要稍等一会才能起来,如果很快就核查,就会报上面这个错误。
还有一种情况也会产生这个错误,那就是节点间时间不同步,我这次遇到的这个问题就是因为节点
间时间不同步,我用了简单的rdate保证两个节点间的同步,当然还有其他的方法,如ntpdate或建立
时间服务器。

 

也可以直接用如下文件管理:
 /etc/rc.d/init.d/init.crs
 /etc/rc.d/init.d/init.crsd
 /etc/rc.d/init.d/init.cssd
 /etc/rc.d/init.d/init.evmd


 

参考:http://www.dbspecialists.com/files/presentations/rac_quick_reference.html


######################################################################################


rac2上的asm无法启动,报如下的错误:


[oracle@rac1 ~]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
  [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]


然后用crs_start单独启动,看报什么错,结果又报了一大堆错误:


[oracle@rac2 ~]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 15:25:59 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-27504: IPC error creating OSD context
rac2:ora.rac2.ASM2.asm:ORA-27300: OS system dependent operation:if_not_found failed with status: 0
rac2:ora.rac2.ASM2.asm:ORA-27301: OS failure message: Error 0
rac2:ora.rac2.ASM2.asm:ORA-27302: failure occurred at: skgxpvaddr9
rac2:ora.rac2.ASM2.asm:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
  [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 15:25:59 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-27504: IPC error creating OSD context
rac2:ora.rac2.ASM2.asm:ORA-27300: OS system dependent operation:if_not_found failed with status: 0
rac2:ora.rac2.ASM2.asm:ORA-27301: OS failure message: Error 0
rac2:ora.rac2.ASM2.asm:ORA-27302: failure occurred at: skgxpvaddr9
rac2:ora.rac2.ASM2.asm:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]

 

一般错误“ORA-27504: IPC error creating OSD context”是因为节点间的通信的有问题

 

首先查看/etc/hosts文件

 

正确的格式应该如下:

 

[oracle@rac1 ~]$ more /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1      localhost.localdomain localhost

#skate add

# Public
192.168.2.31   rac1.localdomain        rac1
192.168.2.22   rac2.localdomain        rac2
#Private
192.168.0.31   rac1-priv.localdomain   rac1-priv
192.168.0.22   rac2-priv.localdomain   rac2-priv
#Virtual
192.168.2.131   rac1-vip.localdomain    rac1-vip
192.168.2.122   rac2-vip.localdomain    rac2-vip
[oracle@rac1 ~]$

 

 

我的这个文件没有问题,在群里讨论,我和大家都比较关注下面的错误:
ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command


但是什么引起这个错误的呢?

先怀疑网卡设置,可能是ip设置有问题,或者MUT有问题。不过经过检查我的网卡设置都是正常的

 

rac1节点网络:


[root@rac1 tmp]# ip a
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:2a:81:d3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.31/24 brd 192.168.2.255 scope global eth0
    inet 192.168.2.131/24 brd 192.168.2.255 scope global secondary eth0:1
    inet6 fe80::20c:29ff:fe2a:81d3/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:2a:81:dd brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.31/24 brd 192.168.0.255 scope global eth1
    inet6 fe80::20c:29ff:fe2a:81dd/64 scope link
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0


rac2节点网络:


[root@rac2 ~]# ip a
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:81:22:38 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.22/24 brd 192.168.2.255 scope global eth0
    inet 192.168.2.122/24 brd 192.168.2.255 scope global secondary eth0:1
    inet6 fe80::20c:29ff:fe81:2238/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:81:22:42 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.22/24 brd 192.168.0.255 scope global eth1
    inet6 fe80::20c:29ff:fe81:2242/64 scope link
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0

 

 

我又google了半天,找到一个帖子,说是尝试如下修改,可以解决

 

1、关闭 Oracle 实例-instance。
2、cd $ORACLE_HOME/rdbms/lib
3、make -f ins_rdbms.mk rac_off
4、make -f ins_rdbms.mk ioracle

 

我按其操作后,没起作用,反而出来如下的错误:

 

[oracle@rac2 lib]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 16:34:23 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-00439: feature not enabled: Real Application Clusters
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
  [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 16:34:23 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-00439: feature not enabled: Real Application Clusters
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]

 

从错误码“ORA-00439: feature not enabled: Real Application Clusters”可以看出已经禁用了集群功能,于是我有反向执行

 

1、关闭 Oracle 实例-instance。(这步我没操作,因为我的实例就没起来,呵呵)
2、cd $ORACLE_HOME/rdbms/lib
3、make -f ins_rdbms.mk rac_on
4、make -f ins_rdbms.mk ioracle

 

执行后,又恢复到以前的额错误了,查看相应的alertlog都没有错误,不过在asm2的alertlog中最后两行有错误

 

[root@rac2 ~]# tail -50 /u01/app/oracle/admin/+ASM/bdump/alert_+ASM2.log |more
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 27244
Fri May  7 03:26:47 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/10.2
.0 _1 s/arch
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  large_pool_size          = 12582912
  instance_type            = asm
  cluster_interconnects    = 192,168.0.22
  cluster_database         = TRUE
  instance_number          = 2
  remote_login_passwordfile= EXCLUSIVE
  background_dump_dest     = /u01/app/oracle/admin/+ASM/bdump
  user_dump_dest           = /u01/app/oracle/admin/+ASM/udump
  core_dump_dest           = /u01/app/oracle/admin/+ASM ump
  asm_diskgroups           = DATA
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 29732
Fri May  7 03:28:40 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/10.2
.0 _1 s/arch
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  large_pool_size          = 12582912
  instance_type            = asm
  cluster_interconnects    = 192,168.0.22
  cluster_database         = TRUE
  instance_number          = 2
  remote_login_passwordfile= EXCLUSIVE
  background_dump_dest     = /u01/app/oracle/admin/+ASM/bdump
  user_dump_dest           = /u01/app/oracle/admin/+ASM/udump
  core_dump_dest           = /u01/app/oracle/admin/+ASM ump
  asm_diskgroups           = DATA
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 31201


这个信息也不能定位错误在哪,在这过程中,我用sqlplus在rac1中可以成功启动数据库。最后一个群里的朋友说看看
asm2的参数文件内容


我经过检查发现我把+ASM2.cluster_interconnects='192.168.0.22' 写成+ASM2.cluster_interconnects='192,168.0.22'


把逗点写成了逗号,马上改正过了,然后在启动asm2实例,就可以启动了。

 

现在在回头想,报错:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
就可以理解了,因为节点间通信有问题,才会报这个错误.

 

 


#####################################################################################

 

启动数据库报错

 

[oracle@rac2 ~]$ srvctl start database -d rac
PRKP-1001 : Error starting instance rac1 on node rac1
CRS-0215: Could not start resource 'ora.rac.rac1.inst'.
PRKP-1001 : Error starting instance rac2 on node rac2
CRS-0215: Could not start resource 'ora.rac.rac2.inst'.

 

虽然用srvctl无法启动数据库,但是可以用sqlplus分别在两个节点正常启动数据库

抱这个错误,网上有说按如下方法可以解决:

 

as root:
crsctl stop crs
rm -f /var/tmp/.oracle/*
crsctl start crs

 

等一会,crs正常启动后,就可以正常启动数据库了

 

但对我的环境,问题依旧。这是我突然想到数据库名和实例名的大小写的问题,


我刚才注册到ocr里的都是小写的,怀疑可能是这个原因。于是删除原来小写的
,从新注册大写的

 

这是原来注册的小写的:


[oracle@rac2 ~]$ srvctl add database -d rac -o /u01/app/oracle/product/10.2.0/db_1
[oracle@rac2 ~]$ srvctl add instance -d rac -i rac1 -n rac1
[oracle@rac2 ~]$ srvctl add instance -d rac -i rac2 -n rac2
[oracle@rac2 ~]$ srvctl modify instance -d rac -i rac1 -s +ASM1
[oracle@rac2 ~]$ srvctl modify instance -d rac -i rac2 -s +ASM2

 

把小写的删除

 

[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac2
Remove instance rac2 from the database rac? (y/[n]) y
[oracle@rac2 ~]$ srvctl remove database -d rac
Remove the database rac? (y/[n]) y
[oracle@rac2 ~]$

 

把database和instance注册成大写的

 

[oracle@rac2 ~]$ srvctl add database -d RAC -o $ORACLE_HOME
[oracle@rac2 ~]$ srvctl add instance -d RAC -i RAC1 -n rac1
[oracle@rac2 ~]$ srvctl add instance -d RAC -i RAC2 -n rac2
[oracle@rac2 ~]$ srvctl modify  instance -d RAC -i RAC1 -s +ASM1
[oracle@rac2 ~]$ srvctl modify  instance -d RAC -i RAC2 -s +ASM2

 

然后在启动数据库,居然启动了啊。

 

[oracle@rac2 ~]$ srvctl start database -d rac

 

 


######################################################################


在ocr中删除instance和database的报错:

 

[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
PRKP-1023 : The instance {0} is still running.rac
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
PRKP-1023 : The instance {0} is still running.rac
[oracle@rac2 ~]$ srvctl remove instance -d rac
PRKO-2001 : Invalid command line syntax
[oracle@rac2 ~]$ srvctl remove database -d rac
Remove the database rac? (y/[n]) y
PRKP-1022 : The database rac is still running.

 

解决方式:用crs_stop -all停掉所有的服务,然后用crs_stat -t -v 检查各服务的状态
如果有服务的state是UNKNO的,那就只能一个一个的停掉了。

 

################################################################

 

onsctl启动的问题:

 

[oracle@rac1 ~]$ onsctl ping
Number of onsconfiguration retrieved, numcfg = 0
ons is not running ...

 

解决:
[oracle@rac1 ~]$ onsctl start
Number of onsconfiguration retrieved, numcfg = 0
Number of onsconfiguration retrieved, numcfg = 0
onsctl: ons started


[oracle@rac1 ~]$ onsctl ping
Number of onsconfiguration retrieved, numcfg = 0
ons is running ...


##########################################################

 

错误现象oifcfg getif 没有返回值

 

[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg iflist
eth0  192.168.2.0
eth1  192.168.0.0
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global rac1
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global rac2

 

手工把网络信息注册到ocr中

 

[oracle@rac2 public]# oifcfg setif -global eth0/192.168.2.0:public
[oracle@rac2 public]# oifcfg setif -global eth1/192.168.0.0:cluster_interconnect


然后就可以查了啊

 

[oracle@rac1 ~]$ oifcfg getif
eth0  192.168.2.0  global  public
eth1  192.168.0.0  global  cluster_interconnect

 

#########################################################################

 

 

以上的错误是我在重建ocr/voting disk所遇到的一些错误。

 

 

----------end---------

 

 

 

 

你可能感兴趣的:(oracle,sql,数据库,database,interface,disk)