RAC节点重起有问题,查看crs 状态有如下错误
CRS-0184: Cannot communicate with the CRS daemon
前几天搭建了两节点的一个RAC数据库
发现一个问题,当我重起两台主机的时候,有如下的问题:
虚拟IP全跑到其中一个节点上去了
# ifconfig -a
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 182.1.21.151 netmask 0xffff0000 broadcast 182.1.255.255
inet 182.1.21.156 netmask 0xffff0000 broadcast 182.1.255.255
inet 182.1.21.157 netmask 0xffff0000 broadcast 182.1.255.255 tcp_sendspace 131072 tcp_recvspace 65536
en1: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 192.168.128.1 netmask 0xffffff00 broadcast 192.168.128.255
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
#
启动另一个节点,但启动的时候却没有把vip在这台上up起来
# ifconfig -a
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 182.1.21.152 netmask 0xffff0000 broadcast 182.1.255.255 tcp_sendspace 131072 tcp_recvspace 65536
en1: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 192.168.128.2 netmask 0xffffff00 broadcast 192.168.128.255
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
此时这个节点上查看crs的进程,却没有发现什么问题
# ps -ef|grep crs
oracle 30076 40192 2 15:20:51 - 0:00 /u01/crs/oracle/product/10.2.0/crs/bin/ocssd.bin
root 30902 37254 11 15:21:18 - 0:00 /bin/sh /u01/crs/oracle/product/10.2.0/crs/bin/racgwrap check
oracle 36098 37854 0 15:20:59 - 0:00 /u01/crs/oracle/product/10.2.0/crs/bin/evmlogger.bin -o /u01/crs/oracle/product/10.2.0/crs/evm/log/evmlogger.info -l /u01/crs/oracle/product/10.2.0/crs/evm/log/evmlogger.log
root 37254 1 15 15:21:14 - 0:00 /u01/crs/oracle/product/10.2.0/crs/bin/crsd.bin restart
oracle 37854 1 0 15:18:38 - 0:00 /u01/crs/oracle/product/10.2.0/crs/bin/evmd.bin
root 39134 41568 0 15:20:49 - 0:00 /u01/crs/oracle/product/10.2.0/crs/bin/oprocd run -t 1000 -m 500 -f
root 39514 30902 7 15:21:18 - 0:00 /u01/crs/oracle/product/10.2.0/crs/bin/racgmain check
oracle 40192 35124 0 15:20:51 - 0:00 /bin/sh -c ulimit -c unlimited; cd /u01/crs/oracle/product/10.2.0/crs/log/zj1/cssd; /u01/crs/oracle/product/10.2.0/crs/bin/ocssd || exit $?
root 41376 38730 0 15:21:18 pts/1 0:00 grep crs
但是数据库等相关服务是没有启动的,只有前面那个节点上的是正常的
查看crs状态,提示
$crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon
后来重新启动crs
# /etc/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
Apr 14 15:13:45.107 | INF | daemon shutting down
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
Shutdown has begun. The daemons should exit soon.
# /etc/init.crs start
Startup will be queued to init within 30 seconds.
所有状态就正常了
===================================
但再次重新启动的时候还是有这个问题,看来要找到问题的根源,不能每次都这么折腾
查看crsd.log
发现有
2008-04-15 09:38:33.286: [ CRSMAIN][10548]32Failed to spawn a thread for UI connection. error=-11
2008-04-15 09:49:15.441: [ CRSMAIN][10548]32Failed to spawn a thread for UI connection. error=-11
......
metalink上居然还没有这个错误的解决方法,晕了
查遍metalink上的CRS-0184,有提及到/etc/security/limits,可能需要更改root的unlimit参数,但Oracle的安装文档上是没有提及到要更改root的相关参数的
更改两个节点的/etc/security/limits
修改下面的配置:
root:
fsize = -1
core = 2097151
cpu = -1
data = -1
rss = -1
stack = -1
nofiles = -1
重新启动系统:
一切正常:
ifconfig">zj1@oracle>ifconfig -a
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 182.1.21.151 netmask 0xffff0000 broadcast 182.1.255.255
inet 182.1.21.156 netmask 0xffff0000 broadcast 182.1.255.255
tcp_sendspace 131072 tcp_recvspace 65536
en1: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 192.168.128.1 netmask 0xffffff00 broadcast 192.168.128.255
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
zj1@oracle >
ifconfig">zj2@oracle>ifconfig -a
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 182.1.21.152 netmask 0xffff0000 broadcast 182.1.255.255
inet 182.1.21.157 netmask 0xffff0000 broadcast 182.1.255.255 tcp_sendspace 131072 tcp_recvspace 65536
en1: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 192.168.128.2 netmask 0xffffff00 broadcast 192.168.128.255
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
zj2@oracle >
crs_stat">zj1@oracle>crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.mdzj.db application ONLINE ONLINE zj1
ora....j1.inst application ONLINE ONLINE zj1
ora....j2.inst application ONLINE ONLINE zj2
ora.....zj1.cs application ONLINE ONLINE zj2
ora....zj1.srv application ONLINE ONLINE zj1
ora....zj2.srv application ONLINE ONLINE zj2
ora.....zj2.cs application ONLINE ONLINE zj2
ora....zj1.srv application ONLINE ONLINE zj1
ora....zj2.srv application ONLINE ONLINE zj2
ora.....zj3.cs application ONLINE ONLINE zj2
ora....zj1.srv application ONLINE ONLINE zj1
ora....zj2.srv application ONLINE ONLINE zj2
ora....J1.lsnr application ONLINE ONLINE zj1
ora.zj1.gsd application ONLINE ONLINE zj1
ora.zj1.ons application ONLINE ONLINE zj1
ora.zj1.vip application ONLINE ONLINE zj1
ora....J2.lsnr application ONLINE ONLINE zj2
ora.zj2.gsd application ONLINE ONLINE zj2
ora.zj2.ons application ONLINE ONLINE zj2
ora.zj2.vip application ONLINE ONLINE zj2
由于是新安装的操作系统,忽略了更改root的ulimit参数,导致这个问题。