前几天同事遇到一个问题,在某RAC环境中,由于SA要打patch,所以希望RAC跑在单节点模式下,他先down掉了一台机器,在另外一台机器上,叫DBA把instance起来。
这是一个2个RAC的环境,有2台server,每个server上跑2个instance。即:
对于SIAP数据库,SIAP1在server1上;SIAP2在server2上;对于SIMP数据库,SIMP1在server1上,SIMP2在server2上。
数据库是11g的RAC,共享存储和心跳有veritas控制,RAC cluster有CRS控制。
目前server2已经down了,希望在server1上启动SIAP1和SIMP1。问题是,SIMP1已经启动,但是SIAP1却启动不了。启动时报错为:
SQL
>
startup
ORA
-
27504
:
IPC
error
creating
OSD
context
ORA
-
27300
:
OS
system
dependent
operation
:
check
if
cable
failed
with
status
:
0
ORA
-
27301
:
OS
failure
message
:
Error
0
ORA
-
27302
:
failure
occurred
at
:
skgxpcini1
ORA
-
27303
:
additional
information
:
requested
interface
ce2
interface
not
running
set
_disable_interface_checking
=
TRUE
to
disable
this
check
for
single
instance
cluster
.
Check
output
from
ifconf
看报错的提示,是ce2的网卡没有在跑,要求将_disable_interface_checking设置成true才能把数据库起来。
由于当时情况紧急,没来得及细细研究,只是当SA把另一台server也起来的时候,SIAP1就能启动起来了。
于是,我们不禁要问,为什么在同一台server,一个instance能起来,另一个却不能?
我们来查一下RAC的网络配置,看看ce2是什么网卡。(当前状态已正常,2个server,4个instance均工作正常)
CRS中的信息:
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ oifcfg getif
ce0 144.135.159.0 global public
ce6 144.135.159.0 global public
hosts文件中的配置:
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ more /etc/hosts
#Public
144.135.159.111 vus029pa vus029pa.in.telstra.com.au loghost
#Oracle Virtual IP Addresses
144.135.159.110 osiiprd1dbr01.in.telstra.com.au
144.135.159.112 osiiprd1dbr02.in.telstra.com.au
# Private Interconnects
192.168.0.1 osiiprd1db2-priv osiiprd1db2-priv.in.telstra.com.au
192.168.0.2 osiiprd1db1-priv osiiprd1db1-priv.in.telstra.com.au
网卡的配置:
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ ifconfig -a
ce0: flags=1009040843
<
UP
,
BROADCAST
,
RUNNING
,
MULTICAST
,
DEPRECATED
,
IPv4
,
NOFAILOVER
,
FIXEDMTU
>
mtu
1500 index 2
inet 144.135.159.13 netmask ffffff00 broadcast 144.135.159.255
groupname clustermgmt-mnb
ce0:1: flags=1001000843
<
UP
,
BROADCAST
,
RUNNING
,
MULTICAST
,
IPv4
,
FIXEDMTU
>
mtu 1500 index 2
inet 144.135.159.111 netmask ffffff00 broadcast 144.135.159.255
ce0:2: flags=1000843
<
UP
,
BROADCAST
,
RUNNING
,
MULTICAST
,
IPv4
>
mtu 1500 index 2
inet 144.135.159.109 netmask ffffff00 broadcast 144.135.159.255
ce2: flags=1000843
<
UP
,
BROADCAST
,
RUNNING
,
MULTICAST
,
IPv4
>
mtu 1500 index 18
inet 192.168.0.1 netmask ffffff00 broadcast 192.168.0.255
ce6: flags=1009040843
<
UP
,
BROADCAST
,
RUNNING
,
MULTICAST
,
DEPRECATED
,
IPv4
,
NOFAILOVER
,
FIXEDMTU
>
mtu
1500 index 6
inet 144.135.159.63 netmask ffffff00 broadcast 144.135.159.255
groupname clustermgmt-mnb
ce6:1: flags=1040843
<
UP
,
BROADCAST
,
RUNNING
,
MULTICAST
,
DEPRECATED
,
IPv4
>
mtu 1500 index 6
inet 144.135.159.110 netmask ffffff00 broadcast 144.135.159.255
我们看到ce2网卡配置的IP是192.168.0.1,而从hosts文件中看到,这个地址是private的地址。也就是说,SIAP1在启动的时候,去检查private地址的网卡是否up,如果up,实例才能正常启动。
类似这样的检查private网络的网卡,在ASM+10gRAC的环境中也同样存在:
[
root
@
rac1
~]
# ifconfig
eth0
Link
encap
:
Ethernet
HWaddr
00
:
0
C
:
29
:
AE
:
9
A
:
38
inet
addr
:
192.168.190.131
Bcast
:
192.168.190.255
Mask
:
255.255.255.0
inet6
addr
:
fe80
::
20
c
:
29
ff
:
feae
:
9
a38
/
64
Scope
:
Link
UP
BROADCAST
RUNNING
MULTICAST
MTU
:
1500
Metric
:
1
RX
packets
:
3648
errors
:
0
dropped
:
0
overruns
:
0
frame
:
0
TX
packets
:
3809
errors
:
0
dropped
:
0
overruns
:
0
carrier
:
0
collisions
:
0
txqueuelen
:
1000
RX
bytes
:
362192
(
353.7
KiB
)
TX
bytes
:
357537
(
349.1
KiB
)
Interrupt
:
10
Base
address
:
0
x1480
eth1
Link
encap
:
Ethernet
HWaddr
00
:
0
C
:
29
:
AE
:
9
A
:
42
inet
addr
:
10.10.10.31
Bcast
:
10.10.10.255
Mask
:
255.255.255.0
inet6
addr
:
fe80
::
20
c
:
29
ff
:
feae
:
9
a42
/
64
Scope
:
Link
UP
BROADCAST
RUNNING
MULTICAST
MTU
:
1500
Metric
:
1
RX
packets
:
595
errors
:
0
dropped
:
0
overruns
:
0
frame
:
0
TX
packets
:
22
errors
:
0
dropped
:
0
overruns
:
0
carrier
:
0
collisions
:
0
txqueuelen
:
1000
RX
bytes
:
107822
(
105.2
KiB
)
TX
bytes
:
1092
(
1.0
KiB
)
Interrupt
:
5
Base
address
:
0
x1800
lo
Link
encap
:
Local
Loopback
inet
addr
:
127.0.0.1
Mask
:
255.0.0.0
inet6
addr
: ::
1
/
128
Scope
:
Host
UP
LOOPBACK
RUNNING
MTU
:
16436
Metric
:
1
RX
packets
:
41187
errors
:
0
dropped
:
0
overruns
:
0
frame
:
0
TX
packets
:
41187
errors
:
0
dropped
:
0
overruns
:
0
carrier
:
0
collisions
:
0
txqueuelen
:
0
RX
bytes
:
12143968
(
11.5
MiB
)
TX
bytes
:
12143968
(
11.5
MiB
)
[
root
@
rac1
~]
# cat /etc/hosts |grep priv
10.10.10.31
rac1
-
priv
.
mycorpdomain
.
com
rac1
-
priv
10.10.10.32
rac2
-
priv
.
mycorpdomain
.
com
rac2
-
priv
10.10.10.33
rac3
-
priv
.
mycorpdomain
.
com
rac3
-
priv
[
root
@
rac1
~]
#
[
root
@
rac1
~]
#
[
root
@
rac1
~]
# ifconfig eth1 down
[
root
@
rac1
~]
# ifconfig
eth0
Link
encap
:
Ethernet
HWaddr
00
:
0
C
:
29
:
AE
:
9
A
:
38
inet
addr
:
192.168.190.131
Bcast
:
192.168.190.255
Mask
:
255.255.255.0
inet6
addr
:
fe80
::
20
c
:
29
ff
:
feae
:
9
a38
/
64
Scope
:
Link
UP
BROADCAST
RUNNING
MULTICAST
MTU
:
1500
Metric
:
1
RX
packets
:
3769
errors
:
0
dropped
:
0
overruns
:
0
frame
:
0
TX
packets
:
3914
errors
:
0
dropped
:
0
overruns
:
0
carrier
:
0
collisions
:
0
txqueuelen
:
1000
RX
bytes
:
373764
(
365.0
KiB
)
TX
bytes
:
368107
(
359.4
KiB
)
Interrupt
:
10
Base
address
:
0
x1480
lo
Link
encap
:
Local
Loopback
inet
addr
:
127.0.0.1
Mask
:
255.0.0.0
inet6
addr
: ::
1
/
128
Scope
:
Host
UP
LOOPBACK
RUNNING
MTU
:
16436
Metric
:
1
RX
packets
:
41240
errors
:
0
dropped
:
0
overruns
:
0
frame
:
0
TX
packets
:
41240
errors
:
0
dropped
:
0
overruns
:
0
carrier
:
0
collisions
:
0
txqueuelen
:
0
RX
bytes
:
12145505
(
11.5
MiB
)
TX
bytes
:
12145505
(
11.5
MiB
)
[
root
@
rac1
~]
#
[
root
@
rac1
~]
#
[
root
@
rac1
~]
# su - oracle
rac1
->
rac1
->
rac1
->
crs_stat
-
t
Name
Type
Target
State
Host
----------------------------------------------------------
--
ora.devdb.db application OFFLINE OFFLINE
ora
....
b1
.
inst
application
OFFLINE
OFFLINE
ora
....
b2
.
inst
application
OFFLINE
OFFLINE
ora
....
b3
.
inst
application
OFFLINE
OFFLINE
ora
....
SM1
.
asm
application
OFFLINE
OFFLINE
ora
....
C1
.
lsnr
application
OFFLINE
OFFLINE
ora
.
rac1
.
gsd
application
OFFLINE
OFFLINE
ora
.
rac1
.
ons
application
OFFLINE
OFFLINE
ora
.
rac1
.
vip
application
OFFLINE
OFFLINE
ora
....
SM2
.
asm
application
OFFLINE
OFFLINE
ora
....
C2
.
lsnr
application
OFFLINE
OFFLINE
ora
.
rac2
.
gsd
application
OFFLINE
OFFLINE
ora
.
rac2
.
ons
application
OFFLINE
OFFLINE
ora
.
rac2
.
vip
application
OFFLINE
OFFLINE
ora
....
SM3
.
asm
application
OFFLINE
OFFLINE
ora
....
C3
.
lsnr
application
OFFLINE
OFFLINE
ora
.
rac3
.
gsd
application
OFFLINE
OFFLINE
ora
.
rac3
.
ons
application
OFFLINE
OFFLINE
ora
.
rac3
.
vip
application
OFFLINE
OFFLINE
rac1
->
export
ORACLE_SID
=+
ASM1
rac1
->
sqlplus
"
/as sysdba
"
SQL
*
Plus
:
Release
10.2.0.1.0
-
Production
on
Fri
Jul
13
22
:
07
:
31
2012
Copyright
(
c
)
1982
,
2005
,
Oracle
.
All
rights
reserved
.
Connected
to
an
idle
instance
.
SQL
>
startup
ORA
-
27504
:
IPC
error
creating
OSD
context
ORA
-
27300
:
OS
system
dependent
operation
:
if_not_up
failed
with
status
:
0
ORA
-
27301
:
OS
failure
message
:
Error
0
ORA
-
27302
:
failure
occurred
at
:
skgxpvaddr5
ORA
-
27303
:
additional
information
:
requested
interface
eth1
is
not
UP
.
Check
output
from
ifconfig
command
SQL
>
在启动过程中,asm的alertlong中也可以看到会检查private网络:
Starting
ORACLE
instance
(
normal
)
LICENSE_MAX_SESSION
=
0
LICENSE_SESSIONS_WARNING
=
0
Interface
type
1
eth1
10.10.10.0
configured
from
OCR
for
use
as
a
cluster
interconnect
Interface
type
1
eth0
192.168.190.0
configured
from
OCR
for
use
as
a
public
interface
Picked
latch
-
free
SCN
scheme
2
......
也就是说,不管在10g还是11g中,不管是asm instance还是database instance,在RAC环境下,启动的时候,总是会检查private的网卡是否up,只有up的情况下,才能启动instance。
那么,为什么我们的SIMP1却能启动呢?
我们来看看SIAP1和SIMP1启动时的alertlog,看看有何不同:
SIAP1启动的alertlog:
Sat
Jul
07
19
:
27
:
24
GMT
2012
Starting
ORACLE
instance
(
normal
)
cluster_interconnects
=
192.168.0.1
Cluster
communication
is
configured
to
use
the
following
interface
(
s
)
for
this
instance
192.168.0.1
Sat
Jul
07
19
:
28
:
07
GMT
2012
cluster
interconnect
IPC
version
:
Oracle
UDP
/
IP
(
generic
)
IPC
Vendor
1
proto
2
siap1
instance
setting
cluster_interconnects
and
not
list
NIC
info
,
simp1
instance
not
setting
cluster_interconnects
and
list
NIC
info
,
......
SIMP1启动时候的alertlog:
Sat
Jul
07
15
:
09
:
36
GMT
2012
Starting
ORACLE
instance
(
normal
)
LICENSE_MAX_SESSION
=
0
LICENSE_SESSIONS_WARNING
=
0
Interface
type
1
ce0
144.135.159.0
configured
from
OCR
for
use
as
a
public
interface
Interface
type
1
ce6
144.135.159.0
configured
from
OCR
for
use
as
a
public
interface
WARNING
:
No
cluster
interconnect
has
been
specified
.
Depending
on
the
communication
driver
configured
Oracle
cluster
traffic
may
be
directed
to
the
public
interface
of
this
machine
.
Oracle
recommends
that
RAC
clustered
databases
be
configured
with
a
private
interconnect
for
enhanced
security
and
performance
.
....
Cluster
communication
is
configured
to
use
the
following
interface
(
s
)
for
this
instance
144.135.159.111
Sat
Jul
07
15
:
09
:
43
GMT
2012
cluster
interconnect
IPC
version
:
Oracle
UDP
/
IP
(
generic
)
IPC
Vendor
1
proto
2
我们看到,SIMP1启动时,是用144.135.159.111 这个IP做节点间通信的,而SIAP1启动时,是用192.168.0.1这个IP做节点间通信。
为什么在CRS中都没有配置cluster_interconnect,2个instance会走截然不同的IP。
我们知道在CRS中如果没有配置cluster_interconnect,那么private是会走public IP的,因此,SIMP1确实属于这种情况。那为何SIAP1却没有按照这种情况走?
我们想到有另外一个参数,初始化参数cluster_interconnects,当配置这个参数时,CRS中的配置就是失效,因为优先权还是初始化参数中的cluster_interconnects高。我们来检查一下2个instance的这个参数:
SIAP1:
NAME
TYPE
VALUE
---------------------------
-- ----------- -------------------
cluster_interconnects
string
192.168.0.1
SIMP1上:
NAME
TYPE
VALUE
---------------------------
-- ----------- -------------------
cluster_interconnects
string
看来这就是问题所在了,由于在SIAP1中配置了初始化参数cluster_interconnects 为固定的private地址,这个配置会忽略CRS中的设置,不再走public的地址,所以当SA down掉private的网卡,也就是ce2的时候,SIAP1就起不来了。
但是SIMP1由于没有配置cluster_interconnects,它所使用的配置是CRS中的信息,且CRS中没配cluster_interconnect,所以就走public网络了,即我们在alertlog中看到的144.135.159.111的地址。
初始化参数cluster_interconnects的配置,CRS中global public的配置,CRS中global cluster_interconnect的配置。明白了这些的关系和优先级,故障的原因也就明了了。
原文地址:http://www.oracleblog.org/working-case/can-not-startup-single-node-of-rac/