如同我在《我们应该怎样安装Oracle数据库?》这一篇文章提及,只要安装Oracle软件的时候严格按文档操作,通常不会遇到太大的问题。但是,现实环境总是那么复杂,在一些新的软件组合或者是新的版本上进行安装,容易遇到一些比较复杂的或者说是解决起来不是那么容易的问题。本文将要描述的就是在这样一个环境下的安装:HP-UX 11.31 IA64、Symantec SFRAC、Oracle 10g RAC。当安装CRS运行root.sh时,一直挂起,也就是说,root.sh一直过不去。下面列出这个问题的解决过程,供朋友们参考。
先来看看运行root.sh得到输出:
cwxndb01:[/oracle/app/oracle/product/crs]#./root.sh
WARNING: directory '/oracle/app/oracle/product'isnot owned by root
WARNING: directory '/oracle/app/oracle'isnot owned by root
WARNING: directory '/oracle/app'isnot owned by root
WARNING: directory '/oracle'isnot owned by root
Checking to see if Oracle CRS stack is already configured
Checking to see if any 9i GSD is up
Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
WARNING: directory '/oracle/app/oracle/product'isnot owned by root
WARNING: directory '/oracle/app/oracle'isnot owned by root
WARNING: directory '/oracle/app'isnot owned by root
WARNING: directory '/oracle'isnot owned by root
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node <nodenumber>: <nodename> <private interconnect name> <hostname>
node 0: cwxndb01 cwxndb01-priv cwxndb01
Creating OCR keys foruser'root', privgrp 'sys'..
Operation successful.
Now formatting voting device: /dev/vx/rdsk/vgdata01/lv_vote_128m_01
Now formatting voting device: /dev/vx/rdsk/vgdata02/lv_vote_128m_02
Now formatting voting device: /dev/vx/rdsk/vgdata03/lv_vote_128m_03
Format of 3 voting devices complete.
Startup will be queued to init within 30 seconds.
cwxndb01:[/oracle/app/oracle/product/crs]#./root.sh WARNING: directory '/oracle/app/oracle/product' is not owned by root WARNING: directory '/oracle/app/oracle' is not owned by root WARNING: directory '/oracle/app' is not owned by root WARNING: directory '/oracle' is not owned by root Checking to see if Oracle CRS stack is already configured Checking to see if any 9i GSD is up Setting the permissions on OCR backup directory Setting up NS directories Oracle Cluster Registry configuration upgraded successfully WARNING: directory '/oracle/app/oracle/product' is not owned by root WARNING: directory '/oracle/app/oracle' is not owned by root WARNING: directory '/oracle/app' is not owned by root WARNING: directory '/oracle' is not owned by root Successfully accumulated necessary OCR keys. Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897. node <nodenumber>: <nodename> <private interconnect name> <hostname> node 0: cwxndb01 cwxndb01-priv cwxndb01 Creating OCR keys for user 'root', privgrp 'sys'.. Operation successful. Now formatting voting device: /dev/vx/rdsk/vgdata01/lv_vote_128m_01 Now formatting voting device: /dev/vx/rdsk/vgdata02/lv_vote_128m_02 Now formatting voting device: /dev/vx/rdsk/vgdata03/lv_vote_128m_03 Format of 3 voting devices complete. Startup will be queued to init within 30 seconds.
然后就在显示完”Startup will be queued to init within 30 seconds”之后就一直挂起。在安装CRS运行root.sh时,一般失败的情形是报错然后退出,很少有一直挂起的。
为了解决一个未知的问题,我们通常分为3步,了解当前正在做什么,当前所做的为什么会引起问题,解决问题。那么现在,我们需要知道的是root.sh这个脚本正在做什么,执行到了哪里。
使用”ps -ef | grep root.sh”命令,找到root.sh的pid,然后再寻找这个pid的子进程,一直定位到最末端,发现当前正在运行的命令是”sleep 60″,而”sleep 60″的上级命令是”init.cssd startcheck CSS”。
root.sh的核心其实在于$ORA_CRS_HOME/install/rootconfig这个脚本,在这个文件中,我们根据上述线索,找到下面的代码:
# Prepareto start the daemons.
$ID/init.crs start
# Checkto see if they are going to start.
$ID/init.cssd startcheck CSS
if [ "$?" != "0" ]; then
$ECHO CRS daemons notsetto start. See $MSGFILE for details.
exit 1
fi
# Prepare to start the daemons. $ID/init.crs start # Check to see if they are going to start. $ID/init.cssd startcheck CSS if [ "$?" != "0" ]; then $ECHO CRS daemons not set to start. See $MSGFILE for details. exit 1 fi
实际上运行root.sh的输出中的”Startup will be queued to init within 30 seconds.”,正是init.crs start的显示。从root.sh当前运行的子进程来看,目前正在运行”$ID/init.cssd startcheck CSS”这一命令。所以接下来我们需要知道在这一个命令里面会一直挂起。
在HP-UX操作系统中,init.cssd在/sbin/init.d目录下,我们可以发现init.cssd也是一个比较复杂的脚本。为了搞清楚一个shell脚本执行到了哪里出了问题,我们可以使用sh -x以调试方式运行shell脚本。(实际上在分析root.sh当前正在进行的操作时,我们同样可以用sh -x ./root.sh这样的命令来运行。)
我们手工执行如下的命令:
cd /sbin/init.d
sh -x init.cssd startcheck CSS
cd /sbin/init.d sh -x init.cssd startcheck CSS
然后发现在下面的代码中32-37行代码处挂起:
# checkon vendor clusterware and filesystem dependencies.
# Startcheck is supposed to block for a long time waiting on dependencies.
'startcheck') # Returns 0 if we should start
# Returns 1 on a non-cluster boot
# Returns 2 if disabled by admin
# Returns 3 on error
# Check our startup run status, initialized by the automatic startup
# or manual startup routes.
$ID/init.cssd runcheck
STATUS=$?
if [ "$STATUS" != "0" ]; then
exit $STATUS;
fi
# If we have vendor clusterware, and CLINFO indicates a non-clustered boot
# then prevent CRS startup.
if [ $USING_VC -eq 1 ]
then
# Run CLINFO to get cluster status. It returns 0 if the machine is
# booting in clustered mode.
$EVAL $CLINFO
if [ "$?" != "0" ]; then
$LOGMSG "Oracle Cluster Ready Services disabled by non-clustered boot."
$ID/init.cssd norun
exit 1;
fi
# Wait for the Vendor Clusterware to start
$EVAL $VC_UP
RC=$?
while [ $RC -ne 0 ]; do
$LOGMSG "Oracle Cluster Ready Services waiting for $VC_NAME to start."
$SLEEP $DEP_CHECK_WAIT
$EVAL $VC_UP
RC=$?
done
fi
# check on vendor clusterware and filesystem dependencies. # Startcheck is supposed to block for a long time waiting on dependencies. 'startcheck') # Returns 0 if we should start # Returns 1 on a non-cluster boot # Returns 2 if disabled by admin # Returns 3 on error # Check our startup run status, initialized by the automatic startup # or manual startup routes. $ID/init.cssd runcheck STATUS=$? if [ "$STATUS" != "0" ]; then exit $STATUS; fi # If we have vendor clusterware, and CLINFO indicates a non-clustered boot # then prevent CRS startup. if [ $USING_VC -eq 1 ] then # Run CLINFO to get cluster status. It returns 0 if the machine is # booting in clustered mode. $EVAL $CLINFO if [ "$?" != "0" ]; then $LOGMSG "Oracle Cluster Ready Services disabled by non-clustered boot." $ID/init.cssd norun exit 1; fi # Wait for the Vendor Clusterware to start $EVAL $VC_UP RC=$? while [ $RC -ne 0 ]; do $LOGMSG "Oracle Cluster Ready Services waiting for $VC_NAME to start." $SLEEP $DEP_CHECK_WAIT $EVAL $VC_UP RC=$? done fi
32-37行代码是一个循环,从代码注释上看,是在等待Vendor Clusterware启动。如果一直没有启动,就一直循环,也就是一直挂起。代码”$EVAL $VC_UP“用于判断Vendon Clusterware是否已启动。这里$VC_UP是什么值呢,在文件中搜索代码,可以找到如下的代码:
HP-UX) MACH_HARDWARE=`/bin/uname -m`
CLUSTERDIR=/opt/nmapi/nmapi2
if [ "$MACH_HARDWARE" = "ia64" ]; then
SO_EXT=so
NMAPIDIR_64=$CLUSTERDIR/lib/hpux64
else
SO_EXT=sl
NMAPIDIR_64=$CLUSTERDIR/lib/pa20_64
fi
LD_LIBRARY_PATH=$ORA_CRS_HOME/lib:$NMAPIDIR_64:/usr/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
# Presence of this file indicates that vendor clusterware is installed
SKGXNLIB=${NMAPIDIR_64}/libnmapi2.${SO_EXT}
if [ -f $SKGXNLIB ]; then
USING_VC=1
fi
VC_UP="$PSEF | $GREP '/usr/lbin/cm[g]msd' 1>$NULL 2>$NULL"
# NOTE: Not supporting Hardware partitioning machines.
REBOOT_TOC="/var/opt/oracle/bin/reboot_toc"
if [ -x $REBOOT_TOC ]; then
FAST_REBOOT="$REBOOT_TOC || /usr/sbin/reboot -r -n -q"
SLOW_REBOOT="/bin/kill -HUP `$CAT /var/run/syslog.pid` ; /bin/sync & $SLEEP 2 ; $REBOOT_TOC || /usr/sbin/reboot -r -n -q"
else
FAST_REBOOT="/usr/sbin/reboot -r -n -q"
SLOW_REBOOT="/bin/kill -HUP `$CAT /var/run/syslog.pid` ; /bin/sync & $SLEEP 2 ; /usr/sbin/reboot -r -n -q"
fi
DFL_CLSINFO=$CLUSTERDIR/bin/clsinfo
if [ $USING_VC -eq 0 ]; then
# We are not using vendor clusterware.
VC_NAME=""
VC_UP=/bin/true
CLINFO=/bin/true
# We do not have a slow reboot without vendor clusterware
SLOW_REBOOT=$FAST_REBOOT
elif [ -f $DFL_CLSINFO ]; then
# We are using Generic HP Vendor Clusterware
VC_NAME="Generic HP-UX Vendor Clusterware"
VC_UP=/bin/true
CLINFO=$DFL_CLSINFO
else
# We are using HPUX ServiceGuard Clusterware
VC_NAME="HP-UX Service Guard"
CLINFO=/bin/true
fi
ID=/sbin/init.d
HP-UX) MACH_HARDWARE=`/bin/uname -m` CLUSTERDIR=/opt/nmapi/nmapi2 if [ "$MACH_HARDWARE" = "ia64" ]; then SO_EXT=so NMAPIDIR_64=$CLUSTERDIR/lib/hpux64 else SO_EXT=sl NMAPIDIR_64=$CLUSTERDIR/lib/pa20_64 fi LD_LIBRARY_PATH=$ORA_CRS_HOME/lib:$NMAPIDIR_64:/usr/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH # Presence of this file indicates that vendor clusterware is installed SKGXNLIB=${NMAPIDIR_64}/libnmapi2.${SO_EXT} if [ -f $SKGXNLIB ]; then USING_VC=1 fi VC_UP="$PSEF | $GREP '/usr/lbin/cm[g]msd' 1>$NULL 2>$NULL" # NOTE: Not supporting Hardware partitioning machines. REBOOT_TOC="/var/opt/oracle/bin/reboot_toc" if [ -x $REBOOT_TOC ]; then FAST_REBOOT="$REBOOT_TOC || /usr/sbin/reboot -r -n -q" SLOW_REBOOT="/bin/kill -HUP `$CAT /var/run/syslog.pid` ; /bin/sync & $SLEEP 2 ; $REBOOT_TOC || /usr/sbin/reboot -r -n -q" else FAST_REBOOT="/usr/sbin/reboot -r -n -q" SLOW_REBOOT="/bin/kill -HUP `$CAT /var/run/syslog.pid` ; /bin/sync & $SLEEP 2 ; /usr/sbin/reboot -r -n -q" fi DFL_CLSINFO=$CLUSTERDIR/bin/clsinfo if [ $USING_VC -eq 0 ]; then # We are not using vendor clusterware. VC_NAME="" VC_UP=/bin/true CLINFO=/bin/true # We do not have a slow reboot without vendor clusterware SLOW_REBOOT=$FAST_REBOOT elif [ -f $DFL_CLSINFO ]; then # We are using Generic HP Vendor Clusterware VC_NAME="Generic HP-UX Vendor Clusterware" VC_UP=/bin/true CLINFO=$DFL_CLSINFO else # We are using HPUX ServiceGuard Clusterware VC_NAME="HP-UX Service Guard" CLINFO=/bin/true fi ID=/sbin/init.d
以上一段代码的主要意思是:如果存在/opt/nmapi/nmapi2/libnmapi2.so文件,那么表示系统使用了Vendor Clusterware,也就是使用了第三方集群软件。如果存在/opt/nmapi/nmapi2/bin/clsinfo文件,则表示系统使用了其他的第三方集群软件,并且该文件可以执行用于判断是否已经启动Vendor Clusterware,即VC_UP=/opt/nmapi/nmapi2/bin/clsinfo。如果不存在该文件,表示系统使用的是HP ServiceGuard,那么VC_UP=”$PSEF | $GREP ‘/usr/lbin/cm[g]msd’ 1>$NULL 2>$NULL” ,即判断cm开头的一些进程是否在运行,用来判断Service Guard是否在运行。从sh -x执行结果来看,脚本认为系统用的是Service Guard,很显然,这个是错误的,一直在等待Service Guard启动,那这样很明显就会一直等待而挂起。
那么到目前为止,可以知道是由于错误地判断了Vendor Clusterware引起了问题,那么用于判断Vendor Clusterware的文件/opt/nmapi/nmapi2/bin/clsinfo应该不存在于系统中。/opt/nmapi/nmapi2/bin目录下的确没有clsinfo这个文件。而/opt/nmapi/nmapi2/lib目录下存在相应的库文件:
cwxndb01:[/opt/nmapi/nmapi2/lib/hpux64]#ll
total 0
lr-xr-xr-x 1 bin bin 27 Jul 18 13:33 libnmapi2.so -> /usr/lib/hpux64/libvcsmm.so
cwxndb01:[/opt/nmapi/nmapi2/lib/hpux64]#ll total 0 lr-xr-xr-x 1 bin bin 27 Jul 18 13:33 libnmapi2.so -> /usr/lib/hpux64/libvcsmm.so
可以看到库文件用的是symantec sfrac的库文件。
而bin目录下没有任意文件,看起来应该是symantec sfrac安装的BUG。
那怎么样来解决这个问题呢,知道了原因,解决就很简单了:
cwxndb01:[/opt/nmapi/nmapi2]#cd /opt/nmapi/nmapi2/bin
cwxndb01:[/opt/nmapi/nmapi2/bin]#ln -s /opt/VRTSvcs/ops/bin/clsinfo
cwxndb01:[/opt/nmapi/nmapi2]#cd /opt/nmapi/nmapi2/bin cwxndb01:[/opt/nmapi/nmapi2/bin]#ln -s /opt/VRTSvcs/ops/bin/clsinfo
只需要在这个目录下建一个链接即可。做完上述操作后,再清理掉环境,重新运行root.sh,一切顺利。
这个问题引起的根本原因,看起来是Symantec SFRAC安装的一个bug。不过总的来说,新版本的SFRAC环境下安装RAC数据库已经方便很多,跟普通方式安装CRS没有任何区别,而早期版本的SFRAC环境下安装RAC,需要使用SFRAC提供的专用界面进行安装。