[oracle@node1 crsd]$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
[oracle@node1 crsd]$ crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
[root@node1 crs]# ps -ef|grep crs
root 3926 1 0 17:46 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run
root 29408 25855 0 22:09 pts/1 00:00:00 grep crs
[root@node1 bin]# ./racgvip
There is no VIP name
[root@node1 crsd]# /etc/init.d/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Device or resource busy] [16]
Shutdown has begun. The daemons should exit soon.
[root@node1 crsd]# raw -qa
/dev/raw/raw1: bound to major 8, minor 17
/dev/raw/raw2: bound to major 8, minor 33
[root@node1 crsd]# ls -al /dev/raw/raw2
crw-rw---- 1 oracle dba 162, 2 9月 15 17:45 /dev/raw/raw2
[root@node1 bin]# ./crsctl query css votedisk
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Device or resource busy] [16]
[root@node1 bin]# ./ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
[root@node1 ~]# ll /etc/oracle/ocr.loc
-rw-r--r-- 1 root oinstall 45 2012-01-17 /etc/oracle/ocr.loc
[root@node1 bin]# more /etc/oracle/ocr.loc
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
[root@node1 ~]# dd if=/dev/raw/raw1 of=/opt/oracle/ocr_raw.bak
dd: 打开 ‘/dev/raw/raw1’: 设备或资源忙
lsof|grep /dev/raw/raw1
没人占用
想把RAW1对应的分区格式化掉. 格式化中发现SDB1居然是10.7GB 不是裸设备100M
由于系统管理员过来帮忙,
FDISK SDB 后导致启动文件系统出了问题.因此在启动输入root用户密码后可以重新fdisk sdb
并把sdb 10.7GB分区为sdb1 把裸设备分区为sdc1 然后mkfs.ext3 /dev/sdb1 格式化.
这样就进入了系统.并且修改 /etc/sysconfig/rawdevices的 符合连接
再度重启后发现 DD 可以备份/DEV/RAW/RAW1的内容 不再报错误了
[root@node1 tmp]# dd if=/dev/zero of=/dev/raw/raw1 bs=512 count=2048
读入了 2048+0 个块
输出了 2048+0 个块
[root@node1 tmp]# dd if=/dev/zero of=/dev/raw/raw2 bs=512 count=2048
读入了 2048+0 个块
输出了 2048+0 个块
裸设备正常使用中…
/tmp 没有产生新错误
停掉CRS
[root@node1 ~]# /etc/init.d/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage
Shutdown has begun. The daemons should exit soon.
执行OCR恢复
ocrconfig -restore /opt/oracle/crshome/product/10.2.0/db_1/cdata/crs/backup00.ocr
没反应
去看OCR日志
Cd /opt/oracle/crshome/product/10.2.0/db_1/log/node1/client
[root@node1 client]# cat ocrconfig_6090.log
Oracle Database 10g CRS Release 10.2.0.1.0 Production Copyright 1996, 2005 Oracle. All rights reserved.
2012-09-19 10:51:08.056: [ OCRCONF][3086915264]ocrconfig starts...
2012-09-19 10:51:08.109: [ OCROSD][3086915264]utopen:12:Not enough space in the backing store
2012-09-19 10:51:08.109: [ OCROSD][3086915264]utopen:10:None of the OCR devices are usable
2012-09-19 10:51:08.109: [ OCRRAW][3086915264]phy_rec:1:could not open OCR device
2012-09-19 10:51:08.109: [ OCRCONF][3086915264]Failed to restore OCR from [/opt/oracle/crshome/product/10.2.0/db_1/cdata/crs/backup00.ocr]
2012-09-19 10:51:08.109: [ OCRCONF][3086915264]Exiting [status=failed]...
估计是权限问题
[root@node1 client]# ll /dev/raw/raw*
crw-rw---- 1 root disk 162, 1 9月 18 18:41 /dev/raw/raw1
crw-rw---- 1 root disk 162, 2 9月 18 18:41 /dev/raw/raw2
是为了避免OCR一直运行没完 dd无法读取裸设备而忙的原因才把权限修改了
临时屏蔽CRSD自启动
[root@node1 opt]# vi /etc/inittab
# Run xdm in runlevel 5
x:5:respawn:/etc/X11/prefdm -nodaemon
#h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null
#h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null
#h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null
经同事提醒: 分区还存在问题
Disk /dev/sdc: 107 MB, 107374080 bytes
64 heads, 32 sectors/track, 102 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 102 102 1024 83 Linux
Disk /dev/sdd: 107 MB, 107374080 bytes
64 heads, 32 sectors/track, 102 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Device Boot Start End Blocks Id System
/dev/sdd1 * 1 102 104432 83 Linux
重新分区 fdisk /dev/sdc
重新导入裸文件 过去导出来的raw1.file
重新 ocrconfig -restore /opt/oracle/crshome/product/10.2.0/db_1/cdata/crs/backup00.ocr
没反应.气死也!
重新系统 也没用….
第二天 想下把节点2搞搞. 因为节点2也报同样的错误,那就是增加磁盘到了SCSCI0号总线上导致盘符发生变化
那么它没有经历过两位同事的操刀手.
节点2启动了
修改 /etc/sysconfig/rawdevices
[root@node2 ~]# cat /etc/sysconfig/rawdevices
# This file and interface are deprecated.
# Applications needing raw device access should open regular
# block devices with O_DIRECT.
# raw device bindings
# format: <rawdev> <major> <minor>
# <rawdev> <blockdev>
# example: /dev/raw/raw1 /dev/sda1
# /dev/raw/raw2 8 5
/dev/raw/raw1 /dev/sdc1
/dev/raw/raw2 /dev/sdd1
[root@node2 ~]# service rawdevices restart
后OCR没有效,重启系统 结果好了
Ocrconfig check crs 三个都OK了
Crs_stat –t 节点2的都OK 了.
本来想通过节点2自动恢复OCR盘的内容,节点1的OCR可以读取正确内容而成功启动.
关闭了节点2
Crsctl stop crs 虚拟机比较忙
开启节点1 一切照旧,老样的 OCR不写日志在/TMP和client目录下 而CRS日志也没.
真气人 难道破坏了OCR的程序,不会吧 把节点2启动起来 对文件一一比对.
Ll /dev/raw/raw* 权限
Cat /etc/sysconfig/rawdevices 盘符.
今天特意带来大话RAC这本书翻到第6章OCR部分工具 163页. 看到配置CRS堆栈是否自动启动
说 crsctl disable crs 命令实际修改下面文件
/etc/oracle/scls_scr/dbp/root/crsstart
注意dbp换成node1
两个节点文件对比一看 节点2 是enable 节点1是disable
记得同事叫我把节点1 CRS不自己启动 这个操作.好吧 把它改成enable 然后重新启动节1
PS查看下 不再是 /etc/init.d/init.crsd run 而是一大堆
[root@node1 ~]# ps -ef | grep crs*
root 3392 1 0 15:38 ? 00:00:00 crond
root 3427 1 0 15:38 ? 00:00:00 anacron -s
root 4045 1 0 15:38 ? 00:00:00 /bin/su -l oracle -c sh -c 'ulimit -c unlimited; cd /opt/oracle/crshome/product/10.2.0/db_1/log/node1/evmd; exec /opt/oracle/crshome/product/10.2.0/db_1/bin/evmd '
root 4052 1 1 15:38 ? 00:00:08 /opt/oracle/crshome/product/10.2.0/db_1/bin/crsd.bin reboot
oracle 4773 4045 0 15:39 ? 00:00:01 /opt/oracle/crshome/product/10.2.0/db_1/bin/evmd.bin
root 4890 4752 0 15:39 ? 00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/crshome/product/10.2.0/db_1/log/node1/cssd;
[root@node1 ~]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
[root@node1 ~]# su - oracle
[oracle@node1 ~]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....C1.inst application ONLINE ONLINE node1
ora....C2.inst application ONLINE ONLINE node2
ora.MYRAC.db application ONLINE ONLINE node2
ora....SM1.asm application ONLINE ONLINE node1
ora....E1.lsnr application ONLINE ONLINE node1
ora.node1.gsd application ONLINE ONLINE node1
ora.node1.ons application ONLINE ONLINE node1
ora.node1.vip application ONLINE ONLINE node1
ora....SM2.asm application ONLINE ONLINE node2
ora....E2.lsnr application ONLINE ONLINE node2
ora.node2.gsd application ONLINE ONLINE node2
ora.node2.ons application ONLINE ONLINE node2
ora.node2.vip application ONLINE ONLINE node2
总结
1 增加磁盘时候小心盘符发生改变
2 分区命令注意start 和end 创建分区的时候有提示两个1的时候
3 OCR程序先在CRS前启动,OCR不能启动 CRS也不能启动
4 两位同事操刀命令熟,速度快.极容易忽悠掉信息的细节
5 记住不要采用试错的方式,修改CRS的设置.尤其是在问题还没有精确定位时.
6 任何改动要人工手记在本子,或者word内.因为不断地修改和试错容易造成环境的破坏.
7 这个BUG折腾了1个周的时间,求教了多人,能起到作用的是两位要好的同事,提供了有效的帮助.而群里的人提供的是命令和文件,让自己熟悉了linux 一些命令和文件配置.因此当一个人无法解决的时候,可以洗洗睡睡,或者请教他人.正所谓当局者迷旁观者清.人久了头脑会发昏,视觉疲劳,容易放过重要的信息和提示.
8 还好这是虚拟机,如果是生产系统,需要短时间处理问题,在嘈杂,压力,闷热下,估计是无法解决问题的.或许在压力下才用试错法带来更多的问题.