概述:11Grac经常有会碰到ora.cssd,ora.crsd进程启动失败。一般css.d进程失败多是由于voting盘损坏或者voting盘数量不足导致,而crsd进程失败多是OCR损坏或者集群的配置信息损坏
1.OCR一般默认4个小时备份一次,在备份文件位置处,至少存在5份OCR备份信息,最近4小时生成的OCR,最近一天生成的一份备份,最近一周的一份备份
[grid@rac1 rac1]$ ocrconfig -showbackup rac2 2015/08/25 14:54:37 /u01/app/11.2/grid/cdata/rac-cluster/backup00.ocr rac2 2015/08/24 21:12:34 /u01/app/11.2/grid/cdata/rac-cluster/backup01.ocr rac2 2015/08/24 17:12:34 /u01/app/11.2/grid/cdata/rac-cluster/backup02.ocr rac2 2015/08/24 13:12:33 /u01/app/11.2/grid/cdata/rac-cluster/day.ocr rac1 2015/08/13 13:12:12 /u01/app/11.2/grid/cdata/rac-cluster/week.ocr
2.手动备份OCR信息
[root@rac1 grid]# ocrconfig -showbackup rac2 2015/08/25 14:54:37 /u01/app/11.2/grid/cdata/rac-cluster/backup00.ocr rac2 2015/08/24 21:12:34 /u01/app/11.2/grid/cdata/rac-cluster/backup01.ocr rac2 2015/08/24 17:12:34 /u01/app/11.2/grid/cdata/rac-cluster/backup02.ocr rac2 2015/08/24 13:12:33 /u01/app/11.2/grid/cdata/rac-cluster/day.ocr rac1 2015/08/13 13:12:12 /u01/app/11.2/grid/cdata/rac-cluster/week.ocr rac1 2015/08/28 09:09:18 /u01/app/11.2/grid/cdata/rac-cluster/backup_20150828_090918.ocr
3.模拟ocr盘损坏
检查ocr、voting所使用的盘
[root@rac1 grid]# crsctl query css votedisk STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE 1da20ec3577a4fa9bf2882a391d66afb (/dev/raw/raw1) [DATA]
模拟损坏OCR盘
dd if=/dev/zero of=/dev/raw/raw1 bs=4K count=100
4 启动集群,打开集群的alert日志
[grid@rac1 ~]$ cd /u01/app/11.2/grid/log/rac1/ [grid@rac1 rac1]$ tail -f alertrac1.log [root@rac1 grid]# crsctl start cluster -all
我们可以看到有以下报错:
2015-08-28 09:22:17.471: [ohasd(1990)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE 2015-08-28 09:22:17.471: [ohasd(1990)]CRS-2769:Unable to failover resource 'ora.diskmon'. 2015-08-28 09:22:30.845: [cssd(7243)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2/grid/log/rac1/cssd/ocssd.log
即集群找不到voting盘文件,我们知道ocr记录的是集群配置信息,这也于我们dd掉ocr盘预期的结果相符
以下步骤用来恢复ocr,重新启动集群:
5 停止所有节点集群
[root@rac1 grid]# crsctl stop crs -f
如果无法停止,可以使用以下方式:
ps -elf | egrep "PID|d.bin|ohas|oraagent|orarootagent|cssdagent|cssdmonitor" | grep -v grep 上面这种方式需要对查询出来的PID手动kill -9 ps -elf | egrep "d.bin|ohas|oraagent|orarootagent|cssdagent|cssdmonitor" | grep -v grep |awk '{print $4}' |xargs -n 10 kill -9
通过以下方式检查确认集群停止成功
[root@rac1 grid]# ps -ef|grep crs root 9229 4909 0 09:42 pts/0 00:00:00 grep crs [root@rac1 grid]# ps -ef|grep css root 9231 4909 0 09:42 pts/0 00:00:00 grep css [root@rac1 grid]# ps -ef|grep evm root 9236 4909 0 09:42 pts/0 00:00:00 grep evm [root@rac1 grid]# ps -ef|grep ohas root 9204 1 0 09:41 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run root 9239 4909 0 09:42 pts/0 00:00:00 grep ohas
6 以独占模式启动crs
[root@rac1 grid]# crsctl start crs -excl -nocrs CRS-4123: Oracle High Availability Services has been started. CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1' CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1' CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1' CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1' CRS-2672: Attempting to start 'ora.gipcd' on 'rac1' CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'rac1' CRS-2672: Attempting to start 'ora.diskmon' on 'rac1' CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.drivers.acfs' on 'rac1' CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on 'rac1' CRS-2672: Attempting to start 'ora.ctssd' on 'rac1' CRS-2681: Clean of 'ora.cluster_interconnect.haip' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1' CRS-2676: Start of 'ora.drivers.acfs' on 'rac1' succeeded CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.asm' on 'rac1' CRS-2676: Start of 'ora.asm' on 'rac1' succeeded
说明:
-excl 该参数指定使用独占模式
-nocrs 该参数指定忽略查找crs及voting
此时集群状态:
[grid@rac1 trace]$ crsctl stat res -t -init -------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS ---------------------------------------------------------------- Cluster Resources ------------------------------------------------------------------------ ora.asm 1 ONLINE INTERMEDIATE rac1 OCR not started ora.cluster_interconnect.haip 1 ONLINE ONLINE rac1 ora.crf 1 OFFLINE OFFLINE ora.crsd 1 OFFLINE OFFLINE ora.cssd 1 ONLINE ONLINE rac1 ora.cssdmonitor 1 ONLINE ONLINE rac1 ora.ctssd 1 ONLINE ONLINE rac1 ACTIVE:0 ora.diskmon 1 OFFLINE OFFLINE ora.drivers.acfs 1 ONLINE ONLINE rac1 ora.evmd 1 OFFLINE OFFLINE ora.gipcd 1 ONLINE ONLINE rac1 ora.gpnpd 1 ONLINE ONLINE rac1 ora.mdnsd 1 ONLINE ONLINE rac1
7 重新创建ocrvoting
SQL> create diskgroup data external redundancy disk '/dev/raw/raw1' attribute 'au_size'='1M','compatible.asm' = '11.2.0','compatible.rdbms' = '11.2.0';
注意此处的ocrvote的名字一定要和损坏之前的一致,否则在恢复ocrvote的时候会报错:
PROT-35: The configured OCR locations are not accessible.
8 利用备份恢复OCR
[root@rac1 bin]# ./ocrconfig -restore /u01/app/11.2/grid/cdata/rac-cluster/backup00.ocr
可以用以下命令检查:
cluvfy comp ocr -n all ocrcheck
9 恢复vote盘
[root@rac1 bin]# ./crsctl replace votedisk +DATA Successful addition of voting disk 4201f39953204fbdbf2b502ef4abe9cb. Successfully replaced voting disk group with +DATA. CRS-4266: Voting file(s) successfully replaced
注意此处可能会报错:
crsctl replace votedisk +ocrvote CRS-4602: Failed 27 to add voting file 5a71f4b0868e4f8abfc4808566c5c7fa. CRS-4602: Failed 27 to add voting file 66699f04c8a74f57bf08e0682294e449. CRS-4602: Failed 27 to add voting file 7181a4d009884fecbff2cab4c69f2de2. Failed to replace voting disk group with +ocrvote. CRS-4000: Command Replace failed, or completed with errors.
可以用以下方式解决:
SQL> show parameter disk NAME TYPE ------------------------------------ ---------------------- VALUE ------------------------------ asm_diskgroups string OCRVOTE asm_diskstring string SQL> alter system set asm_diskstring='/dev/raw/*';
然后重新执行命令恢复vote盘
检查确认:
[grid@rac1 ~]$ crsctl query css votedisk
10 重新创建spfile
注意,如何集群asm所使用的spfile放在了ocr共享盘,此处需要重新创建,方式有两种:
1) 利用11g的特性
create spfile from memory
2) 手动创建
root@rac2 ~]# vi /tmp/asm_pfile.txt
加入如下参数:
*.asm_power_limit=1 *.diagnostic_dest='/u01/app/grid/11.2.0/log' *.instance_type='asm' *.large_pool_size=12M *.remote_login_passwordfile='EXCLUSIVE'
利用我们自己编辑的文档重新创建spfile
SQL> create spfile='+DATA' from pfile='/tmp/asm_pfile.txt';
11 关闭集群,重启集群:
[root@rac1 grid]# crsctl stop crs -f [root@rac1 grid]# crsctl start crs [root@rac1 grid]# crsctl start cluster -all
12 检查集群资源状态:
1)集群信息
[root@rac1 grid]# crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resource
ora.DATA.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.LISTENER.lsnr ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.ORADATA.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.asm ONLINE ONLINE rac1 Started ONLINE ONLINE rac2 Started ora.gsd OFFLINE OFFLINE rac1 OFFLINE OFFLINE rac2 ora.net1.network ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.ons ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.registry.acfs ONLINE ONLINE rac1 ONLINE ONLINE rac2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE rac1 ora.cvu 1 ONLINE ONLINE rac1 ora.oc4j 1 ONLINE ONLINE rac1 ora.rac1.vip 1 ONLINE ONLINE rac1 ora.rac2.vip 1 ONLINE ONLINE rac2 ora.scan1.vip 1 ONLINE ONLINE rac1 ora.sunny.db 1 ONLINE ONLINE rac1 Open 2 ONLINE ONLINE rac2 Open
2)检查ocr vote信息
[root@rac1 grid]# ocrcheck Status of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 3084 Available space (kbytes) : 259036 ID : 101930821 Device/File Name : +DATA Device/File integrity check succeeded Device/File not configured Device/File not configured Device/File not configured Device/File not configured Cluster registry integrity check succeeded Logical corruption check succeeded
3)检查spfile信息
SQL> show parameter spfile NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ spfile string /u01/app/11.2/grid/dbs/spfile+ ASM1.ora
4)检查DG是否正常
[grid@rac1 rac-cluster]$ asmcmd lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED EXTERN N 512 4096 1048576 3082 2687 0 2687 0 Y DATA/ MOUNTED EXTERN N 512 4096 1048576 8197 5802 0 5802 0 N ORADATA/
到此恢复完成,更spfile的位置到DATA里面:
SQL> create pfile='/tmp/aa.txt' from spfile; File created. SQL> create spfile='+DATA' from pfile='/tmp/aa.txt'; File created.
关闭集群,重新启动集群:
crsctl stop cluster -all crsctl start cluster -all
13 知识拓展
1)关于export 和 import 手工备份OCR:
[root@rac1 rac-cluster]# ocrconfig -manualbackup
可以使用import参数到处ocr信息,其也可以用了恢复ocr
ocrconfig -export /tmp/ocr.bak ocrconfig -import file_name
如果使用export的ocr备份恢复ocr盘,不可以使用restore参数,需要使用 -import参数
2)关于利用kfed命令读取磁盘头,获得ocr在磁盘位置信息
[root@rac1 rac-cluster]# kfed read /dev/raw/raw1 | grep -E 'vfstart|vfend' kfdhdb.vfstart: 320 ; 0x0ec: 0x00000140 kfdhdb.vfend: 352 ; 0x0f0: 0x00000160
特别:对于没有备份信息恢复ocr,只能才去重建方式,所以日常工作中一定要注意检查ocr的备份信息