模拟丢失OLR导致RAC节点不能启动

模拟丢失OLR导致RAC节点不能启动

    • 1. OLR
    • 2. OLR备份
    • 3. 模拟OLR丢失
    • 4. 问题分析
      • 4.1 确认GI启动阶段
    • 5. 恢复OLR文件
    • 6. 参考文档

1. OLR

OLR是保存在本地的集群注册表。
OLR主要作用就是为ohasd守护进程提供集群的配置信息和初始化资源的定义信息。
当集群启动时会从/etc/oracle/olr.loc中读取OLR的位置,OLR默认保存在/cdata下,文件名为<节点名>.olr

[root@rac1 grid]# cat /etc/oracle/olr.loc
olrconfig_loc=/u01/11.2.0/grid/cdata/rac1.olr
crs_home=/u01/11.2.0/grid

[root@rac1 grid]# ls -l /u01/11.2.0/grid/cdata/rac1.olr
-rw------- 1 root oinstall 272756736 Jan 15 12:57 /u01/11.2.0/grid/cdata/rac1.olr

2. OLR备份

OLR不会自动备份,如果集群配置信息发生改变后,需要手动进行备份

[root@rac1 grid]# ocrconfig -local -manualbackup

rac1     2019/01/15 13:07:02     /u01/11.2.0/grid/cdata/rac1/backup_20190115_130702.olr

rac1     2015/03/18 16:09:38     /u01/11.2.0/grid/cdata/rac1/backup_20150318_160938.olr

当OLR丢失后,可以使用命令ocrconfig -local -restore 来恢复。

3. 模拟OLR丢失

#手动删除olr
[root@rac1 ~]# rm -rf /u01/11.2.0/grid/cdata/rac1.olr

#启动节点1
[root@rac1 ~]# crsctl start crs

4. 问题分析

4.1 确认GI启动阶段

输入crsctl stat res -t -init命令,进行查看:

[root@rac1 ~]# crsctl stat res -t -init
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

由以上程序可以看出ohasd层面没有启动,有可能是/etc/inittab中启动集群的init.ohasd脚本没有调用,或者是ohasd.bin守护进程没有启动成功。因此需要进一步验证:

[root@rac1 ~]# ps -ef|grep has
root      2747     1  0 05:42 ?        00:00:01 /bin/sh /etc/init.d/init.ohasd run
root     24733  2747  0 13:32 ?        00:00:00 /u01/11.2.0/grid/perl/bin/perl -I/u01/11.2.0/grid/perl/lib /u01/11.2.0/grid/bin/crswrapexece.pl /u01/11.2.0/grid/crs/install/s_crsconfig_rac1_env.txt /u01/11.2.0/grid/bin/ohasd.bin restart
root     24738  2747  0 13:32 ?        00:00:00 /u01/11.2.0/grid/perl/bin/perl -I/u01/11.2.0/grid/perl/lib /u01/11.2.0/grid/bin/crswrapexece.pl /u01/11.2.0/grid/crs/install/s_crsconfig_rac1_env.txt /u01/11.2.0/grid/bin/ohasd.bin restart
root     24742 14550  0 13:32 pts/2    00:00:00 grep has

根据上面的输出可推出init.ohasd脚本已经调用了,而且ohasd.bin守护进程也已经被启动,那么问题在于ohasd没有被成功启动。因此,需查看ohasd日志文件进行分析:

[grid@rac1 ohasd]$ tail -100f /u01/11.2.0/grid/log/rac1/ohasd/ohasd.log
...
2019-01-15 13:36:45.723: [ default][651043328] OHASD Daemon Starting. Command string :restart
2019-01-15 13:36:45.724: [ default][651043328] Initializing OLR
2019-01-15 13:36:45.725: [  OCROSD][651043328]utopen:6m':failed in stat OCR file/disk /u01/11.2.0/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2019-01-15 13:36:45.725: [  OCROSD][651043328]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2019-01-15 13:36:45.725: [  OCRRAW][651043328]proprinit: Could not open raw device 
2019-01-15 13:36:45.725: [  OCRAPI][651043328]a_init:16!: Backend init unsuccessful : [26]
2019-01-15 13:36:45.726: [  CRSOCR][651043328] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2019-01-15 13:36:45.726: [ default][651043328] OLR initalization failured, rc=26
2019-01-15 13:36:45.726: [ default][651043328]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2019-01-15 13:36:45.726: [ default][651043328][PANIC] OHASD exiting; Could not init OLR
2019-01-15 13:36:45.726: [ default][651043328] Done.
...

根据日志文件信息提示,/u01/11.2.0/grid/cdata/rac1.olr丢失,导致集群不能启动,验证/u01/11.2.0/grid/cdata/rac1.olr文件是否真丢失

[grid@rac1 ohasd]$ ls -l /u01/11.2.0/grid/cdata/rac1.olr
ls: /u01/11.2.0/grid/cdata/rac1.olr: No such file or directory

5. 恢复OLR文件

OLR备份文件存放在/cdata/<节点名>/backup_时间.olr

[grid@rac1 ohasd]$ cd /u01/11.2.0/grid/cdata/rac1/
[grid@rac1 rac1]$ ls -l
total 38236
-rw------- 1 root root 6356992 Mar 18  2015 backup_20150318_160938.olr
-rw------- 1 root root 6385664 Jan  2 14:55 backup_20190102_145536.olr
-rw------- 1 root root 6553600 Jan 15 10:50 backup_20190115_105011.olr
-rw------- 1 root root 6574080 Jan 15 10:50 backup_20190115_105031.olr
-rw------- 1 root root 6594560 Jan 15 13:07 backup_20190115_130702.olr
-rw------- 1 root root 6615040 Jan 15 13:22 backup_20190115_132243.olr

恢复OLR

[root@rac1 rac1]# ocrconfig -local -restore /u01/11.2.0/grid/cdata/rac1/backup_20190115_132243.olr

#验证olr是否恢复
[root@rac1 rac1]# ls -l /u01/11.2.0/grid/cdata/rac1.olr
ls: /u01/11.2.0/grid/cdata/rac1.olr: No such file or directory

#并没有直接恢复,经查询资料,在恢复olr之前,需手动touch olr文件

[root@rac1 rac1]# touch /u01/11.2.0/grid/cdata/rac1.olr
[root@rac1 rac1]# ocrconfig -local -restore /u01/11.2.0/grid/cdata/rac1/backup_20190115_132243.olr 
PROTL-19: Cannot proceed while the Oracle High Availability Service is running

#kill掉ohasd启动进程
[root@rac1 rac1]# ps -ef|grep grid
root     12936 31479  0 12:55 pts/2    00:00:00 su - grid
grid     12939 12936  0 12:55 pts/2    00:00:00 -bash
root     14454  2747  4 13:47 ?        00:00:01 /u01/11.2.0/grid/bin/ohasd.bin restart
root     14517     1  0 13:47 ?        00:00:00 /u01/11.2.0/grid/bin/orarootagent.bin
root     14589     1  0 13:47 ?        00:00:00 /u01/11.2.0/grid/bin/cssdagent
root     14591     1  0 13:47 ?        00:00:00 /u01/11.2.0/grid/bin/cssdmonitor
root     14617  9726  0 13:47 pts/1    00:00:00 grep grid
root     14711 12483  0 13:10 pts/1    00:00:00 su - grid
grid     14712 14711  0 13:10 pts/1    00:00:00 -bash
root     31478 31442  0 09:01 pts/2    00:00:00 su - grid
grid     31479 31478  0 09:01 pts/2    00:00:00 -bash

[root@rac1 rac1]# kill -9 2747

[root@rac1 rac1]# ocrconfig -local -restore /u01/11.2.0/grid/cdata/rac1/backup_20190115_132243.olr 

#能正常启动节点
[root@rac1 ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

6. 参考文档

《RAC核心技术详解》

你可能感兴趣的:(故障解决)