原因是:racj1节点1有mount了另外一台机器232.100的nfs挂接点。而服务器端的nfs服务因为安全加固要求关闭nfs服务,所以在racj1节点1没有先umount掉nfs目录的情况下,直接停止了服务器端的nfs服务导致了rac节点1的挂死现象。
此时需要在开racj1窗口,然后用mount命令查看nfs挂接点情况:有服务器的nfs目录挂接点在racj1上。
然后强制fuser -ck /mnt后发现racj1连接都被断开,过来5秒后重新连接,df –h等命令正常执行。(安全的做法:这里其实应该将服务器端的nfs服务重启启动,保证客户端racj1先正常,再umount掉就不会引起后续的一系列故障了,没想到fuser -ck会引起这么大的问题 )
此时发现racj1节点1的rac群集服务当掉,数据库ora_进程也全部消失。
root尝试手动启动crs,1分钟后群集正常:
/oracle/app/11.2.0/grid/bin/crsctl start crs
此时发现/ogg目录在节点1没有正常挂起,而节点2是挂着的。OGG采用的是acfs共享群集文件系统。
尝试启动均失败。
[grid@racj1 ~]$ asmcmd
ASMCMD> volinfo -a
Diskgroup Name: OGGDG
Volume Name: OGGVOL
Volume Device: /dev/asm/oggvol-141
State: ENABLED
Size (MB): 409600
Resize Unit (MB): 32
Redundancy: UNPROT
Stripe Columns: 4
Stripe Width (K): 128
Usage: ACFS
Mountpath: /ogg
执行/oracle/app/11.2.0/grid/bin/srvctl stop filesystem -d /dev/asm/oggvol-141
此时节点2的/ogg目录也卸载。
尝试启动,但提示失败。
[root@racj1 ~]# /oracle/app/11.2.0/grid/bin/srvctl start filesystem -d /dev/asm/oggvol-141
PRCR-1079 : Failed to start resource ora.oggdg.oggvol.acfs
CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj1' failed
[root@racj1 ~]# /oracle/app/11.2.0/grid/bin/srvctl stop filesystem -d /dev/asm/oggvol-141
[root@racj1 ~]# /oracle/app/11.2.0/grid/bin/srvctl start filesystem -d /dev/asm/oggvol-141
PRCR-1079 : Failed to start resource ora.oggdg.oggvol.acfs
CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj2/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj1' failed
CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj2/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj2' failed
[root@racj1 ~]# more
查看日志其实是有关键错误的,只是当时没注意:
[grid@racj1 ~]$ srvctl stop filesystem -d /dev/asm/oggvol-141
[grid@racj1 ~]$ acfsutil registry -f -a /dev/asm/oggvol-141 /ogg
acfsutil registry: CLSU-00100: Operating System function: open64 failed with error data: 13
acfsutil registry: CLSU-00101: Operating System error message: Permission denied
acfsutil registry: CLSU-00103: error location: OOF_1
acfsutil registry: CLSU-00104: additional error information: open64 (/dev/asm/oggvol-141)
acfsutil registry: ACFS-03141: unable to open device /dev/asm/oggvol-141
此时怀疑权限有问题,对比节点1,2果然发现不对:
检查发现racj1节点的/dev/asm/oggvol-141的权限不对了。
[root@racj1 orarootagent_root]# ls -l /dev/asm/oggvol-141
brw------- 1 root root 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# chown root:asmadmin /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# ls -l /dev/asm/oggvol-141
brw------- 1 root asmadmin 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# chmod 770 /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# ls -l /dev/asm/oggvol-141
brwxrwx--- 1 root asmadmin 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141
crsctl status resource –t检查发现ora.oggdg.oggvol.acfs是offline的。
尝试启动失败:
尝试重启acfs服务还是失败:
crsctl stop resource ora.oggdg.oggvol.acfs
crsctl start resource ora.oggdg.oggvol.acfs
尝试用root手动挂接报错:
mount -t acfs -rw /dev/asm/oggvol-141 /ogg
[root@racj1 orarootagent_root]# mount -t acfs -rw /dev/asm/oggvol-141 /ogg
mount: wrong fs type, bad option, bad superblock on /dev/asm/oggvol-141,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
mount.acfs: CLSU-00100: Operating System function: mount failed with error data: 22
mount.acfs: CLSU-00101: Operating System error message: Invalid argument
mount.acfs: CLSU-00103: error location: MOUNT_3
mount.acfs: ACFS-02126: Volume /dev/asm/oggvol-141 cannot be mounted.
使用dmesg查看,发现关键提示:
ACFSK-0021: FSCK-NEEDED set for volume /dev/asm/oggvol-141 . Internal ACFS Location 838 .
根据提示执行fsck命令成功:
[root@racj1 ~]# /sbin/fsck -a -v -y -t acfs /dev/asm/oggvol-141
[root@racj1 ~]# su - grid
[grid@racj1 ~]$ crsctl start resource ora.oggdg.oggvol.acfs
目录挂接成功后,继续启动ogg的操作。
view report gdcq查看报错:
[/ogg/12c/extract(ggs::gglib::MultiThreading::MainThread::ExecMain()+0x60) [0x752c80]]
: [/ogg/12c/extract(ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*)+0x14d) [0x753d5d]]
: [/ogg/12c/extract(ggs::gglib::MultiThreading::MainThread::Run(int, char**)+0xb1) [0x753e41]]
: [/ogg/12c/extract(main+0x3b) [0x6eff1b]]
: [/lib64/libc.so.6(__libc_start_main+0xfd) [0x3396a1ed1d]]
: [/ogg/12c/extract() [0x69aed1]]
2019-04-25 10:12:39 ERROR OGG-00446 Opening file +ARCHDG/2_5183_986573398.dbf in DBLOGREADER mode: (308) ORA-00308: cannot open archived log '+ARCHDG/2_5183_986573398.dbf'
ORA-17503: ksfdopn:2 Failed to open file +ARCHDG/2_5183_986573398.dbf
ORA-15173: entry '2_5183_986573398.dbf' does not exist in directory '/'
Not able to establish initial position for sequence 5183, rba 1626514448.
2019-04-25 10:12:39 ERROR OGG-01668 PROCESS ABENDING.
由于当前部署了每4小时备份一次归档到带库,然后删除的策略。导致ogg恢复的时候刚好归档没了。
[root@racj1 ~]# su - grid
[grid@racj1 ~]$ asmcmd
ASMCMD> ls
ARCHDG/
CRSDG/
DATADG/
OGGDG/
ASMCMD> cd archdg
ASMCMD> ls
GDDB/
ASMCMD> cd gddb
ASMCMD> ls
ARCHIVELOG/
ASMCMD> cd archivelog
ASMCMD> ls
2019_04_25/
ASMCMD> cd 2019*
ASMCMD> ls
thread_1_seq_7409.702.1006510689
ASMCMD> ls
thread_1_seq_7409.702.1006510689
SQL> set line 132 wrap off
SQL> select * from v$Log;
truncating (as requested) before column NEXT_CHANGE#
GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_TIME
---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------------- -----------
1 1 7409 2147483648 512 1 YES ACTIVE 1.5742E+13 2019-04-25 09:56:19 2019-04-25
2 1 7407 2147483648 512 1 YES INACTIVE 1.5742E+13 2019-04-25 09:13:43 2019-04-25
3 1 7410 2147483648 512 1 NO CURRENT 1.5742E+13 2019-04-25 10:18:08
4 1 7408 2147483648 512 1 YES INACTIVE 1.5742E+13 2019-04-25 09:19:05 2019-04-25
5 1 7406 2147483648 512 1 YES INACTIVE 1.5742E+13 2019-04-25 08:37:07 2019-04-25
6 2 5193 2147483648 512 1 NO CURRENT 1.5742E+13 2019-04-25 09:56:17
7 2 5189 2147483648 512 1 YES INACTIVE 1.5742E+13 2019-04-25 08:37:04 2019-04-25
8 2 5190 2147483648 512 1 YES INACTIVE 1.5742E+13 2019-04-25 08:52:05 2019-04-25
9 2 5191 2147483648 512 1 YES INACTIVE 1.5742E+13 2019-04-25 09:17:26 2019-04-25
10 2 5192 2147483648 512 1 YES INACTIVE 1.5742E+13 2019-04-25 09:47:27 2019-04-25
10 rows selected.
/usr/openv/netbackup/bin/bplist -S 'nbujxq' -C 'racj2' -t 4 -R -l /
-rw-rw---- oracle asmadmin 5052160K 4月 25 09:56 /al_2879_1_1006509383
-rw-rw---- oracle asmadmin 8343552K 4月 25 09:56 /al_2878_1_1006509383
-rw-rw---- oracle asmadmin 7269376K 4月 25 09:56 /al_2877_1_1006509383
-rw-rw---- oracle asmadmin 8310016K 4月 25 09:56 /al_2876_1_1006509383
RUN {
allocate channel D1 type 'sbt_tape' parms 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
allocate channel D2 type 'sbt_tape' parms 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
send 'NB_ORA_SERV=nbujxq,NB_ORA_CLIENT=racj2';
restore archivelog from logseq 5183 until logseq 5193 thread 2;
restore archivelog from logseq 7390 until logseq 7408 thread 1;
RELEASE CHANNEL D1;
RELEASE CHANNEL D2;
}