一早发现核心系统的DBtime监控阈值一直在某一个点平移,感觉有点不对劲。
因为我们的脚本依托dba_hist_snapshot试图的SNIP来做的。遂进行AWR报告的生成查看其SNAP_ID是否有异常;

                          21220 19 Sep 2018 09:00      1
                          21221 19 Sep 2018 10:00      1
                          21222 19 Sep 2018 11:00      1
                          21223 19 Sep 2018 12:00      1
                          21224 19 Sep 2018 13:00      1
                          21225 19 Sep 2018 14:00      1
                          21226 19 Sep 2018 15:00      1
                          21227 19 Sep 2018 16:00      1
                          21228 19 Sep 2018 17:00      1
                          21229 19 Sep 2018 18:00      1
                          21230 19 Sep 2018 19:00      1

Specify the Begin and End Snapshot Ids


Enter value for begin_snap: 

昨天晚上系统确实是有CBC相关的等待,不过很快就恢复了。这是什么情况,难道是数据库归档满了,或者是mm进程down了?试着手动生成个SNAP_ID试试。发现是可以的。

[oracle@bapdb2 trace]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Thu Sep 20 10:40:33 2018

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

10:40:33 SYS@bapdb2(bapdb2)> set line 300 pages 1000
10:40:35 SYS@bapdb2(bapdb2)> BEGIN
10:40:37   2  DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT ();
10:40:37   3  END;
10:40:37   4  /

PL/SQL procedure successfully completed.

系统内的归档目录也很充足,不存在归档异常导致进程异常的情况;

10:43:57 SYS@b2(db2)> select group_number,block_size,name,allocation_unit_size,state,type,total_mb,free_mb,offline_disks from v$asm_diskgroup;

GROUP_NUMBER BLOCK_SIZE NAME                           ALLOCATION_UNIT_SIZE STATE       TYPE     TOTAL_MB    FREE_MB OFFLINE_DISKS
------------ ---------- ------------------------------ -------------------- ----------- ------ ---------- ---------- -------------
           1       4096 SAS_ARCH                                    1048576 CONNECTED   EXTERN    1024000     617921             0

节点一查看进程:
[oracle@db1 ~]$ ps -ef |grep mm
grid       6634      1  0  2017 ?        00:33:47 asm_mman_+ASM1
grid       6648      1  0  2017 ?        01:52:06 asm_mmon_+ASM1
grid       6650      1  0  2017 ?        2-00:53:46 asm_mmnl_+ASM1
oracle     8610      1  0  2017 ?        00:33:56 ora_mman_db1
oracle     8650      1  0  2017 ?        3-11:28:35 ora_mmon_db1
oracle     8655      1  1  2017 ?        4-07:20:56 ora_mmnl_db1

节点二查看进程:
[oracle@bapdb2 ~]$ ps -ef |grep mm
oracle    54354  53982  0 11:09 pts/1    00:00:00 grep mm
grid     105256      1  0  2017 ?        00:23:52 asm_mman_+ASM2
grid     105295      1  0  2017 ?        01:15:06 asm_mmon_+ASM2
grid     105312      1  0  2017 ?        1-03:49:26 asm_mmnl_+ASM2
oracle   106889      1  0  2017 ?        00:28:00 ora_mman_db2
oracle   106927      1  0  2017 ?        3-04:47:42 ora_mmnl_db2

发现节点二的MMON进程DOWN了。从ALERT日志进行搜索:
Tue Sep 19 03:49:00 2017
MMON started with pid=36, OS id=8650
Tue Sep 19 03:49:00 2017
MMNL started with pid=37, OS id=8655  

Tue Sep 19 04:01:47 2017
MMON started with pid=36, OS id=106923
Tue Sep 19 04:01:47 2017
MMNL started with pid=37, OS id=106927

这个id为106923的进程确实是异常了。之前处理过类似的情况,可以在节点二直接启动MMON相关进程;

SQL> alter system enable restricted session; 
System altered. 
SQL> alter system disable restricted session; 
System altered. 

同时Alert日志也给出了反馈;
Thu Sep 20 11:10:28 2018
Stopping background process MMNL
Starting background process MMON
Starting background process MMNL
Thu Sep 20 11:10:29 2018
MMON started with pid=37, OS id=55936 
Thu Sep 20 11:10:29 2018
MMNL started with pid=236, OS id=55938 
ALTER SYSTEM enable restricted session;
minact-scn: Inst 2 is a slave inc#:16 mmon proc-id:55936 status:0x2
minact-scn status: grec-scn:0x0026.4dcf0d36 gmin-scn:0x0026.4dcf0d36 gcalc-scn:0x0026.4dcf1208
Thu Sep 20 11:11:05 2018
ALTER SYSTEM disable restricted session;
Thu Sep 20 11:13:25 2018
LGWR: Standby redo logfile selected for thread 2 sequence 154126 for destination LOG_ARCHIVE_DEST_3

再次查看进程启动正常
11:10:29 SYS@db2(xxxdb2)> !ps -ef |grep mm
oracle    55936      1  0 11:10 ?        00:00:00 ora_mmon_db2
oracle    55938      1  0 11:10 ?        00:00:00 ora_mmnl_db2
grid     105256      1  0  2017 ?        00:23:52 asm_mman_+ASM2
grid     105295      1  0  2017 ?        01:15:06 asm_mmon_+ASM2
grid     105312      1  0  2017 ?        1-03:49:26 asm_mmnl_+ASM2
oracle   106889      1  0  2017 ?        00:28:00 ora_mman_db2

追查了一下MMON进程的trc文件,发现最下面有这一条:
*** 2018-09-19 18:46:41.432
minact-scn slave-status: grec-scn:0x0026.4db016c0 gmin-scn:0x0026.4db016c0 gcalc-scn:0x0026.4db0273c
minact-scn slave-status: grec-scn:0x0026.4dbdde59 gmin-scn:0x0026.4dbdde59 gcalc-scn:0x0026.4dbdf492

*** 2018-09-19 18:56:44.302
minact-scn slave-status: grec-scn:0x0026.4dca45db gmin-scn:0x0026.4dca45db gcalc-scn:0x0026.4dca5990

*** 2018-09-19 19:01:37.026
error 28 detected in background process
OPIRIP: Uncaught error 447. Error stack:
ORA-00447: fatal error in background process
ORA-00028: your session has been killed

猜想是因为这个问题:
Fixed Objects Statistics (GATHER_FIXED_OBJECTS_STATS) Considerations (文档 ID 798257.1)