不知为什么,向目的库插入时速度超慢,在数据库里发现,这些库都有MLOG$东西被锁。
在网上,找到一篇blog,讲的还蛮有道理的,转过来自己收藏:
truncate MLOG$
解决高级复制环境下刷新日志HWM太高导致刷新越来越慢。
Doc ID: Note:69432.1 Subject: CAN AN INDEX BE CREATED ON MLOG$ (SNAPSHOT LOG)? Type: BULLETIN Status: PUBLISHED
Can an index be created on MLOG$ (SNAPSHOT LOG)?
================================================
You have noticed that your fast refreshes are becoming slower and slower and
less efficient as time goes on. You want to add an index on the MLOG$ snapshot
log table to speed up the refreshes and to prevent a Full Table Scan on the
snapshot log.
First of all, the answer is NOT to add an index on the snapshot log table.
Oracle has to do a full table scan on the snapshot log so that it can read all
the rows that it needs to refresh for a particular snapshot. Besides, all SQL
statements generated by the refresh operation is hardcoded in the kernel.
What may be causing the performance degradation is that your snapshot log's
High Water Mark (HWM) may be wastefully high. The snapshot log table grows at
peak times, but never shrinks once the refresh is done. Therefore, during a
fast refresh of the snapshot, Oracle will read the snapshot log using full
table scan all the way to the HWM.
The answer to speeding up the performance of the snapshot refresh is to reset
the HWM. The only way to do this is to truncate the snapshot log or recreate
it.
Once your snapshot log is purged (meaning all snapshots have already refreshed
against that master table), then issue a truncate on it.
i.e. SQL> truncate table mlog$_EMP;
This will reset the HWM without affecting any of your snapshot's fast refreshes.
If you choose to RECREATE your snapshot log, you will have to follow up by
performing a COMPLETE refresh on all the affected snapshots.
试验下,再看。
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
高级复制中mlog$表的HWM过高导致性能问题
这篇论坛文章(赛迪网技术社区)针对高级复制中由于mlog$表的HWM过高导致的性能问题进行了详细的讲解,更多内容请参考下文:
某系统升级后,报告事务端的数据库反应非常慢,用户大量投诉。
现场检查系统后,发现CPU占用到了80%(平时该系统正常时只有10%~20%)。wio也非常高(说明磁盘读写严重)。一段时间后,现场将statspack报告发回来了:
STATSPACK report for
DB Name DB Id Instance Inst Num Release Cluster Host
------------ ----------- ------------ -------- ----------- ------- ------------
OLTPDB 3781951398 OLTPDB 1 9.2.0.5.0 NO jnaip21
Snap Id Snap Time Sessions Curs/Sess Comment
------- ------------------ -------- --------- -------------------
Begin Snap: 3 27-Sep-04 13:15:58 32 83.4
End Snap: 4 27-Sep-04 14:10:36 37 83.4
Elapsed: 54.63 (mins)
Cache Sizes (end)
~~~~~~~~~~~~~~~~~
Buffer Cache: 720M Std Block Size: 8K
Shared Pool Size: 112M Log Buffer: 512K
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 11,763.97 7,400.17
Logical reads: 8,251.68 5,190.75
Block changes: 83.17 52.32
Physical reads: 2,488.23 1,565.23
Physical writes: 6.15 3.87
User calls: 411.88 259.09
Parses: 56.57 35.59
Hard parses: 0.29 0.18
Sorts: 14.59 9.18
Logons: 0.12 0.08
Executes: 80.04 50.35
Transactions: 1.59
% Blocks changed per Read: 1.01 Recursive Call %: 28.14
Rollback per transaction %: 0.17 Rows per Sort: 1818.01
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 99.99 Redo NoWait %: 100.00
Buffer Hit %: 69.88 In-memory Sort %: 99.99
Library Hit %: 99.78 Soft Parse %: 99.50
Execute to Parse %: 29.32 Latch Hit %: 99.98
Parse CPU to Parse Elapsd %: 96.72 % Non-Parse CPU: 99.02
Shared Pool Statistics Begin End
------ ------
Memory Usage %: 93.60 93.71
% SQL with executions>1: 20.28 19.05
% Memory for SQL w/exec>1: 25.59 25.88
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file scattered read 599,637 3,886 63.79
CPU time 1,142 18.74
db file sequential read 238,983 629 10.33
enqueue 263 366 6.00
log file sync 5,154 23 .37
-------------------------------------------------------------
Wait Events for DB: OLTPDB Instance: OLTPDB Snaps: 3 -4
-> s - second
-> cs - centisecond - 100th of a second
-> ms - millisecond - 1000th of a second
-> us - microsecond - 1000000th of a second
-> ordered by wait time desc, waits desc (idle events last)
Avg
Total Wait wait Waits
Event Waits Timeouts Time (s) (ms) /txn
---------------------------- ------------ ---------- ---------- ------ --------
db file scattered read 599,637 0 3,886 6 115.1
db file sequential read 238,983 0 629 3 45.9
enqueue 263 4 366 1390 0.1
log file sync 5,154 0 23 4 1.0
log file parallel write 15,143 8,060 20 1 2.9
db file parallel write 1,008 0 12 12 0.2
control file parallel write 1,065 0 4 4 0.2
buffer busy waits 3,195 0 4 1 0.6
... ...
SQL ordered by Gets for DB: OLTPDB Instance: OLTPDB Snaps: 3 -4
-> End Buffer Gets Threshold: 10000
-> Note that resources reported for PL/SQL includes the resources used by
all SQL statements called within the PL/SQL code. As individual SQL
statements are also reported, it is possible and valid for the summed
total % to exceed 100
CPU Elapsd
Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
6,509,469 1,060 6,141.0 24.1 66.35 64.61 2833840857
Module: JDBC Thin Client
select lib.code, lib.name, lib.info, leaf.leafdirindex, tp.tpname, from t_commend com, t_leafinfo leaf, t_lib lib,t_tpinfo tp where
2,258,385 343 6,584.2 8.3 65.41 1289.82 3307794804
Module: (TNS V1-V3)
SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO
","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */
DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID"
FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1
AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO
2,251,917 343 6,565.4 8.3 66.51 1306.09 1606105459
Module: (TNS V1-V3)
SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI
NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU
PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU".
"T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO
NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1
2,249,895 343 6,559.5 8.3 66.34 1098.20 4039622144
delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1
... ...
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
2,129,843 343 6,209.5 26.1 66.34 1098.20 4039622144
delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1
2,129,004 343 6,207.0 26.1 65.41 1289.82 3307794804
Module: (TNS V1-V3)
SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO
","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */
DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID"
FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1
AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO
2,123,001 343 6,189.5 26.0 66.51 1306.09 1606105459
Module: (TNS V1-V3)
SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI
NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU
PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU".
"T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO
NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1
1,724,081 278 6,201.7 21.1 54.05 887.05 2153620124
update "RU"."MLOG$_T_LOOP" set snaptime$$ = :1 where snap
time$$ > to_date('2100-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS')
关注statspack中以下信息:Buffer Hit非常低,只有69.88%;在Top Event的等待事件中,db file scattered read的等待最高,占到了63.79%。很明显,这是由于数据文件扫描导致的buffer cache下降非常严重。而就肯定是由于存在全表扫描(全表扫描到的数据块不会被cache在buffer cache,而是会已最快的速度被置换出内存,这就是为什么全表扫描会导致buffer hit下降)语句所致。
再看top sql中的内容,我们发现,大多数top sql都涉及到一个奇特的表:MLOG$_T_LOOP。这张表的使用是不会出现在应用程序中。因此相关语句都是被系统调用的。那么这张表是做什么用的呢?
我们的系统为了将业务处理与业务查询分离,通过高级复制将业务处理端的某些表同步到业务查询端去。而mlog$_表就是记录最后一次刷新后,被同步表的数据变化的日志表,当查询端(即快照端)的刷新作业(我们设置10秒钟间隔运行)通过mlog$_读取到源表的数据变化,更新查询端相应的表。而业务处理端检测到所有快照端都更新过这些数据后,就会将这些数据删除。此外,mlog表上是没有任何索引,因此对mlog表的查询肯定是全表扫描。但是由于我们的刷新作业10秒钟更新一次,而10秒钟所产生的数据变化量非常小,在正确情况下,这样的全表扫描是不会影响到系统性能的。
我查了一下MLOG$_T_LOOP,发现它记录的数据确实非常少(一般只有几条数据),但是对它查询钟,physical reads却达到了6200多,这显然不正常。说明这张表的高水位(HWM,高水位记录的是表曾经达到的最大数据块,就像洪水过后留下的水位线。而对表的扫描并不是以当前的实际水位为基准的,而是以高水位为基准的,也就是说会扫描表曾经占用过的所有数据块)非常高。
通过现场了解,我们得知现场在做升级时,并没有停止同步刷新作业,而是将刷新作业间隔延长到2个小时。升级过程中,两边的数据库都没有停止。但是,期间在业务处理端做了一些数据处理,特别是表T_LOOP有大量的数据更新。
问题很明确了!由于升级过程中T_LOOP有大量的数据更新,而快照没有被刷新。因此mlog$_t_loop中保存了两个小时内的所有数据变化记录,导致高水位上升!而刷新作业再次恢复到10秒间隔后,尽管mlog$_t_loop中的数据量下降了,但是它的高水位并没有下降!
于是安排现场将业务处理端的客户连接断开,然后停止快照端的刷新作业,手工刷新一次同步后,检查业务处理端的mlog$表,确保这些表为空,然后truncate mlog$表,恢复同步作业,最后恢复业务处理端的客户连接。经过一段时间观察后,数据库恢复正常。
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
mlog$_xxx表hwm过高导致性能问题的解决办法
在fuyuncat的个人站点上,有篇关于mlog$表的HWM过高导致性能问题的一个解决方案,在用物化视图进行快速刷新来同步数据的中,这个问题是非常容易遇到的,特别在数据复制源端数据更新非常频繁的情况下。。。
fuyuncat的解决办法是:安排现场,将业务处理端的客户连接断开,然后停止快照端的刷新作业,手工刷新一次同步后,检查业务处理端的mlog$表,确保这些表为空,然后truncate table mlog$表,恢复同步作业,最后恢复业务处理端的客户连接。
这样做确实没有错,但如果在确实不能停应用的情况下,该如何解决这个问题呢?其实只要通过dbms_metadata.get_ddl抽取出mlog$_表的创建语法,就知道如何做了.这个表看起来象系统表,没有索引,其实是允许dba做一些DDL,DML等操作的,由于该表的HWM提高了,所以mlog$_xxx即使没有任何记录,全表扫描一次也是很慢的。。。
create table MLOG$_TEST
( ID VARCHAR2(32),
SNAPTIME$$ DATE,
DMLTYPE$$ VARCHAR2(1),
OLD_NEW$$ VARCHAR2(1),
CHANGE_VECTOR$$ RAW(255))
为了不阻塞应用,可以用online redefinition的方式移动一下表;对于没有索引的表,另一种风险较大的办法,就是选择业务最少的时间直接alter table mlog$_test move,这样做虽然可能会对应用有一点阻塞,但只要存储的性能好,几百兆的mlog$_表能在十几秒内完成,基本上也不会有什么影响。。。