问题描述
数据库hang,大量latch: cache buffers chains异常等待事件 业务慢,主机CPU达到100%
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
latch: cache buffers chains | 644,380 | 321,720 | 499 | 47.19 | Concurrency |
log file sync | 1,324,320 | 173,152 | 131 | 25.40 | Commit |
DB CPU | 109,560 | 16.07 | |||
latch free | 20,281 | 4,206 | 207 | 0.62 | Other |
db file sequential read | 254,620 | 3,612 | 14 | 0.53 | User I/O |
执行慢的语句几乎都是一条undate语句,该语句执行走主键索引。该语句执行计划一直都没有变化,且为主键索引,性能一直较好。
5y22vy8jp06yz UPDATE service_deal_info SET server_info = :1, finishserv_time = to_char(systimestamp, 'yyyy-MM-dd HH24:MI:SS.ff3'), response_info = :2, restransaction_id = :3 WHERE orchestration_id = :4 AND business_id = :5
[10:25:48]---------------------------------------------------------------------------------------------- [10:25:48]| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| [10:25:48]---------------------------------------------------------------------------------------------- [10:25:48]| 0 | UPDATE STATEMENT | | | | 1 (100)| [10:25:48]| 1 | UPDATE | SERVICE_DEAL_INFO | | | | [10:25:48]| 2 | TABLE ACCESS BY INDEX ROWID| SERVICE_DEAL_INFO | 1 | 2367 | 0 (0)| [10:25:48]| 3 | INDEX UNIQUE SCAN | PK_SERVICE_DEAL_INFO | 1 | | 0 (0)| [10:25:48]----------------------------------------------------------------------------------------------
3xccqv0u3f49x DELETE FROM SERVICE_DEAL_INFO WHERE 1=1 AND CALLSERV_TIME BETWEEN '2012-07-17 00:00:00.000' AND '2012-07-17 23:59:59.999'。
经过分析,比较了有无此delete时的数据库性能变化
无此删除操作时,平均逻辑读为15.47
Buffer Gets Executions Gets per Exec %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
1,155,588 74,721 15.47 2.81 52.61 84.2 13.3 5y22vy8jp06yz bwengine@tibapp6 (TNS V1-V3) UPDATE service_deal_info S...
有delete操作时,平均逻辑读是31,829.71,增加了2000倍,整个过程中直接计划并没有变化
Buffer Gets Executions Gets per Exec %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
2,093,885,334 65,784 31,829.71 9.86 251,438.86 21.2 0 5y22vy8jp06yz bwengine@tibapp6 (TNS V1-V3) UPDATE service_deal_info S...
delete操作对update操作为什么有这么大的影响呢?下面在测试机上做了模拟实验,版本是11gr2
create table as select * from dba_objects create index idx_t on t(object_id); SQL> select * from t where object_id=100; OWNER OBJECT_NAME SUBOBJECT_NAME ------------------------------ -------------------------------------------------------------------------------------------------------------------------------- ------------------------------ OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE CREATED LAST_DDL_TIME TIMESTAMP STATUS T G S NAMESPACE EDITION_NAME ---------- -------------- ------------------- ------------------- ------------------- ------------------- ------- - - - ---------- ------------------------------ SYS ORA$BASE 100 EDITION 2012-07-10 15:27:10 2012-07-10 15:28:10 2012-07-10:15:27:10 VALID N N N 64 ---------------------------------------------------------- Plan hash value: 1594971208 ------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 128 | 11392 | 141 (0)| 00:00:02 | | 1 | TABLE ACCESS BY INDEX ROWID| T | 128 | 11392 | 141 (0)| 00:00:02 | |* 2 | INDEX RANGE SCAN | IDX_T | 137 | | 3 (0)| 00:00:01 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("OBJECT_ID"=100) í3??D??¢ ---------------------------------------------------------- 0 recursive calls 0 db block gets 4 consistent gets 0 physical reads 0 redo size 1612 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed SQL> delete t where object_id >100; ò?é?3y13643DD?£
SQL> select * from t where object_id=100; OWNER OBJECT_NAME SUBOBJECT_NAME ------------------------------ -------------------------------------------------------------------------------------------------------------------------------- ------------------------------ OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE CREATED LAST_DDL_TIME TIMESTAMP STATUS T G S NAMESPACE EDITION_NAME ---------- -------------- ------------------- ------------------- ------------------- ------------------- ------- - - - ---------- ------------------------------ SYS ORA$BASE 100 EDITION 2012-07-10 15:27:10 2012-07-10 15:28:10 2012-07-10:15:27:10 VALID N N N 64 ?′DD???? ---------------------------------------------------------- Plan hash value: 1594971208 ------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 128 | 11392 | 141 (0)| 00:00:02 | | 1 | TABLE ACCESS BY INDEX ROWID| T | 128 | 11392 | 141 (0)| 00:00:02 | |* 2 | INDEX RANGE SCAN | IDX_T | 137 | | 3 (0)| 00:00:01 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("OBJECT_ID"=100) í3??D??¢ ---------------------------------------------------------- 0 recursive calls 0 db block gets 83 consistent gets 0 physical reads 172 redo size 1612 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed 09:57:50 SQL> / ------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 128 | 11392 | 141 (0)| 00:00:02 | | 1 | TABLE ACCESS BY INDEX ROWID| T | 128 | 11392 | 141 (0)| 00:00:02 | |* 2 | INDEX RANGE SCAN | IDX_T | 137 | | 3 (0)| 00:00:01 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("OBJECT_ID"=100) í3??D??¢ ---------------------------------------------------------- 0 recursive calls 0 db block gets 83 consistent gets 0 physical reads 128 redo size 1612 bytes sent via SQL*Net to client 520 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
SQL> @showundo SESS User Name STA Machine OS User Process Start Time STA UBLK UREC UNDO ------------ ---------- --- --------------- ---------- --------- ----------- --- --------- --------- ------------ 493,2841 SYS INA dtydb3 oracle 5401 07/19 11:1 ACT 448 14464 _SYSSMU14_39
-- 开启级别为12的Trace,level后面的数字设置了Trace的级别 ALTER SESSION SET EVENTS '10046 trace name context forever, level 12' --为了方面跟踪,刷新内存中的数据 alter system flush buffer_cache; --再次运行select查询 select * from t where object_id=100;查看trace文件发现如下内容,该select语句要对一些数据库做物理读
*** 2012-07-19 13:53:21.117 WAIT #47325232720248: nam='SQL*Net message from client' ela= 77532123 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201117155 CLOSE #47325232720248:c=0,e=26,dep=0,type=3,tim=1342677201117394 PARSE #47325232720248:c=999,e=100,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1594971208,tim=1342677201117583 EXEC #47325232720248:c=0,e=115,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1594971208,tim=1342677201117748 WAIT #47325232720248: nam='SQL*Net message to client' ela= 12 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201117838 WAIT #47325232720248: nam='gc cr grant 2-way' ela= 494 p1=1 p2=56761 p3=1 obj#=14679 tim=1342677201118711 WAIT #47325232720248: nam='db file sequential read' ela= 703 file#=1 block#=56761 blocks=1 obj#=14679 tim=1342677201119677 WAIT #47325232720248: nam='gc cr grant 2-way' ela= 355 p1=1 p2=56762 p3=1 obj#=14679 tim=1342677201120387 WAIT #47325232720248: nam='db file sequential read' ela= 746 file#=1 block#=56762 blocks=1 obj#=14679 tim=1342677201121279 WAIT #47325232720248: nam='gc current grant 2-way' ela= 277 p1=1 p2=56762 p3=33619969 obj#=14679 tim=1342677201121938 WAIT #47325232720248: nam='db file sequential read' ela= 648 file#=4 block#=2834 blocks=1 obj#=0 tim=1342677201123117 WAIT #47325232720248: nam='gc cr grant 2-way' ela= 440 p1=1 p2=32250 p3=1 obj#=13803 tim=1342677201124671 WAIT #47325232720248: nam='db file sequential read' ela= 576 file#=1 block#=32250 blocks=1 obj#=13803 tim=1342677201125414 WAIT #47325232720248: nam='gc current grant 2-way' ela= 1205 p1=1 p2=32250 p3=33619969 obj#=13803 tim=1342677201126887 WAIT #47325232720248: nam='db file sequential read' ela= 8551 file#=4 block#=155526 blocks=1 obj#=0 tim=1342677201136316 WAIT #47325232720248: nam='db file sequential read' ela= 553 file#=4 block#=155525 blocks=1 obj#=0 tim=1342677201137873 WAIT #47325232720248: nam='db file sequential read' ela= 556 file#=4 block#=155524 blocks=1 obj#=0 tim=1342677201140705 FETCH #47325232720248:c=10999,e=24598,p=7,cr=82,cu=0,mis=0,r=1,dep=0,og=1,plh=1594971208,tim=1342677201142513 WAIT #47325232720248: nam='SQL*Net message from client' ela= 744 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201143371 FETCH #47325232720248:c=0,e=34,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=1594971208,tim=1342677201143493 WAIT #47325232720248: nam='SQL*Net message to client' ela= 7 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201143622继续查看,物理读的对象除了表T和相关索引,还有undo数据块
SQL> select FILE_NAME,file_id from dba_data_files where file_id in(1,4); FILE_NAME FILE_ID ---------------------------------------------- +DATA/hrdb/datafile/system.374.788282825 1 +DATA/hrdb/datafile/undotbs2.378.788282973 4 SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE 4 = FILE_ID AND 155524 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1; SQL> SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE 4 = FILE_ID AND 155526 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1; SEGMENT_TYPE OWNER||'.'||SEGMENT_NAME ------------------ ---------------------------------------------------------------------------------------------------------------- TYPE2 UNDO SYS._SYSSMU20_2938845397$ SQL> SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE 4 = FILE_ID AND 155525 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1; SEGMENT_TYPE OWNER||'.'||SEGMENT_NAME ------------------ ---------------------------------------------------------------------------------------------------------------- TYPE2 UNDO SYS._SYSSMU20_2938845397$ SQL> SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE 4 = FILE_ID AND 155524 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1; SEGMENT_TYPE OWNER||'.'||SEGMENT_NAME ------------------ ---------------------------------------------------------------------------------------------------------------- TYPE2 UNDO SYS._SYSSMU20_2938845397$
总结:原因就很明显了,当批量删除数据时,如果另一个会话需要读写该表时,由于没有提交,为了保持数据的一致性,需要读取undo数据库进行数据库的还原操作。而这些读的操作往往是之前逻辑读的几十倍甚至上千倍,如果修改数据的事务长时间没有提交,会严重影响其它对改表的语句的性能。
对于大事务,特别是更新或DELETE数千万记录的大事务,在生产系统上尽量避免单条SQL一次性做。这造成的影响特别大,比如: