走主键索引的查询sql变慢的问题

问题描述
数据库hang,大量latch: cache buffers chains异常等待事件 业务慢,主机CPU达到100%

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
latch: cache buffers chains 644,380 321,720 499 47.19 Concurrency
log file sync 1,324,320 173,152 131 25.40 Commit
DB CPU 109,560 16.07
latch free 20,281 4,206 207 0.62 Other
db file sequential read 254,620 3,612 14 0.53 User I/O

执行慢的语句几乎都是一条undate语句,该语句执行走主键索引。该语句执行计划一直都没有变化,且为主键索引,性能一直较好。

5y22vy8jp06yz UPDATE service_deal_info SET server_info = :1, finishserv_time = to_char(systimestamp, 'yyyy-MM-dd HH24:MI:SS.ff3'), response_info = :2, restransaction_id = :3 WHERE orchestration_id = :4 AND business_id = :5

[10:25:48]----------------------------------------------------------------------------------------------
[10:25:48]| Id  | Operation                    | Name                     | Rows  | Bytes | Cost (%CPU)|
[10:25:48]----------------------------------------------------------------------------------------------
[10:25:48]|   0 | UPDATE STATEMENT             |                          |       |       |     1 (100)|
[10:25:48]|   1 |  UPDATE                      |   SERVICE_DEAL_INFO    |       |       |            |
[10:25:48]|   2 |   TABLE ACCESS BY INDEX ROWID| SERVICE_DEAL_INFO    |     1 |  2367 |     0   (0)|
[10:25:48]|   3 |    INDEX UNIQUE SCAN         | PK_SERVICE_DEAL_INFO |     1 |       |     0   (0)|
[10:25:48]----------------------------------------------------------------------------------------------

同时,发现此时还有一个delete操作,正在对改表做delete操作。初步判断改delete对其它undate有影响

3xccqv0u3f49x DELETE FROM SERVICE_DEAL_INFO WHERE 1=1 AND CALLSERV_TIME BETWEEN '2012-07-17 00:00:00.000' AND '2012-07-17 23:59:59.999'。

经过分析,比较了有无此delete时的数据库性能变化

无此删除操作时,平均逻辑读为15.47
Buffer Gets Executions Gets per Exec %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
1,155,588 74,721 15.47 2.81 52.61 84.2 13.3 5y22vy8jp06yz bwengine@tibapp6 (TNS V1-V3) UPDATE service_deal_info S...
有delete操作时,平均逻辑读是31,829.71,增加了2000倍,整个过程中直接计划并没有变化
Buffer Gets Executions Gets per Exec %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
2,093,885,334 65,784 31,829.71 9.86 251,438.86 21.2 0 5y22vy8jp06yz bwengine@tibapp6 (TNS V1-V3) UPDATE service_deal_info S...


delete操作对update操作为什么有这么大的影响呢?下面在测试机上做了模拟实验,版本是11gr2

create table as select * from dba_objects
create index idx_t  on t(object_id);

SQL> select * from t where object_id=100;

OWNER                          OBJECT_NAME                                                                                  SUBOBJECT_NAME
------------------------------ -------------------------------------------------------------------------------------------------------------------------------- ------------------------------
 OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE         CREATED             LAST_DDL_TIME       TIMESTAMP           STATUS  T G S  NAMESPACE EDITION_NAME
---------- -------------- ------------------- ------------------- ------------------- ------------------- ------- - - - ---------- ------------------------------
SYS                            ORA$BASE
       100                EDITION             2012-07-10 15:27:10 2012-07-10 15:28:10 2012-07-10:15:27:10 VALID   N N N     64
----------------------------------------------------------
Plan hash value: 1594971208

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   128 | 11392 |   141   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |   128 | 11392 |   141   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | IDX_T |   137 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID"=100)
í3??D??¢
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
       1612  bytes sent via SQL*Net to client
        520  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL> delete t where object_id >100;
ò?é?3y13643DD?£

再次查询表t,逻辑读变为83,并且产生了172的redo size,且多次查询逻辑读均为83,且有了少量redo

SQL> select * from t where object_id=100;

OWNER                          OBJECT_NAME                                                                                  SUBOBJECT_NAME
------------------------------ -------------------------------------------------------------------------------------------------------------------------------- ------------------------------
 OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE         CREATED             LAST_DDL_TIME       TIMESTAMP           STATUS  T G S  NAMESPACE EDITION_NAME
---------- -------------- ------------------- ------------------- ------------------- ------------------- ------- - - - ---------- ------------------------------
SYS                            ORA$BASE
       100                EDITION             2012-07-10 15:27:10 2012-07-10 15:28:10 2012-07-10:15:27:10 VALID   N N N     64
?′DD????
----------------------------------------------------------
Plan hash value: 1594971208

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   128 | 11392 |   141   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |   128 | 11392 |   141   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | IDX_T |   137 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("OBJECT_ID"=100)
í3??D??¢
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         83  consistent gets
          0  physical reads
        172  redo size
       1612  bytes sent via SQL*Net to client
        520  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

09:57:50 SQL> /
-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   128 | 11392 |   141   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |   128 | 11392 |   141   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | IDX_T |   137 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("OBJECT_ID"=100)
í3??D??¢
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         83  consistent gets
          0  physical reads
        128  redo size
       1612  bytes sent via SQL*Net to client
        520  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

实验看,应该是批量delete操作,导致了select操作的逻辑读变化很大,增加了20倍。为什么同样的操作,逻辑读差别这么大呢,很明显和delete操作有关。delete操作会产生大量的undo段,这undo被缓存在内存中,此时有448个undo块

SQL> @showundo

SESS         User Name  STA Machine         OS  User   Process   Start Time  STA      UBLK      UREC UNDO
------------ ---------- --- --------------- ---------- --------- ----------- --- --------- --------- ------------
493,2841     SYS        INA dtydb3          oracle     5401      07/19  11:1 ACT       448     14464 _SYSSMU14_39

为了弄清原因,使用10046 trace进行跟踪分析

-- 开启级别为12的Trace,level后面的数字设置了Trace的级别
ALTER SESSION SET EVENTS '10046 trace name context forever, level 12'
--为了方面跟踪,刷新内存中的数据
alter system flush buffer_cache;
--再次运行select查询
select * from t where object_id=100;
查看trace文件发现如下内容,该select语句要对一些数据库做物理读

*** 2012-07-19 13:53:21.117
WAIT #47325232720248: nam='SQL*Net message from client' ela= 77532123 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201117155
CLOSE #47325232720248:c=0,e=26,dep=0,type=3,tim=1342677201117394
PARSE #47325232720248:c=999,e=100,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1594971208,tim=1342677201117583
EXEC #47325232720248:c=0,e=115,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1594971208,tim=1342677201117748
WAIT #47325232720248: nam='SQL*Net message to client' ela= 12 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201117838
WAIT #47325232720248: nam='gc cr grant 2-way' ela= 494 p1=1 p2=56761 p3=1 obj#=14679 tim=1342677201118711
WAIT #47325232720248: nam='db file sequential read' ela= 703 file#=1 block#=56761 blocks=1 obj#=14679 tim=1342677201119677
WAIT #47325232720248: nam='gc cr grant 2-way' ela= 355 p1=1 p2=56762 p3=1 obj#=14679 tim=1342677201120387
WAIT #47325232720248: nam='db file sequential read' ela= 746 file#=1 block#=56762 blocks=1 obj#=14679 tim=1342677201121279
WAIT #47325232720248: nam='gc current grant 2-way' ela= 277 p1=1 p2=56762 p3=33619969 obj#=14679 tim=1342677201121938
WAIT #47325232720248: nam='db file sequential read' ela= 648 file#=4 block#=2834 blocks=1 obj#=0 tim=1342677201123117
WAIT #47325232720248: nam='gc cr grant 2-way' ela= 440 p1=1 p2=32250 p3=1 obj#=13803 tim=1342677201124671
WAIT #47325232720248: nam='db file sequential read' ela= 576 file#=1 block#=32250 blocks=1 obj#=13803 tim=1342677201125414
WAIT #47325232720248: nam='gc current grant 2-way' ela= 1205 p1=1 p2=32250 p3=33619969 obj#=13803 tim=1342677201126887
WAIT #47325232720248: nam='db file sequential read' ela= 8551 file#=4 block#=155526 blocks=1 obj#=0 tim=1342677201136316
WAIT #47325232720248: nam='db file sequential read' ela= 553 file#=4 block#=155525 blocks=1 obj#=0 tim=1342677201137873
WAIT #47325232720248: nam='db file sequential read' ela= 556 file#=4 block#=155524 blocks=1 obj#=0 tim=1342677201140705
FETCH #47325232720248:c=10999,e=24598,p=7,cr=82,cu=0,mis=0,r=1,dep=0,og=1,plh=1594971208,tim=1342677201142513
WAIT #47325232720248: nam='SQL*Net message from client' ela= 744 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201143371
FETCH #47325232720248:c=0,e=34,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=1594971208,tim=1342677201143493
WAIT #47325232720248: nam='SQL*Net message to client' ela= 7 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1342677201143622
继续查看,物理读的对象除了表T和相关索引,还有undo数据块

SQL> select FILE_NAME,file_id from dba_data_files where file_id in(1,4);

FILE_NAME                                                FILE_ID
---------------------------------------------- 
+DATA/hrdb/datafile/system.374.788282825                 1
+DATA/hrdb/datafile/undotbs2.378.788282973               4


SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE  4 = FILE_ID AND 155524 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1;


SQL> SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE  4 = FILE_ID AND 155526 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1;

SEGMENT_TYPE       OWNER||'.'||SEGMENT_NAME
------------------ ----------------------------------------------------------------------------------------------------------------
TYPE2 UNDO         SYS._SYSSMU20_2938845397$

SQL> SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE  4 = FILE_ID AND 155525 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1;

SEGMENT_TYPE       OWNER||'.'||SEGMENT_NAME
------------------ ----------------------------------------------------------------------------------------------------------------
TYPE2 UNDO         SYS._SYSSMU20_2938845397$

SQL> SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS WHERE  4 = FILE_ID AND 155524 BETWEEN BLOCK_ID AND BLOCK_ID+BLOCKS -1;

SEGMENT_TYPE       OWNER||'.'||SEGMENT_NAME
------------------ ----------------------------------------------------------------------------------------------------------------
TYPE2 UNDO         SYS._SYSSMU20_2938845397$

总结:原因就很明显了,当批量删除数据时,如果另一个会话需要读写该表时,由于没有提交,为了保持数据的一致性,需要读取undo数据库进行数据库的还原操作。而这些读的操作往往是之前逻辑读的几十倍甚至上千倍,如果修改数据的事务长时间没有提交,会严重影响其它对改表的语句的性能。

对于大事务,特别是更新或DELETE数千万记录的大事务,在生产系统上尽量避免单条SQL一次性做。这造成的影响特别大,比如:

  • 事务可能意外中断,回滚时间很长,事务恢复时过高的并行度可能引起负载增加。
  • 表中大量的行长时间被锁住。
  • 如果事务意外中断,长时间的回滚(恢复)过程中,可能严重影响SQL性能(因为查询时需要回滚块)。
  • 事务还未提交时,影响SQL性能,比如本文中提到的情况。
  • 消耗过多UNDO空间。
  • 对于DELETE大事务,有些版本的oracle在空闲空间查找上会有问题,导致在INSERT数据时,查找空间导致过长的时间。
  • 对于RAC数据库,由于一致性读的代价更大,所以大事务的危害更大。
参考文档
http://www.laoxiong.net/full-table-scan-but-lots-of-db-file-sequential-read%E4%B8%80%E4%BE%8B.html

你可能感兴趣的:(走主键索引的查询sql变慢的问题)