某省电信在做批扣(批销)时,出现严重的性能问题,发现下面这一条SQL性能非常低下:
SELECT A.ACCT_BALANCE_ID,
A.BALANCE_TYPE_ID,
A.ACCT_ID,
NVL(A.SERV_ID, -1) SERV_ID,
NVL(A.ITEM_GROUP_ID, -1) ITEM_GROUP_ID,
A.OBJECT_TYPE_ID,
F.PRIORITY,
A.BALANCE,
NVL(A.CYCLE_UPPER, -1) CYCLE_UPPER,
NVL(A.CYCLE_LOWER, -1) CYCLE_LOWER,
NVL(A.CYCLE_UPPER_TYPE, ’ ’) CYCLE_UPPER_TYPE,
NVL(A.CYCLE_LOWER_TYPE, ’ ’) CYCLE_LOWER_TYPE,
B.ADJUST_FLAG ADJUST_FLAG,
B.ALLOW_TRANS ALLOW_TRANS,
B.CORPUS_FLAG,
NVL(TO_CHAR(A.EFF_DATE, ’YYYYMMDDHH24MISS’), ’ ’) EFF_DATE,
NVL(TO_CHAR(A.EXP_DATE, ’YYYYMMDDHH24MISS’), ’ ’) EXP_DATE,
A.STATE,
TO_CHAR(A.STATE_DATE, ’YYYYMMDDHH24MISS’) STATE_DATE,
B.BALANCE_TYPE_NAME,
NVL(C.ACCT_NAME, ’ ’) ACCT_NAME,
NVL(D.ACC_NBR, ’ ’) SERV_NAME,
NVL(E.ITEM_GROUP_NAME, ’ ’) ITEM_GROUP_NAME
FROM (SELECT ACCT_BALANCE_ID,
BALANCE_TYPE_ID,
ACCT_ID,
NVL(SERV_ID, -1) SERV_ID,
NVL(ITEM_GROUP_ID, -1) ITEM_GROUP_ID,
OBJECT_TYPE_ID,
BALANCE,
NVL(CYCLE_UPPER, -1) CYCLE_UPPER,
NVL(CYCLE_LOWER, -1) CYCLE_LOWER,
NVL(CYCLE_UPPER_TYPE, ’ ’) CYCLE_UPPER_TYPE,
NVL(CYCLE_LOWER_TYPE, ’ ’) CYCLE_LOWER_TYPE,
EFF_DATE,
EXP_DATE,
STATE,
STATE_DATE,
0 SHARE_RULE_PRIORITY
FROM ACCT_BALANCE
WHERE ACCT_ID = :LACCTID
AND BALANCE > 0
AND BALANCE_TYPE_ID != 1
AND STATE = ’10A’
UNION
SELECT A1.ACCT_BALANCE_ID,
A1.BALANCE_TYPE_ID,
B1.ACCT_ID,
NVL(B1.SERV_ID, -1) SERV_ID,
NVL(B1.ITEM_GROUP_ID, -1) ITEM_GROUP_ID,
A1.OBJECT_TYPE_ID,
A1.BALANCE,
NVL(B1.UPPER_AMOUNT, -1) CYCLE_UPPER,
NVL(B1.LOWER_AMOUNT, -1) CYCLE_LOWER,
NVL(A1.CYCLE_UPPER_TYPE, ’ ’) CYCLE_UPPER_TYPE,
NVL(A1.CYCLE_LOWER_TYPE, ’ ’) CYCLE_LOWER_TYPE,
A1.EFF_DATE,
A1.EXP_DATE,
A1.STATE,
A1.STATE_DATE,
B1.PRIORITY SHARE_RULE_PRIORITY
FROM ACCT_BALANCE A1, BALANCE_SHARE_RULE B1
WHERE A1.ACCT_BALANCE_ID = B1.ACCT_BALANCE_ID
AND B1.ACCT_ID = :LACCTID
AND A1.BALANCE > 0
AND A1.BALANCE_TYPE_ID != 1
AND A1.STATE = ’10A’
AND NVL(B1.EFF_DATE, SYSDATE) < = SYSDATE
AND NVL(B1.EXP_DATE, SYSDATE) >= SYSDATE) A,
BALANCE_TYPE B,
ACCT C,
SERV D,
A_BALANCE_ITEM_GROUP E,
A_BALANCE_OBJECT_TYPE F
WHERE A.BALANCE_TYPE_ID = B.BALANCE_TYPE_ID
AND A.OBJECT_TYPE_ID = F.OBJECT_TYPE_ID
AND A.ACCT_ID = C.ACCT_ID
AND A.SERV_ID = D.SERV_ID(+)
AND A.ITEM_GROUP_ID = E.ITEM_GROUP_ID(+)
ORDER BY F.PRIORITY,
B.PRIORITY,
A.SHARE_RULE_PRIORITY ASC,
A.EXP_DATE ASC,
A.EFF_DATE ASC,
A.BALANCE ASC
查看执行计划:
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2723G| 696T| | 7776M| | |
| 1 | SORT ORDER BY | | 2723G| 696T| 1503T| 7776M| | |
|* 2 | HASH JOIN | | 2723G| 696T| 59M| 39355 | | |
|* 3 | HASH JOIN | | 228K| 56M| | 23918 | | |
| 4 | TABLE ACCESS FULL | BALANCE_TYPE | 8 | 184 | | 7 | | |
|* 5 | HASH JOIN | | 228K| 51M| | 23907 | | |
| 6 | TABLE ACCESS FULL | A_BALANCE_OBJECT_TYPE | 4 | 16 | | 7 | | |
|* 7 | HASH JOIN OUTER | | 228K| 50M| 43M| 23896 | | |
|* 8 | HASH JOIN OUTER | | 228K| 40M| 38M| 23199 | | |
| 9 | VIEW | | 228K| 36M| | 2043 | | |
| 10 | SORT UNIQUE | | 228K| 11M| 38M| 2043 | | |
| 11 | UNION-ALL | | | | | | | |
|* 12 | TABLE ACCESS BY GLOBAL INDEX ROWID | ACCT_BALANCE | 228K| 11M| | 50 | ROWID | ROW L |
|* 13 | INDEX RANGE SCAN | IDX_ACCT_BALANCE_ACCT_ID42 | 121K| | | 3 | | |
| 14 | NESTED LOOPS | | 1 | 146 | | 4 | | |
|* 15 | TABLE ACCESS FULL | BALANCE_SHARE_RULE | 1 | 109 | | 2 | | |
|* 16 | TABLE ACCESS BY GLOBAL INDEX ROWID| ACCT_BALANCE | 1 | 37 | | 2 | ROWID | ROW L |
|* 17 | INDEX UNIQUE SCAN | PK_P_ACCT_BALANCE2 | 1 | | | 1 | | |
| 18 | PARTITION RANGE ALL | | | | | | 1 | 63 |
| 19 | TABLE ACCESS FULL | SERV | 12M| 258M| | 14070 | 1 | 63 |
| 20 | TABLE ACCESS FULL | A_BALANCE_ITEM_GROUP | 244 | 11224 | | 7 | | |
| 21 | PARTITION RANGE ALL | | | | | | 1 | 63 |
| 22 | TABLE ACCESS FULL | ACCT | 11M| 239M| | 8505 | 1 | 63 |
------------------------------------------------------------------------------------------------------------------------------------Predicate Information (identified by operation id):
---------------------------------------------------2 - access(”A”.”ACCT_ID”=”C”.”ACCT_ID”)
3 - access(”A”.”BALANCE_TYPE_ID”=”B”.”BALANCE_TYPE_ID”)
5 - access(”A”.”OBJECT_TYPE_ID”=”F”.”OBJECT_TYPE_ID”)
7 - access(”A”.”ITEM_GROUP_ID”=”E”.”ITEM_GROUP_ID”(+))
8 - access(”A”.”SERV_ID”=”D”.”SERV_ID”(+))
12 - filter(”ACCT_BALANCE”.”BALANCE”>0 AND “ACCT_BALANCE”.”BALANCE_TYPE_ID”<>1 AND “ACCT_BALANCE”.”STATE”=’10A’)
13 - access(”ACCT_BALANCE”.”ACCT_ID”=TO_NUMBER(:Z))
15 - filter(”B1″.”ACCT_ID”=TO_NUMBER(:Z) AND NVL(”B1″.”EFF_DATE”,SYSDATE@!)< =SYSDATE@! AND
NVL("B1"."EXP_DATE",SYSDATE@!)>=SYSDATE@!)
16 - filter(”A1″.”BALANCE”>0 AND “A1″.”BALANCE_TYPE_ID”<>1 AND “A1″.”STATE”=’10A’)
17 - access(”A1″.”ACCT_BALANCE_ID”=”B1″.”ACCT_BALANCE_ID”)
从执行计划来看,Oracle评估出来的COST是相当的惊人(7776M),而返回的行数是2723G。看起来应该是统计信息不准确导致了选择了错误的执行计划。
对于一个SQL,如果其执行计划错误地采用于full table scan和hash join,则一般是由于评估表访问返回的数据行数过多,使其高估了成本。反之,如果是错误地采用了索引扫描,则一般是由于评估表访问返回的数据行数太少,使其低估了成本。
我们来分析这个执行计划。首先查看访问表时,Oracle评估返回的行的数目是否有偏大的情况。我们发现看以下的几行,返回的结果集行数很高:
|* 12 | TABLE ACCESS BY GLOBAL INDEX ROWID | ACCT_BALANCE | 228K| 11M| | 50 | ROWID | ROW L |
|* 13 | INDEX RANGE SCAN | IDX_ACCT_BALANCE_ACCT_ID42 | 121K| | | 3 | | |
| 19 | TABLE ACCESS FULL | SERV | 12M| 258M| | 14070 | 1 | 63 |
| 22 | TABLE ACCESS FULL | ACCT | 11M| 239M| | 8505 | 1 | 63 |
最值得怀疑的首先是第12行和第13行。第13行是一个索引范围扫描,返回的结果集行数居然有121K行,而这个访问条件只是 “access(”ACCT_BALANCE”.”ACCT_ID”=TO_NUMBER(:Z))”,从业务上来看,看起来这条SQL是查询某个帐户的 余额及其他信息的。输入的条件为ACCT_ID,而一般来说,帐户余额表中,一个ACCT_ID的对应的行,只有很少的行(10行以下),而不会返回这么 多行。
IDX_ACCT_BALANCE_ACCT_ID42这个索引,是一个在ACCT_BALANCE表(这是个分区表)的ACCT_ID上的全局普 通索引(GLOBAL NORMAL INDEX)。检查统计信息,这个表是当天才收集的,ACCT_ID这个字段没有任何统计信息(已经被删除)。同时索引中的统计信息如下:
Index Tree Leaf Distinct Number Leaf Blocks Data Blocks Cluster
Name Unique Level Blks Keys of Rows Per Key Per Key Factor
--------------- --------- ----- ---- -------------- -------------- ----------- ----------- ------------
IDX_ACCT_BALANC NONUNIQUE 2 #### 580,079 30,742,660 1 47 27,447,720
E_ACCT_ID42
看起来索引的Distinct Keys值太小,才580,079。索引的总行数为30,742,660。重新分析该索引后,Distinct Keys的值才是正确的值:10,540,493。
再看看ACCT_BALANCE表中的ACCT_ID字段,由于没有统计信息,那么Oracle会使用一个默认的选择率(对于表的单个字段条件过滤,一般 是1%)。对于这样一个3000多万行的表来说,默认的选择率造成评估的返回行数过高。由于正是出帐期间,重新分析这样一个表,影响是非常大的,而这个表 之前刚分析过。我们这里可以使用DBMS_STATS.SET_COLUMN_STATS过程手工设置统计信息:
SQL> exec dbms_stats.set_column_stats(ownname=>user,tabname=>’ACCT_BALANCE’,colname=>’ACCT_ID’,
NULLCNT=>0,DENSITY=>1/10540393,DISTCNT=>10540393);
这里的统计信息来自于ACCT_ID字段上的索引(ACCT_ID字段定义为NOT NULL,所以比较方便地手工计算出这些统计信息)。
我们再次查看执行计划:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 35M| 9603M| | 1472K| | |
| 1 | SORT ORDER BY | | 35M| 9603M| 20G| 1472K| | |
| 2 | NESTED LOOPS | | 35M| 9603M| | 62 | | |
| 3 | NESTED LOOPS OUTER | | 3 | 780 | | 56 | | |
| 4 | NESTED LOOPS OUTER | | 3 | 717 | | 50 | | |
| 5 | NESTED LOOPS | | 3 | 579 | | 47 | | |
| 6 | NESTED LOOPS | | 3 | 510 | | 44 | | |
| 7 | VIEW | | 3 | 498 | | 42 | | |
| 8 | SORT UNIQUE | | 3 | 248 | | 42 | | |
| 9 | UNION-ALL | | | | | | | |
|* 10 | TABLE ACCESS BY GLOBAL INDEX ROWID | ACCT_BALANCE | 2 | 102 | | 6 | ROWID | ROW L |
|* 11 | INDEX RANGE SCAN | IDX_ACCT_BALANCE_ACCT_ID42 | 3 | | | 3 | | |
| 12 | NESTED LOOPS | | 1 | 146 | | 4 | | |
|* 13 | TABLE ACCESS FULL | BALANCE_SHARE_RULE | 1 | 109 | | 2 | | |
|* 14 | TABLE ACCESS BY GLOBAL INDEX ROWID| ACCT_BALANCE | 1 | 37 | | 2 | ROWID | ROW L |
|* 15 | INDEX UNIQUE SCAN | PK_P_ACCT_BALANCE2 | 1 | | | 1 | | |
| 16 | TABLE ACCESS BY INDEX ROWID | A_BALANCE_OBJECT_TYPE | 1 | 4 | | 1 | | |
|* 17 | INDEX UNIQUE SCAN | PK_A_BALANCE_OBJECT_TYPE | 1 | | | | | |
| 18 | TABLE ACCESS BY INDEX ROWID | BALANCE_TYPE | 1 | 23 | | 1 | | |
|* 19 | INDEX UNIQUE SCAN | PK_BALANCE_TYPE | 1 | | | | | |
| 20 | TABLE ACCESS BY INDEX ROWID | A_BALANCE_ITEM_GROUP | 1 | 46 | | 1 | | |
|* 21 | INDEX UNIQUE SCAN | PK_ITEM_GROUP_ID_HN | 1 | | | | | |
| 22 | TABLE ACCESS BY GLOBAL INDEX ROWID | SERV | 1 | 21 | | 2 | ROWID | ROW L |
|* 23 | INDEX UNIQUE SCAN | PK_SERV | 1 | | | 1 | | |
| 24 | TABLE ACCESS BY GLOBAL INDEX ROWID | ACCT | 11M| 239M| | 2 | ROWID | ROW L |
|* 25 | INDEX UNIQUE SCAN | PK_ACCT | 1 | | | 1 | | |
------------------------------------------------------------------------------------------------------------------------------------
现在可以看到,执行计划已经是改善了很多,只有唯一的一个全表扫描--BALANCE_SHARE_RULE,不过那个表的行数为0。
这个时候,客户反映系统正常了。虽然Oracle在ACCT表上评估返回的行数仍然很高,但是从系统故障处理这个角度上讲目的已经达到了。接下来就是处理这个ACCT表,类似的问题,类似的过程,这里不再细述。
总结:
此次性能问题的处理,从接到客户电话,到拨号连接,到最终处理完成,不到1小时。个人认为,这个处理速度还是比较快速的。做一个如下的总结: