2013-08-08 下午 星期四
---------------直方图--------------------------
直方图信息――收集性能数据的时候要收集的内容,对执行计划有巨大的影响。
dbms_stats包对表和索引的分析分为三个层次:
1、表自身的分析:表的行数、行长、数据块等信息,user_tables可以查到一部分
2、对列的分析:包括列值的重复数,列上的null值,数据在列上的分布情况。(直方图)
3、对索引的分析:包括索引叶子块的数量、索引的高度、索引的聚簇因子。
直方图单指数据在列上的分布情况:
生成直方图的过程――当Oracle作直方图分析的时候,会将列上的数据分成很多相同的部分,
把每一个部分叫做一个bucket(桶)
这样CBO就很容易知道列上数值的分布情况,
这种数据的分布分析将作为一个重要的因素纳入到成本计算里面。
histogram类型:
1、height直方图――等高直方图
2、frequencey直方图――频度直方图
案例:
1、创建一个案例表
create table t as
select rownum as id,round(dbms_random.normal*1000) as val1,
100+round(ln(rownum/3.25+2)) as val2,
100+round(ln(rownum/3.25+2)) as val3,
dbms_random.string('P',250) as pad
from dual connect by level<=1000
order by dbms_random.value;
2、加主键
SQL> alter table t add constraint t_pk primary key(id);
Table altered.
3、创建索引
SQL> create index t_val1_i on t(val1);
Index created.
SQL> create index t_val2_i on t(val2);
Index created.
SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size auto',cascade=>true);--对所有的列作直方图分析,桶数oracle自己去决定
PL/SQL procedure successfully completed.
method_opt――
Accepts:
FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
FOR COLUMNS [size clause] column|attribute [size_clause] [,column|attribute [size_clause]...]
size_clause is defined as size_clause := SIZE {integer | REPEAT | AUTO | SKEWONLY}
- integer : Number of histogram buckets. Must be in the range [1,254].
- REPEAT : Collects histograms only on the columns that already have histograms.
- AUTO : Oracle determines the columns to collect histograms based on data distribution and the workload of the columns.
- SKEWONLY : Oracle determines the columns to collect histograms based on the data distribution of the columns.
The default is FOR ALL COLUMNS SIZE AUTO.The default value can be changed using the SET_PARAM Procedure.
查询视图:
select column_name,num_buckets,low_value,high_value,density,num_nulls,avg_col_len,histogram,num_distinct from user_tab_col_statistics where table_name='T'
发现:histogram=NONE 表示直方图信息都没有生成。
VAL1字段的最大值和最小值:发现是经过加密的。
oracle提供了这个值的反解的方法:
select column_name,num_buckets,utl_raw.cast_to_number(low_value) as low_value,utl_raw.cast_to_number(high_value) as high_value,density,num_nulls,avg_col_len,histogram,num_distinct from user_tab_col_statistics where table_name='T'
--数字反解
select column_name,num_buckets,utl_raw.cast_to_varchar2(low_value) as low_value,utl_raw.cast_to_varchar2(high_value) as high_value,density,num_nulls,avg_col_len,histogram,num_distinct from user_tab_col_statistics where table_name='T'
--字符串的反解
SQL> set linesize 100
SQL> desc user_tab_col_statistics;
Name Null? Type
----------------------------------------------------- -------- ------------------------------------
TABLE_NAME VARCHAR2(30) --表名
COLUMN_NAME VARCHAR2(30) --列名
NUM_DISTINCT NUMBER --列的不重复值个数
LOW_VALUE RAW(32) --最小值
HIGH_VALUE RAW(32) --最大值
DENSITY NUMBER --密度
NUM_NULLS NUMBER --空值个数
NUM_BUCKETS NUMBER --桶数
LAST_ANALYZED DATE --最后一此分析时间
SAMPLE_SIZE NUMBER --采样大小
GLOBAL_STATS VARCHAR2(3) --状态
USER_STATS VARCHAR2(3)
AVG_COL_LEN NUMBER
HISTOGRAM VARCHAR2(15) --直方图类型
DENSITY――0到1之间的一个数字,越接近于0,表示过滤操作能去掉很多行(重复率比较高),越接近于1,表示过滤操作无法去掉很多行(重复率比较低)。
如果没有直方图的话,DENSITY=1/NUM_DISTINCT
强制生成直方图:
SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size SKEWONLY',cascade=>true);
PL/SQL procedure successfully completed.
SQL> set linesize 1000
SQL> select column_name,num_buckets,num_distinct,density,histogram from user_tab_col_statistics where table_name='T';
COLUMN_NAME NUM_BUCKETS NUM_DISTINCT DENSITY HISTOGRAM
------------------------------ ----------- ------------ ---------- ---------------
ID 1 1000 .001 NONE
VAL1 254 871 .001288 HEIGHT BALANCED
VAL2 6 6 .0005 FREQUENCY
VAL3 6 6 .0005 FREQUENCY
PAD 254 1000 .001 HEIGHT BALANCED
SQL> select val2,count(1) from t group by val2 order by val2;
VAL2 COUNT(1)
---------- ----------
101 8
102 25
103 68
104 185
105 502
106 212
6 rows selected.
FREQUENCY直方图――将相同的值放在一个桶内。
HEIGHT BALANCED――不重复的值超过了254,所以没办法作FREQUENCY,作等高直方图,桶等高的,里面的值是不相同的。
select endpoint_value,endpoint_number
from user_tab_histograms where table_name='T' and column_name='VAL2' 这个视图中存储的是累加行数的效果
order by endpoint_number;
ENDPOINT_VALUE ENDPOINT_NUMBER
-------------- ---------------
101 8
102 33
103 101
104 286
105 788
106 1000
6 rows selected.
查到累加值和原值
select endpoint_value,endpoint_number,endpoint_number-lag(endpoint_number,1,0) over(order by endpoint_number) as yz
from user_tab_histograms where table_name='T' and column_name='VAL2' --lag错位相减函数
order by endpoint_number;
ENDPOINT_VALUE ENDPOINT_NUMBER YZ
-------------- --------------- ----------
101 8 8
102 33 25
103 101 68
104 286 185
105 788 502
106 1000 212
6 rows selected.
CBO是怎样利用频度直方图精确的估算基于列val2过滤后的结果的。
SQL> explain plan set statement_id '101' for select * from t where val2=101;
Explained.
SQL> explain plan set statement_id '102' for select * from t where val2=102;
Explained.
SQL> explain plan set statement_id '103' for select * from t where val2=103;
Explained.
SQL> explain plan set statement_id '104' for select * from t where val2=104;
Explained.
SQL> explain plan set statement_id '105' for select * from t where val2=105;
Explained.
SQL> explain plan set statement_id '106' for select * from t where val2=106;
Explained.
SQL> select statement_id,cardinality from plan_table where id=0 order by statement_id;
STATEMENT_ID CARDINALITY
------------------------------ -----------
101 8 --执行计划中的rows就是从频度直方图中取到的。
102 25
103 68
104 185
105 502
106 212
6 rows selected.
SQL> select * from t where val2=101;
8 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 289244162
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 | 2136 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T | 8 | 2136 | 3 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T_VAL2_I | 8 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("VAL2"=101)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
10 consistent gets
0 physical reads
0 redo size
2757 bytes sent via SQL*Net to client
400 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
8 rows processed --endpoint_number-lag(endpoint_number,1,0) over(order by endpoint_number)
SQL> select * from t where val2=106;
212 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 212 | 56604 | 11 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T | 212 | 56604 | 11 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("VAL2"=106)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
57 consistent gets
0 physical reads
0 redo size
58464 bytes sent via SQL*Net to client
554 bytes received via SQL*Net from client
16 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
212 rows processed
模拟等高直方图的计算过程
select count(1),max(val2),bucket from(
select val2,ntile(5) over(order by val2) as bucket from t)
group by bucket;
COUNT(1) MAX(VAL2) BUCKET
---------- ---------- ----------
200 104 1
200 105 2
200 106 4
200 106 5
200 105 3
强制作等高直方图:
SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size 5',cascade=>true);
PL/SQL procedure successfully completed.
SQL>select column_name,num_buckets,num_distinct,density,histogram from user_tab_col_statistics where table_name='T';
COLUMN_NAME NUM_BUCKETS NUM_DISTINCT DENSITY HISTOGRAM
------------------------------ ----------- ------------ ---------- ---------------
ID 5 1000 .001 HEIGHT BALANCED
VAL1 5 871 .001288 HEIGHT BALANCED
VAL2 5 6 .138244755 HEIGHT BALANCED
VAL3 5 6 .138244755 HEIGHT BALANCED
PAD 5 1000 .001 HEIGHT BALANCED
全部变为等高直方图
SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;
ENDPOINT_VALUE ENDPOINT_NUMBER --此时ENDPOINT_NUMBER由值变为桶号
-------------- ---------------
101 0
104 1
105 3
106 5
0――表示开始点,从101开始:
1――桶1:104最大值 101,102,103,104 ――估计每个值50个,
2 104,105 桶2没显示表示结束值和下面的桶一样
3――桶3:105最大值 105
4 105,106
5――桶5:106最大值 --记录的是区间的结束点
101:200/4=50
102:200/4=50
103:200/4=50
104:150
105:100+200+100=400
106:200+100=300
SQL> conn hr/hr
Connected.
SQL> explain plan set statement_id '101' for select * from t where val2=101;
Explained.
SQL> explain plan set statement_id '102' for select * from t where val2=102;
Explained.
SQL> explain plan set statement_id '103' for select * from t where val2=103;
Explained.
SQL> explain plan set statement_id '104' for select * from t where val2=104;
Explained.
SQL> explain plan set statement_id '105' for select * from t where val2=105;
Explained.
SQL> explain plan set statement_id '106' for select * from t where val2=106;
Explained.
SQL> select statement_id,cardinality from plan_table where id=0 order by statement_id;
STATEMENT_ID CARDINALITY
------------------------------ -----------
101 50
102 50
103 50
104 50 --?
105 400
106 300
6 rows selected.
SQL> select * from t where val2=101;
8 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 289244162
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50 | 13350 | 10 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T | 50 | 13350 | 10 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T_VAL2_I | 50 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("VAL2"=101)
SQL> select count(1) from t where val2=101;
COUNT(1)
----------
8 --实际有8行
SQL> select count(1) from t where val2=105;
COUNT(1)
----------
502
SQL> select * from t where val2=105;
502 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 400 | 104K| 11 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T | 400 | 104K| 11 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("VAL2"=105)
桶数不能超过254,上面的错误原因是给的桶数太少了,增加桶数的话,误差一定会减小。
执行计划中的Rows来源于直方图的cardinality,此时估计的card可能是有误差的,bucket越大,误差越小,数据量越大,误差越大。
误差大,小小的数据变化,就可能导致直方图发生很大的变化。
SQL> update t set val2=105 where val2=106 and rownum<13;
12 rows updated.
SQL> commit;
Commit complete.
SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size 5',cascade=>true);
PL/SQL procedure successfully completed.
SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;
ENDPOINT_VALUE ENDPOINT_NUMBER
-------------- ---------------
101 0
104 1
105 4
106 5
SQL> conn hr/hr
Connected.
SQL> explain plan set statement_id '101' for select * from t where val2=101;
Explained.
SQL> explain plan set statement_id '102' for select * from t where val2=102;
Explained.
SQL> explain plan set statement_id '103' for select * from t where val2=103;
Explained.
SQL> explain plan set statement_id '104' for select * from t where val2=104;
Explained.
SQL> explain plan set statement_id '105' for select * from t where val2=105;
Explained.
SQL> explain plan set statement_id '106' for select * from t where val2=106;
Explained.
SQL> select statement_id,cardinality from plan_table where id=0 order by statement_id;
STATEMENT_ID CARDINALITY
------------------------------ -----------
101 80
102 80
103 80
104 80
105 600
106 80
6 rows selected.
SQL> select * from t where val2=105;
514 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 600 | 156K| 11 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T | 600 | 156K| 11 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("VAL2"=105)
select * from user_tab_histograms where table_name='T' --bucket的分布情况
select * from user_tab_columns where table_name='T' --列的直方图情况
select * from user_tab_col_statistics where table_name='T' --列的直方图情况
将ID列的直方图信息删除
SQL> exec dbms_stats.delete_column_stats(user,'T','ID');
PL/SQL procedure successfully completed.
select * from user_tab_columns where table_name='T' --确认,ID列上的histogram=NONE
重新收集直方图信息,桶数Oralce自己去决定
SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size 1',cascade=>true); --不做直方图
PL/SQL procedure successfully completed.
select * from user_tab_columns where table_name='T' --确认,有些列没有直方图了。
SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;
ENDPOINT_VALUE ENDPOINT_NUMBER
-------------- ---------------
101 0
106 1
没有直方图就是只有一个bucket,endpoint_number字段只有0和1两个位置表示一个桶,同时也会记录最大值和最小值。
另外一个数据字典也能看到这个信息:
select * from user_tab_col_statistics where table_name='T'
研究:如果没有直方图信息的时候,card是从哪里取到的?
SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;
ENDPOINT_VALUE ENDPOINT_NUMBER
-------------- ---------------
101 0
106 1
SQL> conn hr/hr
Connected.
SQL> explain plan set statement_id '101' for select * from t where val2=101;
Explained.
SQL> explain plan set statement_id '102' for select * from t where val2=102;
Explained.
SQL> explain plan set statement_id '103' for select * from t where val2=103;
Explained.
SQL> explain plan set statement_id '104' for select * from t where val2=104;
Explained.
SQL> explain plan set statement_id '105' for select * from t where val2=105;
Explained.
SQL> explain plan set statement_id '106' for select * from t where val2=106;
Explained.
SQL> select statement_id,cardinality from plan_table where id=0 order by statement_id;
STATEMENT_ID CARDINALITY
------------------------------ -----------
101 167
102 167
103 167
104 167
105 167
106 167
6 rows selected.
SQL> select * from t where val2=103;
68 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1601196873
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 167 | 44589 | 11 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T | 167 | 44589 | 11 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("VAL2"=103)
SQL> select count(1) from t where val2=103;
COUNT(1)
----------
68
总结:当没有直方图信息的时候,只有一个bucket,oracle会认为以均匀分布的方式,
来作card值的分布计算,如果列值分布确实是不均匀的,此时plan有较大的误差,
如果列值均匀分布,这样处理没有问题,如果列值不均匀,执行计划计划误差就大了。