13.直方图

2013-08-08 下午 星期四


---------------直方图--------------------------


直方图信息――收集性能数据的时候要收集的内容,对执行计划有巨大的影响。


dbms_stats包对表和索引的分析分为三个层次:

1、表自身的分析:表的行数、行长、数据块等信息,user_tables可以查到一部分

2、对列的分析:包括列值的重复数,列上的null值,数据在列上的分布情况。(直方图)

3、对索引的分析:包括索引叶子块的数量、索引的高度、索引的聚簇因子。


直方图单指数据在列上的分布情况:


生成直方图的过程――当Oracle作直方图分析的时候,会将列上的数据分成很多相同的部分,

把每一个部分叫做一个bucket(桶)

这样CBO就很容易知道列上数值的分布情况,

这种数据的分布分析将作为一个重要的因素纳入到成本计算里面。


histogram类型:

1、height直方图――等高直方图

2、frequencey直方图――频度直方图


案例:


1、创建一个案例表


create table t as

select rownum as id,round(dbms_random.normal*1000) as val1,

100+round(ln(rownum/3.25+2)) as val2,

100+round(ln(rownum/3.25+2)) as val3,

dbms_random.string('P',250) as pad

from dual connect by level<=1000

order by dbms_random.value;


2、加主键


SQL> alter table t add constraint t_pk primary key(id);


Table altered.


3、创建索引


SQL> create index t_val1_i on t(val1);


Index created.


SQL> create index t_val2_i on t(val2);


Index created.


SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size auto',cascade=>true);--对所有的列作直方图分析,桶数oracle自己去决定


PL/SQL procedure successfully completed.


method_opt――

Accepts:

FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]

FOR COLUMNS [size clause] column|attribute [size_clause] [,column|attribute [size_clause]...]


size_clause is defined as size_clause := SIZE {integer | REPEAT | AUTO | SKEWONLY}


- integer : Number of histogram buckets. Must be in the range [1,254].

- REPEAT : Collects histograms only on the columns that already have histograms.

- AUTO : Oracle determines the columns to collect histograms based on data distribution and the workload of the columns.

- SKEWONLY : Oracle determines the columns to collect histograms based on the data distribution of the columns.


The default is FOR ALL COLUMNS SIZE AUTO.The default value can be changed using the SET_PARAM Procedure.


查询视图:

select column_name,num_buckets,low_value,high_value,density,num_nulls,avg_col_len,histogram,num_distinct from user_tab_col_statistics where table_name='T'


发现:histogram=NONE  表示直方图信息都没有生成。


VAL1字段的最大值和最小值:发现是经过加密的。


oracle提供了这个值的反解的方法:

select column_name,num_buckets,utl_raw.cast_to_number(low_value) as low_value,utl_raw.cast_to_number(high_value) as high_value,density,num_nulls,avg_col_len,histogram,num_distinct from user_tab_col_statistics where table_name='T'

--数字反解


select column_name,num_buckets,utl_raw.cast_to_varchar2(low_value) as low_value,utl_raw.cast_to_varchar2(high_value) as high_value,density,num_nulls,avg_col_len,histogram,num_distinct from user_tab_col_statistics where table_name='T'

--字符串的反解


SQL> set linesize 100

SQL> desc user_tab_col_statistics;

Name                                                  Null?    Type

----------------------------------------------------- -------- ------------------------------------

TABLE_NAME                                                     VARCHAR2(30)   --表名

COLUMN_NAME                                                    VARCHAR2(30)   --列名

NUM_DISTINCT                                                   NUMBER     --列的不重复值个数

LOW_VALUE                                                      RAW(32)     --最小值

HIGH_VALUE                                                     RAW(32)     --最大值

DENSITY                                                        NUMBER      --密度

NUM_NULLS                                                      NUMBER     --空值个数

NUM_BUCKETS                                                    NUMBER     --桶数

LAST_ANALYZED                                                  DATE       --最后一此分析时间

SAMPLE_SIZE                                                    NUMBER      --采样大小

GLOBAL_STATS                                                   VARCHAR2(3)     --状态

USER_STATS                                                     VARCHAR2(3)    

AVG_COL_LEN                                                    NUMBER

HISTOGRAM                                                      VARCHAR2(15)  --直方图类型


DENSITY――0到1之间的一个数字,越接近于0,表示过滤操作能去掉很多行(重复率比较高),越接近于1,表示过滤操作无法去掉很多行(重复率比较低)。


如果没有直方图的话,DENSITY=1/NUM_DISTINCT


强制生成直方图:

SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size SKEWONLY',cascade=>true);


PL/SQL procedure successfully completed.


SQL> set linesize 1000

SQL> select column_name,num_buckets,num_distinct,density,histogram from user_tab_col_statistics where table_name='T';


COLUMN_NAME                    NUM_BUCKETS NUM_DISTINCT    DENSITY HISTOGRAM

------------------------------ ----------- ------------ ---------- ---------------

ID                                       1         1000       .001 NONE

VAL1                                   254          871    .001288 HEIGHT BALANCED

VAL2                                     6            6      .0005 FREQUENCY

VAL3                                     6            6      .0005 FREQUENCY

PAD                                    254         1000       .001 HEIGHT BALANCED


SQL> select val2,count(1) from t group by val2 order by val2;


VAL2   COUNT(1)

---------- ----------

      101          8

      102         25

      103         68

      104        185

      105        502

      106        212


6 rows selected.


FREQUENCY直方图――将相同的值放在一个桶内。

HEIGHT BALANCED――不重复的值超过了254,所以没办法作FREQUENCY,作等高直方图,桶等高的,里面的值是不相同的。


select endpoint_value,endpoint_number

from user_tab_histograms where table_name='T' and column_name='VAL2'   这个视图中存储的是累加行数的效果

order by endpoint_number;


ENDPOINT_VALUE ENDPOINT_NUMBER

-------------- ---------------

          101               8

          102              33

          103             101

          104             286

          105             788

          106            1000


6 rows selected.


查到累加值和原值


select endpoint_value,endpoint_number,endpoint_number-lag(endpoint_number,1,0) over(order by endpoint_number) as yz

from user_tab_histograms where table_name='T' and column_name='VAL2'  --lag错位相减函数

order by endpoint_number;


ENDPOINT_VALUE ENDPOINT_NUMBER         YZ

-------------- --------------- ----------

          101               8          8

          102              33         25

          103             101         68

          104             286        185

          105             788        502

          106            1000        212


6 rows selected.


CBO是怎样利用频度直方图精确的估算基于列val2过滤后的结果的。


SQL> explain plan set statement_id '101' for select * from t where val2=101;


Explained.


SQL> explain plan set statement_id '102' for select * from t where val2=102;


Explained.


SQL> explain plan set statement_id '103' for select * from t where val2=103;


Explained.


SQL> explain plan set statement_id '104' for select * from t where val2=104;


Explained.


SQL> explain plan set statement_id '105' for select * from t where val2=105;


Explained.


SQL> explain plan set statement_id '106' for select * from t where val2=106;


Explained.


SQL> select statement_id,cardinality from plan_table where id=0 order by statement_id;


STATEMENT_ID    CARDINALITY

------------------------------ -----------

101                                      8    --执行计划中的rows就是从频度直方图中取到的。

102                                     25

103                                     68

104                                    185

105                                    502

106                                    212


6 rows selected.


SQL> select * from t where val2=101;


8 rows selected.



Execution Plan

----------------------------------------------------------

Plan hash value: 289244162


----------------------------------------------------------------------------------------

| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |          |     8 |  2136 |     3   (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| T        |     8 |  2136 |     3   (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | T_VAL2_I |     8 |       |     1   (0)| 00:00:01 |

----------------------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


  2 - access("VAL2"=101)


Statistics

----------------------------------------------------------

         1  recursive calls

         0  db block gets

        10  consistent gets

         0  physical reads

         0  redo size

      2757  bytes sent via SQL*Net to client

       400  bytes received via SQL*Net from client

         2  SQL*Net roundtrips to/from client

         0  sorts (memory)

         0  sorts (disk)

         8  rows processed   --endpoint_number-lag(endpoint_number,1,0) over(order by endpoint_number)


SQL> select * from t where val2=106;


212 rows selected.



Execution Plan

----------------------------------------------------------

Plan hash value: 1601196873


--------------------------------------------------------------------------

| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |

--------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |      |   212 | 56604 |    11   (0)| 00:00:01 |

|*  1 |  TABLE ACCESS FULL| T    |   212 | 56604 |    11   (0)| 00:00:01 |

--------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


  1 - filter("VAL2"=106)



Statistics

----------------------------------------------------------

         1  recursive calls

         0  db block gets

        57  consistent gets

         0  physical reads

         0  redo size

     58464  bytes sent via SQL*Net to client

       554  bytes received via SQL*Net from client

        16  SQL*Net roundtrips to/from client

         0  sorts (memory)

         0  sorts (disk)

       212  rows processed


模拟等高直方图的计算过程


select count(1),max(val2),bucket from(

select val2,ntile(5) over(order by val2) as bucket from t)

group by bucket;


 COUNT(1)  MAX(VAL2)     BUCKET

---------- ---------- ----------

      200        104          1

      200        105          2

      200        106          4

      200        106          5

      200        105          3


强制作等高直方图:

SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size 5',cascade=>true);


PL/SQL procedure successfully completed.


SQL>select column_name,num_buckets,num_distinct,density,histogram from user_tab_col_statistics where table_name='T';


COLUMN_NAME                    NUM_BUCKETS NUM_DISTINCT    DENSITY HISTOGRAM

------------------------------ ----------- ------------ ---------- ---------------

ID                                       5         1000       .001 HEIGHT BALANCED

VAL1                                     5          871    .001288 HEIGHT BALANCED

VAL2                                     5            6 .138244755 HEIGHT BALANCED

VAL3                                     5            6 .138244755 HEIGHT BALANCED

PAD                                      5         1000       .001 HEIGHT BALANCED


全部变为等高直方图


SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;


ENDPOINT_VALUE ENDPOINT_NUMBER   --此时ENDPOINT_NUMBER由值变为桶号

-------------- ---------------

          101                          0

          104                         1

          105                          3

          106                          5


0――表示开始点,从101开始:

1――桶1:104最大值       101,102,103,104  ――估计每个值50个,          

2                                             104,105  桶2没显示表示结束值和下面的桶一样  

3――桶3:105最大值       105                            

4                                              105,106                            

5――桶5:106最大值   --记录的是区间的结束点    



101:200/4=50

102:200/4=50

103:200/4=50

104:150

105:100+200+100=400

106:200+100=300


SQL> conn hr/hr

Connected.

SQL> explain plan set statement_id '101' for select * from t where val2=101;


Explained.


SQL> explain plan set statement_id '102' for select * from t where val2=102;


Explained.


SQL> explain plan set statement_id '103' for select * from t where val2=103;


Explained.


SQL> explain plan set statement_id '104' for select * from t where val2=104;


Explained.


SQL> explain plan set statement_id '105' for select * from t where val2=105;


Explained.


SQL> explain plan set statement_id '106' for select * from t where val2=106;


Explained.


SQL> select statement_id,cardinality from plan_table where id=0 order by statement_id;


STATEMENT_ID    CARDINALITY

------------------------------ -----------

101                                     50

102                                     50

103                                     50

104                                     50    --?

105                                    400

106                                    300


6 rows selected.


SQL> select * from t where val2=101;


8 rows selected.



Execution Plan

----------------------------------------------------------

Plan hash value: 289244162


----------------------------------------------------------------------------------------

| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |          |    50 | 13350 |    10   (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| T        |    50 | 13350 |    10   (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | T_VAL2_I |    50 |       |     1   (0)| 00:00:01 |

----------------------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


  2 - access("VAL2"=101)


SQL>  select count(1) from t where val2=101;


 COUNT(1)

----------

        8    --实际有8行


SQL>  select count(1) from t where val2=105;


 COUNT(1)

----------

      502

SQL> select * from t where val2=105;


502 rows selected.



Execution Plan

----------------------------------------------------------

Plan hash value: 1601196873


--------------------------------------------------------------------------

| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |

--------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |      |   400 |   104K|    11   (0)| 00:00:01 |

|*  1 |  TABLE ACCESS FULL| T    |   400 |   104K|    11   (0)| 00:00:01 |

--------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


  1 - filter("VAL2"=105)


桶数不能超过254,上面的错误原因是给的桶数太少了,增加桶数的话,误差一定会减小。


执行计划中的Rows来源于直方图的cardinality,此时估计的card可能是有误差的,bucket越大,误差越小,数据量越大,误差越大。


误差大,小小的数据变化,就可能导致直方图发生很大的变化。



SQL> update t set val2=105 where val2=106 and rownum<13;


12 rows updated.


SQL> commit;


Commit complete.


SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size 5',cascade=>true);


PL/SQL procedure successfully completed.


SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;


ENDPOINT_VALUE ENDPOINT_NUMBER

-------------- ---------------

          101               0

          104               1

          105               4

          106               5


SQL> conn hr/hr

Connected.

SQL> explain plan set statement_id '101' for select * from t where val2=101;


Explained.


SQL> explain plan set statement_id '102' for select * from t where val2=102;


Explained.


SQL> explain plan set statement_id '103' for select * from t where val2=103;


Explained.


SQL> explain plan set statement_id '104' for select * from t where val2=104;


Explained.


SQL> explain plan set statement_id '105' for select * from t where val2=105;


Explained.


SQL> explain plan set statement_id '106' for select * from t where val2=106;


Explained.


SQL>  select statement_id,cardinality from plan_table where id=0 order by statement_id;


STATEMENT_ID                   CARDINALITY

------------------------------ -----------

101                                     80

102                                     80

103                                     80

104                                     80

105                                    600

106                                     80


6 rows selected.


SQL> select * from t where val2=105;


514 rows selected.



Execution Plan

----------------------------------------------------------

Plan hash value: 1601196873


--------------------------------------------------------------------------

| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |

--------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |      |   600 |   156K|    11   (0)| 00:00:01 |

|*  1 |  TABLE ACCESS FULL| T    |   600 |   156K|    11   (0)| 00:00:01 |

--------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


  1 - filter("VAL2"=105)


select * from user_tab_histograms where table_name='T' --bucket的分布情况

select * from user_tab_columns where table_name='T'    --列的直方图情况

select * from user_tab_col_statistics where table_name='T'  --列的直方图情况



将ID列的直方图信息删除

SQL> exec dbms_stats.delete_column_stats(user,'T','ID');


PL/SQL procedure successfully completed.



select * from user_tab_columns where table_name='T'  --确认,ID列上的histogram=NONE


重新收集直方图信息,桶数Oralce自己去决定


SQL> exec dbms_stats.gather_table_stats(user,'t',estimate_percent=>100,method_opt=>'for all columns size 1',cascade=>true);  --不做直方图


PL/SQL procedure successfully completed.



select * from user_tab_columns where table_name='T'   --确认,有些列没有直方图了。


SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;


ENDPOINT_VALUE ENDPOINT_NUMBER

-------------- ---------------

          101               0

          106               1



没有直方图就是只有一个bucket,endpoint_number字段只有0和1两个位置表示一个桶,同时也会记录最大值和最小值。


另外一个数据字典也能看到这个信息:

select * from user_tab_col_statistics where table_name='T'


研究:如果没有直方图信息的时候,card是从哪里取到的?


SQL> select endpoint_value,endpoint_number from user_tab_histograms where table_name='T' and column_name='VAL2' order by endpoint_number;


ENDPOINT_VALUE ENDPOINT_NUMBER

-------------- ---------------

          101               0

          106               1


SQL> conn hr/hr

Connected.

SQL> explain plan set statement_id '101' for select * from t where val2=101;


Explained.


SQL> explain plan set statement_id '102' for select * from t where val2=102;


Explained.


SQL> explain plan set statement_id '103' for select * from t where val2=103;


Explained.


SQL> explain plan set statement_id '104' for select * from t where val2=104;


Explained.


SQL> explain plan set statement_id '105' for select * from t where val2=105;


Explained.


SQL> explain plan set statement_id '106' for select * from t where val2=106;


Explained.


SQL> select statement_id,cardinality from plan_table where id=0 order by statement_id;


STATEMENT_ID                   CARDINALITY

------------------------------ -----------

101                                    167

102                                    167

103                                    167

104                                    167

105                                    167

106                                    167


6 rows selected.


SQL> select * from t where val2=103;


68 rows selected.



Execution Plan

----------------------------------------------------------

Plan hash value: 1601196873


--------------------------------------------------------------------------

| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |

--------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |      |   167 | 44589 |    11   (0)| 00:00:01 |

|*  1 |  TABLE ACCESS FULL| T    |   167 | 44589 |    11   (0)| 00:00:01 |

--------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


  1 - filter("VAL2"=103)


SQL> select count(1) from t where val2=103;


 COUNT(1)

----------

       68


总结:当没有直方图信息的时候,只有一个bucket,oracle会认为以均匀分布的方式,

     来作card值的分布计算,如果列值分布确实是不均匀的,此时plan有较大的误差,

     如果列值均匀分布,这样处理没有问题,如果列值不均匀,执行计划计划误差就大了。


你可能感兴趣的:(oracle,性能优化,直方图)