自动建立正确索引(btree,hash,gin,gist,sp-gist,brin,bitmap...)的方法

背景

PostgreSQL的索引接口是开放的，支持btree,hash,bitmap,gin,gist,sp-gist,brin,rum,bloom,zoomdb等索引接口。因此，不同的数据类型，有不同的索引结构可以选择。

由于索引接口众多（应对不同的场景），一些用户可能无法判断应该选择使用哪种索引方法。

虽然我在很多文章中有提到过索引的原理以及选择方法，但是一切为懒人服务，所以我们看看如何把创建索引变得更简单，让数据库自动选择最合适的索引方法。

如果你想了解各种索引的用途，你也可以参考一下手册或者以下CASE的文章，了解索引的原理和应用场景。

《懒人推动社会进步 - 多列聚合, gin与数据分布(选择性)》

语法

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]  
    ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )  
    [ WITH ( storage_parameter = value [, ... ] ) ]  
    [ TABLESPACE tablespace_name ]  
    [ WHERE predicate ]

索引选择的要素

必要约束

1、如果用户要创建 unique 索引，那么只能选择btree索引接口。

2、某些类型指支持某类索引接口，例如数组类型、全文检索类型，只支持GIN索引。GIS类型只支持gist或sp-gist索引。

选择性

如果某个类型支持多种索引接口，那么到底选择哪种接口比较好呢？

和数据的选择性是有关系的。

1、选择性差（例如1亿记录，有100-10万条唯一值），建议使用gin或bitmap索引。

2、选择性好（例如1亿记录，有8000万唯一值），建议使用btree或hash索引。

数据分布

1、对于数据值与行号呈现较强的线性相关特性时，加入用户没有按该列顺序输出的需求。则建议使用brin块级索引。

2、当列长度超过数据块的1/3时，不能使用btree，建议使用hash索引。或者使用表达式btree索引，建少索引entry的大小。

查询需求

如果某个类型支持多种索引接口，那么到底选择哪种接口比较好呢？

和数据的查询需要也是有关系的。

1、范围查询、排序查询、等值查询

可以使用btree, brin.

2、仅仅有等值查询

可以使用btree, hash

3、有多个列的任意组合查询需求

可以使用bitmap, gin, btree等索引接口

4、有包含、相交等查询需求

可以使用gin等索引接口

5、有距离、距离排序、相交、包含、贯穿等查询需求

可以使用gist等索引接口

性能要求

当一个列支持多种索引接口时，应该选择哪个索引接口，和业务对性能的要求也有关系。

例如，某个列同时支持btree和brin索引，应该选哪个呢？

除了前面提到的线性相关性，还需要考虑业务的查询要求，以及对性能的要求，BTREE对写入性能影响比BRIN大很多，但是明细查询速度，排序速度，limit输出等，都比GIN好很多。

但是当重复值较多时，建议使用GIN，因为它是将元素值作为索引KEY，将行号作为VALUE的倒排索引。

小结

在创建索引时，我们需要指定索引方法，抛开其他的原因，对于同一列可以使用多种索引方法建立索引时，到底使用什么方法？

本文接下来的例子将给大家介绍这个问题的解法。

统计时需要使用参与索引的字段的pg_class, pg_stats的统计信息。

pg_class.reltuples，行数  
  
pg_stats.n_distinct，唯一值个数或比例  
  
pg_stats.correlation，列的线性相关性

将来PG 10还能支持自定义组合列的统计信息收集，比如A,B列的唯一值，相关性等统计信息。支持更加复杂的优化器成本计算。

例子

我讲字段选择索引方法的功能加入到UDF中，利用参数控制自动建立索引的名字。要求等。

Greenplum

语法

Command:     CREATE INDEX  
Description: define a new index  
Syntax:  
CREATE [UNIQUE] INDEX name ON table   
       [USING btree|bitmap|gist]  
       ( {column | (expression)} [opclass] [, ...] )  
       [ WITH ( FILLFACTOR = value ) ]  
       [TABLESPACE tablespace]  
       [WHERE predicate]

UDF

指定表名，列名，索引名，自动创建合适的索引方法的索引。

例子如下，用户可以更精细的调整内容，自动建立合适的索引。

create or replace function build_idx(  
sch name, -- schema name  
tbl name, -- 表名  
col name, -- 列名  
idx name -- 索引名  
) returns void as   
$$  
declare  
  v_tuples numeric;  
  v_n_distinct_cnt numeric;  
  v_n_distinct numeric;  
  v_corr numeric;  
  v_sql text;  
  am text := 'btree';  
begin  
  -- 建议先收集统计信息, 或者根据统计信息的状态，判断是否要更新统计信息  
  
  select reltuples into v_tuples from pg_class where relname=tbl and relnamespace=(select oid from pg_namespace where nspname=sch);  
  
  -- 唯一值个数，100 ~ 10万 , 同时唯一值比例低于1/10000，使用bitmap(gp) , 或者使用gin(pg)  
  select case when n_distinct >=1 then n_distinct when n_distinct=-1 then v_tuples else abs(n_distinct*v_tuples) end into v_n_distinct_cnt   
  from pg_stats where schemaname=sch and tablename=tbl and attname=col;  
      
  -- 唯一值比例，低于1/10000 ，使用bitmap(gp) , 或者使用gin(pg)。否则使用btree  
  select case when n_distinct >=1 then n_distinct/v_tuples when n_distinct=-1 then 1 else abs(n_distinct) end into v_n_distinct  
  from pg_stats where schemaname=sch and tablename=tbl and attname=col;  
  
  if v_n_distinct <= 1/10000.0 and v_n_distinct_cnt between 100 and 100000 then  
    am = 'bitmap';  
  end if;  
  
  v_sql := 'create index '||quote_ident(idx)||' on '||quote_ident(sch)||'.'||quote_ident(tbl)||' using '||am||' ('||quote_ident(col)||')';  
  raise notice '%', v_sql;  
  execute v_sql;  
end;  
$$ language plpgsql strict;

测试，如下，自动创建bitmap索引。

postgres=# create table tbl(id int, info text);  
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.  
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
postgres=# insert into tbl select 200*random(), 'test' from generate_series(1,3000000);  
INSERT 0 3000000  
postgres=# select build_idx('public','tbl','id','test_idx_id');  
NOTICE:  create index test_idx_id on public.tbl using bitmap (id)  
 build_idx   
-----------  
   
(1 row)

PostgreSQL

语法

Command:     CREATE INDEX  
Description: define a new index  
Syntax:  
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]  
    ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )  
    [ WITH ( storage_parameter = value [, ... ] ) ]  
    [ TABLESPACE tablespace_name ]  
    [ WHERE predicate ]

UDF