PostgreSQL , 多个exists , 范围检索 , IN检索 , 模糊检索 , 组合 , gin , recheck , filter , subplan
当一个SQL包含复杂的多个exists , 范围检索 , IN检索 , 模糊检索 , 组合查询时,可能由于索引使用不当导致查询性能较慢。
主要的问题在于,索引使用不当,可能导致几个问题:
1、索引本身扫描的耗时过多
2、位图扫描引入的recheck过多
3、subplan 引入的 filter过多
一个现实的例子,可以看到耗时集中在recheck和filter上面,每个索引扫描返回的记录数都很多,但是组合起来是0条符合条件的记录。
问题就出在索引不正确上,导致了问题。
-> Subquery Scan on "*SELECT* 2" (cost=273453.65..432483146.70 rows=223 width=349) (actual time=25932.371..25932.371 rows=0 loops=1)
Output: ...................................
Buffers: shared hit=920071 read=269255
I/O Timings: read=1552.767
-> Bitmap Heap Scan on zjxftypt.tab1010201 t_1 (cost=273453.65..432483144.47 rows=223 width=349) (actual time=25932.370..25932.370 rows=0 loops=1)
Output: t_1.storeid, t_1.xfjbh, t_1.wtsd, t_1.rs, t_1.digoal123x, t_1.dz, t_1.blfsjd, t_1.qx, t_1.gk, t_1.xfrq, t_1.djsj, t_1.djdw, t_1.xfjclzt, t_1.digoal123, t_1.xfxs
-- 位图扫描的条件重新过滤 , 过滤太多了
Recheck Cond: ((t_1.xfrq < (to_date('2018-06-11'::character varying, 'yyyy-mm-dd'::character varying) + 1)) AND (t_1.xfrq >= to_date('2014-02-12'::character varying, 'yyyy-mm-dd'::character varying)) AND (t_1.digoal123 = 1::numeric))
Rows Removed by Index Recheck: 1214155
-- 过滤exists的JOIN条件值是否满足 ,过滤太多了
Filter: (((t_1.digoal123x)::text ~~ '%阿里巴巴%'::text) AND ((alternatives: SubPlan 4 or hashed SubPlan 5) OR(alternatives: SubPlan 6 or hashed SubPlan 7)))
Rows Removed by Filter: 5215804
Buffers: shared hit=920071 read=269255
I/O Timings: read=1552.767
-- 条件1,2位图扫描
-> BitmapAnd (cost=273453.65..273453.65 rows=4909643 width=0) (actual time=2510.718..2510.718 rows=0 loops=1)
Buffers: shared hit=27036 read=16539
I/O Timings: read=101.425
-- 自身条件1 符合条件的记录太多了
-> Bitmap Index Scan on index_tab1010201_xfrq (cost=0.00..126565.99 rows=4943755 width=0) (actual time=1085.429..1085.429 rows=5268071 loops=1)
Index Cond: ((t_1.xfrq < (to_date('2018-06-11'::character varying, 'yyyy-mm-dd'::character varying) + 1)) AND (t_1.xfrq >= to_date('2014-02-12'::character varying, 'yyyy-mm-dd'::character varying)))
Buffers: shared hit=3288 read=16539
I/O Timings: read=101.425
-- 自身条件2 符合条件的记录太多了
-> Bitmap Index Scan on index_tab1010201_digoal123 (cost=0.00..146887.30 rows=6599316 width=0) (actual time=1355.825..1355.825 rows=6845646 loops=1)
Index Cond: (t_1.digoal123 = 1::numeric)
Buffers: shared hit=23748
..............sub plans
1、复现问题,创建测试表
create table test(id int, c1 text, c2 date, c3 text);
SQL如下
select * from test
where
(
exists (select 1 from pg_class where oid::int = test.id)
or
exists (select 1 from pg_attribute where attrelid::int=test.id)
)
and c1 in ('1','2','3')
and c2 between current_date-1 and current_date
and c3 ~ 'abcdef';
2、写入测试护甲1000万条
insert into test select id, (random()*10)::int::text, current_date, md5(random()::text) from generate_series(1,10000000) t(id);
3、创建索引,使之可以在索引层面过滤掉所有数据
create extension pg_trgm;
create extension btree_gin;
create index idx_test_1 on test using gin (c1, c2, c3 gin_trgm_ops);
如果是复现问题,应该是这两个索引
create index idx1 on test (c1);
create index idx2 on test (c2);
4、查看执行计划
postgres=# explain (analyze,verbose,timing,costs,buffers)
select * from test
where
(
exists (select 1 from pg_class where oid::int = test.id)
or
exists (select 1 from pg_attribute where attrelid::int=test.id)
)
and c1 in ('1','2','3')
and c2 between current_date-1 and current_date
and c3 ~ 'abcdef';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=156.43..8593.79 rows=228 width=43) (actual time=837.151..837.151 rows=0 loops=1)
Output: test.id, test.c1, test.c2, test.c3
-- 位图扫描重新RECHECK过滤
Recheck Cond: ((test.c1 = ANY ('{1,2,3}'::text[])) AND (test.c2 >= (CURRENT_DATE - 1)) AND (test.c2 <= CURRENT_DATE) AND (test.c3 ~ 'abcdef'::text))
Rows Removed by Index Recheck: 1
-- exists子句的条件检查,过滤
Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed SubPlan 4))
Rows Removed by Filter: 7
Heap Blocks: exact=8
Buffers: shared hit=11658 read=23
-- 所有条件压到GIN复合索引里面
-- GIN多个条件时,会自动内部位图扫描
-> Bitmap Index Scan on idx_test_1 (cost=0.00..156.37 rows=304 width=0) (actual time=834.418..834.418 rows=8 loops=1)
Index Cond: ((test.c1 = ANY ('{1,2,3}'::text[])) AND (test.c2 >= (CURRENT_DATE - 1)) AND (test.c2 <= CURRENT_DATE) AND (test.c3 ~ 'abcdef'::text))
Buffers: shared hit=11582 read=23
SubPlan 1
-> Seq Scan on pg_catalog.pg_class (cost=0.00..15.84 rows=1 width=0) (never executed)
Filter: ((pg_class.oid)::integer = test.id)
SubPlan 2
-> Seq Scan on pg_catalog.pg_class pg_class_1 (cost=0.00..14.87 rows=387 width=4) (actual time=0.014..0.155 rows=388 loops=1)
Output: (pg_class_1.oid)::integer
Buffers: shared hit=11
SubPlan 3
-> Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute (cost=0.28..84.39 rows=8 width=0) (never executed)
Filter: ((pg_attribute.attrelid)::integer = test.id)
Heap Fetches: 0
SubPlan 4
-> Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute pg_attribute_1 (cost=0.28..77.13 rows=2904 width=4) (actual time=0.029..1.081 rows=2941 loops=1)
Output: (pg_attribute_1.attrelid)::integer
Heap Fetches: 459
Buffers: shared hit=57
Planning time: 1.070 ms
Execution time: 839.834 ms
(29 rows)
看起来还不错,但是仔细深究实际上并没有优化太多,还可以有更好的优化。
5、深入优化,需要理解GIN复合索引内部的执行机制(位图扫描)。
因为满足C3条件的记录本身就很少,所以完全不需要使用GIN内部的位图扫描。
postgres=# select count(*) from test where c3 ~ 'abcdef';
count
-------
23
(1 row)
修改为如下索引
postgres=# drop index idx_test_1 ;
DROP INDEX
postgres=# create index idx_test_1 on test using gin (c3 gin_trgm_ops) ;
CREATE INDEX
6、耗时编程24毫秒
postgres=# explain (analyze,verbose,timing,costs,buffers)
select * from test
where
(
exists (select 1 from pg_class where oid::int = test.id)
or
exists (select 1 from pg_attribute where attrelid::int=test.id)
)
and c1 in ('1','2','3')
and c2 between current_date-1 and current_date
and c3 ~ 'abcdef';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=53.76..27798.16 rows=228 width=43) (actual time=24.287..24.287 rows=0 loops=1)
Output: test.id, test.c1, test.c2, test.c3
Recheck Cond: (test.c3 ~ 'abcdef'::text)
Rows Removed by Index Recheck: 6
Filter: ((test.c1 = ANY ('{1,2,3}'::text[])) AND (test.c2 <= CURRENT_DATE) AND (test.c2 >= (CURRENT_DATE - 1)) AND ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed SubPlan 4)))
Rows Removed by Filter: 23
Heap Blocks: exact=29
Buffers: shared hit=226
-> Bitmap Index Scan on idx_test_1 (cost=0.00..53.70 rows=1000 width=0) (actual time=21.517..21.517 rows=29 loops=1)
Index Cond: (test.c3 ~ 'abcdef'::text)
Buffers: shared hit=128
SubPlan 1
-> Seq Scan on pg_catalog.pg_class (cost=0.00..15.84 rows=1 width=0) (never executed)
Filter: ((pg_class.oid)::integer = test.id)
SubPlan 2
-> Seq Scan on pg_catalog.pg_class pg_class_1 (cost=0.00..14.87 rows=387 width=4) (actual time=0.011..0.156 rows=387 loops=1)
Output: (pg_class_1.oid)::integer
Buffers: shared hit=11
SubPlan 3
-> Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute (cost=0.28..84.39 rows=8 width=0) (never executed)
Filter: ((pg_attribute.attrelid)::integer = test.id)
Heap Fetches: 0
SubPlan 4
-> Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute pg_attribute_1 (cost=0.28..77.13 rows=2904 width=4) (actual time=0.028..1.099 rows=2938 loops=1)
Output: (pg_attribute_1.attrelid)::integer
Heap Fetches: 456
Buffers: shared hit=58
Planning time: 0.801 ms
Execution time: 24.403 ms
(29 rows)
Time: 26.052 ms
本文的SQL比较复杂,优化的思路和其他SQL差不多,只是本例可以理解BITMAP SCAN以及GIN索引的内部BITMAP SCAN在对较大数据进行合并时,可能引入的开销。
切入点依旧是explain,找耗时段,找背后的原因,解决。
1、什么时候使用GIN复合?
当任意一个条件,选择性不好时,使用复合。
什么时候使用GIN非复合?
2、当有有一个条件,选择性很好时,把它单独拿出来,作为一个独立索引。比如本例的c3模糊查询字段,过滤性好,应该单独拿出来。
其实就是说,选择性不好的列,不要放到索引里面,即使要放,也应该等PG出了分区索引后,将这种列作为分区索引的分区键。(多颗树),或者使用partial index。
《PostgreSQL 黑科技 - 空间聚集存储, 内窥GIN, GiST, SP-GiST索引》
《宝剑赠英雄 - 任意组合字段等效查询, 探探PostgreSQL多列展开式B树 (GIN)》
《PostgreSQL GIN索引实现原理》
《PostgreSQL GIN multi-key search 优化》
《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》
《PostgreSQL GIN索引limit慢的原因分析》
《beyond b-tree (gin\gist索引讲解PDF)》