postgresql大表join优化

postgresql大表join优化

一、背景

1、数据量:

表名 数据量
f_invoice 87346130
f_invoice_item 97535867

2、索引:

表:f_invoice_item

CREATE INDEX f_invoice_item_order_item_id_idx ON ins_dw_prd12.f_invoice_item USING btree (order_item_id)
CREATE INDEX f_invoice_item_invoice_id_idx ON ins_dw_prd12.f_invoice_item USING btree (invoice_id) WITH (fillfactor='100')
​

表:f_invoice

CREATE INDEX idx_f_invoice_gin ON ins_dw_prd12.f_invoice USING gin (source_type, invoice_type, invoice_status, invoice_title, invoice_date, seller_taxer_code, shop_id, create_time)
CREATE INDEX idx_f_invoice_invoice_date ON ins_dw_prd12.f_invoice USING btree (invoice_date) WITH (fillfactor='100')
CREATE INDEX idx_f_invoice_seller_taxer_code ON ins_dw_prd12.f_invoice USING btree (seller_taxer_code) WITH (fillfactor='100')
CREATE INDEX idx_invoice_createtime_btree ON ins_dw_prd12.f_invoice USING btree (create_time) WITH (fillfactor='100')
​

 

二 、优化前

sql:

explain(analyse, timing)
SELECT count(*)
from (SELECT fi.invoice_id
      FROM ins_dw_prd12.f_invoice fi
      WHERE (fi.seller_taxer_code in ('91320200704046760T', '91340100149067617J', '91320214MA1YGE8F94') and
             fi.create_time >= '2020-01-01 00:00:00' and fi.create_time <= '2020-01-31 00:00:00')) AS mm
         INNER JOIN ins_dw_prd12.f_invoice_item fit ON fit.invoice_id = mm.invoice_id
         inner join ins_dw_prd12.f_invoice m on m.invoice_id = mm.invoice_id

执行计划:

Finalize Aggregate  (cost=3083416.86..3083416.87 rows=1 width=8) (actual time=85251.980..85251.980 rows=1 loops=1)
  ->  Gather  (cost=3083416.44..3083416.85 rows=4 width=8) (actual time=85251.097..85269.008 rows=5 loops=1)
        Workers Planned: 4
        Workers Launched: 4
        ->  Partial Aggregate  (cost=3082416.44..3082416.45 rows=1 width=8) (actual time=85244.739..85244.739 rows=1 loops=5)
              ->  Nested Loop  (cost=184106.68..3082211.80 rows=81856 width=0) (actual time=2308.041..85237.967 rows=57076 loops=5)
                    ->  Parallel Hash Join  (cost=184106.11..2879308.96 rows=81856 width=16) (actual time=2307.992..85029.464 rows=57076 loops=5)
                          Hash Cond: (fit.invoice_id = fi.invoice_id)
                          ->  Parallel Seq Scan on f_invoice_item fit  (cost=0.00..2631148.52 rows=24401652 width=8) (actual time=0.466..79746.085 rows=19507465 loops=5)
                          ->  Parallel Hash  (cost=183190.09..183190.09 rows=73282 width=8) (actual time=334.243..334.243 rows=54056 loops=5)
                                Buckets: 524288  Batches: 1  Memory Usage: 14752kB
                                ->  Parallel Index Scan using idx_invoice_createtime_btree on f_invoice fi  (cost=0.57..183190.09 rows=73282 width=8) (actual time=0.177..314.460 rows=54056 loops=5)
                                      Index Cond: ((create_time >= '2020-01-01 00:00:00'::timestamp without time zone) AND (create_time <= '2020-01-31 00:00:00'::timestamp without time zone))
                                      Filter: ((seller_taxer_code)::text = ANY ('{91320200704046760T,91340100149067617J,91320214MA1YGE8F94}'::text[]))
                                      Rows Removed by Filter: 455651
                    ->  Index Only Scan using f_invoice_pkey on f_invoice m  (cost=0.57..2.48 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=285380)
                          Index Cond: (invoice_id = fit.invoice_id)
                          Heap Fetches: 285380
Planning Time: 8.120 ms
Execution Time: 85269.153 ms
​

分析:

其中耗时最严重的点在:

并行顺序扫描了表f_invoice_item,并且loops=5,每次扫描行数:rows=19507465;而表f_invoice_item数据量才9700万左右。

->  Parallel Seq Scan on f_invoice_item fit  (cost=0.00..2631148.52 rows=24401652 width=8) (actual time=0.466..79746.085 rows=19507465 loops=5)

 

问题:表f_invoice_item上有索引f_invoice_item_invoice_id_idx,为什么会不走呢??

 

三、优化后

sql:

explain(analyse, timing)
SELECT count(*)
FROM (select *
      from ins_dw_prd12.f_invoice fi
      where fi.seller_taxer_code in ('91320200704046760T', '91340100149067617J', '91320214MA1YGE8F94')
        and fi.create_time >= '2020-03-01 00:00:00'
        and fi.create_time <= '2020-03-31 00:00:00') m
         INNER JOIN (select *
                     from ins_dw_prd12.f_invoice_item
                     where invoice_id in (SELECT fi.invoice_id
                                          FROM ins_dw_prd12.f_invoice fi
                                          WHERE fi.seller_taxer_code in
                                                ('91320200704046760T', '91340100149067617J', '91320214MA1YGE8F94')
                                            and fi.create_time >= '2020-03-01 00:00:00'
                                            and fi.create_time <= '2020-03-31 00:00:00')) fit
                    ON fit.invoice_id = m.invoice_id

执行计划:

Finalize Aggregate  (cost=428280.97..428280.98 rows=1 width=8) (actual time=2400.367..2400.367 rows=1 loops=1)
  ->  Gather  (cost=428280.55..428280.96 rows=4 width=8) (actual time=2399.218..2432.599 rows=5 loops=1)
        Workers Planned: 4
        Workers Launched: 4
        ->  Partial Aggregate  (cost=427280.55..427280.56 rows=1 width=8) (actual time=2394.585..2394.585 rows=1 loops=5)
              ->  Nested Loop  (cost=203100.20..427279.71 rows=334 width=0) (actual time=1465.895..2388.019 rows=52988 loops=5)
                    ->  Parallel Hash Join  (cost=203099.63..405399.83 rows=299 width=16) (actual time=1459.954..1850.252 rows=47458 loops=5)
                          Hash Cond: (fi.invoice_id = fi_1.invoice_id)
                          ->  Parallel Index Scan using idx_invoice_createtime_btree on f_invoice fi  (cost=0.57..202088.56 rows=80840 width=8) (actual time=0.313..363.616 rows=47458 loops=5)
                                Index Cond: ((create_time >= '2020-03-01 00:00:00'::timestamp without time zone) AND (create_time <= '2020-03-31 00:00:00'::timestamp without time zone))
                                Filter: ((seller_taxer_code)::text = ANY ('{91320200704046760T,91340100149067617J,91320214MA1YGE8F94}'::text[]))
                                Rows Removed by Filter: 601517
                          ->  Parallel Hash  (cost=202088.56..202088.56 rows=80840 width=8) (actual time=1459.076..1459.076 rows=47458 loops=5)
                                Buckets: 524288  Batches: 1  Memory Usage: 13472kB
                                ->  Parallel Index Scan using idx_invoice_createtime_btree on f_invoice fi_1  (cost=0.57..202088.56 rows=80840 width=8) (actual time=1.947..1438.735 rows=47458 loops=5)
                                      Index Cond: ((create_time >= '2020-03-01 00:00:00'::timestamp without time zone) AND (create_time <= '2020-03-31 00:00:00'::timestamp without time zone))
                                      Filter: ((seller_taxer_code)::text = ANY ('{91320200704046760T,91340100149067617J,91320214MA1YGE8F94}'::text[]))
                                      Rows Removed by Filter: 601517
                    ->  Index Only Scan using f_invoice_item_invoice_id_idx on f_invoice_item  (cost=0.57..70.85 rows=233 width=8) (actual time=0.011..0.011 rows=1 loops=237290)
                          Index Cond: (invoice_id = fi_1.invoice_id)
                          Heap Fetches: 264945
Planning Time: 0.591 ms
Execution Time: 2432.666 ms
​

 

效果

从优化前85秒到优化后2.4秒,性能提升接近40倍。

 

 

你可能感兴趣的:(postgreSql,postgresql,索引)