PostgreSQL , 10.0 , merge sort , gather merge
在数据库中,经常会有多个节点append,然后sort的情况。
例如一张表有10个分区,查询所有分区,并按某列排序输出,常规的做法是所有的记录append,然后sort。
PostgreSQL 10.0 将支持append node的并行计算,也就是说所有的分区表可以并行的sort,然后返回,此时就可以使用merge sort来提高排序的速度。
另外,像单表的并行计算,如果需要排序输出的话,每个worker process可以并行的排序,然后在merge sort输出结果。
在许多分布式数据库中,merge sort也是必备的,否则排序都收到上层节点做是非常耗费CPU和内存的。
merge sort原理,首先要保证下层的所有节点返回的数据是有序的(例如每个NODE都按ID的顺序返回),以轮询所有下层node的方式组装并返回有序结果。
merge sort和merge join的原理也类似。
merge sort的性能提升示例
Query 4: With GM 7901.480 -> Without GM 9064.776
Query 5: With GM 53452.126 -> Without GM 55059.511
Query 9: With GM 52613.132 -> Without GM 98206.793
Query 15: With GM 68051.058 -> Without GM 68918.378
Query 17: With GM 129236.075 -> Without GM 160451.094
Query 20: With GM 259144.232 -> Without GM 306256.322
Query 21: With GM 153483.497 -> Without GM 168169.916
讨论
Hi hackers,
Attached is the patch to implement Gather Merge. The Gather Merge node would
assume that the results from each worker are ordered with respect to each
other,
and then do a final merge pass over those. This is so that we get the
top-level
query ordering we want. The final plan for such a query would look something
like this:
Gather Merge
-> Sort
-> Parallel Seq Scan on foo
Filter: something
With this we now have a new parallel node which will always return the
sorted
results, so that any query having the pathkey can build the gather merge
path.
Currently if a query has a pathkey and we want to make it parallel-aware,
the
plan would be something like this:
Sort
-> Gather
-> Parallel Seq Scan on foo
Filter: something
With GatherMerge now it is also possible to have plan like:
Finalize GroupAggregate
-> Gather Merge
-> Partial GroupAggregate
-> Sort
With gather merge, sort can be pushed under the Gather Merge. It's valuable
as it has very good performance benefits. Consider the following test
results:
1) ./db/bin/pgbench postgres -i -F 100 -s 20
2) update pgbench_accounts set filler = 'foo' where aid%10 = 0;
3) vacuum analyze pgbench_accounts;
4) set max_parallel_workers_per_gather = 4;
Without patch:
postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts
where filler like '%foo%' group by aid;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=81696.51..85242.09 rows=202605 width=12) (actual
time=1037.212..1162.086 rows=200000 loops=1)
Group Key: aid
-> Sort (cost=81696.51..82203.02 rows=202605 width=8) (actual
time=1037.203..1072.446 rows=200000 loops=1)
Sort Key: aid
Sort Method: external sort Disk: 3520kB
-> Seq Scan on pgbench_accounts (cost=0.00..61066.59 rows=202605
width=8) (actual time=801.398..868.390 rows=200000 loops=1)
Filter: (filler ~~ '%foo%'::text)
Rows Removed by Filter: 1800000
Planning time: 0.133 ms
Execution time: 1171.656 ms
(10 rows)
With Patch:
postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts
where filler like '%foo%' group by aid;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=47274.13..56644.58 rows=202605 width=12)
(actual time=315.457..561.825 rows=200000 loops=1)
Group Key: aid
-> Gather Merge (cost=47274.13..54365.27 rows=50651 width=0) (actual
time=315.451..451.886 rows=200000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial GroupAggregate (cost=46274.09..47160.49 rows=50651
width=12) (actual time=306.830..333.908 rows=40000 loops=5)
Group Key: aid
-> Sort (cost=46274.09..46400.72 rows=50651 width=8)
(actual time=306.822..310.800 rows=40000 loops=5)
Sort Key: aid
Sort Method: quicksort Memory: 2543kB
-> Parallel Seq Scan on pgbench_accounts
(cost=0.00..42316.15 rows=50651 width=8) (actual time=237.552..255.968
rows=40000 loops=5)
Filter: (filler ~~ '%foo%'::text)
Rows Removed by Filter: 360000
Planning time: 0.200 ms
Execution time: 572.221 ms
(15 rows)
I ran the TPCH benchmark queries with the patch and found that 7 out of 22
queries end up picking the Gather Merge path.
Below benchmark numbers taken under following configuration:
- Scale factor = 10
- max_worker_processes = DEFAULT (8)
- max_parallel_workers_per_gather = 4
- Cold cache environment is ensured. With every query execution - server is
stopped and also OS caches were dropped.
- The reported values of execution time (in ms) is median of 3 executions.
- power2 machine with 512GB of RAM
- PFA performance machine cpu into (benchmark_machine_info.txt)
Query 4: With GM 7901.480 -> Without GM 9064.776
Query 5: With GM 53452.126 -> Without GM 55059.511
Query 9: With GM 52613.132 -> Without GM 98206.793
Query 15: With GM 68051.058 -> Without GM 68918.378
Query 17: With GM 129236.075 -> Without GM 160451.094
Query 20: With GM 259144.232 -> Without GM 306256.322
Query 21: With GM 153483.497 -> Without GM 168169.916
Here from the results we can see that query 9, 17 and 20 are the one which
show good performance benefit with the Gather Merge.
PFA tpch_result.tar.gz for the explain analysis output for TPCH queries
(with
and without patch)
I ran the TPCH benchmark queries with different number of workers and found
that
Query 18 also started picking Gather merge with worker > 6. PFA attach
TPCH_GatherMerge.pdf for the detail benchmark results.
Implementation details:
New Gather Merge node:
The patch introduces a new node type for Gather Merge. The Gather Merge
implementation is mostly similar to what Gather does. The major difference
is
that the Gather node has two TupleTableSlots; one for leader and one for the
tuple read from the worker, and Gather Merge has a TupleTableSlot per
worker,
plus one for the leader. As for Gather Merge, we need to fill every slot,
then
build a heap of the tuples and return the lowest one.
The patch generates the gather merge path from:
a) create_ordered_paths() for partial_pathlist. If the pathkey contain the
sort_pathkey, then directly create the gather merge. If not then create sort
and then create the gather merge path.
Example:
explain analyze
select * from pgbench_accounts where filler like '%foo%' order by aid;
b) create_distinct_paths(): when sort-based implementations of DISTINCT is
possible.
Example:
explain analyze
select distinct * from pgbench_accounts where filler like '%foo%' order by
aid;
c) create_grouping_paths() : While generating a complete GroupAgg Path, loop
over the partial path list and if partial path contains the group_pathkeys
generate the gather merge path.
Example:
explain analyze
select * from pgbench_accounts where filler like '%foo%' group by aid;
In all the above mentioned cases, with the patches it's giving almost a 2x
performance gain. PFA pgbench_query.out, for the explain analyze output for
the
queries.
Gather Merge reads the tuple from each queue and then picks the lowest one,
so
every time it has to read the tuple from the queue into wait mode. During
testing I found that some of the query spent some time reading tuple
from the queue. So in the patch I introduced the tuple array; once we get
the tuple into wait mode, it tries to read more tuples in nowait mode and
store it into the tuple array. Once we get one tuple through the queue,
there
are chances to have more ready tuple into queue, so just read it and, if
any,
store it to the tuple array. With this I found good performance benefits
with
some of the TPC-H complex queries.
Costing:
GatherMerge merges several pre-sorted input streams using a heap.
Considering
that costing for Gather Merge is the combination of cost_gather +
cost_merge_append.
Open Issue:
- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
tqueue.c by
calling gather_readnext() into per-tuple context. Now for gather merge that
is
not possible, as we storing the tuple into Tuple array and we want tuple to
be
free only its get pass through the merge sort algorithm. One idea is, we can
also call gm_readnext_tuple() under per-tuple context (which will fix the
leak
into tqueue.c) and then store the copy of tuple into tuple array.
- Need to see a way to add the simple test as part of regression.
Thanks to my colleague Robert Haas for his help in design and some
preliminary review for the patch.
Please let me know your thought, and thanks for reading.
Regards,
Rushabh Lathia
www.EnterpriseDB.com
这个patch的讨论,详见邮件组,本文末尾URL。
PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。
https://commitfest.postgresql.org/13/810/
https://www.postgresql.org/message-id/flat/CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com#CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com