很多人都熟悉GROUP BY和HAVING。但是,你熟悉CUBE、ROLLUP和GROUPING SETS吗?
我们使用BP能源报告的数据。
数据结构是
CREATE TABLE t_oil (
region text,
country text,
year int,
production int,
consumption int
);
导入数据:
postgres=# COPY t_oil FROM './oil_ext.txt';
COPY 644
时间:22.798 ms
其中包含1965-2010年间,两个地区的14个国家的数据:
postgres=# SELECT region, avg(production) FROM t_oil GROUP BY region;
region | avg
---------------+-----------------------
Middle East | 1992.6036866359447005
North America | 4541.3623188405797101
(2 行记录)
GROUP BY会返回很多行,每组一行。但是,你也可能还对整体平均感兴趣。
postgres=# SELECT region, avg(production) FROM t_oil GROUP BY ROLLUP (region);
region | avg
---------------+-----------------------
Middle East | 1992.6036866359447005
North America | 4541.3623188405797101
| 2607.5139860139860140
(3 行记录)
ROLLUP会注入新的一行,它包含整体平均值。如果你做报表,这很像摘要(summary)行。不需要执行两次查询,PostgreSQL就返回了需要的全部数据。但是你要注意,PostgreSQL的不同版本,可能返回不同的顺序。9.6以前,PostgreSQL不得不做大量的排序工作。从10.0开始,可以使用hash,提升了性能:
postgres=# explain SELECT region, avg(production) FROM t_oil GROUP BY ROLLUP (region);
QUERY PLAN
----------------------------------------------------------------------------------------
MixedAggregate (cost=0.00..17.31 rows=3 width=44)
Hash Key: region
Group Key: ()
-> Seq Scan on t_oil (cost=0.00..12.44 rows=644 width=16)
(4 行记录)
如果你想排序,而且不同版本返回的顺序相同,就要使用ORDER BY。
当然,也支持多列:
postgres=# SELECT region, country, avg(production) FROM t_oil
postgres-# WHERE country IN ('USA', 'Canada', 'Iran', 'Oman')
postgres-# GROUP BY ROLLUP (region, country);
region | country | avg
---------------+---------+-----------------------
Middle East | Iran | 3631.6956521739130435
Middle East | Oman | 586.4545454545454545
Middle East | | 2142.9111111111111111
North America | Canada | 2123.2173913043478261
North America | USA | 9141.3478260869565217
North America | | 5632.2826086956521739
| | 3906.7692307692307692
(7 行记录)
这个例子,PostgreSQL注入了三行,其中一行是Middle East的,一行是North America的,最后一行是整体的。
你也许还想预计算更多的数据,以提高灵活性。可以使用CUBE:
postgres=# SELECT region, country, avg(production) FROM t_oil
WHERE country IN ('USA', 'Canada', 'Iran', 'Oman')
GROUP BY CUBE (region, country);
region | country | avg
---------------+---------+-----------------------
| | 3906.7692307692307692
Middle East | Iran | 3631.6956521739130435
North America | Canada | 2123.2173913043478261
North America | USA | 9141.3478260869565217
Middle East | Oman | 586.4545454545454545
North America | | 5632.2826086956521739
Middle East | | 2142.9111111111111111
| Oman | 586.4545454545454545
| Canada | 2123.2173913043478261
| Iran | 3631.6956521739130435
| USA | 9141.3478260869565217
(11 行记录)
ROLLUP和CUBE实际上是在GROUPING SETS之上提供便利功能。使用GROUPING SETS,可以明确地列出想要的聚合:
postgres=# SELECT region, country, avg(production) FROM t_oil
WHERE country IN ('USA', 'Canada', 'Iran', 'Oman')
GROUP BY GROUPING SETS ( (), region, country);
region | country | avg
---------------+---------+-----------------------
| | 3906.7692307692307692
North America | | 5632.2826086956521739
Middle East | | 2142.9111111111111111
| Oman | 586.4545454545454545
| Canada | 2123.2173913043478261
| Iran | 3631.6956521739130435
| USA | 9141.3478260869565217
(7 行记录)
这里,我想要三个分组集合:整体平均,GROUP BY region和GROUP BY country。
在内部,PostgreSQL使用传统的GroupAggregates。一个GroupAggregate节点需要排序的数据,所以,PostgreSQL要做很多临时排序工作:
postgres=# explain SELECT region, country, avg(production) FROM t_oil
WHERE country IN ('USA', 'Canada', 'Iran', 'Oman')
GROUP BY GROUPING SETS ( (), region, country);
QUERY PLAN
----------------------------------------------------------------------------------
GroupAggregate (cost=22.58..32.69 rows=34 width=52)
Group Key: region
Group Key: ()
Sort Key: country
Group Key: country
-> Sort (cost=22.58..23.04 rows=184 width=24)
Sort Key: region
-> Seq Scan on t_oil
(cost=0.00..15.66 rows=184 width=24)
Filter: (country = ANY
('{USA,Canada,Iran,Oman}'::text[]))
PostgreSQL的hash聚合只支持普通GROUP BY,不支持分组集合。PostgreSQL 10.0和9.6相比,计划器已经有了更多的可选项。
另外,在新的PostgreSQL 11.0上做测试,又有了新的改进:
postgres=# explain SELECT region, country, avg(production) FROM t_oil
WHERE country IN ('USA', 'Canada', 'Iran', 'Oman')
GROUP BY GROUPING SETS ( (), region, country);
QUERY PLAN
----------------------------------------------------------------------------------
MixedAggregate (cost=10000000000.00..10000000018.17 rows=17 width=52)
Hash Key: region
Hash Key: country
Group Key: ()
-> Seq Scan on t_oil (cost=10000000000.00..10000000015.66 rows=184 width=24)
Filter: (country = ANY ('{USA,Canada,Iran,Oman}'::text[]))
(6 行记录)
分组集合和FILTER一起使用,可以运行部分聚合:
postgres=# SELECT region,
avg(production) AS all,
avg(production) FILTER (WHERE year < 1990) AS old,
avg(production) FILTER (WHERE year >= 1990) AS new
FROM t_oil GROUP BY ROLLUP (region);
region | all | old | new
---------------+-----------------------+-----------------------+-----------------------
Middle East | 1992.6036866359447005 | 1747.3258928571428571 | 2254.2333333333333333
North America | 4541.3623188405797101 | 4471.6533333333333333 | 4624.3492063492063492
| 2607.5139860139860140 | 2430.6856187290969900 | 2801.1831501831501832
(3 行记录)
排序集合也是很强大的特性。数据分组后,在组内根据一定的条件排序,然后,在执行其他计算。
比如可用来计算中位数。
有一个办法是,获得排序的数据,把50%移动到数据集。这正是WITHIN GROUP要求PostgreSQL做的:
postgres=# SELECT region, percentile_disc(0.5) WITHIN GROUP (ORDER BY production)
FROM t_oil GROUP BY 1;
region | percentile_disc
---------------+-----------------
Middle East | 1082
North America | 3054
(2 行记录)
percentile_disc函数会跳过组内50%的数据,返回期望的值。注意,中位数可能严重偏离平均值。在经济学中,二者之间的偏差甚至是社会收入的指标。偏差越大,收入越不平等。为提供更多的灵活性,ANSI标准不只是提出中位数函数。percentile_disc也可以使用0-1之间的任何值。
漂亮的是可以在分组集合上使用排序集合:
postgres=# SELECT region, percentile_disc(0.5) WITHIN GROUP (ORDER BY production)
FROM t_oil GROUP BY ROLLUP (1);
region | percentile_disc
---------------+-----------------
Middle East | 1082
North America | 3054
| 1696
(3 行记录)
正如ANSI SQL标准所提出的,PostgreSQL提供了两个percentile_函数:percentile_disc函数返回的值在数据集中,而percentile_cont函数如果没找到会插入一个值:
postgres=# SELECT percentile_disc(0.62) WITHIN GROUP (ORDER BY id),
percentile_cont(0.62) WITHIN GROUP (ORDER BY id)
FROM generate_series(1, 5) AS id;
percentile_disc | percentile_cont
-----------------+-----------------
4 | 3.48
(1 行记录)
4是实际存在的值,3.48是被插入的值。
要找到组中最频繁使用的值,可以使用mode函数。在举例之前,我们先看看表格内容的更多信息:
postgres=# SELECT production, count(*) FROM t_oil
WHERE country = 'Other Middle East'
GROUP BY production ORDER BY 2 DESC LIMIT 4;
production | count
------------+-------
48 | 5
52 | 5
50 | 5
53 | 4
(4 行记录)
有三个不同的值出现了五次。使用mode函数,可以只返回他们中的一个:
postgres=# SELECT country, mode() WITHIN GROUP (ORDER BY production)
FROM t_oil
WHERE country = 'Other Middle East' GROUP BY 1;
country | mode
-------------------+------
Other Middle East | 48
(1 行记录)
返回了最频繁使用的值,但是,SQL没告诉我们,它实际出现的频率。
假设聚合和标准的排序集合类似。但是,他们用来回答不同类型的问题:如果是某值会返回什么?
唯一的假设函数是rank:
postgres=# SELECT region,
rank(9000) WITHIN GROUP (ORDER BY production DESC NULLS LAST)
FROM t_oil GROUP BY ROLLUP (1);
region | rank
---------------+------
Middle East | 21
North America | 27
| 47
(3 行记录)
它告诉我们,如果某国生产了,每天9000桶,那么按年度计算,就能在北美第27,在中东排第21位。