Presto
还使用GROUPING SETS
,CUBE
和ROLLUP
语法支持复杂的聚合。 使用此语法,用户可以执行需要在单个查询中对多组列进行聚合的分析。 复杂的分组操作不支持对由输入列组成的表达式进行分组。 仅允许使用列名或常规名称
。
复杂的分组操作通常等效于简单GROUP BY
表达式的UNION ALL
,如以下示例所示。 但是,当聚合的数据源不确定时,此等效项将不适用。
分组集允许用户指定多个列进行分组。 不属于分组列的给定子列表的列被设置为NULL。
SELECT * FROM shipping;
origin_state | origin_zip | destination_state | destination_zip | package_weight
--------------+------------+-------------------+-----------------+----------------
California | 94131 | New Jersey | 8648 | 13
California | 94131 | New Jersey | 8540 | 42
New Jersey | 7081 | Connecticut | 6708 | 225
California | 90210 | Connecticut | 6927 | 1337
California | 94131 | Colorado | 80302 | 5
New York | 10002 | New Jersey | 8540 | 3
SELECT origin_state, origin_zip, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_state),
(origin_state, origin_zip),
(destination_state));
origin_state | origin_zip | destination_state | _col0
--------------+------------+-------------------+-------
New Jersey | NULL | NULL | 225
California | NULL | NULL | 1397
New York | NULL | NULL | 3
California | 90210 | NULL | 1337
California | 94131 | NULL | 60
New Jersey | 7081 | NULL | 225
New York | 10002 | NULL | 3
NULL | NULL | Colorado | 5
NULL | NULL | New Jersey | 58
NULL | NULL | Connecticut | 1562
但是,使用复杂分组语法(GROUPING SETS,CUBE或ROLLUP
)的查询仅从基础数据源读取一次,而使用UNION ALL
的查询则读取基础数据三次。 这就是为什么当数据源不确定时,使用UNION ALL
的查询可能会产生不一致的结果的原因。
The CUBE operator generates all possible grouping sets (i.e. a power set) for a given set of columns. For example, the query:
SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY CUBE (origin_state, destination_state);
等价于
SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_state, destination_state),
(origin_state),
(destination_state),
());
origin_state | destination_state | _col0
--------------+-------------------+-------
California | New Jersey | 55
California | Colorado | 5
New York | New Jersey | 3
New Jersey | Connecticut | 225
California | Connecticut | 1337
California | NULL | 1397
New York | NULL | 3
New Jersey | NULL | 225
NULL | New Jersey | 58
NULL | Connecticut | 1562
NULL | Colorado | 5
NULL | NULL | 1625
(12 rows)
The ROLLUP operator generates all possible subtotals for a given set of columns. For example, the query:
SELECT origin_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ROLLUP (origin_state, origin_zip);
origin_state | origin_zip | _col2
--------------+------------+-------
California | 94131 | 60
California | 90210 | 1337
New Jersey | 7081 | 225
New York | 10002 | 3
California | NULL | 1397
New York | NULL | 3
New Jersey | NULL | 225
NULL | NULL | 1625
(8 rows)
等价于
SELECT origin_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS ((origin_state, origin_zip), (origin_state), ());
hive
写法的不同GROUP BY ROLLUP (origin_state, origin_zip);
hive
的写法是GROUP BY origin_state, origin_zip with rollup