presto cube等复杂聚合函数

Presto还使用GROUPING SETSCUBEROLLUP语法支持复杂的聚合。 使用此语法,用户可以执行需要在单个查询中对多组列进行聚合的分析。 复杂的分组操作不支持对由输入列组成的表达式进行分组。 仅允许使用列名或常规名称

复杂的分组操作通常等效于简单GROUP BY表达式的UNION ALL,如以下示例所示。 但是,当聚合的数据源不确定时,此等效项将不适用。

GROUPING SETS

分组集允许用户指定多个列进行分组。 不属于分组列的给定子列表的列被设置为NULL。

SELECT * FROM shipping;
 origin_state | origin_zip | destination_state | destination_zip | package_weight
--------------+------------+-------------------+-----------------+----------------
 California   |      94131 | New Jersey        |            8648 |             13
 California   |      94131 | New Jersey        |            8540 |             42
 New Jersey   |       7081 | Connecticut       |            6708 |            225
 California   |      90210 | Connecticut       |            6927 |           1337
 California   |      94131 | Colorado          |           80302 |              5
 New York     |      10002 | New Jersey        |            8540 |              3
SELECT origin_state, origin_zip, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_state),
    (origin_state, origin_zip),
    (destination_state));
 origin_state | origin_zip | destination_state | _col0
--------------+------------+-------------------+-------
 New Jersey   | NULL       | NULL              |   225
 California   | NULL       | NULL              |  1397
 New York     | NULL       | NULL              |     3
 California   |      90210 | NULL              |  1337
 California   |      94131 | NULL              |    60
 New Jersey   |       7081 | NULL              |   225
 New York     |      10002 | NULL              |     3
 NULL         | NULL       | Colorado          |     5
 NULL         | NULL       | New Jersey        |    58
 NULL         | NULL       | Connecticut       |  1562

但是,使用复杂分组语法(GROUPING SETS,CUBE或ROLLUP)的查询仅从基础数据源读取一次,而使用UNION ALL的查询则读取基础数据三次。 这就是为什么当数据源不确定时,使用UNION ALL的查询可能会产生不一致的结果的原因。

cube

The CUBE operator generates all possible grouping sets (i.e. a power set) for a given set of columns. For example, the query:

SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY CUBE (origin_state, destination_state);

等价于

SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_state, destination_state),
    (origin_state),
    (destination_state),
    ());
origin_state | destination_state | _col0
--------------+-------------------+-------
 California   | New Jersey        |    55
 California   | Colorado          |     5
 New York     | New Jersey        |     3
 New Jersey   | Connecticut       |   225
 California   | Connecticut       |  1337
 California   | NULL              |  1397
 New York     | NULL              |     3
 New Jersey   | NULL              |   225
 NULL         | New Jersey        |    58
 NULL         | Connecticut       |  1562
 NULL         | Colorado          |     5
 NULL         | NULL              |  1625
(12 rows)

rollup

The ROLLUP operator generates all possible subtotals for a given set of columns. For example, the query:

SELECT origin_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ROLLUP (origin_state, origin_zip);
 origin_state | origin_zip | _col2
--------------+------------+-------
 California   |      94131 |    60
 California   |      90210 |  1337
 New Jersey   |       7081 |   225
 New York     |      10002 |     3
 California   | NULL       |  1397
 New York     | NULL       |     3
 New Jersey   | NULL       |   225
 NULL         | NULL       |  1625
(8 rows)

等价于

SELECT origin_state, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS ((origin_state, origin_zip), (origin_state), ());

总结

  • 注意和hive写法的不同
  • 比如
  • GROUP BY ROLLUP (origin_state, origin_zip);
  • hive的写法是GROUP BY origin_state, origin_zip with rollup

你可能感兴趣的:(presto,presto,rollup,cube)