Hive_6. 数据聚合 -- Group By & Grouping_SETS & RollUp & CUBE & Having

今天跟大家介绍一些 Hive 中的高级操作-数据聚合。这里主要根据以下三部分向大家介绍一下Hive 中常见的聚合:

  • 基于 Group By 的基本聚合函数
  • 高级聚合 -- GROUPING SETS  & ROLLUP and  CUBE
  • 聚合条件 -- Having

1. 基于 Group by 的基本聚合函数 

数据聚合是基于特定的条件使用数据汇总的形式来收集和表达更多的信息。Hive 提供了一些内置的聚合函数,如MAXMINAVG等等。Hive 还支持高级的聚合: GROUPING SETSROLLUPCUBE,分析函数[analytic functions],以及 windowing
Hive 的基本内置的聚合函数通常使用GROUP BY子句。如果没有GROUP BY子句指定,默认情况下它是对整个表进行聚合。除了聚合函数, 所有其他 select 的列也必须包含在GROUP BY子句中(分析函数除外)。以下是几个例子使用内置的聚合函数:
注意:关于窗口函数 & 分区表函数 请参考 SQL Windowing 项目: http://blog.csdn.net/mike_h/article/details/50245995

  • 没有GROUP BY字段的聚合:
jdbc:hive2://> SELECT count(*) AS row_cnt FROM employee;
+----------+
| row_cnt |
+----------+
| 5 |
+----------+
1 row selected (60.709 seconds
  • 对 GROUP BY字段进行聚合:
jdbc:hive2://> SELECT sex_age.sex, count(*) AS row_cnt 
. . . . . . .> FROM employee 
. . . . . . .> GROUP BY sex_age.sex;
+--------------+----------+
| sex_age.sex | row_cnt |
+--------------+----------+
| Female | 2 |
| Male | 3 |
+--------------+----------+
2 rows selected (100.565 seconds)

--select 字段名必须包含在 Group by 字段中 
jdbc:hive2://> SELECT name, sex_age.sex, count(*) AS row_cnt 
. . . . . . .> FROM employee GROUP BY sex_age.sex;
Error: Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'name' (state=42000,code=10025) 

如果我们必须要 SELECT 一些 GROUP BY 中没有的字段, 我们有两种方法:

  • 使用[analytic functions],引入后, 完全避免使用GROUP BY子句 (该方法会稍后进行介绍)
  • 使用collect_set 函数,该函数返回一组对象和消除重复的元素

方法2 使用如下:

--根据性别和数据抽样来统计每个性别的人数 
jdbc:hive2://> SELECT sex_age.sex,
. . . . . . .> collect_set(sex_age.age)[0] AS random_age, 
. . . . . . .> count(*) AS row_cnt 
. . . . . . .> FROM employee GROUP BY sex_age.sex;
+--------------+-------------+----------+
| sex_age.sex | random_age | row_cnt |
+--------------+-------------+----------+
| Female | 27 | 2 |
| Male | 35 | 3 |
+--------------+-------------+----------+
2 rows selected (48.15 seconds)

在一个 SELECT 语句中可以存在不同的聚合函数。当然也可以使用其他函数,如嵌套方式的条件函数。但是不支持嵌套的聚合函数详细信息, 可以参阅下面的例子:
  • 在一个 SELECT 语句中调用多个聚合函数 :
jdbc:hive2://> SELECT sex_age.sex, AVG(sex_age.age) AS avg_age, 
. . . . . . .> count(*) AS row_cnt 
. . . . . . .> FROM employee GROUP BY sex_age.sex; 
+--------------+---------------------+----------+
| sex_age.sex | avg_age | row_cnt |
+--------------+---------------------+----------+
| Female | 42.0 | 2 |
| Male | 31.666666666666668 | 3 |
+--------------+---------------------+----------+
2 rows selected (98.857 seconds)
  • 使用带有 CASE WHEN  的聚合函数:
jdbc:hive2://> SELECT sum(CASE WHEN sex_age.sex = 'Male' 
. . . . . . .> THEN sex_age.age ELSE 0 END)/
. . . . . . .> count(CASE WHEN sex_age.sex = 'Male' THEN 1 
. . . . . . .> ELSE NULL END) AS male_age_avg FROM employee;
+---------------------+
| male_age_avg |
+---------------------+
| 31.666666666666668 |
+---------------------+
1 row selected (38.415 seconds)
 
     
  • 使用带有COALESCE 和 IF 的聚合函数:
jdbc:hive2://> SELECT  
. . . . . . .> sum(coalesce(sex_age.age,0)) AS age_sum,      -- 返回一组数据中第一个不为 Null 的值,如果都为null,则返回null
. . . . . . .> sum(if(sex_age.sex = 'Female',sex_age.age,0)) -- false 的话则会0
. . . . . . .> AS female_age_sum FROM employee;
+----------+---------------+
| age_sum | female_age_sum|
+----------+---------------+
| 179 | 84 |
+----------+---------------+
1 row selected (42.137 seconds)
  • 嵌套聚合函数是不允许的,如下所示:(聚合函数的运算都是基于初粒度的运算,类似于 Spark 中 RDD )
jdbc:hive2://> SELECT avg(count(*)) AS row_cnt
. . . . . . .> FROM employee;
Error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 1:11 Not yet supported place for UDAF 'count' (state=42000,code=10128)

聚合函数可以使用 DISTINCT 关键字来返回唯一聚合值。

jdbc:hive2://> SELECT count(DISTINCT sex_age.sex) AS sex_uni_cnt,
. . . . . . .> count(DISTINCT name) AS name_uni_cnt 
. . . . . . .> FROM employee;     
+--------------+---------------+
| sex_uni_cnt  | name_uni_cnt  |
+--------------+---------------+
| 2            | 5             |
+--------------+---------------+
1 row selected (35.935 seconds)

 
   

注意:

当我们吧 COUNT 和 DISTINCT 放在一起的时候,Hive 通常会忽略掉 reducer 的数量设置( 例如 mapred.reduce.tasks = 20 ),此时仅仅使用一个 reducer。在处理大量数据的情况下,单一的 reducer 显然会变成性能的瓶颈。当然,折衷的方案是使用子查询:

--在整个处理过程中只触发单个 reducer
SELECT count(distinct sex_age.sex) AS sex_uni_cnt FROM employee;
--在聚合之前,使用子查询来选出唯一值,这样会更加高效
SELECT count(*) AS sex_uni_cnt FROM (SELECT distinct sex_age.sex FROM employee) a;

在这种情况下,第一步会使用多个 reducer 来实现 DISTINCT 查询,使得数据 取得唯一值,mapper 的输出到COUNT 阶段的分区也会随之减少,最终使得 Reducer 不会有负载压力。

平时利用 Hive 处理数据的时候可能会遇到聚合字段为 NULL 的情况,对于这种情况,如果一行中含有一个 NULL 字段,则第二行将会被忽略掉。为了避免这种情况,我们可以使用 COALESCE 来为 Null 字段赋默认值。具体实现如下:

--创建测试表 
jdbc:hive2://> CREATE TABLE t AS SELECT * FROM
. . . . . . .> (SELECT employee_id-99 AS val1, 
. . . . . . .> (employee_id-98) AS val2 FROM employee_hr 
. . . . . . .> WHERE employee_id <= 101
. . . . . . .> UNION ALL
. . . . . . .> SELECT null val1, 2 AS val2 FROM employee_hr 
. . . . . . .> WHERE employee_id = 100) a;
No rows affected (0.138 seconds) 

--检查创建表的行
jdbc:hive2://> SELECT * FROM t;
+---------+---------+
| t.val1  | t.val2  |
+---------+---------+
| 1       | 2       |
| NULL    | 2       |
| 2       | 3       |
+---------+---------+
3 rows selected (0.069 seconds)

--在做 Sum(val1+val2)操作时,会忽略掉第2行(NULL, 2)
jdbc:hive2://> SELECT sum(val1), sum(val1+val2) 
. . . . . . .> FROM t;                   
+------+------+
| _c0  | _c1  |
+------+------+
| 3    | 8    |
+------+------+
1 row selected (57.775 seconds)

jdbc:hive2://> SELECT sum(coalesce(val1,0)), 
. . . . . . .> sum(coalesce(val1,0)+val2) FROM t;
+------+------+
| _c0  | _c1  |
+------+------+
| 3    | 10   |
+------+------+
1 row selected (69.967 seconds)

hive.map.aggr 属性用来控制 map 任务中的聚合,它的默认值是 false,如果设置为 true的话,Hive 会直接在 map 任务中直接进行第一阶段的聚合,虽然会提高性能,但是也会消耗很多内存。

jdbc:hive2://> SET hive.map.aggr=true;
No rows affected (0.002 seconds)

2. 高级聚合 -- GROUPING SETS

蜂巢提供了 GROUPING SETS 关键字实现对同一个数据集进行的多个GROUP BY操作。实际上,GROUPING SETS 将 Job 某一阶段的所有处理操作都整合在一起,显然要比 GROUP BY 和 UNION ALL 多阶段操作更加高效。如果GROUPING SETS()  参数为空的话,将会做整体聚合。 下面的例子将会介绍GROUPING SETS 的等价性。为了更好的理解,我们可以将 GROUPING SETS 理解为对 UNION ALL 的外部实现,对 UNION ALL 中每个GROUP BY 的内部实现。

SELECT name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name, work_place[0] 
GROUPING SETS((name, work_place[0]));
||
SELECT name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name, work_place[0]

SELECT name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name, work_place[0] 
GROUPING SETS(name, work_place[0]);
||
SELECT name, NULL AS main_place, count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name
UNION ALL
SELECT NULL AS name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY work_place[0];

SELECT name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name, work_place[0] 
GROUPING SETS((name, work_place[0]), name);
||
SELECT name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name, work_place[0]
UNION ALL
SELECT name, NULL AS main_place, count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name;

SELECT name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name, work_place[0]
GROUPING SETS((name, work_place[0]), name, work_place[0], ());
||
SELECT name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name, work_place[0]
UNION ALL
SELECT name, NULL AS main_place, count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY name
UNION ALL
SELECT NULL AS name, work_place[0] AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id
GROUP BY work_place[0]
UNION ALL
SELECT NULL AS name, NULL AS main_place, 
count(employee_id) AS emp_id_cnt 
FROM employee_id;

随着大家对Hive 使用的深入,对 Hive GROUPING SETS 运算符的问题也就随之增多,虽然大部分 issue 已经解决了。不过相信随着业务的进一步深入,更多的问题将会呈现出来,有兴趣的朋友可以去看看。

这里跟大家介绍一下一个 Hive 当前版本的一个 issue:针对 GROUPING SETS 的解析错误:https://issues.apache.org/jira/browse/HIVE-6950

jdbc:hive2://> SELECT sex_age.sex, sex_age.age, 
. . . . . . .> count(name) AS name_cnt 
. . . . . . .> FROM employee
. . . . . . .> GROUP BY sex_age.sex, sex_age.age
. . . . . . .> GROUPING SETS((sex_age.sex, sex_age.age));
Error: Error while compiling statement: FAILED: ParseException line 1:131 missing ) at ',' near ''
line 1:145 extraneous input ')' expecting EOF near '' (state=42000,code=40000)
不过该问题已经在 Hive 1.2.0 版本中 Fixed了。

3. 高级聚合 -- ROLLUP and CUBE

Cube & GROUPING__ID 关键字在 Hive 中使用实例:https://www.qubole.com/blog/product/cube-keyword-in-apache-hive/

ROLLUP 语句通过指定的一组维度来使得 SELECT 语句能进行多级聚合。它能以高效且最小查询开销的优势来扩展 Group bY 语句。跟 GROUPING SETS 比较,它能创建特定 level 的聚合,ROLLUP 创建 n+1 level 的聚合,N 就是分组的字段数。ROLLUP 有如下功能:

  • 它是一种能计算 Goup by 语句中指定的标准聚合值
  • 它能够创建 high level 的分层汇总 -- 从左向右的分组列整合

具体示例如下:

GROUP BY a,b,c WITH ROLLUP

等效于:

GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),())

CUBE 语句用来需要分组的字段和为所有可能的组合创建聚合。如果有 N 个字段指定为 CUBE ,那么将会有 2n 个聚合组合返回。如下所示:

GROUP BY a,b,c WITH CUBE

等效于:

GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())

The GROUPING__ID function works as an extension to distinguish entire rows from each other. It accepts one or more columns and returns the decimal equivalent of the BIT vector for each column specified afterGROUP BY. The returned decimal number is

GROUPING__ID    函数作为一个扩展来跟其他行做区分。它会接收一个或多个字段,之后根据 GROUP BY 后的每个字段返回一个十进制的比特向量。返回的十进制数值是从二进制的 1 和 0 转换而来的,用来表示每一行的字段已经聚合过了(此时对应的值将不为 NULL)。离  GROUP BY  比较近的字段会先进行排序。如下所示,对应的是  start_date
jdbc:hive2://> SELECT GROUPING__ID, 
. . . . . . .> BIN(CAST(GROUPING__ID AS BIGINT)) AS bit_vector, 
. . . . . . .> name, start_date, count(employee_id) emp_id_cnt 
. . . . . . .> FROM employee_hr 
. . . . . . .> GROUP BY start_date, name 
. . . . . . .> WITH CUBE ORDER BY start_date;
+---------------+-------------+----------+-------------+------------+
| grouping__id  | bit_vector  |   name   | start_date  | emp_id_cnt |
+---------------+-------------+----------+-------------+------------+
| 2             | 10          | Steven   | NULL        | 1          |
| 2             | 10          | Michael  | NULL        | 1          |
| 2             | 10          | Lucy     | NULL        | 1          |
| 0             | 0           | NULL     | NULL        | 4          |
| 2             | 10          | Will     | NULL        | 1          |
| 3             | 11          | Lucy     | 2010-01-03  | 1          |
| 1             | 1           | NULL     | 2010-01-03  | 1          |
| 1             | 1           | NULL     | 2012-11-03  | 1          |
| 3             | 11          | Steven   | 2012-11-03  | 1          |
| 1             | 1           | NULL     | 2013-10-02  | 1          |
| 3             | 11          | Will     | 2013-10-02  | 1          |
| 1             | 1           | NULL     | 2014-01-29  | 1          |
| 3             | 11          | Michael  | 2014-01-29  | 1          |
+---------------+-------------+----------+-------------+------------+
13 rows selected (136.708 seconds)

4. 聚合条件 – HAVING

从 HIve 0.7.0 版本开始,Hive 便支持 HAVING 来进行条件过滤 GROUP BY 的结果集。通过 HAVING ,我们可以避免GROUP BY后面跟着子查询。如下所示:

jdbc:hive2://> SELECT sex_age.age FROM employee 
. . . . . . .> GROUP BY sex_age.age HAVING count(*)<=1;
+--------------+
| sex_age.age  |
+--------------+
| 57           |
| 27           |
| 35           |
+--------------+
3 rows selected (74.376 seconds)

如果不适用 HAVING ,我们可以通过子查询进行实现:

jdbc:hive2://> SELECT a.age
. . . . . . .> FROM
. . . . . . .> (SELECT count(*) as cnt, sex_age.age 
. . . . . . .> FROM employee GROUP BY sex_age.age
. . . . . . .> ) a WHERE a.cnt<=1;
+--------+
| a.age  |
+--------+
| 57     |
| 27     |
| 35     |
+--------+
3 rows selected (87.298 seconds)


你可能感兴趣的:(Hive,SQL,Functions)