今天跟大家介绍一些 Hive 中的高级操作-数据聚合。这里主要根据以下三部分向大家介绍一下Hive 中常见的聚合:
数据聚合是基于特定的条件使用数据汇总的形式来收集和表达更多的信息。Hive 提供了一些内置的聚合函数,如MAX
, MIN
, AVG
等等。Hive 还支持高级的聚合: GROUPING SETS
, ROLLUP
, CUBE
,分析函数[analytic functions],以及 windowing。
Hive 的基本内置的聚合函数通常使用GROUP BY子句。如果没有GROUP BY子句指定,默认情况下它是对整个表进行聚合。除了聚合函数, 所有其他 select 的列也必须包含在GROUP BY子句中(分析函数除外)。以下是几个例子使用内置的聚合函数:
注意:关于窗口函数 & 分区表函数 请参考 SQL Windowing 项目: http://blog.csdn.net/mike_h/article/details/50245995
jdbc:hive2://> SELECT count(*) AS row_cnt FROM employee; +----------+ | row_cnt | +----------+ | 5 | +----------+ 1 row selected (60.709 seconds
jdbc:hive2://> SELECT sex_age.sex, count(*) AS row_cnt . . . . . . .> FROM employee . . . . . . .> GROUP BY sex_age.sex; +--------------+----------+ | sex_age.sex | row_cnt | +--------------+----------+ | Female | 2 | | Male | 3 | +--------------+----------+ 2 rows selected (100.565 seconds)
--select 字段名必须包含在 Group by 字段中
jdbc:hive2://> SELECT name, sex_age.sex, count(*) AS row_cnt . . . . . . .> FROM employee GROUP BY sex_age.sex; Error: Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'name' (state=42000,code=10025)
如果我们必须要 SELECT 一些 GROUP BY 中没有的字段, 我们有两种方法:
collect_set
函数,该函数返回一组对象和消除重复的元素方法2 使用如下:
--根据性别和数据抽样来统计每个性别的人数
jdbc:hive2://> SELECT sex_age.sex, . . . . . . .> collect_set(sex_age.age)[0] AS random_age, . . . . . . .> count(*) AS row_cnt . . . . . . .> FROM employee GROUP BY sex_age.sex; +--------------+-------------+----------+ | sex_age.sex | random_age | row_cnt | +--------------+-------------+----------+ | Female | 27 | 2 | | Male | 35 | 3 | +--------------+-------------+----------+ 2 rows selected (48.15 seconds)
jdbc:hive2://> SELECT sex_age.sex, AVG(sex_age.age) AS avg_age, . . . . . . .> count(*) AS row_cnt . . . . . . .> FROM employee GROUP BY sex_age.sex; +--------------+---------------------+----------+ | sex_age.sex | avg_age | row_cnt | +--------------+---------------------+----------+ | Female | 42.0 | 2 | | Male | 31.666666666666668 | 3 | +--------------+---------------------+----------+ 2 rows selected (98.857 seconds)
jdbc:hive2://> SELECT sum(CASE WHEN sex_age.sex = 'Male' . . . . . . .> THEN sex_age.age ELSE 0 END)/ . . . . . . .> count(CASE WHEN sex_age.sex = 'Male' THEN 1 . . . . . . .> ELSE NULL END) AS male_age_avg FROM employee; +---------------------+ | male_age_avg | +---------------------+ | 31.666666666666668 | +---------------------+ 1 row selected (38.415 seconds)
COALESCE
和 IF
的聚合函数:jdbc:hive2://> SELECT . . . . . . .> sum(coalesce(sex_age.age,0)) AS age_sum, -- 返回一组数据中第一个不为 Null 的值,如果都为null,则返回null . . . . . . .> sum(if(sex_age.sex = 'Female',sex_age.age,0)) -- false 的话则会0 . . . . . . .> AS female_age_sum FROM employee; +----------+---------------+ | age_sum | female_age_sum| +----------+---------------+ | 179 | 84 | +----------+---------------+ 1 row selected (42.137 seconds)
jdbc:hive2://> SELECT avg(count(*)) AS row_cnt . . . . . . .> FROM employee; Error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 1:11 Not yet supported place for UDAF 'count' (state=42000,code=10128)
聚合函数可以使用 DISTINCT
关键字来返回唯一聚合值。
jdbc:hive2://> SELECT count(DISTINCT sex_age.sex) AS sex_uni_cnt,
. . . . . . .> count(DISTINCT name) AS name_uni_cnt
. . . . . . .> FROM employee;
+--------------+---------------+
| sex_uni_cnt | name_uni_cnt |
+--------------+---------------+
| 2 | 5 |
+--------------+---------------+
1 row selected (35.935 seconds)
当我们吧 COUNT
和 DISTINCT
放在一起的时候,Hive 通常会忽略掉 reducer 的数量设置( 例如 mapred.reduce.tasks = 20 ),此时仅仅使用一个 reducer。在处理大量数据的情况下,单一的 reducer 显然会变成性能的瓶颈。当然,折衷的方案是使用子查询:
--在整个处理过程中只触发单个 reducer
SELECT count(distinct sex_age.sex) AS sex_uni_cnt FROM employee;
--在聚合之前,使用子查询来选出唯一值,这样会更加高效
SELECT count(*) AS sex_uni_cnt FROM (SELECT distinct sex_age.sex FROM employee) a;
在这种情况下,第一步会使用多个 reducer 来实现 DISTINCT
查询,使得数据 取得唯一值,mapper 的输出到COUNT
阶段的分区也会随之减少,最终使得 Reducer 不会有负载压力。
平时利用 Hive 处理数据的时候可能会遇到聚合字段为 NULL
的情况,对于这种情况,如果一行中含有一个 NULL
字段,则第二行将会被忽略掉。为了避免这种情况,我们可以使用 COALESCE
来为 Null 字段赋默认值。具体实现如下:
--创建测试表
jdbc:hive2://> CREATE TABLE t AS SELECT * FROM
. . . . . . .> (SELECT employee_id-99 AS val1,
. . . . . . .> (employee_id-98) AS val2 FROM employee_hr
. . . . . . .> WHERE employee_id <= 101
. . . . . . .> UNION ALL
. . . . . . .> SELECT null val1, 2 AS val2 FROM employee_hr
. . . . . . .> WHERE employee_id = 100) a;
No rows affected (0.138 seconds)
jdbc:hive2://> SELECT * FROM t;
+---------+---------+
| t.val1 | t.val2 |
+---------+---------+
| 1 | 2 |
| NULL | 2 |
| 2 | 3 |
+---------+---------+
3 rows selected (0.069 seconds)
jdbc:hive2://> SELECT sum(val1), sum(val1+val2)
. . . . . . .> FROM t;
+------+------+
| _c0 | _c1 |
+------+------+
| 3 | 8 |
+------+------+
1 row selected (57.775 seconds)
jdbc:hive2://> SELECT sum(coalesce(val1,0)),
. . . . . . .> sum(coalesce(val1,0)+val2) FROM t;
+------+------+
| _c0 | _c1 |
+------+------+
| 3 | 10 |
+------+------+
1 row selected (69.967 seconds)
hive.map.aggr 属性用来控制 map 任务中的聚合,它的默认值是 false,如果设置为 true的话,Hive 会直接在 map 任务中直接进行第一阶段的聚合,虽然会提高性能,但是也会消耗很多内存。
jdbc:hive2://> SET hive.map.aggr=true;
No rows affected (0.002 seconds)
蜂巢提供了
关键字实现对同一个数据集进行的多个GROUP BY操作。实际上,GROUPING SETS 将 Job 某一阶段的所有处理操作都整合在一起,显然要比 GROUP BY 和 UNION ALL 多阶段操作更加高效。如果GROUPING SETS() 参数为空的话,将会做整体聚合。 下面的例子将会介绍GROUPING SETS 的等价性。为了更好的理解,我们可以将 GROUPING SETS 理解为对 UNION ALL 的外部实现,对 UNION ALL 中每个GROUP BY 的内部实现。GROUPING SETS
SELECT name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name, work_place[0]
GROUPING SETS((name, work_place[0]));
||
SELECT name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name, work_place[0]
SELECT name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name, work_place[0]
GROUPING SETS(name, work_place[0]);
||
SELECT name, NULL AS main_place, count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name
UNION ALL
SELECT NULL AS name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY work_place[0];
SELECT name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name, work_place[0]
GROUPING SETS((name, work_place[0]), name);
||
SELECT name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name, work_place[0]
UNION ALL
SELECT name, NULL AS main_place, count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name;
SELECT name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name, work_place[0]
GROUPING SETS((name, work_place[0]), name, work_place[0], ());
||
SELECT name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name, work_place[0]
UNION ALL
SELECT name, NULL AS main_place, count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY name
UNION ALL
SELECT NULL AS name, work_place[0] AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id
GROUP BY work_place[0]
UNION ALL
SELECT NULL AS name, NULL AS main_place,
count(employee_id) AS emp_id_cnt
FROM employee_id;
随着大家对Hive 使用的深入,对 Hive GROUPING SETS
运算符的问题也就随之增多,虽然大部分 issue 已经解决了。不过相信随着业务的进一步深入,更多的问题将会呈现出来,有兴趣的朋友可以去看看。
这里跟大家介绍一下一个 Hive 当前版本的一个 issue:针对 GROUPING SETS 的解析错误:https://issues.apache.org/jira/browse/HIVE-6950
jdbc:hive2://> SELECT sex_age.sex, sex_age.age,
. . . . . . .> count(name) AS name_cnt
. . . . . . .> FROM employee
. . . . . . .> GROUP BY sex_age.sex, sex_age.age
. . . . . . .> GROUPING SETS((sex_age.sex, sex_age.age));
Error: Error while compiling statement: FAILED: ParseException line 1:131 missing ) at ',' near ''
line 1:145 extraneous input ')' expecting EOF near '' (state=42000,code=40000)
不过该问题已经在 Hive 1.2.0 版本中 Fixed了。
Cube & GROUPING__ID
关键字在 Hive 中使用实例:https://www.qubole.com/blog/product/cube-keyword-in-apache-hive/
ROLLUP
语句通过指定的一组维度来使得 SELECT
语句能进行多级聚合。它能以高效且最小查询开销的优势来扩展 Group bY 语句。跟 GROUPING SETS 比较,它能创建特定 level 的聚合,ROLLUP
创建 n+1 level 的聚合,N 就是分组的字段数。ROLLUP
有如下功能:
具体示例如下:
GROUP BY a,b,c WITH ROLLUP
等效于:
GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),())
CUBE
语句用来需要分组的字段和为所有可能的组合创建聚合。如果有 N 个字段指定为 CUBE
,那么将会有 2n 个聚合组合返回。如下所示:
GROUP BY a,b,c WITH CUBE
等效于:
GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())
The GROUPING__ID
function works as an extension to distinguish entire rows from each other. It accepts one or more columns and returns the decimal equivalent of the BIT
vector for each column specified afterGROUP BY
. The returned decimal number is
GROUPING__ID
函数作为一个扩展来跟其他行做区分。它会接收一个或多个字段,之后根据
GROUP BY
后的每个字段返回一个十进制的比特向量。返回的十进制数值是从二进制的 1 和 0 转换而来的,用来表示每一行的字段已经聚合过了(此时对应的值将不为 NULL)。离
GROUP BY
比较近的字段会先进行排序。如下所示,对应的是
start_date
jdbc:hive2://> SELECT GROUPING__ID,
. . . . . . .> BIN(CAST(GROUPING__ID AS BIGINT)) AS bit_vector,
. . . . . . .> name, start_date, count(employee_id) emp_id_cnt
. . . . . . .> FROM employee_hr
. . . . . . .> GROUP BY start_date, name
. . . . . . .> WITH CUBE ORDER BY start_date;
+---------------+-------------+----------+-------------+------------+
| grouping__id | bit_vector | name | start_date | emp_id_cnt |
+---------------+-------------+----------+-------------+------------+
| 2 | 10 | Steven | NULL | 1 |
| 2 | 10 | Michael | NULL | 1 |
| 2 | 10 | Lucy | NULL | 1 |
| 0 | 0 | NULL | NULL | 4 |
| 2 | 10 | Will | NULL | 1 |
| 3 | 11 | Lucy | 2010-01-03 | 1 |
| 1 | 1 | NULL | 2010-01-03 | 1 |
| 1 | 1 | NULL | 2012-11-03 | 1 |
| 3 | 11 | Steven | 2012-11-03 | 1 |
| 1 | 1 | NULL | 2013-10-02 | 1 |
| 3 | 11 | Will | 2013-10-02 | 1 |
| 1 | 1 | NULL | 2014-01-29 | 1 |
| 3 | 11 | Michael | 2014-01-29 | 1 |
+---------------+-------------+----------+-------------+------------+
13 rows selected (136.708 seconds)
从 HIve 0.7.0 版本开始,Hive 便支持 HAVING
来进行条件过滤 GROUP BY 的结果集。通过 HAVING
,我们可以避免GROUP BY后面跟着子查询。如下所示:
jdbc:hive2://> SELECT sex_age.age FROM employee
. . . . . . .> GROUP BY sex_age.age HAVING count(*)<=1;
+--------------+
| sex_age.age |
+--------------+
| 57 |
| 27 |
| 35 |
+--------------+
3 rows selected (74.376 seconds)
如果不适用 HAVING
,我们可以通过子查询进行实现:
jdbc:hive2://> SELECT a.age
. . . . . . .> FROM
. . . . . . .> (SELECT count(*) as cnt, sex_age.age
. . . . . . .> FROM employee GROUP BY sex_age.age
. . . . . . .> ) a WHERE a.cnt<=1;
+--------+
| a.age |
+--------+
| 57 |
| 27 |
| 35 |
+--------+
3 rows selected (87.298 seconds)