这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计。
环境信息:
Hive版本为apache-hive-0.14.0-bin
Hadoop版本为hadoop-2.6.0
Tez版本为tez-0.7.0
数据:
2016-03,2016-03-10,user1
2016-03,2016-03-10,user5
2016-03,2016-03-12,user7
2016-04,2016-04-12,user3
2016-04,2016-04-13,user2
2016-04,2016-04-13,user4
2016-04,2016-04-16,user4
2016-03,2016-03-10,user2
2016-03,2016-03-10,user3
2016-04,2016-04-12,user5
2016-04,2016-04-13,user6
2016-04,2016-04-15,user3
2016-04,2016-04-15,user2
2016-04,2016-04-16,user1
创建表:
CREATE TABLE windows_gcr (
op_month STRING,
op_day STRING,
userno STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;
将数据导入Hive表中:
load data local inpath '/home/hadoop/testhivedata/windows_gcr.txt' into table windows_gcr;
1. GROUPING SETS
在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
按照op_month和op_day维度进行聚合:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
GROUPING SETS (op_month,op_day)
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
2016-03 NULL 5 1
2016-04 NULL 6 1
NULL 2016-03-10 4 2
NULL 2016-03-12 1 2
NULL 2016-04-12 2 2
NULL 2016-04-13 3 2
NULL 2016-04-15 2 2
NULL 2016-04-16 2 2
其中的 GROUPING__ID,表示结果属于哪一个分组集合。
等价于
SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month
UNION ALL
SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day;
再看一个例子:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
GROUPING SETS(op_month,op_day,(op_month,op_day))
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
2016-04 NULL 6 1
2016-03 NULL 5 1
NULL 2016-03-12 1 2
NULL 2016-04-12 2 2
NULL 2016-04-13 3 2
NULL 2016-04-15 2 2
NULL 2016-04-16 2 2
NULL 2016-03-10 4 2
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-12 2 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
2016-04 2016-04-16 2 3
等价于
SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month
UNION ALL
SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day
UNION ALL
SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day;
2. CUBE
根据GROUP BY的维度的所有组合进行聚合。
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
WITH CUBE
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
NULL NULL 7 0
2016-03 NULL 5 1
2016-04 NULL 6 1
NULL 2016-04-12 2 2
NULL 2016-04-13 3 2
NULL 2016-04-15 2 2
NULL 2016-04-16 2 2
NULL 2016-03-10 4 2
NULL 2016-03-12 1 2
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-16 2 3
2016-04 2016-04-12 2 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
等价于
SELECT NULL as op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,0 AS GROUPING__ID FROM windows_gcr
UNION ALL
SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month
UNION ALL
SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day
UNION ALL
SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day
3. ROLLUP
是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
比如,以op_month维度进行层级聚合:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
WITH ROLLUP
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
NULL NULL 7 0
2016-03 NULL 5 1
2016-04 NULL 6 1
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-12 2 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
2016-04 2016-04-16 2 3
可以实现这样的上钻过程:
月天的用户数->月的用户数->总用户数
--把op_month和op_day调换顺序,则以op_day维度进行层级聚合:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_day,op_month
WITH ROLLUP
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
NULL NULL 7 0
NULL 2016-04-13 3 1
NULL 2016-03-12 1 1
NULL 2016-04-15 2 1
NULL 2016-03-10 4 1
NULL 2016-04-16 2 1
NULL 2016-04-12 2 1
2016-04 2016-04-12 2 3
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
2016-04 2016-04-16 2 3
可以实现这样的上钻过程:
天月的UV->天的UV->总UV
具体可以访问Hive官网:
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup