本文介绍Hive中的三种高级聚合函数,分别是grouping sets、cube、rollup。
grouping sets用于在一个 group by 查询中,根据不同的维度组合进行聚合,等价于将不同维度的 group by 结果集进行 union all。
下面进行举例演示:
新建test.txt文件,输入如下的三列数据,以空格分隔。第一列是用户手机的平台,有ios和android两种,第二列代表app
的版本号,第三列代表用户id。
[root@hadoop ~]# vim test.txt
ios 1.1 1
ios 1.1 2
ios 1.2 3
android 1.1 4
android 1.1 5
android 1.2 6
在hive中新建表temp_test8,将test文件中的数据插入,查看数据。
CREATE TABLE temp_test8 (
platform STRING comment '平台'
,version STRING comment '版本号'
,uid STRING comment '用户ID'
) row format delimited fields terminated BY ' ';
load data local inpath '/root/test.txt' into table temp_test8;
select * from temp_test8;
temp_test8.platform temp_test8.version temp_test8.uid
ios 1.1 1
ios 1.1 2
ios 1.2 3
android 1.1 4
android 1.1 5
android 1.2 6
此时如果想要同时求出每个平台下有多少用户,每个版本下有多少用户以及每个平台每个版本下有多少用户,按照前面的博客所学可以分别对不同的维度进行group by达成目的。高级聚合函数的作用就是一次性进行不同维度的聚合,下面首先展示grouping sets的用法。
计算每个平台下有多少用户,每个版本下有多少用户。
SELECT platform
,version
,COUNT(uid) AS uv
FROM temp_test8
GROUP BY platform
,version GROUPING SETS(platform, version);--这里可以写任意多个维度字段,意思是分别对每个字段进行聚合,聚合时其他字段全部取NULL
等价于:
SELECT platform
,NULL AS version
,COUNT(uid) AS uv
FROM temp_test8
GROUP BY platform
UNION ALL
SELECT NULL AS platform
,version
,COUNT(uid) AS uv
FROM temp_test8
GROUP BY version;
结果:
platform version uv
NULL 1.1 4
NULL 1.2 2
android NULL 3
ios NULL 3
计算每个平台下有多少用户,每个版本下有多少用户,同时使用grouping__id,显示结果属于哪一个分组集合。
SELECT platform
,version
,COUNT(uid) AS uv
,grouping__id --表示结果属于哪一个分组集合,是GROUPING SETS自带的字段,并不是原始字段
FROM temp_test8
GROUP BY platform
,version GROUPING SETS(platform, version)
ORDER BY grouping__id;
等价于:
SELECT platform
,NULL AS version
,COUNT(uid) AS uv
,1 as grouping__id
FROM temp_test8
GROUP BY platform
UNION ALL
SELECT NULL AS platform
,version
,COUNT(uid) AS uv
,2 as grouping__id
FROM temp_test8
GROUP BY version;
结果:
platform version uv grouping__id
ios NULL 3 1
android NULL 3 1
NULL 1.2 2 2
NULL 1.1 4 2
计算每个平台下有多少用户,每个版本下有多少用户,每个平台每个版本下有多少用户,同时使用grouping__id,显示结果属于哪一个分组集合。
SELECT platform
,version
,COUNT(uid) AS uv
,GROUPING__ID
FROM temp_test8
GROUP BY platform
,version GROUPING SETS(platform, version,(platform, version))--使用括号括起来任意数量的维度字段,可以对该字段组合进行聚合
ORDER BY GROUPING__ID;
等价于:
SELECT platform
,NULL AS version
,COUNT(uid) AS uv
,1 as grouping__id
FROM temp_test8
GROUP BY platform
UNION ALL
SELECT NULL AS platform
,version
,COUNT(uid) AS uv
,2 as grouping__id
FROM temp_test8
GROUP BY version
UNION ALL
SELECT platform
,version
,COUNT(uid) AS uv
,3 as grouping__id
FROM temp_test8
GROUP BY platform
,version;
结果:
platform version uv grouping__id
ios NULL 3 1
android NULL 3 1
NULL 1.2 2 2
NULL 1.1 4 2
ios 1.2 1 3
ios 1.1 2 3
android 1.2 1 3
android 1.1 2 3
cube会根据 group by 的维度的所有组合进行聚合。
上面最后一个使用grouping sets的例子使用cube可以更简单的实现,当聚合的维度很多,而且要所有维度的组合进行聚合时最好是使用cube。
举例:
计算每个平台下有多少用户,每个版本下有多少用户,每个平台每个版本下有多少用户,不分维度有多少用户。即对所有维度组合进行聚合。
SELECT platform
,version
,COUNT(uid) AS uv
,GROUPING__ID
FROM temp_test8
GROUP BY platform
,version
WITH CUBE
ORDER BY GROUPING__ID;
等价于:
SELECT platform
,version
,COUNT(uid) AS uv
,GROUPING__ID
FROM temp_test8
GROUP BY platform
,version GROUPING SETS(platform, version,(platform, version),())--使用括号括起来任意数量的维度字段,可以对该字段组合进行聚合,只写括号代表不分任何维度直接聚合
ORDER BY GROUPING__ID;
等价于:
SELECT NULL AS platform
,NULL AS version
,COUNT(uid) AS uv
,0 as grouping__id
FROM temp_test8
UNION ALL
SELECT platform
,NULL AS version
,COUNT(uid) AS uv
,1 as grouping__id
FROM temp_test8
GROUP BY platform
UNION ALL
SELECT NULL AS platform
,version
,COUNT(uid) AS uv
,2 as grouping__id
FROM temp_test8
GROUP BY version
UNION ALL
SELECT platform
,version
,COUNT(uid) AS uv
,3 as grouping__id
FROM temp_test8
GROUP BY platform
,version;
结果:
platform version uv grouping__id
NULL NULL 6 0
ios NULL 3 1
android NULL 3 1
NULL 1.2 2 2
NULL 1.1 4 2
ios 1.2 1 3
ios 1.1 2 3
android 1.2 1 3
android 1.1 2 3
rollup是 cube 的子集,以最左侧的维度为主,从该维度进行层级聚合。
举例:
以总->平台->平台、版本向下聚合。计算共有多少用户,每个平台下有多少用户,每个平台每个版本下有多少用户。
SELECT platform
,version
,COUNT(uid) AS uv
,GROUPING__ID
FROM temp_test8
GROUP BY platform
,version --根据group by后的字段顺序进行下钻聚合
WITH ROLLUP
ORDER BY GROUPING__ID;
等价于:
SELECT NULL AS platform
,NULL AS version
,COUNT(uid) AS uv
,0 as grouping__id
FROM temp_test8
UNION ALL
SELECT platform
,NULL AS version
,COUNT(uid) AS uv
,1 as grouping__id
FROM temp_test8
GROUP BY platform
UNION ALL
SELECT platform
,version
,COUNT(uid) AS uv
,3 as grouping__id
FROM temp_test8
GROUP BY platform
,version;
结果:
platform version uv grouping__id
NULL NULL 6 0
ios NULL 3 1
android NULL 3 1
ios 1.2 1 3
ios 1.1 2 3
android 1.2 1 3
android 1.1 2 3
以 总->版本->版本、平台 向下聚合。 计算共有多少用户,每个版本下有多少用户,每个版本下每个平台有多少用户。
SELECT version
,platform
,COUNT(uid) AS uv
,GROUPING__ID
FROM temp_test8
GROUP BY version
,platform --根据group by后的字段顺序进行下钻聚合
WITH ROLLUP
ORDER BY GROUPING__ID;
等价于:
SELECT NULL as version
,NULL as platform
,COUNT(uid) AS uv
,0 as grouping__id
FROM temp_test8
UNION ALL
SELECT version
,NULL as platform
,COUNT(uid) AS uv
,1 as grouping__id
FROM temp_test8
GROUP BY version
UNION ALL
SELECT version
,platform
,COUNT(uid) AS uv
,3 as grouping__id
FROM temp_test8
GROUP BY version
,platform;
结果:
version platform uv grouping__id
NULL NULL 6 0
1.2 NULL 2 1
1.1 NULL 4 1
1.2 ios 1 3
1.2 android 1 3
1.1 ios 2 3
1.1 android 2 3
能看到这里的同学,就右上角点个赞吧,3Q~