groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM srcgroupByClause?
多GroupBy 插入
Group By的Map-Side聚合
GROUPING SETS
CUBE
ROLL UP
Grouping_ID
Grouping function
groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
默认的:Group By子句是指定列名的,但在0.11之后可以通过指定列编号来写语句: 0.11.0 - 2.1.x, 设置 hive.groupby.orderby.position.alias = true (默认是 false). 2.2.0 以后, 设置hive.groupby.position.alias = true (默认是false).
统计行数:
SELECT COUNT() FROM table2;
当然你也可以使用COUNT(1)来替换COUNT().
统计分组后,每组的个数,如下所示:
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
选择语句和groupby子句
当使用group by语句的时候,你在select中只能使用被group by的字段。想使用其他字段你得使用udaf。(UDF以后单独讲解)
比如下面的例子:
CREATE TABLE t1(a INTEGER, b INTGER);
SELECT
a,
sum(b)
FROM
t1
GROUP BY
a;
下面的语句就不行:
SELECT
a,
b
FROM
t1
GROUP BY
a;
这是因为在select语句中引用了非group by字段b,如果那张表像以下这样:
100 1
100 2
100 3
对a字段进行group by之后,b的值应该给多少呢?尽管可以给一个最低值或者最高值,但是hive并没有采用这个方式。如果想使用这些字段你可以使用UDAF
GroupBy插入
例子:
ROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY ‘/user/facebook/tmp/pv_age_sum’
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;
Group By的Map-Side聚合
设置hive.map.aggr=true 来开启,默认值为false。
这样做可以提高执行效率,那么牺牲的是内存。
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;
rouping Sets, Cubes, Rollups, and the GROUPING__ID Function 高级聚合,Cube,分组,RollUp
该文档主要介绍了groupby子句的高级聚合特性
GROUPING SETS
GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解,
以acorn_3g.test_xinyan_reg为例:
hive -e “use acorn_3g;desc test_xinyan_reg;”
user_id bigint None
device_id int None 手机,平板
os_id int None 操作系统类型
app_id int None 手机app_id
client_version string None 客户端版本
from_id int None 四级渠道
几个Demo帮助大家了解:
### ` grouping sets语句 等价hive语句`
elect device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id)) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(os_id,app_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
Union all
elect null,os_id,app_id,count(user_id) from
Test_xinyan_reg group by os_id,app_id;
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
ELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,null,count(user_id) FROM test_xinyan_reg
CUBE函数
cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合 select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with cube; 手工实现需要写的hql语句(写个程序自己生成的,手写累死):
SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
UNION ALL
SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
UNION ALL
SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
UNION ALL
SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
UNION ALL
SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
UNION ALL
SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
UNION ALL
SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
UNION ALL
SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
UNION ALL
SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
UNION ALL
SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
UNION ALL
SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
UNION ALL
SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
UNION ALL
SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
UNION ALL
SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
UNION ALL
SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
UNION ALL
SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
UNION ALL
SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
UNION ALL
SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
UNION ALL
SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
UNION ALL
SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
UNION ALL
SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
UNION ALL
SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg
看着很蛋疼是不是,体会到cube的强大了吗!(低版本hive可以通过union all方式解决,算是没有办法的办法)
ROLL UP函数
rollup可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合。
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id with rollup;
等价以下sql语句:
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
Grouping_ID函数
当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)
grouping_id函数是计算分组级别的函数,注意如果要使用grouping_id函数那必须得有group by字句,而且group by字句的中的列与grouping_id函数的参数必须相等。比如group by A,B,那么必须使用grouping_id(A,B)。下面用一个等效关系来说明grouping_id()与grouping()的联系,grouping_id(A, B)等效于grouping(A) + grouping(B),但要注意这里的+号不是算术相加,它表示的是二进制数据组合在一起,比如grouping(A)=1,grouping(B)=1,那么grouping_id(A, B)=11B,也就是十进制数3。原来的表数据执行下面的sql语句结果太多效果不明显,所以我改了下表数据,不过对比两个结果集效果很明显。
直接拿官方文档一个例子
Column1 (key) Column2 (value)
1 NULL
1 1
2 2
3 3
3 NULL
4 5
hql统计:
> SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP
统计结果如下:
key value Grouping__ID count
NULL NULL 0 00 6
1 NULL 1 10 2
1 NULL 3 11 1
1 1 3 11 1
2 NULL 1 10 1
2 2 3 11 1
3 NULL 1 10 2
3 NULL 3 11 1
3 3 3 11 1
4 NULL 1 10 1
4 5 3 11 1
GROUPING__ID转变为二进制,如果对应位上有值为null,说明这列本身值就是null。(通过类DataFilterNull.py 扫描,可以筛选过滤掉列中null、“”统计结果),
Grouping function
grouping函数用来区分NULL值,这里NULL值有2种情况,一是原本表中的数据就为NULL,二是由rollup、cube、grouping sets生成的NULL值。 当为第一种情况中的空值时,grouping(NULL)返回0;当为第二种情况中的空值时,grouping(NULL)返回1。实例如下,从结果中可以看到第二个结果集中原本为null的数据由于grouping函数为1,故显示ROLLUP-NULL字符串。
SELECT key, value, GROUPING__ID,
grouping(key, value), grouping(value, key), grouping(key), grouping(value),
count(*)
FROM T1
GROUP BY key, value WITH ROLLUP;
This query will produce the following results.
NULL NULL 3 3 3 1 1 6
1 NULL 0 0 0 0 0 2
1 NULL 1 1 2 0 1 1
1 1 0 0 0 0 0 1
2 NULL 1 1 2 0 1 1
2 2 0 0 0 0 0 1
3 NULL 0 0 0 0 0 2
3 NULL 1 1 2 0 1 1
3 3 0 0 0 0 0 1
4 NULL 1 1 2 0 1 1
4 5 0 0 0 0 0 1
hive.new.job.grouping.set.cardinality 可能大家已经考虑到了,如果使用sets、cube或者rollup之类的操作,当基数很大时可能会出现一些问题。所以,使用这个配置来完成一些优化。该配置应该大于sets、cube或者rollup产生的组合大小。