Groupby语句,GroupBy高级特性

语法:

groupByClause: GROUP BY groupByExpression (, groupByExpression)*

groupByExpression: expression

groupByQuery: SELECT expression (, expression)* FROM srcgroupByClause?

高级使用:

多GroupBy 插入

Group By的Map-Side聚合

GROUPING SETS

CUBE

ROLL UP

Grouping_ID

Grouping function

Group By 语法

groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression

groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

默认的:Group By子句是指定列名的,但在0.11之后可以通过指定列编号来写语句: 0.11.0 - 2.1.x, 设置 hive.groupby.orderby.position.alias = true (默认是 false). 2.2.0 以后, 设置hive.groupby.position.alias = true (默认是false).

简单例子

统计行数:

SELECT COUNT() FROM table2;
当然你也可以使用COUNT(1)来替换COUNT(
).
统计分组后,每组的个数,如下所示:

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

你可以这样:

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

但你不能这样:

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

选择语句和groupby子句

当使用group by语句的时候,你在select中只能使用被group by的字段。想使用其他字段你得使用udaf。(UDF以后单独讲解)
比如下面的例子:

CREATE TABLE t1(a INTEGER, b INTGER);

SELECT
a,
sum(b)
FROM
t1
GROUP BY
a;

下面的语句就不行:

SELECT
a,
b
FROM
t1
GROUP BY
a;

这是因为在select语句中引用了非group by字段b,如果那张表像以下这样:

a b

100 1
100 2
100 3

对a字段进行group by之后,b的值应该给多少呢?尽管可以给一个最低值或者最高值,但是hive并没有采用这个方式。如果想使用这些字段你可以使用UDAF

高级特性

  • GroupBy插入

例子:

ROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY ‘/user/facebook/tmp/pv_age_sum’
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;
Group By的Map-Side聚合
设置hive.map.aggr=true 来开启,默认值为false。
这样做可以提高执行效率,那么牺牲的是内存。
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;

rouping Sets, Cubes, Rollups, and the GROUPING__ID Function 高级聚合,Cube,分组,RollUp
该文档主要介绍了groupby子句的高级聚合特性

  • GROUPING SETS

GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解,

以acorn_3g.test_xinyan_reg为例:

hive -e “use acorn_3g;desc test_xinyan_reg;”
user_id bigint None
device_id int None 手机,平板
os_id int None 操作系统类型
app_id int None 手机app_id
client_version string None 客户端版本
from_id int None 四级渠道
几个Demo帮助大家了解:

	### `		grouping sets语句	等价hive语句`

elect device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id)) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(os_id,app_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id

  • Union all

elect null,os_id,app_id,count(user_id) from
Test_xinyan_reg group by os_id,app_id;
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id

  • UNION ALL

ELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,null,count(user_id) FROM test_xinyan_reg

  • CUBE函数

cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合 select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with cube; 手工实现需要写的hql语句(写个程序自己生成的,手写累死):

SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
UNION ALL
SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
UNION ALL
SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
UNION ALL
SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
UNION ALL
SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
UNION ALL
SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
UNION ALL
SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
UNION ALL
SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
UNION ALL
SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
UNION ALL
SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
UNION ALL
SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
UNION ALL
SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
UNION ALL
SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
UNION ALL
SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
UNION ALL
SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
UNION ALL
SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
UNION ALL
SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
UNION ALL
SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
UNION ALL
SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
UNION ALL
SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
UNION ALL
SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
UNION ALL
SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
UNION ALL

SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg
看着很蛋疼是不是,体会到cube的强大了吗!(低版本hive可以通过union all方式解决,算是没有办法的办法)

  • ROLL UP函数

rollup可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合。

select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id with rollup;

等价以下sql语句:

select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

  • Grouping_ID函数

当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)

grouping_id函数是计算分组级别的函数,注意如果要使用grouping_id函数那必须得有group by字句,而且group by字句的中的列与grouping_id函数的参数必须相等。比如group by A,B,那么必须使用grouping_id(A,B)。下面用一个等效关系来说明grouping_id()与grouping()的联系,grouping_id(A, B)等效于grouping(A) + grouping(B),但要注意这里的+号不是算术相加,它表示的是二进制数据组合在一起,比如grouping(A)=1,grouping(B)=1,那么grouping_id(A, B)=11B,也就是十进制数3。原来的表数据执行下面的sql语句结果太多效果不明显,所以我改了下表数据,不过对比两个结果集效果很明显。

直接拿官方文档一个例子

Column1 (key) Column2 (value)
1 NULL
1 1
2 2
3 3
3 NULL
4 5

hql统计:

>  SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP

统计结果如下:
key value Grouping__ID count
NULL NULL 0 00 6
1 NULL 1 10 2
1 NULL 3 11 1
1 1 3 11 1
2 NULL 1 10 1
2 2 3 11 1
3 NULL 1 10 2
3 NULL 3 11 1
3 3 3 11 1
4 NULL 1 10 1
4 5 3 11 1

	GROUPING__ID转变为二进制,如果对应位上有值为null,说明这列本身值就是null。(通过类DataFilterNull.py 扫描,可以筛选过滤掉列中null、“”统计结果),
  • Grouping function

grouping函数用来区分NULL值,这里NULL值有2种情况,一是原本表中的数据就为NULL,二是由rollup、cube、grouping sets生成的NULL值。 当为第一种情况中的空值时,grouping(NULL)返回0;当为第二种情况中的空值时,grouping(NULL)返回1。实例如下,从结果中可以看到第二个结果集中原本为null的数据由于grouping函数为1,故显示ROLLUP-NULL字符串。

SELECT key, value, GROUPING__ID,
grouping(key, value), grouping(value, key), grouping(key), grouping(value),
count(*)
FROM T1
GROUP BY key, value WITH ROLLUP;

This query will produce the following results.

NULL NULL 3 3 3 1 1 6
1 NULL 0 0 0 0 0 2
1 NULL 1 1 2 0 1 1
1 1 0 0 0 0 0 1
2 NULL 1 1 2 0 1 1
2 2 0 0 0 0 0 1
3 NULL 0 0 0 0 0 2
3 NULL 1 1 2 0 1 1
3 3 0 0 0 0 0 1
4 NULL 1 1 2 0 1 1
4 5 0 0 0 0 0 1

hive.new.job.grouping.set.cardinality 可能大家已经考虑到了,如果使用sets、cube或者rollup之类的操作,当基数很大时可能会出现一些问题。所以,使用这个配置来完成一些优化。该配置应该大于sets、cube或者rollup产生的组合大小。

你可能感兴趣的:(Hive)