源数据建表语句:
hive> show create table bi_all_access_log;
OK
CREATE TABLE `bi_all_access_log`(
`appsource` string,
`appkey` string,
`identifier` string,
`uid` string)
PARTITIONED BY (
`pt_month` string,
`pt_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'line.delim'='\n',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-cluster/user/hive/warehouse/bi_all_access_log'
TBLPROPERTIES (
'transient_lastDdlTime'='1481864860')
Time taken: 0.025 seconds, Fetched: 22 row(s)
1、GROUPING SETS
GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来。
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day))
;
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day,appkey))
;
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day),(pt_day,appkey))
;
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((),(pt_day),(pt_day,appkey))
;
2、CUBE
cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合。
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with cube
;
3、ROLL UP
rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with rollup
;
4、Grouping_ID
用以区别数据里的NULL与cube、rollup及grouping sets所产生的NULL。
select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with rollup
;
select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with cube
;
select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((),(pt_day),(pt_day,appkey))
;
5、总结
cube的分组组合最全,是各个维度值的笛卡尔(包含null)组合;
rollup的各维度组合应满足,前一维度为null后一位维度必须为null,前一维度取非null时,下一维度随意;
grouping sets则为自定义维度,根据需要分组即可。
ps:
通过grouping sets的使用可以简化SQL,比group by单维度进行union性能更好。