Hive分析函数之grouping sets、cube、rollup学习

源数据建表语句:
hive> show create table bi_all_access_log;
OK
CREATE TABLE `bi_all_access_log`(
  `appsource` string, 
  `appkey` string, 
  `identifier` string, 
  `uid` string)
PARTITIONED BY ( 
  `pt_month` string, 
  `pt_day` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'line.delim'='\n', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://emr-cluster/user/hive/warehouse/bi_all_access_log'
TBLPROPERTIES (
  'transient_lastDdlTime'='1481864860')
Time taken: 0.025 seconds, Fetched: 22 row(s)

1、GROUPING SETS
GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day))
;

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day,appkey))
;

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((pt_day),(pt_day,appkey))
;

select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((),(pt_day),(pt_day,appkey))
;

2、CUBE
cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合。
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with cube
;

3、ROLL UP
rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。
select pt_day,appsource,appkey,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with rollup
;

4、Grouping_ID
用以区别数据里的NULL与cube、rollup及grouping sets所产生的NULL。
select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with rollup
;

select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
with cube
;

select pt_day,appsource,appkey,GROUPING__ID,count(identifier),count(uid)
from bi_all_access_log
where pt_month='2017-11'
group by pt_day,appsource,appkey
grouping sets((),(pt_day),(pt_day,appkey))
;

5、总结
cube的分组组合最全,是各个维度值的笛卡尔(包含null)组合;
rollup的各维度组合应满足,前一维度为null后一位维度必须为null,前一维度取非null时,下一维度随意;
grouping sets则为自定义维度,根据需要分组即可。

ps: 通过grouping sets的使用可以简化SQL,比group by单维度进行union性能更好

你可能感兴趣的:(#,Hive,Sql)