Hive分析窗口函数之GROUPING SETS,CUBE和ROLLUP

这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计

环境信息:

Hive版本为apache-hive-0.14.0-bin
Hadoop版本为hadoop-2.6.0
Tez版本为tez-0.7.0

 

数据:

2016-03,2016-03-10,user1

2016-03,2016-03-10,user5

2016-03,2016-03-12,user7

2016-04,2016-04-12,user3

2016-04,2016-04-13,user2

2016-04,2016-04-13,user4

2016-04,2016-04-16,user4

2016-03,2016-03-10,user2

2016-03,2016-03-10,user3

2016-04,2016-04-12,user5

2016-04,2016-04-13,user6

2016-04,2016-04-15,user3

2016-04,2016-04-15,user2

2016-04,2016-04-16,user1

 

 创建表:

CREATE TABLE windows_gcr (

   op_month STRING,

   op_day STRING,

   userno STRING

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

stored as textfile;


将数据导入Hive表中:

load data local inpath '/home/hadoop/testhivedata/windows_gcr.txt' into table windows_gcr;

 

 

1. GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL

按照op_month和op_day维度进行聚合:

SELECT

   op_month,

   op_day,

   COUNT(DISTINCT userno) AS userno_count,

   GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

GROUPING SETS (op_month,op_day)

ORDER BY GROUPING__ID;

 

结果:

op_month       op_day        userno_count   grouping__id

2016-03 NULL           5                  1

2016-04       NULL     6                 1

NULL           2016-03-10      4                2

NULL           2016-03-12      1                2

NULL           2016-04-12      2                2

NULL           2016-04-13      3                2

NULL           2016-04-15      2                2

NULL           2016-04-16      2                2

 

其中的 GROUPING__ID,表示结果属于哪一个分组集合。

 

 

等价于

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day;

 

 

再看一个例子:

SELECT

   op_month,

   op_day,

   COUNT(DISTINCT userno) AS userno_count,

   GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

GROUPING SETS(op_month,op_day,(op_month,op_day))

ORDER BY GROUPING__ID;

 

结果:

op_month       op_day        userno_count   grouping__id

2016-04       NULL     6                 1

2016-03       NULL     5                 1

NULL           2016-03-12      1                2

NULL           2016-04-12      2                2

NULL           2016-04-13      3                2

NULL           2016-04-15      2                2

NULL           2016-04-16      2                2

NULL           2016-03-10      4                2

2016-03       2016-03-10      4                 3

2016-03       2016-03-12      1                 3

2016-04       2016-04-12      2                 3

2016-04       2016-04-13      3                 3

2016-04       2016-04-15      2                 3

2016-04       2016-04-16      2                 3

 

 

等价于

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day

UNION ALL

SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day;

 

2. CUBE

根据GROUP BY的维度的所有组合进行聚合。

SELECT

   op_month,

   op_day,

   COUNT(DISTINCT userno) AS userno_count,

   GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

WITH CUBE

ORDER BY GROUPING__ID;

 

结果:

op_month       op_day        userno_count   grouping__id

NULL           NULL     7                  0

2016-03       NULL     5                 1

2016-04       NULL     6                 1

NULL           2016-04-12      2                2

NULL           2016-04-13      3                2

NULL           2016-04-15      2                2

NULL           2016-04-16      2                2

NULL           2016-03-10      4                2

NULL           2016-03-12      1                 2

2016-03       2016-03-10      4                 3

2016-03       2016-03-12      1                 3

2016-04       2016-04-16      2                 3

2016-04       2016-04-12      2                 3

2016-04       2016-04-13      3                 3

2016-04       2016-04-15      2                 3

 

等价于

SELECT NULL as op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,0 AS GROUPING__ID FROM windows_gcr

UNION ALL

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day

UNION ALL

SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day

 

 

3. ROLLUP

CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

 

比如,以op_month维度进行层级聚合:

SELECT

   op_month,

   op_day,

   COUNT(DISTINCT userno) AS userno_count,

   GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

WITH ROLLUP

ORDER BY GROUPING__ID;

结果:

op_month       op_day        userno_count   grouping__id

NULL           NULL     7                  0

2016-03       NULL     5                 1

2016-04       NULL     6                 1

2016-03       2016-03-10      4                 3

2016-03       2016-03-12      1                 3

2016-04       2016-04-12      2                 3

2016-04       2016-04-13      3                 3

2016-04       2016-04-15      2                 3

2016-04       2016-04-16      2                 3

 

可以实现这样的上钻过程:

月天的用户数->月的用户数->总用户数

 

--op_monthop_day调换顺序,则以op_day维度进行层级聚合:

 

SELECT

   op_month,

   op_day,

   COUNT(DISTINCT userno) AS userno_count,

   GROUPING__ID

FROM windows_gcr

GROUP BY op_day,op_month

WITH ROLLUP

ORDER BY GROUPING__ID;

 

 结果:

op_month       op_day        userno_count   grouping__id

NULL           NULL     7                  0

NULL           2016-04-13      3                1

NULL           2016-03-12      1                1

NULL           2016-04-15      2                1

NULL           2016-03-10      4                1

NULL           2016-04-16      2                1

NULL           2016-04-12      2                1

2016-04       2016-04-12      2                 3

2016-03       2016-03-10      4                 3

2016-03       2016-03-12      1                 3

2016-04       2016-04-13      3                 3

2016-04       2016-04-15      2                 3

2016-04       2016-04-16      2                 3

 

可以实现这样的上钻过程:

天月的UV->天的UV->UV

 

具体可以访问Hive官网:

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

 

 

 

你可能感兴趣的:(hive)