Hive从入门到放弃——玩一玩Hive的数据分析开窗函数(十五)

背 景

  Hive顶着大数据数据仓库的头衔,那么数据分析是在常用不过的功能了,所以,而数据分析钟,聚合函数和开窗函数是最重要的两环,聚合函数在博客Hive从入门到放弃——Hive 用户内置函数简介(十一)已经介绍,而开窗函数,因为器运用的灵活性,这里单独介绍一下;
  有些新手会混淆聚合函数和开窗函数的区别,聚合函数:目是是按照某些维度聚会,计算需求需要的度量值的统计指标;开窗函数:选取的维度本身有不同的值,这些值作为一个分组,计算这些组内的度量值的统计指标;如一个记录全校不同班级的语文成绩的数据表,聚合函数基本用来统计全校平均分,总分,最高分,最低分等,每个班平均分,总分,最高分,最低分等;而开窗函数不仅可以做到校平均分,总分,最高分,最低分等,每个班平均分,总分,最高分,最低分等,还可以每班成绩排序,全校成绩排序,从一班开始累计人数等分析,更好的实现不同维度组的度量值的统计指标,下面进入开窗函数的正题;

数据表准备

  这里我们新建一个表ods_sale_order_producttion_amount(ods产品每天销售总额),其中建表语句,数据加载,数据预览如下;

-- DDL建表
CREATE TABLE `ods_sale_order_producttion_amount`(
  `month_key` string COMMENT '月份', 
  `date_key` string COMMENT '日期', 
  `production_amount` decimal(18,2) COMMENT '产品总值')
COMMENT 'ods产品每天销售总额'
PARTITIONED BY ( 
  `event_year` int, 
  `event_week` int, 
  `event_day` string, 
  `event_hour` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://dw-test-cluster/hive/warehouse/ods/sale/ods_sale_order_producttion_amount'
TBLPROPERTIES ('parquet.compression'='snappy')

-- 装入数据
insert into ods_sale_order_producttion_amount partition(event_year=2020,event_week=30,event_day='20200731',event_hour='00')
select  '202005','20200501',199000.00
union all 
select  '202005','20200502',185000.00
union all 
select  '202005','20200503',199000.00
union all 
select  '202005','20200504',138500.00
union all 
select  '202005','20200505',196540.00
union all 
select  '202005','20200506',138500.00
union all 
select  '202005','20200507',159840.00
union all 
select  '202005','20200508',189462.00
union all
select  '202005','20200509',200000.00
union all
select  '202005','20200510',198540.00
union all 
select  '202006','20200601',189000.00
union all 
select  '202006','20200602',185000.00
union all 
select  '202006','20200603',189000.00
union all 
select  '202006','20200604',158500.00
union all 
select  '202006','20200605',200140.00
union all 
select  '202006','20200606',158500.00
union all 
select  '202006','20200607',198420.00
union all 
select  '202006','20200608',158500.00
union all
select  '202006','20200609',200100.00
union all
select  '202006','20200610',135480.00

-- 数据预览
month_key       date_key        production_amount       event_year      event_week      event_day       event_hour
202005  20200501        199000.00       2020    30      20200731        00
202005  20200502        185000.00       2020    30      20200731        00
202005  20200503        199000.00       2020    30      20200731        00
202005  20200504        138500.00       2020    30      20200731        00
202005  20200505        196540.00       2020    30      20200731        00
202005  20200506        138500.00       2020    30      20200731        00
202005  20200507        159840.00       2020    30      20200731        00
202005  20200508        189462.00       2020    30      20200731        00
202005  20200509        200000.00       2020    30      20200731        00
202005  20200510        198540.00       2020    30      20200731        00
202006  20200601        189000.00       2020    30      20200731        00
202006  20200602        185000.00       2020    30      20200731        00
202006  20200603        189000.00       2020    30      20200731        00
202006  20200604        158500.00       2020    30      20200731        00
202006  20200605        200140.00       2020    30      20200731        00
202006  20200606        158500.00       2020    30      20200731        00
202006  20200607        198420.00       2020    30      20200731        00
202006  20200608        158500.00       2020    30      20200731        00
202006  20200609        200100.00       2020    30      20200731        00
202006  20200610        135480.00       2020    30      20200731        00
Time taken: 0.233 seconds, Fetched: 20 row(s)

开窗分组排序函数

  • ROW_NUMBER:
    用法ROW_NUMBER() OVER([PARTITION BY col1.col2,…] order by col1.col2,… ASC|DESC)
    返回值:返回根据分区列的每一组数据按照排序列的顺序或者倒序排序的排名结果,即使排序列的值一样,组内排名也有先后,最终的组内排名值等于最终的组内参与排名的行总数;
  • RANK
    用法RANK() OVER([PARTITION BY col1.col2,…] order by col1.col2,… ASC|DESC)
    返回值:返回根据分区列的每一组数据按照排序列的顺序或者倒序排序的排名结果,排序列的值一样,组内排名一样,但是下一个排名跳过上组排名重复的个数,可能出现间断的名次,即最终的组内排名值等于最终的组内参与排名的行总数;
  • DENSE_RANK
    用法DENSE_RANK() OVER([PARTITION BY col1.col2,…] order by col1.col2,… ASC|DESC)
    返回值:返回根据分区列的每一组数据按照排序列的顺序或者倒序排序的排名结果,排序列的值一样,组内排名一样,但是下一个排名不会跳过上组排名重复的个数,即最终的组内排名小于等于最终的组内参与排名的行总数;
  • CUME_DIST
    用法CUME_DIST() OVER([PARTITION BY col1.col2,…] order by col1.col2,… ASC|DESC)
    返回值:返回根据分区列的每一组数据按照排序列的顺序或者倒序排序的排名百分比占比,取值为数学区间(0,1](大于0,小于等于1),排序列的值一样,组内百分比排名一样,值的计算上等于小于等于该列组内RANK()值的个数/组内总行数,,即最终的组内排名百分比总是等于1.0
  • PERCENT_RANK
    用法PERCENT_RANK() OVER([PARTITION BY col1.col2,…] order by col1.col2,… ASC|DESC)
    返回值:返回根据分区列的每一组数据按照排序列的顺序或者倒序排序的排名百分比占比,取值为数学区间[0,1](大于等于0,小于等于1),而且百分比排名总是从0开始排的,排序列的值一样,组内百分比排名一样,而且一定是从等于0开始往后排的,值的计算上等于(组内当前行的RANK值-1)/(组内总行数-1),即最终的组内排名百分比不一定等于1.0,当组内最大的RANK值也发生了重复的时候,则不会等于1.0,否则等于1.0;
  • NTILE
    用法NTILE(num) OVER([PARTITION BY col1.col2,…] order by col1.col2,… ASC|DESC)
    返回值:用于将分组数据按照顺序切分成num片,返回根据分区列的每一组数据按照排序列的顺序或者倒序排序的结果平均的分配到num个切片中,如果分配不均匀,如将组内的10行数按分组排序规则分配给7个切片,则余数为3,这三个余下的分片会平均从较小的分片里面开始平摊,即分片1,2,3能分到两行数据,剩下的能分到一行,返回的值时该行数据对应的分片值,分片从1开始计算;
    运用:常用来获取1/n时间段内的最大,最小收益等,如按照时间分组排序,平均分成3份,获取每组内前m段时间内的最好的业绩,则只需要取到NTILE()处理后的返回值的小于等于m即可;
    ROW_NUMBER() ,RANK() ,DENSE_RANK(),CUME_DIST(),PERCENT_RANK(),NTILE() 的Hive Cli实现效果:如下;
hive> set hive.cli.print.header=true;
hive> select
    >
    >         ROW_NUMBER() OVER(PARTITION BY month_key order by production_amount DESC)   row_number_result
    >        ,RANK() OVER(PARTITION BY month_key order by production_amount DESC)         rank_result
    >        ,DENSE_RANK() OVER(PARTITION BY month_key order by production_amount DESC)   dense_rank_result
    >        ,month_key
    >        ,date_key
    >        ,production_amount
    > from ods_sale_order_producttion_amount where event_day='20200731';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200721163446_108a897b-ddf1-46ae-9bd8-a81a2afec8c9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0123, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0123/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0123
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-21 16:35:03,990 Stage-1 map = 0%,  reduce = 0%
2020-07-21 16:35:12,456 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.76 sec
2020-07-21 16:35:17,720 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.95 sec
MapReduce Total cumulative CPU time: 6 seconds 950 msec
Ended Job = job_1592876386879_0123
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.95 sec   HDFS Read: 14952 HDFS Write: 970 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 950 msec
OK
Time taken: 32.507 seconds, Fetched: 20 row(s)

  为了方便看结果,我把结果拉到表格内来格式规范点,看起来舒服点,具体结果如表1;

表1 ROW_NUMBER() ,RANK() ,DENSE_RANK(),CUME_DIST(),PERCENT_RANK(),NTILE()排序结果
row_number_result rank_result dense_rank_result cume_dist_result percent_rank_result ntile2_result ntile3_result ntile4_result ntile7_result month_key date_key production_amount
1 1 1 0.1 0.0 1 1 1 1 202006 20200605 200140
2 2 2 0.2 0.1111111111111111 1 1 1 1 202006 20200609 200100
3 3 3 0.3 0.2222222222222222 1 1 1 2 202006 20200607 198420
4 4 4 0.5 0.3333333333333333 1 1 2 2 202006 20200601 189000
5 4 4 0.5 0.3333333333333333 1 2 2 3 202006 20200603 189000
6 6 5 0.6 0.5555555555555556 2 2 2 3 202006 20200602 185000
7 7 6 0.9 0.6666666666666666 2 2 3 4 202006 20200604 158500
8 7 6 0.9 0.6666666666666666 2 3 3 5 202006 20200606 158500
9 7 6 0.9 0.6666666666666666 2 3 4 6 202006 20200608 158500
10 10 7 1.0 1.0 2 3 4 7 202006 20200610 135480
1 1 1 0.1 0.0 1 1 1 1 202005 20200509 200000
2 2 2 0.3 0.1111111111111111 1 1 1 1 202005 20200501 199000
3 2 2 0.3 0.1111111111111111 1 1 1 2 202005 20200503 199000
4 4 3 0.4 0.3333333333333333 1 1 2 2 202005 20200510 198540
5 5 4 0.5 0.4444444444444444 1 2 2 3 202005 20200505 196540
6 6 5 0.6 0.5555555555555556 2 2 2 3 202005 20200508 189462
7 7 6 0.7 0.6666666666666666 2 2 3 4 202005 20200502 185000
8 8 7 0.8 0.7777777777777778 2 3 3 5 202005 20200507 159840
9 9 8 1.0 0.8888888888888888 2 3 4 6 202005 20200504 138500
10 9 8 1.0 0.8888888888888888 2 3 4 7 202005 20200506 138500

开窗基本集合函数

  • COUNT
    用法COUNT(*) OVER([PARTITION BY col1.col2,…] )
    返回值:返回根据分区列的每一组数据的总行数;
  • SUM
    用法SUM(coln) OVER([PARTITION BY col1.col2,…] )
    返回值:返回根据分区列的每一组数据的关于coln的汇总结果;
  • MIN
    用法MIN(coln) OVER([PARTITION BY col1.col2,…] )
    返回值:返回根据分区列的每一组数据的关于coln的最小值;
  • MAX
    用法MAX(coln) OVER([PARTITION BY col1.col2,…] )
    返回值:返回根据分区列的每一组数据的关于coln的最大值;
  • AVG
    用法AVG(coln) OVER([PARTITION BY col1.col2,…] )
    返回值:返回根据分区列的每一组数据的关于coln的平均值;
    COUNT,SUM,MIN.MAX,AVG的Hive Cli实现:如下;
hive> set hive.cli.print.header=true;
hive> select month_key,date_key,production_amount
    >       ,count(3) over(partition by month_key) as `每月份数`
    >       ,sum(production_amount) OVER(partition by month_key) as `每月总和`
    >       ,min(production_amount) OVER(partition by month_key) as `每月最小收入`
    >       ,max(production_amount) OVER(partition by month_key) as `每月最大收入`
    >       ,avg(production_amount) OVER(partition by month_key) as `每月平均收入`
    > from dw.ods_sale_order_producttion_amount
    > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200729105357_e57ed4d0-f210-431c-8889-5a138cbb0032
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0139, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0139/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0139
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-29 10:54:10,089 Stage-1 map = 0%,  reduce = 0%
2020-07-29 10:54:19,185 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.93 sec
2020-07-29 10:54:25,548 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.35 sec
MapReduce Total cumulative CPU time: 8 seconds 350 msec
Ended Job = job_1592876386879_0139
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.35 sec   HDFS Read: 18912 HDFS Write: 1807 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 350 msec
OK
month_key       date_key        production_amount       每月份数        每月总和        每月最小收入    每月最大收入    每月平均收入
202005  20200501        199000.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200510        198540.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200509        200000.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200508        189462.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200507        159840.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200506        138500.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200505        196540.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200504        138500.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200503        199000.00       10      1804382.00      138500.00       200000.00       180438.200000
202005  20200502        185000.00       10      1804382.00      138500.00       200000.00       180438.200000
202006  20200610        135480.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200609        200100.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200608        158500.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200607        198420.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200606        158500.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200605        200140.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200604        158500.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200603        189000.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200602        185000.00       10      1772640.00      135480.00       200140.00       177264.000000
202006  20200601        189000.00       10      1772640.00      135480.00       200140.00       177264.000000
Time taken: 29.364 seconds, Fetched: 20 row(s)

  • LEAD
    用法LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值,第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
    返回值:回根据分区列的每一组统计窗口内往下第n行值;
  • LAG
    用法LAG(col,n,DEFAULT),用于统计窗口内往上第n行值,第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
    返回值:返回根据分区列的每一组统计窗口内往上第n行值;
  • FIRST_VALUE
    用法FIRST_VALUE(coln) over (partition by col1,col2 order by col1,col2)
    返回值:返回根据分区排序列的每一组统计窗口内第1行值;
  • LAST_VALUE
    用法LAST_VALUE(coln) over (partition by col1,col2 order by col1,col2)
    返回值:返回根据分区排序列的每一组统计窗口内截止到当前行最后一行值;
    LEAD,LAG,FIRST_VALUE,LAST_VALUE的Hive Cli实现:
hive> set hive.cli.print.header=true;
hive> select month_key,date_key,production_amount
    >       ,FIRST_VALUE(production_amount) OVER(partition by month_key order by date_key) as `月初数据`
    >       ,LAST_VALUE(production_amount) OVER(partition by month_key)  as `总月末数据`
    >       ,LAST_VALUE(production_amount) OVER(partition by month_key order by date_key)  as `截至当前月末数据`
    >       ,LAG(date_key)  OVER(partition by month_key order by date_key)             as `LAG默认参数`
    >       ,LAG(date_key,1)  OVER(partition by month_key order by date_key) as `LAG取date_key值向上移动1`
    >       ,LAG(date_key,2,'19000101')  OVER(partition by month_key order by date_key) as `LAG取date_key值向上移动2位,取到null19000101代替`
    >       ,LEAD(date_key)   OVER(partition by month_key order by date_key)               as `LEAD默认参数`
    >       ,LEAD(date_key,1)  OVER(partition by month_key order by date_key)   as `LEAD取date_key值向上移动1`
    >       ,LEAD(date_key,2,'19000101')   OVER(partition by month_key order by date_key)  as `LEAD取date_key值向上移动1位,取到null19000101代替`
    > from dw.ods_sale_order_producttion_amount
    > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200729113052_9e57c0fe-5c46-4589-9406-f394787885bc
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0144, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0144/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0144
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-29 11:30:59,931 Stage-1 map = 0%,  reduce = 0%
2020-07-29 11:31:06,206 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.27 sec
2020-07-29 11:31:12,498 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.62 sec
MapReduce Total cumulative CPU time: 7 seconds 620 msec
Ended Job = job_1592876386879_0144
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0145, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0145/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0145
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-07-29 11:31:24,159 Stage-2 map = 0%,  reduce = 0%
2020-07-29 11:31:30,442 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.7 sec
2020-07-29 11:31:36,698 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.97 sec
MapReduce Total cumulative CPU time: 4 seconds 970 msec
Ended Job = job_1592876386879_0145
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.62 sec   HDFS Read: 20982 HDFS Write: 2084 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.97 sec   HDFS Read: 21609 HDFS Write: 2499 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 590 msec
OK
month_key       date_key        production_amount       月初数据        总月末数据      截至当前月末数据        lag默认参数     lag取date_key值向上移动1位      lag取date_key值向上移动2位,取到null19000101代替   lead默认参数    lead取date_key值向上移动1位     lead取date_key值向上移动1位,取到null19000101代替
202005  20200501        199000.00       199000.00       185000.00       199000.00       NULL    NULL    19000101        20200502        20200502        20200503
202005  20200510        198540.00       199000.00       185000.00       198540.00       20200509        20200509        20200508        NULL    NULL    19000101
202005  20200509        200000.00       199000.00       185000.00       200000.00       20200508        20200508        20200507        20200510        20200510        19000101
202005  20200508        189462.00       199000.00       185000.00       189462.00       20200507        20200507        20200506        20200509        20200509        20200510
202005  20200507        159840.00       199000.00       185000.00       159840.00       20200506        20200506        20200505        20200508        20200508        20200509
202005  20200506        138500.00       199000.00       185000.00       138500.00       20200505        20200505        20200504        20200507        20200507        20200508
202005  20200505        196540.00       199000.00       185000.00       196540.00       20200504        20200504        20200503        20200506        20200506        20200507
202005  20200504        138500.00       199000.00       185000.00       138500.00       20200503        20200503        20200502        20200505        20200505        20200506
202005  20200503        199000.00       199000.00       185000.00       199000.00       20200502        20200502        20200501        20200504        20200504        20200505
202005  20200502        185000.00       199000.00       185000.00       185000.00       20200501        20200501        19000101        20200503        20200503        20200504
202006  20200610        135480.00       189000.00       189000.00       135480.00       20200609        20200609        20200608        NULL    NULL    19000101
202006  20200609        200100.00       189000.00       189000.00       200100.00       20200608        20200608        20200607        20200610        20200610        19000101
202006  20200608        158500.00       189000.00       189000.00       158500.00       20200607        20200607        20200606        20200609        20200609        20200610
202006  20200607        198420.00       189000.00       189000.00       198420.00       20200606        20200606        20200605        20200608        20200608        20200609
202006  20200606        158500.00       189000.00       189000.00       158500.00       20200605        20200605        20200604        20200607        20200607        20200608
202006  20200605        200140.00       189000.00       189000.00       200140.00       20200604        20200604        20200603        20200606        20200606        20200607
202006  20200604        158500.00       189000.00       189000.00       158500.00       20200603        20200603        20200602        20200605        20200605        20200606
202006  20200603        189000.00       189000.00       189000.00       189000.00       20200602        20200602        20200601        20200604        20200604        20200605
202006  20200602        185000.00       189000.00       189000.00       185000.00       20200601        20200601        19000101        20200603        20200603        20200604
202006  20200601        189000.00       189000.00       189000.00       189000.00       NULL    NULL    19000101        20200602        20200602        20200603
Time taken: 45.744 seconds, Fetched: 20 row(s)

LAG和LEAD默认的第二个参数是1,第三个参数是null
OVER(partition by col1 order by col2),加了更细力度order by的列后,就可以做到截止到目前列的数据,这个在扩展中可以详细讲;

开窗基本集合函数扩展

  另一种情况的运用,就是累计每个月截止到目前行的业绩总和,平均业绩,最大业绩等等,这里以求总和为例子,就是我们经常说的MTD业绩,如按照业绩从小到大排列,求截至目前行的总值,具体如下图所示:
Hive从入门到放弃——玩一玩Hive的数据分析开窗函数(十五)_第1张图片

图1 根据RANGE或者ROWS累计开窗

  这里可以结合RANGE或者ROWS的来使用,RANGEROWS还是有点区别的,只不过说这个列子不明显,可参考下图的一个列子,RANGE会把相同partition和order by的值显示为一样的最终结果,但是ROWS不会,Hive Cli代码和效果如下;


hive> set hive.cli.print.header=true;
hive> select month_key,date_key,production_amount
  >      ,sum(production_amount) OVER(partition by month_key order by production_amount RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `RANGE每月总和`
  >      ,sum(production_amount) OVER(partition by month_key order by production_amount ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `ROWS每月总和`
  >
  >      ,min(production_amount) OVER(partition by month_key order by production_amount RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `RANGE每月最小收入`
  >      ,max(production_amount) OVER(partition by month_key order by production_amount RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `RANGE每月最大收入`
  >      ,avg(production_amount) OVER(partition by month_key order by production_amount RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `RANGE每月平均收入`
  >      ,min(production_amount) OVER(partition by month_key order by production_amount ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `ROWS每月最小收入`
  >      ,max(production_amount) OVER(partition by month_key order by production_amount ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `ROWS每月最大收入`
  >      ,avg(production_amount) OVER(partition by month_key order by production_amount ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as `ROWS每月平均收入`
  > from dw.ods_sale_order_producttion_amount
  > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200729151835_4e818a96-4447-4fad-aae5-8e14371e2ba6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0152, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0152/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0152
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-29 15:18:43,001 Stage-1 map = 0%,  reduce = 0%
2020-07-29 15:18:53,853 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.91 sec
2020-07-29 15:19:01,158 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.56 sec
MapReduce Total cumulative CPU time: 8 seconds 560 msec
Ended Job = job_1592876386879_0152
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.56 sec   HDFS Read: 22462 HDFS Write: 2646 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 560 msec
OK
month_key       date_key        production_amount       range每月总和   rows每月总和    range每月最小收入       range每月最大收入       range每月平均收入       rows每月最小收入        rows每月最大收入    rows每月平均收入
202005  20200504        138500.00       277000.00       138500.00       138500.00       138500.00       138500.000000   138500.00       138500.00       138500.000000
202005  20200506        138500.00       277000.00       277000.00       138500.00       138500.00       138500.000000   138500.00       138500.00       138500.000000
202005  20200507        159840.00       436840.00       436840.00       138500.00       159840.00       145613.333333   138500.00       159840.00       145613.333333
202005  20200502        185000.00       621840.00       621840.00       138500.00       185000.00       155460.000000   138500.00       185000.00       155460.000000
202005  20200508        189462.00       811302.00       811302.00       138500.00       189462.00       162260.400000   138500.00       189462.00       162260.400000
202005  20200505        196540.00       1007842.00      1007842.00      138500.00       196540.00       167973.666667   138500.00       196540.00       167973.666667
202005  20200510        198540.00       1206382.00      1206382.00      138500.00       198540.00       172340.285714   138500.00       198540.00       172340.285714
202005  20200501        199000.00       1604382.00      1405382.00      138500.00       199000.00       178264.666667   138500.00       199000.00       175672.750000
202005  20200503        199000.00       1604382.00      1604382.00      138500.00       199000.00       178264.666667   138500.00       199000.00       178264.666667
202005  20200509        200000.00       1804382.00      1804382.00      138500.00       200000.00       180438.200000   138500.00       200000.00       180438.200000
202006  20200610        135480.00       135480.00       135480.00       135480.00       135480.00       135480.000000   135480.00       135480.00       135480.000000
202006  20200604        158500.00       610980.00       293980.00       135480.00       158500.00       152745.000000   135480.00       158500.00       146990.000000
202006  20200606        158500.00       610980.00       452480.00       135480.00       158500.00       152745.000000   135480.00       158500.00       150826.666667
202006  20200608        158500.00       610980.00       610980.00       135480.00       158500.00       152745.000000   135480.00       158500.00       152745.000000
202006  20200602        185000.00       795980.00       795980.00       135480.00       185000.00       159196.000000   135480.00       185000.00       159196.000000
202006  20200601        189000.00       1173980.00      984980.00       135480.00       189000.00       167711.428571   135480.00       189000.00       164163.333333
202006  20200603        189000.00       1173980.00      1173980.00      135480.00       189000.00       167711.428571   135480.00       189000.00       167711.428571
202006  20200607        198420.00       1372400.00      1372400.00      135480.00       198420.00       171550.000000   135480.00       198420.00       171550.000000
202006  20200609        200100.00       1572500.00      1572500.00      135480.00       200100.00       174722.222222   135480.00       200100.00       174722.222222
202006  20200605        200140.00       1772640.00      1772640.00      135480.00       200140.00       177264.000000   135480.00       200140.00       177264.000000
Time taken: 26.882 seconds, Fetched: 20 row(s)

   rows和range后面的常用参数如下表2:

表2 rows和range后面的常用参数解析
UNBOUNDED PRECEDING The window starts at the first row of the partition.
UNBOUNDED FOLLOWING The window ends at the last row of the partition.
CURRENT ROW window begins at the current row or ends at the current row
n PRECEDING or n FOLLOWING The window starts or ends n rows before or after the current row.for example,ROWS BETWEEN Unbounded preceding AND 1 Preceding 1,means that the window goes from the first row of the partition to the row that stands (in the ordered set) immediatly before the current row…

OLAP开窗函数

Hive开窗函数也支持OLAP,根据不同维度上钻和下钻的指标统计,达到数据分析的效果,常用的OLAP函数如下;

  • GROUPING SETS GROUPING__ID
    返回值:在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL GROUPING__ID,表示结果属于哪一个分组集合;
    Hive Cli语句和效果:如下;
hive> set hive.cli.print.header=true;
hive> select month_key,date_key
    >       ,sum(production_amount) `生产总值`
    >       ,GROUPING__ID
    > from     dw.ods_sale_order_producttion_amount
    > group by  month_key,date_key
    > grouping sets(month_key,date_key)
    > order by GROUPING__ID
    > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200729154417_180a7883-ed66-49e3-a52e-bdebfbf61507
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0156, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0156/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0156
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-29 15:44:24,616 Stage-1 map = 0%,  reduce = 0%
2020-07-29 15:44:32,981 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.61 sec
2020-07-29 15:44:38,197 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.62 sec
MapReduce Total cumulative CPU time: 7 seconds 620 msec
Ended Job = job_1592876386879_0156
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0157, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0157/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0157
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-07-29 15:44:49,818 Stage-2 map = 0%,  reduce = 0%
2020-07-29 15:44:56,078 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.34 sec
2020-07-29 15:45:02,288 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.51 sec
MapReduce Total cumulative CPU time: 4 seconds 510 msec
Ended Job = job_1592876386879_0157
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.62 sec   HDFS Read: 11977 HDFS Write: 796 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.51 sec   HDFS Read: 7133 HDFS Write: 877 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 130 msec
OK
month_key       date_key        生产总值        grouping__id
202005  NULL    1804382.00      1
202006  NULL    1772640.00      1
NULL    20200601        189000.00       2
NULL    20200610        135480.00       2
NULL    20200609        200100.00       2
NULL    20200608        158500.00       2
NULL    20200607        198420.00       2
NULL    20200606        158500.00       2
NULL    20200605        200140.00       2
NULL    20200604        158500.00       2
NULL    20200603        189000.00       2
NULL    20200602        185000.00       2
NULL    20200510        198540.00       2
NULL    20200509        200000.00       2
NULL    20200508        189462.00       2
NULL    20200507        159840.00       2
NULL    20200506        138500.00       2
NULL    20200505        196540.00       2
NULL    20200504        138500.00       2
NULL    20200503        199000.00       2
NULL    20200502        185000.00       2
NULL    20200501        199000.00       2
Time taken: 45.797 seconds, Fetched: 22 row(s)

结果分析:统计不同层次维度的度量值,第1列是按照month_key进行分组,第2列是按照date_key进行分组,统计各自的生产总值,注意GROUPING__ID表组别,是两个_隔开,根据grouping sets(month_key,date_key)来标识顺序,1是代表month_key,2是代表date_key。

  • CUBE
    返回值:根据GROUP BY的维度的所有组合进行聚合;
    Hive Cli语句和效果:如下;
hive> set hive.cli.print.header=true;
hive> select month_key,date_key
    >       ,sum(production_amount) `生产总值`
    >       ,GROUPING__ID
    > from     dw.ods_sale_order_producttion_amount
    > group by  month_key,date_key
    > WITH CUBE
    > ORDER BY GROUPING__ID;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200729155154_4b3bf262-89b1-4834-a6b9-a8ed0a9eafda
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0159, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0159/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0159
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-29 15:52:14,628 Stage-1 map = 0%,  reduce = 0%
2020-07-29 15:52:20,895 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.17 sec
2020-07-29 15:52:27,147 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.0 sec
MapReduce Total cumulative CPU time: 7 seconds 0 msec
Ended Job = job_1592876386879_0159
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0161, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0161/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0161
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-07-29 15:53:01,847 Stage-2 map = 0%,  reduce = 0%
2020-07-29 15:53:08,105 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.92 sec
2020-07-29 15:53:14,350 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 5.13 sec
MapReduce Total cumulative CPU time: 5 seconds 130 msec
Ended Job = job_1592876386879_0161
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.0 sec   HDFS Read: 12026 HDFS Write: 1599 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 5.13 sec   HDFS Read: 7936 HDFS Write: 1708 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 130 msec
OK
month_key       date_key        生产总值        grouping__id
202005  20200501        199000.00       0
202006  20200610        135480.00       0
202006  20200608        158500.00       0
202006  20200607        198420.00       0
202006  20200606        158500.00       0
202006  20200605        200140.00       0
202006  20200604        158500.00       0
202006  20200603        189000.00       0
202006  20200602        185000.00       0
202006  20200601        189000.00       0
202006  20200609        200100.00       0
202005  20200510        198540.00       0
202005  20200509        200000.00       0
202005  20200508        189462.00       0
202005  20200507        159840.00       0
202005  20200506        138500.00       0
202005  20200505        196540.00       0
202005  20200504        138500.00       0
202005  20200503        199000.00       0
202005  20200502        185000.00       0
202005  NULL    1804382.00      1
202006  NULL    1772640.00      1
NULL    20200610        135480.00       2
NULL    20200609        200100.00       2
NULL    20200608        158500.00       2
NULL    20200607        198420.00       2
NULL    20200606        158500.00       2
NULL    20200605        200140.00       2
NULL    20200604        158500.00       2
NULL    20200603        189000.00       2
NULL    20200602        185000.00       2
NULL    20200601        189000.00       2
NULL    20200510        198540.00       2
NULL    20200509        200000.00       2
NULL    20200508        189462.00       2
NULL    20200507        159840.00       2
NULL    20200506        138500.00       2
NULL    20200505        196540.00       2
NULL    20200504        138500.00       2
NULL    20200503        199000.00       2
NULL    20200502        185000.00       2
NULL    20200501        199000.00       2
NULL    NULL    3577022.00      3
Time taken: 80.686 seconds, Fetched: 43 row(s)

结果分析:根据GROUP BY的维度的所有组合进行聚合,统计不同维度的值;

  • ROLLUP
    返回值:是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合,即只会返回group by 后面第一个维度的所有聚合情况;
    Hive Cli语句和效果:如下,这里分别以month_key,date_key写在group by第一个,就一目了然了;
-- month_key在最左侧
hive> set hive.cli.print.header=true;
hive> select month_key,date_key
    >       ,sum(production_amount) `生产总值`
    >       ,GROUPING__ID
    > from     dw.ods_sale_order_producttion_amount
    > group by  month_key,date_key
    > WITH ROLLUP
    > ORDER BY GROUPING__ID;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200729155836_b7c2669b-9763-469f-8337-03cc5b71f68e
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0163, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0163/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0163
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-29 15:58:56,098 Stage-1 map = 0%,  reduce = 0%
2020-07-29 15:59:03,410 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.12 sec
2020-07-29 15:59:08,635 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.25 sec
MapReduce Total cumulative CPU time: 7 seconds 250 msec
Ended Job = job_1592876386879_0163
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0165, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0165/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0165
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-07-29 15:59:42,370 Stage-2 map = 0%,  reduce = 0%
2020-07-29 15:59:48,660 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.13 sec
2020-07-29 15:59:53,878 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.1 sec
MapReduce Total cumulative CPU time: 4 seconds 100 msec
Ended Job = job_1592876386879_0165
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.25 sec   HDFS Read: 12022 HDFS Write: 959 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.1 sec   HDFS Read: 7296 HDFS Write: 988 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 350 msec
OK
month_key       date_key        生产总值        grouping__id
202006  20200610        135480.00       0
202006  20200609        200100.00       0
202006  20200608        158500.00       0
202006  20200607        198420.00       0
202006  20200606        158500.00       0
202006  20200605        200140.00       0
202006  20200604        158500.00       0
202006  20200603        189000.00       0
202006  20200602        185000.00       0
202006  20200601        189000.00       0
202005  20200510        198540.00       0
202005  20200509        200000.00       0
202005  20200508        189462.00       0
202005  20200507        159840.00       0
202005  20200506        138500.00       0
202005  20200505        196540.00       0
202005  20200504        138500.00       0
202005  20200503        199000.00       0
202005  20200502        185000.00       0
202005  20200501        199000.00       0
202005  NULL    1804382.00      1
202006  NULL    1772640.00      1
NULL    NULL    3577022.00      3
Time taken: 78.751 seconds, Fetched: 23 row(s)

--date_key在最左侧

hive> set hive.cli.print.header=true;
hive> select month_key,date_key
    >       ,sum(production_amount) `生产总值`
    >       ,GROUPING__ID
    > from     dw.ods_sale_order_producttion_amount
    > group by  date_key,month_key
    > WITH ROLLUP
    > ORDER BY GROUPING__ID;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20200729160309_ad0a688d-edd1-4017-ab14-3407c90d2054
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0166, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0166/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0166
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-29 16:03:19,204 Stage-1 map = 0%,  reduce = 0%
2020-07-29 16:03:25,534 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.87 sec
2020-07-29 16:03:30,752 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.77 sec
MapReduce Total cumulative CPU time: 6 seconds 770 msec
Ended Job = job_1592876386879_0166
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0167, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0167/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0167
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-07-29 16:03:41,661 Stage-2 map = 0%,  reduce = 0%
2020-07-29 16:03:47,991 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.47 sec
2020-07-29 16:03:53,245 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 5.59 sec
MapReduce Total cumulative CPU time: 5 seconds 590 msec
Ended Job = job_1592876386879_0167
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.77 sec   HDFS Read: 11941 HDFS Write: 1539 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 5.59 sec   HDFS Read: 7876 HDFS Write: 1638 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 360 msec
OK
month_key       date_key        生产总值        grouping__id
202006  20200610        135480.00       0
202006  20200609        200100.00       0
202006  20200608        158500.00       0
202006  20200607        198420.00       0
202006  20200606        158500.00       0
202006  20200605        200140.00       0
202006  20200604        158500.00       0
202006  20200603        189000.00       0
202006  20200602        185000.00       0
202006  20200601        189000.00       0
202005  20200510        198540.00       0
202005  20200509        200000.00       0
202005  20200508        189462.00       0
202005  20200507        159840.00       0
202005  20200506        138500.00       0
202005  20200505        196540.00       0
202005  20200504        138500.00       0
202005  20200503        199000.00       0
202005  20200502        185000.00       0
202005  20200501        199000.00       0
NULL    20200505        196540.00       1
NULL    20200510        198540.00       1
NULL    20200610        135480.00       1
NULL    20200502        185000.00       1
NULL    20200609        200100.00       1
NULL    20200509        200000.00       1
NULL    20200608        158500.00       1
NULL    20200504        138500.00       1
NULL    20200607        198420.00       1
NULL    20200508        189462.00       1
NULL    20200606        158500.00       1
NULL    20200605        200140.00       1
NULL    20200507        159840.00       1
NULL    20200604        158500.00       1
NULL    20200503        199000.00       1
NULL    20200603        189000.00       1
NULL    20200506        138500.00       1
NULL    20200602        185000.00       1
NULL    20200501        199000.00       1
NULL    20200601        189000.00       1
NULL    NULL    3577022.00      3
Time taken: 44.948 seconds, Fetched: 41 row(s)

结果分析:以最左侧的维度为主,从该维度进行层级聚合,即只会返回group by 后面第一个维度的所有聚合情况;

开窗高级集合函数

  窗口函数还支持复杂的数学分析函数,如相关和线性回归函数标准差和方差函数等,由于不太常用,这里就不一一列举了,给一个传送门,用到的时候可以去参考函数形式,改写成HiveQL就行了,传送门:复杂窗口函数;

你可能感兴趣的:(Hadoop,Hive)