hive sql——窗口函数使用实战小结

文章目录

    • 一、什么是窗口函数
    • 二、窗口函数介绍
    • 三、窗口函数基本语法
    • 四、理解Window子句
      • 1. 理解下什么是`WINDOW`子句(灵活控制窗口的子集)
      • 2. 官网有一段话列出了如下窗口函数是不支持window子句的:
    • Hive窗口函数实例:
      • 聚合函数
        • 实例表准备:
        • 数据准备
          • SUM()
          • COUNT()
          • MAX()
          • MIN()
          • AVG()
      • 排序函数
        • 实例表准备
          • 数据准备
          • NTILE()
          • ROW_NUMBER()\RANK()\DENSE_RANK()
      • 比例函数
        • CUME_DIST()
        • PERCENT_RANK()
      • 取值函数
        • LAG
        • LEAD()
        • FIRST_VALUE()、LAST_VALUE()
      • 增强GROUP
        • GROUPING SETS
        • CUBE
        • ROLLUP

hive推出的窗口函数功能是对hive sql的功能增强,确实目前用于离线数据分析逻辑日趋复杂,很多场景都需要用到。用于实现分组内所有和连续累积的统计。

一、什么是窗口函数

1、窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行),这个数据窗口大小可能会随着行的变化而变化。
2、窗口函数对于每个组返回多行,组内每一行对应返回一行值。

二、窗口函数介绍

()、聚合函数:
	1.sum(col) over() :  分组对col累计求和,over() 中的语法如下
	2.count(col) over() : 分组对col累计,over() 中的语法如下
	3.min(col) over() : 分组对col求最小
	4.max(col) over() : 分组求col的最大值
	5.avg(col) over() : 分组求col列的平均值

()、取值函数:
	1.first_value(col) over() : 某分区排序后的第一个col值
	2.last_value(col) over() : 某分区排序后的最后一个col值
	3.lag(col,n,DEFAULT) : 统计往前n行的col值,n可选,默认为1DEFAULT当往上第n行为NULL时候,取默认值,如不指定,则为NULL
	4.lead(col,n,DEFAULT) : 统计往后n行的col值,n可选,默认为1DEFAULT当往下第n行为NULL时候,取默认值,如不指定,则为NULL()、排序函数:
	1.row_number() over() : 排名函数,不会重复,适合于生成主键或者不并列排名
	2.rank() over() :  排名函数,有并列名次,名次不连续。如:1,1,3
	3.dense_rank() over() : 排名函数,有并列名次,名次连续。如:114.ntile(n) : 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。

()、比例函数:
	1.cume_dist() 小于等于当前值的行数/分组内总行数比如,统计小于等于当前薪水的人数,所占总人数的比例
	2.percent_rank() 计算给定行的百分比排名。分组内当前行的RANK值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。
	
()、增强GROUP BY
    1.GROUPING SETS
    2.GROUPING__ID
    3.CUBE
    4.ROLLUP

三、窗口函数基本语法

<窗口函数>()
OVER
(
	[PARTITION BY <COLUMN 1 , COLUMN 2,COLUMN 3 ...>]
	[ORDER BY <排序用的清单列>][ASC/DESC]
	(ROWS | RANGE) <范围条件>
)

窗口函数的语法分为四个部分:

  • 函数子句:指明具体操作,如sum-求和,first_value-取第一个值;
  • partition by子句:指明分区字段,如果没有,则将所有数据作为一个分区;
  • order by子句:指明了每个分区排序的字段和方式,也是可选的,没有就是按照表中的顺序;
  • 窗口子句:指明相对当前记录的计算范围,可以向上(preceding),可以向下(following),也可以使用between指明,上下边界的值,没有的话默认为当前分区。ROWS BETWEEN,也叫做window子句。数字+PRECEDING 向前n条,数字+FOLLOWING 向后n条, CURRENT ROW 当前行,UNBOUNDED 无边界。 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING表示从最前面的起点开始,表示到最后面的终点,UNBOUNDED PRECEDING 向前无边界,UNBOUNDED FOLLOWING向后无边界。

四、理解Window子句

1. 理解下什么是WINDOW子句(灵活控制窗口的子集)

  • PRECEDING:往前
  • FOLLOWING:往后
  • CURRENT ROW:当前行
  • UNBOUNDED:无边界(一般结合PRECEDING,FOLLOWING使用)
  • UNBOUNDED PRECEDING:表示该窗口最前面的行(起点)
  • UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)

比如说:

  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行)
  • ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
  • ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW(表示往前2行到当前行)
  • ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示当前行到终点)

2. 官网有一段话列出了如下窗口函数是不支持window子句的:

  • rank(): 排名函数,有并列名次,名次不连续。如:1,1,3。
  • ntile(n): 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。
  • dense_rank() over(): 排名函数,有并列名次,名次连续。如:1,1。
  • cume_dist():小于等于当前值的行数/分组内总行数比如,统计小于等于当前薪水的人数,所占总人数的比例。
  • percent_rank(): 计算给定行的百分比排名。分组内当前行的RANK值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。

Hive窗口函数实例:

聚合函数

实例表准备:
CREATE EXTERNAL TABLE test.student_score (
`student_id` string,
`date_key` string,
`school_id` string,
`grade` string,
`class` string,
`score` string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
stored as textfile 
location '/tmp/test/student_score/';
数据准备
10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
SUM()

SUM — 注意,结果和ORDER BY相关,默认为升序

测试代码:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id) AS scores1, 
    --从起点到当前行,结果同pv1 
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS scores2, 
    --分组内所有行
    SUM(score) OVER(PARTITION BY school_id) AS scores3,
     --当前行+往前3行                             
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS scores4,  
    --当前行+往前3行+往后1行
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS scores5,   
    --当前行+往后所有行  
    SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS scores6    
from test.student_score;

结果:

school_id student_id score scores1 scores2 scores3 scores4 scores5 scores6
1001 10001 11 11.0 11.0 396.0 11.0 66.0 396.0
1001 10002 55 66.0 66.0 396.0 66.0 143.0 385.0
1001 10003 77 143.0 143.0 396.0 143.0 176.0 330.0
1001 10004 33 176.0 176.0 396.0 176.0 198.0 253.0
1001 10005 22 198.0 198.0 396.0 187.0 286.0 220.0
1001 10006 99 297.0 297.0 396.0 231.0 330.0 198.0
1001 10007 99 396.0 396.0 396.0 253.0 253.0 99.0

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;

COUNT()

在Window子句上与sum()的理解不同,结果和ORDER BY相关,默认为升序

测试代码:

select 
    school_id,
    grade,
    class,
     -- 默认为从起点到当前行,明细数据会按照排序主键排序来一行加一行,如果排序主键有多行数据一样,也就是多行数据顺序一致,则会一起被加上。
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class) AS sv1,
    --从起点到当前行,在分组内,按照排序建顺序,来一行加一行,就算是排序主键一样也会一次加,不会一起被加。
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sv2, 
    --分组内所有行
    COUNT(student_id) OVER(PARTITION BY grade) AS sv3, 
    --当前行+往前3行,规则同sv2
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS sv4,  
     --当前行+往前3行+往后1行,规则同sv2 
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS sv5,   
     ---当前行+往后所有行,规则同sv2
    COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sv6  
from test.student_score;

注意:窗口函数不支持COUNT(DISTINCT xxx)操作,这种情况可以利用子查询先对数据进行去重之后再进行统计,可以参见我的文章:HIVE如何实现如何实现 COUNT(DISTINCT ) OVER (PARTITION BY )?

结果:

school_id grade class sv1 sv2 sv3 sv4 sv5 sv6
1001 初二 2 2 1 2 1 2 2
1001 初二 2 2 2 2 2 2 1
1001 初一 1 2 1 3 1 2 3
1001 初一 1 2 2 3 2 3 2
1001 初一 3 3 3 3 3 3 1
1001 初三 1 1 1 2 1 2 2
1001 初三 2 2 2 2 2 2 1
MAX()

在Window子句上与sum()的理解相同,结果和ORDER BY相关,默认为升序。

代码测试:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id) AS mas1, 
    --从起点到当前行,结果同pv1 
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mas2, 
    --分组内所有行
    MAX(score) OVER(PARTITION BY school_id) AS mas3,
     --当前行+往前3行                             
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mas4,  
    --当前行+往前3行+往后1行
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS mas5,   
    --当前行+往后所有行  
    MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mas6    
from test.student_score;

结果:

school_id student_id score mas1 mas2 mas3 mas4 mas5 mas6
1001 10001 11 11 11 99 11 55 99
1001 10002 55 55 55 99 55 77 99
1001 10003 77 77 77 99 77 77 99
1001 10004 33 77 77 99 77 77 99
1001 10005 22 77 77 99 77 99 99
1001 10006 99 99 99 99 99 99 99
1001 10007 99 99 99 99 99 99 99
MIN()

在Window子句上与sum()的理解相同,结果和ORDER BY相关,默认为升序

代码测试:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id) AS mis1, 
    --从起点到当前行,结果同pv1 
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mis2, 
    --分组内所有行
    MIN(score) OVER(PARTITION BY school_id) AS mis3,
     --当前行+往前3行                             
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mis4,  
    --当前行+往前3行+往后1行
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS mis5,   
    --当前行+往后所有行  
    MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mis6    
from test.student_score;

结果:

school_id student_id score mis1 mis2 mis3 mis4 mis5 mis6
1001 10001 11 11 11 11 11 11 11
1001 10002 55 11 11 11 11 11 22
1001 10003 77 11 11 11 11 11 22
1001 10004 33 11 11 11 11 11 22
1001 10005 22 11 11 11 22 22 22
1001 10006 99 11 11 11 22 22 99
1001 10007 99 11 11 11 22 22 99
AVG()

在Window子句上与sum()的理解相同,结果和ORDER BY相关,默认为升序

测试代码:

select 
    school_id,
    student_id,
    score,
    -- 默认为从起点到当前行
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id) AS as1, 
    --从起点到当前行,结果同pv1 
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS as2, 
    --分组内所有行
    AVG(score) OVER(PARTITION BY school_id) AS as3,
     --当前行+往前3行                             
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS as4,  
    --当前行+往前3行+往后1行
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS as5,   
    --当前行+往后所有行  
    AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS as6    
from test.student_score;

结果:

school_id student_id score mis1 mis2 mis3 mis4 mis5 mis6
1001 10001 11 11.0 11.0 56.57142857142857 11.0 33.0 56.57142857142857
1001 10002 55 33.0 33.0 56.57142857142857 33.0 47.666666666666664 64.16666666666667
1001 10003 77 47.666666666666664 47.666666666666664 56.57142857142857 47.666666666666664 44.0 66.0
1001 10004 33 44.0 44.0 56.57142857142857 44.0 39.6 63.25
1001 10005 22 39.6 39.6 56.57142857142857 46.75 57.2 73.33333333333333
1001 10006 99 49.5 49.5 56.57142857142857 57.75 66.0 99.0
1001 10007 99 56.57142857142857 56.57142857142857 56.57142857142857 63.25 63.25 99.0

排序函数

实例表准备
CREATE EXTERNAL TABLE test.student_score(
    `student_id` STRING, 
    `date_key` STRING, 
    `school_id` STRING, 
    `grade` STRING, 
    `class` STRING, 
    `score` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile
LOCATION '/tmp/test/student_score';
数据准备
10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
10001,2021-05-20,1001,初一,1,22
10002,2021-05-21,1001,初二,2,66
10003,2021-05-23,1001,初三,1,88
10004,2021-05-24,1001,初一,3,44
10005,2021-05-25,1001,初一,1,33
10006,2021-05-26,1001,初三,2,33
10007,2021-05-27,1001,初二,2,11
NTILE()

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY school_id ORDER BY date_key ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。如果切片不均匀,默认增加第一个切片的分布

SELECT 
    school_id,
    student_id,
    score,
    --分组内将数据分成2片
    NTILE(2) OVER(PARTITION BY school_id ORDER BY student_id) AS rn1,
    --分组内将数据分成3片
    NTILE(3) OVER(PARTITION BY school_id ORDER BY student_id) AS rn2,
    --将所有数据分成4片
    NTILE(4) OVER(ORDER BY school_id) AS rn3
FROM test.student_score;

结果:

school_id student_id score rn1 rn2 rn3
1001 10001 11 1 1 1
1001 10002 55 1 1 1
1001 10003 77 1 1 1
1001 10004 33 1 2 1
1001 10005 22 2 2 2
1001 10006 99 2 3 2
1001 10007 99 2 3 2
1002 10001 22 1 1 2
1002 10002 66 1 1 3
1002 10003 88 1 1 3
1002 10004 44 1 2 3
1002 10005 33 2 2 4
1002 10006 33 2 3 4
1002 10007 11 2 3 4

统计不同学校school,成绩最高的前1/3的学生:

--rn1 = 1 的记录,就是我们想要的结果
SELECT 
school_id,
student_id,
score,
NTILE(3) OVER(PARTITION BY school_id ORDER BY student_id DESC) AS rn1
FROM test.student_score;

结果:

school_id student_id score rn1
1002 10007 11 1
1002 10006 33 1
1002 10005 33 1
1002 10004 44 2
1002 10003 88 2
1002 10002 66 3
1002 10001 22 3
1001 10007 99 1
1001 10006 99 1
1001 10005 22 1
1001 10004 33 2
1001 10003 77 2
1001 10002 55 3
1001 10001 11 3
ROW_NUMBER()\RANK()\DENSE_RANK()
SELECT
    school_id,
    student_id,
    score,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY score) AS rank1,
    RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank2,
    DENSE_RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank3
FROM test.student_score
ORDER BY school_id,score;

结果:

school_id student_id score rank1 rank2 rank3
1001 10001 11 1 1 1
1001 10005 22 2 2 2
1001 10004 33 3 3 3
1001 10002 55 4 4 4
1001 10003 77 5 5 5
1001 10006 99 6 6 6
1001 10007 99 7 6 6
1002 10007 11 1 1 1
1002 10001 22 2 2 2
1002 10005 33 3 3 3
1002 10006 33 4 3 3
1002 10004 44 5 5 4
1002 10002 66 6 6 5
1002 10003 88 7 7 6

比例函数

CUME_DIST()

CUME_DIST 小于等于当前值的行数占分组内总行数的比例。

--–比如,统计小于等于当前成绩的人数,所占总人数的比例
SELECT 
    school_id,
    student_id,
    score,
    CUME_DIST() OVER(ORDER BY score) AS rn1,
    CUME_DIST() OVER(PARTITION BY school_id ORDER BY score) AS rn2 
FROM test.student_score;

结果:

school_id student_id score rn1 rn2
1001 10001 11 0.14285714285714285 0.14285714285714285
1001 10005 22 0.2857142857142857 0.2857142857142857
1001 10004 33 0.5 0.42857142857142855
1001 10002 55 0.6428571428571429 0.5714285714285714
1001 10003 77 0.7857142857142857 0.7142857142857143
1001 10006 99 1.0 1.0
1001 10007 99 1.0 1.0
1002 10007 11 0.14285714285714285 0.14285714285714285
1002 10001 22 0.2857142857142857 0.2857142857142857
1002 10005 33 0.5 0.5714285714285714
1002 10006 33 0.5 0.5714285714285714
1002 10004 44 0.5714285714285714 0.7142857142857143
1002 10002 66 0.7142857142857143 0.8571428571428571
1002 10003 88 0.8571428571428571 1.0
PERCENT_RANK()

PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1

SELECT 
    school_id,
    student_id,
    score,
    PERCENT_RANK() OVER(ORDER BY score) AS rn1,   --分组内
    RANK() OVER(ORDER BY score) AS rank1,          --分组内RANK值
    SUM(1) OVER(PARTITION BY NULL) AS sum1,     --分组内总行数
    PERCENT_RANK() OVER(PARTITION BY school_id ORDER BY score)
     AS rn2,
     RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank2,
    SUM(1) OVER(PARTITION BY school_id) AS sum2
FROM test.student_score;

结果:

school_id student_id score rn1 rank1 sum1 rn2 rank2 sum2
1001 10001 11 0.0 1 14 0.0 1 7
1001 10005 22 0.15384615384615385 3 14 0.16666666666666666 2 7
1001 10004 33 0.3076923076923077 5 14 0.3333333333333333 3 7
1001 10002 55 0.6153846153846154 9 14 0.5 4 7
1001 10003 77 0.7692307692307693 11 14 0.6666666666666666 5 7
1001 10006 99 0.9230769230769231 13 14 0.8333333333333334 6 7
1001 10007 99 0.9230769230769231 13 14 0.8333333333333334 6 7
1002 10007 11 0.0 1 14 0.0 1 7
1002 10001 22 0.15384615384615385 3 14 0.16666666666666666 2 7
1002 10005 33 0.3076923076923077 5 14 0.3333333333333333 3 7
1002 10006 33 0.3076923076923077 5 14 0.3333333333333333 3 7
1002 10004 44 0.5384615384615384 8 14 0.6666666666666666 5 7
1002 10002 66 0.6923076923076923 10 14 0.8333333333333334 6 7
1002 10003 88 0.8461538461538461 12 14 1.0 7 7

取值函数

数据和表准备同排序函数。

LAG
  • 作用:
    LAG(col,n,DEFAULT)用于统计窗口内往上第n行值
  • 参数:
    第一个参数为列名,
    第二个参数为往上第n行(可选,默认为1),
    第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
    school_id,
    date_key,
    grade,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
    LAG(date_key,1,'1970-01-01') OVER(PARTITION BY school_id ORDER BY date_key) AS last_day_1,
    LAG(date_key,2) OVER(PARTITION BY school_id ORDER BY date_key) AS last_day_2 
FROM test.student_score;

结果:

school_id date_key grade rn last_day_1 last_day_2
1002 2021-05-20 初一 1 1970-01-01 NULL
1002 2021-05-21 初二 2 2021-05-20 NULL
1002 2021-05-23 初三 3 2021-05-21 2021-05-20
1002 2021-05-24 初一 4 2021-05-23 2021-05-21
1002 2021-05-25 初一 5 2021-05-24 2021-05-23
1002 2021-05-26 初三 6 2021-05-25 2021-05-24
1002 2021-05-27 初二 7 2021-05-26 2021-05-25
1001 2021-05-20 初一 1 1970-01-01 NULL
1001 2021-05-21 初二 2 2021-05-20 NULL
1001 2021-05-23 初三 3 2021-05-21 2021-05-20
1001 2021-05-24 初一 4 2021-05-23 2021-05-21
1001 2021-05-25 初一 5 2021-05-24 2021-05-23
1001 2021-05-26 初三 6 2021-05-25 2021-05-24
1001 2021-05-27 初二 7 2021-05-26 2021-05-25

结果分析:

last_day_1: 指定了往上第1行的值,default'1970-01-01'  
            1002第一行,往上1行为NULL,因此取默认值 1970-01-01
            1002第三行,往上1行值为第二行值,2021-05-20
            1002第六行,往上1行值为第五行值,2021-05-25
last_day_2: 指定了往上第2行的值,为指定默认值
			1002第一行,往上2行为NULL
			1002第二行,往上2行为NULL
		    1002第四行,往上2行为第二行值,2021-05-21
		    1002第七行,往上2行为第五行值,2021-05-25
LEAD()
  • 作用:
    LAG相反。LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值。
  • 参数:
    第一个参数为列名.
    第二个参数为往下第n行(可选,默认为1).
    第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
    school_id,
    date_key,
    grade,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
    LEAD(date_key,1,'1970-01-01') OVER(PARTITION BY school_id ORDER BY date_key) AS next_day_1,
    LEAD(date_key,2) OVER(PARTITION BY school_id ORDER BY date_key) AS next_day_2 
FROM test.student_score;

结果:

school_id date_key grade rn next_day_1 next_day_2
1002 2021-05-20 初一 1 2021-05-21 2021-05-23
1002 2021-05-21 初二 2 2021-05-23 2021-05-24
1002 2021-05-23 初三 3 2021-05-24 2021-05-25
1002 2021-05-24 初一 4 2021-05-25 2021-05-26
1002 2021-05-25 初一 5 2021-05-26 2021-05-27
1002 2021-05-26 初三 6 2021-05-27 NULL
1002 2021-05-27 初二 7 1970-01-01 NULL
1001 2021-05-20 初一 1 2021-05-21 2021-05-23
1001 2021-05-21 初二 2 2021-05-23 2021-05-24
1001 2021-05-23 初三 3 2021-05-24 2021-05-25
1001 2021-05-24 初一 4 2021-05-25 2021-05-26
1001 2021-05-25 初一 5 2021-05-26 2021-05-27
1001 2021-05-26 初三 6 2021-05-27 NULL
1001 2021-05-27 初二 7 1970-01-01 NULL

结果分析:

next_day_1: 指定了往下第1行的值,default'1970-01-01'  
            1002第七行,往下行为NULL,因此取默认值 1970-01-01
            1002第四行,往下1行值为第二行值,2021-05-25
            1002第二行,往下1行值为第五行值,2021-05-23
next_day_2: 指定了往下第2行的值,为指定默认值
			1002第七行,往下2行为NULL
			1002第六行,往下2行为NULL
		    1002第三行,往下2行为第二行值,2021-05-25
		    1002第一行,往上2行为第五行值,2021-05-23
FIRST_VALUE()、LAST_VALUE()
  1. FIRST_VALUE()取分组内排序后,截止到当前行,第一个值。
  2. LAST_VALUE()取分组内排序后,截止到当前行,最后一个值。
  3. 如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果。
  4. 如果想要取分组内排序后最后一个值,则需要变通一下,如实例中last_3last_4
SELECT
    school_id,
    date_key,
    grade,
    ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
    FIRST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key) AS first_1,
    -- 如果不指定`ORDER BY`,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果。
    FIRST_VALUE(date_key) OVER(PARTITION BY school_id) AS first_2,
    LAST_VALUE(date_key) OVER(PARTITION BY school_id) AS last_2,
    --我们发现LAST_VALUE并不能取最后一个值而是默认取从起点到当前行的最后一个值
    LAST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key) AS last_3,
    --如果想要取分组内排序后最后一个值,则需要变通一下
    FIRST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key DESC) AS last_4 
FROM test.student_score;

结果:

school_id date_key grade rn first_1 first_2 last_2 last_3 last_4
1002 2021-05-20 初一 1 2021-05-20 2021-05-20 2021-05-27 2021-05-20 2021-05-27
1002 2021-05-21 初二 2 2021-05-20 2021-05-20 2021-05-27 2021-05-21 2021-05-27
1002 2021-05-23 初三 3 2021-05-20 2021-05-20 2021-05-27 2021-05-23 2021-05-27
1002 2021-05-24 初一 4 2021-05-20 2021-05-20 2021-05-27 2021-05-24 2021-05-27
1002 2021-05-25 初一 5 2021-05-20 2021-05-20 2021-05-27 2021-05-25 2021-05-27
1002 2021-05-26 初三 6 2021-05-20 2021-05-20 2021-05-27 2021-05-26 2021-05-27
1002 2021-05-27 初二 7 2021-05-20 2021-05-20 2021-05-27 2021-05-27 2021-05-27
1001 2021-05-20 初一 1 2021-05-20 2021-05-20 2021-05-27 2021-05-20 2021-05-27
1001 2021-05-21 初二 2 2021-05-20 2021-05-20 2021-05-27 2021-05-21 2021-05-27
1001 2021-05-23 初三 3 2021-05-20 2021-05-20 2021-05-27 2021-05-23 2021-05-27
1001 2021-05-24 初一 4 2021-05-20 2021-05-20 2021-05-27 2021-05-24 2021-05-27
1001 2021-05-25 初一 5 2021-05-20 2021-05-20 2021-05-27 2021-05-25 2021-05-27
1001 2021-05-26 初三 6 2021-05-20 2021-05-20 2021-05-27 2021-05-26 2021-05-27
1001 2021-05-27 初二 7 2021-05-20 2021-05-20 2021-05-27 2021-05-27 2021-05-27

增强GROUP

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP,这几个分析函数通常用于OLAP中,使用于多指标需要根据不同的维度上钻和下钻的指标统计,比如,分学校、年级、班级的统计不同维度组合数据。

GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL。其中的 GROUPING__ID,表示结果属于哪一个分组集合。

SELECT 
    school_id,
    grade,
    class,
    COUNT(DISTINCT student_id) AS uv,
    GROUPING__ID 
FROM test.student_score
GROUP BY school_id,grade,class 
GROUPING SETS (school_id,grade,class,(school_id,grade,class),(school_id,grade),(grade,class),(school_id,class)) 
ORDER BY GROUPING__ID;

结果:

school_id grade class uv GROUPING__ID
1001 初一 3 1 0
1002 初三 1 1 0
1001 初一 1 2 0
1002 初一 3 1 0
1001 初二 2 2 0
1001 初三 2 1 0
1002 初一 1 2 0
1002 初三 2 1 0
1002 初二 2 2 0
1001 初三 1 1 0
1001 初一 NULL 3 1
1002 初二 NULL 2 1
1001 初二 NULL 2 1
1002 初一 NULL 3 1
1001 初三 NULL 2 1
1002 初三 NULL 2 1
1002 NULL 2 3 2
1002 NULL 3 1 2
1001 NULL 1 3 2
1001 NULL 3 1 2
1001 NULL 2 3 2
1002 NULL 1 3 2
1001 NULL NULL 7 3
1002 NULL NULL 7 3
NULL 初三 1 1 4
NULL 初三 2 1 4
NULL 初一 1 2 4
NULL 初二 2 2 4
NULL 初一 3 1 4
NULL 初三 NULL 2 5
NULL 初一 NULL 3 5
NULL 初二 NULL 2 5
NULL NULL 3 1 6
NULL NULL 2 3 6
NULL NULL 1 3 6

经过结果分析我们发现,使用 GROUPING SETS可以再同一层SQL查询进行多维度统计并将结果合并(union all)在一起。其效果等价于如下:

SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,0 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,1 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade
UNION ALL
SELECT school_id,NULL AS class,class,COUNT(DISTINCT student_id) AS uv,2 AS GROUPING__ID FROM test.student_score GROUP BY school_id,class
UNION ALL
SELECT NULL AS school_id,grade,class,COUNT(DISTINCT student_id) AS uv,3 AS GROUPING__ID FROM test.student_score GROUP BY grade,class
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,4 AS GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,5 AS GROUPING__ID FROM test.student_score GROUP BY grade
UNION ALL
SELECT NULL AS school_id,NULL AS grade,class,COUNT(DISTINCT student_id) AS uv,6 AS GROUPING__ID FROM test.student_score GROUP BY class
CUBE

根据GROUP BY的维度的所有组合进行聚合

SELECT 
    school_id,
    grade,
    class,
    COUNT(DISTINCT student_id) AS uv,
    GROUPING__ID 
FROM test.student_score 
GROUP BY school_id,grade,class 
WITH CUBE 
ORDER BY GROUPING__ID;

结果:

school_id grade class uv GROUPING__ID
1001 初三 1 1 0
1002 初一 3 1 0
1001 初三 2 1 0
1001 初一 1 2 0
1002 初一 1 2 0
1001 初一 3 1 0
1002 初二 2 2 0
1001 初二 2 2 0
1002 初三 1 1 0
1002 初三 2 1 0
1002 初二 NULL 2 1
1002 初一 NULL 3 1
1001 初二 NULL 2 1
1001 初三 NULL 2 1
1002 初三 NULL 2 1
1001 初一 NULL 3 1
1001 NULL 3 1 2
1002 NULL 3 1 2
1002 NULL 1 3 2
1001 NULL 1 3 2
1002 NULL 2 3 2
1001 NULL 2 3 2
1002 NULL NULL 7 4
1001 NULL NULL 7 4
NULL 初三 2 1 3
NULL 初一 3 1 3
NULL 初一 1 2 3
NULL 初三 1 1 3
NULL 初二 2 2 3
NULL 初二 NULL 2 5
NULL 初一 NULL 3 5
NULL 初三 NULL 2 5
NULL NULL 3 1 6
NULL NULL 1 3 6
NULL NULL 2 3 6
NULL NULL NULL 7 7

经过结果分析我们发现,使用 CUBE可以再同一层SQL,GROUP BY的维度的所有组合进行多维度统计并将结果合并(union all)在一起。其效果等价于如下:

SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,0 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,1 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade
UNION ALL
SELECT school_id,NULL AS class,class,COUNT(DISTINCT student_id) AS uv,2 AS GROUPING__ID FROM test.student_score GROUP BY school_id,class
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,4 AS GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,grade,class,COUNT(DISTINCT student_id) AS uv,3 AS GROUPING__ID FROM test.student_score GROUP BY grade,class
UNION ALL
SELECT NULL AS school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,5 AS GROUPING__ID FROM test.student_score GROUP BY grade
UNION ALL
SELECT NULL AS school_id,NULL AS grade,class,COUNT(DISTINCT student_id) AS uv,6 AS GROUPING__ID FROM test.student_score GROUP BY class
UNION ALL
SELECT NULL AS school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,7 AS GROUPING__ID FROM test.student_score
ROLLUP

CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
group by rollup(school_id,grade,class), 可以理解为从右到左以一次少一列的方式依次进行group by。

学校、年级、班级的UV->学校、年级的UV->学校的UV->总的UV:

SELECT 
    school_id,
    grade,
    class,
    COUNT(DISTINCT student_id) AS uv,
    GROUPING__ID 
FROM test.student_score 
GROUP BY school_id,grade,class 
WITH ROLLUP 
ORDER BY GROUPING__ID;

结果:

school_id grade class uv GROUPING__ID
1001 初一 3 1 0
1002 初三 1 1 0
1001 初一 1 2 0
1002 初一 3 1 0
1001 初二 2 2 0
1001 初三 2 1 0
1002 初一 1 2 0
1002 初三 2 1 0
1002 初二 2 2 0
1001 初三 1 1 0
1001 初一 NULL 3 1
1002 初二 NULL 2 1
1001 初二 NULL 2 1
1002 初一 NULL 3 1
1001 初三 NULL 2 1
1002 初三 NULL 2 1
1001 NULL NULL 7 3
1002 NULL NULL 7 3
NULL NULL NULL 7 7

经过对结果分析: group by rollup(school_id,grade,class) 则以group by(school_id,grade,class) -> group by(school_id,grade) -> group by(school_id) -> group by null(最终汇总)的顺序进行分组相当于:

SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class 
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id,grade 
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score

你可能感兴趣的:(sql,hivessql,hive,大数据,sql,hive,开窗函数,spark)