hive
推出的窗口函数功能是对hive sql
的功能增强,确实目前用于离线数据分析逻辑日趋复杂,很多场景都需要用到。用于实现分组内所有和连续累积的统计。
1、窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行),这个数据窗口大小可能会随着行的变化而变化。
2、窗口函数对于每个组返回多行,组内每一行对应返回一行值。
(一)、聚合函数:
1.sum(col) over() : 分组对col累计求和,over() 中的语法如下
2.count(col) over() : 分组对col累计,over() 中的语法如下
3.min(col) over() : 分组对col求最小
4.max(col) over() : 分组求col的最大值
5.avg(col) over() : 分组求col列的平均值
(二)、取值函数:
1.first_value(col) over() : 某分区排序后的第一个col值
2.last_value(col) over() : 某分区排序后的最后一个col值
3.lag(col,n,DEFAULT) : 统计往前n行的col值,n可选,默认为1,DEFAULT当往上第n行为NULL时候,取默认值,如不指定,则为NULL
4.lead(col,n,DEFAULT) : 统计往后n行的col值,n可选,默认为1,DEFAULT当往下第n行为NULL时候,取默认值,如不指定,则为NULL
(三)、排序函数:
1.row_number() over() : 排名函数,不会重复,适合于生成主键或者不并列排名
2.rank() over() : 排名函数,有并列名次,名次不连续。如:1,1,3
3.dense_rank() over() : 排名函数,有并列名次,名次连续。如:1,1,
4.ntile(n) : 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。
(四)、比例函数:
1.cume_dist() 小于等于当前值的行数/分组内总行数比如,统计小于等于当前薪水的人数,所占总人数的比例
2.percent_rank() 计算给定行的百分比排名。分组内当前行的RANK值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。
(五)、增强GROUP BY
1.GROUPING SETS
2.GROUPING__ID
3.CUBE
4.ROLLUP
<窗口函数>()
OVER
(
[PARTITION BY <COLUMN 1 , COLUMN 2,COLUMN 3 ...>]
[ORDER BY <排序用的清单列>][ASC/DESC]
(ROWS | RANGE) <范围条件>
)
窗口函数的语法分为四个部分:
函数子句
:指明具体操作,如sum-求和,first_value-取第一个值;partition by子句
:指明分区字段,如果没有,则将所有数据作为一个分区;order by子句
:指明了每个分区排序的字段和方式,也是可选的,没有就是按照表中的顺序;窗口子句
:指明相对当前记录的计算范围,可以向上(preceding
),可以向下(following
),也可以使用between
指明,上下边界的值,没有的话默认为当前分区。ROWS BETWEEN
,也叫做window
子句。数字+PRECEDING
向前n条,数字+FOLLOWING
向后n条,CURRENT ROW
当前行,UNBOUNDED
无边界。ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
表示从最前面的起点开始,表示到最后面的终点,UNBOUNDED PRECEDING
向前无边界,UNBOUNDED FOLLOWING
向后无边界。
WINDOW
子句(灵活控制窗口的子集)
PRECEDING
:往前FOLLOWING
:往后CURRENT ROW
:当前行UNBOUNDED
:无边界(一般结合PRECEDING,FOLLOWING
使用)UNBOUNDED PRECEDING
:表示该窗口最前面的行(起点)UNBOUNDED FOLLOWING
:表示该窗口最后面的行(终点)比如说:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(表示从起点到当前行)ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING
(表示往前2行到往后1行)ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW
(表示往前2行到当前行)ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
(表示当前行到终点)
rank()
: 排名函数,有并列名次,名次不连续。如:1,1,3。ntile(n)
: 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。dense_rank() over()
: 排名函数,有并列名次,名次连续。如:1,1。cume_dist()
:小于等于当前值的行数/分组内总行数比如,统计小于等于当前薪水的人数,所占总人数的比例。percent_rank()
: 计算给定行的百分比排名。分组内当前行的RANK值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。
CREATE EXTERNAL TABLE test.student_score (
`student_id` string,
`date_key` string,
`school_id` string,
`grade` string,
`class` string,
`score` string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile
location '/tmp/test/student_score/';
10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
SUM
— 注意,结果和ORDER BY
相关,默认为升序
测试代码:
select
school_id,
student_id,
score,
-- 默认为从起点到当前行
SUM(score) OVER(PARTITION BY school_id ORDER BY student_id) AS scores1,
--从起点到当前行,结果同pv1
SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS scores2,
--分组内所有行
SUM(score) OVER(PARTITION BY school_id) AS scores3,
--当前行+往前3行
SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS scores4,
--当前行+往前3行+往后1行
SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS scores5,
--当前行+往后所有行
SUM(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS scores6
from test.student_score;
结果:
school_id | student_id | score | scores1 | scores2 | scores3 | scores4 | scores5 | scores6 |
---|---|---|---|---|---|---|---|---|
1001 | 10001 | 11 | 11.0 | 11.0 | 396.0 | 11.0 | 66.0 | 396.0 |
1001 | 10002 | 55 | 66.0 | 66.0 | 396.0 | 66.0 | 143.0 | 385.0 |
1001 | 10003 | 77 | 143.0 | 143.0 | 396.0 | 143.0 | 176.0 | 330.0 |
1001 | 10004 | 33 | 176.0 | 176.0 | 396.0 | 176.0 | 198.0 | 253.0 |
1001 | 10005 | 22 | 198.0 | 198.0 | 396.0 | 187.0 | 286.0 | 220.0 |
1001 | 10006 | 99 | 297.0 | 297.0 | 396.0 | 231.0 | 330.0 | 198.0 |
1001 | 10007 | 99 | 396.0 | 396.0 | 396.0 | 253.0 | 253.0 | 99.0 |
如果不指定
ROWS BETWEEN
,默认为从起点到当前行;
如果不指定ORDER BY
,则将分组内所有值累加;
在Window子句上与sum()的理解不同,结果和
ORDER BY
相关,默认为升序
测试代码:
select
school_id,
grade,
class,
-- 默认为从起点到当前行,明细数据会按照排序主键排序来一行加一行,如果排序主键有多行数据一样,也就是多行数据顺序一致,则会一起被加上。
COUNT(student_id) OVER(PARTITION BY grade ORDER BY class) AS sv1,
--从起点到当前行,在分组内,按照排序建顺序,来一行加一行,就算是排序主键一样也会一次加,不会一起被加。
COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sv2,
--分组内所有行
COUNT(student_id) OVER(PARTITION BY grade) AS sv3,
--当前行+往前3行,规则同sv2
COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS sv4,
--当前行+往前3行+往后1行,规则同sv2
COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS sv5,
---当前行+往后所有行,规则同sv2
COUNT(student_id) OVER(PARTITION BY grade ORDER BY class ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sv6
from test.student_score;
注意:窗口函数不支持
COUNT(DISTINCT xxx)
操作,这种情况可以利用子查询先对数据进行去重之后再进行统计,可以参见我的文章:HIVE如何实现如何实现 COUNT(DISTINCT ) OVER (PARTITION BY )?
结果:
school_id | grade | class | sv1 | sv2 | sv3 | sv4 | sv5 | sv6 |
---|---|---|---|---|---|---|---|---|
1001 | 初二 | 2 | 2 | 1 | 2 | 1 | 2 | 2 |
1001 | 初二 | 2 | 2 | 2 | 2 | 2 | 2 | 1 |
1001 | 初一 | 1 | 2 | 1 | 3 | 1 | 2 | 3 |
1001 | 初一 | 1 | 2 | 2 | 3 | 2 | 3 | 2 |
1001 | 初一 | 3 | 3 | 3 | 3 | 3 | 3 | 1 |
1001 | 初三 | 1 | 1 | 1 | 2 | 1 | 2 | 2 |
1001 | 初三 | 2 | 2 | 2 | 2 | 2 | 2 | 1 |
在Window子句上与sum()的理解相同,结果和
ORDER BY
相关,默认为升序。
代码测试:
select
school_id,
student_id,
score,
-- 默认为从起点到当前行
MAX(score) OVER(PARTITION BY school_id ORDER BY student_id) AS mas1,
--从起点到当前行,结果同pv1
MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mas2,
--分组内所有行
MAX(score) OVER(PARTITION BY school_id) AS mas3,
--当前行+往前3行
MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mas4,
--当前行+往前3行+往后1行
MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS mas5,
--当前行+往后所有行
MAX(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mas6
from test.student_score;
结果:
school_id | student_id | score | mas1 | mas2 | mas3 | mas4 | mas5 | mas6 |
---|---|---|---|---|---|---|---|---|
1001 | 10001 | 11 | 11 | 11 | 99 | 11 | 55 | 99 |
1001 | 10002 | 55 | 55 | 55 | 99 | 55 | 77 | 99 |
1001 | 10003 | 77 | 77 | 77 | 99 | 77 | 77 | 99 |
1001 | 10004 | 33 | 77 | 77 | 99 | 77 | 77 | 99 |
1001 | 10005 | 22 | 77 | 77 | 99 | 77 | 99 | 99 |
1001 | 10006 | 99 | 99 | 99 | 99 | 99 | 99 | 99 |
1001 | 10007 | 99 | 99 | 99 | 99 | 99 | 99 | 99 |
在Window子句上与sum()的理解相同,结果和
ORDER BY
相关,默认为升序
代码测试:
select
school_id,
student_id,
score,
-- 默认为从起点到当前行
MIN(score) OVER(PARTITION BY school_id ORDER BY student_id) AS mis1,
--从起点到当前行,结果同pv1
MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mis2,
--分组内所有行
MIN(score) OVER(PARTITION BY school_id) AS mis3,
--当前行+往前3行
MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mis4,
--当前行+往前3行+往后1行
MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS mis5,
--当前行+往后所有行
MIN(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mis6
from test.student_score;
结果:
school_id | student_id | score | mis1 | mis2 | mis3 | mis4 | mis5 | mis6 |
---|---|---|---|---|---|---|---|---|
1001 | 10001 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
1001 | 10002 | 55 | 11 | 11 | 11 | 11 | 11 | 22 |
1001 | 10003 | 77 | 11 | 11 | 11 | 11 | 11 | 22 |
1001 | 10004 | 33 | 11 | 11 | 11 | 11 | 11 | 22 |
1001 | 10005 | 22 | 11 | 11 | 11 | 22 | 22 | 22 |
1001 | 10006 | 99 | 11 | 11 | 11 | 22 | 22 | 99 |
1001 | 10007 | 99 | 11 | 11 | 11 | 22 | 22 | 99 |
在Window子句上与sum()的理解相同,结果和
ORDER BY
相关,默认为升序
测试代码:
select
school_id,
student_id,
score,
-- 默认为从起点到当前行
AVG(score) OVER(PARTITION BY school_id ORDER BY student_id) AS as1,
--从起点到当前行,结果同pv1
AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS as2,
--分组内所有行
AVG(score) OVER(PARTITION BY school_id) AS as3,
--当前行+往前3行
AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS as4,
--当前行+往前3行+往后1行
AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS as5,
--当前行+往后所有行
AVG(score) OVER(PARTITION BY school_id ORDER BY student_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS as6
from test.student_score;
结果:
school_id | student_id | score | mis1 | mis2 | mis3 | mis4 | mis5 | mis6 |
---|---|---|---|---|---|---|---|---|
1001 | 10001 | 11 | 11.0 | 11.0 | 56.57142857142857 | 11.0 | 33.0 | 56.57142857142857 |
1001 | 10002 | 55 | 33.0 | 33.0 | 56.57142857142857 | 33.0 | 47.666666666666664 | 64.16666666666667 |
1001 | 10003 | 77 | 47.666666666666664 | 47.666666666666664 | 56.57142857142857 | 47.666666666666664 | 44.0 | 66.0 |
1001 | 10004 | 33 | 44.0 | 44.0 | 56.57142857142857 | 44.0 | 39.6 | 63.25 |
1001 | 10005 | 22 | 39.6 | 39.6 | 56.57142857142857 | 46.75 | 57.2 | 73.33333333333333 |
1001 | 10006 | 99 | 49.5 | 49.5 | 56.57142857142857 | 57.75 | 66.0 | 99.0 |
1001 | 10007 | 99 | 56.57142857142857 | 56.57142857142857 | 56.57142857142857 | 63.25 | 63.25 | 99.0 |
CREATE EXTERNAL TABLE test.student_score(
`student_id` STRING,
`date_key` STRING,
`school_id` STRING,
`grade` STRING,
`class` STRING,
`score` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile
LOCATION '/tmp/test/student_score';
10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
10001,2021-05-20,1001,初一,1,22
10002,2021-05-21,1001,初二,2,66
10003,2021-05-23,1001,初三,1,88
10004,2021-05-24,1001,初一,3,44
10005,2021-05-25,1001,初一,1,33
10006,2021-05-26,1001,初三,2,33
10007,2021-05-27,1001,初二,2,11
NTILE(n)
,用于将分组数据按照顺序切分成n
片,返回当前切片值
NTILE
不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY school_id ORDER BY date_key ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
。如果切片不均匀,默认增加第一个切片的分布
SELECT
school_id,
student_id,
score,
--分组内将数据分成2片
NTILE(2) OVER(PARTITION BY school_id ORDER BY student_id) AS rn1,
--分组内将数据分成3片
NTILE(3) OVER(PARTITION BY school_id ORDER BY student_id) AS rn2,
--将所有数据分成4片
NTILE(4) OVER(ORDER BY school_id) AS rn3
FROM test.student_score;
结果:
school_id | student_id | score | rn1 | rn2 | rn3 |
---|---|---|---|---|---|
1001 | 10001 | 11 | 1 | 1 | 1 |
1001 | 10002 | 55 | 1 | 1 | 1 |
1001 | 10003 | 77 | 1 | 1 | 1 |
1001 | 10004 | 33 | 1 | 2 | 1 |
1001 | 10005 | 22 | 2 | 2 | 2 |
1001 | 10006 | 99 | 2 | 3 | 2 |
1001 | 10007 | 99 | 2 | 3 | 2 |
1002 | 10001 | 22 | 1 | 1 | 2 |
1002 | 10002 | 66 | 1 | 1 | 3 |
1002 | 10003 | 88 | 1 | 1 | 3 |
1002 | 10004 | 44 | 1 | 2 | 3 |
1002 | 10005 | 33 | 2 | 2 | 4 |
1002 | 10006 | 33 | 2 | 3 | 4 |
1002 | 10007 | 11 | 2 | 3 | 4 |
统计不同学校school,成绩最高的前1/3的学生:
--rn1 = 1 的记录,就是我们想要的结果
SELECT
school_id,
student_id,
score,
NTILE(3) OVER(PARTITION BY school_id ORDER BY student_id DESC) AS rn1
FROM test.student_score;
结果:
school_id | student_id | score | rn1 |
---|---|---|---|
1002 | 10007 | 11 | 1 |
1002 | 10006 | 33 | 1 |
1002 | 10005 | 33 | 1 |
1002 | 10004 | 44 | 2 |
1002 | 10003 | 88 | 2 |
1002 | 10002 | 66 | 3 |
1002 | 10001 | 22 | 3 |
1001 | 10007 | 99 | 1 |
1001 | 10006 | 99 | 1 |
1001 | 10005 | 22 | 1 |
1001 | 10004 | 33 | 2 |
1001 | 10003 | 77 | 2 |
1001 | 10002 | 55 | 3 |
1001 | 10001 | 11 | 3 |
SELECT
school_id,
student_id,
score,
ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY score) AS rank1,
RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank2,
DENSE_RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank3
FROM test.student_score
ORDER BY school_id,score;
结果:
school_id | student_id | score | rank1 | rank2 | rank3 |
---|---|---|---|---|---|
1001 | 10001 | 11 | 1 | 1 | 1 |
1001 | 10005 | 22 | 2 | 2 | 2 |
1001 | 10004 | 33 | 3 | 3 | 3 |
1001 | 10002 | 55 | 4 | 4 | 4 |
1001 | 10003 | 77 | 5 | 5 | 5 |
1001 | 10006 | 99 | 6 | 6 | 6 |
1001 | 10007 | 99 | 7 | 6 | 6 |
1002 | 10007 | 11 | 1 | 1 | 1 |
1002 | 10001 | 22 | 2 | 2 | 2 |
1002 | 10005 | 33 | 3 | 3 | 3 |
1002 | 10006 | 33 | 4 | 3 | 3 |
1002 | 10004 | 44 | 5 | 5 | 4 |
1002 | 10002 | 66 | 6 | 6 | 5 |
1002 | 10003 | 88 | 7 | 7 | 6 |
CUME_DIST
小于等于当前值的行数占分组内总行数的比例。
--–比如,统计小于等于当前成绩的人数,所占总人数的比例
SELECT
school_id,
student_id,
score,
CUME_DIST() OVER(ORDER BY score) AS rn1,
CUME_DIST() OVER(PARTITION BY school_id ORDER BY score) AS rn2
FROM test.student_score;
结果:
school_id | student_id | score | rn1 | rn2 |
---|---|---|---|---|
1001 | 10001 | 11 | 0.14285714285714285 | 0.14285714285714285 |
1001 | 10005 | 22 | 0.2857142857142857 | 0.2857142857142857 |
1001 | 10004 | 33 | 0.5 | 0.42857142857142855 |
1001 | 10002 | 55 | 0.6428571428571429 | 0.5714285714285714 |
1001 | 10003 | 77 | 0.7857142857142857 | 0.7142857142857143 |
1001 | 10006 | 99 | 1.0 | 1.0 |
1001 | 10007 | 99 | 1.0 | 1.0 |
1002 | 10007 | 11 | 0.14285714285714285 | 0.14285714285714285 |
1002 | 10001 | 22 | 0.2857142857142857 | 0.2857142857142857 |
1002 | 10005 | 33 | 0.5 | 0.5714285714285714 |
1002 | 10006 | 33 | 0.5 | 0.5714285714285714 |
1002 | 10004 | 44 | 0.5714285714285714 | 0.7142857142857143 |
1002 | 10002 | 66 | 0.7142857142857143 | 0.8571428571428571 |
1002 | 10003 | 88 | 0.8571428571428571 | 1.0 |
PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
SELECT
school_id,
student_id,
score,
PERCENT_RANK() OVER(ORDER BY score) AS rn1, --分组内
RANK() OVER(ORDER BY score) AS rank1, --分组内RANK值
SUM(1) OVER(PARTITION BY NULL) AS sum1, --分组内总行数
PERCENT_RANK() OVER(PARTITION BY school_id ORDER BY score)
AS rn2,
RANK() OVER(PARTITION BY school_id ORDER BY score) AS rank2,
SUM(1) OVER(PARTITION BY school_id) AS sum2
FROM test.student_score;
结果:
school_id | student_id | score | rn1 | rank1 | sum1 | rn2 | rank2 | sum2 |
---|---|---|---|---|---|---|---|---|
1001 | 10001 | 11 | 0.0 | 1 | 14 | 0.0 | 1 | 7 |
1001 | 10005 | 22 | 0.15384615384615385 | 3 | 14 | 0.16666666666666666 | 2 | 7 |
1001 | 10004 | 33 | 0.3076923076923077 | 5 | 14 | 0.3333333333333333 | 3 | 7 |
1001 | 10002 | 55 | 0.6153846153846154 | 9 | 14 | 0.5 | 4 | 7 |
1001 | 10003 | 77 | 0.7692307692307693 | 11 | 14 | 0.6666666666666666 | 5 | 7 |
1001 | 10006 | 99 | 0.9230769230769231 | 13 | 14 | 0.8333333333333334 | 6 | 7 |
1001 | 10007 | 99 | 0.9230769230769231 | 13 | 14 | 0.8333333333333334 | 6 | 7 |
1002 | 10007 | 11 | 0.0 | 1 | 14 | 0.0 | 1 | 7 |
1002 | 10001 | 22 | 0.15384615384615385 | 3 | 14 | 0.16666666666666666 | 2 | 7 |
1002 | 10005 | 33 | 0.3076923076923077 | 5 | 14 | 0.3333333333333333 | 3 | 7 |
1002 | 10006 | 33 | 0.3076923076923077 | 5 | 14 | 0.3333333333333333 | 3 | 7 |
1002 | 10004 | 44 | 0.5384615384615384 | 8 | 14 | 0.6666666666666666 | 5 | 7 |
1002 | 10002 | 66 | 0.6923076923076923 | 10 | 14 | 0.8333333333333334 | 6 | 7 |
1002 | 10003 | 88 | 0.8461538461538461 | 12 | 14 | 1.0 | 7 | 7 |
数据和表准备同排序函数。
- 作用:
LAG(col,n,DEFAULT)
用于统计窗口内往上第n行值- 参数:
第一个参数为列名,
第二个参数为往上第n行(可选,默认为1),
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
school_id,
date_key,
grade,
ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
LAG(date_key,1,'1970-01-01') OVER(PARTITION BY school_id ORDER BY date_key) AS last_day_1,
LAG(date_key,2) OVER(PARTITION BY school_id ORDER BY date_key) AS last_day_2
FROM test.student_score;
结果:
school_id | date_key | grade | rn | last_day_1 | last_day_2 |
---|---|---|---|---|---|
1002 | 2021-05-20 | 初一 | 1 | 1970-01-01 | NULL |
1002 | 2021-05-21 | 初二 | 2 | 2021-05-20 | NULL |
1002 | 2021-05-23 | 初三 | 3 | 2021-05-21 | 2021-05-20 |
1002 | 2021-05-24 | 初一 | 4 | 2021-05-23 | 2021-05-21 |
1002 | 2021-05-25 | 初一 | 5 | 2021-05-24 | 2021-05-23 |
1002 | 2021-05-26 | 初三 | 6 | 2021-05-25 | 2021-05-24 |
1002 | 2021-05-27 | 初二 | 7 | 2021-05-26 | 2021-05-25 |
1001 | 2021-05-20 | 初一 | 1 | 1970-01-01 | NULL |
1001 | 2021-05-21 | 初二 | 2 | 2021-05-20 | NULL |
1001 | 2021-05-23 | 初三 | 3 | 2021-05-21 | 2021-05-20 |
1001 | 2021-05-24 | 初一 | 4 | 2021-05-23 | 2021-05-21 |
1001 | 2021-05-25 | 初一 | 5 | 2021-05-24 | 2021-05-23 |
1001 | 2021-05-26 | 初三 | 6 | 2021-05-25 | 2021-05-24 |
1001 | 2021-05-27 | 初二 | 7 | 2021-05-26 | 2021-05-25 |
结果分析:
last_day_1: 指定了往上第1行的值,default为'1970-01-01'
1002第一行,往上1行为NULL,因此取默认值 1970-01-01
1002第三行,往上1行值为第二行值,2021-05-20
1002第六行,往上1行值为第五行值,2021-05-25
last_day_2: 指定了往上第2行的值,为指定默认值
1002第一行,往上2行为NULL
1002第二行,往上2行为NULL
1002第四行,往上2行为第二行值,2021-05-21
1002第七行,往上2行为第五行值,2021-05-25
- 作用:
与LAG
相反。LEAD(col,n,DEFAULT)
用于统计窗口内往下第n行值。- 参数:
第一个参数为列名.
第二个参数为往下第n行(可选,默认为1).
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
school_id,
date_key,
grade,
ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
LEAD(date_key,1,'1970-01-01') OVER(PARTITION BY school_id ORDER BY date_key) AS next_day_1,
LEAD(date_key,2) OVER(PARTITION BY school_id ORDER BY date_key) AS next_day_2
FROM test.student_score;
结果:
school_id | date_key | grade | rn | next_day_1 | next_day_2 |
---|---|---|---|---|---|
1002 | 2021-05-20 | 初一 | 1 | 2021-05-21 | 2021-05-23 |
1002 | 2021-05-21 | 初二 | 2 | 2021-05-23 | 2021-05-24 |
1002 | 2021-05-23 | 初三 | 3 | 2021-05-24 | 2021-05-25 |
1002 | 2021-05-24 | 初一 | 4 | 2021-05-25 | 2021-05-26 |
1002 | 2021-05-25 | 初一 | 5 | 2021-05-26 | 2021-05-27 |
1002 | 2021-05-26 | 初三 | 6 | 2021-05-27 | NULL |
1002 | 2021-05-27 | 初二 | 7 | 1970-01-01 | NULL |
1001 | 2021-05-20 | 初一 | 1 | 2021-05-21 | 2021-05-23 |
1001 | 2021-05-21 | 初二 | 2 | 2021-05-23 | 2021-05-24 |
1001 | 2021-05-23 | 初三 | 3 | 2021-05-24 | 2021-05-25 |
1001 | 2021-05-24 | 初一 | 4 | 2021-05-25 | 2021-05-26 |
1001 | 2021-05-25 | 初一 | 5 | 2021-05-26 | 2021-05-27 |
1001 | 2021-05-26 | 初三 | 6 | 2021-05-27 | NULL |
1001 | 2021-05-27 | 初二 | 7 | 1970-01-01 | NULL |
结果分析:
next_day_1: 指定了往下第1行的值,default为'1970-01-01'
1002第七行,往下行为NULL,因此取默认值 1970-01-01
1002第四行,往下1行值为第二行值,2021-05-25
1002第二行,往下1行值为第五行值,2021-05-23
next_day_2: 指定了往下第2行的值,为指定默认值
1002第七行,往下2行为NULL
1002第六行,往下2行为NULL
1002第三行,往下2行为第二行值,2021-05-25
1002第一行,往上2行为第五行值,2021-05-23
FIRST_VALUE()
取分组内排序后,截止到当前行,第一个值。LAST_VALUE()
取分组内排序后,截止到当前行,最后一个值。- 如果不指定
ORDER BY
,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果。- 如果想要取分组内排序后最后一个值,则需要变通一下,如实例中
last_3
,last_4
。
SELECT
school_id,
date_key,
grade,
ROW_NUMBER() OVER(PARTITION BY school_id ORDER BY date_key) AS rn,
FIRST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key) AS first_1,
-- 如果不指定`ORDER BY`,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果。
FIRST_VALUE(date_key) OVER(PARTITION BY school_id) AS first_2,
LAST_VALUE(date_key) OVER(PARTITION BY school_id) AS last_2,
--我们发现LAST_VALUE并不能取最后一个值而是默认取从起点到当前行的最后一个值
LAST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key) AS last_3,
--如果想要取分组内排序后最后一个值,则需要变通一下
FIRST_VALUE(date_key) OVER(PARTITION BY school_id ORDER BY date_key DESC) AS last_4
FROM test.student_score;
结果:
school_id | date_key | grade | rn | first_1 | first_2 | last_2 | last_3 | last_4 |
---|---|---|---|---|---|---|---|---|
1002 | 2021-05-20 | 初一 | 1 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-20 | 2021-05-27 |
1002 | 2021-05-21 | 初二 | 2 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-21 | 2021-05-27 |
1002 | 2021-05-23 | 初三 | 3 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-23 | 2021-05-27 |
1002 | 2021-05-24 | 初一 | 4 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-24 | 2021-05-27 |
1002 | 2021-05-25 | 初一 | 5 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-25 | 2021-05-27 |
1002 | 2021-05-26 | 初三 | 6 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-26 | 2021-05-27 |
1002 | 2021-05-27 | 初二 | 7 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-27 | 2021-05-27 |
1001 | 2021-05-20 | 初一 | 1 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-20 | 2021-05-27 |
1001 | 2021-05-21 | 初二 | 2 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-21 | 2021-05-27 |
1001 | 2021-05-23 | 初三 | 3 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-23 | 2021-05-27 |
1001 | 2021-05-24 | 初一 | 4 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-24 | 2021-05-27 |
1001 | 2021-05-25 | 初一 | 5 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-25 | 2021-05-27 |
1001 | 2021-05-26 | 初三 | 6 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-26 | 2021-05-27 |
1001 | 2021-05-27 | 初二 | 7 | 2021-05-20 | 2021-05-20 | 2021-05-27 | 2021-05-27 | 2021-05-27 |
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
,这几个分析函数通常用于OLAP
中,使用于多指标需要根据不同的维度上钻和下钻的指标统计
,比如,分学校、年级、班级的统计不同维度组合数据。
在一个
GROUP BY
查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY
结果集进行UNION ALL
。其中的GROUPING__ID
,表示结果属于哪一个分组集合。
SELECT
school_id,
grade,
class,
COUNT(DISTINCT student_id) AS uv,
GROUPING__ID
FROM test.student_score
GROUP BY school_id,grade,class
GROUPING SETS (school_id,grade,class,(school_id,grade,class),(school_id,grade),(grade,class),(school_id,class))
ORDER BY GROUPING__ID;
结果:
school_id | grade | class | uv | GROUPING__ID |
---|---|---|---|---|
1001 | 初一 | 3 | 1 | 0 |
1002 | 初三 | 1 | 1 | 0 |
1001 | 初一 | 1 | 2 | 0 |
1002 | 初一 | 3 | 1 | 0 |
1001 | 初二 | 2 | 2 | 0 |
1001 | 初三 | 2 | 1 | 0 |
1002 | 初一 | 1 | 2 | 0 |
1002 | 初三 | 2 | 1 | 0 |
1002 | 初二 | 2 | 2 | 0 |
1001 | 初三 | 1 | 1 | 0 |
1001 | 初一 | NULL | 3 | 1 |
1002 | 初二 | NULL | 2 | 1 |
1001 | 初二 | NULL | 2 | 1 |
1002 | 初一 | NULL | 3 | 1 |
1001 | 初三 | NULL | 2 | 1 |
1002 | 初三 | NULL | 2 | 1 |
1002 | NULL | 2 | 3 | 2 |
1002 | NULL | 3 | 1 | 2 |
1001 | NULL | 1 | 3 | 2 |
1001 | NULL | 3 | 1 | 2 |
1001 | NULL | 2 | 3 | 2 |
1002 | NULL | 1 | 3 | 2 |
1001 | NULL | NULL | 7 | 3 |
1002 | NULL | NULL | 7 | 3 |
NULL | 初三 | 1 | 1 | 4 |
NULL | 初三 | 2 | 1 | 4 |
NULL | 初一 | 1 | 2 | 4 |
NULL | 初二 | 2 | 2 | 4 |
NULL | 初一 | 3 | 1 | 4 |
NULL | 初三 | NULL | 2 | 5 |
NULL | 初一 | NULL | 3 | 5 |
NULL | 初二 | NULL | 2 | 5 |
NULL | NULL | 3 | 1 | 6 |
NULL | NULL | 2 | 3 | 6 |
NULL | NULL | 1 | 3 | 6 |
经过结果分析我们发现,使用
GROUPING SETS
可以再同一层SQL
查询进行多维度统计并将结果合并(union all
)在一起。其效果等价于如下:
SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,0 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,1 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade
UNION ALL
SELECT school_id,NULL AS class,class,COUNT(DISTINCT student_id) AS uv,2 AS GROUPING__ID FROM test.student_score GROUP BY school_id,class
UNION ALL
SELECT NULL AS school_id,grade,class,COUNT(DISTINCT student_id) AS uv,3 AS GROUPING__ID FROM test.student_score GROUP BY grade,class
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,4 AS GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,5 AS GROUPING__ID FROM test.student_score GROUP BY grade
UNION ALL
SELECT NULL AS school_id,NULL AS grade,class,COUNT(DISTINCT student_id) AS uv,6 AS GROUPING__ID FROM test.student_score GROUP BY class
根据
GROUP BY的维度的所有组合进行聚合
。
SELECT
school_id,
grade,
class,
COUNT(DISTINCT student_id) AS uv,
GROUPING__ID
FROM test.student_score
GROUP BY school_id,grade,class
WITH CUBE
ORDER BY GROUPING__ID;
结果:
school_id | grade | class | uv | GROUPING__ID |
---|---|---|---|---|
1001 | 初三 | 1 | 1 | 0 |
1002 | 初一 | 3 | 1 | 0 |
1001 | 初三 | 2 | 1 | 0 |
1001 | 初一 | 1 | 2 | 0 |
1002 | 初一 | 1 | 2 | 0 |
1001 | 初一 | 3 | 1 | 0 |
1002 | 初二 | 2 | 2 | 0 |
1001 | 初二 | 2 | 2 | 0 |
1002 | 初三 | 1 | 1 | 0 |
1002 | 初三 | 2 | 1 | 0 |
1002 | 初二 | NULL | 2 | 1 |
1002 | 初一 | NULL | 3 | 1 |
1001 | 初二 | NULL | 2 | 1 |
1001 | 初三 | NULL | 2 | 1 |
1002 | 初三 | NULL | 2 | 1 |
1001 | 初一 | NULL | 3 | 1 |
1001 | NULL | 3 | 1 | 2 |
1002 | NULL | 3 | 1 | 2 |
1002 | NULL | 1 | 3 | 2 |
1001 | NULL | 1 | 3 | 2 |
1002 | NULL | 2 | 3 | 2 |
1001 | NULL | 2 | 3 | 2 |
1002 | NULL | NULL | 7 | 4 |
1001 | NULL | NULL | 7 | 4 |
NULL | 初三 | 2 | 1 | 3 |
NULL | 初一 | 3 | 1 | 3 |
NULL | 初一 | 1 | 2 | 3 |
NULL | 初三 | 1 | 1 | 3 |
NULL | 初二 | 2 | 2 | 3 |
NULL | 初二 | NULL | 2 | 5 |
NULL | 初一 | NULL | 3 | 5 |
NULL | 初三 | NULL | 2 | 5 |
NULL | NULL | 3 | 1 | 6 |
NULL | NULL | 1 | 3 | 6 |
NULL | NULL | 2 | 3 | 6 |
NULL | NULL | NULL | 7 | 7 |
经过结果分析我们发现,使用
CUBE
可以再同一层SQL
,GROUP BY
的维度的所有组合进行多维度统计并将结果合并(union all
)在一起。其效果等价于如下:
SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,0 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,1 AS GROUPING__ID FROM test.student_score GROUP BY school_id,grade
UNION ALL
SELECT school_id,NULL AS class,class,COUNT(DISTINCT student_id) AS uv,2 AS GROUPING__ID FROM test.student_score GROUP BY school_id,class
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,4 AS GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,grade,class,COUNT(DISTINCT student_id) AS uv,3 AS GROUPING__ID FROM test.student_score GROUP BY grade,class
UNION ALL
SELECT NULL AS school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,5 AS GROUPING__ID FROM test.student_score GROUP BY grade
UNION ALL
SELECT NULL AS school_id,NULL AS grade,class,COUNT(DISTINCT student_id) AS uv,6 AS GROUPING__ID FROM test.student_score GROUP BY class
UNION ALL
SELECT NULL AS school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,7 AS GROUPING__ID FROM test.student_score
是
CUBE
的子集,以最左侧的维度为主,从该维度进行层级聚合。
group by rollup(school_id,grade,class), 可以理解为从右到左以一次少一列的方式依次进行group by。
学校、年级、班级的UV->学校、年级的UV->学校的UV->总的UV:
SELECT
school_id,
grade,
class,
COUNT(DISTINCT student_id) AS uv,
GROUPING__ID
FROM test.student_score
GROUP BY school_id,grade,class
WITH ROLLUP
ORDER BY GROUPING__ID;
结果:
school_id | grade | class | uv | GROUPING__ID |
---|---|---|---|---|
1001 | 初一 | 3 | 1 | 0 |
1002 | 初三 | 1 | 1 | 0 |
1001 | 初一 | 1 | 2 | 0 |
1002 | 初一 | 3 | 1 | 0 |
1001 | 初二 | 2 | 2 | 0 |
1001 | 初三 | 2 | 1 | 0 |
1002 | 初一 | 1 | 2 | 0 |
1002 | 初三 | 2 | 1 | 0 |
1002 | 初二 | 2 | 2 | 0 |
1001 | 初三 | 1 | 1 | 0 |
1001 | 初一 | NULL | 3 | 1 |
1002 | 初二 | NULL | 2 | 1 |
1001 | 初二 | NULL | 2 | 1 |
1002 | 初一 | NULL | 3 | 1 |
1001 | 初三 | NULL | 2 | 1 |
1002 | 初三 | NULL | 2 | 1 |
1001 | NULL | NULL | 7 | 3 |
1002 | NULL | NULL | 7 | 3 |
NULL | NULL | NULL | 7 | 7 |
经过对结果分析: group by rollup(school_id,grade,class) 则以group by(school_id,grade,class) -> group by(school_id,grade) -> group by(school_id) -> group by null(最终汇总)的顺序进行分组相当于:
SELECT school_id,grade,class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id,grade,class
UNION ALL
SELECT school_id,grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id,grade
UNION ALL
SELECT school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score GROUP BY school_id
UNION ALL
SELECT NULL AS school_id,NULL AS grade,NULL AS class,COUNT(DISTINCT student_id) AS uv,GROUPING__ID FROM test.student_score