1. 介绍
普通聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。
因此,普通聚合函数每组(Group by)只有一个返回值,而开窗函数则可以为窗口中的每行都返回一个值。
1.1 基础结构
分析函数(如:sum(), max(), row_number()...) + 窗口子句(over函数)
1.2 over 函数
over(partition by [column_n] order by [column_m])
先按照column_n
分区,相同的column_n
分为一区,每个分区根据column_m
排序(默认升序)。
1.3 测试数据
建表并插入数据
-- 建表
create table student_scores(
id int,
studentId int,
language int,
math int,
english int,
classId string,
departmentId string
);
-- 写入数据
insert into table student_scores values
(1,111,68,69,90,'class1','department1'),
(2,112,73,80,96,'class1','department1'),
(3,113,90,74,75,'class1','department1'),
(4,114,89,94,93,'class1','department1'),
(5,115,99,93,89,'class1','department1'),
(6,121,96,74,79,'class2','department1'),
(7,122,89,86,85,'class2','department1'),
(8,123,70,78,61,'class2','department1'),
(9,124,76,70,76,'class2','department1'),
(10,211,89,93,60,'class1','department2'),
(11,212,76,83,75,'class1','department2'),
(12,213,71,94,90,'class1','department2'),
(13,214,94,94,66,'class1','department2'),
(14,215,84,82,73,'class1','department2'),
(15,216,85,74,93,'class1','department2'),
(16,221,77,99,61,'class2','department2'),
(17,222,80,78,96,'class2','department2'),
(18,223,79,74,96,'class2','department2'),
(19,224,75,80,78,'class2','department2'),
(20,225,82,85,63,'class2','department2');
1.4 窗口含义
select studentId,math,departmentId,classId,
-- 符合所有条件的行作为窗口,这里符合department1的有9个
count(math) over() as count1,
-- 按照classId分组的所有行作为窗口
count(math) over(partition by classId) as count2,
-- 按照classId分组、按照math排序的所有行作为窗口
count(math) over(partition by classId order by math) as count3,
-- 按照classId分组、按照math排序,当前行向前1行向后2行的行作为窗口
count(math) over(partition by classId order by math rows between 1 preceding and 2 following) as count4,
-- 按照classId分组、按照math排序,当前行向后所有行作为窗口
count(math) over(partition by classId order by math rows between current row and unbounded following) as count5
from student_scores where departmentId='department1';
结果:
studentId math departmentId classId count1 count2 count3 count4 count5
111 69 department1 class1 9 5 1 3 5
113 74 department1 class1 9 5 2 4 4
112 80 department1 class1 9 5 3 4 3
115 93 department1 class1 9 5 4 3 2
114 94 department1 class1 9 5 5 2 1
124 70 department1 class2 9 4 1 3 4
121 74 department1 class2 9 4 2 4 3
123 78 department1 class2 9 4 3 3 2
122 86 department1 class2 9 4 4 2 1
结果解析:
studentId=115,
count1为departmentId=department1的行数为9,
count2为分区class1中的行数5,
count3为分区class1中math<=93的行数4,
count4为分区class1中math值向前1行和向后2行的行数3,
count5为分区class1中当前math值到class1分区结束的行数2。
上面可以看到:如果不指定ROWS BETWEEN
,默认统计窗口是从起点到当前行
关键是ROWS BETWEEN
,也叫做window子句
。
PRECEDING
:向前
FOLLOWING
:向后
CURRENT ROW
:当前行
UNBOUNDED
:无边界,UNBOUNDED PRECEDING
表示从最前面的起点开始,UNBOUNDED FOLLOWING
表示到最后面的终点
开窗函数可以粗略地分为两类:聚合开窗函数和排序开窗函数。
2. 聚合开窗函数
2.1 sum 函数
select studentId,math,departmentId,classId,
sum(math) over() as sum1,
sum(math) over(partition by classId) as sum2,
sum(math) over(partition by classId order by math) as sum3,
sum(math) over(partition by classId order by math rows between 1 preceding and 2 following) as sum4,
sum(math) over(partition by classId order by math rows between current row and unbounded following) as sum5
from student_scores where departmentId='department1';
-- 结果解析:类似count()函数
min(),max(),avg()都与 count()类似
2.2 first_value 开窗函数
作用:返回分区中的第一个值
select studentId,math,departmentId,classId,
first_value(math) over(partition by classId order by math) as first_value1,
first_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as first_value2
from student_scores where departmentId='department1';
结果:
studentid math departmentid classid first_value1 first_value2
111 69 department1 class1 69 69
113 74 department1 class1 69 69
112 80 department1 class1 69 74
115 93 department1 class1 69 80
114 94 department1 class1 69 93
124 70 department1 class2 70 70
121 74 department1 class2 70 70
123 78 department1 class2 70 74
122 86 department1 class2 70 78
结果解析:
studentId=122,
first_value1为分区class2中按照math排序的第一个值70,
first_value2为分区class2中按照math排序后当前行向前1行向后2行区间的第一个值78。
2.3 last_vlaue 开窗函数
作用:返回分区最后一个值
hive> select * from my_table;
OK
1 3 10 A
2 NULL 10 NULL
3 5 10 A
4 NULL 10 NULL
5 NULL 10 NULL
6 2 10 A
Time taken: 0.06 seconds, Fetched: 6 row(s)
- 您需要做的就是在上一行和当前行之间的窗口上滑动,并找到最近的非空值。 LAST_VALUE可窗口函数有一个参数,可以将空值忽略为boolean。 LAST_VALUE(
, as boolean);
SELECT
COALESCE(`date`, LAST_VALUE(`date`, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(number, LAST_VALUE(number, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(estimate, LAST_VALUE(estimate, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(client, LAST_VALUE(client, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM my_table;
结果将是
OK
1 3 10 A
2 3 10 A
3 5 10 A
4 5 10 A
5 5 10 A
6 2 10 A
Time taken: 19.177 seconds, Fetched: 6 row(s)
2.4 lag 开窗函数
作用:LAG(col, n, DEFAULT)
用于统计窗口内向上第 n 行的值
col
:列名
n
:向上 n 行,[可选,默认为 1]
DEFAULT
:当向上 n 行为 NULL 时,取默认值;如果不指定,则为 NULL
select studentId,math,departmentId,classId,
lag(math,2,60) over(partition by classId order by math) as lag1,
lag(math,2) over(partition by classId order by math) as lag2
from student_scores where departmentId='department1';
结果:
studentid math departmentid classid lag1 lag2
111 69 department1 class1 60 NULL
113 74 department1 class1 60 NULL
112 80 department1 class1 69 69
115 93 department1 class1 74 74
114 94 department1 class1 80 80
124 70 department1 class2 60 NULL
121 74 department1 class2 60 NULL
123 78 department1 class2 70 70
122 86 department1 class2 74 74
结果解析:
studentId=113,
lag1为分区class1按照math排序后当前行向上2行的值NULL,但是设置了DEFUALT,所以为60,
lag2因为没有设置DEFAULT,所以为NULL。
2.5 lead 开窗函数
作用:LEAD(col, n, DEFAULT)
与LAG
相反,用于统计窗口内向下 n 行的值
col
:列名
n
:向下 n 行,[可选,默认为 1]
DEFAULT
:当向下 n 行为 NULL 时,取默认值;如果不指定,则为 NULL
2.6 cume_dist 开窗函数
作用:计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:(小于等于当前值的行数) / (分区内总行数)
select studentId,math,departmentId,classId,
cume_dist() over(order by math) as cume_dist1,
cume_dist() over(order by math desc) as cume_dist2,
cume_dist() over(partition by classId order by math) as cume_dist3
from student_scores where departmentId='department1';
结果:
studentid math departmentid classid cume_dist1 cume_dist2 cume_dist3
111 69 department1 class1 0.1111111111111111 1.0 0.2
113 74 department1 class1 0.4444444444444444 0.7777777777777778 0.4
112 80 department1 class1 0.6666666666666666 0.4444444444444444 0.6
115 93 department1 class1 0.8888888888888888 0.2222222222222222 0.8
114 94 department1 class1 1.0 0.1111111111111111 1.0
124 70 department1 class2 0.2222222222222222 0.8888888888888888 0.25
121 74 department1 class2 0.4444444444444444 0.7777777777777778 0.5
123 78 department1 class2 0.5555555555555556 0.5555555555555556 0.75
122 86 department1 class2 0.7777777777777778 0.3333333333333333 1.0
结果解析:
studentId=115,
cume_dist1=小于等于93的行数8/总行数9=0.8888888888888888
cume_dist2=大于等于93的行数2/总行数9=0.2222222222222222
cume_dist3=class1分区内小于等于93的行数4/总行数5=0.8
3. 排序开窗函数
3.1 row_number 开窗函数
作用:row_number() over([partition by col1] [order by col2])
开窗函数是基于over
子句中order by
列的一个排名。在窗口或分区内从 1 开始排序,即使遇到col2
相等时,名次依旧增加。例如:有两条记录相等,但一个是第一,一个是第二。
select id,studentId,language,math,english,classId,departmentId,
row_number() over (order by math) as row_number1
from student_scores;
结果:
id studentid language math english classid departmentid row_number1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 2
18 223 79 74 96 class2 department2 3
15 216 85 74 93 class1 department2 4
3 113 90 74 75 class1 department1 5
6 121 96 74 79 class2 department1 6
8 123 70 78 61 class2 department1 7
17 222 80 78 96 class2 department2 8
19 224 75 80 78 class2 department2 9
2 112 73 80 96 class1 department1 10
14 215 84 82 73 class1 department2 11
11 212 76 83 75 class1 department2 12
20 225 82 85 63 class2 department2 13
7 122 89 86 85 class2 department1 14
5 115 99 93 89 class1 department1 15
10 211 89 93 60 class1 department2 16
12 213 71 94 90 class1 department2 17
4 114 89 94 93 class1 department1 18
13 214 94 94 66 class1 department2 19
16 221 77 99 61 class2 department2 20
studentId=223, 216, 113, 121的math都为74,但是row_number1的值是递增的
3.2 rank 开窗函数
作用:rank() over([partition by col1] [order by col2])
,当遇到col2
相等时,名次相同,但是下一个 col2 值的名次递增 N(N 是重复的次数)。例如:有两条记录是并列第一,下一个是第三,没有第二。
select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as rank1
from student_scores;
结果:
id studentid language math english classid departmentid rank1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 2
18 223 79 74 96 class2 department2 3
15 216 85 74 93 class1 department2 3
3 113 90 74 75 class1 department1 3
6 121 96 74 79 class2 department1 3
8 123 70 78 61 class2 department1 7
17 222 80 78 96 class2 department2 7
19 224 75 80 78 class2 department2 9
2 112 73 80 96 class1 department1 9
14 215 84 82 73 class1 department2 11
11 212 76 83 75 class1 department2 12
20 225 82 85 63 class2 department2 13
7 122 89 86 85 class2 department1 14
5 115 99 93 89 class1 department1 15
10 211 89 93 60 class1 department2 15
12 213 71 94 90 class1 department2 17
4 114 89 94 93 class1 department1 17
13 214 94 94 66 class1 department2 17
16 221 77 99 61 class2 department2 20
从第3行到第6行,重复出现了4次,所以第7行rank1为3+4=7
3.3 dense_rank 开窗函数
作用:dense_rank() over([partition by col1] [order by col2])
与rank
类似,当遇到col2
相等时,名次同样相等,不同的是,下一个 col2 值的名次+1,而不是+N。
select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as dense_rank1
from student_scores;
结果:
id studentid language math english classid departmentid dense_rank1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 2
18 223 79 74 96 class2 department2 3
15 216 85 74 93 class1 department2 3
3 113 90 74 75 class1 department1 3
6 121 96 74 79 class2 department1 3
8 123 70 78 61 class2 department1 4
17 222 80 78 96 class2 department2 4
19 224 75 80 78 class2 department2 5
2 112 73 80 96 class1 department1 5
14 215 84 82 73 class1 department2 6
11 212 76 83 75 class1 department2 7
20 225 82 85 63 class2 department2 8
7 122 89 86 85 class2 department1 9
5 115 99 93 89 class1 department1 10
10 211 89 93 60 class1 department2 10
12 213 71 94 90 class1 department2 11
4 114 89 94 93 class1 department1 11
13 214 94 94 66 class1 department2 11
16 221 77 99 61 class2 department2 12
从第3行到第6行math值相等,dense_rank1为3,所以第7行dense_rank1为4
3.4 ntile 开窗函数
作用:ntile(N) over([partition by col1] [order by col2])
,将分区中的数据按照顺序划分为 N 片,返回当前片的值。
注 1:如果切片分布不均匀,默认增加第一个切片的分布
注 2:ntile
不支持ROWS BETWEEN
select id,studentId,language,math,english,classId,departmentId,
ntile(2) over(partition by departmentid order by math) as ntile1
from student_scores;
结果:
id studentid language math english classid departmentid ntile1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 1
6 121 96 74 79 class2 department1 1
3 113 90 74 75 class1 department1 1
8 123 70 78 61 class2 department1 1
2 112 73 80 96 class1 department1 2
7 122 89 86 85 class2 department1 2
5 115 99 93 89 class1 department1 2
4 114 89 94 93 class1 department1 2
15 216 85 74 93 class1 department2 1
18 223 79 74 96 class2 department2 1
17 222 80 78 96 class2 department2 1
19 224 75 80 78 class2 department2 1
14 215 84 82 73 class1 department2 1
11 212 76 83 75 class1 department2 1
20 225 82 85 63 class2 department2 2
10 211 89 93 60 class1 department2 2
12 213 71 94 90 class1 department2 2
13 214 94 94 66 class1 department2 2
16 221 77 99 61 class2 department2 2
3.5 percent_rank 开窗函数
作用:percent_rank() over([partition by col1] [order by col2])
,计算给定行的百分比排名。分组内当前行的 RANK 值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。
select id,studentid,language,math,english,classid,departmentid,
percent_rank() over(order by math) as rn1
from student_scores;
结果:
id studentid language math english classid departmentid rn1
1 111 68 69 90 class1 department1 0.0
9 124 76 70 76 class2 department1 0.05263157894736842
18 223 79 74 96 class2 department2 0.10526315789473684
15 216 85 74 93 class1 department2 0.10526315789473684
3 113 90 74 75 class1 department1 0.10526315789473684
6 121 96 74 79 class2 department1 0.10526315789473684
8 123 70 78 61 class2 department1 0.3157894736842105
17 222 80 78 96 class2 department2 0.3157894736842105
19 224 75 80 78 class2 department2 0.42105263157894735
2 112 73 80 96 class1 department1 0.42105263157894735
14 215 84 82 73 class1 department2 0.5263157894736842
11 212 76 83 75 class1 department2 0.5789473684210527
20 225 82 85 63 class2 department2 0.631578947368421
7 122 89 86 85 class2 department1 0.6842105263157895
5 115 99 93 89 class1 department1 0.7368421052631579
10 211 89 93 60 class1 department2 0.7368421052631579
12 213 71 94 90 class1 department2 0.8421052631578947
4 114 89 94 93 class1 department1 0.8421052631578947
13 214 94 94 66 class1 department2 0.8421052631578947
16 221 77 99 61 class2 department2 1.0
第二行,rn1的值为(2-1)/(20-1)=0.05263157894736842