窗口函数的定义引用一个大佬的定义: a window function calculates a return value for every input row of a table based on a group of rows。窗口函数与与其他函数的区别:
- 普通函数: 作用于每一条记录,计算出一个新列(记录数不变);
- 聚合函数: 作用于一组记录(全部数据按照某种方式分为多组),计算出一个聚合值(记录数变小);
- 窗口函数: 作用于每一条记录,逐条记录去指定多条记录来计算一个值(记录数不变)。
窗口函数语法结构
<窗口函数>(参数)
OVER
(
[PARTITION BY <列清单>]
[ORDER BY <排序用清单列>] [ASC/DESC]
(ROWS | RANGE) <范围条件>
)
- 函数名:
- OVER: 关键字,说明这是窗口函数,不是普通的聚合函数;
- 子句
- PARTITION BY: 分组字段
- ORDER BY: 排序字段
- ROWS/RANGE窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)
- ROW: 物理窗口,数据筛选基于排序后的index
- RANGE: 逻辑窗口,数据筛选基于值
主要有以下三种窗口函数
- ranking functions: 数据排序函数, 比如 :rank(…)、row_number(…)等
- analytic functions: 统计比较函数, 比如:lead(…)、lag(…)、 first_value(…)等
- aggregate functions: 聚合函数, 比如:sum(…)、 max(…)、min(…)、avg(…)等
数据加载
from pyspark.sql.types import *
schema = StructType().add('name', StringType(), True).add('create_time', TimestampType(), True).add('department', StringType(), True).add('salary', IntegerType(), True)
df = spark.createDataFrame([
("Tom", datetime.strptime("2020-01-01 00:01:00", "%Y-%m-%d %H:%M:%S"), "Sales", 4500),
("Georgi", datetime.strptime("2020-01-02 12:01:00", "%Y-%m-%d %H:%M:%S"), "Sales", 4200),
("Kyoichi", datetime.strptime("2020-02-02 12:10:00", "%Y-%m-%d %H:%M:%S"), "Sales", 3000),
("Berni", datetime.strptime("2020-01-10 11:01:00", "%Y-%m-%d %H:%M:%S"), "Sales", 4700),
("Berni", datetime.strptime("2020-01-07 11:01:00", "%Y-%m-%d %H:%M:%S"), "Sales", None),
("Guoxiang", datetime.strptime("2020-01-08 12:11:00", "%Y-%m-%d %H:%M:%S"), "Sales", 4200),
("Parto", datetime.strptime("2020-02-20 12:01:00", "%Y-%m-%d %H:%M:%S"), "Finance", 2700),
("Anneke", datetime.strptime("2020-01-02 08:20:00", "%Y-%m-%d %H:%M:%S"), "Finance", 3300),
("Sumant", datetime.strptime("2020-01-30 12:01:05", "%Y-%m-%d %H:%M:%S"), "Finance", 3900),
("Jeff", datetime.strptime("2020-01-02 12:01:00", "%Y-%m-%d %H:%M:%S"), "Marketing", 3100),
("Patricio", datetime.strptime("2020-01-05 12:18:00", "%Y-%m-%d %H:%M:%S"), "Marketing", 2500)
], schema=schema)
df.createOrReplaceTempView('salary')
df.show()
+--------+-------------------+----------+------+
| name| create_time|department|salary|
+--------+-------------------+----------+------+
| Tom|2020-01-01 00:01:00| Sales| 4500|
| Georgi|2020-01-02 12:01:00| Sales| 4200|
| Kyoichi|2020-02-02 12:10:00| Sales| 3000|
| Berni|2020-01-10 11:01:00| Sales| 4700|
| Berni|2020-01-07 11:01:00| Sales| null|
|Guoxiang|2020-01-08 12:11:00| Sales| 4200|
| Parto|2020-02-20 12:01:00| Finance| 2700|
| Anneke|2020-01-02 08:20:00| Finance| 3300|
| Sumant|2020-01-30 12:01:05| Finance| 3900|
| Jeff|2020-01-02 12:01:00| Marketing| 3100|
|Patricio|2020-01-05 12:18:00| Marketing| 2500|
+--------+-------------------+----------+------+
ranking functions
sql |
DataFrame |
功能 |
row_number |
rowNumber |
从1~n的唯一序号值 |
rank |
rank |
与denseRank一样,都是排名,对于相同的数值,排名一致。区别:rank不会跳过并列的排名 |
dense_rank |
denseRank |
同rank |
percent_rank |
percentRank |
计算公式: (组内排名-1)/(组内行数-1),如果组内只有1行,则结果为0 |
ntile |
ntile |
将组内数据排序后,按照指定的n切分为n个桶,该值为当前行的桶号(桶号从1开始) |
spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,rank() over(partition by department order by salary) as rank
,dense_rank() over(partition by department order by salary) as dense_rank
,percent_rank() over(partition by department order by salary) as percent_rank
,ntile(2) over(partition by department order by salary) as ntile
FROM salary
""").toPandas()
|
name |
department |
salary |
index |
rank |
dense_rank |
percent_rank |
ntile |
0 |
Patricio |
Marketing |
2500.0 |
1 |
1 |
1 |
0.0 |
1 |
1 |
Jeff |
Marketing |
3100.0 |
2 |
2 |
2 |
1.0 |
2 |
2 |
Berni |
Sales |
NaN |
1 |
1 |
1 |
0.0 |
1 |
3 |
Kyoichi |
Sales |
3000.0 |
2 |
2 |
2 |
0.2 |
1 |
4 |
Georgi |
Sales |
4200.0 |
3 |
3 |
3 |
0.4 |
1 |
5 |
Guoxiang |
Sales |
4200.0 |
4 |
3 |
3 |
0.4 |
2 |
6 |
Tom |
Sales |
4500.0 |
5 |
5 |
4 |
0.8 |
2 |
7 |
Berni |
Sales |
4700.0 |
6 |
6 |
5 |
1.0 |
2 |
8 |
Parto |
Finance |
2700.0 |
1 |
1 |
1 |
0.0 |
1 |
9 |
Anneke |
Finance |
3300.0 |
2 |
2 |
2 |
0.5 |
1 |
10 |
Sumant |
Finance |
3900.0 |
3 |
3 |
3 |
1.0 |
2 |
analytic functions
sql |
DataFrame |
功能 |
cume_dist |
cumeDist |
计算公式: 组内小于等于值当前行数/组内总行数 |
lag |
lag |
lag(input, [offset,[default]]) 当前index
|
lead |
lead |
与lag相反 |
first_value |
first_value |
取分组内排序后,截止到当前行,第一个值 |
last_value |
last_value |
取分组内排序后,截止到当前行,最后一个值 |
spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,cume_dist() over(partition by department order by salary) as cume_dist
,lag(salary, 1) over(partition by department order by salary) as lag -- 当前行向上
,lead(salary, 1) over(partition by department order by salary) as lead -- 当前行向下
,lag(salary, 0) over(partition by department order by salary) as lag_0
,lead(salary, 0) over(partition by department order by salary) as lead_0
,first_value(salary) over(partition by department order by salary) as first_value
,last_value(salary) over(partition by department order by salary) as last_value
FROM salary
""").toPandas()
|
name |
department |
salary |
index |
cume_dist |
lag |
lead |
lag_0 |
lead_0 |
first_value |
last_value |
0 |
Patricio |
Marketing |
2500.0 |
1 |
0.500000 |
NaN |
3100.0 |
2500.0 |
2500.0 |
2500.0 |
2500.0 |
1 |
Jeff |
Marketing |
3100.0 |
2 |
1.000000 |
2500.0 |
NaN |
3100.0 |
3100.0 |
2500.0 |
3100.0 |
2 |
Berni |
Sales |
NaN |
1 |
0.166667 |
NaN |
3000.0 |
NaN |
NaN |
NaN |
NaN |
3 |
Kyoichi |
Sales |
3000.0 |
2 |
0.333333 |
NaN |
4200.0 |
3000.0 |
3000.0 |
NaN |
3000.0 |
4 |
Georgi |
Sales |
4200.0 |
3 |
0.666667 |
3000.0 |
4200.0 |
4200.0 |
4200.0 |
NaN |
4200.0 |
5 |
Guoxiang |
Sales |
4200.0 |
4 |
0.666667 |
4200.0 |
4500.0 |
4200.0 |
4200.0 |
NaN |
4200.0 |
6 |
Tom |
Sales |
4500.0 |
5 |
0.833333 |
4200.0 |
4700.0 |
4500.0 |
4500.0 |
NaN |
4500.0 |
7 |
Berni |
Sales |
4700.0 |
6 |
1.000000 |
4500.0 |
NaN |
4700.0 |
4700.0 |
NaN |
4700.0 |
8 |
Parto |
Finance |
2700.0 |
1 |
0.333333 |
NaN |
3300.0 |
2700.0 |
2700.0 |
2700.0 |
2700.0 |
9 |
Anneke |
Finance |
3300.0 |
2 |
0.666667 |
2700.0 |
3900.0 |
3300.0 |
3300.0 |
2700.0 |
3300.0 |
10 |
Sumant |
Finance |
3900.0 |
3 |
1.000000 |
3300.0 |
NaN |
3900.0 |
3900.0 |
2700.0 |
3900.0 |
aggregate functions
只是在一定窗口里实现一些普通的聚合函数。
sql |
功能 |
avg |
平均值 |
sum |
求和 |
min |
最小值 |
max |
最大值 |
spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,sum(salary) over(partition by department order by salary) as sum
,avg(salary) over(partition by department order by salary) as avg
,min(salary) over(partition by department order by salary) as min
,max(salary) over(partition by department order by salary) as max
FROM salary
""").toPandas()
|
name |
department |
salary |
index |
sum |
avg |
min |
max |
0 |
Patricio |
Marketing |
2500.0 |
1 |
2500.0 |
2500.0 |
2500.0 |
2500.0 |
1 |
Jeff |
Marketing |
3100.0 |
2 |
5600.0 |
2800.0 |
2500.0 |
3100.0 |
2 |
Berni |
Sales |
NaN |
1 |
NaN |
NaN |
NaN |
NaN |
3 |
Kyoichi |
Sales |
3000.0 |
2 |
3000.0 |
3000.0 |
3000.0 |
3000.0 |
4 |
Georgi |
Sales |
4200.0 |
3 |
11400.0 |
3800.0 |
3000.0 |
4200.0 |
5 |
Guoxiang |
Sales |
4200.0 |
4 |
11400.0 |
3800.0 |
3000.0 |
4200.0 |
6 |
Tom |
Sales |
4500.0 |
5 |
15900.0 |
3975.0 |
3000.0 |
4500.0 |
7 |
Berni |
Sales |
4700.0 |
6 |
20600.0 |
4120.0 |
3000.0 |
4700.0 |
8 |
Parto |
Finance |
2700.0 |
1 |
2700.0 |
2700.0 |
2700.0 |
2700.0 |
9 |
Anneke |
Finance |
3300.0 |
2 |
6000.0 |
3000.0 |
2700.0 |
3300.0 |
10 |
Sumant |
Finance |
3900.0 |
3 |
9900.0 |
3300.0 |
2700.0 |
3900.0 |
窗口子句
ROWS/RANG窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)
- ROWS: 物理窗口,数据筛选基于排序后的index
- RANGE: 逻辑窗口,数据筛选基于值
语法:OVER (PARTITION BY … ORDER BY … frame_type BETWEEN start AND end)
有以下5种边界
- CURRENT ROW:
- UNBOUNDED PRECEDING: 分区第一行
- UNBOUNDED FOLLOWING: 分区最后一行
- n PRECEDING: 当前行,向前n行
- n FOLLOWING: 当前行,向后n行
- UNBOUNDED: 起点
spark.sql("""
SELECT
name
,department
,create_time
,row_number() over(partition by department order by create_time) as index
,row_number() over(partition by department order by (case when salary is not null then create_time end)) as index_ignore_null
,salary
,collect_list(salary) over(partition by department order by create_time rows between UNBOUNDED PRECEDING AND 1 PRECEDING) as before_salarys
,last(salary) over(partition by department order by create_time rows between UNBOUNDED PRECEDING AND 1 PRECEDING) as before_salary1
,lag(salary, 1) over(partition by department order by create_time) as before_salary2
,lead(salary, 1) over(partition by department order by create_time) as after_salary
FROM salary
ORDER BY department, index
""").toPandas()
|
name |
department |
create_time |
index |
index_ignore_null |
salary |
before_salarys |
before_salary1 |
before_salary2 |
after_salary |
0 |
Anneke |
Finance |
2020-01-02 08:20:00 |
1 |
1 |
3300.0 |
[] |
NaN |
NaN |
3900.0 |
1 |
Sumant |
Finance |
2020-01-30 12:01:05 |
2 |
2 |
3900.0 |
[3300] |
3300.0 |
3300.0 |
2700.0 |
2 |
Parto |
Finance |
2020-02-20 12:01:00 |
3 |
3 |
2700.0 |
[3300, 3900] |
3900.0 |
3900.0 |
NaN |
3 |
Jeff |
Marketing |
2020-01-02 12:01:00 |
1 |
1 |
3100.0 |
[] |
NaN |
NaN |
2500.0 |
4 |
Patricio |
Marketing |
2020-01-05 12:18:00 |
2 |
2 |
2500.0 |
[3100] |
3100.0 |
3100.0 |
NaN |
5 |
Tom |
Sales |
2020-01-01 00:01:00 |
1 |
2 |
4500.0 |
[] |
NaN |
NaN |
4200.0 |
6 |
Georgi |
Sales |
2020-01-02 12:01:00 |
2 |
3 |
4200.0 |
[4500] |
4500.0 |
4500.0 |
NaN |
7 |
Berni |
Sales |
2020-01-07 11:01:00 |
3 |
1 |
NaN |
[4500, 4200] |
4200.0 |
4200.0 |
4200.0 |
8 |
Guoxiang |
Sales |
2020-01-08 12:11:00 |
4 |
4 |
4200.0 |
[4500, 4200] |
NaN |
NaN |
4700.0 |
9 |
Berni |
Sales |
2020-01-10 11:01:00 |
5 |
5 |
4700.0 |
[4500, 4200, 4200] |
4200.0 |
4200.0 |
3000.0 |
10 |
Kyoichi |
Sales |
2020-02-02 12:10:00 |
6 |
6 |
3000.0 |
[4500, 4200, 4200, 4700] |
4700.0 |
4700.0 |
NaN |
spark.sql("""
SELECT
name
,department
,create_time
,index
,salary
,before_salarys[size(before_salarys)-1] as before_salary
FROM(
SELECT
name
,department
,create_time
,row_number() over(partition by department order by create_time) as index
,salary
,collect_list(salary) over(partition by department order by create_time rows between UNBOUNDED PRECEDING AND 1 PRECEDING) as before_salarys
FROM salary
ORDER BY department, index
) AS base
""").toPandas()
|
name |
department |
create_time |
index |
salary |
before_salary |
0 |
Anneke |
Finance |
2020-01-02 08:20:00 |
1 |
3300.0 |
NaN |
1 |
Sumant |
Finance |
2020-01-30 12:01:05 |
2 |
3900.0 |
3300.0 |
2 |
Parto |
Finance |
2020-02-20 12:01:00 |
3 |
2700.0 |
3900.0 |
3 |
Jeff |
Marketing |
2020-01-02 12:01:00 |
1 |
3100.0 |
NaN |
4 |
Patricio |
Marketing |
2020-01-05 12:18:00 |
2 |
2500.0 |
3100.0 |
5 |
Tom |
Sales |
2020-01-01 00:01:00 |
1 |
4500.0 |
NaN |
6 |
Georgi |
Sales |
2020-01-02 12:01:00 |
2 |
4200.0 |
4500.0 |
7 |
Berni |
Sales |
2020-01-07 11:01:00 |
3 |
NaN |
4200.0 |
8 |
Guoxiang |
Sales |
2020-01-08 12:11:00 |
4 |
4200.0 |
4200.0 |
9 |
Berni |
Sales |
2020-01-10 11:01:00 |
5 |
4700.0 |
4200.0 |
10 |
Kyoichi |
Sales |
2020-02-02 12:10:00 |
6 |
3000.0 |
4700.0 |
混合应用
spark.sql("""
SELECT
name
,department
,salary
,row_number() over(partition by department order by salary) as index
,salary - (min(salary) over(partition by department order by salary)) as salary_diff -- 比部门最低工资高多少
,min(salary) over() as min_salary_0 -- 最小工资
,first_value(salary) over(order by salary) as max_salary_1
,max(salary) over(order by salary) as current_max_salary_0 -- 截止到当前最大工资
,last_value(salary) over(order by salary) as current_max_salary_1
,max(salary) over(partition by department order by salary rows between 1 FOLLOWING and 1 FOLLOWING) as next_salary_0 -- 按照salary排序下一条记录
,lead(salary) over(partition by department order by salary) as next_salary_1
FROM salary
WHERE salary is not null
""").toPandas()
|
name |
department |
salary |
index |
salary_diff |
min_salary_0 |
max_salary_1 |
current_max_salary_0 |
current_max_salary_1 |
next_salary_0 |
next_salary_1 |
0 |
Patricio |
Marketing |
2500 |
1 |
0 |
2500 |
2500 |
2500 |
2500 |
3100.0 |
3100.0 |
1 |
Parto |
Finance |
2700 |
1 |
0 |
2500 |
2500 |
2700 |
2700 |
3300.0 |
3300.0 |
2 |
Kyoichi |
Sales |
3000 |
1 |
0 |
2500 |
2500 |
3000 |
3000 |
4200.0 |
4200.0 |
3 |
Jeff |
Marketing |
3100 |
2 |
600 |
2500 |
2500 |
3100 |
3100 |
NaN |
NaN |
4 |
Anneke |
Finance |
3300 |
2 |
600 |
2500 |
2500 |
3300 |
3300 |
3900.0 |
3900.0 |
5 |
Sumant |
Finance |
3900 |
3 |
1200 |
2500 |
2500 |
3900 |
3900 |
NaN |
NaN |
6 |
Georgi |
Sales |
4200 |
2 |
1200 |
2500 |
2500 |
4200 |
4200 |
4200.0 |
4200.0 |
7 |
Guoxiang |
Sales |
4200 |
3 |
1200 |
2500 |
2500 |
4200 |
4200 |
4500.0 |
4500.0 |
8 |
Tom |
Sales |
4500 |
4 |
1500 |
2500 |
2500 |
4500 |
4500 |
4700.0 |
4700.0 |
9 |
Berni |
Sales |
4700 |
5 |
1700 |
2500 |
2500 |
4700 |
4700 |
NaN |
NaN |
参考
- Introducing Window Functions in Spark SQL
- Standard Functions for Window Aggregation (Window Functions
- List Of Spark SQL Window Functions
- 在hive、Spark SQL中引入窗口函数
- Hive 分析函数进阶指南
- Hive SQL 分析函数面试题