窗口函数(window functions),也被称为 “开窗函数”,也叫OLAP函数(Online Anallytical Processing,联机分析处理),可对数据库数据进行实时分析处理。它是数据库的标准功能之一,主流的数据库比如Oracle,PostgreSQL都支持窗口函数功能,MySQL 直到 8.0 版本才开始支持窗口函数。
窗口函数,简单来说就是对于一个查询SQL,将其结果集按指定的规则进行分区,每个分区可以看作是一个窗口,分区内的每一行,根据 其所属分区内的行数据进行函数计算,获取计算结果,作为该行的窗口函数结果值。
窗口函数与group聚合查询类似,都是对一组(分区)记录进行计算,区别在于group对一组记录计算后返回一条记录作为结果,而窗口函数对一组记录计算后,这组记录中每条数据都会对应一个结果。
下面举个例子看一下:
假设我们有一个sales
表,为员工的年度销额表:
CREATE TABLE sales(
sales_employee VARCHAR(50) NOT NULL,
fiscal_year INT NOT NULL,
sale DECIMAL(14,2) NOT NULL,
PRIMARY KEY(sales_employee,fiscal_year)
);
INSERT INTO sales(sales_employee,fiscal_year,sale)
VALUES('Bob',2016,100),
('Bob',2017,150),
('Bob',2018,200),
('Alice',2016,150),
('Alice',2017,100),
('Alice',2018,200),
('John',2016,200),
('John',2017,150),
('John',2018,250);
SELECT
*
FROM
sales;
+----------------+-------------+--------+
| sales_employee | fiscal_year | sale |
+----------------+-------------+--------+
| Alice | 2016 | 150.00 |
| Alice | 2017 | 100.00 |
| Alice | 2018 | 200.00 |
| Bob | 2016 | 100.00 |
| Bob | 2017 | 150.00 |
| Bob | 2018 | 200.00 |
| John | 2016 | 200.00 |
| John | 2017 | 150.00 |
| John | 2018 | 250.00 |
+----------------+-------------+--------+
9 rows in set (0.01 sec)
例如,以下sum()
函数返回记录年份中所有员工的总销售额,通过group by
分组查询每年度员工的销售总额,如下sql:
SELECT
fiscal_year,
SUM(sale)
FROM
sales
GROUP BY
fiscal_year;
查询结果如下:
+-------------+-----------+
| fiscal_year | SUM(sale) |
+-------------+-----------+
| 2016 | 450.00 |
| 2017 | 400.00 |
| 2018 | 650.00 |
+-------------+-----------+
3 rows in set (0.01 sec)
在上述示例中,聚合函数都会减少查询返回的行数。
与带有GROUP BY
子句的聚合函数一样,窗口函数也对行的子集进行操作,但它们不会减少查询返回的行数。
例如,以下查询返回每个员工的销售额,以及按会计年度计算的员工总销售额:
SELECT
fiscal_year,
sales_employee,
sale,
SUM(sale) OVER (PARTITION BY fiscal_year) total_sales
FROM
sales;
查询结果如下:
+-------------+----------------+--------+-------------+
| fiscal_year | sales_employee | sale | total_sales |
+-------------+----------------+--------+-------------+
| 2016 | Alice | 150.00 | 450.00 |
| 2016 | Bob | 100.00 | 450.00 |
| 2016 | John | 200.00 | 450.00 |
| 2017 | Alice | 100.00 | 400.00 |
| 2017 | Bob | 150.00 | 400.00 |
| 2017 | John | 150.00 | 400.00 |
| 2018 | Alice | 200.00 | 650.00 |
| 2018 | Bob | 200.00 | 650.00 |
| 2018 | John | 250.00 | 650.00 |
+-------------+----------------+--------+-------------+
9 rows in set (0.02 sec)
在此示例中,SUM()
函数用作窗口函数,函数对由OVER
子句内容定义的一组行进行操作。SUM()
应用函数的一组行称为窗口。
调用窗口函数的一般语法如下:
window_function_name(expression)
OVER (
[partition_defintion]
[order_definition]
[frame_definition]
)
在这个语法中:
OVER
具有三个可能元素的子句:分区定义,顺序定义和帧定义。OVER
子句后面的开括号和右括号是强制性的,即使没有表达式,例如:
window_function_name(expression) OVER()
partition_clause
句法将partition_clause
行分成块或分区。两个分区由分区边界分隔。
窗口函数在分区内执行,并在跨越分区边界时重新初始化。
partition_clause
语法如下所示:
PARTITION BY [{,...}]
您可以在PARTITION BY
子句中指定一个或多个表达式。多个表达式用逗号分隔。
order_by_clause
句法order_by_clause
语法如下:
ORDER BY [ASC|DESC], [{,...}]
ORDER BY
子句指定行在分区中的排序方式。可以在多个键上的分区内对数据进行排序,每个键由表达式指定。多个表达式也用逗号分隔。
与PARTITION BY
子句类似ORDER BY
,所有窗口函数也支持子句。但是,仅对ORDER BY
顺序敏感的窗口函数使用子句才有意义。
frame_clause
句法帧是当前分区的子集。要定义子集,请使用frame子句,如下所示:
frame_unit {|}
相对于当前行定义帧,这允许帧根据其分区内当前行的位置在分区内移动。
帧单位指定当前行和帧行之间的关系类型。它可以是ROWS
或RANGE
。当前行和帧行的偏移量是行号,如果帧单位是ROWS
行值,则行值是帧单位RANGE
。
所述frame_start
和frame_between
定义帧边界。
将frame_start
包含下列之一:
UNBOUNDED PRECEDING
:frame从分区的第一行开始。N PRECEDING
:第一个当前行之前的物理N行。N可以是文字数字或计算结果的表达式。CURRENT ROW
:当前计算的行frame_between
如下:
BETWEEN frame_boundary_1 AND frame_boundary_2
frame_boundary_1
和frame_boundary_2
可各自含有下列之一:
frame_start
:如前所述。UNBOUNDED FOLLOWING
:框架结束于分区的最后一行。N FOLLOWING
:当前行之后的物理N行。如果未frame_definition
在OVER
子句中指定,则MySQL默认使用以下帧:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
常用聚合函数有:
函数名 | 作用 |
---|---|
max | 查询指定列的最大值 |
min | 查询指定列的最小值 |
count | 统计查询结果的行数 |
sum | 求和,返回指定列的总和 |
avg | 求平均值,返回指定列数据的平均值 |
排序函数有row_number()
、rank()
、dense_rank()
这三个函数,语法中排序字句(order_definition)是必填的,分组字句(partition_defintion)是非必填,不填表示整表排序,填写时表示组内排序。
下面举例看一下:
假设有一个表employee
保存了员工薪资和部门信息。
CREATE TABLE employee(
`id` varchar(10) PRIMARY KEY NOT NULL COMMENT '主键',
`name` varchar(50) NOT NULL COMMENT '姓名',
`salary` int NOT NULL COMMENT '薪资',
`department` varchar(50) NOT NULL COMMENT '部门'
);
INSERT INTO employee(`id`,`name`,`salary`,`department`)
VALUES ('1','Joe',85000,'IT'),
('2','Henry',85000,'Sales'),
('3','Sam',60000,'Sales'),
('4','Max',90000,'IT'),
('5','Janet',69000,'IT'),
('6','Randy',85000,'IT'),
('7','Will',70000,'IT');
SELECT * FROM employee;
+----+-------+--------+------------+
| id | name | salary | department |
+----+-------+--------+------------+
| 1 | Joe | 85000 | IT |
| 2 | Henry | 85000 | Sales |
| 3 | Sam | 60000 | Sales |
| 4 | Max | 90000 | IT |
| 5 | Janet | 69000 | IT |
| 6 | Randy | 85000 | IT |
| 7 | Will | 70000 | IT |
+----+-------+--------+------------+
7 rows in set (0.00 sec)
下面语句展示未分组进行排序:
SELECT
`id`,
`name`,
`salary`,
`department`,
row_number() over(order by salary desc) as `row_number`,
rank() over(order by salary desc) as `rank`,
dense_rank() over(order by salary desc) as `dense_rank`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+------------+------+------------+
| id | name | salary | department | row_number | rank | dense_rank |
+----+-------+--------+------------+------------+------+------------+
| 4 | Max | 90000 | IT | 1 | 1 | 1 |
| 1 | Joe | 85000 | IT | 2 | 2 | 2 |
| 2 | Henry | 85000 | Sales | 3 | 2 | 2 |
| 6 | Randy | 85000 | IT | 4 | 2 | 2 |
| 7 | Will | 70000 | IT | 5 | 5 | 3 |
| 5 | Janet | 69000 | IT | 6 | 6 | 4 |
| 3 | Sam | 60000 | Sales | 7 | 7 | 5 |
+----+-------+--------+------------+------------+------+------------+
7 rows in set (0.00 sec)
下面语句展示根据部门分组进行排序:
SELECT
`id`,
`name`,
`salary`,
`department`,
row_number() over(partition by department order by salary desc) as `row_number`,
rank() over(partition by department order by salary desc) as `rank`,
dense_rank() over(partition by department order by salary desc) as `dense_rank`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+------------+------+------------+
| id | name | salary | department | row_number | rank | dense_rank |
+----+-------+--------+------------+------------+------+------------+
| 4 | Max | 90000 | IT | 1 | 1 | 1 |
| 1 | Joe | 85000 | IT | 2 | 2 | 2 |
| 6 | Randy | 85000 | IT | 3 | 2 | 2 |
| 7 | Will | 70000 | IT | 4 | 4 | 3 |
| 5 | Janet | 69000 | IT | 5 | 5 | 4 |
| 2 | Henry | 85000 | Sales | 1 | 1 | 1 |
| 3 | Sam | 60000 | Sales | 2 | 2 | 2 |
+----+-------+--------+------------+------------+------+------------+
7 rows in set (0.00 sec)
基本语法: ntile(n) over(partition by…order by…),其中n表示被切分的段数。
含义: ntile(n)用于将分组数据平均切分成n块,如果切分的每组数量不均等,则第一组分得的数据更多。
举例: ntile()函数通常用于比如部门前33%高薪的员工,则n取值为3,用where筛选出第一组的数据。其sql如下:
SELECT temp.* FROM (
SELECT
`id`,
`name`,
`salary`,
`department`,
row_number() over(partition by department order by salary desc) as `row_number`,
ntile(3) over(partition by department order by salary desc) as `ntile`
FROM
employee) temp
WHERE temp.ntile <= 1;
执行结果如下:
+----+-------+--------+------------+------------+-------+
| id | name | salary | department | row_number | ntile |
+----+-------+--------+------------+------------+-------+
| 4 | Max | 90000 | IT | 1 | 1 |
| 1 | Joe | 85000 | IT | 2 | 1 |
| 2 | Henry | 85000 | Sales | 1 | 1 |
+----+-------+--------+------------+------------+-------+
3 rows in set (0.01 sec)
基本语法: first_value(column) over(partition by…order by…),其中column为的列名
含义: 返回窗口第一行中列column对应的值
举例: 查询部门的年薪最高者姓名追加到新的一列
SELECT
`id`,
`name`,
`salary`,
`department`,
first_value(name) over(partition by department order by salary desc) as `max_salary_name`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+-----------------+
| id | name | salary | department | max_salary_name |
+----+-------+--------+------------+-----------------+
| 4 | Max | 90000 | IT | Max |
| 1 | Joe | 85000 | IT | Max |
| 6 | Randy | 85000 | IT | Max |
| 7 | Will | 70000 | IT | Max |
| 5 | Janet | 69000 | IT | Max |
| 2 | Henry | 85000 | Sales | Henry |
| 3 | Sam | 60000 | Sales | Henry |
+----+-------+--------+------------+-----------------+
7 rows in set (0.01 sec)
基本语法:
LAG([,offset[, default_value]]) OVER (
PARTITION BY expr,...
ORDER BY expr [ASC|DESC],...
)
LAG()
函数返回expression
当前行之前的行的值,其值为offset
其分区或结果集中的行数。
offset
是从当前行返回的行数,以获取值。offset
必须是零或文字正整数。如果offset
为零,则LAG()
函数计算expression
当前行的值。如果未指定offset
,则LAG()
默认情况下函数使用一个。
如果没有前一行,则LAG()
函数返回default_value
。例如,如果offset为2,则第一行的返回值为default_value
。如果省略default_value
,则默认LAG()
返回函数NULL
。
PARTITION BY
子句PARTITION BY
子句将结果集中的行划分LAG()
为应用函数的分区。如果省略PARTITION BY
子句,LAG()
函数会将整个结果集视为单个分区。
ORDER BY
子句ORDER BY
子句指定在LAG()
应用函数之前每个分区中的行的顺序。
LAG()
函数可用于计算当前行和上一行之间的差异。
含义: 返回分区中当前行之前的第N行的值。 如果不存在前一行,则返回NULL。。
举例: 查询部门中比当前员工年薪较高一位姓名追加到新的一列
SELECT
`id`,
`name`,
`salary`,
`department`,
lag(name,1) over(partition by department order by salary desc) as `higher_salary_name`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+--------------------+
| id | name | salary | department | higher_salary_name |
+----+-------+--------+------------+--------------------+
| 4 | Max | 90000 | IT | NULL |
| 1 | Joe | 85000 | IT | Max |
| 6 | Randy | 85000 | IT | Joe |
| 7 | Will | 70000 | IT | Randy |
| 5 | Janet | 69000 | IT | Will |
| 2 | Henry | 85000 | Sales | NULL |
| 3 | Sam | 60000 | Sales | Henry |
+----+-------+--------+------------+--------------------+
7 rows in set (0.00 sec)
基本语法:
LAST_VALUE (expression) OVER (
[partition_clause]
[order_clause]
[frame_clause]
)
查询部门的年薪最第者姓名追加到新的一列
SELECT
`id`,
`name`,
`salary`,
`department`,
last_value(name) over(partition by department order by salary desc) as `min_salary_name`
FROM
employee;
查询结果如下,很显然结果和我们预期的不一样:
+----+-------+--------+------------+-----------------+
| id | name | salary | department | min_salary_name |
+----+-------+--------+------------+-----------------+
| 4 | Max | 90000 | IT | Max |
| 1 | Joe | 85000 | IT | Randy |
| 6 | Randy | 85000 | IT | Randy |
| 7 | Will | 70000 | IT | Will |
| 5 | Janet | 69000 | IT | Janet |
| 2 | Henry | 85000 | Sales | Henry |
| 3 | Sam | 60000 | Sales | Sam |
+----+-------+--------+------------+-----------------+
7 rows in set (0.00 sec)
为什么min_salary_name查询是当前行数据呢?
原因在于这两个函数 可以用rows 指定作用域。 而默认的作用域是
RANGE UNBOUNDED PRECEDING AND CURRENT ROW
就是说从窗口的第一行到当前行。 所以last_value 最后一行肯定是当前行了。知道原因后,只需要改掉行的作用域就可以了。
如下sql:
SELECT
`id`,
`name`,
`salary`,
`department`,
last_value(name) over(partition by department order by salary desc rows between UNBOUNDED PRECEDING AND UNBOUNDED following) as `min_salary_name`
FROM
employee;
查询结果如下,这回就正确了:
+----+-------+--------+------------+-----------------+
| id | name | salary | department | min_salary_name |
+----+-------+--------+------------+-----------------+
| 4 | Max | 90000 | IT | Janet |
| 1 | Joe | 85000 | IT | Janet |
| 6 | Randy | 85000 | IT | Janet |
| 7 | Will | 70000 | IT | Janet |
| 5 | Janet | 69000 | IT | Janet |
| 2 | Henry | 85000 | Sales | Sam |
| 3 | Sam | 60000 | Sales | Sam |
+----+-------+--------+------------+-----------------+
7 rows in set (0.00 sec)
基本语法:
LEAD([,offset[, default_value]]) OVER (
PARTITION BY (expr)
ORDER BY (expr)
)
LEAD()
函数返回的值expression
从offset-th
有序分区排。
offset
是从当前行向前行的行数,以获取值。
offset
必须是一个非负整数。如果offset
为零,则LEAD()
函数计算expression
当前行的值。
如果省略 offset
,则LEAD()
函数默认使用一个。
如果没有后续行,则LEAD()
函数返回default_value
。例如,如果offset
是1,则最后一行的返回值为default_value
。
如果您未指定default_value
,则函数返回 NULL
。
PARTITION BY
子句将结果集中的行划分LEAD()
为应用函数的分区。
如果PARTITION BY
未指定子句,则结果集中的所有行都将被视为单个分区。
ORDER BY
子句确定LEAD()
应用函数之前分区中行的顺序。
含义: 返回分区中当前行之后的第N行的值。 如果不存在前一行,则返回NULL。。
举例: 查询部门中比当前员工年薪较低一位姓名追加到新的一列
SELECT
`id`,
`name`,
`salary`,
`department`,
lead(name,1) over(partition by department order by salary desc) as `lower_salary_name`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+-------------------+
| id | name | salary | department | lower_salary_name |
+----+-------+--------+------------+-------------------+
| 4 | Max | 90000 | IT | Joe |
| 1 | Joe | 85000 | IT | Randy |
| 6 | Randy | 85000 | IT | Will |
| 7 | Will | 70000 | IT | Janet |
| 5 | Janet | 69000 | IT | NULL |
| 2 | Henry | 85000 | Sales | Sam |
| 3 | Sam | 60000 | Sales | NULL |
+----+-------+--------+------------+-------------------+
7 rows in set (0.00 sec)
基本语法:
NTH_VALUE(expression, N)
FROM FIRST
OVER (
partition_clause
order_clause
frame_clause
)
NTH_VALUE()
函数返回expression
窗口框架第N行的值。如果第N行不存在,则函数返回NULL
。N必须是正整数,例如1,2和3。
FROM FIRST
指示NTH_VALUE()
功能在窗口帧的第一行开始计算。
请注意,SQL标准支持FROM FIRST
和FROM LAST
。但是,MySQL只支持FROM FIRST
。如果要模拟效果FROM LAST
,则可以使用其中ORDER BY
的over_clause
相反顺序对结果集进行排序。
含义: 返回窗口框架第N行的参数值。请注意默认边界问题加RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
与不加的区别。
举例: 查询部门中薪水第二高的员工姓名追加到新的一列
SELECT
`id`,
`name`,
`salary`,
`department`,
nth_value(name,2) over(partition by department order by salary desc RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as `second_salary_name`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+--------------------+
| id | name | salary | department | second_salary_name |
+----+-------+--------+------------+--------------------+
| 4 | Max | 90000 | IT | Joe |
| 1 | Joe | 85000 | IT | Joe |
| 6 | Randy | 85000 | IT | Joe |
| 7 | Will | 70000 | IT | Joe |
| 5 | Janet | 69000 | IT | Joe |
| 2 | Henry | 85000 | Sales | Sam |
| 3 | Sam | 60000 | Sales | Sam |
+----+-------+--------+------------+--------------------+
7 rows in set (0.00 sec)
基本语法:
CUME_DIST() OVER (
PARTITION BY expr, ...
ORDER BY expr [ASC | DESC], ...
)
含义: 它返回一组值中值的累积分布。它表示值小于或等于行的值除以总行数的行数。
举例: 查询部门中员工薪资累积分布(即高于等于当前员工工资员工数量占员工总数的百分比)追加到新的一列
SELECT
`id`,
`name`,
`salary`,
`department`,
cume_dist() over(partition by department order by salary desc ) as `cume`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+------+
| id | name | salary | department | cume |
+----+-------+--------+------------+------+
| 4 | Max | 90000 | IT | 0.2 |
| 1 | Joe | 85000 | IT | 0.6 |
| 6 | Randy | 85000 | IT | 0.6 |
| 7 | Will | 70000 | IT | 0.8 |
| 5 | Janet | 69000 | IT | 1 |
| 2 | Henry | 85000 | Sales | 0.5 |
| 3 | Sam | 60000 | Sales | 1 |
+----+-------+--------+------------+------+
7 rows in set (0.00 sec)
基本语法:
PERCENT_RANK()
OVER (
PARTITION BY expr,...
ORDER BY expr [ASC|DESC],...
)
含义: PERCENT_RANK()
函数返回一个从0到1的数字。
对于指定的行,PERCENT_RANK()
计算行的等级减1,除以评估的分区或查询结果集中的行数减1:
(rank - 1) / (total_rows - 1)
在此公式中,rank
是指定行的等级,total_rows
是要计算的行数。
PERCENT_RANK()
对于分区或结果集中的第一行,函数始终返回零。重复的列值将接收相同的PERCENT_RANK()
值。
与其他窗口函数类似,PARTITION BY
子句将行分配到分区中,ORDER BY
子句指定每个分区中行的逻辑顺序。PERCENT_RANK()
为每个有序分区独立计算函数。
两个PARTITION BY
和ORDER BY
子句都是可选项。但是,它PERCENT_RANK()
是一个顺序敏感函数,因此,您应始终使用ORDER BY
子句。
举例: 查询部门中员工薪资等级分布追加到新的一列
SELECT
`id`,
`name`,
`salary`,
`department`,
percent_rank() over(partition by department order by salary desc ) as `percent_rank`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+--------------+
| id | name | salary | department | percent_rank |
+----+-------+--------+------------+--------------+
| 4 | Max | 90000 | IT | 0 |
| 1 | Joe | 85000 | IT | 0.25 |
| 6 | Randy | 85000 | IT | 0.25 |
| 7 | Will | 70000 | IT | 0.75 |
| 5 | Janet | 69000 | IT | 1 |
| 2 | Henry | 85000 | Sales | 0 |
| 3 | Sam | 60000 | Sales | 1 |
+----+-------+--------+------------+--------------+
7 rows in set (0.00 sec)
mysql 8.0版本我们可以直接使用row_number实现部门薪资排名,如下sql:
SELECT
`id`,
`name`,
`salary`,
`department`,
row_number() over(partition by department order by salary desc) as `row_number`
FROM
employee;
查询结果如下:
+----+-------+--------+------------+------------+
| id | name | salary | department | row_number |
+----+-------+--------+------------+------------+
| 4 | Max | 90000 | IT | 1 |
| 1 | Joe | 85000 | IT | 2 |
| 6 | Randy | 85000 | IT | 3 |
| 7 | Will | 70000 | IT | 4 |
| 5 | Janet | 69000 | IT | 5 |
| 2 | Henry | 85000 | Sales | 1 |
| 3 | Sam | 60000 | Sales | 2 |
+----+-------+--------+------------+------------+
7 rows in set (0.00 sec)
mysql 5.7因为还没有窗口函数,所以我们实现其查询逻辑,下面给出查询sql:
SELECT
a.*,
@rn := ( IF ( @department = department, @rn + 1, 1 ) ) AS num,
@department := department AS temp_department # 要点:分组字段必须要赋值,顺序一定在生成序号逻辑后面
FROM
( SELECT * FROM employee ORDER BY department, salary ) a,
( SELECT @rn := 0, @department := '' ) b
查询结果如下:
+----+-------+--------+------------+------+-----------------+
| id | name | salary | department | num | temp_department |
+----+-------+--------+------------+------+-----------------+
| 5 | Janet | 69000 | IT | 1 | IT |
| 7 | Will | 70000 | IT | 2 | IT |
| 1 | Joe | 85000 | IT | 3 | IT |
| 6 | Randy | 85000 | IT | 4 | IT |
| 4 | Max | 90000 | IT | 5 | IT |
| 3 | Sam | 60000 | Sales | 1 | Sales |
| 2 | Henry | 85000 | Sales | 2 | Sales |
+----+-------+--------+------------+------+-----------------+
7 rows in set, 4 warnings (0.00 sec)
上面的逻辑实现也比较简单也比较巧妙,其思想是:
当然也能实现rank()
、dense_rank()
函数,请读者思考自行实现。
排名问题:每个部门按业绩来排名
topN问题:找出每个部门排名前N的员工进行奖励
leetcode 185. 部门工资前三高的所有员工(困难)
相信读者看完本篇,这道题简单闭着眼睛也能写出来,哈哈哈。