MYSQL从8.0开始支持窗口函数
窗口函数:窗口类似于窗户,限定一个空间范围,可以理解为记录集合。窗口函数也就是满足某种条件的几率集合上执行特殊函数,对于每条记录都要在此窗口内执行函数,窗口大小都是固定的,这种属于静态窗口;不同的记录对应不同的窗口,这种动态变化的窗口叫滑动窗口。
窗口函数的基本用法如下:
函数名 ([expr])over 子句
函数() over()
其中,over是关键字,用来指定函数执行的窗口范围,包含三个分析子句:分组(partition by )子句,排序(order by)子句,窗口(rows)子句,如果后面括号中什么都不写,则意味着窗口包含满足where条件的所有行,窗口函数基于所有行进行计算;如果不为空,则支持以下语法来设置窗口:
函数名 ([expr])over (partition by <要分组的列> order by <要排序的列> rows between <数据范围>)
知识点总结
rows between 2 preceding and current row --取当前行和前面两行
rows between unbounded preceding and current row --包括本行和之前所有的行
rows between current row and unbounded following --包括本行和之后所有的行
rows between 3 preceding and current row --包括本行和前面三行
rows between 3 preceding and 1 following --从前面三行和下面一行,总共五行
-- 当order by后面缺少窗口从句条件,窗口规范默认是rows between unbounded preceding and current row.
-- 当order by和窗口从句都缺失, 窗口规范默认是 rows between unbounded preceding and unbounded following
# 需求1: 查询出2019年每月的支付总额和当年累积支付总额
-- step1 过滤出2019年数据
select * from user_trade where year(pay_time)=2019;
-- step2 在1的基础上,按照月份进行group by 分组,统计每个月份的支付总额
select MONTH(pay_time),sum(pay_amount)
FROM user_trade
WHERE YEAR (pay_time) = 2019
GROUP BY MONTH (pay_time);
-- step3 在2的基础上应用窗口函数实现需求
SELECT a.MONTH, a.pay_amount,
sum(a.pay_amount) over (ORDER BY a.MONTH) --此时没有使用rows指定窗口数据范围,默认当前行及其之前的所有行
FROM( SELECT MONTH(pay_time) MONTH, sum(pay_amount) pay_amount
FROM user_trade WHERE YEAR(pay_time) = 2019
GROUP BY MONTH (pay_time)) a
#需求2:查询出2018-2019年每月的支付总额和当年累积支付总额
SELECT a.year,a.month,a.pay_amount,
sum(a.pay_amount) over(partition by a.year order by a.month)
FROM
(SELECT year(pay_time) year,month(pay_time) month,
sum(pay_amount) pay_amount
FROM user_trade
WHERE year(pay_time) in (2018,2019)
GROUP BY year(pay_time),
month(pay_time))a;
#需求3: 查询出2019年每个月的近三月移动平均支付金额
SELECT a.month, a.pay_amount,
avg(a.pay_amount) over(order by a.month rows between 2 preceding and current row)
FROM
(SELECT month(pay_time) month, sum(pay_amount) pay_amount
FROM user_trade
WHERE year(pay_time)=2019
GROUP BY month(pay_time))a;
#需求4: 查询出每四个月的最大月总支付金额
SELECT a.month,
a.pay_amount,
max(a.pay_amount) over(order by a.month rows between 3 preceding
and current row)
FROM
(SELECT substr(pay_time,1,7) as month,
sum(pay_amount) as pay_amount
FROM user_trade
GROUP BY substr(pay_time,1,7))a;
#需求5: 2020年1月,购买商品品类数的用户排名
SELECT
user_name,
count( DISTINCT goods_category ) category_count,
row_number() over(order by count( DISTINCT goods_category ) ) order1,
-- row_number生成了行的编号从1开始
rank() over(order by count( DISTINCT goods_category ) ) order2, dense_rank() over(order by count( DISTINCT goods_category ) ) order3
FROM
user_trade
WHERE
substring( pay_time, 1, 7 ) = '2020-01'
GROUP BY
user_name;
这三个函数的作用都是返回相应规则的排序序号
row_number:它会为查询出来的每一行记录生成一个序号,依次排序且不会重复。
在各个分组内,rank 是跳跃排序,dese_rank 是连续排序
#需求6: 查询出将2020年2月的支付用户,按照支付金额分成5组后的结果
SELECT user_name,
sum(pay_amount) pay_amount,
ntile(5) over(order by sum(pay_amount) desc) level
FROM user_trade
WHERE substr(pay_time,1,7)='2020-02'
GROUP BY user_name;
#需求7: 查询出2020年支付金额排名前30%的所有用户
SELECT a.user_name,
a.pay_amount,
a.level FROM
(SELECT user_name,
sum(pay_amount) pay_amount,
ntile(10) over(order by sum(pay_amount) desc) level
FROM user_trade
WHERE year(pay_time)=2020
GROUP BY user_name)a
WHERE a.level in (1,2,3);
ntile(n) over(partition by …A… order by …B… )
n:切分的片数
A:分组的字段名称
B:排序的字段名称
ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值 NTILE不支持ROWS BETWEEN
#需求8: 查询出King和West的时间偏移(前N行)
SELECT user_name,pay_time,
lag(pay_time,1,pay_time) over(partition by user_name order by
pay_time) lag1,
-- 没有传入偏移量,那么默认就是1,找不到的话,此处也没有给默认值为 null
lag(pay_time) over(partition by user_name order by pay_time) lag1_s,
lag(pay_time,2,pay_time) over(partition by user_name order by pay_time) lag2,
lag(pay_time,2) over(partition by user_name order by pay_time) lag2_s
FROM user_trade
WHERE user_name in ('King','West');
#需求9: King和West的时间偏移(后N行)
SELECT user_name,pay_time,
lead(pay_time,1,pay_time) over(partition by user_name order by pay_time) lead1,
lead(pay_time) over(partition by user_name order by pay_time) lead2,
lead(pay_time,2,pay_time) over(partition by user_name order by pay_time) lead3,
lead(pay_time,2) over(partition by user_name order by pay_time) lead4
FROM user_trade
WHERE user_name in ('King','West');
Lag和Lead函数可以在同一次查询中取出同一字段的前N行的数据(Lag)和后N行的数据(Lead) 作为 独立的列。
在实际应用当中,若要用到取今天和昨天的某字段差值时,Lag和Lead函数的应用就显得尤为重 要。
lag(exp_str,offset,defval) over(partion by …order by …)
lead(exp_str,offset,defval) over(partion by …order by …)
exp_str是字段名称。 offset是偏移量,即是上1个或上N个的值,假设当前行在表中排在第5 行,则offset 为3,则表示我 们所要找的数据行就是表中的第2行(即5-3=2)。offset默认值为1。 defval默认值,当两个函数取上N/下N个值,当在表中从当前行位置向前数N行已经超出了表的 范 围时,lag()函数将defval这个参数值作为函数的返回值,若没有指定默认值,则返回NULL, 那么 在数学运算中,总要给一个默认值才不会出错。
需求11: 查询出每年支付时间间隔最长的用户
SELECT years, b.user_name, b.pay_days
FROM (SELECT years, a.user_name,
datediff(a.pay_time,a.lag_dt) pay_days,
rank() over(partition by years order by datediff(a.pay_time,a.lag_dt) desc) rank1
FROM
(SELECT year(pay_time) as years, user_name, pay_time,
lag(pay_time) over(partition by user_name,year(pay_time)
order by pay_time) lag_dt
FROM user_trade)a
WHERE a.lag_dt is not null)b
WHERE b.rank1=1;
1. CUME_DIST() 计算某个值在一组有序的数据中累计的分布
SELECT
name,
score,
ROW_NUMBER() OVER (ORDER BY score) row_num,
CUME_DIST() OVER (ORDER BY score) cume_dist_val
FROM
scores;
2. FIRST_VALUE() 和 LAST_VALUE() 返回分区第一行 和最后一行
1.获取员工姓名,加班时间和加班时间最少的员工:
SELECT employee_name,hours,
FIRST_VALUE(employee_name) OVER (ORDER BY hours) least_over_time
FROM overtime;
2.查找每个部门加班时间最少的员工:
SELECT employee_name, department, hours,
FIRST_VALUE(employee_name) OVER (PARTITION BY department ORDER BY hours) least_over_time
FROM overtime;
3. NTH_VALUE(expression,N):获取分区中第N行的值
查找薪水第二高的员工
SELECT employee_name, salary,
NTH_VALUE(employee_name, 2) OVER (ORDER BY salary DESC) second_highest_salary
FROM basic_pays;
4. PERCENT_RANK():计算分区中 行的百分位数排名
PERCENT_RANK()函数返回一个从0到1的数字。
对于指定的行,PERCENT_RANK()计算公式是:(rank - 1) / (total_rows - 1)
1.写一个sql查询,找到活跃用户的id和name,活跃用户是指那些至少连续5天登录账户的用户。返回的结果表按照id排序。
Accounts 表:
id | name |
---|---|
1 | Winston |
7 | Jonathan |
Logins 表:
id | login_date |
---|---|
7 | 2020-05-30 |
1 | 2020-05-30 |
7 | 2020-05-31 |
7 | 2020-06-01 |
7 | 2020-06-02 |
7 | 2020-06-02 |
7 | 2020-06-03 |
7 | 2020-06-07 |
7 | 2020-06-10 |
select distinct tmp3.id, a.name from
(
select id, sub_d, count(distinct login_date) num from
(
select id, login_date, SUBDATE(login_date, a) sub_d from
(
select id, login_date,dense_rank() over(partition by id order by login_date) a
from Logins
) tmp
) tmp2 group by id, sub_d
) tmp3 left join Accounts a on a.id=tmp3.id where num>=5
2.写一条sql查询计算以7天(某日期+该日期前6天)为一个时间段的顾客消费平均值
Customer 表:
custom_id | name | visited_on | amount |
---|---|---|---|
1 | Jhon | 2019-01-01 | 100 |
2 | Daniel | 2019-01-02 | 110 |
3 | Jade | 2019-01-03 | 120 |
4 | Khaled | 2019-01-04 | 130 |
5 | Winston | 2019-01-05 | 110 |
6 | Elvis | 2019-01-06 | 140 |
7 | Anna | 2019-01-07 | 150 |
8 | Maria | 2019-01-08 | 80 |
9 | Jaze | 2019-01-09 | 110 |
1 | Jhon | 2019-01-10 | 130 |
3 | Jade | 2019-01-10 | 150 |
结果表:
visited_on | amount | average_amount |
---|---|---|
2019-01-07 | 860 | 122.86 |
2019-01-08 | 840 | 120 |
2019-01-09 | 840 | 120 |
2019-01-10 | 1000 | 142.86 |
SELECT visited_on, amount, round(average_amount, 2) average_amount
FROM
(
SELECT visited_on,
SUM(amount) OVER (ORDER BY visited_on ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS amount,
AVG(amount) OVER (ORDER BY visited_on ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS average_amount,
ROW_NUMBER() OVER (ORDER BY visited_on) AS rn
FROM
(
SELECT visited_on,
SUM(amount) amount
FROM Customer group by visited_on
) a
) b
WHERE b.rn >= 7
参考教程