mysql窗口函数

窗口函数

什么是窗口函数

MYSQL从8.0开始支持窗口函数

窗口函数:窗口类似于窗户,限定一个空间范围,可以理解为记录集合。窗口函数也就是满足某种条件的几率集合上执行特殊函数,对于每条记录都要在此窗口内执行函数,窗口大小都是固定的,这种属于静态窗口;不同的记录对应不同的窗口,这种动态变化的窗口叫滑动窗口。

窗口函数的基本用法如下:

函数名 ([expr])over 子句
函数() over()

其中,over是关键字,用来指定函数执行的窗口范围,包含三个分析子句:分组(partition by )子句,排序(order by)子句,窗口(rows)子句,如果后面括号中什么都不写,则意味着窗口包含满足where条件的所有行,窗口函数基于所有行进行计算;如果不为空,则支持以下语法来设置窗口:

函数名 ([expr])over (partition by <要分组的列> order by <要排序的列> rows between <数据范围>)

知识点总结

rows between 2 preceding and current row --取当前行和前面两行
rows between unbounded preceding and current row --包括本行和之前所有的行
rows between current row and unbounded following --包括本行和之后所有的行
rows between 3 preceding and current row --包括本行和前面三行
rows between 3 preceding and 1 following --从前面三行和下面一行,总共五行
-- 当order by后面缺少窗口从句条件,窗口规范默认是rows between unbounded preceding and current row.
-- 当order by和窗口从句都缺失, 窗口规范默认是 rows between unbounded preceding and unbounded following

窗口函数的应用

聚合类窗口函数

  1. 累计计算函数,累计求和,sum() over ()
# 需求1: 查询出2019年每月的支付总额和当年累积支付总额

-- step1 过滤出2019年数据
select * from user_trade where year(pay_time)=2019;
-- step2 在1的基础上,按照月份进行group by 分组,统计每个月份的支付总额
select MONTH(pay_time),sum(pay_amount)
FROM user_trade
WHERE YEAR (pay_time) = 2019
GROUP BY MONTH (pay_time);
-- step3 在2的基础上应用窗口函数实现需求 
SELECT a.MONTH, a.pay_amount,
sum(a.pay_amount) over (ORDER BY a.MONTH) --此时没有使用rows指定窗口数据范围,默认当前行及其之前的所有行
FROM( SELECT MONTH(pay_time) MONTH, sum(pay_amount) pay_amount
FROM user_trade WHERE YEAR(pay_time) = 2019
GROUP BY MONTH (pay_time)) a
#需求2:查询出2018-2019年每月的支付总额和当年累积支付总额
SELECT a.year,a.month,a.pay_amount,
sum(a.pay_amount) over(partition by a.year order by a.month)
FROM
(SELECT year(pay_time) year,month(pay_time) month,
sum(pay_amount) pay_amount
FROM user_trade
WHERE year(pay_time) in (2018,2019)
GROUP BY year(pay_time),
month(pay_time))a;
  1. 移动平均:avg() over()
#需求3: 查询出2019年每个月的近三月移动平均支付金额
SELECT a.month, a.pay_amount,
avg(a.pay_amount) over(order by a.month rows between 2 preceding and current row)
FROM 
(SELECT month(pay_time) month, sum(pay_amount) pay_amount
FROM user_trade
WHERE year(pay_time)=2019
GROUP BY month(pay_time))a;
  1. 最大、小值:max()/min() over()
#需求4: 查询出每四个月的最大月总支付金额
SELECT a.month,
       a.pay_amount,
       max(a.pay_amount) over(order by a.month rows between 3 preceding
and current row)
FROM
      (SELECT substr(pay_time,1,7) as month,
              sum(pay_amount) as pay_amount
       FROM user_trade
       GROUP BY substr(pay_time,1,7))a;

专有窗口函数

  1. 排序函数
  • row_number() over()
  • rank() over()
  • dense_rank() over
#需求5: 2020年1月,购买商品品类数的用户排名
SELECT
    user_name,
    count( DISTINCT goods_category )  category_count,
    row_number() over(order by count( DISTINCT goods_category ) )  order1,
-- row_number生成了行的编号从1开始
rank() over(order by count( DISTINCT goods_category ) ) order2, dense_rank() over(order by count( DISTINCT goods_category ) ) order3
FROM
    user_trade
WHERE
    substring( pay_time, 1, 7 ) = '2020-01'
GROUP BY
    user_name;

这三个函数的作用都是返回相应规则的排序序号
row_number:它会为查询出来的每一行记录生成一个序号,依次排序且不会重复。
在各个分组内,rank 是跳跃排序,dese_rank 是连续排序

  1. 分段函数:ntile() over()
#需求6: 查询出将2020年2月的支付用户,按照支付金额分成5组后的结果
SELECT user_name,
       sum(pay_amount) pay_amount,
       ntile(5) over(order by sum(pay_amount) desc) level
FROM user_trade
WHERE substr(pay_time,1,7)='2020-02'
GROUP BY user_name;
#需求7: 查询出2020年支付金额排名前30%的所有用户
SELECT a.user_name,
       a.pay_amount,
a.level FROM
      (SELECT user_name,
              sum(pay_amount) pay_amount,
              ntile(10) over(order by sum(pay_amount) desc) level
      FROM user_trade
      WHERE year(pay_time)=2020
      GROUP BY user_name)a
WHERE a.level in (1,2,3);

ntile(n) over(partition by …A… order by …B… )
n:切分的片数
A:分组的字段名称
B:排序的字段名称
ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值 NTILE不支持ROWS BETWEEN

  1. 偏移分析函数
  • lag()over()
  • lead()over()
#需求8: 查询出King和West的时间偏移(前N行)
SELECT user_name,pay_time,
lag(pay_time,1,pay_time) over(partition by user_name order by
pay_time) lag1,
-- 没有传入偏移量,那么默认就是1,找不到的话,此处也没有给默认值为 null
lag(pay_time) over(partition by user_name order by pay_time) lag1_s,
lag(pay_time,2,pay_time) over(partition by user_name order by pay_time) lag2,
lag(pay_time,2) over(partition by user_name order by pay_time) lag2_s
FROM user_trade
WHERE user_name in ('King','West');
#需求9: King和West的时间偏移(后N行)
SELECT user_name,pay_time,
lead(pay_time,1,pay_time) over(partition by user_name order by pay_time) lead1,
lead(pay_time) over(partition by user_name order by pay_time) lead2,
lead(pay_time,2,pay_time) over(partition by user_name order by pay_time) lead3,
lead(pay_time,2) over(partition by user_name order by pay_time) lead4
FROM user_trade
WHERE user_name in ('King','West');

Lag和Lead函数可以在同一次查询中取出同一字段的前N行的数据(Lag)和后N行的数据(Lead) 作为 独立的列。
在实际应用当中,若要用到取今天和昨天的某字段差值时,Lag和Lead函数的应用就显得尤为重 要。
lag(exp_str,offset,defval) over(partion by …order by …)
lead(exp_str,offset,defval) over(partion by …order by …)
exp_str是字段名称。 offset是偏移量,即是上1个或上N个的值,假设当前行在表中排在第5 行,则offset 为3,则表示我 们所要找的数据行就是表中的第2行(即5-3=2)。offset默认值为1。 defval默认值,当两个函数取上N/下N个值,当在表中从当前行位置向前数N行已经超出了表的 范 围时,lag()函数将defval这个参数值作为函数的返回值,若没有指定默认值,则返回NULL, 那么 在数学运算中,总要给一个默认值才不会出错。

需求11: 查询出每年支付时间间隔最长的用户 
SELECT years, b.user_name, b.pay_days
FROM (SELECT years, a.user_name,
datediff(a.pay_time,a.lag_dt) pay_days,
rank() over(partition by years order by datediff(a.pay_time,a.lag_dt) desc) rank1
FROM
(SELECT year(pay_time) as years, user_name, pay_time,
lag(pay_time) over(partition by user_name,year(pay_time)
order by pay_time) lag_dt
FROM user_trade)a
WHERE a.lag_dt is not null)b
WHERE b.rank1=1;

不常用的专用的窗口函数

1. CUME_DIST() 计算某个值在一组有序的数据中累计的分布

SELECT
	name,
    score,
    ROW_NUMBER() OVER (ORDER BY score) row_num,
    CUME_DIST() OVER (ORDER BY score) cume_dist_val
FROM
 scores; 

2. FIRST_VALUE() 和 LAST_VALUE() 返回分区第一行 和最后一行
1.获取员工姓名,加班时间和加班时间最少的员工:

SELECT employee_name,hours,
FIRST_VALUE(employee_name) OVER (ORDER BY hours) least_over_time
FROM overtime;

2.查找每个部门加班时间最少的员工:

SELECT employee_name, department, hours,
FIRST_VALUE(employee_name) OVER (PARTITION BY department ORDER BY hours) least_over_time
FROM overtime;

3. NTH_VALUE(expression,N):获取分区中第N行的值
查找薪水第二高的员工

SELECT employee_name, salary, 
NTH_VALUE(employee_name, 2) OVER (ORDER BY salary DESC) second_highest_salary
FROM basic_pays; 

4. PERCENT_RANK():计算分区中 行的百分位数排名

PERCENT_RANK()函数返回一个从0到1的数字。
对于指定的行,PERCENT_RANK()计算公式是:(rank - 1) / (total_rows - 1)

窗口函数案例

1.写一个sql查询,找到活跃用户的id和name,活跃用户是指那些至少连续5天登录账户的用户。返回的结果表按照id排序。

Accounts 表:

id name
1 Winston
7 Jonathan

Logins 表:

id login_date
7 2020-05-30
1 2020-05-30
7 2020-05-31
7 2020-06-01
7 2020-06-02
7 2020-06-02
7 2020-06-03
7 2020-06-07
7 2020-06-10
select distinct tmp3.id, a.name from
(
	select id, sub_d, count(distinct login_date) num from 
(
	select id, login_date, SUBDATE(login_date, a) sub_d from 
(
	select id, login_date,dense_rank() over(partition by id order by login_date) a
from Logins
) tmp
) tmp2 group by id, sub_d 
) tmp3 left join Accounts a on a.id=tmp3.id where num>=5

2.写一条sql查询计算以7天(某日期+该日期前6天)为一个时间段的顾客消费平均值
Customer 表:

custom_id name visited_on amount
1 Jhon 2019-01-01 100
2 Daniel 2019-01-02 110
3 Jade 2019-01-03 120
4 Khaled 2019-01-04 130
5 Winston 2019-01-05 110
6 Elvis 2019-01-06 140
7 Anna 2019-01-07 150
8 Maria 2019-01-08 80
9 Jaze 2019-01-09 110
1 Jhon 2019-01-10 130
3 Jade 2019-01-10 150

结果表:

visited_on amount average_amount
2019-01-07 860 122.86
2019-01-08 840 120
2019-01-09 840 120
2019-01-10 1000 142.86
SELECT visited_on, amount, round(average_amount, 2) average_amount
FROM
(
SELECT visited_on, 
SUM(amount) OVER (ORDER BY visited_on ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS amount, 
AVG(amount) OVER (ORDER BY visited_on ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS average_amount, 
ROW_NUMBER() OVER (ORDER BY visited_on) AS rn
FROM
(
    SELECT visited_on, 
    SUM(amount) amount
    FROM Customer group by visited_on
) a 
) b
WHERE b.rn >= 7

参考教程

你可能感兴趣的:(数据库,mysql,数据库,sql)