项目用了 PostgreSQL 作为数据库。用上了一直感兴趣的开窗函数,这里记录以下。
考虑到使用MySQL的项目也可能有这种需求,思考 MySQL 8.0 以前的实现方式(8.0开始就有开窗函数了,不用自己造轮子)。探索的过程中,把 User-Defined Variables 也学起来了,结合官网资料做一下笔记。
SELECT
*
FROM
main main
LEFT JOIN (
SELECT
*
FROM
(
SELECT *, ROW_NUMBER () OVER ( PARTITION BY main_id ORDER BY create_time DESC ) AS rw FROM main_history
) TEMP
WHERE TEMP.rw = 1
) history ON main.main_id = history.main_id;
SELECT
*
FROM
main main
LEFT JOIN (
SELECT
*
FROM
(
【这一部分是本文讨论的重点, 拆出来放到下文】
) TEMP
WHERE TEMP.rw = 1
) history ON main.main_id = history.main_id;
后文讨论 ROW_NUMBER () 、RANK()、DENSE_RANK 关注以下sql即可
SELECT *, ROW_NUMBER () OVER ( PARTITION BY main_id ORDER BY create_time DESC ) AS rw FROM main_history
OVER ( PARTITION BY main_id ORDER BY create_time DESC )
数据根据 main_id 分组 (与group by 不同,这里多条数据不会被合并成一条),组内的数据按 create_time 倒序排列
ROW_NUMBER ()
根据 OVER 的排序规则,为数据编号,如果 create_time 相同,依旧会保证 1,2,3…n 的连续序号
SELECT *, ROW_NUMBER () OVER ( PARTITION BY main_id ORDER BY create_time DESC ) AS rw FROM main_history
使用 ROW_NUMBER () 实现需求,默认了以下行为:
一个主表有多个历史表,但是历史表同一时刻(尽管create_time相同),只会取一条记录。
如果主表希望查出两个历史表记录,需要用到 rank()
SELECT *, rank() OVER ( PARTITION BY main_id ORDER BY create_time DESC ) AS rw FROM main_history
可以看到上文main_id = 1 的分组中 rw 列的排列是 1, 1, 3。
如果需要1 , 1, 2 这种稠密的编码,则需要使用到dense_rank
SELECT *, dense_rank() OVER ( PARTITION BY main_id ORDER BY create_time DESC ) AS rw FROM main_history
MAX(create_time) 就是最近时间
SELECT
*
FROM
main main
LEFT JOIN (
SELECT
historyTemp.*
FROM
main_history historyTemp
JOIN
( SELECT main_id, MAX(create_time) AS max_create_time FROM main_history GROUP BY main_id ) TEMP
ON historyTemp.main_id = TEMP.main_id AND historyTemp.create_time = TEMP.max_create_time
) history ON main.main_id = history.main_id;
这一部分是结合网上的资料,用mysql能解析的方式实现了开窗函数。一开始看这些变量真的挺头晕的,后面查阅官方资料,也算是清晰了。
SELECT
history.*,
IF( @pre_main_id = history.main_id, @cur_rank := @cur_rank + 1, @cur_rank := 1 ) row_number ,
@pre_main_id := history.main_id
FROM
-- 连接可能被线程池工具复用,避免变量污染, 每次使用都重新初始化 变量
main_history history ,( SELECT @cur_rank := 0, @pre_main_id := NULL ) r
ORDER BY
history.main_id, history.create_time DESC
SELECT history.*,
IF(@pre_main_id = history.main_id, IF(@pre_create_time = history.create_time, @cur_rank, @cur_rank := @cur_rank + 1), @cur_rank := 1) dense_rank,
-- 后置处理
@pre_create_time := history.create_time temp2, @pre_main_id := history.main_id temp3
FROM
-- 连接可能被线程池工具复用,避免变量污染, 每次使用都重新初始化 变量
main_history history, (SELECT @cur_rank := 0, @pre_main_id := NULL, @pre_create_time := NULL, @rank_counter := 1) r
ORDER BY history.main_id, history.create_time DESC
) temp where dense_rank = 1;
SELECT history.*,
-- 前置处理, 保证总数递增,确保形如 1, 1, 3 这种排列,而不是 1, 1, 2
IF(@pre_main_id = history.main_id, @rank_counter := @rank_counter + 1, @rank_counter := 1) temp1,
IF(@pre_main_id = history.main_id, IF(@pre_create_time = history.create_time, @cur_rank, @cur_rank := @rank_counter), @cur_rank := 1) rank,
-- 后置处理
@pre_create_time := history.create_time temp2, @pre_main_id := history.main_id temp3
FROM
-- 连接可能被线程池工具复用,避免变量污染, 每次使用都重新初始化 变量
main_history history, (SELECT @cur_rank := 0, @pre_main_id := NULL, @pre_create_time := NULL, @rank_counter := 1) r
ORDER BY history.main_id, history.create_time DESC
User-Defined Variables 官方文档
【前置知识】一个sql语句的执行过程:存储引擎 -> server -> client
In a SELECT statement, each select expression is evaluated only when sent to the client. This means that in a HAVING, GROUP BY, or ORDER BY clause, referring to a variable that is assigned a value in the select expression list does not work as expected: SELECT (@aa:=id) AS a, (@aa+3) AS b FROM tbl_name HAVING b=5;
释意: 用户变量作用在select语句上,只有在发送到客户端的时候才会被解析.
言下之意是:以下语句执行时,用户变量还没处理数据,所以不要使用用户变量。
If you refer to a variable that has not been initialized, it has a value of NULL and a type of string.
释意:未初始化使用,默认值为string 类型的 null
使用了用户变量,很自然就要在意变量是不是线程安全的。类似Spring 集成线程池,连接是共享的,要特别在意线程安全问题。
对于这个问题,有网友解答了:参考连接 (内容是搬运stackoverflow的)
Including (select @num := 0) initializes the variable at the beginning of the query. User-defined variables are scoped to the individual connection, and a connection can only run one query at a time, so this specific case is perfectly “thread-safe.”
However, it’s also a bit of a hack.
select
@num := (@num + 1) as row_number
from
user u,
(select @num := 0);
简要释义:一个connection同一时刻只会允许一个语句执行。(select @num := 0) 声明了一个connection作用域的用户变量。如果connection是隔离的,用户变量这个时候是安全的。
补充:
(select @num := 0)
Spring 会复用 connection,但是使用用户变量时都重新初始化,也不用担心connenction的上个使用者传递一个使用过的用户变量给下一个使用者。
最近 chatGPT好火,可以想象 chatGPT 以后能把开窗函数的 sql 转化成 MySQL 5.x 的对等实现。希望国内早点引进这种提高生产力的技术。