窗口函数是SQL2003标准才开始有的一系列SQL函数,用于应付一些复杂运算是比较方便。但是普遍使用的MySQL数据库对窗口函数支持得却很不好,直到最近的版本才开始有部分支持,这当然就让MySQL程序员很郁闷了。
实际操作中,我们可以在MySQL里用SQL拼出窗口函数功能,但是需要使用用户变量以及多个SELECT表达式从左到右依次计算的隐含规则。下面我们来看两个例子(为调试方便,我们直接用集算器作为测试环境)。
1、2016年1月销售额排名
A | |
1 | set @i1=0, @i2=0, @d1=null; |
2 | select @i1:=@i1+1 `row_number`, province, curr_sales, prev_sales,
@i2:=if(prev_sales=curr_sales,@i2,@i1) `rank` from (select province, cast(@d1 as decimal(15,2)) as prev_sales, @d1:=sales as curr_sales from detail where yearmonth=201601 order by sales desc ) t1; |
3 | =connect(“mysql”) |
4 | >A3.execute(A1) |
5 | =A3.query@x(A2) |
(1)A1中语句用于初始化用户变量;
(2)A2中语句先对销售额排倒序,然后每一行销售额与上一行销售额比较,若相等则排名不变,否则排名等于行号;
(3)A3连接数据库;
(4)A4执行初始化语句;
(5)A5执行查询语句并关闭数据库连接,返回结果。
执行后A5为需要的结果。
2、2016年1月和2月销售额按月分组百分比排名
A | |
1 | set @i1=null, @i2=0, @i3=0, @d1=null; |
2 | select curr_month, t1.province, curr_sales, sale_rank,
if(count>1, (sale_rank-1)/(count-1), 0) as `percent_rank` from (select prev_month, curr_month, province, @i2:=if(prev_month=curr_month,@i2+1,1) as `row_number`, @i3:=if(prev_month<>curr_month, 1, if(prev_sales=curr_sales, @i3, @i2)) as ‘sale_rank’, prev_sales, curr_sales from (select @i1 as prev_month, @i1:=yearmonth as curr_month, province, @d1 as prev_sales, @d1:=sales as curr_sales from (select * from detail where yearmonth in (201601,201602) order by yearmonth, sales desc ) t111 ) t11 ) t1 join (select yearmonth, province, count(*) count from detail where yearmonth in (201601, 201602) group by yearmonth ) t2 on t1.curr_month=t2.yearmonth; |
3 | =connect(“mysql”) |
4 | >A3.execute(A1) |
5 | =A3.query@x(A2) |
(1)A1中语句用于初始化用户变量;
(2)A2中语句子查询t11求出上一行的月份和销售额,t1再求出本月行号与排名,t2算出每月的行数,最后t1与t2连接再利用公式[if(本月行数>1,(当前行的本月排名-1)/(本组行数-1),0)]求出百分比排号。
执行后A5为需要的结果。
通过上述两个例子,我们可以看到,为了实现窗口函数相应功能,SQL语句冗长、复杂而且可读性较差。另外,这里还使用了SELECT表达式从左到右依次计算的隐含规则,而这在MySQL参考手册是不推荐使用的,如果今后不能使用这一规则,那么写出来的SQL语句会更加复杂。譬如不使用这条隐含规则如何能取上一行的字段值呢?各位读者可以自行脑补。
值得庆幸的是,有了集算器及其特有的SPL语言,我们就大可不必这么麻烦了,MySQL只要使用最基本的SQL就行了,剩下的事由集算器来完成。
下面我们就来看看集算器的SPL语法是如何实现相应窗口函数的功能的。
1、SUM()、COUNT()、AVG()、MAX()、MIN()、VARIANCE
a)select province, sales, sum(sales) over() `sum`,
avg(sales) over() `avg`, max(sales) over() `max`,
min(sales) over() `min`, count(*) over() `count`
from detail
where yearmonth=201601
order by sales;
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth=201601 order by sales desc”) |
3 | =A2.sum(sales) |
4 | =A2.avg(sales) |
5 | =A2.max(sales) |
6 | =A2.min(sales) |
7 | =A2.count() |
8 | =A2.new(province, sales, A3:sum, A4:avg,A5:max,A6:min, A7:count) |
(1)A3到A7依次对销售额求和、求平均、求最大、求最小及求总行数;
(2)A8构造序表,其中每一行都有本月销售额总和、平均值、最大值、最小值及总行数
执行后A8的结果如下:
这个例子很常规,毫无挑战性,只是小练一把,下面开始玩真的。
b)select yearmonth,province,sales,
sum(sales) over (partition by yearmonth) `sum`,
avg(sales) over (partition by yearmonth) `avg`,
max(sales) over (partition by yearmonth) `max`,
min(sales) over (partition by yearmonth) `min`,
count(*) over (partition by yearmonth) `count`
from detail
where yearmonth in (201601,201602) and sales>49500
order by yearmonth, sales desc;
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth in (201601,201602) and sales>49500 order by yearmonth,sales desc”) |
3 | =A2.groups(yearmonth;sum(sales):sum,avg(sales):avg,max(sales):max,min(sales):min, count(1):count) |
4 | =A2.switch(yearmonth,A3) |
5 | =A4.new(yearmonth.yearmonth:yearmonth,province,sales,yearmonth.sum:sum, yearmonth.avg:avg,yearmonth.max:max,yearmonth.min:min,yearmonth.count:count) |
(1)A2中按月份分组并对销售额求和、求平均、求最大、求最小及每组行数;
(2)A4按月份将A2中yearmonth字段值转换成A3中相同月份的记录
执行后A5的结果如下。
2、VARIANCE()、STD()
a)select province, sales, variance(sales) over() `variance`, std(sales) over() `std`
from detail where yearmonth=201601;
A | |
1 | =connect(“mysql”) |
2 | =A1.query(“select * from detail where yearmonth=201601”) |
3 | =A2.variance(sales) |
4 | =sqrt(A3) |
5 | =A2.new(province,sales,A3:variance,A4:std) |
(1)A3对销售额求方差。
(2)A4对A3求平方根即为标准差
执行后A5的结果如下。
b)select yearmonth, province, sales,
variance(sales) over(partition by yearmonth) `variance`,
std(sales) over(partition by yearmonth) `std`
from detail
where yearmonth in (201601, 201602);
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth in (201601,201602) order by yearmonth”) |
3 | =A2.group(yearmonth) |
4 | =A3.new(yearmonth:m,~.variance(sales):v, sqrt(v):v2) |
5 | =A2.switch(yearmonth, A4:m) |
6 | =A5.new(yearmonth.m:yearmonth, province, sales, yearmonth.v:variance, yearmonth.v2:std) |
(1)A3按月份分组
(2)A4求每月销售额的方差
执行后A6的结果如下:
3、ROW_NUMBER()、RANK()、DENSE_RANK()、PERCENT_RANK()
a)select province, sales, row_number() over(order by sales desc) `row_number`,
rank() over (order by sales desc) `rank`,
dense_rank() over (order by sales desc) `dense_rank`,
percent_rank() over (order by sales desc) `percent_rank`
from detail
where yearmonth=201601;
A | |
1 | =connect(“mysql”) |
2 | =A1.query(“select * from detail where yearmonth=201601”) |
3 | =A2.sort(sales:-1) |
4 | =A2.count() |
5 | =A3.new(province,sales,#:row_number,rank(sales):rank,ranki(sales):dense_rank, if(A4>1,(rank-1)/(A4-1),0):percent_rank) |
(1)A5中#表示当前行在A3中的序号
(2)百分比排名的公式=if(行数>1,(排名-1)/(行数-1))
执行后A5的结果如下:
b)select province, sales,
row_number() over(partition by yearmonth order by sales desc)
`row_number`,
rank() over (partition by yearmonth order by sales desc) `rank`,
dense_rank() over (partition by yearmonth order by sales desc)
`dense_rank`,
percent_rank() over (partition by yearmonth order by sales desc)
`percent_rank`
from detail
where yearmonth in (201601,201602);
A | |
1 | =connect(“mysql”) |
2 | =A1.query(“select * from detail where yearmonth in (201601,201602)”) |
3 | =A2.sort(yearmonth,sales:-1) |
4 | =A2.groups(yearmonth:m;count(1):count) |
5 | =A2.switch(yearmonth,A4:m) |
6 | =A3.new(yearmonth,province,sales,seq(yearmonth):row_number,rank(sales;yearmonth):rank, ranki(sales;yearmonth):dense_rank, if(yearmonth.count>1, (rank-1)/(yearmonth.count-1),0):percent_rank) |
执行后A6的结果如下:
4、NTILE()
a)select province, sales, ntile(3) over() `ntile`
from detail
where yearmonth=201601;
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth=201601”) |
3 | =桶数=3 |
4 | =A2.count() |
5 | =A2.new(province,sales,z(#,桶数,A4):ntile) |
(1)A3里指明桶数为3
(2)A5中z(i,桶数,总行数)计算第i行所在桶号
执行后A9的结果如下:
b)select yearmonth, province, sales, ntile(3) over(partition by yearmonth) `ntile`
from detail
where yearmonth=201601 or( yearmonth=201602 and province!=’上海’);
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth=201601 or (yearmonth=201602 and province!=’上海’) order by yearmonth” ) |
3 | =桶数=3 |
4 | =A2.group(yearmonth:m;~.count():count) |
5 | =A2.switch(yearmonth,A4:m) |
6 | =A5.new(yearmonth.m:yearmonth,province,sales, z(seq(yearmonth), 桶数, yearmonth.count):ntile) |
执行后A6的结果如下:
5、FIRST_VALUE()、LAST_VALUE()、NTH_VALUE()、LAG()、LEAD()
a)select province,sales,
first_value(sales) over(partition by yearmonth) `first_value`,
last_value(sales) over(partition by yearmonth) `last_value`,
nth_value(sales, 5) over(partition by yearmonth) `nth_value`,
lag(sales, 2) over(partition by yearmonth) `lag`,
lead(sales, 3) over(partition by yearmonth) `lead`
from detail
where yearmonth=201601;
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth=201601”) |
3 | =A2.new( province, sales, A2.m(1).sales:first_value,A2.m(-1).sales:last_value, A2.m(5).sales:nth_value, ~[-2].sales:lag,~[3].sales:lead) |
(1)Am(i)取A2中第i条记录,越界返回null,负数则从后往前数第abs(i)条记录,不能使用A2(i),因为A2(i)越界会报错
执行后A3的结果如下:
b)select yearmonth,province,sales,
first_value(sales) over(partition by yearmonth) `first_value`,
last_value(sales) over(partition by yearmonth) `last_value`,
nth_value(sales, 5) over(partition by yearmonth) `nth_value`,
lag(sales, 2) over(partition by yearmonth) `lag`,
lead(sales, 3) over(partition by yearmonth) `lead`
from detail
where yearmonth=201601 or (yearmonth=201602 and sales>50000);
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth=201601 or (yearmonth=201602 and sales>50000) order by yearmonth”) |
3 | =A2.group(yearmonth:m;~.count():count,~.m(1).sales:first_value, ~.m(-1).sales:last_value,~.m(5).sales:nth_value) |
4 | =A2.switch(yearmonth, A3:m) |
5 | =A2.new(yearmonth.m:yearmonth, province, sales, yearmonth.first_value:first_value,yearmonth.last_value:last_value,yearmonth.nth_value:nth_value, (seq=seq(yearmonth),if(seq>2,~[-2].sales,null)):lag,if(yearmonth.count-seq>=3,~[3].sales,null):lead) |
(1)A5中,seq(yearmonth)尽可能不要在if函数中使用,因为seq函数是在对A2中记录循环过程中累加的,导致seq函数少执行1次就少累加1。
(2)A5中,前面的表达式用seq=seq(yearmonth)对变量seq赋值,这样后续表达式就可以引用变量seq。
执行后A5的结果如下:
6、CUME_DIST()
a)select province,sales, cume_dist() over(order by sales) `cume_dist`
from detail
where yearmonth=201601;
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth=201601 order by sales desc”) |
3 | =A2.count() |
4 | =A2.new(province,sales,(A3-rank(sales)+1)/A3:cume_dist) |
5 | =A4.rvs() |
(1)CUME_DIST() over (order by sales)求销售额从小到大的累积概率分布,公式为(小于等于当前销售额的行数/总行数)
(2)小于等于当前销售额的行数=总行数-当前销售额从大到小的排名+1
(3)A2必须按销售额从大到小排序
(4)A5数据倒排
执行后A5的结果如下:
b)select yearmonth, province,sales,
cume_dist() over(partition by yearmonth order by sales) `cume_dist`
from detail
where yearmonth in (201601,201602);
A | |
1 | =connect(“mysql”) |
2 | =A1.query@x(“select * from detail where yearmonth in (201601,201602) order by yearmonth desc,sales desc”) |
3 | =A2.groups(yearmonth:m;count(1):count) |
4 | =A2.switch(yearmonth,A3:m) |
5 | =A2.new(yearmonth.m:yearmonth,province,sales,(yearmonth.count-rank(sales;yearmonth)+1)/yearmonth.count:cume_dist) |
6 | =A5.rvs() |
(1)对应于最后的倒排,A2中按月份从大到小排序
执行后A6的结果如下:
看完十多个例子,有没有觉得集算器代码实现so easy?!而且,由于集算器可以对单元格进行分步计算,我们可以按照自然的思路逐步查看查询结果,从而更加简便、直观地完善整个查询脚本。赶紧用起来吧,你会发现更多又方便又强大的功能!