我们本次使用到的 sql 都是在 presto 上跑的,如果想在 hive 或者 其他平台上跑的话,
请自行将 sql 转成对应的 sql 。
首先看一下,array_agg() over() 函数。
select leader , employee , array_agg(employee) over(partition by leader order by employee)
from (
select 'A' as leader , 'E1' as employee
union all select 'A' as leader , 'E2' as employee
union all select 'A' as leader , 'E3' as employee
union all select 'B' as leader , 'E4' as employee
union all select 'B' as leader , 'E5' as employee
union all select 'B' as leader , 'E6' as employee
) as a
计算结果如下所示:
leader | employee | array_agg |
---|---|---|
B | E4 | [E4] |
B | E5 | [E4, E5] |
B | E6 | [E4, E5, E6] |
A | E1 | [E1] |
A | E2 | [E1, E2] |
A | E3 | [E1, E2, E3] |
接下来是几个简单的函数:
slice(array,start_index, end_index)
为了更加形象的说明问题,我们使用的下面的例子进行说明。
select sale_date , shop_id , array_agg(ARRAY[sale_date , shop_id , cast(sale_amt as varchar)]) over(partition by shop_id order by sale_date)
from (
select '2019-06-06' as sale_date , 'shop1' as shop_id , 10 as sale_amt
union all select '2019-06-05' as sale_date , 'shop1' as shop_id , 14 as sale_amt
union all select '2019-06-03' as sale_date , 'shop1' as shop_id , 17 as sale_amt
union all select '2019-06-02' as sale_date , 'shop1' as shop_id , 18 as sale_amt
union all select '2019-06-01' as sale_date , 'shop1' as shop_id , 13 as sale_amt
union all
select '2019-06-06' as sale_date , 'shop2' as shop_id , 11 as sale_amt
union all select '2019-06-05' as sale_date , 'shop2' as shop_id , 15 as sale_amt
union all select '2019-06-03' as sale_date , 'shop2' as shop_id , 18 as sale_amt
union all select '2019-06-02' as sale_date , 'shop2' as shop_id , 19 as sale_amt
union all select '2019-06-01' as sale_date , 'shop2' as shop_id , 16 as sale_amt
) as a
上面这段 sql 的结果如下所示:
sale_date | shop_id | sale_index |
---|---|---|
2019-06-01 | shop1 | [[2019-06-01, shop1, 13]] |
2019-06-02 | shop1 | [[2019-06-01, shop1, 13], [2019-06-02, shop1, 18], [2019-06-03, shop1, 17]] |
2019-06-03 | shop1 | [[2019-06-01, shop1, 13], [2019-06-02, shop1, 18], [2019-06-03, shop1, 17], [2019-06-05, shop1, 14]] |
2019-06-04 | shop1 | [[2019-06-01, shop1, 13], [2019-06-02, shop1, 18], [2019-06-03, shop1, 17], [2019-06-05, shop1, 14], [2019-06-06, shop1, 10]] |
2019-06-01 | shop2 | [[2019-06-01, shop2, 16]] |
2019-06-02 | shop2 | [[2019-06-01, shop2, 16], [2019-06-02, shop2, 19]] |
2019-06-03 | shop2 | [[2019-06-01, shop2, 16], [2019-06-02, shop2, 19], [2019-06-03, shop2, 18]] |
2019-06-05 | shop2 | [[2019-06-01, shop2, 16], [2019-06-02, shop2, 19], [2019-06-03, shop2, 18], [2019-06-05, shop2, 15]] |
2019-06-06 | shop2 | [[2019-06-01, shop2, 16], [2019-06-02, shop2, 19], [2019-06-03, shop2, 18], [2019-06-05, shop2, 15], [2019-06-06, shop2, 11]] |
请注意观察数组里面的数字。我们可以观察到已经是排好序了。但是我们需要的是倒序。这是个问题。
另外,我只想取出前 3 天的数据汇总,所以我们需要取 array 的子集。这是个问题。
然后我们还会遇到一个问题,如果日期不连续的话,例如,shop2 在 2019-06-04 没有数据,所以我们需要对数组里面的元素进行过滤。
过滤之后我们需要单独把数组里的销售额拿出来相加。
排序我们可以使用 reverse 函数,slice 函数取数组的子集,filter(array, lamabda) 过滤出我们需要的元素,然后使用 reduce 把销售额加起来。
这里使用到的 reduce 在官网上没有讲到,reduce(array,cast(ROW(0,0.00) as ROW(cnt int,amt doubel)), (x,y) ->cast( as ROW(cnt int , amt double)) , s -> s )
这些都是在数组上的操作详细的可以查看:https://prestodb.github.io/docs/current/functions/array.html
这里先弄清楚 filter 和 reduce 的功能特性。
select cast(reduce(filter(ARRAY[1,2,3,4,5],x -> x >= 4),cast(ROW(0) as ROW(cnt int)), (x,y) ->cast( ROW(y + x.cnt) as ROW(cnt int ) ), x -> x ) as json) as rs
其中, filter(array,filterFunction(x)): 第一个参数的是要操作的数组,第二个是 filter 的条件,这里是一个正则表达式。
reduce(array,initialState , inputFunction(S, T, S), outputFunction(S, R))
在了解完 array 的几个函数以后,我可以得到如下所示的结果
最终的结果为:
select sale_date
, shop_id
, sale_pre3
, json_array_get(cast(reduce(filter(array_amt,x -> x[1] >= sale_pre3),cast(ROW(0) as ROW(cnt int)), (x,y) ->cast( ROW(cast(y[3] as integer) + x.cnt) as ROW(cnt int ) ), x -> x ) as json),0) as rs
from (
select sale_date
, shop_id
, cast(date_add('day',-2,date(sale_date)) as varchar) as sale_pre3
, array_agg(ARRAY[sale_date , shop_id , cast(sale_amt as varchar)]) over(partition by shop_id order by sale_date) as array_amt
from (
select '2019-06-06' as sale_date , 'shop1' as shop_id , 10 as sale_amt
union all select '2019-06-05' as sale_date , 'shop1' as shop_id , 14 as sale_amt
union all select '2019-06-03' as sale_date , 'shop1' as shop_id , 17 as sale_amt
union all select '2019-06-02' as sale_date , 'shop1' as shop_id , 18 as sale_amt
union all select '2019-06-01' as sale_date , 'shop1' as shop_id , 13 as sale_amt
union all
select '2019-06-06' as sale_date , 'shop2' as shop_id , 11 as sale_amt
union all select '2019-06-05' as sale_date , 'shop2' as shop_id , 15 as sale_amt
union all select '2019-06-03' as sale_date , 'shop2' as shop_id , 18 as sale_amt
union all select '2019-06-02' as sale_date , 'shop2' as shop_id , 19 as sale_amt
union all select '2019-06-01' as sale_date , 'shop2' as shop_id , 16 as sale_amt
) as a
) as b
其中,json_array_get(json_str,index)
函数是将 json_str
转成 json_object
,然后取出第一个数组。
其实我们也可以使用 lag 或者 leader 函数来实现这个功能。最终的解决结果如下所示:
select sale_date
, shop_id
, (
sale_amt
+if(_last_3_day < last_1_day,last_1_amt,0)
+if(_last_3_day < last_2_day,last_1_amt,0)
+if(_last_3_day < last_3_day,last_1_amt,0)
) as last_3_amt_sum
from (
select sale_date
, shop_id
, sale_amt
, cast(date(date_add('day' , cast(-3 as bigint) , cast (sale_date as timestamp))) as varchar) as _last_3_day
, lag(sale_amt,1) over(partition by shop_id order by sale_date) as last_1_amt
, lag(sale_date,1) over(partition by shop_id order by sale_date) as last_1_day
, lag(sale_amt,2) over(partition by shop_id order by sale_date) as last_2_amt
, lag(sale_date,2) over(partition by shop_id order by sale_date) as last_2_day
, lag(sale_amt,3) over(partition by shop_id order by sale_date) as last_3_amt
, lag(sale_date,3) over(partition by shop_id order by sale_date) as last_3_day
from (
select '2019-06-06' as sale_date , 'shop1' as shop_id , 10 as sale_amt
union all select '2019-06-05' as sale_date , 'shop1' as shop_id , 14 as sale_amt
union all select '2019-06-03' as sale_date , 'shop1' as shop_id , 17 as sale_amt
union all select '2019-06-02' as sale_date , 'shop1' as shop_id , 18 as sale_amt
union all select '2019-06-01' as sale_date , 'shop1' as shop_id , 13 as sale_amt
union all
select '2019-06-06' as sale_date , 'shop2' as shop_id , 11 as sale_amt
union all select '2019-06-05' as sale_date , 'shop2' as shop_id , 15 as sale_amt
union all select '2019-06-03' as sale_date , 'shop2' as shop_id , 18 as sale_amt
union all select '2019-06-02' as sale_date , 'shop2' as shop_id , 19 as sale_amt
union all select '2019-06-01' as sale_date , 'shop2' as shop_id , 16 as sale_amt
) as a
) as aa
先说动态 TOP N 问题之前我们先了认识一下 TOP N 。 TOP N 是什么呢?让我们先闪回到学生时代,那时在期末考试结束后
伟大的校长说学生们辛苦了,要有奖励,于是就有了奖励。教务处的老师们经过一番思考之后,制定了一系列的奖励政策,其中之一是
每个班级总成绩前 10 名的学生发一本《三体》回去研读。如果你是教务处的老师,应该取出这批学生呢? 其实有点 SQL 基础的人
我们都应该有一些思路了。如下所示:
select stu_name
, class
, score
from (
select stu_name
, class
, score
, row_number() over(partition by class order by score desc) as student_ranking
from (
select 'a' as stu_name, 'C1' as class 213 as score
union all select 'b' as stu_name, 'C1' as class 223 as score
union all select 'c' as stu_name, 'C1' as class 223 as score
union all select 'd' as stu_name, 'C1' as class 223 as score
union all select 'e' as stu_name, 'C1' as class 223 as score
union all select 'g' as stu_name, 'C2' as class 213 as score
union all select 'h' as stu_name, 'C2' as class 223 as score
union all select 'm' as stu_name, 'C2' as class 223 as score
union all select 'n' as stu_name, 'C2' as class 223 as score
union all select 'i' as stu_name, 'C2' as class 223 as score
) as school_report_card
) as school_report_card1
where student_ranking <= 10
但是呢,有的班主任提出了不同的观点,有的我们人多,只奖励 10 个人太少了,要按照每个班级人数的 10% 奖励。好像还挺有道理的。
那么这有改怎么写呢?
借用周星驰的经典台词“你让我打狗,总得给我一个棒子啊”,我们首先要知道每个班的总人数。
select 'C1' as class , 40 as population
union all select 'C2' as class , 50 as population
那么我们的 SQL 重新写成:
select stu_name
, class
, score
, student_ranking
from (
select stu_name
, class
, score
, row_number() over(partition by class order by score desc) as student_ranking
from (
select 'a' as stu_name, 'C1' as class ,213 as score
union all select 'b' as stu_name, 'C1' as class ,223 as score
union all select 'c' as stu_name, 'C1' as class ,223 as score
union all select 'd' as stu_name, 'C1' as class ,223 as score
union all select 'e' as stu_name, 'C1' as class ,223 as score
union all select 'g' as stu_name, 'C2' as class ,213 as score
union all select 'h' as stu_name, 'C2' as class ,223 as score
union all select 'm' as stu_name, 'C2' as class ,223 as score
union all select 'n' as stu_name, 'C2' as class ,223 as score
union all select 'i' as stu_name, 'C2' as class ,223 as score
) as school_report_card
) as school_report_card1
left join (
select 'C1' as class , 40 as population
union all select 'C2' as class , 50 as population
) as class_population
on school_report_card1.class = class_population.class
where student_ranking <= population*0.1
上面使用到的数组,那么我们能不能也使用数组呢?答案是肯定的
不知道,你使用过没有 clickhouse ,下面来结束一下如何使用 clickhouse 做动态 TOP N 。
clickhouse 没有像 presto 那样开窗函数,所以只能使用数组进行处理了。
首先,来看看关于数组的几个基本的函数
select class
, groupArray(4)([stu_name , cast(score as String)])
from (
select 'a' as stu_name, 'C1' as class ,213 as score
union all select 'b' as stu_name, 'C1' as class ,223 as score
union all select 'c' as stu_name, 'C1' as class ,212 as score
union all select 'd' as stu_name, 'C1' as class ,877 as score
union all select 'e' as stu_name, 'C1' as class ,665 as score
union all select 'g' as stu_name, 'C2' as class ,6564 as score
union all select 'h' as stu_name, 'C2' as class ,444 as score
union all select 'm' as stu_name, 'C2' as class ,111 as score
union all select 'n' as stu_name, 'C2' as class ,222 as score
union all select 'i' as stu_name, 'C2' as class ,333 as score
)
group by class
结果如下所示:
class | array_group |
---|---|
C1 | b,223,a,213,e,665,c,212 |
C2 | m,111,g,6564,h,444,n,222 |
由上面的结果可以看到 groupArray(record element) 的功能是和 presto 的 array_agg() 的
功能差不多。就是把分组内的每个元素汇总放到一个数组里面。
接下来看一下,看下一个重要的函数:arrayResize 。
select class , arrayResize(a.array_group , b.population) as resized_array_group
from (
select class
, groupArray(4)([stu_name , cast(score as String)]) as array_group
from (
select 'a' as stu_name, 'C1' as class ,213 as score
union all select 'b' as stu_name, 'C1' as class ,223 as score
union all select 'c' as stu_name, 'C1' as class ,212 as score
union all select 'd' as stu_name, 'C1' as class ,877 as score
union all select 'e' as stu_name, 'C1' as class ,665 as score
union all select 'g' as stu_name, 'C2' as class ,6564 as score
union all select 'h' as stu_name, 'C2' as class ,444 as score
union all select 'm' as stu_name, 'C2' as class ,111 as score
union all select 'n' as stu_name, 'C2' as class ,222 as score
union all select 'i' as stu_name, 'C2' as class ,333 as score
)
group by class
) as a all
left join (
select 'C1' as class , 4 as population
union all select 'C2' as class , 3 as population
) as b
on a.class = b.class
结果为:
class | resized_array_group |
---|---|
C1 | a,213,c,212,b,223,d,877 |
C2 | g,6564,m,111,i,333 |
由例子我们知道 resize(array,new_size) 的功能就是重新设置数组的大小,如果 new_size
> old_size
那么可以 ‘’ 或者 0 填充数组。
如果 new_size
< old_size
则会将尾部的几个元素删除掉。
然后是 arrayJoin 此函数和 arrayConcat 看起来一样,但是其实差别特别的大。
arrayJoin 是将数组中的元素 split 放到列上,功能有点类似 hive 的 LATERAL VIEW 函数。
还有一个函数 indexOf(element,array),这个函数是将 element 在 array 中的位置返回,这样我们就可以得到成绩的排名。
select aa.class
,arrayConcat(joined_array ,[toString(element_index)] )
from (
select class
, arrayJoin(arrayResize(a.array_group , b.population)) as joined_array
, indexOf(arrayResize(a.array_group , b.population) ,arrayJoin(arrayResize(a.array_group , b.population))) as element_index
from (
select class
, groupArray(4)([stu_name , cast(score as String)]) as array_group
from (
select *
from (
select 'a' as stu_name, 'C1' as class ,213 as score
union all select 'b' as stu_name, 'C1' as class ,223 as score
union all select 'c' as stu_name, 'C1' as class ,212 as score
union all select 'd' as stu_name, 'C1' as class ,877 as score
union all select 'e' as stu_name, 'C1' as class ,665 as score
union all select 'g' as stu_name, 'C2' as class ,6564 as score
union all select 'h' as stu_name, 'C2' as class ,444 as score
union all select 'm' as stu_name, 'C2' as class ,111 as score
union all select 'n' as stu_name, 'C2' as class ,222 as score
union all select 'i' as stu_name, 'C2' as class ,333 as score
) as aa
order by class , score desc
)
group by class
) as a all
left join (
select 'C1' as class , 4 as population
union all select 'C2' as class , 3 as population
) as b
on a.class = b.class
) as aa
class | array |
---|---|
C1 | d,877,1 |
C1 | e,665,2 |
C1 | b,223,3 |
C1 | a,213,4 |
C2 | g,6564,1 |
C2 | h,444,2 |
C2 | i,333,3 |
后面就好说吧,使用 arrayJoin 展开使用数组的下标展示出不同的字段就可以了。
所以最后的结果是:
select class
,arry[1] as stu_name
,arry[2] as score
,arry[3] as student_ranking
from (
select aa.class
,arrayConcat(joined_array ,[toString(element_index)] ) as arry
from (
select class
, arrayJoin(arrayResize(a.array_group , b.population)) as joined_array
, indexOf(arrayResize(a.array_group , b.population) ,arrayJoin(arrayResize(a.array_group , b.population))) as element_index
from (
select class
, groupArray(4)([stu_name , cast(score as String)]) as array_group
from (
select *
from (
select 'a' as stu_name, 'C1' as class ,213 as score
union all select 'b' as stu_name, 'C1' as class ,223 as score
union all select 'c' as stu_name, 'C1' as class ,212 as score
union all select 'd' as stu_name, 'C1' as class ,877 as score
union all select 'e' as stu_name, 'C1' as class ,665 as score
union all select 'g' as stu_name, 'C2' as class ,6564 as score
union all select 'h' as stu_name, 'C2' as class ,444 as score
union all select 'm' as stu_name, 'C2' as class ,111 as score
union all select 'n' as stu_name, 'C2' as class ,222 as score
union all select 'i' as stu_name, 'C2' as class ,333 as score
) as aa
order by class , score desc
)
group by class
) as a all
left join (
select 'C1' as class , 4 as population
union all select 'C2' as class , 3 as population
) as b
on a.class = b.class
) as aa
) as aaa
结果为:
header 1 | header 2 |
---|---|
row 1 col 1 | row 1 col 2 |
row 2 col 1 | row 2 col 2 |
class | stu_name | score | student_ranking |
---|---|---|---|
C1 | d | 877 | 1 |
C1 | e | 665 | 2 |
C1 | b | 223 | 3 |
C1 | a | 213 | 4 |
C2 | g | 6564 | 1 |
C2 | h | 444 | 2 |
C2 | i | 333 | 3 |
随机分组的问题是这样的,如果我们 100 名学生成绩都差不多,那么我们可以随机的分到两个班级里面去。
select stu_name
,case when group_index = 1
then 'C1'
when group_index = 2
then 'C2'
else 'none' end as class
from (
select stu_name
,(index + max_index/2 -1)/(max_index/2) as group_index
from (
select stu_name
,index
,(max(index) over()) as max_index
from (
select *
from (
select stu_name
,row_number() over() as index
from (
select 'a' as stu_name
union all select 'b' as stu_name
union all select 'c' as stu_name
union all select 'd' as stu_name
union all select 'e' as stu_name
union all select 'f' as stu_name
)
) as a
) as aa
) as aaa
) as aaaa
需要注意的是我们这个例子对于总人数是偶数的时候是有效的,但是奇数的时候就会有问题,那些人特殊处理一下就可以了
例如,
a、b、c、d、e
做取商操作的时候,得到 (1 + 2 -1)/2 = 1 、 (2 + 2 -1)/2 = 1、 (2 + 3 -1)/2 = 2、 (2 + 4 -1)/2 = 2、 (2 + 5 -1)/2 = 3
我们发现,5 会分到 3 里面,所以如果总人数是奇数的时候,需要对余出来的几行记录做一下特殊处理。
还可以使用取模的办法来分组。请看下面的公式。
group=index%goup_size
% 是取于操作,举个例子,
1,2,3,4,5,6,7,8,9,10
2 取模后:
1,0,1,0,1,0,1,0,1,0
3 取模后:
1,2,0,1,2,0,1,2,0,1
所以 sql 可以改造成:
select stu_name
,index
,index%3
,substring('ABC',1+index%3,1) as group_index
from(
select stu_name
,row_number() over() as index
from (
select 'a' as stu_name
union all select 'b' as stu_name
union all select 'c' as stu_name
union all select 'd' as stu_name
union all select 'e' as stu_name
union all select 'f' as stu_name
)
)
order by index
结果:
stu_name | index | group_index | group_name |
---|---|---|---|
b | 1 | 1 | B |
c | 2 | 2 | C |
d | 3 | 0 | A |
e | 4 | 1 | B |
f | 5 | 2 | C |
a | 6 | 0 | A |
其中,interval = 7 ,每隔 7 个数字的分成了一个组。
select (index + interval -1 )/interval
, index
from (
select dim_date_id
,max(index) over() as max_index
,index
from (
select dim_date_id
,row_number() over() as index
from dim.dim_date
where dim_date_id between '20190701' and '20190721'
)
)