本文主要总结了一些sql在时间阈上的操作,包括连续消费,最长签到,累计消费等问题,其实映射到其他业务场景也就变成了类似的计算;如游戏领域,连续登陆时间,连续签到时长,最大连续签到天数等常见的业务场景;方法都是共通的,这里就用sparksql来实现一些方法,hivesql的话有部分代码可能需要略微修改,比如having这种需要外面再套一层改成where等等就不再赘述
为了比较好切割,我就用@
进行拼凑了,第一个是日期,第二个是用户,第三个是否消费,第四个为消费金额
20190531@156@1@20
20190601@156@1@20
20190602@156@1@10
20190603@156@0@0
20190604@156@0@0
20190605@156@1@10
20190606@156@1@10
20190607@156@1@10
20190608@156@0@0
20190609@156@1@20
20190610@156@1@20
20190531@187@0@0
20190601@187@1@10
20190602@187@1@20
20190603@187@1@30
20190604@187@1@40
20190605@187@0@0
20190606@187@1@10
20190607@187@0@0
20190608@187@1@20
20190609@187@1@20
20190610@187@1@10
20190609@173@0@0
20190610@173@1@10
映射成表,如下结构
create table tmp_time_exp
(
dt string,
passenger_phone string,
is_call string comment '是否消费',
cost bigint comment '花费金额'
)
row format DELIMITED fields terminated by '@'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/hdfslocation'
查询一下是否符合
tmp_time_exp.dt tmp_time_exp.passenger_phone tmp_time_exp.is_call tmp_time_exp.cost
20190531 156 1 20
20190601 156 1 20
20190602 156 1 10
20190603 156 0 0
20190604 156 0 0
20190605 156 1 10
20190606 156 1 10
20190607 156 1 10
20190608 156 0 0
20190609 156 1 20
20190610 156 1 20
20190531 187 0 0
20190601 187 1 10
20190602 187 1 20
20190603 187 1 30
20190604 187 1 40
20190605 187 0 0
20190606 187 1 10
20190607 187 0 0
20190608 187 1 20
20190609 187 1 20
20190610 187 1 10
20190609 173 0 0
20190610 173 1 10
例子:如需要找到连续三天消费的用户,他的连续消费开始时间及结束时间
select
passenger_phone,
is_call,
cost,
unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd') as start_dt,
dt as end_dt,
datediff(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd'),'yyyy-MM-dd')) as last3day
from
tmp_time_exp
where
is_call != 0
having
last3day = 2
结果输出
passenger_phone is_call cost start_dt end_dt last3day
156 1 10 1559232000 20190602 2
156 1 10 1559664000 20190607 2
187 1 30 1559318400 20190603 2
187 1 40 1559404800 20190604 2
187 1 10 1559923200 20190610 2
1. 在使用datediff
的是时候,需要注意传递的参数必须是标准日期格式的,所以需要转化下 。2. 使用lag
或者lead
都可以实现类似操作,首先对用户进行分组,然后对其消费时间进行排序,然后将下一个消费时间进行位移,然后做差。比较好理解,如上,将连续日期位移两个位置,如果相减为2,则这三天都是必须连续登陆的
举例:如156的用户,连续消费的时间段是5.31-6.2;6.5-6.7;6.9-6.10,金额为分别为50,30,40
select
passenger_phone,
min(dt) as start_day,
max(dt) as end_day,
count(1) as last_days,
sum(cost) as cost_sum
from
(
select
*,
row_number() over(partition by passenger_phone order by dt) as ranker
from
tmp_time_exp
where
is_call != 0
)a
group by
passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
输出结果
passenger_phone start_day end_day last_days cost_sum
156 20190531 20190602 3 50
156 20190605 20190607 3 30
156 20190609 20190610 2 40
173 20190610 20190610 1 10
187 20190601 20190604 4 100
187 20190606 20190606 1 10
187 20190608 20190610 3 50
上述的处理方式,也是参考一个blog的处理,链接找不到了,处理的很巧妙,使用日期排序的方式和自己的日期做差进行分组,如果差值都是一样的,说明是连续的日期,且这个差值相同的个数即为连续的天数
举例:156的用户。6.10消费了,往前推,6.9也消费了,但是6.8没消费,所以到目前为止连续消费的时间是2天;这个很多用于类似签到的功能,如果今天断签,则重新开始计算累计的签到天数
select
*
from
(
select
passenger_phone,
min(dt) as start_time,
max(dt) as end_time,
count(1) as day_cnt
from
(
select
*,
row_number() over(partition by passenger_phone order by dt) as ranker
from
tmp_time_exp
where
is_call = 1
)aa
group by
passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)bb
where
end_time = '20190610'
在问题2中,直接将结束日期限定为今日(6.10)即可得出
with end_dt as
(
select
passenger_phone,
max(dt) as end_dt
from
tmp_time_exp
where
dt between '20190531' and '20190610'
and is_call = 0 -- 先找到最大的不消费的日期
group by
passenger_phone
)
select
aa.dt,
aa.passenger_phone,
datediff(from_unixtime(unix_timestamp(aa.dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(bb.end_dt,'yyyyMMdd'),'yyyy-MM-dd')) as day_cnt
from
(
select
dt,
passenger_phone
from
tmp_time_exp
where
dt = '20190610' -- 昨日在线用户
)aa
join
end_dt as bb
on
aa.passenger_phone = bb.passenger_phone
先获取每个用户最大的不消费的日期,因为从6.10开始,往前推,直到碰到第一个不消费的日期即可停止,这样就可以得出,直到6.10消费不间断的时间长度
结果都是
passenger_phone start_time end_time day_cnt
156 20190609 20190610 2
173 20190610 20190610 1
187 20190608 20190610 3
举例:如156的用户,连续消费的时间段是5.31-6.2;6.5-6.7;6.9-6.10,时长分别为3,3,2;金额为分别为50,30,40 其实就是问题 2 的衍生。
select
passenger_phone,
start_day,
end_day,
last_days,
rank() over(partition by passenger_phone order by last_days desc) as appose_rank, -- 包括了并列第一的情况
row_number() over(partition by passenger_phone order by last_days desc) as last_ranker -- 不包括并列
from
(
select
passenger_phone,
min(dt) as start_day,
max(dt) as end_day,
count(1) as last_days
from
(
select
*,
row_number() over(partition by passenger_phone order by dt) as ranker
from
tmp_time_exp
where
is_call != 0
)a
group by
passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)aa
having
-- last_ranker = 1
appose_rank = 1
使用问题2中的解法,直接对其结果进行下一层计算即可,即直接取出连续最长的消费时长
select
cc.*,
length(dd) as max_length,
row_number() over(partition by passenger_phone order by length(dd) desc) as ranker
from
(
select
passenger_phone,
concat_ws('',collect(is_call)) as call_list
from
(
select
dt,
passenger_phone,
is_call
from
tmp_time_exp
order by
passenger_phone desc, dt desc
)aa
group by
passenger_phone
)cc
lateral view explode(split(call_list,'0')) asTable as dd
having
ranker = 1
一种比较取巧的方式,是一次面试过程中,面试官提醒我的解法,同样可以解决这个问题,但是如果需要加上日期就会稍微再复杂一些,需要前期concat一部分日期的数据,然后后期在进行解开
结果都是一致的
passenger_phone start_day end_day last_days appose_rank last_ranker
156 20190531 20190602 3 1 1
156 20190605 20190607 3 1 2
173 20190610 20190610 1 1 1
187 20190601 20190604 4 1 1
举例:当日消费人数最高的日期
select
dt,
passenger_phone,
is_call_cnt,
rank() over(order by is_call_cnt desc) as call_ord_ranker
from
(
select
*,
sum(is_call) over(partition by dt) as is_call_cnt
from
tmp_time_exp
)aa
having
call_ord_ranker = 1
select
*,
first_value(dt) over(order by is_call_cnt desc) as max_dt
from
(
select
*,
sum(is_call) over(partition by dt) as is_call_cnt
from
tmp_time_exp
)aa
having
max_dt = dt
结果
dt passenger_phone is_call cost is_call_cnt max_dt
20190610 187 1 10 3.0 20190610
20190610 173 1 10 3.0 20190610
20190610 156 1 20 3.0 20190610
举例:如156的用户,消费首次到达50元的日期是6.2号,首次到达100元的日期是6.9号
select
passenger_phone,
max(min_gt50_dt) as min_gt50_dt,
max(min_gt100_dt) as min_gt100_dt
from
(
select
*,
min(dt) over(partition by passenger_phone,if(cost_until_today >= 50,1,0)) as min_gt50_dt,
min(dt) over(partition by passenger_phone,if(cost_until_today >= 100,1,0)) as min_gt100_dt
from
(
select
dt,
passenger_phone,
cost,
sum(cost) over(partition by passenger_phone order by dt) as cost_until_today
from
tmp_time_exp
)aa
)bb
group by
passenger_phone
结果
passenger_phone min_gt50_dt min_gt100_dt
156 20190602 20190609
173 20190609 20190609
187 20190603 20190604
其中比较核心的是使用了sum() over(partition by ... order by dt)
语句,表示到dt
为止的分组的总和,也就是累计截止的表达,对于一些分区边界的限定考虑,可以参考以下第7个问题
例子:比如一个诉求是找到6.5号前后三天中,消费金额最大的一天,这种区间性质最大值的查找,大概率都会使用窗口函数来实现,类似max() over(partition by ... order by dt rows between 3 preceding and 3 following)
这种,表示了到dt
这一天,往前推三天,往后推三天,也就是总共七天(包括自己)内,找到该区间内的最大值,同理把窗口聚合改成sum
也就变成了该区间内的总和
select
dt,
passenger_phone,
cost,
max(cost) over(partition by passenger_phone order by dt rows between unbounded preceding and current row) as until_cur_max,
max(cost) over(partition by passenger_phone order by dt) as until_cur_max2, -- 效果同上
max(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_max,
sum(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_sum
from
tmp_time_exp
结果
dt passenger_phone cost until_cur_max until_cur_max2 before3later3_max before3later3_sum
20190531 156 20 20 20 20 50
20190601 156 20 20 20 20 50
20190602 156 10 20 20 20 60
20190603 156 0 20 20 20 70
20190604 156 0 20 20 20 60
20190605 156 10 20 20 10 40
20190606 156 10 20 20 20 50
20190607 156 10 20 20 20 70
20190608 156 0 20 20 20 70
20190609 156 20 20 20 20 60
20190610 156 20 20 20 20 50
20190609 173 0 0 0 10 10
20190610 173 10 10 10 10 10
20190531 187 0 0 0 30 60
20190601 187 10 10 10 40 100
20190602 187 20 20 20 40 100
20190603 187 30 30 30 40 110
20190604 187 40 40 40 40 110
20190605 187 0 40 40 40 120
20190606 187 10 40 40 40 120
20190607 187 0 40 40 40 100
20190608 187 20 40 40 20 60
20190609 187 20 40 40 20 60
20190610 187 10 40 40 20 50