SparkSql中时间阈操作【窗口函数】

本文主要总结了一些sql在时间阈上的操作,包括连续消费,最长签到,累计消费等问题,其实映射到其他业务场景也就变成了类似的计算;如游戏领域,连续登陆时间,连续签到时长,最大连续签到天数等常见的业务场景;方法都是共通的,这里就用sparksql来实现一些方法,hivesql的话有部分代码可能需要略微修改,比如having这种需要外面再套一层改成where等等就不再赘述

构造数据进行测试

为了比较好切割,我就用@进行拼凑了,第一个是日期,第二个是用户,第三个是否消费,第四个为消费金额

20190531@156@1@20
20190601@156@1@20
20190602@156@1@10
20190603@156@0@0
20190604@156@0@0
20190605@156@1@10
20190606@156@1@10
20190607@156@1@10
20190608@156@0@0
20190609@156@1@20
20190610@156@1@20
20190531@187@0@0
20190601@187@1@10
20190602@187@1@20
20190603@187@1@30
20190604@187@1@40
20190605@187@0@0
20190606@187@1@10
20190607@187@0@0
20190608@187@1@20
20190609@187@1@20
20190610@187@1@10
20190609@173@0@0
20190610@173@1@10

映射成表,如下结构

create table tmp_time_exp 
(
    dt string,  
    passenger_phone string,
    is_call string comment '是否消费',
    cost bigint comment '花费金额'
)
row format DELIMITED fields terminated by '@'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/hdfslocation'

查询一下是否符合

tmp_time_exp.dt	tmp_time_exp.passenger_phone	tmp_time_exp.is_call	tmp_time_exp.cost
20190531	156	1	20
20190601	156	1	20
20190602	156	1	10
20190603	156	0	0
20190604	156	0	0
20190605	156	1	10
20190606	156	1	10
20190607	156	1	10
20190608	156	0	0
20190609	156	1	20
20190610	156	1	20
20190531	187	0	0
20190601	187	1	10
20190602	187	1	20
20190603	187	1	30
20190604	187	1	40
20190605	187	0	0
20190606	187	1	10
20190607	187	0	0
20190608	187	1	20
20190609	187	1	20
20190610	187	1	10
20190609	173	0	0
20190610	173	1	10

常见问题

1.求n天连续消费用户

例子:如需要找到连续三天消费的用户,他的连续消费开始时间及结束时间

select
    passenger_phone,
    is_call,
    cost,
    unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd') as start_dt,
    dt as end_dt,
    datediff(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(lag(dt,2,0) over(partition by passenger_phone order by dt),'yyyyMMdd'),'yyyy-MM-dd')) as last3day  
from
    tmp_time_exp
where
    is_call != 0 
having  
    last3day = 2 

结果输出

passenger_phone	is_call	cost	start_dt	end_dt	last3day
156	1	10	1559232000	20190602	2
156	1	10	1559664000	20190607	2
187	1	30	1559318400	20190603	2
187	1	40	1559404800	20190604	2
187	1	10	1559923200	20190610	2

1. 在使用datediff的是时候,需要注意传递的参数必须是标准日期格式的,所以需要转化下 。2. 使用lag或者lead都可以实现类似操作,首先对用户进行分组,然后对其消费时间进行排序,然后将下一个消费时间进行位移,然后做差。比较好理解,如上,将连续日期位移两个位置,如果相减为2,则这三天都是必须连续登陆的

2.用户连续消费的时间段,持续时间及该时间段消费的金额总和

举例:如156的用户,连续消费的时间段是5.31-6.2;6.5-6.7;6.9-6.10,金额为分别为50,30,40

select
    passenger_phone,
    min(dt) as start_day,
    max(dt) as end_day,
    count(1) as last_days,
    sum(cost) as cost_sum
from
(
    select
        *,
        row_number() over(partition by passenger_phone order by dt) as ranker
    from
        tmp_time_exp
    where
        is_call != 0
)a
group by
    passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)

输出结果

passenger_phone	start_day	end_day	last_days	cost_sum
156	20190531	20190602	3	50
156	20190605	20190607	3	30
156	20190609	20190610	2	40
173	20190610	20190610	1	10
187	20190601	20190604	4	100
187	20190606	20190606	1	10
187	20190608	20190610	3	50

上述的处理方式,也是参考一个blog的处理,链接找不到了,处理的很巧妙,使用日期排序的方式和自己的日期做差进行分组,如果差值都是一样的,说明是连续的日期,且这个差值相同的个数即为连续的天数

3.包括6.10,连续消费天数,断更不算(消费签到天数)

举例:156的用户。6.10消费了,往前推,6.9也消费了,但是6.8没消费,所以到目前为止连续消费的时间是2天;这个很多用于类似签到的功能,如果今天断签,则重新开始计算累计的签到天数

方法 1
select
    *
from
(
    select
        passenger_phone,
        min(dt) as start_time,
        max(dt) as end_time,
        count(1) as day_cnt
    from
    (
        select
            *,
            row_number() over(partition by passenger_phone order by dt) as ranker
        from
            tmp_time_exp
        where
            is_call = 1
    )aa
    group by
        passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)bb
where
    end_time = '20190610'

在问题2中,直接将结束日期限定为今日(6.10)即可得出

方法 2
with end_dt as
(
    select
        passenger_phone,
        max(dt) as end_dt
    from
        tmp_time_exp
    where
        dt between '20190531' and '20190610'
        and is_call = 0  -- 先找到最大的不消费的日期
    group by
        passenger_phone
)
select
    aa.dt,
    aa.passenger_phone,
    datediff(from_unixtime(unix_timestamp(aa.dt,'yyyyMMdd'),'yyyy-MM-dd'),from_unixtime(unix_timestamp(bb.end_dt,'yyyyMMdd'),'yyyy-MM-dd')) as day_cnt
from
(
    select
        dt,
        passenger_phone
    from
        tmp_time_exp
    where
        dt = '20190610'  -- 昨日在线用户
)aa
join
    end_dt as bb
on
    aa.passenger_phone = bb.passenger_phone

先获取每个用户最大的不消费的日期,因为从6.10开始,往前推,直到碰到第一个不消费的日期即可停止,这样就可以得出,直到6.10消费不间断的时间长度

结果都是

passenger_phone start_time      end_time        day_cnt
156	20190609	20190610	2
173	20190610	20190610	1
187	20190608	20190610	3

4.最长连续消费天数

举例:如156的用户,连续消费的时间段是5.31-6.2;6.5-6.7;6.9-6.10,时长分别为3,3,2;金额为分别为50,30,40 其实就是问题 2 的衍生。

方法1
select
    passenger_phone,
    start_day,
    end_day,
    last_days,
    rank() over(partition by passenger_phone order by last_days desc) as appose_rank, -- 包括了并列第一的情况
    row_number() over(partition by passenger_phone order by last_days desc) as last_ranker  -- 不包括并列
from
(
    select
        passenger_phone,
        min(dt) as start_day,
        max(dt) as end_day,
        count(1) as last_days
    from
    (
        select
            *,
            row_number() over(partition by passenger_phone order by dt) as ranker
        from
            tmp_time_exp
        where
            is_call != 0
    )a
    group by
        passenger_phone,date_sub(from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyy-MM-dd'),ranker)
)aa
having
    -- last_ranker = 1
    appose_rank = 1


使用问题2中的解法,直接对其结果进行下一层计算即可,即直接取出连续最长的消费时长

方法2
select
    cc.*,
    length(dd) as max_length,
    row_number() over(partition by passenger_phone order by length(dd) desc) as ranker
from
(
    select
        passenger_phone,
        concat_ws('',collect(is_call)) as call_list
    from
    (
        select
            dt,
            passenger_phone,
            is_call
        from
            tmp_time_exp
        order by
            passenger_phone desc, dt desc
    )aa
    group by
        passenger_phone
)cc
lateral view explode(split(call_list,'0')) asTable as dd
having
    ranker = 1

一种比较取巧的方式,是一次面试过程中,面试官提醒我的解法,同样可以解决这个问题,但是如果需要加上日期就会稍微再复杂一些,需要前期concat一部分日期的数据,然后后期在进行解开

结果都是一致的

passenger_phone start_day       end_day last_days       appose_rank     last_ranker
156	20190531	20190602	3	1	1
156	20190605	20190607	3	1	2
173	20190610	20190610	1	1	1
187	20190601	20190604	4	1	1

5. 消费峰值日期

举例:当日消费人数最高的日期

方法1
select
    dt,
    passenger_phone,
    is_call_cnt,
    rank() over(order by is_call_cnt desc) as call_ord_ranker
from
(
    select
        *,
        sum(is_call) over(partition by dt) as is_call_cnt
    from
        tmp_time_exp
)aa
having
    call_ord_ranker = 1
方法2
select
    *,
    first_value(dt) over(order by is_call_cnt desc) as max_dt
from
(
    select
        *,
        sum(is_call) over(partition by dt) as is_call_cnt
    from
        tmp_time_exp
)aa
having
    max_dt = dt

结果

dt	passenger_phone	is_call	cost	is_call_cnt	max_dt
20190610	187	1	10	3.0	20190610
20190610	173	1	10	3.0	20190610
20190610	156	1	20	3.0	20190610

6. 消费累计到达 x 元的日期

举例:如156的用户,消费首次到达50元的日期是6.2号,首次到达100元的日期是6.9号

select
    passenger_phone,
    max(min_gt50_dt) as min_gt50_dt,
    max(min_gt100_dt) as min_gt100_dt
from
(
    select
        *,
        min(dt) over(partition by passenger_phone,if(cost_until_today >= 50,1,0)) as min_gt50_dt,
        min(dt) over(partition by passenger_phone,if(cost_until_today >= 100,1,0)) as min_gt100_dt
    from
    (
        select
            dt,
            passenger_phone,
            cost,
            sum(cost) over(partition by passenger_phone order by dt) as cost_until_today
        from
            tmp_time_exp
    )aa
)bb
group by 
    passenger_phone

结果

passenger_phone	min_gt50_dt	min_gt100_dt
156	20190602	20190609
173	20190609	20190609
187	20190603	20190604

其中比较核心的是使用了sum() over(partition by ... order by dt)语句,表示到dt为止的分组的总和,也就是累计截止的表达,对于一些分区边界的限定考虑,可以参考以下第7个问题

7. 找到某个时间区间内,消费的最大值

例子:比如一个诉求是找到6.5号前后三天中,消费金额最大的一天,这种区间性质最大值的查找,大概率都会使用窗口函数来实现,类似max() over(partition by ... order by dt rows between 3 preceding and 3 following)这种,表示了到dt这一天,往前推三天,往后推三天,也就是总共七天(包括自己)内,找到该区间内的最大值,同理把窗口聚合改成sum也就变成了该区间内的总和

select
    dt,
    passenger_phone,
    cost,
    max(cost) over(partition by passenger_phone order by dt rows between unbounded preceding and current row) as until_cur_max,
    max(cost) over(partition by passenger_phone order by dt) as until_cur_max2,  -- 效果同上
    max(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_max,
    sum(cost) over(partition by passenger_phone order by dt rows between 3 preceding and 3 following) as before3later3_sum
from
    tmp_time_exp

结果

dt	passenger_phone	cost	until_cur_max	until_cur_max2	before3later3_max	before3later3_sum
20190531	156	20	20	20	20	50
20190601	156	20	20	20	20	50
20190602	156	10	20	20	20	60
20190603	156	0	20	20	20	70
20190604	156	0	20	20	20	60
20190605	156	10	20	20	10	40
20190606	156	10	20	20	20	50
20190607	156	10	20	20	20	70
20190608	156	0	20	20	20	70
20190609	156	20	20	20	20	60
20190610	156	20	20	20	20	50
20190609	173	0	0	0	10	10
20190610	173	10	10	10	10	10
20190531	187	0	0	0	30	60
20190601	187	10	10	10	40	100
20190602	187	20	20	20	40	100
20190603	187	30	30	30	40	110
20190604	187	40	40	40	40	110
20190605	187	0	40	40	40	120
20190606	187	10	40	40	40	120
20190607	187	0	40	40	40	100
20190608	187	20	40	40	20	60
20190609	187	20	40	40	20	60
20190610	187	10	40	40	20	50

你可能感兴趣的:(SQL,Spark,Hadoop)