sql 通过lead lag获取数据连续最大条数-获取用户连续登录最长天数

目录

  • 一、业务需求
  • 二、初始化测试数据
  • 三、实现
    • (一)最终语句(hive)
    • (二)思路

一、业务需求

获取每个用户连续登录的最长天数

二、初始化测试数据

drop table if EXISTS login_log;
create table login_log ( times timestamp, user_id string);
truncate table login_log;
insert into login_log VALUES
(from_unixtime(unix_timestamp('2022-03-16 12:00:00','yyyy-MM-dd HH:mm:ss')), '100001'),
(from_unixtime(unix_timestamp('2022-03-16 12:01:00','yyyy-MM-dd HH:mm:ss')), '100001'),
(from_unixtime(unix_timestamp('2022-03-13 12:01:00','yyyy-MM-dd HH:mm:ss')), '100011',
(from_unixtime(unix_timestamp('2022-03-13 12:01:00','yyyy-MM-dd HH:mm:ss')), '100011'),
(from_unixtime(unix_timestamp('2022-03-13 02:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),
(from_unixtime(unix_timestamp('2022-03-12 12:01:00','yyyy-MM-dd HH:mm:ss')), '100004'),
(from_unixtime(unix_timestamp('2022-03-12 02:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),
(from_unixtime(unix_timestamp('2022-03-11 02:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),
(from_unixtime(unix_timestamp('2022-03-11 11:01:00','yyyy-MM-dd HH:mm:ss')), '100012'),
(from_unixtime(unix_timestamp('2022-03-10 11:18:00','yyyy-MM-dd HH:mm:ss')), '100012'),
(from_unixtime(unix_timestamp('2022-03-09 11:01:00','yyyy-MM-dd HH:mm:ss')), '100009'),
(from_unixtime(unix_timestamp('2022-03-08 11:18:00','yyyy-MM-dd HH:mm:ss')), '100009');

三、实现

(一)最终语句(hive)

with tmp_user_serial as (
select b.user_id,
       case when b.last_exists = 0 and b.next_exists = 0 then 1 
                 when b.next_exists = 1 then datediff(lead(b.date_date,1) over(partition by b.user_id order by b.date_date),  b.date_date)+1
                 else 0 end serial_day
   from (select a.user_id, a.date_date, 
                case when lag(a.date_date,1) over(partition by a.user_id order by a.date_date) = last_date then 1 else 0 end  last_exists,
                case when lead(a.date_date,1) over(partition by a.user_id order by a.date_date) = next_date then 1 else 0 end  next_exists
           from (select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date
                   from login_log 
                  group by user_id,to_date(times), date_sub(times,1),date_add(times,1)) a 
        ) b 
  where last_exists = 0 or next_exists = 0)
  select user_id, max(serial_day) max_seriral_day 
            from tmp_user_serial 
           group by user_id 

(二)思路

实现考虑如下场景:
1、某用户就只登录过一次;
2、某用户登录过多次,但都不连续;
3、某用户连续登录过多次;
4、某用户一天内多登录;

实现思路:
1、对于一天多次登录的情况,用user_id ,登录日期 两个数据分组,得到某人某天的唯一数据;
并获取该日期的前一天及后一天数值,便于后续对比取用;

对应sql中的最内层

select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date
  from login_log 
 group by user_id,to_date(times), date_sub(times,1),date_add(times,1)

2、通过对数据的lead lag处理,判断按照用户、日期分组排序后的数据前一条日期是否是当前登录日期的前一天 last_exists ,后一条是否是当前登录日期的后一天 next_exists

对应sql 中的第二层

         select a.user_id, a.date_date, 
                case when lag(a.date_date,1) over(partition by a.user_id order by a.date_date) = last_date then 1 else 0 end  last_exists,
                case when lead(a.date_date,1) over(partition by a.user_id order by a.date_date) = next_date then 1 else 0 end  next_exists
           from (select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date
                   from login_log 
                  group by user_id,to_date(times), date_sub(times,1),date_add(times,1)) a 

3、根据登录日期的前后一天数据是否存在判断是否连续
(1)若都存在,则说明该数据是连续日期中的中间数据,last_exists = 1 and next_exists =1
(2)若仅前一天数据存在,则说明该数据是连续日期中的最后一天,last_exists = 1 and next_exists =0
(3)若仅后一天数据存在,则说明该数据是连续日期中的第一天, last_exists = 0 and next_exists =1
(4)若前后一天都不存在,则说明该数据前后不连续,连续天数为1, last_exists = 0 and next_exists =0

该处计算思路:
1、对于连续的日期数据(两条及以上),排除掉连续日期的中间数据后,按照时间顺序排列,前一天和最后一天相连,lead一下可以直接获取到其日期差,从而得到连续登录天数;
2、对于不连续的日期数据,直接给值1 即可;

对应sql中的第三层

select b.user_id,
       case when b.last_exists = 0 and b.next_exists = 0 then 1 
                 when b.next_exists = 1 then datediff(lead(b.date_date,1) over(partition by b.user_id order by b.date_date),  b.date_date)+1
                 else 0 end serial_day
   from (select a.user_id, a.date_date, 
                case when lag(a.date_date,1) over(partition by a.user_id order by a.date_date) = last_date then 1 else 0 end  last_exists,
                case when lead(a.date_date,1) over(partition by a.user_id order by a.date_date) = next_date then 1 else 0 end  next_exists
           from (select user_id,to_date(times) date_date,date_sub(times,1) last_date, date_add(times,1) next_date
                   from login_log 
                  group by user_id,to_date(times), date_sub(times,1),date_add(times,1)) a 
        ) b 
  where last_exists = 0 or next_exists = 0

4、根据用户获取其最大连续登录天数,以用户分组,获取连续天数的最大值即可

对应sql

  select user_id, max(serial_day) max_seriral_day 
            from tmp_user_serial 
           group by user_id 

你可能感兴趣的:(问题解决,算法题目,sql,hive,数据库)