最近发现一道大数据面试经常会问的SQL题目:统计连续登录的三天及以上的用户(或者类似的:连续3个月充值会员用户、连续N天购买商品的用户等),下面就来记录一下解题思路。
要求输出格式:
+---------+--------+-------------+-------------+--+
| uid | times | start_date | end_date |
+---------+--------+-------------+-------------+--+
首先建表:
create table test.user_login_info(
user_id string COMMENT '用户ID'
,login_date date COMMENT '登录日期'
)
row format delimited
fields terminated by ',';
导入原始数据:
hdfs dfs -put /root/user_login_info.txt /user/hive/warehouse/test.db/user_login_info/
查看:
select * from user_login_info;
+--------------------------+-----------------------------+
| user_login_info.user_id | user_login_info.login_date |
+--------------------------+-----------------------------+
| user01 | 2018-02-28 |
| user01 | 2018-03-01 |
| user01 | 2018-03-02 |
| user01 | 2018-03-04 |
| user01 | 2018-03-05 |
| user01 | 2018-03-06 |
| user01 | 2018-03-07 |
| user02 | 2018-03-01 |
| user02 | 2018-03-02 |
| user02 | 2018-03-03 |
| user02 | 2018-03-06 |
+--------------------------+-----------------------------+
解题思路:
1.先把数据按照user_id分组,login_date升序
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rank
from user_login_info
;
结果:
+----------+-------------+-------+
| user_id | login_date | rank |
+----------+-------------+-------+
| user01 | 2018-02-28 | 1 |
| user01 | 2018-03-01 | 2 |
| user01 | 2018-03-02 | 3 |
| user01 | 2018-03-04 | 4 |
| user01 | 2018-03-05 | 5 |
| user01 | 2018-03-06 | 6 |
| user01 | 2018-03-07 | 7 |
| user02 | 2018-03-01 | 1 |
| user02 | 2018-03-02 | 2 |
| user02 | 2018-03-03 | 3 |
| user02 | 2018-03-06 | 4 |
+----------+-------------+-------+
2.用 login_date - rank 得到的差值日期如果是一样的,则说明是连续登录的
select
t1.user_id as user_id
,t1.login_date as login_date
,date_sub(t1.login_date , t1.rank) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rank
from user_login_info
)t1
;
结果:
+----------+-------------+-------------+
| user_id | login_date | date_diff |
+----------+-------------+-------------+
| user01 | 2018-02-28 | 2018-02-27 |
| user01 | 2018-03-01 | 2018-02-27 |
| user01 | 2018-03-02 | 2018-02-27 |
| user01 | 2018-03-04 | 2018-02-28 |
| user01 | 2018-03-05 | 2018-02-28 |
| user01 | 2018-03-06 | 2018-02-28 |
| user01 | 2018-03-07 | 2018-02-28 |
| user02 | 2018-03-01 | 2018-02-28 |
| user02 | 2018-03-02 | 2018-02-28 |
| user02 | 2018-03-03 | 2018-02-28 |
| user02 | 2018-03-06 | 2018-03-02 |
+----------+-------------+-------------+
3.根据 user_id 和 date_diff 分组,login_date 的最小时间即 start_date ,最大时间即 end_date,取分组后的count>=3的即为最终结果
select
t2.user_id as user_id
,count(*) as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
select
t1.user_id as user_id
,t1.login_date as login_date
,date_sub(t1.login_date , t1.rank) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rank
from user_login_info
)t1
)t2
group by t2.user_id,t2.date_diff
having count(*) >= 3
;
结果:
+----------+--------+-------------+-------------+
| user_id | times | start_date | end_date |
+----------+--------+-------------+-------------+
| user01 | 3 | 2018-02-28 | 2018-03-02 |
| user01 | 4 | 2018-03-04 | 2018-03-07 |
| user02 | 3 | 2018-03-01 | 2018-03-03 |
+----------+--------+-------------+-------------+
以上为一种解决方案,大佬们如果有更好的方案欢迎留言交流。