数据准备 | |
---|---|
id | login_date |
01 | 2021-02-28 |
01 | 2021-03-01 |
01 | 2021-03-02 |
01 | 2021-03-04 |
01 | 2021-03-05 |
01 | 2021-03-06 |
01 | 2021-03-08 |
02 | 2021-03-01 |
02 | 2021-03-02 |
02 | 2021-03-03 |
02 | 2021-03-06 |
03 | 2021-03-06 |
方案一
1.先把数据按照用户id分组,根据登录日期排序
SQL:
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data;
结果: | ||
---|---|---|
id | login_date | rn |
01 | 2021-02-28 | 1 |
01 | 2021-03-01 | 2 |
01 | 2021-03-02 | 3 |
01 | 2021-03-04 | 4 |
01 | 2021-03-05 | 5 |
01 | 2021-03-06 | 6 |
01 | 2021-03-08 | 7 |
02 | 2021-03-01 | 1 |
02 | 2021-03-02 | 2 |
02 | 2021-03-03 | 3 |
02 | 2021-03-06 | 4 |
03 | 2021-03-06 | 1 |
2.用登录日期与rn求date_sub,得到的差值日期如果是相等大数据培训的,则说明这两天肯定是连续的
SQL:
SELECT
t1.id,
t1.login_date,
date_sub(t1.login_date, rn) as diff_date
FROM
(
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data
) t1;
结果: | ||
---|---|---|
id | login_date | diff_date |
01 | 2021-02-28 | 2021-02-27 |
01 | 2021-03-01 | 2021-02-27 |
01 | 2021-03-02 | 2021-02-27 |
01 | 2021-03-04 | 2021-02-28 |
01 | 2021-03-05 | 2021-02-28 |
01 | 2021-03-06 | 2021-02-28 |
01 | 2021-03-08 | 2021-03-01 |
02 | 2021-03-01 | 2021-02-28 |
02 | 2021-03-02 | 2021-02-28 |
02 | 2021-03-03 | 2021-02-28 |
02 | 2021-03-06 | 2021-03-02 |
03 | 2021-03-06 | 2021-03-05 |
3.根据id和日期差date_diff分组,登录次数即为分组后的count(1)
SQL:
SELECT
t2.id,
count(1) as login_times,
min(t2.login_date) as start_date,
max(t2.login_date) as end_date
FROM
(
SELECT
t1.id,
t1.login_date,
date_sub(t1.login_date,rn) as diff_date
FROM
(
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data
) t1
) t2
group by t2.id, t2.diff_date
having login_times >= 3;
结果: | |||
---|---|---|---|
id | login_times | start_date | end_date |
01 | 3 | 2021-02-28 | 2021-03-02 |
01 | 3 | 2021-03-04 | 2021-03-06 |
02 | 3 | 2021-03-01 | 2021-03-03 |
方案二
方案二利用lag和lead函数进行处理,思路类似。
SQL:
SELECT
id,
lag_login_date,
login_date,lead_login_date
FROM
(SELECT
id,
login_date,
lag(login_date,1,login_date) over(partition by id order by login_date) as lag_login_date,
lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
FROM data
) t1
where datediff(login_date,lag_login_date) =1 and datediff(lead_login_date,login_date) =1;结果: | |||
---|---|---|---|
id | lag_login_date | login_date | lead_login_date |
01 | 2018-02-28 | 2018-03-01 | 2018-03-02 |
01 | 2018-03-04 | 2018-03-05 | 2018-03-06 |
02 | 2018-03-01 | 2018-03-02 | 2018-03-03 |
行转列和列转行
以"连续登录"中的数据为例:
select id,
concat_ws(',',collect_list(login_date)) cw
from data
group by id;结果: | |
---|---|
id | cw |
01 | 2018-02-28,2018-03-01,2018-03-02,2018-03-04,2018-03-05,2018-03-06,2018-03-08 |
02 | 2018-03-01,2018-03-02,2018-03-03,2018-03-06 |
03 | 2018-03-06 |
以上面SQL生成的数据为基准,执行下列SQL:
select id, login_date
from t
lateral view explode(split(cw,',')) b AS login_date;结果: | |
---|---|
id | login_date |
01 | 2018-02-28 |
01 | 2018-03-01 |
01 | 2018-03-02 |
01 | 2018-03-04 |
01 | 2018-03-05 |
01 | 2018-03-06 |
01 | 2018-03-08 |
02 | 2018-03-01 |
02 | 2018-03-02 |
02 | 2018-03-03 |
02 | 2018-03-06 |
03 | 2018-03-06 |