表数据:
1 Alise 2020-5-12 09:25:56
2 Alise 2020-5-13 10:25:30
3 Alise 2020-5-15 05:25:05
4 Alise 2020-5-15 15:23:11
5 Alise 2020-5-16 21:06:14
6 Bob 2020-5-12 11:17:48
7 Bob 2020-5-13 21:26:28
8 Bob 2020-5-13 21:27:28
9 Bob 2020-5-14 09:27:01
10 Bob 2020-5-14 21:27:01
11 Bob 2020-5-16 22:15:14
12 Jack 2020-5-16 23:14:45
13 Jack 2020-5-16 23:18:17
14 Jack 2020-5-17 12:07:22
15 Tom 2020-5-12 09:25:56
16 Tom 2020-5-14 09:21:00
17 Tom 2020-5-15 09:00:56
18 Tom 2020-5-16 21:06:14
建表并加载数据:
CREATE TABLE `player`(
`id` int, -- 记录ID
`name` string, -- 用户名
`login_time` timestamp -- 登录时间戳
)
row format delimited
fields terminated by '\t';
load data local inpath '/tmp/hive/data/tbl/player.txt' overwrite into table player;
1)将时间戳转换成日期格式并去重,生成表t1:
select distinct
name,
to_date(login_time) as login_date
from
player;
2)计算表t1中每个记录的日期,与其在对应用户分组中按日期降序行号的差值diff,生成表t2:
select
name,
login_date,
date_sub(login_date,row_number() over(partition by name order by login_date)) as diff
from (
select distinct
name,
to_date(login_time) as login_date
from
player
) as t1;
3)将表t2中的记录按照name,diff分组,name和diff相同的记录必定属于同一连续登录时间段内的记录(原理是若两个数列公差相同,则两个数列对应位置相减的差值必定相同)。取每个组中的最小和最大日期,即连续登录的起止日期,过滤出分组内记录个数大于等于n(n=3)的记录,生成表t3:
select
name,
min(login_date) as start_date,
max(login_date) as end_date
from (
select
name,
login_date,
date_sub(login_date,row_number() over(partition by name order by login_date)) as diff
from (
select distinct
name,
to_date(login_time) as login_date
from
player
) as t1
) as t2
group by
t2.name,t2.diff
having
count(1) >= 3;
4)将表t3与原始表t0连接,连接条件为t0.name=t3.name,过滤条件为t0.login_time>=t3.start_date OR t0.login_time<=t3.end_date+1。并计算连续登录天数:
PS:Hive中不支持非等值连接,可以使用Where过滤代替
select
t0.id,
t0.name,
t0.login_time,
t3.start_date,
t3.end_date,
datediff(end_date,start_date)+1 as days
from
player as t0
left join (
select
name,
min(login_date) as start_date,
max(login_date) as end_date
from (
select
name,
login_date,
date_sub(login_date,row_number() over(partition by name order by login_date)) as diff
from (
select distinct
name,
to_date(login_time) as login_date
from
player
) as t1
) as t2
group by
t2.name,t2.diff
having
count(1) >= 3
) as t3
on
t0.name = t3.name
where
t0.login_time>=t3.start_date
and
t0.login_time<=date_add(t3.end_date,1);
查询结果
+--------+----------+------------------------+----------------+--------------+-------+
| t0.id | t0.name | t0.login_time | t3.start_date | t3.end_date | days |
+--------+----------+------------------------+----------------+--------------+-------+
| 6 | Bob | 2020-05-12 11:17:48.0 | 2020-05-12 | 2020-05-14 | 3 |
| 7 | Bob | 2020-05-13 21:26:28.0 | 2020-05-12 | 2020-05-14 | 3 |
| 8 | Bob | 2020-05-13 21:27:28.0 | 2020-05-12 | 2020-05-14 | 3 |
| 9 | Bob | 2020-05-14 09:27:01.0 | 2020-05-12 | 2020-05-14 | 3 |
| 10 | Bob | 2020-05-14 21:27:01.0 | 2020-05-12 | 2020-05-14 | 3 |
| 16 | Tom | 2020-05-14 09:21:00.0 | 2020-05-14 | 2020-05-16 | 3 |
| 17 | Tom | 2020-05-15 09:00:56.0 | 2020-05-14 | 2020-05-16 | 3 |
| 18 | Tom | 2020-05-16 21:06:14.0 | 2020-05-14 | 2020-05-16 | 3 |
+--------+----------+------------------------+----------------+--------------+-------+