Hive统计连续登录n天的用户登录信息

前言

  • Hadoop:2.7.7
  • Hive:2.3.0
  • 本文主要练习如何使用Hive SQL统计练习登录n天的用户登录信息,主要使用窗口函数。

测试用表

表数据:

1	Alise	2020-5-12 09:25:56
2	Alise	2020-5-13 10:25:30
3	Alise	2020-5-15 05:25:05
4	Alise	2020-5-15 15:23:11
5	Alise	2020-5-16 21:06:14
6	Bob	2020-5-12 11:17:48
7	Bob	2020-5-13 21:26:28
8	Bob	2020-5-13 21:27:28
9	Bob	2020-5-14 09:27:01
10	Bob	2020-5-14 21:27:01
11	Bob	2020-5-16 22:15:14
12	Jack	2020-5-16 23:14:45
13	Jack	2020-5-16 23:18:17
14	Jack	2020-5-17 12:07:22
15	Tom	2020-5-12 09:25:56
16	Tom	2020-5-14 09:21:00
17	Tom	2020-5-15 09:00:56
18	Tom	2020-5-16 21:06:14

建表并加载数据:

CREATE TABLE `player`(
    `id` int,              -- 记录ID
    `name` string,         -- 用户名
    `login_time` timestamp -- 登录时间戳
)
row format delimited
fields terminated by '\t';

load data local inpath '/tmp/hive/data/tbl/player.txt' overwrite into table player;

统计连续登录超过2天的用户登录信息,并显示起止日期与连续登录天数

1)将时间戳转换成日期格式并去重,生成表t1:

select distinct
    name,
    to_date(login_time) as login_date
from
    player;

2)计算表t1中每个记录的日期,与其在对应用户分组中按日期降序行号的差值diff,生成表t2:

select
    name,
    login_date,
    date_sub(login_date,row_number() over(partition by name order by login_date)) as diff
from (
    select distinct
        name,
        to_date(login_time) as login_date
    from
        player
) as t1;

3)将表t2中的记录按照name,diff分组,name和diff相同的记录必定属于同一连续登录时间段内的记录(原理是若两个数列公差相同,则两个数列对应位置相减的差值必定相同)。取每个组中的最小和最大日期,即连续登录的起止日期,过滤出分组内记录个数大于等于n(n=3)的记录,生成表t3:

select
    name,
    min(login_date) as start_date,
    max(login_date) as end_date
from (
    select
        name,
        login_date,
        date_sub(login_date,row_number() over(partition by name order by login_date)) as diff
    from (
        select distinct
            name,
            to_date(login_time) as login_date
        from
            player
    ) as t1
) as t2
group by 
    t2.name,t2.diff
having 
    count(1) >= 3;

4)将表t3与原始表t0连接,连接条件为t0.name=t3.name,过滤条件为t0.login_time>=t3.start_date OR t0.login_time<=t3.end_date+1。并计算连续登录天数:

PS:Hive中不支持非等值连接,可以使用Where过滤代替

select
    t0.id,
    t0.name,
    t0.login_time,
    t3.start_date,
    t3.end_date,
    datediff(end_date,start_date)+1 as days
from
    player as t0
left join (
    select
        name,
        min(login_date) as start_date,
        max(login_date) as end_date
    from (
        select
            name,
            login_date,
            date_sub(login_date,row_number() over(partition by name order by login_date)) as diff
        from (
            select distinct
                name,
                to_date(login_time) as login_date
            from
                player
        ) as t1
    ) as t2
    group by 
        t2.name,t2.diff
    having 
        count(1) >= 3
) as t3
on 
    t0.name = t3.name
where 
    t0.login_time>=t3.start_date
    and
    t0.login_time<=date_add(t3.end_date,1);

查询结果

+--------+----------+------------------------+----------------+--------------+-------+
| t0.id  | t0.name  |     t0.login_time      | t3.start_date  | t3.end_date  | days  |
+--------+----------+------------------------+----------------+--------------+-------+
| 6      | Bob      | 2020-05-12 11:17:48.0  | 2020-05-12     | 2020-05-14   | 3     |
| 7      | Bob      | 2020-05-13 21:26:28.0  | 2020-05-12     | 2020-05-14   | 3     |
| 8      | Bob      | 2020-05-13 21:27:28.0  | 2020-05-12     | 2020-05-14   | 3     |
| 9      | Bob      | 2020-05-14 09:27:01.0  | 2020-05-12     | 2020-05-14   | 3     |
| 10     | Bob      | 2020-05-14 21:27:01.0  | 2020-05-12     | 2020-05-14   | 3     |
| 16     | Tom      | 2020-05-14 09:21:00.0  | 2020-05-14     | 2020-05-16   | 3     |
| 17     | Tom      | 2020-05-15 09:00:56.0  | 2020-05-14     | 2020-05-16   | 3     |
| 18     | Tom      | 2020-05-16 21:06:14.0  | 2020-05-14     | 2020-05-16   | 3     |
+--------+----------+------------------------+----------------+--------------+-------+

End~

你可能感兴趣的:(Hive,数据仓库,SQL)