hive之连续登录问题

目录

  • 1、开窗函数的格式
  • 2、窗口范围图例
  • 3、连续登录问题:
  • 4、 限制时间段内登录次数问题:

1、开窗函数的格式

FUNCTION_NAME([argument_list])

OVER (

[PARTITION BY window_partition,]

[ORDER BY window_ordering,[ASC|DESC]])

[ { ROWS | RANGE } BETWEEN frame_start AND frame_end ] );

FUNCTION_NAME:函数名称。如row_number()、sum()、first_value()等。

argument_list:函数的参数列表。

PARTITION BY:根据window_partition(分区字段)进行分区,该子句也被称为查询分区子句。类似于group by,都是将数据按照边界值进行分组。而OVER之前的函数在每一个分组之内进行,如果超出了分组,则函数会重新计算。

ORDER BY:将各个分区内的数据,根据window_ordering(排序字段)进行排序。ORDER BY子句会对输入的数据强制排序(窗口函数是SQL语句最后执行的函数,因此可以把SQL结果集想象成输入数据)。ORDER BY子句对于诸如row_number(),lead(),lag()等函数是必须的。如果数据无序,这些函数的结果就没有意义。

ROWS和RANGE分别表示选择前后几行、选择数据范围。

2、窗口范围图例

hive之连续登录问题_第1张图片
注释:
PRECEDING:往前

FOLLOWING:往后

CURRENT ROW:当前行

UNBOUNDED:起点

UNBOUNDED PRECEDING 表示从前面的起点

UNBOUNDED FOLLOWING:表示到后面的终点
常用的范围:
01:ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
02:ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
03:ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING,n PRECEDING m FOLLOWING:表示窗口的范围是[(当前行的行数)- n, (当前行的行数)+ m] row。
04:ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING

3、连续登录问题:

数据准备:
±–±---------+
|id |login_date|
±–±---------+
|01 |2021-02-28|
|01 |2021-03-01|
|01 |2021-03-02|
|01 |2021-03-04|
|01 |2021-03-05|
|01 |2021-03-06|
|01 |2021-03-08|
|02 |2021-03-01|
|02 |2021-03-02|
|02 |2021-03-03|
|02 |2021-03-06|
|03 |2021-03-06|
±–±---------+
统计连续登录天数超过3天的用户,输出信息包括:用户id,登录天数,起始时间,结束时间;
方法1:

SELECT
 t2.id,
 count(1)           as login_times,
 min(t2.login_date) as start_date,
 max(t2.login_date) as end_date
FROM
(
    SELECT
     t1.id,
     t1.login_date,
     date_sub(t1.login_date,rn) as diff_date
    FROM
    (
        SELECT
         id,
         login_date,
         row_number() over(partition by id order by login_date asc) as rn 
        FROM data
    ) t1
) t2
group by t2.id, t2.diff_date
having login_times >= 3;

+---+-----------+----------+----------+
|id |login_times|start_date|end_date  |
+---+-----------+----------+----------+
| 01|3          |2021-02-28|2021-03-02|
| 01|3          |2021-03-04|2021-03-06|
| 02|3          |2021-03-01|2021-03-03|
+---+-----------+----------+----------+

方法2:

SELECT 
  id,
  lag_login_date,
  login_date,lead_login_date
FROM
      (SELECT 
         id,
         login_date,
         lag(login_date,1,login_date) over(partition by id order by login_date) as lag_login_date,
         lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
      FROM data
      ) t1
where datediff(login_date,lag_login_date) =1 and datediff(lead_login_date,login_date) =1;

+---+--------------+----------+---------------+
|id |lag_login_date|login_date|lead_login_date|
+---+--------------+----------+---------------+
|01 |2018-02-28    |2018-03-01|2018-03-02     |
|01 |2018-03-04    |2018-03-05|2018-03-06     |
|02 |2018-03-01    |2018-03-02|2018-03-03     |
+---+--------------+----------+---------------+

方法3:

SELECT 
  id,
  lag_login_date,
  login_date
FROM
      (SELECT 
         id,
         login_date,
         lag(login_date,2,login_date) over(partition by id order by login_date) as lag_login_date,
         --lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
      FROM data
      ) t1
where datediff(login_date,lag_login_date) =2
+---+--------------+----------+
|id |lag_login_date|login_date|
+---+--------------+----------+
|01 |2018-02-28    |2018-03-02|
|01 |2018-03-04    |2018-03-06|
|02 |2018-03-01    |2018-03-03|
+---+--------------+----------+

4、 限制时间段内登录次数问题:

有一张用户购买记录表.现在我们需要找出所有的特殊用户.特殊用户的定义如下:
在当前购买时间的近7天内(含当天)购买次数超过3次(含),且近7天的购买总金额超过1000的用户即为特殊用户.
数据准备:
±–±---------+
|user_id |lbuy_date|amount|
±–±---------±-----------+
|101 |2021-01-01|1000|
|101 |2021-01-02|2000|
|102|2021-10-01|10|
|102 |2021-10-02|700|
|102 |2021-10-07|200|
|103 |2021-11-07|500|
|103 |2021-11-08|500|
|103 |2021-11-20|500|
|104|2021-03-01|10|
|104|2021-03-05|200|
|104|2021-03-09|800|
|104 |2021-03-09|800|
±–±---------+

实现:

---sql逻辑
select
distinct user_id
from
(
  select
    user_id
    ,buy_date
    ,count(1) over(PARTITION by user_id order by datediff(buy_date,'2021-01-01') RANGE between 6 PRECEDING and CURRENT row) as cnt
    ,sum(amount) over(PARTITION by user_id order by datediff(buy_date,'2021-01-01') RANGE between 6 PRECEDING and CURRENT row) as amount
  from test.aaa001
)t1
where cnt>=3 and amount>1000
;
user_id
104
106

你可能感兴趣的:(大数据系列二,hive,hadoop,big,data)