Hive面试题之连续登录、行转列和列转行分析

数据准备
id login_date
01 2021-02-28
01 2021-03-01
01 2021-03-02
01 2021-03-04
01 2021-03-05
01 2021-03-06
01 2021-03-08
02 2021-03-01
02 2021-03-02
02 2021-03-03
02 2021-03-06
03 2021-03-06

方案一
1.先把数据按照用户id分组,根据登录日期排序
SQL:
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data;

结果:
id login_date rn
01 2021-02-28 1
01 2021-03-01 2
01 2021-03-02 3
01 2021-03-04 4
01 2021-03-05 5
01 2021-03-06 6
01 2021-03-08 7
02 2021-03-01 1
02 2021-03-02 2
02 2021-03-03 3
02 2021-03-06 4
03 2021-03-06 1

2.用登录日期与rn求date_sub,得到的差值日期如果是相等大数据培训的,则说明这两天肯定是连续的
SQL:
SELECT

 t1.id,
 t1.login_date,
 date_sub(t1.login_date, rn) as diff_date

FROM

(
   SELECT
       id,
       login_date,
       row_number() over(partition by id order by login_date asc) as rn 
   FROM data
) t1;
结果:
id login_date diff_date
01 2021-02-28 2021-02-27
01 2021-03-01 2021-02-27
01 2021-03-02 2021-02-27
01 2021-03-04 2021-02-28
01 2021-03-05 2021-02-28
01 2021-03-06 2021-02-28
01 2021-03-08 2021-03-01
02 2021-03-01 2021-02-28
02 2021-03-02 2021-02-28
02 2021-03-03 2021-02-28
02 2021-03-06 2021-03-02
03 2021-03-06 2021-03-05

3.根据id和日期差date_diff分组,登录次数即为分组后的count(1)
SQL:
SELECT
t2.id,
count(1) as login_times,
min(t2.login_date) as start_date,
max(t2.login_date) as end_date
FROM
(

SELECT
 t1.id,
 t1.login_date,
 date_sub(t1.login_date,rn) as diff_date
FROM
(
    SELECT
     id,
     login_date,
     row_number() over(partition by id order by login_date asc) as rn 
    FROM data
) t1

) t2
group by t2.id, t2.diff_date
having login_times >= 3;

结果:
id login_times start_date end_date
01 3 2021-02-28 2021-03-02
01 3 2021-03-04 2021-03-06
02 3 2021-03-01 2021-03-03

方案二
方案二利用lag和lead函数进行处理,思路类似。
SQL:
SELECT
id,
lag_login_date,
login_date,lead_login_date
FROM

  (SELECT 
     id,
     login_date,
     lag(login_date,1,login_date) over(partition by id order by login_date) as lag_login_date,
     lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
  FROM data
  ) t1
where datediff(login_date,lag_login_date) =1 and datediff(lead_login_date,login_date) =1;结果:
id lag_login_date login_date lead_login_date
01 2018-02-28 2018-03-01 2018-03-02
01 2018-03-04 2018-03-05 2018-03-06
02 2018-03-01 2018-03-02 2018-03-03

行转列和列转行
以"连续登录"中的数据为例:
select id,

   concat_ws(',',collect_list(login_date)) cw

from data

group by id;结果:
id cw
01 2018-02-28,2018-03-01,2018-03-02,2018-03-04,2018-03-05,2018-03-06,2018-03-08
02 2018-03-01,2018-03-02,2018-03-03,2018-03-06
03 2018-03-06

以上面SQL生成的数据为基准,执行下列SQL:
select id, login_date
from t

lateral view explode(split(cw,',')) b AS login_date;结果:
id login_date
01 2018-02-28
01 2018-03-01
01 2018-03-02
01 2018-03-04
01 2018-03-05
01 2018-03-06
01 2018-03-08
02 2018-03-01
02 2018-03-02
02 2018-03-03
02 2018-03-06
03 2018-03-06

你可能感兴趣的:(大数据hive)