spark sql 连续登录最大天数

数据:

+-------+----------+
|user_id|        dt|
+-------+----------+
|      1|2020-09-01|
|      1|2020-09-02|
|      1|2020-09-03|
|      1|2020-09-04|
|      1|2020-09-05|
|      1|2020-09-07|
|      2|2020-08-01|
|      2|2020-09-02|
|      2|2020-09-03|
|      2|2020-09-04|
|      2|2020-09-05|
|      2|2020-09-07|
|      3|2020-08-01|
|      3|2020-09-02|
|      3|2020-09-04|
|      3|2020-09-04|
|      3|2020-09-05|
|      3|2020-09-07|
+-------+----------+

注意:
| 3|2020-09-04|
| 3|2020-09-04|
这里是有重复的,所以
第一步是去重复:

SELECT DISTINCT user_id, dt FROM test
+-------+----------+                                                            
|user_id|        dt|
+-------+----------+
|      3|2020-09-04|
|      1|2020-09-07|
|      3|2020-08-01|
|      1|2020-09-05|
|      2|2020-09-07|
|      2|2020-09-02|
|      3|2020-09-07|
|      1|2020-09-03|
|      2|2020-09-05|
|      3|2020-09-05|
|      1|2020-09-04|
|      2|2020-09-04|
|      1|2020-09-01|
|      2|2020-09-03|
|      2|2020-08-01|
|      3|2020-09-02|
|      1|2020-09-02|
+-------+----------+

第二步:
同一个user_id的登录时间进行排序

SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test)
+-------+----------+---+                                                        
|user_id|        dt| rk|
+-------+----------+---+
|      3|2020-08-01|  1|
|      3|2020-09-02|  2|
|      3|2020-09-04|  3|
|      3|2020-09-05|  4|
|      3|2020-09-07|  5|
|      1|2020-09-01|  1|
|      1|2020-09-02|  2|
|      1|2020-09-03|  3|
|      1|2020-09-04|  4|
|      1|2020-09-05|  5|
|      1|2020-09-07|  6|
|      2|2020-08-01|  1|
|      2|2020-09-02|  2|
|      2|2020-09-03|  3|
|      2|2020-09-04|  4|
|      2|2020-09-05|  5|
|      2|2020-09-07|  6|
+-------+----------+---+

第三步:
用dt减去排名之后,如果时间是连续的,那么结果相同。

SELECT user_id, date_add(dt, - rk) AS sim FROM (SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test))
+-------+----------+                                                            
|user_id|       sim|
+-------+----------+
|      3|2020-07-31|
|      3|2020-08-31|
|      3|2020-09-01|
|      3|2020-09-01|
|      3|2020-09-02|
|      1|2020-08-31|
|      1|2020-08-31|
|      1|2020-08-31|
|      1|2020-08-31|
|      1|2020-08-31|
|      1|2020-09-01|
|      2|2020-07-31|
|      2|2020-08-31|
|      2|2020-08-31|
|      2|2020-08-31|
|      2|2020-08-31|
|      2|2020-09-01|
+-------+----------+

第四步:
对相同的时间进行求和:

SELECT user_id, sim, COUNT(sim) AS cs FROM (SELECT user_id, date_add(dt, - rk) AS sim FROM (SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test))) GROUP BY user_id, sim
+-------+----------+---+                                                        
|user_id|       sim| cs|
+-------+----------+---+
|      3|2020-07-31|  1|
|      3|2020-08-31|  1|
|      3|2020-09-01|  2|
|      3|2020-09-02|  1|
|      1|2020-08-31|  5|
|      1|2020-09-01|  1|
|      2|2020-07-31|  1|
|      2|2020-08-31|  4|
|      2|2020-09-01|  1|
+-------+----------+---+

第五步:
求最大连续天数:

SELECT user_id, MAX(cs) AS max_date FROM (SELECT user_id, sim, COUNT(sim) AS cs FROM (SELECT user_id, date_add(dt, - rk) AS sim FROM (SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test))) GROUP BY user_id, sim) GROUP BY user_id
+-------+--------+                                                              
|user_id|max_date|
+-------+--------+
|      3|       2|
|      1|       5|
|      2|       4|
+-------+--------+

整个sql:

SELECT user_id, MAX(cs) AS max_date
FROM
    (
        SELECT user_id, sim, COUNT(sim) AS cs
        FROM
            (
                SELECT user_id, date_add(dt, - rk) AS sim
                FROM
                    (
                        SELECT dt, user_id, row_number() over(partition BY user_id order by dt) AS rk
                        FROM
                            (
                                SELECT DISTINCT user_id, dt FROM test
                            )
                    )
            ) GROUP BY user_id, sim
    ) GROUP BY user_id

注意:

也可以使用date_sub函数,把date_add(dt, - rk)换成date_sub(dt,rk)效果是一样的。

date_add是时间加几天,data_sub是时间减几天。

你可能感兴趣的:(spark sql 连续登录最大天数)