ClickHouse应用随笔2——活跃用户留存统计(前方高能~~)

原文链接:https://blog.csdn.net/weixin_44181744/article/details/115430787?utm_source=app&app_version=4.5.8

前言

众所周知,产品数据监控体系中,用户留存是一个非常重要的指标,细化到业务场景中,留存其实可以拆解出来多个类别:包括新增用户留存、活跃用户留存、付费用户留存、回流用户留存、新增付费用户留存、回流付费用户留存等等,以下主要针对活跃用户留存设计ClickHouse查询语句。

表结构及查询需求

表结构

time user_id
January 1, 2021, 08:33 AM 112233abc
January 1, 2021, 09:00 AM 221133bac
January 1, 2021, 09:20 AM 113322acb
January 2, 2021, 10:12 AM 113322acb
January 2, 2021, 11:01 AM 221133bac
January 3, 2021, 11:45 AM 112233abc
…… ……

查询需求:在表 User_login 中,按指定时间区间给出该区间内每日的活跃次留、三留、七留、十四留、三十留。

ClickHouse语句如何实现

PS:涉及多重嵌套查询

  1. 查询每个 user_id 的登录时间间隔(按天),按日期分组,得到每个 user_id 两次登录的时间差;
select
    uid,
    base_date,
    retention_date
from
    -- 表a,首日登录的用户
    (select
        distinct 
            date(time) base_date, 
            user_id uid
    from
        User_login
    where
        -- 需要计算留存的日期区间
        base_date between '2021-03-01' and '2021-03-31'
    group by
        uid, base_date) a
left join
    -- 表b,表连接匹配后续登录的情况,计算留存
    (select
        distinct 
            date(time) retention_date, 
            user_id uid
    from
        User_login
    where
        -- 此处不应限制最大日期,否则可能导致后续留存率为0
        retention_date >= '2021-03-01'
    group by
        uid, retention_date) b
on 
    a.uid = b.uid
where 
    -- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
    base_date <= retention_date
order by
    uid, base_date

结果如下(demo数据):

image

  1. 利用 retention_date 减去 base_date 得到两日登录的日期差值;
select
    uid,
    base_date,
    retention_date,
    retention_date - base_date datediff
from
    /* 上一步的查询
    (select
        uid,
        base_date,
        retention_date
    from
        -- 表a,首日登录的用户
        (select
            distinct 
                date(time) base_date, 
                user_id uid
        from
            User_login
        where
            -- 需要计算留存的日期区间
            base_date between '2021-03-01' and '2021-03-31'
        group by
            uid, base_date) a
    left join
        -- 表b,表连接匹配后续登录的情况,计算留存
        (select
            distinct 
                date(time) retention_date, 
                user_id uid
        from
            User_login
        where
            -- 此处不应限制最大日期,否则可能导致后续留存率为0
            retention_date >= '2021-03-01'
        group by
            uid, retention_date) b
    on 
        a.uid = b.uid
    where 
        -- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
        base_date <= retention_date
    order by
        uid, base_date) c
    */
order by
    uid, base_date

结果如下(demo数据):

image

  1. 对应提取相应的两次登录的日期差,得到当日、次日、三日、七日等多个登录用户数的值;
select
    base_date,
    sum(case when datediff = 0 then 1 else 0 end) day_0,
    sum(case when datediff = 1 then 1 else 0 end) day_1,
    sum(case when datediff = 2 then 1 else 0 end) day_2,
    sum(case when datediff = 6 then 1 else 0 end) day_6,
    sum(case when datediff = 13 then 1 else 0 end) day_13,
    sum(case when datediff = 29 then 1 else 0 end) day_29
from
    /* 上一步的查询
    (select
        uid,
        base_date,
        retention_date,
        retention_date - base_date datediff
    from
        (select
            uid,
            base_date,
            retention_date
        from
            -- 表a,首日登录的用户
            (select
                distinct 
                    date(time) base_date, 
                    user_id uid
            from
                User_login
            where
                -- 需要计算留存的日期区间
                base_date between '2021-03-01' and '2021-03-31'
            group by
                uid, base_date) a
        left join
            -- 表b,表连接匹配后续登录的情况,计算留存
            (select
                distinct 
                    date(time) retention_date, 
                    user_id uid
            from
                User_login
            where
                -- 此处不应限制最大日期,否则可能导致后续留存率为0
                retention_date >= '2021-03-01'
            group by
                uid, retention_date) b
        on 
            a.uid = b.uid
        where 
            -- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
            base_date <= retention_date
        order by
            uid, base_date) c
    order by
        uid, base_date) d
    */
group by base_date
order by base_date

结果如下(demo数据,一共查询出31天每天的各个留存情况,其中day_0是当日的登录):

image

  1. 利用次日、三、七日等登录数除以当日登录数,得到对应留存率;
select
    base_date,
    -- 保留4位小数,用于百分比展示
    round(day_1/day_0, 4) retention_2,
    round(day_2/day_0, 4) retention_3,
    round(day_6/day_0, 4) retention_7,
    round(day_13/day_0, 4) retention_14,
    round(day_29/day_0, 4) retention_30
from
    /* 上一步的查询
    (select
        base_date,
        sum(case when datediff = 0 then 1 else 0 end) day_0,
        sum(case when datediff = 1 then 1 else 0 end) day_1,
        sum(case when datediff = 2 then 1 else 0 end) day_2,
        sum(case when datediff = 6 then 1 else 0 end) day_6,
        sum(case when datediff = 13 then 1 else 0 end) day_13,
        sum(case when datediff = 29 then 1 else 0 end) day_29
    from
        (select
            uid,
            base_date,
            retention_date,
            retention_date - base_date datediff
        from
            (select
                uid,
                base_date,
                retention_date
            from
                -- 表a,首日登录的用户
                (select
                    distinct 
                        date(time) base_date, 
                        user_id uid
                from
                    User_login
                where
                    -- 需要计算留存的日期区间
                    base_date between '2021-03-01' and '2021-03-31'
                group by
                    uid, base_date) a
            left join
                -- 表b,表连接匹配后续登录的情况,计算留存
                (select
                    distinct 
                        date(time) retention_date, 
                        user_id uid
                from
                    User_login
                where
                    -- 此处不应限制最大日期,否则可能导致后续留存率为0
                    retention_date >= '2021-03-01'
                group by
                    uid, retention_date) b
            on 
                a.uid = b.uid
            where 
                -- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
                base_date <= retention_date
            order by
                uid, base_date) c
        order by
            uid, base_date) d
    group by base_date
    order by base_date)
    */

结果如下(demo数据):

image

到这里所有活跃留存就查询完毕了,百分比的显示只需在metabase设置中调整一下即可,效果如下:
image

emmm…总感觉还是差点啥,是的没错!还有一环!!

  1. 增加日期选择交互框,按指定日期进行查询;

这也是最后一步了,如果在编写查询语句时就把查询时间写死了,当业务方想在平台上查看留存时,还得专门找数据小伙伴来调整时间,这不仅给数据侧带来更多工作量,还降低了业务方工作效率;

如果不限制这个时间,则每次查询都要把整个数据表都过一遍,这就不仅影响到查询效率了,当数据量日渐增多,这查询语句也会给数据库带来越来越大的压力;

因此可以使用时间筛选交互框的方式,让业务方自己选择想要查看的日期区间对应的留存情况,操作如下:(在日期筛选时把条件设为 {{date}} 形式,再在设置里将交互框调整为日期类型即可)

select
    base_date,
    -- 保留4位小数,用于百分比展示
    round(day_1/day_0, 4) retention_2,
    round(day_2/day_0, 4) retention_3,
    round(day_6/day_0, 4) retention_7,
    round(day_13/day_0, 4) retention_14,
    round(day_29/day_0, 4) retention_30
from
    (select
        base_date,
        sum(case when datediff = 0 then 1 else 0 end) day_0,
        sum(case when datediff = 1 then 1 else 0 end) day_1,
        sum(case when datediff = 2 then 1 else 0 end) day_2,
        sum(case when datediff = 6 then 1 else 0 end) day_6,
        sum(case when datediff = 13 then 1 else 0 end) day_13,
        sum(case when datediff = 29 then 1 else 0 end) day_29
    from
        (select
            uid,
            base_date,
            retention_date,
            retention_date - base_date datediff
        from
            (select
                uid,
                base_date,
                retention_date
            from
                -- 表a,首日登录的用户
                (select
                    distinct 
                        date(time) base_date, 
                        user_id uid
                from
                    User_login
                where
                    -- 需要计算留存的日期区间
                    -- 利用时间选择交互,手动选择时间区间后再运行
                    base_date between {{start_date}} and {{end_date}}
                group by
                    uid, base_date) a
            left join
                -- 表b,表连接匹配后续登录的情况,计算留存
                (select
                    distinct 
                        date(time) retention_date, 
                        user_id uid
                from
                    User_login
                where
                    -- 此处不应限制最大日期,否则可能导致后续留存率为0
                    retention_date >= {{start_date}}
                group by
                    uid, retention_date) b
            on 
                a.uid = b.uid
            where 
                -- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
                base_date <= retention_date
            order by
                uid, base_date) c
        order by
            uid, base_date) d
    group by base_date
    order by base_date)

效果如下:

image

image




你可能感兴趣的:(ClickHouse应用随笔2——活跃用户留存统计(前方高能~~))