原文链接:https://blog.csdn.net/weixin_44181744/article/details/115430787?utm_source=app&app_version=4.5.8
前言
众所周知,产品数据监控体系中,用户留存是一个非常重要的指标,细化到业务场景中,留存其实可以拆解出来多个类别:包括新增用户留存、活跃用户留存、付费用户留存、回流用户留存、新增付费用户留存、回流付费用户留存等等,以下主要针对活跃用户留存设计ClickHouse查询语句。
表结构及查询需求
表结构
time | user_id |
---|---|
January 1, 2021, 08:33 AM | 112233abc |
January 1, 2021, 09:00 AM | 221133bac |
January 1, 2021, 09:20 AM | 113322acb |
January 2, 2021, 10:12 AM | 113322acb |
January 2, 2021, 11:01 AM | 221133bac |
January 3, 2021, 11:45 AM | 112233abc |
…… | …… |
查询需求:在表 User_login 中,按指定时间区间给出该区间内每日的活跃次留、三留、七留、十四留、三十留。
ClickHouse语句如何实现
PS:涉及多重嵌套查询
- 查询每个 user_id 的登录时间间隔(按天),按日期分组,得到每个 user_id 两次登录的时间差;
select
uid,
base_date,
retention_date
from
-- 表a,首日登录的用户
(select
distinct
date(time) base_date,
user_id uid
from
User_login
where
-- 需要计算留存的日期区间
base_date between '2021-03-01' and '2021-03-31'
group by
uid, base_date) a
left join
-- 表b,表连接匹配后续登录的情况,计算留存
(select
distinct
date(time) retention_date,
user_id uid
from
User_login
where
-- 此处不应限制最大日期,否则可能导致后续留存率为0
retention_date >= '2021-03-01'
group by
uid, retention_date) b
on
a.uid = b.uid
where
-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
base_date <= retention_date
order by
uid, base_date
结果如下(demo数据):
- 利用 retention_date 减去 base_date 得到两日登录的日期差值;
select
uid,
base_date,
retention_date,
retention_date - base_date datediff
from
/* 上一步的查询
(select
uid,
base_date,
retention_date
from
-- 表a,首日登录的用户
(select
distinct
date(time) base_date,
user_id uid
from
User_login
where
-- 需要计算留存的日期区间
base_date between '2021-03-01' and '2021-03-31'
group by
uid, base_date) a
left join
-- 表b,表连接匹配后续登录的情况,计算留存
(select
distinct
date(time) retention_date,
user_id uid
from
User_login
where
-- 此处不应限制最大日期,否则可能导致后续留存率为0
retention_date >= '2021-03-01'
group by
uid, retention_date) b
on
a.uid = b.uid
where
-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
base_date <= retention_date
order by
uid, base_date) c
*/
order by
uid, base_date
结果如下(demo数据):
- 对应提取相应的两次登录的日期差,得到当日、次日、三日、七日等多个登录用户数的值;
select
base_date,
sum(case when datediff = 0 then 1 else 0 end) day_0,
sum(case when datediff = 1 then 1 else 0 end) day_1,
sum(case when datediff = 2 then 1 else 0 end) day_2,
sum(case when datediff = 6 then 1 else 0 end) day_6,
sum(case when datediff = 13 then 1 else 0 end) day_13,
sum(case when datediff = 29 then 1 else 0 end) day_29
from
/* 上一步的查询
(select
uid,
base_date,
retention_date,
retention_date - base_date datediff
from
(select
uid,
base_date,
retention_date
from
-- 表a,首日登录的用户
(select
distinct
date(time) base_date,
user_id uid
from
User_login
where
-- 需要计算留存的日期区间
base_date between '2021-03-01' and '2021-03-31'
group by
uid, base_date) a
left join
-- 表b,表连接匹配后续登录的情况,计算留存
(select
distinct
date(time) retention_date,
user_id uid
from
User_login
where
-- 此处不应限制最大日期,否则可能导致后续留存率为0
retention_date >= '2021-03-01'
group by
uid, retention_date) b
on
a.uid = b.uid
where
-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
base_date <= retention_date
order by
uid, base_date) c
order by
uid, base_date) d
*/
group by base_date
order by base_date
结果如下(demo数据,一共查询出31天每天的各个留存情况,其中day_0是当日的登录):
- 利用次日、三、七日等登录数除以当日登录数,得到对应留存率;
select
base_date,
-- 保留4位小数,用于百分比展示
round(day_1/day_0, 4) retention_2,
round(day_2/day_0, 4) retention_3,
round(day_6/day_0, 4) retention_7,
round(day_13/day_0, 4) retention_14,
round(day_29/day_0, 4) retention_30
from
/* 上一步的查询
(select
base_date,
sum(case when datediff = 0 then 1 else 0 end) day_0,
sum(case when datediff = 1 then 1 else 0 end) day_1,
sum(case when datediff = 2 then 1 else 0 end) day_2,
sum(case when datediff = 6 then 1 else 0 end) day_6,
sum(case when datediff = 13 then 1 else 0 end) day_13,
sum(case when datediff = 29 then 1 else 0 end) day_29
from
(select
uid,
base_date,
retention_date,
retention_date - base_date datediff
from
(select
uid,
base_date,
retention_date
from
-- 表a,首日登录的用户
(select
distinct
date(time) base_date,
user_id uid
from
User_login
where
-- 需要计算留存的日期区间
base_date between '2021-03-01' and '2021-03-31'
group by
uid, base_date) a
left join
-- 表b,表连接匹配后续登录的情况,计算留存
(select
distinct
date(time) retention_date,
user_id uid
from
User_login
where
-- 此处不应限制最大日期,否则可能导致后续留存率为0
retention_date >= '2021-03-01'
group by
uid, retention_date) b
on
a.uid = b.uid
where
-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
base_date <= retention_date
order by
uid, base_date) c
order by
uid, base_date) d
group by base_date
order by base_date)
*/
结果如下(demo数据):
到这里所有活跃留存就查询完毕了,百分比的显示只需在metabase设置中调整一下即可,效果如下:
emmm…总感觉还是差点啥,是的没错!还有一环!!
- 增加日期选择交互框,按指定日期进行查询;
这也是最后一步了,如果在编写查询语句时就把查询时间写死了,当业务方想在平台上查看留存时,还得专门找数据小伙伴来调整时间,这不仅给数据侧带来更多工作量,还降低了业务方工作效率;
如果不限制这个时间,则每次查询都要把整个数据表都过一遍,这就不仅影响到查询效率了,当数据量日渐增多,这查询语句也会给数据库带来越来越大的压力;
因此可以使用时间筛选交互框的方式,让业务方自己选择想要查看的日期区间对应的留存情况,操作如下:(在日期筛选时把条件设为 {{date}} 形式,再在设置里将交互框调整为日期类型即可)
select
base_date,
-- 保留4位小数,用于百分比展示
round(day_1/day_0, 4) retention_2,
round(day_2/day_0, 4) retention_3,
round(day_6/day_0, 4) retention_7,
round(day_13/day_0, 4) retention_14,
round(day_29/day_0, 4) retention_30
from
(select
base_date,
sum(case when datediff = 0 then 1 else 0 end) day_0,
sum(case when datediff = 1 then 1 else 0 end) day_1,
sum(case when datediff = 2 then 1 else 0 end) day_2,
sum(case when datediff = 6 then 1 else 0 end) day_6,
sum(case when datediff = 13 then 1 else 0 end) day_13,
sum(case when datediff = 29 then 1 else 0 end) day_29
from
(select
uid,
base_date,
retention_date,
retention_date - base_date datediff
from
(select
uid,
base_date,
retention_date
from
-- 表a,首日登录的用户
(select
distinct
date(time) base_date,
user_id uid
from
User_login
where
-- 需要计算留存的日期区间
-- 利用时间选择交互,手动选择时间区间后再运行
base_date between {{start_date}} and {{end_date}}
group by
uid, base_date) a
left join
-- 表b,表连接匹配后续登录的情况,计算留存
(select
distinct
date(time) retention_date,
user_id uid
from
User_login
where
-- 此处不应限制最大日期,否则可能导致后续留存率为0
retention_date >= {{start_date}}
group by
uid, retention_date) b
on
a.uid = b.uid
where
-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
base_date <= retention_date
order by
uid, base_date) c
order by
uid, base_date) d
group by base_date
order by base_date)
效果如下:
大
功
告
成