最近我们要根据 平时的 投放日志 统计出 每天的 活跃用户,留存 等指标,为此我进行了细致的调研。
我们的留存规则如下: 对于第1天的新增用户,如果第i 天,该用户有访问行为,则认为该用户为留存用户。
为此,我们简化为如下3张表的一个统计需求。
日志表,全量用户表,每天去重用户表
这几张表的字段如下:
日志表主要有以下几个字段:
id 用户id
action 用户的具体行为
day 产生日志的时间 (按天分区)
日志表存储了每天的全量日志,不会对用户维度的数据进行去重,结构和数据如下:
0: jdbc:hive2://cdh-manager:10000> show create table org_log;
INFO : Compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.053 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.044 seconds
INFO : OK
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `org_log`( |
| `id` int, |
| `action` string) |
| PARTITIONED BY ( |
| `day` date) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/org_log' |
| TBLPROPERTIES ( |
| 'last_modified_by'='hive', |
| 'last_modified_time'='1554816805', |
| 'transient_lastDdlTime'='1554816805') |
+----------------------------------------------------+
日志表内的数据:
0: jdbc:hive2://cdh-manager:10000> select * from org_log;
INFO : Compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:org_log.id, type:int, comment:null), FieldSchema(name:org_log.action, type:string, comment:null), FieldSchema(name:org_log.day, type:date, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.623 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log
INFO : Completed executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.001 seconds
INFO : OK
+-------------+-----------------+--------------+
| org_log.id | org_log.action | org_log.day |
+-------------+-----------------+--------------+
| 1 | find | 2019-04-09 |
| 2 | touch | 2019-04-09 |
| 1 | touch | 2019-04-09 |
| 3 | touch | 2019-04-09 |
| 4 | touch | 2019-04-10 |
| 1 | touch | 2019-04-10 |
| 2 | find | 2019-04-10 |
| 2 | key | 2019-04-10 |
| 5 | touch | 2019-04-10 |
| 1 | touch | 2019-04-11 |
| 2 | touch | 2019-04-11 |
| 3 | touch | 2019-04-11 |
| 1 | touch | 2019-04-12 |
| 2 | touch | 2019-04-12 |
| 3 | touch | 2019-04-12 |
+-------------+-----------------+--------------+
15 rows selected (1.16 seconds)
----------------------------------------------------------------
全量用户表 ,存储了所有去重后的用户
表内主要有2个字段 :
id 用户id
create_day 用户第一次有记录的时间
0: jdbc:hive2://cdh-manager:10000> show create table all_user;
INFO : Compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.035 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.016 seconds
INFO : OK
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `all_user`( |
| `id` int, |
| `create_day` date) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/all_user' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1554819131') |
+----------------------------------------------------+
13 rows selected (0.091 seconds)
内部数据:
0: jdbc:hive2://cdh-manager:10000> select * from all_user;
INFO : Compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:all_user.id, type:int, comment:null), FieldSchema(name:all_user.create_day, type:date, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.086 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user
INFO : Completed executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.001 seconds
INFO : OK
+--------------+----------------------+
| all_user.id | all_user.create_day |
+--------------+----------------------+
| 1 | 2019-04-09 |
| 2 | 2019-04-09 |
| 3 | 2019-04-09 |
| 4 | 2019-04-10 |
| 5 | 2019-04-10 |
+--------------+----------------------+
5 rows selected (0.158 seconds)
插入数据脚本:
insert into org_log partition(day='2019-04-11') values (1,'touch'),(2,'touch'),(3,'touch');
---------------------------------------------
每天去重用户表,存储了每天去重后的用户 :
表内主要有以下3个字段:
id 用户id
day 日期(按天分区)
0: jdbc:hive2://cdh-manager:10000> show create table daily_active;
INFO : Compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.037 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.02 seconds
INFO : OK
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `daily_active`( |
| `id` int) |
| PARTITIONED BY ( |
| `day` date) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/daily_active' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1554866026') |
+----------------------------------------------------+
14 rows selected (0.105 seconds)
内部数据:
0: jdbc:hive2://cdh-manager:10000> select * from daily_active;
INFO : Compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:daily_active.id, type:int, comment:null), FieldSchema(name:daily_active.day, type:date, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.126 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active
INFO : Completed executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.001 seconds
INFO : OK
+------------------+-------------------+
| daily_active.id | daily_active.day |
+------------------+-------------------+
| 1 | 2019-04-09 |
| 2 | 2019-04-09 |
| 3 | 2019-04-09 |
| 1 | 2019-04-10 |
| 2 | 2019-04-10 |
| 4 | 2019-04-10 |
| 5 | 2019-04-10 |
| 1 | 2019-04-11 |
| 2 | 2019-04-11 |
| 3 | 2019-04-11 |
| 1 | 2019-04-12 |
| 2 | 2019-04-12 |
| 3 | 2019-04-12 |
+------------------+-------------------+
13 rows selected (0.259 seconds)
--------------------------------------------
流程如下:
第一步. 得出每天的去重用户
第二步. 根据当前的去重用户,更新全量用户表
第三步. 选取当天 之前的 1-30 day 用户,计算留存
HQL :
[root@cdh-manager active_test]# cat insert_daily_active.hql
use keep_active_test;
INSERT OVERWRITE table daily_active partition(day = '${cal_day}') SELECT DISTINCT ID FROM org_log WHERE day = '${cal_day}';
脚本:
[root@cdh-manager active_test]# cat insert_daily_active.sh
#!/bin/bash
beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_daily_active.hql --hivevar cal_day=$1
HQL :
[root@cdh-manager active_test]# cat insert_distinct_alluser.hql
use keep_active_test;
INSERT INTO table all_user (id, create_day)
SELECT tmp.id, '${cal_day}' AS create_day FROM (
SELECT DISTINCT id FROM org_log WHERE day = '${cal_day}'
) AS tmp
WHERE tmp.id NOT IN (SELECT id FROM all_user);
脚本:
[root@cdh-manager active_test]# cat insert_distinct_alluser.sh
#!/bin/bash
echo $1
beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_distinct_alluser.hql --hivevar cal_day=$1
这里描述下主要的思路:
通过 全量用户表 选出 30天的增量用户, 与每天去重的用户表 JOIN , JOIN 的字段是用户ID , 计算时间差值,按照时间差值分组统计。
有两个脚本,一个较为基础,
即原始脚本。帮助理解,无分组函数:
[root@cdh-manager active_test]# cat cal_daily_active2.hql
use keep_active_test;
SELECT all_user.id, all_user.create_day, daily_active.day, datediff(daily_active.day, all_user.create_day )
FROM all_user INNER JOIN daily_active
ON all_user.id = daily_active.id
WHERE all_user.create_day = '2019-04-09';
结果:
+--------------+----------------------+-------------------+------+
| all_user.id | all_user.create_day | daily_active.day | _c3 |
+--------------+----------------------+-------------------+------+
| 1 | 2019-04-09 | 2019-04-11 | 2 |
| 1 | 2019-04-09 | 2019-04-09 | 0 |
| 1 | 2019-04-09 | 2019-04-12 | 3 |
| 1 | 2019-04-09 | 2019-04-10 | 1 |
| 2 | 2019-04-09 | 2019-04-11 | 2 |
| 2 | 2019-04-09 | 2019-04-09 | 0 |
| 2 | 2019-04-09 | 2019-04-12 | 3 |
| 2 | 2019-04-09 | 2019-04-10 | 1 |
| 3 | 2019-04-09 | 2019-04-11 | 2 |
| 3 | 2019-04-09 | 2019-04-09 | 0 |
| 3 | 2019-04-09 | 2019-04-12 | 3 |
+--------------+----------------------+-------------------+------+
11 rows selected (54.082 seconds)
最后是实际统计函数:
root@cdh-manager active_test]# cat cal_daily_active.hql
use keep_active_test;
SELECT datediff(daily_active.day, all_user.create_day) AS day , COUNT(daily_active.id) AS total
FROM all_user INNER JOIN daily_active
ON all_user.id = daily_active.id
WHERE all_user.create_day = '2019-04-09'
GROUP BY datediff(daily_active.day, all_user.create_day);
结果:
+------+--------+
| day | total |
+------+--------+
| 0 | 3 |
| 1 | 2 |
| 2 | 3 |
| 3 | 3 |
+------+--------+
第一行 : 留存天数
第二行: 留存人数