目录
0 问题描述
1 数据分析
2 小结
已知用户行为表 tracking_log, 大概字段有:
(user_id 用户编号, op_id 操作编号, op_time 操作时间)
要求:(1)统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻。
(2)统计用户行为序列为A-B-D的用户数
其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
(1)数据生成
create table tracking_log(
user_id int ,
op_id string,
op_time string
)row format delimited fields terminated by '\t';
insert into tracking_log(user_id, op_id, op_time) values
(1, 'A', '2020-1-1 12:01:03'),
(2, 'A', '2020-1-1 12:01:04'),
(3, 'A', '2020-1-1 12:01:05'),
(1, 'B', '2020-1-1 12:03:03'),
(1, 'A', '2020-1-1 12:04:03'),
(1, 'C', '2020-1-1 12:06:03'),
(1, 'D', '2020-1-1 12:11:03'),
(2, 'A', '2020-1-1 12:07:04'),
(3, 'C', '2020-1-1 12:02:05'),
(2, 'C', '2020-1-1 12:09:03'),
(2, 'A', '2020-1-1 12:10:03'),
(4, 'A', '2020-1-1 12:01:03'),
(4, 'C', '2020-1-1 12:11:05'),
(4, 'D', '2020-1-1 12:15:05'),
(1, 'A', '2020-1-2 12:01:03'),
(2, 'A', '2020-1-2 12:01:04'),
(3, 'A', '2020-1-2 12:01:05'),
(1, 'B', '2020-1-2 12:03:03'),
(1, 'A', '2020-1-2 12:04:03'),
(1, 'C', '2020-1-2 12:06:03'),
(2, 'A', '2020-1-2 12:07:04'),
(3, 'B', '2020-1-2 12:08:05'),
(3, 'E', '2020-1-2 12:09:05'),
(3, 'D', '2020-1-2 12:11:05'),
(2, 'C', '2020-1-2 12:09:03'),
(4, 'E', '2020-1-2 12:05:03'),
(4, 'B', '2020-1-2 12:06:03'),
(4, 'E', '2020-1-2 12:07:03'),
(2, 'A', '2020-1-2 12:10:03');
hive> select * from tracking_log;
OK
1 A 2020-1-1 12:01:03
2 A 2020-1-1 12:01:04
3 A 2020-1-1 12:01:05
1 B 2020-1-1 12:03:03
1 A 2020-1-1 12:04:03
1 C 2020-1-1 12:06:03
1 D 2020-1-1 12:11:03
2 A 2020-1-1 12:07:04
3 C 2020-1-1 12:02:05
2 C 2020-1-1 12:09:03
2 A 2020-1-1 12:10:03
4 A 2020-1-1 12:01:03
4 C 2020-1-1 12:11:05
4 D 2020-1-1 12:15:05
1 A 2020-1-2 12:01:03
2 A 2020-1-2 12:01:04
3 A 2020-1-2 12:01:05
1 B 2020-1-2 12:03:03
1 A 2020-1-2 12:04:03
1 C 2020-1-2 12:06:03
2 A 2020-1-2 12:07:04
3 B 2020-1-2 12:08:05
3 E 2020-1-2 12:09:05
3 D 2020-1-2 12:11:05
2 C 2020-1-2 12:09:03
4 E 2020-1-2 12:05:03
4 B 2020-1-2 12:06:03
4 E 2020-1-2 12:07:03
2 A 2020-1-2 12:10:03
Time taken: 0.095 seconds, Fetched: 29 row(s)
(2)数据分析
路径分析:实际上就是字符串序列分析。
序列分析采用concat_ws(',',collect_set())来分析。
第一问具体SQL如下:
select to_date(op_time) as dt
,count(distinct user_id)
from
(
select user_id
,op_time
,concat_ws(',',collect_set(op_id) over(partition by user_id order by op_time)) as op_id_str --用户随时间的行为轨迹
from tracking_log
) t
where locate('A,B',op_id_str) > 0 --过滤出轨迹中包含字符串A,B的数据
group by to_date(op_time)
采用了locate()函数表示序列A,B是否在字符串中存在,存在则返回该位置的索引,因此肯定为大于0的数字
具体结果如下:
OK
2020-01-01 2
2020-01-02 2
Time taken: 10.599 seconds, Fetched: 2 row(s)
第二问:需要匹配A-B-D的路径,但A,B之间可以有任何其他浏览记录,B-D之间除了C记录可以有任何其他浏览记录,所以我们使用字符串的正则匹配,like来求解。
用户行为轨迹数据如下:
select user_id
,op_time
,concat_ws(',',collect_set(op_id) over(partition by user_id order by op_time)) as op_id_str
from tracking_log
--------------------------------------------------------------------------------
OK
1 2020-1-1 12:01:03 A
1 2020-1-1 12:03:03 A,B
1 2020-1-1 12:04:03 A,B
1 2020-1-1 12:06:03 A,B,C
1 2020-1-1 12:11:03 A,B,C,D
1 2020-1-2 12:01:03 A,B,C,D
1 2020-1-2 12:03:03 A,B,C,D
1 2020-1-2 12:04:03 A,B,C,D
1 2020-1-2 12:06:03 A,B,C,D
2 2020-1-1 12:01:04 A
2 2020-1-1 12:07:04 A
2 2020-1-1 12:09:03 A,C
2 2020-1-1 12:10:03 A,C
2 2020-1-2 12:01:04 A,C
2 2020-1-2 12:07:04 A,C
2 2020-1-2 12:09:03 A,C
2 2020-1-2 12:10:03 A,C
3 2020-1-1 12:01:05 A
3 2020-1-1 12:02:05 A,C
3 2020-1-2 12:01:05 A,C
3 2020-1-2 12:08:05 A,C,B
3 2020-1-2 12:09:05 A,C,B,E
3 2020-1-2 12:11:05 A,C,B,E,D
4 2020-1-1 12:01:03 A
4 2020-1-1 12:11:05 A,C
4 2020-1-1 12:15:05 A,C,D
4 2020-1-2 12:05:03 A,C,D,E
4 2020-1-2 12:06:03 A,C,D,E,B
4 2020-1-2 12:07:03 A,C,D,E,B
Time taken: 6.456 seconds, Fetched: 29 row(s)
最终SQL如下:
select to_date(op_time) as dt
,count(distinct user_id)
from
(
select user_id
,op_time
,concat_ws(',',collect_set(op_id) over(partition by user_id order by op_time)) as op_id_str
from tracking_log
) t
where op_id_str like '%A%B%D' and op_id_str not like '%A%B%C%D'
group by to_date(op_time)
结果如下:
--------------------------------------------------------------------------------
OK
2020-01-02 1
Time taken: 10.544 seconds, Fetched: 1 row(s)
本文给出了用户行为路径的数据分析方法,主要将用户路径转换为字符串序列进行分析,并利用like方法进行路径的模糊匹配。