为了突出重点,总结就写在最前面了。从拿到需求开始,我们经历了以下步骤来完成工作:
其中数据清洗主要是做了两个工作:
实现的判断依据如下:
实现过程中主要使用了下面的函数:
一般来说,客户会按照A->B->C->D的顺序来访问页面,而且越到后面的页面访问率就越低,比如A是网站首页,B是列表页,C是详情页,D是支付页,根据常识,很容易知道A页面的访问率是最高的而D则是最低的。
现在有一个需求:使用SQL语句来完成以下的工作:
这个需求表述的不是很清楚,我们再来明确一下各种情况应该怎么处理。
首先要完成这种数据漏斗,必须能够将一个用户的每次访问严格的界定开来,即有办法区分每个用户的每次访问都请求了哪些网址,这项工作显然不应该是我们这次的任务,故假定数据集符合该要求。
其次,还有一些特殊的情况的处理需要进一步明确:
对于第1个问题,按照最后一次访问的时间为准;对于第2个问题,则约定其为乱序访问。
我们并没有现成的数据集可以使用。为了说明问题,需要我们自己创建一个数据集用于测试。
为了说明问题方便,在不影响结论的情况下,应该让表结构尽可能的简单。必须的字段包括:
用等价类划分法来给一些测试数据。等价类大概划成下面这样:
基于这样的等价类划分,可以给出下面的测试数据集:
页面无缺失,顺序,超过间隔,无重复页面
session_id req_url req_time
s_01 a.html 2017-03-26 08:00:00
s_01 b.html 2017-03-26 10:01:00
s_01 c.html 2017-03-26 10:03:00
s_01 d.html 2017-03-26 10:04:00
页面无缺失,顺序,超过间隔,有重复页面
session_id req_url req_time
s_02 a.html 2017-03-26 08:00:00
s_02 a.html 2017-03-26 09:00:00
s_02 b.html 2017-03-26 11:01:00
s_02 c.html 2017-03-26 11:03:00
s_02 d.html 2017-03-26 11:04:00
页面无缺失,顺序,间隔不足,无重复页面
session_id req_url req_time
s_03 a.html 2017-03-26 08:00:00
s_03 b.html 2017-03-26 09:01:00
s_03 c.html 2017-03-26 09:03:00
s_03 d.html 2017-03-26 09:04:00
页面无缺失,顺序,间隔不足,有重复页面
session_id req_url req_time
s_04 a.html 2017-03-26 08:00:00
s_04 a.html 2017-03-26 09:00:00
s_04 b.html 2017-03-26 10:01:00
s_04 c.html 2017-03-26 10:03:00
s_04 d.html 2017-03-26 10:04:00
页面无缺失,顺序访问完后乱序,超过间隔,有重复页面
session_id req_url req_time
s_05 a.html 2017-03-26 08:00:00
s_05 b.html 2017-03-26 10:01:00
s_05 c.html 2017-03-26 10:03:00
s_05 d.html 2017-03-26 10:04:00
s_05 c.html 2017-03-26 11:04:00
页面无缺失,顺序访问完后乱序,间隔不足,有重复页面
session_id req_url req_time
s_06 a.html 2017-03-26 09:00:00
s_06 b.html 2017-03-26 10:01:00
s_06 c.html 2017-03-26 10:03:00
s_06 d.html 2017-03-26 10:04:00
s_06 c.html 2017-03-26 11:04:00
页面无缺失,其他乱序,超过间隔,有重复页面
session_id req_url req_time
s_07 a.html 2017-03-26 07:00:00
s_07 b.html 2017-03-26 10:01:00
s_07 c.html 2017-03-26 10:03:00
s_07 d.html 2017-03-26 10:04:00
s_07 c.html 2017-03-26 11:04:00
页面无缺失,其他乱序,超过间隔,无重复页面
session_id req_url req_time
s_08 a.html 2017-03-26 07:00:00
s_08 c.html 2017-03-26 10:01:00
s_08 b.html 2017-03-26 10:03:00
s_08 d.html 2017-03-26 10:04:00
页面无缺失,其他乱序,间隔不足,有重复页面
session_id req_url req_time
s_09 a.html 2017-03-26 09:00:00
s_09 b.html 2017-03-26 10:01:00
s_09 c.html 2017-03-26 10:03:00
s_09 d.html 2017-03-26 10:04:00
s_09 c.html 2017-03-26 11:04:00
页面无缺失,其他乱序,间隔不足,无重复页面
session_id req_url req_time
s_10 a.html 2017-03-26 09:00:00
s_10 c.html 2017-03-26 10:01:00
s_10 b.html 2017-03-26 10:03:00
s_10 d.html 2017-03-26 10:04:00
页面缺失,其他乱序,超过间隔,有重复页面
session_id req_url req_time
s_11 a.html 2017-03-26 07:00:00
s_11 b.html 2017-03-26 10:03:00
s_11 d.html 2017-03-26 10:04:00
s_11 d.html 2017-03-26 11:04:00
页面缺失,其他乱序,超过间隔,无重复页面
session_id req_url req_time
s_12 a.html 2017-03-26 07:00:00
s_12 b.html 2017-03-26 10:03:00
s_12 d.html 2017-03-26 10:04:00
页面缺失,其他乱序,间隔不足,有重复页面
session_id req_url req_time
s_13 a.html 2017-03-26 07:00:00
s_13 d.html 2017-03-26 10:04:00
s_13 d.html 2017-03-26 11:04:00
页面缺失,其他乱序,间隔不足,无重复页面
session_id req_url req_time
s_14 a.html 2017-03-26 07:00:00
s_14 d.html 2017-03-26 10:04:00
有了数据集,我们就可以给出下面的测试用例:
数据漏斗 | 期望输出 | 实际输出 |
---|---|---|
乱序漏斗 | s_05 s_06 s_07 s_08 s_09 s_10 s_11 s_12 s_13 s_14 |
|
顺序漏斗 | s_01 s_02 s_03 s_04 |
|
顺序间隔漏斗 | s_01 s_02 |
完成了需求分析以及测试数据和测试用例的准备以后,我们就可以开始实现这几个数据漏斗了。
使用下面的命令在Hive中创建一个表,用于存放用户访问数据。
use test_db;
create table t_visitlog(session_id string,req_url string,req_time timestamp)
row format delimited
fields terminated by ',';
然后创建一个数据文件,visitlog.data,输入下面的内容:
s_01,a.html,2017-03-26 08:00:00
s_01,b.html,2017-03-26 10:01:00
s_01,c.html,2017-03-26 10:03:00
s_01,d.html,2017-03-26 10:04:00
s_02,a.html,2017-03-26 08:00:00
s_02,a.html,2017-03-26 09:00:00
s_02,b.html,2017-03-26 11:01:00
s_02,c.html,2017-03-26 11:03:00
s_02,d.html,2017-03-26 11:04:00
s_03,a.html,2017-03-26 08:00:00
s_03,b.html,2017-03-26 09:01:00
s_03,c.html,2017-03-26 09:03:00
s_03,d.html,2017-03-26 09:04:00
s_04,a.html,2017-03-26 08:00:00
s_04,a.html,2017-03-26 09:00:00
s_04,b.html,2017-03-26 10:01:00
s_04,c.html,2017-03-26 10:03:00
s_04,d.html,2017-03-26 10:04:00
s_05,a.html,2017-03-26 08:00:00
s_05,b.html,2017-03-26 10:01:00
s_05,c.html,2017-03-26 10:03:00
s_05,d.html,2017-03-26 10:04:00
s_05,c.html,2017-03-26 11:04:00
s_06,a.html,2017-03-26 09:00:00
s_06,b.html,2017-03-26 10:01:00
s_06,c.html,2017-03-26 10:03:00
s_06,d.html,2017-03-26 10:04:00
s_06,c.html,2017-03-26 11:04:00
s_07,a.html,2017-03-26 07:00:00
s_07,b.html,2017-03-26 10:01:00
s_07,c.html,2017-03-26 10:03:00
s_07,d.html,2017-03-26 10:04:00
s_07,c.html,2017-03-26 11:04:00
s_08,a.html,2017-03-26 07:00:00
s_08,c.html,2017-03-26 10:01:00
s_08,b.html,2017-03-26 10:03:00
s_08,d.html,2017-03-26 10:04:00
s_09,a.html,2017-03-26 09:00:00
s_09,b.html,2017-03-26 10:01:00
s_09,c.html,2017-03-26 10:03:00
s_09,d.html,2017-03-26 10:04:00
s_09,c.html,2017-03-26 11:04:00
s_10,a.html,2017-03-26 09:00:00
s_10,c.html,2017-03-26 10:01:00
s_10,b.html,2017-03-26 10:03:00
s_10,d.html,2017-03-26 10:04:00
s_11,a.html,2017-03-26 07:00:00
s_11,b.html,2017-03-26 10:03:00
s_11,d.html,2017-03-26 10:04:00
s_11,d.html,2017-03-26 11:04:00
s_12,a.html,2017-03-26 07:00:00
s_12,b.html,2017-03-26 10:03:00
s_12,d.html,2017-03-26 10:04:00
s_13,a.html,2017-03-26 07:00:00
s_13,d.html,2017-03-26 10:04:00
s_13,d.html,2017-03-26 11:04:00
s_14,a.html,2017-03-26 07:00:00
s_14,d.html,2017-03-26 10:04:00
然后使用下面的命令将其上传到hdfs中。
hadoop fs -put visitlog.data /user/hive/warehouse/test_db.db/t_visitlog/
然后使用下面的SQL语句检查是否上传成功:
select * from t_visitlog;
我这边是上传成功的。到这里我们就完成了准备工作。
接下来,我们要将数据进行一些处理,使得后面的工作更容易开展,主要工作包括:
使用下面的SQL语句即可实现去掉每一次访问中重复的页面记录,同时只保留每个页面最后一次访问记录。
select session_id,req_url,req_time from(
select session_id,req_url,req_time,rank() over(partition by session_id,req_url order by req_time desc) as rank
from(
select session_id,req_url,req_time
from t_visitlog
distribute by session_id,req_url
sort by session_id,req_time desc
)a
)b
where rank = 1;
为了粘贴方便,这里就不缩进了。
为了方便后续的操作,我们使用下面的SQL语句将查询结果放到一个新的表t_vlog_norepeat中。
使用下面的命令来查看一下新创建的表的结构:
desc t_vlog_norepeat;
输出结果如下:
session_id string
req_url string
req_time timestamp
是符合我们要求的。
使用下面的命令将访问记录保存到t_vlog_merge表中去。
create table t_vlog_merge as
select session_id,concat_ws(',',collect_set(vr)) as vtext
from(
select distinct session_id,concat_ws('_',cast(req_time as string),req_url) as vr
from (
select session_id,req_url,req_time from t_vlog_norepeat sort by session_id,req_time asc
)a
) temp
group by session_id;
结果如下:
s_01 2017-03-26 08:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:03:00_c.html,2017-03-26 10:04:00_d.html
s_02 2017-03-26 09:00:00_a.html,2017-03-26 11:01:00_b.html,2017-03-26 11:03:00_c.html,2017-03-26 11:04:00_d.html
s_03 2017-03-26 08:00:00_a.html,2017-03-26 09:01:00_b.html,2017-03-26 09:03:00_c.html,2017-03-26 09:04:00_d.html
s_04 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:03:00_c.html,2017-03-26 10:04:00_d.html
s_05 2017-03-26 08:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.html
s_06 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.html
s_07 2017-03-26 07:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.html
s_08 2017-03-26 07:00:00_a.html,2017-03-26 10:01:00_c.html,2017-03-26 10:03:00_b.html,2017-03-26 10:04:00_d.html
s_09 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_b.html,2017-03-26 10:04:00_d.html,2017-03-26 11:04:00_c.html
s_10 2017-03-26 09:00:00_a.html,2017-03-26 10:01:00_c.html,2017-03-26 10:03:00_b.html,2017-03-26 10:04:00_d.html
s_11 2017-03-26 07:00:00_a.html,2017-03-26 10:03:00_b.html,2017-03-26 11:04:00_d.html
s_12 2017-03-26 07:00:00_a.html,2017-03-26 10:03:00_b.html,2017-03-26 10:04:00_d.html
s_13 2017-03-26 07:00:00_a.html,2017-03-26 11:04:00_d.html
s_14 2017-03-26 07:00:00_a.html,2017-03-26 10:04:00_d.html
我们来分析一下乱序漏斗的几个特征,第1个是按照”,”分割后长度不为4,第2个就是按照”,”分割后页面顺序不对。
使用下面的代码即可实现乱序漏斗:
select session_id
from(
select session_id,vtext,split(vtext,',') as arr from t_vlog_merge
) temp
where size(arr) != 4 or split(arr[0],'_')[1] != 'a.html'
or split(arr[1],'_')[1] != 'b.html' or split(arr[2],'_')[1] != 'c.html'
or split(arr[3],'_')[1] != 'd.html';
输出结果为:
s_05
s_06
s_07
s_08
s_09
s_10
s_11
s_12
s_13
s_14
使用下面的代码即可实现顺序漏斗:
select session_id
from(
select session_id,vtext,split(vtext,',') as arr from t_vlog_merge
) temp
where size(arr) == 4 and split(arr[0],'_')[1] == 'a.html'
and split(arr[1],'_')[1] == 'b.html' and split(arr[2],'_')[1] == 'c.html'
and split(arr[3],'_')[1] == 'd.html';
输出结果:
s_01
s_02
s_03
s_04
使用下面的代码可以实现顺序间隔漏斗(访问A页面2小时以后访问B):
select session_id
from(
select session_id,vtext,split(vtext,',') as arr from t_vlog_merge
) temp
where size(arr) == 4 and split(arr[0],'_')[1] == 'a.html'
and split(arr[1],'_')[1] == 'b.html' and split(arr[2],'_')[1] == 'c.html'
and split(arr[3],'_')[1] == 'd.html'
and unix_timestamp(split(arr[1],'_')[0]) - unix_timestamp(split(arr[0],'_')[0]) > 7200;
输出结果:
s_01
s_02
上面已经测试过了,测试结果如下:
数据漏斗 | 期望输出 | 实际输出 |
---|---|---|
乱序漏斗 | s_05 s_06 s_07 s_08 s_09 s_10 s_11 s_12 s_13 s_14 |
s_05 s_06 s_07 s_08 s_09 s_10 s_11 s_12 s_13 s_14 |
顺序漏斗 | s_01 s_02 s_03 s_04 |
s_01 s_02 s_03 s_04 |
顺序间隔漏斗 | s_01 s_02 |
s_01 s_02 |
http://blog.csdn.net/suiyingli39/article/details/53319704
http://blog.csdn.net/hua245942641/article/details/50298989
http://blog.csdn.net/liyantianmin/article/details/48262109
http://blog.csdn.net/mtj66/article/details/52629876
https://yq.aliyun.com/articles/25890