本项目以淘宝电商用户真实行为数据为数据源,运用Navicat 12 for MySQL对其进行数据清洗,利用AARRR模型和RFM模型对其展开数据分析,利用PowerBI制作可视化图像。
数据来自阿里巴巴天池:UserBehavior.csv
本数据集包含了2017年11月25日至2017年12月3日之间,有行为的约一百万随机用户的所有行为(行为包括点击、购买、加购物车、收藏)。数据集的每一行表示一条用户行为,由用户ID、商品ID、商品类目ID、行为类型和时间戳组成,并以逗号分隔。
数据集字段如下:
用户ID:字符串类型,用户名ID
商品ID:字符串类型,商品种类ID
商品类目ID:字符串类型,商品所属类目ID
行为类型:字符串类型,用户行为类型,包括pv(点击商品详情页面)、buy(购买商品)、cart(将商品加入购物车)、fav(将商品收藏)
时间戳:行为发生的时间戳
在navicat中将数据集导入MySQL,考虑到原数据集体量较大,此次分析仅导入10万条数据。
导入数据时,如果把timestamps设置成了datetime数据类型,则这一字段回全部显示为0。为了成功导入应将timestamps字段的数据类型选择为varchar,之后再利用MySQL语句修改为日期。
由于导入的数据没有列名,因此为各字段添加英文列名
用户ID——userid、商品ID——itemid、商品类目ID——categoryid、行为类型——behavior、时间戳——timestamps
通过将userid,itemid,timestamps三个字段设置为主键,可知数据中没有重复值。
利用count函数统计每一个字段包含的数据行数
SELECT COUNT(userid),
COUNT(itemid),
COUNT(categoryid),
COUNT(behavior),
COUNT(timestamps)
FROM UserBehavior;
新增dates日期列和hours时间列
ALTER TABLE UserBehavior ADD dates varchar(255);
UPDATE UserBehavior SET dates=FROM_UNIXTIME(timestamps,'%Y-%m-%d');
ALTER TABLE UserBehavior ADD hours varchar(255);
UPDATE UserBehavior SET hours=FROM_UNIXTIME(timestamps,'%H:%m:%s');
主要查看日期列是否出现2017-11-25至2017-12-3之外的数据,只需查询日期的最大值和最小值即可。
SELECT min(dates),max(dates) from userbehavior;
运行结果如下▼
可见最小日期为2017年9月11日,早于2017年11月25日。继续查询不符合要求的数据。
SELECT * from userbehavior where dates < '2017-11-25';
DELETE from userbehavior where dates<'2017-11-25';
在删除数据时一定要再三检查条件内容的正确性,稍有差池都会导致数据的无法恢复,只能重新导入,从头再来。
SELECT min(dates),max(dates) from userbehavior;
统计用户数,商品数,商品类别数,用户行为数。
SELECT COUNT(DISTINCT userid) AS 'customer',
COUNT(DISTINCT itemid) AS 'item',
COUNT(DISTINCT categoryid) AS 'category',
COUNT(DISTINCT behavior) AS 'behaviortype'
FROM UserBehavior;
运行结果如下▼
该数据集包含983位用户,64440个商品,3128个商品,4种用户行为
模型中的用户获取一般考察渠道曝光率、渠道转换率、日新增用户数DNU、获客成本CAC等指标。
由于字段限制,本篇分析主要考察日新增用户数DNU。由下图可见,仅在11月25日-30日这六天有新增用户,且新增用户数在26日开始出现急剧下跌,虽然在25日当日的新增用户中有很大一部分是此前的活跃用户,而非当日新增,但在30日的新增用户数已下降为个位数。
经过清洗后的数据共有99955条,PV为89664,UV为983,人均浏览次数约为91.22。
select count(distinct userid) as 'UV',
sum(case when behavior='pv' then 1 else 0 end) as 'PV',
sum(case when behavior='pv' then 1 else 0 end)/count(distinct userid) as '人均浏览次数'
from userbehavior;
从日期、时段、星期三个时间维度来观察用户行为。
在每日用户行为中,从11月30日开始至12月2日,pv和cart都有较大的增幅,fav也出现明显的增长趋势,但buy依旧处于较小的波动状态。可能是因为临近双十二,用户在收罗商品准备到优惠力度最大的时候才下单。
从各时段来看,晚间20点-22点是用户活跃的高峰期,下午15点会有一个小高峰。而下午18-19点是用户增长最快的时段。
从星期来看,周末时的用户更活跃,而工作日期间的活跃度波动较小。
# 先新增onlyhours列
alter table userbehavior add onlyhours VARCHAR(255);
update userbehavior set onlyhours=FROM_UNIXTIME(timestamps,'%H');
# 每日、小时用户行为
select dates,onlyhours,
sum(case when behavior='pv' then 1 else 0 end) as pv,
sum(case when behavior='fav' then 1 else 0 end) as fav,
sum(case when behavior='cart' then 1 else 0 end) as cart,
sum(case when behavior='buy' then 1 else 0 end) as buy,
count(behavior) as all_click,
count(distinct userid) as all_user
from userbehavior GROUP BY dates,onlyhours ORDER BY dates,onlyhours;
# 每星期用户行为
select date_format(dates,'%W') as weeks,
sum(case when behavior='pv' then 1 else 0 end) as pv,
sum(case when behavior='fav' then 1 else 0 end) as fav,
sum(case when behavior='cart' then 1 else 0 end) as cart,
sum(case when behavior='buy' then 1 else 0 end) as buy,
count(behavior) as all_click,
count(distinct userid) as all_user
from userbehavior GROUP BY weeks ORDER BY field(weeks,'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday');
浏览页跳失率=仅有点击行为的用户数 / 总UV,为7.02%。具体是指用户仅仅有pv行为,没有其它的收藏、加购、购买行为。较低的浏览页跳失率表明用户对目标页面和推荐商品有一定兴趣。
关键页跳失率=有收藏或加购行为但无购买的用户数 / 总UV,为51.88%。结合前面按日期分布的用户行为特征分析,由于临近双十二,较多用户会选择收藏或加购商品,在等待优惠更大的时机再购买。另一方面,可能是由于商品的库存不足、码数颜色缺货等问题。
# 浏览页跳失率
creat view bounce_rate as
select (select count(distinct userid) from userbehavior) as 总用户,
count(distinct userid) as 仅pv用户,
concat(format(count(distinct userid)/(select count(distinct userid) from userbehavior) * 100,2),'%') as 浏览页跳失率
from userbehavior
where userid not in
(select distinct userid from userbehavior where behavior='fav')
and userid not in
(select distinct userid from userbehavior where behavior='cart')
and userid not in
(select distinct userid from userbehavior where behavior='buy') ;
# 关键页跳失率
creat view key_rate as
select (select count(distinct userid) from userbehavior) as 总用户,
count(distinct userid) as 收藏加购用户,
concat(format(count(distinct userid)/(select count(distinct userid) from userbehavior)*100,2),'%') as 关键页跳失率
from userbehavior
where userid in (select distinct userid from userbehavior where behavior='fav')
or userid in (select distinct userid from userbehavior where behavior='cart')
and userid not in (select distinct userid from userbehavior where behavior='buy');
由于本数据集的日期只有9天,这里主要考察次日留存率、3日留存率和7日留存。可以看到进入12月后,留存率有所增长,表明双十二的预热起到一定效果。
代码如下▼
create view time_inter as
select a.*,b.firstday,datediff(a.dates,b.firstday) as day_diff
from (select userid,dates from userbehavior group by userid,dates) as a,
(select userid,min(dates) as firstday from userbehavior GROUP BY userid) as b
where a.userid=b.userid ORDER BY userid,dates;
create view retention_day as
select firstday,
sum(case when day_diff=0 then 1 else 0 end) as day_0,
sum(case when day_diff=1 then 1 else 0 end) as day_1,
sum(case when day_diff=2 then 1 else 0 end) as day_2,
sum(case when day_diff=3 then 1 else 0 end) as day_3,
sum(case when day_diff=4 then 1 else 0 end) as day_4,
sum(case when day_diff=5 then 1 else 0 end) as day_5,
sum(case when day_diff=6 then 1 else 0 end) as day_6,
sum(case when day_diff=7 then 1 else 0 end) as day_7,
sum(case when day_diff=8 then 1 else 0 end) as day_8
from time_inter
group by firstday
order by firstday;
# 搭建留存率模型retention_rate
create view retention_rate as
select firstday, day_0,
concat(format(day_1/day_0*100, 2), '%') as day_1,
concat(format(day_2/day_0*100, 2), '%') as day_2,
concat(format(day_3/day_0*100, 2), '%') as day_3,
concat(format(day_4/day_0*100, 2), '%') as day_4,
concat(format(day_5/day_0*100, 2), '%') as day_5,
concat(format(day_6/day_0*100, 2), '%') as day_6,
concat(format(day_7/day_0*100, 2), '%') as day_7,
concat(format(day_8/day_0*100, 2), '%') as day_8
from retention_day;
用户的购物路径包含四条,如下所示:
假设以上步骤只能依次进行或中断,不能跳过中间过程到下个节点。为了更具体考察用户在不同流程中的行为转化率,将拆解各条路径的步骤,得出各步骤的转化率。
创建一个基于各用户对不同商品的行为的统计视图
create view c as
Select userid,itemid,
sum(case when behavior='pv' then 1 else 0 end) as '点击',
sum(case when behavior='fav' then 1 else 0 end) as '收藏',
sum(case when behavior='cart' then 1 else 0 end) as '加入购物车',
sum(case when behavior='buy' then 1 else 0 end) as '购买'
from userbehavior GROUP BY userid,itemid;
SELECT * FROM c;
select count(userid) as '点击' from c where 点击>0;
select count(userid) as '点击、购买' from c where 点击>0 and 加入购物车=0 and 收藏=0 and 购买>0;
select count(userid) as '点击、加入购物车' from c where 点击>0 and 收藏=0 and 加入购物车>0;
SELECT count(userid) as '点击、加入购物车、购买' from c where 点击>0 and 加入购物车>0 and 购买>0;
select count(userid) as '点击、收藏' from c where 点击>0 and 收藏>0 and 加入购物车=0;
SELECT count(userid) as '点击、收藏、购买' from c where 点击>0 and 加入购物车=0 and 购买>0 and 收藏>0;
SELECT count(userid) as '点击、收藏、加入购物车' from c where 点击>0 and 加入购物车>0 and 收藏>0;
SELECT count(userid) as '点击、收藏、加入购物车、购买' from c where 点击>0 and 加入购物车>0 and 购买>0 and 收藏>0;
# 漏斗模型:用户行为漏斗和独立访客漏斗
select behavior,
count(behavior) as behavior_times, count(distinct userid) as user_times
from userbehavior
GROUP BY behavior order by field(behavior,'pv','fav','cart','buy');
在这九天内的用户购买次数有0-28次,少量用户在九天内完成2次以上的购买,出现8次以上的购买可能是剁手党所为,也可能是刷单行为。整体复购率为65.87%,比较客观,可针对复购用户的喜好和习惯调整平台首页的各栏目显示顺序和展示区域,在提供更便捷的服务的同时鼓励用户尝试其他板块的体验,以便展示更多推荐商品和增加引流。
create view user_behavior_times as
select userid,
sum(case when behavior='pv' then 1 else 0 end) as pv_times,
sum(case when behavior='fav' then 1 else 0 end) as fav_times,
sum(case when behavior='cart' then 1 else 0 end) as cart_times,
sum(case when behavior='buy' then 1 else 0 end) as buy_times,
concat(format(sum(case when behavior='buy' then 1 else 0 end)/sum(case when behavior='pv' then 1 else 0 end)*100,2),'%') as 购买率,
sum(case when behavior='buy' then 1 else 0 end)/sum(case when behavior='pv' then 1 else 0 end) as sort
from userbehavior GROUP BY userid ORDER BY sort desc;
本文的复购率 = 购买次数>1的用户数 / 购买次数>0的用户数
create view repurchase_rate as
select concat(format((select count(userid) from user_behavior_times where buy_times>1)/(select count(userid) from user_behavior_times where buy_times>0)*100,2),'%') as 复购率;
select * from repurchase_rate;
RFM模型是根据客户活跃程度和交易金额的贡献,进行客户价值细分的一种方法。从客户最近一次交易时间间隔(Recency)、客户最近交易次数(Frequency)、客户最近交易金额(Monetary)三个指标来衡量用户价值。
创建R视图
create view R as
select userid,max(dates) as 'recency' from userbehavior where behavior='buy' GROUP BY userid;
select * from R;
create view R1 as
select userid,recency,
(case when datediff('2017-12-03',recency) between 0 and 2 then 4
when datediff('2017-12-03',recency) between 2 and 4 then 3
when datediff('2017-12-03',recency) between 4 and 6 then 2
when datediff('2017-12-03',recency) >6 then 1 end) as R1 from R;
select * from R1;
select avg(R1) as R_avg from R1;
创建F视图
create view F as
select distinct userid,count(behavior) as 购买次数 from userbehavior where behavior='buy' group by userid;
select * from F;
create view F1 as
select userid,购买次数,
(case when 购买次数<=2 then 1
when 2<购买次数<=4 then 2
when 4<购买次数<=8 then 3
when 8<购买次数 then 4 end) as F1 from F;
select * from F1;
select avg(F1) as F_avg from F1;
create view RFM as
select a.*,b.F1,
(case when a.R1>=3.2846 and b.F1>=1.4352 then '重要价值用户'
when a.R1>=3.2846 and b.F1<1.4352 then '重要发展用户'
when a.R1<3.2846 and b.F1>=1.4352 then '重要保持用户'
when a.R1<3.2846 and b.F1<1.4352 then '重要挽留用户' end) as 用户分类
from R1 as a,F1 as b where a.userid=b.userid;
select * from RFM;
select 用户分类,count(用户分类) as 用户个数 from RFM GROUP BY 用户分类;
依据购买次数来考察商品类别和商品种类的销售情况。
成交量>=10的商品类别如下所示▼
虽然近97%的商品无人购买,但这主要是由于统计时间过短,部分商品的购买具有季节性,所以在该时间段出现低销量的现象。此外,这些低销量的商品的存在虽然仅能满足极少数客户的需求,但其利润可通过长尾效应积累获取,同时还能为平台营造有层次感的购物氛围,部分商品还能烘托出主流商品的优势,让客户更有信心购买。
create view hot_item as
select categoryid,itemid,
sum(case when behavior='pv' then 1 else 0 end) as pv_times,
sum(case when behavior='buy' then 1 else 0 end) as buy_times
from userbehavior GROUP BY categoryid,itemid ORDER BY categoryid,itemid,buy_times;
select * from hot_item;