随着互联网人口红利的消失,利用大数据分析深入的了解用户、进行精准化运营变得越来越重要。本项目通过电商角度,选取阿里天池项目中的淘宝App用户行为数据利用MYSQL进行数据分析。
1 理解数据
数据来源
阿里天池:https://tianchi.aliyun.com/dataset/dataDetail?dataId=649
包含了淘宝App由2017年11月25日至2017年12月3日之间,有行为的约一百万随机用户的所有行为(行为包括点击、购买、加购、收藏)。由于性能限制,本次共随机抽样10w条记录进行数据分析。
字段含义
user_id:用户id
item_ id:商品id
category_ id:商品类目id
behaviour_type:行为类型(pv代表浏览;fav代表收藏;cart代表加购;buy代表购买)
timestamp:用户行为时间
2 分析方向和目的
通常电商分析会包含销售、流量&搜索、用户、商品、以及促销&优惠券等维度,基于本淘宝用户行为数据包的具体情况,会从流量、用户、商品三方面进行分析,探索用户行为规律,为用户和产品运营提供更精准的策略,从而提高GMV:
3 数据清洗
对数据进行预处理,包括缺失值、异常值和重复值的处理
3.1 导入数据
创建新表ub,由于源数据集中没有主键,因此通过rand()和limit函数随机抽样10w条数据,插入新表中
create table ub like userbehavior
insert into ub
select * from userbehavior order by rand() limit 100000
3.2 删除重复值
涉及到删除数据,首先创建新表避免删除数据无法恢复,并以用户、行为、时间为唯一索引删除重复行数据
create table ub2 (select distinct user_id, item_id, category_id, behaviour_type, timestamp from ub)
数据量仍然为10万,无重复值,右键删除ub2表即可
3.3 缺失值处理
计算缺失值所占比重:无缺失
select sum(case when user_id is null then 1 else 0 end) as userid,
sum(case when item_id is null then 1 else 0 end) as itemid,
sum(case when category_id is null then 1 else 0 end) as cateid,
sum(case when behaviour_type is null then 1 else 0 end) as bt,
sum(case when timestamp is null then 1 else 0 end) as date
from ub
3.4 一致化处理
时间戳的处理:转换为日期格式,同时提取日期数和小时数,并便于后续时间序列分析
--时间戳转为日期格式
alter table ub add ubtime datetime
update ub set ubtime = from_unixtime('timestamp','%y-%m-%d %h:%i:%s')
--长日期转为短日期
alter table ub add ubdate datetime
update ub set ubdate = date_format(ubtime,'%y-%m-%d')
--提取小时数
alter table ub add ubhour int
update ub set ubhour = hour(ubtime)
3.5 添加主键
alter table ub add column behaviour_id int not null auto_increment primary key first
3.6 异常值处理
只保留 2017年11月25日至2017年12月3日 之间的数据,因为涉及到删除,先用select验证查询语句是否正确
select* from ub where ubdate not between'2017-11-25' and '2017-12-03'
注:mysql中between and 包含边界;not between不包含边界
查询出48条数据是在11.25号前发生的,条件设置正确,用delete语句删除
delete from ub where ubdate not between'2017-11-25' and '2017-12-03'
经过数据清理后的主表 ubs 的记录数为:99952
4 数据分析
按照相关方法,对数据进行分析
4.1流量分析
4.1.1 PV/UV随时间变化趋势
--按日统计
select ubdate, count(1) as pv, count(distinct user_id) as uv
from ub where behaviour_type = 'pv'
group by ubdate
order by ubdate
--按小时统计
select ubhour, count(1) as pv, count(distinct user_id) as uv
from ub where behaviour_type = 'pv'
group by ubhour order by ubhour
• 按日统计结果
• 按小时统计
4.1.2 跳失率
跳失率=只有点击行为的用户数/总用户数
select count(distinct user_id) as '只有点击行为的用户', concat(round(count(distinct user_id)*100/29483,1),'%') as '跳失率'from ub
where user_id not in (select distinct user_id from ub where behaviour_type = 'cart')
and user_id not in (select distinct user_id from ub where behaviour_type = 'buy')
and user_id not in (select distinct user_id from ub where behaviour_type = 'fav')
4.2用户分析
4.2.1 用户行为变化趋势
select ubdate, sum(case behaviour_type when 'pv' then 1 else 0 end) as 'pv',
sum(case behaviour_type when 'cart' then 1 else 0 end) as cart,
sum(case behaviour_type when 'buy' then 1 else 0 end) as buy,
sum(case behaviour_type when 'fav' then 1 else 0 end) as fav
from ub group by ubdate order by ubdate
4.2.2 行为漏斗分析
行为漏斗转化分析有两种分析维度,一种按行为计数,偏重看有多少次浏览行为、购买行为等以及对应的转化率;一种按uv(独立访客)计数,偏重看有多少用户浏览、加购并转化发生了购买行为。实际工作中根据业务需求从不同的维度进行分析。
• 按行为计算
用户购买流程:浏览(pv)-收藏 (fav ) -购买 (buy)或者浏览(pv)-加购 (cart) -购买 (buy)。采用漏斗模型对数据进行转化率分析
--计算浏览数,收藏数,加购数,购买数
select sum(case behaviour_type when 'pv' then 1 else 0 end) as 'pv',
sum(case behaviour_type when 'buy' then 1 else 0 end) as 'buy',
sum(case behaviour_type when 'cart' then 1 else 0 end) as 'cart',
sum(case behaviour_type when 'fav' then 1 else 0 end) as 'fav' from ub
--计算总体转化率
select concat(round(sum(case behaviour_type when 'buy' then 1 else 0 end)*100/sum(case behaviour_type when 'pv' then 1 else 0 end),1),'%') as '购买转化率',
concat(round(sum(case behaviour_type when 'fav' then 1 else 0 end)*100/sum(case behaviour_type when 'pv' then 1 else 0 end),1),'%') as '收藏转化率',
concat(round(sum(case behaviour_type when 'cart' then 1 else 0 end)*100/sum(case behaviour_type when 'pv' then 1 else 0 end),1),'%') as '加购转化率' from ub
• 按UV(独立访客)计算
--计算UV
select count(distinct user_id) from ub
-- 计算四种用户行为
create view uv as select behaviour_type, count(DISTINCT user_id) as uv
from ub group by behaviour_type
--计算转化率
select concat(round(sum(case behaviour_type when 'buy' then uv else 0 end)*100/sum(case behaviour_type when 'pv' then uv else 0 end),1),'%') as 购买转化率,
concat(round(sum(case behaviour_type when 'fav' then uv else 0 end)*100/sum(case behaviour_type when 'pv' then uv else 0 end),1),'%') as 收藏转化率,
concat(round(sum(case behaviour_type when 'cart' then uv else 0 end)*100/sum(case behaviour_type when 'pv' then uv else 0 end),1),'%')as 加购转化率 from uv
4.2.1 复购
• 复购率
复购率统计口径:有复购行为的用户数 / 有购买行为的用户数
--创建包含所有有复购行为的userid和购买次数的视图f
create view f as
select user_id, count(1) as '购买次数'
from ub where behaviour_type='buy'
group by user_id
having count(behaviour_type) >=2
--计算复购率
select (select count(user_id) from f)/(select count(distinct user_id)
from ub where behaviour_type ='buy') as '复购率' from ub limit 1
• 复购频数分布
select 购买次数, count(购买次数) as 人数
from f group by 购买次数 order by 购买次数
4.2.2 留存
-- 获取每个用户的使用日期与第一次使用日期
create view retention as
select a.user_id, a.ubdate, b.firstday
from (select user_id, ubdate from ub group by user_id, ubdate) as a inner join ( select user_id, min(ubdate) as firstday from ub group by user_id) as b
on a.user_id = b.user_id
order by a.user_id, a.ubdate
-- 计算第一次使用日期与使用日期的间隔byday
create view byday as
select user_id, ubdate, firstday, datediff(ubdate,firstday) as by_day
from retention
-- 计算每日的留存用户数量
day0为当日新增用户
create view retention2 as
select firstday,
sum(case when by_day = 0 then 1 else 0 end) as day0,
sum(case when by_day = 1 then 1 else 0 end) as day1,
sum(case when by_day = 2 then 1 else 0 end) as day2,
sum(case when by_day = 3 then 1 else 0 end) as day3,
sum(case when by_day = 4 then 1 else 0 end) as day4,
sum(case when by_day = 5 then 1 else 0 end) as day5,
sum(case when by_day = 6 then 1 else 0 end) as day6,
sum(case when by_day = 7 then 1 else 0 end) as day7,
sum(case when by_day = 8 then 1 else 0 end) as day8
from byday group by firstday order by firstday
-- 留存率计算
select firstday,
concat(round(day1*100/day0,1),'%') as day1, concat(round(day2*100/day0,1),'%') as day2,
concat(round(day3*100/day0,1),'%') as day3, concat(round(day4*100/day0,1),'%') as day4,
concat(round(day5*100/day0,1),'%') as day5, concat(round(day6*100/day0,1),'%') as day6,
concat(round(day7*100/day0,1),'%') as day7, concat(round(day8*100/day0,1),'%') as day8
from retention2 group by firstday order by firstday
4.2.3 RFM用户价值模型分析
由于本数据集没有M相关数据,因此只对R、M进行分析。
--r
查询每个用户的近期购买时间并创建视图
create view r as
select user_id, max(ubdate) as '近期购买时间'
from ub where behaviour_type = 'buy' group by user_id
构建r等级划分
create view r等级划分 as
select user_id, 近期购买时间, datediff('2017-12-03',近期购买时间) as 距今天数,
(case when datediff('2017-12-03',近期购买时间)<=2 then 5
when datediff('2017-12-03',近期购买时间)<=4 then 4
when datediff('2017-12-03',近期购买时间)<=6 then 3
when datediff('2017-12-03',近期购买时间)<=8 then 2 else 1 end) as r,
(case when datediff('2017-12-03',近期购买时间)<=2 then '5'
when datediff('2017-12-03',近期购买时间)<=4 then '4'
when datediff('2017-12-03',近期购买时间)<=6 then '3'
when datediff('2017-12-03',近期购买时间)<=8 then '2' else '1' end) as r值
from r
--f
查询每个用户的购买次数并创建视图
drop view if exists f
create view f as
select user_id, count(user_id) as '购买次数'
from ub where behaviour_type = 'buy' group by user_id
建立f等级划分
create view f等级划分 as
select user_id, 购买次数,
(case when 购买次数<=1 then 1 when 购买次数<=2 then 2
when 购买次数<=3 then 3 when 购买次数<=4 then 4 else 5 end) as 'f',
(case when 购买次数<=1 then '1' when 购买次数<=2 then '2'
when 购买次数<=3 then '3' when 购买次数<=4 then '4' else 5 end) as 'f值'
from f
--rf
r平均值
select avg(r) as 'r平均值' from r等级划分
f平均值
select avg(f) as 'f平均值' from f等级划分
--汇总
create view rfm汇总 as
select a.*, b.f, b.f值,
(case when a.r > 3.7794 and b.f>1.0601 then '高价值客户'
when a.r < 3.7794 and b.f>1.0601 then '唤回客户'
when a.r > 3.7794 and b.f<1.0601 then '深耕客户'
when a.r < 3.7794 and b.f<1.0601 then '挽留客户' end) as 客户分类
from r等级划分 as a, f等级划分 as b where a.user_id = b.user_id
--rf count
select 客户分类,count(1) as 客户数 from rfm汇总 group by 客户分类
4.3商品分析
4.2.3 Top 20 商品/类目分析
• 按类目分析
--创建视图Cate
create view Cate as
select category_id,
sum(case when behaviour_type='pv' then 1 else 0 end) as '浏览量',
sum(case when behaviour_type='buy' then 1 else 0 end) as '购买量',
concat(round(sum(case when behaviour_type='buy' then 1 else 0 end)*100/sum(case when behaviour_type='pv' then 1 else 0 end),1),'%') as '购买转化率'
from ub group by category_id
--Top 20 购买量类目
select * from cate order by 购买量 desc limit 20
--Top 20 浏览量类目
select * from cate order by 浏览量 desc limit 20
• 按商品分析
--创建视图Item
create view item as
select item_id,
sum(case when behaviour_type='pv' then 1 else 0 end) as '浏览量',
sum(case when behaviour_type='buy' then 1 else 0 end) as '购买量',
concat(round(sum(case when behaviour_type='buy' then 1 else 0 end)*100/sum(case when behaviour_type='pv' then 1 else 0 end),1),'%') as '购买转化率'
from ub group by item_id
--Top 20 购买量商品
select * from item order by 购买量 desc limit 20
--Top 20 浏览量商品
select * from item order by 浏览量 desc limit 20
5 结论及建议
--流量
pv与uv的每日变化趋势大致相同,工作日访问量(11.27-12.1)维持在低值, 周末较高。而12月第一个周末相较于11月最后一个周末的pv与uv均有较大幅度地提升,研究可发现双12的各项预热活动带来了该周末的访问量增长
pv与uv的小时变化中,凌晨1-6点是用户访问的低值时段,晚上20-22时用户访问的活跃时段,早上6-10时和晚饭后18点-21点是用户访问淘宝app的增长迅速时段,符合人的正常作息。建议可在用户活跃时段推送新品及促销活动,提高购买率;若有活动日,则可在用户活跃时段和增长迅速时段分时段投放活动信息和优惠券。
同时流量分析通常与营销活动、广告投放等结合起来评估站内站外活动和投放的效果
--用户
用户行为:四类用户行为在12月第一个周末迎来了峰值,访问量和加购量在周日继续上涨,但相应收藏量有所下降,需结合具体的活动情况来分析数据波动原因。
从行为纬度看,浏览到加购的转化率为9.5%,从浏览到购买的转化率为2.3%;从独立访客纬度看,大约有6.8%的活跃用户在统计期间内有购买行为。需结合具体类目商品转化率benchmark进行评估分析。若转化率低于正常水平,则可通过用户行为途径寻找原因,比如PV转化率低,则可从商品价格、页面信息等因素入手深挖原因。
该周期内复购率为5.4%,较为正常。通常时间周期越长,复购率越高,在实际业务中,更常看季复购率、半年复购率甚至年复购率。不同类目的复购率也会有差异。
留存率:以11.25日为首日统计日,则次日留存率为30.6%,三日留存率为25.2%,七日留存率为32%。双12活动预热使得用户在周末的留存率有所上涨。
根据R,F数据将用户划分为四类等级,可针对每个等级进行精准化营销
--商品
部分购买量高的类目/商品浏览量低,而部分浏览量高的类目/商品转化率低,需结果具体商品/类目的特性进行分析。
对于热销的类目/商品,可多推出一些和该类目/商品相关的其他类目/商品捆绑交叉销售,提高销量
若有新商品以及参与活动的引流商品等,也需要单独看此类商品表现