之前在博客记录了一篇行为分析(python)的文章,后来觉得自己可以用SQL再走一遍,也算练练手。数据来源于天池的2020-04-13的“UserBehavior.csv”。
数据集包含了2017年11月25日至2017年12月3日之间,有行为的约一百万随机用户的所有行为(行为包括点击、购买、加购、喜欢)。数据集的组织形式和MovieLens-20M类似,即数据集的每一行表示一条用户行为,由用户ID、商品ID、商品类目ID、行为类型和时间戳组成,并以逗号分隔。
我自己的MySql版本是5.5,Navicat是10.1,窗口函数不支持。另外,这一次我并没有将表拆分,之后的处理和查询也都是在一表中进行的,工作中还是应该循序范式操作。
以下是我的分析操作流程,这里就不再做过多的结论描述了,毕竟只是对前文各分析模块用Sql取数。
依然从以下四个方向着手进行分析
PV、UV。
留存率。
各周期内消费次数统计。
各行为转化模型。
复购率。
回购率。
RFM模型。
商品和行为关系。
TOP商品分析。
源数据中的时间是时间戳,这里我将时间戳进行转换,并且增加日期和对应时段用于后续分析,因为电脑配置原因,我这里只写入大概80万条数据,实在是卡的要命。。。
增加转化后的数据date(%Y-%m-%d %H:%i:%s)
alter table ub add date VARCHAR(32)
UPDATE ub SET date = FROM_UNIXTIME(time,'%Y-%m-%d %H:%i:%s')
SELECT * from ub limit 10
新增日期(%Y-%m-%d)一列
alter table ub add day VARCHAR(32)
UPDATE ub SET day = cast(date as DATE)
新增小时列
alter table ub add hour VARCHAR(32)
UPDATE ub SET hour = right(CONVERT(date,TIME),8)
UPDATE ub SET hour = hour(date)
alter table ub modify column date datetime
alter table ub modify column day datetime
alter table ub modify column hour INT
SELECT COUNT(1) FROM ub WHERE user_id IS NULL
有备无患
SELECT * INTO behavior_ORIG FROM behavior
因为实际数据中不仅仅是2017-11-25 到 2017-12-04的数据,也有一些零零碎碎的杂项,所以对无关日期进行了删除处理。
DELETE FROM ub WHERE day < '2017-11-25' OR day >= '2017-12-04'
SELECT
day,
COUNT(1) PV
FROM ub
GROUP BY day
ORDER BY day ASC
SELECT hour,count(1)
from ub
GROUP BY hour
ORDER BY hour asc
SELECT
diff_day.min_day as min_day,
sum(case when diff_day.to_fday=0 then 1 else 0 end) as day_1,
sum(case when diff_day.to_fday=1 then 1 else 0 end) as day_2,
sum(case when diff_day.to_fday=2 then 1 else 0 end) as day_3,
sum(case when diff_day.to_fday=3 then 1 else 0 end) as day_4,
sum(case when diff_day.to_fday=4 then 1 else 0 end) as day_5,
sum(case when diff_day.to_fday=5 then 1 else 0 end) as day_6,
sum(case when diff_day.to_fday=6 then 1 else 0 end) as day_7,
sum(case when diff_day.to_fday=7 then 1 else 0 end) as day_8
from
(SELECT a.user_id,a.day,b.min_day,DATEDIFF(a.day,b.min_day) as to_fday
from ub as a
LEFT JOIN
(SELECT user_id, min(day) as min_day
from ub
GROUP BY user_id) as b
on a.user_id=b.user_id
order by a.user_id,a.day) as diff_day
GROUP BY diff_day.min_day
ORDER BY diff_day.min_day
SELECT
gp_day_type.type,
max(case when gp_day_type.day = '2017-11-25 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-11-25',
max(case when gp_day_type.day = '2017-11-26 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-11-26',
max(case when gp_day_type.day = '2017-11-27 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-11-27',
max(case when gp_day_type.day = '2017-11-28 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-11-28',
max(case when gp_day_type.day = '2017-11-29 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-11-29',
max(case when gp_day_type.day = '2017-11-30 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-11-30',
max(case when gp_day_type.day = '2017-12-01 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-12-01',
max(case when gp_day_type.day = '2017-12-02 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-12-02',
max(case when gp_day_type.day = '2017-12-03 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-12-03',
max(case when gp_day_type.day = '2017-12-04 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-12-04',
max(case when gp_day_type.day = '2017-12-05 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-12-05',
max(case when gp_day_type.day = '2017-12-06 00:00:00' then gp_day_type.type_count ELSE 0 end) as '2017-12-06'
from
(SELECT day, type, count(user_id) as type_count
from ub
GROUP BY day, type) as gp_day_type
GROUP BY gp_day_type.type
SELECT
type,
count(user_id) as count,
round((count(user_id)/(SELECT count(user_id) from ub where type='pv')),3)*100 as 'Proportion'
from ub
GROUP BY type
ORDER BY Proportion asc
刚看到我这里用了a.*,实际中应该避免这样写法
SELECT a.*
from
(SELECT
type,
count(DISTINCT user_id) as count,
round((count(DISTINCT user_id)/(SELECT count(DISTINCT user_id) from ub where type='pv')),3)*100 as 'Proportion'
from ub
GROUP BY type
UNION all
SELECT
'uv' type,
count(DISTINCT user_id) as count,
'100' as 'Proportion'
FROM ub) as a
ORDER BY a.Proportion asc
这里我用有复购行为的用户数 / 有购买行为的用户数
SELECT
((select
sum(buy_s.count)
from
(SELECT
user_id,
count(user_id) as count
from ub
where type='buy'
group by user_id
having count(user_id)>2) as buy_s)
/
(select
sum(buy_o.count)
from
(SELECT
user_id,
count(user_id) as count
from ub
where type='buy'
group by user_id) as buy_o))*100 as '复购率'
SELECT
gp_days.user_id,
min(gp_days.day) as min_day,
max(gp_days.day) as max_day,
DATEDIFF(max(gp_days.day),min(gp_days.day)) as DATEDIFF
FROM
(SELECT
buy_all.user_id,buy_all.day
from
(SELECT user_id, day
from ub
where type='buy') as buy_all
join
(SELECT user_id, count(user_id) as count
from ub
where type='buy'
group by user_id
having count(user_id)>2) as buy_s
on buy_all.user_id=buy_s.user_id) as gp_days
GROUP BY gp_days.user_id
这里我的思路是取出各用户的最近一次消费时间、消费频率、共计消费金额,之后分别计算这三类的平均值,最后再用三类数据减去均值,如果大于0置1,小于0置0。
源数据中没有消费金额项,所以只能将用户分成四类
SELECT
avg(a.datediff)
FROM
(SELECT user_id,
datediff('2017-12-03 00:00:00',max(day)) as datediff
from ub
where type='buy'
GROUP BY user_id) as a
SELECT max(day) from ub
SELECT user_id,
if(datediff('2017-12-03 00:00:00',max(day))-2.5076>0,1,0 ) as r
from ub
where type='buy'
GROUP BY user_id
SELECT
avg(a.count)
FROM
(select user_id,count(1) as count
from ub
where type = 'buy'
GROUP BY user_id) as a
select user_id,
if(count(1)-3.0346 >0 ,1, 0) as f
from ub
where type = 'buy'
GROUP BY user_id
SELECT
r.user_id,CONCAT(r.r,f.f) as rf
FROM
(SELECT user_id,
if(datediff('2017-12-03 00:00:00',max(day))-2.5076>0,1,0 ) as r
from ub
where type='buy'
GROUP BY user_id) as r
join
(select user_id,
if(count(1)-3.0346 >0 ,1, 0) as f
from ub
where type = 'buy'
GROUP BY user_id) as f
ON r.user_id=f.user_id
SELECT buy.cate_id,pv.pv_count,buy.buy_count,round(buy.buy_count/pv.pv_count,3)*100 as Transform
from
(SELECT cate_id,count(1) as buy_count
from ub
where type='buy'
GROUP BY cate_id) as buy
join
(SELECT cate_id,count(1) as pv_count
from ub
where type='pv'
GROUP BY cate_id) as pv
on buy.cate_id = pv.cate_id
ORDER BY buy.buy_count desc
LIMIT 10
SELECT buy.item_id,pv.pv_count,buy.buy_count,round(buy.buy_count/pv.pv_count,3)*100 as Transform
from
(SELECT item_id,count(1) as buy_count
from ub
where type='buy'
GROUP BY item_id) as buy
join
(SELECT item_id,count(1) as pv_count
from ub
where type='pv'
GROUP BY item_id) as pv
on buy.item_id = pv.item_id
ORDER BY buy.buy_count desc
LIMIT 10