数据来源于CDNow网站,是用户在一家CD网站的消费记录。
数据集包含的字段
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline
columns = ['user_id','order_dt','order_products','order_amount']
df=pd.read_table('CDNOW_master.txt',names=columns,sep='\s+')
df.head()
user_id | order_dt | order_products | order_amount | |
---|---|---|---|---|
0 | 1 | 19970101 | 1 | 11.77 |
1 | 2 | 19970112 | 1 | 12.00 |
2 | 2 | 19970112 | 5 | 77.00 |
3 | 3 | 19970102 | 2 | 20.76 |
4 | 3 | 19970330 | 2 | 20.76 |
df.info()
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
user_id 69659 non-null int64
order_dt 69659 non-null int64
order_products 69659 non-null int64
order_amount 69659 non-null float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB
可以看到order_dt字段中数据类型是int,需要将其转化为日期类型,利用datetime模块中的to_datetime方法将其转换
df['order_dt']=pd.to_datetime(df.order_dt,format='%Y%m%d')
df['order_dt']
0 1997-01-01
1 1997-01-12
2 1997-01-12
3 1997-01-02
4 1997-03-30
...
69654 1997-04-05
69655 1997-04-22
69656 1997-03-25
69657 1997-03-25
69658 1997-03-26
Name: order_dt, Length: 69659, dtype: datetime64[ns]
现在打算把上述日期中的月份提取出来,对用户行为按月进行分析
df.order_dt.values # 现将df.order_dt.values 转化为数组,然后再利用 数组的astype函数进行转换,否则会发生错误
array(['1997-01-01T00:00:00.000000000', '1997-01-12T00:00:00.000000000',
'1997-01-12T00:00:00.000000000', ...,
'1997-03-25T00:00:00.000000000', '1997-03-25T00:00:00.000000000',
'1997-03-26T00:00:00.000000000'], dtype='datetime64[ns]')
df['month']=df.order_dt.values.astype('datetime64[M]')
df['month']
0 1997-01-01
1 1997-01-01
2 1997-01-01
3 1997-01-01
4 1997-03-01
...
69654 1997-04-01
69655 1997-04-01
69656 1997-03-01
69657 1997-03-01
69658 1997-03-01
Name: month, Length: 69659, dtype: datetime64[ns]
df.head()
user_id | order_dt | order_products | order_amount | month | |
---|---|---|---|---|---|
0 | 1 | 1997-01-01 | 1 | 11.77 | 1997-01-01 |
1 | 2 | 1997-01-12 | 1 | 12.00 | 1997-01-01 |
2 | 2 | 1997-01-12 | 5 | 77.00 | 1997-01-01 |
3 | 3 | 1997-01-02 | 2 | 20.76 | 1997-01-01 |
4 | 3 | 1997-03-30 | 2 | 20.76 | 1997-03-01 |
grouped_month = df.groupby('month') # 将用户按月进行分组
order_month_amount = grouped_month.order_amount.sum() # 计算每位用户购买CD所花费的总金额
order_month_amount.head()
month
1997-01-01 299060.17
1997-02-01 379590.03
1997-03-01 393155.27
1997-04-01 142824.49
1997-05-01 107933.30
Name: order_amount, dtype: float64
print( plt.style.available) # matplotlib的绘图风格有多种
['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']
plt.style.use("ggplot")
order_month_amount.plot()
由上图可知,消费金额在前三个月达到最高峰,后续消费金额较为稳定,略有下降
grouped_month.user_id.count().plot()
前三个月的消费订单数在10000笔左右,后续消费订单在2500左右
对每月重复消费用户的记录用drop_duplicates进行去重,然后用len函数对去重后的对每月用户数进行统计
df.groupby('month').user_id.apply(lambda x:len(x.drop_duplicates())).plot() # 每月去重的个数
可以看出每月消费人数低于每月消费次数
前三个月每个月的消费人数在8000—10000左右,后续月份,平均消费人数不到2000
grouped_user = df.groupby('user_id') # 按照用户ID对消费情况进行分组
grouped_user.sum().describe()
order_products | order_amount | |
---|---|---|
count | 23570.000000 | 23570.000000 |
mean | 7.122656 | 106.080426 |
std | 16.983531 | 240.925195 |
min | 1.000000 | 0.000000 |
25% | 1.000000 | 19.970000 |
50% | 3.000000 | 43.395000 |
75% | 7.000000 | 106.475000 |
max | 1033.000000 | 13990.930000 |
grouped_user.sum().plot.scatter(x='order_amount',y='order_products')
发现有异常值点,需要将其过滤掉
grouped_user.sum().query('order_amount<4000').plot.scatter(x='order_amount',y='order_products')
在排除异常值点后,发现用户购买产品数和购买金额成正比
grouped_user.sum().order_amount #各用户消费总金额
user_id
1 11.77
2 89.00
3 156.46
4 100.50
5 385.61
...
23566 36.00
23567 20.97
23568 121.70
23569 25.74
23570 94.08
Name: order_amount, Length: 23570, dtype: float64
grouped_user.sum().order_amount.plot.hist(bins=20)
横轴表示消费金额,纵轴表示消费相应金额的用户数
从直方图可知,用户消费金额,绝大部分呈集中趋势,小部分异常值干扰了判断。可以使用过滤操作排除异常
grouped_user.sum().query('order_products<100').order_products#.plot.hist(bins=20)
user_id
1 1
2 6
3 16
4 7
5 29
..
23566 2
23567 1
23568 6
23569 2
23570 5
Name: order_products, Length: 23491, dtype: int64
grouped_user.sum().query('order_products<100').order_products.plot.hist(bins=20)
使用切比雪夫定理过滤掉异常值,计算94%的购买产品数的数据分布情况,可以看出消费者的消费能力不是很高。
grouped_user.min().order_dt # 用户的最早消费日期
user_id
1 1997-01-01
2 1997-01-12
3 1997-01-02
4 1997-01-01
5 1997-01-01
...
23566 1997-03-25
23567 1997-03-25
23568 1997-03-25
23569 1997-03-25
23570 1997-03-25
Name: order_dt, Length: 23570, dtype: datetime64[ns]
grouped_user.min().order_dt.value_counts().plot()
用户第一次购买分布,集中在前三个月
其中在2月11日到2月15日有较大的波动
grouped_user.max().order_dt.value_counts().plot()
用户最后一次购买的分布比第一次分布广
大部分购买,集中在前三个月,说明有很多用户购买了一次用户后就不再进行购买
随着时间的递增,最后一次购买数也在递增,消费总体呈现流失上升的状况
pivoted_counts = df.pivot_table(index='user_id',columns = 'month',values= 'order_products',aggfunc='count').fillna(0)
pivoted_counts.head()# 用户每月购买产品数
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | ||||||||||||||||||
1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 2.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
下面将购买一件以上的产品数统一用1表示,仅消费一件产品的产品数统一用0表示,没有消费产品数的用np.NaN代替
purchase_r = pivoted_counts.applymap(lambda x: 1 if x>1 else np.NaN if x==0 else 0) #else后面为一个整体 即另一个完整的循环
purchase_r.head()
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | ||||||||||||||||||
1 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 0.0 | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN |
4 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 1.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
(purchase_r.sum()/purchase_r.count()).plot(figsize=(10,4)) # 复购率 注意count 0 和1 都会计算 但是又排除掉空值 而sum 只计算 1
复购率稳定在20%左右,前三个月因为有大量新用户涌入,而这批用户只购买了一次,所以导致复购率降低
现在将购买过产品数的相应值统一记为1,表示消费过,否则记为0
df_purchase = pivoted_counts.applymap(lambda x: 1 if x > 0 else 0) # applymap 指的是要对数据框中的每一个数据调用函数
df_purchase.head()
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | ||||||||||||||||||
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
若当月消费,次月继续消费,则记为会回购,对上述消费情况进行处理
def purchase_back(data):
status = []
for i in range(17):
if data[i] == 1:
if data[i+1] == 1:
status.append(1)
if data[i+1] ==0:
status.append(0)
else:
status.append(np.NaN)
status.append(np.NaN) # 将最后一个月填充好,因为无法得知下月情况
return pd.Series(status,index=data.index)
purchase_b=df_purchase.apply(purchase_back,axis=1)
purchase_b.head()
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | ||||||||||||||||||
1 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 0.0 | NaN | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN |
4 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 1.0 | 0.0 | NaN | 1.0 | 1.0 | 1.0 | 0.0 | NaN | 0.0 | NaN | NaN | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
(purchase_b.sum()/purchase_b.count()).plot(figsize=(10,4))
从图中可以看出回购率高于复购率,稳定在30%左右,而复购率稳定在20%左右。
说明老客整体上质量高于新客,即老客的忠诚度比较好。
rfm = df.pivot_table(index='user_id',values=['order_amount','order_products','order_dt'],
aggfunc={
'order_dt':'max','order_products':'sum','order_amount':'sum'}) # 不变的是用户id
rfm.head()
order_amount | order_dt | order_products | |
---|---|---|---|
user_id | |||
1 | 11.77 | 1997-01-01 | 1 |
2 | 89.00 | 1997-01-12 | 6 |
3 | 156.46 | 1998-05-28 | 16 |
4 | 100.50 | 1997-12-12 | 7 |
5 | 385.61 | 1998-01-03 | 29 |
-(rfm.order_dt - rfm.order_dt.max()) # 计算距离今天最近一次消费间隔天数 ,这里采用用户最后一次消费时间作为“今天”
user_id
1 545 days
2 534 days
3 33 days
4 200 days
5 178 days
...
23566 462 days
23567 462 days
23568 434 days
23569 462 days
23570 461 days
Name: order_dt, Length: 23570, dtype: timedelta64[ns]
rfm['R']=-(rfm.order_dt - rfm.order_dt.max())/np.timedelta64(1,'D') # 把days去掉
rfm['R']
user_id
1 545.0
2 534.0
3 33.0
4 200.0
5 178.0
...
23566 462.0
23567 462.0
23568 434.0
23569 462.0
23570 461.0
Name: R, Length: 23570, dtype: float64
rfm.rename(columns={
'order_products':'F','order_amount':'M'},inplace=True)
rfm
M | order_dt | F | R | |
---|---|---|---|---|
user_id | ||||
1 | 11.77 | 1997-01-01 | 1 | 545.0 |
2 | 89.00 | 1997-01-12 | 6 | 534.0 |
3 | 156.46 | 1998-05-28 | 16 | 33.0 |
4 | 100.50 | 1997-12-12 | 7 | 200.0 |
5 | 385.61 | 1998-01-03 | 29 | 178.0 |
... | ... | ... | ... | ... |
23566 | 36.00 | 1997-03-25 | 2 | 462.0 |
23567 | 20.97 | 1997-03-25 | 1 | 462.0 |
23568 | 121.70 | 1997-04-22 | 6 | 434.0 |
23569 | 25.74 | 1997-03-25 | 2 | 462.0 |
23570 | 94.08 | 1997-03-26 | 5 | 461.0 |
23570 rows × 4 columns
def rfm_func(x):
level = x.apply(lambda x: '1' if x>0 else '0')
return level
level=rfm[['R','F','M']].apply(lambda x: x-x.mean()).apply(rfm_func)
str_label = level.R + level.F + level.M
d = {
'111':'重要价值客户','011':'重要保持客户','101':'重要挽留客户',
'001':'重要发展客户','110':'一般价值客户',
'010':'一般保持客户','100':'一般挽留客户',
'000':'一般发展客户'}
rfm['label']=str_label.map(d)
rfm
M | order_dt | F | R | label | |
---|---|---|---|---|---|
user_id | |||||
1 | 11.77 | 1997-01-01 | 1 | 545.0 | 一般挽留客户 |
2 | 89.00 | 1997-01-12 | 6 | 534.0 | 一般挽留客户 |
3 | 156.46 | 1998-05-28 | 16 | 33.0 | 重要保持客户 |
4 | 100.50 | 1997-12-12 | 7 | 200.0 | 一般发展客户 |
5 | 385.61 | 1998-01-03 | 29 | 178.0 | 重要保持客户 |
... | ... | ... | ... | ... | ... |
23566 | 36.00 | 1997-03-25 | 2 | 462.0 | 一般挽留客户 |
23567 | 20.97 | 1997-03-25 | 1 | 462.0 | 一般挽留客户 |
23568 | 121.70 | 1997-04-22 | 6 | 434.0 | 重要挽留客户 |
23569 | 25.74 | 1997-03-25 | 2 | 462.0 | 一般挽留客户 |
23570 | 94.08 | 1997-03-26 | 5 | 461.0 | 一般挽留客户 |
23570 rows × 5 columns
rfm.groupby('label').sum()
M | F | R | |
---|---|---|---|
label | |||
一般价值客户 | 7181.28 | 650 | 36295.0 |
一般保持客户 | 19937.45 | 1712 | 29448.0 |
一般发展客户 | 196971.23 | 13977 | 591108.0 |
一般挽留客户 | 438291.81 | 29346 | 6951815.0 |
重要价值客户 | 167080.83 | 11121 | 358363.0 |
重要保持客户 | 1592039.62 | 107789 | 517267.0 |
重要发展客户 | 45785.01 | 2023 | 56636.0 |
重要挽留客户 | 33028.40 | 1263 | 114482.0 |
从RFM分层可知,大部分用户为重要保持客户,但是这是由于极值的影响,所以RFM的划分标准应该以业务为标准
rfm.loc[rfm.label=='重要价值客户','color']='g'
rfm.loc[rfm.label!='重要价值客户','color']='r'
rfm
M | order_dt | F | R | label | color | |
---|---|---|---|---|---|---|
user_id | ||||||
1 | 11.77 | 1997-01-01 | 1 | 545.0 | 一般挽留客户 | r |
2 | 89.00 | 1997-01-12 | 6 | 534.0 | 一般挽留客户 | r |
3 | 156.46 | 1998-05-28 | 16 | 33.0 | 重要保持客户 | r |
4 | 100.50 | 1997-12-12 | 7 | 200.0 | 一般发展客户 | r |
5 | 385.61 | 1998-01-03 | 29 | 178.0 | 重要保持客户 | r |
... | ... | ... | ... | ... | ... | ... |
23566 | 36.00 | 1997-03-25 | 2 | 462.0 | 一般挽留客户 | r |
23567 | 20.97 | 1997-03-25 | 1 | 462.0 | 一般挽留客户 | r |
23568 | 121.70 | 1997-04-22 | 6 | 434.0 | 重要挽留客户 | r |
23569 | 25.74 | 1997-03-25 | 2 | 462.0 | 一般挽留客户 | r |
23570 | 94.08 | 1997-03-26 | 5 | 461.0 | 一般挽留客户 | r |
23570 rows × 6 columns
rfm.plot.scatter('F','R',color=rfm.color)
可以看出重要价值客户和非重要价值客户的分布情况
我们按照用户的消费行为,简单划分成几个维度:新用户、活跃用户、不活跃用户、回流用户。
以上的时间窗口都是按月统计。
比如某用户在1月第一次消费,那么他在1月的分层就是新用户;他在2月消费国,则是活跃用户;3月没有消费,此时是不活跃用户;4月再次消费,此时是回流用户,5月还是消费,是活跃用户。
df_purchase.tail() # 每月的消费次数
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | ||||||||||||||||||
23566 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23567 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23568 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23569 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23570 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
def active_status(data):
status=[]
for i in range (18):
# 若本月没有消费
if data[i]==0:
if len(status)>0:
if status[i-1] == 'unreg':
status.append('unreg')
else:
status.append('unactive')
else:
status.append('unreg')
# 若本月有消费
else:
if len(status) == 0:
status.append('new')
else:
if status[i-1]=='unactive':
status.append('return')
elif status[i-1] == 'unreg':
status.append('new')
else:
status.append('active')
return pd.Series(status,index=data.index)
若本月没有消费
若本月有消费
purchase_status = df_purchase.apply(active_status,axis=1) # apply 指的是对数据框进行“行”或“列”处理
purchase_status.head()
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | ||||||||||||||||||
1 | new | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive |
2 | new | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive | unactive |
3 | new | unactive | return | active | unactive | unactive | unactive | unactive | unactive | unactive | return | unactive | unactive | unactive | unactive | unactive | return | unactive |
4 | new | unactive | unactive | unactive | unactive | unactive | unactive | return | unactive | unactive | unactive | return | unactive | unactive | unactive | unactive | unactive | unactive |
5 | new | active | unactive | return | active | active | active | unactive | return | unactive | unactive | return | active | unactive | unactive | unactive | unactive | unactive |
status_update = purchase_status.replace('unreg',np.NaN).apply(pd.value_counts)#.T.fillna(0).plot.area()
status_update
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
active | NaN | 1157.0 | 1681 | 1773.0 | 852.0 | 747.0 | 746.0 | 604.0 | 528.0 | 532.0 | 624.0 | 632.0 | 512.0 | 472.0 | 571.0 | 518.0 | 459.0 | 446.0 |
new | 7846.0 | 8476.0 | 7248 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
return | NaN | NaN | 595 | 1049.0 | 1362.0 | 1592.0 | 1434.0 | 1168.0 | 1211.0 | 1307.0 | 1404.0 | 1232.0 | 1025.0 | 1079.0 | 1489.0 | 919.0 | 1029.0 | 1060.0 |
unactive | NaN | 6689.0 | 14046 | 20748.0 | 21356.0 | 21231.0 | 21390.0 | 21798.0 | 21831.0 | 21731.0 | 21542.0 | 21706.0 | 22033.0 | 22019.0 | 21510.0 | 22133.0 | 22082.0 | 22064.0 |
status_update.fillna(0).T.apply(lambda x: x/x.sum(),axis=1).plot.area()
可以看到后期不活跃的用户(流失用户)越来越多,回流用户一直稳定在1000左右,且没有新增用户,说明运营状况不佳
只看紫色回流和活跃两个分层,用户数比较稳定,这两个分层相加,就是每月消费用户占比情况
st_ratio=status_update.fillna(0).apply(lambda x: x/x.sum(),axis=1) # 统计不同分层用户各月份所占比例
st_ratio
month | 1997-01-01 | 1997-02-01 | 1997-03-01 | 1997-04-01 | 1997-05-01 | 1997-06-01 | 1997-07-01 | 1997-08-01 | 1997-09-01 | 1997-10-01 | 1997-11-01 | 1997-12-01 | 1998-01-01 | 1998-02-01 | 1998-03-01 | 1998-04-01 | 1998-05-01 | 1998-06-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
active | 0.000000 | 0.090011 | 0.130776 | 0.137934 | 0.066283 | 0.058114 | 0.058036 | 0.046989 | 0.041077 | 0.041388 | 0.048545 | 0.049168 | 0.039832 | 0.036720 | 0.044422 | 0.040299 | 0.035709 | 0.034697 |
new | 0.332881 | 0.359610 | 0.307510 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
return | 0.000000 | 0.000000 | 0.031390 | 0.055342 | 0.071854 | 0.083988 | 0.075653 | 0.061620 | 0.063888 | 0.068953 | 0.074070 | 0.064996 | 0.054075 | 0.056924 | 0.078554 | 0.048483 | 0.054286 | 0.055922 |
unactive | 0.000000 | 0.019337 | 0.040606 | 0.059981 | 0.061739 | 0.061377 | 0.061837 | 0.063017 | 0.063112 | 0.062823 | 0.062276 | 0.062751 | 0.063696 | 0.063655 | 0.062184 | 0.063985 | 0.063838 | 0.063786 |
st_ratio.loc[['return','active']].T.plot()
因为消费行为有明显的二八倾向,我们需要知道高质量用户为消费贡献了多少份额
user_cumsum = grouped_user.sum().sort_values('order_amount').apply(lambda x:x.cumsum()/x.sum()) # 先进行排序再计算累计频率
user_cumsum
order_products | order_amount | |
---|---|---|
user_id | ||
10175 | 0.000006 | 0.000000 |
4559 | 0.000012 | 0.000000 |
1948 | 0.000018 | 0.000000 |
925 | 0.000024 | 0.000000 |
10798 | 0.000030 | 0.000000 |
... | ... | ... |
7931 | 0.982940 | 0.985405 |
19339 | 0.985192 | 0.988025 |
7983 | 0.988385 | 0.990814 |
14048 | 0.994538 | 0.994404 |
7592 | 1.000000 | 1.000000 |
23570 rows × 2 columns
user_cumsum.reset_index().order_amount.plot() # 必须加上reset_index 重新更换索引否则作图失败
绘制趋势图,横坐标是按贡献金额大小排序而成,纵坐标则是用户累计贡献。
可以很清楚的看到,前20000个用户贡献了40%的消费。后面4000位用户贡献了60%,确实呈现28倾向。
user_cumsum1= grouped_user.sum().sort_values('order_products').apply(lambda x:x.cumsum()/x.sum()) # 先进行排序再计算累计频率
user_cumsum1.reset_index().order_amount.plot()
统计一下销量,前两万个用户贡献了40%的销量,高消费用户贡献了60%的销量。
user_life = grouped_user.order_dt.agg(['min','max']) # 用户的第一次消费和最后一次消费日期
user_life.head()
min | max | |
---|---|---|
user_id | ||
1 | 1997-01-01 | 1997-01-01 |
2 | 1997-01-12 | 1997-01-12 |
3 | 1997-01-02 | 1998-05-28 |
4 | 1997-01-01 | 1997-12-12 |
5 | 1997-01-01 | 1998-01-03 |
(user_life['max'] - user_life['min']).describe()
count 23570
mean 134 days 20:55:36.987696
std 180 days 13:46:43.039788
min 0 days 00:00:00
25% 0 days 00:00:00
50% 0 days 00:00:00
75% 294 days 00:00:00
max 544 days 00:00:00
dtype: object
中位数为0天,表示用户只购买了一次
(user_life['min'] == user_life['max']).value_counts()
True 12054
False 11516
dtype: int64
有一半用户,就消费了一次
((user_life['max'] - user_life['min'])/np.timedelta64(1,'D')).hist(bins=20)
max((user_life['max'] - user_life['min'])/np.timedelta64(1,'D'))
544.0
用户最长生命周期为544天
update_1 =(user_life['max'] - user_life['min']).reset_index()[0]/np.timedelta64(1,'D')
update_1[update_1>0].hist(bins=40)
横轴表示用户生命周期,纵轴表示用户ID
排除了仅消费一次的人 做直方图
普通用户的生命周期在50-300天,高质量用户的生命周期在400天以上,也就是忠诚用户。
update_2=update_1[update_1>0]
update_2.mean()
276.0448072247308
用户消费2次及以上的生命周期是276天,高于总体134天。因此用户首次消费后应该引导其进行第二次消费,会带来更多的收益
update_3= update_1[update_1>400]
update_3.sum()/update_2.sum()
0.5292126412266761
update_3.count()/update_1.count()
0.15490029698769622
可以看到用户生命周期在400天以上的用户数占总用户的15.49%,但是其用户生命周期却占总的生命周期的52.92%,超过半数。说明用户的消费收入是取决于这一小撮人,符合二八定律。
指用户在第一次消费后有多少比例进行第二次消费
order_dt_min=grouped_user.order_dt.min() # 用户第一次消费时间
min_reindex = order_dt_min.reset_index()
min_reindex.head()
user_id | order_dt | |
---|---|---|
0 | 1 | 1997-01-01 |
1 | 2 | 1997-01-12 |
2 | 3 | 1997-01-02 |
3 | 4 | 1997-01-01 |
4 | 5 | 1997-01-01 |
user_purchase=df[['user_id','order_dt','order_products','order_amount']]
user_purchase.head()
user_id | order_dt | order_products | order_amount | |
---|---|---|---|---|
0 | 1 | 1997-01-01 | 1 | 11.77 |
1 | 2 | 1997-01-12 | 1 | 12.00 |
2 | 2 | 1997-01-12 | 5 | 77.00 |
3 | 3 | 1997-01-02 | 2 | 20.76 |
4 | 3 | 1997-03-30 | 2 | 20.76 |
user_purchase_retention=pd.merge(left=user_purchase,right=min_reindex,on='user_id',how='inner',suffixes=('','_min')) #suffix是区分重复列
user_purchase_retention.head()
user_id | order_dt | order_products | order_amount | order_dt_min | |
---|---|---|---|---|---|
0 | 1 | 1997-01-01 | 1 | 11.77 | 1997-01-01 |
1 | 2 | 1997-01-12 | 1 | 12.00 | 1997-01-12 |
2 | 2 | 1997-01-12 | 5 | 77.00 | 1997-01-12 |
3 | 3 | 1997-01-02 | 2 | 20.76 | 1997-01-02 |
4 | 3 | 1997-03-30 | 2 | 20.76 | 1997-01-02 |
user_purchase_retention['order_diff']=user_purchase_retention['order_dt']-user_purchase_retention['order_dt_min'] # 表示用户每一次消费与第一次消费的时间间隔
user_purchase_retention['order_diff']=user_purchase_retention['order_diff'].apply(lambda x:x/np.timedelta64(1,'D')) # 去除days
user_purchase_retention.order_diff.max()
544.0
研究用户留存一年的消费分布情况,将留存时间分为几个区间。
bin = [0,3,7,15,30,60,90,180,365]
user_purchase_retention['order_diff_bin'] = pd.cut(user_purchase_retention.order_diff,bins = bin)
user_purchase_retention.head(20)
user_id | order_dt | order_products | order_amount | order_dt_min | order_diff | order_diff_bin | |
---|---|---|---|---|---|---|---|
0 | 1 | 1997-01-01 | 1 | 11.77 | 1997-01-01 | 0.0 | NaN |
1 | 2 | 1997-01-12 | 1 | 12.00 | 1997-01-12 | 0.0 | NaN |
2 | 2 | 1997-01-12 | 5 | 77.00 | 1997-01-12 | 0.0 | NaN |
3 | 3 | 1997-01-02 | 2 | 20.76 | 1997-01-02 | 0.0 | NaN |
4 | 3 | 1997-03-30 | 2 | 20.76 | 1997-01-02 | 87.0 | (60.0, 90.0] |
5 | 3 | 1997-04-02 | 2 | 19.54 | 1997-01-02 | 90.0 | (60.0, 90.0] |
6 | 3 | 1997-11-15 | 5 | 57.45 | 1997-01-02 | 317.0 | (180.0, 365.0] |
7 | 3 | 1997-11-25 | 4 | 20.96 | 1997-01-02 | 327.0 | (180.0, 365.0] |
8 | 3 | 1998-05-28 | 1 | 16.99 | 1997-01-02 | 511.0 | NaN |
9 | 4 | 1997-01-01 | 2 | 29.33 | 1997-01-01 | 0.0 | NaN |
10 | 4 | 1997-01-18 | 2 | 29.73 | 1997-01-01 | 17.0 | (15.0, 30.0] |
11 | 4 | 1997-08-02 | 1 | 14.96 | 1997-01-01 | 213.0 | (180.0, 365.0] |
12 | 4 | 1997-12-12 | 2 | 26.48 | 1997-01-01 | 345.0 | (180.0, 365.0] |
13 | 5 | 1997-01-01 | 2 | 29.33 | 1997-01-01 | 0.0 | NaN |
14 | 5 | 1997-01-14 | 1 | 13.97 | 1997-01-01 | 13.0 | (7.0, 15.0] |
15 | 5 | 1997-02-04 | 3 | 38.90 | 1997-01-01 | 34.0 | (30.0, 60.0] |
16 | 5 | 1997-04-11 | 3 | 45.55 | 1997-01-01 | 100.0 | (90.0, 180.0] |
17 | 5 | 1997-05-31 | 3 | 38.71 | 1997-01-01 | 150.0 | (90.0, 180.0] |
18 | 5 | 1997-06-16 | 2 | 26.14 | 1997-01-01 | 166.0 | (90.0, 180.0] |
19 | 5 | 1997-07-22 | 2 | 28.14 | 1997-01-01 | 202.0 | (180.0, 365.0] |
pivoted_retention = user_purchase_retention.pivot_table(index='user_id',values='order_amount',columns='order_diff_bin',aggfunc='sum')
pivoted_retention.head()
order_diff_bin | (0, 3] | (3, 7] | (7, 15] | (15, 30] | (30, 60] | (60, 90] | (90, 180] | (180, 365] |
---|---|---|---|---|---|---|---|---|
user_id | ||||||||
3 | NaN | NaN | NaN | NaN | NaN | 40.3 | NaN | 78.41 |
4 | NaN | NaN | NaN | 29.73 | NaN | NaN | NaN | 41.44 |
5 | NaN | NaN | 13.97 | NaN | 38.90 | NaN | 110.40 | 155.54 |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 97.43 |
8 | NaN | NaN | NaN | NaN | 13.97 | NaN | 45.29 | 104.17 |
pivoted_retention.mean()
order_diff_bin
(0, 3] 35.905798
(3, 7] 36.385121
(7, 15] 42.669895
(15, 30] 45.964649
(30, 60] 50.215070
(60, 90] 48.975277
(90, 180] 67.223297
(180, 365] 91.960059
dtype: float64
计算一下用户在后续各时间段的平均消费额,可以就看到用户随着留存的时间变长,消费金额也增多。
pivoted_retention_trans = pivoted_retention.fillna(0).applymap(lambda x: 1 if x>0 else 0)
pivoted_retention_trans
order_diff_bin | (0, 3] | (3, 7] | (7, 15] | (15, 30] | (30, 60] | (60, 90] | (90, 180] | (180, 365] |
---|---|---|---|---|---|---|---|---|
user_id | ||||||||
3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
5 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
8 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
23561 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
23563 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
23564 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
23568 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
23570 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10810 rows × 8 columns
(pivoted_retention_trans.sum()/pivoted_retention_trans.count()).plot.bar()
只有2.5%的用户在第一次消费的次日至3天内有过消费,3%的用户在3~7天内有过消费。有20%的用户在第一次消费后的一个月到半年之间有过购买,23%的用户在半年后至1年内有过购买。从运营角度看,CD机营销在注重新用户的增长同时,应该注重用户忠诚度的培养,在一定时间内召回用户购买。
df.order_dt
0 1997-01-01
1 1997-01-12
2 1997-01-12
3 1997-01-02
4 1997-03-30
...
69654 1997-04-05
69655 1997-04-22
69656 1997-03-25
69657 1997-03-25
69658 1997-03-26
Name: order_dt, Length: 69659, dtype: datetime64[ns]
df.order_dt.shift()
0 NaT
1 1997-01-01
2 1997-01-12
3 1997-01-12
4 1997-01-02
...
69654 1997-03-25
69655 1997-04-05
69656 1997-04-22
69657 1997-03-25
69658 1997-03-25
Name: order_dt, Length: 69659, dtype: datetime64[ns]
order_diff=grouped_user.apply(lambda x:x.order_dt-x.order_dt.shift()) # 用户每笔订单日期相差时间
order_diff
user_id
1 0 NaT
2 1 NaT
2 0 days
3 3 NaT
4 87 days
...
23568 69654 11 days
69655 17 days
23569 69656 NaT
23570 69657 NaT
69658 1 days
Name: order_dt, Length: 69659, dtype: timedelta64[ns]
order_diff.describe()
count 46089
mean 68 days 23:22:13.567662
std 91 days 00:47:33.924168
min 0 days 00:00:00
25% 10 days 00:00:00
50% 31 days 00:00:00
75% 89 days 00:00:00
max 533 days 00:00:00
Name: order_dt, dtype: object
用户平均每单所需用的时间为68天
(order_diff/np.timedelta64(1,'D')).hist(bins=20)
看一下直方图,典型的长尾分布,大部分用户的消费间隔确实比较短。不妨将时间召回点设为消费后立即赠送优惠券,消费后10天询问用户CD怎么样,消费后30天提醒优惠券到期,消费后60天短信推送。这便是数据的应用了。