1用户个体的消费分析
1.1用户消费频次描述性统计
import pandas as pd
import numpy as np
from datetime import datetime
from sqlalchemy import create_engine
import pymysql
from matplotlib import pyplot as plt
import seaborn as sns
## 导入数据
dic={'host':'106*******2',
'user':'*******',
'port':3306,
'password':'root',
'database':'adventure'}
con=pymysql.connect(**dic)
sql="select * from orderCustomer_mdx"
df_order=pd.read_sql(sql,con)
df_order.head()
df_order.info()
RangeIndex: 197313 entries, 0 to 197312
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 197313 non-null int64
1 sales_order_key 197313 non-null int64
2 create_date_x 197313 non-null object
3 customer_key 197313 non-null object
4 product_key 197313 non-null object
5 english_product_name 197313 non-null object
6 cpzl_zw 197313 non-null object
7 cplb_zw 197313 non-null object
8 unit_price 197313 non-null float64
9 create_date_y 197313 non-null object
10 birth_date 197313 non-null object
11 gender 197313 non-null object
12 marital_status 197313 non-null object
13 yearly_income 197313 non-null object
14 province 197313 non-null object
15 city 197313 non-null object
16 chinese_territory 197313 non-null object
dtypes: float64(1), int64(2), object(14)
memory usage: 25.6+ MB
grouped_user=df_order.groupby('customer_key') # 根据用户id聚合
consume_times=grouped_user['create_date_x'].agg('count')
consume_times.describe()
count 173261.000000
mean 1.138819
std 0.419065
min 1.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 10.000000
Name: create_date_x, dtype: float64
- 消费者回头率很低,约有90%的人只消费了一次
1.2 用户消费频次分布图
consume_times.hist(bins=[0,1,2,3,4,5,6,7,8,9,10,11])
consume_times[consume_times>1].hist(bins=[1,2,3,4,5,6,7,8,9,10,11])
[图片上传失败...(image-e24930-1589531382130)]
- 去除只消费一次的客户 发现约有10%的顾客消费两次
1.3用户消费金额描述
grouped_user['unit_price'].sum().describe()
count 173261.000000
mean 557.398775
std 1016.186028
min 2.290000
25% 8.990000
50% 34.990000
75% 564.990000
max 10734.810000
Name: unit_price, dtype: float64
- 平均用户消费金额557.4,而中位数才35元。说明均值受极值影响严重
1.4 用户消费金额分布图
grouped_user['unit_price'].sum().hist(bins=\
[0,50,100,200,500,1000,1500,2000,2500,3000,3500,4000,5000,6000,7000,8000,10000])
- 最多的消费金额是50以内,其次是2000以上。差距明显
pd.DataFrame(grouped_user['unit_price'].sum().
sort_values(ascending=False)).\
apply(lambda x:x.cumsum()/x.sum()).\
reset_index().plot()
- 按照用户消费金额进行升序排序,由图可以知道80%的用户仅贡献了10%的消费额度,而排名前3500(20%)的用户就贡献了90%的消费额度
2. 用户消费行为分析
2.1最早一次消费和最近一次消费
time_delta=grouped_user['create_date_x'].agg(['max','min'])
time_delta.head()
2.2 用户分层
- RFM模型
- 新、老、活跃、回流、流失
R 最近一次消费
F 消费频次
M 消费金额
RFM=pd.pivot_table(df_order,index='customer_key',
values=['create_date_x','product_key','unit_price'],
aggfunc={'create_date_x':'max','product_key':'count','unit_price':'sum'})
RFM_layer=RFM.rename(columns={'create_date_x':'R','product_key':'F','unit_price':'M'})
RFM_layer['RR']=-(RFM_layer.R-RFM_layer.R.max())/np.timedelta64(1,'D')
RFM_layer.head()
def RFM_model(df):
# df.apply(lambda x:x-x.mean())
level=df.apply(lambda x:'1' if x>=0 else '0')
label=level['RR']+level['F']+level['M']
dic={
'111':'重要价值客户',
'011':'重要保持客户',
'101':'重要发展客户',
'001':'重要挽留客户',
'110':'一般价值客户',
'010':'一般保持客户',
'100':'一般发展客户',
'000':'一般挽留客户'
}
return dic[label]
RFM_layer['lab']=RFM_layer[['RR','F','M']].apply(lambda x:x-x.mean()).apply(RFM_model,axis=1)
RFM_layer.head()
RFM_layer.loc[RFM_layer['lab']=='重要价值客户','color']='g'
RFM_layer.loc[RFM_layer['lab']=='重要保持客户','color']='b'
RFM_layer.loc[RFM_layer[(RFM_layer['lab']!='重要保持客户')&(RFM_layer['lab']!='重要价值客户')].index,'color']='r'
RFM_layer.plot.scatter(x='F',y='RR',c=RFM_layer.color)
- 从图中可以看出重要价值客户和重要保持客户占购买次数两次及以上的人数的比例 很大
# sns.swarmplot(x=RFM_layer.F,y=RFM_layer.RR)
# sns.violinplot(x=RFM_layer.F,y=RFM_layer.RR,scale= 'count')
2.3 AARRR
- 注册、活跃、回流、流失
df_order['year_month']=df_order.create_date_x.apply(lambda x:str(x)[:7])
pivoted_counts=df_order.pivot_table(index='customer_key',
columns='year_month',
values='unit_price',
aggfunc='count').fillna(0)
pivoted_counts.head()
def aarrr(x):
status=[]
for i in range(17):
if x[i]==0:
if len(status)>0:
if status[i-1]=='unreg':
status.append('unreg')
else:
status.append('unactive')
else:
status.append('unreg')
else:
if len(status)>0:
if status[i-1]=='unreg':
status.append('new')
else:
if status[i-1]=='unactive':
status.append('return')
else:
status.append('active')
else:
status.append('new')
return status
df_purchase=pivoted_counts.applymap(lambda x: 1 if x>0 else 0)
aarrr=df_purchase.apply(lambda x:pd.Series(aarrr(x)),axis=1)
aarrr.columns=df_purchase.columns
customer_count=aarrr.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))
customer_count
status_count=aarrr.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))
status_count
status_count.T.iloc[:,:3].plot.bar(stacked=True)
- 每月新用户居多,回流用户和活跃用户占极少数
status_count.T.plot.area()
- 使用pyecharts画图更直观:
from pyecharts.charts import Line
import pyecharts.options as opts
M=status_count.fillna(0).T
c = (
Line()
.add_xaxis(M.index.to_list())
.add_yaxis("active", M['active'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
.add_yaxis("new", M['new'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
.add_yaxis("return", M['return'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
.add_yaxis("unactive", M['unactive'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
.set_global_opts(title_opts=opts.TitleOpts(title="Line-面积图"))
)
c.render_notebook()
- 不活跃用户直线上升,且数据越来越庞大,新用户增长平稳。
- 活跃和回流用户由于数量级差别太大看不出来,将不活跃用户去除,再观察:
M=status_count.fillna(0).T
c = (
Line()
.add_xaxis(M.index.to_list())
.add_yaxis("active", M['active'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
.add_yaxis("new", M['new'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
.add_yaxis("return", M['return'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
.set_global_opts(title_opts=opts.TitleOpts(title="Line-面积图"))
)
c.render_notebook()
- 从上图可以看到,每月新客人数平稳增加,老客回流稳中略有增长,但是活跃顾客却不断下降
2.3 用户平均购买周期
df_order['create_date_x']=pd.to_datetime(df_order['create_date_x'],format = '%Y-%m-%d')
order_diff1=df_order.groupby('customer_key').apply(lambda x:(x.create_date_x-x.create_date_x.shift()))
order_diff1=order_diff1/np.timedelta64(1,'D')
order_diff1.head(20)
customer_key
1000006 156305 NaN
1000011 163001 NaN
100003 101728 NaN
1000046 157708 NaN
1000060 142570 NaN
142571 125.0
1000084 134251 NaN
1000090 151375 NaN
100011 79827 NaN
79828 138.0
1000129 143249 NaN
100014 37076 NaN
1000140 194473 NaN
100015 20121 NaN
1000158 165030 NaN
100017 24801 NaN
1000174 168904 NaN
100018 162404 NaN
162405 27.0
1000184 160136 NaN
Name: create_date_x, dtype: float64
order_diff1.groupby(level=0).mean().dropna()
customer_key
1000060 125.0
100011 138.0
100018 27.0
1000410 25.0
1000455 15.0
...
99916 115.0
99923 21.0
999270 13.5
999520 115.0
99984 258.0
Name: create_date_x, Length: 20231, dtype: float64
order_diff1.groupby(level=0).mean().\
dropna().hist(bins=[0,20,50,80,100,130,
160,200,250,300,350,400,450,500])
# 每个用户的平均购买周期
- 约一半人的平均回购天数在100天以内
order_diff1.hist()
## 每次下单的购买周期 按订单算
- 从上图也可以看出 按订单的回购周期大部分也在也在100天以内
order_diff1.describe()
count 24052.000000
mean 113.276276
std 103.227265
min -55.000000
25% 28.000000
50% 82.000000
75% 174.000000
max 485.000000
Name: create_date_x, dtype: float64
- 所有订单的平均回购周期是113天
order_diff1.groupby(level=0).mean().\
dropna().dropna().describe()
count 20231.000000
mean 118.559745
std 100.622285
min -55.000000
25% 37.000000
50% 92.500000
75% 176.000000
max 485.000000
Name: create_date_x, dtype: float64
- 用户的购买周期均值和订单购买周期的均值差别不大
- 可能本身用户购买次数最多的是2次 所以对于这部分人来说 按订单和按用户来划分回购周期并无区别,也因此两种划分方式得到的数据的区别不大
2.4用户生命周期
date_delta=(time_delta['max']-time_delta['min'])/np.timedelta64(1,'D')
date_delta.plot.hist(bins=50)
# date_delta.head()
date_delta.describe()
count 173261.000000
mean 15.730995
std 57.631400
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 485.000000
dtype: float64
- 由于绝大多数人指只购买一次,因此去除该部分客户
date_delta[date_delta>0].plot.hist(bins=25)
date_delta[date_delta>0].describe()
count 19953.000000
mean 136.599409
std 111.044766
min 1.000000
25% 43.000000
50% 110.000000
75% 208.000000
max 485.000000
dtype: float64
- 用户生命周期呈现指数分布,
- 生命周期大于0的用户中,50%的顾客生命周期小于110天
- 生命周期的标准差很大,数据偏左,即生命周期偏短的顾客居多