Adventure 用户个体和用户行为分析

1用户个体的消费分析

1.1用户消费频次描述性统计

import pandas as pd
import numpy as np
from datetime import datetime
from sqlalchemy import create_engine
import pymysql
from matplotlib import pyplot as plt
import seaborn as sns
## 导入数据
dic={'host':'106*******2',
    'user':'*******',
    'port':3306,
    'password':'root',
    'database':'adventure'}
con=pymysql.connect(**dic)
sql="select * from orderCustomer_mdx"
df_order=pd.read_sql(sql,con)
df_order.head()
image.png
df_order.info()

RangeIndex: 197313 entries, 0 to 197312
Data columns (total 17 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   index                 197313 non-null  int64  
 1   sales_order_key       197313 non-null  int64  
 2   create_date_x         197313 non-null  object 
 3   customer_key          197313 non-null  object 
 4   product_key           197313 non-null  object 
 5   english_product_name  197313 non-null  object 
 6   cpzl_zw               197313 non-null  object 
 7   cplb_zw               197313 non-null  object 
 8   unit_price            197313 non-null  float64
 9   create_date_y         197313 non-null  object 
 10  birth_date            197313 non-null  object 
 11  gender                197313 non-null  object 
 12  marital_status        197313 non-null  object 
 13  yearly_income         197313 non-null  object 
 14  province              197313 non-null  object 
 15  city                  197313 non-null  object 
 16  chinese_territory     197313 non-null  object 
dtypes: float64(1), int64(2), object(14)
memory usage: 25.6+ MB
grouped_user=df_order.groupby('customer_key') # 根据用户id聚合
consume_times=grouped_user['create_date_x'].agg('count')
consume_times.describe()
count    173261.000000
mean          1.138819
std           0.419065
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max          10.000000
Name: create_date_x, dtype: float64
  • 消费者回头率很低,约有90%的人只消费了一次

1.2 用户消费频次分布图

consume_times.hist(bins=[0,1,2,3,4,5,6,7,8,9,10,11])

image.png
consume_times[consume_times>1].hist(bins=[1,2,3,4,5,6,7,8,9,10,11])

image.png

[图片上传失败...(image-e24930-1589531382130)]

  • 去除只消费一次的客户 发现约有10%的顾客消费两次

1.3用户消费金额描述

grouped_user['unit_price'].sum().describe()
count    173261.000000
mean        557.398775
std        1016.186028
min           2.290000
25%           8.990000
50%          34.990000
75%         564.990000
max       10734.810000
Name: unit_price, dtype: float64
  • 平均用户消费金额557.4,而中位数才35元。说明均值受极值影响严重

1.4 用户消费金额分布图

grouped_user['unit_price'].sum().hist(bins=\
[0,50,100,200,500,1000,1500,2000,2500,3000,3500,4000,5000,6000,7000,8000,10000])

image.png
  • 最多的消费金额是50以内,其次是2000以上。差距明显
pd.DataFrame(grouped_user['unit_price'].sum().
             sort_values(ascending=False)).\
    apply(lambda x:x.cumsum()/x.sum()).\
reset_index().plot()


image.png
  • 按照用户消费金额进行升序排序,由图可以知道80%的用户仅贡献了10%的消费额度,而排名前3500(20%)的用户就贡献了90%的消费额度

2. 用户消费行为分析

2.1最早一次消费和最近一次消费

time_delta=grouped_user['create_date_x'].agg(['max','min'])
time_delta.head()
image.png

2.2 用户分层

  • RFM模型
  • 新、老、活跃、回流、流失

R 最近一次消费
F 消费频次
M 消费金额

RFM=pd.pivot_table(df_order,index='customer_key',
                   values=['create_date_x','product_key','unit_price'],
                   aggfunc={'create_date_x':'max','product_key':'count','unit_price':'sum'})
RFM_layer=RFM.rename(columns={'create_date_x':'R','product_key':'F','unit_price':'M'})

RFM_layer['RR']=-(RFM_layer.R-RFM_layer.R.max())/np.timedelta64(1,'D')
RFM_layer.head()
image.png
def RFM_model(df):
#     df.apply(lambda x:x-x.mean())
    level=df.apply(lambda x:'1' if x>=0 else '0')
    label=level['RR']+level['F']+level['M']
    dic={
        '111':'重要价值客户',
        '011':'重要保持客户',
        '101':'重要发展客户',
        '001':'重要挽留客户',
        '110':'一般价值客户',
        '010':'一般保持客户',
        '100':'一般发展客户',
        '000':'一般挽留客户'
    }
    return dic[label]


RFM_layer['lab']=RFM_layer[['RR','F','M']].apply(lambda x:x-x.mean()).apply(RFM_model,axis=1)


RFM_layer.head()
image.png
RFM_layer.loc[RFM_layer['lab']=='重要价值客户','color']='g'
RFM_layer.loc[RFM_layer['lab']=='重要保持客户','color']='b'
RFM_layer.loc[RFM_layer[(RFM_layer['lab']!='重要保持客户')&(RFM_layer['lab']!='重要价值客户')].index,'color']='r'

RFM_layer.plot.scatter(x='F',y='RR',c=RFM_layer.color)

image.png
  • 从图中可以看出重要价值客户和重要保持客户占购买次数两次及以上的人数的比例 很大
# sns.swarmplot(x=RFM_layer.F,y=RFM_layer.RR)
# sns.violinplot(x=RFM_layer.F,y=RFM_layer.RR,scale= 'count')

2.3 AARRR

  • 注册、活跃、回流、流失
df_order['year_month']=df_order.create_date_x.apply(lambda x:str(x)[:7])
pivoted_counts=df_order.pivot_table(index='customer_key',
                             columns='year_month',
                             values='unit_price',
                             aggfunc='count').fillna(0)
pivoted_counts.head()
image.png
 def aarrr(x):
        status=[]
        for i in range(17):
        
            if x[i]==0:
                if len(status)>0:
                    if status[i-1]=='unreg':
                        status.append('unreg')
                    else:
                        status.append('unactive')
                else:
                    status.append('unreg')
            else:
                if len(status)>0:
                    if status[i-1]=='unreg':
                        status.append('new')
                    else:
                        if status[i-1]=='unactive':
                            status.append('return')
                        else:
                            status.append('active')
                else:
                    status.append('new')
        return status
df_purchase=pivoted_counts.applymap(lambda x: 1 if x>0 else 0)
aarrr=df_purchase.apply(lambda x:pd.Series(aarrr(x)),axis=1)
aarrr.columns=df_purchase.columns
customer_count=aarrr.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))
customer_count
image.png
status_count=aarrr.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))
status_count
image.png
status_count.T.iloc[:,:3].plot.bar(stacked=True)

image.png
  • 每月新用户居多,回流用户和活跃用户占极少数
status_count.T.plot.area()

image.png
  • 使用pyecharts画图更直观:
from  pyecharts.charts import Line
import pyecharts.options as opts
M=status_count.fillna(0).T
c = (
    Line()
    .add_xaxis(M.index.to_list())
    .add_yaxis("active",  M['active'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
    .add_yaxis("new", M['new'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
    .add_yaxis("return", M['return'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
    .add_yaxis("unactive", M['unactive'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
    .set_global_opts(title_opts=opts.TitleOpts(title="Line-面积图"))
    )
c.render_notebook()
image.png
  • 不活跃用户直线上升,且数据越来越庞大,新用户增长平稳。
  • 活跃和回流用户由于数量级差别太大看不出来,将不活跃用户去除,再观察:
M=status_count.fillna(0).T
c = (
    Line()
    .add_xaxis(M.index.to_list())
    .add_yaxis("active",  M['active'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
    .add_yaxis("new", M['new'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
    .add_yaxis("return", M['return'], areastyle_opts=opts.AreaStyleOpts(opacity=0.5))

    .set_global_opts(title_opts=opts.TitleOpts(title="Line-面积图"))
    )
c.render_notebook()
image.png
  • 从上图可以看到,每月新客人数平稳增加,老客回流稳中略有增长,但是活跃顾客却不断下降

2.3 用户平均购买周期

df_order['create_date_x']=pd.to_datetime(df_order['create_date_x'],format = '%Y-%m-%d')
order_diff1=df_order.groupby('customer_key').apply(lambda x:(x.create_date_x-x.create_date_x.shift()))
order_diff1=order_diff1/np.timedelta64(1,'D')
order_diff1.head(20)
customer_key        
1000006       156305      NaN
1000011       163001      NaN
100003        101728      NaN
1000046       157708      NaN
1000060       142570      NaN
              142571    125.0
1000084       134251      NaN
1000090       151375      NaN
100011        79827       NaN
              79828     138.0
1000129       143249      NaN
100014        37076       NaN
1000140       194473      NaN
100015        20121       NaN
1000158       165030      NaN
100017        24801       NaN
1000174       168904      NaN
100018        162404      NaN
              162405     27.0
1000184       160136      NaN
Name: create_date_x, dtype: float64
order_diff1.groupby(level=0).mean().dropna()
customer_key
1000060    125.0
100011     138.0
100018      27.0
1000410     25.0
1000455     15.0
           ...  
99916      115.0
99923       21.0
999270      13.5
999520     115.0
99984      258.0
Name: create_date_x, Length: 20231, dtype: float64
order_diff1.groupby(level=0).mean().\
dropna().hist(bins=[0,20,50,80,100,130,
                    160,200,250,300,350,400,450,500])
# 每个用户的平均购买周期

image.png
  • 约一半人的平均回购天数在100天以内
order_diff1.hist()
## 每次下单的购买周期 按订单算

image.png
  • 从上图也可以看出 按订单的回购周期大部分也在也在100天以内
order_diff1.describe()
count    24052.000000
mean       113.276276
std        103.227265
min        -55.000000
25%         28.000000
50%         82.000000
75%        174.000000
max        485.000000
Name: create_date_x, dtype: float64
  • 所有订单的平均回购周期是113天
order_diff1.groupby(level=0).mean().\
dropna().dropna().describe()
count    20231.000000
mean       118.559745
std        100.622285
min        -55.000000
25%         37.000000
50%         92.500000
75%        176.000000
max        485.000000
Name: create_date_x, dtype: float64
  • 用户的购买周期均值和订单购买周期的均值差别不大
  • 可能本身用户购买次数最多的是2次 所以对于这部分人来说 按订单和按用户来划分回购周期并无区别,也因此两种划分方式得到的数据的区别不大

2.4用户生命周期

date_delta=(time_delta['max']-time_delta['min'])/np.timedelta64(1,'D')
date_delta.plot.hist(bins=50)
# date_delta.head()

image.png
date_delta.describe()
count    173261.000000
mean         15.730995
std          57.631400
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max         485.000000
dtype: float64
  • 由于绝大多数人指只购买一次,因此去除该部分客户
date_delta[date_delta>0].plot.hist(bins=25)

image.png
date_delta[date_delta>0].describe()
count    19953.000000
mean       136.599409
std        111.044766
min          1.000000
25%         43.000000
50%        110.000000
75%        208.000000
max        485.000000
dtype: float64
  • 用户生命周期呈现指数分布,
  • 生命周期大于0的用户中,50%的顾客生命周期小于110天
  • 生命周期的标准差很大,数据偏左,即生命周期偏短的顾客居多

你可能感兴趣的:(Adventure 用户个体和用户行为分析)