数据来源于CDnow网站的用户购买记录,字段包括用户ID,订单日期,订单数量和订单金额。
字段名 | 说明 |
---|---|
user_id | 用户ID |
order_dt | 订单日期 |
order_product | 订单数量 |
order_amount | 订单金额 |
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
df = pd.read_csv('./CDNOW_master.txt', header=None, sep='\s+', names=['user_id', 'order_date', 'order_quantity', 'order_amount'])
df.head()
'''
user_id order_date order_quantity order_amount
0 1 19970101 1 11.77
1 2 19970112 1 12.00
2 2 19970112 5 77.00
3 3 19970102 2 20.76
4 3 19970330 2 20.76
'''
df.shape # (69659, 4)
df.info()
'''
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 69659 non-null int64
1 order_date 69659 non-null int64
2 order_quantity 69659 non-null int64
3 order_amount 69659 non-null float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB
'''
df['order_date'] = pd.to_datetime(df['order_date'], format="%Y%m%d")
增加一列,显示月份。
df['order_month'] = df['order_date'].astype('datetime64[M]')
df.head()
'''
user_id order_date order_quantity order_amount order_month
0 1 1997-01-01 1 11.77 1997-01-01
1 2 1997-01-12 1 12.00 1997-01-01
2 2 1997-01-12 5 77.00 1997-01-01
3 3 1997-01-02 2 20.76 1997-01-01
4 3 1997-03-30 2 20.76 1997-03-01
'''
describe函数用于观察这一系列数据的范围、大小、波动趋势等,便于判断对数据使用哪类模型更合适。
df.describe()
'''
user_id order_quantity order_amount
count 69659.000000 69659.000000 69659.000000
mean 11470.854592 2.410040 35.893648
std 6819.904848 2.333924 36.281942
min 1.000000 1.000000 0.000000
25% 5506.000000 1.000000 14.490000
50% 11410.000000 2.000000 25.980000
75% 17273.000000 3.000000 43.700000
max 23570.000000 99.000000 1286.010000
'''
df.groupby(by='order_month')['order_amount'].sum()
'''
order_month
1997-01-01 299060.17
1997-02-01 379590.03
1997-03-01 393155.27
1997-04-01 142824.49
1997-05-01 107933.30
1997-06-01 108395.87
1997-07-01 122078.88
1997-08-01 88367.69
1997-09-01 81948.80
1997-10-01 89780.77
1997-11-01 115448.64
1997-12-01 95577.35
1998-01-01 76756.78
1998-02-01 77096.96
1998-03-01 108970.15
1998-04-01 66231.52
1998-05-01 70989.66
1998-06-01 76109.30
'''
df.groupby(by='order_month')['order_amount'].sum().plot()
df.groupby(by='order_month')['order_quantity'].sum()
'''
order_month
1997-01-01 19416
1997-02-01 24921
1997-03-01 26159
1997-04-01 9729
1997-05-01 7275
1997-06-01 7301
1997-07-01 8131
1997-08-01 5851
1997-09-01 5729
1997-10-01 6203
1997-11-01 7812
1997-12-01 6418
1998-01-01 5278
1998-02-01 5340
1998-03-01 7431
1998-04-01 4697
1998-05-01 4903
1998-06-01 5287
'''
df.groupby(by='order_month')['user_id'].count()
'''
order_month
1997-01-01 8928
1997-02-01 11272
1997-03-01 11598
1997-04-01 3781
1997-05-01 2895
1997-06-01 3054
1997-07-01 2942
1997-08-01 2320
1997-09-01 2296
1997-10-01 2562
1997-11-01 2750
1997-12-01 2504
1998-01-01 2032
1998-02-01 2026
1998-03-01 2793
1998-04-01 1878
1998-05-01 1985
1998-06-01 2043
'''
df.groupby(by='order_month')['user_id'].nunique()
'''
order_month
1997-01-01 7846
1997-02-01 9633
1997-03-01 9524
1997-04-01 2822
1997-05-01 2214
1997-06-01 2339
1997-07-01 2180
1997-08-01 1772
1997-09-01 1739
1997-10-01 1839
1997-11-01 2028
1997-12-01 1864
1998-01-01 1537
1998-02-01 1551
1998-03-01 2060
1998-04-01 1437
1998-05-01 1488
1998-06-01 1506
'''
df['order_quantity'].sum(), df['order_amount'].sum() # (167881, 2500315.6300000004)
users_amount_series = df.groupby(by='user_id')['order_amount'].sum()
users_quantity_series = df.groupby(by='user_id')['order_quantity'].sum()
plt.scatter(users_amount_series, users_quantity_series)
order_amount_1000_series = df.query('order_amount <= 1000').groupby(by='user_id')['order_amount'].sum()
order_quantity_1000_series = df.query('order_quantity <= 100').groupby(by='user_id')['order_quantity'].sum()
df.groupby(by='user_id')['order_month'].min().value_counts()
'''
1997-02-01 8476
1997-01-01 7846
1997-03-01 7248
'''
df.groupby(by='user_id')['order_month'].max().value_counts()
'''
1997-02-01 4912
1997-03-01 4478
1997-01-01 4192
1998-06-01 1506
1998-05-01 1042
1998-03-01 993
1998-04-01 769
1997-04-01 677
1997-12-01 620
1997-11-01 609
1998-02-01 550
1998-01-01 514
1997-06-01 499
1997-07-01 493
1997-05-01 480
1997-10-01 455
1997-09-01 397
1997-08-01 384
'''
获取每位用户的首次消费和最后一次消费的消费时间。
new_old_df = df.groupby(by='user_id')['order_date'].agg(['min', 'max'])
(new_old_df['min'] == new_old_df['max']).value_counts()
'''
True 12054 # 新用户
False 11516 # 老用户
'''
获取新老客户的消费比例。
new_num = new_old_df.loc[new_old_df['min'] == new_old_df['max']].shape[0]
old_num = new_old_df.loc[new_old_df['min'] != new_old_df['max']].shape[0]
new_num / old_num # 1.0467176102813478
rfm = df.pivot_table(
index='user_id',
aggfunc={
'order_quantity': 'sum',
'order_amount': 'sum',
'order_date': 'max'
}
)
rfm.head()
'''
order_amount order_date order_quantity
user_id
1 11.77 1997-01-01 1
2 89.00 1997-01-12 6
3 156.46 1998-05-28 16
4 100.50 1997-12-12 7
5 385.61 1998-01-03 29
'''
RFM分析模型
R表示客户最近一次交易到现在时间间隔;
F表示客户购买商品的总数量。F值越大,表示客户交易越频繁,反之则表示客户交易不够活跃;
M表示客户交易金额。M值越大,表示客户价值越高,反之则表示客户价值较低。
# 计算R
rfm['R'] = (df['order_date'].max() - rfm['order_date']) / np.timedelta64(1, 'D')
rfm = rfm[['order_amount', 'order_quantity', 'R']]
rfm.columns = ['M', 'F', 'R']
rfm.head()
'''
M F R
user_id
1 11.77 1 545.0
2 89.00 6 534.0
3 156.46 16 33.0
4 100.50 7 200.0
5 385.61 29 178.0
'''
# RFM分层算法
def rfm_func(x):
# 存储的数据是三个字符串形式的0或1
level = x.map(lambda x: '1' if x >= 0 else '0')
label = level['R'] + level.F + level.M
d = {
'111': '重要价值客户',
'011': '重要保持客户',
'101': '重要挽留客户',
'001': '重要发展客户',
'110': '一般价值客户',
'010': '一般保持客户',
'100': '一般挽留客户',
'000': '一般发展客户',
}
result = d[label]
return result
# df.apply(func):用于对df中的行或列进行指定形式(func)的运算
rfm['label'] = rfm.apply(lambda x: x - x.mean(), axis=0).apply(rfm_func, axis = 1)
rfm.head()
'''
M F R label
user_id
1 11.77 1 545.0 一般挽留客户
2 89.00 6 534.0 一般挽留客户
3 156.46 16 33.0 重要保持客户
4 100.50 7 200.0 一般发展客户
5 385.61 29 178.0 重要保持客户
'''
统计每位用户每月的消费次数
consumption_per_month_df = df.pivot_table(index='user_id', values='order_date', aggfunc='count', columns='order_month', fill_value=0)
consumption_per_month_df.head()
'''
order_month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 1 0
4 2 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
5 2 1 0 1 1 1 1 0 1 0 0 2 1 0 0 0 0 0
'''
统计每位用户每月是否消费,有则为1,否则为0。
is_consumption_per_month_df = consumption_per_month_df.applymap(lambda x: 1 if x > 0 else 0)
is_consumption_per_month_df.head()
'''
order_month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0
4 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
5 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0
'''
unreg:观望用户,即前两月没买,第三个月才首次购买,用户在前两个月为观望用户;
unactive:用户首月购买后,后面的月份中没有购买,用户在没有购买的月份中为非活跃用户;
new:在当前月进行首次购买的用户,在当前月中为新用户;
active:连续月份购买的用户,在这些月中为活跃用户;
return:购买后间隔数个月才再次购买的用户,在再次开始购买的首月用户为回头客。
consumption_per_month_df.shape # (23570, 18)
def user_classify(data):
status = [] # 用于存储某位用户在每个月的活跃度
# 18 - consumption_per_month_df.shape[1]
for i in range(18):
# 若本月没有消费
if data[i] == 0:
if len(status) > 0:
if status[i-1] == 'unreg':
status.append('unreg')
else:
status.append('unactive')
else:
status.append('unreg')
# 若本月有消费
else:
if len(status) == 0:
status.append('new')
else:
if status[i-1] == 'unactive':
status.append('return')
elif status[i-1] == 'unreg':
status.append('new')
else:
status.append('active')
return status
user_status = consumption_per_month_df.apply(user_classify, axis = 1)
user_status.head()
'''
user_id
1 [new, unactive, unactive, unactive, unactive, ...
2 [new, unactive, unactive, unactive, unactive, ...
3 [new, unactive, return, active, unactive, unac...
4 [new, unactive, unactive, unactive, unactive, ...
5 [new, active, unactive, return, active, active...
'''
user_status_per_month_df = DataFrame(data=user_status.tolist(), index=consumption_per_month_df.index, columns=consumption_per_month_df.columns)
user_status_per_month_df.head()
'''
order_month 1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 new unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive
2 new unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive unactive
3 new unactive return active unactive unactive unactive unactive unactive unactive return unactive unactive unactive unactive unactive return unactive
4 new unactive unactive unactive unactive unactive unactive return unactive unactive unactive return unactive unactive unactive unactive unactive unactive
5 new active unactive return active active active unactive return unactive unactive return active unactive unactive unactive unactive unactive
'''
user_status_per_month_df.apply(lambda x: pd.value_counts(x), axis=0).fillna(0).T
'''
active new return unactive unreg
order_month
1997-01-01 0.0 7846.0 0.0 0.0 15724.0
1997-02-01 1157.0 8476.0 0.0 6689.0 7248.0
1997-03-01 1681.0 7248.0 595.0 14046.0 0.0
1997-04-01 1773.0 0.0 1049.0 20748.0 0.0
1997-05-01 852.0 0.0 1362.0 21356.0 0.0
1997-06-01 747.0 0.0 1592.0 21231.0 0.0
1997-07-01 746.0 0.0 1434.0 21390.0 0.0
1997-08-01 604.0 0.0 1168.0 21798.0 0.0
1997-09-01 528.0 0.0 1211.0 21831.0 0.0
1997-10-01 532.0 0.0 1307.0 21731.0 0.0
1997-11-01 624.0 0.0 1404.0 21542.0 0.0
1997-12-01 632.0 0.0 1232.0 21706.0 0.0
1998-01-01 512.0 0.0 1025.0 22033.0 0.0
1998-02-01 472.0 0.0 1079.0 22019.0 0.0
1998-03-01 571.0 0.0 1489.0 21510.0 0.0
1998-04-01 518.0 0.0 919.0 22133.0 0.0
1998-05-01 459.0 0.0 1029.0 22082.0 0.0
1998-06-01 446.0 0.0 1060.0 22064.0 0.0
'''