项目数据来源:https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1
项目目的:尝试分析用户行为数据,了解这方面的指标,了解从行为数据中能得到什么规律和信息。
这篇博客主要记录我是怎么处理数据和做特征工程的。
1、读取数据,初步查看数据信息
import pandas as pd
import numpy as np
df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"])
df.head()
|
User_ID |
Item_ID |
Category_ID |
Behavior_type |
Timestamp |
0 |
1 |
2268318 |
2520377 |
pv |
1511544070 |
1 |
1 |
2333346 |
2520771 |
pv |
1511561733 |
2 |
1 |
2576651 |
149192 |
pv |
1511572885 |
3 |
1 |
3830808 |
4181361 |
pv |
1511593493 |
4 |
1 |
4365585 |
2520377 |
pv |
1511596146 |
df.info()
RangeIndex: 3835330 entries, 0 to 3835329
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 User_ID int64
1 Item_ID int64
2 Category_ID int64
3 Behavior_type object
4 Timestamp int64
dtypes: int64(4), object(1)
memory usage: 146.3+ MB
df.describe()
|
User_ID |
Item_ID |
Category_ID |
Timestamp |
count |
3.835330e+06 |
3.835330e+06 |
3.835330e+06 |
3.835330e+06 |
mean |
2.411328e+05 |
2.578345e+06 |
2.711667e+06 |
1.511963e+09 |
std |
2.548230e+05 |
1.487859e+06 |
1.464903e+06 |
8.665418e+05 |
min |
1.000000e+00 |
3.000000e+00 |
2.171000e+03 |
4.401600e+04 |
25% |
1.299830e+05 |
1.294202e+06 |
1.349561e+06 |
1.511763e+09 |
50% |
1.786180e+05 |
2.576338e+06 |
2.725426e+06 |
1.511966e+09 |
75% |
2.255740e+05 |
3.860809e+06 |
4.145813e+06 |
1.512181e+09 |
max |
1.018011e+06 |
5.163067e+06 |
5.161669e+06 |
2.122867e+09 |
df.isnull().sum()
User_ID 0
Item_ID 0
Category_ID 0
Behavior_type 0
Timestamp 0
dtype: int64
初步观察数据我们可以得到:数据一共有3835330条,有5列分别是:User_ID、Item_ID、Category_ID、Behavior_type、Timestamp,其中User_ID、Item_ID、Category_ID、Behavior_type、Timestamp这4列是整数类型,Behavior_type是字符类型,Timestamp需要后面转换成时间类型。数据比较干净,没有null值。
2、清洗数据
print(df.shape)
list_ = df.columns.to_list()
df.drop_duplicates(list_, keep='first', inplace=True)
print(df.shape)
(3835330, 5)
(3835328, 5)
import datetime
import time
df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).date())
df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%H'))
del df['Timestamp']
df.head()
|
User_ID |
Item_ID |
Category_ID |
Behavior_type |
date |
time |
0 |
1 |
2268318 |
2520377 |
pv |
2017-11-25 |
01 |
1 |
1 |
2333346 |
2520771 |
pv |
2017-11-25 |
06 |
2 |
1 |
2576651 |
149192 |
pv |
2017-11-25 |
09 |
3 |
1 |
3830808 |
4181361 |
pv |
2017-11-25 |
15 |
4 |
1 |
4365585 |
2520377 |
pv |
2017-11-25 |
15 |
df['date'].value_counts()
2017-12-02 532774
2017-12-03 528928
2017-12-01 417976
2017-11-26 406792
2017-11-30 400366
2017-11-25 395034
2017-11-29 390884
2017-11-27 382304
2017-11-28 378326
2017-11-24 1507
2017-11-23 181
2017-11-22 53
2017-11-19 33
2017-11-21 25
2017-11-20 20
2017-11-18 19
2017-11-17 18
2018-08-28 16
2017-11-03 14
2017-11-16 10
2017-11-14 6
2017-11-13 6
2017-12-04 6
2017-11-11 5
2017-11-15 4
2017-12-06 3
2017-11-10 2
2017-09-16 2
2017-07-03 2
2017-11-12 2
1970-01-01 1
2015-02-06 1
2017-10-31 1
2017-09-11 1
2017-11-02 1
2017-11-05 1
2017-10-10 1
2017-11-06 1
2037-04-09 1
2017-11-04 1
Name: date, dtype: int64
print(df.shape)
df['date'] = df['date'][(df['date']>=datetime.date(2017,11,25)) & (df['date']<=datetime.date(2017,12,3))]
df = df.dropna(axis=0, how='any')
print(df.shape)
(3835328, 6)
(3833384, 6)
df['week'] = df['date'].apply(lambda x: x.weekday()+1)
df.head()
|
User_ID |
Item_ID |
Category_ID |
Behavior_type |
date |
time |
week |
0 |
1 |
2268318 |
2520377 |
pv |
2017-11-25 |
01 |
6 |
1 |
1 |
2333346 |
2520771 |
pv |
2017-11-25 |
06 |
6 |
2 |
1 |
2576651 |
149192 |
pv |
2017-11-25 |
09 |
6 |
3 |
1 |
3830808 |
4181361 |
pv |
2017-11-25 |
15 |
6 |
4 |
1 |
4365585 |
2520377 |
pv |
2017-11-25 |
15 |
6 |
3、RFM
df_rfm = df[df['Behavior_type']=='buy']
df_rfm.shape
(76705, 7)
df_rfm['datediff'] = (datetime.date(2017,12,4) - df['date']).dt.days
df_rfm.head()
D:\Anconda\envs\pytorch\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
|
User_ID |
Item_ID |
Category_ID |
Behavior_type |
date |
time |
week |
datediff |
71 |
100 |
1603476 |
2951233 |
buy |
2017-11-25 |
11 |
6 |
9 |
73 |
100 |
2971043 |
4869428 |
buy |
2017-11-25 |
21 |
6 |
9 |
100 |
100 |
598929 |
2429887 |
buy |
2017-11-27 |
13 |
1 |
7 |
119 |
100 |
1046201 |
3002561 |
buy |
2017-11-27 |
15 |
1 |
7 |
125 |
100 |
1606258 |
4098232 |
buy |
2017-11-27 |
21 |
1 |
7 |
df_r = df_rfm.groupby(by=['User_ID'])['datediff'].agg([('r', 'min')])
df_f = df_rfm.groupby(by=['User_ID'])['Behavior_type'].agg([('f', 'count')])
df_rfm_2 = df_r.join(df_f)
df_rfm_2.head()
|
r |
f |
User_ID |
|
|
100 |
6 |
8 |
117 |
6 |
10 |
119 |
5 |
3 |
121 |
9 |
1 |
122 |
2 |
3 |
df_rfm_2.info()
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 r 25400 non-null int64
1 f 25400 non-null int64
dtypes: int64(2)
memory usage: 1.8 MB
df_rfm_2.describe()
|
r |
f |
count |
25400.000000 |
25400.000000 |
mean |
3.535906 |
3.019882 |
std |
2.405743 |
3.039492 |
min |
1.000000 |
1.000000 |
25% |
1.000000 |
1.000000 |
50% |
3.000000 |
2.000000 |
75% |
5.000000 |
4.000000 |
max |
9.000000 |
84.000000 |
bins_r = [1, 3, 5, 7, 9]
labels_r = np.arange(4, 0, -1)
df_rfm_2['r_score'] = pd.cut(df_rfm_2['r'], bins=bins_r, labels=labels_r, include_lowest=True)
df_rfm_2['r_score'].unique()
[2, 3, 1, 4]
Categories (4, int64): [4 < 3 < 2 < 1]
df_rfm_2['f'].value_counts()
1 8688
2 5899
3 3849
4 2321
5 1545
6 955
7 639
8 421
9 275
10 188
11 148
12 102
13 63
14 59
15 58
16 34
17 24
18 22
20 16
19 14
21 12
23 12
22 9
29 6
25 5
31 4
30 4
26 3
36 3
28 3
32 3
27 2
51 1
33 1
65 1
47 1
84 1
69 1
24 1
72 1
57 1
42 1
43 1
60 1
61 1
39 1
Name: f, dtype: int64
bins_f = [1, 21, 43, 65, 84]
labels_f = np.arange(1, 5)
df_rfm_2['f_score'] = pd.cut(df_rfm_2['f'], bins=bins_f, labels=labels_f, include_lowest=True)
df_rfm_2['f_score'].unique()
[1, 2, 4, 3]
Categories (4, int64): [1 < 2 < 3 < 4]
df_rfm_2.head()
|
r |
f |
r_score |
f_score |
User_ID |
|
|
|
|
100 |
6 |
8 |
2 |
1 |
117 |
6 |
10 |
2 |
1 |
119 |
5 |
3 |
3 |
1 |
121 |
9 |
1 |
1 |
1 |
122 |
2 |
3 |
4 |
1 |
df_rfm_2.describe()
|
r |
f |
count |
25400.000000 |
25400.000000 |
mean |
3.535906 |
3.019882 |
std |
2.405743 |
3.039492 |
min |
1.000000 |
1.000000 |
25% |
1.000000 |
1.000000 |
50% |
3.000000 |
2.000000 |
75% |
5.000000 |
4.000000 |
max |
9.000000 |
84.000000 |
df_rfm_2.info()
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 r 25400 non-null int64
1 f 25400 non-null int64
2 r_score 25400 non-null category
3 f_score 25400 non-null category
dtypes: category(2), int64(2)
memory usage: 1.9 MB
df_rfm_2[['r_score', 'f_score']] = df_rfm_2[['r_score', 'f_score']].astype(int)
df_rfm_2.info()
Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 r 25400 non-null int64
1 f 25400 non-null int64
2 r_score 25400 non-null int32
3 f_score 25400 non-null int32
dtypes: int32(2), int64(2)
memory usage: 2.0 MB
df_rfm_2.describe()
|
r |
f |
r_score |
f_score |
count |
25400.000000 |
25400.000000 |
25400.000000 |
25400.000000 |
mean |
3.535906 |
3.019882 |
3.257441 |
1.003150 |
std |
2.405743 |
3.039492 |
1.000799 |
0.065734 |
min |
1.000000 |
1.000000 |
1.000000 |
1.000000 |
25% |
1.000000 |
1.000000 |
3.000000 |
1.000000 |
50% |
3.000000 |
2.000000 |
4.000000 |
1.000000 |
75% |
5.000000 |
4.000000 |
4.000000 |
1.000000 |
max |
9.000000 |
84.000000 |
4.000000 |
4.000000 |
r_avg = 3
f_avg = 1
df_rfm_2['label_customer'] = np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']>f_avg), '重要价值客户',
np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']<=f_avg), '重要维护客户',
np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']>f_avg), '重要发展客户',
np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']<=f_avg), '一般客户',
np.nan))))
df_rfm_2.head()
|
r |
f |
r_score |
f_score |
label_customer |
User_ID |
|
|
|
|
|
100 |
6 |
8 |
2 |
1 |
一般客户 |
117 |
6 |
10 |
2 |
1 |
一般客户 |
119 |
5 |
3 |
3 |
1 |
一般客户 |
121 |
9 |
1 |
1 |
1 |
一般客户 |
122 |
2 |
3 |
4 |
1 |
重要维护客户 |
df_rfm_2 = df_rfm_2.reset_index()
df_rfm_2.head()
|
User_ID |
r |
f |
r_score |
f_score |
label_customer |
0 |
100 |
6 |
8 |
2 |
1 |
一般客户 |
1 |
117 |
6 |
10 |
2 |
1 |
一般客户 |
2 |
119 |
5 |
3 |
3 |
1 |
一般客户 |
3 |
121 |
9 |
1 |
1 |
1 |
一般客户 |
4 |
122 |
2 |
3 |
4 |
1 |
重要维护客户 |
df.shape
(3833384, 7)
df_rfm_2 = df_rfm_2[['User_ID','label_customer']]
df_rfm_2.shape
(25400, 2)
df = pd.merge(df, df_rfm_2, how='left', left_on='User_ID', right_on='User_ID')
df.shape
(3833384, 8)
df.head()
|
User_ID |
Item_ID |
Category_ID |
Behavior_type |
date |
time |
week |
label_customer |
0 |
1 |
2268318 |
2520377 |
pv |
2017-11-25 |
01 |
6 |
NaN |
1 |
1 |
2333346 |
2520771 |
pv |
2017-11-25 |
06 |
6 |
NaN |
2 |
1 |
2576651 |
149192 |
pv |
2017-11-25 |
09 |
6 |
NaN |
3 |
1 |
3830808 |
4181361 |
pv |
2017-11-25 |
15 |
6 |
NaN |
4 |
1 |
4365585 |
2520377 |
pv |
2017-11-25 |
15 |
6 |
NaN |
4、产品分类
在实际电商中,会根据产品销量、利润和其他一些特征对产品进行分类,通常有爆款、旺款、在售、清仓等状态,这里用产品buy次数代替产品销量,对产品进行分类。
df_pro = df[df['Behavior_type']=='buy']
print(df_pro.shape)
print(df_pro['Item_ID'].value_counts().describe())
df_4 = df_pro['Item_ID'].value_counts().reset_index()
df_5 = df_4[df_4['Item_ID']>1]
print(df_5['Item_ID'].value_counts())
print(df_5['Item_ID'].describe())
df_4.head()
df_4.columns=['Item_ID','buy_num']
df_pro = pd.merge(df_pro, df_4, how='left', left_on='Item_ID', right_on='Item_ID')
df_pro['buy_num'].value_counts()
df_pro.head()
(76705, 8)
count 56726.000000
mean 1.352202
std 1.109932
min 1.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 58.000000
Name: Item_ID, dtype: float64
2 6990
3 1940
4 819
5 411
6 197
7 138
8 95
9 68
10 38
11 31
13 18
12 17
15 13
14 12
18 6
17 5
16 3
23 3
21 2
22 2
19 2
26 1
58 1
27 1
25 1
20 1
32 1
Name: Item_ID, dtype: int64
count 10816.000000
mean 2.847171
std 1.923511
min 2.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 58.000000
Name: Item_ID, dtype: float64
|
User_ID |
Item_ID |
Category_ID |
Behavior_type |
date |
time |
week |
label_customer |
buy_num |
0 |
100 |
1603476 |
2951233 |
buy |
2017-11-25 |
11 |
6 |
一般客户 |
3 |
1 |
100 |
2971043 |
4869428 |
buy |
2017-11-25 |
21 |
6 |
一般客户 |
1 |
2 |
100 |
598929 |
2429887 |
buy |
2017-11-27 |
13 |
1 |
一般客户 |
1 |
3 |
100 |
1046201 |
3002561 |
buy |
2017-11-27 |
15 |
1 |
一般客户 |
2 |
4 |
100 |
1606258 |
4098232 |
buy |
2017-11-27 |
21 |
1 |
一般客户 |
1 |
df.shape
(3833384, 8)
df_pro = df_pro[['Item_ID','buy_num']]
df_pro.shape
(76705, 2)
df_pro.drop_duplicates(['Item_ID'],keep='first',inplace=True)
df_pro.shape
(56726, 2)
df = pd.merge(df, df_pro, how='left', on='Item_ID')
df.shape
(3833384, 9)
df['pro_type'] = np.where(df['buy_num']>=9,'爆款',np.where(df['buy_num']>=3,'旺款',np.where(df['buy_num']>0,'在售','清仓')))
del df['buy_num']
df.head()
|
User_ID |
Item_ID |
Category_ID |
Behavior_type |
date |
time |
week |
label_customer |
pro_type |
0 |
1 |
2268318 |
2520377 |
pv |
2017-11-25 |
01 |
6 |
NaN |
在售 |
1 |
1 |
2333346 |
2520771 |
pv |
2017-11-25 |
06 |
6 |
NaN |
在售 |
2 |
1 |
2576651 |
149192 |
pv |
2017-11-25 |
09 |
6 |
NaN |
清仓 |
3 |
1 |
3830808 |
4181361 |
pv |
2017-11-25 |
15 |
6 |
NaN |
清仓 |
4 |
1 |
4365585 |
2520377 |
pv |
2017-11-25 |
15 |
6 |
NaN |
清仓 |
df.to_csv("bi_tmall_users.csv", sep=",", index=False)