淘宝用户行为分析-1-数据处理(python)

项目数据来源:https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1
项目目的:尝试分析用户行为数据,了解这方面的指标,了解从行为数据中能得到什么规律和信息。
这篇博客主要记录我是怎么处理数据和做特征工程的。

1、读取数据,初步查看数据信息

#数据处理包
import pandas as pd
import numpy as np
df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"])
# nrows参数设置读取的行数
# df = pd.read_csv("UserBehavior.csv", sep=",", names=["User_ID","Item_ID","Category_ID","Behavior_type","Timestamp"], nrows =500)
# sample 随机选取n行数据
# df = df.sample(n=1000000, replace=False, random_state=1, axis=0)
df.head()
User_ID Item_ID Category_ID Behavior_type Timestamp
0 1 2268318 2520377 pv 1511544070
1 1 2333346 2520771 pv 1511561733
2 1 2576651 149192 pv 1511572885
3 1 3830808 4181361 pv 1511593493
4 1 4365585 2520377 pv 1511596146
df.info()

RangeIndex: 3835330 entries, 0 to 3835329
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   User_ID        int64 
 1   Item_ID        int64 
 2   Category_ID    int64 
 3   Behavior_type  object
 4   Timestamp      int64 
dtypes: int64(4), object(1)
memory usage: 146.3+ MB
df.describe()
User_ID Item_ID Category_ID Timestamp
count 3.835330e+06 3.835330e+06 3.835330e+06 3.835330e+06
mean 2.411328e+05 2.578345e+06 2.711667e+06 1.511963e+09
std 2.548230e+05 1.487859e+06 1.464903e+06 8.665418e+05
min 1.000000e+00 3.000000e+00 2.171000e+03 4.401600e+04
25% 1.299830e+05 1.294202e+06 1.349561e+06 1.511763e+09
50% 1.786180e+05 2.576338e+06 2.725426e+06 1.511966e+09
75% 2.255740e+05 3.860809e+06 4.145813e+06 1.512181e+09
max 1.018011e+06 5.163067e+06 5.161669e+06 2.122867e+09
# 查看df有没有null值,每列总共有几个null值
df.isnull().sum()
User_ID          0
Item_ID          0
Category_ID      0
Behavior_type    0
Timestamp        0
dtype: int64

初步观察数据我们可以得到:数据一共有3835330条,有5列分别是:User_ID、Item_ID、Category_ID、Behavior_type、Timestamp,其中User_ID、Item_ID、Category_ID、Behavior_type、Timestamp这4列是整数类型,Behavior_type是字符类型,Timestamp需要后面转换成时间类型。数据比较干净,没有null值。

# 读取融合的CSV文件
# merge_dt = pd.read_csv( cwd  + '\\' + "UserBehavior.csv", encoding="utf_8", engine='python' ,iterator = True ) 
# # merge_dt  
# # output: 
# # 调用数据
# df = merge_dt.get_chunk(10000)

2、清洗数据

# 数据去重
print(df.shape)
list_ = df.columns.to_list()
df.drop_duplicates(list_, keep='first', inplace=True)
print(df.shape)
(3835330, 5)
(3835328, 5)
import datetime
import time
# 时间戳转成datetime再转成str
# df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%Y-%m-%d'))
df["date"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).date())
df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%H'))
# df["time"] = df["Timestamp"].apply(lambda x: datetime.datetime.fromtimestamp(x).time())
del df['Timestamp']
df.head()
User_ID Item_ID Category_ID Behavior_type date time
0 1 2268318 2520377 pv 2017-11-25 01
1 1 2333346 2520771 pv 2017-11-25 06
2 1 2576651 149192 pv 2017-11-25 09
3 1 3830808 4181361 pv 2017-11-25 15
4 1 4365585 2520377 pv 2017-11-25 15
# 偷个懒,把datetime直接分割成2列
# df[["date","time"]] = df["datetime"].str.split(' ', 1, expand=True)
# df.head()
df['date'].value_counts()
2017-12-02    532774
2017-12-03    528928
2017-12-01    417976
2017-11-26    406792
2017-11-30    400366
2017-11-25    395034
2017-11-29    390884
2017-11-27    382304
2017-11-28    378326
2017-11-24      1507
2017-11-23       181
2017-11-22        53
2017-11-19        33
2017-11-21        25
2017-11-20        20
2017-11-18        19
2017-11-17        18
2018-08-28        16
2017-11-03        14
2017-11-16        10
2017-11-14         6
2017-11-13         6
2017-12-04         6
2017-11-11         5
2017-11-15         4
2017-12-06         3
2017-11-10         2
2017-09-16         2
2017-07-03         2
2017-11-12         2
1970-01-01         1
2015-02-06         1
2017-10-31         1
2017-09-11         1
2017-11-02         1
2017-11-05         1
2017-10-10         1
2017-11-06         1
2037-04-09         1
2017-11-04         1
Name: date, dtype: int64
# 2017-11-25'至'2017-12-03'之外,数据量极少,删除不分析
print(df.shape)
# df['date'] = df['date'][df['date'].between('2017-11-25','2017-12-03', inclusive=True)]
df['date'] = df['date'][(df['date']>=datetime.date(2017,11,25)) & (df['date']<=datetime.date(2017,12,3))]
# 删除含有nnan的数据行
df = df.dropna(axis=0, how='any')
print(df.shape)
(3835328, 6)
(3833384, 6)
# 数据量较多时从星期的维度分析可能会发现一些规律,但是该项目只涉及9天即一周零2天,
# 从星期维度分析和按日期分析没什么区别,也可以不做这个特征工程。
df['week'] = df['date'].apply(lambda x: x.weekday()+1)
df.head()
User_ID Item_ID Category_ID Behavior_type date time week
0 1 2268318 2520377 pv 2017-11-25 01 6
1 1 2333346 2520771 pv 2017-11-25 06 6
2 1 2576651 149192 pv 2017-11-25 09 6
3 1 3830808 4181361 pv 2017-11-25 15 6
4 1 4365585 2520377 pv 2017-11-25 15 6

3、RFM

df_rfm = df[df['Behavior_type']=='buy']
df_rfm.shape
(76705, 7)
# 看看有多少用户
# len(df_rfm['User_ID'].unique())
# 看看有多少订单(假设一个buy是一个订单,但是我们知道这只是假设,实际中一个订单会有多个产品)
# len(df_rfm['Behavior_type'])
# 增加各时间段,用来计算订单时间差
df_rfm['datediff'] = (datetime.date(2017,12,4) - df['date']).dt.days
df_rfm.head()
D:\Anconda\envs\pytorch\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
User_ID Item_ID Category_ID Behavior_type date time week datediff
71 100 1603476 2951233 buy 2017-11-25 11 6 9
73 100 2971043 4869428 buy 2017-11-25 21 6 9
100 100 598929 2429887 buy 2017-11-27 13 1 7
119 100 1046201 3002561 buy 2017-11-27 15 1 7
125 100 1606258 4098232 buy 2017-11-27 21 1 7
# 最近一次购买时间间隔
df_r = df_rfm.groupby(by=['User_ID'])['datediff'].agg([('r', 'min')])
# 购买次数
df_f = df_rfm.groupby(by=['User_ID'])['Behavior_type'].agg([('f', 'count')])
# 关联2个数据框
df_rfm_2 = df_r.join(df_f)
df_rfm_2.head()
r f
User_ID
100 6 8
117 6 10
119 5 3
121 9 1
122 2 3
df_rfm_2.info()

Int64Index: 25400 entries, 100 to 1018011
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   r       25400 non-null  int64
 1   f       25400 non-null  int64
dtypes: int64(2)
memory usage: 1.8 MB
df_rfm_2.describe()
r f
count 25400.000000 25400.000000
mean 3.535906 3.019882
std 2.405743 3.039492
min 1.000000 1.000000
25% 1.000000 1.000000
50% 3.000000 2.000000
75% 5.000000 4.000000
max 9.000000 84.000000
# 最近购买时间,一共9天,距今天分别是1-9,一共分成4段,划分点分别是:,距今越近分数越高,距今越远分数越低
# bins_r = df_rfm_2['r'].quantile(q=np.linspace(0, 1, num=5), interpolation='nearest')
bins_r = [1, 3, 5, 7, 9]
labels_r = np.arange(4, 0, -1)
df_rfm_2['r_score'] = pd.cut(df_rfm_2['r'], bins=bins_r, labels=labels_r, include_lowest=True)
df_rfm_2['r_score'].unique()
[2, 3, 1, 4]
Categories (4, int64): [4 < 3 < 2 < 1]
df_rfm_2['f'].value_counts()
1     8688
2     5899
3     3849
4     2321
5     1545
6      955
7      639
8      421
9      275
10     188
11     148
12     102
13      63
14      59
15      58
16      34
17      24
18      22
20      16
19      14
21      12
23      12
22       9
29       6
25       5
31       4
30       4
26       3
36       3
28       3
32       3
27       2
51       1
33       1
65       1
47       1
84       1
69       1
24       1
72       1
57       1
42       1
43       1
60       1
61       1
39       1
Name: f, dtype: int64
# 购买频率达到84次,分成4批,分割成1-21,22-43,44-65,66-84
bins_f = [1, 21, 43, 65, 84]
labels_f = np.arange(1, 5)
df_rfm_2['f_score'] = pd.cut(df_rfm_2['f'], bins=bins_f, labels=labels_f, include_lowest=True)
df_rfm_2['f_score'].unique()
[1, 2, 4, 3]
Categories (4, int64): [1 < 2 < 3 < 4]
df_rfm_2.head()
r f r_score f_score
User_ID
100 6 8 2 1
117 6 10 2 1
119 5 3 3 1
121 9 1 1 1
122 2 3 4 1
# df_rfm_2['total_score'] = df_rfm_2.apply(lambda x: x['r_score'] + x['f_score'], axis=1)
# df_rfm_2['total_score'].value_counts()
df_rfm_2.describe()
r f
count 25400.000000 25400.000000
mean 3.535906 3.019882
std 2.405743 3.039492
min 1.000000 1.000000
25% 1.000000 1.000000
50% 3.000000 2.000000
75% 5.000000 4.000000
max 9.000000 84.000000
# df_rfm_2['r_score'] = df_rfm_2['r_score'].apply(lambda x: int(str(x)))
# df_rfm_2['f_score'] = df_rfm_2['f_score'].apply(lambda x: int(str(x)))
df_rfm_2.info()

Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   r        25400 non-null  int64   
 1   f        25400 non-null  int64   
 2   r_score  25400 non-null  category
 3   f_score  25400 non-null  category
dtypes: category(2), int64(2)
memory usage: 1.9 MB
# category类型转换成int数据类型
df_rfm_2[['r_score', 'f_score']] = df_rfm_2[['r_score', 'f_score']].astype(int)
df_rfm_2.info()

Int64Index: 25400 entries, 100 to 1018011
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   r        25400 non-null  int64
 1   f        25400 non-null  int64
 2   r_score  25400 non-null  int32
 3   f_score  25400 non-null  int32
dtypes: int32(2), int64(2)
memory usage: 2.0 MB
df_rfm_2.describe()
r f r_score f_score
count 25400.000000 25400.000000 25400.000000 25400.000000
mean 3.535906 3.019882 3.257441 1.003150
std 2.405743 3.039492 1.000799 0.065734
min 1.000000 1.000000 1.000000 1.000000
25% 1.000000 1.000000 3.000000 1.000000
50% 3.000000 2.000000 4.000000 1.000000
75% 5.000000 4.000000 4.000000 1.000000
max 9.000000 84.000000 4.000000 4.000000
# 从上面的describe函数得到r和f的均值分别是3.257441和1.003150
r_avg = 3
f_avg = 1
# bins = df_rfm_2['total_score'].quantile(q=np.linspace(0,1,num=5), interpolation='nearest')
# labels = ['一般维持客户','潜力客户','重要深耕客户','重要价值客户']
# df_rfm_2['label_customer'] = pd.cut(df_rfm_2['total_score'], bins=bins, labels=labels, include_lowest=True)
# 根据r和f均值把用户划分成4类
df_rfm_2['label_customer'] = np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']>f_avg), '重要价值客户',
                                     np.where((df_rfm_2['r_score']>r_avg) & (df_rfm_2['f_score']<=f_avg), '重要维护客户',
                                             np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']>f_avg), '重要发展客户',
                                                     np.where((df_rfm_2['r_score']<=r_avg) & (df_rfm_2['f_score']<=f_avg), '一般客户', 
                                                              np.nan))))
df_rfm_2.head()
r f r_score f_score label_customer
User_ID
100 6 8 2 1 一般客户
117 6 10 2 1 一般客户
119 5 3 3 1 一般客户
121 9 1 1 1 一般客户
122 2 3 4 1 重要维护客户
# 索引转化成普通列
df_rfm_2 = df_rfm_2.reset_index()
df_rfm_2.head()
User_ID r f r_score f_score label_customer
0 100 6 8 2 1 一般客户
1 117 6 10 2 1 一般客户
2 119 5 3 3 1 一般客户
3 121 9 1 1 1 一般客户
4 122 2 3 4 1 重要维护客户
df.shape
(3833384, 7)
df_rfm_2 = df_rfm_2[['User_ID','label_customer']]
df_rfm_2.shape
(25400, 2)
# 数据关联
df = pd.merge(df, df_rfm_2, how='left', left_on='User_ID', right_on='User_ID')
df.shape
(3833384, 8)
df.head()
User_ID Item_ID Category_ID Behavior_type date time week label_customer
0 1 2268318 2520377 pv 2017-11-25 01 6 NaN
1 1 2333346 2520771 pv 2017-11-25 06 6 NaN
2 1 2576651 149192 pv 2017-11-25 09 6 NaN
3 1 3830808 4181361 pv 2017-11-25 15 6 NaN
4 1 4365585 2520377 pv 2017-11-25 15 6 NaN

4、产品分类

在实际电商中,会根据产品销量、利润和其他一些特征对产品进行分类,通常有爆款、旺款、在售、清仓等状态,这里用产品buy次数代替产品销量,对产品进行分类。

df_pro = df[df['Behavior_type']=='buy']
print(df_pro.shape)
# 查看Item_ID的销量
print(df_pro['Item_ID'].value_counts().describe())
df_4 = df_pro['Item_ID'].value_counts().reset_index()
# Item_ID太多,只查看下次数大于1的,方便后面设置分类的点
df_5 = df_4[df_4['Item_ID']>1]
print(df_5['Item_ID'].value_counts())
print(df_5['Item_ID'].describe())
df_4.head()
df_4.columns=['Item_ID','buy_num']
df_pro = pd.merge(df_pro, df_4, how='left', left_on='Item_ID', right_on='Item_ID')
df_pro['buy_num'].value_counts()
df_pro.head()
(76705, 8)
count    56726.000000
mean         1.352202
std          1.109932
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         58.000000
Name: Item_ID, dtype: float64
2     6990
3     1940
4      819
5      411
6      197
7      138
8       95
9       68
10      38
11      31
13      18
12      17
15      13
14      12
18       6
17       5
16       3
23       3
21       2
22       2
19       2
26       1
58       1
27       1
25       1
20       1
32       1
Name: Item_ID, dtype: int64
count    10816.000000
mean         2.847171
std          1.923511
min          2.000000
25%          2.000000
50%          2.000000
75%          3.000000
max         58.000000
Name: Item_ID, dtype: float64
User_ID Item_ID Category_ID Behavior_type date time week label_customer buy_num
0 100 1603476 2951233 buy 2017-11-25 11 6 一般客户 3
1 100 2971043 4869428 buy 2017-11-25 21 6 一般客户 1
2 100 598929 2429887 buy 2017-11-27 13 1 一般客户 1
3 100 1046201 3002561 buy 2017-11-27 15 1 一般客户 2
4 100 1606258 4098232 buy 2017-11-27 21 1 一般客户 1
df.shape
(3833384, 8)
df_pro = df_pro[['Item_ID','buy_num']]
df_pro.shape
(76705, 2)
# 去重,否则和df关联后会出现一对多的情况,使数据量太大
df_pro.drop_duplicates(['Item_ID'],keep='first',inplace=True)
df_pro.shape
(56726, 2)
df = pd.merge(df, df_pro, how='left', on='Item_ID')
df.shape
(3833384, 9)
# 设置分界条件对产品进行分类
df['pro_type'] = np.where(df['buy_num']>=9,'爆款',np.where(df['buy_num']>=3,'旺款',np.where(df['buy_num']>0,'在售','清仓')))
del df['buy_num']
df.head()
User_ID Item_ID Category_ID Behavior_type date time week label_customer pro_type
0 1 2268318 2520377 pv 2017-11-25 01 6 NaN 在售
1 1 2333346 2520771 pv 2017-11-25 06 6 NaN 在售
2 1 2576651 149192 pv 2017-11-25 09 6 NaN 清仓
3 1 3830808 4181361 pv 2017-11-25 15 6 NaN 清仓
4 1 4365585 2520377 pv 2017-11-25 15 6 NaN 清仓
df.to_csv("bi_tmall_users.csv", sep=",", index=False)

你可能感兴趣的:(数据分析)