商家有时会在特定日期,例如Boxing-day,黑色星期五或是双十一(11月11日)开展大型促销活动或者发放优惠券以吸引消费者,然而很多被吸引来的买家都是一次性消费者,这些促销活动可能对销售业绩的增长并没有长远帮助,因此为解决这个问题,商家需要识别出哪类消费者可以转化为重复购买者。通过对这些潜在的忠诚客户进行定位,商家可以大大降低促销成本,提高投资回报率(Return on Investment, ROI)。众所周知的是,在线投放广告时精准定位客户是件比较难的事情,尤其是针对新消费者的定位。不过,利用天猫长期积累的用户行为日志,我们或许可以解决这个问题。
字段名称 | 描述 |
user_id | 购物者的唯一ID编码 |
item_id | 购物者的唯一ID编码 |
cat_id | 商品所属品类的唯一编码 |
merchant_id | 商家的唯一ID编码 |
brand_id | 商品品牌的唯一编码 |
time_tamp | 购买时间(格式:mmdd) |
action_type | 包含{0, 1, 2, 3},0表示单击,1表示添加到购物车,2表示购买,3表示添加到收藏夹 |
字段名称 | 描述 |
user_id | 购物者的唯一ID编码 |
age_range | 用户年龄范围。<18岁为1;[18,24]为2; [25,29]为3; [30,34]为4;[35,39]为5;[40,49]为6; > = 50时为7和8; 0和NULL表示未知 |
gender | 用户性别。0表示女性,1表示男性,2和NULL表示未知 |
字段名称 | 描述 |
user_id | 购物者的唯一ID编码 |
merchant_id | 商家的唯一ID编码 |
label | 包含{0, 1},1表示重复买家,0表示非重复买家。测试集这一部分需要预测,因此为空。 |
A U C = ∑ i ∈ p o s i t i v e C l a s s r a n k i − M ( 1 + M ) 2 M ∗ N AUC = \cfrac{\sum_{i\in{positive Class}}rank_{i} - \cfrac{M(1+M)}{2}}{M*N} AUC=M∗N∑i∈positiveClassranki−2M(1+M)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import scipy
import gc
from collections import Counter
import warnings
from matplotlib import rcParams
config = {
"font.family":'Times New Roman', # 设置字体类型
%matplotlib inline
# 数据存储情况
!tree data/data_format1/
├── test_format1.csv
├── train_format1.csv
├── user_info_format1.csv
└── user_log_format1.csv
0 directories, 4 files
# 问题:如何优化读入数据的内存占用情况?
# 解释内存
def reduce_mem(df):
starttime = time.time()
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2 # 统计内存使用情况
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if pd.isnull(c_min) or pd.isnull(c_max):
if str(col_type)[:3] == 'int':
# 装换数据类型
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,
return df
train_data = reduce_mem(pd.read_csv("./data/data_format1/train_format1.csv"))
-- Mem. usage decreased to 1.74 Mb (70.8% reduction),time spend:0.00 min
user_id | merchant_id | label | |
0 | 34176 | 3906 | 0 |
1 | 34176 | 121 | 0 |
2 | 34176 | 4356 | 1 |
3 | 34176 | 2217 | 0 |
4 | 230784 | 4818 | 0 |
... | ... | ... | ... |
260859 | 359807 | 4325 | 0 |
260860 | 294527 | 3971 | 0 |
260861 | 294527 | 152 | 0 |
260862 | 294527 | 2537 | 0 |
260863 | 229247 | 4140 | 0 |
260864 rows × 3 columns
RangeIndex: 260864 entries, 0 to 260863
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 260864 non-null int32
1 merchant_id 260864 non-null int16
2 label 260864 non-null int8
dtypes: int16(1), int32(1), int8(1)
memory usage: 1.7 MB
user_id 212062
merchant_id 1993
label 2
dtype: int64
user_info = reduce_mem(pd.read_csv("./data/data_format1/user_info_format1.csv"))
-- Mem. usage decreased to 3.24 Mb (66.7% reduction),time spend:0.00 min
user_id | age_range | gender | |
0 | 376517 | 6.0 | 1.0 |
1 | 234512 | 5.0 | 0.0 |
2 | 344532 | 5.0 | 0.0 |
3 | 186135 | 5.0 | 0.0 |
4 | 30230 | 5.0 | 0.0 |
... | ... | ... | ... |
424165 | 395814 | 3.0 | 1.0 |
424166 | 245950 | 0.0 | 1.0 |
424167 | 208016 | NaN | NaN |
424168 | 272535 | 6.0 | 1.0 |
424169 | 18031 | 3.0 | 1.0 |
424170 rows × 3 columns
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 424170 non-null int32
1 age_range 421953 non-null float16
2 gender 417734 non-null float16
dtypes: float16(2), int32(1)
memory usage: 3.2 MB
user_id 424170
age_range 9
gender 3
dtype: int64
# 问题:如何在pandas读取大批量的数据?
# 数据量过大,采用迭代方法
reader = pd.read_csv("./data/data_format1/user_log_format1.csv", iterator=True)
# try:
# df = reader.get_chunk(100000)
# except StopIteration:
# print("Iteration is stopped.")
loop = True
chunkSize = 100000
chunks = []
while loop:
chunk = reader.get_chunk(chunkSize)
except StopIteration:
loop = False
print("Iteration is stopped.")
df = pd.concat(chunks, ignore_index=True)
user_log = reduce_mem(df)
user_id | item_id | cat_id | seller_id | brand_id | time_stamp | action_type | |
0 | 328862 | 323294 | 833 | 2882 | 2660.0 | 829 | 0 |
1 | 328862 | 844400 | 1271 | 2882 | 2660.0 | 829 | 0 |
2 | 328862 | 575153 | 1271 | 2882 | 2660.0 | 829 | 0 |
3 | 328862 | 996875 | 1271 | 2882 | 2660.0 | 829 | 0 |
4 | 328862 | 1086186 | 1271 | 1253 | 1049.0 | 829 | 0 |
... | ... | ... | ... | ... | ... | ... | ... |
54925325 | 208016 | 107662 | 898 | 1346 | 7996.0 | 1110 | 0 |
54925326 | 208016 | 1058313 | 898 | 1346 | 7996.0 | 1110 | 0 |
54925327 | 208016 | 449814 | 898 | 983 | 7996.0 | 1110 | 0 |
54925328 | 208016 | 634856 | 898 | 1346 | 7996.0 | 1110 | 0 |
54925329 | 208016 | 272094 | 898 | 1346 | 7996.0 | 1111 | 0 |
54925330 rows × 7 columns
RangeIndex: 54925330 entries, 0 to 54925329
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 user_id int32
1 item_id int32
2 cat_id int16
3 seller_id int16
4 brand_id float16
5 time_stamp int16
6 action_type int8
dtypes: float16(1), int16(3), int32(2), int8(1)
memory usage: 890.5 MB
test_data = reduce_mem(pd.read_csv("./data/data_format1/test_format1.csv"))
-- Mem. usage decreased to 3.49 Mb (41.7% reduction),time spend:0.00 min
user_id | merchant_id | prob | |
0 | 163968 | 4605 | NaN |
1 | 360576 | 1581 | NaN |
2 | 98688 | 1964 | NaN |
3 | 98688 | 3645 | NaN |
4 | 295296 | 3361 | NaN |
... | ... | ... | ... |
261472 | 228479 | 3111 | NaN |
261473 | 97919 | 2341 | NaN |
261474 | 97919 | 3971 | NaN |
261475 | 32639 | 3536 | NaN |
261476 | 32639 | 3319 | NaN |
261477 rows × 3 columns
RangeIndex: 261477 entries, 0 to 261476
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 261477 non-null int32
1 merchant_id 261477 non-null int16
2 prob 0 non-null float64
dtypes: float64(1), int16(1), int32(1)
memory usage: 3.5 MB
# user_log 用户日志表
Total = user_log.isnull().sum().sort_values(ascending=False)
percent = (user_log.isnull().sum()/user_log.isnull().count()).sort_values(ascending=False)*100
missing_data = pd.concat([Total, percent], axis=1, keys=["Total", "Percent"])
Total | Percent | |
brand_id | 91015 | 0.165707 |
user_id | 0 | 0.000000 |
item_id | 0 | 0.000000 |
cat_id | 0 | 0.000000 |
seller_id | 0 | 0.000000 |
time_stamp | 0 | 0.000000 |
action_type | 0 | 0.000000 |
# user_info 用户信息表
Total = user_info.isnull().sum().sort_values(ascending=False)
percent = (user_info.isnull().sum()/user_info.isnull().count()).sort_values(ascending=False)*100
missing_data = pd.concat([Total, percent], axis=1, keys=["Total", "Percent"])
Total | Percent | |
gender | 6436 | 1.517316 |
age_range | 2217 | 0.522668 |
user_id | 0 | 0.000000 |
# 年龄字段数据取值情况
array([ 6., 5., 4., 7., 3., 0., 8., 2., nan, 1.], dtype=float16)
# 年龄为零或为空为缺失值,缺失条目为95131条
user_info[(user_info["age_range"] == 0) | (user_info["age_range"].isna())].count()
user_id 95131
age_range 92914
gender 90664
dtype: int64
# 不同年龄段的缺失用户数量,不统计空值
user_id | |
age_range | |
0.0 | 92914 |
1.0 | 24 |
2.0 | 52871 |
3.0 | 111654 |
4.0 | 79991 |
5.0 | 40777 |
6.0 | 35464 |
7.0 | 6992 |
8.0 | 1266 |
# 查看性别取值范围
array([ 1., 0., 2., nan], dtype=float16)
# 统计性别缺失情况
user_info[(user_info["gender"].isna()) | (user_info["gender"] == 2)].count()
user_id 16862
age_range 14664
gender 10426
dtype: int64
# 注:count只统计非Nan的值,因此在user_id的数量能够表示真实的缺失数量
user_info[(user_info["gender"].isna()) | (user_info["gender"] == 2) | (user_info["age_range"] == 0) | (user_info["age_range"].isna())].count()
user_id 106330
age_range 104113
gender 99894
dtype: int64
label_gp = train_data.groupby("label")["user_id"].count()
0 244912
1 15952
Name: user_id, dtype: int64
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
, ax=ax[0]
, shadow=True
, explode=[0, 0.1]
, autopct="%1.1f%%"
sns.countplot(x="label", data=train_data, ax=ax[1])
知识点: countplot
train_data_merchant = train_data.copy()
top5_idx = train_data_merchant.merchant_id.value_counts().head().index.tolist()
# 增加一列用于标记Top商户和非Top商户,便于统计其复购情况
train_data_merchant["Top5"] = train_data_merchant["merchant_id"].map(lambda x: 1 if x in top5_idx else 0)
train_data_merchant = train_data_merchant[train_data_merchant.Top5 == 1]
sns.countplot(x="merchant_id", hue="label", data=train_data_merchant)
merchant_repeat_buy = [rate for rate in train_data.groupby("merchant_id")["label"].mean() if rate <= 1 and rate >0]
plt.figure(figsize=(8, 4))
ax1 = plt.subplot(1, 2, 1)
sns.distplot(merchant_repeat_buy, fit=scipy.stats.norm)
ax2 = plt.subplot(1, 2, 2)
res = scipy.stats.probplot(merchant_repeat_buy, plot=plt)
user_repeat_buy = [rate for rate in train_data.groupby("user_id")["label"].mean() if rate <= 1 and rate >0]
plt.figure(figsize=(8, 4))
ax1 = plt.subplot(1, 2, 1)
sns.distplot(user_repeat_buy, fit=scipy.stats.norm)
ax2 = plt.subplot(1, 2, 2)
res = scipy.stats.probplot(user_repeat_buy, plot=plt)
train_data_user_info = train_data.merge(user_info, on=["user_id"], how="left")
plt.figure(figsize=(6, 4))
plt.title("Gender VS Label")
ax = sns.countplot(x="gender", hue="label", data=train_data_user_info)
for p in ax.patches:
hight = p.get_height()
repeat_buy = [rate for rate in train_data_user_info.groupby("gender")["label"].mean() if rate <= 1 and rate >0]
plt.figure(figsize=(8, 4))
ax1 = plt.subplot(1, 2, 1)
sns.distplot(repeat_buy, fit=scipy.stats.norm)
ax2 = plt.subplot(1, 2, 2)
res = scipy.stats.probplot(repeat_buy, plot=plt)
plt.figure(figsize=(5, 4))
plt.title("Age VS Label")
res = sns.countplot(x="age_range", hue="label", data=train_data_user_info)
repeat_buy = [rate for rate in train_data_user_info.groupby("age_range")["label"].mean() if rate <= 1 and rate >0]
plt.figure(figsize=(8, 4))
ax1 = plt.subplot(1, 2, 1)
sns.distplot(repeat_buy, fit=scipy.stats.norm)
ax2 = plt.subplot(1, 2, 2)
res = scipy.stats.probplot(repeat_buy, plot=plt)