这次专门使用天池新人赛的离线比赛来实际练习, 因为时间和算力, 更重要是经验的问题, 这次尝试还有很多问题, 比如还不会传统的机器算法, 比如对深度学习还有很多东西不够熟悉, 还有数据预处理还不够熟练。
本文由拎着激光炮的野人原创, 欢迎转载, 转载请注明作者与原文链接
https://www.jianshu.com/p/ef1fc958e30f
解读分析
认真读题之后, 发现这个赛题是针对于类似于o2o的预测, 商品基本来自于服务行业, 线上购买, 线下消费, 也就是说和商品的地理位置有很大的关系, 所以我们为了简单, 我们假设不同的商品类别之间没有太大的可替换性和相关性, 假设用户的geo位置和商品的geo位置有很大的相关性
1. 导入数据
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math
from sklearn.metrics import f1_score
idx = pd.IndexSlice
#读取items字段
items = pd.read_csv("./tianchi_fresh_comp_train_item.csv")
print("read items", items.count()[0])
actions = pd.read_csv("./tianchi_fresh_comp_train_user.csv")
print("user action read, total:", actions.count()[0])
read items 620918
user action read, total: 23291027
# 读取并且转换actions表, 用户的所有的行为
# TODO: 暂时忽略所有的geo信息
def prepare_data(actions, items):
#convert time
actions.time = pd.to_datetime(actions.time)
#index user
user_index = actions.user_id.drop_duplicates()
user_index = user_index.reset_index(drop=True).reset_index().set_index("user_id")
user_index.columns = ['user']
actions = pd.merge(actions, user_index, left_on='user_id', right_index=True, how='left')
#index item
item_ids = actions.item_id.drop_duplicates()
item_ids = item_ids.reset_index(drop=True).reset_index().set_index("item_id")
item_ids.columns = ['item']
actions = pd.merge(actions, item_ids, left_on='item_id', right_index=True, how='left')
items = pd.merge(items, item_ids, left_on='item_id', right_index=True, how='left')
# index category
category = actions.item_category.drop_duplicates()
category = category.reset_index(drop=True).reset_index().set_index("item_category")
category.columns = ['category']
actions = pd.merge(actions, category, left_on='item_category', right_index=True, how='left')
#drop user_id, item_id
actions = actions.drop(['user_id', 'item_id', 'item_category'], axis=1)
items = items.drop(['item_id', 'item_category'], axis=1);
#reoder columns
actions = actions.loc[:, ['user', 'item', 'behavior_type', 'category', 'time', 'user_geohash']]
#add date and hour
actions['date'] = actions.time.dt.date
actions['hour'] = actions.time.dt.hour
return actions, items, user_index, item_ids, category
# actions, items = prepare_data(actions, items)
actions, items, user_index, item_ids, _ = prepare_data(actions, items)
actions.head()
[图片上传中...(image.png-fb438e-1547196526920-0)]
items.head()
2.观察数据
geo = pd.concat([items.item_geohash, actions.user_geohash]).drop_duplicates()
item_geo = items.item_geohash.drop_duplicates().dropna()
print("商品的geo去重后总数的统计", item_geo.count())
action_geo = actions.user_geohash.drop_duplicates().dropna()
print("用户行为的geo去重后总数的统计",action_geo.count())
print("商品与用户行为的geo去重后总数的统计:\n",
"交集 / 用户行为geo:",
len(action_geo[action_geo.isin(item_geo)]) / len(action_geo),
"\n交集 / 商品geo:",
len(item_geo[item_geo.isin(action_geo)]) / len(item_geo)
)
del item_geo
del action_geo
#从结果可以看出, 大多数情况下用户和商品的地址存在匹配的情况, 少量不匹配
商品的geo去重后总数的统计 57358
用户行为的geo去重后总数的统计 1018981
商品与用户行为的geo去重后总数的统计:
交集 / 用户行为geo: 0.025223237724746585
交集 / 商品geo: 0.44809791136371563
ag = actions.loc[:, ['user', 'user_geohash']].dropna()
print("用户行为带有geohash的数量", len(ag))
ag = ag.drop_duplicates()
print("用户行为带有geohash的数量(去重后)", len(ag))
ag['c'] = 1
ag = ag.loc[:, ['user', 'c']].groupby('user').sum()
print(ag.describe())
del ag
#可以发现用户
#有geo hash地址的用户行为的中位数为42, 就是大多数用户所在的geohash是经常变化的
#用户在不同的时间, 处于多个不同的geo地址(也就是说这个geo的还是比较精确的, 可能离开商品的某个geo有一定的距离)
#那么可以考虑的是, 是否时间间隔越近的两个geohash地址, 意味着越近的距离
用户行为带有geohash的数量 7380017
用户行为带有geohash的数量(去重后) 1257674
c
count 16240.000000
mean 77.442980
std 53.782759
min 1.000000
25% 42.000000
50% 68.000000
75% 103.000000
max 709.000000
df = actions[actions.user_geohash.notna()]
print("购买的时候, 有geo信息的行为数量", len(df), "占全部行为的", len(df[df.user_geohash.isin(items.item_geohash)]) / len(df))
del df
购买的时候, 有geo信息的行为数量 7380017 占全部行为的 0.03044234179948366
3. 提取特征
首先要考虑要提取哪些特征, 这些特征需要考虑体现用户、商品、商品分类、地点等特性
- 用户: 总体行为次数,还有如何体现出用户的购买爱好, 比如针对某一类商品购买的喜好?
- 商品/分类: 总体有多少用户购买, 所有用户的总体行为计数
- 分类:总共有多少商品
- 上面特征的时间特性?
- 上面物品的地理特性
- 上面物品的交叉特性, 比如某个用户特别爱购买某个商品
- 与时间相关的特性, 用户某一天的购买行为计数, 用来计算第二天是否购买
3.0 保存特征
saved_actions = actions
print(len(actions))
actions.head()
23291027
# actions = saved_actions;#恢复actions
print("共计: {}条交易记录".format(actions.user.max()))
共计: 19999条交易记录
# #从用户来限制提取特征对数据额占用, 是在太卡了, 后续删除
# actions = actions.set_index("user").loc[:10000, :]
# actions = actions.reset_index()
# print(actions.user.max())
# actions.head()
3.1 提取用户特征
#用户总计购买了多少商品
user = actions.groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user.rename(columns={'item': 'c'}, level=0, inplace=True)
user.head()
# 统计购买商品的种类
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
.groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()
#统计购买商品类别的种类
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
.groupby(['user', 'behavior_type'])[['category']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()
user = pd.DataFrame(user.values, index=user.index, columns=["u{}".format(i) for i in range(0, 12, 1)])
user.head()
user = user / (user.mean() + user.std() * 3)
user.head()
# user.to_csv("user.csv")
# del user
3.2 统计商品属性
#统计商品被购买的次数
good = actions.groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good.rename(columns={'user': 'c'}, level=0, inplace=True)
good.head()
#统计商品被多少用户购买过
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
.groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good = good.merge(c, left_index=True, right_index=True, how='left')
good.head()
good = pd.DataFrame(good.values, index=good.index, columns=["g{}".format(i) for i in range(0, 8, 1)])
good.head()
good = good / (good.mean() + good.std() * 3)
good.head()
![https://upload-images.jianshu.io/upload_images/7100403-5cbe4cd6a773ce6b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
# good.to_csv("good.csv")
# del good
3.3 统计商品类别的特征
#统计商品类别被购买的次数
cat = actions.groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat.rename(columns={'user': 'c'}, level=0, inplace=True)
cat.head()
#统计商品类别被多少用户购买过
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
.groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()
#统计商品类别有多少商品
c = actions.drop_duplicates(['item', 'behavior_type', 'category']) \
.groupby(['category', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()
cat = pd.DataFrame(cat.values, index=cat.index, columns=["c{}".format(i) for i in range(0, 12, 1)])
cat.head()
cat = cat / (cat.mean() + cat.std() * 3)
cat.head()
# cat.to_csv('cat.csv')
# del cat
del c
3.4 时间特性
3.5 地理特性
3.6 交叉特性
3.7 24小时内的动作
def read_csv():
return pd.read_csv("user.csv", index_col=0)
def read_good():
return pd.read_csv("good.csv", index_col=0)
def read_cat():
return pd.read_csv('cat.csv', index_col=0)
def read_label():
return pd.read_csv("label.csv", index_col=0)
# 用户第二天是否会购买的标签
label = actions[actions.behavior_type == 4].copy()
label.date = (pd.to_datetime(label.date) - np.timedelta64(1, 'D'))
# label.date = label.date.dt.date
print(label.date.dtypes)
label['buy'] = 1
# label = label.loc[:, ['date', 'user','category','item','buy']].groupby(['date', 'user','category','item']).sum()
label = label.set_index(['date', 'user']).loc[:, ['item', 'category', 'buy']].drop_duplicates()
label.set_index(['category','item',], append=True, inplace=True)
label.head()
datetime64[ns]
# label.to_csv("label.csv")
# del label
# read_label().head()
# 统计用户最后一天的行为
d_action = actions.copy()
d_action['d'] = 1
d_action.date = pd.to_datetime(d_action.date)
d_action = d_action.groupby([ 'date', 'user', 'category', 'item', 'behavior_type']).sum()[['d']]
d_action = d_action / (d_action.mean() + d_action.std() * 3)
d_action = d_action.unstack().fillna(0).astype(np.float32)
d_action.columns = d_action.columns.droplevel(0)
d_action.columns = ['d_t{}'.format(i) for i in range(1, 5, 1)]
d_action.head()
# d_action.to_csv('d_action.csv')
# pd.read_csv('d_action.csv', index_col=0).dtypes
#某个用户3小时的行为
x_action = actions.copy()
x_action['c'] = 1
x_action.date = pd.to_datetime(x_action.date).dt.date
#数据量太大, 只考虑最后3个小时的数据
x_action = x_action.loc[x_action.hour.isin([23, 22, 21])]
x_action.date = pd.to_datetime(x_action.date)
x_action = x_action.groupby([ 'date', 'user', 'category', 'item', 'hour', 'behavior_type']).sum()
x_action = x_action.unstack()
x_action = x_action / (x_action.mean() + x_action.std() * 3)
x_action = x_action.stack().astype(np.float32)
x_action.head()
x_action = x_action.unstack(['hour', 'behavior_type'], fill_value=0).sort_index(axis=1)
x_action.columns = x_action.columns.droplevel(0)
# print(x_action.describe())
#用如此方式来保证代码会被正确的展开成96列, 而不至于部分代码被
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns=pd.MultiIndex.from_product([range(1, 5, 1), range(21, 24, 1)], names=['behavior_type','hour']))
x_action = x_action.fillna(0)
# x_action.info()
# x_action[:, :] = x_action[:, :].astype(np.int8)
# x_action.info()
# x_action = x_action.apply(lambda x: x.astype(np.int32))
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns = ["h{}_{}".format(h, t) for h in range(21, 24, 1) for t in [1, 2, 3, 4]])
# print(x_action.describe())
x_action.head()
x_action = d_action.merge(x_action, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.head()
# 合并x, y数据, 使用how='left'可以过滤掉之前没有行为, 但是却有购买动作的数据
# 当然, 这样我也过滤到了, 我看了n个便宜的, 结果买了这类里面的一个爆品
# TODO: 以后想法处理
x_action = x_action.merge(label, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.buy = x_action.buy.astype(np.int8)
x_action.head()
#对应时间点行为对应的用户, 商品, 分类属性
x_action.reset_index(inplace=True)
x_action = user.merge(x_action, right_on='user', left_index=True, how='right')
x_action = good.merge(x_action, right_on='item', left_index=True, how='right')
x_action = cat.merge(x_action, right_on='category', left_index=True, how='right')
x_action.set_index(['date', 'user', 'category', 'item'], inplace=True)
x_action.head()
#x_action.to_csv("x_action.csv") #数据量太大, 写入非常的慢, 如何破这个问题呢?
#pd.read_csv("x_action.csv").head()
3.8 优化方向
3.8.1 以后可能考虑加入噪音层, 不然, 某个用户可能存在只是查看了一次, 买了一次, 就被网络记忆成必买的用户
3.8.2 如何按组来训练, 毕竟用户一般是看一类商品, 然后选择其中一个商品来购买
3.8.3 目前采用的"正则化"是否合理, 是否有更好或者更加通用的数据处理方式, 或者直接用normal是否更好
from keras.layers import Dense, LSTM, Dropout
from keras.models import Model, Input
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_action.loc[:'2014-12-18'].values[:, :-1], x_action.values[:, -1], test_size=0.1)
x_train, y_train
(array([[0.03628967, 0.03287605, 0.04344553, ..., 0. , 0. ,
0. ],
[2.19627494, 2.48323743, 2.35556257, ..., 0. , 0. ,
0. ],
[0.12262843, 0.05917688, 0.07467201, ..., 0. , 0. ,
0. ],
...,
[0.08175772, 0.04602647, 0.16020541, ..., 0. , 0. ,
0. ],
[9.42243503, 9.34921724, 9.17379612, ..., 0. , 0. ,
0. ],
[2.09024259, 1.75996439, 2.44856316, ..., 0. , 0. ,
0. ]]), array([0., 0., 0., ..., 0., 0., 0.]))
x_train.shape # (17482, 128) for test, (9014805, 48) for all
(9014805, 48)
inputs = Input(shape=(x_train.shape[1], ))
x = Dense(256)(inputs)
x = Dropout(0.2)(x)
x = Dense(128)(x)
x = Dropout(0.2)(x)
outputs = Dense(1, activation='sigmoid')(x)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', optimizer='rmsprop',metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=64, epochs=50, validation_data=[x_test, y_test])
# 10000名用户的结果如下:
# Epoch 11/50
# 4065324/4065324 [==============================] - 364s 90us/step - loss: 0.0219 - acc: 0.9963 - val_loss: 0.0196 - val_acc: 0.9968
Train on 9014805 samples, validate on 1001646 samples
Epoch 1/50
9014805/9014805 [==============================] - 796s 88us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0293 - val_acc: 0.9959
Epoch 2/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0215 - val_acc: 0.9968
Epoch 3/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0208 - val_acc: 0.9968
Epoch 4/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0253 - val_acc: 0.9966
Epoch 5/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 6/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0258 - val_acc: 0.9966
Epoch 7/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0235 - val_acc: 0.9969
Epoch 8/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0216 - val_acc: 0.9968
Epoch 9/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0242 - val_acc: 0.9964
Epoch 10/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0273 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9969
Epoch 11/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0234 - val_acc: 0.9967
Epoch 12/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0230 - val_acc: 0.9968
Epoch 13/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0236 - val_acc: 0.9969
Epoch 14/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9967
Epoch 15/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 16/50
5038976/9014805 [===============>..............] - ETA: 5:37 - loss: 0.0271 - acc: 0.9960
............................
y_predict = model.predict(x_test)
y_predict
array([[1.0223342e-03],
[5.4835586e-10],
[1.0230833e-03],
...,
[5.3039176e-04],
[1.6522235e-03],
[1.2059750e-03]], dtype=float32)
import matplotlib.pyplot as plt
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])
#统计哪个阈值的F1 score最高
def get_f1_by(true, predict, n):
predict = y_predict.reshape(-1)
return f1_score(true, np.where(predict >= n, np.ones_like(predict), np.zeros_like(predict)))
def cal_f1_score(true, predict):
result = area.apply(lambda i : get_f1_by(true, predict, i))
return result
area = pd.Series(np.arange(1e-9, 0.9, 0.05))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443
when n = 0.050000001 best result is 0.061109622085231845
area = pd.Series(np.arange(1e-9, 0.1, 0.001))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443
when n = 0.016000001 best result is 0.19509536784741144
#从这个结果发现0.043是阈值可以得到最高的F1, 我们用这个来预测最后一天的结果
y_predict = model.predict(x_action.loc['2014-12-18'].iloc[:, :-1])
y_predict = np.where(y_predict > 0.016, np.ones_like(y_predict), np.zeros_like(y_predict))
len(y_predict[y_predict==1])
1154
result = x_action.loc['2014-12-18'].copy()
result.buy = y_predict
result = result.loc[result.buy > 0, 'buy'].reset_index().loc[:, ['user', 'item']]
result.head()
user_index.head()
item_ids.head()
result = result.merge(item_ids.reset_index(), left_on='item', right_on='item', how='left') \
.merge(user_index.reset_index(), left_on='user', right_on='user', how='left').loc[:, ['user_id', 'item_id']]
result.head()
result.to_csv('tianchi_mobile_recommendation_predict.csv')
至此数据已经得到一个结果, 提交到天池上结果尚未得到:
来日更新
我本意是通过LSTM和embedding来构建网络, 获得数据的, 不过先通过“传统"的方式来构建网络, 作为后续网络的一个参考标准, 这个就搞了非常久, 还是基础不行啊, 准备一边练习一边学, 有更好的方式和我弄得不好的地方, 请不吝指出!