keras 练习2 -- 天池新人实战赛之[离线赛] (1)

这次专门使用天池新人赛的离线比赛来实际练习, 因为时间和算力，更重要是经验的问题，这次尝试还有很多问题，比如还不会传统的机器算法，比如对深度学习还有很多东西不够熟悉，还有数据预处理还不够熟练。

本文由拎着激光炮的野人原创，欢迎转载，转载请注明作者与原文链接

https://www.jianshu.com/p/ef1fc958e30f

解读分析

认真读题之后，发现这个赛题是针对于类似于o2o的预测，商品基本来自于服务行业，线上购买，线下消费，也就是说和商品的地理位置有很大的关系，所以我们为了简单，我们假设不同的商品类别之间没有太大的可替换性和相关性，假设用户的geo位置和商品的geo位置有很大的相关性

1. 导入数据

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math
from sklearn.metrics import f1_score
idx = pd.IndexSlice

#读取items字段
items = pd.read_csv("./tianchi_fresh_comp_train_item.csv")
print("read items", items.count()[0])
actions = pd.read_csv("./tianchi_fresh_comp_train_user.csv")
print("user action read, total:", actions.count()[0])

read items 620918
user action read, total: 23291027

# 读取并且转换actions表， 用户的所有的行为
# TODO: 暂时忽略所有的geo信息

def prepare_data(actions, items):
    #convert time
    actions.time = pd.to_datetime(actions.time)

    #index user
    user_index = actions.user_id.drop_duplicates()
    user_index = user_index.reset_index(drop=True).reset_index().set_index("user_id")
    user_index.columns = ['user']
    actions = pd.merge(actions, user_index, left_on='user_id', right_index=True, how='left')

    #index item
    item_ids = actions.item_id.drop_duplicates()
    item_ids = item_ids.reset_index(drop=True).reset_index().set_index("item_id")
    item_ids.columns = ['item']
    actions = pd.merge(actions, item_ids, left_on='item_id', right_index=True, how='left')

    items = pd.merge(items, item_ids, left_on='item_id', right_index=True, how='left')

    # index category
    category = actions.item_category.drop_duplicates()
    category = category.reset_index(drop=True).reset_index().set_index("item_category")
    category.columns = ['category']
    actions = pd.merge(actions, category, left_on='item_category', right_index=True, how='left')

    #drop user_id, item_id
    actions = actions.drop(['user_id', 'item_id', 'item_category'], axis=1)
    items = items.drop(['item_id', 'item_category'], axis=1);

    #reoder columns
    actions = actions.loc[:, ['user', 'item', 'behavior_type', 'category', 'time', 'user_geohash']]
    
    #add date and hour
    actions['date'] = actions.time.dt.date
    actions['hour'] = actions.time.dt.hour
    return actions, items, user_index, item_ids, category

# actions, items = prepare_data(actions, items)
actions, items, user_index, item_ids, _ = prepare_data(actions, items)

actions.head()

[图片上传中...(image.png-fb438e-1547196526920-0)]

items.head()

2.观察数据

geo = pd.concat([items.item_geohash, actions.user_geohash]).drop_duplicates()
item_geo = items.item_geohash.drop_duplicates().dropna()
print("商品的geo去重后总数的统计", item_geo.count())
action_geo = actions.user_geohash.drop_duplicates().dropna()
print("用户行为的geo去重后总数的统计",action_geo.count())
print("商品与用户行为的geo去重后总数的统计:\n", 
      "交集 / 用户行为geo:",
      len(action_geo[action_geo.isin(item_geo)]) / len(action_geo),
      "\n交集 / 商品geo:",
      len(item_geo[item_geo.isin(action_geo)]) / len(item_geo)
     )
del item_geo
del action_geo
#从结果可以看出， 大多数情况下用户和商品的地址存在匹配的情况， 少量不匹配

商品的geo去重后总数的统计 57358
用户行为的geo去重后总数的统计 1018981
商品与用户行为的geo去重后总数的统计:
交集 / 用户行为geo: 0.025223237724746585
交集 / 商品geo: 0.44809791136371563

ag = actions.loc[:, ['user', 'user_geohash']].dropna()
print("用户行为带有geohash的数量", len(ag))
ag = ag.drop_duplicates()
print("用户行为带有geohash的数量(去重后)", len(ag))
ag['c'] = 1
ag = ag.loc[:, ['user', 'c']].groupby('user').sum()
print(ag.describe())
del ag
#可以发现用户
#有geo hash地址的用户行为的中位数为42， 就是大多数用户所在的geohash是经常变化的
#用户在不同的时间， 处于多个不同的geo地址(也就是说这个geo的还是比较精确的， 可能离开商品的某个geo有一定的距离)
#那么可以考虑的是， 是否时间间隔越近的两个geohash地址， 意味着越近的距离

用户行为带有geohash的数量 7380017
用户行为带有geohash的数量(去重后) 1257674
c
count 16240.000000
mean 77.442980
std 53.782759
min 1.000000
25% 42.000000
50% 68.000000
75% 103.000000
max 709.000000

df = actions[actions.user_geohash.notna()]
print("购买的时候， 有geo信息的行为数量", len(df), "占全部行为的", len(df[df.user_geohash.isin(items.item_geohash)]) / len(df))
del df

购买的时候，有geo信息的行为数量 7380017 占全部行为的 0.03044234179948366

3. 提取特征

首先要考虑要提取哪些特征，这些特征需要考虑体现用户、商品、商品分类、地点等特性

用户: 总体行为次数，还有如何体现出用户的购买爱好，比如针对某一类商品购买的喜好？
商品/分类: 总体有多少用户购买，所有用户的总体行为计数
分类：总共有多少商品
上面特征的时间特性？
上面物品的地理特性
上面物品的交叉特性，比如某个用户特别爱购买某个商品
与时间相关的特性，用户某一天的购买行为计数，用来计算第二天是否购买

3.0 保存特征

saved_actions = actions
print(len(actions))
actions.head()

23291027

image.png

# actions = saved_actions;#恢复actions

print("共计: {}条交易记录".format(actions.user.max()))

共计: 19999条交易记录

# #从用户来限制提取特征对数据额占用， 是在太卡了, 后续删除
# actions = actions.set_index("user").loc[:10000, :]
# actions = actions.reset_index()
# print(actions.user.max())
# actions.head()

3.1 提取用户特征

#用户总计购买了多少商品
user = actions.groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user.rename(columns={'item': 'c'}, level=0, inplace=True)
user.head()

image.png

# 统计购买商品的种类
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
    .groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()

image.png

#统计购买商品类别的种类
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
    .groupby(['user', 'behavior_type'])[['category']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()

image.png

user = pd.DataFrame(user.values, index=user.index, columns=["u{}".format(i) for i in range(0, 12, 1)])
user.head()

image.png

user = user / (user.mean() + user.std() * 3)
user.head()

image.png

# user.to_csv("user.csv")
# del user

3.2 统计商品属性

#统计商品被购买的次数
good = actions.groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good.rename(columns={'user': 'c'}, level=0, inplace=True)
good.head()

image.png

#统计商品被多少用户购买过
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
    .groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good = good.merge(c, left_index=True, right_index=True, how='left')
good.head()

image.png

good = pd.DataFrame(good.values, index=good.index, columns=["g{}".format(i) for i in range(0, 8, 1)])
good.head()

image.png

good = good / (good.mean() + good.std() * 3)
good.head()

![https://upload-images.jianshu.io/upload_images/7100403-5cbe4cd6a773ce6b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

# good.to_csv("good.csv")
# del good

3.3 统计商品类别的特征

#统计商品类别被购买的次数
cat = actions.groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat.rename(columns={'user': 'c'}, level=0, inplace=True)
cat.head()

image.png

#统计商品类别被多少用户购买过
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
    .groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()

image.png

#统计商品类别有多少商品
c = actions.drop_duplicates(['item', 'behavior_type', 'category']) \
    .groupby(['category', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()

image.png

cat = pd.DataFrame(cat.values, index=cat.index, columns=["c{}".format(i) for i in range(0, 12, 1)])
cat.head()

image.png

cat = cat / (cat.mean() + cat.std() * 3)
cat.head()

image.png

# cat.to_csv('cat.csv')

# del cat
del c

3.4 时间特性

3.5 地理特性

3.6 交叉特性

3.7 24小时内的动作

def read_csv():
    return pd.read_csv("user.csv", index_col=0)
def read_good():
    return pd.read_csv("good.csv", index_col=0)
def read_cat():
    return pd.read_csv('cat.csv', index_col=0)
def read_label():
    return pd.read_csv("label.csv", index_col=0)

# 用户第二天是否会购买的标签
label = actions[actions.behavior_type == 4].copy()
label.date = (pd.to_datetime(label.date) - np.timedelta64(1, 'D'))
# label.date = label.date.dt.date
print(label.date.dtypes)
label['buy'] = 1
# label = label.loc[:, ['date', 'user','category','item','buy']].groupby(['date', 'user','category','item']).sum()
label = label.set_index(['date', 'user']).loc[:, ['item', 'category', 'buy']].drop_duplicates()
label.set_index(['category','item',], append=True, inplace=True)
label.head()

datetime64[ns]

image.png

# label.to_csv("label.csv")
# del label

# read_label().head()

# 统计用户最后一天的行为
d_action = actions.copy()
d_action['d']  = 1
d_action.date = pd.to_datetime(d_action.date)
d_action = d_action.groupby([ 'date', 'user', 'category', 'item', 'behavior_type']).sum()[['d']]
d_action = d_action / (d_action.mean() + d_action.std() * 3)
d_action = d_action.unstack().fillna(0).astype(np.float32)
d_action.columns = d_action.columns.droplevel(0)
d_action.columns = ['d_t{}'.format(i) for i in range(1, 5, 1)]
d_action.head()

image.png

# d_action.to_csv('d_action.csv')

# pd.read_csv('d_action.csv', index_col=0).dtypes

#某个用户3小时的行为
x_action = actions.copy()
x_action['c']  = 1
x_action.date = pd.to_datetime(x_action.date).dt.date
#数据量太大， 只考虑最后3个小时的数据
x_action = x_action.loc[x_action.hour.isin([23, 22, 21])]
x_action.date = pd.to_datetime(x_action.date)
x_action = x_action.groupby([ 'date', 'user', 'category', 'item', 'hour', 'behavior_type']).sum()
x_action = x_action.unstack()
x_action = x_action / (x_action.mean() + x_action.std() * 3)
x_action = x_action.stack().astype(np.float32)
x_action.head()

image.png

x_action = x_action.unstack(['hour', 'behavior_type'], fill_value=0).sort_index(axis=1)
x_action.columns = x_action.columns.droplevel(0)
# print(x_action.describe())
#用如此方式来保证代码会被正确的展开成96列， 而不至于部分代码被
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns=pd.MultiIndex.from_product([range(1, 5, 1), range(21, 24, 1)], names=['behavior_type','hour']))
x_action = x_action.fillna(0)
# x_action.info()
# x_action[:, :] = x_action[:, :].astype(np.int8)
# x_action.info()
# x_action = x_action.apply(lambda x: x.astype(np.int32))
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns = ["h{}_{}".format(h, t) for h in range(21, 24, 1) for t in [1, 2, 3, 4]])
# print(x_action.describe())
x_action.head()

image.png

x_action = d_action.merge(x_action, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.head()

image.png

# 合并x， y数据， 使用how='left'可以过滤掉之前没有行为， 但是却有购买动作的数据
# 当然， 这样我也过滤到了， 我看了n个便宜的， 结果买了这类里面的一个爆品
# TODO: 以后想法处理
x_action = x_action.merge(label, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.buy = x_action.buy.astype(np.int8)
x_action.head()

image.png

#对应时间点行为对应的用户， 商品， 分类属性
x_action.reset_index(inplace=True)
x_action = user.merge(x_action, right_on='user', left_index=True, how='right')
x_action = good.merge(x_action, right_on='item', left_index=True, how='right')
x_action = cat.merge(x_action, right_on='category', left_index=True, how='right')
x_action.set_index(['date', 'user', 'category', 'item'], inplace=True)
x_action.head()

image.png

#x_action.to_csv("x_action.csv") #数据量太大， 写入非常的慢， 如何破这个问题呢？

#pd.read_csv("x_action.csv").head()

3.8 优化方向

3.8.1 以后可能考虑加入噪音层，不然，某个用户可能存在只是查看了一次，买了一次，就被网络记忆成必买的用户

3.8.2 如何按组来训练，毕竟用户一般是看一类商品，然后选择其中一个商品来购买

3.8.3 目前采用的"正则化"是否合理，是否有更好或者更加通用的数据处理方式，或者直接用normal是否更好

from keras.layers import Dense, LSTM, Dropout
from keras.models import Model, Input
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_action.loc[:'2014-12-18'].values[:, :-1], x_action.values[:, -1], test_size=0.1)
x_train, y_train

(array([[0.03628967, 0.03287605, 0.04344553, ..., 0.        , 0.        ,
         0.        ],
        [2.19627494, 2.48323743, 2.35556257, ..., 0.        , 0.        ,
         0.        ],
        [0.12262843, 0.05917688, 0.07467201, ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.08175772, 0.04602647, 0.16020541, ..., 0.        , 0.        ,
         0.        ],
        [9.42243503, 9.34921724, 9.17379612, ..., 0.        , 0.        ,
         0.        ],
        [2.09024259, 1.75996439, 2.44856316, ..., 0.        , 0.        ,
         0.        ]]), array([0., 0., 0., ..., 0., 0., 0.]))

x_train.shape # (17482, 128) for test, (9014805, 48) for all

(9014805, 48)

inputs = Input(shape=(x_train.shape[1], ))
x = Dense(256)(inputs)
x = Dropout(0.2)(x)
x = Dense(128)(x)
x = Dropout(0.2)(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', optimizer='rmsprop',metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=64, epochs=50, validation_data=[x_test, y_test])
# 10000名用户的结果如下:
# Epoch 11/50
# 4065324/4065324 [==============================] - 364s 90us/step - loss: 0.0219 - acc: 0.9963 - val_loss: 0.0196 - val_acc: 0.9968

Train on 9014805 samples, validate on 1001646 samples
Epoch 1/50
9014805/9014805 [==============================] - 796s 88us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0293 - val_acc: 0.9959
Epoch 2/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0215 - val_acc: 0.9968
Epoch 3/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0208 - val_acc: 0.9968
Epoch 4/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0253 - val_acc: 0.9966
Epoch 5/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 6/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0258 - val_acc: 0.9966
Epoch 7/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0235 - val_acc: 0.9969
Epoch 8/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0216 - val_acc: 0.9968
Epoch 9/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0242 - val_acc: 0.9964
Epoch 10/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0273 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9969
Epoch 11/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0234 - val_acc: 0.9967
Epoch 12/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0230 - val_acc: 0.9968
Epoch 13/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0236 - val_acc: 0.9969
Epoch 14/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9967
Epoch 15/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 16/50
5038976/9014805 [===============>..............] - ETA: 5:37 - loss: 0.0271 - acc: 0.9960

............................

y_predict = model.predict(x_test)
y_predict

array([[1.0223342e-03],
       [5.4835586e-10],
       [1.0230833e-03],
       ...,
       [5.3039176e-04],
       [1.6522235e-03],
       [1.2059750e-03]], dtype=float32)

import matplotlib.pyplot as plt
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

image.png

image.png

#统计哪个阈值的F1 score最高
def get_f1_by(true, predict, n):
    predict = y_predict.reshape(-1)
    return f1_score(true, np.where(predict >= n, np.ones_like(predict), np.zeros_like(predict)))

def cal_f1_score(true, predict):
    
    result = area.apply(lambda i : get_f1_by(true, predict, i))
    return result

area = pd.Series(np.arange(1e-9, 0.9, 0.05))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443

when n = 0.050000001 best result is 0.061109622085231845

image.png

area = pd.Series(np.arange(1e-9, 0.1, 0.001))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443

when n = 0.016000001 best result is 0.19509536784741144

image.png

#从这个结果发现0.043是阈值可以得到最高的F1， 我们用这个来预测最后一天的结果

y_predict = model.predict(x_action.loc['2014-12-18'].iloc[:, :-1])

y_predict = np.where(y_predict > 0.016, np.ones_like(y_predict), np.zeros_like(y_predict))

len(y_predict[y_predict==1])

result = x_action.loc['2014-12-18'].copy()
result.buy = y_predict
result = result.loc[result.buy > 0, 'buy'].reset_index().loc[:, ['user', 'item']]
result.head()

image.png

user_index.head()

image.png

item_ids.head()

image.png

result = result.merge(item_ids.reset_index(), left_on='item', right_on='item', how='left') \
    .merge(user_index.reset_index(), left_on='user', right_on='user', how='left').loc[:, ['user_id', 'item_id']]
result.head()

result.to_csv('tianchi_mobile_recommendation_predict.csv')

至此数据已经得到一个结果，提交到天池上结果尚未得到：
来日更新

我本意是通过LSTM和embedding来构建网络，获得数据的，不过先通过“传统"的方式来构建网络，作为后续网络的一个参考标准，这个就搞了非常久，还是基础不行啊，准备一边练习一边学，有更好的方式和我弄得不好的地方，请不吝指出！

keras 练习2 -- 天池新人实战赛之[离线赛] (1)

解读分析

1. 导入数据

2.观察数据

3. 提取特征

3.0 保存特征

3.1 提取用户特征

3.2 统计商品属性

3.3 统计商品类别的特征

3.4 时间特性

3.5 地理特性

3.6 交叉特性

3.7 24小时内的动作

3.8 优化方向

你可能感兴趣的:(keras 练习2 -- 天池新人实战赛之[离线赛] (1))