通过召回的操作, 我们已经进行了问题规模的缩减, 对于每个用户, 选择出了N篇文章作为了候选集,并基于召回的候选集构建了与用户历史相关的特征,以及用户本身的属性特征,文章本省的属性特征,以及用户与文章之间的特征,下面就是使用机器学习模型来对构造好的特征进行学习,然后对测试集进行预测,得到测试集中的每个候选集用户点击的概率,返回点击概率最大的topk个文章,作为最终的结果。
排序阶段选择了三个比较有代表性的排序模型,它们分别是:
得到了最终的排序模型输出的结果之后,还选择了两种比较经典的模型集成的方法:
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import gc, os
import time
from datetime import datetime
import lightgbm as lgb
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')
data_path = '../data/'
save_path = './temp_results/'
offline = False
# 重新读取数据的时候,发现click_article_id是一个浮点数,所以将其转换成int类型
trn_user_item_feats_df_din_model = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')
trn_user_item_feats_df_din_model['click_article_id'] = trn_user_item_feats_df_din_model['click_article_id'].astype(int)
if offline:
val_user_item_feats_df_din_model = pd.read_csv(save_path + 'val_user_item_feats_df.csv')
val_user_item_feats_df_din_model['click_article_id'] = val_user_item_feats_df_din_model['click_article_id'].astype(int)
else:
val_user_item_feats_df_din_model = None
tst_user_item_feats_df_din_model = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')
tst_user_item_feats_df_din_model['click_article_id'] = tst_user_item_feats_df_din_model['click_article_id'].astype(int)
# 做特征的时候为了方便,给测试集也打上了一个无效的标签,这里直接删掉就行
del tst_user_item_feats_df_din_model['label']
这个是为后面的DIN模型服务的
if offline:
all_data = pd.read_csv('./data/train_click_log.csv')
else:
trn_data = pd.read_csv('./data/train_click_log.csv')
tst_data = pd.read_csv('./data/testA_click_log.csv')
all_data = trn_data.append(tst_data)
# 构建用户历史行为
hist_click =all_data[['user_id', 'click_article_id']].groupby('user_id').agg({list}).reset_index()
his_behavior_df = pd.DataFrame()
his_behavior_df['user_id'] = hist_click['user_id']
his_behavior_df['hist_click_article_id'] = hist_click['click_article_id']
# 拼接用户历史点击文章序列信息
trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')
if offline:
val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')
else:
val_user_item_feats_df_din_model = None
tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')
trn_user_item_feats_df_din_model.head()
user_id | click_article_id | sim0 | time_diff0 | word_diff0 | sim_max | sim_min | sim_sum | sim_mean | score | ... | click_region | click_referrer_type | user_time_hob1 | user_time_hob2 | word_hbo | category_id | created_at_ts | words_count | is_cat_hab | hist_click_article_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 191890 | 0.215748 | 1603305000 | 80 | 0.215748 | 0.215748 | 0.215748 | 0.215748 | 0.996221 | ... | 25 | 2 | 0.343715 | 0.992865 | 266.0 | 309 | 1506581786000 | 242 | 0 | [30760, 157507] |
1 | 11 | 191890 | 0.068161 | 1600533000 | 54 | 0.068161 | 0.068161 | 0.068161 | 0.068161 | 0.996075 | ... | 25 | 2 | 0.343551 | 0.992781 | 200.0 | 309 | 1506581786000 | 242 | 0 | [50644, 234481] |
2 | 31 | 191890 | -0.023481 | 1595980000 | 28 | -0.023481 | -0.023481 | -0.023481 | -0.023481 | 0.995851 | ... | 25 | 1 | 0.343456 | 0.992715 | 218.0 | 309 | 1506581786000 | 242 | 0 | [156279, 161526] |
3 | 86 | 191890 | 0.263226 | 1599786000 | 30 | 0.263226 | 0.263226 | 0.263226 | 0.263226 | 0.996370 | ... | 25 | 2 | 0.343011 | 0.992780 | 213.5 | 309 | 1506581786000 | 242 | 0 | [234481, 16346] |
4 | 94 | 191890 | -0.002702 | 1605934000 | 2 | -0.002702 | -0.002702 | -0.002702 | -0.002702 | 0.995544 | ... | 25 | 2 | 0.342910 | 0.992766 | 244.5 | 309 | 1506581786000 | 242 | 0 | [211442, 48074] |
5 rows × 29 columns
# 获取每一为用户历史所点击的文章数目
trn_user_item_feats_df_din_model["seq_length"] = trn_user_item_feats_df_din_model.hist_click_article_id.apply(lambda x:len(x))
tst_user_item_feats_df_din_model["seq_length"] = tst_user_item_feats_df_din_model.hist_click_article_id.apply(lambda x:len(x))
trn_user_item_feats_df_din_model.head()
user_id | click_article_id | sim0 | time_diff0 | word_diff0 | sim_max | sim_min | sim_sum | sim_mean | score | ... | click_referrer_type | user_time_hob1 | user_time_hob2 | word_hbo | category_id | created_at_ts | words_count | is_cat_hab | hist_click_article_id | seq_length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 191890 | 0.215748 | 1603305000 | 80 | 0.215748 | 0.215748 | 0.215748 | 0.215748 | 0.996221 | ... | 2 | 0.343715 | 0.992865 | 266.0 | 309 | 1506581786000 | 242 | 0 | [30760, 157507] | 2 |
1 | 11 | 191890 | 0.068161 | 1600533000 | 54 | 0.068161 | 0.068161 | 0.068161 | 0.068161 | 0.996075 | ... | 2 | 0.343551 | 0.992781 | 200.0 | 309 | 1506581786000 | 242 | 0 | [50644, 234481] | 2 |
2 | 31 | 191890 | -0.023481 | 1595980000 | 28 | -0.023481 | -0.023481 | -0.023481 | -0.023481 | 0.995851 | ... | 1 | 0.343456 | 0.992715 | 218.0 | 309 | 1506581786000 | 242 | 0 | [156279, 161526] | 2 |
3 | 86 | 191890 | 0.263226 | 1599786000 | 30 | 0.263226 | 0.263226 | 0.263226 | 0.263226 | 0.996370 | ... | 2 | 0.343011 | 0.992780 | 213.5 | 309 | 1506581786000 | 242 | 0 | [234481, 16346] | 2 |
4 | 94 | 191890 | -0.002702 | 1605934000 | 2 | -0.002702 | -0.002702 | -0.002702 | -0.002702 | 0.995544 | ... | 2 | 0.342910 | 0.992766 | 244.5 | 309 | 1506581786000 | 242 | 0 | [211442, 48074] | 2 |
5 rows × 30 columns
我们下面尝试使用DIN模型, DIN的全称是Deep Interest Network, 这是阿里2018年基于前面的深度学习模型无法表达用户多样化的兴趣而提出的一个模型, 它可以通过考虑【给定的候选广告】和【用户的历史行为】的相关性,来计算用户兴趣的表示向量。具体来说就是通过引入局部激活单元,通过软搜索历史行为的相关部分来关注相关的用户兴趣,并采用加权和来获得有关候选广告的用户兴趣的表示。与候选广告相关性较高的行为会获得较高的激活权重,并支配着用户兴趣。该表示向量在不同广告上有所不同,大大提高了模型的表达能力。所以该模型对于此次新闻推荐的任务也比较适合, 我们在这里通过当前的候选文章与用户历史点击文章的相关性来计算用户对于文章的兴趣。 该模型的结构如下:
我们这里直接调包来使用这个模型, 关于这个模型的详细细节部分我们会在下一期的推荐系统组队学习中给出。下面说一下该模型如何具体使用:deepctr的函数原型如下:
def DIN(dnn_feature_columns, history_feature_list, dnn_use_bn=False,
dnn_hidden_units=(200, 80), dnn_activation=‘relu’, att_hidden_size=(80, 40), att_activation=“dice”,
att_weight_normalization=False, l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, seed=1024,
task=‘binary’):
- dnn_feature_columns: 特征列, 包含数据所有特征的列表
- history_feature_list: 用户历史行为列, 反应用户历史行为的特征的列表
- dnn_use_bn: 是否使用BatchNormalization
- dnn_hidden_units: 全连接层网络的层数和每一层神经元的个数, 一个列表或者元组
- dnn_activation_relu: 全连接网络的激活单元类型
- att_hidden_size: 注意力层的全连接网络的层数和每一层神经元的个数
- att_activation: 注意力层的激活单元类型
- att_weight_normalization: 是否归一化注意力得分
- l2_reg_dnn: 全连接网络的正则化系数
- l2_reg_embedding: embedding向量的正则化稀疏
- dnn_dropout: 全连接网络的神经元的失活概率
- task: 任务, 可以是分类, 也可是是回归
在具体使用的时候, 我们必须要传入特征列和历史行为列, 但是再传入之前, 我们需要进行一下特征列的预处理。具体如下:
下面根据具体的代码感受一下, 逻辑是这样, 首先我们需要写一个数据准备函数, 在这里面就是根据上面的具体步骤准备数据, 得到数据和特征列, 然后就是建立DIN模型并训练, 最后基于模型进行测试。
# 导入deepctr
from deepctr.models import DIN
from deepctr.feature_column import SparseFeat, VarLenSparseFeat, DenseFeat, get_feature_names
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import backend as K
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.callbacks import *
import tensorflow as tf
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
### 构建模型数据类型
# 数据准备函数
def get_din_feats_columns(df, dense_fea, sparse_fea, behavior_fea, his_behavior_fea, emb_dim=32, max_len=100):
"""
数据准备函数:
df: 数据集
dense_fea: 数值型特征列
sparse_fea: 离散型特征列
behavior_fea: 用户的候选行为特征列
his_behavior_fea: 用户的历史行为特征列
embedding_dim: embedding的维度, 这里为了简单, 统一把离散型特征列采用一样的隐向量维度
max_len: 用户序列的最大长度
"""
sparse_feature_columns = []
for feat in sparse_fea:
if feat != "click_article_id":
sparse_feature_columns.append(SparseFeat(feat, vocabulary_size=df[feat].max() + 1, embedding_dim=emb_dim))
else:
sparse_feature_columns.append(SparseFeat(feat, vocabulary_size=364046 + 1, embedding_dim=emb_dim))
# sparse_feature_columns = [SparseFeat(feat, vocabulary_size=df[feat].max() + 1, embedding_dim=emb_dim) for feat in sparse_fea]
dense_feature_columns = [DenseFeat(feat, 1, ) for feat in dense_fea]
var_feature_columns = [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=364046+ 1,
embedding_dim=emb_dim, embedding_name='click_article_id'), maxlen=max_len, length_name="seq_length") for feat in hist_behavior_fea]
dnn_feature_columns = sparse_feature_columns + dense_feature_columns + var_feature_columns
# 建立x, x是一个字典的形式
x = {}
for name in get_feature_names(dnn_feature_columns):
if name in his_behavior_fea:
# 这是历史行为序列
his_list = [l for l in df[name]]
x[name] = pad_sequences(his_list, maxlen=max_len, padding='post') # 二维数组
else:
x[name] = df[name].values
return x, dnn_feature_columns
# 把特征分开
sparse_fea = ['user_id', 'click_article_id', 'category_id', 'click_environment', 'click_deviceGroup',
'click_os', 'click_country', 'click_region', 'click_referrer_type', 'is_cat_hab']
behavior_fea = ['click_article_id']
hist_behavior_fea = ['hist_click_article_id']
dense_fea = ['sim0', 'time_diff0', 'word_diff0', 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score',
'rank','click_size','time_diff_mean','active_level','user_time_hob1','user_time_hob2',
'word_hbo','words_count']
# dense特征进行归一化, 神经网络训练都需要将数值进行归一化处理
mm = MinMaxScaler()
# 下面是做一些特殊处理,当在其他的地方出现无效值的时候,不处理无法进行归一化,刚开始可以先把他注释掉,在运行了下面的代码
# 之后如果发现报错,应该先去想办法处理如何不出现inf之类的值
trn_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)
tst_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)
for feat in dense_fea:
trn_user_item_feats_df_din_model[feat] = mm.fit_transform(trn_user_item_feats_df_din_model[[feat]])
if val_user_item_feats_df_din_model is not None:
val_user_item_feats_df_din_model[feat] = mm.fit_transform(val_user_item_feats_df_din_model[[feat]])
tst_user_item_feats_df_din_model[feat] = mm.fit_transform(tst_user_item_feats_df_din_model[[feat]])
WARNING:root:
DeepCTR version 0.8.3 detected. Your version is 0.8.2.
Use `pip install -U deepctr` to upgrade.Changelog: https://github.com/shenweichen/DeepCTR/releases/tag/v0.8.3
# 准备训练数据
x_trn, dnn_feature_columns = get_din_feats_columns(trn_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_trn = trn_user_item_feats_df_din_model['label'].values
if offline:
# 准备验证数据
x_val, dnn_feature_columns = get_din_feats_columns(val_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_val = val_user_item_feats_df_din_model['label'].values
dense_fea = [x for x in dense_fea if x != 'label']
x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
trn_user_item_feats_df_din_model[sparse_fea].describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
user_id | 180286.0 | 99957.159336 | 57758.499264 | 0.0 | 49932.25 | 99944.5 | 149992.75 | 199999.0 |
click_article_id | 180286.0 | 195545.750319 | 108824.289772 | 12185.0 | 74494.00 | 191890.0 | 307279.00 | 360826.0 |
category_id | 180286.0 | 297.700831 | 126.888484 | 7.0 | 141.00 | 309.0 | 430.00 | 455.0 |
click_environment | 180286.0 | 3.939596 | 0.356480 | 1.0 | 4.00 | 4.0 | 4.00 | 4.0 |
click_deviceGroup | 180286.0 | 1.936002 | 1.051360 | 1.0 | 1.00 | 1.0 | 3.00 | 5.0 |
click_os | 180286.0 | 12.296756 | 7.320290 | 2.0 | 2.00 | 17.0 | 17.00 | 20.0 |
click_country | 180286.0 | 1.300828 | 1.589765 | 1.0 | 1.00 | 1.0 | 1.00 | 11.0 |
click_region | 180286.0 | 18.182599 | 7.093410 | 1.0 | 13.00 | 21.0 | 25.00 | 28.0 |
click_referrer_type | 180286.0 | 2.069101 | 1.293896 | 1.0 | 1.00 | 2.0 | 2.00 | 7.0 |
is_cat_hab | 180286.0 | 0.000000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
# 建立模型
model = DIN(dnn_feature_columns, behavior_fea)
WARNING:tensorflow:
The following Variables were used a Lambda layer's call (lambda), but
are not present in its tracked objects:
It is possible that this is intended behavior, but it is more likely
an omission. This is a strong indication that this layer should be
formulated as a subclassed Layer rather than a Lambda layer.
WARNING:tensorflow:
The following Variables were used a Lambda layer's call (lambda), but
are not present in its tracked objects:
It is possible that this is intended behavior, but it is more likely
an omission. This is a strong indication that this layer should be
formulated as a subclassed Layer rather than a Lambda layer.
# 查看模型结构
model.summary()
Model: "functional_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
user_id (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_article_id (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
category_id (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_environment (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_deviceGroup (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_os (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_country (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_region (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_referrer_type (InputLayer [(None, 1)] 0
__________________________________________________________________________________________________
is_cat_hab (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
sparse_emb_user_id (Embedding) (None, 1, 32) 8000000 user_id[0][0]
__________________________________________________________________________________________________
sparse_seq_emb_hist_click_artic multiple 11649504 click_article_id[0][0]
hist_click_article_id[0][0]
click_article_id[0][0]
__________________________________________________________________________________________________
sparse_emb_category_id (Embeddi (None, 1, 32) 14592 category_id[0][0]
__________________________________________________________________________________________________
sparse_emb_click_environment (E (None, 1, 32) 160 click_environment[0][0]
__________________________________________________________________________________________________
sparse_emb_click_deviceGroup (E (None, 1, 32) 192 click_deviceGroup[0][0]
__________________________________________________________________________________________________
sparse_emb_click_os (Embedding) (None, 1, 32) 672 click_os[0][0]
__________________________________________________________________________________________________
sparse_emb_click_country (Embed (None, 1, 32) 384 click_country[0][0]
__________________________________________________________________________________________________
sparse_emb_click_region (Embedd (None, 1, 32) 928 click_region[0][0]
__________________________________________________________________________________________________
sparse_emb_click_referrer_type (None, 1, 32) 256 click_referrer_type[0][0]
__________________________________________________________________________________________________
sparse_emb_is_cat_hab (Embeddin (None, 1, 32) 32 is_cat_hab[0][0]
__________________________________________________________________________________________________
no_mask (NoMask) (None, 1, 32) 0 sparse_emb_user_id[0][0]
sparse_seq_emb_hist_click_article
sparse_emb_category_id[0][0]
sparse_emb_click_environment[0][0
sparse_emb_click_deviceGroup[0][0
sparse_emb_click_os[0][0]
sparse_emb_click_country[0][0]
sparse_emb_click_region[0][0]
sparse_emb_click_referrer_type[0]
sparse_emb_is_cat_hab[0][0]
__________________________________________________________________________________________________
hist_click_article_id (InputLay [(None, 50)] 0
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 1, 320) 0 no_mask[0][0]
no_mask[1][0]
no_mask[2][0]
no_mask[3][0]
no_mask[4][0]
no_mask[5][0]
no_mask[6][0]
no_mask[7][0]
no_mask[8][0]
no_mask[9][0]
__________________________________________________________________________________________________
no_mask_1 (NoMask) (None, 1, 320) 0 concatenate[0][0]
__________________________________________________________________________________________________
attention_sequence_pooling_laye (None, 1, 32) 13961 sparse_seq_emb_hist_click_article
sparse_seq_emb_hist_click_article
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 1, 352) 0 no_mask_1[0][0]
attention_sequence_pooling_layer[
__________________________________________________________________________________________________
sim0 (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
time_diff0 (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
word_diff0 (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
sim_max (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
sim_min (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
sim_sum (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
sim_mean (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
score (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
rank (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
click_size (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
time_diff_mean (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
active_level (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
user_time_hob1 (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
user_time_hob2 (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
word_hbo (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
words_count (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
flatten (Flatten) (None, 352) 0 concatenate_1[0][0]
__________________________________________________________________________________________________
no_mask_3 (NoMask) (None, 1) 0 sim0[0][0]
time_diff0[0][0]
word_diff0[0][0]
sim_max[0][0]
sim_min[0][0]
sim_sum[0][0]
sim_mean[0][0]
score[0][0]
rank[0][0]
click_size[0][0]
time_diff_mean[0][0]
active_level[0][0]
user_time_hob1[0][0]
user_time_hob2[0][0]
word_hbo[0][0]
words_count[0][0]
__________________________________________________________________________________________________
no_mask_2 (NoMask) (None, 352) 0 flatten[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate) (None, 16) 0 no_mask_3[0][0]
no_mask_3[1][0]
no_mask_3[2][0]
no_mask_3[3][0]
no_mask_3[4][0]
no_mask_3[5][0]
no_mask_3[6][0]
no_mask_3[7][0]
no_mask_3[8][0]
no_mask_3[9][0]
no_mask_3[10][0]
no_mask_3[11][0]
no_mask_3[12][0]
no_mask_3[13][0]
no_mask_3[14][0]
no_mask_3[15][0]
__________________________________________________________________________________________________
flatten_1 (Flatten) (None, 352) 0 no_mask_2[0][0]
__________________________________________________________________________________________________
flatten_2 (Flatten) (None, 16) 0 concatenate_2[0][0]
__________________________________________________________________________________________________
no_mask_4 (NoMask) multiple 0 flatten_1[0][0]
flatten_2[0][0]
__________________________________________________________________________________________________
concatenate_3 (Concatenate) (None, 368) 0 no_mask_4[0][0]
no_mask_4[1][0]
__________________________________________________________________________________________________
dnn (DNN) (None, 80) 89880 concatenate_3[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 1) 80 dnn[0][0]
__________________________________________________________________________________________________
seq_length (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
prediction_layer (PredictionLay (None, 1) 1 dense[0][0]
==================================================================================================
Total params: 19,770,642
Trainable params: 19,770,402
Non-trainable params: 240
__________________________________________________________________________________________________
# 模型编译
model.compile('adam', 'binary_crossentropy',metrics=['binary_crossentropy', tf.keras.metrics.AUC()])
# 模型训练
if offline:
history = model.fit(x_trn, y_trn, verbose=1, epochs=10, validation_data=(x_val, y_val) , batch_size=256)
else:
# 也可以使用上面的语句用自己采样出来的验证集
# history = model.fit(x_trn, y_trn, verbose=1, epochs=3, validation_split=0.3, batch_size=256)
history = model.fit(x_trn, y_trn, verbose=1, epochs=2, batch_size=256)
Epoch 1/2
705/705 [==============================] - 102s 145ms/step - loss: 0.0098 - binary_crossentropy: 0.0098 - auc: 0.4829
Epoch 2/2
705/705 [==============================] - 99s 141ms/step - loss: 6.7655e-04 - binary_crossentropy: 6.3514e-04 - auc: 0.8093
tst_user_item_feats_df_din_model['pred_score'] = model.predict(x_tst, verbose=1, batch_size=256)
tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'din_rank_score.csv', index=False)
3711/3711 [==============================] - 37s 10ms/step
def submit(recall_df, topk=5, model_name=None):
recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])
recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')
# 判断是不是每个用户都有5篇文章及以上
tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())
assert tmp.min() >= topk
del recall_df['pred_score']
submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()
submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]
# 按照提交格式定义列名
submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2',
3: 'article_3', 4: 'article_4', 5: 'article_5'})
save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'
submit.to_csv(save_name, index=False, header=True)
# 预测结果重新排序, 及生成提交结果
rank_results = tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']]
submit(rank_results, topk=5, model_name='din')
# 显示结果
!head -n 10 ./temp_results/din_01-19.csv
user_id,article_1,article_2,article_3,article_4,article_5
200000,258007,207111,123938,301743,169138
200001,207111,301743,40054,108758,68719
200002,207111,301743,142300,40054,68719
200003,258007,207111,142300,301743,309813
200004,142300,301743,207111,309813,40054
200005,258007,207111,301743,272266,40054
200006,207111,301743,142300,40054,74494
200007,207111,142300,301743,272266,40054
200008,123938,169138,301743,207111,309813
本次零基础入门推荐系统之新闻推荐已经基本完结,此次项目实战重点在于熟悉推荐系统中各个环节如何进行操作,有助于自己进一步的学习总结,但是,本次项目对于自己而言仍然有许多改进的地方,如数据信息挖掘不够充分,这也导致自己的搭建排序模型时,模型表现的效果不是很好。而由于现阶段对于出现的问题暂时不能够解决,因此,进一步的理论与实践的将有助于对整个项目有一定的把控。最后,本次项目旨在记录自己的学习过程,难免有一定的错误,如想深入研究,请点击:
第19期_学习者手册(新闻推荐)
零基础入门推荐系统 - 新闻推荐