最近想要进行对比实验,在写AtttRec的时候,发现要使用对损失函数,即来自BPR(贝叶斯个性化排序)。因此,先建立MF-BPR模型进行实验。在BPR的建立过程遇到很多问题,现在将整个模型的建模过程进行分享。模型源码可以在我的Githubhttps://github.com/ZiyaoGeng/Recommender-System-with-TF2.0
找到。
本文约2.9k字,预计阅读10分钟。
1、首先定义模型输入的初始化参数:
feature_columns:输入的特征列,分为用户特征和物品特征列【包含特征名、特征最大值、embedding维度】;
embed_reg:embedding的正则化参数;
分别建立用户嵌入层和物品嵌入层(由于考虑到特征可能包括多个,如用户id、性别、年龄,物品id、种类等,所以用户、物品嵌入信息分别以列表的形式保存)。
初始化如下所示:
class BPR(Model):
def __init__(self, feature_columns, embed_reg=1e-6):
"""
BPR
:param feature_columns: A list. user feature columns + item feature columns
:param embed_reg: A scalar. The regularizer of embedding.
"""
super(BPR, self).__init__()
# feature columns
self.user_fea_col, self.item_fea_col = feature_columns
# field num
self.user_field_num = len(self.user_fea_col)
self.item_field_num = len(self.item_fea_col)
# user embedding layers [id, age,...]
self.embed_user_layers = [Embedding(input_dim=feat['feat_num'],
input_length=1,
output_dim=feat['embed_dim'],
embeddings_initializer='random_uniform',
embeddings_regularizer=l2(embed_reg))
for feat in self.user_fea_col]
# item embedding layers [id, cate_id, ...]
self.embed_item_layers = [Embedding(input_dim=feat['feat_num'],
input_length=1,
output_dim=feat['embed_dim'],
embeddings_initializer='random_uniform',
embeddings_regularizer=l2(embed_reg))
for feat in self.item_fea_col]
2、具体结构
(1)模型的输入为三元组的形式:(user_inputs, pos_inputs, neg_inputs)
;
(2)用户embedding、正、负样本embedding分别将对应的所有稀疏特征embedding平均池化后得到的;
(3)分别计算正、负样本得分;
「(4)在TF2.0中,Model类继承了tf.keras.layers.Layer」,因此包含add_loss
方法,因此通过add_loss
在模型结构中加入损失;
tf.keras.layers.Layer
:
(5)返回正负样本分数;
模型构建如下:
def call(self, inputs):
user_inputs, pos_inputs, neg_inputs = inputs # (None, user_field_num), (None, item_field_num)
# user info
user_embed = tf.add_n([self.embed_user_layers[i](user_inputs[:, i])
for i in range(self.user_field_num)]) / self.user_field_num # (None, dim)
# item info
pos_embed = tf.add_n([self.embed_item_layers[i](pos_inputs[:, i])
for i in range(self.item_field_num)]) / self.item_field_num # (None, dim)
neg_embed = tf.add_n([self.embed_item_layers[i](neg_inputs[:, i])
for i in range(self.item_field_num)]) / self.item_field_num # (None, dim)
# calculate positive item scores and negative item scores
pos_scores = tf.reduce_sum(tf.multiply(user_embed, pos_embed), axis=1, keepdims=True) # (None, 1)
neg_scores = tf.reduce_sum(tf.multiply(user_embed, neg_embed), axis=1, keepdims=True) # (None, 1)
# add loss. Computes softplus: log(exp(features) + 1)
# self.add_loss(tf.reduce_mean(tf.math.softplus(neg_scores - pos_scores)))
self.add_loss(tf.reduce_mean(-tf.math.log(tf.nn.sigmoid(pos_scores - neg_scores))))
return pos_scores, neg_scores
本次选择ml-1m中的ratings.dat
进行简单测试:
正负样本划分:可以选择分数大于某个阈值为正样本,小于和未评分作为负样本;
构建训练集、验证集、测试集;
训练集,选择0~t-2
时间步的作为训练正样本,同时随机生成单个负样本,与用户信息构造三元组;
验证集,选择t-1
时间步作为验证集正样本,其他同上;
测试集:选择t
时间步作为测试集,选择100个随机负样本,与用户信息分别构造100个三元组形式;
生成用户特征列、物品特征列;
def create_implicit_ml_1m_dataset(file, trans_score=2, embed_dim=8):
"""
Create implicit ml-1m dataset.
:param file: A string. dataset path.
:param trans_score: A scalar. Greater than it is 1, and less than it is 0.
:param embed_dim: A scalar. latent factor.
:return: user_num, item_num, train_df, test_df
"""
print('==========Data Preprocess Start============')
data_df = pd.read_csv(file, sep="::", engine='python',
names=['user_id', 'item_id', 'label', 'Timestamp'])
# implicit dataset
data_df.loc[data_df.label < trans_score, 'label'] = 0
data_df.loc[data_df.label >= trans_score, 'label'] = 1
# sort by time user and timestamp
data_df = data_df.sort_values(by=['user_id', 'Timestamp'])
# create train, val, test data
train_data, val_data, test_data = [], [], []
item_id_max = data_df['item_id'].max()
for user_id, df in tqdm(data_df[['user_id', 'item_id']].groupby('user_id')):
pos_list = df['item_id'].tolist()
def gen_neg():
neg = pos_list[0]
while neg in pos_list:
neg = random.randint(1, item_id_max)
return neg
neg_list = [gen_neg() for i in range(len(pos_list) + 100)]
for i in range(1, len(pos_list)):
if i == len(pos_list) - 1:
for neg in neg_list[i:]:
test_data.append([[user_id], [pos_list[i]], [neg]])
elif i == len(pos_list) - 2:
val_data.append([[user_id], [pos_list[i]], [neg_list[i]]])
else:
train_data.append([[user_id], [pos_list[i]], [neg_list[i]]])
# feature columns
user_num, item_num = data_df['user_id'].max() + 1, data_df['item_id'].max() + 1
feature_columns = [[sparseFeature('user_id', user_num, embed_dim)],
[sparseFeature('item_id', item_num, embed_dim)]]
# shuffle
random.shuffle(train_data)
random.shuffle(val_data)
random.shuffle(test_data)
# create dataframe
train = pd.DataFrame(train_data, columns=['user_id', 'pos_item', 'neg_item'])
val = pd.DataFrame(val_data, columns=['user_id', 'pos_item', 'neg_item'])
test = pd.DataFrame(test_data, columns=['user_id', 'pos_item', 'neg_item'])
# create dataset
def df_to_list(data):
return [np.array(data['user_id'].tolist()),
np.array(data['pos_item'].tolist()), np.array(data['neg_item'].tolist())]
train_X = df_to_list(train)
val_X = df_to_list(val)
test_X = df_to_list(test)
print('============Data Preprocess End=============')
return feature_columns, train_X, val_X, test_X
模型超参数:
trans_score:正分样本阈值,2;
embed_dim:embedding维度,16;
K:top@K,10;
learning_rate:学习率,0.001
epoch;
batch_size;
优化器:Adam;
评估指标:Hit@K,NDCG@K、MRR;
if __name__ == '__main__':
# =============================== GPU ==============================
# gpu = tf.config.experimental.list_physical_devices(device_type='GPU')
# print(gpu)
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1, 2'
# ========================= Hyper Parameters =======================
file = '../dataset/ml-1m/ratings.dat'
trans_score = 2
embed_dim = 16
embed_reg = 1e-6 # 1e-6
K = 10
learning_rate = 0.001
epochs = 20
batch_size = 512
# ========================== Create dataset =======================
feature_columns, train, val, test = create_implicit_ml_1m_dataset(file, trans_score, embed_dim)
train_X = train
val_X = val
# ============================Build Model==========================
model = BPR(feature_columns, embed_reg)
model.summary()
# ============================model checkpoint======================
# check_path = 'save/bpr_weights.epoch_{epoch:04d}.val_loss_{val_loss:.4f}.ckpt'
# checkpoint = tf.keras.callbacks.ModelCheckpoint(check_path, save_weights_only=True,
# verbose=1, period=5)
# =========================Compile============================
model.compile(optimizer=Adam(learning_rate=learning_rate))
results = []
for epoch in range(epochs):
# ===========================Fit==============================
t1 = time()
model.fit(
train_X,
None,
validation_data=(val_X, None),
epochs=1,
# callbacks=[checkpoint],
batch_size=batch_size,
)
# ===========================Test==============================
t2 = time()
hit_rate, ndcg, mrr = evaluate_model(model, test, K)
print('Iteration %d Fit [%.1f s], Evaluate [%.1f s]: HR = %.4f, NDCG = %.4f, MRR = %.4f'
% (epoch + 1, t2 - t1, time() - t2, hit_rate, ndcg, mrr))
results.append([epoch, t2-t1, time()-t2, hit_rate, ndcg, mrr])
# ========================== Write Log ===========================
pd.DataFrame(results, columns=['Iteration', 'fit_time', 'evaluate_time',
'hit_rate', 'ndcg', 'mrr']).to_csv(
'log/BPR_log_dim_{}_K_{}_epoch_{}_batch_size_{}.csv'.format(embed_dim, K, epochs, batch_size), index=False)
实验最终结果:hit 0.54
Tensorflow2.0构建模型时,在内部加入损失的方法为「self.add_loss()」。
【注】:最后发现,采用Loss
作为基类也是可以构造损失函数的,y_pred
参数的形式为batch_size, d1, ... dn
。
具体代码见【阅读原文】:
https://github.com/ZiyaoGeng/Recommender-System-with-TF2.0
往期精彩回顾
【学习总结】基于注意力机制的推荐模型在注意些什么?
【序列推荐】AAAI2019|AttRec---采用度量学习来建模用户的长期偏好
【序列推荐】KDD2018|STAMP---基于注意力的短期记忆优先的推荐TF
扫码关注更多精彩
点分享
点点赞
点在看