https://zhuanlan.zhihu.com/p/32078473?edition=yidianzixun&utm_source=yidianzixun&yidian_docid=0I2070US
程世东
《深度学习私房菜:跟着案例学Tensorflow》作者
请安装TensorFlow1.0,Python3.5
项目地址: chengstone/movie_recommender
本项目使用文本卷积神经网络,并使用MovieLens
数据集完成电影推荐的任务。
推荐系统在日常的网络应用中无处不在,比如网上购物、网上买书、新闻app、社交网络、音乐网站、电影网站等等等等,有人的地方就有推荐。根据个人的喜好,相同喜好人群的习惯等信息进行个性化的内容推荐。比如打开新闻类的app,因为有了个性化的内容,每个人看到的新闻首页都是不一样的。
这当然是很有用的,在信息爆炸的今天,获取信息的途径和方式多种多样,人们花费时间最多的不再是去哪获取信息,而是要在众多的信息中寻找自己感兴趣的,这就是信息超载问题。为了解决这个问题,推荐系统应运而生。
协同过滤是推荐系统应用较广泛的技术,该方法搜集用户的历史记录、个人喜好等信息,计算与其他用户的相似度,利用相似用户的评价来预测目标用户对特定项目的喜好程度。优点是会给用户推荐未浏览过的项目,缺点呢,对于新用户来说,没有任何与商品的交互记录和个人喜好等信息,存在冷启动问题,导致模型无法找到相似的用户或商品。
为了解决冷启动的问题,通常的做法是对于刚注册的用户,要求用户先选择自己感兴趣的话题、群组、商品、性格、喜欢的音乐类型等信息,比如豆瓣FM:
本项目使用的是MovieLens 1M 数据集,包含6000个用户在近4000部电影上的1亿条评论。
数据集分为三个文件:用户数据users.dat,电影数据movies.dat和评分数据ratings.dat。
分别有用户ID、性别、年龄、职业ID和邮编等字段。
数据中的格式:UserID::Gender::Age::Occupation::Zip-code
可以看出UserID、Gender、Age和Occupation都是类别字段,其中邮编字段是我们不使用的。
分别有电影ID、电影名和电影风格等字段。
数据中的格式:MovieID::Title::Genres
MovieID是类别字段,Title是文本,Genres也是类别字段
分别有用户ID、电影ID、评分和时间戳等字段。
数据中的格式:UserID::MovieID::Rating::Timestamp
评分字段Rating就是我们要学习的targets,时间戳字段我们不使用。
数据预处理的代码请参见项目。
通过研究数据集中的字段类型,我们发现有一些是类别字段,通常的处理是将这些字段转成one hot编码,但是像UserID、MovieID这样的字段就会变成非常的稀疏,输入的维度急剧膨胀,这是我们不愿意见到的,毕竟我这小笔记本不像大厂动辄能处理数以亿计维度的输入:)
所以在预处理数据时将这些字段转成了数字,我们用这个数字当做嵌入矩阵的索引,在网络的第一层使用了嵌入层,维度是(N,32)和(N,16)。
电影类型的处理要多一步,有时一个电影有多个电影类型,这样从嵌入矩阵索引出来是一个(n,32)的矩阵,因为有多个类型嘛,我们要将这个矩阵求和,变成(1,32)的向量。
电影名的处理比较特殊,没有使用循环神经网络,而是用了文本卷积网络,下文会进行说明。
从嵌入层索引出特征以后,将各特征传入全连接层,将输出再次传入全连接层,最终分别得到(1,200)的用户特征和电影特征两个特征向量。
我们的目的就是要训练出用户特征和电影特征,在实现推荐功能时使用。得到这两个特征以后,就可以选择任意的方式来拟合评分了。我使用了两种方式,一个是上图中画出的将两个特征做向量乘法,将结果与真实评分做回归,采用MSE优化损失。因为本质上这是一个回归问题,另一种方式是,将两个特征作为输入,再次传入全连接层,输出一个值,将输出值回归到真实评分,采用MSE优化损失。
实际上第二个方式的MSE loss在0.8附近,第一个方式在1附近,5次迭代的结果。
网络看起来像下面这样
图片来自Kim Yoon的论文:Convolutional Neural Networks for Sentence Classification
将卷积神经网络用于文本的文章建议你阅读Understanding Convolutional Neural Networks for NLP
网络的第一层是词嵌入层,由每一个单词的嵌入向量组成的嵌入矩阵。下一层使用多个不同尺寸(窗口大小)的卷积核在嵌入矩阵上做卷积,窗口大小指的是每次卷积覆盖几个单词。这里跟对图像做卷积不太一样,图像的卷积通常用2x2、3x3、5x5之类的尺寸,而文本卷积要覆盖整个单词的嵌入向量,所以尺寸是(单词数,向量维度),比如每次滑动3个,4个或者5个单词。第三层网络是max pooling得到一个长向量,最后使用dropout做正则化,最终得到了电影Title的特征。
全部代码请参见项目。
#嵌入矩阵的维度
embed_dim = 32
#用户ID个数
uid_max = max(features.take(0,1)) + 1 # 6040
#性别个数
gender_max = max(features.take(2,1)) + 1 # 1 + 1 = 2
#年龄类别个数
age_max = max(features.take(3,1)) + 1 # 6 + 1 = 7
#职业个数
job_max = max(features.take(4,1)) + 1# 20 + 1 = 21
#电影ID个数
movie_id_max = max(features.take(1,1)) + 1 # 3952
#电影类型个数
movie_categories_max = max(genres2int.values()) + 1 # 18 + 1 = 19
#电影名单词个数
movie_title_max = len(title_set) # 5216
#对电影类型嵌入向量做加和操作的标志,考虑过使用mean做平均,但是没实现mean
combiner = "sum"
#电影名长度
sentences_size = title_count # = 15
#文本卷积滑动窗口,分别滑动2, 3, 4, 5个单词
window_sizes = {2, 3, 4, 5}
#文本卷积核数量
filter_num = 8
#电影ID转下标的字典,数据集中电影ID跟下标不一致,比如第5行的数据电影ID不一定是5
movieid2idx = {val[0]:i for i, val in enumerate(movies.values)}
超参
# Number of Epochs
num_epochs = 5
# Batch Size
batch_size = 256
dropout_keep = 0.5
# Learning Rate
learning_rate = 0.0001
# Show stats for every n number of batches
show_every_n_batches = 20
save_dir = './save'
输入
定义输入的占位符
def get_inputs():
uid = tf.placeholder(tf.int32, [None, 1], name="uid")
user_gender = tf.placeholder(tf.int32, [None, 1], name="user_gender")
user_age = tf.placeholder(tf.int32, [None, 1], name="user_age")
user_job = tf.placeholder(tf.int32, [None, 1], name="user_job")
movie_id = tf.placeholder(tf.int32, [None, 1], name="movie_id")
movie_categories = tf.placeholder(tf.int32, [None, 18], name="movie_categories")
movie_titles = tf.placeholder(tf.int32, [None, 15], name="movie_titles")
targets = tf.placeholder(tf.int32, [None, 1], name="targets")
LearningRate = tf.placeholder(tf.float32, name = "LearningRate")
dropout_keep_prob = tf.placeholder(tf.float32, name = "dropout_keep_prob")
return uid, user_gender, user_age, user_job, movie_id, movie_categories, movie_titles, targets, LearningRate, dropout_keep_prob
定义User的嵌入矩阵
def get_user_embedding(uid, user_gender, user_age, user_job):
with tf.name_scope("user_embedding"):
uid_embed_matrix = tf.Variable(tf.random_uniform([uid_max, embed_dim], -1, 1), name = "uid_embed_matrix")
uid_embed_layer = tf.nn.embedding_lookup(uid_embed_matrix, uid, name = "uid_embed_layer")
gender_embed_matrix = tf.Variable(tf.random_uniform([gender_max, embed_dim // 2], -1, 1), name= "gender_embed_matrix")
gender_embed_layer = tf.nn.embedding_lookup(gender_embed_matrix, user_gender, name = "gender_embed_layer")
age_embed_matrix = tf.Variable(tf.random_uniform([age_max, embed_dim // 2], -1, 1), name="age_embed_matrix")
age_embed_layer = tf.nn.embedding_lookup(age_embed_matrix, user_age, name="age_embed_layer")
job_embed_matrix = tf.Variable(tf.random_uniform([job_max, embed_dim // 2], -1, 1), name = "job_embed_matrix")
job_embed_layer = tf.nn.embedding_lookup(job_embed_matrix, user_job, name = "job_embed_layer")
return uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer
将User的嵌入矩阵一起全连接生成User的特征
def get_user_feature_layer(uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer):
with tf.name_scope("user_fc"):
#第一层全连接
uid_fc_layer = tf.layers.dense(uid_embed_layer, embed_dim, name = "uid_fc_layer", activation=tf.nn.relu)
gender_fc_layer = tf.layers.dense(gender_embed_layer, embed_dim, name = "gender_fc_layer", activation=tf.nn.relu)
age_fc_layer = tf.layers.dense(age_embed_layer, embed_dim, name ="age_fc_layer", activation=tf.nn.relu)
job_fc_layer = tf.layers.dense(job_embed_layer, embed_dim, name = "job_fc_layer", activation=tf.nn.relu)
#第二层全连接
user_combine_layer = tf.concat([uid_fc_layer, gender_fc_layer, age_fc_layer, job_fc_layer], 2) #(?, 1, 128)
user_combine_layer = tf.contrib.layers.fully_connected(user_combine_layer, 200, tf.tanh) #(?, 1, 200)
user_combine_layer_flat = tf.reshape(user_combine_layer, [-1, 200])
return user_combine_layer, user_combine_layer_flat
定义Movie ID的嵌入矩阵
def get_movie_id_embed_layer(movie_id):
with tf.name_scope("movie_embedding"):
movie_id_embed_matrix = tf.Variable(tf.random_uniform([movie_id_max, embed_dim], -1, 1), name = "movie_id_embed_matrix")
movie_id_embed_layer = tf.nn.embedding_lookup(movie_id_embed_matrix, movie_id, name = "movie_id_embed_layer")
return movie_id_embed_layer
对电影类型的多个嵌入向量做加和
def get_movie_categories_layers(movie_categories):
with tf.name_scope("movie_categories_layers"):
movie_categories_embed_matrix = tf.Variable(tf.random_uniform([movie_categories_max, embed_dim], -1, 1), name = "movie_categories_embed_matrix")
movie_categories_embed_layer = tf.nn.embedding_lookup(movie_categories_embed_matrix, movie_categories, name = "movie_categories_embed_layer")
if combiner == "sum":
movie_categories_embed_layer = tf.reduce_sum(movie_categories_embed_layer, axis=1, keep_dims=True)
# elif combiner == "mean":
return movie_categories_embed_layer
Movie Title的文本卷积网络实现
def get_movie_cnn_layer(movie_titles):
#从嵌入矩阵中得到电影名对应的各个单词的嵌入向量
with tf.name_scope("movie_embedding"):
movie_title_embed_matrix = tf.Variable(tf.random_uniform([movie_title_max, embed_dim], -1, 1), name = "movie_title_embed_matrix")
movie_title_embed_layer = tf.nn.embedding_lookup(movie_title_embed_matrix, movie_titles, name = "movie_title_embed_layer")
movie_title_embed_layer_expand = tf.expand_dims(movie_title_embed_layer, -1)
#对文本嵌入层使用不同尺寸的卷积核做卷积和最大池化
pool_layer_lst = []
for window_size in window_sizes:
with tf.name_scope("movie_txt_conv_maxpool_{}".format(window_size)):
filter_weights = tf.Variable(tf.truncated_normal([window_size, embed_dim, 1, filter_num],stddev=0.1),name = "filter_weights")
filter_bias = tf.Variable(tf.constant(0.1, shape=[filter_num]), name="filter_bias")
conv_layer = tf.nn.conv2d(movie_title_embed_layer_expand, filter_weights, [1,1,1,1], padding="VALID", name="conv_layer")
relu_layer = tf.nn.relu(tf.nn.bias_add(conv_layer,filter_bias), name ="relu_layer")
maxpool_layer = tf.nn.max_pool(relu_layer, [1,sentences_size - window_size + 1 ,1,1], [1,1,1,1], padding="VALID", name="maxpool_layer")
pool_layer_lst.append(maxpool_layer)
#Dropout层
with tf.name_scope("pool_dropout"):
pool_layer = tf.concat(pool_layer_lst, 3, name ="pool_layer")
max_num = len(window_sizes) * filter_num
pool_layer_flat = tf.reshape(pool_layer , [-1, 1, max_num], name = "pool_layer_flat")
dropout_layer = tf.nn.dropout(pool_layer_flat, dropout_keep_prob, name = "dropout_layer")
return pool_layer_flat, dropout_layer
将Movie的各个层一起做全连接
def get_movie_feature_layer(movie_id_embed_layer, movie_categories_embed_layer, dropout_layer):
with tf.name_scope("movie_fc"):
#第一层全连接
movie_id_fc_layer = tf.layers.dense(movie_id_embed_layer, embed_dim, name = "movie_id_fc_layer", activation=tf.nn.relu)
movie_categories_fc_layer = tf.layers.dense(movie_categories_embed_layer, embed_dim, name = "movie_categories_fc_layer", activation=tf.nn.relu)
#第二层全连接
movie_combine_layer = tf.concat([movie_id_fc_layer, movie_categories_fc_layer, dropout_layer], 2) #(?, 1, 96)
movie_combine_layer = tf.contrib.layers.fully_connected(movie_combine_layer, 200, tf.tanh) #(?, 1, 200)
movie_combine_layer_flat = tf.reshape(movie_combine_layer, [-1, 200])
return movie_combine_layer, movie_combine_layer_flat
tf.reset_default_graph()
train_graph = tf.Graph()
with train_graph.as_default():
#获取输入占位符
uid, user_gender, user_age, user_job, movie_id, movie_categories, movie_titles, targets, lr, dropout_keep_prob = get_inputs()
#获取User的4个嵌入向量
uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer = get_user_embedding(uid, user_gender, user_age, user_job)
#得到用户特征
user_combine_layer, user_combine_layer_flat = get_user_feature_layer(uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer)
#获取电影ID的嵌入向量
movie_id_embed_layer = get_movie_id_embed_layer(movie_id)
#获取电影类型的嵌入向量
movie_categories_embed_layer = get_movie_categories_layers(movie_categories)
#获取电影名的特征向量
pool_layer_flat, dropout_layer = get_movie_cnn_layer(movie_titles)
#得到电影特征
movie_combine_layer, movie_combine_layer_flat = get_movie_feature_layer(movie_id_embed_layer,
movie_categories_embed_layer,
dropout_layer)
#计算出评分,要注意两个不同的方案,inference的名字(name值)是不一样的,后面做推荐时要根据name取得tensor
with tf.name_scope("inference"):
#将用户特征和电影特征作为输入,经过全连接,输出一个值的方案
# inference_layer = tf.concat([user_combine_layer_flat, movie_combine_layer_flat], 1) #(?, 200)
# inference = tf.layers.dense(inference_layer, 1,
# kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
# kernel_regularizer=tf.nn.l2_loss, name="inference")
#简单的将用户特征和电影特征做矩阵乘法得到一个预测评分
#感谢网友 @风惜殇 指出此处的问题,这里使用矩阵乘法是错的,得到的是多个用户对多个电影的评分矩阵,shape是[batch_size, batch_size],跟做mse的targets形状是不同的
# inference = tf.matmul(user_combine_layer_flat, tf.transpose(movie_combine_layer_flat))
#应该使用点乘求和,得到一个用户对一个电影的评分,shape是[batch_size, 1],此处改动已经同步到github
inference = tf.reduce_sum(user_combine_layer_flat * movie_combine_layer_flat, axis=1)
inference = tf.expand_dims(inference, axis=1)
with tf.name_scope("loss"):
# MSE损失,将计算值回归到评分
cost = tf.losses.mean_squared_error(targets, inference )
loss = tf.reduce_mean(cost)
# 优化损失
# train_op = tf.train.AdamOptimizer(lr).minimize(loss) #cost
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(lr)
gradients = optimizer.compute_gradients(loss) #cost
train_op = optimizer.apply_gradients(gradients, global_step=global_step)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import time
import datetime
losses = {'train':[], 'test':[]}
with tf.Session(graph=train_graph) as sess:
#搜集数据给tensorBoard用
# Keep track of gradient values and sparsity
grad_summaries = []
for g, v in gradients:
if g is not None:
grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name.replace(':', '_')), g)
sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name.replace(':', '_')), tf.nn.zero_fraction(g))
grad_summaries.append(grad_hist_summary)
grad_summaries.append(sparsity_summary)
grad_summaries_merged = tf.summary.merge(grad_summaries)
# Output directory for models and summaries
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))
# Summaries for loss and accuracy
loss_summary = tf.summary.scalar("loss", loss)
# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, grad_summaries_merged])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
# Inference summaries
inference_summary_op = tf.summary.merge([loss_summary])
inference_summary_dir = os.path.join(out_dir, "summaries", "inference")
inference_summary_writer = tf.summary.FileWriter(inference_summary_dir, sess.graph)
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
for epoch_i in range(num_epochs):
#将数据集分成训练集和测试集,随机种子不固定
train_X,test_X, train_y, test_y = train_test_split(features,
targets_values,
test_size = 0.2,
random_state = 0)
train_batches = get_batches(train_X, train_y, batch_size)
test_batches = get_batches(test_X, test_y, batch_size)
#训练的迭代,保存训练损失
for batch_i in range(len(train_X) // batch_size):
x, y = next(train_batches)
categories = np.zeros([batch_size, 18])
for i in range(batch_size):
categories[i] = x.take(6,1)[i]
titles = np.zeros([batch_size, sentences_size])
for i in range(batch_size):
titles[i] = x.take(5,1)[i]
feed = {
uid: np.reshape(x.take(0,1), [batch_size, 1]),
user_gender: np.reshape(x.take(2,1), [batch_size, 1]),
user_age: np.reshape(x.take(3,1), [batch_size, 1]),
user_job: np.reshape(x.take(4,1), [batch_size, 1]),
movie_id: np.reshape(x.take(1,1), [batch_size, 1]),
movie_categories: categories, #x.take(6,1)
movie_titles: titles, #x.take(5,1)
targets: np.reshape(y, [batch_size, 1]),
dropout_keep_prob: dropout_keep, #dropout_keep
lr: learning_rate}
step, train_loss, summaries, _ = sess.run([global_step, loss, train_summary_op, train_op], feed) #cost
losses['train'].append(train_loss)
train_summary_writer.add_summary(summaries, step) #
# Show every batches
if (epoch_i * (len(train_X) // batch_size) + batch_i) % show_every_n_batches == 0:
time_str = datetime.datetime.now().isoformat()
print('{}: Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f}'.format(
time_str,
epoch_i,
batch_i,
(len(train_X) // batch_size),
train_loss))
#使用测试数据的迭代
for batch_i in range(len(test_X) // batch_size):
x, y = next(test_batches)
categories = np.zeros([batch_size, 18])
for i in range(batch_size):
categories[i] = x.take(6,1)[i]
titles = np.zeros([batch_size, sentences_size])
for i in range(batch_size):
titles[i] = x.take(5,1)[i]
feed = {
uid: np.reshape(x.take(0,1), [batch_size, 1]),
user_gender: np.reshape(x.take(2,1), [batch_size, 1]),
user_age: np.reshape(x.take(3,1), [batch_size, 1]),
user_job: np.reshape(x.take(4,1), [batch_size, 1]),
movie_id: np.reshape(x.take(1,1), [batch_size, 1]),
movie_categories: categories, #x.take(6,1)
movie_titles: titles, #x.take(5,1)
targets: np.reshape(y, [batch_size, 1]),
dropout_keep_prob: 1,
lr: learning_rate}
step, test_loss, summaries = sess.run([global_step, loss, inference_summary_op], feed) #cost
#保存测试损失
losses['test'].append(test_loss)
inference_summary_writer.add_summary(summaries, step) #
time_str = datetime.datetime.now().isoformat()
if (epoch_i * (len(test_X) // batch_size) + batch_i) % show_every_n_batches == 0:
print('{}: Epoch {:>3} Batch {:>4}/{} test_loss = {:.3f}'.format(
time_str,
epoch_i,
batch_i,
(len(test_X) // batch_size),
test_loss))
# Save Model
saver.save(sess, save_dir) #, global_step=epoch_i
print('Model Trained and Saved')
使用函数 get_tensor_by_name()
从 loaded_graph
中获取tensors,后面的推荐功能要用到。
def get_tensors(loaded_graph):
uid = loaded_graph.get_tensor_by_name("uid:0")
user_gender = loaded_graph.get_tensor_by_name("user_gender:0")
user_age = loaded_graph.get_tensor_by_name("user_age:0")
user_job = loaded_graph.get_tensor_by_name("user_job:0")
movie_id = loaded_graph.get_tensor_by_name("movie_id:0")
movie_categories = loaded_graph.get_tensor_by_name("movie_categories:0")
movie_titles = loaded_graph.get_tensor_by_name("movie_titles:0")
targets = loaded_graph.get_tensor_by_name("targets:0")
dropout_keep_prob = loaded_graph.get_tensor_by_name("dropout_keep_prob:0")
lr = loaded_graph.get_tensor_by_name("LearningRate:0")
#两种不同计算预测评分的方案使用不同的name获取tensor inference
# inference = loaded_graph.get_tensor_by_name("inference/inference/BiasAdd:0")
inference = loaded_graph.get_tensor_by_name("inference/ExpandDims:0") #之前是MatMul:0 因为inference代码修改了 这里也要修改 感谢网友 @清歌 指出问题
movie_combine_layer_flat = loaded_graph.get_tensor_by_name("movie_fc/Reshape:0")
user_combine_layer_flat = loaded_graph.get_tensor_by_name("user_fc/Reshape:0")
return uid, user_gender, user_age, user_job, movie_id, movie_categories, movie_titles, targets, lr, dropout_keep_prob, inference, movie_combine_layer_flat, user_combine_layer_flat
这部分就是对网络做正向传播,计算得到预测的评分
def rating_movie(user_id_val, movie_id_val):
loaded_graph = tf.Graph() #
with tf.Session(graph=loaded_graph) as sess: #
# Load saved model
loader = tf.train.import_meta_graph(load_dir + '.meta')
loader.restore(sess, load_dir)
# Get Tensors from loaded model
uid, user_gender, user_age, user_job, movie_id, movie_categories, movie_titles, targets, lr, dropout_keep_prob, inference,_, __ = get_tensors(loaded_graph) #loaded_graph
categories = np.zeros([1, 18])
categories[0] = movies.values[movieid2idx[movie_id_val]][2]
titles = np.zeros([1, sentences_size])
titles[0] = movies.values[movieid2idx[movie_id_val]][1]
feed = {
uid: np.reshape(users.values[user_id_val-1][0], [1, 1]),
user_gender: np.reshape(users.values[user_id_val-1][1], [1, 1]),
user_age: np.reshape(users.values[user_id_val-1][2], [1, 1]),
user_job: np.reshape(users.values[user_id_val-1][3], [1, 1]),
movie_id: np.reshape(movies.values[movieid2idx[movie_id_val]][0], [1, 1]),
movie_categories: categories, #x.take(6,1)
movie_titles: titles, #x.take(5,1)
dropout_keep_prob: 1}
# Get Prediction
inference_val = sess.run([inference], feed)
return (inference_val)
将训练好的电影特征组合成电影特征矩阵并保存到本地
loaded_graph = tf.Graph() #
movie_matrics = []
with tf.Session(graph=loaded_graph) as sess: #
# Load saved model
loader = tf.train.import_meta_graph(load_dir + '.meta')
loader.restore(sess, load_dir)
# Get Tensors from loaded model
uid, user_gender, user_age, user_job, movie_id, movie_categories, movie_titles, targets, lr, dropout_keep_prob, _, movie_combine_layer_flat, __ = get_tensors(loaded_graph) #loaded_graph
for item in movies.values:
categories = np.zeros([1, 18])
categories[0] = item.take(2)
titles = np.zeros([1, sentences_size])
titles[0] = item.take(1)
feed = {
movie_id: np.reshape(item.take(0), [1, 1]),
movie_categories: categories, #x.take(6,1)
movie_titles: titles, #x.take(5,1)
dropout_keep_prob: 1}
movie_combine_layer_flat_val = sess.run([movie_combine_layer_flat], feed)
movie_matrics.append(movie_combine_layer_flat_val)
pickle.dump((np.array(movie_matrics).reshape(-1, 200)), open('movie_matrics.p', 'wb'))
movie_matrics = pickle.load(open('movie_matrics.p', mode='rb'))
将训练好的用户特征组合成用户特征矩阵并保存到本地
loaded_graph = tf.Graph() #
users_matrics = []
with tf.Session(graph=loaded_graph) as sess: #
# Load saved model
loader = tf.train.import_meta_graph(load_dir + '.meta')
loader.restore(sess, load_dir)
# Get Tensors from loaded model
uid, user_gender, user_age, user_job, movie_id, movie_categories, movie_titles, targets, lr, dropout_keep_prob, _, __,user_combine_layer_flat = get_tensors(loaded_graph) #loaded_graph
for item in users.values:
feed = {
uid: np.reshape(item.take(0), [1, 1]),
user_gender: np.reshape(item.take(1), [1, 1]),
user_age: np.reshape(item.take(2), [1, 1]),
user_job: np.reshape(item.take(3), [1, 1]),
dropout_keep_prob: 1}
user_combine_layer_flat_val = sess.run([user_combine_layer_flat], feed)
users_matrics.append(user_combine_layer_flat_val)
pickle.dump((np.array(users_matrics).reshape(-1, 200)), open('users_matrics.p', 'wb'))
users_matrics = pickle.load(open('users_matrics.p', mode='rb'))
使用生产的用户特征矩阵和电影特征矩阵做电影推荐
思路是计算当前看的电影特征向量与整个电影特征矩阵的余弦相似度,取相似度最大的top_k个,这里加了些随机选择在里面,保证每次的推荐稍稍有些不同。
def recommend_same_type_movie(movie_id_val, top_k = 20):
loaded_graph = tf.Graph() #
with tf.Session(graph=loaded_graph) as sess: #
# Load saved model
loader = tf.train.import_meta_graph(load_dir + '.meta')
loader.restore(sess, load_dir)
norm_movie_matrics = tf.sqrt(tf.reduce_sum(tf.square(movie_matrics), 1, keep_dims=True))
normalized_movie_matrics = movie_matrics / norm_movie_matrics
#推荐同类型的电影
probs_embeddings = (movie_matrics[movieid2idx[movie_id_val]]).reshape([1, 200])
probs_similarity = tf.matmul(probs_embeddings, tf.transpose(normalized_movie_matrics))
sim = (probs_similarity.eval())
# results = (-sim[0]).argsort()[0:top_k]
# print(results)
print("您看的电影是:{}".format(movies_orig[movieid2idx[movie_id_val]]))
print("以下是给您的推荐:")
p = np.squeeze(sim)
p[np.argsort(p)[:-top_k]] = 0
p = p / np.sum(p)
results = set()
while len(results) != 5:
c = np.random.choice(3883, 1, p=p)[0]
results.add(c)
for val in (results):
print(val)
print(movies_orig[val])
return results
思路是使用用户特征向量与电影特征矩阵计算所有电影的评分,取评分最高的top_k个,同样加了些随机选择部分。
def recommend_your_favorite_movie(user_id_val, top_k = 10):
loaded_graph = tf.Graph() #
with tf.Session(graph=loaded_graph) as sess: #
# Load saved model
loader = tf.train.import_meta_graph(load_dir + '.meta')
loader.restore(sess, load_dir)
#推荐您喜欢的电影
probs_embeddings = (users_matrics[user_id_val-1]).reshape([1, 200])
probs_similarity = tf.matmul(probs_embeddings, tf.transpose(movie_matrics))
sim = (probs_similarity.eval())
# print(sim.shape)
# results = (-sim[0]).argsort()[0:top_k]
# print(results)
# sim_norm = probs_norm_similarity.eval()
# print((-sim_norm[0]).argsort()[0:top_k])
print("以下是给您的推荐:")
p = np.squeeze(sim)
p[np.argsort(p)[:-top_k]] = 0
p = p / np.sum(p)
results = set()
while len(results) != 5:
c = np.random.choice(3883, 1, p=p)[0]
results.add(c)
for val in (results):
print(val)
print(movies_orig[val])
return results
import random
def recommend_other_favorite_movie(movie_id_val, top_k = 20):
loaded_graph = tf.Graph() #
with tf.Session(graph=loaded_graph) as sess: #
# Load saved model
loader = tf.train.import_meta_graph(load_dir + '.meta')
loader.restore(sess, load_dir)
probs_movie_embeddings = (movie_matrics[movieid2idx[movie_id_val]]).reshape([1, 200])
probs_user_favorite_similarity = tf.matmul(probs_movie_embeddings, tf.transpose(users_matrics))
favorite_user_id = np.argsort(probs_user_favorite_similarity.eval())[0][-top_k:]
# print(normalized_users_matrics.eval().shape)
# print(probs_user_favorite_similarity.eval()[0][favorite_user_id])
# print(favorite_user_id.shape)
print("您看的电影是:{}".format(movies_orig[movieid2idx[movie_id_val]]))
print("喜欢看这个电影的人是:{}".format(users_orig[favorite_user_id-1]))
probs_users_embeddings = (users_matrics[favorite_user_id-1]).reshape([-1, 200])
probs_similarity = tf.matmul(probs_users_embeddings, tf.transpose(movie_matrics))
sim = (probs_similarity.eval())
# results = (-sim[0]).argsort()[0:top_k]
# print(results)
# print(sim.shape)
# print(np.argmax(sim, 1))
p = np.argmax(sim, 1)
print("喜欢看这个电影的人还喜欢看:")
results = set()
while len(results) != 5:
c = p[random.randrange(top_k)]
results.add(c)
for val in (results):
print(val)
print(movies_orig[val])
return results
以上就是实现的常用的推荐功能,将网络模型作为回归问题进行训练,得到训练好的用户特征矩阵和电影特征矩阵进行推荐。
如果你对个性化推荐感兴趣,以下资料建议你看看:
Understanding Convolutional Neural Networks for NLP
Convolutional Neural Networks for Sentence Classification
利用TensorFlow实现卷积神经网络做文本分类
Convolutional Neural Network for Text Classification in Tensorflow
SVD Implement Recommendation systems
1.使用更多的特征,进一步降低MSE
1) 用户属性数据中的Zip-code可以标识用户所处地区,不同地域的人可能有不同的喜好,应该是一个有用处的特征。
2) 时间特征:电影名中的上映时间,不同时代的电影,评分可能略有差异;用户评分时间距电影上映时间可能也会影响评分。
2.使用得到的特征做电影推荐
参考:
https://blog.csdn.net/ChuQiDeCha/article/details/84729177
谷歌开发者发表于Tenso...
深度学习于...发表于深度学习与...
以下文章来自 刘建平Pinard - 博客园,对tensorflow机器学习模型的跨平台上线学习分享在 用PMML实现机器学习模型的跨平台上线中,我们讨论了使用PMML文件来实现跨平台模型上线的方法,这个…
漫漫成长
TensorFlow是Google的开源深度学习库,在图形分类、音频处理、推荐系统和自然语言处理等场景下都有丰富的应用。毫不夸张得说,TensorFlow的流行让深度学习门槛变得越来越低,只要你有Python…
七月在线 ...发表于从零学AI