本文将介绍如下几种推荐算法以及调优过程
1.基线算法baseline
2.item协同过滤
3. 结合基线算法baseline的item协同过滤算法
4. item协同过滤(topK+ baseline)
电影数据集地址:
http://files.grouplens.org/datasets/movielens/ml-100k.zip
baseline算法的主要原理:使用公式item_mean+ user_mean[user] - all_mean填充用户评分矩阵Nan值预测用户对未知item的评分,其中item_mean是所有用户对指定item的评分平均值,user_mean是指定用户又有定影评分的平均值,all_mean则是所有item的评分平均值
首先看下测试数据的结构[user_id,movie_id,rating,timestamp]
1 1 5 874965758
1 2 3 876893171
1 3 4 878542960
用pandas读入数据
import numpy as np
import pandas as pd
title=['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u1.base",sep='\t',names = title)
查看user和item去重后的个数
print np.max(df['user_id']),np.max(df['item_id'])
943 1682
构造评分矩阵ratings
ratings = np.zeros((943, 1682))
for row in df.itertuples():
ratings[row[1]-1,row[2]-1] = row[3]
查看评分矩阵稠密度
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print('训练集矩阵密度为: {:4.2f}%'.format(sparsity))
训练集矩阵密度为: 5.04%
可以看出来评分矩阵是个非常稀疏的矩阵,95%的数据都是空值
开始baseline算法,首先要计算的是item_mean,user_mean, all_mean
all_mean = np.mean(ratings[ratings!=0])
user_mean = sum(ratings.T)/sum((ratings!=0).T)
item_mean = sum(ratings)/sum((ratings!=0))
#用all_mean填充user_mean和item_mean可能存在的空值Nan
user_mean = np.where(np.isnan(user_mean), all_mean, user_mean)
item_mean = np.where(np.isnan(item_mean), all_mean, item_mean)
预测用户user对item的评分
def predict_naive(user, item):
prediction = item_mean[item] + user_mean[user] - all_mean
return prediction
用均方根误差衡量算法准确率
def rmse(pred, actual):
'''计算预测结果的rmse'''
from sklearn.metrics import mean_squared_error
pred = pred[actual.nonzero()].flatten()
actual = actual[actual.nonzero()].flatten()
return np.sqrt(mean_squared_error(pred, actual))
用测试集测试算法
# 用测试集测试
for row in test_df.itertuples():
user,item,actual = row[1]-1,row[2]-1,row[3]
predictions.append(predict_naive(user, item))
actuals.append(actual)
print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(actuals)))
测试结果的rmse为 0.9344
item协同过滤
# 计算item和user相似度矩阵
user_s = ratings.dot(ratings.T)
item_s = ratings.T.dot(ratings)
user_norm = np.array([np.sqrt(np.diagonal(user_s))])
item_norm = np.array([np.sqrt(np.diagonal(item_s))])
user_sim = (user_s/user_norm/user_norm.T)
item_sim = (item_s/item_norm/item_norm.T)
print np.round_(item_sim[:10,:10], 3)
[[ 1. 0.296 0.279 0.388 0.252 0.114 0.518 0.41 0.416 0.199]
[ 0.296 1. 0.177 0.405 0.211 0.099 0.331 0.31 0.207 0.152]
[ 0.279 0.177 1. 0.275 0.118 0.104 0.311 0.125 0.207 0.121]
[ 0.388 0.405 0.275 1. 0.265 0.091 0.411 0.391 0.357 0.219]
[ 0.252 0.211 0.118 0.265 1. 0.016 0.28 0.214 0.202 0.031]
[ 0.114 0.099 0.104 0.091 0.016 1. 0.128 0.065 0.164 0.139]
[ 0.518 0.331 0.311 0.411 0.28 0.128 1. 0.342 0.43 0.279]
[ 0.41 0.31 0.125 0.391 0.214 0.065 0.342 1. 0.364 0.166]
[ 0.416 0.207 0.207 0.357 0.202 0.164 0.43 0.364 1. 0.25 ]
[ 0.199 0.152 0.121 0.219 0.031 0.139 0.279 0.166 0.25 1. ]]
评分预测方法
def predict_itemCF(user, item, k=100):
'''item协同过滤算法,预测rating'''
nzero = ratings[user].nonzero()[0]
prediction = ratings[user, nzero].dot(item_sim[item, nzero])\
/ sum(item_sim[item, nzero])
return prediction
测试预测结果
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用item-based协同过滤算法进行预测...')
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_itemCF(user, item))
targets.append(actual)
print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))
测试集大小为 20000
采用item-based协同过滤算法进行预测...
测试结果的rmse为 0.9534
结合基线算法baseline的item协同过滤算法
def predict_itemCF_baseline(user, item):
'''结合baseline的item-basedCF算法,预测rating'''
nzero = ratings[user].nonzero()[0]
baseline = item_mean + user_mean[user] - all_mean
prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
/ sum(item_sim[item, nzero]) + baseline[item]
return prediction
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用结合baseline的item-item协同过滤算法进行预测...')
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_itemCF_baseline(user, item))
targets.append(actual)
print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))
测试集大小为 20000
采用item-based协同过滤算法进行预测...
测试结果的rmse为 0.8794
修正非法评分,将预测评分大于5的取值5,小于1的评分取值1
def predict_itemCF_baseline(user, item, k=100):
'''结合基线算法的item-based CF算法,预测rating'''
nzero = ratings[user].nonzero()[0]
baseline = item_mean + user_mean[user] - all_mean
prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
/ sum(item_sim[item, nzero]) + baseline[item]
if prediction > 5:
prediction = 5
if prediction < 1:
prediciton = 1
return prediction
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用结合baseline的item-item协同过滤算法进行预测...')
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_biasCF(user, item))
targets.append(actual)
print('修正评分后的测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))
测试集大小为 20000
采用结合baseline的item-item协同过滤算法进行预测...
修正评分后的测试结果的rmse为 0.8793
item协同过滤(topK+ baseline)
print('------ Top-k协同过滤(item-based + baseline)------')
def predict_topkCF(user, item, k=10):
'''top-k CF算法,以item-based协同过滤为基础,结合baseline,预测rating'''
nzero = ratings[user].nonzero()[0]
baseline = item_mean + user_mean[user] - all_mean
choice = nzero[item_sim[item, nzero].argsort()[::-1][:k]]
prediction = (ratings[user, choice] - baseline[choice]).dot(item_sim[item, choice])\
/ sum(item_sim[item, choice]) + baseline[item]
if prediction > 5: prediction = 5
if prediction < 1: prediction = 1
return prediction
print('载入测试集...')
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用top K协同过滤算法进行预测...')
k = 20
print('选取的K值为%d.' % k)
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_topkCF(user, item, k))
targets.append(actual)
print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))
------ Top-k协同过滤(item-based + baseline)------
载入测试集...
测试集大小为 20000
采用top K协同过滤算法进行预测...
选取的K值为20.
测试结果的rmse为 0.7799