今天看了一篇关于协同过滤的学习文章,感觉不错,文中主要介绍了以下两种算法:
Implementing your own recommender systems in Python
- 基于内存的协同过滤算法(Memory-Based Collaborative Filtering)
- 基于模型的协同过滤算法(Model-based Collaborative Filtering)
跟着写了一遍代码以后,发现其实也有国内翻译版的,意思基本上差不多,不过还是有一点点机翻的感觉:
在Python中实现你自己的推荐系统
基于内存的协同过滤算法包括两个类别:user-item filtering 、item-item filtering。.用原文的话就是:
- Item-Item Collaborative Filtering: “Users who liked this item also liked …”
- User-Item Collaborative Filtering: “Users who are similar to you also liked …”
这个介绍应该很直白易懂了。
直接上代码吧,原文应该是pyhton 2写的,改成python3环境上也没太大区别。
读入数据
import numpy as np
import pandas as pd
#读入数据
header=['user_id','item_id','rating','timestamp']
df=pd.read_csv('D:/PythonSource/ml-100k/u.data',sep='\t',names=header)
n_users=df.user_id.unique().shape[0]
n_items=df.item_id.unique().shape[0]
print('Numbers of users='+str(n_users),'and Numbers of items='+str(n_items))
Numbers of users=943 and Numbers of items=1682
分割数据集
from sklearn import cross_validation as cv
#分割数据集
train_data,test_data=cv.train_test_split(df,test_size=0.25)
# create 2 user-item matrices
train_data_matrix=np.zeros((n_users,n_items))
for line in train_data.itertuples():
#数据中用户和物品是从1开始计算的
train_data_matrix[line[1]-1,line[2]-1]=line[3]
test_data_matrix=np.zeros((n_users,n_items))
for line in test_data.itertuples():
test_data_matrix[line[1]-1,line[2]-1]=line[3]
计算相似性(余弦夹角)
from sklearn.metrics.pairwise import pairwise_distances
user_similarity=pairwise_distances(train_data_matrix,metric='cosine')
item_similarity=pairwise_distances(train_data_matrix.T,metric='cosine')
计算预测值
#预测函数
def predict(ratings,similarity,type='user'):
if type=='user':
mean_user_rating=ratings.mean(axis=1)
ratings_diff=(ratings-mean_user_rating[:,np.newaxis])
print(ratings_diff)
pred=mean_user_rating[:,np.newaxis]+similarity.dot(ratings_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
elif type=='item':
pred=ratings.dot(similarity)/np.array([np.abs(similarity).sum(axis=1)])
return pred
这里计算user类型的时候,涉及数组转置,这个过程可参考:
Collaborative filtering using RapidMiner: user vs. item recommenders
输出预测结果
#输出结果
item_prediction=predict(train_data_matrix,item_similarity,type='item')
np.savetxt('D:/PythonSource/item_prediction.csv',item_prediction,delimiter=',')
user_prediction=predict(train_data_matrix,user_similarity,type='user')
np.savetxt('D:/PythonSource/user_prediction.csv',user_prediction,delimiter=',')
准确性评估
#评估准确性
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction,ground_truth):
prediction=prediction[ground_truth.nonzero()].flatten()
ground_truth=ground_truth[ground_truth.nonzero()].flatten()
return sqrt(mean_squared_error(prediction,ground_truth))
print('user-based CF RMSE='+str(rmse(user_prediction,test_data_matrix)))
print('item-based CF RMSE='+str(rmse(item_prediction,test_data_matrix)))
user-based CF RMSE=3.138256866186845
item-based CF RMSE=3.464855694296178