2018-07-08 基于内存的协同过滤算法学习(Memory-Based Collaborative Filtering)

今天看了一篇关于协同过滤的学习文章,感觉不错,文中主要介绍了以下两种算法:
Implementing your own recommender systems in Python

  • 基于内存的协同过滤算法(Memory-Based Collaborative Filtering)
  • 基于模型的协同过滤算法(Model-based Collaborative Filtering)

跟着写了一遍代码以后,发现其实也有国内翻译版的,意思基本上差不多,不过还是有一点点机翻的感觉:
在Python中实现你自己的推荐系统


基于内存的协同过滤算法包括两个类别:user-item filtering 、item-item filtering。.用原文的话就是:

  • Item-Item Collaborative Filtering: “Users who liked this item also liked …”
  • User-Item Collaborative Filtering: “Users who are similar to you also liked …”

这个介绍应该很直白易懂了。
直接上代码吧,原文应该是pyhton 2写的,改成python3环境上也没太大区别。

读入数据

import numpy as np
import pandas as pd
#读入数据
header=['user_id','item_id','rating','timestamp']
df=pd.read_csv('D:/PythonSource/ml-100k/u.data',sep='\t',names=header)
n_users=df.user_id.unique().shape[0]
n_items=df.item_id.unique().shape[0]
print('Numbers of users='+str(n_users),'and Numbers of items='+str(n_items))

Numbers of users=943 and Numbers of items=1682

分割数据集

from sklearn import cross_validation as cv
#分割数据集
train_data,test_data=cv.train_test_split(df,test_size=0.25)
# create 2 user-item matrices
train_data_matrix=np.zeros((n_users,n_items))
for line in train_data.itertuples():
    #数据中用户和物品是从1开始计算的
    train_data_matrix[line[1]-1,line[2]-1]=line[3]

test_data_matrix=np.zeros((n_users,n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1,line[2]-1]=line[3]

计算相似性(余弦夹角)

from sklearn.metrics.pairwise import pairwise_distances
user_similarity=pairwise_distances(train_data_matrix,metric='cosine')
item_similarity=pairwise_distances(train_data_matrix.T,metric='cosine')

计算预测值

#预测函数
def predict(ratings,similarity,type='user'):
    if type=='user':
        mean_user_rating=ratings.mean(axis=1)
        ratings_diff=(ratings-mean_user_rating[:,np.newaxis])
        print(ratings_diff)
        pred=mean_user_rating[:,np.newaxis]+similarity.dot(ratings_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
    elif type=='item':
        pred=ratings.dot(similarity)/np.array([np.abs(similarity).sum(axis=1)])
    return pred

这里计算user类型的时候,涉及数组转置,这个过程可参考:
Collaborative filtering using RapidMiner: user vs. item recommenders

输出预测结果

#输出结果
item_prediction=predict(train_data_matrix,item_similarity,type='item')
np.savetxt('D:/PythonSource/item_prediction.csv',item_prediction,delimiter=',')
user_prediction=predict(train_data_matrix,user_similarity,type='user')
np.savetxt('D:/PythonSource/user_prediction.csv',user_prediction,delimiter=',')

准确性评估

#评估准确性
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction,ground_truth):
    prediction=prediction[ground_truth.nonzero()].flatten()
    ground_truth=ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction,ground_truth))
print('user-based CF RMSE='+str(rmse(user_prediction,test_data_matrix)))
print('item-based CF RMSE='+str(rmse(item_prediction,test_data_matrix)))

user-based CF RMSE=3.138256866186845
item-based CF RMSE=3.464855694296178

你可能感兴趣的:(2018-07-08 基于内存的协同过滤算法学习(Memory-Based Collaborative Filtering))