基于用户的协同过滤就是找到与该用户最相似的几个用户,然后再找出这几个用户喜欢的交集,交集中要去除该用户已经看过的,然后对该用户进行推荐。
一般系统中用户数据的来源:采集用户过往的痕迹,搜索记录,评分,观看之类的。
计算相似度的方法:
1.余弦相似度
|A* B| / np.sqrt(|A|*|B|)
2.Jaccard 系数
|A∩B|/|A∪B|
相似度计算方法可以根据自己的意愿进行改进
数据来源:网上的电影数据,1M的
class Usercf(self):
def __init__(self):
self.data = pd.read_csv("ratings.csv")
def _cosine_sim(self):
def _get_top_n_users(self):
def _get_candidates_items(self):
def _get_top_n_items(self):
def calculate(self):
def _cosine_sim(target_movies, movies):
'''
simple method for calculate cosine distance.
e.g: x = [1 0 1 1 0], y = [0 1 1 0 1]
cosine = (x1*y1+x2*y2+...) / [sqrt(x1^2+x2^2+...)+sqrt(y1^2+y2^2+...)]
that means union_len(movies1, movies2) / sqrt(len(movies1)*len(movies2))
此处用交集运算来代替乘法运算,不用对数组进行对齐操作,方便计算
'''
union_len = len(set(target_movies) & set(movies))
if union_len == 0: return 0.0
product = len(target_movies) * len(movies)
cosine = union_len / math.sqrt(product)
return cosine
def _get_top_n_users(self, target_user_id, top_n):
'''
calculate similarity between all users and return Top N similar users.
'''
#获取用户看过的电影列表
target_movies = self.data[self.data['UserID'] == target_user_id]['MovieID']
#获取其他用户名
other_users_id = [i for i in set(self.data['UserID']) if i != target_user_id]
#获取其他用户看过的电影列表
other_movies = [self.data[self.data['UserID'] == i]['MovieID'] for i in other_users_id]
#计算各个用户与该用户的相似度,并进行排序,返回top_n相似度用户
sim_list = [self._cosine_sim(target_movies, movies) for movies in other_movies]
sim_list = sorted(zip(other_users_id, sim_list), key=lambda x: x[1], reverse=True)
return sim_list[:top_n]
def _get_candidates_items(self, target_user_id):
"""
Find all movies in source data and target_user did not meet before.
"""
target_user_movies = set(self.data[self.data['UserID'] == target_user_id]['MovieID'])
other_user_movies = set(self.data[self.data['UserID'] != target_user_id]['MovieID'])
candidates_movies = list(target_user_movies ^ other_user_movies)
return candidates_movies
def _get_top_n_items(self, top_n_users, candidates_movies, top_n):
"""
calculate interest of candidates movies and return top n movies.
e.g. interest = sum(sim * normalize_rating)
"""
top_n_user_data = [self.data[self.data['UserID'] == k] for k, _ in top_n_users]
interest_list = []
for movie_id in candidates_movies:
tmp = []
for user_data in top_n_user_data:
if movie_id in user_data['MovieID'].values:
tmp.append(user_data[user_data['MovieID'] == movie_id]['Rating'].values[0]/5)
else:
tmp.append(0)
interest = sum([top_n_users[i][1] * tmp[i] for i in range(len(top_n_users))])
interest_list.append((movie_id, interest))
interest_list = sorted(interest_list, key=lambda x: x[1], reverse=True)
return interest_list[:top_n]
def calculate(self, target_user_id=1, top_n=10):
"""
user-cf for movies recommendation.
"""
# most similar top n users
top_n_users = self._get_top_n_users(target_user_id, top_n)
# candidates movies for recommendation
candidates_movies = self._get_candidates_items(target_user_id)
# most interest top n movies
top_n_movies = self._get_top_n_items(top_n_users, candidates_movies, top_n)
return top_n_movies
cf算法存在的问题:
冷启动问题:
即新系统没有用户的使用记录,就无法使用该模型,一个老系统加入一个新用户时,也无法使用该模型进行推荐。
解决方案: