ItemCF算法并不利用物品的内容属性计算物品之间的相似度,它主要通过分析用户的行为记录计算物品之间的相似度。该算法认为一个人的兴趣都局限在几个方面,当很多人都对两个物品感兴趣时,就认为这两个物品具有较大的相似度,即物品A,B具有很大的相似度是因为喜欢物品A的用户大都也喜欢物品B。
W i , j = ∣ N ( i ) ∩ N ( j ) ∣ ∣ N ( i ) ∣ ∣ N ( j ) ∣ W_{i,j}=\frac{|N(i) \cap N(j)|} {\sqrt{|N(i)| |N(j)|}} Wi,j=∣N(i)∣∣N(j)∣∣N(i)∩N(j)∣
通过如下公式计算用户u对一个物品j的兴趣 :
p u , j = ∑ i ∈ N ( u ) ∩ S ( j , K ) W j , i r u , i p_{u,j}=\sum_{i \in N(u) \cap S(j,K)}{W_{j,i}r_{u,i}} pu,j=i∈N(u)∩S(j,K)∑Wj,iru,i
N ( u ) N(u) N(u)是用户喜欢的物品的集合, S ( j , K ) S(j,K) S(j,K)是和物品j最相似的K个物品的集合, w j , i w_{j,i} wj,i是物品 j j j和 i i i的相似度,$ r_{u,i}$是用户u对物品i的兴趣。
W i , j = ∑ u ∈ ∣ N ( i ) ∩ N ( j ) ∣ 1 l o g 1 + ∣ N ( U ) ∣ ∣ N ( i ) ∣ ∣ N ( j ) ∣ W_{i,j}=\frac{\sum_{u \in |N(i) \cap N(j)|} \frac{1}{log1+|N(U)|}} {\sqrt{|N(i)| |N(j)|}} Wi,j=∣N(i)∣∣N(j)∣∑u∈∣N(i)∩N(j)∣log1+∣N(U)∣1
W i , j = W i , j max j W i , j W_{i,j} = \frac{W_{i,j}}{\max_{j}W_{i,j}} Wi,j=maxjWi,jWi,j
下面代码使用了电影评分数据集,该数据集可以从这里下载(提取码:7k1w )。数据中包含了943个用户对1682个电影的10W条评分数据,数据已经处理成csv格式,可通过pandas.read_csv直接读取。
import numpy as np
import pandas as pd
from itertools import combinations, permutations
from operator import itemgetter
def trans_df2dict(df):
"""将数据转化成字典格式"""
user_rating = dict() # 用户评分数据
for row in df.values:
uesr_id, movie_id, rating = row[0], row[1], row[2]
if uesr_id not in user_rating.keys():
user_rating[uesr_id] = {}
user_rating[uesr_id][movie_id] = rating
return user_rating
def get_items_similarity(df, item_num):
"""计算items相似性矩阵,返回相似性矩阵"""
# 1. 建立用户-物品的倒排表
inverted_table = df.groupby(by='userId')['moviesId'].agg(list).to_dict()
# 2. 初始化共现矩阵,遍历每个用户,将物品两两组合,并在共现矩阵中加1
W = np.zeros((item_num, item_num))
# 统计每个电影被多少人看过
count_item_users_num = df.groupby(by='moviesId')['userId'].agg('count').to_dict()
for key, val in inverted_table.items():
val.sort(reverse=True) # 降序
for per in combinations(val, 2):
W[per[0] - 1][per[1] - 1] += 1
W[per[1] - 1][per[0] - 1] += 1
# 计算相似性
for i in range(W.shape[0]):
for j in range(W.shape[1]):
W[i][j] /= np.sqrt(count_item_users_num.get(i + 1) * count_item_users_num.get(j + 1))
w_dict = {}
for i in range(W.shape[0]):
tmp = []
for index, k in enumerate(W[i]):
tmp.append((index + 1, k))
w_dict[i + 1] = tmp
return w_dict
def user_interest_with_items(user_id, item_id, K, user_rating, w_dict):
"""计算指定用户与指定物品的兴趣程度"""
interest = 0
for i in sorted(w_dict[item_id], key=itemgetter(1), reverse=True)[0:K]:
item_index = i[0]
item_simi = i[1]
if item_index in user_rating[user_id].keys():
interest += item_simi * user_rating[user_id][item_index]
return interest
def get_user_interest_list(user_id, K, user_rating, w_dict):
"""计算用户的兴趣列表"""
rank = []
item_id_list = w_dict.keys()
for item_id in item_id_list:
if item_id in user_rating[user_id].keys():
continue
interest = user_interest_with_items(user_id, item_id, K, user_rating, w_dict)
rank.append((item_id, interest))
return sorted(rank, key=itemgetter(1), reverse=True)
if __name__ == '__main__':
df = pd.read_csv('./ml-100k.csv')
item_num = df.moviesId.nunique()
user_num = df.userId.nunique()
user_rating = trans_df2dict(df)
w_dict = get_items_similarity(df, item_num)
recommend_list = get_user_interest_list(2, 20, user_rating, w_dict)
print(recommend_list[0:20])
参考资料:
推荐系统实战