随着机器学习技术的逐渐发展与完善,推荐系统也逐渐运用机器学习的思想来进行推荐。将机器学习应用到推荐系统中的方案真是不胜枚举。以下对Model-Based CF算法做一个大致的分类:
接下来我们重点学习以下几种应用较多的方案:
基于K最近邻的协同过滤推荐其实本质上就是MemoryBased CF,只不过在选取近邻的时候,加上K最近邻的限制。
这里我们直接根据MemoryBased CF的代码实现
修改以下地方
class CollaborativeFiltering(object):
based = None
def __init__(self, k=40, rules=None, use_cache=False, standard=None):
'''
:param k: 取K个最近邻来进行预测
:param rules: 过滤规则,四选一,否则将抛异常:"unhot", "rated", ["unhot","rated"], None
:param use_cache: 相似度计算结果是否开启缓存
:param standard: 评分标准化方法,None表示不使用、mean表示均值中心化、zscore表示Z-Score标准化
'''
self.k = 40
self.rules = rules
self.use_cache = use_cache
self.standard = standard
修改所有的选取近邻的地方的代码,根据相似度来选取K个最近邻
similar_users = self.similar[uid].drop([uid]).dropna().sort_values(ascending=False)[:self.k]
similar_items = self.similar[iid].drop([iid]).dropna().sort_values(ascending=False)[:self.k]
但由于我们的原始数据较少,这里我们的KNN方法的效果会比纯粹的MemoryBasedCF要差
如果我们将评分看作是一个连续的值而不是离散的值,那么就可以借助线性回归思想来预测目标用户对某物品的评分。其中一种实现策略被称为Baseline(基准预测)。
Baseline设计思想基于以下的假设:
这个用户或物品普遍高于或低于平均值的差值,我们称为偏置(bias)
Baseline目标:
使用Baseline的算法思想预测评分的步骤如下:
计算所有电影的平均评分 μ \mu μ(即全局平均评分)
计算每个用户评分与平均评分 μ \mu μ 的偏置值 b u b_u bu
计算每部电影所接受的评分与平均评分 μ \mu μ的偏置值 b i b_i bi
预测用户对电影的评分:
r ^ u i = b u i = μ + b u + b i \hat{r}_{ui} = b_{ui} = \mu + b_u + b_i r^ui=bui=μ+bu+bi
举例:
比如我们想通过Baseline来预测用户A对电影“阿甘正传”的评分,那么首先计算出整个评分数据集的平均评分 μ \mu μ是3.5分;而用户A是一个比较苛刻的用户,他的评分比较严格,普遍比平均评分低0.5分,即用户A的偏置值 b i b_i bi 是-0.5;而电影“阿甘正传”是一部比较热门而且备受好评的电影,它的评分普遍比平均评分要高1.2分,那么电影“阿甘正传”的偏置值 b i b_i bi 是+1.2,因此就可以预测出用户A对电影“阿甘正传”的评分为: 3.5 + ( − 0.5 ) + 1.2 3.5+(-0.5)+1.2 3.5+(−0.5)+1.2,也就是4.2分。
对于所有电影的平均评分 μ \mu μ 是直接能计算出的,因此问题在于要测出每个用户的 b u b_u bu 值和每部电影的 b i b_i bi 的值。
对于线性回归问题,我们可以利用平方差构建损失函数如下:
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − μ − b u − b i ) 2 \begin{split} Cost &= \sum_{u,i\in R}(r_{ui}-\hat{r}_{ui})^2 \\&=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 \end{split} Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−μ−bu−bi)2
加入L2正则化:
C o s t = ∑ u , i ∈ R ( r u i − μ − b u − b i ) 2 + λ ∗ ( ∑ u b u 2 + ∑ i b i 2 ) Cost=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 + \lambda*(\sum_u {b_u}^2 + \sum_i {b_i}^2) Cost=u,i∈R∑(rui−μ−bu−bi)2+λ∗(u∑bu2+i∑bi2)
公式解析:
对于最小过程的求解,我们一般采用随机梯度下降法或者交替最小二乘法来优化实现。
使用随机梯度下降优化算法预测Baseline偏置值
J ( θ ) = C o s t = f ( b u , b i ) J ( θ ) = ∑ u , i ∈ R ( r u i − μ − b u − b i ) 2 + λ ∗ ( ∑ u b u 2 + ∑ i b i 2 ) \begin{split} &J(\theta)=Cost=f(b_u, b_i)\\ \\ &J(\theta)=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 + \lambda*(\sum_u {b_u}^2 + \sum_i {b_i}^2) \end{split} J(θ)=Cost=f(bu,bi)J(θ)=u,i∈R∑(rui−μ−bu−bi)2+λ∗(u∑bu2+i∑bi2)
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j:=\theta_j-\alpha\cfrac{\partial }{\partial \theta_j}J(\theta) θj:=θj−α∂θj∂J(θ)
损失函数偏导推导:
∂ ∂ b u J ( θ ) = ∂ ∂ b u f ( b u , b i ) = 2 ∑ u , i ∈ R ( r u i − μ − b u − b i ) ( − 1 ) + 2 λ b u = − 2 ∑ u , i ∈ R ( r u i − μ − b u − b i ) + 2 λ ∗ b u \begin{split} \cfrac{\partial}{\partial b_u} J(\theta)&=\cfrac{\partial}{\partial b_u} f(b_u, b_i) \\&=2\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)(-1) + 2\lambda{b_u} \\&=-2\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + 2\lambda*b_u \end{split} ∂bu∂J(θ)=∂bu∂f(bu,bi)=2u,i∈R∑(rui−μ−bu−bi)(−1)+2λbu=−2u,i∈R∑(rui−μ−bu−bi)+2λ∗bu
$b_u$更新(因为alpha可以人为控制,所以2可以省略掉):
b u : = b u − α ∗ ( − ∑ u , i ∈ R ( r u i − μ − b u − b i ) + λ ∗ b u ) : = b u + α ∗ ( ∑ u , i ∈ R ( r u i − μ − b u − b i ) − λ ∗ b u ) \begin{split} b_u&:=b_u - \alpha*(-\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + \lambda * b_u)\\ &:=b_u + \alpha*(\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) - \lambda* b_u) \end{split} bu:=bu−α∗(−u,i∈R∑(rui−μ−bu−bi)+λ∗bu):=bu+α∗(u,i∈R∑(rui−μ−bu−bi)−λ∗bu)
同理可得,梯度下降更新 b i b_i bi:
b i : = b i + α ∗ ( ∑ u , i ∈ R ( r u i − μ − b u − b i ) − λ ∗ b i ) b_i:=b_i + \alpha*(\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) -\lambda*b_i) bi:=bi+α∗(u,i∈R∑(rui−μ−bu−bi)−λ∗bi)
由于随机梯度下降法本质上利用每个样本的损失来更新参数,而不用每次求出全部的损失和,因此使用SGD时:
单样本损失值:
e r r o r = r u i − r ^ u i = r u i − ( μ + b u + b i ) = r u i − μ − b u − b i \begin{split} error &=r_{ui}-\hat{r}_{ui} \\&= r_{ui}-(\mu+b_u+b_i) \\&= r_{ui}-\mu-b_u-b_i \end{split} error=rui−r^ui=rui−(μ+bu+bi)=rui−μ−bu−bi
参数更新:
b u : = b u + α ∗ ( ( r u i − μ − b u − b i ) − λ ∗ b u ) : = b u + α ∗ ( e r r o r − λ ∗ b u ) b i : = b i + α ∗ ( ( r u i − μ − b u − b i ) − λ ∗ b i ) : = b i + α ∗ ( e r r o r − λ ∗ b i ) \begin{split} b_u&:=b_u + \alpha*((r_{ui}-\mu-b_u-b_i) -\lambda*b_u) \\ &:=b_u + \alpha*(error - \lambda*b_u) \\ \\ b_i&:=b_i + \alpha*((r_{ui}-\mu-b_u-b_i) -\lambda*b_i)\\ &:=b_i + \alpha*(error -\lambda*b_i) \end{split} bubi:=bu+α∗((rui−μ−bu−bi)−λ∗bu):=bu+α∗(error−λ∗bu):=bi+α∗((rui−μ−bu−bi)−λ∗bi):=bi+α∗(error−λ∗bi)
import pandas as pd
import numpy as np
class BaselineCFBySGD(object):
def __init__(self, number_epochs, alpha, reg, columns=["uid", "iid", "rating"]):
# 梯度下降最高迭代次数
self.number_epochs = number_epochs
# 学习率
self.alpha = alpha
# 正则参数
self.reg = reg
# 数据集中user-item-rating字段的名称
self.columns = columns
def fit(self, dataset):
'''
:param dataset: uid, iid, rating
:return:
'''
self.dataset = dataset
# 用户评分数据
self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 物品评分数据
self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 计算全局平均分
self.global_mean = self.dataset[self.columns[2]].mean()
# 调用sgd方法训练模型参数
self.bu, self.bi = self.sgd()
def sgd(self):
'''
利用随机梯度下降,优化bu,bi的值
:return: bu, bi
'''
# 初始化bu、bi的值,全部设为0
bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
for i in range(self.number_epochs):
print("iter%d" % i)
for uid, iid, real_rating in self.dataset.itertuples(index=False):
error = real_rating - (self.global_mean + bu[uid] + bi[iid])
bu[uid] += self.alpha * (error - self.reg * bu[uid])
bi[iid] += self.alpha * (error - self.reg * bi[iid])
return bu, bi
def predict(self, uid, iid):
predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
return predict_rating
if __name__ == '__main__':
dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
bcf = BaselineCFBySGD(20, 0.1, 0.1, ["userId", "movieId", "rating"])
bcf.fit(dataset)
while True:
uid = int(input("uid: "))
iid = int(input("iid: "))
print(bcf.predict(uid, iid))
添加test方法,然后使用之前实现accuary方法计算准确性指标
import pandas as pd
import numpy as np
def data_split(data_path, x=0.8, random=False):
'''
切分数据集, 这里为了保证用户数量保持不变,将每个用户的评分数据按比例进行拆分
:param data_path: 数据集路径
:param x: 训练集的比例,如x=0.8,则0.2是测试集
:param random: 是否随机切分,默认False
:return: 用户-物品评分矩阵
'''
print("开始切分数据集...")
# 设置要加载的数据字段的类型
dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
# 加载数据,我们只用前三列数据,分别是用户ID,电影ID,已经用户对电影的对应评分
ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))
testset_index = []
# 为了保证每个用户在测试集和训练集都有数据,因此按userId聚合
for uid in ratings.groupby("userId").any().index:
user_rating_data = ratings.where(ratings["userId"]==uid).dropna()
if random:
# 因为不可变类型不能被 shuffle方法作用,所以需要强行转换为列表
index = list(user_rating_data.index)
np.random.shuffle(index) # 打乱列表
_index = round(len(user_rating_data) * x)
testset_index += list(index[_index:])
else:
# 将每个用户的x比例的数据作为训练集,剩余的作为测试集
index = round(len(user_rating_data) * x)
testset_index += list(user_rating_data.index.values[index:])
testset = ratings.loc[testset_index]
trainset = ratings.drop(testset_index)
print("完成数据集切分...")
return trainset, testset
def accuray(predict_results, method="all"):
'''
准确性指标计算方法
:param predict_results: 预测结果,类型为容器,每个元素是一个包含uid,iid,real_rating,pred_rating的序列
:param method: 指标方法,类型为字符串,rmse或mae,否则返回两者rmse和mae
:return:
'''
def rmse(predict_results):
'''
rmse评估指标
:param predict_results:
:return: rmse
'''
length = 0
_rmse_sum = 0
for uid, iid, real_rating, pred_rating in predict_results:
length += 1
_rmse_sum += (pred_rating - real_rating) ** 2
return round(np.sqrt(_rmse_sum / length), 4)
def mae(predict_results):
'''
mae评估指标
:param predict_results:
:return: mae
'''
length = 0
_mae_sum = 0
for uid, iid, real_rating, pred_rating in predict_results:
length += 1
_mae_sum += abs(pred_rating - real_rating)
return round(_mae_sum / length, 4)
def rmse_mae(predict_results):
'''
rmse和mae评估指标
:param predict_results:
:return: rmse, mae
'''
length = 0
_rmse_sum = 0
_mae_sum = 0
for uid, iid, real_rating, pred_rating in predict_results:
length += 1
_rmse_sum += (pred_rating - real_rating) ** 2
_mae_sum += abs(pred_rating - real_rating)
return round(np.sqrt(_rmse_sum / length), 4), round(_mae_sum / length, 4)
if method.lower() == "rmse":
rmse(predict_results)
elif method.lower() == "mae":
mae(predict_results)
else:
return rmse_mae(predict_results)
class BaselineCFBySGD(object):
def __init__(self, number_epochs, alpha, reg, columns=["uid", "iid", "rating"]):
# 梯度下降最高迭代次数
self.number_epochs = number_epochs
# 学习率
self.alpha = alpha
# 正则参数
self.reg = reg
# 数据集中user-item-rating字段的名称
self.columns = columns
def fit(self, dataset):
'''
:param dataset: uid, iid, rating
:return:
'''
self.dataset = dataset
# 用户评分数据
self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 物品评分数据
self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 计算全局平均分
self.global_mean = self.dataset[self.columns[2]].mean()
# 调用sgd方法训练模型参数
self.bu, self.bi = self.sgd()
def sgd(self):
'''
利用随机梯度下降,优化bu,bi的值
:return: bu, bi
'''
# 初始化bu、bi的值,全部设为0
bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
for i in range(self.number_epochs):
print("iter%d" % i)
for uid, iid, real_rating in self.dataset.itertuples(index=False):
error = real_rating - (self.global_mean + bu[uid] + bi[iid])
bu[uid] += self.alpha * (error - self.reg * bu[uid])
bi[iid] += self.alpha * (error - self.reg * bi[iid])
return bu, bi
def predict(self, uid, iid):
'''评分预测'''
if iid not in self.items_ratings.index:
raise Exception("无法预测用户<{uid}>对电影<{iid}>的评分,因为训练集中缺失<{iid}>的数据".format(uid=uid, iid=iid))
predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
return predict_rating
def test(self,testset):
'''预测测试集数据'''
for uid, iid, real_rating in testset.itertuples(index=False):
try:
pred_rating = self.predict(uid, iid)
except Exception as e:
print(e)
else:
yield uid, iid, real_rating, pred_rating
if __name__ == '__main__':
trainset, testset = data_split("datasets/ml-latest-small/ratings.csv", random=True)
bcf = BaselineCFBySGD(20, 0.1, 0.1, ["userId", "movieId", "rating"])
bcf.fit(trainset)
pred_results = bcf.test(testset)
rmse, mae = accuray(pred_results)
print("rmse: ", rmse, "mae: ", mae)
使用交替最小二乘法优化算法预测Baseline偏置值
最小二乘法和梯度下降法一样,可以用于求极值。
最小二乘法思想:对损失函数求偏导,然后再使偏导为0
同样,损失函数:
J ( θ ) = ∑ u , i ∈ R ( r u i − μ − b u − b i ) 2 + λ ∗ ( ∑ u b u 2 + ∑ i b i 2 ) J(\theta)=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2 + \lambda*(\sum_u {b_u}^2 + \sum_i {b_i}^2) J(θ)=u,i∈R∑(rui−μ−bu−bi)2+λ∗(u∑bu2+i∑bi2)
对损失函数求偏导:
∂ ∂ b u f ( b u , b i ) = − 2 ∑ u , i ∈ R ( r u i − μ − b u − b i ) + 2 λ ∗ b u \cfrac{\partial}{\partial b_u} f(b_u, b_i) =-2 \sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + 2\lambda * b_u ∂bu∂f(bu,bi)=−2u,i∈R∑(rui−μ−bu−bi)+2λ∗bu
令偏导为0,则可得:
∑ u , i ∈ R ( r u i − μ − b u − b i ) = λ ∗ b u ∑ u , i ∈ R ( r u i − μ − b i ) = ∑ u , i ∈ R b u + λ ∗ b u \sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) = \lambda* b_u \\\sum_{u,i\in R}(r_{ui}-\mu-b_i) = \sum_{u,i\in R} b_u+\lambda * b_u u,i∈R∑(rui−μ−bu−bi)=λ∗buu,i∈R∑(rui−μ−bi)=u,i∈R∑bu+λ∗bu
为了简化公式,这里令 ∑ u , i ∈ R b u ≈ ∣ R ( u ) ∣ ∗ b u \sum_{u,i\in R} b_u \approx |R(u)|*b_u ∑u,i∈Rbu≈∣R(u)∣∗bu,即直接假设每一项的偏置都相等,可得:
b u : = ∑ u , i ∈ R ( r u i − μ − b i ) λ 1 + ∣ R ( u ) ∣ b_u := \cfrac {\sum_{u,i\in R}(r_{ui}-\mu-b_i)}{\lambda_1 + |R(u)|} bu:=λ1+∣R(u)∣∑u,i∈R(rui−μ−bi)
其中 ∣ R ( u ) ∣ |R(u)| ∣R(u)∣表示用户 u u u 的有过评分的数量
同理可得:
b i : = ∑ u , i ∈ R ( r u i − μ − b u ) λ 2 + ∣ R ( i ) ∣ b_i := \cfrac {\sum_{u,i\in R}(r_{ui}-\mu-b_u)}{\lambda_2 + |R(i)|} bi:=λ2+∣R(i)∣∑u,i∈R(rui−μ−bu)
其中 ∣ R ( i ) ∣ |R(i)| ∣R(i)∣表示物品 i i i收到的评分数量
b u b_u bu和 b i b_i bi分别属于用户和物品的偏置,因此他们的正则参数可以分别设置两个独立的参数
通过最小二乘推导,我们最终分别得到了 b u b_u bu和 b i b_i bi的表达式,但他们的表达式中却又各自包含对方,因此这里我们将利用一种叫交替最小二乘的方法来计算他们的值:
import pandas as pd
import numpy as np
class BaselineCFByALS(object):
def __init__(self, number_epochs, reg_bu, reg_bi, columns=["uid", "iid", "rating"]):
# 梯度下降最高迭代次数
self.number_epochs = number_epochs
# bu的正则参数
self.reg_bu = reg_bu
# bi的正则参数
self.reg_bi = reg_bi
# 数据集中user-item-rating字段的名称
self.columns = columns
def fit(self, dataset):
'''
:param dataset: uid, iid, rating
:return:
'''
self.dataset = dataset
# 用户评分数据
self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 物品评分数据
self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 计算全局平均分
self.global_mean = self.dataset[self.columns[2]].mean()
# 调用sgd方法训练模型参数
self.bu, self.bi = self.als()
def als(self):
'''
利用随机梯度下降,优化bu,bi的值
:return: bu, bi
'''
# 初始化bu、bi的值,全部设为0
bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
for i in range(self.number_epochs):
print("iter%d" % i)
for iid, uids, ratings in self.items_ratings.itertuples(index=True):
_sum = 0
for uid, rating in zip(uids, ratings):
_sum += rating - self.global_mean - bu[uid]
bi[iid] = _sum / (self.reg_bi + len(uids))
for uid, iids, ratings in self.users_ratings.itertuples(index=True):
_sum = 0
for iid, rating in zip(iids, ratings):
_sum += rating - self.global_mean - bi[iid]
bu[uid] = _sum / (self.reg_bu + len(iids))
return bu, bi
def predict(self, uid, iid):
predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
return predict_rating
if __name__ == '__main__':
dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
bcf = BaselineCFByALS(20, 25, 15, ["userId", "movieId", "rating"])
bcf.fit(dataset)
while True:
uid = int(input("uid: "))
iid = int(input("iid: "))
print(bcf.predict(uid, iid))
import pandas as pd
import numpy as np
def data_split(data_path, x=0.8, random=False):
'''
切分数据集, 这里为了保证用户数量保持不变,将每个用户的评分数据按比例进行拆分
:param data_path: 数据集路径
:param x: 训练集的比例,如x=0.8,则0.2是测试集
:param random: 是否随机切分,默认False
:return: 用户-物品评分矩阵
'''
print("开始切分数据集...")
# 设置要加载的数据字段的类型
dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
# 加载数据,我们只用前三列数据,分别是用户ID,电影ID,已经用户对电影的对应评分
ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))
testset_index = []
# 为了保证每个用户在测试集和训练集都有数据,因此按userId聚合
for uid in ratings.groupby("userId").any().index:
user_rating_data = ratings.where(ratings["userId"]==uid).dropna()
if random:
# 因为不可变类型不能被 shuffle方法作用,所以需要强行转换为列表
index = list(user_rating_data.index)
np.random.shuffle(index) # 打乱列表
_index = round(len(user_rating_data) * x)
testset_index += list(index[_index:])
else:
# 将每个用户的x比例的数据作为训练集,剩余的作为测试集
index = round(len(user_rating_data) * x)
testset_index += list(user_rating_data.index.values[index:])
testset = ratings.loc[testset_index]
trainset = ratings.drop(testset_index)
print("完成数据集切分...")
return trainset, testset
def accuray(predict_results, method="all"):
'''
准确性指标计算方法
:param predict_results: 预测结果,类型为容器,每个元素是一个包含uid,iid,real_rating,pred_rating的序列
:param method: 指标方法,类型为字符串,rmse或mae,否则返回两者rmse和mae
:return:
'''
def rmse(predict_results):
'''
rmse评估指标
:param predict_results:
:return: rmse
'''
length = 0
_rmse_sum = 0
for uid, iid, real_rating, pred_rating in predict_results:
length += 1
_rmse_sum += (pred_rating - real_rating) ** 2
return round(np.sqrt(_rmse_sum / length), 4)
def mae(predict_results):
'''
mae评估指标
:param predict_results:
:return: mae
'''
length = 0
_mae_sum = 0
for uid, iid, real_rating, pred_rating in predict_results:
length += 1
_mae_sum += abs(pred_rating - real_rating)
return round(_mae_sum / length, 4)
def rmse_mae(predict_results):
'''
rmse和mae评估指标
:param predict_results:
:return: rmse, mae
'''
length = 0
_rmse_sum = 0
_mae_sum = 0
for uid, iid, real_rating, pred_rating in predict_results:
length += 1
_rmse_sum += (pred_rating - real_rating) ** 2
_mae_sum += abs(pred_rating - real_rating)
return round(np.sqrt(_rmse_sum / length), 4), round(_mae_sum / length, 4)
if method.lower() == "rmse":
rmse(predict_results)
elif method.lower() == "mae":
mae(predict_results)
else:
return rmse_mae(predict_results)
class BaselineCFByALS(object):
def __init__(self, number_epochs, reg_bu, reg_bi, columns=["uid", "iid", "rating"]):
# 梯度下降最高迭代次数
self.number_epochs = number_epochs
# bu的正则参数
self.reg_bu = reg_bu
# bi的正则参数
self.reg_bi = reg_bi
# 数据集中user-item-rating字段的名称
self.columns = columns
def fit(self, dataset):
'''
:param dataset: uid, iid, rating
:return:
'''
self.dataset = dataset
# 用户评分数据
self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 物品评分数据
self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 计算全局平均分
self.global_mean = self.dataset[self.columns[2]].mean()
# 调用sgd方法训练模型参数
self.bu, self.bi = self.als()
def als(self):
'''
利用随机梯度下降,优化bu,bi的值
:return: bu, bi
'''
# 初始化bu、bi的值,全部设为0
bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
for i in range(self.number_epochs):
print("iter%d" % i)
for iid, uids, ratings in self.items_ratings.itertuples(index=True):
_sum = 0
for uid, rating in zip(uids, ratings):
_sum += rating - self.global_mean - bu[uid]
bi[iid] = _sum / (self.reg_bi + len(uids))
for uid, iids, ratings in self.users_ratings.itertuples(index=True):
_sum = 0
for iid, rating in zip(iids, ratings):
_sum += rating - self.global_mean - bi[iid]
bu[uid] = _sum / (self.reg_bu + len(iids))
return bu, bi
def predict(self, uid, iid):
'''评分预测'''
if iid not in self.items_ratings.index:
raise Exception("无法预测用户<{uid}>对电影<{iid}>的评分,因为训练集中缺失<{iid}>的数据".format(uid=uid, iid=iid))
predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
return predict_rating
def test(self,testset):
'''预测测试集数据'''
for uid, iid, real_rating in testset.itertuples(index=False):
try:
pred_rating = self.predict(uid, iid)
except Exception as e:
print(e)
else:
yield uid, iid, real_rating, pred_rating
if __name__ == '__main__':
trainset, testset = data_split("datasets/ml-latest-small/ratings.csv", random=True)
bcf = BaselineCFByALS(20, 25, 15, ["userId", "movieId", "rating"])
bcf.fit(trainset)
pred_results = bcf.test(testset)
rmse, mae = accuray(pred_results)
print("rmse: ", rmse, "mae: ", mae)
函数求导:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-z3obT7Xj-1678011026571)(./img/常见函数求导.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cPry5Srt-1678011026576)(./img/导数的四则运算.png)]
通常SVD矩阵分解指的是SVD(奇异值)分解技术,在这我们姑且将其命名为Traditional SVD(传统并经典着)其公式如下:
M m × n = U m × k Σ k × k V k × n T M_{m×n}=U_{m×k} \Sigma_{k \times k}V_{k \times n}^T Mm×n=Um×kΣk×kVk×nT
Traditional SVD分解的形式为3个矩阵相乘,中间矩阵为奇异值矩阵。如果想运用SVD分解的话,有一个前提是要求矩阵是稠密的,即矩阵里的元素要非空,否则就不能运用SVD分解。
很显然我们的数据其实绝大多数情况下都是稀疏的,因此如果要使用Traditional SVD,一般的做法是先用均值或者其他统计学方法来填充矩阵,然后再运用Traditional SVD分解降维,但这样做明显对数据的原始性造成一定影响。
刚才提到的Traditional SVD首先需要填充矩阵,然后再进行分解降维,同时存在计算复杂度高的问题,因为要分解成3个矩阵,所以后来提出了Funk SVD的方法,它不再将矩阵分解为3个矩阵,而是分解为2个用户-隐含特征,项目-隐含特征的矩阵,Funk SVD也被称为最原始的LFM模型
∑ i , j ( m i . j − q j T p i ) 2 \sum_{i,j}(m_{i.j}-q_j^Tp_i)^2 i,j∑(mi.j−qjTpi)2
借鉴线性回归的思想,通过最小化观察数据的平方来寻求最优的用户和项目的隐含向量表示。同时为了避免过度拟合(Overfitting)观测数据,又提出了带有L2正则项的FunkSVD,上公式:
m i n q ∗ , p ∗ ∑ ( u , i ) ∈ K ( r u i − q i T p u ) 2 + λ ( ∥ q i ∥ 2 + ∥ p u ∥ 2 ) min_{q*,p*} \space \sum_{(u,i) \in \rm K} \space (r_{ui}-q_i^T p_u)^2+\lambda ({\left\| {{q_i}} \right\|^2} + {\left\| {{p_u}} \right\|^2}) minq∗,p∗ (u,i)∈K∑ (rui−qiTpu)2+λ(∥qi∥2+∥pu∥2)
以上两种最优化函数都可以通过梯度下降或者随机梯度下降法来寻求最优解。
在FunkSVD提出来之后,出现了很多变形版本,其中一个相对成功的方法是BiasSVD,顾名思义,即带有偏置项的SVD分解:
arg min ⏟ p i , q j ∑ i , j ( m i j − μ − b i − b j − q j T p i ) 2 + λ ( ∥ p i ∥ 2 2 + ∥ q j ∥ 2 2 + ∥ b i ∥ 2 2 + ∥ b j ∥ 2 2 ) \underbrace {\arg \min }_{{p_i},{q_j}}{\sum\limits_{i,j} {\left( {{m_{ij}} - \mu - {b_i} - {b_j} - q_j^T{p_i}} \right)} ^2} + \lambda (\left\| {{p_i}} \right\|_2^2 + \left\| {{q_j}} \right\|_2^2 + \left\| {{b_i}} \right\|_2^2 + \left\| {{b_j}} \right\|_2^2) pi,qj argmini,j∑(mij−μ−bi−bj−qjTpi)2+λ(∥pi∥22+∥qj∥22+∥bi∥22+∥bj∥22)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sGG11HuN-1678011026578)(./img/矩阵分解4.jpg)]
它基于的假设和Baseline基准预测是一样的,但这里将Baseline的偏置引入到了矩阵分解中
人们后来又提出了改进的BiasSVD,被称为SVD++,该算法是在BiasSVD的基础上添加了用户的隐式反馈信息:
arg min ⏟ p i , q i ∑ i , j ( m i j − μ − b i − b j − q j T p i − q j T ∣ N ( i ) ∣ − 1 2 ∑ s ∈ N ( i ) y s ) 2 + λ ( ∥ p i ∥ 2 2 + ∥ q j ∥ 2 2 + ∥ b i ∥ 2 2 + ∥ b j ∥ 2 2 + ∑ s ∈ N ( i ) ∥ y s ∥ 2 2 ) \underbrace {\arg \min }_{{p_i},{q_i}}{\sum\limits_{i,j} {\left( {{m_{ij}} - \mu - {b_i} - {b_j} - q_j^T{p_i} - q_j^T{{\left| {N(i)} \right|}^{ - \frac{1}{2}}}\sum\limits_{s \in N(i)} {{y_s}} } \right)} ^2}\\ + \lambda (\left\| {{p_i}} \right\|_2^2 + \left\| {{q_j}} \right\|_2^2 + \left\| {{b_i}} \right\|_2^2 + \left\| {{b_j}} \right\|_2^2 + \sum\limits_{s \in N(i)} {\left\| {{y_s}} \right\|_2^2} ) pi,qi argmini,j∑ mij−μ−bi−bj−qjTpi−qjT∣N(i)∣−21s∈N(i)∑ys 2+λ(∥pi∥22+∥qj∥22+∥bi∥22+∥bj∥22+s∈N(i)∑∥ys∥22)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iLDmyPG6-1678011026580)(./img/矩阵分解5.jpg)]
显示反馈指的用户的评分这样的行为,隐式反馈指用户的浏览记录、购买记录、收听记录等。
SVD++是基于这样的假设:在BiasSVD基础上,认为用户对于项目的历史浏览记录、购买记录、收听记录等可以从侧面反映用户的偏好。
LFM也就是前面提到的Funk SVD矩阵分解
LFM(latent factor model)隐语义模型核心思想是通过隐含特征联系用户和物品,如下图:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kbgzcOz9-1678011026582)(./img/LFM矩阵分解图解.png)]
利用矩阵分解技术,将原始User-Item的评分矩阵(稠密/稀疏)分解为P和Q矩阵,然后利用 P ∗ Q P*Q P∗Q还原出User-Item评分矩阵 R R R。整个过程相当于降维处理,其中:
矩阵值 P 11 P_{11} P11表示用户1对隐含特征1的权重值
矩阵值 Q 11 Q_{11} Q11表示隐含特征1在物品1上的权重值
矩阵值 R 11 R_{11} R11就表示预测的用户1对物品1的评分,且 R 11 = P 1 , k ⃗ ⋅ Q k , 1 ⃗ R_{11}=\vec{P_{1,k}}\cdot \vec{Q_{k,1}} R11=P1,k⋅Qk,1
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5CFLLnYt-1678011026583)(./img/LFM矩阵分解图解2.png)]
利用LFM预测用户对物品的评分, k k k表示隐含特征数量:
r ^ u i = p u k ⃗ ⋅ q i k ⃗ = ∑ k = 1 k p u k q i k \begin{split} \hat {r}_{ui} &=\vec {p_{uk}}\cdot \vec {q_{ik}} \\&={\sum_{k=1}}^k p_{uk}q_{ik} \end{split} r^ui=puk⋅qik=k=1∑kpukqik
因此最终,我们的目标也就是要求出P矩阵和Q矩阵及其当中的每一个值,然后再对用户-物品的评分进行预测。
同样对于评分预测我们利用平方差来构建损失函数:
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) 2 \begin{split} Cost &= \sum_{u,i\in R} (r_{ui}-\hat{r}_{ui})^2 \\&=\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})^2 \end{split} Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−k=1∑kpukqik)2
加入L2正则化:
C o s t = ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U p u k 2 + ∑ I q i k 2 ) Cost = \sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})^2 + \lambda(\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2) Cost=u,i∈R∑(rui−k=1∑kpukqik)2+λ(U∑puk2+I∑qik2)
对损失函数求偏导:
∂ ∂ p u k C o s t = ∂ ∂ p u k [ ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U p u k 2 + ∑ I q i k 2 ) ] = 2 ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) ( − q i k ) + 2 λ p u k ∂ ∂ q i k C o s t = ∂ ∂ q i k [ ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U p u k 2 + ∑ I q i k 2 ) ] = 2 ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) ( − p u k ) + 2 λ q i k \begin{split} \cfrac {\partial}{\partial p_{uk}}Cost &= \cfrac {\partial}{\partial p_{uk}}[\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})^2 + \lambda(\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2)] \\&=2\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})(-q_{ik}) + 2\lambda p_{uk} \\\\ \cfrac {\partial}{\partial q_{ik}}Cost &= \cfrac {\partial}{\partial q_{ik}}[\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})^2 + \lambda(\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2)] \\&=2\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})(-p_{uk}) + 2\lambda q_{ik} \end{split} ∂puk∂Cost∂qik∂Cost=∂puk∂[u,i∈R∑(rui−k=1∑kpukqik)2+λ(U∑puk2+I∑qik2)]=2u,i∈R∑(rui−k=1∑kpukqik)(−qik)+2λpuk=∂qik∂[u,i∈R∑(rui−k=1∑kpukqik)2+λ(U∑puk2+I∑qik2)]=2u,i∈R∑(rui−k=1∑kpukqik)(−puk)+2λqik
梯度下降更新参数 p u k p_{uk} puk:
p u k : = p u k − α ∂ ∂ p u k C o s t : = p u k − α [ 2 ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) ( − q i k ) + 2 λ p u k ] : = p u k + α [ ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) q i k − λ p u k ] \begin{split} p_{uk}&:=p_{uk} - \alpha\cfrac {\partial}{\partial p_{uk}}Cost \\&:=p_{uk}-\alpha [2\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})(-q_{ik}) + 2\lambda p_{uk}] \\&:=p_{uk}+\alpha [\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})q_{ik} - \lambda p_{uk}] \end{split} puk:=puk−α∂puk∂Cost:=puk−α[2u,i∈R∑(rui−k=1∑kpukqik)(−qik)+2λpuk]:=puk+α[u,i∈R∑(rui−k=1∑kpukqik)qik−λpuk]
同理:
q i k : = q i k + α [ ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) p u k − λ q i k ] \begin{split} q_{ik}&:=q_{ik} + \alpha[\sum_{u,i\in R} (r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})p_{uk} - \lambda q_{ik}] \end{split} qik:=qik+α[u,i∈R∑(rui−k=1∑kpukqik)puk−λqik]
随机梯度下降: 向量乘法 每一个分量相乘 求和
p u k : = p u k + α [ ( r u i − ∑ k = 1 k p u k q i k ) q i k − λ 1 p u k ] q i k : = q i k + α [ ( r u i − ∑ k = 1 k p u k q i k ) p u k − λ 2 q i k ] \begin{split} &p_{uk}:=p_{uk}+\alpha [(r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})q_{ik} - \lambda_1 p_{uk}] \\&q_{ik}:=q_{ik} + \alpha[(r_{ui}-{\sum_{k=1}}^k p_{uk}q_{ik})p_{uk} - \lambda_2 q_{ik}] \end{split} puk:=puk+α[(rui−k=1∑kpukqik)qik−λ1puk]qik:=qik+α[(rui−k=1∑kpukqik)puk−λ2qik]
由于P矩阵和Q矩阵是两个不同的矩阵,通常分别采取不同的正则参数,如 λ 1 \lambda_1 λ1和 λ 2 \lambda_2 λ2
'''
LFM Model
'''
import pandas as pd
import numpy as np
# 评分预测 1-5
class LFM(object):
def __init__(self, alpha, reg_p, reg_q, number_LatentFactors=10, number_epochs=10, columns=["uid", "iid", "rating"]):
self.alpha = alpha # 学习率
self.reg_p = reg_p # P矩阵正则
self.reg_q = reg_q # Q矩阵正则
self.number_LatentFactors = number_LatentFactors # 隐式类别数量
self.number_epochs = number_epochs # 最大迭代次数
self.columns = columns
def fit(self, dataset):
'''
fit dataset
:param dataset: uid, iid, rating
:return:
'''
self.dataset = pd.DataFrame(dataset)
self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
self.globalMean = self.dataset[self.columns[2]].mean()
self.P, self.Q = self.sgd()
def _init_matrix(self):
'''
初始化P和Q矩阵,同时为设置0,1之间的随机值作为初始值
:return:
'''
# User-LF
P = dict(zip(
self.users_ratings.index,
np.random.rand(len(self.users_ratings), self.number_LatentFactors).astype(np.float32)
))
# Item-LF
Q = dict(zip(
self.items_ratings.index,
np.random.rand(len(self.items_ratings), self.number_LatentFactors).astype(np.float32)
))
return P, Q
def sgd(self):
'''
使用随机梯度下降,优化结果
:return:
'''
P, Q = self._init_matrix()
for i in range(self.number_epochs):
print("iter%d"%i)
error_list = []
for uid, iid, r_ui in self.dataset.itertuples(index=False):
# User-LF P
## Item-LF Q
v_pu = P[uid] #用户向量
v_qi = Q[iid] #物品向量
err = np.float32(r_ui - np.dot(v_pu, v_qi))
v_pu += self.alpha * (err * v_qi - self.reg_p * v_pu)
v_qi += self.alpha * (err * v_pu - self.reg_q * v_qi)
P[uid] = v_pu
Q[iid] = v_qi
# for k in range(self.number_of_LatentFactors):
# v_pu[k] += self.alpha*(err*v_qi[k] - self.reg_p*v_pu[k])
# v_qi[k] += self.alpha*(err*v_pu[k] - self.reg_q*v_qi[k])
error_list.append(err ** 2)
print(np.sqrt(np.mean(error_list)))
return P, Q
def predict(self, uid, iid):
# 如果uid或iid不在,我们使用全剧平均分作为预测结果返回
if uid not in self.users_ratings.index or iid not in self.items_ratings.index:
return self.globalMean
p_u = self.P[uid]
q_i = self.Q[iid]
return np.dot(p_u, q_i)
def test(self,testset):
'''预测测试集数据'''
for uid, iid, real_rating in testset.itertuples(index=False):
try:
pred_rating = self.predict(uid, iid)
except Exception as e:
print(e)
else:
yield uid, iid, real_rating, pred_rating
if __name__ == '__main__':
dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
lfm = LFM(0.02, 0.01, 0.01, 10, 100, ["userId", "movieId", "rating"])
lfm.fit(dataset)
while True:
uid = input("uid: ")
iid = input("iid: ")
print(lfm.predict(int(uid), int(iid)))
BiasSvd其实就是前面提到的Funk SVD矩阵分解基础上加上了偏置项。
利用BiasSvd预测用户对物品的评分, k k k表示隐含特征数量:
r ^ u i = μ + b u + b i + p u k ⃗ ⋅ q k i ⃗ = μ + b u + b i + ∑ k = 1 k p u k q i k \begin{split} \hat {r}_{ui} &=\mu + b_u + b_i + \vec {p_{uk}}\cdot \vec {q_{ki}} \\&=\mu + b_u + b_i + {\sum_{k=1}}^k p_{uk}q_{ik} \end{split} r^ui=μ+bu+bi+puk⋅qki=μ+bu+bi+k=1∑kpukqik
同样对于评分预测我们利用平方差来构建损失函数:
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) 2 \begin{split} Cost &= \sum_{u,i\in R} (r_{ui}-\hat{r}_{ui})^2 \\&=\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i -{\sum_{k=1}}^k p_{uk}q_{ik})^2 \end{split} Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)2
加入L2正则化:
C o s t = ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U b u 2 + ∑ I b i 2 + ∑ U p u k 2 + ∑ I q i k 2 ) Cost = \sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})^2 \\+ \lambda(\sum_U{b_u}^2+\sum_I{b_i}^2+\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2) Cost=u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)2+λ(U∑bu2+I∑bi2+U∑puk2+I∑qik2)
对损失函数求偏导:
∂ ∂ p u k C o s t = ∂ ∂ p u k [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U b u 2 + ∑ I b i 2 + ∑ U p u k 2 + ∑ I q i k 2 ) ] = 2 ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) ( − q i k ) + 2 λ p u k ∂ ∂ q i k C o s t = ∂ ∂ q i k [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U b u 2 + ∑ I b i 2 + ∑ U p u k 2 + ∑ I q i k 2 ) ] = 2 ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) ( − p u k ) + 2 λ q i k \begin{split}\cfrac {\partial}{\partial p_{uk}}Cost &= \cfrac {\partial}{\partial p_{uk}}[\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})^2 + \lambda(\sum_U{b_u}^2+\sum_I{b_i}^2+\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2)]\\&=2\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})(-q_{ik}) + 2\lambda p_{uk}\\\\\cfrac {\partial}{\partial q_{ik}}Cost &= \cfrac {\partial}{\partial q_{ik}}[\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})^2 + \lambda(\sum_U{b_u}^2+\sum_I{b_i}^2+\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2)]\\&=2\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})(-p_{uk}) + 2\lambda q_{ik}\end{split} ∂puk∂Cost∂qik∂Cost=∂puk∂[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)2+λ(U∑bu2+I∑bi2+U∑puk2+I∑qik2)]=2u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)(−qik)+2λpuk=∂qik∂[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)2+λ(U∑bu2+I∑bi2+U∑puk2+I∑qik2)]=2u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)(−puk)+2λqik
∂ ∂ b u C o s t = ∂ ∂ b u [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U b u 2 + ∑ I b i 2 + ∑ U p u k 2 + ∑ I q i k 2 ) ] = 2 ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) ( − 1 ) + 2 λ b u ∂ ∂ b i C o s t = ∂ ∂ b i [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) 2 + λ ( ∑ U b u 2 + ∑ I b i 2 + ∑ U p u k 2 + ∑ I q i k 2 ) ] = 2 ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) ( − 1 ) + 2 λ b i \begin{split} \cfrac {\partial}{\partial b_u}Cost &= \cfrac {\partial}{\partial b_u}[\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})^2 + \lambda(\sum_U{b_u}^2+\sum_I{b_i}^2+\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2)] \\&=2\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})(-1) + 2\lambda b_u \\\\ \cfrac {\partial}{\partial b_i}Cost &= \cfrac {\partial}{\partial b_i}[\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})^2 + \lambda(\sum_U{b_u}^2+\sum_I{b_i}^2+\sum_U{p_{uk}}^2+\sum_I{q_{ik}}^2)] \\&=2\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})(-1) + 2\lambda b_i \end{split} ∂bu∂Cost∂bi∂Cost=∂bu∂[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)2+λ(U∑bu2+I∑bi2+U∑puk2+I∑qik2)]=2u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)(−1)+2λbu=∂bi∂[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)2+λ(U∑bu2+I∑bi2+U∑puk2+I∑qik2)]=2u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)(−1)+2λbi
梯度下降更新参数 p u k p_{uk} puk:
p u k : = p u k − α ∂ ∂ p u k C o s t : = p u k − α [ 2 ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) ( − q i k ) + 2 λ p u k ] : = p u k + α [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) q i k − λ p u k ] \begin{split} p_{uk}&:=p_{uk} - \alpha\cfrac {\partial}{\partial p_{uk}}Cost \\&:=p_{uk}-\alpha [2\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})(-q_{ik}) + 2\lambda p_{uk}] \\&:=p_{uk}+\alpha [\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})q_{ik} - \lambda p_{uk}] \end{split} puk:=puk−α∂puk∂Cost:=puk−α[2u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)(−qik)+2λpuk]:=puk+α[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)qik−λpuk]
同理:
q i k : = q i k + α [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) p u k − λ q i k ] \begin{split} q_{ik}&:=q_{ik} + \alpha[\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})p_{uk} - \lambda q_{ik}] \end{split} qik:=qik+α[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)puk−λqik]
b u : = b u + α [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) − λ b u ] b_u:=b_u + \alpha[\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik}) - \lambda b_u] bu:=bu+α[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)−λbu]
b i : = b i + α [ ∑ u , i ∈ R ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) − λ b i ] b_i:=b_i + \alpha[\sum_{u,i\in R} (r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik}) - \lambda b_i] bi:=bi+α[u,i∈R∑(rui−μ−bu−bi−k=1∑kpukqik)−λbi]
随机梯度下降:
p u k : = p u k + α [ ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) q i k − λ 1 p u k ] q i k : = q i k + α [ ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) p u k − λ 2 q i k ] \begin{split} &p_{uk}:=p_{uk}+\alpha [(r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})q_{ik} - \lambda_1 p_{uk}] \\&q_{ik}:=q_{ik} + \alpha[(r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik})p_{uk} - \lambda_2 q_{ik}] \end{split} puk:=puk+α[(rui−μ−bu−bi−k=1∑kpukqik)qik−λ1puk]qik:=qik+α[(rui−μ−bu−bi−k=1∑kpukqik)puk−λ2qik]
b u : = b u + α [ ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) − λ 3 b u ] b_u:=b_u + \alpha[(r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik}) - \lambda_3 b_u] bu:=bu+α[(rui−μ−bu−bi−k=1∑kpukqik)−λ3bu]
b i : = b i + α [ ( r u i − μ − b u − b i − ∑ k = 1 k p u k q i k ) − λ 4 b i ] b_i:=b_i + \alpha[(r_{ui}-\mu - b_u - b_i-{\sum_{k=1}}^k p_{uk}q_{ik}) - \lambda_4 b_i] bi:=bi+α[(rui−μ−bu−bi−k=1∑kpukqik)−λ4bi]
由于P矩阵和Q矩阵是两个不同的矩阵,通常分别采取不同的正则参数,如 λ 1 \lambda_1 λ1和 λ 2 \lambda_2 λ2
'''
BiasSvd Model
'''
import math
import random
import pandas as pd
import numpy as np
class BiasSvd(object):
def __init__(self, alpha, reg_p, reg_q, reg_bu, reg_bi, number_LatentFactors=10, number_epochs=10, columns=["uid", "iid", "rating"]):
self.alpha = alpha # 学习率
self.reg_p = reg_p
self.reg_q = reg_q
self.reg_bu = reg_bu
self.reg_bi = reg_bi
self.number_LatentFactors = number_LatentFactors # 隐式类别数量
self.number_epochs = number_epochs
self.columns = columns
def fit(self, dataset):
'''
fit dataset
:param dataset: uid, iid, rating
:return:
'''
self.dataset = pd.DataFrame(dataset)
self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
self.globalMean = self.dataset[self.columns[2]].mean()
self.P, self.Q, self.bu, self.bi = self.sgd()
def _init_matrix(self):
'''
初始化P和Q矩阵,同时为设置0,1之间的随机值作为初始值
:return:
'''
# User-LF
P = dict(zip(
self.users_ratings.index,
np.random.rand(len(self.users_ratings), self.number_LatentFactors).astype(np.float32)
))
# Item-LF
Q = dict(zip(
self.items_ratings.index,
np.random.rand(len(self.items_ratings), self.number_LatentFactors).astype(np.float32)
))
return P, Q
def sgd(self):
'''
使用随机梯度下降,优化结果
:return:
'''
P, Q = self._init_matrix()
# 初始化bu、bi的值,全部设为0
bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
for i in range(self.number_epochs):
print("iter%d"%i)
error_list = []
for uid, iid, r_ui in self.dataset.itertuples(index=False):
v_pu = P[uid]
v_qi = Q[iid]
err = np.float32(r_ui - self.globalMean - bu[uid] - bi[iid] - np.dot(v_pu, v_qi))
v_pu += self.alpha * (err * v_qi - self.reg_p * v_pu)
v_qi += self.alpha * (err * v_pu - self.reg_q * v_qi)
P[uid] = v_pu
Q[iid] = v_qi
bu[uid] += self.alpha * (err - self.reg_bu * bu[uid])
bi[iid] += self.alpha * (err - self.reg_bi * bi[iid])
error_list.append(err ** 2)
print(np.sqrt(np.mean(error_list)))
return P, Q, bu, bi
def predict(self, uid, iid):
if uid not in self.users_ratings.index or iid not in self.items_ratings.index:
return self.globalMean
p_u = self.P[uid]
q_i = self.Q[iid]
return self.globalMean + self.bu[uid] + self.bi[iid] + np.dot(p_u, q_i)
if __name__ == '__main__':
dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
bsvd = BiasSvd(0.02, 0.01, 0.01, 0.01, 0.01, 10, 20)
bsvd.fit(dataset)
while True:
uid = input("uid: ")
iid = input("iid: ")
print(bsvd.predict(int(uid), int(iid)))