安装还是常见的
pip install surprise
安装常见问题:出现报错(error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: https://visualstudio.microsoft.com/downloads/)
解决方法:
①最笨的方法,下载所提示的对应的Visual Studio版本;
②核心思想,躲!在https://www.lfd.uci.edu/~gohlke/pythonlibs/上找到对应python版本的想要的库的whl包,然后pip install xx.whl进行安装,surprise库的shl文件在https://pypi.org/project/surprise/#files,不过可能还是躲不掉;
③对于2.7选手,可以在https://www.microsoft.com/en-us/download/details.aspx?id=44266上下载VCForPython27.msi以支持对用C写成的包的支持;
Surprise库非常适用于初学者了解推荐算法,其内置的功能包括:
本节会给出Surprise库使用的相关示例,读者可以根据自己的需要对示例的代码进行改写,从而实现自己所需的功能。
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
# 加载内置的ml100k数据集
data = Dataset.load_builtin('ml-100k')
# 使用SVD算法
algo = SVD()
# 使用五折交叉验证,使用cv参数设置几折,measures设置评价指标,verbose设置为True表示显示详细信息
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
写成类似于sklearn中的常见写法
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
data = Dataset.load_builtin('ml-100k')
# 类似于sklearn中的写法,将数据分割为75%
trainset, testset = train_test_split(data, test_size=.25)
algo = SVD()
# 不同上一个例子,这里使用fit和test函数
algo.fit(trainset)
predictions = algo.test(testset)
# 选用rmse指标
accuracy.rmse(predictions)
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
# 指定要读入的文件的格式,本例中每行三列,分别是用户、项目和评分,中间用空格隔开,若是用逗号或其他符号隔开,在sep参数中进行变化即可
reader = Reader(line_format='user item rating', sep=' ')
# 指定要读入的数据文件,本例中为test.txt
data = Dataset.load_from_file('test.txt', reader=reader)
# 把全部数据集都作为训练集
data = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import PredefinedKFold
# 数据集在系统路径\data\下
files_dir = os.path.expanduser('~/data/')
# 训练集为u1.base、u2.base
train_file = files_dir + 'u%d.base'
# 测试集为u1.test、u2.test
test_file = files_dir + 'u%d.test'
# range(m,n)表示训练集和测试集文件的命名,因为本例中是从u1到u2,所以这里为range(1,3),其实就是定义一个列表,里面是一组组训练集和测试集文件,即[(训练集1,测试集1),(训练集2,测试集2)……]
folds_files = [(train_file % i, test_file % i) for i in range(1, 3)]
reader = Reader(line_format='user item rating', sep='\t')
data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()
algo = SVD()
# 因为本例中有两组训练集和测试集,所以出现两次结果
for trainset, testset in pkf.split(data):
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions, verbose=True)
accuracy.mae(predictions, verbose=True)
该算法即随机预测算法,假设测试集的评分满足正态分布,然后生成正态分布的随机数进行预测,正态分布 N ( μ ^ , σ ^ 2 ) N(\hat{\mu},\hat{\sigma}^2) N(μ^,σ^2)的参数均值和方差从训练集中得到。
μ ^ = 1 ∣ R t r a i n ∣ ∑ r u i ∈ R t r a i n r u i \hat{\mu}=\frac{1}{\vert R_{train}\vert}\sum_{r_{ui}\in R_{train}}r_{ui} μ^=∣Rtrain∣1rui∈Rtrain∑rui
σ ^ = ∑ r u i ∈ R t r a i n ( r u i − μ ^ ) 2 ∣ R t r a i n ∣ \hat{\sigma}=\sqrt{\sum_{r_{ui}\in R_{train}}\frac{(r_{ui}-\hat{\mu})^2}{\vert R_{train}\vert}} σ^=rui∈Rtrain∑∣Rtrain∣(rui−μ^)2
algo = NormalPredictor()
Koren提出的baseline算法,不考虑用户的偏好
r u i ^ = μ + b u + b i \hat{r_{ui}}=\mu+b_u+b_i rui^=μ+bu+bi
对于未在训练集中出现的 u u u, b u = 0 b_u=0 bu=0( b i b_i bi做类似处理)
参数设置
训练方法是使用交替最小二乘法(ALS)还是随机梯度下降(SGD)
bsl_options = {'method': 'als',
'n_epochs': 5,
'reg_u': 12,
'reg_i': 5
}
algo = BaselineOnly(bsl_options=bsl_options)
bsl_options = {'method': 'sgd',
'learning_rate': .00005,
}
algo = BaselineOnly(bsl_options=bsl_options)
最基础的KNN算法,可分为user-based KNN和item-based KNN
user-based KNN的公式
r u i ^ = ∑ v ∈ N i k ( u ) s i m ( u , v ) ⋅ r v i ∑ v ∈ N i k ( u ) s i m ( u , v ) \hat{r_{ui}} = \frac {\sum_{v\in N_i^k(u)} sim(u,v)\cdot r_{vi}} {\sum_{v\in N_i^k(u)}sim(u,v)} rui^=∑v∈Nik(u)sim(u,v)∑v∈Nik(u)sim(u,v)⋅rvi
item-based KNN的公式
r u i ^ = ∑ j ∈ N u k ( i ) s i m ( i , j ) ⋅ r u j ∑ j ∈ N u k ( i ) s i m ( i , j ) \hat{r_{ui}} = \frac {\sum_{j\in N_u^k(i)} sim(i,j)\cdot r_{uj}} {\sum_{j\in N_u^k(i)}sim(i,j)} rui^=∑j∈Nuk(i)sim(i,j)∑j∈Nuk(i)sim(i,j)⋅ruj
8. k k k:设置的邻居的个数,默认为40
9. m i n _ k min\_k min_k:最少的邻居的个数,如果合适的邻居达不到设置的最小邻居值,则使用全局平均值进行预测,默认为1
10. s i m _ o p t i o n s sim\_options sim_options中的 n a m e name name:使用的计算相似度的函数,默认为MSD,也可设置为cosine或pearson_baseline
11. s i m _ o p t i o n s sim\_options sim_options中的 u s e r _ b a s e d user\_based user_based:默认为True,即使用user-based KNN,若设置为True,则使用item-based KNN
12. s i m _ o p t i o n s sim\_options sim_options中的 m i n _ s u p p o r t min\_support min_support:相似度达到该值,才能进入邻居的选择范围,无默认值
13. s i m _ o p t i o n s sim\_options sim_options中的 s h r i n k a g e shrinkage shrinkage:当相似函数选择为pearson_baseline,用该参数设置是否衰减,默认为100
sim_options = {'name': 'cosine',
'user_based': False # compute similarities between items
}
algo = KNNBasic(k=10, sim_options=sim_options)
sim_options = {'name': 'pearson_baseline',
'shrinkage': 0 # no shrinkage
}
algo = KNNBasic(k=10, sim_options=sim_options)
在KNNBasic算法的基础上,考虑用户均值或项目均值
r ^ u i = μ u + ∑ v ∈ N i k ( u ) sim ( u , v ) ⋅ ( r v i − μ v ) ∑ v ∈ N i k ( u ) sim ( u , v ) \hat{r}_{ui} = \mu_u + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v)} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)} r^ui=μu+v∈Nik(u)∑sim(u,v)v∈Nik(u)∑sim(u,v)⋅(rvi−μv)
或
r ^ u i = μ i + ∑ j ∈ N u k ( i ) sim ( i , j ) ⋅ ( r u j − μ j ) ∑ j ∈ N u k ( i ) sim ( i , j ) \hat{r}_{ui} = \mu_i + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j)} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)} r^ui=μi+j∈Nuk(i)∑sim(i,j)j∈Nuk(i)∑sim(i,j)⋅(ruj−μj)
参数设置与KNNBasic类似
sim_options = {'name': 'cosine',
'user_based': False # compute similarities between items
}
algo = KNNWithMeans(k=10, sim_options=sim_options)
引入Z-Score的思想
r ^ u i = μ u + σ u ∑ v ∈ N i k ( u ) sim ( u , v ) ⋅ ( r v i − μ v ) / σ v ∑ v ∈ N i k ( u ) sim ( u , v ) \hat{r}_{ui} = \mu_u + \sigma_u \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v) / \sigma_v} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)} r^ui=μu+σuv∈Nik(u)∑sim(u,v)v∈Nik(u)∑sim(u,v)⋅(rvi−μv)/σv
或
r ^ u i = μ i + σ i ∑ j ∈ N u k ( i ) sim ( i , j ) ⋅ ( r u j − μ j ) / σ j ∑ j ∈ N u k ( i ) sim ( i , j ) \hat{r}_{ui} = \mu_i + \sigma_i \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j) / \sigma_j} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)} r^ui=μi+σij∈Nuk(i)∑sim(i,j)j∈Nuk(i)∑sim(i,j)⋅(ruj−μj)/σj
参数设置与KNNBasic类似
sim_options = {'name': 'cosine',
'user_based': False # compute similarities between items
}
algo = KNNWithZScore(k=10, sim_options=sim_options)
和KNNWithMeans的区别在于,用的不是均值而是bias
r ^ u i = b u i + ∑ v ∈ N i k ( u ) sim ( u , v ) ⋅ ( r v i − b v i ) ∑ v ∈ N i k ( u ) sim ( u , v ) \hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - b_{vi})} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)} r^ui=bui+v∈Nik(u)∑sim(u,v)v∈Nik(u)∑sim(u,v)⋅(rvi−bvi)
或
r ^ u i = b u i + ∑ j ∈ N u k ( i ) sim ( i , j ) ⋅ ( r u j − b u j ) ∑ j ∈ N u k ( i ) sim ( i , j ) \hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - b_{uj})} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)} r^ui=bui+j∈Nuk(i)∑sim(i,j)j∈Nuk(i)∑sim(i,j)⋅(ruj−buj)
参数设置与KNNBasic类似
sim_options = {'name': 'cosine',
'user_based': False # compute similarities between items
}
algo = KNNBaseline(k=10, sim_options=sim_options)
经典的SVD算法
r ^ u i = μ + b u + b i + q i T p u \hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u r^ui=μ+bu+bi+qiTpu
损失函数为
∑ r u i ∈ R t r a i n ( r u i − r ^ u i ) 2 + λ ( b i 2 + b u 2 + ∣ ∣ q i ∣ ∣ 2 + ∣ ∣ p u ∣ ∣ 2 ) \sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 + \lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right) rui∈Rtrain∑(rui−r^ui)2+λ(bi2+bu2+∣∣qi∣∣2+∣∣pu∣∣2)
优化公式为
b u ← b u + γ ( e u i − λ b u ) b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u) bu←bu+γ(eui−λbu)
b i ← b i + γ ( e u i − λ b i ) b_i \leftarrow b_i + \gamma (e_{ui} - \lambda b_i) bi←bi+γ(eui−λbi)
p u ← p u + γ ( e u i ⋅ q i − λ p u ) p_u \leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u) pu←pu+γ(eui⋅qi−λpu)
q i ← q i + γ ( e u i ⋅ p u − λ q i ) q_i \leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i) qi←qi+γ(eui⋅pu−λqi)
14. n _ f a c t o r s n\_factors n_factors:隐因子的数量,默认为100
15. n _ e p o c h s n\_epochs n_epochs:迭代次数,默认为20
16. b i a s e d biased biased:默认为True,即使用SGD,如果为False,则使用MF算法也就是PMF算法
17. i n i t _ m e a n init\_mean init_mean:p和q两个向量的初始值由正态分布生成,均值参数由该参数设置,默认为0
18. i n i t _ s t d _ d e v init\_std\_dev init_std_dev:p和q两个向量的初始值由正态分布生成,标准差参数由该参数设置,默认为0.1
19. l r _ a l l lr\_all lr_all:可由该参数直接设置所有学习速率的值,默认为0.005
20. r e g _ a l l reg\_all reg_all:可由该参数直接设置所有正则化系数的值,默认为0.02
21. l r _ b u lr\_bu lr_bu:设置 b u b_u bu的学习速率,可覆盖 l r _ a l l lr\_all lr_all,默认未设置
22. l r _ b i lr\_bi lr_bi:设置 b i b_i bi的学习速率,可覆盖 l r _ a l l lr\_all lr_all,默认未设置
23. l r _ p u lr\_pu lr_pu:设置 p u p_u pu的学习速率,可覆盖 l r _ a l l lr\_all lr_all,默认未设置
24. l r _ q i lr\_qi lr_qi:设置 q i q_i qi的学习速率,可覆盖 l r _ a l l lr\_all lr_all,默认未设置
25. r e g _ b u reg\_bu reg_bu:设置 b u b_u bu的学习速率,可覆盖 r e g _ a l l reg\_all reg_all,默认未设置
26. r e g _ b i reg\_bi reg_bi:设置 b i b_i bi的学习速率,可覆盖 r e g _ a l l reg\_all reg_all,默认未设置
27. r e g _ p u reg\_pu reg_pu:设置 p u p_u pu的学习速率,可覆盖 r e g _ a l l reg\_all reg_all,默认未设置
28. r e g _ q i reg\_qi reg_qi:设置 q i q_i qi的学习速率,可覆盖 r e g _ a l l reg\_all reg_all,默认未设置
29. r a n d o m _ s t a t e random\_state random_state:随机种子设置,默认未设置,可设置为一个整数,即可在多次试验时得到相同结果(在相同的训练集和测试集的情况下)
algo = SVD(n_factors=5, n_epochs=20, lr_all=0.007, reg_all=0.002, verbose=False, init_mean=0.1, init_std_dev=0)
依然是Koren提出的,考虑了隐性反馈的SVDpp算法
r ^ u i = μ + b u + b i + q i T ( p u + ∣ I u ∣ − 1 2 ∑ j ∈ I u y j ) \hat{r}_{ui} = \mu + b_u + b_i + q_i^T\left(p_u + |I_u|^{-\frac{1}{2}} \sum_{j \in I_u}y_j\right) r^ui=μ+bu+bi+qiT⎝⎛pu+∣Iu∣−21j∈Iu∑yj⎠⎞
和SVD相比,多了两个参数
30. l r _ y j lr\_yj lr_yj:设置 y j y_j yj的学习速率,可覆盖 l r _ a l l lr\_all lr_all,默认未设置
31. r e g _ y j reg\_yj reg_yj:设置 y j y_j yj的学习速率,可覆盖 r e g _ a l l reg\_all reg_all,默认未设置
algo = SVDpp(n_factors=5, n_epochs=20, lr_all=0.007, reg_all=0.002, verbose=False, init_mean=0.1, init_std_dev=0)
非负矩阵分解,即要求p矩阵和q矩阵都是正的
r ^ u i = q i T p u , \hat{r}_{ui} = q_i^Tp_u, r^ui=qiTpu,
和SVD相比,多了两个参数
32. i n i t _ l o w init\_low init_low:设置初始值的下限,默认为0
33. i n i t _ h i g h init\_high init_high:设置初始值的上限,默认为1
algo = NMF(n_factors=5, n_epochs=20, lr_all=0.007, reg_all=0.002, verbose=False, init_mean=0.1, init_std_dev=0)
r ^ u i = μ u + 1 ∣ R i ( u ) ∣ ∑ j ∈ R i ( u ) dev ( i , j ) \hat{r}_{ui} = \mu_u + \frac{1}{ |R_i(u)|} \sum\limits_{j \in R_i(u)} \text{dev}(i, j) r^ui=μu+∣Ri(u)∣1j∈Ri(u)∑dev(i,j)
dev ( i , j ) = 1 ∣ U i j ∣ ∑ u ∈ U i j r u i − r u j \text{dev}(i, j) = \frac{1}{ |U_{ij}|}\sum\limits_{u \in U_{ij}} r_{ui} - r_{uj} dev(i,j)=∣Uij∣1u∈Uij∑rui−ruj
algo = SlopeOne()
r ^ u i = C u i ‾ + ( μ u − C u ‾ ) + ( μ i − C i ‾ ) \hat{r}_{ui} = \overline{C_{ui}} + (\mu_u - \overline{C_u}) + (\mu_i- \overline{C_i}) r^ui=Cui+(μu−Cu)+(μi−Ci)
#!/usr/bin/python
# -*- coding: utf-8 -*-
from surprise import KNNBasic
from surprise import Dataset
import pandas as pd
from surprise import Reader
import numpy as np
from surprise.model_selection import KFold
import math
num_item = 80
reader = Reader(line_format='user item rating', sep=',')
data = Dataset.load_from_file('rating2.txt', reader=reader)
kf = KFold(n_splits=5)
sim_options = {'name': 'cosine',
'user_based': False
}
algo = KNNBasic(sim_options=sim_options, verbose=False)
precision = 0.0
recall = 0.0
map = 0.0
ndcg = 0.0
topk = 3
for trainset, testset in kf.split(data):
algo.fit(trainset)
fenmu = pd.DataFrame(np.array(testset)[:, 0]).drop_duplicates().shape[0]
real = [[] for i in range(fenmu)]
sor = [[] for i in range(fenmu)]
hit = 0
score = 0.0
dcg = 0.0
dic = {}
m = 0
for i in range(len(testset)):
if int(testset[i][0]) not in dic:
dic[int(testset[i][0])] = m
m += 1
ls = []
real[m - 1].append(int(testset[i][1]))
for j in range(num_item):
uid = str(testset[i][0])
iid = str(j)
pred = algo.predict(uid, iid)
ls.append([pred[3], j])
ls = sorted(ls, key=lambda x: x[0], reverse=True)
for s in range(topk):
sor[m-1].append(int(ls[s][1]))
else:
real[dic[int(testset[i][0])]].append(int(testset[i][1]))
for i in range(fenmu):
idcg = 0.0
ap_score = 0.0
ap = 0.0
cg = 0.0
for y in range(topk):
if sor[i][y] in real[i]:
ap_score += 1
ap += ap_score / (y + 1)
cg += 1 / math.log((y + 2), 2)
score += ap / min(len(real[i]), topk)
for z in range(int(ap_score)):
idcg += 1 / math.log((z + 2), 2)
if idcg > 0:
dcg += cg / idcg
recall += ap_score / (len(real[i]) * fenmu)
precision += ap_score / (topk * fenmu)
map += float(score) / fenmu
ndcg += float(dcg) / fenmu
print 'precision ' + str(precision / 5.0)
print 'recall ' + str(recall / 5.0)
print 'map ' + str(map / 5.0)
print 'ndcg ' + str(ndcg / 5.0)