这里是关于SVD、SVD++和Asymmetric SVD 相关资料汇总,以及一个使用surprise编写SVD的实例。
SVD的论文:
https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
SVD++的论文:http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf
上面两篇论文所用的数据集为:The Netflix data。也就是只有用户对电影的评分。
已经开源的算法实现工具包:https://github.com/NicolasHug/Surprise
该工具包中对这两种算法的使用说明:http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD
SVD和SVD++的简单区别:https://www.quora.com/Whats-the-difference-between-SVD-and-SVD++
这个使用SVD的一个简单教程,其中包含如何自定义读取自己的数据:https://medium.com/@m_n_malaeb/the-easy-guide-for-building-python-collaborative-filtering-recommendation-system-in-2017-d2736d2e92a8
更多链接:http://surprise.readthedocs.io/en/stable/notation_standards.html#ricci-2010
http://www.inf.ed.ac.uk/teaching/courses/dme/studpres0910/DME-presentation-final.pdf
import zipfile
from surprise import Reader, Dataset, SVD, evaluate
下载和解压数据
# Unzip ml-100k.zip
zipfile = zipfile.ZipFile('D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k.zip', 'r')
zipfile.extractall()
zipfile.close()
读取数据
u_data = 'D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k/u.data';
# Prepare the data to be used in Surprise
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file(u_data, reader=reader)
把数据切分成5份,其中4份的用于训练,其他一份用于交叉认证,以生成信息报告,比如’RMSE’, ‘MAE’。如果不显示分割的话,默认也会分割成5份。
例如10折交叉验证(10-fold cross validation),将数据集分成十份,轮流将其中9份做训练1份做验证,10次的结果的均值作为对算法精度的估计,一般还需要进行多次10折交叉验证求均值,例如:10次10折交叉验证,以求更精确一点。
# Split the dataset into 5 folds and choose the algorithm
data.split(n_folds=5)
algo = SVD()
训练
# Train and test reporting the RMSE and MAE scores
evaluate(algo, data, measures=['RMSE', 'MAE'])
# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.train(trainset)
预测。
# Predict a certain item
userid = str(196)
itemid = str(302)
actual_rating = 4
print(algo.predict(userid, itemid, actual_rating))
import zipfile
from surprise import Reader, Dataset, SVD, evaluate
# Unzip ml-100k.zip
# zipfile = zipfile.ZipFile('D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k.zip', 'r')
# zipfile.extractall()
# zipfile.close()
u_data = 'D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k/u.data';
# Read data into an array of strings
with open(u_data) as f:
all_lines = f.readlines()
# Prepare the data to be used in Surprise
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file(u_data, reader=reader)
# Split the dataset into 5 folds and choose the algorithm
data.split(n_folds=5)
algo = SVD()
# Train and test reporting the RMSE and MAE scores
evaluate(algo, data, measures=['RMSE', 'MAE'])
# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.train(trainset)
# Predict a certain item
userid = str(196)
itemid = str(302)
actual_rating = 4
print(algo.predict(userid, itemid, actual_rating))
结束。