SVD、SVD++和Asymmetric SVD 以及实例

这里是关于SVD、SVD++和Asymmetric SVD 相关资料汇总,以及一个使用surprise编写SVD的实例。

1. 资料汇总

SVD的论文:
https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

SVD++的论文:http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf

上面两篇论文所用的数据集为:The Netflix data。也就是只有用户对电影的评分。

已经开源的算法实现工具包:https://github.com/NicolasHug/Surprise
该工具包中对这两种算法的使用说明:http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

SVD和SVD++的简单区别:https://www.quora.com/Whats-the-difference-between-SVD-and-SVD++

这个使用SVD的一个简单教程,其中包含如何自定义读取自己的数据:https://medium.com/@m_n_malaeb/the-easy-guide-for-building-python-collaborative-filtering-recommendation-system-in-2017-d2736d2e92a8

更多链接:http://surprise.readthedocs.io/en/stable/notation_standards.html#ricci-2010

http://www.inf.ed.ac.uk/teaching/courses/dme/studpres0910/DME-presentation-final.pdf

2. 使用SVD的实例:

import zipfile
from surprise import Reader, Dataset, SVD, evaluate

下载和解压数据

# Unzip ml-100k.zip
zipfile = zipfile.ZipFile('D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k.zip', 'r')
zipfile.extractall()
zipfile.close()

读取数据

u_data = 'D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k/u.data';

# Prepare the data to be used in Surprise
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file(u_data, reader=reader)

把数据切分成5份,其中4份的用于训练,其他一份用于交叉认证,以生成信息报告,比如’RMSE’, ‘MAE’。如果不显示分割的话,默认也会分割成5份。

例如10折交叉验证(10-fold cross validation),将数据集分成十份,轮流将其中9份做训练1份做验证,10次的结果的均值作为对算法精度的估计,一般还需要进行多次10折交叉验证求均值,例如:10次10折交叉验证,以求更精确一点。

# Split the dataset into 5 folds and choose the algorithm
data.split(n_folds=5)
algo = SVD()

训练

# Train and test reporting the RMSE and MAE scores
evaluate(algo, data, measures=['RMSE', 'MAE'])

# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.train(trainset)

预测。

# Predict a certain item
userid = str(196)
itemid = str(302)
actual_rating = 4
print(algo.predict(userid, itemid, actual_rating))

3. 完整的代码为:

import zipfile
from surprise import Reader, Dataset, SVD, evaluate

# Unzip ml-100k.zip
# zipfile = zipfile.ZipFile('D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k.zip', 'r')
# zipfile.extractall()
# zipfile.close()

u_data = 'D:/LiangYiHuai/kaggle/music-recommendation-data/ml-100k/u.data';

# Read data into an array of strings
with open(u_data) as f:
    all_lines = f.readlines()

# Prepare the data to be used in Surprise
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file(u_data, reader=reader)

# Split the dataset into 5 folds and choose the algorithm
data.split(n_folds=5)
algo = SVD()

# Train and test reporting the RMSE and MAE scores
evaluate(algo, data, measures=['RMSE', 'MAE'])

# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.train(trainset)

# Predict a certain item
userid = str(196)
itemid = str(302)
actual_rating = 4
print(algo.predict(userid, itemid, actual_rating))

结束。

你可能感兴趣的:(机器学习,python)