1:加载数据集
def load_format2trainset():
file_path = "F:\\ML\\recommendation_data\\music_playlist_farmat.txt"
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
print("构建数据集...")
retrainset = music_data.build_full_trainset()
return retrainset
主要用的到的类有:Reader --- 解析包含评分的文件 reader类
Dataset--- 包含一些数据集操作,主要方法有load_builtion('数据集名') #加载内置数据集
load_from_df() #加载pandas结构数据
load_from_file() #加载用户自己的数据
load_from_folds() #加载多个数据,例如
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
对数据集的操作包括:
build_full_trainset() #不对数据集做切分,返回整个数据
split(n_folds=5, shuffle=True) #切分数据集
2:算法选择,surprise库包含了基于协同过滤的和基于矩阵分解的两大类算法。
random_pred.NormalPredictor |
Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. |
baseline_only.BaselineOnly |
Algorithm predicting the baseline estimate for given user and item. |
knns.KNNBasic |
A basic collaborative filtering algorithm. |
knns.KNNWithMeans |
A basic collaborative filtering algorithm, taking into account the mean ratings of each user. |
knns.KNNWithZScore |
A basic collaborative filtering algorithm, taking into account the z-score normalization of each user. |
knns.KNNBaseline |
A basic collaborative filtering algorithm taking into account a baseline rating. |
matrix_factorization.SVD |
The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. |
matrix_factorization.SVDpp |
The SVD++ algorithm, an extension of SVD taking into account implicit ratings. |
matrix_factorization.NMF |
A collaborative filtering algorithm based on Non-negative Matrix Factorization. |
slope_one.SlopeOne |
A simple yet accurate collaborative filtering algorithm. |
co_clustering.CoClustering |
A collaborative filtering algorithm based on co-clustering. |
3:模型训练
下面是示例代码(非完整,以推荐歌单为例):
algo = KNNBaseline()
algo.fit(trainset) #训练模型
current_playlist = list(name_id_dic.keys())[listid] #name_id_dic存储的是歌单名到歌单id的映射
playlist_id = name_id_dic[current_playlist]
# 取出来对应的内部user id => to_inner_uid
playlist_inner_id = algo.trainset.to_inner_uid(playlist_id) #将raw_i转化成inner_id
playlist_neighbors = algo.get_neighbors(playlist_inner_id, k=10) #获取歌单的近邻歌单,返回值是inner_id
to_inner_uid(ruid) :Convert a user raw id to an inner id.
to_inner_iid(riid) :Convert an item raw id to an inner id.
to_raw_iid(iiid) :Convert an item inner id to a raw id.
to_raw_uid(iuid) :Convert a user inner id to a raw id.
all_items() :返回一个可迭代的items的inner_id 列表
all_users() : Generator function to iterate over all users. 返回users的inner_id列表
all_ratings() : 返回:A tuple (uid, iid, rating) id为内部id
build_testset() :生成测试数据list
ur ------用户评分, 返回字典,value是:(item_inner_id, rating). key值是user的inner_id
ir ------物品评分. 返回字典,value:(user_inner_id, rating). The keys are item inner ids.
n_users / n_items / n_ratings :数据集包含的用户,物品,评分数量
rating_scale :评分范围
global_mean :评分均值
算法基础类: The algorithm base class ,主要包括
Parameters:
uid – (Raw) id of the user. See this note.
iid – (Raw) id of the item. See this note.
r_ui (float) – The true rating ruirui. Optional, default is None.
clip (bool) – Whether to clip the estimation into the rating scale. For example, if r^uir^ui is 5.55.5 while the rating scale is [1,5][1,5], then r^uir^ui is set to 55. Same goes if r^ui<1r^ui<1. Default is True.
verbose (bool) – Whether to print details of the prediction. Default is False.
Returns:
A Prediction object containing:
The (raw) user id uid.
The (raw) item id iid.
The true rating r_ui (r^uir^ui).
The estimated rating (r^uir^ui).
Some additional details about the prediction that might be useful for later analysis.