转载请注明出处:http://blog.csdn.net/chen19920219/article/details/76905381
最近在GitHub上发现了一个很好的开源推荐系统,Star700多,包含了常用的矩阵分解算法,包括SVD,SVD++,NMF等等,GitHub地址:https://github.com/NicolasHug/Surprise,由于安装和使用过程中有许多坑,特此记录下来:
Surprise安装
官方文档中显示安装环境为Python2.7或者3.5,我的环境为3.5,其他没试过。
首先,文档显示有两种安装方法,这里使用第一种安装方法
$ pip install numpy
$ pip install scikit-surprise
在安装之前首先确认安装了numpy模块,然后在安装surprise时,老是报错,错误为unable to findvcvarsall.bat,网上搜了下解决办法链接为:
http://jingyan.baidu.com/article/adc815138162e8f723bf7387.html
然后重新pipinstall scikit-surprise就好了。
Surprise 使用
Surprise里有自带的数据集,自带的数据集加载方法和加载自己数据集的方法不同。加载项目提供的数据集就不多说了,这里重点说下Surprise怎么加载自己本地的数据集以及经常使用的方法。
官方API提供了加载本地数据集的方法:
Load a custom dataset
You can of course use a custom dataset. Surprise offerstwo ways of loading a custom dataset:
· you can either specify a single file with all the ratingsand use the split () method to performcross-validation ;
· or if your dataset is already split into predefinedfolds, you can specify a list of files for training and testing.
Either way, you will need to define a Reader object for Surprise tobe able to parse the file(s).
上面说到如何加载自己的数据集,如果要加载自己的数据集,提供了两种加载方式:
1. 可以使用官方定义的split()方法来定义k次交叉实验
2. 如果你自己以及分割好k次实验的数据集,那么可以定义一个list来进行训练和测试
事实上,我们更倾向于使用第一种方法,因为系统自动给你进行k次实验,不用我们分割数据集,简单又方便
Load anentire dataset
From file examples/load_custom_dataset.py¶
# path to dataset file
file_path=os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader=Reader(line_format='user item rating timestamp',sep='\t')
data=Dataset.load_from_file(file_path,reader=reader)
data.split(n_folds=5)
官方API还提供了一个演示加载的Demo,在加载数据集之前需要初始化一个reader,因为加载本地方法需要两个参数
classmethodload_from_file(file_path, reader)
Load a datasetfrom a (custom) file.
Use this if youwant to use a custom dataset and all of the ratings are stored in one file. Youwill have to split your dataset using the split method. See an example inthe User Guide.
Parameters: |
· file_path (string) – The path to the file containing ratings. · reader (Reader) – A reader to read the file. |
一个是你的数据集的地址,另一个就是初始化一个Reader对象,Reader类如下:
classsurprise.dataset.Reader(name=None, line_format=None, sep=None, rating_scale=(1, 5),skip_lines=0)
The Reader classis used to parse a file containing ratings.
Such a file isassumed to specify only one rating per line, and each line needs to respect thefollowing structure:
user ;item ;rating ; [timestamp]
where the orderof the fields and the separator (here ‘;’) may be arbitrarily defined (seebelow). brackets indicate that the timestamp field is optional.
Parameters: |
· name (string, optional) – If specified, a Reader for one of the built-in datasets is returned and any other parameter is ignored. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is None. · line_format (string) – The fields names, in the order at which they are encountered on a line. Example: 'item user rating'. · sep (char) – the separator between fields. Example : ';'. · rating_scale (tuple, optional) – The rating scale used for every rating. Default is(1, 5). · skip_lines (int, optional) – Number of lines to skip at the beginning of the file. Default is 0. |
上面说到Reader类分割文件,文件的数据结构必须为:
user ;item ;rating ; [timestamp]格式,
当然你可以少个timestamp也是没关系的,user为用户的id;item为项目的id;rating为项目所在用户id的评分;
你也可以自己定义数据结构,具体参照API。
Reader里的方法我们一般用line_format属性和sep属性,其他默认就可以了,当然,你也可以把其他属性加进去根据自己的情况来,line_format为数据的行格式,也就是上面的user ; item ; rating ;而seq的意思是要去怎么分割行数据,比如说根据空格或者逗号。
而data.split(n_folds=3)为定义了3次交叉实验,如果不写这句默认为5次.
下节我们将具体讲下怎么来加载自己的数据集实验,以及评估的方法。