推荐系统--Surprise模型选择模块selection moudle

Surprise库中  The model_selection package  提供了算法的交叉验证和参数选择功能

1:交叉验证迭代器 (类似于scikit-learn)

KFold基础k折交叉验证

RepeatedKFold 多次k折交叉验证.

ShuffleSplit乱序训练集和数据集下的基础交叉验证

LeaveOneOut在测试集上每个用户只取一个评分做交叉验证

PredefinedKFold:数据集是通过方法 load_from_folds 加载进来的交叉验证方法.

 当然,该模块提供了train_test_split方法切分数据集

  • surprise.model_selection.split.KFold(n_splits=5, random_state=None, shuffle=True)

该类下面包括 方法:split(dataset) return:tuple of (trainset, testset)

每次验证拿出fold中的一折做测试数据,其他k-1折用于训练:

参数:n_splits (int) – The number of folds.

          random_state (取值如下) – 决定是否使用RNG来划分数据,

                  1:int, random_state 用于新的RNG的seed. 用于保证多次调用split()方法可以得到相同的数据集划分

                  2:RandomState instance, this same instance is used as RNG. (Random Number Generator)

                  3:None, the current RNG from numpy is used. 

                  注意:random_state 只有是shuffle = True时才被使用. 默认是None.

      shuffle (bool) – 在切分数据时是否洗牌. 洗牌并不是原地完成的. 默认True.
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# define a cross-validation iterator
kf = KFold(n_splits=3)

algo = SVD()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

输出:

RMSE: 0.9374
RMSE: 0.9476
RMSE: 0.9478

  • surprise.model_selection.split.LeaveOneOut(n_splits=5, random_state=None)

    测试集上每个用户只取一个评分做交叉验证,与其他交叉验证策略相反,随机分割并不能保证所有的折叠都不相同,尽管这对于相当大的数据集仍然很有可能。参数类似于上面KFold

  • surprise.model_selection.split.PredefinedKFold
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import PredefinedKFold

# path to dataset folder
files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')

# This time, we'll use the built-in reader.
reader = Reader('ml-100k')

# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]

data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()

algo = SVD()

for trainset, testset in pkf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

  • surprise.model_selection.split.RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)

        多次交叉验证,每次分割都是随机的

  • surprise.model_selection.split.ShuffleSplit(n_splits=5,test_size=0.2,train_size=None,random_state=None, shuffle=True)

           使用随机切分的数据集

  • surprise.model_selection.split.train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)

2:交叉验证

  • surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse', u'mae'], cv=None, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', verbose=False)
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate


# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

# We'll use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
输出结果:
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

            Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std
RMSE        0.9311  0.9370  0.9320  0.9317  0.9391  0.9342  0.0032
MAE         0.7350  0.7375  0.7341  0.7342  0.7375  0.7357  0.0015
Fit time    6.53    7.11    7.23    7.15    3.99    6.40    1.23
Test time   0.26    0.26    0.25    0.15    0.13    0.21    0.06

参数:

  • algo (AlgoBase) – 待评估算法
  • data (Dataset) – 评估数据集.
  • measures (list of string) – 评估计算方法.  accuracy 里面定义的方法名. Default is ['rmse', 'mae'].
  • cv (cross-validation iterator, int or None) – 决定数据参数怎么被切分,即训练集和数据集怎么定义 ,如果int,作为 KFold参数If NoneKFold is used with n_splits=5.
  • return_train_measures (bool) – 是否计算算法在训练数据集上的表现. Default is False.
  • n_jobs (int) –

    The maximum number of folds evaluated in parallel.

    • If -1, all CPUs are used.
    • If 1 is given, 没有并行计算,用于debugging.
    • For n_jobs below -1(n_cpus + n_jobs + 1) are used. For example, with n_jobs = -2 all CPUs but one are used.

    Default is -1.

  • pre_dispatch (int or string) –控制并行计算过程中每个cpu的分配工作量,当cpu过载时减少这个数可以有效避免内存爆炸
    • None, 在这种情况下,所有工作立即创建并产生. 将其用于轻量级和快速运行的作业,以避免由于按需产生作业而导致延迟。
    • An int, 给出的总cpu的确切数量。
    • A string, 给定 n_jobs表达式, 例如 '2*n_jobs'.默认 '2*n_jobs'.
  • verbose (int) – If True 打印每个分割的训练集和测试集准确度度量. 还打印了所有拆分的平均值和标准偏差。 Default is False: nothing is printed.

3:参数选择

class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', joblib_verbose=0)

通过过交叉验证程序计算算法各种参数组合。 这对于确定预测算法的最佳参数集很有用。 它类似于scikit-learn的GridSearchCV。

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

# Use movielens-100K
data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

out:

0.961300130118
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
Parameters:
  • algo_class (AlgoBase) – 待评估算法
  • param_grid (dict) – Dictionary with algorithm parameters as keys and list of values as keys. All combinations will be evaluated with desired algorithm. Dict parameters such as sim_options require special treatment, see this note.
  • measures (list of string) – 评估计算方法.  accuracy 里面定义的方法名. 默认['rmse', 'mae'].
  • cv (cross-validation iterator, int or None) – 决定数据参数怎么被切分,即训练集和数据集怎么定义 ,如果int,作为 KFold参数If NoneKFold is used with n_splits=5.
  • refit (bool or str) – If True, 对算法使用measures[1]最对最佳性能的对应参数来重新fit整个数据集. 如果使用其他measure,那么传递相应方法字符串, 然后你可以使用test() and predict() 方法. refit仅用于--给定的fit()data参数不是通过load_from_folds().加载的情况, Default is False.
  • return_train_measures (bool) – 是否计算算法在训练数据集上的表现. Default is False.
  • n_jobs (int) –同上一个类
  • pre_dispatch (int or string) –同上
  • joblib_verbose (int) – 控制joblib的详细程度:越高,消息越多。
属性:
best_estimator

dict of AlgoBase – key :评估精度 , value:选定参measure下的最好性能的算法

best_score

dict of floats – key :评估精度 , value:选定参measure下的最好性能值

best_params

dict of dicts – key :评估精度 , value:选定参measure下的最好参数组合

best_index

dict of ints – Ukey :评估精度, value:选定参measure下的最高精度的的能用于cv_results 的索引.

cv_results

dict of arrays – A dict that contains accuracy measures over all splits, as well as train and test time for each parameter combination. Can be imported into a pandas DataFrame .

'split0_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split1_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split2_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'mean_test_rmse':   [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'std_test_rmse':    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_rmse':   [7 8 3 5 4 6 1 2]
'split0_test_mae':  [0.81, 0.82, 0.78, 0.79, 0.79, 0.8, 0.77, 0.79]
'split1_test_mae':  [0.8, 0.81, 0.78, 0.79, 0.78, 0.79, 0.77, 0.78]
'split2_test_mae':  [0.81, 0.81, 0.78, 0.79, 0.78, 0.8, 0.77, 0.78]
'mean_test_mae':    [0.81, 0.81, 0.78, 0.79, 0.79, 0.8, 0.77, 0.78]
'std_test_mae':     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_mae':    [7 8 2 5 4 6 1 3]
'mean_fit_time':    [1.53, 1.52, 1.53, 1.53, 3.04, 3.05, 3.06, 3.02]
'std_fit_time':     [0.03, 0.04, 0.0, 0.01, 0.04, 0.01, 0.06, 0.01]
'mean_test_time':   [0.46, 0.45, 0.44, 0.44, 0.47, 0.49, 0.46, 0.34]
'std_test_time':    [0.0, 0.01, 0.01, 0.0, 0.03, 0.06, 0.01, 0.08]
'params':           [{'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}]
'param_n_epochs':   [5, 5, 5, 5, 10, 10, 10, 10]
'param_lr_all':     [0.0, 0.0, 0.01, 0.01, 0.0, 0.0, 0.01, 0.01]
'param_reg_all':    [0.4, 0.6, 0.4, 0.6, 0.4, 0.6, 0.4, 0.6]
fit ( data )

在参数cv指定切分策略的数据集上,Runs the fit() method of the algorithm for all parameter combination, .

Parameters: data (Dataset) – The dataset on which to evaluate the algorithm, in parallel.
predict ( *args )

根据 refit参数调用  predict() 评估发现的最好的参数 .  See AlgoBase.predict().

只有 refit 不是 False.时可用

test ( testsetverbose=False )

根据 refit参数调用 test() 评估发现的最好的参数 . 只有 refit 不是 False.时可用    See AlgoBase.test().





你可能感兴趣的:(ML)