Surprise库中 The model_selection package 提供了算法的交叉验证和参数选择功能
1:交叉验证迭代器 (类似于scikit-learn)
KFold
基础k折交叉验证
RepeatedKFold
多次k折交叉验证.
ShuffleSplit
乱序训练集和数据集下的基础交叉验证
LeaveOneOut
在测试集上每个用户只取一个评分做交叉验证
PredefinedKFold
:数据集是通过方法 load_from_folds
加载进来的交叉验证方法.
当然,该模块提供了train_test_split方法切分数据集
该类下面包括 方法:split(dataset) return:tuple of (trainset, testset)
每次验证拿出fold中的一折做测试数据,其他k-1折用于训练:
参数:n_splits (int) – The number of folds.
random_state (取值如下) – 决定是否使用RNG来划分数据,
1:int, random_state 用于新的RNG的seed. 用于保证多次调用split()方法可以得到相同的数据集划分
2:RandomState instance, this same instance is used as RNG. (Random Number Generator)
3:None, the current RNG from numpy is used.
注意:random_state 只有是shuffle = True时才被使用. 默认是None.
shuffle (bool) – 在切分数据时是否洗牌. 洗牌并不是原地完成的. 默认True.from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold
# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')
# define a cross-validation iterator
kf = KFold(n_splits=3)
algo = SVD()
for trainset, testset in kf.split(data):
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)
# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)
输出:
RMSE: 0.9374
RMSE: 0.9476
RMSE: 0.9478
测试集上每个用户只取一个评分做交叉验证,与其他交叉验证策略相反,随机分割并不能保证所有的折叠都不相同,尽管这对于相当大的数据集仍然很有可能。参数类似于上面KFold
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import PredefinedKFold
# path to dataset folder
files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')
# This time, we'll use the built-in reader.
reader = Reader('ml-100k')
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()
algo = SVD()
for trainset, testset in pkf.split(data):
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)
# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)
多次交叉验证,每次分割都是随机的
使用随机切分的数据集
2:交叉验证
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
# We'll use the famous SVD algorithm.
algo = SVD()
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
输出结果:
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE 0.9311 0.9370 0.9320 0.9317 0.9391 0.9342 0.0032
MAE 0.7350 0.7375 0.7341 0.7342 0.7375 0.7357 0.0015
Fit time 6.53 7.11 7.23 7.15 3.99 6.40 1.23
Test time 0.26 0.26 0.25 0.15 0.13 0.21 0.06
参数:
AlgoBase
) – 待评估算法Dataset
) – 评估数据集.accuracy
里面定义的方法名. Default is ['rmse', 'mae']
.None
) – 决定数据参数怎么被切分,即训练集和数据集怎么定义 ,如果int,作为 KFold
参数If None
, KFold
is used with n_splits=5
.False
.The maximum number of folds evaluated in parallel.
-1
, all CPUs are used.1
is given, 没有并行计算,用于debugging.n_jobs
below -1
, (n_cpus + n_jobs + 1)
are used. For example, with n_jobs = -2
all CPUs but one are used.Default is -1
.
None
, 在这种情况下,所有工作立即创建并产生. 将其用于轻量级和快速运行的作业,以避免由于按需产生作业而导致延迟。n_jobs
表达式, 例如 '2*n_jobs'
.默认 '2*n_jobs'
.True
打印每个分割的训练集和测试集准确度度量. 还打印了所有拆分的平均值和标准偏差。 Default is False
: nothing is printed.3:参数选择
class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', joblib_verbose=0)
通过过交叉验证程序计算算法各种参数组合。 这对于确定预测算法的最佳参数集很有用。 它类似于scikit-learn的GridSearchCV。
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV
# Use movielens-100K
data = Dataset.load_builtin('ml-100k')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
# best RMSE score
print(gs.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
out:
0.961300130118
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
Parameters: |
|
---|
best_estimator
dict of AlgoBase – key :评估精度 , value:选定参measure下的最好性能的算法
best_score
dict of floats – key :评估精度 , value:选定参measure下的最好性能值
best_params
dict of dicts – key :评估精度 , value:选定参measure下的最好参数组合
best_index
dict of ints – Ukey :评估精度, value:选定参measure下的最高精度的的能用于cv_results
的索引.
cv_results
dict of arrays – A dict that contains accuracy measures over all splits, as well as train and test time for each parameter combination. Can be imported into a pandas DataFrame .
'split0_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split1_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split2_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'mean_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'std_test_rmse': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_rmse': [7 8 3 5 4 6 1 2]
'split0_test_mae': [0.81, 0.82, 0.78, 0.79, 0.79, 0.8, 0.77, 0.79]
'split1_test_mae': [0.8, 0.81, 0.78, 0.79, 0.78, 0.79, 0.77, 0.78]
'split2_test_mae': [0.81, 0.81, 0.78, 0.79, 0.78, 0.8, 0.77, 0.78]
'mean_test_mae': [0.81, 0.81, 0.78, 0.79, 0.79, 0.8, 0.77, 0.78]
'std_test_mae': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_mae': [7 8 2 5 4 6 1 3]
'mean_fit_time': [1.53, 1.52, 1.53, 1.53, 3.04, 3.05, 3.06, 3.02]
'std_fit_time': [0.03, 0.04, 0.0, 0.01, 0.04, 0.01, 0.06, 0.01]
'mean_test_time': [0.46, 0.45, 0.44, 0.44, 0.47, 0.49, 0.46, 0.34]
'std_test_time': [0.0, 0.01, 0.01, 0.0, 0.03, 0.06, 0.01, 0.08]
'params': [{'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}]
'param_n_epochs': [5, 5, 5, 5, 10, 10, 10, 10]
'param_lr_all': [0.0, 0.0, 0.01, 0.01, 0.0, 0.0, 0.01, 0.01]
'param_reg_all': [0.4, 0.6, 0.4, 0.6, 0.4, 0.6, 0.4, 0.6]
fit
(
data
)
在参数cv指定切分策略的数据集上,Runs the fit()
method of the algorithm for all parameter combination, .
Parameters: | data (Dataset ) – The dataset on which to evaluate the algorithm, in parallel. |
---|
predict
(
*args
)
根据 refit
参数调用 predict()
评估发现的最好的参数 . See AlgoBase.predict()
.
只有 refit
不是 False
.时可用
test
(
testset,
verbose=False
)
根据 refit
参数调用 test()
评估发现的最好的参数 . 只有 refit
不是 False
.时可用 See AlgoBase.test()
.