S折交叉验证法:数据被随机划分为5个互不相交且大小相同的子集,利用S-1个子集数据训练模型,余下的一个子集测试模型。由于测试集由S种选法,因此对S种组合依次重复进行,获取测试误差的均值。将这个均值作为泛化误差的估计。
sklearn.model_selection.KFold(n_splits=3,shuffle=False,random_state=None)
如果为整数,则它指定了随机数生成器的种子。
如果为RandomState实例,则指定了随机数生成器。
如果为None,则使用默认的随机数生成器。
对于样本数为n的数据集,KFold会先将0~(n-1)之间的整数从前到后均价划分为n_splits份,每次迭代时依次挑选一份作为测试集的下标。如果希望是随机挑选而不是顺序挑选,则可以在划分之前混洗数据,即shuffle=True。
生成数据
from sklearn.model_selection import KFold
import numpy as np
X = np.random.rand(9,4)
y = np.array([1,1,0,0,1,1,0,0,1])
调用KFold示例(非混洗)
folder = KFold(n_splits=3,random_state=0,shuffle=False)
for train_index,test_index in folder.split(X,y):
print("Train Index:",train_index)
print("Test Index:",test_index)
print("X_train:",X[train_index])
print("X_test:",X[test_index])
print("")
Train Index: [3 4 5 6 7 8]
Test Index: [0 1 2]
X_train: [[ 0.13079725 0.48578664 0.64161516 0.16668596]
[ 0.4999551 0.41196095 0.87824022 0.58348625]
[ 0.25872091 0.73951121 0.04957464 0.45203743]
[ 0.72628999 0.52417452 0.06881971 0.95963271]
[ 0.02276032 0.98144591 0.37960828 0.61095952]
[ 0.41491324 0.42039075 0.95688853 0.15339434]]
X_test: [[ 0.17697103 0.42337491 0.44060735 0.12488469]
[ 0.54331568 0.63086644 0.02425023 0.00419293]
[ 0.37441732 0.27994645 0.7224304 0.82671591]]
Train Index: [0 1 2 6 7 8]
Test Index: [3 4 5]
X_train: [[ 0.17697103 0.42337491 0.44060735 0.12488469]
[ 0.54331568 0.63086644 0.02425023 0.00419293]
[ 0.37441732 0.27994645 0.7224304 0.82671591]
[ 0.72628999 0.52417452 0.06881971 0.95963271]
[ 0.02276032 0.98144591 0.37960828 0.61095952]
[ 0.41491324 0.42039075 0.95688853 0.15339434]]
X_test: [[ 0.13079725 0.48578664 0.64161516 0.16668596]
[ 0.4999551 0.41196095 0.87824022 0.58348625]
[ 0.25872091 0.73951121 0.04957464 0.45203743]]
Train Index: [0 1 2 3 4 5]
Test Index: [6 7 8]
X_train: [[ 0.17697103 0.42337491 0.44060735 0.12488469]
[ 0.54331568 0.63086644 0.02425023 0.00419293]
[ 0.37441732 0.27994645 0.7224304 0.82671591]
[ 0.13079725 0.48578664 0.64161516 0.16668596]
[ 0.4999551 0.41196095 0.87824022 0.58348625]
[ 0.25872091 0.73951121 0.04957464 0.45203743]]
X_test: [[ 0.72628999 0.52417452 0.06881971 0.95963271]
[ 0.02276032 0.98144591 0.37960828 0.61095952]
[ 0.41491324 0.42039075 0.95688853 0.15339434]]
调用KFold示例(混洗)
folder = KFold(n_splits=3,random_state=0,shuffle=True)
for train_index,test_index in folder.split(X,y):
print("Shuffled Train Index:",train_index)
print("Shuffled Test Index:",test_index)
print("Shuffled X_train:",X[train_index])
print("Shuffled X_test:",X[test_index])
print("")
Shuffled Train Index: [0 3 4 5 6 8]
Shuffled Test Index: [1 2 7]
Shuffled X_train: [[ 0.17697103 0.42337491 0.44060735 0.12488469]
[ 0.13079725 0.48578664 0.64161516 0.16668596]
[ 0.4999551 0.41196095 0.87824022 0.58348625]
[ 0.25872091 0.73951121 0.04957464 0.45203743]
[ 0.72628999 0.52417452 0.06881971 0.95963271]
[ 0.41491324 0.42039075 0.95688853 0.15339434]]
Shuffled X_test: [[ 0.54331568 0.63086644 0.02425023 0.00419293]
[ 0.37441732 0.27994645 0.7224304 0.82671591]
[ 0.02276032 0.98144591 0.37960828 0.61095952]]
Shuffled Train Index: [0 1 2 3 5 7]
Shuffled Test Index: [4 6 8]
Shuffled X_train: [[ 0.17697103 0.42337491 0.44060735 0.12488469]
[ 0.54331568 0.63086644 0.02425023 0.00419293]
[ 0.37441732 0.27994645 0.7224304 0.82671591]
[ 0.13079725 0.48578664 0.64161516 0.16668596]
[ 0.25872091 0.73951121 0.04957464 0.45203743]
[ 0.02276032 0.98144591 0.37960828 0.61095952]]
Shuffled X_test: [[ 0.4999551 0.41196095 0.87824022 0.58348625]
[ 0.72628999 0.52417452 0.06881971 0.95963271]
[ 0.41491324 0.42039075 0.95688853 0.15339434]]
Shuffled Train Index: [1 2 4 6 7 8]
Shuffled Test Index: [0 3 5]
Shuffled X_train: [[ 0.54331568 0.63086644 0.02425023 0.00419293]
[ 0.37441732 0.27994645 0.7224304 0.82671591]
[ 0.4999551 0.41196095 0.87824022 0.58348625]
[ 0.72628999 0.52417452 0.06881971 0.95963271]
[ 0.02276032 0.98144591 0.37960828 0.61095952]
[ 0.41491324 0.42039075 0.95688853 0.15339434]]
Shuffled X_test: [[ 0.17697103 0.42337491 0.44060735 0.12488469]
[ 0.13079725 0.48578664 0.64161516 0.16668596]
[ 0.25872091 0.73951121 0.04957464 0.45203743]]
sklearn.model_selection.StratifiedKFold(n_splits=3,shuffle=False,random_state=None)
如果为整数,则它指定了随机数生成器的种子。
如果为RandomState实例,则指定了随机数生成器。
如果为None,则使用默认的随机数生成器。
KFold的分层采样版本
生成数据
from sklearn.model_selection import KFold,StratifiedKFold
import numpy as np
X = np.random.rand(8,4)
y = np.array([1,1,0,0,1,1,0,0])
KFold
folder = KFold(n_splits=4,random_state=0,shuffle=False)
for train_index,test_index in folder.split(X,y):
print("Train Index:",train_index)
print("Test Index:",test_index)
print("y_train:",y[train_index])
print("y_test:",y[test_index])
print("")
Train Index: [2 3 4 5 6 7]
Test Index: [0 1]
y_train: [0 0 1 1 0 0]
y_test: [1 1]
Train Index: [0 1 4 5 6 7]
Test Index: [2 3]
y_train: [1 1 1 1 0 0]
y_test: [0 0]
Train Index: [0 1 2 3 6 7]
Test Index: [4 5]
y_train: [1 1 0 0 0 0]
y_test: [1 1]
Train Index: [0 1 2 3 4 5]
Test Index: [6 7]
y_train: [1 1 0 0 1 1]
y_test: [0 0]
StratifiedKold
stratified_folder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
for train_index,test_index in stratified_folder.split(X,y):
print("Stratified Train Index:",train_index)
print("Stratified Test Index:",test_index)
# 标签y的分布比KFold均匀
print("Stratified y_train:",y[train_index])
print("Stratified y_test:",y[test_index])
print("")
Stratified Train Index: [1 3 4 5 6 7]
Stratified Test Index: [0 2]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0]
Stratified Train Index: [0 2 4 5 6 7]
Stratified Test Index: [1 3]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0]
Stratified Train Index: [0 1 2 3 5 7]
Stratified Test Index: [4 6]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0]
Stratified Train Index: [0 1 2 3 4 6]
Stratified Test Index: [5 7]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0]
sklearn.model_selection.cross_val_score(estimator,X,y=None,scoring=None,cv=None,n_jobs=1,verbose=0,fit_params=None,pre_dispatch=’2*n_jobs’)
‘accurach’:采样的是metrics.accuracy_score评分函数。
‘average_precision’:采用的是metrics.average_precision_score评分函数。
f1系列:采用的是metrics.f1_score评分函数。
‘log_loss’:采用的是metrics.log_loss评分函数。
‘precision’系列:采用的是metrics.precision_score评分函数。
‘recall’系列:采用的是metrics.recall_score评分函数。
‘roc_auc’:采用的是metrics.roc_auc_score评分函数。
‘adjusted_rand_score’:采用的是metrics.adjusted_rand_score评分函数。
‘mean_absolute_error’:采用的是metrics.mean_absolute_error评分函数。
‘mean_squared_error’:采用的是metrics.mean_squared_error评分函数。
‘r2’:采用的是metrics.r2_score评分函数。cv:一个整数、k折交叉生成器、一个迭代器、或者None。
如果为None,则使用默认的3折交叉生成器。
如果为整数,则指定了k折交叉生成器的k值。
如果为k折交叉生成器,则直接指定了k折交叉生成器。
如果为迭代器,则迭代器的结果就是数据集划分的结果。fit_params:一个字典,指定了estimator执行.fit方法时的关键字参数。 n_jobs:并行性。默认为-1表示派发任务到所有计算机的CPU上。 verbose:一个整数,用于控制输出日志。 pre_dispatch:一个整数或者字符串,用于控制并行执行时,分发的总的任务数量。
便利函数,它是在指定数据集上运行指定学习器时,通过k折交叉获取的最佳性能。
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_digits
from sklearn.svm import LinearSVC
digits = load_digits()
X = digits.data
y = digits.target
result = cross_val_score(LinearSVC(),X,y,cv=10)
print("Cross Val Score is:",result)
Cross Val Score is: [ 0.9027027 0.95081967 0.89502762 0.88888889 0.93296089 0.97206704
0.96648045 0.93820225 0.85875706 0.9375 ]