机器学习-KFold交叉验证

Section I: Brief Introduction on StratifiedKFold

A slight improvement over the standard k-fold cross-validation approach is stratified k-fold cross-validattion, which can yeild better bias and variance estimates, especially in case of unequal class proportions. In stratified cross-validattion, the class proportionss are preserved in each fold to ensure that each fold is representative of the class proportions in the training dataset.

FROM
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

Section II: Code and Analyses

代码

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import numpy as np
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings("ignore")

#Section 1: Load Breast data, i.e., Benign and Malignant
breast=datasets.load_breast_cancer()
X=breast.data
y=breast.target
X_train,X_test,y_train,y_test=\
    train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)

#Section 2: Define PipeLine Model
pipe_lr=make_pipeline(StandardScaler(),
                      PCA(n_components=2),
                      LogisticRegression(random_state=1))

#Section 3: Define StratifiedKFold Model
print("Original Class Dist: %s\n" % np.bincount(y))
kfold=StratifiedKFold(n_splits=10,random_state=1).split(X_train,y_train)
scores=[]
for k,(train_idx,test_idx) in enumerate(kfold):
    pipe_lr.fit(X_train[train_idx],y_train[train_idx])
    score=pipe_lr.score(X_train[test_idx],y_train[test_idx])
    scores.append(score)
    print("Fold: %2d, Class dist: %s, Acc: %.3f" % (k+1,np.bincount(y_train[train_idx]),score))

print('CV Accuracy: %.3f +/- %.3f' % (np.mean(scores),np.std(scores)))

#Section 4: The easier manner when cross_val_score used
from sklearn.model_selection import cross_val_score

scores=cross_val_score(estimator=pipe_lr,
                       X=X_train,
                       y=y_train,
                       cv=10,
                       n_jobs=1)
print("\nCV Accuracy Scores: %s" % scores)
print("CV Accuracy: %.3f +/- %.3f" % (np.mean(scores),np.std(np.std(scores))))

结果

Original Class Dist: [212 357]

Fold:  1, Class dist: [153 256], Acc: 0.978
Fold:  2, Class dist: [153 256], Acc: 0.935
Fold:  3, Class dist: [153 256], Acc: 0.957
Fold:  4, Class dist: [153 256], Acc: 0.935
Fold:  5, Class dist: [153 256], Acc: 0.913
Fold:  6, Class dist: [153 257], Acc: 0.956
Fold:  7, Class dist: [153 257], Acc: 0.933
Fold:  8, Class dist: [153 257], Acc: 0.956
Fold:  9, Class dist: [153 257], Acc: 0.933
Fold: 10, Class dist: [153 257], Acc: 0.956
CV Accuracy: 0.945 +/- 0.018

CV Accuracy Scores: [0.97826087 0.93478261 0.95652174 0.93478261 0.91304348 0.95555556
 0.93333333 0.95555556 0.93333333 0.95555556]
CV Accuracy: 0.945 +/- 0.000

对比上述结果可知两点,其一,StratifiedKFold显然是根据照类别分布,按照比例采样形成训练集和测试集;其二,cross_val_score划分数据结果同StratifiedKFold采样结果是一致的。

参考文献
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

你可能感兴趣的:(机器学习)