A slight improvement over the standard k-fold cross-validation approach is stratified k-fold cross-validattion, which can yeild better bias and variance estimates, especially in case of unequal class proportions. In stratified cross-validattion, the class proportionss are preserved in each fold to ensure that each fold is representative of the class proportions in the training dataset.
FROM
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.
代码
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import numpy as np
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings("ignore")
#Section 1: Load Breast data, i.e., Benign and Malignant
breast=datasets.load_breast_cancer()
X=breast.data
y=breast.target
X_train,X_test,y_train,y_test=\
train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
#Section 2: Define PipeLine Model
pipe_lr=make_pipeline(StandardScaler(),
PCA(n_components=2),
LogisticRegression(random_state=1))
#Section 3: Define StratifiedKFold Model
print("Original Class Dist: %s\n" % np.bincount(y))
kfold=StratifiedKFold(n_splits=10,random_state=1).split(X_train,y_train)
scores=[]
for k,(train_idx,test_idx) in enumerate(kfold):
pipe_lr.fit(X_train[train_idx],y_train[train_idx])
score=pipe_lr.score(X_train[test_idx],y_train[test_idx])
scores.append(score)
print("Fold: %2d, Class dist: %s, Acc: %.3f" % (k+1,np.bincount(y_train[train_idx]),score))
print('CV Accuracy: %.3f +/- %.3f' % (np.mean(scores),np.std(scores)))
#Section 4: The easier manner when cross_val_score used
from sklearn.model_selection import cross_val_score
scores=cross_val_score(estimator=pipe_lr,
X=X_train,
y=y_train,
cv=10,
n_jobs=1)
print("\nCV Accuracy Scores: %s" % scores)
print("CV Accuracy: %.3f +/- %.3f" % (np.mean(scores),np.std(np.std(scores))))
结果
Original Class Dist: [212 357]
Fold: 1, Class dist: [153 256], Acc: 0.978
Fold: 2, Class dist: [153 256], Acc: 0.935
Fold: 3, Class dist: [153 256], Acc: 0.957
Fold: 4, Class dist: [153 256], Acc: 0.935
Fold: 5, Class dist: [153 256], Acc: 0.913
Fold: 6, Class dist: [153 257], Acc: 0.956
Fold: 7, Class dist: [153 257], Acc: 0.933
Fold: 8, Class dist: [153 257], Acc: 0.956
Fold: 9, Class dist: [153 257], Acc: 0.933
Fold: 10, Class dist: [153 257], Acc: 0.956
CV Accuracy: 0.945 +/- 0.018
CV Accuracy Scores: [0.97826087 0.93478261 0.95652174 0.93478261 0.91304348 0.95555556
0.93333333 0.95555556 0.93333333 0.95555556]
CV Accuracy: 0.945 +/- 0.000
对比上述结果可知两点,其一,StratifiedKFold显然是根据照类别分布,按照比例采样形成训练集和测试集;其二,cross_val_score划分数据结果同StratifiedKFold采样结果是一致的。
参考文献:
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.