[机器学习]KFold 和 StratifiedKFold

首先这是从一个错误引出来的:

ValueError                                Traceback (most recent call last)
<ipython-input-42-2ab744268d80> in <module>()
     20 print('---------分割线--------------')
     21 sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
---> 22 for train, test in sfolder.split(X,y):
     23     print('Train: %s | test: %s' % (train, test))
     24     print(" ")

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
    330                                                              n_samples))
    331 
--> 332         for train, test in super(_BaseKFold, self).split(X, y, groups):
    333             yield train, test
    334 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
     93         X, y, groups = indexable(X, y, groups)
     94         indices = np.arange(_num_samples(X))
---> 95         for test_index in self._iter_test_masks(X, y, groups):
     96             train_index = indices[np.logical_not(test_index)]
     97             test_index = indices[test_index]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _iter_test_masks(self, X, y, groups)
    632 
    633     def _iter_test_masks(self, X, y=None, groups=None):
--> 634         test_folds = self._make_test_folds(X, y)
    635         for i in range(self.n_splits):
    636             yield test_folds == i

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _make_test_folds(self, X, y)
    587             raise ValueError(
    588                 'Supported target types are: {}. Got {!r} instead.'.format(
--> 589                     allowed_target_types, type_of_target_y))
    590 
    591         y = column_or_1d(y)

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

这个坑了我3个小时。
标签是整数或者小数,真的能影响分割是否成功!!!

from sklearn.model_selection import KFold,StratifiedKFold
X=np.array([
    [1,2,3,4],
    [11,12,13,14],
    [21,22,23,24],
    [31,32,33,34],
    [41,42,43,44],
    [51,52,53,54],
    [61,62,63,64],
    [71,72,73,74]
])
y=np.array([1,1,0,0,1,1,0,4]) # 整数
floder = KFold(n_splits=4,random_state=0,shuffle=False)
for train, test in floder.split(X,y):
    print('Train: %s | test: %s' % (train, test))
    print(" ")
print('---------分割线--------------')
sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
for train, test in sfolder.split(X,y):
    print('Train: %s | test: %s' % (train, test))
    print(" ")

输出为:

Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
---------分割线--------------
Train: [1 3 4 5 6] | test: [0 2 7]
Train: [0 2 4 5 6 7] | test: [1 3]
Train: [0 1 2 3 5 7] | test: [4 6]
Train: [0 1 2 3 4 6 7] | test: [5]

标签是整数,两者均可成功分割
如果标签是小数,后者就无法工作了,坑爹的回归

由上面的例子也能够看出,两者的真正不同之处在于:KFold是均分,保证每次训练集和测试集是相同的占比

你可能感兴趣的:(机器学习)