首先这是从一个错误引出来的:
ValueError Traceback (most recent call last)
<ipython-input-42-2ab744268d80> in <module>()
20 print('---------分割线--------------')
21 sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
---> 22 for train, test in sfolder.split(X,y):
23 print('Train: %s | test: %s' % (train, test))
24 print(" ")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
330 n_samples))
331
--> 332 for train, test in super(_BaseKFold, self).split(X, y, groups):
333 yield train, test
334
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
93 X, y, groups = indexable(X, y, groups)
94 indices = np.arange(_num_samples(X))
---> 95 for test_index in self._iter_test_masks(X, y, groups):
96 train_index = indices[np.logical_not(test_index)]
97 test_index = indices[test_index]
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _iter_test_masks(self, X, y, groups)
632
633 def _iter_test_masks(self, X, y=None, groups=None):
--> 634 test_folds = self._make_test_folds(X, y)
635 for i in range(self.n_splits):
636 yield test_folds == i
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _make_test_folds(self, X, y)
587 raise ValueError(
588 'Supported target types are: {}. Got {!r} instead.'.format(
--> 589 allowed_target_types, type_of_target_y))
590
591 y = column_or_1d(y)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
这个坑了我3个小时。
标签是整数或者小数,真的能影响分割是否成功!!!
from sklearn.model_selection import KFold,StratifiedKFold
X=np.array([
[1,2,3,4],
[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44],
[51,52,53,54],
[61,62,63,64],
[71,72,73,74]
])
y=np.array([1,1,0,0,1,1,0,4]) # 整数
floder = KFold(n_splits=4,random_state=0,shuffle=False)
for train, test in floder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print(" ")
print('---------分割线--------------')
sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
for train, test in sfolder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print(" ")
输出为:
Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
---------分割线--------------
Train: [1 3 4 5 6] | test: [0 2 7]
Train: [0 2 4 5 6 7] | test: [1 3]
Train: [0 1 2 3 5 7] | test: [4 6]
Train: [0 1 2 3 4 6 7] | test: [5]
标签是整数,两者均可成功分割
如果标签是小数,后者就无法工作了,坑爹的回归
由上面的例子也能够看出,两者的真正不同之处在于:KFold是均分,保证每次训练集和测试集是相同的占比