在机器学习,一般不能直接拿整个数据集取训练,而采用cross-validation方法来训练。增强随机性减小噪声等,来减少过拟合,从而有限的数据中获取学习到更全面的信息,让模型泛化能力强。在sklearn中,经常使用的有:KFold, StratifiedKFold,StratifiedShuffleSplit, GroupKFold。逐一解释使用区别,使用一个简单的df,df信息如图。(一般情况下, n_splits=5/10
)
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold,\
StratifiedShuffleSplit, GroupKFold, GroupShuffleSplit
df2 = pd.DataFrame([[6.5, 1, 2],
[8, 1, 0],
[61, 2, 1],
[54, 0, 1],
[78, 0, 1],
[119, 2, 2],
[111, 1, 2],
[23, 0, 0],
[31, 2, 0]], columns=['h', 'w', 'class'])
df2
h w class
0 6.5 1 2
1 8.0 1 0
2 61.0 2 1
3 54.0 0 1
4 78.0 0 1
5 119.0 2 2
6 111.0 1 2
7 23.0 0 0
8 31.0 2 0
X = df2.drop(['class'], axis=1)
y = df2['class']
floder = KFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in floder.split(X,y):
print("KFold Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
# print(X.iloc[train_idx], y.iloc[train_idx], '\n', X.iloc[test_idx], y.iloc[test_idx])
===================================================================
KFold Spliting:
Train index: [0 1 3 5 6 8] | test index: [2 4 7]
KFold Spliting:
Train index: [0 2 3 4 7 8] | test index: [1 5 6]
KFold Spliting:
Train index: [1 2 4 5 6 7] | test index: [0 3 8]
注意划分后得到的是针对数据的索引。我们现在只关注其test index,可以发现每次划分得到的索引不是按照class
对应的类别均匀划分的,如第一次[2,4,7]
对应类别是1,1,0
. 其实 train index也一样,2,0,1,2,2,0
.这在很多时候是不满足要求的,因为我们很多时候希望每次划分得到的train dataset/valid dataset
其中对应的target类别是均匀的。
有意思的是,你将 n_splits=8或9
试试,可以看到不同划分数目,得到test index数目是不一样的。如 n_splits=8
时, 第1 folds中test index size为 n_samples // n_splits + 1= 2
,其余为1。
The first
n_samples % n_splits
folds have sizen_samples // n_splits + 1
, other folds have sizen_samples // n_splits
, wheren_samples
is the number of samples. —— kfold
现在我们知道,KFold不能按照target类别来均匀划分,如果数据集必须按target类别来划分呢?那就要用到 StratifiedKFold
。
sfolder = StratifiedKFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in sfolder.split(X,y):
print("StratifiedKFold Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
======================================================
StratifiedKFold Spliting:
Train index: [0 3 4 5 7 8] | test index: [1 2 6]
StratifiedKFold Spliting:
Train index: [1 2 3 5 6 8] | test index: [0 4 7]
StratifiedKFold Spliting:
Train index: [0 1 2 4 6 7] | test index: [3 5 8]
这时我们得到的第一次test index 为 [1 2 6]
,train index也可以验证,也就是说,划分得到的数据集target类别是均匀的。但是还有些数据,如df中特征列 w
如果也代表类别,我们希望将这个特征列相同类别划成一组呢?就像df.groupby
一样意思。这可以用 GroupKFold
.
gfolder = GroupKFold(n_splits=3)
for train_idx, test_idx in gfolder.split(X,y, groups=X['w']):
print("GroupKFold Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
========================================================================
GroupKFold Spliting:
Train index: [0 1 3 4 6 7] | test index: [2 5 8]
GroupKFold Spliting:
Train index: [2 3 4 5 7 8] | test index: [0 1 6]
GroupKFold Spliting:
Train index: [0 1 2 5 6 8] | test index: [3 4 7]
这里第一次test index为 [2 5 8]
,对应w列为2。 [0 1 6]
为1。这样就得到了按组划分了。可以试试将 groups=y
看看。
StratifiedShuffleSplit
是 StratifiedKFold
和 ShuffleSplit
缝合怪。其跟 StratifiedKFold
最大区别是可以重复采样,可以看到第一个test index是 [1 5 4]
,第二个是 [8 0 4]
,那么有可能某两个fold的index是一样的, not guarantee that all folds will be different
。
shuffle_split = StratifiedShuffleSplit(n_splits=3, random_state=2020, test_size=3) #test_size必须比类别大或者 可以重复采样
for train_idx, test_idx in shuffle_split.split(X,y):
print("StratifiedShuffleSplit Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
====================================================================
StratifiedShuffleSplit Spliting:
Train index: [8 2 3 0 6 7] | test index: [1 5 4]
StratifiedShuffleSplit Spliting:
Train index: [3 1 6 2 7 5] | test index: [8 0 4]
StratifiedShuffleSplit Spliting:
Train index: [1 8 2 6 0 4] | test index: [7 3 5]
现在很多数据集会出现非常不均衡情况,如果在训练可能要求按照某些特征列和target列这两个均匀划分,为此出现了 Stratified Group KFold
, 可以看做 GroupKFold
和 StratifiedKFold
缝合怪。
下面代码来自于stratifiedgroupkfold , 数据集是sklearn iris。另外再添加一列ID,就是令groups=df[‘ID’]
并且划分后train valid 中y还是跟原数据集分布一样。
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import GroupKFold
from collections import Counter, defaultdict
from sklearn.datasets import load_iris
def read_data():
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
#新定义一个ID列
list_id = ['A', 'B', 'C', 'D', 'E']
df['ID'] = np.random.choice(list_id, len(df))
features = iris.feature_names
return df, features
df, features = read_data()
print(df.sample(6))
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target ID
133 6.3 2.8 5.1 1.5 2 C
21 5.1 3.7 1.5 0.4 0 A
84 5.4 3.0 4.5 1.5 1 A
62 6.0 2.2 4.0 1.0 1 D
5 5.4 3.9 1.7 0.4 0 B
132 6.4 2.8 5.6 2.2 2 E
StratiiedGroupKFold
分解实现:
def count_y(y, groups):
"""统计每个group里各个y 数目"""
unique_num = np.max(y) + 1
#key不存在默认返回 np.zeros(unique_num)
y_counts_per_group = defaultdict(lambda : np.zeros(unique_num))
for label, g in zip(y, groups):
y_counts_per_group[g][label] += 1
# defaultdict(>,
# {'A': array([5., 9., 8.]),
# 'B': array([11., 12., 10.]),
# 'C': array([13., 8., 8.]),
# 'D': array([9., 11., 11.]),
# 'E': array([12., 10., 13.])})
return y_counts_per_group
def StratiiedGroupKFold(X, y, groups, features, k, seed=None):
"""
StratiiedGroupKFold数据,yeild划分后数据集索引
:param X: 数据集X
:param y: y target
:param groups: 指定其分布划分的groups
:param features: 特征
:param k: n_split
:param seed:
"""
max_y = np.max(y)
#得到每个groups y的数目的统计字典
y_counts_per_group = count_y(y, groups)
gf = GroupKFold(n_splits=k)
for train_idx, val_idx in gf.split(X, y, groups):
#分别获取train val划分后数据 以及各自对应的ID列类别数目
x_train = X.iloc[train_idx,:]
#id列类别数目
id_train = x_train['ID'].unique()
x_train = x_train[features]
x_val, y_val = X.iloc[val_idx, :], y.iloc[val_idx]
id_val = x_val['ID'].unique()
x_val = x_val[features]
#统计training dataset 和 validation dataset中y中每个类别数目
y_counts_train = np.zeros(max_y + 1)
y_counts_val = np.zeros(max_y + 1)
for id in id_train:
y_counts_train += y_counts_per_group[id]
for id in id_val:
y_counts_val += y_counts_per_group[id]
#train dataset中按ID列统计y类别相对于最大数目的比例
numratio_train = y_counts_train / np.max(y_counts_train)
#stratified 数目: validation dataset对应y_counts_train最大值索引的count数目 * numratio_train向上取整
stratified_count = np.ceil(y_counts_val[np.argmax(y_counts_train)] * numratio_train).astype(int)
val_idx = np.array([])
np.random.rand(seed)
for num in range(max_y + 1):
val_idx = np.append(val_idx, np.random.choice(y_val[y_val==num].index, stratified_count[num]))
val_idx = val_idx.astype(int)
yield train_idx, val_idx
看看划分效果:
def get_distribution(y_vals):
"""返回个y各类别的占比"""
y_distribut = Counter(y_vals)
y_vals_sum = sum(y_distribut.values())
return [f'{
y_distribut[i]/y_vals_sum:.2%}' for i in range(np.max(y_vals) + 1)]
X = df.drop('target', axis=1)
y = df['target']
groups = df['ID']
distribution = [get_distribution(y)]
index = ['all dataset']
#看看划分情况
for fold, (train_idx, val_idx) in enumerate(StratiiedGroupKFold(X, y, groups, features, k=3, seed=2020)):
print(f'Train ID - fold {
fold:1d}:{
groups[train_idx].unique()}\
Test ID - fold {
fold:1d}:{
groups[val_idx].unique()}')
distribution.append(get_distribution(y[train_idx]))
index.append(f'train set - fold{
fold:1d}')
distribution.append(get_distribution(y[val_idx]))
index.append(f'valid set - fold{
fold:1d}')
print(pd.DataFrame(distribution, index=index, columns={
f' Label{
l:2d}' for l in range(np.max(y)+1)}))
Train ID - fold 0:['B' 'A' 'C' 'D'] Test ID - fold 0:['E']
Train ID - fold 1:['A' 'D' 'E'] Test ID - fold 1:['B' 'C']
Train ID - fold 2:['B' 'C' 'E'] Test ID - fold 2:['A' 'D']
Label 1 Label 2 Label 0
all dataset 33.33% 33.33% 33.33%
train set - fold0 32.48% 31.62% 35.90%
valid set - fold0 33.33% 33.33% 33.33%
train set - fold1 34.44% 33.33% 32.22%
valid set - fold1 33.93% 33.93% 32.14%
train set - fold2 33.33% 35.48% 31.18%
valid set - fold2 33.33% 35.42% 31.25%
通用实现:
def stratified_group_k_fold(X, y, groups, k, seed=None):
labels_num = np.max(y) + 1
y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
y_distr = Counter()
for label, g in zip(y, groups):
y_counts_per_group[g][label] += 1
y_distr[label] += 1
y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
groups_per_fold = defaultdict(set)
def eval_y_counts_per_fold(y_counts, fold):
y_counts_per_fold[fold] += y_counts
std_per_label = []
for label in range(labels_num):
label_std = np.std([y_counts_per_fold[i][label] / y_distr[label] for i in range(k)])
std_per_label.append(label_std)
y_counts_per_fold[fold] -= y_counts
return np.mean(std_per_label)
groups_and_y_counts = list(y_counts_per_group.items())
random.Random(seed).shuffle(groups_and_y_counts)
for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
best_fold = None
min_eval = None
for i in range(k):
fold_eval = eval_y_counts_per_fold(y_counts, i)
if min_eval is None or fold_eval < min_eval:
min_eval = fold_eval
best_fold = i
y_counts_per_fold[best_fold] += y_counts
groups_per_fold[best_fold].add(g)
all_groups = set(groups)
for i in range(k):
train_groups = all_groups - groups_per_fold[i]
test_groups = groups_per_fold[i]
train_indices = [i for i, g in enumerate(groups) if g in train_groups]
test_indices = [i for i, g in enumerate(groups) if g in test_groups]
yield train_indices, test_indices
先到这儿,有时间再看看sample方法和不平衡数据使用。
[1] StratifiedKFold v.s KFold v.s StratifiedShuffleSplit
[2] sampling
[3] imbalanced-learn