sk-learn中提StratifiedShuffleSplit()提供分层抽样功能,确保每个标签对应的样本的比例
参数说明
使用方法:
代码说明:
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4],[1, 2],
[3, 4], [1, 2], [3, 4],[4,5],[4,5]])#训练数据集10*2
print(X)
y = np.array([0, 0, 1, 1, 0, 0, 1, 1, 2, 2])#类别数据集10*1
print(y)
ss=StratifiedShuffleSplit(n_splits=2,test_size=0.25,random_state=42)#分成2组,测试比例为0.25
for train_index, test_index in ss.split(X, y):
print("TRAIN_INDEX:", train_index, "TEST_INDEX:", test_index)#获得索引值
X_train, X_test = X[train_index], X[test_index]#训练集对应的值
y_train, y_test = y[train_index], y[test_index]#类别集对应的值
print("X_train:",X_train)
print("y_train:",y_train)
运行结果:
[[1 2]
[3 4]
[1 2]
[3 4]
[1 2]
[3 4]
[1 2]
[3 4]
[4 5]
[4 5]]
[0 0 1 1 0 0 1 1 2 2]
TRAIN_INDEX: [3 9 1 7 2 0 4] TEST_INDEX: [5 6 8]
X_train: [[3 4]
[4 5]
[3 4]
[3 4]
[1 2]
[1 2]
[1 2]]
y_train: [1 2 0 1 1 0 0]
TRAIN_INDEX: [8 5 0 7 6 2 4] TEST_INDEX: [3 1 9]
X_train: [[4 5]
[3 4]
[1 2]
[3 4]
[1 2]
[1 2]
[1 2]]
y_train: [2 0 0 1 1 1 0]
从示例可以看出,分类标准就是确保标签的比例一致,两次输出标签0,1,2的比例一致。其实数据量大的时候,还保证了这些标签在在取样后的比例和抽样前的一致。极而言之就是0,1,2在y中所站的比例和0,1,2在y_train中的比例一致。
以下是大数据量的表现
链接:https://pan.baidu.com/s/10cnGXuCxd9ABkbMg99vOCA
提取码:7ece
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
# 读取数据
data = pd.read_csv("median_income.csv")
# 查看原始数据在数据中的比例
data["income_cat"].value_counts() / len(data)
【OUT】:
3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
Name: income_cat, dtype: float64
# 数据的加工处理,是每个类别里的数据变的均匀
data["income_cat"] = np.ceil(data["median_income"]/1.5)
data["income_cat"].where(data["income_cat"]<5, 5.0, inplace=True)
# 创建对象
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
strat_train_test = data.loc[train_index]
strat_test_test = data.loc[test_index]
# 计算分层抽样后的各个数据占的比例
strat_train_test["income_cat"].value_counts() / len(strat_train_test)
【OUT】:
3.0 0.350594
2.0 0.318859
4.0 0.176296
5.0 0.114402
1.0 0.039850
Name: income_cat, dtype: float64
从运行结果很容易发现分层抽样前后各个数据占总数据的比例基本一致。
–声明:数据来源kaggle,代码参考工业出版社的《机器学习实战》–