sklearn使用StratifiedShuffleSplit分层抽样踩的坑

发现问题

新手入坑机器学习,使用StratifiedShuffleSplit创建测试集时,一直报如下错误:
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

踩坑代码

import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
housing = pd.read_csv(os.path.join('base_path', 'kc_house_data.csv'))

housing["price_grade"] = np.round(housing["price"] / 100000)
# pandas的where在否的时候,会取替换值
housing["price_grade"].where(housing["price_grade"] < 1000, 1000, True)
strat_train_set, strat_test_set = None, None
# 然后使用分层采样
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["price_grade"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

如上,代码和网上教程都差不多,然鹅他一直报上边的错。

解题思路

分析错误:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
ValueError:y中填充最少的类只有1个成员,这太少了。任何类别的最小组数不能少于2个

查看price_grade的值分布情况 housing["price_grade"].value_counts(),存在只有一个成员的类别
sklearn使用StratifiedShuffleSplit分层抽样踩的坑_第1张图片

解决方法

重新调整分组方法,确保每个类别中至少有两个成员。

housing["price_grade"] = np.ceil(housing["price"] / 20000)
housing["price_grade"].where(housing["price_grade"] < 50, 50, True)
housing["price_grade"].value_counts()

调整过后,最小的类别下有三个成员,再次执行就没问题了。其统计信息如下:
sklearn使用StratifiedShuffleSplit分层抽样踩的坑_第2张图片

你可能感兴趣的:(python,机器学习)