笔者比较推荐分层抽样,进行拆分数据,这样训练和测试集的比例就会保持一致。
所以以下的变形都是基于分层抽样的。
直接用的是 sklearn.model_selection.StratifiedShuffleSplit
。经常会用这个方式拆分数据进行交叉验证,由于这里仅仅用作二拆分,所以仅需要一次拆分(即设定n_splits= 1
)
StratifiedShuffleSplit
是 ShuffleSplit
和 StratifiedKFold
的功能合并
from sklearn.model_selection import StratifiedShuffleSplit
def train_test_sp(data, target, test_size = 0.2):
"""
简单分层抽样
基于 sklearn.model_selection.StratifiedShuffleSplit
param data: 数据集
param target: 分层拆分比例依据
param test_size: 测试集比例 float
"""
splitned = StratifiedShuffleSplit(n_splits= 1, test_size = test_size, random_state = 42)
for train_index, test_index in splitned.split(data, data[target]):
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
return strat_train_set, strat_test_set
当正负样本存在比例不平衡的时候,用分层抽样的方式进行下采样。
再用简单分层抽样 拆分训练集测试集。
from sklearn.cluster import KMeans
from sklearn.model_selection import StratifiedShuffleSplit
def Kclust_sample(data, target, n_clusters=10, out_size = 0.3, neg_flg=0, seed=42):
"""
负样本聚类分层欠采样\n
Author: Scc_hy
:param data: pandas DataFrame
:param target: str 目标变量
:param n_clusters: k_means聚类个数
:param out_size: 需要负样本的比例 (下采样抽取多少)
:param neg_flg: 负样本类型标识 str/int
:param seed: 随机因子
例:
Kclust_sample(new_dt, 'IS_LS', n_clusters=10, out_size = 0.3, neg_flg=0, seed=42)
"""
## 拆分正负样本
pos_flg = [i for i in set(data[target].unique()) - set([neg_flg, neg_flg]) ][0]
neg_dt = data[data[target] == neg_flg].copy(deep = True)
neg_dt.reset_index(drop = True, inplace = True)
pos_dt = data[data[target] == pos_flg].copy(deep = True)
pos_dt.reset_index(drop = True, inplace = True)
## 聚类
kclust = KMeans(n_clusters = n_clusters)
kclust.fit(neg_dt)
neg_dt['kclust'] = kclust.predict(neg_dt)
print('=================== 负样本聚类完成 ==================')
split = StratifiedShuffleSplit(n_splits= 10, test_size = out_size, random_state = seed)
## 用聚类结果进行分组抽样
need_lst = [i for i in split.split(neg_dt.drop('kclust', axis = 1), neg_dt['kclust'])]
## 生成10对 1- out_size : out_size 的子集索引对
indecis = np.random.permutation(len(need_lst))
neg_ot = neg_dt.drop('kclust' , axis = 1).loc[list(need_lst[indecis[0]][1]), :] # 从中随机抽取一组中的 少数样本(out_size * len(data))
out_dt = pd.concat([neg_ot ,pos_dt], axis = 0)
out_dt.reset_index(drop = True, inplace = True)
return out_dt
### 先下采样再进行分层抽样分训练集测试集
out_dt = Kclust_sample(new_dt, 'IS_LS', n_clusters=10, out_size = 0.3, neg_flg=0, seed=42)
tr_set, te_set = train_test_sp(out_dt, 'IS_LS', test_size = 0.2)
以上仅仅是再实际操作中笔者一般用到的方法,希望对大家有帮助,后续可能还会补充