在深度学习或者机器学习训练时,总会对现有数据进行随机采样,分裂出训练数据集和测试数据集,下面就几种情况进行分析:
倘若原始数据集本身是随机排列的话,可以直接采样下面的方式实现
from sklearn.model_selection import train_test_split
df_train,df_test = train_test_split(df,test_size = 0.2)
按照原来label顺序排列的话,容易对使得训练数据集和测试数据集label分布不均衡,此时采用分层抽样(stratify sample)
from sklearn.model_selection import train_test_split
df_train,df_test = train_test_split(df,test_size = 0.2,stratify=df['label'])
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
https://answers.dataiku.com/2352/split-dataset-by-stratified-sampling