sklearn.model_selection.train_test_split

数据集划分:sklearn.model_selection.train_test_split(*arrays, **options)

主要参数说明:

*arrays:可以是列表、numpy数组、scipy稀疏矩阵或pandas的数据框

test_size:可以为浮点、整数或None,默认为None

①若为浮点时,表示测试集占总样本的百分比

②若为整数时,表示测试样本样本数

③若为None时,test size自动设置成0.25

train_size:可以为浮点、整数或None,默认为None

①若为浮点时,表示训练集占总样本的百分比

②若为整数时,表示训练样本的样本数

③若为None时,train_size自动被设置成0.75

random_state:可以为整数、RandomState实例或None,默认为None

①若为None时,每次生成的数据都是随机,可能不一样

②若为整数时,每次生成的数据都相同

stratify:可以为类似数组或None

①若为None时,划分出来的测试集或训练集中,其类标签的比例也是随机的

②若不为None时,划分出来的测试集或训练集中,其类标签的比例同输入的数组中类标签的比例相同,可以用于处理不均衡的数据集

通过简单栗子看看各个参数的作用:

①test_size决定划分测试、训练集比例

  1. In [ 1]: import numpy as np
  2. ...: from sklearn.model_selection import train_test_split
  3. ...: X = np.arange( 20)
  4. ...: y = [ 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A
  5. ...: ', 'B', 'A', 'A']
  6. ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size= 0.25
  7. ...: ,random_state= 0)
  8. ...:
  9. In [ 2]: X_test.shape
  10. Out[ 2]: ( 5,)
  11. In [ 3]: X_train.shape
  12. Out[ 3]: ( 15,)
  13. In [ 4]: X_test ,y_test
  14. Out[ 4]: (array([ 18, 1, 19, 8, 10]), [ 'A', 'B', 'A', 'B', 'A'])
②random_state不同值获取到不同的数据集

设置random_state=0再运行一次,结果同上述相同

  1. In [ 5]: import numpy as np
  2. ...: from sklearn.model_selection import train_test_split
  3. ...: X = np.arange( 20)
  4. ...: y = [ 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A
  5. ...: ', 'B', 'A', 'A']
  6. ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size= 0.25
  7. ...: ,random_state= 0)
  8. ...: X_test ,y_test
  9. ...:
  10. Out[ 5]: (array([ 18, 1, 19, 8, 10]), [ 'A', 'B', 'A', 'B', 'A'])
设置random_state=None运行两次,发现两次的结果不同

  1. In [ 6]: import numpy as np
  2. ...: from sklearn.model_selection import train_test_split
  3. ...: X = np.arange( 20)
  4. ...: y = [ 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A
  5. ...: ', 'B', 'A', 'A']
  6. ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size= 0.25
  7. ...: )
  8. ...: X_test ,y_test
  9. ...:
  10. Out[ 6]: (array([ 3, 18, 14, 7, 4]), [ 'A', 'A', 'A', 'B', 'A'])
  11. In [ 7]: import numpy as np
  12. ...: from sklearn.model_selection import train_test_split
  13. ...: X = np.arange( 20)
  14. ...: y = [ 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A
  15. ...: ', 'B', 'A', 'A']
  16. ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size= 0.25
  17. ...: )
  18. ...: X_test ,y_test
  19. ...:
  20. Out[ 7]: (array([ 18, 6, 3, 14, 8]), [ 'A', 'A', 'A', 'A', 'B'])
③设置stratify参数,可以处理数据不平衡问题

  1. In [ 8]: import numpy as np
  2. ...: from sklearn.model_selection import train_test_split
  3. ...: X = np.arange( 20)
  4. ...: y = [ 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A
  5. ...: ', 'B', 'A', 'A']
  6. ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size= 0.25
  7. ...: ,stratify=y)
  8. ...: X_test ,y_test
  9. ...:
  10. Out[ 8]: (array([ 18, 8, 3, 10, 11]), [ 'A', 'B', 'A', 'A', 'B'])
  11. In [ 9]: import numpy as np
  12. ...: from sklearn.model_selection import train_test_split
  13. ...: X = np.arange( 20)
  14. ...: y = [ 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A
  15. ...: ', 'B', 'A', 'A']
  16. ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size= 0.25
  17. ...: ,stratify=y)
  18. ...: X_test ,y_test
  19. ...:
  20. Out[ 9]: (array([ 6, 19, 8, 17, 0]), [ 'A', 'A', 'B', 'B', 'A'])
  21. In [ 10]: X_train,y_train
  22. Out[ 10]:
  23. (array([ 7, 1, 11, 10, 15, 2, 3, 5, 4, 13, 12, 16, 18, 14, 9]),
  24. [ 'B', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'A'])
设置stratify=y时,我们发现每次划分后,测试集和训练集中的类标签比例同原始的样本中类标签的比例相同,都为2:3

你可能感兴趣的:(Python,数据分析)