我们在用python做机器学习的交叉验证工作时,常会遇到random_state参数,比如函数:
KFold(n_splits=5, shuffle=False, random_state=None)
该函数用来做K折交叉验证。
n_splits:折数,int型,默认值为5.
shuffle:对数据进行划分前是否进行洗牌。boolean型
random_state:int, RandomState instance 或 None, 默认为None。直译为“随机状态”。
只有当shuffle=True时,random_state才有意义。
KFold(n_splits=5, shuffle=True, random_state=None)
代表每次数据的划分不一样。
举例:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
xx=np.arange(25)
kf=KFold(n_splits=5,shuffle=True,random_state=None)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
输出结果为:
train_index:[ 0 1 2 3 4 5 6 8 9 10 11 12 14 15 17 18 19 20 21 23],test_index:[ 7 13 16 22 24]
train_index:[ 0 2 4 5 6 7 8 10 11 12 13 14 15 16 17 19 20 21 22 24],test_index:[ 1 3 9 18 23]
train_index:[ 1 2 3 5 7 8 9 10 11 12 13 14 15 16 18 20 21 22 23 24],test_index:[ 0 4 6 17 19]
train_index:[ 0 1 2 3 4 6 7 9 11 12 13 16 17 18 19 20 21 22 23 24],test_index:[ 5 8 10 14 15]
train_index:[ 0 1 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 22 23 24],test_index:[ 2 11 12 20 21]
再重新运行一次:
kf=KFold(n_splits=5,shuffle=True,random_state=None)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
输出结果为:
train_index:[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 24],test_index:[16 20 21 22 23]
train_index:[ 1 2 3 4 5 6 7 8 9 11 12 14 15 16 17 20 21 22 23 24],test_index:[ 0 10 13 18 19]
train_index:[ 0 2 4 5 6 7 9 10 11 12 13 14 15 16 18 19 20 21 22 23],test_index:[ 1 3 8 17 24]
train_index:[ 0 1 3 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 2 4 5 6 11]
train_index:[ 0 1 2 3 4 5 6 8 10 11 13 16 17 18 19 20 21 22 23 24],test_index:[ 7 9 12 14 15]
可以看到,两次的数据划分不一样,每次划分前都重新洗牌一次。
例如指定为“1”:
KFold(n_splits=5, shuffle=True, random_state=1)
代表每次数据的划分一样。
举例:
xx=np.arange(25)
kf=KFold(n_splits=5,shuffle=True,random_state=1)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
输出结果为:
train_index:[ 0 1 2 4 5 6 7 8 9 10 11 12 15 16 18 19 20 22 23 24],test_index:[ 3 13 14 17 21]
train_index:[ 0 1 3 5 6 7 8 9 11 12 13 14 15 16 17 20 21 22 23 24],test_index:[ 2 4 10 18 19]
train_index:[ 0 2 3 4 5 8 9 10 11 12 13 14 15 16 17 18 19 21 23 24],test_index:[ 1 6 7 20 22]
train_index:[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 17 18 19 20 21 22],test_index:[ 0 15 16 23 24]
train_index:[ 0 1 2 3 4 6 7 10 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 5 8 9 11 12]
再重新运行一次:
kf=KFold(n_splits=5,shuffle=True,random_state=1)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
输出结果为:
train_index:[ 0 1 2 4 5 6 7 8 9 10 11 12 15 16 18 19 20 22 23 24],test_index:[ 3 13 14 17 21]
train_index:[ 0 1 3 5 6 7 8 9 11 12 13 14 15 16 17 20 21 22 23 24],test_index:[ 2 4 10 18 19]
train_index:[ 0 2 3 4 5 8 9 10 11 12 13 14 15 16 17 18 19 21 23 24],test_index:[ 1 6 7 20 22]
train_index:[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 17 18 19 20 21 22],test_index:[ 0 15 16 23 24]
train_index:[ 0 1 2 3 4 6 7 10 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 5 8 9 11 12]
可见,当两次都将random_state指定为整数“1”时,两次洗牌的结果一样,数据的划分结果一样。
那么问题来了,我让random_state=2行不行?random_state=160行不行?random_state=1与random_state=2有什么区别?看到有个博文设定random_state=42,我就懵×了……:
这个42代表什么含义?
好吧,那我们研究一下:
我们设定random_state=2,运行一下试试:
kf=KFold(n_splits=5,shuffle=True,random_state=2)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
输出结果:
train_index:[ 1 2 3 4 5 7 8 9 10 11 12 13 15 16 18 19 20 21 22 24],test_index:[ 0 6 14 17 23]
train_index:[ 0 1 2 4 5 6 7 8 10 11 13 14 15 17 18 19 20 21 23 24],test_index:[ 3 9 12 16 22]
train_index:[ 0 2 3 6 7 8 9 11 12 13 14 15 16 17 18 20 21 22 23 24],test_index:[ 1 4 5 10 19]
train_index:[ 0 1 3 4 5 6 8 9 10 11 12 13 14 15 16 17 19 22 23 24],test_index:[ 2 7 18 20 21]
train_index:[ 0 1 2 3 4 5 6 7 9 10 12 14 16 17 18 19 20 21 22 23],test_index:[ 8 11 13 15 24]
还保持random_state=2,再运行一下试试:
kf=KFold(n_splits=5,shuffle=True,random_state=2)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
输出结果:
train_index:[ 1 2 3 4 5 7 8 9 10 11 12 13 15 16 18 19 20 21 22 24],test_index:[ 0 6 14 17 23]
train_index:[ 0 1 2 4 5 6 7 8 10 11 13 14 15 17 18 19 20 21 23 24],test_index:[ 3 9 12 16 22]
train_index:[ 0 2 3 6 7 8 9 11 12 13 14 15 16 17 18 20 21 22 23 24],test_index:[ 1 4 5 10 19]
train_index:[ 0 1 3 4 5 6 8 9 10 11 12 13 14 15 16 17 19 22 23 24],test_index:[ 2 7 18 20 21]
train_index:[ 0 1 2 3 4 5 6 7 9 10 12 14 16 17 18 19 20 21 22 23],test_index:[ 8 11 13 15 24]
看到了吗,两次random_state=2对数据的划分结果一致,两次random_state=1对数据的划分结果一致,random_state=1与random_state=2的数据划分结果不一致。
因此,赋予random_state的整数,不代表数值意义,而只是一种编号,1、2、42或者160都只是一种编号,不代表具体的数值,不分谁大谁小,只代表一种洗牌结果。
参考资料:
附录:
附录里分别跑了两次random_state=42与random_state=160的运行结果,献给强迫症读者:
第一次跑random_state=42:
kf=KFold(n_splits=5,shuffle=True,random_state=42)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
运行结果:
train_index:[ 1 2 3 4 5 6 7 9 10 12 13 14 15 17 18 19 20 21 22 24],test_index:[ 0 8 11 16 23]
train_index:[ 0 2 3 4 6 7 8 10 11 12 14 15 16 17 18 19 20 21 23 24],test_index:[ 1 5 9 13 22]
train_index:[ 0 1 5 6 7 8 9 10 11 13 14 16 17 18 19 20 21 22 23 24],test_index:[ 2 3 4 12 15]
train_index:[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 22 23],test_index:[17 18 20 21 24]
train_index:[ 0 1 2 3 4 5 8 9 11 12 13 15 16 17 18 20 21 22 23 24],test_index:[ 6 7 10 14 19]
第二次跑random_state=42:
kf=KFold(n_splits=5,shuffle=True,random_state=42)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
运行结果:
train_index:[ 1 2 3 4 5 6 7 9 10 12 13 14 15 17 18 19 20 21 22 24],test_index:[ 0 8 11 16 23]
train_index:[ 0 2 3 4 6 7 8 10 11 12 14 15 16 17 18 19 20 21 23 24],test_index:[ 1 5 9 13 22]
train_index:[ 0 1 5 6 7 8 9 10 11 13 14 16 17 18 19 20 21 22 23 24],test_index:[ 2 3 4 12 15]
train_index:[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 19 22 23],test_index:[17 18 20 21 24]
train_index:[ 0 1 2 3 4 5 8 9 11 12 13 15 16 17 18 20 21 22 23 24],test_index:[ 6 7 10 14 19]
第一次跑random_state=160:
kf=KFold(n_splits=5,shuffle=True,random_state=160)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
运行结果:
train_index:[ 1 2 3 4 5 7 8 9 10 11 12 14 15 16 17 19 20 22 23 24],test_index:[ 0 6 13 18 21]
train_index:[ 0 1 2 3 4 5 6 7 8 11 12 13 15 16 18 19 21 22 23 24],test_index:[ 9 10 14 17 20]
train_index:[ 0 1 2 6 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 3 4 5 7 11]
train_index:[ 0 1 3 4 5 6 7 8 9 10 11 13 14 15 17 18 20 21 22 24],test_index:[ 2 12 16 19 23]
train_index:[ 0 2 3 4 5 6 7 9 10 11 12 13 14 16 17 18 19 20 21 23],test_index:[ 1 8 15 22 24]
第二次跑random_state=160:
kf=KFold(n_splits=5,shuffle=True,random_state=160)
for train_index,test_index in kf.split(xx):
print('train_index:%s,test_index:%s'%(train_index,test_index))
运行结果:
train_index:[ 1 2 3 4 5 7 8 9 10 11 12 14 15 16 17 19 20 22 23 24],test_index:[ 0 6 13 18 21]
train_index:[ 0 1 2 3 4 5 6 7 8 11 12 13 15 16 18 19 21 22 23 24],test_index:[ 9 10 14 17 20]
train_index:[ 0 1 2 6 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23 24],test_index:[ 3 4 5 7 11]
train_index:[ 0 1 3 4 5 6 7 8 9 10 11 13 14 15 17 18 20 21 22 24],test_index:[ 2 12 16 19 23]
train_index:[ 0 2 3 4 5 6 7 9 10 11 12 13 14 16 17 18 19 20 21 23],test_index:[ 1 8 15 22 24]
看到了吧,random_state为同一数值时,数据划分结果就一样,random_state为不同数值时,数据划分结果就不一样。不管是1、2、42还是160