sklearn 中文文档
http://sklearn.apachecn.org/cn/stable/modules/model_evaluation.html
K折交叉验证:sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
思路:将训练/测试数据集划分n_splits个互斥子集,每次用其中一个子集当作验证集,剩下的n_splits-1个作为训练集,进行n_splits次训练和测试,得到n_splits个结果
注意点:对于不能均等份的数据集,其前n_samples % n_splits子集拥有n_samples // n_splits + 1个样本,其余子集都只有n_samples // n_splits样本
参数说明:
n_splits:表示划分几等份
shuffle:在每次划分时,是否进行洗牌
①若为Falses时,其效果等同于random_state等于整数,每次划分的结果相同
②若为True时,每次划分的结果都不一样,表示经过洗牌,随机取样的
random_state:随机种子数
属性:
①get_n_splits(X=None, y=None, groups=None):获取参数n_splits的值
②split(X, y=None, groups=None):将数据集划分成训练集和测试集,返回索引生成器
通过一个不能均等划分的栗子,设置不同参数值,观察其结果
①设置shuffle=False,运行两次,发现两次结果相同
In [1]: from sklearn.model_selection import KFold
...: import numpy as np
...: X = np.arange(24).reshape(12,2)
...: y = np.random.choice([1,2],12,p=[0.4,0.6])
...: kf = KFold(n_splits=5,shuffle=False)
...: for train_index , test_index in kf.split(X):
...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
...:
...:
train_index:[ 3 4 5 6 7 8 9 10 11] , test_index: [0 1 2]
train_index:[ 0 1 2 6 7 8 9 10 11] , test_index: [3 4 5]
train_index:[ 0 1 2 3 4 5 8 9 10 11] , test_index: [6 7]
train_index:[ 0 1 2 3 4 5 6 7 10 11] , test_index: [8 9]
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]
In [2]: from sklearn.model_selection import KFold
...: import numpy as np
...: X = np.arange(24).reshape(12,2)
...: y = np.random.choice([1,2],12,p=[0.4,0.6])
...: kf = KFold(n_splits=5,shuffle=False)
...: for train_index , test_index in kf.split(X):
...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
...:
...:
train_index:[ 3 4 5 6 7 8 9 10 11] , test_index: [0 1 2]
train_index:[ 0 1 2 6 7 8 9 10 11] , test_index: [3 4 5]
train_index:[ 0 1 2 3 4 5 8 9 10 11] , test_index: [6 7]
train_index:[ 0 1 2 3 4 5 6 7 10 11] , test_index: [8 9]
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]
②设置shuffle=True时,运行两次,发现两次运行的结果不同
In [3]: from sklearn.model_selection import KFold
...: import numpy as np
...: X = np.arange(24).reshape(12,2)
...: y = np.random.choice([1,2],12,p=[0.4,0.6])
...: kf = KFold(n_splits=5,shuffle=True)
...: for train_index , test_index in kf.split(X):
...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
...:
...:
train_index:[ 0 1 2 4 5 6 7 8 10] , test_index: [ 3 9 11]
train_index:[ 0 1 2 3 4 5 9 10 11] , test_index: [6 7 8]
train_index:[ 2 3 4 5 6 7 8 9 10 11] , test_index: [0 1]
train_index:[ 0 1 3 4 5 6 7 8 9 11] , test_index: [ 2 10]
train_index:[ 0 1 2 3 6 7 8 9 10 11] , test_index: [4 5]
In [4]: from sklearn.model_selection import KFold
...: import numpy as np
...: X = np.arange(24).reshape(12,2)
...: y = np.random.choice([1,2],12,p=[0.4,0.6])
...: kf = KFold(n_splits=5,shuffle=True)
...: for train_index , test_index in kf.split(X):
...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
...:
...:
train_index:[ 0 1 2 3 4 5 7 8 11] , test_index: [ 6 9 10]
train_index:[ 2 3 4 5 6 8 9 10 11] , test_index: [0 1 7]
train_index:[ 0 1 3 5 6 7 8 9 10 11] , test_index: [2 4]
train_index:[ 0 1 2 3 4 6 7 9 10 11] , test_index: [5 8]
train_index:[ 0 1 2 4 5 6 7 8 9 10] , test_index: [ 3 11]
③设置shuffle=True和random_state=整数,发现每次运行的结果都相同
In [5]: from sklearn.model_selection import KFold
...: import numpy as np
...: X = np.arange(24).reshape(12,2)
...: y = np.random.choice([1,2],12,p=[0.4,0.6])
...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
...: for train_index , test_index in kf.split(X):
...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
...:
...:
train_index:[ 0 1 2 3 5 7 8 9 10] , test_index: [ 4 6 11]
train_index:[ 0 1 3 4 5 6 7 9 11] , test_index: [ 2 8 10]
train_index:[ 0 2 3 4 5 6 8 9 10 11] , test_index: [1 7]
train_index:[ 0 1 2 4 5 6 7 8 10 11] , test_index: [3 9]
train_index:[ 1 2 3 4 6 7 8 9 10 11] , test_index: [0 5]
In [6]: from sklearn.model_selection import KFold
...: import numpy as np
...: X = np.arange(24).reshape(12,2)
...: y = np.random.choice([1,2],12,p=[0.4,0.6])
...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
...: for train_index , test_index in kf.split(X):
...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
...:
...:
train_index:[ 0 1 2 3 5 7 8 9 10] , test_index: [ 4 6 11]
train_index:[ 0 1 3 4 5 6 7 9 11] , test_index: [ 2 8 10]
train_index:[ 0 2 3 4 5 6 8 9 10 11] , test_index: [1 7]
train_index:[ 0 1 2 4 5 6 7 8 10 11] , test_index: [3 9]
train_index:[ 1 2 3 4 6 7 8 9 10 11] , test_index: [0 5]
④n_splits属性值获取方式
In [8]: kf.split(X)
Out[8]:
In [9]: kf.get_n_splits()
Out[9]: 5
In [10]: kf.n_splits
Out[10]: 5
原文:https://blog.csdn.net/Dream_angel_Z/article/details/47048285
调用形式是scores = cross_validation.cross_val_score(clf, raw_data, raw_target, cv=5, score_func=None)
clf:表示的是不同的分类器,可以是任何的分类器。比如支持向量机分类器。clf = svm.SVC(kernel=’linear’, C=1);
raw_data:原始数据;
raw_target:原始类别标号;
cv:代表的就是不同的cross validation的方法了。引用scikit-learn上的一句话(When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.)如果cv是一个int数字的话,那么默认使用的是KFold或者StratifiedKFold交叉,如果如果指定了类别标签则使用的是StratifiedKFold。
cross_val_score:这个函数的返回值就是对于每次不同的的划分raw_data时,在test_data上得到的分类的准确率。至于准确率的算法可以通过score_func参数指定,如果不指定的话,是用clf默认自带的准确率算法。
>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_validation.cross_val_score(
... clf, iris.data, iris.target, cv=5)
...
>>> scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
---------------------
作者:拾毅者
来源:CSDN
原文:https://blog.csdn.net/Dream_angel_Z/article/details/47048285
版权声明:本文为博主原创文章,转载请附上博文链接!
通过scores.mean()求出平均值,得到平均精度。还可以通过指定scoring来设置准确率算法
>>> from sklearn import metrics
>>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target,
... cv=5, scoring='f1_weighted')
>>> scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])