1交叉验证
1)评估方法一般有留出法,交叉验证法,自助法,这里我们介绍交叉验证法。
2)k折交叉验证法:将数据集D划分为k个大小相似的互斥子集,每个子集都尽可能保持数据分布的一致性,每次用k-1个子集的并集作为训练集,余下的那个子集作为测试集,最后可以返回k个测试结果,最终返回这k个测试结果的均值。
图片1.jpg
2 网格搜索
也叫超参数搜索,是用来调参的,每组超参数都采用交叉验证来进行估计
3 案例——iris
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
def knncls():
iris = load_iris()
x = iris.data
y = iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25)
#构造一些参数进行搜索 字典形式
param={'n_neighbors':[3,5,10]}
#超参数 进行网格搜索
knn = KNeighborsClassifier()
gc = GridSearchCV(knn,param_grid=param,cv=2)
gc.fit(x_train,y_train)
#预测准确率
print('测试集上的准确性:',gc.score(x_test,y_test))
print('在交叉验证当中最好的结果',gc.best_score_)
print('最好的模型:',gc.best_estimator_)
print('每个超参数交叉验证的结果:',gc.cv_results_)
if __name__=='__main__':
knncls()
输出:
测试集上的准确性: 0.9210526315789473
在交叉验证当中最好的结果 0.9553571428571429
最好的模型: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=10, p=2,
weights='uniform')
每个超参数交叉验证的结果: {'mean_fit_time': array([0.07811606, 0. , 0. ]),
'std_fit_time': array([0.07811606, 0. , 0. ]),
'mean_score_time': array([0.09376121, 0. , 0. ]),
'std_score_time': array([0.09376121, 0. , 0. ]),
'param_n_neighbors': masked_array(data=[3, 5, 10],mask=[False, False, False],fill_value='?',dtype=object),
'params': [{'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 10}],
'split0_test_score': array([0.96428571, 0.94642857, 0.94642857]),
'split1_test_score': array([0.91071429, 0.94642857, 0.96428571]),
'mean_test_score': array([0.9375 , 0.94642857, 0.95535714]),
'std_test_score': array([0.02678571, 0. , 0.00892857]),
'rank_test_score': array([3, 2, 1]),
'split0_train_score': array([0.92857143, 0.92857143, 0.92857143]),
'split1_train_score': array([1. , 1. , 0.94642857]),
'mean_train_score': array([0.96428571, 0.96428571, 0.9375 ]),
'std_train_score': array([0.03571429, 0.03571429, 0.00892857])}
其中-------------------------------k=3-----------------k=5------------k=10
'split0_test_score': array([0.96428571, 0.94642857, 0.94642857]),模型一
'split1_test_score': array([0.91071429, 0.94642857, 0.96428571]),模型二
求均值:
'mean_test_score': array([0.9375 , 0.94642857, 0.95535714])
最终选择k=10