class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1,
iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
estimator : 分类器
param_grid : 需要调参的参数。格式为:param_grid = {'criterion': ['gini', 'entropy'],
'max_depth': [2,3,4,5,6],
'min_samples_split':[2,3,4,5,6],
'min_samples_leaf':[2,3,4,5,6]
}scoring: 模型评分标准:如下
refit : 默认为True ,在搜索参数结束后,用最佳参数结果再次fit一遍全部数据集
cv: 交叉验证参数,默认为3折
verbose:日志冗长度,int:冗长度,0:不输出训练过程,1:偶尔输出,>1:对每个子模型都输出
pre_dispatch=‘2*n_jobs’: 指定总共分发的并行任务数。
scoring参数:分类、聚类和回归的评估方法都有。
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score',
'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score',
'homogeneity_score', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error',
'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro',
'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro',
'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']
具体含义可以参考:http://sklearn.apachecn.org/cn/0.19.0/modules/model_evaluation.html#scoring-parameter
需要我们人工手动输入的参数称为超参数, 进行超参数的选择的过程叫做调参。
#coding=gbk
#调整估计器的超参数,进行超参数的选择的过程叫做调参
#1,网格搜索方法 GridSearchCV
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV #导入网格搜索方法的包
# Each datapoint is a 8x8 image of a digit. 每一个数据点都是 8x8 的像素点
digits = datasets.load_digits()
print(digits.data.shape) # (1797, 64)
print(digits.data[:5,:])
# import matplotlib.pyplot as plt
# plt.gray()
# plt.matshow(digits.images[3])
# plt.show()
n_samples = len(digits.images)
print(n_samples) # 1797
X = digits.data
y = digits.target
print(y[:5]) #[0 1 2 3 4] 对应的数字
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
param_grid = [{
'kernel':['rbf'],
'gamma':[1e-3, 1e-4],
'C':[1, 10,100,1000]
},{
'kernel':['linear'],
'C':[1, 10,100,1000]
}]
scores =['precision', 'recall']
for score in scores :
print('score %s' %score)
print('------')
clf = GridSearchCV(SVC(), param_grid, cv=5, scoring='%s_macro'%score)
clf.fit(X_train, y_train)
print('best params is :')
print(clf.best_params_)
print('grid score')
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print('%.3f (+-/%0.03f) for %r'%(mean, std*2, params))
print('-----')
print('classification report')
y_true, y_pred = y_test, clf.predict(X_test)
report = classification_report(y_true, y_pred)
print(report)
输出结果:
score precision
------
best params is :
{'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
grid score
0.986 (+-/0.016) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1}
0.959 (+-/0.029) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
0.982 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 10}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 100}
0.982 (+-/0.025) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 100}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1000}
0.982 (+-/0.025) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1000}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 1}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 10}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 100}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 1000}
-----
classification report
precision recall f1-score support
0 1.00 1.00 1.00 89
1 0.97 1.00 0.98 90
2 0.99 0.98 0.98 92
3 1.00 0.99 0.99 93
4 1.00 1.00 1.00 76
5 0.99 0.98 0.99 108
6 0.99 1.00 0.99 89
7 0.99 1.00 0.99 78
8 1.00 0.98 0.99 92
9 0.99 0.99 0.99 92
avg / total 0.99 0.99 0.99 899
score recall
------
best params is :
{'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
grid score
0.986 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1}
0.957 (+-/0.029) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
0.981 (+-/0.028) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 10}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 100}
0.981 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 100}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1000}
0.981 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1000}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 1}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 10}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 100}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 1000}
-----
classification report
precision recall f1-score support
0 1.00 1.00 1.00 89
1 0.97 1.00 0.98 90
2 0.99 0.98 0.98 92
3 1.00 0.99 0.99 93
4 1.00 1.00 1.00 76
5 0.99 0.98 0.99 108
6 0.99 1.00 0.99 89
7 0.99 1.00 0.99 78
8 1.00 0.98 0.99 92
9 0.99 0.99 0.99 92
avg / total 0.99 0.99 0.99 899
实际当中有用的参数,以clf表示我们的GridSearchCV对象
clf.best_params_ 返回最好的参数
clf.best_score_ 返回最好的测试分数,它的值和 clf.cv_results_['mean_test_score'][dt_grid.best_index_] 是相同的。
clf.best_index_ 返回列表中分数最好的下表
clf.best_estimator_ 返回最好的模型
grid_scores_ 在sklearn 0.18中已经不赞成使用了,用下面的cv_results_来代替
clf.cv_results_ 返回使用交叉验证进行搜索的结果,它本身又是一个字典,里面又有很多内容,我们来看一下上面的clf.cv_results_.keys()里面有什么:
dict_keys(
['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
'param_C', 'param_gamma', 'param_kernel', 'params',
'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score',
'mean_test_score', 'std_test_score', 'rank_test_score',
'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score',
'mean_train_score', 'std_train_score'] )
可以分为上面几类: 第一类是时间, 第二类是参数, 第三类是测试分数,其中又分为每次交叉验证的参数和统计的参数,第四类是训练分数,其中也分为每次交叉验证的参数和统计的参数。
参考:https://www.cnblogs.com/jiaxin359/p/8641976.html#_labelTop
http://sklearn.apachecn.org/cn/0.19.0/modules/grid_search.html#grid-search-tips