关于sklearn.model_selection.GridSearchCV,为什么值得写一篇博客:
其实网格搜索现在用的并不多,但是作为基础知识我觉得还是有掌握的必要的。
这篇博客主要借鉴sklearn官网教程进行讲解:
官网教程链接
#执行程序前请pip install mglearn
import mglearn
import pylab as plt
mglearn.plots.plot_grid_search_overview()
plt.show()
结构图:
网格搜索完后有fit与predict方法,所以其实也可以不需要返回一个最优模型,再用这个最优模型做fit与predict。
参数:
画红圈的地方很重要,这个截图来自:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
实现方式一:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC(gamma="scale")
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(iris.data, iris.target)
关于网格搜索最优模型的保存:
from sklearn.externals import joblib
joblib.dump(grid_search.best_estimator_, '/Users/will/Desktop/sklearn_model/k2_svm.pickle')
model = joblib.load('/Users/will/Desktop/sklearn_modle/k2modle.pickle')
y_pred_svr1 = model.predict(X_test)
实现方式2:
from __future__ import print_function
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
print(__doc__)
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
# Note the problem is too easy: the hyperparameter plateau is too flat and the
# output model is the same for precision and recall with ties in quality.
输出:
# Tuning hyper-parameters for precision
Best parameters set found on development set:
{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Grid scores on development set:
0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.959 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.025) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.025) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.975 (+/-0.014) for {'C': 1, 'kernel': 'linear'}
0.975 (+/-0.014) for {'C': 10, 'kernel': 'linear'}
0.975 (+/-0.014) for {'C': 100, 'kernel': 'linear'}
0.975 (+/-0.014) for {'C': 1000, 'kernel': 'linear'}
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
precision recall f1-score support
0 1.00 1.00 1.00 89
1 0.97 1.00 0.98 90
2 0.99 0.98 0.98 92
3 1.00 0.99 0.99 93
4 1.00 1.00 1.00 76
5 0.99 0.98 0.99 108
6 0.99 1.00 0.99 89
7 0.99 1.00 0.99 78
8 1.00 0.98 0.99 92
9 0.99 0.99 0.99 92
micro avg 0.99 0.99 0.99 899
macro avg 0.99 0.99 0.99 899
weighted avg 0.99 0.99 0.99 899
# Tuning hyper-parameters for recall
Best parameters set found on development set:
{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Grid scores on development set:
0.986 (+/-0.019) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.957 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.987 (+/-0.019) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.981 (+/-0.028) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.987 (+/-0.019) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.981 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.987 (+/-0.019) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.981 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.972 (+/-0.012) for {'C': 1, 'kernel': 'linear'}
0.972 (+/-0.012) for {'C': 10, 'kernel': 'linear'}
0.972 (+/-0.012) for {'C': 100, 'kernel': 'linear'}
0.972 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
precision recall f1-score support
0 1.00 1.00 1.00 89
1 0.97 1.00 0.98 90
2 0.99 0.98 0.98 92
3 1.00 0.99 0.99 93
4 1.00 1.00 1.00 76
5 0.99 0.98 0.99 108
6 0.99 1.00 0.99 89
7 0.99 1.00 0.99 78
8 1.00 0.98 0.99 92
9 0.99 0.99 0.99 92
micro avg 0.99 0.99 0.99 899
macro avg 0.99 0.99 0.99 899
weighted avg 0.99 0.99 0.99 899
上述分析1:
关于上面代码的:
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
中的scoring字符串的意思:
因为这是多分类问题,所以需要在平常见的评价标准’precision’, 'recall’后加_macro。
上述分析2:
关于digits数据集:
下面截图来源以及关于digits数据集请参考:
https://blog.csdn.net/Asun0204/article/details/75607948
print (digits.keys())
['images', 'data', 'target_names', 'DESCR', 'target']