超参数调节

1.网格搜索参数GridSearchCV 类

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, 
iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
  1. estimator : 分类器

  2. param_grid : 需要调参的参数。格式为:param_grid = {'criterion': ['gini', 'entropy'],
              'max_depth': [2,3,4,5,6],
              'min_samples_split':[2,3,4,5,6],
              'min_samples_leaf':[2,3,4,5,6]
              }

  3. scoring: 模型评分标准:如下

  4. refit : 默认为True ,在搜索参数结束后,用最佳参数结果再次fit一遍全部数据集

  5. cv: 交叉验证参数,默认为3折

  6. verbose:日志冗长度,int:冗长度,0:不输出训练过程,1:偶尔输出,>1:对每个子模型都输出

  7. pre_dispatch=‘2*n_jobs’: 指定总共分发的并行任务数。

 scoring参数:分类、聚类和回归的评估方法都有。

 

['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score',
 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score',
 'homogeneity_score', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error',
 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 
'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 
'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']

 具体含义可以参考:http://sklearn.apachecn.org/cn/0.19.0/modules/model_evaluation.html#scoring-parameter

2.使用网格搜索:

需要我们人工手动输入的参数称为超参数, 进行超参数的选择的过程叫做调参。

#coding=gbk
#调整估计器的超参数,进行超参数的选择的过程叫做调参

#1,网格搜索方法 GridSearchCV
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report 

from sklearn.model_selection import GridSearchCV    #导入网格搜索方法的包
# Each datapoint is a 8x8 image of a digit. 每一个数据点都是 8x8 的像素点
digits = datasets.load_digits()
print(digits.data.shape)    # (1797, 64)
print(digits.data[:5,:])

# import matplotlib.pyplot as plt 
# plt.gray() 
# plt.matshow(digits.images[3]) 
# plt.show() 
n_samples = len(digits.images)
print(n_samples)    # 1797
X = digits.data
y = digits.target
print(y[:5])    #[0 1 2 3 4] 对应的数字

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

param_grid = [{
    'kernel':['rbf'],
    'gamma':[1e-3, 1e-4],
    'C':[1, 10,100,1000]
    },{
        'kernel':['linear'],
        'C':[1, 10,100,1000]
        }]
scores =['precision', 'recall']
for score in scores :
    print('score %s' %score)
    print('------')
    clf = GridSearchCV(SVC(), param_grid, cv=5, scoring='%s_macro'%score)   
    clf.fit(X_train, y_train)
    print('best params is :')
    print(clf.best_params_)
    print('grid score')
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print('%.3f (+-/%0.03f) for %r'%(mean, std*2, params))
    
    print('-----')
    print('classification report')
    y_true, y_pred = y_test, clf.predict(X_test)
    report = classification_report(y_true, y_pred)
    print(report)

输出结果:

score precision
------
best params is :
{'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
grid score

0.986 (+-/0.016) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1}
0.959 (+-/0.029) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
0.982 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 10}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 100}
0.982 (+-/0.025) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 100}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1000}
0.982 (+-/0.025) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1000}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 1}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 10}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 100}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 1000}
-----
classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899

score recall
------
best params is :
{'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
grid score

0.986 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1}
0.957 (+-/0.029) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
0.981 (+-/0.028) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 10}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 100}
0.981 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 100}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1000}
0.981 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1000}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 1}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 10}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 100}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 1000}
-----
classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899

实际当中有用的参数,以clf表示我们的GridSearchCV对象

clf.best_params_   返回最好的参数

clf.best_score_  返回最好的测试分数,它的值和 clf.cv_results_['mean_test_score'][dt_grid.best_index_] 是相同的。

clf.best_index_  返回列表中分数最好的下表

clf.best_estimator_  返回最好的模型

grid_scores_     在sklearn 0.18中已经不赞成使用了,用下面的cv_results_来代替

clf.cv_results_     返回使用交叉验证进行搜索的结果,它本身又是一个字典,里面又有很多内容,我们来看一下上面的clf.cv_results_.keys()里面有什么:

dict_keys(
['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 
'param_C', 'param_gamma', 'param_kernel', 'params', 
'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score',
'mean_test_score', 'std_test_score', 'rank_test_score', 
'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 
'mean_train_score', 'std_train_score'] )

可以分为上面几类: 第一类是时间, 第二类是参数, 第三类是测试分数,其中又分为每次交叉验证的参数和统计的参数,第四类是训练分数,其中也分为每次交叉验证的参数和统计的参数。

 

参考:https://www.cnblogs.com/jiaxin359/p/8641976.html#_labelTop

           http://sklearn.apachecn.org/cn/0.19.0/modules/grid_search.html#grid-search-tips

你可能感兴趣的:(DataMining)