sklearn:sklearn.GridSearchCVl函数的简介、使用方法之详细攻略
目录
sklearn.GridSearchCV函数的简介
1、参数说明
2、功能代码
sklearn.GridSearchCV函数的使用方法
"""Exhaustive search over specified parameter values for an estimator. Important members are fit, predict. """ GridSearchCV implements a "fit" and a "score" method. It also implements "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the estimator used. The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid. Read more in the :ref:`User Guide |
穷举搜索指定参数值的估计量。 重要的成员是要被训练的、预测的。 GridSearchCV实现了一个“fit”和一个“score”方法。 如果在使用的估计器中实现了“predict”、“predict_proba”、“decision_function”、“transform”和“inverse_transform”,那么它还实现了“predict”、“predict_proba”、“decision_function”、“transform”和“inverse_transform”。 应用这些方法的估计器的参数是通过参数网格上交叉验证的网格搜索来优化的。
|
Parameters
scoring : string, callable, list/tuple, dict or None, default: None A single string (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`) to evaluate the predictions on the test set. For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values. NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each. See :ref:`multimetric_grid_search` for an example. If None, the estimator's default scorer (if available) is used.
fit_params : dict, optional |
参数 ---------- estimator: estimator对象。 这里假设实现了sci -learn estimator接口。 要么估计器需要提供一个' ' score ' '函数,要么' ' scores ' '必须被传递。
param_grid:字典的字典或列表 以参数名称(字符串)作为键的Dictionary和尝试作为值的参数设置列表,或此类Dictionary的列表,在这种情况下,将探索列表中每个Dictionary跨越的网格。这允许搜索任何序列的参数设置。
scoring :string, callable, list/tuple, dict or None, default: None一个字符串(参见:ref: ' scoring_parameter ')或callable(参见:ref: ' scores ')来评估测试集上的预测。 对于评估多个指标,要么给出一个(惟一的)字符串列表,要么给出一个以名称为键、以可调用项为值的dict。 注意,当使用自定义记分员时,每个记分员应该返回一个值。返回值列表/数组的度量函数可以包装成多个评分器,每个评分器返回一个值。参见:ref: ' multimetric_grid_search '获取示例。如果没有,则使用估计器的默认记分员(如果可用)。
fit_params: dict,可选 deprecated:: 0.19 ' ' fit_params ' '在0.19版本中被弃用,将在0.21版本中删除。而是将fit参数传递给' ' fit ' '方法 。 |
n_jobs : int, default=1 Number of jobs to run in parallel. pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A string, giving an expression as a function of n_jobs, as in '2*n_jobs' iid : boolean, default=True If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds. cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - An object to be used as a cross-validation generator. - An iterable yielding train, test splits. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. Refer :ref:`User Guide |
n_jobs: int,默认值为1
|
refit : boolean, or string, default=True
Also for multiple metric evaluation, the attributes ``best_index_``,``best_score_`` and ``best_parameters_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. |
refit: boolean,或string, default=True
同样,对于多个度量求值,属性' ' best_index_ ' '、' ' best_score_ ' '和' ' best_parameters_ ' '只有在' ' refit ' '被设置并全部被确定为w.r时才可用。t这个特定的得分手。 |
return_train_score : boolean, optional If ``False``, the ``cv_results_`` attribute will not include training scores. Current default is ``'warn'``, which behaves as ``True`` in addition to raising a warning when a training score is looked up. That default will be changed to ``False`` in 0.21. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. |
return_train_score:布尔值,可选 如果' ' False ' ', ' ' cv_results_ ' '属性将不包括训练分数。 当前的默认值是' 'warn' ' ' ',它的行为为' ' True ' ',除了在查询训练分数时发出警告外。 默认值将在0.21中更改为' ' False ' '。 计算训练分数是用来了解不同的参数设置如何影响过拟合/欠拟合权衡。 然而,计算训练集上的分数在计算上是很昂贵的,并且并不严格要求选择产生最佳泛化性能的参数。 |
Attributes ---------- cv_results_ : dict of numpy (masked) ndarrays A dict with keys as column headers and values as columns, that can be imported into a pandas ``DataFrame``. For instance the below given table +------------+-----------+------------+-----------------+---+---------+ |param_kernel|param_gamma|param_degree|split0_test_score|... |rank_t...| +============+===========+============+======== =========+===+=========+ | 'poly' | -- | 2 | 0.8 |...| 2 | +------------+-----------+------------+-----------------+---+---------+ | 'poly' | -- | 3 | 0.7 |...| 4 | +------------+-----------+------------+-----------------+---+---------+ | 'rbf' | 0.1 | -- | 0.8 |...| 3 | +------------+-----------+------------+-----------------+---+---------+ | 'rbf' | 0.2 | -- | 0.9 |...| 1 | +------------+-----------+------------+-----------------+---+---------+ will be represented by a ``cv_results_`` dict of:: { 'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'], mask = [False False False False]...) 'param_gamma': masked_array(data = [-- -- 0.1 0.2], mask = [ True True False False]...), 'param_degree': masked_array(data = [2.0 3.0 -- --], mask = [False False True True]...), 'split0_test_score' : [0.8, 0.7, 0.8, 0.9], 'split1_test_score' : [0.82, 0.5, 0.7, 0.78], 'mean_test_score' : [0.81, 0.60, 0.75, 0.82], 'std_test_score' : [0.02, 0.01, 0.03, 0.03], 'rank_test_score' : [2, 4, 3, 1], 'split0_train_score' : [0.8, 0.9, 0.7], 'split1_train_score' : [0.82, 0.5, 0.7], 'mean_train_score' : [0.81, 0.7, 0.7], 'std_train_score' : [0.03, 0.03, 0.04], 'mean_fit_time' : [0.73, 0.63, 0.43, 0.49], 'std_fit_time' : [0.01, 0.02, 0.01, 0.01], 'mean_score_time' : [0.007, 0.06, 0.04, 0.04], 'std_score_time' : [0.001, 0.002, 0.003, 0.005], 'params' : [{'kernel': 'poly', 'degree': 2}, ...], } |
属性 cv_results_: numpy(掩蔽)ndarrays的字典 以键作为列标头,以值作为列的dict,可以是这样 导入到一个pandas ' ' DataFrame ' '。 例如下面给出的表 |
NOTE The key ``'params'`` is used to store a list of parameter settings dicts for all the parameter candidates. The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and ``std_score_time`` are all in seconds. For multi-metric evaluation, the scores for all the scorers are available in the ``cv_results_`` dict at the keys ending with that scorer's name (``'_ |
请注意 键“”params“”用于存储所有参数候选项的参数设置字典列表。 ' mean_fit_time ' '、' std_fit_time ' '、' mean_score_time ' '和' std_score_time ' '都是以秒为单位的。 对于多指标评估,所有评分者的分数都可以在键上以该评分者的名字结尾的' ' cv_results_ ' ' dict中找到(' " _ |
best_estimator_ : estimator or dict Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if ``refit=False``. See ``refit`` parameter for more information on allowed values. best_score_ : float Mean cross-validated score of the best_estimator For multi-metric evaluation, this is present only if ``refit`` is specified. best_params_ : dict Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is present only if ``refit`` is specified. best_index_ : int The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting. The dict at ``search.cv_results_['params'][search.best_index_]`` gives the parameter setting for the best model, that gives the highest mean score (``search.best_score_``). For multi-metric evaluation, this is present only if ``refit`` is specified. |
best_estimator_: estimator或dict 由搜索选择的估计量,即在被遗漏的数据上给出最高分(或指定最小损失)的估计量。如果' ' refit=False ' ',则不可用。有关允许值的更多信息,请参见' ' refit ' '参数。 best_score_:float best_estimator的交叉验证平均得分 对于多度量评估,只有在指定“refit”时才会出现这种情况。 best_params_: dict类型 参数设置,给出了最好的结果,对举行的数据。对于多度量评估,只有在指定“refit”时才会出现这种情况。 best_index_: int 对应最佳候选参数设置的索引(' ' cv_results_ ' '数组的索引)。 dict at ' ' search.cv_results_['params'][搜索。best_index_] ' '给出了最佳模型的参数设置,并给出了最高的平均分数(' ' search.best_score_ ' ')。对于多度量评估,只有在指定“refit”时才会出现这种情况。 |
scorer_ : function or a dict Scorer function used on the held out data to choose the best parameters for the model. For multi-metric evaluation, this attribute holds the validated ``scoring`` dict which maps the scorer key to the scorer callable. n_splits_ : int The number of cross-validation splits (folds/iterations). Notes ------ The parameters selected are those that maximize the score of the left out data, unless an explicit score is passed in which case it is used instead. If `n_jobs` was set to a value higher than one, the data is copied for each point in the grid (and not `n_jobs` times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available. A workaround in this case is to set `pre_dispatch`. Then, the memory is copied only`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 * n_jobs`. See Also --------- :class:`ParameterGrid`: generates all the combinations of a hyperparameter grid. :func:`sklearn.model_selection.train_test_split`: utility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for its final evaluation. :func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function. |
scorer_ : f功能还是字典 记分员函数用于对保留的数据进行筛选,为模型选择最佳参数。 对于多度量的评估,此属性保存已验证的“评分”dict,该dict将评分键映射到可调用的评分者。 n_splits_: int 交叉验证分割(折叠/迭代)的数量。 笔记------ 所选的参数是那些使未输入数据的得分最大化的参数,除非传递了一个显式的分数(在这种情况下使用该分数)。 如果将' n_jobs '设置为大于1的值,则将为网格中的每个点复制数据(而不是' n_jobs '时间)。如果单个作业花费的时间很少,那么这样做是出于效率的考虑,但是如果数据集很大且没有足够的可用内存,则可能会引起错误。在这种情况下,一个解决方案是设置' pre_dispatch '。然后,内存多次只复制' pre_dispatch '。' pre_dispatch '的合理值是' 2 * n_jobs '。 另请参阅 --------- 类:“ParameterGrid”: 生成超参数网格的所有组合。 :func:“sklearn.model_selection.train_test_split”: 实用工具函数,用于将数据分割成可用于拟合GridSearchCV实例的开发集和用于最终评估的评估集。 :func:“sklearn.metrics.make_scorer”: 从性能指标或损失函数中创建一个记分员。 |
class GridSearchCV Found at: sklearn.model_selection._search
class GridSearchCV(BaseSearchCV):
"""Exhaustive search over specified parameter values for an estimator.
"""
def __init__(self, estimator, param_grid, scoring=None,
fit_params=None,
n_jobs=1, iid=True, refit=True, cv=None, verbose=0,
pre_dispatch='2*n_jobs', error_score='raise',
return_train_score="warn"):
super(GridSearchCV, self).__init__(estimator=estimator,
scoring=scoring, fit_params=fit_params, n_jobs=n_jobs, iid=iid,
refit=refit, cv=cv, verbose=verbose, pre_dispatch=pre_dispatch,
error_score=error_score, return_train_score=return_train_score)
self.param_grid = param_grid
_check_param_grid(param_grid)
def _get_param_iterator(self):
"""Return ParameterGrid instance for the given param_grid"""
return ParameterGrid(self.param_grid)
Examples
--------
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svc = svm.SVC()
>>> clf = GridSearchCV(svc, parameters)
>>> clf.fit(iris.data, iris.target)
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
GridSearchCV(cv=None, error_score=...,
estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,
decision_function_shape='ovr', degree=..., gamma=...,
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=...,
verbose=False),
fit_params=None, iid=..., n_jobs=1,
param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,
scoring=..., verbose=...)
>>> sorted(clf.cv_results_.keys())
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
'mean_train_score', 'param_C', 'param_kernel', 'params',...
'rank_test_score', 'split0_test_score',...
'split0_train_score', 'split1_test_score', 'split1_train_score',...
'split2_test_score', 'split2_train_score',...
'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]