sklearn:sklearn.GridSearchCV函数的简介、使用方法之详细攻略

sklearn:sklearn.GridSearchCVl函数的简介、使用方法之详细攻略

 

 

目录

sklearn.GridSearchCV函数的简介

1、参数说明

2、功能代码

sklearn.GridSearchCV函数的使用方法


 

 

sklearn.GridSearchCV函数的简介

1、参数说明

  """Exhaustive search over specified parameter values for an estimator.
    Important members are fit, predict.  """
    GridSearchCV implements a "fit" and a "score" method.
    It also implements "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the estimator used.
    The parameters of the estimator used to apply these methods are 
     optimized
    by cross-validated grid-search over a parameter grid.
    
    Read more in the :ref:`User Guide `.
穷举搜索指定参数值的估计量。
重要的成员是要被训练的、预测的。

GridSearchCV实现了一个“fit”和一个“score”方法。

    如果在使用的估计器中实现了“predict”、“predict_proba”、“decision_function”、“transform”和“inverse_transform”,那么它还实现了“predict”、“predict_proba”、“decision_function”、“transform”和“inverse_transform”。

应用这些方法的估计器的参数是通过参数网格上交叉验证的网格搜索来优化的。

 

  Parameters
    ----------
    estimator : estimator object.
    This is assumed to implement the scikit-learn estimator interface.
    Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.
    
    param_grid : dict or list of dictionaries
    Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

 

    scoring : string, callable, list/tuple, dict or None, default: None    A single string (see :ref:`scoring_parameter`) or a callable    (see :ref:`scoring`) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single    value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.  See :ref:`multimetric_grid_search` for an example. If None, the estimator's default scorer (if available) is used.

 

fit_params : dict, optional 
Parameters to pass to the fit method.        .. deprecated:: 0.19    ``fit_params`` as a constructor argument was deprecated in version    0.19 and will be removed in version 0.21. Pass fit parameters to    the ``fit`` method instead.

参数

    ----------

estimator: estimator对象。

这里假设实现了sci -learn estimator接口。

要么估计器需要提供一个' ' score ' '函数,要么' ' scores ' '必须被传递。

    

param_grid:字典的字典或列表

以参数名称(字符串)作为键的Dictionary和尝试作为值的参数设置列表,或此类Dictionary的列表,在这种情况下,将探索列表中每个Dictionary跨越的网格。这允许搜索任何序列的参数设置。

 

scoring :string, callable, list/tuple, dict or None, default: None一个字符串(参见:ref: ' scoring_parameter ')或callable(参见:ref: ' scores ')来评估测试集上的预测。

对于评估多个指标,要么给出一个(惟一的)字符串列表,要么给出一个以名称为键、以可调用项为值的dict。

注意,当使用自定义记分员时,每个记分员应该返回一个值。返回值列表/数组的度量函数可以包装成多个评分器,每个评分器返回一个值。参见:ref: ' multimetric_grid_search '获取示例。如果没有,则使用估计器的默认记分员(如果可用)。

 

fit_params: dict,可选
参数传递给fit方法。. .作为构造函数参数的

deprecated:: 0.19 ' ' fit_params ' '在0.19版本中被弃用,将在0.21版本中删除。而是将fit参数传递给' ' fit ' '方法

    n_jobs : int, default=1
    Number of jobs to run in parallel.
    
    pre_dispatch : int, or string, optional
    Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
    - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
    - An int, giving the exact number of total jobs that are spawned
    - A string, giving an expression as a function of n_jobs, as in '2*n_jobs'
    
    iid : boolean, default=True
    If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
    
    cv : int, cross-validation generator or an iterable, optional
    Determines the cross-validation splitting strategy. 
    Possible inputs for cv are:
    - None, to use the default 3-fold cross validation,
    - integer, to specify the number of folds in a `(Stratified)KFold`,
    - An object to be used as a cross-validation generator.
    - An iterable yielding train, test splits.
    For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.
    
    Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here.

n_jobs: int,默认值为1
并行运行的作业数


pre_dispatch: int或string可选
控制在并行执行期间分派的作业的数量。当分配的作业比cpu处理的多时,减少这个数量可以避免内存消耗的激增。这个参数可以是:
-没有,在这种情况下,所有的工作是立即创造和产生。将其用于轻量级和快速运行的作业,以避免由于按需生成作业而导致的延迟
-一个整数,给出确切的总数的工作,是产生
-一个字符串,给出一个表达式作为n_jobs的函数,如'2*n_jobs'

iid:布尔值,默认=真
如果为真,则假定数据在所有折痕上是同分布的,损失最小的是每个样本的总损失,而不是折痕上的平均损失


cv: int,交叉验证生成器或可迭代的,可选的
确定交叉验证分割策略。
cv的可能输入有:
- None,使用默认的3倍交叉验证,
- integer,用于指定“(分层)KFold”中的折叠数,
-用作交叉验证生成器的对象。
-一个迭代的屈服序列,测试分裂。
对于整数/无输入,如果估计器是一个分类器,而' ' y ' '是二进制或多类的,则使用:class: ' hierarchfiedkfold '。在所有其他情况下,使用:class: ' KFold '。
参考:ref: ' User Guide ',了解这里可以使用的各种交叉验证策略。

refit : boolean, or string, default=True
    Refit an estimator using the best found parameters on the whole dataset.
    
    For multiple metric evaluation, this needs to be a string denoting the scorer is used to find the best parameters for refitting the estimator at the end.
    
    The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance.

 

    Also for multiple metric evaluation, the attributes ``best_index_``,``best_score_`` and ``best_parameters_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer.
    
    See ``scoring`` parameter to know more about multiple metric evaluation.
    
    verbose : integer
    Controls the verbosity: the higher, the more messages.
    
    error_score : 'raise' (default) or numeric
    Value to assign to the score if an error occurs in estimator fitting.
    If set to 'raise', the error is raised. If a numeric value is given,  FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

refit: boolean,或string, default=True
使用在整个数据集上找到的最佳参数来重新编译估计器。


对于多个度量评估,这需要是一个表示记分员的字符串,用于在最后找到重新编译估计器的最佳参数。


修改后的估计器在' ' best_estimator_ ' '属性中可用,并且允许在这个' ' GridSearchCV ' '实例中直接使用' ' predict ' '。

 

同样,对于多个度量求值,属性' ' best_index_ ' '、' ' best_score_ ' '和' ' best_parameters_ ' '只有在' ' refit ' '被设置并全部被确定为w.r时才可用。t这个特定的得分手。
参见“评分”参数以了解更多关于多重度量评估的信息。

verbose :整数
控制冗余:越高,消息越多。

error_score: 'raise'(默认)或数值
如果在估计器拟合中出现错误,则将值赋给该分数
如果设置为“引发”,则会引发错误。如果给定一个数值,则会引发FitFailedWarning。此参数不影响refit步骤,因为后者总是会引起错误。

  return_train_score : boolean, optional
    If ``False``, the ``cv_results_`` attribute will not include training scores.
    
    Current default is ``'warn'``, which behaves as ``True`` in addition to raising a warning when a training score is looked up.
    That default will be changed to ``False`` in 0.21.
    Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off.
    However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.
return_train_score:布尔值,可选
如果' ' False ' ', ' ' cv_results_ ' '属性将不包括训练分数。
当前的默认值是' 'warn' ' ' ',它的行为为' ' True ' ',除了在查询训练分数时发出警告外。
默认值将在0.21中更改为' ' False ' '。
计算训练分数是用来了解不同的参数设置如何影响过拟合/欠拟合权衡
然而,计算训练集上的分数在计算上是很昂贵的,并且并不严格要求选择产生最佳泛化性能的参数。
    Attributes
    ----------
    cv_results_ : dict of numpy (masked) ndarrays
    A dict with keys as column headers and values as columns, that can be
    imported into a pandas ``DataFrame``.
    
    For instance the below given table
    
    +------------+-----------+------------+-----------------+---+---------+
    |param_kernel|param_gamma|param_degree|split0_test_score|...
     |rank_t...|
    
     +============+===========+============+========
     =========+===+=========+
    |  'poly'    |     --    |      2     |        0.8      |...|    2    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'poly'    |     --    |      3     |        0.7      |...|    4    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.1   |     --     |        0.8      |...|    3    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.2   |     --     |        0.9      |...|    1    |
    +------------+-----------+------------+-----------------+---+---------+
    
    will be represented by a ``cv_results_`` dict of::
    
    {
    'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
    mask = [False False False False]...)
    'param_gamma': masked_array(data = [-- -- 0.1 0.2],
    mask = [ True  True False False]...),
    'param_degree': masked_array(data = [2.0 3.0 -- --],
    mask = [False False  True  True]...),
    'split0_test_score'  : [0.8, 0.7, 0.8, 0.9],
    'split1_test_score'  : [0.82, 0.5, 0.7, 0.78],
    'mean_test_score'    : [0.81, 0.60, 0.75, 0.82],
    'std_test_score'     : [0.02, 0.01, 0.03, 0.03],
    'rank_test_score'    : [2, 4, 3, 1],
    'split0_train_score' : [0.8, 0.9, 0.7],
    'split1_train_score' : [0.82, 0.5, 0.7],
    'mean_train_score'   : [0.81, 0.7, 0.7],
    'std_train_score'    : [0.03, 0.03, 0.04],
    'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
    'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
    'mean_score_time'    : [0.007, 0.06, 0.04, 0.04],
    'std_score_time'     : [0.001, 0.002, 0.003, 0.005],
    'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
    }
属性

cv_results_: numpy(掩蔽)ndarrays的字典
以键作为列标头,以值作为列的dict,可以是这样
导入到一个pandas ' ' DataFrame ' '。
例如下面给出的表
 
    NOTE
    The key ``'params'`` is used to store a list of parameter settings dicts for all the parameter candidates.
    The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and ``std_score_time`` are all in seconds.
    For multi-metric evaluation, the scores for all the scorers are available in the ``cv_results_`` dict at the keys ending with that scorer's name (``'_'``) instead of ``'_score'`` shown above. ('split0_test_precision', 'mean_train_precision' etc.)
请注意
键“”params“”用于存储所有参数候选项的参数设置字典列表。
' mean_fit_time ' '、' std_fit_time ' '、' mean_score_time ' '和' std_score_time ' '都是以秒为单位的。
对于多指标评估,所有评分者的分数都可以在键上以该评分者的名字结尾的' ' cv_results_ ' ' dict中找到(' " _' ' '),而不是上面显示的' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' cv_results_ ' ' ' dict中找到。(“split0_test_precision”,“mean_train_precision”等等)。
    best_estimator_ : estimator or dict
    Estimator that was chosen by the search, i.e. estimator
    which gave highest score (or smallest loss if specified)
    on the left out data. Not available if ``refit=False``.
    See ``refit`` parameter for more information on allowed values.
    
    best_score_ : float
    Mean cross-validated score of the best_estimator
    For multi-metric evaluation, this is present only if ``refit`` is specified.
    
    best_params_ : dict
    Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is present only if ``refit`` is specified.
    
    best_index_ : int
    The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting.
    The dict at ``search.cv_results_['params'][search.best_index_]`` gives the parameter setting for the best model, that gives the highest mean score (``search.best_score_``). For multi-metric evaluation, this is present only if ``refit`` is specified.
best_estimator_: estimator或dict
由搜索选择的估计量,即在被遗漏的数据上给出最高分(或指定最小损失)的估计量。如果' ' refit=False ' ',则不可用。有关允许值的更多信息,请参见' ' refit ' '参数。

best_score_:float
best_estimator的交叉验证平均得分
对于多度量评估,只有在指定“refit”时才会出现这种情况。

best_params_: dict类型
参数设置,给出了最好的结果,对举行的数据。对于多度量评估,只有在指定“refit”时才会出现这种情况。

best_index_: int
对应最佳候选参数设置的索引(' ' cv_results_ ' '数组的索引)。
dict at ' ' search.cv_results_['params'][搜索。best_index_] ' '给出了最佳模型的参数设置,并给出了最高的平均分数(' ' search.best_score_ ' ')。对于多度量评估,只有在指定“refit”时才会出现这种情况。
    scorer_ : function or a dict
    Scorer function used on the held out data to choose the best parameters for the model.
    For multi-metric evaluation, this attribute holds the validated ``scoring`` dict which maps the scorer key to the scorer callable.
    
    n_splits_ : int
    The number of cross-validation splits (folds/iterations).
    
    Notes
    ------
    The parameters selected are those that maximize the score of the left  out data, unless an explicit score is passed in which case it is used instead.
    If `n_jobs` was set to a value higher than one, the data is copied for each point in the grid (and not `n_jobs` times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available.  A  workaround in this case is to set `pre_dispatch`. Then, the memory is copied only`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 * n_jobs`.
    
    See Also
    ---------
    :class:`ParameterGrid`:
    generates all the combinations of a hyperparameter grid.
    :func:`sklearn.model_selection.train_test_split`:
    utility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for its final evaluation.
    :func:`sklearn.metrics.make_scorer`:
    Make a scorer from a performance metric or loss function.
scorer_ : f功能还是字典
记分员函数用于对保留的数据进行筛选,为模型选择最佳参数。
对于多度量的评估,此属性保存已验证的“评分”dict,该dict将评分键映射到可调用的评分者。

n_splits_: int
交叉验证分割(折叠/迭代)的数量。

笔记------
所选的参数是那些使未输入数据的得分最大化的参数,除非传递了一个显式的分数(在这种情况下使用该分数)。
如果将' n_jobs '设置为大于1的值,则将为网格中的每个点复制数据(而不是' n_jobs '时间)。如果单个作业花费的时间很少,那么这样做是出于效率的考虑,但是如果数据集很大且没有足够的可用内存,则可能会引起错误。在这种情况下,一个解决方案是设置' pre_dispatch '。然后,内存多次只复制' pre_dispatch '。' pre_dispatch '的合理值是' 2 * n_jobs '。

另请参阅 ---------
类:“ParameterGrid”:
生成超参数网格的所有组合。
:func:“sklearn.model_selection.train_test_split”:
实用工具函数,用于将数据分割成可用于拟合GridSearchCV实例的开发集和用于最终评估的评估集。
:func:“sklearn.metrics.make_scorer”:
从性能指标或损失函数中创建一个记分员。

 

2、功能代码

class GridSearchCV Found at: sklearn.model_selection._search

class GridSearchCV(BaseSearchCV):
    """Exhaustive search over specified parameter values for an estimator.

   
    
    """
    def __init__(self, estimator, param_grid, scoring=None, 
     fit_params=None, 
        n_jobs=1, iid=True, refit=True, cv=None, verbose=0, 
        pre_dispatch='2*n_jobs', error_score='raise', 
        return_train_score="warn"):
        super(GridSearchCV, self).__init__(estimator=estimator, 
         scoring=scoring, fit_params=fit_params, n_jobs=n_jobs, iid=iid, 
         refit=refit, cv=cv, verbose=verbose, pre_dispatch=pre_dispatch, 
         error_score=error_score, return_train_score=return_train_score)
        self.param_grid = param_grid
        _check_param_grid(param_grid)
    
    def _get_param_iterator(self):
        """Return ParameterGrid instance for the given param_grid"""
        return ParameterGrid(self.param_grid)

 

sklearn.GridSearchCV函数的使用方法

    Examples
    --------
    >>> from sklearn import svm, datasets
    >>> from sklearn.model_selection import GridSearchCV
    >>> iris = datasets.load_iris()
    >>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
    >>> svc = svm.SVC()
    >>> clf = GridSearchCV(svc, parameters)
    >>> clf.fit(iris.data, iris.target)
    ...                             # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
    GridSearchCV(cv=None, error_score=...,
    estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,
    decision_function_shape='ovr', degree=..., gamma=...,
    kernel='rbf', max_iter=-1, probability=False,
    random_state=None, shrinking=True, tol=...,
    verbose=False),
    fit_params=None, iid=..., n_jobs=1,
    param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,
    scoring=..., verbose=...)
    >>> sorted(clf.cv_results_.keys())
    ...                             # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
    ['mean_fit_time', 'mean_score_time', 'mean_test_score',...
    'mean_train_score', 'param_C', 'param_kernel', 'params',...
    'rank_test_score', 'split0_test_score',...
    'split0_train_score', 'split1_test_score', 'split1_train_score',...
    'split2_test_score', 'split2_train_score',...
    'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(Python编程(初级+进阶))