autosklearn 源码理解

self.steps

0 ['categorical_encoding', <autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.OHEChoice object at 0x7f7a74dfa8d0>]
1 ['imputation', Imputation(random_state=None, strategy='median')]
2 ['variance_threshold', VarianceThreshold(random_state=None)]
3 ['rescaling', <autosklearn.pipeline.components.data_preprocessing.rescaling.RescalingChoice object at 0x7f7a74dfa780>]
4 ['balancing', Balancing(random_state=None, strategy='none')]
5 ['preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f7a74dc6390>]
6 ['classifier', <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f7a74dc6048>]
pipenode class is choice
OHEChoice choice
Imputation not choice
VarianceThreshold not choice
RescalingChoice choice
Balancing not choice
FeaturePreprocessorChoice choice
ClassifierChoice choice
from collections import OrderedDict
import importlib
import inspect
import pkgutil
import sys

def find_components(package, directory, base_class):
    components = OrderedDict()

    for module_loader, module_name, ispkg in pkgutil.iter_modules([directory]):
        full_module_name = "%s.%s" % (package, module_name)
        if full_module_name not in sys.modules and not ispkg:
            module = importlib.import_module(full_module_name)

            for member_name, obj in inspect.getmembers(module):
                if inspect.isclass(obj) and issubclass(obj, base_class) and \
                        obj != base_class:
                    # TODO test if the obj implements the interface
                    # Keep in mind that this only instantiates the ensemble_wrapper,
                    # but not the real target classifier
                    classifier = obj
                    components[module_name] = classifier

    return components

_classifiers = find_components(__package__,
                               classifier_directory,
                               AutoSklearnClassificationAlgorithm)

autosklearn/ensemble_builder.py:389

            self.logger.warning("No models better than random - "
                                "using Dummy Score!")

automl.py文件中_fit函数self._proc_ensemble = self._get_ensemble_process(time_left_for_ensembles)获取的是EnsembleBuilder对象,代码文件是~/ensemble_builder.py
_proc_ensemble继承了多进程类,会单独开个进程运行run方法。
run中调用了EnsembleBuilder.main
重点研究:autosklearn.ensembles.ensemble_selection.EnsembleSelection#_fast
_ensemble这个变量可能是通过load_model 加载的。

We experimented with different approaches to optimize these weights: stacking [26], gradient-free numerical optimization, and the method ensemble selection [24].

we found both numerical optimization and stacking to overfit to the validation set and to be
computationally costly

In a nutshell, ensemble selection (introduced by Caruana et al. [24]) is a greedy procedure that starts from an empty ensemble and then iteratively adds the model that maximizes ensemble validation performance (with uniform weight, but allowing for repetitions)

We note that SMAC [9] can handle this conditionality
natively

The 14 possible feature preprocessing methods can be categorized into
feature selection (2), kernel approximation (2), matrix decomposition (3), embeddings (1), feature
clustering (1), polynomial feature expansion (1) and methods that use a classifier for feature selection
(2).

文章目录

  • auto-sklearn 基本运行流程梳理
    • 创建搜索空间
    • metalearn

auto-sklearn 基本运行流程梳理

基本运行入口:autosklearn.automl.AutoMLClassifier#fit
在子类配置了一些必要的参数之后,调用父类的fit方法,即autosklearn.automl.AutoML#fit
在调用loaded_data_manager = XYDataManager(...将X y进行管理之后,调用return self._fit(...

创建搜索空间

self.configuration_space, configspace_path = self._create_search_space(
进入对应区域:autosklearn.automl.AutoML#_create_search_space
看到configuration_space = pipeline.get_configuration_space(
进入对应区域:autosklearn.util.pipeline.get_configuration_space
在这个函数中,配置了info字典之后,最后一段代码:

    if info['task'] in REGRESSION_TASKS:
        return _get_regression_configuration_space(info, include, exclude)
    else:
        return _get_classification_configuration_space(info, include, exclude)
  • 进入对应区域:autosklearn.util.pipeline._get_classification_configuration_space

最后一段代码:

    return SimpleClassificationPipeline(
        dataset_properties=dataset_properties,
        include=include, exclude=exclude).\
        get_hyperparameter_search_space()
  • 进入对应区域:autosklearn.pipeline.base.BasePipeline#get_hyperparameter_search_space

最后一段代码:

        if not hasattr(self, 'config_space') or self.config_space is None:
            self.config_space = self._get_hyperparameter_search_space(
                include=self.include_, exclude=self.exclude_,
                dataset_properties=self.dataset_properties_)
        return self.config_space
  • 进入对应区域:autosklearn.pipeline.classification.SimpleClassificationPipeline#_get_hyperparameter_search_space

至此,进过多次跳转与入栈,我们终于进入了”干货“最为丰富的区域了。
看到如下代码:

        cs = self._get_base_search_space(
            cs=cs, dataset_properties=dataset_properties,
            exclude=exclude, include=include, pipeline=self.steps)

注意,这里的self.steps表示autosklearn想要优化出的Pipeline的所有节点。

  • 进入对应区域:autosklearn.pipeline.base.BasePipeline#_get_base_search_space

看到要获取matches,我们想知道matches是怎么来的:

  • 进入对应区域:autosklearn.pipeline.create_searchspace_util.get_match_array

for node_name, node in pipeline:这个循环中,构造了一个很重要的变量:node_i_choices,他是一个2维列表。在原生形式中,维度1为7,表示7个Pipeline的结点。其中每个子列表表示可以选择的所有option
我取前4个作为样例

node_i_choices[0]
Out[16]: 
[autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.no_encoding.NoEncoding,
 autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.one_hot_encoding.OneHotEncoder]
node_i_choices[1]
Out[17]: [Imputation(random_state=None, strategy='median')]
node_i_choices[2]
Out[18]: [VarianceThreshold(random_state=None)]
node_i_choices[3]
Out[19]: 
[autosklearn.pipeline.components.data_preprocessing.rescaling.minmax.MinMaxScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.none.NoRescalingComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.normalize.NormalizerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.quantile_transformer.QuantileTransformerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.robust_scaler.RobustScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.standardize.StandardScalerComponent]

之后,matches_dimensions表示每个子列表的长度,用来构造一个高维张量matches

matches_dimensions
Out[20]: [2, 1, 1, 6, 1, 15, 15]
matches = np.ones(matches_dimensions, dtype=int)

看到:

    pipeline_idxs = [range(dim) for dim in matches_dimensions]
    for pipeline_instantiation_idxs in itertools.product(*pipeline_idxs):

可以理解为遍历这条Pipeline中所有的可能。
pipeline_instantiation_idxs表示某个Pipeline在matches中的坐标

pipeline_instantiation_idxs
Out[25]: (0, 0, 0, 0, 0, 0, 0)
            node_input = node.get_properties()['input']
            node_output = node.get_properties()['output']
node_input
Out[26]: (5, 6, 10)
node_output
Out[27]: (8,)

这个操作乍一看不理解,跳转get_properties函数我们看到:

                'input': (DENSE, SPARSE, UNSIGNED_DATA),
                'output': (PREDICTIONS,)}

应该是适应哪些类型。
首先判断sparse与dense是否check:

            # First check if these two instantiations of this node can work
            # together. Do this in multiple if statements to maintain
            # readability
            if (data_is_sparse and SPARSE not in node_input) or \
                    not data_is_sparse and DENSE not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break
            # No need to check if the node can handle SIGNED_DATA; this is
            # always assumed to be true
            elif not dataset_is_signed and UNSIGNED_DATA not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break

后面的操作也差不多,反正就是检查这个Pipeline是否合理。源码很sophisticated,我暂时跳过。
最后返回matches

  • 返回对应区域:autosklearn.pipeline.base.BasePipeline#_get_base_search_space:293
            if not is_choice:
                cs.add_configuration_space(node_name,
                                           node.get_hyperparameter_search_space(dataset_properties))
            # If the node isn't a choice, we have to figure out which of it's
            #  choices are actually legal choices
            else:
                choices_list = \
                    autosklearn.pipeline.create_searchspace_util.find_active_choices(
                        matches, node, node_idx,
                        dataset_properties,
                        include.get(node_name),
                        exclude.get(node_name)
                    )
                sub_config_space = node.get_hyperparameter_search_space(
                    dataset_properties, include=choices_list)
                cs.add_configuration_space(node_name, sub_config_space)

如果是选择性的结点,则进入else的部分,choices_list是所有的候选项

choices_list
Out[29]: ['no_encoding', 'one_hot_encoding']

我们再打印一下

sub_config_space
Out[30]: 
Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {no_encoding, one_hot_encoding}, Default: one_hot_encoding
    one_hot_encoding:minimum_fraction, Type: UniformFloat, Range: [0.0001, 0.5], Default: 0.01, on log-scale
    one_hot_encoding:use_minimum_fraction, Type: Categorical, Choices: {True, False}, Default: True
  Conditions:
    one_hot_encoding:minimum_fraction | one_hot_encoding:use_minimum_fraction == 'True'
    one_hot_encoding:use_minimum_fraction | __choice__ == 'one_hot_encoding'

我们打印一下特征处理部分:

Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {extra_trees_preproc_for_classification, fast_ica, feature_agglomeration, kernel_pca, kitchen_sinks, liblinear_svc_preprocessor, no_preprocessing, nystroem_sampler, pca, polynomial, random_trees_embedding, select_percentile_classification, select_rates}, Default: no_preprocessing
    extra_trees_preproc_for_classification:bootstrap, Type: Categorical, Choices: {True, False}, Default: False
    extra_trees_preproc_for_classification:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    extra_trees_preproc_for_classification:max_depth, Type: Constant, Value: None
    extra_trees_preproc_for_classification:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    extra_trees_preproc_for_classification:max_leaf_nodes, Type: Constant, Value: None
    extra_trees_preproc_for_classification:min_impurity_decrease, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    extra_trees_preproc_for_classification:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    extra_trees_preproc_for_classification:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:n_estimators, Type: Constant, Value: 100
    fast_ica:algorithm, Type: Categorical, Choices: {parallel, deflation}, Default: parallel
    fast_ica:fun, Type: Categorical, Choices: {logcosh, exp, cube}, Default: logcosh
    fast_ica:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
    fast_ica:whiten, Type: Categorical, Choices: {False, True}, Default: False
    feature_agglomeration:affinity, Type: Categorical, Choices: {euclidean, manhattan, cosine}, Default: euclidean
    feature_agglomeration:linkage, Type: Categorical, Choices: {ward, complete, average}, Default: ward
    feature_agglomeration:n_clusters, Type: UniformInteger, Range: [2, 400], Default: 25
    feature_agglomeration:pooling_func, Type: Categorical, Choices: {mean, median, max}, Default: mean
    kernel_pca:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
    kernel_pca:degree, Type: UniformInteger, Range: [2, 5], Default: 3
    kernel_pca:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 1.0, on log-scale
    kernel_pca:kernel, Type: Categorical, Choices: {poly, rbf, sigmoid, cosine}, Default: rbf
    kernel_pca:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
    kitchen_sinks:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 1.0, on log-scale
    kitchen_sinks:n_components, Type: UniformInteger, Range: [50, 10000], Default: 100, on log-scale
    liblinear_svc_preprocessor:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
    liblinear_svc_preprocessor:dual, Type: Constant, Value: False
    liblinear_svc_preprocessor:fit_intercept, Type: Constant, Value: True
    liblinear_svc_preprocessor:intercept_scaling, Type: Constant, Value: 1
    liblinear_svc_preprocessor:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: squared_hinge
    liblinear_svc_preprocessor:multi_class, Type: Constant, Value: ovr
    liblinear_svc_preprocessor:penalty, Type: Constant, Value: l1
    liblinear_svc_preprocessor:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    nystroem_sampler:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
    nystroem_sampler:degree, Type: UniformInteger, Range: [2, 5], Default: 3
    nystroem_sampler:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 0.1, on log-scale
    nystroem_sampler:kernel, Type: Categorical, Choices: {poly, rbf, sigmoid, cosine}, Default: rbf
    nystroem_sampler:n_components, Type: UniformInteger, Range: [50, 10000], Default: 100, on log-scale
    pca:keep_variance, Type: UniformFloat, Range: [0.5, 0.9999], Default: 0.9999
    pca:whiten, Type: Categorical, Choices: {False, True}, Default: False
    polynomial:degree, Type: UniformInteger, Range: [2, 3], Default: 2
    polynomial:include_bias, Type: Categorical, Choices: {True, False}, Default: True
    polynomial:interaction_only, Type: Categorical, Choices: {False, True}, Default: False
    random_trees_embedding:bootstrap, Type: Categorical, Choices: {True, False}, Default: True
    random_trees_embedding:max_depth, Type: UniformInteger, Range: [2, 10], Default: 5
    random_trees_embedding:max_leaf_nodes, Type: Constant, Value: None
    random_trees_embedding:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    random_trees_embedding:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    random_trees_embedding:min_weight_fraction_leaf, Type: Constant, Value: 1.0
    random_trees_embedding:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 10
    select_percentile_classification:percentile, Type: UniformFloat, Range: [1.0, 99.0], Default: 50.0
    select_percentile_classification:score_func, Type: Categorical, Choices: {chi2, f_classif, mutual_info}, Default: chi2
    select_rates:alpha, Type: UniformFloat, Range: [0.01, 0.5], Default: 0.1
    select_rates:mode, Type: Categorical, Choices: {fpr, fdr, fwe}, Default: fpr
    select_rates:score_func, Type: Categorical, Choices: {chi2, f_classif}, Default: chi2
  Conditions:
    extra_trees_preproc_for_classification:bootstrap | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:criterion | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:max_depth | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:max_features | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:max_leaf_nodes | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_impurity_decrease | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_samples_leaf | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_samples_split | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_weight_fraction_leaf | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:n_estimators | __choice__ == 'extra_trees_preproc_for_classification'
    fast_ica:algorithm | __choice__ == 'fast_ica'
    fast_ica:fun | __choice__ == 'fast_ica'
    fast_ica:n_components | fast_ica:whiten == 'True'
    fast_ica:whiten | __choice__ == 'fast_ica'
    feature_agglomeration:affinity | __choice__ == 'feature_agglomeration'
    feature_agglomeration:linkage | __choice__ == 'feature_agglomeration'
    feature_agglomeration:n_clusters | __choice__ == 'feature_agglomeration'
    feature_agglomeration:pooling_func | __choice__ == 'feature_agglomeration'
    kernel_pca:degree | kernel_pca:kernel == 'poly'
    kernel_pca:kernel | __choice__ == 'kernel_pca'
    kernel_pca:n_components | __choice__ == 'kernel_pca'
    kitchen_sinks:gamma | __choice__ == 'kitchen_sinks'
    kitchen_sinks:n_components | __choice__ == 'kitchen_sinks'
    liblinear_svc_preprocessor:C | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:dual | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:fit_intercept | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:intercept_scaling | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:loss | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:multi_class | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:penalty | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:tol | __choice__ == 'liblinear_svc_preprocessor'
    nystroem_sampler:degree | nystroem_sampler:kernel == 'poly'
    nystroem_sampler:kernel | __choice__ == 'nystroem_sampler'
    nystroem_sampler:n_components | __choice__ == 'nystroem_sampler'
    pca:keep_variance | __choice__ == 'pca'
    pca:whiten | __choice__ == 'pca'
    polynomial:degree | __choice__ == 'polynomial'
    polynomial:include_bias | __choice__ == 'polynomial'
    polynomial:interaction_only | __choice__ == 'polynomial'
    preprocessor:kernel_pca:coef0 | preprocessor:kernel_pca:kernel in {'poly', 'sigmoid'}
    preprocessor:kernel_pca:gamma | preprocessor:kernel_pca:kernel in {'poly', 'rbf'}
    preprocessor:nystroem_sampler:coef0 | preprocessor:nystroem_sampler:kernel in {'poly', 'sigmoid'}
    preprocessor:nystroem_sampler:gamma | preprocessor:nystroem_sampler:kernel in {'poly', 'rbf', 'sigmoid'}
    random_trees_embedding:bootstrap | __choice__ == 'random_trees_embedding'
    random_trees_embedding:max_depth | __choice__ == 'random_trees_embedding'
    random_trees_embedding:max_leaf_nodes | __choice__ == 'random_trees_embedding'
    random_trees_embedding:min_samples_leaf | __choice__ == 'random_trees_embedding'
    random_trees_embedding:min_samples_split | __choice__ == 'random_trees_embedding'
    random_trees_embedding:min_weight_fraction_leaf | __choice__ == 'random_trees_embedding'
    random_trees_embedding:n_estimators | __choice__ == 'random_trees_embedding'
    select_percentile_classification:percentile | __choice__ == 'select_percentile_classification'
    select_percentile_classification:score_func | __choice__ == 'select_percentile_classification'
    select_rates:alpha | __choice__ == 'select_rates'
    select_rates:mode | __choice__ == 'select_rates'
    select_rates:score_func | __choice__ == 'select_rates'
  Forbidden Clauses:
    (Forbidden: preprocessor:feature_agglomeration:affinity in {'cosine', 'manhattan'} && Forbidden: preprocessor:feature_agglomeration:linkage == 'ward')
    (Forbidden: preprocessor:liblinear_svc_preprocessor:penalty == 'l1' && Forbidden: preprocessor:liblinear_svc_preprocessor:loss == 'hinge')

我们打印一下模型超参部分:

Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {adaboost, bernoulli_nb, decision_tree, extra_trees, gaussian_nb, gradient_boosting, k_nearest_neighbors, lda, liblinear_svc, libsvm_svc, multinomial_nb, passive_aggressive, qda, random_forest, sgd}, Default: random_forest
    adaboost:algorithm, Type: Categorical, Choices: {SAMME.R, SAMME}, Default: SAMME.R
    adaboost:learning_rate, Type: UniformFloat, Range: [0.01, 2.0], Default: 0.1, on log-scale
    adaboost:max_depth, Type: UniformInteger, Range: [1, 10], Default: 1
    adaboost:n_estimators, Type: UniformInteger, Range: [50, 500], Default: 50
    bernoulli_nb:alpha, Type: UniformFloat, Range: [0.01, 100.0], Default: 1.0, on log-scale
    bernoulli_nb:fit_prior, Type: Categorical, Choices: {True, False}, Default: True
    decision_tree:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    decision_tree:max_depth_factor, Type: UniformFloat, Range: [0.0, 2.0], Default: 0.5
    decision_tree:max_features, Type: Constant, Value: 1.0
    decision_tree:max_leaf_nodes, Type: Constant, Value: None
    decision_tree:min_impurity_decrease, Type: Constant, Value: 0.0
    decision_tree:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    decision_tree:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    decision_tree:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees:bootstrap, Type: Categorical, Choices: {True, False}, Default: False
    extra_trees:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    extra_trees:max_depth, Type: Constant, Value: None
    extra_trees:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    extra_trees:max_leaf_nodes, Type: Constant, Value: None
    extra_trees:min_impurity_decrease, Type: Constant, Value: 0.0
    extra_trees:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    extra_trees:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    extra_trees:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees:n_estimators, Type: Constant, Value: 100
    gradient_boosting:early_stop, Type: Categorical, Choices: {off, train, valid}, Default: off
    gradient_boosting:l2_regularization, Type: UniformFloat, Range: [1e-10, 1.0], Default: 1e-10, on log-scale
    gradient_boosting:learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.1, on log-scale
    gradient_boosting:loss, Type: Constant, Value: auto
    gradient_boosting:max_bins, Type: Constant, Value: 256
    gradient_boosting:max_depth, Type: Constant, Value: None
    gradient_boosting:max_iter, Type: UniformInteger, Range: [32, 512], Default: 100
    gradient_boosting:max_leaf_nodes, Type: UniformInteger, Range: [3, 2047], Default: 31, on log-scale
    gradient_boosting:min_samples_leaf, Type: UniformInteger, Range: [1, 200], Default: 20, on log-scale
    gradient_boosting:n_iter_no_change, Type: UniformInteger, Range: [1, 20], Default: 10
    gradient_boosting:scoring, Type: Constant, Value: loss
    gradient_boosting:tol, Type: Constant, Value: 1e-07
    gradient_boosting:validation_fraction, Type: UniformFloat, Range: [0.01, 0.4], Default: 0.1
    k_nearest_neighbors:n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 1, on log-scale
    k_nearest_neighbors:p, Type: Categorical, Choices: {1, 2}, Default: 2
    k_nearest_neighbors:weights, Type: Categorical, Choices: {uniform, distance}, Default: uniform
    lda:n_components, Type: UniformInteger, Range: [1, 250], Default: 10
    lda:shrinkage, Type: Categorical, Choices: {None, auto, manual}, Default: None
    lda:shrinkage_factor, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    lda:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    liblinear_svc:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
    liblinear_svc:dual, Type: Constant, Value: False
    liblinear_svc:fit_intercept, Type: Constant, Value: True
    liblinear_svc:intercept_scaling, Type: Constant, Value: 1
    liblinear_svc:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: squared_hinge
    liblinear_svc:multi_class, Type: Constant, Value: ovr
    liblinear_svc:penalty, Type: Categorical, Choices: {l1, l2}, Default: l2
    liblinear_svc:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    libsvm_svc:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
    libsvm_svc:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
    libsvm_svc:degree, Type: UniformInteger, Range: [2, 5], Default: 3
    libsvm_svc:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 0.1, on log-scale
    libsvm_svc:kernel, Type: Categorical, Choices: {rbf, poly, sigmoid}, Default: rbf
    libsvm_svc:max_iter, Type: Constant, Value: -1
    libsvm_svc:shrinking, Type: Categorical, Choices: {True, False}, Default: True
    libsvm_svc:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.001, on log-scale
    multinomial_nb:alpha, Type: UniformFloat, Range: [0.01, 100.0], Default: 1.0, on log-scale
    multinomial_nb:fit_prior, Type: Categorical, Choices: {True, False}, Default: True
    passive_aggressive:C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 1.0, on log-scale
    passive_aggressive:average, Type: Categorical, Choices: {False, True}, Default: False
    passive_aggressive:fit_intercept, Type: Constant, Value: True
    passive_aggressive:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: hinge
    passive_aggressive:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    qda:reg_param, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.0
    random_forest:bootstrap, Type: Categorical, Choices: {True, False}, Default: True
    random_forest:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    random_forest:max_depth, Type: Constant, Value: None
    random_forest:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    random_forest:max_leaf_nodes, Type: Constant, Value: None
    random_forest:min_impurity_decrease, Type: Constant, Value: 0.0
    random_forest:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    random_forest:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    random_forest:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    random_forest:n_estimators, Type: Constant, Value: 100
    sgd:alpha, Type: UniformFloat, Range: [1e-07, 0.1], Default: 0.0001, on log-scale
    sgd:average, Type: Categorical, Choices: {False, True}, Default: False
    sgd:epsilon, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    sgd:eta0, Type: UniformFloat, Range: [1e-07, 0.1], Default: 0.01, on log-scale
    sgd:fit_intercept, Type: Constant, Value: True
    sgd:l1_ratio, Type: UniformFloat, Range: [1e-09, 1.0], Default: 0.15, on log-scale
    sgd:learning_rate, Type: Categorical, Choices: {optimal, invscaling, constant}, Default: invscaling
    sgd:loss, Type: Categorical, Choices: {hinge, log, modified_huber, squared_hinge, perceptron}, Default: log
    sgd:penalty, Type: Categorical, Choices: {l1, l2, elasticnet}, Default: l2
    sgd:power_t, Type: UniformFloat, Range: [1e-05, 1.0], Default: 0.5
    sgd:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
  Conditions:
    adaboost:algorithm | __choice__ == 'adaboost'
    adaboost:learning_rate | __choice__ == 'adaboost'
    adaboost:max_depth | __choice__ == 'adaboost'
    adaboost:n_estimators | __choice__ == 'adaboost'
    bernoulli_nb:alpha | __choice__ == 'bernoulli_nb'
    bernoulli_nb:fit_prior | __choice__ == 'bernoulli_nb'
    decision_tree:criterion | __choice__ == 'decision_tree'
    decision_tree:max_depth_factor | __choice__ == 'decision_tree'
    decision_tree:max_features | __choice__ == 'decision_tree'
    decision_tree:max_leaf_nodes | __choice__ == 'decision_tree'
    decision_tree:min_impurity_decrease | __choice__ == 'decision_tree'
    decision_tree:min_samples_leaf | __choice__ == 'decision_tree'
    decision_tree:min_samples_split | __choice__ == 'decision_tree'
    decision_tree:min_weight_fraction_leaf | __choice__ == 'decision_tree'
    extra_trees:bootstrap | __choice__ == 'extra_trees'
    extra_trees:criterion | __choice__ == 'extra_trees'
    extra_trees:max_depth | __choice__ == 'extra_trees'
    extra_trees:max_features | __choice__ == 'extra_trees'
    extra_trees:max_leaf_nodes | __choice__ == 'extra_trees'
    extra_trees:min_impurity_decrease | __choice__ == 'extra_trees'
    extra_trees:min_samples_leaf | __choice__ == 'extra_trees'
    extra_trees:min_samples_split | __choice__ == 'extra_trees'
    extra_trees:min_weight_fraction_leaf | __choice__ == 'extra_trees'
    extra_trees:n_estimators | __choice__ == 'extra_trees'
    gradient_boosting:early_stop | __choice__ == 'gradient_boosting'
    gradient_boosting:l2_regularization | __choice__ == 'gradient_boosting'
    gradient_boosting:learning_rate | __choice__ == 'gradient_boosting'
    gradient_boosting:loss | __choice__ == 'gradient_boosting'
    gradient_boosting:max_bins | __choice__ == 'gradient_boosting'
    gradient_boosting:max_depth | __choice__ == 'gradient_boosting'
    gradient_boosting:max_iter | __choice__ == 'gradient_boosting'
    gradient_boosting:max_leaf_nodes | __choice__ == 'gradient_boosting'
    gradient_boosting:min_samples_leaf | __choice__ == 'gradient_boosting'
    gradient_boosting:n_iter_no_change | gradient_boosting:early_stop in {'valid', 'train'}
    gradient_boosting:scoring | __choice__ == 'gradient_boosting'
    gradient_boosting:tol | __choice__ == 'gradient_boosting'
    gradient_boosting:validation_fraction | gradient_boosting:early_stop == 'valid'
    k_nearest_neighbors:n_neighbors | __choice__ == 'k_nearest_neighbors'
    k_nearest_neighbors:p | __choice__ == 'k_nearest_neighbors'
    k_nearest_neighbors:weights | __choice__ == 'k_nearest_neighbors'
    lda:n_components | __choice__ == 'lda'
    lda:shrinkage | __choice__ == 'lda'
    lda:shrinkage_factor | lda:shrinkage == 'manual'
    lda:tol | __choice__ == 'lda'
    liblinear_svc:C | __choice__ == 'liblinear_svc'
    liblinear_svc:dual | __choice__ == 'liblinear_svc'
    liblinear_svc:fit_intercept | __choice__ == 'liblinear_svc'
    liblinear_svc:intercept_scaling | __choice__ == 'liblinear_svc'
    liblinear_svc:loss | __choice__ == 'liblinear_svc'
    liblinear_svc:multi_class | __choice__ == 'liblinear_svc'
    liblinear_svc:penalty | __choice__ == 'liblinear_svc'
    liblinear_svc:tol | __choice__ == 'liblinear_svc'
    libsvm_svc:C | __choice__ == 'libsvm_svc'
    libsvm_svc:coef0 | libsvm_svc:kernel in {'poly', 'sigmoid'}
    libsvm_svc:degree | libsvm_svc:kernel == 'poly'
    libsvm_svc:gamma | __choice__ == 'libsvm_svc'
    libsvm_svc:kernel | __choice__ == 'libsvm_svc'
    libsvm_svc:max_iter | __choice__ == 'libsvm_svc'
    libsvm_svc:shrinking | __choice__ == 'libsvm_svc'
    libsvm_svc:tol | __choice__ == 'libsvm_svc'
    multinomial_nb:alpha | __choice__ == 'multinomial_nb'
    multinomial_nb:fit_prior | __choice__ == 'multinomial_nb'
    passive_aggressive:C | __choice__ == 'passive_aggressive'
    passive_aggressive:average | __choice__ == 'passive_aggressive'
    passive_aggressive:fit_intercept | __choice__ == 'passive_aggressive'
    passive_aggressive:loss | __choice__ == 'passive_aggressive'
    passive_aggressive:tol | __choice__ == 'passive_aggressive'
    qda:reg_param | __choice__ == 'qda'
    random_forest:bootstrap | __choice__ == 'random_forest'
    random_forest:criterion | __choice__ == 'random_forest'
    random_forest:max_depth | __choice__ == 'random_forest'
    random_forest:max_features | __choice__ == 'random_forest'
    random_forest:max_leaf_nodes | __choice__ == 'random_forest'
    random_forest:min_impurity_decrease | __choice__ == 'random_forest'
    random_forest:min_samples_leaf | __choice__ == 'random_forest'
    random_forest:min_samples_split | __choice__ == 'random_forest'
    random_forest:min_weight_fraction_leaf | __choice__ == 'random_forest'
    random_forest:n_estimators | __choice__ == 'random_forest'
    sgd:alpha | __choice__ == 'sgd'
    sgd:average | __choice__ == 'sgd'
    sgd:epsilon | sgd:loss == 'modified_huber'
    sgd:eta0 | sgd:learning_rate in {'invscaling', 'constant'}
    sgd:fit_intercept | __choice__ == 'sgd'
    sgd:l1_ratio | sgd:penalty == 'elasticnet'
    sgd:learning_rate | __choice__ == 'sgd'
    sgd:loss | __choice__ == 'sgd'
    sgd:penalty | __choice__ == 'sgd'
    sgd:power_t | sgd:learning_rate == 'invscaling'
    sgd:tol | __choice__ == 'sgd'
  Forbidden Clauses:
    (Forbidden: liblinear_svc:penalty == 'l1' && Forbidden: liblinear_svc:loss == 'hinge')
    (Forbidden: liblinear_svc:dual == 'False' && Forbidden: liblinear_svc:penalty == 'l2' && Forbidden: liblinear_svc:loss == 'hinge')
    (Forbidden: liblinear_svc:dual == 'False' && Forbidden: liblinear_svc:penalty == 'l1')

至此,我们基本搞定出了构造超参的方法。

metalearn

  • 返回对应区域:autosklearn.automl.AutoML#_fit:462

metalearn的计算在_proc_smac这个过程中,_proc_smac = AutoMLSMBO(..._proc_smac.run_smbo()

self._initial_configurations_via_metalearning
Out[35]: 25
  • 进入对应区域:autosklearn.smbo.AutoMLSMBO#run_smbo
metalearning_configurations = self.get_metalearning_suggestions()
  • 进入对应区域:autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions

  • 进入对应区域:autosklearn.metalearning.metalearning.meta_base.MetaBase#__init__

  • 进入对应区域: autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#__init__

在这个构造函数中会将metalearn需要的对应文件读进来

简写as:algorithm select
autosklearn 源码理解_第1张图片

(base) ~/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense (tqc ✘)✹✭ ᐅ tree
.
├── algorithm_runs.arff
├── configurations.csv
├── description.txt
├── feature_costs.arff
├── feature_runstatus.arff
├── feature_values.arff
└── readme.txt

从这个代码中也能判断每个文件的具体语义:

        self.read_funcs = {
            "algorithm_runs.arff": self._read_algorithm_runs,
            "feature_values.arff": self._read_feature_values,
            "configurations.csv": self._read_configurations
        }

注意到arff文件,这个文件是Attribute-Relation File Format (ARFF),overview

  • 进入对应区域: autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_algorithm_runs

注意到:

measure_instance_algorithm_triples = defaultdict(lambda: defaultdict(dict))

相当于一个深度为3的字典,并且前两层是defaultdict

performance_measures
Out[4]: ['accuracy']

arff_dict["data"]
Out[5]: 
[['2120', 1.0, '1', 0.07826496935407823, 'ok'],
 ['75193', 1.0, '2', 0.03999833101239747, 'ok'],
 ['2117', 1.0, '3', 0.1586523546565738, 'ok'],
 ['75156', 1.0, '4', 0.21584478577202915, 'ok'],
        for data in arff_dict["data"]:
            inst_name = str(data[0])
            repetition = data[1]
            algorithm = str(data[2])
            perf_list = data[3:-1]
            status = data[-1]
                measure_instance_algorithm_triples[performance_measure][
                    inst_name][algorithm] = perf_list[i]

measure_instance_algorithm_triples是一个深度为3的字典,第一层为模型度量(如accuracy),第二层为实例,如某个数据集的训练任务,第三层为调用的算法。最后的值是以上三个键的叠加下(如采用accuracy度量,在A数据集上svm算法)下的表现。
在终端查看一下这个变量:

measure_instance_algorithm_triples
Out[13]: 
defaultdict(<function autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem._read_algorithm_runs.<locals>.<lambda>()>,
            {'7': defaultdict(dict, {}),
             'accuracy': defaultdict(dict,
                         {'2120': {'1': 0.07826496935407823},
                          '75193': {'2': 0.03999833101239747},
                          '2117': {'3': 0.1586523546565738},
                          '75156': {'4': 0.21584478577202915},
                          ...
                          '75225': {'111': 0.11244019138755978},
                          '75141': {'112': 0.052904180540140566},
                          '75107': {'113': 0.050303030303030294},
                          '75097': {'114': 0.0602053084250439}})})
pd.DataFrame(measure_instance_algorithm_triples[pm])
Out[16]: 
         2120     75193      2117  ...     75141     75107     75097
1    0.078265       NaN       NaN  ...       NaN       NaN       NaN
2         NaN  0.039998       NaN  ...       NaN       NaN       NaN
3         NaN       NaN  0.158652  ...       NaN       NaN       NaN
4         NaN       NaN       NaN  ...       NaN       NaN       NaN
5         NaN       NaN       NaN  ...       NaN       NaN       NaN

得到的是一个矩阵

        measure_algorithm_matrices = OrderedDict()
        for pm in performance_measures:
            measure_algorithm_matrices[pm] = pd.DataFrame(
                measure_instance_algorithm_triples[pm]).transpose()

        self.algorithm_runs = measure_algorithm_matrices
  • 进入对应区域:autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_feature_values
filename
Out[19]: '/home/tqc/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense/feature_values.arff'
arff_dict["data"][0]  # 0: 数据集名称 1: 重复次数  2::metafeature
Out[20]: 
['75249',
 1.0,
 0.3891675492147647,
 0.9236550632911392,
 0.5,
 0.07634493670886076,
 0.4236550632911392,
 0.011471518987341773,
 87.17241379310344,
 416.33571239756776,
 46.57141283642045,
 -3.0,
 ...
        for data in arff_dict["data"]:
            inst_name = data[0]
            repetition = data[1]
            features = data[2:]

打印metafeatures

'75239': {'ClassEntropy': 0.9443547030267275,
  'ClassProbabilityMax': 0.6379707916986933,
  'ClassProbabilityMean': 0.5,
  'ClassProbabilityMin': 0.3620292083013067,
  'ClassProbabilitySTD': 0.1379707916986933,
  'DatasetRatio': 0.02536510376633359,
  'InverseDatasetRatio': 39.42424242424242,
  'KurtosisMax': 30.96304702758789,
  'KurtosisMean': 5.229958094222913,
  'KurtosisMin': -1.8830041885375977,
  'KurtosisSTD': 9.159290860687747,
  'Landmark1NN': 0.9854139753376394,
  'LandmarkDecisionNodeLearner': 0.6379741632413387,
  'LandmarkDecisionTree': 1.0,
  'LandmarkLDA': 0.7832472108044627,
  'LandmarkNaiveBayes': 0.9976981796829125,
  'LandmarkRandomNodeLearner': 0.6379741632413387,
  'LogDatasetRatio': -3.674380917046025,
  'LogInverseDatasetRatio': 3.674380917046025,
  'LogNumberOfFeatures': 3.4965075614664802,
  'LogNumberOfInstances': 7.170888478512505,
  'NumberOfCategoricalFeatures': 0.0,
  'NumberOfClasses': 2.0,
  'NumberOfFeatures': 33.0,
  'NumberOfFeaturesWithMissingValues': 0.0,
  'NumberOfInstances': 1301.0,
  'NumberOfInstancesWithMissingValues': 0.0,
  'NumberOfMissingValues': 0.0,
  'NumberOfNumericFeatures': 33.0,
  'PCAFractionOfComponentsFor95PercentVariance': 0.5151515151515151,
  'PCAKurtosisFirstPC': 1.5021543502807617,
  'PCASkewnessFirstPC': 1.460637092590332,
  'PercentageOfFeaturesWithMissingValues': 0.0,
  'PercentageOfInstancesWithMissingValues': 0.0,
  'PercentageOfMissingValues': 0.0,
  'RatioNominalToNumerical': 0.0,
  'RatioNumericalToNominal': 0.0,
  'SkewnessMax': 5.591684341430664,
  'SkewnessMean': 1.5151810998266393,
  'SkewnessMin': -0.8763390183448792,
  'SkewnessSTD': 1.68213778251707,
  'SymbolsMax': 0.0,
  'SymbolsMean': 0.0,
  'SymbolsMin': 0.0,
  'SymbolsSTD': 0.0,
  'SymbolsSum': 0.0},

最后一行:

self.metafeatures = pd.DataFrame(metafeatures).transpose()
self.metafeatures
Out[24]: 
       ClassEntropy  ClassProbabilityMax  ...  SymbolsSTD  SymbolsSum
75249      0.389168             0.923655  ...         0.0         0.0
75203      2.447791             0.289707  ...         0.0         0.0
75090      3.316085             0.111876  ...         0.0         0.0
75213      0.758988             0.780645  ...         0.0         0.0

self.metafeatures表示各个数据集的元特征(metafeature)

  • 进入对应区域:autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_configurations
filename
Out[27]: '/home/tqc/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense/configurations.csv'

打印configurations

 '114': {'balancing:strategy': 'none',
  'categorical_encoding:__choice__': 'no_encoding',
  'classifier:__choice__': 'k_nearest_neighbors',
  'classifier:k_nearest_neighbors:n_neighbors': 3,
  'classifier:k_nearest_neighbors:p': 1,
  'classifier:k_nearest_neighbors:weights': 'uniform',
  'imputation:strategy': 'mean',
  'preprocessor:__choice__': 'polynomial',
  'preprocessor:polynomial:degree': 2,
  'preprocessor:polynomial:include_bias': 'True',
  'preprocessor:polynomial:interaction_only': 'False',
  'rescaling:__choice__': 'quantile_transformer',
  'rescaling:quantile_transformer:n_quantiles': 498,
  'rescaling:quantile_transformer:output_distribution': 'normal'}}

至此,我们明白了:

变量名 语义 文件 加载函数
configurations 算法对应的参数 configurations.csv _read_configurations
metafeatures 数据集对应的元特征 feature_values.arff _read_feature_values
algorithm_runs 算法与元特征的对应矩阵(含模型表现) algorithm_runs.arff _read_algorithm_runs

注:self.algorithm_runs = measure_algorithm_matrices

现在,我们已经知道了autosklearn源码包中存储的用于预测的元特征与推荐模型是怎么对应的了。

  • 回到对应区域:autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions:601

  • 进入对应区域:autosklearn.smbo.AutoMLSMBO#_calculate_metafeatures_encoded

  • 进入对应区域:autosklearn.smbo._calculate_metafeatures_encoded

    EXCLUDE_META_FEATURES = EXCLUDE_META_FEATURES_CLASSIFICATION \
        if task in CLASSIFICATION_TASKS else EXCLUDE_META_FEATURES_REGRESSION

根据任务是回归还是分类,排除调一些不应该计算的元特征。
比如我用来断点调试的是一个分类任务。

EXCLUDE_META_FEATURES
Out[28]: 
{'Landmark1NN',
 'LandmarkDecisionNodeLearner',
 'LandmarkDecisionTree',
 'LandmarkLDA',
 'LandmarkNaiveBayes',
 'PCA',
 'PCAFractionOfComponentsFor95PercentVariance',
 'PCAKurtosisFirstPC',
 'PCASkewnessFirstPC'}
    result = calculate_all_metafeatures_encoded_labels(
        x_train, y_train, categorical=[False] * x_train.shape[1],
        dataset_name=basename, dont_calculate=EXCLUDE_META_FEATURES)
  • 进入对应区域:autosklearn.metalearning.metafeatures.metafeatures.calculate_all_metafeatures_encoded_labels

这个函数的代码如下:

    calculate = set()
    calculate.update(npy_metafeatures)
    return calculate_all_metafeatures(X, y, categorical, dataset_name,
                                      calculate=calculate,
                                      dont_calculate=dont_calculate)

这里的npy_metafeatures是一个全局变量

npy_metafeatures = set(["LandmarkLDA",
                        "LandmarkNaiveBayes",
                        "LandmarkDecisionTree",
                        "LandmarkDecisionNodeLearner",
                        "LandmarkRandomNodeLearner",
                        "LandmarkWorstNodeLearner",
                        "Landmark1NN",
                        "PCAFractionOfComponentsFor95PercentVariance",
                        "PCAKurtosisFirstPC",
                        "PCASkewnessFirstPC",
                        "Skewnesses",
                        "SkewnessMin",
                        "SkewnessMax",
                        "SkewnessMean",
                        "SkewnessSTD",
                        "Kurtosisses",
                        "KurtosisMin",
                        "KurtosisMax",
                        "KurtosisMean",
                        "KurtosisSTD"])
  • 进入对应区域:autosklearn.metalearning.metafeatures.metafeatures.calculate_all_metafeatures

注意到两个全局变量:

metafeatures = MetafeatureFunctions()
helper_functions = HelperFunctions()

注意到被@metafeatures.define装饰的类会将类的一个实例通过__setitem__设置到metafeatures这个对象中。而metafeatures是一个MetafeatureFunctions实例

截取一个代码作为例子

@metafeatures.define("NumberOfInstances")
class NumberOfInstances(MetaFeature):
    def _calculate(self, X, y, categorical):
        return float(X.shape[0])

注意到如果当前解析的元特征如果属于npy(if name in npy_metafeatures:),则要对X做一些操作,变成X_transformed

X_transformed = imputer.fit_transform(X_transformed)
X_transformed = standard_scaler.fit_transform(X_transformed)

最后,直接计算当前元特征的值

value = metafeatures[name](X_, y_, categorical_)

metafeatures[name]是一个实例,后面接括号,猜测是用了__call__魔法方法,追踪到两层父类的autosklearn.metalearning.metafeatures.metafeature.AbstractMetaFeature#__call__

    def __call__(self, X, y, categorical=None):
        if categorical is None:
            categorical = [False for i in range(X.shape[1])]
        starttime = time.time()

        try:
            if scipy.sparse.issparse(X) and hasattr(self, "_calculate_sparse"):
                value = self._calculate_sparse(X, y, categorical)
            else:
                value = self._calculate(X, y, categorical)
            comment = ""
        except MemoryError as e:
            value = None
            comment = "Memory Error"

        endtime = time.time()
        return MetaFeatureValue(self.__class__.__name__, self.type_,
                                0, 0, value, endtime-starttime, comment=comment)

最后返回的是一个MetaFeatureValue实例,里面封装了元特征的值和其他一些信息。

mf_是一个字典,key是元特征的名字,value是对应的MetaFeatureValue实例。

至此,我们已经弄清楚了数据集的元特征是怎么算的了。接下来看

  • 返回对应区域:autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions:538
meta_features
Out[2]: 
Metafeatures for dataset breast_cancer
  ClassEntropy: 0.9495480401701638
  SymbolsSum: 0.0
  SymbolsSTD: 0
  SymbolsMean: 0
  SymbolsMax: 0
  SymbolsMin: 0
  ClassProbabilitySTD: 0.13145539906103287
  ClassProbabilityMean: 0.5
  ClassProbabilityMax: 0.6314553990610329
  ClassProbabilityMin: 0.3685446009389671
  InverseDatasetRatio: 14.2
  DatasetRatio: 0.07042253521126761
  RatioNominalToNumerical: 0.0
  RatioNumericalToNominal: 0.0
  NumberOfCategoricalFeatures: 0
  NumberOfNumericFeatures: 30
  NumberOfMissingValues: 0.0
  NumberOfFeaturesWithMissingValues: 0.0
  NumberOfInstancesWithMissingValues: 0.0
  NumberOfFeatures: 30.0
  NumberOfClasses: 2.0
  NumberOfInstances: 426.0
  LogInverseDatasetRatio: 2.653241964607215
  LogDatasetRatio: -2.653241964607215
  PercentageOfMissingValues: 0.0
  PercentageOfFeaturesWithMissingValues: 0.0
  PercentageOfInstancesWithMissingValues: 0.0
  LogNumberOfFeatures: 3.4011973816621555
  LogNumberOfInstances: 6.054439346269371
meta_features_encoded
Out[6]: 
Metafeatures for dataset breast_cancer
  LandmarkRandomNodeLearner: 0.346312292358804
  SkewnessSTD: 1.3079569844668182
  SkewnessMean: 1.719863689861412
  SkewnessMax: 5.497448960200661
  SkewnessMin: 0.3638245081126332
  KurtosisSTD: 13.149786948568595
  KurtosisMean: 7.73073011481775
  KurtosisMax: 54.11573179309323
  KurtosisMin: -0.5734114126286567
                    meta_base.add_dataset(self.dataset_name, meta_features)
                    all_metafeatures = meta_base.get_metafeatures(

这时的all_metafeatures是包含本次数据集的一个矩阵

all_metafeatures
Out[7]: 
               ClassEntropy  SymbolsSum  ...  KurtosisMax  KurtosisMin
75249              0.389168         0.0  ...   416.335712    -3.000000
75203              2.447791         0.0  ...  1381.000806    -3.000000
75090              3.316085         0.0  ...  2319.000488    -3.000000
75213              0.758988         0.0  ...    34.615789    -1.998501
75157              0.991535         0.0  ...    11.286622    -0.840580
...                     ...         ...  ...          ...          ...
75198              5.269287         0.0  ...  1690.960100    -3.000000
75156              0.995151         0.0  ...  2508.999535    -3.000000
75114              0.770740         0.0  ...  1028.527714    -1.146804
75230              6.618346         0.0  ...     8.500022     0.019823
breast_cancer      0.949548         0.0  ...    54.115732    -0.573411
[133 rows x 38 columns]

看到至关重要的一行代码:

metalearning_configurations = self.collect_metalearning_suggestions(meta_base)

我们现在所在的函数get_metalearning_suggestions的返回值就是metalearning_configurations,看来在有了已训练的元数据集和本数据集的元特征后,我们可以通过度量学习的方法找到一些相近的数据集,进而将其配置推荐给当前训练任务。

  • 进入对应区域:autosklearn.smbo.AutoMLSMBO#collect_metalearning_suggestions

  • 进入对应区域:autosklearn.smbo._get_metalearning_configurations

  • 进入对应区域:autosklearn.metalearning.mismbo.suggest_via_metalearning

    ml = MetaLearningOptimizer(
        dataset_name=dataset_name,
        configuration_space=meta_base.configuration_space,
        meta_base=meta_base,
        distance='l1',
        seed=1,)
    runs = ml.metalearning_suggest_all(exclude_double_configurations=True)
  • 进入对应区域:autosklearn.metalearning.optimizers.metalearn_optimizer.metalearner.MetaLearningOptimizer#metalearning_suggest_all

  • 进入对应区域:autosklearn.metalearning.optimizers.metalearn_optimizer.metalearner.MetaLearningOptimizer#_learn

_split_metafeature_array函数中:

return dataset_metafeatures, all_other_metafeatures

dataset_metafeatures:本数据集的元特征
all_other_metafeatures:其他数据集的元特征

all_other_metafeatures.shape
Out[11]: (132, 46)
dataset_metafeatures.shape
Out[12]: (46,)
        keep = []
        for idx in dataset_metafeatures.index:
            if np.isfinite(dataset_metafeatures.loc[idx]):
               keep.append(idx)

keep:所有应该计算的元特征

all_other_metafeatures = all_other_metafeatures.fillna(all_other_metafeatures.mean())

采用平均值处理缺失值

runs
Out[21]: 
     75249  75203  75090  75213  75157  ...  75112  75198     75156  75114  75230
1      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN
2      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN
3      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN
4      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN  0.215845    NaN    NaN
5      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN

看完了metalearn的推荐,我们回到最开始

  • 返回对应区域:autosklearn.smbo.AutoMLSMBO#run_smbo:388

最后是怎么把metalearn加到smac中的呢?autosklearn.smbo.get_smac_object

        default_config = scenario.cs.get_default_configuration()
        initial_configurations = [default_config] + metalearning_configurations
    return SMAC(
        scenario=scenario,
        rng=seed,
        runhistory2epm=rh2EPM,
        tae_runner=ta,
        initial_configurations=initial_configurations,
        runhistory=runhistory,
        run_id=seed,
    )

你可能感兴趣的:(机器学习,automl)