数学工具构造器

autosklearn 源码理解

self.steps

0 ['categorical_encoding', <autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.OHEChoice object at 0x7f7a74dfa8d0>]
1 ['imputation', Imputation(random_state=None, strategy='median')]
2 ['variance_threshold', VarianceThreshold(random_state=None)]
3 ['rescaling', <autosklearn.pipeline.components.data_preprocessing.rescaling.RescalingChoice object at 0x7f7a74dfa780>]
4 ['balancing', Balancing(random_state=None, strategy='none')]
5 ['preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f7a74dc6390>]
6 ['classifier', <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f7a74dc6048>]

pipenode class	is choice
OHEChoice	choice
Imputation	not choice
VarianceThreshold	not choice
RescalingChoice	choice
Balancing	not choice
FeaturePreprocessorChoice	choice
ClassifierChoice	choice

from collections import OrderedDict
import importlib
import inspect
import pkgutil
import sys

def find_components(package, directory, base_class):
    components = OrderedDict()

    for module_loader, module_name, ispkg in pkgutil.iter_modules([directory]):
        full_module_name = "%s.%s" % (package, module_name)
        if full_module_name not in sys.modules and not ispkg:
            module = importlib.import_module(full_module_name)

            for member_name, obj in inspect.getmembers(module):
                if inspect.isclass(obj) and issubclass(obj, base_class) and \
                        obj != base_class:
                    # TODO test if the obj implements the interface
                    # Keep in mind that this only instantiates the ensemble_wrapper,
                    # but not the real target classifier
                    classifier = obj
                    components[module_name] = classifier

    return components

_classifiers = find_components(__package__,
                               classifier_directory,
                               AutoSklearnClassificationAlgorithm)

autosklearn/ensemble_builder.py:389

            self.logger.warning("No models better than random - "
                                "using Dummy Score!")

automl.py文件中_fit函数self._proc_ensemble = self._get_ensemble_process(time_left_for_ensembles)获取的是EnsembleBuilder对象，代码文件是~/ensemble_builder.py。
_proc_ensemble继承了多进程类，会单独开个进程运行run方法。
在run中调用了EnsembleBuilder.main
重点研究：autosklearn.ensembles.ensemble_selection.EnsembleSelection#_fast
_ensemble这个变量可能是通过load_model 加载的。

We experimented with different approaches to optimize these weights: stacking [26], gradient-free numerical optimization, and the method ensemble selection [24].

we found both numerical optimization and stacking to overfit to the validation set and to be
computationally costly

In a nutshell, ensemble selection (introduced by Caruana et al. [24]) is a greedy procedure that starts from an empty ensemble and then iteratively adds the model that maximizes ensemble validation performance (with uniform weight, but allowing for repetitions)

We note that SMAC [9] can handle this conditionality
natively

The 14 possible feature preprocessing methods can be categorized into
feature selection (2), kernel approximation (2), matrix decomposition (3), embeddings (1), feature
clustering (1), polynomial feature expansion (1) and methods that use a classifier for feature selection
(2).

文章目录

auto-sklearn 基本运行流程梳理

创建搜索空间
metalearn

auto-sklearn 基本运行流程梳理

基本运行入口：autosklearn.automl.AutoMLClassifier#fit
在子类配置了一些必要的参数之后，调用父类的fit方法，即autosklearn.automl.AutoML#fit
在调用loaded_data_manager = XYDataManager(...将X y进行管理之后，调用return self._fit(...

创建搜索空间

self.configuration_space, configspace_path = self._create_search_space(
进入对应区域：autosklearn.automl.AutoML#_create_search_space
看到configuration_space = pipeline.get_configuration_space(
进入对应区域：autosklearn.util.pipeline.get_configuration_space
在这个函数中，配置了info字典之后，最后一段代码：

    if info['task'] in REGRESSION_TASKS:
        return _get_regression_configuration_space(info, include, exclude)
    else:
        return _get_classification_configuration_space(info, include, exclude)

进入对应区域：autosklearn.util.pipeline._get_classification_configuration_space

最后一段代码：

    return SimpleClassificationPipeline(
        dataset_properties=dataset_properties,
        include=include, exclude=exclude).\
        get_hyperparameter_search_space()

进入对应区域：autosklearn.pipeline.base.BasePipeline#get_hyperparameter_search_space

最后一段代码：

        if not hasattr(self, 'config_space') or self.config_space is None:
            self.config_space = self._get_hyperparameter_search_space(
                include=self.include_, exclude=self.exclude_,
                dataset_properties=self.dataset_properties_)
        return self.config_space

进入对应区域：autosklearn.pipeline.classification.SimpleClassificationPipeline#_get_hyperparameter_search_space

至此，进过多次跳转与入栈，我们终于进入了”干货“最为丰富的区域了。
看到如下代码：

        cs = self._get_base_search_space(
            cs=cs, dataset_properties=dataset_properties,
            exclude=exclude, include=include, pipeline=self.steps)

注意，这里的self.steps表示autosklearn想要优化出的Pipeline的所有节点。

进入对应区域：autosklearn.pipeline.base.BasePipeline#_get_base_search_space

看到要获取matches，我们想知道matches是怎么来的：

进入对应区域：autosklearn.pipeline.create_searchspace_util.get_match_array

在for node_name, node in pipeline:这个循环中，构造了一个很重要的变量：node_i_choices，他是一个2维列表。在原生形式中，维度1为7，表示7个Pipeline的结点。其中每个子列表表示可以选择的所有option
我取前4个作为样例

node_i_choices[0]
Out[16]: 
[autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.no_encoding.NoEncoding,
 autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.one_hot_encoding.OneHotEncoder]
node_i_choices[1]
Out[17]: [Imputation(random_state=None, strategy='median')]
node_i_choices[2]
Out[18]: [VarianceThreshold(random_state=None)]
node_i_choices[3]
Out[19]: 
[autosklearn.pipeline.components.data_preprocessing.rescaling.minmax.MinMaxScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.none.NoRescalingComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.normalize.NormalizerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.quantile_transformer.QuantileTransformerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.robust_scaler.RobustScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.standardize.StandardScalerComponent]

之后，matches_dimensions表示每个子列表的长度，用来构造一个高维张量matches

matches_dimensions
Out[20]: [2, 1, 1, 6, 1, 15, 15]

matches = np.ones(matches_dimensions, dtype=int)

看到：

    pipeline_idxs = [range(dim) for dim in matches_dimensions]
    for pipeline_instantiation_idxs in itertools.product(*pipeline_idxs):

可以理解为遍历这条Pipeline中所有的可能。
pipeline_instantiation_idxs表示某个Pipeline在matches中的坐标

pipeline_instantiation_idxs
Out[25]: (0, 0, 0, 0, 0, 0, 0)

            node_input = node.get_properties()['input']
            node_output = node.get_properties()['output']

node_input
Out[26]: (5, 6, 10)
node_output
Out[27]: (8,)

这个操作乍一看不理解，跳转get_properties函数我们看到：

                'input': (DENSE, SPARSE, UNSIGNED_DATA),
                'output': (PREDICTIONS,)}

应该是适应哪些类型。
首先判断sparse与dense是否check：

            # First check if these two instantiations of this node can work
            # together. Do this in multiple if statements to maintain
            # readability
            if (data_is_sparse and SPARSE not in node_input) or \
                    not data_is_sparse and DENSE not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break

            # No need to check if the node can handle SIGNED_DATA; this is
            # always assumed to be true
            elif not dataset_is_signed and UNSIGNED_DATA not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break

后面的操作也差不多，反正就是检查这个Pipeline是否合理。源码很sophisticated，我暂时跳过。
最后返回matches

返回对应区域：autosklearn.pipeline.base.BasePipeline#_get_base_search_space:293

            if not is_choice:
                cs.add_configuration_space(node_name,
                                           node.get_hyperparameter_search_space(dataset_properties))
            # If the node isn't a choice, we have to figure out which of it's
            #  choices are actually legal choices
            else:
                choices_list = \
                    autosklearn.pipeline.create_searchspace_util.find_active_choices(
                        matches, node, node_idx,
                        dataset_properties,
                        include.get(node_name),
                        exclude.get(node_name)
                    )
                sub_config_space = node.get_hyperparameter_search_space(
                    dataset_properties, include=choices_list)
                cs.add_configuration_space(node_name, sub_config_space)

如果是选择性的结点，则进入else的部分，choices_list是所有的候选项

choices_list
Out[29]: ['no_encoding', 'one_hot_encoding']

我们再打印一下

sub_config_space
Out[30]: 
Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {no_encoding, one_hot_encoding}, Default: one_hot_encoding
    one_hot_encoding:minimum_fraction, Type: UniformFloat, Range: [0.0001, 0.5], Default: 0.01, on log-scale
    one_hot_encoding:use_minimum_fraction, Type: Categorical, Choices: {True, False}, Default: True
  Conditions:
    one_hot_encoding:minimum_fraction | one_hot_encoding:use_minimum_fraction == 'True'
    one_hot_encoding:use_minimum_fraction | __choice__ == 'one_hot_encoding'

我们打印一下特征处理部分：

Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {extra_trees_preproc_for_classification, fast_ica, feature_agglomeration, kernel_pca, kitchen_sinks, liblinear_svc_preprocessor, no_preprocessing, nystroem_sampler, pca, polynomial, random_trees_embedding, select_percentile_classification, select_rates}, Default: no_preprocessing
    extra_trees_preproc_for_classification:bootstrap, Type: Categorical, Choices: {True, False}, Default: False
    extra_trees_preproc_for_classification:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    extra_trees_preproc_for_classification:max_depth, Type: Constant, Value: None
    extra_trees_preproc_for_classification:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    extra_trees_preproc_for_classification:max_leaf_nodes, Type: Constant, Value: None
    extra_trees_preproc_for_classification:min_impurity_decrease, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    extra_trees_preproc_for_classification:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    extra_trees_preproc_for_classification:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:n_estimators, Type: Constant, Value: 100
    fast_ica:algorithm, Type: Categorical, Choices: {parallel, deflation}, Default: parallel
    fast_ica:fun, Type: Categorical, Choices: {logcosh, exp, cube}, Default: logcosh
    fast_ica:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
    fast_ica:whiten, Type: Categorical, Choices: {False, True}, Default: False
    feature_agglomeration:affinity, Type: Categorical, Choices: {euclidean, manhattan, cosine}, Default: euclidean
    feature_agglomeration:linkage, Type: Categorical, Choices: {ward, complete, average}, Default: ward
    feature_agglomeration:n_clusters, Type: UniformInteger, Range: [2, 400], Default: 25
    feature_agglomeration:pooling_func, Type: Categorical, Choices: {mean, median, max}, Default: mean
    kernel_pca:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
    kernel_pca:degree, Type: UniformInteger, Range: [2, 5], Default: 3
    kernel_pca:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 1.0, on log-scale
    kernel_pca:kernel, Type: Categorical, Choices: {poly, rbf, sigmoid, cosine}, Default: rbf
    kernel_pca:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
    kitchen_sinks:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 1.0, on log-scale
    kitchen_sinks:n_components, Type: UniformInteger, Range: [50, 10000], Default: 100, on log-scale
    liblinear_svc_preprocessor:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
    liblinear_svc_preprocessor:dual, Type: Constant, Value: False
    liblinear_svc_preprocessor:fit_intercept, Type: Constant, Value: True
    liblinear_svc_preprocessor:intercept_scaling, Type: Constant, Value: 1
    liblinear_svc_preprocessor:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: squared_hinge
    liblinear_svc_preprocessor:multi_class, Type: Constant, Value: ovr
    liblinear_svc_preprocessor:penalty, Type: Constant, Value: l1
    liblinear_svc_preprocessor:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    nystroem_sampler:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
    nystroem_sampler:degree, Type: UniformInteger, Range: [2, 5], Default: 3
    nystroem_sampler:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 0.1, on log-scale
    nystroem_sampler:kernel, Type: Categorical, Choices: {poly, rbf, sigmoid, cosine}, Default: rbf
    nystroem_sampler:n_components, Type: UniformInteger, Range: [50, 10000], Default: 100, on log-scale
    pca:keep_variance, Type: UniformFloat, Range: [0.5, 0.9999], Default: 0.9999
    pca:whiten, Type: Categorical, Choices: {False, True}, Default: False
    polynomial:degree, Type: UniformInteger, Range: [2, 3], Default: 2
    polynomial:include_bias, Type: Categorical, Choices: {True, False}, Default: True
    polynomial:interaction_only, Type: Categorical, Choices: {False, True}, Default: False
    random_trees_embedding:bootstrap, Type: Categorical, Choices: {True, False}, Default: True
    random_trees_embedding:max_depth, Type: UniformInteger, Range: [2, 10], Default: 5
    random_trees_embedding:max_leaf_nodes, Type: Constant, Value: None
    random_trees_embedding:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    random_trees_embedding:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    random_trees_embedding:min_weight_fraction_leaf, Type: Constant, Value: 1.0
    random_trees_embedding:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 10
    select_percentile_classification:percentile, Type: UniformFloat, Range: [1.0, 99.0], Default: 50.0
    select_percentile_classification:score_func, Type: Categorical, Choices: {chi2, f_classif, mutual_info}, Default: chi2
    select_rates:alpha, Type: UniformFloat, Range: [0.01, 0.5], Default: 0.1
    select_rates:mode, Type: Categorical, Choices: {fpr, fdr, fwe}, Default: fpr
    select_rates:score_func, Type: Categorical, Choices: {chi2, f_classif}, Default: chi2
  Conditions:
    extra_trees_preproc_for_classification:bootstrap | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:criterion | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:max_depth | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:max_features | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:max_leaf_nodes | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_impurity_decrease | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_samples_leaf | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_samples_split | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:min_weight_fraction_leaf | __choice__ == 'extra_trees_preproc_for_classification'
    extra_trees_preproc_for_classification:n_estimators | __choice__ == 'extra_trees_preproc_for_classification'
    fast_ica:algorithm | __choice__ == 'fast_ica'
    fast_ica:fun | __choice__ == 'fast_ica'
    fast_ica:n_components | fast_ica:whiten == 'True'
    fast_ica:whiten | __choice__ == 'fast_ica'
    feature_agglomeration:affinity | __choice__ == 'feature_agglomeration'
    feature_agglomeration:linkage | __choice__ == 'feature_agglomeration'
    feature_agglomeration:n_clusters | __choice__ == 'feature_agglomeration'
    feature_agglomeration:pooling_func | __choice__ == 'feature_agglomeration'
    kernel_pca:degree | kernel_pca:kernel == 'poly'
    kernel_pca:kernel | __choice__ == 'kernel_pca'
    kernel_pca:n_components | __choice__ == 'kernel_pca'
    kitchen_sinks:gamma | __choice__ == 'kitchen_sinks'
    kitchen_sinks:n_components | __choice__ == 'kitchen_sinks'
    liblinear_svc_preprocessor:C | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:dual | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:fit_intercept | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:intercept_scaling | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:loss | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:multi_class | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:penalty | __choice__ == 'liblinear_svc_preprocessor'
    liblinear_svc_preprocessor:tol | __choice__ == 'liblinear_svc_preprocessor'
    nystroem_sampler:degree | nystroem_sampler:kernel == 'poly'
    nystroem_sampler:kernel | __choice__ == 'nystroem_sampler'
    nystroem_sampler:n_components | __choice__ == 'nystroem_sampler'
    pca:keep_variance | __choice__ == 'pca'
    pca:whiten | __choice__ == 'pca'
    polynomial:degree | __choice__ == 'polynomial'
    polynomial:include_bias | __choice__ == 'polynomial'
    polynomial:interaction_only | __choice__ == 'polynomial'
    preprocessor:kernel_pca:coef0 | preprocessor:kernel_pca:kernel in {'poly', 'sigmoid'}
    preprocessor:kernel_pca:gamma | preprocessor:kernel_pca:kernel in {'poly', 'rbf'}
    preprocessor:nystroem_sampler:coef0 | preprocessor:nystroem_sampler:kernel in {'poly', 'sigmoid'}
    preprocessor:nystroem_sampler:gamma | preprocessor:nystroem_sampler:kernel in {'poly', 'rbf', 'sigmoid'}
    random_trees_embedding:bootstrap | __choice__ == 'random_trees_embedding'
    random_trees_embedding:max_depth | __choice__ == 'random_trees_embedding'
    random_trees_embedding:max_leaf_nodes | __choice__ == 'random_trees_embedding'
    random_trees_embedding:min_samples_leaf | __choice__ == 'random_trees_embedding'
    random_trees_embedding:min_samples_split | __choice__ == 'random_trees_embedding'
    random_trees_embedding:min_weight_fraction_leaf | __choice__ == 'random_trees_embedding'
    random_trees_embedding:n_estimators | __choice__ == 'random_trees_embedding'
    select_percentile_classification:percentile | __choice__ == 'select_percentile_classification'
    select_percentile_classification:score_func | __choice__ == 'select_percentile_classification'
    select_rates:alpha | __choice__ == 'select_rates'
    select_rates:mode | __choice__ == 'select_rates'
    select_rates:score_func | __choice__ == 'select_rates'
  Forbidden Clauses:
    (Forbidden: preprocessor:feature_agglomeration:affinity in {'cosine', 'manhattan'} && Forbidden: preprocessor:feature_agglomeration:linkage == 'ward')
    (Forbidden: preprocessor:liblinear_svc_preprocessor:penalty == 'l1' && Forbidden: preprocessor:liblinear_svc_preprocessor:loss == 'hinge')

我们打印一下模型超参部分：

Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {adaboost, bernoulli_nb, decision_tree, extra_trees, gaussian_nb, gradient_boosting, k_nearest_neighbors, lda, liblinear_svc, libsvm_svc, multinomial_nb, passive_aggressive, qda, random_forest, sgd}, Default: random_forest
    adaboost:algorithm, Type: Categorical, Choices: {SAMME.R, SAMME}, Default: SAMME.R
    adaboost:learning_rate, Type: UniformFloat, Range: [0.01, 2.0], Default: 0.1, on log-scale
    adaboost:max_depth, Type: UniformInteger, Range: [1, 10], Default: 1
    adaboost:n_estimators, Type: UniformInteger, Range: [50, 500], Default: 50
    bernoulli_nb:alpha, Type: UniformFloat, Range: [0.01, 100.0], Default: 1.0, on log-scale
    bernoulli_nb:fit_prior, Type: Categorical, Choices: {True, False}, Default: True
    decision_tree:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    decision_tree:max_depth_factor, Type: UniformFloat, Range: [0.0, 2.0], Default: 0.5
    decision_tree:max_features, Type: Constant, Value: 1.0
    decision_tree:max_leaf_nodes, Type: Constant, Value: None
    decision_tree:min_impurity_decrease, Type: Constant, Value: 0.0
    decision_tree:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    decision_tree:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    decision_tree:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees:bootstrap, Type: Categorical, Choices: {True, False}, Default: False
    extra_trees:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    extra_trees:max_depth, Type: Constant, Value: None
    extra_trees:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    extra_trees:max_leaf_nodes, Type: Constant, Value: None
    extra_trees:min_impurity_decrease, Type: Constant, Value: 0.0
    extra_trees:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    extra_trees:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    extra_trees:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees:n_estimators, Type: Constant, Value: 100
    gradient_boosting:early_stop, Type: Categorical, Choices: {off, train, valid}, Default: off
    gradient_boosting:l2_regularization, Type: UniformFloat, Range: [1e-10, 1.0], Default: 1e-10, on log-scale
    gradient_boosting:learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.1, on log-scale
    gradient_boosting:loss, Type: Constant, Value: auto
    gradient_boosting:max_bins, Type: Constant, Value: 256
    gradient_boosting:max_depth, Type: Constant, Value: None
    gradient_boosting:max_iter, Type: UniformInteger, Range: [32, 512], Default: 100
    gradient_boosting:max_leaf_nodes, Type: UniformInteger, Range: [3, 2047], Default: 31, on log-scale
    gradient_boosting:min_samples_leaf, Type: UniformInteger, Range: [1, 200], Default: 20, on log-scale
    gradient_boosting:n_iter_no_change, Type: UniformInteger, Range: [1, 20], Default: 10
    gradient_boosting:scoring, Type: Constant, Value: loss
    gradient_boosting:tol, Type: Constant, Value: 1e-07
    gradient_boosting:validation_fraction, Type: UniformFloat, Range: [0.01, 0.4], Default: 0.1
    k_nearest_neighbors:n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 1, on log-scale
    k_nearest_neighbors:p, Type: Categorical, Choices: {1, 2}, Default: 2
    k_nearest_neighbors:weights, Type: Categorical, Choices: {uniform, distance}, Default: uniform
    lda:n_components, Type: UniformInteger, Range: [1, 250], Default: 10
    lda:shrinkage, Type: Categorical, Choices: {None, auto, manual}, Default: None
    lda:shrinkage_factor, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    lda:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    liblinear_svc:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
    liblinear_svc:dual, Type: Constant, Value: False
    liblinear_svc:fit_intercept, Type: Constant, Value: True
    liblinear_svc:intercept_scaling, Type: Constant, Value: 1
    liblinear_svc:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: squared_hinge
    liblinear_svc:multi_class, Type: Constant, Value: ovr
    liblinear_svc:penalty, Type: Categorical, Choices: {l1, l2}, Default: l2
    liblinear_svc:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    libsvm_svc:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
    libsvm_svc:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
    libsvm_svc:degree, Type: UniformInteger, Range: [2, 5], Default: 3
    libsvm_svc:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 0.1, on log-scale
    libsvm_svc:kernel, Type: Categorical, Choices: {rbf, poly, sigmoid}, Default: rbf
    libsvm_svc:max_iter, Type: Constant, Value: -1
    libsvm_svc:shrinking, Type: Categorical, Choices: {True, False}, Default: True
    libsvm_svc:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.001, on log-scale
    multinomial_nb:alpha, Type: UniformFloat, Range: [0.01, 100.0], Default: 1.0, on log-scale
    multinomial_nb:fit_prior, Type: Categorical, Choices: {True, False}, Default: True
    passive_aggressive:C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 1.0, on log-scale
    passive_aggressive:average, Type: Categorical, Choices: {False, True}, Default: False
    passive_aggressive:fit_intercept, Type: Constant, Value: True
    passive_aggressive:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: hinge
    passive_aggressive:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    qda:reg_param, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.0
    random_forest:bootstrap, Type: Categorical, Choices: {True, False}, Default: True
    random_forest:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    random_forest:max_depth, Type: Constant, Value: None
    random_forest:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    random_forest:max_leaf_nodes, Type: Constant, Value: None
    random_forest:min_impurity_decrease, Type: Constant, Value: 0.0
    random_forest:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    random_forest:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    random_forest:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    random_forest:n_estimators, Type: Constant, Value: 100
    sgd:alpha, Type: UniformFloat, Range: [1e-07, 0.1], Default: 0.0001, on log-scale
    sgd:average, Type: Categorical, Choices: {False, True}, Default: False
    sgd:epsilon, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
    sgd:eta0, Type: UniformFloat, Range: [1e-07, 0.1], Default: 0.01, on log-scale
    sgd:fit_intercept, Type: Constant, Value: True
    sgd:l1_ratio, Type: UniformFloat, Range: [1e-09, 1.0], Default: 0.15, on log-scale
    sgd:learning_rate, Type: Categorical, Choices: {optimal, invscaling, constant}, Default: invscaling
    sgd:loss, Type: Categorical, Choices: {hinge, log, modified_huber, squared_hinge, perceptron}, Default: log
    sgd:penalty, Type: Categorical, Choices: {l1, l2, elasticnet}, Default: l2
    sgd:power_t, Type: UniformFloat, Range: [1e-05, 1.0], Default: 0.5
    sgd:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
  Conditions:
    adaboost:algorithm | __choice__ == 'adaboost'
    adaboost:learning_rate | __choice__ == 'adaboost'
    adaboost:max_depth | __choice__ == 'adaboost'
    adaboost:n_estimators | __choice__ == 'adaboost'
    bernoulli_nb:alpha | __choice__ == 'bernoulli_nb'
    bernoulli_nb:fit_prior | __choice__ == 'bernoulli_nb'
    decision_tree:criterion | __choice__ == 'decision_tree'
    decision_tree:max_depth_factor | __choice__ == 'decision_tree'
    decision_tree:max_features | __choice__ == 'decision_tree'
    decision_tree:max_leaf_nodes | __choice__ == 'decision_tree'
    decision_tree:min_impurity_decrease | __choice__ == 'decision_tree'
    decision_tree:min_samples_leaf | __choice__ == 'decision_tree'
    decision_tree:min_samples_split | __choice__ == 'decision_tree'
    decision_tree:min_weight_fraction_leaf | __choice__ == 'decision_tree'
    extra_trees:bootstrap | __choice__ == 'extra_trees'
    extra_trees:criterion | __choice__ == 'extra_trees'
    extra_trees:max_depth | __choice__ == 'extra_trees'
    extra_trees:max_features | __choice__ == 'extra_trees'
    extra_trees:max_leaf_nodes | __choice__ == 'extra_trees'
    extra_trees:min_impurity_decrease | __choice__ == 'extra_trees'
    extra_trees:min_samples_leaf | __choice__ == 'extra_trees'
    extra_trees:min_samples_split | __choice__ == 'extra_trees'
    extra_trees:min_weight_fraction_leaf | __choice__ == 'extra_trees'
    extra_trees:n_estimators | __choice__ == 'extra_trees'
    gradient_boosting:early_stop | __choice__ == 'gradient_boosting'
    gradient_boosting:l2_regularization | __choice__ == 'gradient_boosting'
    gradient_boosting:learning_rate | __choice__ == 'gradient_boosting'
    gradient_boosting:loss | __choice__ == 'gradient_boosting'
    gradient_boosting:max_bins | __choice__ == 'gradient_boosting'
    gradient_boosting:max_depth | __choice__ == 'gradient_boosting'
    gradient_boosting:max_iter | __choice__ == 'gradient_boosting'
    gradient_boosting:max_leaf_nodes | __choice__ == 'gradient_boosting'
    gradient_boosting:min_samples_leaf | __choice__ == 'gradient_boosting'
    gradient_boosting:n_iter_no_change | gradient_boosting:early_stop in {'valid', 'train'}
    gradient_boosting:scoring | __choice__ == 'gradient_boosting'
    gradient_boosting:tol | __choice__ == 'gradient_boosting'
    gradient_boosting:validation_fraction | gradient_boosting:early_stop == 'valid'
    k_nearest_neighbors:n_neighbors | __choice__ == 'k_nearest_neighbors'
    k_nearest_neighbors:p | __choice__ == 'k_nearest_neighbors'
    k_nearest_neighbors:weights | __choice__ == 'k_nearest_neighbors'
    lda:n_components | __choice__ == 'lda'
    lda:shrinkage | __choice__ == 'lda'
    lda:shrinkage_factor | lda:shrinkage == 'manual'
    lda:tol | __choice__ == 'lda'
    liblinear_svc:C | __choice__ == 'liblinear_svc'
    liblinear_svc:dual | __choice__ == 'liblinear_svc'
    liblinear_svc:fit_intercept | __choice__ == 'liblinear_svc'
    liblinear_svc:intercept_scaling | __choice__ == 'liblinear_svc'
    liblinear_svc:loss | __choice__ == 'liblinear_svc'
    liblinear_svc:multi_class | __choice__ == 'liblinear_svc'
    liblinear_svc:penalty | __choice__ == 'liblinear_svc'
    liblinear_svc:tol | __choice__ == 'liblinear_svc'
    libsvm_svc:C | __choice__ == 'libsvm_svc'
    libsvm_svc:coef0 | libsvm_svc:kernel in {'poly', 'sigmoid'}
    libsvm_svc:degree | libsvm_svc:kernel == 'poly'
    libsvm_svc:gamma | __choice__ == 'libsvm_svc'
    libsvm_svc:kernel | __choice__ == 'libsvm_svc'
    libsvm_svc:max_iter | __choice__ == 'libsvm_svc'
    libsvm_svc:shrinking | __choice__ == 'libsvm_svc'
    libsvm_svc:tol | __choice__ == 'libsvm_svc'
    multinomial_nb:alpha | __choice__ == 'multinomial_nb'
    multinomial_nb:fit_prior | __choice__ == 'multinomial_nb'
    passive_aggressive:C | __choice__ == 'passive_aggressive'
    passive_aggressive:average | __choice__ == 'passive_aggressive'
    passive_aggressive:fit_intercept | __choice__ == 'passive_aggressive'
    passive_aggressive:loss | __choice__ == 'passive_aggressive'
    passive_aggressive:tol | __choice__ == 'passive_aggressive'
    qda:reg_param | __choice__ == 'qda'
    random_forest:bootstrap | __choice__ == 'random_forest'
    random_forest:criterion | __choice__ == 'random_forest'
    random_forest:max_depth | __choice__ == 'random_forest'
    random_forest:max_features | __choice__ == 'random_forest'
    random_forest:max_leaf_nodes | __choice__ == 'random_forest'
    random_forest:min_impurity_decrease | __choice__ == 'random_forest'
    random_forest:min_samples_leaf | __choice__ == 'random_forest'
    random_forest:min_samples_split | __choice__ == 'random_forest'
    random_forest:min_weight_fraction_leaf | __choice__ == 'random_forest'
    random_forest:n_estimators | __choice__ == 'random_forest'
    sgd:alpha | __choice__ == 'sgd'
    sgd:average | __choice__ == 'sgd'
    sgd:epsilon | sgd:loss == 'modified_huber'
    sgd:eta0 | sgd:learning_rate in {'invscaling', 'constant'}
    sgd:fit_intercept | __choice__ == 'sgd'
    sgd:l1_ratio | sgd:penalty == 'elasticnet'
    sgd:learning_rate | __choice__ == 'sgd'
    sgd:loss | __choice__ == 'sgd'
    sgd:penalty | __choice__ == 'sgd'
    sgd:power_t | sgd:learning_rate == 'invscaling'
    sgd:tol | __choice__ == 'sgd'
  Forbidden Clauses:
    (Forbidden: liblinear_svc:penalty == 'l1' && Forbidden: liblinear_svc:loss == 'hinge')
    (Forbidden: liblinear_svc:dual == 'False' && Forbidden: liblinear_svc:penalty == 'l2' && Forbidden: liblinear_svc:loss == 'hinge')
    (Forbidden: liblinear_svc:dual == 'False' && Forbidden: liblinear_svc:penalty == 'l1')

至此，我们基本搞定出了构造超参的方法。

metalearn

返回对应区域：autosklearn.automl.AutoML#_fit:462

metalearn的计算在_proc_smac这个过程中，_proc_smac = AutoMLSMBO(...，_proc_smac.run_smbo()

self._initial_configurations_via_metalearning
Out[35]: 25

进入对应区域：autosklearn.smbo.AutoMLSMBO#run_smbo

metalearning_configurations = self.get_metalearning_suggestions()

进入对应区域：autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions
进入对应区域：autosklearn.metalearning.metalearning.meta_base.MetaBase#__init__
进入对应区域： autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#__init__

在这个构造函数中会将metalearn需要的对应文件读进来

简写as：algorithm select

(base) ~/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense (tqc ✘)✹✭ ᐅ tree
.
├── algorithm_runs.arff
├── configurations.csv
├── description.txt
├── feature_costs.arff
├── feature_runstatus.arff
├── feature_values.arff
└── readme.txt

从这个代码中也能判断每个文件的具体语义：

        self.read_funcs = {
            "algorithm_runs.arff": self._read_algorithm_runs,
            "feature_values.arff": self._read_feature_values,
            "configurations.csv": self._read_configurations
        }

注意到arff文件，这个文件是Attribute-Relation File Format (ARFF)，overview

进入对应区域： autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_algorithm_runs

注意到：

measure_instance_algorithm_triples = defaultdict(lambda: defaultdict(dict))

相当于一个深度为3的字典，并且前两层是defaultdict

performance_measures
Out[4]: ['accuracy']

arff_dict["data"]
Out[5]: 
[['2120', 1.0, '1', 0.07826496935407823, 'ok'],
 ['75193', 1.0, '2', 0.03999833101239747, 'ok'],
 ['2117', 1.0, '3', 0.1586523546565738, 'ok'],
 ['75156', 1.0, '4', 0.21584478577202915, 'ok'],

        for data in arff_dict["data"]:
            inst_name = str(data[0])
            repetition = data[1]
            algorithm = str(data[2])
            perf_list = data[3:-1]
            status = data[-1]

                measure_instance_algorithm_triples[performance_measure][
                    inst_name][algorithm] = perf_list[i]

measure_instance_algorithm_triples是一个深度为3的字典，第一层为模型度量（如accuracy），第二层为实例，如某个数据集的训练任务，第三层为调用的算法。最后的值是以上三个键的叠加下（如采用accuracy度量，在A数据集上svm算法）下的表现。
在终端查看一下这个变量：

measure_instance_algorithm_triples
Out[13]: 
defaultdict(<function autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem._read_algorithm_runs.<locals>.<lambda>()>,
            {'7': defaultdict(dict, {}),
             'accuracy': defaultdict(dict,
                         {'2120': {'1': 0.07826496935407823},
                          '75193': {'2': 0.03999833101239747},
                          '2117': {'3': 0.1586523546565738},
                          '75156': {'4': 0.21584478577202915},
                          ...
                          '75225': {'111': 0.11244019138755978},
                          '75141': {'112': 0.052904180540140566},
                          '75107': {'113': 0.050303030303030294},
                          '75097': {'114': 0.0602053084250439}})})

pd.DataFrame(measure_instance_algorithm_triples[pm])
Out[16]: 
         2120     75193      2117  ...     75141     75107     75097
1    0.078265       NaN       NaN  ...       NaN       NaN       NaN
2         NaN  0.039998       NaN  ...       NaN       NaN       NaN
3         NaN       NaN  0.158652  ...       NaN       NaN       NaN
4         NaN       NaN       NaN  ...       NaN       NaN       NaN
5         NaN       NaN       NaN  ...       NaN       NaN       NaN

得到的是一个矩阵

        measure_algorithm_matrices = OrderedDict()
        for pm in performance_measures:
            measure_algorithm_matrices[pm] = pd.DataFrame(
                measure_instance_algorithm_triples[pm]).transpose()

        self.algorithm_runs = measure_algorithm_matrices

进入对应区域：autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_feature_values

filename
Out[19]: '/home/tqc/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense/feature_values.arff'
arff_dict["data"][0]  # 0： 数据集名称 1: 重复次数  2:：metafeature
Out[20]: 
['75249',
 1.0,
 0.3891675492147647,
 0.9236550632911392,
 0.5,
 0.07634493670886076,
 0.4236550632911392,
 0.011471518987341773,
 87.17241379310344,
 416.33571239756776,
 46.57141283642045,
 -3.0,
 ...

        for data in arff_dict["data"]:
            inst_name = data[0]
            repetition = data[1]
            features = data[2:]

打印metafeatures

'75239': {'ClassEntropy': 0.9443547030267275,
  'ClassProbabilityMax': 0.6379707916986933,
  'ClassProbabilityMean': 0.5,
  'ClassProbabilityMin': 0.3620292083013067,
  'ClassProbabilitySTD': 0.1379707916986933,
  'DatasetRatio': 0.02536510376633359,
  'InverseDatasetRatio': 39.42424242424242,
  'KurtosisMax': 30.96304702758789,
  'KurtosisMean': 5.229958094222913,
  'KurtosisMin': -1.8830041885375977,
  'KurtosisSTD': 9.159290860687747,
  'Landmark1NN': 0.9854139753376394,
  'LandmarkDecisionNodeLearner': 0.6379741632413387,
  'LandmarkDecisionTree': 1.0,
  'LandmarkLDA': 0.7832472108044627,
  'LandmarkNaiveBayes': 0.9976981796829125,
  'LandmarkRandomNodeLearner': 0.6379741632413387,
  'LogDatasetRatio': -3.674380917046025,
  'LogInverseDatasetRatio': 3.674380917046025,
  'LogNumberOfFeatures': 3.4965075614664802,
  'LogNumberOfInstances': 7.170888478512505,
  'NumberOfCategoricalFeatures': 0.0,
  'NumberOfClasses': 2.0,
  'NumberOfFeatures': 33.0,
  'NumberOfFeaturesWithMissingValues': 0.0,
  'NumberOfInstances': 1301.0,
  'NumberOfInstancesWithMissingValues': 0.0,
  'NumberOfMissingValues': 0.0,
  'NumberOfNumericFeatures': 33.0,
  'PCAFractionOfComponentsFor95PercentVariance': 0.5151515151515151,
  'PCAKurtosisFirstPC': 1.5021543502807617,
  'PCASkewnessFirstPC': 1.460637092590332,
  'PercentageOfFeaturesWithMissingValues': 0.0,
  'PercentageOfInstancesWithMissingValues': 0.0,
  'PercentageOfMissingValues': 0.0,
  'RatioNominalToNumerical': 0.0,
  'RatioNumericalToNominal': 0.0,
  'SkewnessMax': 5.591684341430664,
  'SkewnessMean': 1.5151810998266393,
  'SkewnessMin': -0.8763390183448792,
  'SkewnessSTD': 1.68213778251707,
  'SymbolsMax': 0.0,
  'SymbolsMean': 0.0,
  'SymbolsMin': 0.0,
  'SymbolsSTD': 0.0,
  'SymbolsSum': 0.0},

最后一行：

self.metafeatures = pd.DataFrame(metafeatures).transpose()

self.metafeatures
Out[24]: 
       ClassEntropy  ClassProbabilityMax  ...  SymbolsSTD  SymbolsSum
75249      0.389168             0.923655  ...         0.0         0.0
75203      2.447791             0.289707  ...         0.0         0.0
75090      3.316085             0.111876  ...         0.0         0.0
75213      0.758988             0.780645  ...         0.0         0.0

self.metafeatures表示各个数据集的元特征（metafeature）

进入对应区域：autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_configurations

filename
Out[27]: '/home/tqc/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense/configurations.csv'

打印configurations

 '114': {'balancing:strategy': 'none',
  'categorical_encoding:__choice__': 'no_encoding',
  'classifier:__choice__': 'k_nearest_neighbors',
  'classifier:k_nearest_neighbors:n_neighbors': 3,
  'classifier:k_nearest_neighbors:p': 1,
  'classifier:k_nearest_neighbors:weights': 'uniform',
  'imputation:strategy': 'mean',
  'preprocessor:__choice__': 'polynomial',
  'preprocessor:polynomial:degree': 2,
  'preprocessor:polynomial:include_bias': 'True',
  'preprocessor:polynomial:interaction_only': 'False',
  'rescaling:__choice__': 'quantile_transformer',
  'rescaling:quantile_transformer:n_quantiles': 498,
  'rescaling:quantile_transformer:output_distribution': 'normal'}}

至此，我们明白了：

变量名	语义	文件	加载函数
configurations	算法对应的参数	configurations.csv	_read_configurations
metafeatures	数据集对应的元特征	feature_values.arff	_read_feature_values
algorithm_runs	算法与元特征的对应矩阵（含模型表现）	algorithm_runs.arff	_read_algorithm_runs

注：self.algorithm_runs = measure_algorithm_matrices

现在，我们已经知道了autosklearn源码包中存储的用于预测的元特征与推荐模型是怎么对应的了。

回到对应区域：autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions:601
进入对应区域：autosklearn.smbo.AutoMLSMBO#_calculate_metafeatures_encoded
进入对应区域：autosklearn.smbo._calculate_metafeatures_encoded

    EXCLUDE_META_FEATURES = EXCLUDE_META_FEATURES_CLASSIFICATION \
        if task in CLASSIFICATION_TASKS else EXCLUDE_META_FEATURES_REGRESSION

根据任务是回归还是分类，排除调一些不应该计算的元特征。
比如我用来断点调试的是一个分类任务。

EXCLUDE_META_FEATURES
Out[28]: 
{'Landmark1NN',
 'LandmarkDecisionNodeLearner',
 'LandmarkDecisionTree',
 'LandmarkLDA',
 'LandmarkNaiveBayes',
 'PCA',
 'PCAFractionOfComponentsFor95PercentVariance',
 'PCAKurtosisFirstPC',
 'PCASkewnessFirstPC'}

    result = calculate_all_metafeatures_encoded_labels(
        x_train, y_train, categorical=[False] * x_train.shape[1],
        dataset_name=basename, dont_calculate=EXCLUDE_META_FEATURES)

进入对应区域：autosklearn.metalearning.metafeatures.metafeatures.calculate_all_metafeatures_encoded_labels

这个函数的代码如下：

    calculate = set()
    calculate.update(npy_metafeatures)
    return calculate_all_metafeatures(X, y, categorical, dataset_name,
                                      calculate=calculate,
                                      dont_calculate=dont_calculate)

这里的npy_metafeatures是一个全局变量

npy_metafeatures = set(["LandmarkLDA",
                        "LandmarkNaiveBayes",
                        "LandmarkDecisionTree",
                        "LandmarkDecisionNodeLearner",
                        "LandmarkRandomNodeLearner",
                        "LandmarkWorstNodeLearner",
                        "Landmark1NN",
                        "PCAFractionOfComponentsFor95PercentVariance",
                        "PCAKurtosisFirstPC",
                        "PCASkewnessFirstPC",
                        "Skewnesses",
                        "SkewnessMin",
                        "SkewnessMax",
                        "SkewnessMean",
                        "SkewnessSTD",
                        "Kurtosisses",
                        "KurtosisMin",
                        "KurtosisMax",
                        "KurtosisMean",
                        "KurtosisSTD"])

进入对应区域：autosklearn.metalearning.metafeatures.metafeatures.calculate_all_metafeatures

注意到两个全局变量：

metafeatures = MetafeatureFunctions()
helper_functions = HelperFunctions()

注意到被@metafeatures.define装饰的类会将类的一个实例通过__setitem__设置到metafeatures这个对象中。而metafeatures是一个MetafeatureFunctions实例

截取一个代码作为例子

@metafeatures.define("NumberOfInstances")
class NumberOfInstances(MetaFeature):
    def _calculate(self, X, y, categorical):
        return float(X.shape[0])

注意到如果当前解析的元特征如果属于npy（if name in npy_metafeatures:），则要对X做一些操作，变成X_transformed

X_transformed = imputer.fit_transform(X_transformed)

X_transformed = standard_scaler.fit_transform(X_transformed)

最后，直接计算当前元特征的值

value = metafeatures[name](X_, y_, categorical_)

metafeatures[name]是一个实例，后面接括号，猜测是用了__call__魔法方法，追踪到两层父类的autosklearn.metalearning.metafeatures.metafeature.AbstractMetaFeature#__call__

    def __call__(self, X, y, categorical=None):
        if categorical is None:
            categorical = [False for i in range(X.shape[1])]
        starttime = time.time()

        try:
            if scipy.sparse.issparse(X) and hasattr(self, "_calculate_sparse"):
                value = self._calculate_sparse(X, y, categorical)
            else:
                value = self._calculate(X, y, categorical)
            comment = ""
        except MemoryError as e:
            value = None
            comment = "Memory Error"

        endtime = time.time()
        return MetaFeatureValue(self.__class__.__name__, self.type_,
                                0, 0, value, endtime-starttime, comment=comment)

最后返回的是一个MetaFeatureValue实例，里面封装了元特征的值和其他一些信息。

mf_是一个字典，key是元特征的名字，value是对应的MetaFeatureValue实例。

至此，我们已经弄清楚了数据集的元特征是怎么算的了。接下来看

返回对应区域：autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions:538

meta_features
Out[2]: 
Metafeatures for dataset breast_cancer
  ClassEntropy: 0.9495480401701638
  SymbolsSum: 0.0
  SymbolsSTD: 0
  SymbolsMean: 0
  SymbolsMax: 0
  SymbolsMin: 0
  ClassProbabilitySTD: 0.13145539906103287
  ClassProbabilityMean: 0.5
  ClassProbabilityMax: 0.6314553990610329
  ClassProbabilityMin: 0.3685446009389671
  InverseDatasetRatio: 14.2
  DatasetRatio: 0.07042253521126761
  RatioNominalToNumerical: 0.0
  RatioNumericalToNominal: 0.0
  NumberOfCategoricalFeatures: 0
  NumberOfNumericFeatures: 30
  NumberOfMissingValues: 0.0
  NumberOfFeaturesWithMissingValues: 0.0
  NumberOfInstancesWithMissingValues: 0.0
  NumberOfFeatures: 30.0
  NumberOfClasses: 2.0
  NumberOfInstances: 426.0
  LogInverseDatasetRatio: 2.653241964607215
  LogDatasetRatio: -2.653241964607215
  PercentageOfMissingValues: 0.0
  PercentageOfFeaturesWithMissingValues: 0.0
  PercentageOfInstancesWithMissingValues: 0.0
  LogNumberOfFeatures: 3.4011973816621555
  LogNumberOfInstances: 6.054439346269371

meta_features_encoded
Out[6]: 
Metafeatures for dataset breast_cancer
  LandmarkRandomNodeLearner: 0.346312292358804
  SkewnessSTD: 1.3079569844668182
  SkewnessMean: 1.719863689861412
  SkewnessMax: 5.497448960200661
  SkewnessMin: 0.3638245081126332
  KurtosisSTD: 13.149786948568595
  KurtosisMean: 7.73073011481775
  KurtosisMax: 54.11573179309323
  KurtosisMin: -0.5734114126286567

                    meta_base.add_dataset(self.dataset_name, meta_features)
                    all_metafeatures = meta_base.get_metafeatures(

这时的all_metafeatures是包含本次数据集的一个矩阵

all_metafeatures
Out[7]: 
               ClassEntropy  SymbolsSum  ...  KurtosisMax  KurtosisMin
75249              0.389168         0.0  ...   416.335712    -3.000000
75203              2.447791         0.0  ...  1381.000806    -3.000000
75090              3.316085         0.0  ...  2319.000488    -3.000000
75213              0.758988         0.0  ...    34.615789    -1.998501
75157              0.991535         0.0  ...    11.286622    -0.840580
...                     ...         ...  ...          ...          ...
75198              5.269287         0.0  ...  1690.960100    -3.000000
75156              0.995151         0.0  ...  2508.999535    -3.000000
75114              0.770740         0.0  ...  1028.527714    -1.146804
75230              6.618346         0.0  ...     8.500022     0.019823
breast_cancer      0.949548         0.0  ...    54.115732    -0.573411
[133 rows x 38 columns]

看到至关重要的一行代码：

metalearning_configurations = self.collect_metalearning_suggestions(meta_base)

我们现在所在的函数get_metalearning_suggestions的返回值就是metalearning_configurations，看来在有了已训练的元数据集和本数据集的元特征后，我们可以通过度量学习的方法找到一些相近的数据集，进而将其配置推荐给当前训练任务。

进入对应区域：autosklearn.smbo.AutoMLSMBO#collect_metalearning_suggestions
进入对应区域：autosklearn.smbo._get_metalearning_configurations
进入对应区域：autosklearn.metalearning.mismbo.suggest_via_metalearning

    ml = MetaLearningOptimizer(
        dataset_name=dataset_name,
        configuration_space=meta_base.configuration_space,
        meta_base=meta_base,
        distance='l1',
        seed=1,)
    runs = ml.metalearning_suggest_all(exclude_double_configurations=True)

进入对应区域：autosklearn.metalearning.optimizers.metalearn_optimizer.metalearner.MetaLearningOptimizer#metalearning_suggest_all
进入对应区域：autosklearn.metalearning.optimizers.metalearn_optimizer.metalearner.MetaLearningOptimizer#_learn

在_split_metafeature_array函数中：

return dataset_metafeatures, all_other_metafeatures

dataset_metafeatures：本数据集的元特征
all_other_metafeatures：其他数据集的元特征

all_other_metafeatures.shape
Out[11]: (132, 46)
dataset_metafeatures.shape
Out[12]: (46,)

        keep = []
        for idx in dataset_metafeatures.index:
            if np.isfinite(dataset_metafeatures.loc[idx]):
               keep.append(idx)

keep：所有应该计算的元特征

all_other_metafeatures = all_other_metafeatures.fillna(all_other_metafeatures.mean())

采用平均值处理缺失值

runs
Out[21]: 
     75249  75203  75090  75213  75157  ...  75112  75198     75156  75114  75230
1      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN
2      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN
3      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN
4      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN  0.215845    NaN    NaN
5      NaN    NaN    NaN    NaN    NaN  ...    NaN    NaN       NaN    NaN    NaN

看完了metalearn的推荐，我们回到最开始

返回对应区域：autosklearn.smbo.AutoMLSMBO#run_smbo:388

最后是怎么把metalearn加到smac中的呢？autosklearn.smbo.get_smac_object

        default_config = scenario.cs.get_default_configuration()
        initial_configurations = [default_config] + metalearning_configurations

    return SMAC(
        scenario=scenario,
        rng=seed,
        runhistory2epm=rh2EPM,
        tae_runner=ta,
        initial_configurations=initial_configurations,
        runhistory=runhistory,
        run_id=seed,
    )

你可能感兴趣的:(机器学习,automl)

【机器学习】逻辑回归(LogisticRegression)原理与实战 GentleCP 机器学习(深度学习)逻辑回归 logistic regression 原理与实战机器学习
文章目录前言一、什么是逻辑回归1.1逻辑回归基础概念1.2逻辑回归核心概念二、逻辑回归Demo2.1数据准备2.2创建逻辑回归分类器2.3分类器预测三、逻辑回归实战3.1数据准备3.2数据划分与模型创建3.3预测数据评估模型四、参数选择五、总结六、参考资料本文属于我的机器学习/深度学习系列文章，点此查看系列文章目录前言本文主要通过文字和代码样例讲述逻辑回归的原理（包含逻辑回归的基础概念与推导）和实
机器学习里的逻辑回归Logistic Regression基本原理与应用硅基创想家 AI-人工智能与大模型机器学习逻辑回归人工智能
LogisticRegression即逻辑回归，是一种广泛应用于机器学习和数据挖掘领域的有监督学习算法，以下从原理、应用、算法优缺点等方面进行介绍：基本原理线性回归基础：逻辑回归基于线性回归模型，其基本形式为：z=w1x1+w2x2+⋯+wnxn+bz=w_1x_1+w_2x_2+\cdots+w_nx_n+bz=w1x1+w2x2+⋯+wnxn+b其中xix_ixi是特征变量，wiw_iwi是对
深度学习基础知识 namelijink 深度学习人工智能
cuda简介：CUDA（ComputeUnifiedDeviceArchitecture）是由NVIDIA开发的一种并行计算平台和应用程序编程接口（API）。它允许开发人员利用NVIDIA的GPU（图形处理器）来加速各种计算任务，包括科学计算、机器学习、深度学习、数据分析等。NVIDIA是一个全球领先的计算技术公司，专注于设计和制造高性能计算设备。除了生产强大的GPU，NVIDIA还提供与其GPU
【python语言应用】最新全流程Python编程、机器学习与深度学习实践技术应用（帮助你快速了解和入门 Python）赵钰老师 python 机器学习深度学习 python 机器学习深度学习数据分析人工智能
近年来，人工智能领域的飞速发展极大地改变了各个行业的面貌。当前最新的技术动态，如大型语言模型和深度学习技术的发展，展示了深度学习和机器学习技术的强大潜力，成为推动创新和提升竞争力的关键。特别是PyTorch，凭借其灵活性和高效性，成为科研人员和工程师的首选工具。理解和掌握深度学习的基础知识，深入了解其与经典机器学习算法的区别与联系，并系统掌握包括迁移学习、循环神经网络（RNN）、长短时记忆网络（L
ML.NET库学习006：成人人口普查数据分析与分类预测 North_D ML.NET库机器学习人工智能深度学习数据挖掘目标检测自然语言处理神经网络
文章目录ML.NET库学习006：成人人口普查数据分析与分类预测概述数据集数据字段解释为何数据准备很重要主要功能与模块数据准备机器学习工作流代码结构说明数据准备模块机器学习工作流数据加载与分割特征工程与模型训练模型评估与预测实现细节与注意事项数据准备模块机器学习工作流性能优化项目优势LightGBM分类器原理说明总结ML.NET库学习006：成人人口普查数据分析与分类预测概述本项目使用C#和ML.
【Java】已解决：java.util.concurrent.ExecutionException 屿小夏 java 开发语言 android
个人简介：某不知名博主，致力于全栈领域的优质博客分享|用最优质的内容带来最舒适的阅读体验！文末获取免费IT学习资料！文末获取更多信息精彩专栏推荐订阅收藏专栏系列直达链接相关介绍书籍分享点我跳转书籍作为获取知识的重要途径，对于IT从业者来说更是不可或缺的资源。不定期更新IT图书，并在评论区抽取随机粉丝，书籍免费包邮到家AI前沿点我跳转探讨人工智能技术领域的最新发展和创新，涵盖机器学习、深度学习、自然
强化学习在机器人控制中的应用：从理论到实践 Echo_Wish 前沿技术人工智能机器人
强化学习在机器人控制中的应用：从理论到实践大家好，我是你们熟悉的人工智能与Python领域自媒体创作者Echo_Wish。今天我们来聊聊一个炙手可热的话题——强化学习在机器人控制中的应用。近年来，随着人工智能技术的飞速发展，机器人在各个领域的应用越来越广泛。而强化学习作为一种重要的机器学习方法，为机器人控制提供了强有力的技术支持。接下来，让我们一起探讨强化学习在机器人控制中的原理和实践，并通过具体
Apache Iceberg 与 Apache Hudi：数据湖领域的双雄对决夜里慢慢行456 大数据大数据
在数据存储和处理不断发展的领域中，数据湖仓的概念已经崭露头角，成为了一种变革性的力量。数据湖仓结合了数据仓库和数据湖的最佳元素，提供了一个统一的平台，支持数据科学、商业智能、人工智能/机器学习以及临时报告等多种关键功能。这种创新的方法不仅促进了实时分析，还显著降低了平台成本，增强了数据治理，并加速了用例的实现。数据存储和处理的演变催生了被称为数据湖仓的现代分析平台。这些平台旨在解决传统架构的局限性
AI大模型（如GPT、BERT等）可以通过自然语言处理（NLP）和机器学习技术，显著提升测试效率小赖同学啊 python 人工智能自动化测试(app pc API)人工智能自然语言处理 gpt
在软件测试中，AI大模型（如GPT、BERT等）可以通过自然语言处理（NLP）和机器学习技术，显著提升测试效率。以下是几个具体的应用场景及对应的代码实现示例：1.自动生成测试用例AI大模型可以根据需求文档或用户故事自动生成测试用例。代码示例（使用OpenAIGPTAPI）：importopenai#设置OpenAIAPI密钥openai.api_key="your-openai-api-key"#
优化算法全景解析：从梯度下降到群体智能 welcome_123_ 算法 python 人工智能
一、引言：为什么需要优化算法？在AlphaGo击败人类围棋冠军的背后，在特斯拉自动驾驶系统实时决策的瞬间，在推荐系统精准推送内容的过程中，优化算法始终是推动这些技术落地的核心引擎。无论是机器学习模型的训练，还是复杂系统的参数调优，优化算法的本质是：在给定的约束条件下，找到使目标函数最优的解。本文将深入解析优化算法的核心原理、经典方法、现代进展及实战应用，助你全面掌握这一技术利器。二、优化算法分类图
Chrome将网页保存为PDF的实战教程爱编程的喵喵 Python基础课程 Windows实用技巧 windows chrome 网页保存为PDF 实战教程
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了Chrome将网页保存为PDF的实战
Python机器学习舆情分析项目案例分享数澜悠客数字化转型 python 机器学习开发语言
数据收集与准备1.数据收集多样化数据源：从社交媒体平台（如微博、Twitter）、新闻网站、论坛等多渠道收集数据，以获取更全面的舆情信息。可以使用Python的requests库和网页解析库（如BeautifulSoup）进行网页数据爬取，使用Tweepy库获取Twitter数据。数据标注：对于监督学习，需要对收集到的数据进行标注，标记为积极、消极或中性等类别。可以使用人工标注的方式，也可以利用半
2月第五讲：深度剖析 Python 编程中的数据处理与机器学习应用 2501_90442144 python 机器学习开发语言
一、引言在当今数字化时代，编程已经成为推动各个领域发展的关键力量。Python作为一种高级编程语言，以其简洁、易读、功能强大等特点，在数据处理、机器学习、人工智能等众多领域得到了广泛的应用。本文将深入探讨Python在数据处理和机器学习方面的应用，通过实际案例展示其强大的功能和灵活性，帮助读者更好地理解和掌握Python编程在这些领域的应用技巧。二、Python基础概述2.1Python的特点与优
零基础入门机器学习 -- 第四章分类问题与逻辑回归山海青风 #机器学习机器学习分类逻辑回归 python 人工智能
4.1分类vs回归在机器学习中，任务通常分为两大类：回归（Regression）：用于预测连续数值，如房价、温度、工资等。例如：预测明天的气温（28.5°C）。预测一辆二手车的价格（30,000元）。分类（Classification）：用于预测离散类别，如垃圾邮件vs正常邮件。例如：判断一封邮件是否是垃圾邮件（“垃圾邮件”or“正常邮件”）。预测一个贷款申请是否会被批准（“批准”or“拒绝”）。
利用Blackbox AI让编程更轻松人工智能ai开发图像处理
引言随着人工智能技术的发展，AI已经成为工作中不可缺少的工具之一。俗话讲“术业有专攻”，对AI来说当然也是如此。由于训练集、调教等方面的差别，不同的AI适用的工作也不尽相同。在编程辅助方面，已经有一系列比较成熟的平台，但它们一方面价格昂贵，另一方面功能比较单一。Blackbox.ai是一个新出现的人工智能平台，它主要针对的是编程和机器学习方面的AI技术落地。和其他AI平台相比，它提供了简洁美观的界
Python中的决策树算法探索 Soft_Leader 算法 python 决策树
在Python中，决策树算法是一种常用的机器学习技术，用于分类和回归问题。下面我们将探索如何使用Python中的scikit-learn库来实现决策树算法，并简要介绍其基本概念和用法。1.安装必要的库如果你还没有安装scikit-learn库，你可以使用pip来安装它：bash复制代码pipinstall-Uscikit-learn2.导入必要的库和模块python复制代码fromsklearn.
多模态模型详解换个网名有点难深度学习人工智能计算机视觉
多模态模型是什么多模态模型是一种能够处理和理解多种数据类型（如文本、图像、音频、视频等）的机器学习模型，通过融合不同模态的信息来提升任务的性能。其核心在于利用不同模态之间的互补性，增强模型的鲁棒性和准确性。如何融合多个模型以下是多模态模型的融合方法及关键技术的详细解析：一、多模态模型的核心概念模态定义：单模态：单一类型的数据（如纯文本或纯图像）。多模态：多种类型数据的组合（如“图像+文本”“音频+
Pytorch学习之路（3） AAAx1anyu Pytorch学习之旅学习人工智能 pytorch 深度学习笔记
一.机器学习任务的整体流程1.数据预处理：数据格式统一、异常数据消除、必要数据转换，划分训练集、验证集、测试集2.选择模型3.设定损失函数、优化方法、对应的超参数4.用模型拟合训练集数据，在验证集/测试集上计算模型表现二.数据读入pytorch数据读入通过Dataset+DataLoader的方式完成，Dataset定义好数据的格式和数据变换形式，DataLoader用iterative的方式不断
【收藏不迷路】380种群智能优化算法-Matlab代码免费获取（截至2025.2.14） 88号技师智能优化算法算法 matlab 优化算法人工智能
群智能优化算法可以作为很好的工具来解决许多实际问题，如特征选择、图像分割、医学诊断，经济排放调度问题，植物病害识别，工程设计，PID优化控制，设备故障诊断，机器学习模型参数整定等等。在这个领域，有一个理论：没有免费午餐(NoFreeLunch，NFL)理论。它从逻辑上证明了不存在最适合解决所有优化问题的元启发式算法。换句话说，特定的元启发式可能在一组问题上显示出非常有希望的结果，但相同的算法可能在
python 并行框架_基于python的高性能实时并行机器学习框架之Ray介绍 weixin_39778582 python 并行框架
前言加州大学伯克利分校实时智能安全执行实验室(RISELab)的研究人员已开发出了一种新的分布式框架，该框架旨在让基于Python的机器学习和深度学习工作负载能够实时执行，并具有类似消息传递接口(MPI)的性能和细粒度。这种框架名为Ray，看起来有望取代Spark，业界认为Spark对于一些现实的人工智能应用而言速度太慢了;过不了一年，Ray应该会准备好用于生产环境。目前ray已经发布了0.3.0
【一起看花书1.3】——第5章机器学习基础应有光基础知识机器学习人工智能深度学习
先验是“知识”，是合理的假设本文内容对应于原书的5.7-5.11共5小节内容，其中知识性、结论性的内容偏多，也加入了点个人见解。目录：5.7监督学习5.8无监督学习5.9随机梯度下降5.10构建机器学习算法5.11深度学习发展的动力5.7监督学习监督学习，本质上是复杂函数的拟合，即给定特征xxx,我们需要得到标签yyy，这不就是求一个函数的拟合嘛？线性回归是比较简单的，从高代、概率论就可以理解，甚
《探秘Hogwild!算法：无锁并行SGD的神奇之路》人工智能深度学习
在深度学习和机器学习的领域中，优化算法的效率和性能一直是研究的重点。Hogwild!算法作为一种能够实现无锁并行随机梯度下降（SGD）的创新方法，受到了广泛关注。下面就来深入探讨一下Hogwild!算法是如何实现这一壮举的。基础原理铺垫随机梯度下降（SGD）算法是基于梯度下降算法产生的常见优化算法。其目标是优化损失函数，通过对每一个超参数求偏导得到当前轮的梯度，然后向梯度的反方向更新，不断迭代以获
VSCode通过跳板机免密连接远程服务器的解决方案爱编程的喵喵 Python基础课程 vscode 服务器跳板机免密连接解决方案
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了VSCode通过跳板机免密连接远程服
股票自动化交易 reset2021 python
股票自动化交易是指通过编写程序自动执行股票买卖操作，以减少人为干预，提高交易效率和准确性。Python作为一种功能强大且易于上手的编程语言，广泛应用于金融领域，尤其是在量化交易和自动化交易中。本文将介绍如何使用Python实现一个简单的股票自动化交易系统。1.自动化交易的基本流程股票自动化交易通常包括以下几个步骤：数据获取：从交易所或第三方API获取实时股票数据。策略制定：基于技术指标或机器学习模
零基础入门机器学习 -- 第一章什么是机器学习？山海青风 #机器学习机器学习人工智能 python
1.1机器学习的定义机器学习（MachineLearning,ML）是让计算机从数据中学习，然后在没有明确编程的情况下进行预测或决策的技术。传统编程：程序员写出明确的规则，例如“如果温度低于0℃，显示‘结冰’”。机器学习：计算机分析历史天气数据，自行找出“低温→可能结冰”的规律，然后对新数据进行预测。机器学习的核心思想是：数据+算法=经验+预测能力。1.2机器学习vs传统编程特点传统编程机器学习规
机器学习数学基础：21.特征值与特征向量 @心都机器学习概率论人工智能
一、引言在现代科学与工程的众多领域中，线性代数扮演着举足轻重的角色。其中，特征值、特征向量以及相似对角化的概念和方法，不仅是线性代数理论体系的核心部分，更是解决实际问题的有力工具。无论是在物理学中描述系统的振动模式，还是在计算机科学里进行数据降维与图像处理，它们都发挥着关键作用。本教程将深入且全面地对这些内容展开讲解，旨在帮助读者透彻理解并熟练运用相关知识。二、基础知识准备（一）对角矩阵的高次幂计
物流数字化转型：报关单ocr api应用场景、报关单识别接口 OCR_API 接口 ocr
在全球化贸易日益频繁的今天，物流行业的效率和准确性对于企业的竞争力至关重要。翔云报关单OCR（光学字符识别）API助力物流企业实现数字化转型。报关单识别接口是一种通过图像处理和机器学习技术自动识别并提取报关单信息的技术解决方案。它能够快速准确地从纸质或电子版报关单中读取关键数据，如货物名称、数量、金额等，并将其转换为结构化的数字格式。这不仅大大提高了工作效率，还减少了人为错误的可能性。应用场景示例
【机器学习】探索未来科技的前沿：人工智能、机器学习与大模型 E绵绵 Everything 人工智能科技机器学习大模型 python AIGC 应用
文章目录引言一、人工智能：从概念到现实1.1人工智能的定义1.2人工智能的发展历史1.3人工智能的分类1.4人工智能的应用二、机器学习：人工智能的核心技术2.1机器学习的定义2.2机器学习的分类2.3机器学习的实现原理2.4机器学习的应用2.5机器学习的示例代码2.6解释代码三、大模型：推动AI前沿发展的关键技术3.1大模型的定义3.2大模型的发展历程3.3深度学习与神经网络3.4大模型的优势与挑
大模型稀疏动态架构 deepdata_cn 垂域模型语言模型
DeepSeek应用稀疏动态架构（SparseDynamicArchitecture）是其大模型技术的核心创新点。大模型稀疏动态架构是一种用于构建大规模人工智能模型的先进架构，整体提高了模型的效率、灵活性和性能。一、发展历程1.早期探索阶段起源基础：20世纪8090年代的早期机器学习主要集中在决策树、SVM、KNN等经典算法，模型规模小，依赖手工特征。之后在2006年GeoffreyHinton提
《深度解析：批量、随机和小批量梯度下降的区别与应用》人工智能深度学习
在机器学习和深度学习的领域中，梯度下降算法是优化模型参数的核心工具之一。而批量梯度下降（BGD）、随机梯度下降（SGD）和小批量梯度下降（MBGD）是梯度下降算法的三种常见变体，它们在计算效率、收敛速度和准确性等方面各有特点。原理与计算方式批量梯度下降（BGD）：BGD在每次迭代时，都会使用整个训练数据集来计算损失函数的梯度，然后根据梯度更新模型参数。例如，若训练集中有1000个样本，那么每次迭代
如何用ruby来写hadoop的mapreduce并生成jar包 wudixiaotie mapreduce
ruby来写hadoop的mapreduce，我用的方法是rubydoop。怎么配置环境呢： 1.安装rvm：不说了网上有 2.安装ruby：由于我以前是做ruby的，所以习惯性的先安装了ruby，起码调试起来比jruby快多了。 3.安装jruby： rvm install jruby然后等待安
java编程思想 -- 访问控制权限百合不是茶 java 访问控制权限单例模式
访问权限是java中一个比较中要的知识点,它规定者什么方法可以访问,什么不可以访问一:包访问权限; 自定义包: package com.wj.control; //包 public class Demo { //定义一个无参的方法 public void DemoPackage(){ System.out.println("调用
[生物与医学]请审慎食用小龙虾 comsci 生物
现在的餐馆里面出售的小龙虾,有一些是在野外捕捉的,这些小龙虾身体里面可能带有某些病毒和细菌,人食用以后可能会导致一些疾病,严重的甚至会死亡..... 所以,参加聚餐的时候,最好不要点小龙虾...就吃养殖的猪肉,牛肉,羊肉和鱼,等动物蛋白质
org.apache.jasper.JasperException: Unable to compile class for JSP: 商人shang maven 2.2 jdk1.8
环境： jdk1.8 maven tomcat7-maven-plugin 2.0 原因： tomcat7-maven-plugin 2.0 不知吃 jdk 1.8，换成 tomcat7-maven-plugin 2.2就行，即 <plugin>
你的垃圾你处理掉了吗?GC oloz GC
前序:本人菜鸟，此文研究学习来自网络，各位牛牛多指教　 1.垃圾收集算法的核心思想　　Java语言建立了垃圾收集机制，用以跟踪正在使用的对象和发现并回收不再使用(引用)的对象。该机制可以有效防范动态内存分配中可能发生的两个危险：因内存垃圾过多而引发的内存耗尽，以及不恰当的内存释放所造成的内存非法引用。　　垃圾收集算法的核心思想是：对虚拟机可用内存空间，即堆空间中的对象进行识别
shiro 和 SESSSION 杨白白 shiro
shiro 在web项目里默认使用的是web容器提供的session，也就是说shiro使用的session是web容器产生的，并不是自己产生的，在用于非web环境时可用其他来源代替。在web工程启动的时候它就和容器绑定在了一起，这是通过web.xml里面的shiroFilter实现的。通过session.getSession()方法会在浏览器cokkice产生JESSIONID，当关闭浏览器，此
移动互联网终端淘宝客如何实现盈利小桔子移動客戶端淘客淘寶App
2012年淘宝联盟平台为站长和淘宝客带来的分成收入突破30亿元，同比增长100%。而来自移动端的分成达1亿元，其中美丽说、蘑菇街、果库、口袋购物等App运营商分成近5000万元。可以看出，虽然目前阶段PC端对于淘客而言仍旧是盈利的大头，但移动端已经呈现出爆发之势。而且这个势头将随着智能终端(手机，平板)的加速普及而更加迅猛
wordpress小工具制作 aichenglong wordpress 小工具
wordpress 使用侧边栏的小工具，很方便调整页面结构小工具的制作过程 1 在自己的主题文件中新建一个文件夹(如widget)，在文件夹中创建一个php(AWP_posts-category.php) 小工具是一个类,想侧边栏一样，还得使用代码注册，他才可以再后台使用，基本的代码一层不变 <?php class AWP_Post_Category extends WP_Wi
JS微信分享 AILIKES js
// 所有功能必须包含在 WeixinApi.ready 中进行 WeixinApi.ready(function(Api) { // 微信分享的数据 var wxData = { &nb
封装探讨百合不是茶 JAVA面向对象封装
//封装属性方法将某些东西包装在一起，通过创建对象或使用静态的方法来调用，称为封装；封装其实就是有选择性地公开或隐藏某些信息，它解决了数据的安全性问题，增加代码的可读性和可维护性在 Aname类中申明三个属性，将其封装在一个类中：通过对象来调用例如 1： //属性将其设为私有姓名 name 可以公开
jquery radio/checkbox change事件不能触发的问题 bijian1013 JavaScript jquery
我想让radio来控制当前我选择的是机动车还是特种车，如下所示： <html> <head> <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"><
AngularJS中安全性措施 bijian1013 JavaScript AngularJS 安全性 XSRF JSON漏洞
在使用web应用中，安全性是应该首要考虑的一个问题。AngularJS提供了一些辅助机制，用来防护来自两个常见攻击方向的网络攻击。一.JSON漏洞当使用一个GET请求获取JSON数组信息的时候（尤其是当这一信息非常敏感，
[Maven学习笔记九]Maven发布web项目 bit1129 maven
基于Maven的web项目的标准项目结构 user-project user-core user-service user-web src
【Hive七】Hive用户自定义聚合函数(UDAF) bit1129 hive
用户自定义聚合函数，用户提供的多个入参通过聚合计算(求和、求最大值、求最小值)得到一个聚合计算结果的函数。问题：UDF也可以提供输入多个参数然后输出一个结果的运算，比如加法运算add(3，5)，add这个UDF需要实现UDF的evaluate方法,那么UDF和UDAF的实质分别究竟是什么？ Double evaluate(Double a, Double b)
通过 nginx-lua 给 Nginx 增加 OAuth 支持 ronin47
前言：我们使用Nginx的Lua中间件建立了OAuth2认证和授权层。如果你也有此打算，阅读下面的文档，实现自动化并获得收益。SeatGeek 在过去几年中取得了发展，我们已经积累了不少针对各种任务的不同管理接口。我们通常为新的展示需求创建新模块，比如我们自己的博客、图表等。我们还定期开发内部工具来处理诸如部署、可视化操作及事件处理等事务。在处理这些事务中，我们使用了几个不同的接口来认证： &n
利用tomcat-redis-session-manager做session同步时自定义类对象属性保存不上的解决方法 bsr1983 session
在利用tomcat-redis-session-manager做session同步时，遇到了在session保存一个自定义对象时，修改该对象中的某个属性，session未进行序列化，属性没有被存储到redis中。在 tomcat-redis-session-manager的github上有如下说明： Session Change Tracking As noted in the &qu
《代码大全》表驱动法-Table Driven Approach-1 bylijinnan java 算法
关于Table Driven Approach的一篇非常好的文章： http://www.codeproject.com/Articles/42732/Table-driven-Approach package com.ljn.base; import java.util.Random; public class TableDriven { public
Sybase封锁原理 chicony Sybase
昨天在操作Sybase IQ12.7时意外操作造成了数据库表锁定，不能删除被锁定表数据也不能往其中写入数据。由于着急往该表抽入数据，因此立马着手解决该表的解锁问题。无奈此前没有接触过Sybase IQ12.7这套数据库产品，加之当时已属于下班时间无法求助于支持人员支持，因此只有借助搜索引擎强大的
java异常处理机制 CrazyMizzz java
java异常关键字有以下几个，分别为 try catch final throw throws 他们的定义分别为 try： Opening exception-handling statement. catch： Captures the exception. finally： Runs its code before terminating
hive 数据插入DML语法汇总 daizj hive DML 数据插入
Hive的数据插入DML语法汇总1、Loading files into tables语法：1) LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]解释：1)、上面命令执行环境为hive客户端环境下： hive>l
工厂设计模式 dcj3sjt126com 设计模式
使用设计模式是促进最佳实践和良好设计的好办法。设计模式可以提供针对常见的编程问题的灵活的解决方案。工厂模式工厂模式（Factory）允许你在代码执行时实例化对象。它之所以被称为工厂模式是因为它负责“生产”对象。工厂方法的参数是你要生成的对象对应的类名称。 Example #1 调用工厂方法（带参数） <?phpclass Example{
mysql字符串查找函数 dcj3sjt126com mysql
FIND_IN_SET(str,strlist) 假如字符串str 在由N 子链组成的字符串列表strlist 中，则返回值的范围在1到 N 之间。一个字符串列表就是一个由一些被‘,’符号分开的自链组成的字符串。如果第一个参数是一个常数字符串，而第二个是type SET列，则 FIND_IN_SET() 函数被优化，使用比特计算。如果str不在strlist 或st
jvm内存管理 easterfly jvm
一、JVM堆内存的划分分为年轻代和年老代。年轻代又分为三部分：一个eden,两个survivor。工作过程是这样的：e区空间满了后，执行minor gc，存活下来的对象放入s0, 对s0仍会进行minor gc，存活下来的的对象放入s1中，对s1同样执行minor gc，依旧存活的对象就放入年老代中；年老代满了之后会执行major gc，这个是stop the word模式，执行
CentOS-6.3安装配置JDK-8 gengzg centos
JAVA_HOME=/usr/java/jdk1.8.0_45 JRE_HOME=/usr/java/jdk1.8.0_45/jre PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib export JAVA_HOME
【转】关于web路径的获取方法 huangyc1210 Web 路径
假定你的web application 名称为news,你在浏览器中输入请求路径： http://localhost:8080/news/main/list.jsp 则执行下面向行代码后打印出如下结果： 1、 System.out.println(request.getContextPath()); //可返回站点的根路径。也就是项
php里获取第一个中文首字母并排序远去的渡口数据结构 PHP
很久没来更新博客了，还是觉得工作需要多总结的好。今天来更新一个自己认为比较有成就的问题吧。最近在做储值结算，需求里结算首页需要按门店的首字母A-Z排序。我的数据结构原本是这样的： Array ( [0] => Array ( [sid] => 2885842 [recetcstoredpay] =&g
java内部类 hm4123660 java 内部类匿名内部类成员内部类方法内部类
　在Java中，可以将一个类定义在另一个类里面或者一个方法里面，这样的类称为内部类。内部类仍然是一个独立的类，在编译之后内部类会被编译成独立的.class文件，但是前面冠以外部类的类名和$符号。内部类可以间接解决多继承问题,可以使用内部类继承一个类，外部类继承一个类，实现多继承。 &nb
Caused by: java.lang.IncompatibleClassChangeError: class org.hibernate.cfg.Exten zhb8015
maven pom.xml关于hibernate的配置和异常信息如下，查了好多资料，问题还是没有解决。只知道是包冲突，就是不知道是哪个包....遇到这个问题的分享下是怎么解决的。。 maven pom: <dependency> <groupId>org.hibernate</groupId> <ar
Spark 性能相关参数配置详解－任务调度篇 Stark_Summer spark cache cpu 任务调度 yarn
随着Spark的逐渐成熟完善, 越来越多的可配置参数被添加到Spark中来, 本文试图通过阐述这其中部分参数的工作原理和配置思路, 和大家一起探讨一下如何根据实际场合对Spark进行配置优化。由于篇幅较长，所以在这里分篇组织，如果要看最新完整的网页版内容，可以戳这里：http://spark-config.readthedocs.org/，主要是便
css3滤镜 wangkeheng html css
经常看到一些网站的底部有一些灰色的图标，鼠标移入的时候会变亮，开始以为是js操作src或者bg呢，搜索了一下，发现了一个更好的方法：通过css3的滤镜方法。 html代码： <a href='' class='icon'><img src='utv.jpg' /></a> css代码： .icon{-webkit-filter: graysc