self.steps
0 ['categorical_encoding', <autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.OHEChoice object at 0x7f7a74dfa8d0>]
1 ['imputation', Imputation(random_state=None, strategy='median')]
2 ['variance_threshold', VarianceThreshold(random_state=None)]
3 ['rescaling', <autosklearn.pipeline.components.data_preprocessing.rescaling.RescalingChoice object at 0x7f7a74dfa780>]
4 ['balancing', Balancing(random_state=None, strategy='none')]
5 ['preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f7a74dc6390>]
6 ['classifier', <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f7a74dc6048>]
pipenode class | is choice |
---|---|
OHEChoice | choice |
Imputation | not choice |
VarianceThreshold | not choice |
RescalingChoice | choice |
Balancing | not choice |
FeaturePreprocessorChoice | choice |
ClassifierChoice | choice |
from collections import OrderedDict
import importlib
import inspect
import pkgutil
import sys
def find_components(package, directory, base_class):
components = OrderedDict()
for module_loader, module_name, ispkg in pkgutil.iter_modules([directory]):
full_module_name = "%s.%s" % (package, module_name)
if full_module_name not in sys.modules and not ispkg:
module = importlib.import_module(full_module_name)
for member_name, obj in inspect.getmembers(module):
if inspect.isclass(obj) and issubclass(obj, base_class) and \
obj != base_class:
# TODO test if the obj implements the interface
# Keep in mind that this only instantiates the ensemble_wrapper,
# but not the real target classifier
classifier = obj
components[module_name] = classifier
return components
_classifiers = find_components(__package__,
classifier_directory,
AutoSklearnClassificationAlgorithm)
autosklearn/ensemble_builder.py:389
self.logger.warning("No models better than random - "
"using Dummy Score!")
automl.py文件中_fit函数self._proc_ensemble = self._get_ensemble_process(time_left_for_ensembles)
获取的是EnsembleBuilder
对象,代码文件是~/ensemble_builder.py
。
_proc_ensemble
继承了多进程类,会单独开个进程运行run
方法。
在run
中调用了EnsembleBuilder.main
重点研究:autosklearn.ensembles.ensemble_selection.EnsembleSelection#_fast
_ensemble这个变量可能是通过load_model 加载的。
We experimented with different approaches to optimize these weights: stacking [26], gradient-free numerical optimization, and the method ensemble selection [24].
we found both numerical optimization and stacking to overfit to the validation set and to be
computationally costly
In a nutshell, ensemble selection (introduced by Caruana et al. [24]) is a greedy procedure that starts from an empty ensemble and then iteratively adds the model that maximizes ensemble validation performance (with uniform weight, but allowing for repetitions)
We note that SMAC [9] can handle this conditionality
natively
The 14 possible feature preprocessing methods can be categorized into
feature selection (2), kernel approximation (2), matrix decomposition (3), embeddings (1), feature
clustering (1), polynomial feature expansion (1) and methods that use a classifier for feature selection
(2).
基本运行入口:autosklearn.automl.AutoMLClassifier#fit
在子类配置了一些必要的参数之后,调用父类的fit
方法,即autosklearn.automl.AutoML#fit
在调用loaded_data_manager = XYDataManager(...
将X y进行管理之后,调用return self._fit(...
self.configuration_space, configspace_path = self._create_search_space(
进入对应区域:autosklearn.automl.AutoML#_create_search_space
看到configuration_space = pipeline.get_configuration_space(
进入对应区域:autosklearn.util.pipeline.get_configuration_space
在这个函数中,配置了info
字典之后,最后一段代码:
if info['task'] in REGRESSION_TASKS:
return _get_regression_configuration_space(info, include, exclude)
else:
return _get_classification_configuration_space(info, include, exclude)
autosklearn.util.pipeline._get_classification_configuration_space
最后一段代码:
return SimpleClassificationPipeline(
dataset_properties=dataset_properties,
include=include, exclude=exclude).\
get_hyperparameter_search_space()
autosklearn.pipeline.base.BasePipeline#get_hyperparameter_search_space
最后一段代码:
if not hasattr(self, 'config_space') or self.config_space is None:
self.config_space = self._get_hyperparameter_search_space(
include=self.include_, exclude=self.exclude_,
dataset_properties=self.dataset_properties_)
return self.config_space
autosklearn.pipeline.classification.SimpleClassificationPipeline#_get_hyperparameter_search_space
至此,进过多次跳转与入栈,我们终于进入了”干货“最为丰富的区域了。
看到如下代码:
cs = self._get_base_search_space(
cs=cs, dataset_properties=dataset_properties,
exclude=exclude, include=include, pipeline=self.steps)
注意,这里的self.steps表示autosklearn想要优化出的Pipeline的所有节点。
autosklearn.pipeline.base.BasePipeline#_get_base_search_space
看到要获取matches
,我们想知道matches是怎么来的:
autosklearn.pipeline.create_searchspace_util.get_match_array
在for node_name, node in pipeline:
这个循环中,构造了一个很重要的变量:node_i_choices
,他是一个2维列表。在原生形式中,维度1为7,表示7个Pipeline的结点。其中每个子列表表示可以选择的所有option
我取前4个作为样例
node_i_choices[0]
Out[16]:
[autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.no_encoding.NoEncoding,
autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.one_hot_encoding.OneHotEncoder]
node_i_choices[1]
Out[17]: [Imputation(random_state=None, strategy='median')]
node_i_choices[2]
Out[18]: [VarianceThreshold(random_state=None)]
node_i_choices[3]
Out[19]:
[autosklearn.pipeline.components.data_preprocessing.rescaling.minmax.MinMaxScalerComponent,
autosklearn.pipeline.components.data_preprocessing.rescaling.none.NoRescalingComponent,
autosklearn.pipeline.components.data_preprocessing.rescaling.normalize.NormalizerComponent,
autosklearn.pipeline.components.data_preprocessing.rescaling.quantile_transformer.QuantileTransformerComponent,
autosklearn.pipeline.components.data_preprocessing.rescaling.robust_scaler.RobustScalerComponent,
autosklearn.pipeline.components.data_preprocessing.rescaling.standardize.StandardScalerComponent]
之后,matches_dimensions
表示每个子列表的长度,用来构造一个高维张量matches
matches_dimensions
Out[20]: [2, 1, 1, 6, 1, 15, 15]
matches = np.ones(matches_dimensions, dtype=int)
看到:
pipeline_idxs = [range(dim) for dim in matches_dimensions]
for pipeline_instantiation_idxs in itertools.product(*pipeline_idxs):
可以理解为遍历这条Pipeline中所有的可能。
pipeline_instantiation_idxs
表示某个Pipeline在matches
中的坐标
pipeline_instantiation_idxs
Out[25]: (0, 0, 0, 0, 0, 0, 0)
node_input = node.get_properties()['input']
node_output = node.get_properties()['output']
node_input
Out[26]: (5, 6, 10)
node_output
Out[27]: (8,)
这个操作乍一看不理解,跳转get_properties
函数我们看到:
'input': (DENSE, SPARSE, UNSIGNED_DATA),
'output': (PREDICTIONS,)}
应该是适应哪些类型。
首先判断sparse与dense是否check:
# First check if these two instantiations of this node can work
# together. Do this in multiple if statements to maintain
# readability
if (data_is_sparse and SPARSE not in node_input) or \
not data_is_sparse and DENSE not in node_input:
matches[pipeline_instantiation_idxs] = 0
break
# No need to check if the node can handle SIGNED_DATA; this is
# always assumed to be true
elif not dataset_is_signed and UNSIGNED_DATA not in node_input:
matches[pipeline_instantiation_idxs] = 0
break
后面的操作也差不多,反正就是检查这个Pipeline是否合理。源码很sophisticated,我暂时跳过。
最后返回matches
autosklearn.pipeline.base.BasePipeline#_get_base_search_space:293
if not is_choice:
cs.add_configuration_space(node_name,
node.get_hyperparameter_search_space(dataset_properties))
# If the node isn't a choice, we have to figure out which of it's
# choices are actually legal choices
else:
choices_list = \
autosklearn.pipeline.create_searchspace_util.find_active_choices(
matches, node, node_idx,
dataset_properties,
include.get(node_name),
exclude.get(node_name)
)
sub_config_space = node.get_hyperparameter_search_space(
dataset_properties, include=choices_list)
cs.add_configuration_space(node_name, sub_config_space)
如果是选择性的结点,则进入else的部分,choices_list
是所有的候选项
choices_list
Out[29]: ['no_encoding', 'one_hot_encoding']
我们再打印一下
sub_config_space
Out[30]:
Configuration space object:
Hyperparameters:
__choice__, Type: Categorical, Choices: {no_encoding, one_hot_encoding}, Default: one_hot_encoding
one_hot_encoding:minimum_fraction, Type: UniformFloat, Range: [0.0001, 0.5], Default: 0.01, on log-scale
one_hot_encoding:use_minimum_fraction, Type: Categorical, Choices: {True, False}, Default: True
Conditions:
one_hot_encoding:minimum_fraction | one_hot_encoding:use_minimum_fraction == 'True'
one_hot_encoding:use_minimum_fraction | __choice__ == 'one_hot_encoding'
我们打印一下特征处理部分:
Configuration space object:
Hyperparameters:
__choice__, Type: Categorical, Choices: {extra_trees_preproc_for_classification, fast_ica, feature_agglomeration, kernel_pca, kitchen_sinks, liblinear_svc_preprocessor, no_preprocessing, nystroem_sampler, pca, polynomial, random_trees_embedding, select_percentile_classification, select_rates}, Default: no_preprocessing
extra_trees_preproc_for_classification:bootstrap, Type: Categorical, Choices: {True, False}, Default: False
extra_trees_preproc_for_classification:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
extra_trees_preproc_for_classification:max_depth, Type: Constant, Value: None
extra_trees_preproc_for_classification:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
extra_trees_preproc_for_classification:max_leaf_nodes, Type: Constant, Value: None
extra_trees_preproc_for_classification:min_impurity_decrease, Type: Constant, Value: 0.0
extra_trees_preproc_for_classification:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
extra_trees_preproc_for_classification:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
extra_trees_preproc_for_classification:min_weight_fraction_leaf, Type: Constant, Value: 0.0
extra_trees_preproc_for_classification:n_estimators, Type: Constant, Value: 100
fast_ica:algorithm, Type: Categorical, Choices: {parallel, deflation}, Default: parallel
fast_ica:fun, Type: Categorical, Choices: {logcosh, exp, cube}, Default: logcosh
fast_ica:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
fast_ica:whiten, Type: Categorical, Choices: {False, True}, Default: False
feature_agglomeration:affinity, Type: Categorical, Choices: {euclidean, manhattan, cosine}, Default: euclidean
feature_agglomeration:linkage, Type: Categorical, Choices: {ward, complete, average}, Default: ward
feature_agglomeration:n_clusters, Type: UniformInteger, Range: [2, 400], Default: 25
feature_agglomeration:pooling_func, Type: Categorical, Choices: {mean, median, max}, Default: mean
kernel_pca:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
kernel_pca:degree, Type: UniformInteger, Range: [2, 5], Default: 3
kernel_pca:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 1.0, on log-scale
kernel_pca:kernel, Type: Categorical, Choices: {poly, rbf, sigmoid, cosine}, Default: rbf
kernel_pca:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
kitchen_sinks:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 1.0, on log-scale
kitchen_sinks:n_components, Type: UniformInteger, Range: [50, 10000], Default: 100, on log-scale
liblinear_svc_preprocessor:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
liblinear_svc_preprocessor:dual, Type: Constant, Value: False
liblinear_svc_preprocessor:fit_intercept, Type: Constant, Value: True
liblinear_svc_preprocessor:intercept_scaling, Type: Constant, Value: 1
liblinear_svc_preprocessor:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: squared_hinge
liblinear_svc_preprocessor:multi_class, Type: Constant, Value: ovr
liblinear_svc_preprocessor:penalty, Type: Constant, Value: l1
liblinear_svc_preprocessor:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
nystroem_sampler:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
nystroem_sampler:degree, Type: UniformInteger, Range: [2, 5], Default: 3
nystroem_sampler:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 0.1, on log-scale
nystroem_sampler:kernel, Type: Categorical, Choices: {poly, rbf, sigmoid, cosine}, Default: rbf
nystroem_sampler:n_components, Type: UniformInteger, Range: [50, 10000], Default: 100, on log-scale
pca:keep_variance, Type: UniformFloat, Range: [0.5, 0.9999], Default: 0.9999
pca:whiten, Type: Categorical, Choices: {False, True}, Default: False
polynomial:degree, Type: UniformInteger, Range: [2, 3], Default: 2
polynomial:include_bias, Type: Categorical, Choices: {True, False}, Default: True
polynomial:interaction_only, Type: Categorical, Choices: {False, True}, Default: False
random_trees_embedding:bootstrap, Type: Categorical, Choices: {True, False}, Default: True
random_trees_embedding:max_depth, Type: UniformInteger, Range: [2, 10], Default: 5
random_trees_embedding:max_leaf_nodes, Type: Constant, Value: None
random_trees_embedding:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
random_trees_embedding:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
random_trees_embedding:min_weight_fraction_leaf, Type: Constant, Value: 1.0
random_trees_embedding:n_estimators, Type: UniformInteger, Range: [10, 100], Default: 10
select_percentile_classification:percentile, Type: UniformFloat, Range: [1.0, 99.0], Default: 50.0
select_percentile_classification:score_func, Type: Categorical, Choices: {chi2, f_classif, mutual_info}, Default: chi2
select_rates:alpha, Type: UniformFloat, Range: [0.01, 0.5], Default: 0.1
select_rates:mode, Type: Categorical, Choices: {fpr, fdr, fwe}, Default: fpr
select_rates:score_func, Type: Categorical, Choices: {chi2, f_classif}, Default: chi2
Conditions:
extra_trees_preproc_for_classification:bootstrap | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:criterion | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:max_depth | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:max_features | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:max_leaf_nodes | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:min_impurity_decrease | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:min_samples_leaf | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:min_samples_split | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:min_weight_fraction_leaf | __choice__ == 'extra_trees_preproc_for_classification'
extra_trees_preproc_for_classification:n_estimators | __choice__ == 'extra_trees_preproc_for_classification'
fast_ica:algorithm | __choice__ == 'fast_ica'
fast_ica:fun | __choice__ == 'fast_ica'
fast_ica:n_components | fast_ica:whiten == 'True'
fast_ica:whiten | __choice__ == 'fast_ica'
feature_agglomeration:affinity | __choice__ == 'feature_agglomeration'
feature_agglomeration:linkage | __choice__ == 'feature_agglomeration'
feature_agglomeration:n_clusters | __choice__ == 'feature_agglomeration'
feature_agglomeration:pooling_func | __choice__ == 'feature_agglomeration'
kernel_pca:degree | kernel_pca:kernel == 'poly'
kernel_pca:kernel | __choice__ == 'kernel_pca'
kernel_pca:n_components | __choice__ == 'kernel_pca'
kitchen_sinks:gamma | __choice__ == 'kitchen_sinks'
kitchen_sinks:n_components | __choice__ == 'kitchen_sinks'
liblinear_svc_preprocessor:C | __choice__ == 'liblinear_svc_preprocessor'
liblinear_svc_preprocessor:dual | __choice__ == 'liblinear_svc_preprocessor'
liblinear_svc_preprocessor:fit_intercept | __choice__ == 'liblinear_svc_preprocessor'
liblinear_svc_preprocessor:intercept_scaling | __choice__ == 'liblinear_svc_preprocessor'
liblinear_svc_preprocessor:loss | __choice__ == 'liblinear_svc_preprocessor'
liblinear_svc_preprocessor:multi_class | __choice__ == 'liblinear_svc_preprocessor'
liblinear_svc_preprocessor:penalty | __choice__ == 'liblinear_svc_preprocessor'
liblinear_svc_preprocessor:tol | __choice__ == 'liblinear_svc_preprocessor'
nystroem_sampler:degree | nystroem_sampler:kernel == 'poly'
nystroem_sampler:kernel | __choice__ == 'nystroem_sampler'
nystroem_sampler:n_components | __choice__ == 'nystroem_sampler'
pca:keep_variance | __choice__ == 'pca'
pca:whiten | __choice__ == 'pca'
polynomial:degree | __choice__ == 'polynomial'
polynomial:include_bias | __choice__ == 'polynomial'
polynomial:interaction_only | __choice__ == 'polynomial'
preprocessor:kernel_pca:coef0 | preprocessor:kernel_pca:kernel in {'poly', 'sigmoid'}
preprocessor:kernel_pca:gamma | preprocessor:kernel_pca:kernel in {'poly', 'rbf'}
preprocessor:nystroem_sampler:coef0 | preprocessor:nystroem_sampler:kernel in {'poly', 'sigmoid'}
preprocessor:nystroem_sampler:gamma | preprocessor:nystroem_sampler:kernel in {'poly', 'rbf', 'sigmoid'}
random_trees_embedding:bootstrap | __choice__ == 'random_trees_embedding'
random_trees_embedding:max_depth | __choice__ == 'random_trees_embedding'
random_trees_embedding:max_leaf_nodes | __choice__ == 'random_trees_embedding'
random_trees_embedding:min_samples_leaf | __choice__ == 'random_trees_embedding'
random_trees_embedding:min_samples_split | __choice__ == 'random_trees_embedding'
random_trees_embedding:min_weight_fraction_leaf | __choice__ == 'random_trees_embedding'
random_trees_embedding:n_estimators | __choice__ == 'random_trees_embedding'
select_percentile_classification:percentile | __choice__ == 'select_percentile_classification'
select_percentile_classification:score_func | __choice__ == 'select_percentile_classification'
select_rates:alpha | __choice__ == 'select_rates'
select_rates:mode | __choice__ == 'select_rates'
select_rates:score_func | __choice__ == 'select_rates'
Forbidden Clauses:
(Forbidden: preprocessor:feature_agglomeration:affinity in {'cosine', 'manhattan'} && Forbidden: preprocessor:feature_agglomeration:linkage == 'ward')
(Forbidden: preprocessor:liblinear_svc_preprocessor:penalty == 'l1' && Forbidden: preprocessor:liblinear_svc_preprocessor:loss == 'hinge')
我们打印一下模型超参部分:
Configuration space object:
Hyperparameters:
__choice__, Type: Categorical, Choices: {adaboost, bernoulli_nb, decision_tree, extra_trees, gaussian_nb, gradient_boosting, k_nearest_neighbors, lda, liblinear_svc, libsvm_svc, multinomial_nb, passive_aggressive, qda, random_forest, sgd}, Default: random_forest
adaboost:algorithm, Type: Categorical, Choices: {SAMME.R, SAMME}, Default: SAMME.R
adaboost:learning_rate, Type: UniformFloat, Range: [0.01, 2.0], Default: 0.1, on log-scale
adaboost:max_depth, Type: UniformInteger, Range: [1, 10], Default: 1
adaboost:n_estimators, Type: UniformInteger, Range: [50, 500], Default: 50
bernoulli_nb:alpha, Type: UniformFloat, Range: [0.01, 100.0], Default: 1.0, on log-scale
bernoulli_nb:fit_prior, Type: Categorical, Choices: {True, False}, Default: True
decision_tree:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
decision_tree:max_depth_factor, Type: UniformFloat, Range: [0.0, 2.0], Default: 0.5
decision_tree:max_features, Type: Constant, Value: 1.0
decision_tree:max_leaf_nodes, Type: Constant, Value: None
decision_tree:min_impurity_decrease, Type: Constant, Value: 0.0
decision_tree:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
decision_tree:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
decision_tree:min_weight_fraction_leaf, Type: Constant, Value: 0.0
extra_trees:bootstrap, Type: Categorical, Choices: {True, False}, Default: False
extra_trees:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
extra_trees:max_depth, Type: Constant, Value: None
extra_trees:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
extra_trees:max_leaf_nodes, Type: Constant, Value: None
extra_trees:min_impurity_decrease, Type: Constant, Value: 0.0
extra_trees:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
extra_trees:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
extra_trees:min_weight_fraction_leaf, Type: Constant, Value: 0.0
extra_trees:n_estimators, Type: Constant, Value: 100
gradient_boosting:early_stop, Type: Categorical, Choices: {off, train, valid}, Default: off
gradient_boosting:l2_regularization, Type: UniformFloat, Range: [1e-10, 1.0], Default: 1e-10, on log-scale
gradient_boosting:learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.1, on log-scale
gradient_boosting:loss, Type: Constant, Value: auto
gradient_boosting:max_bins, Type: Constant, Value: 256
gradient_boosting:max_depth, Type: Constant, Value: None
gradient_boosting:max_iter, Type: UniformInteger, Range: [32, 512], Default: 100
gradient_boosting:max_leaf_nodes, Type: UniformInteger, Range: [3, 2047], Default: 31, on log-scale
gradient_boosting:min_samples_leaf, Type: UniformInteger, Range: [1, 200], Default: 20, on log-scale
gradient_boosting:n_iter_no_change, Type: UniformInteger, Range: [1, 20], Default: 10
gradient_boosting:scoring, Type: Constant, Value: loss
gradient_boosting:tol, Type: Constant, Value: 1e-07
gradient_boosting:validation_fraction, Type: UniformFloat, Range: [0.01, 0.4], Default: 0.1
k_nearest_neighbors:n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 1, on log-scale
k_nearest_neighbors:p, Type: Categorical, Choices: {1, 2}, Default: 2
k_nearest_neighbors:weights, Type: Categorical, Choices: {uniform, distance}, Default: uniform
lda:n_components, Type: UniformInteger, Range: [1, 250], Default: 10
lda:shrinkage, Type: Categorical, Choices: {None, auto, manual}, Default: None
lda:shrinkage_factor, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
lda:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
liblinear_svc:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
liblinear_svc:dual, Type: Constant, Value: False
liblinear_svc:fit_intercept, Type: Constant, Value: True
liblinear_svc:intercept_scaling, Type: Constant, Value: 1
liblinear_svc:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: squared_hinge
liblinear_svc:multi_class, Type: Constant, Value: ovr
liblinear_svc:penalty, Type: Categorical, Choices: {l1, l2}, Default: l2
liblinear_svc:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
libsvm_svc:C, Type: UniformFloat, Range: [0.03125, 32768.0], Default: 1.0, on log-scale
libsvm_svc:coef0, Type: UniformFloat, Range: [-1.0, 1.0], Default: 0.0
libsvm_svc:degree, Type: UniformInteger, Range: [2, 5], Default: 3
libsvm_svc:gamma, Type: UniformFloat, Range: [3.0517578125e-05, 8.0], Default: 0.1, on log-scale
libsvm_svc:kernel, Type: Categorical, Choices: {rbf, poly, sigmoid}, Default: rbf
libsvm_svc:max_iter, Type: Constant, Value: -1
libsvm_svc:shrinking, Type: Categorical, Choices: {True, False}, Default: True
libsvm_svc:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.001, on log-scale
multinomial_nb:alpha, Type: UniformFloat, Range: [0.01, 100.0], Default: 1.0, on log-scale
multinomial_nb:fit_prior, Type: Categorical, Choices: {True, False}, Default: True
passive_aggressive:C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 1.0, on log-scale
passive_aggressive:average, Type: Categorical, Choices: {False, True}, Default: False
passive_aggressive:fit_intercept, Type: Constant, Value: True
passive_aggressive:loss, Type: Categorical, Choices: {hinge, squared_hinge}, Default: hinge
passive_aggressive:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
qda:reg_param, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.0
random_forest:bootstrap, Type: Categorical, Choices: {True, False}, Default: True
random_forest:criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
random_forest:max_depth, Type: Constant, Value: None
random_forest:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
random_forest:max_leaf_nodes, Type: Constant, Value: None
random_forest:min_impurity_decrease, Type: Constant, Value: 0.0
random_forest:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
random_forest:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
random_forest:min_weight_fraction_leaf, Type: Constant, Value: 0.0
random_forest:n_estimators, Type: Constant, Value: 100
sgd:alpha, Type: UniformFloat, Range: [1e-07, 0.1], Default: 0.0001, on log-scale
sgd:average, Type: Categorical, Choices: {False, True}, Default: False
sgd:epsilon, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
sgd:eta0, Type: UniformFloat, Range: [1e-07, 0.1], Default: 0.01, on log-scale
sgd:fit_intercept, Type: Constant, Value: True
sgd:l1_ratio, Type: UniformFloat, Range: [1e-09, 1.0], Default: 0.15, on log-scale
sgd:learning_rate, Type: Categorical, Choices: {optimal, invscaling, constant}, Default: invscaling
sgd:loss, Type: Categorical, Choices: {hinge, log, modified_huber, squared_hinge, perceptron}, Default: log
sgd:penalty, Type: Categorical, Choices: {l1, l2, elasticnet}, Default: l2
sgd:power_t, Type: UniformFloat, Range: [1e-05, 1.0], Default: 0.5
sgd:tol, Type: UniformFloat, Range: [1e-05, 0.1], Default: 0.0001, on log-scale
Conditions:
adaboost:algorithm | __choice__ == 'adaboost'
adaboost:learning_rate | __choice__ == 'adaboost'
adaboost:max_depth | __choice__ == 'adaboost'
adaboost:n_estimators | __choice__ == 'adaboost'
bernoulli_nb:alpha | __choice__ == 'bernoulli_nb'
bernoulli_nb:fit_prior | __choice__ == 'bernoulli_nb'
decision_tree:criterion | __choice__ == 'decision_tree'
decision_tree:max_depth_factor | __choice__ == 'decision_tree'
decision_tree:max_features | __choice__ == 'decision_tree'
decision_tree:max_leaf_nodes | __choice__ == 'decision_tree'
decision_tree:min_impurity_decrease | __choice__ == 'decision_tree'
decision_tree:min_samples_leaf | __choice__ == 'decision_tree'
decision_tree:min_samples_split | __choice__ == 'decision_tree'
decision_tree:min_weight_fraction_leaf | __choice__ == 'decision_tree'
extra_trees:bootstrap | __choice__ == 'extra_trees'
extra_trees:criterion | __choice__ == 'extra_trees'
extra_trees:max_depth | __choice__ == 'extra_trees'
extra_trees:max_features | __choice__ == 'extra_trees'
extra_trees:max_leaf_nodes | __choice__ == 'extra_trees'
extra_trees:min_impurity_decrease | __choice__ == 'extra_trees'
extra_trees:min_samples_leaf | __choice__ == 'extra_trees'
extra_trees:min_samples_split | __choice__ == 'extra_trees'
extra_trees:min_weight_fraction_leaf | __choice__ == 'extra_trees'
extra_trees:n_estimators | __choice__ == 'extra_trees'
gradient_boosting:early_stop | __choice__ == 'gradient_boosting'
gradient_boosting:l2_regularization | __choice__ == 'gradient_boosting'
gradient_boosting:learning_rate | __choice__ == 'gradient_boosting'
gradient_boosting:loss | __choice__ == 'gradient_boosting'
gradient_boosting:max_bins | __choice__ == 'gradient_boosting'
gradient_boosting:max_depth | __choice__ == 'gradient_boosting'
gradient_boosting:max_iter | __choice__ == 'gradient_boosting'
gradient_boosting:max_leaf_nodes | __choice__ == 'gradient_boosting'
gradient_boosting:min_samples_leaf | __choice__ == 'gradient_boosting'
gradient_boosting:n_iter_no_change | gradient_boosting:early_stop in {'valid', 'train'}
gradient_boosting:scoring | __choice__ == 'gradient_boosting'
gradient_boosting:tol | __choice__ == 'gradient_boosting'
gradient_boosting:validation_fraction | gradient_boosting:early_stop == 'valid'
k_nearest_neighbors:n_neighbors | __choice__ == 'k_nearest_neighbors'
k_nearest_neighbors:p | __choice__ == 'k_nearest_neighbors'
k_nearest_neighbors:weights | __choice__ == 'k_nearest_neighbors'
lda:n_components | __choice__ == 'lda'
lda:shrinkage | __choice__ == 'lda'
lda:shrinkage_factor | lda:shrinkage == 'manual'
lda:tol | __choice__ == 'lda'
liblinear_svc:C | __choice__ == 'liblinear_svc'
liblinear_svc:dual | __choice__ == 'liblinear_svc'
liblinear_svc:fit_intercept | __choice__ == 'liblinear_svc'
liblinear_svc:intercept_scaling | __choice__ == 'liblinear_svc'
liblinear_svc:loss | __choice__ == 'liblinear_svc'
liblinear_svc:multi_class | __choice__ == 'liblinear_svc'
liblinear_svc:penalty | __choice__ == 'liblinear_svc'
liblinear_svc:tol | __choice__ == 'liblinear_svc'
libsvm_svc:C | __choice__ == 'libsvm_svc'
libsvm_svc:coef0 | libsvm_svc:kernel in {'poly', 'sigmoid'}
libsvm_svc:degree | libsvm_svc:kernel == 'poly'
libsvm_svc:gamma | __choice__ == 'libsvm_svc'
libsvm_svc:kernel | __choice__ == 'libsvm_svc'
libsvm_svc:max_iter | __choice__ == 'libsvm_svc'
libsvm_svc:shrinking | __choice__ == 'libsvm_svc'
libsvm_svc:tol | __choice__ == 'libsvm_svc'
multinomial_nb:alpha | __choice__ == 'multinomial_nb'
multinomial_nb:fit_prior | __choice__ == 'multinomial_nb'
passive_aggressive:C | __choice__ == 'passive_aggressive'
passive_aggressive:average | __choice__ == 'passive_aggressive'
passive_aggressive:fit_intercept | __choice__ == 'passive_aggressive'
passive_aggressive:loss | __choice__ == 'passive_aggressive'
passive_aggressive:tol | __choice__ == 'passive_aggressive'
qda:reg_param | __choice__ == 'qda'
random_forest:bootstrap | __choice__ == 'random_forest'
random_forest:criterion | __choice__ == 'random_forest'
random_forest:max_depth | __choice__ == 'random_forest'
random_forest:max_features | __choice__ == 'random_forest'
random_forest:max_leaf_nodes | __choice__ == 'random_forest'
random_forest:min_impurity_decrease | __choice__ == 'random_forest'
random_forest:min_samples_leaf | __choice__ == 'random_forest'
random_forest:min_samples_split | __choice__ == 'random_forest'
random_forest:min_weight_fraction_leaf | __choice__ == 'random_forest'
random_forest:n_estimators | __choice__ == 'random_forest'
sgd:alpha | __choice__ == 'sgd'
sgd:average | __choice__ == 'sgd'
sgd:epsilon | sgd:loss == 'modified_huber'
sgd:eta0 | sgd:learning_rate in {'invscaling', 'constant'}
sgd:fit_intercept | __choice__ == 'sgd'
sgd:l1_ratio | sgd:penalty == 'elasticnet'
sgd:learning_rate | __choice__ == 'sgd'
sgd:loss | __choice__ == 'sgd'
sgd:penalty | __choice__ == 'sgd'
sgd:power_t | sgd:learning_rate == 'invscaling'
sgd:tol | __choice__ == 'sgd'
Forbidden Clauses:
(Forbidden: liblinear_svc:penalty == 'l1' && Forbidden: liblinear_svc:loss == 'hinge')
(Forbidden: liblinear_svc:dual == 'False' && Forbidden: liblinear_svc:penalty == 'l2' && Forbidden: liblinear_svc:loss == 'hinge')
(Forbidden: liblinear_svc:dual == 'False' && Forbidden: liblinear_svc:penalty == 'l1')
至此,我们基本搞定出了构造超参的方法。
autosklearn.automl.AutoML#_fit:462
metalearn的计算在_proc_smac
这个过程中,_proc_smac = AutoMLSMBO(...
,_proc_smac.run_smbo()
self._initial_configurations_via_metalearning
Out[35]: 25
autosklearn.smbo.AutoMLSMBO#run_smbo
metalearning_configurations = self.get_metalearning_suggestions()
进入对应区域:autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions
进入对应区域:autosklearn.metalearning.metalearning.meta_base.MetaBase#__init__
进入对应区域: autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#__init__
在这个构造函数中会将metalearn需要的对应文件读进来
(base) ~/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense (tqc ✘)✹✭ ᐅ tree
.
├── algorithm_runs.arff
├── configurations.csv
├── description.txt
├── feature_costs.arff
├── feature_runstatus.arff
├── feature_values.arff
└── readme.txt
从这个代码中也能判断每个文件的具体语义:
self.read_funcs = {
"algorithm_runs.arff": self._read_algorithm_runs,
"feature_values.arff": self._read_feature_values,
"configurations.csv": self._read_configurations
}
注意到arff文件,这个文件是Attribute-Relation File Format (ARFF)
,overview
autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_algorithm_runs
注意到:
measure_instance_algorithm_triples = defaultdict(lambda: defaultdict(dict))
相当于一个深度为3的字典,并且前两层是defaultdict
performance_measures
Out[4]: ['accuracy']
arff_dict["data"]
Out[5]:
[['2120', 1.0, '1', 0.07826496935407823, 'ok'],
['75193', 1.0, '2', 0.03999833101239747, 'ok'],
['2117', 1.0, '3', 0.1586523546565738, 'ok'],
['75156', 1.0, '4', 0.21584478577202915, 'ok'],
for data in arff_dict["data"]:
inst_name = str(data[0])
repetition = data[1]
algorithm = str(data[2])
perf_list = data[3:-1]
status = data[-1]
measure_instance_algorithm_triples[performance_measure][
inst_name][algorithm] = perf_list[i]
measure_instance_algorithm_triples
是一个深度为3的字典,第一层为模型度量(如accuracy
),第二层为实例,如某个数据集的训练任务,第三层为调用的算法。最后的值是以上三个键的叠加下(如采用accuracy
度量,在A数据集上svm算法)下的表现。
在终端查看一下这个变量:
measure_instance_algorithm_triples
Out[13]:
defaultdict(<function autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem._read_algorithm_runs.<locals>.<lambda>()>,
{'7': defaultdict(dict, {}),
'accuracy': defaultdict(dict,
{'2120': {'1': 0.07826496935407823},
'75193': {'2': 0.03999833101239747},
'2117': {'3': 0.1586523546565738},
'75156': {'4': 0.21584478577202915},
...
'75225': {'111': 0.11244019138755978},
'75141': {'112': 0.052904180540140566},
'75107': {'113': 0.050303030303030294},
'75097': {'114': 0.0602053084250439}})})
pd.DataFrame(measure_instance_algorithm_triples[pm])
Out[16]:
2120 75193 2117 ... 75141 75107 75097
1 0.078265 NaN NaN ... NaN NaN NaN
2 NaN 0.039998 NaN ... NaN NaN NaN
3 NaN NaN 0.158652 ... NaN NaN NaN
4 NaN NaN NaN ... NaN NaN NaN
5 NaN NaN NaN ... NaN NaN NaN
得到的是一个矩阵
measure_algorithm_matrices = OrderedDict()
for pm in performance_measures:
measure_algorithm_matrices[pm] = pd.DataFrame(
measure_instance_algorithm_triples[pm]).transpose()
self.algorithm_runs = measure_algorithm_matrices
autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_feature_values
filename
Out[19]: '/home/tqc/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense/feature_values.arff'
arff_dict["data"][0] # 0: 数据集名称 1: 重复次数 2::metafeature
Out[20]:
['75249',
1.0,
0.3891675492147647,
0.9236550632911392,
0.5,
0.07634493670886076,
0.4236550632911392,
0.011471518987341773,
87.17241379310344,
416.33571239756776,
46.57141283642045,
-3.0,
...
for data in arff_dict["data"]:
inst_name = data[0]
repetition = data[1]
features = data[2:]
打印metafeatures
'75239': {'ClassEntropy': 0.9443547030267275,
'ClassProbabilityMax': 0.6379707916986933,
'ClassProbabilityMean': 0.5,
'ClassProbabilityMin': 0.3620292083013067,
'ClassProbabilitySTD': 0.1379707916986933,
'DatasetRatio': 0.02536510376633359,
'InverseDatasetRatio': 39.42424242424242,
'KurtosisMax': 30.96304702758789,
'KurtosisMean': 5.229958094222913,
'KurtosisMin': -1.8830041885375977,
'KurtosisSTD': 9.159290860687747,
'Landmark1NN': 0.9854139753376394,
'LandmarkDecisionNodeLearner': 0.6379741632413387,
'LandmarkDecisionTree': 1.0,
'LandmarkLDA': 0.7832472108044627,
'LandmarkNaiveBayes': 0.9976981796829125,
'LandmarkRandomNodeLearner': 0.6379741632413387,
'LogDatasetRatio': -3.674380917046025,
'LogInverseDatasetRatio': 3.674380917046025,
'LogNumberOfFeatures': 3.4965075614664802,
'LogNumberOfInstances': 7.170888478512505,
'NumberOfCategoricalFeatures': 0.0,
'NumberOfClasses': 2.0,
'NumberOfFeatures': 33.0,
'NumberOfFeaturesWithMissingValues': 0.0,
'NumberOfInstances': 1301.0,
'NumberOfInstancesWithMissingValues': 0.0,
'NumberOfMissingValues': 0.0,
'NumberOfNumericFeatures': 33.0,
'PCAFractionOfComponentsFor95PercentVariance': 0.5151515151515151,
'PCAKurtosisFirstPC': 1.5021543502807617,
'PCASkewnessFirstPC': 1.460637092590332,
'PercentageOfFeaturesWithMissingValues': 0.0,
'PercentageOfInstancesWithMissingValues': 0.0,
'PercentageOfMissingValues': 0.0,
'RatioNominalToNumerical': 0.0,
'RatioNumericalToNominal': 0.0,
'SkewnessMax': 5.591684341430664,
'SkewnessMean': 1.5151810998266393,
'SkewnessMin': -0.8763390183448792,
'SkewnessSTD': 1.68213778251707,
'SymbolsMax': 0.0,
'SymbolsMean': 0.0,
'SymbolsMin': 0.0,
'SymbolsSTD': 0.0,
'SymbolsSum': 0.0},
最后一行:
self.metafeatures = pd.DataFrame(metafeatures).transpose()
self.metafeatures
Out[24]:
ClassEntropy ClassProbabilityMax ... SymbolsSTD SymbolsSum
75249 0.389168 0.923655 ... 0.0 0.0
75203 2.447791 0.289707 ... 0.0 0.0
75090 3.316085 0.111876 ... 0.0 0.0
75213 0.758988 0.780645 ... 0.0 0.0
self.metafeatures
表示各个数据集的元特征(metafeature)
autosklearn.metalearning.input.aslib_simple.AlgorithmSelectionProblem#_read_configurations
filename
Out[27]: '/home/tqc/PycharmProjects/automl/auto-sklearn/autosklearn/metalearning/files/accuracy_binary.classification_dense/configurations.csv'
打印configurations
'114': {'balancing:strategy': 'none',
'categorical_encoding:__choice__': 'no_encoding',
'classifier:__choice__': 'k_nearest_neighbors',
'classifier:k_nearest_neighbors:n_neighbors': 3,
'classifier:k_nearest_neighbors:p': 1,
'classifier:k_nearest_neighbors:weights': 'uniform',
'imputation:strategy': 'mean',
'preprocessor:__choice__': 'polynomial',
'preprocessor:polynomial:degree': 2,
'preprocessor:polynomial:include_bias': 'True',
'preprocessor:polynomial:interaction_only': 'False',
'rescaling:__choice__': 'quantile_transformer',
'rescaling:quantile_transformer:n_quantiles': 498,
'rescaling:quantile_transformer:output_distribution': 'normal'}}
至此,我们明白了:
变量名 | 语义 | 文件 | 加载函数 |
---|---|---|---|
configurations | 算法对应的参数 | configurations.csv | _read_configurations |
metafeatures | 数据集对应的元特征 | feature_values.arff | _read_feature_values |
algorithm_runs | 算法与元特征的对应矩阵(含模型表现) | algorithm_runs.arff | _read_algorithm_runs |
注:self.algorithm_runs = measure_algorithm_matrices
现在,我们已经知道了autosklearn源码包中存储的用于预测的元特征与推荐模型是怎么对应的了。
回到对应区域:autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions:601
进入对应区域:autosklearn.smbo.AutoMLSMBO#_calculate_metafeatures_encoded
进入对应区域:autosklearn.smbo._calculate_metafeatures_encoded
EXCLUDE_META_FEATURES = EXCLUDE_META_FEATURES_CLASSIFICATION \
if task in CLASSIFICATION_TASKS else EXCLUDE_META_FEATURES_REGRESSION
根据任务是回归还是分类,排除调一些不应该计算的元特征。
比如我用来断点调试的是一个分类任务。
EXCLUDE_META_FEATURES
Out[28]:
{'Landmark1NN',
'LandmarkDecisionNodeLearner',
'LandmarkDecisionTree',
'LandmarkLDA',
'LandmarkNaiveBayes',
'PCA',
'PCAFractionOfComponentsFor95PercentVariance',
'PCAKurtosisFirstPC',
'PCASkewnessFirstPC'}
result = calculate_all_metafeatures_encoded_labels(
x_train, y_train, categorical=[False] * x_train.shape[1],
dataset_name=basename, dont_calculate=EXCLUDE_META_FEATURES)
autosklearn.metalearning.metafeatures.metafeatures.calculate_all_metafeatures_encoded_labels
这个函数的代码如下:
calculate = set()
calculate.update(npy_metafeatures)
return calculate_all_metafeatures(X, y, categorical, dataset_name,
calculate=calculate,
dont_calculate=dont_calculate)
这里的npy_metafeatures
是一个全局变量
npy_metafeatures = set(["LandmarkLDA",
"LandmarkNaiveBayes",
"LandmarkDecisionTree",
"LandmarkDecisionNodeLearner",
"LandmarkRandomNodeLearner",
"LandmarkWorstNodeLearner",
"Landmark1NN",
"PCAFractionOfComponentsFor95PercentVariance",
"PCAKurtosisFirstPC",
"PCASkewnessFirstPC",
"Skewnesses",
"SkewnessMin",
"SkewnessMax",
"SkewnessMean",
"SkewnessSTD",
"Kurtosisses",
"KurtosisMin",
"KurtosisMax",
"KurtosisMean",
"KurtosisSTD"])
autosklearn.metalearning.metafeatures.metafeatures.calculate_all_metafeatures
注意到两个全局变量:
metafeatures = MetafeatureFunctions()
helper_functions = HelperFunctions()
注意到被@metafeatures.define
装饰的类会将类的一个实例通过__setitem__
设置到metafeatures
这个对象中。而metafeatures
是一个MetafeatureFunctions
实例
截取一个代码作为例子
@metafeatures.define("NumberOfInstances")
class NumberOfInstances(MetaFeature):
def _calculate(self, X, y, categorical):
return float(X.shape[0])
注意到如果当前解析的元特征如果属于npy(if name in npy_metafeatures:
),则要对X做一些操作,变成X_transformed
X_transformed = imputer.fit_transform(X_transformed)
X_transformed = standard_scaler.fit_transform(X_transformed)
最后,直接计算当前元特征的值
value = metafeatures[name](X_, y_, categorical_)
metafeatures[name]
是一个实例,后面接括号,猜测是用了__call__
魔法方法,追踪到两层父类的autosklearn.metalearning.metafeatures.metafeature.AbstractMetaFeature#__call__
def __call__(self, X, y, categorical=None):
if categorical is None:
categorical = [False for i in range(X.shape[1])]
starttime = time.time()
try:
if scipy.sparse.issparse(X) and hasattr(self, "_calculate_sparse"):
value = self._calculate_sparse(X, y, categorical)
else:
value = self._calculate(X, y, categorical)
comment = ""
except MemoryError as e:
value = None
comment = "Memory Error"
endtime = time.time()
return MetaFeatureValue(self.__class__.__name__, self.type_,
0, 0, value, endtime-starttime, comment=comment)
最后返回的是一个MetaFeatureValue
实例,里面封装了元特征的值和其他一些信息。
mf_
是一个字典,key是元特征的名字,value是对应的MetaFeatureValue
实例。
至此,我们已经弄清楚了数据集的元特征是怎么算的了。接下来看
autosklearn.smbo.AutoMLSMBO#get_metalearning_suggestions:538
meta_features
Out[2]:
Metafeatures for dataset breast_cancer
ClassEntropy: 0.9495480401701638
SymbolsSum: 0.0
SymbolsSTD: 0
SymbolsMean: 0
SymbolsMax: 0
SymbolsMin: 0
ClassProbabilitySTD: 0.13145539906103287
ClassProbabilityMean: 0.5
ClassProbabilityMax: 0.6314553990610329
ClassProbabilityMin: 0.3685446009389671
InverseDatasetRatio: 14.2
DatasetRatio: 0.07042253521126761
RatioNominalToNumerical: 0.0
RatioNumericalToNominal: 0.0
NumberOfCategoricalFeatures: 0
NumberOfNumericFeatures: 30
NumberOfMissingValues: 0.0
NumberOfFeaturesWithMissingValues: 0.0
NumberOfInstancesWithMissingValues: 0.0
NumberOfFeatures: 30.0
NumberOfClasses: 2.0
NumberOfInstances: 426.0
LogInverseDatasetRatio: 2.653241964607215
LogDatasetRatio: -2.653241964607215
PercentageOfMissingValues: 0.0
PercentageOfFeaturesWithMissingValues: 0.0
PercentageOfInstancesWithMissingValues: 0.0
LogNumberOfFeatures: 3.4011973816621555
LogNumberOfInstances: 6.054439346269371
meta_features_encoded
Out[6]:
Metafeatures for dataset breast_cancer
LandmarkRandomNodeLearner: 0.346312292358804
SkewnessSTD: 1.3079569844668182
SkewnessMean: 1.719863689861412
SkewnessMax: 5.497448960200661
SkewnessMin: 0.3638245081126332
KurtosisSTD: 13.149786948568595
KurtosisMean: 7.73073011481775
KurtosisMax: 54.11573179309323
KurtosisMin: -0.5734114126286567
meta_base.add_dataset(self.dataset_name, meta_features)
all_metafeatures = meta_base.get_metafeatures(
这时的all_metafeatures
是包含本次数据集的一个矩阵
all_metafeatures
Out[7]:
ClassEntropy SymbolsSum ... KurtosisMax KurtosisMin
75249 0.389168 0.0 ... 416.335712 -3.000000
75203 2.447791 0.0 ... 1381.000806 -3.000000
75090 3.316085 0.0 ... 2319.000488 -3.000000
75213 0.758988 0.0 ... 34.615789 -1.998501
75157 0.991535 0.0 ... 11.286622 -0.840580
... ... ... ... ... ...
75198 5.269287 0.0 ... 1690.960100 -3.000000
75156 0.995151 0.0 ... 2508.999535 -3.000000
75114 0.770740 0.0 ... 1028.527714 -1.146804
75230 6.618346 0.0 ... 8.500022 0.019823
breast_cancer 0.949548 0.0 ... 54.115732 -0.573411
[133 rows x 38 columns]
看到至关重要的一行代码:
metalearning_configurations = self.collect_metalearning_suggestions(meta_base)
我们现在所在的函数get_metalearning_suggestions
的返回值就是metalearning_configurations
,看来在有了已训练的元数据集和本数据集的元特征后,我们可以通过度量学习的方法找到一些相近的数据集,进而将其配置推荐给当前训练任务。
进入对应区域:autosklearn.smbo.AutoMLSMBO#collect_metalearning_suggestions
进入对应区域:autosklearn.smbo._get_metalearning_configurations
进入对应区域:autosklearn.metalearning.mismbo.suggest_via_metalearning
ml = MetaLearningOptimizer(
dataset_name=dataset_name,
configuration_space=meta_base.configuration_space,
meta_base=meta_base,
distance='l1',
seed=1,)
runs = ml.metalearning_suggest_all(exclude_double_configurations=True)
进入对应区域:autosklearn.metalearning.optimizers.metalearn_optimizer.metalearner.MetaLearningOptimizer#metalearning_suggest_all
进入对应区域:autosklearn.metalearning.optimizers.metalearn_optimizer.metalearner.MetaLearningOptimizer#_learn
在_split_metafeature_array
函数中:
return dataset_metafeatures, all_other_metafeatures
dataset_metafeatures
:本数据集的元特征
all_other_metafeatures
:其他数据集的元特征
all_other_metafeatures.shape
Out[11]: (132, 46)
dataset_metafeatures.shape
Out[12]: (46,)
keep = []
for idx in dataset_metafeatures.index:
if np.isfinite(dataset_metafeatures.loc[idx]):
keep.append(idx)
keep
:所有应该计算的元特征
all_other_metafeatures = all_other_metafeatures.fillna(all_other_metafeatures.mean())
采用平均值处理缺失值
runs
Out[21]:
75249 75203 75090 75213 75157 ... 75112 75198 75156 75114 75230
1 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN ... NaN NaN 0.215845 NaN NaN
5 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
看完了metalearn的推荐,我们回到最开始
autosklearn.smbo.AutoMLSMBO#run_smbo:388
最后是怎么把metalearn加到smac中的呢?autosklearn.smbo.get_smac_object
default_config = scenario.cs.get_default_configuration()
initial_configurations = [default_config] + metalearning_configurations
return SMAC(
scenario=scenario,
rng=seed,
runhistory2epm=rh2EPM,
tae_runner=ta,
initial_configurations=initial_configurations,
runhistory=runhistory,
run_id=seed,
)