调研 auto-sklearn 2.0

To mitigate this risk, for algorithms that can be trained iteratively (e.g.,gradient boosting and linear models trained with stochastic gradient descent)

we implemented two measures.

  1. Firstly,we allow a pipeline to stop training based on a heuristicat any time, i.e. early stopping, which prevents overfitting.
  2. Secondly, we make use of intermittent results retrieval,e.g., saving the results at checkpoints spaced at geometrically increasing iteration numbers, thereby ensuring that every evaluation returns a performance and thus yields information for the optimizer.

intermittent 间歇的
retrieval 检索
geometrically increasing iteration numbers 几何级数递增的迭代

Current hyperparameter optimization algorithms can cope with such spaces, given enough time, but, in this work, we consider a heavily time-bounded setting. Therefore, we reduced our space to 42 hyperparameters only including iterative models to benefit from the early stopping and intermittent results retrieval.

意思是AS2有153个超参,比AS1的110个超参多了43个。用SMAC这样的算法在一定的时间内肯定是可以对这些超参进行优化的,但是考虑到很重的时间限制(heavily time-bounded setting),所以将超参数缩减至42,只保留迭代类算法。

备注:AS2是否为PoSH-autosklearn的工程化实现,是否采用BOHB算法,模型的启动次数是否为42/0.15=280

We build the model for BO on the highest available budget where we have observed the performance of ∣ Λ ∣ 2 \frac{|\Lambda|}{2} 2Λ pipelines 。

这句话信息量相当大。 highest available budget指的是最高可用budget。重点调查 ∣ Λ ∣ 2 \frac{|\Lambda|}{2} 2Λ 的出处,重新看BOHB论文和代码。

SH potentially provides large speedups, but it could also too aggressively cut away good configurations that need a higher budget to perform best.

Thus, we expect SH to work best for large datasets, for which there is not enough time to train many ML pipelines for the full budget, but for which training a ML pipeline on a small budget already yields a good indication of the generalization error.

第四范式对SH做了改动,用重要性采样代替减半,并且参考其他band的样本

herefore, here we propose a meta-feature-free approach which does not warmstart with a set of configurations specific to a new dataset, but which uses a portfolio– a set of complementary configurations that covers as many diverse datasets as possible and minimizes the risk of failure when facing a new task

训练集

experiment_scripts/portfolio/portfolio_util.py


_training_task_ids = [
    232, 236, 241, 245, 253, 254, 256, 258, 260, 262, 267, 271, 273, 275, 279, 288, 336, 340, 2119,
    2120, 2121, 2122, 2123, 2125, 2356, 3044, 3047, 3048, 3049, 3053, 3054, 3055, 75089, 75092,
    75093, 75098, 75100, 75108, 75109, 75112, 75114, 75115, 75116, 75118, 75120, 75121, 75125,
    75126, 75129, 75131, 75133, 75134, 75136, 75139, 75141, 75142, 75143, 75146, 75147, 75148,
    75149, 75153, 75154, 75156, 75157, 75159, 75161, 75163, 75166, 75169, 75171, 75173, 75174,
    75176, 75178, 75179, 75180, 75184, 75185, 75187, 75192, 75195, 75196, 75199, 75210, 75212,
    75213, 75215, 75217, 75219, 75221, 75223, 75225, 75232, 75233, 75234, 75235, 75236, 75237,
    75239, 75250, 126021, 126024, 126028, 126030, 126031, 146574, 146575, 146576, 146577, 146578,
    146583, 146586, 146592, 146593, 146594, 146596, 146597, 146600, 146601, 146602, 146603,
    146679, 166859, 166866, 166872, 166875, 166882, 166897, 166905, 166906, 166913, 166915, 166931,
    166932, 166944, 166950, 166951, 166953, 166956, 166957, 166958, 166959, 166970, 166996, 167085,
    167086, 167087, 167088, 167089, 167090, 167094, 167096, 167097, 167099, 167100, 167101, 167103,
    167105, 167106, 167202, 167203, 167204, 167205, 168785, 168791, 189779, 189786, 189828, 189829,
    189836, 189840, 189841, 189843, 189844, 189845, 189846, 189857, 189858, 189859, 189863, 189864,
    189869, 189870, 189875, 189878, 189880, 189881, 189882, 189883, 189884, 189887, 189890, 189893,
    189894, 189899, 189900, 189902, 190154, 190155, 190156, 190157, 190158, 190159, 211720, 211721,
    211722, 211723, 211724,
]

experiment_scripts/autoauto/run_autoauto.py:163

if task_id in (2121, 189829):

openml_cc18_ids = [
    167149, 167150, 167151, 167152, 167153, 167154, 167155, 167156, 167157, 167158, 167159, 167160, 167161, 167162,
    167163, 167165, 167166, 167167, 167168, 167169, 167170, 167171, 167164, 167173, 167172, 167174, 167175, 167176,
    167177, 167178, 167179, 167180, 167181, 167182, 126025, 167195, 167194, 167190, 167191, 167192, 167193, 167187,
    167188, 126026, 167189, 167185, 167186, 167183, 167184, 167196, 167198, 126029, 167197, 126030, 167199, 126031,
    167201, 167205, 189904, 167106, 167105, 189905, 189906, 189907, 189908, 189909, 167083, 167203, 167204, 189910,
    167202, 167097
]

# 33% Holdout tasks from automl benchmark set: https://www.openml.org/s/218
# Not using did 2 and 5 from this study as data on openml is wrong for the automl benchmark
openml_automl_benchmark = [
    189871, 189872, 189873, 168794, 168792, 168793, 75105, 189906, 189909, 189908, 167185, 189874, 189861, 189866,
    168797, 168796, 189860, 189862, 168798, 189865, 126026, 167104, 167083, 189905, 75127, 167200, 167184, 167201,
    168795, 126025, 75097, 167190, 126029, 167149, 167152, 167168, 167181, 75193, 167161
]

automl_metadata = [
    232, 236, 241, 245, 253, 254, 256, 258, 260, 262, 267, 271, 273, 275, 279, 288, 336, 340, 2119, 2120, 2121, 2122,
2123, 2125, 2356, 3044, 3047, 3048, 3049, 3053, 3054, 3055, 75089, 75092, 75093, 75098, 75100, 75108, 75109, 75112,
75114, 75115, 75116, 75118, 75120, 75121, 75125, 75126, 75129, 75131, 75133, 75134, 75136, 75139, 75141, 75142,
75143, 75146, 75147, 75148, 75149, 75153, 75154, 75156, 75157, 75159, 75161, 75163, 75166, 75169, 75171, 75173,
75174, 75176, 75178, 75179, 75180, 75184, 75185, 75187, 75192, 75195, 75196, 75199, 75210, 75212, 75213, 75215,
75217, 75219, 75221, 75223, 75225, 75232, 75233, 75234, 75235, 75236, 75237, 75239, 75250, 126021, 126024, 126028,
126030, 126031, 146574, 146575, 146576, 146577, 146578, 146583, 146586, 146592, 146593, 146594, 146596, 146597,
146600, 146601, 146602, 146603, 146679, 166859, 166866, 166872, 166875, 166882, 166897, 166905, 166906, 166913,
166915, 166931, 166932, 166944, 166950, 166951, 166953, 166956, 166957, 166958, 166959, 166970, 166996, 167085,
167086, 167087, 167088, 167089, 167090, 167094, 167096, 167097, 167099, 167100, 167101, 167103, 167105, 167106,
167202, 167203, 167204, 167205, 168785, 168791, 189779, 189786, 189828, 189829, 189836, 189840, 189841, 189843,
189844, 189845, 189846, 189857, 189858, 189859, 189863, 189864, 189869, 189870, 189875, 189878, 189880, 189881,
189882, 189883, 189884, 189887, 189890, 189893, 189894, 189899, 189900, 189902, 190154, 190155, 190156, 190157,
190158, 190159, 211720, 211721, 211722, 211723, 211724
]

#Adult (binary, as in caruana)
#Australian (binary, as it is the smallest dataset)
#Covertype (multiclass, as in caruana, is also the largest dataset)
#guillermo (binary, an actual AutoML2 challenge dataset)
#jungle chess complete (multiclass)
#kc1 (binary, appears to allow for a lot of overfitting)
#KDDCup09_appetency (binary, appears to be a nice dataset)
#MiniBooNE (binary, another largish dataset, but we know the origin compared to the challenge datasets)
ensemble_datasets = [
    126025, 167104, 75193, 168796, 189909, 167181, 75105, 168798
]

ensemble_mini_datasets = [
    126025, 168796,
]
  • 参考文献

An Open Source AutoML Benchmark

Oboe: Collaborative Filtering for AutoML Model Selection

Hydra-MIP: Automated Algorithm Configuration andSelection for Mixed Integer Programming

Collaborative hyperparameter tuning

Hydra: Automatically Configuring Algorithmsfor Portfolio-Based Selection

Sequential Model-Free Hyperparameter Tuning

Model Evaluation, Model Selection, and AlgorithmSelection in Machine Learning

Meta-learning for evolutionary parameter optimizationof classifiers

Towards Automatically-Tuned Neural Networks

Model Selection: Beyond the Bayesian/Frequentist Divide

你可能感兴趣的:(automl)