机器学习:9.2 HPO algorithms

Search Space

  • Specify range for each hyperparameter
Hyper-Parameter Range Distribution
model(backbone) [mobilenetv,resnet,vgg] categorical
learning rate* [1e-6,1e-1] log-uniform
batch size* [8,16,32,64,128,256,512] categorical
monmentum** [0.85,0.85] uniform
weight decay** [1e-6,1e-2] log-uniform
detector [faster-rcnn,ssd,yolo-v3,center-net] categorical
  • The search space can be exponentially large
    • Need to carefully design the space to improve efficiency

HPO algorithms: Black-box or Multi-fidelity

  • Black-box: treats a training job as a black-box in HPO:

    • Completes the training process for each trial
  • Multi-fidelity: modifies the training job to speed up the search

    • Train on subsampled datasets

    • Reduce model size (e.g less #layers, #channels)

    • Stop bad configuration earlier

Two most common HPO strategies

  • Grid search
    • All combinations are evaluated
    • Guarantees the best results
    • Curse of dimensionality
  • Random search
    • Random combinations are tried
    • More efficient than grid search(empirically and in theory, shown in Random Search for Hyper-Parameter Optimization)


Bayesian Optimization (BO)

  • BO: Iteratively learn a mapping from HP to objective function. Based on previous trials. Select the next trial based on the current estimation.

  • Surrogate model

    • Estimate how the objective function depends on HP
    • Probabilistic regression models: Random forest, Gaussian process
  • Acquisition function

    • Acquisition max means uncertainty and predicted objective are high.

    • Sample the next trial according to the acquisition function

    • Trade off exploration and exploitation

  • Limitation of BO:

    • In the initial stages, similar to random search

    • Optimization process is sequential

Hyperband

  • In Successive Halving

    • n: exploration

    • m: exploitation

  • Hyperband runs multiple Successive Halving, each time decreases n and
    increases m

    • More exploration first, then do more exploit

Summary

  • Black-box HPO: grid/random search, bayesian optimization
  • Multi-fidelity HPO: Successive Halving, Hyperband
  • In practice, start with random search
  • Beware there are top performers
    • You can find them by mining your training logs, or what common
      configurations used in paper/code

你可能感兴趣的:(机器学习:9.2 HPO algorithms)