Restart techniques are common in gradient-free optimization to deal with multi-modal functions.
In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks.
作者首先说明了DNNs(Deep Neural Networks)在分类、目标检测、语音处理等方面做的非常好,随后提出问题:DNN虽然有很好的性能表现,它们一般在大规模数据集上进行训练,这往往需要花费几天的时间。所以,如何有效减少训练时间是一个值得探讨的问题。
作者也强调了在当年训练大规模数据集(CIFAR, MS COCO, PSACAL)效果比较好的模型使用的优化器并不是最先进(比如AdaDelta、Adam这类先进的优化器),而是使用了经典的SGD优化器。
接着作者引出了学习率策略并解释 A common learning rate schedule is to use a constant learning rate and divide it by a fixed constant in (approximately) regular intervals.
In this paper, we propose to periodically simulate warm restarts of SGD, where in each restart the learning rate is initialized to some value and is scheduled to decrease.
η t = η m i n i + 1 2 ( η m a x i − η m i n i ) ( 1 + cos ( T c u r T i π ) ) \eta_t = \eta^i_{min} + \frac{1}{2}(\eta^i_{max} - \eta^i_{min}) (1 + \cos(\frac{T_{cur}}{T_i}\pi)) ηt=ηmini+21(ηmaxi−ηmini)(1+cos(TiTcurπ))
其中, η t \eta_t ηt 为当前的学习率, η m i n i 和 η m a x i \eta^i_{min} 和 \eta_{max}^i ηmini和ηmaxi 是学习率的范围, T c u r T_{cur} Tcur 表示已经执行了多少个Epoch,即当前的Epoch数量。当 t = 0 t=0 t=0 且 T c u r = 0 T_{cur}=0 Tcur=0 时,此时的学习率是最大的,即 η t = η m a x i \eta_t = \eta_{max}^i ηt=ηmaxi ;当 T c u r = T i T_{cur}=T_i Tcur=Ti 时,此时的余弦函数输出 − 1 -1 −1, 这导致学习率是最小的,即 η t = η m i n i \eta_t = \eta_{min}^i ηt=ηmini。
{ T 0 = 1 , T m u l t = 2 T 0 = 10 , T m u l t = 2 \begin{cases} T_0 = 1, T_{mult}=2 \\ T_0 = 10, T_{mult}=2 \end{cases} {T0=1,Tmult=2T0=10,Tmult=2
Figure 2: Test errors on CIFAR-10 (left column) and CIFAR-100 (right column) datasets. Note that for SGDR we only plot the recommended solutions. The top and middle rows show the same results on WRN-28-10, with the middle row zooming into the good performance region of low test error. The bottom row shows performance with a wider network, WRN-28-20.
# 导包
from torch import optim
from torch.optim import lr_scheduler
# 定义模型
model, parameters = generate_model(opt)
# 定义优化器
if opt.nesterov:
dampening = 0
dampening = 0.9
optimizer = opt.SGD(parameters, lr=0.1, momentum=0.9, dampening=dampending, weight_decay=1e-3, nesterov=opt.nesterov)
# 定义热重启学习率策略
scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2, eta_min=0, last_epoch=-1)
变量名 | 重启时的Epoch |
a | T 0 T_0 T0 |
b | a × 3 a\times 3 a×3 |
c | b × T m u l t + a b \times T_{mult} + a b×Tmult+a |
d | c × T m u l t + a c\times T_{mult} + a c×Tmult+a |
e | d × T m u l t + a d\times T_{mult} + a d×Tmult+a |
… | … |
变量名 | 重启时的Epoch |
a | 10 10 10 |
b | 30 30 30 |
c | 30 × 2 + 10 30 \times 2 + 10 30×2+10 |
d | 70 × 2 + 10 70 \times 2 + 10 70×2+10 |
e | 150 × 2 + 10 150\times 2 + 10 150×2+10 |
… | … |