对于一个神经网络,我们学习网络中的参数。首先,假设将给定的神经网络所有可学习的参数被分为 N N N组: S 0 , S 1 , … , S N − 1 S_0, S_1, … ,S_{N-1} S0,S1,…,SN−1,每一组都有各自的不同配置,例如不同的learning rate等。 对于每一组 S n S_n Sn,我们可以用一个字典 D n D_n Dn来表示:
{'params': Parameters, 'lr': learning_rate}
其中Parameters是一个列表(list), 它表示了这组里有哪些需要学习的参数。其他关键字定义了这组参数不同于别组的配置情况。通常,多组参数之间存在很多相同的设置,对于这些相同的不需要额外设置的配置,,我们用字典defaults
来表示。即,如果字典 D n D_n Dn里没有的定义,那么就复用defaults里的配置。
在实现不用的优化器的之前,我们不妨先考虑一个优化器基类需要的一些特征。 我们先论基本的优化器的实现,然后从SGD的具体实现来进一步的了解相关的操作。
首先是如何构造一个优化器实例。在实例化一个优化器时,我们应该需要两个参数:
Parameter Groups: [ D 0 , D 1 , … , D N − 1 ] [D_0, D_1, … ,D_{N-1}] [D0,D1,…,DN−1]
Defaults :默认的参数设置
特殊情况,当我们只有一组参数时,即Parameter Groups 只有 D 0 D_0 D0, 这时候, 我们希望只用给出简单化的初始化方式即:给出Parameters和Defaults。
在PyTorch中, 定义了所有优化器的基类Optimizer
:
Optimizer(params, defaults)
其中的参数为:
当需要构造一个新的Optimizer的子类的时候,只需要在初始化的时候,将配置信息保存在defaults
里,例如:
class SGD(Optimizer):
def __init__(self, param. lr=required, momentum=0, dampening=0,
weight_decay=0, nesterov=False):
defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
weight_decay=weight_decay, nesterov=nesterov)
if nesterov and (momentum <= 0 or dampening != 0):
raise ValueError("Nesterov momentum requires a momentum and zero dampening")
super(SGD, self).__init__(params, defaults)
下面给出一个构造一个多组变量的优化器的例子:
optim.SGD([{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}],
lr=1e-2, momentum=0.9)
如果model需要在GPU上运行, 则需要在构造优化器之前调用
.cuda()
. 因为调用.cuda()
前后的Parameters是不同的.
一个优化器需要的属性有param_groups
和state
。param_groups
用于保存变量,另外一个state
保存每个变量的状态信息。例如在带momentum的SGD中,每个变量需要保存一个velocity来计算最终的梯度。这个velocity就保存在成员属性state中。 两个属性都是字典。
首先是和 pickle 相关的特殊函数。__getstate__
,__setstate__
和dump
,load
相关:
def __getstate__(self):
return {
'state': self.state,
'param_groups': self.param_groups,
}
def __setstate__(self, state):
self.__dict__.update(state)
这个两个函数定义了如何对一个类进行 pickling 和 unpickling 。
__getstate__
返回一个字典,如果没有定义__getstate__
这个函数,那么默认使用的是self.__dict__
。__setstate__
来重构这个类的实例。另外, 我们需要保存一个优化的当前状态,state_dict
和 load_state_dict
提供了返回和读取优化器状态的方法。由于实际内存中的 Parameter 不是我们真正关心的。我们只关心这些Parameter对应的一些状态信息。因此 state_dict
的返回中,将 Parameter 都用id来代替,来节省空间。
def state_dict(self):
# Save ids instead of Variables
def pack_group(group):
packed = {k: v for k, v in group.items() if k != 'params'}
packed['params'] = [id(p) for p in group['params']]
return packed
param_groups = [pack_group(g) for g in self.param_groups]
# Remap state to use ids as keys
packed_state = {(id(k) if isinstance(k, Variable) else k): v
for k, v in self.state.items()}
return {
'state': packed_state,
'param_groups': param_groups,
}
同样在读取状态的时候,也不需要读取Parameter,因为真实的Parameter已经保存在当前实例的属性中,我们只需要回复每个Parameter对应的参数即可。
简单来说:
state_dict
将返回类属性{ 'state': packed_state, 'param_groups': param_groups}
load_state_dict
读取类属性。
优化器通过方法step
来更新参数。有一些优化算法需要传入一个额外的closure。这个函数计算一轮迭代,并且返回loss。
optimizer.step()
:for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
optimizer.step(closure)
:for input, target in dataset:
def closure():
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
return loss
optimizer.step(closure)
首先假设:
θ : = θ − η ∇ J ( θ ) = θ − η ∑ i = 1 n ∇ J i ( θ ) / n \theta:=\theta-\eta \nabla J(\theta)=\theta-\eta \sum _{i=1}^{n}\nabla J_{i}(\theta)/n θ:=θ−η∇J(θ)=θ−ηi=1∑n∇Ji(θ)/n
θ \theta θ, η \eta η 和 J ( θ ) J(\theta) J(θ) 分别为参数,学习率和损失函数。
class torch.optim.Adadelta(params,
lr=1.0,
rho=0.9,
eps=1e-06,
weight_decay=0)
It has been proposed in ADADELTA: An Adaptive Learning Rate Method.
Parameters:
class torch.optim.Adagrad(params,
lr=0.01,
lr_decay=0,
weight_decay=0)
Implements Adagrad algorithm.
Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.
It has been proposed in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
G = ∑ τ = 1 t g τ g τ T G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}} G=τ=1∑tgτgτT
where g τ = ∇ J i ( θ ) g_{\tau}=\nabla J_{i}(\theta) gτ=∇Ji(θ) the gradient, at iteration τ \tau τ. The diagonal is given by
G j , j = ∑ τ = 1 t g τ , j 2 G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2} Gj,j=τ=1∑tgτ,j2
This vector is updated after every iteration. The formula for an update is now
θ : = θ − η   d i a g ( G ) − 1 2 ∘ g \theta:=\theta-\eta \,\mathrm{diag}(G)^{-{\frac {1}{2}}}\circ g θ:=θ−ηdiag(G)−21∘g
or, written as per-parameter updates,
θ j : = θ j − η G j , j g j \theta_{j}:=\theta_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j} θj:=θj−Gj,jηgj
Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (默认: 1e-2)
lr_decay (float, optional) – learning rate decay (默认: 0)
weight_decay (float, optional) – weight decay ( l 2 l_2 l2 范数) (默认: 0)
class torch.optim.Adam(params,
lr=0.001,
betas=(0.9, 0.999),
eps=1e-08,
weight_decay=0)
Adam(Adaptive Moment Estimation)是RMSProp的改进版算法。详见:Adam: A Method for Stochastic Optimization.
m θ : β 1 m θ + ( 1 − β 1 ) ∇ J ( θ ) m_{\theta}: \beta_{1}m_{\theta}+(1-\beta_{1})\nabla J(\theta) mθ:β1mθ+(1−β1)∇J(θ)
v θ : β 2 v θ + ( 1 − β 2 ) ( ∇ J ( θ ) 2 v_{\theta}: \beta_{2}v_{\theta}+(1-\beta_{2})(\nabla J(\theta)^{2} vθ:β2vθ+(1−β2)(∇J(θ)2
Then
m ^ θ = m θ 1 − β 1 t \hat{m}_\theta=\frac{m_\theta}{1-\beta_1^t} m^θ=1−β1tmθ
v ^ θ = v θ 1 − β 2 t \hat{v}_\theta=\frac{v_\theta}{1-\beta_2^t} v^θ=1−β2tvθ
θ : θ − η m ^ θ v ^ θ + ϵ \theta:\theta-\eta{\frac{\hat {m}_\theta}{\sqrt{\hat{v}_\theta} + \epsilon}} θ:θ−ηv^θ+ϵm^θ
ϵ \epsilon ϵ防止结果被0除, β 1 \beta_1 β1 和 β 2 \beta_2 β2 是gradient和second moments of the gradients的遗忘因子。
Parameters:
class torch.optim.SparseAdam(params,
lr=0.001,
betas=(0.9, 0.999),
eps=1e-08)
对于稀疏变量的Adam算法。
Parameters:
class torch.optim.Adamax(params,
lr=0.002,
betas=(0.9, 0.999),
eps=1e-08,
weight_decay=0)
Implements Adamax algorithm (a variant of Adam based on infinity norm).
It has been proposed in Adam: A Method for Stochastic Optimization.
Parameters:
class torch.optim.ASGD(params,
lr=0.01,
lambd=0.0001,
alpha=0.75,
t0=1000000.0,
weight_decay=0)
Implements Averaged Stochastic Gradient Descent.
It has been proposed in Acceleration of stochastic approximation by averaging.
Averaged Stochastic Gradient Descent
Parameters:
class torch.optim.LBFGS(params,
lr=1,
max_iter=20,
max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)
L-BFGS algorithm. This optimizer doesn’t support per-parameter options and parameter groups (there can be only one). This is a very memory intensive optimizer (it requires additional param_bytes * (history_size + 1)
bytes). If it doesn’t fit in memory try reducing the history size, or use a different algorithm.
Parameters:
这个算法需要提供一个closure来更新参数:
step(closure)
Performs a single optimization step.
class torch.optim.RMSprop(params,
lr=0.01,
alpha=0.99,
eps=1e-08,
weight_decay=0,
momentum=0,
centered=False)
RMSprop(for Root Mean Square Propagation) is proposed by G. Hinton in his course. The centered version first appears in Generating Sequences With Recurrent Neural Networks.
v ( θ , t ) : = γ ⋅ v ( θ , t − 1 ) + ( 1 − γ ) ( ∇ J i ( θ ) ) 2 v(\theta,t):=\gamma \cdot v(\theta,t-1)+(1-\gamma )(\nabla J_{i}(\theta))^{2} v(θ,t):=γ⋅v(θ,t−1)+(1−γ)(∇Ji(θ))2
where, γ \gamma γ is the forgetting factor.
And the parameters are updated as,
θ : = θ − η v ( θ , t ) ∇ J i ( θ ) \theta:=\theta-{\frac {\eta }{\sqrt {v(\theta,t)}}}\nabla J_{i}(\theta) θ:=θ−v(θ,t)η∇Ji(θ)
RMSProp has shown excellent adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.
Parameters:
True
, compute the centered RMSProp, the gradient is normalized by an estimation of its variance这个算法需要提供一个closure来更新参数:
step(closure)
Performs a single optimization step.
class torch.optim.SGD(params,
lr=<object object>,
momentum=0,
dampening=0,
weight_decay=0,
nesterov=False)
带Momentum的SGD实现:
v = ρ ⋅ v + g v=\rho \cdot v+g v=ρ⋅v+g
p = p − l r ⋅ v p=p-lr\cdot v p=p−lr⋅v
其中 p p p, g g g, v v v 和 ρ \rho ρ 分别是parameters, gradient, velocity, 和 momentum。这里的实现区别于 Sutskever et. al. 以及其它框架的实现方式:
v = ρ ⋅ v + l r ⋅ g v=\rho\cdot v+lr \cdot g v=ρ⋅v+lr⋅g
p = p − v p=p-v p=p−v
同样,对相应的Nesterov版本也进行了类似的修改。
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) - 最初的学习率
momentum (float, optional) — momentum factor (默认为0)
weight_decay (float, optional) — weight decay ( l 2 l_2 l2 范数) (默认: 0)
dampening (float, optional) — dampening for momentum (默认: 0)
nesterov (bool, optional) — enables Nesterov momentum (默认: False)
SGD的实现不需要额外给出closure。
这里将给定的参数初始化到defaults
变量中,然后调用基类的构造函数。
def __init__(self, params, lr=required, momentum=0, dampening=0,
weight_decay=0, nesterov=False):
defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
weight_decay=weight_decay, nesterov=nesterov)
if nesterov and (momentum <= 0 or dampening != 0):
raise ValueError("Nesterov momentum requires a momentum and zero dampening")
super(SGD, self).__init__(params, defaults)
在SGD中,每个变量都有自己的一个velocity数据,这里称为’momentum_buffer’。即SGD类的成员state保存了每一个变量的velocity数据。
def step(self, closure=None):
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
weight_decay = group['weight_decay']
momentum = group['momentum']
dampening = group['dampening']
nesterov = group['nesterov']
for p in group['params']:
# p 是其中的一个变量
if p.grad is None:
# 跳过没有提梯度的变量
continue
d_p = p.grad.data
if weight_decay != 0:
d_p.add_(weight_decay, p.data)
if momentum != 0:
param_state = self.state[p]
if 'momentum_buffer' not in param_state:
buf = param_state['momentum_buffer'] = torch.zeros_like(p.data)
buf.mul_(momentum).add_(d_p)
else:
buf = param_state['momentum_buffer']
buf.mul_(momentum).add_(1 - dampening, d_p)
if nesterov:
d_p = d_p.add(momentum, buf)
else:
d_p = buf
p.data.add_(-group['lr'], d_p)
return loss
torch.optim.lr_scheduler
: 提供了多种学习率调整的规则torch.optim.lr_scheduler.ReduceLROnPlateau
: 支持动态地根据某个准则调整学习率所以的_LRScheduler是所有学习率规则类的基类。这个基类提供了关于学习率调整的一些基本方法和属性。但是这些属性不一定被所有的子类继承。
实例化一个学习率控制器需要给定一个优化器,和当前的epoch。优化器包含了参数的学习率信息,当前的epoch决定了如何相对于最初的学习率做出调整。例如last_epoch=-1
表示将从最初的学习率开始。
class _LRScheduler(optimizer, last_epoch=-1)
optimizer.param_groups['initial_lr']
组成的list。last_epoch + 1
。def step(self, epoch=None):
if epoch is None:
epoch = self.last_epoch + 1
self.last_epoch = epoch
for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
param_group['lr'] = lr
除了ReduceLROnPlateau,其它几个学习率调整的规则都是在基类定义的更新框架下进行的。其中base_lrs保存了当前优化器的初始的learning rate,因为其它的调整规则都是基于初始的学习率进行简单地改变。
class torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda,
last_epoch=-1)
给每一个group指定一个lambda函数,然后根据labmda函数来更新学习率。
α = α 0 ⋅ l a m b d a ( e p o c h ) \alpha = \alpha_0 \cdot lambda(epoch) α=α0⋅lambda(epoch)
其中 l a m b d a ( e p o c h ) lambda(epoch) lambda(epoch)是给每一个group指定的函数, α 0 \alpha_0 α0是初始的学习率。
def get_lr(self):
return [base_lr * lmbda(self.last_epoch)
for lmbda, base_lr in zip(self.lr_lambdas, self.base_lrs)]
Parameters:
optimizer.param_groups
的数量例子:
>>> # Assuming optimizer has two groups.
>>> lambda1 = lambda epoch: epoch // 30
>>> lambda2 = lambda epoch: 0.95 ** epoch
>>> scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
>>> for epoch in range(100):
>>> scheduler.step()
>>> train(...)
>>> validate(...)
class torch.optim.lr_scheduler.StepLR(
optimizer,
step_size,
gamma=0.1,
last_epoch=-1)
每经过一定的步长进行一次学习率的修正:
α = α 0 × γ ⌊ e p o c h / s t e p ⌋ \alpha = \alpha_0 \times \gamma^{\lfloor epoch/step \rfloor} α=α0×γ⌊epoch/step⌋
def get_lr(self):
return [base_lr * self.gamma ** (self.last_epoch // self.step_size)
for base_lr in self.base_lrs]
Parameters:
Example
>>> # Assuming optimizer uses lr = 0.5 for all groups
>>> # lr = 0.05 if epoch < 30
>>> # lr = 0.005 if 30 <= epoch < 60
>>> # lr = 0.0005 if 60 <= epoch < 90
>>> # ...
>>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
>>> for epoch in range(100):
>>> scheduler.step()
>>> train(...)
>>> validate(...)
class torch.optim.lr_scheduler.MultiStepLR(
optimizer,
milestones,
gamma=0.1,
last_epoch=-1)
α = α 0 × γ i \alpha = \alpha_0 \times \gamma^i α=α0×γi
其中 m i ≤ e p o c h < m i + 1 m_i\leq epoch < m_{i+1} mi≤epoch<mi+1,并且
0 = m 0 < m 1 < m 2 < . . . < m N 0= m_0 < m_1 < m_2 < ... < m_N 0=m0<m1<m2<...<mN
MultiStepLR给出不同步长下的学习率调整。通过手动指定每次学习率调整的节点改变学习率。
def get_lr(self):
return [base_lr * self.gamma ** bisect_right(self.milestones, self.last_epoch)
for base_lr in self.base_lrs]
Parameters:
例子:
>>> # Assuming optimizer uses lr = 0.5 for all groups
>>> # lr = 0.05 if epoch < 30
>>> # lr = 0.005 if 30 <= epoch < 80
>>> # lr = 0.0005 if epoch >= 80
>>> scheduler = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
>>> for epoch in range(100):
>>> scheduler.step()
>>> train(...)
>>> validate(...)
class torch.optim.lr_scheduler.ExponentialLR(optimizer,
gamma,
last_epoch=-1)
α = α 0 × γ e p o c h \alpha = \alpha_0 \times \gamma^{epoch} α=α0×γepoch
def get_lr(self):
return [base_lr * self.gamma ** self.last_epoch
for base_lr in self.base_lrs]
Parameters:
class torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='min',
factor=0.1,
patience=10,
verbose=False,
threshold=0.0001,
threshold_mode='rel',
cooldown=0,
min_lr=0,
eps=1e-08)
当某个给定的标准在连续的几个epoch都不在提升时, 降低学习率。
Parameters:
optimizer (Optimizer) – 优化器
mode (str) – {‘max’, ‘min’},指定准则的规则,如果是’min’那么,直到给定的准则不再下降,如果是’max’,那么等到给定的准则不再上升。默认是’min’。
factor (float) – 学习率的调整因子。new_lr = lr * factor
。默认是0.1
patience (int) – 准则不再提升的epoch数量,默认是10。
verbose (bool) – If True
, prints a message to stdout for each update. Default: False
.
threshold (float) – 准则不再提升的标准。默认1e-4
threshold_mode (str) – One of rel, abs. In rel mode, dynamic_threshold = best * ( 1 + threshold )
in ‘max’ mode or best * ( 1 - threshold )
in min mode. In abs mode, dynamic_threshold = best + threshold
in max mode or best - threshold
in minmode. Default: ‘rel’.
cooldown (int) – 在更新学习率之后进行几轮的冷却,然后恢复正常模式,默认是0
min_lr (float or list) – 学习率的下限,即无法再下降。默认是0
eps (float) – 最小下降的学习率。新旧的学习率的下降幅度如果小于eps则不会被更新。默认1e-8。
例子:
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = ReduceLROnPlateau(optimizer, 'min')
>>> for epoch in range(10):
>>> train(...)
>>> val_loss = validate(...)
>>> # Note that step should be called after validate()
>>> scheduler.step(val_loss)