Ptorch中的Optimizer

Ptorch中的Optimizer

文章目录

  • Ptorch中的Optimizer
    • 神经网路的参数
      • 代码实现
        • 构造函数
        • 属性和方法
        • 序列化
        • 更新参数
    • 所有的优化器类
      • Adatelta
      • AdaGrad
      • Adam
      • SparseAdam
      • Adamax
      • ASGD
      • LBFGS
      • RMSprop
      • SGD
        • 具体实现
          • 构造函数
          • step
    • 如何调整学习率
      • \_LRScheduler 基类
        • 构造函数
        • 属性
        • 方法
        • 学习率更新的流程
      • LambdaLR
      • StepLR
      • MultiStepLR
      • ExponentialLR
      • ReduceLROnPlateau

神经网路的参数

对于一个神经网络,我们学习网络中的参数。首先,假设将给定的神经网络所有可学习的参数被分为 N N N组: S 0 , S 1 , … , S N − 1 S_0, S_1, … ,S_{N-1} S0,S1,,SN1,每一组都有各自的不同配置,例如不同的learning rate等。 对于每一组 S n S_n Sn,我们可以用一个字典 D n D_n Dn来表示:

{'params': Parameters, 'lr': learning_rate}

其中Parameters是一个列表(list), 它表示了这组里有哪些需要学习的参数。其他关键字定义了这组参数不同于别组的配置情况。通常,多组参数之间存在很多相同的设置,对于这些相同的不需要额外设置的配置,,我们用字典defaults来表示。即,如果字典 D n D_n Dn里没有的定义,那么就复用defaults里的配置。

代码实现

在实现不用的优化器的之前,我们不妨先考虑一个优化器基类需要的一些特征。 我们先论基本的优化器的实现,然后从SGD的具体实现来进一步的了解相关的操作。

构造函数

首先是如何构造一个优化器实例。在实例化一个优化器时,我们应该需要两个参数:

  • Parameter Groups: [ D 0 , D 1 , … , D N − 1 ] [D_0, D_1, … ,D_{N-1}] [D0,D1,,DN1]

  • Defaults :默认的参数设置

特殊情况,当我们只有一组参数时,即Parameter Groups 只有 D 0 D_0 D0, 这时候, 我们希望只用给出简单化的初始化方式即:给出Parameters和Defaults。

在PyTorch中, 定义了所有优化器的基类Optimizer:

Optimizer(params, defaults)

其中的参数为:

  • params:可以是一个网络的所有Parameters, 也可以是上面提到的Parameter Groups
  • defaults:定义默认的参数配置

当需要构造一个新的Optimizer的子类的时候,只需要在初始化的时候,将配置信息保存在defaults里,例如:

class SGD(Optimizer):
    def __init__(self, param. lr=required, momentum=0, dampening=0,
                 weight_decay=0, nesterov=False):
        defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
                        weight_decay=weight_decay, nesterov=nesterov)
        if nesterov and (momentum <= 0 or dampening != 0):
            raise ValueError("Nesterov momentum requires a momentum and zero dampening")
        super(SGD, self).__init__(params, defaults)

下面给出一个构造一个多组变量的优化器的例子:

optim.SGD([{'params': model.base.parameters()},
           {'params': model.classifier.parameters(), 'lr': 1e-3}],
          lr=1e-2, momentum=0.9)

如果model需要在GPU上运行, 则需要在构造优化器之前调用.cuda(). 因为调用.cuda()前后的Parameters是不同的.

属性和方法

一个优化器需要的属性有param_groupsstateparam_groups用于保存变量,另外一个state保存每个变量的状态信息。例如在带momentum的SGD中,每个变量需要保存一个velocity来计算最终的梯度。这个velocity就保存在成员属性state中。 两个属性都是字典。

序列化

首先是和 pickle 相关的特殊函数。__getstate____setstate__dumpload相关:

def __getstate__(self):
    return {
        'state': self.state,
        'param_groups': self.param_groups,
    }

def __setstate__(self, state):
    self.__dict__.update(state)

这个两个函数定义了如何对一个类进行 picklingunpickling

  • dump 的时候,通过__getstate__返回一个字典,如果没有定义__getstate__这个函数,那么默认使用的是self.__dict__
  • 同样,load 的时候,通过__setstate__来重构这个类的实例。

另外, 我们需要保存一个优化的当前状态,state_dictload_state_dict 提供了返回和读取优化器状态的方法。由于实际内存中的 Parameter 不是我们真正关心的。我们只关心这些Parameter对应的一些状态信息。因此 state_dict的返回中,将 Parameter 都用id来代替,来节省空间。

def state_dict(self):
    # Save ids instead of Variables
    def pack_group(group):
        packed = {k: v for k, v in group.items() if k != 'params'}
        packed['params'] = [id(p) for p in group['params']]
        return packed
    param_groups = [pack_group(g) for g in self.param_groups]
    # Remap state to use ids as keys
    packed_state = {(id(k) if isinstance(k, Variable) else k): v
                    for k, v in self.state.items()}
    return {
        'state': packed_state,
        'param_groups': param_groups,
    }

同样在读取状态的时候,也不需要读取Parameter,因为真实的Parameter已经保存在当前实例的属性中,我们只需要回复每个Parameter对应的参数即可。

简单来说:

  • state_dict 将返回类属性{ 'state': packed_state, 'param_groups': param_groups}

  • load_state_dict读取类属性。

更新参数

优化器通过方法step 来更新参数。有一些优化算法需要传入一个额外的closure。这个函数计算一轮迭代,并且返回loss。

  • optimizer.step():
for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
  • optimizer.step(closure):
for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    optimizer.step(closure)

所有的优化器类

首先假设:
θ : = θ − η ∇ J ( θ ) = θ − η ∑ i = 1 n ∇ J i ( θ ) / n \theta:=\theta-\eta \nabla J(\theta)=\theta-\eta \sum _{i=1}^{n}\nabla J_{i}(\theta)/n θ:=θηJ(θ)=θηi=1nJi(θ)/n
θ \theta θ η \eta η J ( θ ) J(\theta) J(θ) 分别为参数,学习率和损失函数。

Adatelta

class torch.optim.Adadelta(params,
                           lr=1.0,
                           rho=0.9,
                           eps=1e-06,
                           weight_decay=0)

It has been proposed in ADADELTA: An Adaptive Learning Rate Method.

Parameters:

  • params (iterable) – 参数,同基类
  • rho (float, optional) – coefficient used for computing a running average of squared gradients (默认: 0.9)
  • eps (float, optional) – term added to the denominator to improve numerical stability (默认: 1e-6)
  • lr (float, optional) – coefficient that scale delta before it is applied to the parameters (默认: 1.0)
  • weight_decay(float, optional) – weight decay ( l 2 l_2 l2 范数) (默认: 0)

AdaGrad

class torch.optim.Adagrad(params,
                          lr=0.01,
                          lr_decay=0,
                          weight_decay=0)

Implements Adagrad algorithm.

Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.

It has been proposed in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

G = ∑ τ = 1 t g τ g τ T G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}} G=τ=1tgτgτT
where g τ = ∇ J i ( θ ) g_{\tau}=\nabla J_{i}(\theta) gτ=Ji(θ) the gradient, at iteration τ \tau τ. The diagonal is given by
G j , j = ∑ τ = 1 t g τ , j 2 G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2} Gj,j=τ=1tgτ,j2
This vector is updated after every iteration. The formula for an update is now
θ : = θ − η   d i a g ( G ) − 1 2 ∘ g \theta:=\theta-\eta \,\mathrm{diag}(G)^{-{\frac {1}{2}}}\circ g θ:=θηdiag(G)21g

or, written as per-parameter updates,

θ j : = θ j − η G j , j g j \theta_{j}:=\theta_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j} θj:=θjGj,j ηgj

Parameters:

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (默认: 1e-2)

  • lr_decay (float, optional) – learning rate decay (默认: 0)

  • weight_decay (float, optional) – weight decay ( l 2 l_2 l2 范数) (默认: 0)

Adam

class torch.optim.Adam(params,
                       lr=0.001,
                       betas=(0.9, 0.999),
                       eps=1e-08,
                       weight_decay=0)

Adam(Adaptive Moment Estimation)是RMSProp的改进版算法。详见:Adam: A Method for Stochastic Optimization.
m θ : β 1 m θ + ( 1 − β 1 ) ∇ J ( θ ) m_{\theta}: \beta_{1}m_{\theta}+(1-\beta_{1})\nabla J(\theta) mθ:β1mθ+(1β1)J(θ)

v θ : β 2 v θ + ( 1 − β 2 ) ( ∇ J ( θ ) 2 v_{\theta}: \beta_{2}v_{\theta}+(1-\beta_{2})(\nabla J(\theta)^{2} vθ:β2vθ+(1β2)(J(θ)2

Then

m ^ θ = m θ 1 − β 1 t \hat{m}_\theta=\frac{m_\theta}{1-\beta_1^t} m^θ=1β1tmθ

v ^ θ = v θ 1 − β 2 t \hat{v}_\theta=\frac{v_\theta}{1-\beta_2^t} v^θ=1β2tvθ

θ : θ − η m ^ θ v ^ θ + ϵ \theta:\theta-\eta{\frac{\hat {m}_\theta}{\sqrt{\hat{v}_\theta} + \epsilon}} θ:θηv^θ +ϵm^θ

ϵ \epsilon ϵ防止结果被0除, β 1 \beta_1 β1 β 2 \beta_2 β2 是gradient和second moments of the gradients的遗忘因子。

Parameters:

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (默认: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (默认: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability (默认: 1e-8)
  • weight_decay (float, optional) – weight decay ( l 2 l_2 l2 范数) (默认: 0)

SparseAdam

class torch.optim.SparseAdam(params,
                             lr=0.001,
                             betas=(0.9, 0.999),
                             eps=1e-08)

对于稀疏变量的Adam算法。

Parameters:

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (默认: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (默认: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability (默认: 1e-8)

Adamax

class torch.optim.Adamax(params,
                         lr=0.002,
                         betas=(0.9, 0.999),
                         eps=1e-08,
                         weight_decay=0)

Implements Adamax algorithm (a variant of Adam based on infinity norm).

It has been proposed in Adam: A Method for Stochastic Optimization.

Parameters:

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (默认: 2e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square
  • eps (float, optional) – term added to the denominator to improve numerical stability (默认: 1e-8)
  • weight_decay (float, optional) – weight decay ( l 2 l_2 l2 范数) (默认: 0)

ASGD

class torch.optim.ASGD(params,
                       lr=0.01,
                       lambd=0.0001,
                       alpha=0.75,
                       t0=1000000.0,
                       weight_decay=0)

Implements Averaged Stochastic Gradient Descent.

It has been proposed in Acceleration of stochastic approximation by averaging.

Averaged Stochastic Gradient Descent

Parameters:

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (默认: 1e-2)
  • lambd (float, optional) – decay term (默认: 1e-4)
  • alpha (float, optional) – power for eta update (默认: 0.75)
  • t0 (float, optional) – point at which to start averaging (默认: 1e6)
  • weight_decay (float, optional) – weight decay ( l 2 l_2 l2 范数) (默认: 0)

LBFGS

class torch.optim.LBFGS(params,
                        lr=1,
                        max_iter=20,
                        max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)

L-BFGS algorithm. This optimizer doesn’t support per-parameter options and parameter groups (there can be only one). This is a very memory intensive optimizer (it requires additional param_bytes * (history_size + 1) bytes). If it doesn’t fit in memory try reducing the history size, or use a different algorithm.

Parameters:

  • lr (float – learning rate (默认: 1)
  • max_iter (int) – maximal number of iterations per optimization step (默认: 20)
  • max_eval (int) – maximal number of function evaluations per optimization step (默认: max_iter * 1.25).
  • tolerance_grad (float – termination tolerance on first order optimality (默认: 1e-5).
  • tolerance_change (float – termination tolerance on function value/parameter changes (默认: 1e-9).
  • history_size (int) – update history size (默认: 100).

这个算法需要提供一个closure来更新参数:

step(closure)

Performs a single optimization step.

  • closure (callable) – A closure that reevaluates the model and returns the loss.

RMSprop

class torch.optim.RMSprop(params,
                          lr=0.01,
                          alpha=0.99,
                          eps=1e-08,
                          weight_decay=0,
                          momentum=0,
                          centered=False)

RMSprop(for Root Mean Square Propagation) is proposed by G. Hinton in his course. The centered version first appears in Generating Sequences With Recurrent Neural Networks.

v ( θ , t ) : = γ ⋅ v ( θ , t − 1 ) + ( 1 − γ ) ( ∇ J i ( θ ) ) 2 v(\theta,t):=\gamma \cdot v(\theta,t-1)+(1-\gamma )(\nabla J_{i}(\theta))^{2} v(θ,t):=γv(θ,t1)+(1γ)(Ji(θ))2
where, γ \gamma γ is the forgetting factor.

And the parameters are updated as,
θ : = θ − η v ( θ , t ) ∇ J i ( θ ) \theta:=\theta-{\frac {\eta }{\sqrt {v(\theta,t)}}}\nabla J_{i}(\theta) θ:=θv(θ,t) ηJi(θ)
RMSProp has shown excellent adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.

Parameters

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
  • lr (float, optional) – learning rate (默认: 1e-2)
  • momentum (float, optional) – momentum factor (默认: 0)
  • alpha (float, optional) – smoothing constant (默认: 0.99)
  • eps (float, optional) – term added to the denominator to improve numerical stability (默认: 1e-8)
  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance
  • weight_decay (float, optional) – weight decay ( l 2 l_2 l2 范数) (默认: 0)

这个算法需要提供一个closure来更新参数:

step(closure)

Performs a single optimization step.

  • closure – A callable closure that reevaluates the model and returns the loss.

SGD

class torch.optim.SGD(params,
                      lr=<object object>,
                      momentum=0,
                      dampening=0,
                      weight_decay=0,
                      nesterov=False)

带Momentum的SGD实现:

v = ρ ⋅ v + g v=\rho \cdot v+g v=ρv+g

p = p − l r ⋅ v p=p-lr\cdot v p=plrv

其中 p p p, g g g, v v v ρ \rho ρ 分别是parameters, gradient, velocity, 和 momentum。这里的实现区别于 Sutskever et. al. 以及其它框架的实现方式:

v = ρ ⋅ v + l r ⋅ g v=\rho\cdot v+lr \cdot g v=ρv+lrg

p = p − v p=p-v p=pv

同样,对相应的Nesterov版本也进行了类似的修改。

  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) - 最初的学习率

  • momentum (float, optional) — momentum factor (默认为0)

  • weight_decay (float, optional) — weight decay ( l 2 l_2 l2 范数) (默认: 0)

  • dampening (float, optional) — dampening for momentum (默认: 0)

  • nesterov (bool, optional) — enables Nesterov momentum (默认: False)

SGD的实现不需要额外给出closure。

具体实现

构造函数

这里将给定的参数初始化到defaults变量中,然后调用基类的构造函数。

def __init__(self, params, lr=required, momentum=0, dampening=0,
             weight_decay=0, nesterov=False):
    defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
                    weight_decay=weight_decay, nesterov=nesterov)
    if nesterov and (momentum <= 0 or dampening != 0):
        raise ValueError("Nesterov momentum requires a momentum and zero dampening")
    super(SGD, self).__init__(params, defaults)
step

在SGD中,每个变量都有自己的一个velocity数据,这里称为’momentum_buffer’。即SGD类的成员state保存了每一个变量的velocity数据。

def step(self, closure=None):
    loss = None
    if closure is not None:
        loss = closure()

    for group in self.param_groups:
        weight_decay = group['weight_decay']
        momentum = group['momentum']
        dampening = group['dampening']
        nesterov = group['nesterov']

        for p in group['params']:
            # p 是其中的一个变量
            if p.grad is None:
                # 跳过没有提梯度的变量
                continue
            d_p = p.grad.data
            if weight_decay != 0:
                d_p.add_(weight_decay, p.data)
            if momentum != 0:
                param_state = self.state[p]
                if 'momentum_buffer' not in param_state:
                    buf = param_state['momentum_buffer'] = torch.zeros_like(p.data)
                    buf.mul_(momentum).add_(d_p)
                else:
                    buf = param_state['momentum_buffer']
                    buf.mul_(momentum).add_(1 - dampening, d_p)
                if nesterov:
                    d_p = d_p.add(momentum, buf)
                else:
                    d_p = buf

            p.data.add_(-group['lr'], d_p)

    return loss

如何调整学习率

  • torch.optim.lr_scheduler : 提供了多种学习率调整的规则
  • torch.optim.lr_scheduler.ReduceLROnPlateau : 支持动态地根据某个准则调整学习率

_LRScheduler 基类

所以的_LRScheduler是所有学习率规则类的基类。这个基类提供了关于学习率调整的一些基本方法和属性。但是这些属性不一定被所有的子类继承。

构造函数

实例化一个学习率控制器需要给定一个优化器,和当前的epoch。优化器包含了参数的学习率信息,当前的epoch决定了如何相对于最初的学习率做出调整。例如last_epoch=-1 表示将从最初的学习率开始。

class _LRScheduler(optimizer, last_epoch=-1)

属性

  • optimizer: 优化器成员。
  • last_epoch:上一个epoch。在构造一个实力的是欧
  • base_lrs: 初始的learning rate。所有的optimizer.param_groups['initial_lr'] 组成的list。

方法

  • get_lr(): 根据当前的状态得到返回对应的学习率。
  • step(epoch=None): 更新optimizer中的学习率, epoch指定当前的epoch,然后按照给定的epoch进行学习率的调整。如果没有给定当前的epoch,那么将当前的epoch设为last_epoch + 1
def step(self, epoch=None):
    if epoch is None:
        epoch = self.last_epoch + 1
    self.last_epoch = epoch
    for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
        param_group['lr'] = lr

除了ReduceLROnPlateau,其它几个学习率调整的规则都是在基类定义的更新框架下进行的。其中base_lrs保存了当前优化器的初始的learning rate,因为其它的调整规则都是基于初始的学习率进行简单地改变。

学习率更新的流程

Created with Raphaël 2.2.0 Step(epoch) epoch is None? last_epoch = epoch call get_lr() set learning rate End epoch = last_epoch = last_epoch + 1 yes no

LambdaLR

class torch.optim.lr_scheduler.LambdaLR(
            optimizer,  
            lr_lambda,
            last_epoch=-1)

给每一个group指定一个lambda函数,然后根据labmda函数来更新学习率。
α = α 0 ⋅ l a m b d a ( e p o c h ) \alpha = \alpha_0 \cdot lambda(epoch) α=α0lambda(epoch)
其中 l a m b d a ( e p o c h ) lambda(epoch) lambda(epoch)是给每一个group指定的函数, α 0 \alpha_0 α0是初始的学习率。

def get_lr(self):
    return [base_lr * lmbda(self.last_epoch)
            for lmbda, base_lr in zip(self.lr_lambdas, self.base_lrs)]

Parameters:

  • lr_lambda – 学习率和epoch相关的函数。如果指定多个,lambda函数的个数需要对应optimizer.param_groups的数量

例子:

>>> # Assuming optimizer has two groups.
>>> lambda1 = lambda epoch: epoch // 30
>>> lambda2 = lambda epoch: 0.95 ** epoch
>>> scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
>>> for epoch in range(100):
>>>     scheduler.step()
>>>     train(...)
>>>     validate(...)

StepLR

class torch.optim.lr_scheduler.StepLR(
            optimizer,
            step_size,
            gamma=0.1,
            last_epoch=-1)

每经过一定的步长进行一次学习率的修正:

α = α 0 × γ ⌊ e p o c h / s t e p ⌋ \alpha = \alpha_0 \times \gamma^{\lfloor epoch/step \rfloor} α=α0×γepoch/step

def get_lr(self):
    return [base_lr * self.gamma ** (self.last_epoch // self.step_size)
            for base_lr in self.base_lrs]

Parameters:

  • step_size(int) – 学习率下降的周期
  • gamma (float) – 学习率下降的因子

Example

>>> # Assuming optimizer uses lr = 0.5 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005    if 30 <= epoch < 60
>>> # lr = 0.0005   if 60 <= epoch < 90
>>> # ...
>>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
>>> for epoch in range(100):
>>>     scheduler.step()
>>>     train(...)
>>>     validate(...)

MultiStepLR

class torch.optim.lr_scheduler.MultiStepLR(
            optimizer,
            milestones,
            gamma=0.1,
            last_epoch=-1)

α = α 0 × γ i \alpha = \alpha_0 \times \gamma^i α=α0×γi

其中 m i ≤ e p o c h < m i + 1 m_i\leq epoch < m_{i+1} miepoch<mi+1,并且
0 = m 0 < m 1 < m 2 < . . . < m N 0= m_0 < m_1 < m_2 < ... < m_N 0=m0<m1<m2<...<mN

MultiStepLR给出不同步长下的学习率调整。通过手动指定每次学习率调整的节点改变学习率。

def get_lr(self):
    return [base_lr * self.gamma ** bisect_right(self.milestones, self.last_epoch)
            for base_lr in self.base_lrs]

Parameters:

  • milestones (list) – 学习率进行调整的节点序列
  • gamma (float) – 学习率调整因子

例子:

>>> # Assuming optimizer uses lr = 0.5 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005    if 30 <= epoch < 80
>>> # lr = 0.0005   if epoch >= 80
>>> scheduler = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
>>> for epoch in range(100):
>>>     scheduler.step()
>>>     train(...)
>>>     validate(...)

ExponentialLR

class torch.optim.lr_scheduler.ExponentialLR(optimizer,
                                             gamma,
                                             last_epoch=-1)

α = α 0 × γ e p o c h \alpha = \alpha_0 \times \gamma^{epoch} α=α0×γepoch

def get_lr(self):
    return [base_lr * self.gamma ** self.last_epoch
            for base_lr in self.base_lrs]

Parameters:

  • gamma (float) – Multiplicative factor of learning rate decay.

ReduceLROnPlateau

class torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer,
            mode='min',
            factor=0.1,
            patience=10,
            verbose=False,
            threshold=0.0001,
            threshold_mode='rel',
            cooldown=0,
            min_lr=0,
            eps=1e-08)
Created with Raphaël 2.2.0 step(metrics, epoch) epoch is None? last epoch = epoch is metrics better? best = metrics bad epoch=0 in cooldown? bad epoch = 0 cooldown counter -= 1 bad epoch > patience? reduce learning rate reset cooldown counter bad epoch = 0 End bad epoch += 1 last epoch += 1 epoch = last epoch yes no yes no yes no yes no

当某个给定的标准在连续的几个epoch都不在提升时, 降低学习率。

Parameters:

  • optimizer (Optimizer) – 优化器

  • mode (str) – {‘max’, ‘min’},指定准则的规则,如果是’min’那么,直到给定的准则不再下降,如果是’max’,那么等到给定的准则不再上升。默认是’min’。

  • factor (float) – 学习率的调整因子。new_lr = lr * factor。默认是0.1

  • patience (int) – 准则不再提升的epoch数量,默认是10。

  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.

  • threshold (float) – 准则不再提升的标准。默认1e-4

  • threshold_mode (str) – One of rel, abs. In rel mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in minmode. Default: ‘rel’.

  • cooldown (int) – 在更新学习率之后进行几轮的冷却,然后恢复正常模式,默认是0

  • min_lr (float or list) – 学习率的下限,即无法再下降。默认是0

  • eps (float) – 最小下降的学习率。新旧的学习率的下降幅度如果小于eps则不会被更新。默认1e-8。

例子:

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = ReduceLROnPlateau(optimizer, 'min')
>>> for epoch in range(10):
>>>     train(...)
>>>     val_loss = validate(...)
>>>     # Note that step should be called after validate()
>>>     scheduler.step(val_loss)

你可能感兴趣的:(Pytorch笔记)