首先说明:在神经网络构建方面,PyTorch
也有面向对象编程和函数式编程两种形式,分别在 torch.nn
和 torch.nn.functional
模块下面。面向对象式编程中,状态(模型参数)保存在对象中;函数式编程中,函数没有状态,状态由用户维护;对于无状态的层(比如激活、归一化和池化)两者差异甚微。应用较广泛的是面向对象式编程,本文也重点介绍前者。
torch
内置序列化与反序列化方法、在 cpu
、gpu
、tpu
等设备之间无缝切换、进行模型剪枝、量化、JIT 等关键点就两个:状态和计算(这怎么听起来这么像动态规划)
torch.nn.Module
是 torch
中所有神经网络层的基类。
Module
的组成:
submodule
:子模块Parameter
:参数,可跟踪梯度,可被更新Buffer
:Buffer 是状态的一部分,但是不是参数,就比如说 BatchNorm
的 running_mean
不是参数,但是是模型的状态。公有属性
training
:标识是在 train
模式还是在 eval
模式,可通过 Module.train()
和 Module.eval()
切换(或者可以直接赋值)公有方法
注意,Module
的方法如果不加说明都是 in-place 操作
bfloat16()
half()
float()
double()
type(dst_type)
to(dtype)
cpu()
cuda(device=None)
to(device)
forward
zero_grad
add_module(module, name)
:挂一个命名为 name
的子模块上去register_module(module, name)
:同上register_buffer(name, tensor, persistent=True)
:挂 buffer
PyTorch 1.11 一共有五个,常用来打印个向量的类型,或者提示进度之类,或者丢进 TensorBoard
之类的
register_backward_hook(hook)
register_forward_hook(hook)
register_forward_pre_hook(hook)
register_full_backward_hook(hook)
register_load_state_dict_post_hook(hook)
modules()
:返回一个递归迭代过所有 module
的迭代器。相同 module
只会被迭代过一次。children()
:Returns an iterator over immediate children modules. 与 modules()
的区别是本函数只迭代直接子节点,不会递归。parameters(recurse=True)
:Returns an iterator over module parameters. This is typically passed to an optimizer. 若 recurse=True
会递归,否则只迭代直接子节点。named_children()
:上述函数的命名版本。named_modules()
named_parameters()
apply(fn)
:对 self
自己和所有子模块(.children()
)递归应用 fn
,常用于初始化权值load_state_dict(state_dict, strict=True)
:将一个 state_dict
载入自己的参数state_dict
:返回自己的 state_dict
,可以使用 torch.save
保存之to(memory_format)
:memory_format
在 Tensor
一节有写。extra_repr
train
eval
get_submodule
get_submodule
set_extra_state
shared_memory_
:Moves the underlying storage to CUDA shared memory. 只对 CUDA
端生效。移动后的 Tensor 不能进行 resize
。to_empty
:Moves the parameters and buffers to the specified device without copying storage.import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
对于神经网络中的常用模块,PyTorch 已经提供了高性能的实现。
Parameter |
A kind of Tensor that is to be considered a module parameter. |
---|---|
UninitializedParameter |
A parameter that is not initialized. |
UninitializedBuffer |
A buffer that is not initialized. |
Module |
Base class for all neural network modules. |
---|---|
Sequential |
A sequential container. |
ModuleList |
Holds submodules in a list. |
ModuleDict |
Holds submodules in a dictionary. |
ParameterList |
Holds parameters in a list. |
ParameterDict |
Holds parameters in a dictionary. |
通用 hook
register_module_forward_pre_hook |
Registers a forward pre-hook common to all modules. |
---|---|
register_module_forward_hook |
Registers a global forward hook for all the modules |
register_module_backward_hook |
Registers a backward hook common to all the modules. |
register_module_full_backward_hook |
Registers a backward hook common to all the modules. |
nn.Conv1d |
Applies a 1D convolution over an input signal composed of several input planes. |
---|---|
nn.Conv2d |
Applies a 2D convolution over an input signal composed of several input planes. |
nn.Conv3d |
Applies a 3D convolution over an input signal composed of several input planes. |
nn.ConvTranspose1d |
Applies a 1D transposed convolution operator over an input image composed of several input planes. |
nn.ConvTranspose2d |
Applies a 2D transposed convolution operator over an input image composed of several input planes. |
nn.ConvTranspose3d |
Applies a 3D transposed convolution operator over an input image composed of several input planes. |
nn.LazyConv1d |
A torch.nn.Conv1d module with lazy initialization of the in_channels argument of the Conv1d that is inferred from the input.size(1) . |
nn.LazyConv2d |
A torch.nn.Conv2d module with lazy initialization of the in_channels argument of the Conv2d that is inferred from the input.size(1) . |
nn.LazyConv3d |
A torch.nn.Conv3d module with lazy initialization of the in_channels argument of the Conv3d that is inferred from the input.size(1) . |
nn.LazyConvTranspose1d |
A torch.nn.ConvTranspose1d module with lazy initialization of the in_channels argument of the ConvTranspose1d that is inferred from the input.size(1) . |
nn.LazyConvTranspose2d |
A torch.nn.ConvTranspose2d module with lazy initialization of the in_channels argument of the ConvTranspose2d that is inferred from the input.size(1) . |
nn.LazyConvTranspose3d |
A torch.nn.ConvTranspose3d module with lazy initialization of the in_channels argument of the ConvTranspose3d that is inferred from the input.size(1) . |
nn.Unfold |
Extracts sliding local blocks from a batched input tensor. |
nn.Fold |
Combines an array of sliding local blocks into a large containing tensor. |
nn.MaxPool1d |
Applies a 1D max pooling over an input signal composed of several input planes. |
---|---|
nn.MaxPool2d |
Applies a 2D max pooling over an input signal composed of several input planes. |
nn.MaxPool3d |
Applies a 3D max pooling over an input signal composed of several input planes. |
nn.MaxUnpool1d |
Computes a partial inverse of MaxPool1d . |
nn.MaxUnpool2d |
Computes a partial inverse of MaxPool2d . |
nn.MaxUnpool3d |
Computes a partial inverse of MaxPool3d . |
nn.AvgPool1d |
Applies a 1D average pooling over an input signal composed of several input planes. |
nn.AvgPool2d |
Applies a 2D average pooling over an input signal composed of several input planes. |
nn.AvgPool3d |
Applies a 3D average pooling over an input signal composed of several input planes. |
nn.FractionalMaxPool2d |
Applies a 2D fractional max pooling over an input signal composed of several input planes. |
nn.FractionalMaxPool3d |
Applies a 3D fractional max pooling over an input signal composed of several input planes. |
nn.LPPool1d |
Applies a 1D power-average pooling over an input signal composed of several input planes. |
nn.LPPool2d |
Applies a 2D power-average pooling over an input signal composed of several input planes. |
nn.AdaptiveMaxPool1d |
Applies a 1D adaptive max pooling over an input signal composed of several input planes. |
nn.AdaptiveMaxPool2d |
Applies a 2D adaptive max pooling over an input signal composed of several input planes. |
nn.AdaptiveMaxPool3d |
Applies a 3D adaptive max pooling over an input signal composed of several input planes. |
nn.AdaptiveAvgPool1d |
Applies a 1D adaptive average pooling over an input signal composed of several input planes. |
nn.AdaptiveAvgPool2d |
Applies a 2D adaptive average pooling over an input signal composed of several input planes. |
nn.AdaptiveAvgPool3d |
Applies a 3D adaptive average pooling over an input signal composed of several input planes. |
nn.ReflectionPad1d |
Pads the input tensor using the reflection of the input boundary. |
---|---|
nn.ReflectionPad2d |
Pads the input tensor using the reflection of the input boundary. |
nn.ReflectionPad3d |
Pads the input tensor using the reflection of the input boundary. |
nn.ReplicationPad1d |
Pads the input tensor using replication of the input boundary. |
nn.ReplicationPad2d |
Pads the input tensor using replication of the input boundary. |
nn.ReplicationPad3d |
Pads the input tensor using replication of the input boundary. |
nn.ZeroPad2d |
Pads the input tensor boundaries with zero. |
nn.ConstantPad1d |
Pads the input tensor boundaries with a constant value. |
nn.ConstantPad2d |
Pads the input tensor boundaries with a constant value. |
nn.ConstantPad3d |
Pads the input tensor boundaries with a constant value. |
nn.ELU |
Applies the Exponential Linear Unit (ELU) function, element-wise, as described in the paper: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). |
---|---|
nn.Hardshrink |
Applies the Hard Shrinkage (Hardshrink) function element-wise. |
nn.Hardsigmoid |
Applies the Hardsigmoid function element-wise. |
nn.Hardtanh |
Applies the HardTanh function element-wise. |
nn.Hardswish |
Applies the hardswish function, element-wise, as described in the paper: |
nn.LeakyReLU |
Applies the element-wise function: |
nn.LogSigmoid |
Applies the element-wise function: |
nn.MultiheadAttention |
Allows the model to jointly attend to information from different representation subspaces as described in the paper: Attention Is All You Need. |
nn.PReLU |
Applies the element-wise function: |
nn.ReLU |
Applies the rectified linear unit function element-wise: |
nn.ReLU6 |
Applies the element-wise function: |
nn.RReLU |
Applies the randomized leaky rectified liner unit function, element-wise, as described in the paper: |
nn.SELU |
Applied element-wise, as: |
nn.CELU |
Applies the element-wise function: |
nn.GELU |
Applies the Gaussian Error Linear Units function: |
nn.Sigmoid |
Applies the element-wise function: |
nn.SiLU |
Applies the Sigmoid Linear Unit (SiLU) function, element-wise. |
nn.Mish |
Applies the Mish function, element-wise. |
nn.Softplus |
Applies the Softplus function \text{Softplus}(x) = \frac{1}{\beta} _ \log(1 + \exp(\beta _ x))Softplus(x)=β1∗log(1+exp(β∗x)) element-wise. |
nn.Softshrink |
Applies the soft shrinkage function elementwise: |
nn.Softsign |
Applies the element-wise function: |
nn.Tanh |
Applies the Hyperbolic Tangent (Tanh) function element-wise. |
nn.Tanhshrink |
Applies the element-wise function: |
nn.Threshold |
Thresholds each element of the input Tensor. |
nn.GLU |
Applies the gated linear unit function {GLU}(a, b)= a \otimes \sigma(b)G**LU(a,b)=a⊗σ(b) where aa is the first half of the input matrices and bb is the second half. |
nn.Softmin |
Applies the Softmin function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0, 1] and sum to 1. |
---|---|
nn.Softmax |
Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1. |
nn.Softmax2d |
Applies SoftMax over features to each spatial location. |
nn.LogSoftmax |
Applies the \log(\text{Softmax}(x))log(Softmax(x)) function to an n-dimensional input Tensor. |
nn.AdaptiveLogSoftmaxWithLoss |
Efficient softmax approximation as described in Efficient softmax approximation for GPUs by Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. |
nn.BatchNorm1d |
Applies Batch Normalization over a 2D or 3D input as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
---|---|
nn.BatchNorm2d |
Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
nn.BatchNorm3d |
Applies Batch Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
nn.LazyBatchNorm1d |
A torch.nn.BatchNorm1d module with lazy initialization of the num_features argument of the BatchNorm1d that is inferred from the input.size(1) . |
nn.LazyBatchNorm2d |
A torch.nn.BatchNorm2d module with lazy initialization of the num_features argument of the BatchNorm2d that is inferred from the input.size(1) . |
nn.LazyBatchNorm3d |
A torch.nn.BatchNorm3d module with lazy initialization of the num_features argument of the BatchNorm3d that is inferred from the input.size(1) . |
nn.GroupNorm |
Applies Group Normalization over a mini-batch of inputs as described in the paper Group Normalization |
nn.SyncBatchNorm |
Applies Batch Normalization over a N-Dimensional input (a mini-batch of [N-2]D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . |
nn.InstanceNorm1d |
Applies Instance Normalization over a 2D (unbatched) or 3D (batched) input as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. |
nn.InstanceNorm2d |
Applies Instance Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. |
nn.InstanceNorm3d |
Applies Instance Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. |
nn.LazyInstanceNorm1d |
A torch.nn.InstanceNorm1d module with lazy initialization of the num_features argument of the InstanceNorm1d that is inferred from the input.size(1) . |
nn.LazyInstanceNorm2d |
A torch.nn.InstanceNorm2d module with lazy initialization of the num_features argument of the InstanceNorm2d that is inferred from the input.size(1) . |
nn.LazyInstanceNorm3d |
A torch.nn.InstanceNorm3d module with lazy initialization of the num_features argument of the InstanceNorm3d that is inferred from the input.size(1) . |
nn.LayerNorm |
Applies Layer Normalization over a mini-batch of inputs as described in the paper Layer Normalization |
nn.LocalResponseNorm |
Applies local response normalization over an input signal composed of several input planes, where channels occupy the second dimension. |
nn.RNNBase |
|
---|---|
nn.RNN |
Applies a multi-layer Elman RNN with \tanhtanh or \text{ReLU}ReLU non-linearity to an input sequence. |
nn.LSTM |
Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence. |
nn.GRU |
Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence. |
nn.RNNCell |
An Elman RNN cell with tanh or ReLU non-linearity. |
nn.LSTMCell |
A long short-term memory (LSTM) cell. |
nn.GRUCell |
A gated recurrent unit (GRU) cell |
nn.Transformer |
A transformer model. |
---|---|
nn.TransformerEncoder |
TransformerEncoder is a stack of N encoder layers |
nn.TransformerDecoder |
TransformerDecoder is a stack of N decoder layers |
nn.TransformerEncoderLayer |
TransformerEncoderLayer is made up of self-attn and feedforward network. |
nn.TransformerDecoderLayer |
TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. |
nn.Identity |
A placeholder identity operator that is argument-insensitive. |
---|---|
nn.Linear |
Applies a linear transformation to the incoming data: y = xA^T + by=xAT+b |
nn.Bilinear |
Applies a bilinear transformation to the incoming data: y = x_1^T A x_2 + by=x1TAx2+b |
nn.LazyLinear |
A torch.nn.Linear module where in_features is inferred. |
nn.Dropout |
During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution. |
---|---|
nn.Dropout1d |
Randomly zero out entire channels (a channel is a 1D feature map, e.g., the jj-th channel of the ii-th sample in the batched input is a 1D tensor \text{input}[i, j]input[i,j]). |
nn.Dropout2d |
Randomly zero out entire channels (a channel is a 2D feature map, e.g., the jj-th channel of the ii-th sample in the batched input is a 2D tensor \text{input}[i, j]input[i,j]). |
nn.Dropout3d |
Randomly zero out entire channels (a channel is a 3D feature map, e.g., the jj-th channel of the ii-th sample in the batched input is a 3D tensor \text{input}[i, j]input[i,j]). |
nn.AlphaDropout |
Applies Alpha Dropout over the input. |
nn.FeatureAlphaDropout |
Randomly masks out entire channels (a channel is a feature map, e.g. |
nn.Embedding |
A simple lookup table that stores embeddings of a fixed dictionary and size. |
---|---|
nn.EmbeddingBag |
Computes sums or means of ‘bags’ of embeddings, without instantiating the intermediate embeddings. |
nn.CosineSimilarity |
Returns cosine similarity between x_1x1 and x_2x2, computed along dim. |
---|---|
nn.PairwiseDistance |
Computes the pairwise distance between vectors v_1v1, v_2v2 using the p-norm: |
nn.L1Loss |
Creates a criterion that measures the mean absolute error (MAE) between each element in the input xx and target yy. |
---|---|
nn.MSELoss |
Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input xx and target yy. |
nn.CrossEntropyLoss |
This criterion computes the cross entropy loss between input and target. |
nn.CTCLoss |
The Connectionist Temporal Classification loss. |
nn.NLLLoss |
The negative log likelihood loss. |
nn.PoissonNLLLoss |
Negative log likelihood loss with Poisson distribution of target. |
nn.GaussianNLLLoss |
Gaussian negative log likelihood loss. |
nn.KLDivLoss |
The Kullback-Leibler divergence loss. |
nn.BCELoss |
Creates a criterion that measures the Binary Cross Entropy between the target and the input probabilities: |
nn.BCEWithLogitsLoss |
This loss combines a Sigmoid layer and the BCELoss in one single class. |
nn.MarginRankingLoss |
Creates a criterion that measures the loss given inputs x1x1, x2x2, two 1D mini-batch or 0D Tensors, and a label 1D mini-batch or 0D Tensor yy (containing 1 or -1). |
nn.HingeEmbeddingLoss |
Measures the loss given an input tensor xx and a labels tensor yy (containing 1 or -1). |
nn.MultiLabelMarginLoss |
Creates a criterion that optimizes a multi-class multi-classification hinge loss (margin-based loss) between input xx (a 2D mini-batch Tensor) and output yy (which is a 2D Tensor of target class indices). |
nn.HuberLoss |
Creates a criterion that uses a squared term if the absolute element-wise error falls below delta and a delta-scaled L1 term otherwise. |
nn.SmoothL1Loss |
Creates a criterion that uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise. |
nn.SoftMarginLoss |
Creates a criterion that optimizes a two-class classification logistic loss between input tensor xx and target tensor yy (containing 1 or -1). |
nn.MultiLabelSoftMarginLoss |
Creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input xx and target yy of size (N, C)(N,C). |
nn.CosineEmbeddingLoss |
Creates a criterion that measures the loss given input tensors x_1x1, x_2x2 and a Tensor label yy with values 1 or -1. |
nn.MultiMarginLoss |
Creates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input xx (a 2D mini-batch Tensor) and output yy (which is a 1D tensor of target class indices, 0 \leq y \leq \text{x.size}(1)-10≤y≤x.size(1)−1): |
nn.TripletMarginLoss |
Creates a criterion that measures the triplet loss given an input tensors x1x1, x2x2, x3x3 and a margin with a value greater than 00. |
nn.TripletMarginWithDistanceLoss |
Creates a criterion that measures the triplet loss given input tensors aa, pp, and nn (representing anchor, positive, and negative examples, respectively), and a nonnegative, real-valued function (“distance function”) used to compute the relationship between the anchor and positive example (“positive distance”) and the anchor and negative example (“negative distance”). |
nn.PixelShuffle |
Rearranges elements in a tensor of shape (, C \times r^2, H, W)(∗,C×r2,H,W) to a tensor of shape (, C, H \times r, W \times r)(∗,C,H×r,W×r), where r is an upscale factor. |
---|---|
nn.PixelUnshuffle |
Reverses the PixelShuffle operation by rearranging elements in a tensor of shape (, C, H \times r, W \times r)(∗,C,H×r*,W×r) to a tensor of shape (, C \times r^2, H, W)(∗,C×r*2,H,W), where r is a downscale factor. |
nn.Upsample |
Upsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data. |
nn.UpsamplingNearest2d |
Applies a 2D nearest neighbor upsampling to an input signal composed of several input channels. |
nn.UpsamplingBilinear2d |
Applies a 2D bilinear upsampling to an input signal composed of several input channels. |
nn.ChannelShuffle |
Divide the channels in a tensor of shape (, C , H, W)(∗,C,H,W) into g groups and rearrange them as (, C \frac g, g, H, W)(∗,C,g**g,H,W), while keeping the original tensor shape. |
---|---|
nn.DataParallel |
Implements data parallelism at the module level. |
---|---|
nn.parallel.DistributedDataParallel |
Implements distributed data parallelism that is based on torch.distributed package at the module level. |
nn.utils.rnn.PackedSequence |
Holds the data and list of batch_sizes of a packed sequence. |
---|---|
nn.utils.rnn.pack_padded_sequence |
Packs a Tensor containing padded sequences of variable length. |
nn.utils.rnn.pad_packed_sequence |
Pads a packed batch of variable length sequences. |
nn.utils.rnn.pad_sequence |
Pad a list of variable length Tensors with padding_value |
nn.utils.rnn.pack_sequence |
Packs a list of variable length Tensors |
nn.Flatten |
Flattens a contiguous range of dims into a tensor. |
---|---|
nn.Unflatten |
Unflattens a tensor dim expanding it to a desired shape. |
另外还有一些模型裁剪、剪枝的方法在 torch.utils
中,模型量化方法等,不多介绍。
nn.init
每个 Module
有其内置的初始化逻辑,但我们也可以采用一些高级的初始化方式。
torch.nn.init.calculate_gain(nonlinearity, param=None)
:Return the recommended gain value for the given nonlinearity function.函数 | 结果 |
---|---|
uniform_(tensor, a=0.0, b=1.0) |
U ( a , b ) \mathcal{U}(a, b) U(a,b) |
normal_(tensor, mean=0.0, std=1.0) |
N ( m e a n , s t d 2 ) \mathcal{N}(\mathrm{mean}, \mathrm{std}^2) N(mean,std2) |
constant_(tensor, val) |
v a l \mathrm{val} val |
ones_(tensor) |
1 1 1 |
zeros_(tensor) |
0 0 0 |
eye_(tensor) |
I I I(仅对二维矩阵) |
dirac_(tensor, group=1) |
Fills the {3, 4, 5}-dimensional input Tensor with the Dirac delta function. Preserves the identity of the inputs in Convolutional layers, where as many input channels are preserved as possible. In case of groups>1, each group of channels preserves identity |
xavier_uniform_(tensor, gain=1.0) |
U ( − a , a ) \mathcal{U}(-a, a) U(−a,a), where a = gain × 6 fan_in + fan_out a=\operatorname{gain} \times \sqrt{\frac{6}{\text { fan\_in }+\text { fan\_out }}} a=gain× fan_in + fan_out 6 |
xavier_normal_(tensor, gain=1.0) |
N ( 0 , s t d 2 ) \mathcal{N}(0, \mathrm{std}^2) N(0,std2), where std = gain × 2 fan_in + fan_out \operatorname{std}=\operatorname{gain} \times \sqrt{\frac{2}{\text { fan\_in }+\text { fan\_out }}} std=gain× fan_in + fan_out 2 |
kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu') |
U ( − b o u n d , − b o u n d ) \mathcal{U}(-\mathrm{bound}, -\mathrm{bound}) U(−bound,−bound), where b o u n d = gain × 3 fan_mode \mathrm{bound}=\operatorname{gain} \times \sqrt{\frac{3}{\text { fan\_mode }}} bound=gain× fan_mode 3 |
kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu') |
N ( 0 , s t d 2 ) \mathcal{N}(0, \mathrm{std}^2) N(0,std2), where std = gain fan_mode \operatorname{std}=\frac{\text { gain }}{\sqrt{\text { fan\_mode }}} std= fan_mode gain |
trunc_normal_(tensor, mean=0.0, std=1.0, a=- 2.0, b=2.0) |
The values are effectively drawn from the normal distribution N ( mean , std 2 ) \mathcal{N}(\text{mean}, \text{std}^2) N(mean,std2) with values outside [ a , b ] [a, b] [a,b] redrawn until they are within the bounds. |
orthogonal_(tensor, gain=1) |
Fills the input Tensor with a (semi) orthogonal matrix. |
sparse_(tensor, sparsity, std=0.01) |
填入一个稀疏度为 sparsity 的稀疏矩阵,非零元素服从 N ( 0 , s t d 2 ) N(0, \mathrm{std}^2) N(0,std2) |
Optimizer 规定了如何进行梯度更新。多数 Optimizer 也有状态,保存检查点时记得一并保存。
params
:该优化器所优化的 torch.Tensor
或者 dict
。defaults
– (dict): a dict containing default values of optimization options (used when a parameter group doesn’t specify them).方法 | 描述 |
---|---|
Optimizer.add_param_group |
Add a param group to the Optimizer s param_groups. |
Optimizer.load_state_dict |
Loads the optimizer state. |
Optimizer.state_dict |
Returns the state of the optimizer as a dict . |
Optimizer.step |
一步更新 |
Optimizer.zero_grad |
清空更新的 torch.Tensor 的梯度 |
算法 | 描述 |
---|---|
Adadelta |
Implements Adadelta algorithm. |
Adagrad |
Implements Adagrad algorithm. |
Adam |
Implements Adam algorithm. |
AdamW |
Implements AdamW algorithm. |
SparseAdam |
Implements lazy version of Adam algorithm suitable for sparse tensors. |
Adamax |
Implements Adamax algorithm (a variant of Adam based on infinity norm). |
ASGD |
Implements Averaged Stochastic Gradient Descent. |
LBFGS |
Implements L-BFGS algorithm, heavily inspired by minFunc. |
NAdam |
Implements NAdam algorithm. |
RAdam |
Implements RAdam algorithm. |
RMSprop |
Implements RMSprop algorithm. |
Rprop |
Implements the resilient backpropagation algorithm. |
SGD |
Implements stochastic gradient descent (optionally with momentum). |
对于几个关键优化器讲解如下1
# Vanilla Gradient Descent
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update
This simple loop is at the core of all Neural Network libraries. There are other ways of performing the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss functions.
在实际使用中,不能将全部数据统一加载计算梯度时,可以采用 Minibatch Gradient Descent
# Vanilla Minibatch Gradient Descent
while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update
SGD 通常被认为是随机选取 Mini-batch 进行的梯度下降。
之后我们假定 x
为当前参数,dx
为 x
在这个 Mini-batch 的前向计算中获得的梯度。
SGD 有一个超参数:learning_rate
,一个固定的常量
# Vanilla update
x += - learning_rate * dx
Momentum 算法基于物理上的“动量”概念,dx
的积累体现在 v
上,通过v
更新 x
。
v
是模型的内部参数,初始化为 0 0 0。新增超参数 mu
# Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position
一个理解是,“既然这个方向是好的,那么往下走很可能更好”。这样做的好处是可以“冲出”局部最优。
Nesterov 理论背景比 Momentum 更强,在实际中能取得更好的效果。算法的核心是在 x + mu*v
处计算梯度 dx_ahead
,用 dx_ahead
更新 v
,再用 v
更新 x
。
x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v
当然另一种实现是这样的,去掉了 x_ahead
增加了 v_prev
,二者等价。
v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form
前三者的学习率是全局唯一的,而 Adagrad 的学习率随模型参数变化。
超参数只有一个 learning_rate
。
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
cache
变量逐元素记录了梯度的平方,并对学习率进行归一化。但一大问题是学习率是不变的,而 cache
的单调不下降造成更新速度单调不上升,模型过早停止训练,容易欠拟合。
rprop 只使用梯度的符号信息,而 RMSprop 是A mini-batch version of rprop。
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
采用移动平均法对 cache
进行更新,cache
不再是单调不上升的。
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
如果只看最后一句,那么和 Adagrad、RMSprop 是相同的,不同的是 m
和 v
。所以 Adam 可以看做带动量的 RMSprop。
完整的 Adam 算法还有一个启动过程
# t is your iteration counter going from 1 to infinity
m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x += - learning_rate * mt / (np.sqrt(vt) + eps)
https://arxiv.org/pdf/1412.6980.pdf
torch.optim.lr_scheduler
provides several methods to adjust the learning rate based on the number of epochs.
model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler = ExponentialLR(optimizer, gamma=0.9) # 将 optimizer 传进 lr_scheduler 里
for epoch in range(20):
for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
scheduler.step()
还有同时使用多个 scheduler
的
model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = SGD(model, 0.1)
scheduler1 = ExponentialLR(optimizer, gamma=0.9)
scheduler2 = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
for epoch in range(20):
for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
scheduler1.step()
scheduler2.step()
记住:一定要在 optimizer.step()
后边调用 scheduler.step()
????
规则 | 描述 |
---|---|
lr_scheduler.LambdaLR |
Sets the learning rate of each parameter group to the initial lr times a given function. |
lr_scheduler.MultiplicativeLR |
Multiply the learning rate of each parameter group by the factor given in the specified function. |
lr_scheduler.StepLR |
Decays the learning rate of each parameter group by gamma every step_size epochs. |
lr_scheduler.MultiStepLR |
Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones. |
lr_scheduler.ConstantLR |
Decays the learning rate of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: total_iters. |
lr_scheduler.LinearLR |
Decays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: total_iters. |
lr_scheduler.ExponentialLR |
Decays the learning rate of each parameter group by gamma every epoch. |
lr_scheduler.CosineAnnealingLR |
Set the learning rate of each parameter group using a cosine annealing schedule, where \eta*{max}ηmax is set to the initial lr and T*{cur}Tcu**r is the number of epochs since the last restart in SGDR: |
lr_scheduler.ChainedScheduler |
Chains list of learning rate schedulers. |
lr_scheduler.SequentialLR |
Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given epoch. |
lr_scheduler.ReduceLROnPlateau |
Reduce learning rate when a metric has stopped improving. |
lr_scheduler.CyclicLR |
Sets the learning rate of each parameter group according to cyclical learning rate policy (CLR). |
lr_scheduler.OneCycleLR |
Sets the learning rate of each parameter group according to the 1cycle learning rate policy. |
lr_scheduler.CosineAnnealingWarmRestarts |
Set the learning rate of each parameter group using a cosine annealing schedule, where \eta*{max}ηmax is set to the initial lr, T*{cur}Tcu**r is the number of epochs since the last restart and T_{i}T**i is the number of epochs between two warm restarts in SGDR: |
见论文:https://arxiv.org/abs/1803.05407
loader, optimizer, model, loss_fn = ...
swa_model = torch.optim.swa_utils.AveragedModel(model)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=300)
swa_start = 160
swa_scheduler = SWALR(optimizer, swa_lr=0.05)
for epoch in range(300):
for input, target in loader:
optimizer.zero_grad()
loss_fn(model(input), target).backward()
optimizer.step()
if epoch > swa_start:
swa_model.update_parameters(model)
swa_scheduler.step()
else:
scheduler.step()
# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data
preds = swa_model(test_input)
https://cs231n.github.io/neural-networks-3/ ↩︎