别致的SmallSix

Transformer [Attention is All You Need]

（一）论文部分

Abstract

(1)The best performing models also connect the encoder and decoder through an attention mechanism. 最好的性能通过注意力机制将编码器和解码器连接在一起。

(2)the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Transformer完全基于注意力机制，完全摒弃了循环和卷积。

(3)while being more 同事更具

(4)improving over 超过

1 Introduction

(1)have since continued to push 从那之后（主语）继续推进

(2)along 沿着

(3)This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. 这种固有的顺序性排除了训练示例中的并行化，这在较长的序列长度下变得至关重要，因为内存约束限制了跨示例的批处理。

(4) in case of the latter 在后者的情况下

(5)Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. 注意机制已经成为各种任务中引人注目的序列建模和转导模型的组成部分，允许对依赖关系进行建模，而不考虑它们在输入或输出序列中的距离[2,19]。然而，在除少数情况外的所有情况下[27]，这种注意机制都与循环网络结合使用。

(6)Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input andoutput. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.Transformer是一种避免重复的模型架构，而是完全依赖于注意机制来绘制输入和输出之间的全局依赖关系。Transformer允许显着更多的并行化，并且在8个P100 gpu上经过12小时的培训后，可以达到翻译质量的新状态。

2 Background

(1)In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.在这些模型中，将两个任意输入或输出位置的信号关联起来所需的操作数量随着位置之间的距离而增长，ConvS2S为线性增长，ByteNet为对数增长。

(2)Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

Self-attention，有时被称为intra-attention，是一种将单个序列的不同位置联系起来以计算该序列的表示的注意机制。

(3)To the best of our knowledge 据我们所知

3 Model Architecture

(1)The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,respectively.Transformer遵循这个整体架构，使用堆叠的自关注层和点方向层，完全连接编码器和解码器层，分别如图1的左半部分和右半部分所示。

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

encoder:encoder由N = 6层相同的堆栈组成。每一层有两个子层。第一种是multi-head self attention机制，第二种是简单的、位置完全连接的feed-forward network。我们在每一个子层周围使用残差连接[11]，然后进行层归一化[1]。也就是说，每个子层的输出是LayerNorm(x + Sublayer(x))，其中Sublayer(x)是子层本身实现的函数。为了方便这些残差连接，模型中的所有子层以及嵌入层产生的输出维度为dmodel = 512。

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
decoder:decoder也由N = 6相同层的堆栈组成。除了每个编码器层中的两个子层之外，解码器插入第三个子层，该子层对编码器堆栈的输出执行多头注意。与编码器类似，我们在每个子层周围使用残差连接，然后进行层规范化。我们还修改了解码器堆栈中的自注意子层，以防止位置关注后续位置。这种mask，再加上输出嵌入被偏移一个位置的事实，确保了位置i的预测只能依赖于位置小于i的已知输出。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.注意函数可以描述为将一个查询(query)和一组键值(key and value)对映射到一个输出，其中查询、键、值和输出都是向量。输出是作为值的加权和计算的，其中分配给每个值的权重是由查询与相应键的兼容性函数计算的。

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2).

While for small values of dk the two mechanisms perform similarly, additive attention outperforms
dot product attention without scaling for larger values of dk [3]. 当dk值较小时，两种机制的表现相似，当dk值较大时，加性注意优于点积注意[3]。

3.2.2 Multi-Head Attention

we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.我们发现，与其对dmodel维度的键、值和查询执行单一的注意函数，不如将查询、键和值分别以不同的、学习过的线性投影h次线性投影到dk、dk和dv维度，这是有益的。然后，在查询、键和值的每个投影版本上，我们并行地执行注意力函数，生成d维输出值。将它们连接起来并再次进行投影，得到最终值，如图2所示。

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:
• In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.

在“encoder-decoder attention”层中，queries来自前一个decoder层，而memory keys和values来自encoer的输出。这允许decoder中的每个位置都参与输入序列中的所有位置。

• The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

encoder包含self-attention层。在self-sttention层中，所有的keys、values和queries都来自同一个地方，在这种情况下，是编码器中前一层的输出。编码器中的每个位置都可以处理编码器前一层中的所有位置。
• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −1) all values in the input of the softmax which correspond to illegal connections.

类似地，decoder中的self-attention layer允许decoder中的每个位置关注decoder中的所有位置，直至并包括该位置。我们需要防止decoder中的向左信息流以保持自回归特性。通过屏蔽mask(设置为−1)softmax输入中对应于非法连接的所有值来实现缩放点积注意。

3.3 Position-wise Feed-Forward Networks

(1)In addition to 除了……之外

(2)This consists of two linear transformations with a ReLU activation in between.它由两个线性变换组成，中间有一个ReLU激活

The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality
dff = 2048.输入和输出的维数为dmodel = 512，内层的维数dff = 2048。

3.4 Embeddings and Softmax

we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. 我们使用学习的嵌入将输入标记（token）和输出标记（token）转换为维数为dmodel的向量。我们还使用通常学习的线性变换和softmax函数将解码器输出转换为预测的下一个标记（token）概率。

在我们的模型中，我们在两个嵌入层和pre-softmax线性变换之间共享相同的权矩阵，类似于[30]。在嵌入层中，我们将这些权重乘以。

3.5 Positional Encoding

(1)in order for 为了让

(2)in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.为了让模型利用序列的顺序，我们必须注入一些关于单词在序列中相对或绝对位置的信息。

(2)To this end 为此。我们将“位置编码”添加到encoder和decoder堆栈底部的输入嵌入中。

(3)that is 也就是说

4 Why Self-Attention

(1)In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations(x1; x2;......; xn) to another sequence of equal length (z1; z2;...; zn), with xi; zi∈ Rd, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata. 我们将自注意层的各个方面与通常用于映射一个变长符号表示序列(x1; x2;......; xn)到另一个等长的序列(z1; z2;...; zn)，其中 xi; zi∈ Rd。

在典型的序列转导编码器或解码器中的隐藏层。为了激励我们使用自我关注，我们考虑了三个必要条件：

• 每层总的计算复杂度；

• 可以并行化的计算量，通过所需的最小顺序操作数来衡量；

• 网络中远程依赖关系之间的路径长度。

(2)To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. 为了提高涉及非常长的序列的任务的计算性能，self-attention可以限制为只考虑input sequence中以各自输出位置为中心的大小为r的邻域。

（3）Convolutional layers are generally more expensive than recurrent layers, by a factor of k. 卷积层通常比循环层贵k倍。

(4)As side benefit 附带的好处是

(5)yield more interpretable models 产生更多可解释的模型

5 Training

5.1 Training Data and Batching

Sentence pairs were batched together by approximate sequence length.句子对按近似序列长度进行批处理。

5.2 Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps(3.5 days).

5.3 Optimizer

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.这对应于在第一个warmup_steps训练步骤中线性增加学习率，然后按步数的倒数平方根成比例地降低学习率。我们使用了warmup_steps = 4000。

5.4 Regularization

We employ three types of regularization during training:三种类型的正则化
Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0:1.我们将Dropout[33]应用于每个子层的输出，然后将其添加到子层输入并归一化。此外，我们将dropout应用于编码器和解码器堆栈中的嵌入和位置编码之和。对于基本模型，我们使用Pdrop = 0.1的比率。

Label Smoothing During training, we employed label smoothing of value ls = 0:1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

6 Results

6.1 Machine Translation

(1)The Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0:1, instead of 0:3.训练为英语到法语的Transformer(大)模型使用的辍学率Pdrop = 0.1，而不是0.3。

(2)For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty α = 0:6 [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38]. 对于base model，我们使用通过平均最后5个检查点获得的单个模型，这些检查点每隔10分钟写入一次。对于big model，我们取最后20个检查点的平均值。我们使用波束搜索，波束大小为4，长度惩罚α = 0:6[38]。这些超参数是在开发集上实验后选择的。我们将推理过程中的最大输出长度设置为输入长度+ 50，但尽可能早地终止[38]。

(3)We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.我们通过将训练时间、使用的GPU数量和每个GPU的持续单精度浮点容量的估计值相乘来估计用于训练模型的浮点运算次数。

6.2 Model Variations

(1) While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.虽然单头注意力比最佳设置差0.9 BLEU，但过多的头也会降低质量。

(2)we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial.我们观察到减小注意键大小dk会损害模型质量。这表明确定兼容性并不容易，一个比点积更复杂的兼容性函数可能是有益的。

6.3 English Constituency Parsing

the output is subject to strong structural constraints and is significantly longer than the input.Output受到强烈的结构约束，并且明显比input长。

7 Conclusion

In the former task our best model outperforms even all previously reported ensembles.t.在前一个任务中，我们的最佳模型甚至优于所有先前报道的集成。

（二）代码解读

所用代码链接：GitHub - harvardnlp/annotated-transformer: An annotated implementation of the Transformer paper.

一、Jupyter notebook出现500报错

因为在该代码中需要用到jupyter notebook，在运行过程中，点击ipynb之后出现：500 : Internal Server Error。

因为nbconvert 和pandoc不兼容导致，需要安装升级nbconvert，解决办法：

(1)先执行

pip uninstall nbconvert

(2)后执行

pip install nbconvert

重启jupyter，成功运行。

二、模型架构

Encoder：
N个block组成，每个block由一个自注意层和+一个FFN层组成

Decoder：
N个block组成，每个block由一个masked自注意层+交叉注意层+FFN层组成

为什么以第一个自注意层需要对输入的右侧进行mask？ decoder的query的输入是串行的，在测试时前面的query输入时其实是看不见后面的序列，因此应该对其进行mask，以保证当前的判断仅依赖于此前的序列。

1、Attention部分讲解

注意力函数可以描述为：将query和一组键值（key，value）对映射为输出output，其中query、key、value和output都是向量。output由value的加权和计算得到，其中分配给每个value的权重由query与相应key的兼容函数计算得到。
交叉注意层——q来自decoder；k，v来自encoder的输出

使用的是点乘注意力得分计算方法：

这里把attention抽象为对 value() 的每个 token（标记）进行加权，而加权的weight就是 attention weight，而 attention weight 就是根据 query和 key 计算得到，其意义为：为了用 value求出 query的结果, 根据 query和 key 来决定注意力应该放在value的哪部分。

为什么要除以√d_k？ 是因为如果d_k太大，点乘的值太大，如果不做scaling，结果就没有加法注意力好。另外，点乘的结果过大，这使得经过softmax之后的梯度很小，不利于反向传播的进行，所以我们通过对点乘的结果进行尺度化。

2、简单实现

def attention(query, key, value, mask=None, dropoutaNone):
    "Compute 'Scaled Dot Product Attent ion'"
    d_k = query.size(-1)
    scores = torch. matmul(query, key. transpose(-2, -1)) // math.sqrt(d. _k)
    if mask is not None:
        scores = scores. masked_ fill(mask == 0, -1e9)
    p_ _attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_ attn = dropout(p_ attn)
    return torch. matmulp_ attn, value), p. _attn

这段代码实现了一个注意力机制函数attention，用于计算"缩放点积注意力"。

函数接受以下参数：

query：查询张量。
key：键张量。
value：值张量。
mask：掩码张量，用于屏蔽不需要注意力的位置。
dropout：用于进行dropout操作的dropout层。

函数内部的操作如下：

首先，获取查询张量的最后一个维度的大小，赋值给变量d_k。
接下来，通过调用torch.matmul函数计算查询张量和键张量的矩阵乘法，并除以math.sqrt(d_k)进行缩放。这得到了注意力分数张量scores。
如果存在掩码张量mask，则使用masked_fill函数将scores中与掩码为0的位置对应的分数替换为一个较小的值-1e9，以屏蔽这些位置的注意力。
接着，使用F.softmax函数对scores进行softmax操作，得到注意力权重张量p_attn。
如果存在dropout层，将注意力权重张量p_attn传入dropout层进行dropout操作。
最后，计算注意力权重张量p_attn与值张量value的矩阵乘法，并返回计算结果以及注意力权重张量p_attn。

这段代码实现了注意力机制的计算过程，用于计算查询、键和值之间的注意力权重，并根据注意力权重对值进行加权求和。同时，也支持对注意力权重进行掩码和dropout操作。

在Encoder中的自注意层：
K,Q,V都是input embedding+pos embedding经过三个映射得到的。

在Decoder中：

第一个自注意层：
K,Q,V都是input embedding+pos embedding经过三个映射得到的。
第二个交叉注意层：
Q是Decoder前面的输出经过映射得到的；K，V分别是是encoder的输出经过两种映射得到的。

多头注意力，就是将多个上述部分的结果拼接起来：

3、位置编码：
位置编码会随着残差计算向后传播，
本文采用的是sin/cos位置编码，计算公式如下图所示：

三、源码

有五个相关类：

Transformer
TransformerEncoder
TransformerDecoder
TransformerEncoderLayer
TransformerDecoserLayer

1、torch.nn.Transformer

import copy
from typing import Optional, Any, Union, Callable

import torch
from torch import Tensor
from .. import functional as F
from .module import Module
from .activation import MultiheadAttention
from .container import ModuleList
from ..init import xavier_uniform_
from .dropout import Dropout
from .linear import Linear
from .normalization import LayerNorm

class Transformer(Module):
    r"""A transformer model. User is able to modify the attributes as needed. The architecture
    is based on the paper "Attention Is All You Need". Ashish Vaswani, Noam Shazeer,
    Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
    Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
    Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805)
    model with corresponding parameters.

    Args:
        d_model: the number of expected features in the encoder/decoder inputs (default=512).
        nhead: the number of heads in the multiheadattention models (default=8).
        num_encoder_layers: the number of sub-encoder-layers in the encoder (default=6).
        num_decoder_layers: the number of sub-decoder-layers in the decoder (default=6).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of encoder/decoder intermediate layer, can be a string
            ("relu" or "gelu") or a unary callable. Default: relu
        custom_encoder: custom encoder (default=None).
        custom_decoder: custom decoder (default=None).
        layer_norm_eps: the eps value in layer normalization components (default=1e-5).
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
        norm_first: if ``True``, encoder and decoder layers will perform LayerNorms before
            other attention and feedforward operations, otherwise after. Default: ``False`` (after).

    Examples::
        >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
        >>> src = torch.rand((10, 32, 512))
        >>> tgt = torch.rand((20, 32, 512))
        >>> out = transformer_model(src, tgt)

    Note: A full example to apply nn.Transformer module for the word language model is available in
    https://github.com/pytorch/examples/tree/master/word_language_model
    """
    def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
                 
       			pass
       	
	def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
                
                pass
                
	@staticmethod
    def generate_square_subsequent_mask(sz: int) -> Tensor:
    	pass
    
    def _reset_parameters(self):
    	pass

以上代码是一个Transformer模型的实现。Transformer是一种基于自注意力机制的序列模型，广泛应用于自然语言处理和其他序列任务中。

这个Transformer模型有以下主要组成部分：

__init__方法：初始化模型参数和组件。参数包括输入和输出的维度d_model，头数nhead，编码器和解码器的层数num_encoder_layers和num_decoder_layers，前馈神经网络的隐藏层维度dim_feedforward，dropout比例dropout，激活函数activation，自定义编码器和解码器，层归一化的epsilon值layer_norm_eps，是否以batch为首的输入batch_first，以及设备和数据类型。
forward方法：前向传播方法，用于计算输入序列的转换结果。参数包括输入序列src、目标序列tgt，以及一些可选的掩码和填充掩码。该方法根据输入序列和目标序列分别调用编码器和解码器，并返回解码器的输出。
generate_square_subsequent_mask方法：生成一个用于解码器自注意力层掩码的方形上三角矩阵。掩码用于避免解码器在生成序列时依赖后续的信息。
_reset_parameters方法：重置模型参数。该方法用于初始化模型中的可学习参数。

其中，拆分来看

import copy
from typing import Optional, Any, Union, Callable

import torch
from torch import Tensor
from .. import functional as F
from .module import Module
from .activation import MultiheadAttention
from .container import ModuleList
from ..init import xavier_uniform_
from .dropout import Dropout
from .linear import Linear
from .normalization import LayerNorm

这段代码是一个PyTorch模型的实现，其中包含了一些自定义的模块和组件。以下是每个组件的功能的中文阐述：

copy：Python标准库中的copy模块，用于复制对象。
Optional、Any、Union、Callable：这些是类型提示相关的模块，用于在函数和方法中指定参数的类型。
torch：PyTorch库的主要模块，用于创建和操作张量以及其他深度学习相关的功能。
Tensor：PyTorch中的张量类型。
functional：PyTorch中的functional模块，提供了一些函数式接口。
Module：PyTorch中的基类，用于定义自定义模块。
MultiheadAttention：自定义的多头注意力模块。
ModuleList：PyTorch中的模块列表，用于存储和管理多个子模块。
xavier_uniform_：自定义的参数初始化方法，用于初始化模型中的参数。
Dropout：自定义的dropout模块，用于在模型中进行随机丢弃操作。
Linear：自定义的线性层模块，用于进行线性变换。
LayerNorm：自定义的层归一化模块，用于进行层级归一化操作。

这些模块和组件共同构成了一个Transformer模型的实现。Transformer是一种序列模型，主要应用于自然语言处理和其他序列任务中，通过自注意力机制和前馈神经网络层来对输入序列进行编码和解码。

class Transformer(Module):
    r"""A transformer model. User is able to modify the attributes as needed. The architecture
    is based on the paper "Attention Is All You Need". Ashish Vaswani, Noam Shazeer,
    Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
    Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
    Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805)
    model with corresponding parameters.

    Args:
        d_model: the number of expected features in the encoder/decoder inputs (default=512).
        nhead: the number of heads in the multiheadattention models (default=8).
        num_encoder_layers: the number of sub-encoder-layers in the encoder (default=6).
        num_decoder_layers: the number of sub-decoder-layers in the decoder (default=6).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of encoder/decoder intermediate layer, can be a string
            ("relu" or "gelu") or a unary callable. Default: relu
        custom_encoder: custom encoder (default=None).
        custom_decoder: custom decoder (default=None).
        layer_norm_eps: the eps value in layer normalization components (default=1e-5).
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
        norm_first: if ``True``, encoder and decoder layers will perform LayerNorms before
            other attention and feedforward operations, otherwise after. Default: ``False`` (after).

    Examples::
        >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
        >>> src = torch.rand((10, 32, 512))
        >>> tgt = torch.rand((20, 32, 512))
        >>> out = transformer_model(src, tgt)

    Note: A full example to apply nn.Transformer module for the word language model is available in
    https://github.com/pytorch/examples/tree/master/word_language_model

Transformer模型是一种基于注意力机制的神经网络模型，用于处理序列到序列的任务，如机器翻译、文本生成等。它的设计灵感来源于论文《Attention Is All You Need》。

Transformer模型具有以下参数和功能：

d_model：编码器/解码器输入中期望的特征数量，默认为512。
nhead：多头注意力模型中的头数，默认为8。
num_encoder_layers：编码器中的子编码器层数，默认为6。
num_decoder_layers：解码器中的子解码器层数，默认为6。
dim_feedforward：前馈网络模型的维度，默认为2048。
dropout：Dropout的概率值，默认为0.1。
activation：编码器/解码器中间层的激活函数，可以是字符串（"relu"或"gelu"）或一元可调用对象，默认为relu。
custom_encoder：自定义编码器，默认为None。
custom_decoder：自定义解码器，默认为None。
layer_norm_eps：层归一化组件中的ε值，默认为1e-5。
batch_first：如果为True，则输入和输出张量按(batch，seq，feature)的顺序提供。默认为False（seq，batch，feature）。
norm_first：如果为True，则编码器和解码器层将在其他注意力和前馈操作之前执行LayerNorm操作，否则在之后执行。默认为False（之后）。

使用Transformer模型的示例：

import torch
import torch.nn as nn

transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)

Transformer模型在自然语言处理任务中广泛应用，特别是在机器翻译和文本生成方面。它通过自注意力机制和位置编码来捕捉输入序列的全局信息，实现了高效的并行计算，并在许多任务上取得了显著的性能提升。

（1）init

调用及参数

    def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
                 
       			pass

__init__函数是Python类的特殊方法之一，用于初始化类的实例。在Transformer类中，__init__方法用于初始化Transformer模型的各个参数。

功能阐述：

d_model：编码器/解码器输入的特征数量，默认为512。
nhead：多头注意力模型中的头数，默认为8。
num_encoder_layers：编码器中子编码器的层（block）数，默认为6。
num_decoder_layers：解码器中子解码器的层（block）数，默认为6。
dim_feedforward：前馈网络模型的维度，默认为2048。
dropout：Dropout的概率值，默认为0.1。
activation：编码器/解码器中间层的激活函数，可以是字符串（"relu"或"gelu"）或一元可调用对象，默认为relu。
custom_encoder：自定义编码器，默认为None。
custom_decoder：自定义解码器，默认为None。
layer_norm_eps：层归一化组件中的ε值，默认为1e-5。
batch_first：如果为True，则输入和输出张量按(batch，seq，feature)的顺序提供。默认为False（seq，batch，feature）。
norm_first：如果为True，则编码器和解码器层将在其他注意力和前馈操作之前执行LayerNorm操作，否则在之后执行。默认为False（之后）。
device：模型所在的设备，默认为None，表示使用当前设备。
dtype：模型的数据类型，默认为None，表示使用默认的数据类型。

__init__方法通过接收这些参数并赋值给对应的类属性，完成了Transformer模型的初始化过程。在这个方法中，还可以进行一些其他的初始化操作，例如创建编码器和解码器的实例，设置默认的激活函数等。

源码

def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Transformer, self).__init__()

        if custom_encoder is not None:	//是否自定义编码器
            self.encoder = custom_encoder
        else:
            encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout,
                                                    activation, layer_norm_eps, batch_first, norm_first,
                                                    **factory_kwargs)
            encoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
            self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

        if custom_decoder is not None:
            self.decoder = custom_decoder
        else:
            decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout,
                                                    activation, layer_norm_eps, batch_first, norm_first,
                                                    **factory_kwargs)
            decoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
            self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)

        self._reset_parameters()

        self.d_model = d_model
        self.nhead = nhead

        self.batch_first = batch_first

在__init__方法中，根据是否提供了自定义的编码器和解码器，分别创建了默认的TransformerEncoder和TransformerDecoder实例。这些实例使用了指定的层数、维度、注意力头数、前馈网络维度、激活函数等参数。

接下来，调用了_reset_parameters方法，用于重置模型中的参数。

最后，将输入的参数赋值给了类属性，包括d_model和nhead，以及设置了batch_first属性。

通过执行__init__方法，可以创建一个Transformer模型的实例，并进行必要的初始化操作。

（2）forward

调用及参数

    def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
                
                pass

forward方法是Transformer类中的另一个重要方法，用于定义模型的前向传播过程。在该方法中，输入张量经过编码器和解码器的一系列层进行处理，最终生成输出张量。

功能阐述：

src：输入序列张量，形状为(batch_size, src_seq_length, d_model)。代表源语言序列。
tgt：目标序列张量，形状为(batch_size, tgt_seq_length, d_model)。代表目标语言序列。
src_mask：可选的源序列掩码张量，形状为(src_seq_length, src_seq_length)或(batch_size, 1, src_seq_length)。用于遮蔽源序列中无效的位置。
tgt_mask：可选的目标序列掩码张量，形状为(tgt_seq_length, tgt_seq_length)或(batch_size, 1, tgt_seq_length)。用于遮蔽目标序列中无效的位置。
memory_mask：可选的记忆掩码张量，形状为(tgt_seq_length, src_seq_length)或(batch_size, tgt_seq_length, src_seq_length)。用于遮蔽解码器中对编码器输出的注意力。
src_key_padding_mask：可选的源序列填充掩码张量，形状为(batch_size, src_seq_length)。用于标记源序列中的填充位置。
tgt_key_padding_mask：可选的目标序列填充掩码张量，形状为(batch_size, tgt_seq_length)。用于标记目标序列中的填充位置。
memory_key_padding_mask：可选的记忆填充掩码张量，形状为(batch_size, src_seq_length)。用于标记编码器输出中的填充位置。

shape:
// S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number
src: (S, E)(S,E) for unbatched input, (S, N, E)(S,N,E) if batch_first=False or (N, S, E) if batch_first=True.
tgt: (T, E)(T,E) for unbatched input, (T, N, E)(T,N,E) if batch_first=False or (N, T, E) if batch_first=True.
src_mask: (S, S)(S,S).
tgt_mask: (T, T)(T,T).
memory_mask: (T, S)(T,S).
src_key_padding_mask: (S)(S) for unbatched input otherwise (N, S)(N,S).
tgt_key_padding_mask: (T)(T) for unbatched input otherwise (N, T)(N,T).
memory_key_padding_mask: (S)(S) for unbatched input otherwise (N, S)(N,S).

在forward方法中，根据输入参数和类属性的设定，会依次进行以下操作：

将输入序列张量经过编码器进行编码，得到编码后的表示。
将编码后的表示作为输入，经过解码器进行解码，得到解码后的表示。
最终输出解码后的表示张量。

通过定义forward方法，可以使用Transformer类实例调用该方法来执行前向传播，并获得模型的输出结果。

源码

def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Take in and process masked source/target sequences.

        Args:
            src: the sequence to the encoder (required).
            tgt: the sequence to the decoder (required).
            src_mask: the additive mask for the src sequence (optional).
            tgt_mask: the additive mask for the tgt sequence (optional).
            memory_mask: the additive mask for the encoder output (optional).
            src_key_padding_mask: the ByteTensor mask for src keys per batch (optional).
            tgt_key_padding_mask: the ByteTensor mask for tgt keys per batch (optional).
            memory_key_padding_mask: the ByteTensor mask for memory keys per batch (optional).

        Shape:
            - src: :math:`(S, E)` for unbatched input, :math:`(S, N, E)` if `batch_first=False` or
              `(N, S, E)` if `batch_first=True`.
            - tgt: :math:`(T, E)` for unbatched input, :math:`(T, N, E)` if `batch_first=False` or
              `(N, T, E)` if `batch_first=True`.
            - src_mask: :math:`(S, S)`.
            - tgt_mask: :math:`(T, T)`.
            - memory_mask: :math:`(T, S)`.
            - src_key_padding_mask: :math:`(S)` for unbatched input otherwise :math:`(N, S)`.
            - tgt_key_padding_mask: :math:`(T)` for unbatched input otherwise :math:`(N, T)`.
            - memory_key_padding_mask: :math:`(S)` for unbatched input otherwise :math:`(N, S)`.

            Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked
            positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
            while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
            are not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
            is provided, it will be added to the attention weight.
            [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by
            the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero
            positions will be unchanged. If a BoolTensor is provided, the positions with the
            value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.

            - output: :math:`(T, E)` for unbatched input, :math:`(T, N, E)` if `batch_first=False` or
              `(N, T, E)` if `batch_first=True`.

            Note: Due to the multi-head attention architecture in the transformer model,
            the output sequence length of a transformer is same as the input sequence
            (i.e. target) length of the decode.

            where S is the source sequence length, T is the target sequence length, N is the
            batch size, E is the feature number

        Examples:
            >>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
        """

        is_batched = src.dim() == 3
        if not self.batch_first and src.size(1) != tgt.size(1) and is_batched:
            raise RuntimeError("the batch number of src and tgt must be equal")
        elif self.batch_first and src.size(0) != tgt.size(0) and is_batched:
            raise RuntimeError("the batch number of src and tgt must be equal")

        if src.size(-1) != self.d_model or tgt.size(-1) != self.d_model:
            raise RuntimeError("the feature number of src and tgt must be equal to d_model")

        memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
        output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
                              tgt_key_padding_mask=tgt_key_padding_mask,
                              memory_key_padding_mask=memory_key_padding_mask)
        return output

这段代码是Transformer类的前向传播方法forward的实现部分。它接受一些输入参数，包括源序列（src）、目标序列（tgt）以及一些可选的掩码参数，然后通过编码器和解码器对序列进行处理，最终返回输出序列。

功能阐述：

src：输入的源序列（必需）。
tgt：输入的目标序列（必需）。
src_mask：源序列的附加掩码（可选）。
tgt_mask：目标序列的附加掩码（可选）。
memory_mask：编码器输出的附加掩码（可选）。
src_key_padding_mask：每批次源键的ByteTensor掩码（可选）。
tgt_key_padding_mask：每批次目标键的ByteTensor掩码（可选）。
memory_key_padding_mask：每批次记忆键的ByteTensor掩码（可选）。

方法的作用是将输入序列通过编码器进行编码，然后将编码结果传递给解码器进行解码。具体步骤如下：

检查输入的批次维度是否相等（如果batch_first为False，则检查src.size(1)和tgt.size(1)是否相等；如果batch_first为True，则检查src.size(0)和tgt.size(0)是否相等）。
检查输入序列的特征维度是否与模型的d_model相等。
将源序列src传递给编码器进行编码，得到编码结果memory。在编码过程中，可以使用src_mask进行源序列的掩码，以及src_key_padding_mask进行源键的掩码。
将目标序列tgt、编码结果memory以及其他掩码参数传递给解码器进行解码，得到解码结果output。在解码过程中，可以使用tgt_mask进行目标序列的掩码，以及tgt_key_padding_mask和memory_key_padding_mask进行目标键和记忆键的掩码。
返回解码结果output作为模型的输出。

需要注意的是，由于Transformer模型中的多头注意力机制，输出序列的长度与解码器输入序列的长度是相同的。

通过调用forward方法，可以对输入序列进行编码和解码，并获得模型的输出序列。

    @staticmethod
    def generate_square_subsequent_mask(sz: int) -> Tensor:
    	pass
    
    def _reset_parameters(self):
    	pass

generate_square_subsequent_mask方法是Transformer类的静态方法之一。它用于生成一个方形的下三角掩码张量，用于在解码器中遮蔽未来位置的信息，确保模型在生成目标序列时只依赖于已经生成的部分。该方法接受一个整数sz作为输入，表示掩码张量的大小，返回一个形状为(sz, sz)的下三角掩码张量。

_reset_parameters方法是Transformer类的私有方法，用于重置模型中的参数。具体来说，该方法可以在模型初始化时或者在训练过程中被调用，用于重新初始化模型中的可学习参数。在该方法中，可以根据需要对参数进行重新初始化或者重新赋值的操作。

这两个方法的具体实现没有给出，因此无法提供更详细的功能描述。然而，根据方法名可以推测它们的功能如上所述。generate_square_subsequent_mask用于生成解码器中的下三角掩码，而_reset_parameters用于重置模型中的参数。

2、 torch.nn.TransformerEncoderLayer

class TransformerEncoderLayer(Module):
    r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
    This standard encoder layer is based on the paper "Attention Is All You Need".
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
    Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
    Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
    in a different way during application.

    Args:
        d_model: the number of expected features in the input (required).
        nhead: the number of heads in the multiheadattention models (required).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of the intermediate layer, can be a string
            ("relu" or "gelu") or a unary callable. Default: relu
        layer_norm_eps: the eps value in layer normalization components (default=1e-5).
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
        norm_first: if ``True``, layer norm is done prior to attention and feedforward
            operations, respectivaly. Otherwise it's done after. Default: ``False`` (after).

    Examples::
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
        >>> src = torch.rand(10, 32, 512)
        >>> out = encoder_layer(src)

    Alternatively, when ``batch_first`` is ``True``:
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True)
        >>> src = torch.rand(32, 10, 512)
        >>> out = encoder_layer(src)
    """
    __constants__ = ['batch_first', 'norm_first']

    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:


    def __setstate__(self, state):


	def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:

    # self-attention block
    def _sa_block(self, x: Tensor,attn_mask: Optional[Tensor], 
    				key_padding_mask: Optional[Tensor]) -> Tensor:

这段代码定义了TransformerEncoderLayer类，它是Transformer编码器的一个层级单元。该层级单元包含了自注意力机制（self-attention）和前馈神经网络。

功能阐述：

d_model：输入特征的期望数量（必需）。
nhead：多头注意力模型中的头数（必需）。
dim_feedforward：前馈网络模型的维度（默认值为2048）。
dropout：Dropout的值（默认值为0.1）。
activation：中间层的激活函数，可以是字符串（"relu"或"gelu"）或一元可调用对象（默认值为relu）。
layer_norm_eps：层归一化组件中的eps值（默认值为1e-5）。
batch_first：如果为True，则输入和输出张量的形状为(batch, seq, feature)。默认值为False（seq, batch, feature）。
norm_first：如果为True，则在注意力和前馈操作之前进行层归一化。否则在之后进行。默认值为False（之后）。

TransformerEncoderLayer类具有以下方法：

forward方法：接受输入序列src以及可选的掩码参数src_mask和src_key_padding_mask，并返回输出序列。在该方法中，首先执行自注意力机制的处理，然后使用前馈神经网络进行处理。

TransformerEncoderLayer类还包含了内部方法：

_sa_block方法：用于执行自注意力机制的处理。它接受输入张量x以及可选的掩码参数attn_mask和key_padding_mask，并返回处理后的张量。

通过使用TransformerEncoderLayer类，可以构建Transformer编码器的层级结构，并对输入序列进行编码处理。

其中，拆分来看

class TransformerEncoderLayer(Module):
    r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
    This standard encoder layer is based on the paper "Attention Is All You Need".
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
    Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
    Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
    in a different way during application.

    Args:
        d_model: the number of expected features in the input (required).
        nhead: the number of heads in the multiheadattention models (required).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of the intermediate layer, can be a string
            ("relu" or "gelu") or a unary callable. Default: relu
        layer_norm_eps: the eps value in layer normalization components (default=1e-5).
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
        norm_first: if ``True``, layer norm is done prior to attention and feedforward
            operations, respectivaly. Otherwise it's done after. Default: ``False`` (after).

    Examples::
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
        >>> src = torch.rand(10, 32, 512)
        >>> out = encoder_layer(src)

    Alternatively, when ``batch_first`` is ``True``:
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True)
        >>> src = torch.rand(32, 10, 512)
        >>> out = encoder_layer(src)
    """

TransformerEncoderLayer 是一个标准的编码器层，基于论文 "Attention Is All You Need"，作者为 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin。它由自注意力机制（self-attention）和前馈神经网络组成。

该模块用于 Transformer 架构中的编码器部分，用于将输入序列进行编码。编码器层的主要功能是对输入进行自注意力操作，以捕捉输入序列中的语义信息，并使用前馈神经网络进行特征映射和非线性变换。

该模块的输入参数包括：

d_model：输入的特征维度（必需参数）。
nhead：多头自注意力模型中的注意头数（必需参数）。
dim_feedforward：前馈神经网络模型的维度（默认为2048）。
dropout：dropout 的比例（默认为0.1）。
activation：中间层的激活函数，可以是字符串 ("relu" 或 "gelu") 或一个一元的可调用对象（默认为 relu）。
layer_norm_eps：层归一化组件中的 eps 值（默认为1e-5）。
batch_first：如果为 True，则输入和输出张量的形状为 (batch, seq, feature)。默认为 False（seq, batch, feature）。
norm_first：如果为 True，则在注意力和前馈操作之前先进行层归一化，否则在之后进行。默认为 False（之后）。

该模块的输出是经过编码处理后的特征张量。

举例来说，可以这样使用 TransformerEncoderLayer：

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
src = torch.rand(10, 32, 512)
out = encoder_layer(src)

pythonCopy code

当 batch_first=True 时，可以这样使用：

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
src = torch.rand(10, 32, 512)
out = encoder_layer(src)

总之，TransformerEncoderLayer 提供了一个用于编码输入序列的标准编码器层，其中包含自注意力和前馈神经网络。

    __constants__ = ['batch_first', 'norm_first']

    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:


    def __setstate__(self, state):


	def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:

    # self-attention block
    def _sa_block(self, x: Tensor,attn_mask: Optional[Tensor], 
    				key_padding_mask: Optional[Tensor]) -> Tensor:

TransformerEncoderLayer 类的成员函数和方法如下：

__constants__：一个包含常量名称的列表，这些常量在序列化和反序列化过程中会被保留。
__init__(...)：构造函数，用于初始化 TransformerEncoderLayer 类的实例。它接受以下参数：
- d_model：输入的特征维度。
- nhead：多头自注意力模型中的注意头数。
- dim_feedforward：前馈神经网络模型的维度，默认为 2048。
- dropout：dropout 的比例，默认为 0.1。
- activation：中间层的激活函数，可以是字符串 ("relu" 或 "gelu") 或一个一元的可调用对象，默认为 relu。
- layer_norm_eps：层归一化组件中的 eps 值，默认为 1e-5。
- batch_first：如果为 True，则输入和输出张量的形状为 (batch, seq, feature)，默认为 False（seq, batch, feature）。
- norm_first：如果为 True，则在注意力和前馈操作之前先进行层归一化，否则在之后进行，默认为 False（之后）。
- device：用于指定计算设备的可选参数。
- dtype：用于指定张量数据类型的可选参数。
__setstate__(self, state)：用于设置对象状态的方法。
forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor：前向传播方法，用于对输入进行编码处理。它接受以下参数：
- src：输入张量，表示要进行编码的序列。
- src_mask：可选的注意力掩码张量，用于控制自注意力操作中的有效位置。
- src_key_padding_mask：可选的填充掩码张量，用于指示输入序列中哪些位置是填充的。
_sa_block(self, x: Tensor, attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor：自注意力块的方法，用于对输入张量进行自注意力操作。它接受以下参数：
- x：输入张量，表示要进行自注意力操作的特征。
- attn_mask：可选的注意力掩码张量，用于控制自注意力操作中的有效位置。
- key_padding_mask：可选的填充掩码张量，用于指示输入序列中哪些位置是填充的。

总之，TransformerEncoderLayer 类提供了对输入序列进行编码的功能，并实现了自注意力和前馈神经网络的组合操作。

（1）init

调用及参数

torch.nn.TransformerEncoderLayer(d_model, nhead,
			dim_feedforward=2048, dropout=0.1, activation='relu')

参数：
d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).

源码

def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                            **factory_kwargs)
        # Implementation of Feedforward model
        self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
        self.dropout = Dropout(dropout)
        self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)

        self.norm_first = norm_first
        self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.dropout1 = Dropout(dropout)
        self.dropout2 = Dropout(dropout)

        # Legacy string support for activation function.
        if isinstance(activation, str):
            self.activation = _get_activation_fn(activation)
        else:
            self.activation = activation

这段代码是一个Transformer编码器层的初始化函数。Transformer是一种用于序列建模的深度学习模型，常用于自然语言处理任务。该函数的功能是初始化Transformer编码器层的参数和模块。

具体功能如下：

d_model是模型的输入和输出维度。
nhead是多头自注意力机制中的头数。
dim_feedforward是前馈神经网络的隐藏层维度，默认为2048。
dropout是在模型中应用的dropout概率，默认为0.1。
activation是激活函数，可以是字符串形式的激活函数名或自定义激活函数，默认为ReLU函数。
layer_norm_eps是LayerNorm层的epsilon值，默认为1e-5。
batch_first指定输入是否为(batch, seq_len, feature)的形式，默认为False。
norm_first指定是否在自注意力之前应用LayerNorm，默认为False。
device指定计算设备，可以是CPU或GPU。
dtype指定张量的数据类型。

在函数内部，它通过调用其他模块和函数来构建Transformer编码器层的各个组件，包括：

self.self_attn是多头自注意力机制的实例，用于对输入序列进行自注意力计算。
self.linear1是一个线性层，用于进行前馈神经网络的第一次线性变换。
self.dropout是一个dropout层，用于在前馈神经网络中应用dropout。
self.linear2是一个线性层，用于进行前馈神经网络的第二次线性变换。
self.norm1和self.norm2是LayerNorm层，分别应用于自注意力和前馈神经网络的输出。
self.dropout1和self.dropout2是dropout层，分别应用于自注意力和前馈神经网络的输出。

最后，根据输入的activation参数，确定激活函数self.activation。如果activation是字符串形式，则使用_get_activation_fn函数获取对应的激活函数；否则，直接使用传入的自定义激活函数。

总而言之，该初始化函数用于构建Transformer编码器层的各个组件，并设置它们的参数和属性。这些组件包括自注意力机制、前馈神经网络、层归一化等，是Transformer模型中重要的构建块。

（2）forward

调用及参数

forward(src, src_mask=None, src_key_padding_mask=None)

参数：
src – the sequnce to the encoder layer (required).
src_mask – the mask for the src sequence (optional).
src_key_padding_mask – the mask for the src keys per batch (optional).

源码

def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the input through the encoder layer.

        Args:
            src: the sequence to the encoder layer (required).
            src_mask: the mask for the src sequence (optional).
            src_key_padding_mask: the mask for the src keys per batch (optional).

        Shape:
            see the docs in Transformer class.
        """

        # see Fig. 1 of https://arxiv.org/pdf/2002.04745v1.pdf

        x = src
        if self.norm_first:
            x = x + self._sa_block(self.norm1(x), src_mask, src_key_padding_mask)
            x = x + self._ff_block(self.norm2(x))
        else:
            x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask))  //self-attention
            x = self.norm2(x + self._ff_block(x))	//FFN

        return x

	 # self-attention block
    def _sa_block(self, x: Tensor,
                  attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
                  //self.self_attn==>MultiheadAttention
        x = self.self_attn(x, x, x,
                           attn_mask=attn_mask,
                           key_padding_mask=key_padding_mask,
                           need_weights=False)[0]
        return self.dropout1(x)

    # feed forward block
    def _ff_block(self, x: Tensor) -> Tensor:
        x = self.linear2(self.dropout(self.activation(self.linear1(x))))
        return self.dropout2(x)

该代码片段是一个Transformer编码器层的前向传播函数，用于对输入序列进行编码。

具体功能如下：

forward函数：

src是输入序列（必需）。
src_mask是输入序列的掩码（可选）。
src_key_padding_mask是输入序列键的掩码（可选）。

函数的实现过程如下：

初始化输入x为src。
如果self.norm_first为True，则先应用层归一化self.norm1，然后将x输入到self._sa_block（self-attention block）中进行自注意力计算，并将结果与x相加。然后将相加的结果再输入到self._ff_block（feed forward block）中进行前馈神经网络计算，并再次与输入相加。这是一种层归一化后的残差连接方式。
如果self.norm_first为False，则先将x输入到self._sa_block进行自注意力计算，并将结果与x相加，然后再应用层归一化self.norm1。接下来，将归一化后的结果输入到self._ff_block进行前馈神经网络计算，并再次与输入相加。这是一种层归一化前的残差连接方式。
返回最终的输出x。

_sa_block函数：

x是输入张量。
attn_mask是自注意力机制中的掩码，用于指定哪些位置需要被屏蔽。
key_padding_mask是自注意力机制中的键的掩码，用于指定哪些键需要被屏蔽。

函数的实现过程如下：

将x作为查询、键和值输入到多头自注意力机制self.self_attn中，进行自注意力计算。其中attn_mask和key_padding_mask用于控制计算的注意力权重。
对自注意力的输出进行dropout1操作，并返回结果。

_ff_block函数：

x是输入张量。

函数的实现过程如下：

将x输入到前馈神经网络中，通过两个线性变换（self.linear1和self.linear2）和激活函数（self.activation）来进行计算。
对计算结果进行dropout2操作，并返回结果。

总体而言，这段代码实现了Transformer编码器层的前向传播过程，包括自注意力计算和前馈神经网络计算。在计算过程中，还应用了层归一化、残差连接和dropout操作，用于提高模型的表示能力和泛化性能。

3、torch.nn.TransformerEncoder

TransformerEncoder is a stack of N encoder layers

class TransformerEncoder(Module):
    r"""TransformerEncoder is a stack of N encoder layers

    Args:
        encoder_layer: an instance of the TransformerEncoderLayer() class (required).
        num_layers: the number of sub-encoder-layers in the encoder (required).
        norm: the layer normalization component (optional).

    Examples::
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
        >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
        >>> src = torch.rand(10, 32, 512)
        >>> out = transformer_encoder(src)
    """
    __constants__ = ['norm']

    def __init__(self, encoder_layer, num_layers, norm=None):


	def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:

这段代码定义了一个Transformer编码器类TransformerEncoder，它是N个编码器层的堆叠。

具体内容如下：

TransformerEncoder类：

encoder_layer是一个TransformerEncoderLayer类的实例，代表编码器层（必需）。
num_layers是编码器中子编码器层的数量（必需）。
norm是可选的层归一化组件。

forward函数：

src是输入序列张量（必需）。
mask是输入序列的掩码（可选）。
src_key_padding_mask是输入序列键的掩码（可选）。

函数的实现过程如下：

将输入序列src传递给第一个编码器层，并获取输出。
将输出作为下一个编码器层的输入，并依次通过所有编码器层。
最终返回经过所有编码器层处理后的输出张量。

总而言之，该代码定义了一个Transformer编码器类，其中包含多个编码器层的堆叠。通过调用每个编码器层的前向传播函数，对输入序列进行逐层编码，以获得最终的输出表示。

（1）init

调用及参数

torch.nn.TransformerEncoder(encoder_layer, num_layers, norm=None)

参数：
coder_layer – TransformerEncoderLayer类的实例（必需）。
num_layers –编码器中的子编码器层数（必填）。
norm –层归一化组件（可选）。

源码

def __init__(self, encoder_layer, num_layers, norm=None):
        super(TransformerEncoder, self).__init__()
        self.layers = _get_clones(encoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm

这段代码是TransformerEncoder类的初始化函数。它初始化了TransformerEncoder的各个属性和模块。

具体内容如下：

调用super(TransformerEncoder, self).__init__()，即调用父类的初始化函数，以确保正确初始化继承自Module的属性。
self.layers是一个列表，用于保存多个编码器层的实例。通过_get_clones函数将encoder_layer复制num_layers次，并将复制的编码器层添加到self.layers列表中。
self.num_layers保存编码器层的数量，即num_layers的值。
self.norm是一个可选的层归一化组件，用于对编码器层的输出进行归一化处理。

总而言之，该初始化函数创建了多个编码器层的实例，并保存在列表self.layers中。还保存了编码器层的数量和可选的层归一化组件。这些属性和模块将在编码器的前向传播过程中使用。

（2）forward

调用及参数

forward(src, mask=None, src_key_padding_mask=None)

参数：
src – the sequnce to the encoder (required).
mask – the mask for the src sequence (optional).
src_key_padding_mask – the mask for the src keys per batch (optional).

源码

def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the input through the encoder layers in turn.

        Args:
            src: the sequence to the encoder (required).
            mask: the mask for the src sequence (optional).
            src_key_padding_mask: the mask for the src keys per batch (optional).

        Shape:
            see the docs in Transformer class.
        """
        output = src

        for mod in self.layers:
            output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)

        if self.norm is not None:
            output = self.norm(output)

        return output

这段代码是TransformerEncoder类的前向传播函数，用于逐层对输入序列进行编码。

具体内容如下：

forward函数：

src是输入序列张量（必需）。
mask是输入序列的掩码（可选）。
src_key_padding_mask是输入序列键的掩码（可选）。

函数的实现过程如下：

初始化output为输入序列src。
通过循环遍历self.layers中的每个编码器层，并将output传递给每个编码器层的前向传播函数。其中，src_mask和src_key_padding_mask用于控制编码器层中的自注意力和前馈神经网络计算。
如果self.norm不为None，则将最后的输出output应用于层归一化组件self.norm，对输出进行归一化处理。
返回最终的输出output。

总而言之，这段代码实现了TransformerEncoder类的前向传播过程。它将输入序列逐层传递给编码器层，并最终返回编码后的输出序列。如果定义了层归一化组件，还会对输出进行归一化处理。这样，整个Transformer编码器可以通过调用该函数来对输入序列进行编码。

4、torch.nn.TransformerDecoderLayer

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network

class TransformerDecoderLayer(Module):
    r"""TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
    This standard decoder layer is based on the paper "Attention Is All You Need".
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
    Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
    Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
    in a different way during application.

    Args:
        d_model: the number of expected features in the input (required).
        nhead: the number of heads in the multiheadattention models (required).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of the intermediate layer, can be a string
            ("relu" or "gelu") or a unary callable. Default: relu
        layer_norm_eps: the eps value in layer normalization components (default=1e-5).
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
        norm_first: if ``True``, layer norm is done prior to self attention, multihead
            attention and feedforward operations, respectivaly. Otherwise it's done after.
            Default: ``False`` (after).

    Examples::
        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
        >>> memory = torch.rand(10, 32, 512)
        >>> tgt = torch.rand(20, 32, 512)
        >>> out = decoder_layer(tgt, memory)

    Alternatively, when ``batch_first`` is ``True``:
        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8, batch_first=True)
        >>> memory = torch.rand(32, 10, 512)
        >>> tgt = torch.rand(32, 20, 512)
        >>> out = decoder_layer(tgt, memory)
    """
    __constants__ = ['batch_first', 'norm_first']

    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:


    def __setstate__(self, state):

	def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None, memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:



    # self-attention block
    def _sa_block(self, x: Tensor,
                  attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:

    # multihead attention block
    def _mha_block(self, x: Tensor, mem: Tensor,
                   attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:

    # feed forward block
    def _ff_block(self, x: Tensor) -> Tensor:

这段代码定义了TransformerDecoderLayer类，该类是Transformer解码器的一个标准解码器层。

具体内容如下：

TransformerDecoderLayer类：

d_model是输入的特征数量（必需）。
nhead是多头注意力模型中的头数（必需）。
dim_feedforward是前馈网络模型的维度（默认为2048）。
dropout是dropout值（默认为0.1）。
activation是中间层的激活函数，可以是字符串（"relu"或"gelu"）或一元可调用函数（默认为relu）。
layer_norm_eps是层归一化组件的eps值（默认为1e-5）。
batch_first如果为True，则输入和输出张量的形状为(batch, seq, feature)。默认为False（seq, batch, feature）。
norm_first如果为True，则在自注意力、多头注意力和前馈网络操作之前进行层归一化处理。否则在操作之后进行归一化处理。默认为False（在操作之后）。

forward函数：

tgt是目标序列张量（必需）。
memory是记忆（源）序列张量（必需）。
tgt_mask是目标序列的掩码（可选）。
memory_mask是记忆序列的掩码（可选）。
tgt_key_padding_mask是目标序列键的掩码（可选）。
memory_key_padding_mask是记忆序列键的掩码（可选）。

函数的实现过程如下：

_sa_block函数：实现自注意力机制（self-attention）的处理，根据给定的注意力掩码和键的填充掩码对输入序列进行自注意力计算。
_mha_block函数：实现多头注意力机制（multihead attention）的处理，将输入序列和记忆序列进行多头注意力计算，并考虑给定的注意力掩码和键的填充掩码。
_ff_block函数：实现前馈神经网络（feedforward network）的处理，对输入序列进行线性变换、激活函数和dropout操作。
forward函数：将目标序列（tgt）和记忆序列（memory）传递给自注意力、多头注意力和前馈网络块，以依次对目标序列进行解码。
函数返回解码后的输出序列。

总而言之，这段代码定义了Transformer解码器的一个标准解码器层。它包含了自注意力、多头注意力和前馈网络模块，通过调用前向传播函数可以对目标序列进行解码操作。

（1）init

调用及参数

torch.nn.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu')

参数：
d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).

源码

def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(TransformerDecoderLayer, self).__init__()
        
        ## 在Decoder中有两个MultiheadAttention，一个是masked self-Attention；一个是cross Attention
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                            **factory_kwargs)
        self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                                 **factory_kwargs)
        # Implementation of Feedforward model
        self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
        self.dropout = Dropout(dropout)
        self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)

        self.norm_first = norm_first
        ## 因为有三个模块，所以此处定义三对，norm、dropout
        self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.norm3 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.dropout1 = Dropout(dropout)
        self.dropout2 = Dropout(dropout)
        self.dropout3 = Dropout(dropout)

        # Legacy string support for activation function.
        if isinstance(activation, str):
            self.activation = _get_activation_fn(activation)
        else:
            self.activation = activation

这段代码是Transformer解码器层的初始化函数。下面是对其功能的中文阐述：

d_model：模型中输入和输出的特征维度大小。
nhead：多头注意力机制中注意力头的数量。
dim_feedforward：前馈神经网络隐藏层的维度。
dropout：应用于模型的丢弃率，用于防止过拟合。
activation：激活函数，可以是字符串形式的激活函数名称或自定义的激活函数。
layer_norm_eps：Layer Normalization层的epsilon值，用于数值稳定性。
batch_first：输入的维度顺序是否为(batch, sequence, feature)。
norm_first：是否在多头注意力之前应用Layer Normalization。
device和dtype：指定计算设备和数据类型的参数。

接下来是各个组件的功能：

self.self_attn：一个带有掩码的自注意力模块，用于对解码器输入序列进行自注意力计算。
self.multihead_attn：用于解码器输入序列和编码器输出序列之间的跨注意力计算。
self.linear1和self.linear2：两个线性层，用于Feedforward模型的前向传播计算。
self.dropout：丢弃层，用于在模型中应用丢弃操作。
self.norm1、self.norm2和self.norm3：三个Layer Normalization层，分别应用于不同的位置。
self.dropout1、self.dropout2和self.dropout3：三个丢弃层，用于在不同的位置应用丢弃操作。
self.activation：激活函数，用于前馈神经网络中的非线性变换。

这段代码实现了Transformer解码器层的初始化过程，定义了解码器的各个组件和参数，并为后续的解码操作做准备。

（2）forward

调用及参数

forward(tgt, memory, tgt_mask=None, memory_mask=None, 
tgt_key_padding_mask=None, memory_key_padding_mask=None)

参数：
tgt – the sequence to the decoder layer (required).
memory – the sequnce from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).
tgt_key_padding_mask – the mask for the tgt keys per batch (optional).
memory_key_padding_mask – the mask for the memory keys per batch (optional).

源码

def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None, memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the inputs (and mask) through the decoder layer.

        Args:
            tgt: the sequence to the decoder layer (required).
            memory: the sequence from the last layer of the encoder (required).
            tgt_mask: the mask for the tgt sequence (optional).
            memory_mask: the mask for the memory sequence (optional).
            tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
            memory_key_padding_mask: the mask for the memory keys per batch (optional).

        Shape:
            see the docs in Transformer class.
        """
        # see Fig. 1 of https://arxiv.org/pdf/2002.04745v1.pdf

        x = tgt
        if self.norm_first:
            x = x + self._sa_block(self.norm1(x), tgt_mask, tgt_key_padding_mask)
            x = x + self._mha_block(self.norm2(x), memory, memory_mask, memory_key_padding_mask)
            x = x + self._ff_block(self.norm3(x))
        else:	## 默认是执行这里
            x = self.norm1(x + self._sa_block(x, tgt_mask, tgt_key_padding_mask))
            x = self.norm2(x + self._mha_block(x, memory, memory_mask, memory_key_padding_mask))
            x = self.norm3(x + self._ff_block(x))

        return x


    # self-attention block
    def _sa_block(self, x: Tensor,
                  attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
        x = self.self_attn(x, x, x,				## 自注意层，因此query, key, value都是x
                           attn_mask=attn_mask,
                           key_padding_mask=key_padding_mask,
                           need_weights=False)[0]
        return self.dropout1(x)

    # multihead attention block
    def _mha_block(self, x: Tensor, mem: Tensor,
                   attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
        x = self.multihead_attn(x, mem, mem,		## 交叉注意层，query是x； key, value是encoder的输出
                                attn_mask=attn_mask,
                                key_padding_mask=key_padding_mask,
                                need_weights=False)[0]
        return self.dropout2(x)

    # feed forward block
    def _ff_block(self, x: Tensor) -> Tensor:
        x = self.linear2(self.dropout(self.activation(self.linear1(x))))
        return self.dropout3(x)

这段代码实现了Transformer解码器层的前向传播函数。以下是对其功能的中文阐述：

forward函数接受输入数据和相应的掩码，将其传递到解码器层中进行处理。
tgt和memory分别表示解码器的输入序列和编码器最后一层的输出序列。
tgt_mask和memory_mask分别表示解码器输入序列和编码器输出序列的掩码，用于指示哪些位置需要被屏蔽。
tgt_key_padding_mask和memory_key_padding_mask分别表示解码器输入序列和编码器输出序列的键值掩码，用于指示哪些位置需要被屏蔽。
在函数内部，首先将输入序列tgt赋值给变量x。
如果self.norm_first为True，将通过以下顺序进行处理：
- x经过Layer Normalization层self.norm1后与自注意力模块self._sa_block进行求和操作。
- 将结果再经过Layer Normalization层self.norm2后与交叉注意力模块self._mha_block进行求和操作。
- 将结果再经过Layer Normalization层self.norm3后与前馈神经网络模块self._ff_block进行求和操作。
如果self.norm_first为False（默认值），将通过以下顺序进行处理：
- x与自注意力模块self._sa_block进行求和操作后经过Layer Normalization层self.norm1。
- 将结果与交叉注意力模块self._mha_block进行求和操作后经过Layer Normalization层self.norm2。
- 将结果与前馈神经网络模块self._ff_block进行求和操作后经过Layer Normalization层self.norm3。
最后，返回处理后的结果x。
_sa_block函数实现了自注意力模块的功能，将输入x作为查询、键和值进行自注意力计算。
_mha_block函数实现了交叉注意力模块的功能，将输入x作为查询，mem作为键和值进行交叉注意力计算。
_ff_block函数实现了前馈神经网络模块的功能，包括线性变换、激活函数和丢弃操作。

这段代码通过调用不同的模块，实现了解码器层中的自注意力、交叉注意力和前馈神经网络的计算，进而完成了解码器层的前向传播过程。

5、torch.nn.TransformerDecoder

transformerDecoder是N个TransformerDecoderLayer的堆叠

class TransformerDecoder(Module):
    r"""TransformerDecoder is a stack of N decoder layers

    Args:
        decoder_layer: an instance of the TransformerDecoderLayer() class (required).
        num_layers: the number of sub-decoder-layers in the decoder (required).
        norm: the layer normalization component (optional).

    Examples::
        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
        >>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
        >>> memory = torch.rand(10, 32, 512)
        >>> tgt = torch.rand(20, 32, 512)
        >>> out = transformer_decoder(tgt, memory)
    """
    __constants__ = ['norm']

    def __init__(self, decoder_layer, num_layers, norm=None):

	def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:

这段代码定义了Transformer解码器类TransformerDecoder，它是由多个解码器层堆叠而成的。

构造函数__init__接受以下参数：

decoder_layer：一个TransformerDecoderLayer类的实例，表示每个子解码器层的结构。
num_layers：解码器中子解码器层的数量。
norm：可选的层归一化组件。

示例用法：

decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
memory = torch.rand(10, 32, 512)
tgt = torch.rand(20, 32, 512)
out = transformer_decoder(tgt, memory)

forward函数实现了解码器的前向传播。它接受以下参数：

tgt：解码器的输入序列。
memory：编码器最后一层的输出序列。
tgt_mask：解码器输入序列的掩码。
memory_mask：编码器输出序列的掩码。
tgt_key_padding_mask：解码器输入序列的键值掩码。
memory_key_padding_mask：编码器输出序列的键值掩码。

在函数内部，通过循环将输入序列tgt和其他参数传递到每个子解码器层中进行处理。每个子解码器层的前向传播输出将作为下一个子解码器层的输入。最后，返回解码器的输出结果。

这段代码定义了一个多层解码器，通过堆叠多个解码器层，实现了Transformer解码器的功能。

（1）init

调用及参数

torch.nn.TransformerDecoder(decoder_layer, num_layers, norm=None)

参数：
decoder_layer – TransformerDecoderLayer（）类的实例（必需）。
num_layers –解码器中子解码器层的数量（必需）。
norm –层归一化组件（可选）。

源码

def __init__(self, decoder_layer, num_layers, norm=None):
	super(TransformerDecoder, self).__init__()
    self.layers = _get_clones(decoder_layer, num_layers)
    self.num_layers = num_layers
    self.norm = norm

这段代码是Transformer解码器类TransformerDecoder的构造函数中的一部分。

在这段代码中，首先调用super(TransformerDecoder, self).__init__()来初始化父类（Module）的构造函数。

接下来，定义了以下属性：

self.layers：使用_get_clones函数将decoder_layer复制多次，生成一个包含多个解码器层的列表。这样做是为了构建堆叠的解码器层。
self.num_layers：解码器层的数量。
self.norm：可选的层归一化组件。

这段代码的作用是将解码器层复制多次，并将它们存储在列表中，用于构建多层的Transformer解码器。这样就可以通过索引访问不同的解码器层，并在前向传播过程中按顺序应用它们。、

（2）forward

调用及参数

forward(tgt, memory, tgt_mask=None, memory_mask=None, 
		tgt_key_padding_mask=None, memory_key_padding_mask=None)

参数：
tgt – the sequence to the decoder (required).
memory – the sequnce from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).
tgt_key_padding_mask – the mask for the tgt keys per batch (optional).
memory_key_padding_mask – the mask for the memory keys per batch (optional).

源码

def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the inputs (and mask) through the decoder layer in turn.

        Args:
            tgt: the sequence to the decoder (required).
            memory: the sequence from the last layer of the encoder (required).
            tgt_mask: the mask for the tgt sequence (optional).
            memory_mask: the mask for the memory sequence (optional).
            tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
            memory_key_padding_mask: the mask for the memory keys per batch (optional).

        Shape:
            see the docs in Transformer class.
        """
        output = tgt

        for mod in self.layers:
            output = mod(output, memory, tgt_mask=tgt_mask,
                         memory_mask=memory_mask,
                         tgt_key_padding_mask=tgt_key_padding_mask,
                         memory_key_padding_mask=memory_key_padding_mask)

        if self.norm is not None:
            output = self.norm(output)

        return output

这段代码是TransformerDecoder类的前向传播函数forward的实现。

forward函数接受以下参数：

tgt：解码器的输入序列。
memory：编码器最后一层的输出序列。
tgt_mask：解码器输入序列的掩码。
memory_mask：编码器输出序列的掩码。
tgt_key_padding_mask：解码器输入序列的键值掩码。
memory_key_padding_mask：编码器输出序列的键值掩码。

在函数内部，首先将解码器的输入序列赋值给变量output。

然后，通过循环遍历存储在self.layers列表中的解码器层，将output、memory以及其他参数传递给每个解码器层进行处理。每个解码器层的前向传播输出将作为下一个解码器层的输入。

如果定义了层归一化组件self.norm，则在循环结束后将输出output传递给层归一化组件进行处理。

最后，返回经过解码器层和层归一化处理后的输出结果。

这段代码实现了将输入序列通过多个解码器层进行处理的功能，并返回解码器的输出结果。

6、其他相关函数

def _get_clones(module, N):
    return ModuleList([copy.deepcopy(module) for i in range(N)])


def _get_activation_fn(activation):
    if activation == "relu":
        return F.relu
    elif activation == "gelu":
        return F.gelu

    raise RuntimeError("activation should be relu/gelu, not {}".format(activation))

这两个函数是与上述代码中的TransformerDecoder类相关的辅助函数。

_get_clones函数接受一个module和一个整数N作为参数，它使用copy.deepcopy函数将module复制多次，生成一个包含多个复制模块的ModuleList列表。该函数的作用是复制module以创建多个相同的模块实例。

_get_activation_fn函数接受一个激活函数名称activation作为参数，根据激活函数的名称返回相应的激活函数对象。目前支持的激活函数有"relu"和"gelu"，如果传入的激活函数名称不是这两个选项之一，函数将引发RuntimeError异常。

这两个函数的作用是辅助TransformerDecoder类的构造和初始化过程，用于复制模块和获取激活函数对象。

Transformer [Attention is All You Need]

（一）论文部分

Abstract

1 Introduction

2 Background

3 Model Architecture

3.1 Encoder and Decoder Stacks

3.2 Attention

3.2.1 Scaled Dot-Product Attention

3.2.2 Multi-Head Attention

3.2.3 Applications of Attention in our Model

3.3 Position-wise Feed-Forward Networks

3.4 Embeddings and Softmax

3.5 Positional Encoding

4 Why Self-Attention

5 Training

5.1 Training Data and Batching

5.2 Hardware and Schedule

5.3 Optimizer

5.4 Regularization

6 Results

6.1 Machine Translation

6.2 Model Variations

6.3 English Constituency Parsing

7 Conclusion

（二）代码解读

一、Jupyter notebook出现500报错

二、模型架构

1、Attention部分讲解

2、简单实现

三、源码

1、torch.nn.Transformer

（1）init

（2）forward

2、 torch.nn.TransformerEncoderLayer

（1）init

（2）forward

3、torch.nn.TransformerEncoder

（1）init

（2）forward

4、torch.nn.TransformerDecoderLayer

（1）init

（2）forward

5、torch.nn.TransformerDecoder

（1）init

（2）forward

6、其他相关函数

你可能感兴趣的:(Transformer系列,transformer,深度学习,人工智能)