心路历程:复现Transformer架构主干网络过程中,感受颇多,以前只是使用相关衍生模型,但是,从来没有深入的研究过Transformer架构的细节处理工作,这几天真的是成长了。这两年第三次复现作者论文,内心感受颇多,最大的感受就是在专业领域真的应该多向比自己优秀的人学习,只有这样才能不被时代所淘汰!!!
论文下载地址:
个人百度网盘下载地址
链接:https://pan.baidu.com/s/1p9ZJpgeTTjEQVQmQDobbPA
提取码:l980
(此图出处为原作者论文,请知悉)
Transformer架构:整体来看可以分为四个大的部分:
- 输入部分;
- 输出部分;
- 编码器部分(N);
- 解码器部分(N);
class Embedding(nn.Module):
def __init__(self, d_model, vocab, dropout=0.1):
"""
:param d_model: 词嵌入的维度
:param vocab: 词表大小
:param dropout: 随机失活置零比率
"""
super(Embedding, self).__init__()
# 初始化embedding层
self.embedding = nn.Embedding(vocab, d_model)
# 初始化d_model
self.d_model = d_model
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# x: input输入
return self.dropout(self.embedding(x)) * math.sqrt(self.d_model)
# 定义位置编码器类, 我们同样把它看做一个层, 因此会继承nn.Module
class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout, max_len=5000):
"""
:param d_model: 词嵌入的维度
:param max_length: 最大词汇量
:param dropout: 随机置零比率
"""
super(PositionalEncoding, self).__init__()
# 实例化nn中预定义的Dropout层, 并将dropout传入其中, 获得对象self.dropout
self.dropout = nn.Dropout(p=dropout)
# 初始化一个位置编码矩阵, 它是一个0阵,矩阵的大小是max_len x d_model.
pe = torch.zeros(max_len, d_model)
# 初始化一个绝对位置矩阵, 在我们这里,词汇的绝对位置就是用它的索引去表示.
# 所以我们首先使用arange方法获得一个连续自然数向量,然后再使用unsqueeze方法拓展向量维度使其成为矩阵,
# 又因为参数传的是1,代表矩阵拓展的位置,会使向量变成一个max_len x 1 的矩阵,
position = torch.arange(0, max_len).unsqueeze(1)
# 绝对位置矩阵初始化之后,接下来就是考虑如何将这些位置信息加入到位置编码矩阵中,
# 最简单思路就是先将max_len x 1的绝对位置矩阵, 变换成max_len x d_model形状,然后覆盖原来的初始位置编码矩阵即可,
# 要做这种矩阵变换,就需要一个1xd_model形状的变换矩阵div_term,我们对这个变换矩阵的要求除了形状外,
# 还希望它能够将自然数的绝对位置编码缩放成足够小的数字,有助于在之后的梯度下降过程中更快的收敛. 这样我们就可以开始初始化这个变换矩阵了.
# 首先使用arange获得一个自然数矩阵, 但是细心的同学们会发现, 我们这里并没有按照预计的一样初始化一个1xd_model的矩阵,
# 而是有了一个跳跃,只初始化了一半即1xd_model/2 的矩阵。 为什么是一半呢,其实这里并不是真正意义上的初始化了一半的矩阵,
# 我们可以把它看作是初始化了两次,而每次初始化的变换矩阵会做不同的处理,第一次初始化的变换矩阵分布在正弦波上, 第二次初始化的变换矩阵分布在余弦波上,
# 并把这两个矩阵分别填充在位置编码矩阵的偶数和奇数位置上,组成最终的位置编码矩阵.
div_term = torch.exp(torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# 这样我们就得到了位置编码矩阵pe, pe现在还只是一个二维矩阵,要想和embedding的输出(一个三维张量)相加,
# 就必须拓展一个维度,所以这里使用unsqueeze拓展维度.
pe = pe.unsqueeze(0)
# 最后把pe位置编码矩阵注册成模型的buffer,什么是buffer呢,
# 我们把它认为是对模型效果有帮助的,但是却不是模型结构中超参数或者参数,不需要随着优化步骤进行更新的增益对象.
# 注册之后我们就可以在模型保存后重加载时和模型结构与参数一同被加载.
self.register_buffer('pe', pe)
def forward(self, x):
"""forward函数的参数是x, 表示文本序列的词嵌入表示"""
# 在相加之前我们对pe做一些适配工作, 将这个三维张量的第二维也就是句子最大长度的那一维将切片到与输入的x的第二维相同即x.size(1),
# 因为我们默认max_len为5000一般来讲实在太大了,很难有一条句子包含5000个词汇,所以要进行与输入张量的适配.
# 最后使用Variable进行封装,使其与x的样式相同,但是它是不需要进行梯度求解的,因此把requires_grad设置成false.
x = x + Variable(self.pe[:, :x.size(1)],
requires_grad=False)
# 最后使用self.dropout对象进行'丢弃'操作, 并返回结果.
return self.dropout(x)
有一些生成的attention张量中的值计算有可能已知了未来信息而得到的,未来信息被看到是因为训练时会把整个输出结果都一次性进行Embedding,但是理论上解码器的的输出却不是一次就能产生最终结果的,而是一次次通过上一次结果综合得出的,因此,未来的信息可能被提前利用. 所以,我们会进行遮掩
def subsequent_mask(size):
"""
:param size: 词嵌入维度
:return: mask
"""
# 在函数中, 首先定义掩码张量的形状
attn_shape = (1, size, size)
# 然后使用np.ones方法向这个形状中添加1元素,形成上三角阵, 最后为了节约空间,
# 再使其中的数据类型变为无符号8位整形unit8
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
# 最后将numpy类型转化为torch中的tensor, 内部做一个1 - 的操作,
# 在这个其实是做了一个三角阵的反转, subsequent_mask中的每个元素都会被1减,
# 如果是0, subsequent_mask中的该位置由0变成1
# 如果是1, subsequent_mask中的该位置由1变成0
return torch.from_numpy(1 - subsequent_mask)
三个指定的输入Q(query), K(key), V(value);
这几个参数应该怎么理解呢?思来想去没有太合适的专业术语来表示,还是举个例子吧。
例如:我个人在拜读的这篇论文,我们就可以把它理解为Q(query);而论文中的如下图的3.2.1 Scaled Dot-Product Attention,这些关键提示信息,我们可以理解为K(key);那V(value)就可以理解为,我这边拜读完该论文中关键信息之后的学习感悟。
复现代码:
def attention(query, key, value, mask=None, dropout=None):
"""
:param query: 全部样本信息
:param key: 关键信息
:param value: 联想信息
:param mask: 掩码
:param dropout: 随机是活层
:return:query注意力及注意力张量
"""
# 构建d_k,词嵌入维度通常和query的最后一个维度相同
d_k = query.size(-1)
# 构建attn点积部分
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
# 判断mask是否存在
if mask is not None:
scores = scores.masked_fill(mask == 0, 1e-9)
# 判断drop是否存在
if dropout is not None:
scores = dropout(scores)
# 完成softmax层操作
p_attn = F.softmax(scores, dim=-1)
# 返回最终乘机结果(注意力)及注意力张量表示
return torch.matmul(p_attn, value), p_attn
以下仅为个人理解,非礼勿喷:
- 四个liner层,Q、K、V各一个,输出一个;
- 文中的多头其实只是在确定好head数量之后,对Q最后一维的词嵌入维度按照head的数量的平均分配,也就是说原词嵌入维度 = head数量 * 每个头分配的词嵌入维度,此处其实就是一个view方法扩维的过程,即:从三维到四维;
- 结构图中的concat合并部分其实是把我们刚分的多头及维度,再合并的过程,本质上如果代码复现的话,其实原路返回的逻辑就可以实现。
- 最后将Query注意力,送到最后一个linear层,这个多头注意力机制就结束了。
基本就是这样一个执行流程。
以下为个人关于多头注意力机制模块的复现代码:
# 定义克隆函数
def clones(model, n):
"""
:param model: 模型网络
:param n: copy的模型网络数量
:return: model_list
"""
return nn.ModuleList([copy.deepcopy(model) for _ in range(n)])
# 多头注意力机制
class MultiHeadedAttention(nn.Module):
def __init__(self, head, embedding_dim, dropout=0.1):
"""
:param head: 多头的数量
:param embedding_dim: 词嵌入维度
:param dropout: 随机置零比率
"""
super(MultiHeadedAttention, self).__init__()
self.embedding_dim = embedding_dim
self.head = head
# 初始化dropout层
self.dropout = nn.Dropout(dropout)
# assert 断言词向量维度是否被整除,不能整除会报错
assert embedding_dim % head == 0
# head可以被整除的前提下,获取每个头的维度
self.d_k = self.embedding_dim // self.head
# 初始化4个方形的变换矩阵,QKV各一个,最后输出一个
self.layers = clones(nn.Linear(self.embedding_dim, self.embedding_dim), 4)
# 初始化空的注意力张量
self.attn = None
def forward(self, query, key, value, mask=None):
# 由于后期要进行思维计算,所以如果mask存在需要将三维的mask掩码升到思维
if mask is not None:
mask = mask.unsqueeze(0)
# 从query的第0个维度提取词的batch_size
batch_size = query.size(0)
# 多头处理环节
# 获取处理后的QKV,并交换中间两个维度的位置方便后期计算
query, key, value = [model(x).view(batch_size, -1, self.head, self.d_k).transpose(1, 2) for model, x in
zip(self.layers, (query, key, value))]
# 将QKV带入attention函数中接受返回的注意力结果
x, p_attn = attention(query, key, value, mask=mask)
# 多头concat合并
x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.head*self.d_k)
# 返回最后一个liner层的输出结果
return self.layers[-1](x)
考虑注意力机制可能对复杂过程的拟合程度不够, 所以在每个编码器和解码器结构中都通过增加两层网络来增强模型的能力
复现代码比较简单:
class PositionwiseFeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
"""
前馈全连接层
:param d_model: 词嵌入维度,输入维度
:param d_ff: 第一维的输出维度
"""
super(PositionwiseFeedForward, self).__init__()
# 初始化线性层
self.layer_1 = nn.Linear(d_model, d_ff)
self.layer_2 = nn.Linear(d_ff, d_model)
# 初始化dropout层
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# x: 上一层的输出作为本层的输入
return self.layer_2(self.dropout(F.relu(self.layer_1(x))))
规范化层
规范化层,其实也没有什么新意,平时也是这么用的,基本上都是按照惯例进行减均值除方差或者标准差,我这边选择了除以标准差来实现。
那对数据进行规范化操作可以解决哪些问题呢?
因为随着网络层数的增加,通过多层的计算后参数可能开始出现过大或过小的情况,这样可能会导致学习过程出现异常,模型可能收敛非常的慢. 因此都会在一定层数后接规范化层进行数值的规范化,使其特征数值在合理范围内.
class LayerNorm(nn.Module):
def __init__(self, d_model, eps=-1e9):
"""
:param d_model: 词嵌入维度
:param eps: 做除法操作时防止反目为零的常规操作
"""
super(LayerNorm, self).__init__()
# 初始化两个参数张量
self.w_1 = nn.Parameter(torch.ones(d_model))
self.w_2 = nn.Parameter(torch.zeros(d_model))
# 初始化eps
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.w_1 * ((1-mean)/(std + self.eps)) + self.w_2
作为编码器的组成单元, 每个编码器层完成一次对输入的特征提取过程
编码器层结构图:(此图出处为原作者论文,请知悉)
复现代码:
# 构造编码器层结构
class EncoderLayer(nn.Module):
def __init__(self, d_model, self_attn, feed_forward, dropout=0.1):
"""
:param d_model: 词嵌入维度
:param self_attn: 多头注意力机制实例化对象
:param feed_forward: 前馈全连接类实例化对象
:param dropout: 随机失活比率
"""
super(EncoderLayer, self).__init__()
# 初始化相关参数
self.d_model = d_model
self.self_attn = self_attn
self.feed_forward = feed_forward
# 克隆两个子层链接层对象
self.sublayers = clones(SublayerConnection(self.d_model, dropout), 2)
def forward(self, x, mask):
"""
:param x: 上一层的输出,作为本层的输入
:return: 编码器级联结构
"""
# 构造第一个子层结构多头注意力机制 + 规范化层
x = self.sublayers[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayers[1](x, lambda x: self.feed_forward(x))
编码器用于对输入进行指定的特征提取过程, 也称为编码, 由N个编码器层堆叠而成
# 构建编码器类
class Encoder(nn.Module):
def __init__(self, sublayer, N):
"""
:param sublayer: 要克隆的模型类结构
:param N: 数量
"""
super(Encoder, self).__init__()
self.sublayers = clones(sublayer, N)
# 初始化规范化层,收尾使用
self.norm = LayerNorm(sublayer.d_model)
def forward(self, x, mask):
"""
:param x: 上一个编码器层的输出
:param mask: 掩码
:return: 数据规范化后的编码器对象
"""
for sublayer in self.sublayers:
x = sublayer(x, mask)
return self.norm(x)
解码器的组成单元, 每个解码器层根据给定的输入向目标方向进行特征提取操作,即解码过程.
# 构建解码器层
class DecoderLayer(nn.Module):
def __init__(self, d_model, self_attn1, self_attn2, feed_forward, dropout=0.1):
"""
:param d_model: 词嵌入维度
:param self_attn1: 第一个注意力机制子层(output连接层)
:param self_attn2: 第二个注意力机制子层(编码器链接)
:param feed_forward: 前馈全连接层
:param dropout: 随机失活比率
"""
super(DecoderLayer, self).__init__()
# 初始化相关参数
self.d_model = d_model
self.self_attn1 = self_attn1
self.self_attn2 = self_attn2
self.feed_forward = feed_forward
self.norm = LayerNorm(self.d_model)
# 初始化三个子层链接层对象,即残差快
self.sublayers = clones(SublayerConnection(self.d_model, dropout), 3)
def forward(self, x, memory, source_mask, target_mask):
"""
:param x: 上一层的输出,作为本层的输入
:param memory: 来自编码器的语义存储变量
:param source_mask:源数据掩码张量
:param target_mask:目标数据掩码张量
:return:解码器级联对象
"""
x = self.sublayers[0](x, lambda x: self.self_attn1(x, x, x, target_mask))
x = self.sublayers[1](x, lambda x: self.self_attn2(x, memory, memory, source_mask))
return self.sublayers[2](x, lambda x: self.feed_forward(x))
根据编码器的结果以及上一次预测的结果, 对下一次可能出现的’值’进行特征表示.
# 解码器类对象
class Decoder(nn.Module):
def __init__(self, layer, N):
"""
:param layer: 要克隆的model对象
:param N: 数量
"""
super(Decoder, self).__init__()
# clonesmodels对象
self.layers = clones(layer, N)
# 初始化规范化层
self.norm = LayerNorm(layer.d_model)
def forward(self, x, memory, source_mask, target_mask):
"""
:param
x: 上一层的输出,作为本层的输入
:param
memory: 来自编码器的语义存储变量
:param
source_mask: 源数据掩码张量
:param
target_mask: 目标数据掩码张量
:return: 解码器层级联对象
"""
for layer in self.layers:
x = layer(x, memory, source_mask, target_mask)
return self.norm(x)
通过对上一步的线性变化得到指定维度的输出, 也就是转换维度的作用
使最后一维的向量中的数字缩放到0-1的概率值域内, 并满足他们的和为1
复现代码:
# 构造输出层对象
class Generator(nn.Module):
def __init__(self, d_model, vicab_size):
"""
:param d_model: 词嵌入维度
:param vicab_size: 词表长度
"""
super(Generator, self).__init__()
# 初始化线性层
self.liner = nn.Linear(d_model, vicab_size)
def forward(self, x):
"""
:param x: 规范化后的解码器的输出对象
:return: 经过log_softmax处理后的输出对象
"""
return F.log_softmax(self.liner(x), dim=-1)
# 编码器解码器组合类
class EncoderDecoder(nn.Module):
def __init__(self, encoder, decoder, input_embedded, target_embedded, generator):
"""
:param encoder: 编码器对象
:param decoder: 解码器对象
:param input_embedded: 编码器部分对应的经过embedding层处理过的输入对象
:param target_embedded: 解码器部分对应的经过embedding层处理过的输入对象
:param generator: 输出部分对象
"""
super(EncoderDecoder, self).__init__()
# 实例属性的初始化和赋值
self.encoder = encoder
self.decoder = decoder
self.input_embedded = input_embedded
self.target_embedded = target_embedded
self.generator = generator
def forward(self, source, target, source_mask, target_mask):
"""
:param source: 编码器端的初始出入对象
:param target: 解码器端的初始输入对象
:param source_mask: 编码器层掩码对象
:param target_mask: 解码器层的掩码对象
:return: 编码器端的输出对象
"""
return self.generator(self.decode(self.encode(source, source_mask), target, source_mask, target_mask))
def encode(self, source, source_mask):
"""
:param source: 编码器部分的初始输入对象
:param source_mask: 编码器层的掩码对象
:return: 编码器对象
"""
return self.encoder(self.input_embedded(source), source_mask)
def decode(self, memory, target, source_mask, target_mask):
"""
:param target: 解码器端的初始输入对象
:param memory: 编码器端的输出对象
:param source_mask: 编码器层的掩码对象
:param target_mask: 解码器层的掩码对象
:return: 解码器对象
"""
return self.decoder(self.target_embedded(target), memory, source_mask, target_mask)
# Transformer核心网络构建方法
def make_model(source_vocab, target_vocab, N=6,
d_model=512, d_ff=2048, head=8, dropout=0.1):
"""
:param source_vocab: 编码器端词表长度
:param target_vocab: 解码器端词表长度
:param N: 编码器、解码器层数
:param d_model: 词嵌入维度
:param d_ff: 前馈全连接网络总的第一个线性层的输出维度
:param head: 头的数量
:param dropout: 随机置零比率
:return: model对象
"""
# 构建一个深拷贝对象
c = copy.deepcopy
# 初始多头注意力机制层对象
attn = MultiHeadedAttention(head, d_model, dropout)
# 初始前馈全连接层对象
d_ff = PositionwiseFeedForward(d_model, d_ff, dropout)
# 初始positional encoding对象
position = PositionalEncoding(d_model, dropout)
# 核心model构件
model = EncoderDecoder(
encoder=Encoder(EncoderLayer(d_model, c(attn), c(d_ff), dropout), N),
decoder=Decoder(DecoderLayer(d_model, c(attn), c(attn), c(d_ff), dropout), N),
input_embedded=nn.Sequential(Embedding(d_model, source_vocab, dropout), c(position)),
target_embedded=nn.Sequential(Embedding(d_model, target_vocab, dropout), c(position)),
generator=Generator(d_model, target_vocab)
)
# 模型数据初始化
# 模型结构完成后,接下来就是初始化模型中的参数,比如线性层中的变换矩阵
# 这里一但判断参数的维度大于1,则会将其初始化成一个服从均匀分布的矩阵,
for p in model.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
return model
最终模型结构图,展示如下,终于结束了,历时三天…:
能看到这里的老铁都是有心人,非常欢迎你的点赞关注。
EncoderDecoder(
(encoder): Encoder(
(sublayers): ModuleList(
(0): EncoderLayer(
(self_attn): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1): EncoderLayer(
(self_attn): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(2): EncoderLayer(
(self_attn): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(3): EncoderLayer(
(self_attn): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(4): EncoderLayer(
(self_attn): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(5): EncoderLayer(
(self_attn): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(norm): LayerNorm()
)
(decoder): Decoder(
(layers): ModuleList(
(0): DecoderLayer(
(self_attn1): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(self_attn2): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(norm): LayerNorm()
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(1): DecoderLayer(
(self_attn1): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(self_attn2): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(norm): LayerNorm()
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(2): DecoderLayer(
(self_attn1): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(self_attn2): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(norm): LayerNorm()
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(3): DecoderLayer(
(self_attn1): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(self_attn2): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(norm): LayerNorm()
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(4): DecoderLayer(
(self_attn1): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(self_attn2): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(norm): LayerNorm()
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(5): DecoderLayer(
(self_attn1): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(self_attn2): MultiHeadedAttention(
(dropout): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0): Linear(in_features=512, out_features=512, bias=True)
(1): Linear(in_features=512, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=512, bias=True)
(3): Linear(in_features=512, out_features=512, bias=True)
)
)
(feed_forward): PositionwiseFeedForward(
(layer_1): Linear(in_features=512, out_features=2048, bias=True)
(layer_2): Linear(in_features=2048, out_features=512, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(norm): LayerNorm()
(sublayers): ModuleList(
(0): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): SublayerConnection(
(norm): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(norm): LayerNorm()
)
(input_embedded): Sequential(
(0): Embedding(
(embedding): Embedding(11, 512)
(dropout): Dropout(p=0.1, inplace=False)
)
(1): PositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
)
(target_embedded): Sequential(
(0): Embedding(
(embedding): Embedding(11, 512)
(dropout): Dropout(p=0.1, inplace=False)
)
(1): PositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
)
(generator): Generator(
(liner): Linear(in_features=512, out_features=11, bias=True)
)
)