


本专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。



参考2014年论文《Sequence to Sequence Learning with Neural Networks》。

        最常见的序列到序列(seq2seq)模型是编码器 - 解码器模型,它们通常使用递归神经网络(RNN)将源(输入)句子编码到单个向量中。在此次介绍中,我们将把这个单一的向量称为上下文向量。我们可以将上下文向量视为整个输入句子的抽象表示。然后,该向量由第二个RNN解码,该RNN通过一次生成一个单词来学习输出目标(输出)句子。


         上图显示了一个示例翻译。输入/源句子“guten morgen”通过嵌入层(黄色),然后输入到编码器(绿色)。我们还将序列的开头和序列的结尾标记分别附加到句子的开头和结尾。在每个时间步长中,编码器RNN的输入是当前单词的嵌入向量e(xt),以及上一个时间步长中的隐藏状态ht-1,编码器 RNN 将输出新的隐藏状态ht .到目前为止,我们可以将隐藏状态视为句子的向量表示。RNN 可以被认为是e(xt) 和ht-1的函数:


        这里的RNN同样也可以使用LSTM(Long Short-Term Memory)和GRU(Gated Recurrent Unit)等。现在有了X=\{x_{1},x_{2},...,x_{T}\},where\ x_{1}=<sos>,x_{2}=还有初始的隐藏状态h0一般默认为0,可以得到隐藏状态st:





        解码器中的单词总是一个接一个地生成,每个时间步长一个。我们总是使用解码器的第一个输入 ,但对于后续输入yt-1 ,有时会在序列中使用实际的下一个词yt,有时使用我们的解码器预测的单词,y^t−1 .这被称为教师强迫学习,在这里看到更多关于它的信息。[*意思就是,在训练时有概率的选择下次一定训练输入是真实训练数据,还是上次预测的结果]



3.3.2 Encode







我们可以把ct当做是另一种类型的隐藏状态,类似于h_{0}^{l}c_{0}^{l}将被初始化为所有零的张量。此外,我们的上下文向量现在既是最终的隐藏状态,也是最终的单元格状态,即z^{l}=(h_{T}^{l},c_{T}^{l}),将我们的多层方程扩展到 LSTM,我们得到:




import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        :param input_dim: 将one-hot矢量输入到编码器的大小/维数,是输入(源)词汇大小
        :param emb_dim: 嵌入层的维数。此图层将one-hot矢量转换为具有维度的密集矢量
        :param hid_dim: 隐藏状态和单元格状态的维度
        :param n_layers: RNN 中的层数
        :param dropout: dropout防止过拟合

        self.hid_dim = hid_dim
        self.n_layers = n_layers

        self.embedding = nn.Embedding(input_dim, emb_dim)

        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)

        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        # src = [src len, batch size]

        embedded = self.dropout(self.embedding(src))
        # embedded = [src len, batch size, emb dim]

        outputs, (hidden, cell) = self.rnn(embedded)
        # outputs = [src len, batch size, hid dim * n directions]
        # hidden = [n layers * n directions, batch size, hid dim]
        # cell = [n layers * n directions, batch size, hid dim]

        # outputs are always from the top hidden layer

        return hidden, cell

3.3.3 Decode



        每个时间步长输出单个令牌。第一层将接收来自上一个时间步长的隐藏和单元格状态Decoder(s_{t-1}^l,c_{t-1}^l),和当前词的token yt,以产生新的隐藏和单元格状态(s_{t}^l,c_{t}^l)后续图层将使用下面图层中的隐藏状态s_{t}^{l-1}以及其图层中的先前隐藏和单元格状态(s_{t-1}^l,c_{t-1}^l),解码器公式如下:





class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):

        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        self.embedding = nn.Embedding(output_dim, emb_dim)

        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)

        self.fc_out = nn.Linear(hid_dim, output_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):
        # input = [batch size]
        # hidden = [n layers * n directions, batch size, hid dim]
        # cell = [n layers * n directions, batch size, hid dim]

        # n directions in the decoder will both always be 1, therefore:
        # hidden = [n layers, batch size, hid dim]
        # context = [n layers, batch size, hid dim]

        input = input.unsqueeze(0)

        # input = [1, batch size]

        embedded = self.dropout(self.embedding(input))
        # embedded = [1, batch size, emb dim]

        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))

        # output = [seq len, batch size, hid dim * n directions]
        # hidden = [n layers * n directions, batch size, hid dim]
        # cell = [n layers * n directions, batch size, hid dim]

        # seq len and n directions will always be 1 in the decoder, therefore:
        # output = [1, batch size, hid dim]
        # hidden = [n layers, batch size, hid dim]
        # cell = [n layers, batch size, hid dim]

        prediction = self.fc_out(output.squeeze(0))
        # prediction = [batch size, output dim]

        return prediction, hidden, cell

3.3.4 Seq2Seq

        对于实现的最后一部分,我们将实现 seq2seq 模型。这将处理:

  • 接收输入/源句子

  • 使用编码器生成上下文向量

  • 使用解码器生成预测的输出/目标句子
  •         我们的完整模型将如下所示:



class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        # 要确保编码器隐藏层维度和解码器隐藏层维度相等
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        # 要确保编码器隐藏层层数和解码器相等
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src = [src len, batch size]
        # trg = [trg len, batch size]
        # teacher_forcing_ratio is probability to use teacher forcing
        # e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time

        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        # last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)

        # first input to the decoder is the  tokens
        input = trg[0, :]

        for t in range(1, trg_len):
            # insert input token embedding, previous hidden and previous cell states
            # receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)    # output.shape=[64, 5893]

            # place predictions in a tensor holding predictions for each token
            outputs[t] = output

            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio

            # get the highest predicted token from our predictions
            top1 = output.argmax(1) # 取预测结果中最好的一个 top1.shape=[64]

            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            # 强制教学,就是使用真实gt去训练,否则就是使用预测结果去训练
            input = trg[t] if teacher_force else top1

        return outputs

3.3.5 全部训练代码

