Pytorch学习记录-Seq2Seq打包填充序列、掩码和推理

Pytorch学习记录-torchtext和Pytorch的实例4

0. PyTorch Seq2Seq项目介绍

在完成基本的torchtext之后，找到了这个教程，《基于Pytorch和torchtext来理解和实现seq2seq模型》。
这个项目主要包括了6个子项目

~~使用神经网络训练Seq2Seq~~
~~使用RNN encoder-decoder训练短语表示用于统计机器翻译~~
~~使用共同学习完成NMT的堆砌和翻译~~
打包填充序列、掩码和推理
卷积Seq2Seq
Transformer

4. 打包填充序列、掩码和推理

教程基于之前的模型增加了打包填充序列、掩码。

打包填充序列被用于让RNN在Encoder部分略过填充的token。
掩码能够明确强制模型忽略某些值，例如对填充元素的注意。这两种技术都常用于NLP。

这个教程同样也会关注模型的推理，给定句子，查看翻译结果。找出究竟注意力机制关注哪些词。

4.1 引入库和数据预处理

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import spacy
import random
import math
import time

SEED=1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic=True

spacy_de=spacy.load('de')
spacy_en=spacy.load('en')
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

当使用打包填充序列时，要告诉Pytorch没有打包填充的源句子真实长度。Torchtext的Field允许使用include_lengths参数，可以使batch.src成为一个tuple。该tuple第一个元素是tensor格式后的数字化的源句，第二个元素是源句未进行填充的长度。

SRC=Field(tokenize=tokenize_de,init_token='',eos_token='',lower=True,include_lengths=True)
TRG=Field(tokenize=tokenize_en,init_token='',eos_token='',lower=True,include_lengths=True)

train_data,valid_data,test_data=Multi30k.splits(exts=('.de','.en'),fields=(SRC,TRG))
SRC.build_vocab(train_data,min_freq=2)
TRG.build_vocab(train_data,min_freq=2)

关于打包填充序列有一个比较有意思的地方，批处理中的所有元素需要按其非填充长度按降序排序，即批处理中的第一个句子需要最长。我们使用迭代器的两个参数来处理它，sort_within_batch告诉迭代器需要对批处理的内容进行排序，sort_key是一个告诉迭代器如何对批处理中的元素进行排序的函数。在这里，我们按src句子的长度排序。

BATCH_SIZE=32
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator=BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    sort_within_batch=True,
    sort_key=lambda x:len(x,src),
    device=device
)

4.2 构建模型

4.2.1 Encoder

Encoder的改变是在forward方法，这里接收源句长度。

源句完成嵌入后，就可以使用pack_padded_sequence。数据喂给RNN后输出的是packed_outputs，这是一个打包后的tensor，包含有句子的隐藏状态。隐藏状态是标准张量而不是以任何方式打包，唯一的区别是，由于输入是打包序列，因此该张量来自序列中的最终非填充元素。
然后使用pad_packed_sequence解包我们的packed_outputs，它返回输出和每个的长度，这是我们不需要的。
输出的第一个维度是填充序列长度，但是由于使用打包填充序列，当填充标记是输入时，张量的值将全为零。

Encoder输入参数：

input_dim输入encoder的one-hot向量维度，这个和输入词汇大小一致，就是输入字典长度
emb_dim嵌入层的维度，这一层将one-hot向量转为密度向量,256
词嵌入在 pytorch 中只需要调用 torch.nn.Embedding(m, n) 就可以了，m 表示单词的总数目，n 表示词嵌入的维度，是一种降维，相当于是一个大矩阵，矩阵的每一行表示一个单词。
enc_hid_dim encoder隐藏和cell的状态维度,512
dec_hid_dim decoder隐藏和cell的状态维度,512
dropout是要使用的丢失量。这是一个防止过度拟合的正则化参数，0.5

Encoder返回参数：

outputs的大小为[src长度, batch_size, hid_dim num_directions]，其中hid_dim是来自前向RNN的隐藏状态。这里可以将（hid_dim num_directions）看成是前向、后向隐藏状态的堆叠。, ，我们也可以将所有堆叠的编码器隐藏状态表示为。
hidden的大小为[n_layers num_directions, batch_size, hid_dim]，其中[-2,:,:]是在结束最后时间步（即在看到最后一个单词之后）给出顶层前向RNN隐藏状态。[-1，:，:]是在结束最后时间步之后（即在看到句子中的第一个单词之后）给出顶层后向RNN隐藏状态。

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super(Encoder, self).__init__()
        
        self.input_dim=input_dim
        self.emb_dim=emb_dim
        self.enc_hid_dim=enc_hid_dim
        self.dec_hid_dim=dec_hid_dim
        self.dropout=dropout
        
        self.embedding=nn.Embedding(input_dim,emb_dim)
        self.rnn=nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)
        self.fc=nn.Linear(enc_hid_dim*2,dec_hid_dim)
        self.dropout=nn.Dropout(dropout)
    def forward(self, src,src_len):
        embedded=self.dropout(self.embedding(src))
        # 进行打包填充
        packed_embedded=nn.utils.rnn.pack_padded_sequence(embedded,src_len)
        packed_outputs,hidden=self.rnn(packed_embedded)
        outputs,_=nn.utils.rnn.pad_packed_sequence(packed_outputs)
        hidden=torch.tanh(self.fc(torch.cat((hidden[-2,:,:],hidden[-1,:,:]),dim=1)))
        return outputs,hidden

4.2.2 Attention

attention模块用来计算源句attention值。
之前的教程中，我们允许模块通过源句注意填充的token，然而使用掩码，我们可以强制attention关注非填充元素。
forward函数放入掩码输入，tensor结构是[batch size, source sentence length]，当源句token没有被填充时，tensor是1，被填充时，tensor是0。

例子：
["hello", "how", "are", "you", "?", , ]->[1, 1, 1, 1, 1, 0, 0]。

我们在计算注意力之后但在通过softmax函数对其进行归一化之前应用蒙版。它使用masked_fill应用。这将填充第一个参数（mask == 0）为true的每个元素的张量，其值由第二个参数（-1e10）给出。换句话说，它将采用未标准化的注意力值，并将填充元素上的注意力值更改为-1e10。由于这些数字与其他值相比微不足道，因此当通过softmax层时它们将变为零，从而确保源语句中的填充令牌不会受到关注。

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super(Attention,self).__init__()
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        
        self.attn=nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v=nn.Parameter(torch.rand(dec_hid_dim))
        
    def forward(self,hidden,encoder_outputs,mask):
        batch_size=encoder_outputs.shape[1]
        src_len=encoder_outputs.shape[0]
        hidden=hidden.unsqueeze(1).repeat(1,src_len,1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        energy=torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        energy=energy.permute(0,2,1)
        v= self.v.repeat(batch_size, 1).unsqueeze(1)
        attention=torch.bmm(v,energy).squeeze(1)
        attention=attention.masked_fill(mask==0,-1e10)
        
        return F.softmax(attention, dim = 1)

4.2.3 Decoder

Decoder只做了一点点更改，它需要接受源句子上的掩码并将其传递给注意模块。由于我们想要在推理期间查看注意力的值，我们也会返回注意力张量。

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.output_dim = output_dim
        self.dropout = dropout
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs, mask):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        #mask = [batch size, src sent len]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs, mask)
                
        #a = [batch size, src sent len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src sent len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [sent len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #sent len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        output = self.out(torch.cat((output, weighted, embedded), dim = 1))
        
        #output = [bsz, output dim]
        
        return output, hidden.squeeze(0), a.squeeze(1)

4.2.4 Seq2Seq

我们需要告诉它对于pad标记，sos标记和eos标记的索引是什么，并且还将源句长度作为输入传递给forward方法。
我们使用pad_token_index来创建掩码，方法是在源句不等于pad_token的地方创建一个1的掩码张量。这都是在create_mask函数中完成的。
要使用此模型进行推理，我们只需传递一个目标句子trg，即None。这将推断为true并创建一个充满标记的伪trg张量。我们需要用标记填充它，因为需要将其传递给Decoder以开始解码，其余的从未使用，因为我们断言教师强制比率为0，因此模型只使用其自己的预测。我们将虚拟目标张量设置为最大长度为100，这意味着我们将尝试输出的目标令牌的最大数量。
我们还创建了一个注意张量来存储推理的注意值。
在Decoder循环中，在进行推理时，我们检查解码的令牌是否是令牌，如果是，我们立即停止解码并返回到目前为止生成的转换和注意。

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, pad_idx, sos_idx, eos_idx, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.pad_idx = pad_idx
        self.sos_idx = sos_idx
        self.eos_idx = eos_idx
        self.device = device
        
    def create_mask(self, src):
        mask = (src != self.pad_idx).permute(1, 0)
        return mask
        
    def forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src sent len, batch size]
        #src_len = [batch size]
        #trg = [trg sent len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        if trg is None:
            assert teacher_forcing_ratio == 0, "Must be zero during inference"
            inference = True
            trg = torch.zeros((100, src.shape[1])).long().fill_(self.sos_idx).to(src.device)
        else:
            inference = False
            
        batch_size = src.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        
        #tensor to store attention
        attentions = torch.zeros(max_len, batch_size, src.shape[0]).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src, src_len)
                
        #first input to the decoder is the  tokens
        output = trg[0,:]
        
        mask = self.create_mask(src)
                
        #mask = [batch size, src sent len]
                
        for t in range(1, max_len):
            output, hidden, attention = self.decoder(output, hidden, encoder_outputs, mask)
            outputs[t] = output
            attentions[t] = attention
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)
            if inference and output.item() == self.eos_idx:
                return outputs[:t], attentions[:t]
            
        return outputs, attentions