pytorch 如何用pack_padded_sequence和pad_packed_sequence处理变长序列

假设在一个批中我们有以下四个句子做为输入,我们把它存在一个test文件里

1.I love China.
2.The dog is running.
3.Be yourself!
4.We are family!

第一步我们将句子转化为词的id序列,并将这一批数据补成等长(不够长的用0来填充)

import numpy as np
import wordfreq
vocab = {} #词与id对应的字典
seqs=[]
seq_lengths = []

with open('test', 'r') as f:
    for line in f.readlines():
        tokens = wordfreq.tokenize(line.strip(), 'en')[1:] #序列token化
        seq_lengths.append(len(tokens))
        current_seq=[]
        for t in tokens:
            if t not in vocab.keys():
                vocab[t]=len(vocab)+1
            current_seq.append(vocab[t])
        seqs.append(current_seq)
print(seq_lengths)
#[3, 4, 2, 3]
print(seqs)
#[[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12]]

seq_ids = np.zeros((len(seq_lengths), max(seq_lengths)))
l_no = 0
for i in range(len(seqs)):
    current_seq=seqs[i]
    for j in range(len(current_seq)):
        seq_ids[i,j]=current_seq[j]
print(seq_ids)
#[[ 1.  2.  3.  0.]
#[ 4.  5.  6.  7.]
#[ 8.  9.  0.  0.]
#[10. 11. 12.  0.]]

接下来一步为了迎合pytorc对变长序列的处理方法,需要暂时对序列按序列长度进行顺序调整

import torch
import torch.nn as nn
from torch.autograd import Variable

seq_ids = Variable(torch.from_numpy(seq_ids))
lengths=torch.Tensor(seq_lengths)
_, idx_sort = torch.sort(lengths, dim=0, descending=True)#按长度从高到低排列
_, idx_unsort = torch.sort(idx_sort, dim=0)
seq_ids = seq_ids.index_select(0, idx_sort).long()#按长度从长到短重新组织原id矩阵
print('从长到短组织后的id矩阵')
print(seq_ids)
lengths = lengths[idx_sort]
print('重新组织后的长度')
print(lengths)

目前为止,token序列已经处理成pytorch需要的输入,但我们知道RNN,GRU或LSTM每一时刻的输入是词向量,所以我们对这些token序列取它们对应的词向量

word_embeddsings=nn.Embedding(13,3).weight.data #这里注意字典长度应是所有token的id再加上一个用于填充序列的0
input_features=word_embeddsings[seq_ids]

此时,我们已经完全准备好了输入,如果不考虑变长序列,就已经可以开始将input_features输入到RNN,GRU或LSTM中了。但是,我们面对的一批中的数据往往是变长的,为了批量处理,我们为小于最大长度的序列填充了一些0,但我们并不想这些0参与我们的运算,如第三个句子,它的长度为2,但填充后的序列为[8. 9. 0. 0.], 在喂到神经网络后,我们希望神经网络处理到 9 就不要再往下处理,也即我们不想填充的两个0参与运算。pytorch是通过pack_padded_sequence和pad_packed_sequence这两个函数来实现这一功能的。其中pack_padded_sequence我们可以理解为压缩,pad_packed_sequence理解为解压缩。

压缩操作为

x_packed = nn.utils.rnn.pack_padded_sequence(input=input_features, lengths=lengths, batch_first=True).float()#将序列压缩
print('压缩后的输入')
print(x_packed)
#PackedSequence(data=tensor([[-0.9253, -0.9636, -0.3969],
        [ 0.4371, -1.5157,  0.9859],
        [ 0.4630, -0.0826, -0.3679],
        [ 1.4977,  0.6071, -0.9768],
        [-0.1515, -0.1085,  0.0887],
        [ 0.5794, -0.2969,  1.6362],
        [-0.4394,  1.2988,  2.7114],
        [-0.0197,  0.8890, -0.5184],
        [ 0.3094, -0.9346,  0.2803],
        [ 1.3620, -1.0636, -0.2870],
        [-0.5221,  0.6209, -0.0173],
        [-2.2576, -0.7985,  0.3763]]), batch_sizes=tensor([4, 4, 3, 1]), sorted_indices=None, unsorted_indices=None)

压缩后是个PackedSequence对象,主要属性有两个,data和batch_size,其中data我们看到它是按时间步骤重新组织的输入,比如,data中的前四行是这一批数据中第一个时间点的输入

压缩后的对象就可以输入到神经网络了,这里以GRU为例

# initialize
my_gru=nn.GRU(input_size=3,hidden_size=3,num_layers=2,batch_first=True)
c0 = torch.randn(2, 4, 3).float()

# forward
out, hn = my_gru(x_packed, c0)
print('输出')
print(out)
#PackedSequence(data=tensor([[ 0.3617,  0.5106,  0.4123],
        [ 0.8191, -1.2952, -0.6726],
        [ 0.2693, -0.4393, -0.0136],
        [-0.8634,  0.3315,  0.3762],
        [ 0.3613,  0.2391,  0.0373],
        [ 0.4725, -0.8466, -0.1161],
        [ 0.1367, -0.3053,  0.3233],
        [-0.5999,  0.0626,  0.5722],
        [ 0.3956,  0.1379, -0.1179],
        [ 0.3935, -0.5346, -0.2924],
        [ 0.0423, -0.2920,  0.2441],
        [ 0.4255,  0.0270, -0.3083]], grad_fn=), batch_sizes=tensor([4, 4, 3, 1]), sorted_indices=None, unsorted_indices=None)

我们看到输出也是个PackedSequence,为了得到每个输入序列的表示,我们需要解压缩这个对象

output_padded = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
print('解压后')
print(output_padded)
#(tensor([[[ 0.3617,  0.5106,  0.4123],
         [ 0.3613,  0.2391,  0.0373],
         [ 0.3956,  0.1379, -0.1179],
         [ 0.4255,  0.0270, -0.3083]],

        [[ 0.8191, -1.2952, -0.6726],
         [ 0.4725, -0.8466, -0.1161],
         [ 0.3935, -0.5346, -0.2924],
         [ 0.0000,  0.0000,  0.0000]],

        [[ 0.2693, -0.4393, -0.0136],
         [ 0.1367, -0.3053,  0.3233],
         [ 0.0423, -0.2920,  0.2441],
         [ 0.0000,  0.0000,  0.0000]],

        [[-0.8634,  0.3315,  0.3762],
         [-0.5999,  0.0626,  0.5722],
         [ 0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=), tensor([4, 3, 3, 2]))

解压后是我们调整了序列的顺序后对应的输出,我们还要把顺序调整过来

output = output_padded[0].index_select(0, idx_unsort)
print(output)
#tensor([[[ 0.8191, -1.2952, -0.6726],
         [ 0.4725, -0.8466, -0.1161],
         [ 0.3935, -0.5346, -0.2924],
         [ 0.0000,  0.0000,  0.0000]],

        [[ 0.3617,  0.5106,  0.4123],
         [ 0.3613,  0.2391,  0.0373],
         [ 0.3956,  0.1379, -0.1179],
         [ 0.4255,  0.0270, -0.3083]],

        [[-0.8634,  0.3315,  0.3762],
         [-0.5999,  0.0626,  0.5722],
         [ 0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000]],

        [[ 0.2693, -0.4393, -0.0136],
         [ 0.1367, -0.3053,  0.3233],
         [ 0.0423, -0.2920,  0.2441],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=)

到此为止,我们就得到了原始序列的输出。接下来就可以做下面的任务了

你可能感兴趣的:(工具类)