假设在一个批中我们有以下四个句子做为输入,我们把它存在一个test文件里
1.I love China.
2.The dog is running.
3.Be yourself!
4.We are family!
第一步我们将句子转化为词的id序列,并将这一批数据补成等长(不够长的用0来填充)
import numpy as np
import wordfreq
vocab = {} #词与id对应的字典
seqs=[]
seq_lengths = []
with open('test', 'r') as f:
for line in f.readlines():
tokens = wordfreq.tokenize(line.strip(), 'en')[1:] #序列token化
seq_lengths.append(len(tokens))
current_seq=[]
for t in tokens:
if t not in vocab.keys():
vocab[t]=len(vocab)+1
current_seq.append(vocab[t])
seqs.append(current_seq)
print(seq_lengths)
#[3, 4, 2, 3]
print(seqs)
#[[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12]]
seq_ids = np.zeros((len(seq_lengths), max(seq_lengths)))
l_no = 0
for i in range(len(seqs)):
current_seq=seqs[i]
for j in range(len(current_seq)):
seq_ids[i,j]=current_seq[j]
print(seq_ids)
#[[ 1. 2. 3. 0.]
#[ 4. 5. 6. 7.]
#[ 8. 9. 0. 0.]
#[10. 11. 12. 0.]]
接下来一步为了迎合pytorc对变长序列的处理方法,需要暂时对序列按序列长度进行顺序调整
import torch
import torch.nn as nn
from torch.autograd import Variable
seq_ids = Variable(torch.from_numpy(seq_ids))
lengths=torch.Tensor(seq_lengths)
_, idx_sort = torch.sort(lengths, dim=0, descending=True)#按长度从高到低排列
_, idx_unsort = torch.sort(idx_sort, dim=0)
seq_ids = seq_ids.index_select(0, idx_sort).long()#按长度从长到短重新组织原id矩阵
print('从长到短组织后的id矩阵')
print(seq_ids)
lengths = lengths[idx_sort]
print('重新组织后的长度')
print(lengths)
目前为止,token序列已经处理成pytorch需要的输入,但我们知道RNN,GRU或LSTM每一时刻的输入是词向量,所以我们对这些token序列取它们对应的词向量
word_embeddsings=nn.Embedding(13,3).weight.data #这里注意字典长度应是所有token的id再加上一个用于填充序列的0
input_features=word_embeddsings[seq_ids]
此时,我们已经完全准备好了输入,如果不考虑变长序列,就已经可以开始将input_features输入到RNN,GRU或LSTM中了。但是,我们面对的一批中的数据往往是变长的,为了批量处理,我们为小于最大长度的序列填充了一些0,但我们并不想这些0参与我们的运算,如第三个句子,它的长度为2,但填充后的序列为[8. 9. 0. 0.], 在喂到神经网络后,我们希望神经网络处理到 9 就不要再往下处理,也即我们不想填充的两个0参与运算。pytorch是通过pack_padded_sequence和pad_packed_sequence这两个函数来实现这一功能的。其中pack_padded_sequence我们可以理解为压缩,pad_packed_sequence理解为解压缩。
压缩操作为
x_packed = nn.utils.rnn.pack_padded_sequence(input=input_features, lengths=lengths, batch_first=True).float()#将序列压缩
print('压缩后的输入')
print(x_packed)
#PackedSequence(data=tensor([[-0.9253, -0.9636, -0.3969],
[ 0.4371, -1.5157, 0.9859],
[ 0.4630, -0.0826, -0.3679],
[ 1.4977, 0.6071, -0.9768],
[-0.1515, -0.1085, 0.0887],
[ 0.5794, -0.2969, 1.6362],
[-0.4394, 1.2988, 2.7114],
[-0.0197, 0.8890, -0.5184],
[ 0.3094, -0.9346, 0.2803],
[ 1.3620, -1.0636, -0.2870],
[-0.5221, 0.6209, -0.0173],
[-2.2576, -0.7985, 0.3763]]), batch_sizes=tensor([4, 4, 3, 1]), sorted_indices=None, unsorted_indices=None)
压缩后是个PackedSequence对象,主要属性有两个,data和batch_size,其中data我们看到它是按时间步骤重新组织的输入,比如,data中的前四行是这一批数据中第一个时间点的输入
压缩后的对象就可以输入到神经网络了,这里以GRU为例
# initialize
my_gru=nn.GRU(input_size=3,hidden_size=3,num_layers=2,batch_first=True)
c0 = torch.randn(2, 4, 3).float()
# forward
out, hn = my_gru(x_packed, c0)
print('输出')
print(out)
#PackedSequence(data=tensor([[ 0.3617, 0.5106, 0.4123],
[ 0.8191, -1.2952, -0.6726],
[ 0.2693, -0.4393, -0.0136],
[-0.8634, 0.3315, 0.3762],
[ 0.3613, 0.2391, 0.0373],
[ 0.4725, -0.8466, -0.1161],
[ 0.1367, -0.3053, 0.3233],
[-0.5999, 0.0626, 0.5722],
[ 0.3956, 0.1379, -0.1179],
[ 0.3935, -0.5346, -0.2924],
[ 0.0423, -0.2920, 0.2441],
[ 0.4255, 0.0270, -0.3083]], grad_fn=), batch_sizes=tensor([4, 4, 3, 1]), sorted_indices=None, unsorted_indices=None)
我们看到输出也是个PackedSequence,为了得到每个输入序列的表示,我们需要解压缩这个对象
output_padded = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
print('解压后')
print(output_padded)
#(tensor([[[ 0.3617, 0.5106, 0.4123],
[ 0.3613, 0.2391, 0.0373],
[ 0.3956, 0.1379, -0.1179],
[ 0.4255, 0.0270, -0.3083]],
[[ 0.8191, -1.2952, -0.6726],
[ 0.4725, -0.8466, -0.1161],
[ 0.3935, -0.5346, -0.2924],
[ 0.0000, 0.0000, 0.0000]],
[[ 0.2693, -0.4393, -0.0136],
[ 0.1367, -0.3053, 0.3233],
[ 0.0423, -0.2920, 0.2441],
[ 0.0000, 0.0000, 0.0000]],
[[-0.8634, 0.3315, 0.3762],
[-0.5999, 0.0626, 0.5722],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000]]], grad_fn=), tensor([4, 3, 3, 2]))
解压后是我们调整了序列的顺序后对应的输出,我们还要把顺序调整过来
output = output_padded[0].index_select(0, idx_unsort)
print(output)
#tensor([[[ 0.8191, -1.2952, -0.6726],
[ 0.4725, -0.8466, -0.1161],
[ 0.3935, -0.5346, -0.2924],
[ 0.0000, 0.0000, 0.0000]],
[[ 0.3617, 0.5106, 0.4123],
[ 0.3613, 0.2391, 0.0373],
[ 0.3956, 0.1379, -0.1179],
[ 0.4255, 0.0270, -0.3083]],
[[-0.8634, 0.3315, 0.3762],
[-0.5999, 0.0626, 0.5722],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000]],
[[ 0.2693, -0.4393, -0.0136],
[ 0.1367, -0.3053, 0.3233],
[ 0.0423, -0.2920, 0.2441],
[ 0.0000, 0.0000, 0.0000]]], grad_fn=)
到此为止,我们就得到了原始序列的输出。接下来就可以做下面的任务了