雷锋网翻译了一篇技术博客(标题:Taming LSTMs: Variable-sized mini-batches and why PyTorch is good for your health,作者:William Falcon, 翻译 | 赵朋飞 马力群 涂世文 整理 | MY),该博客主要讲述了pytorch下输入为变长序列的LSTM的处理方式,文中所举案例简单,描述详尽,语言通俗易懂,可作为变长序列处理的最好的入门案例。然而,不知何故,代码中存在很多错误,使得其作为入门教材的效果大打折扣。另一方面,该博客没有输出一些关键方法的结果,加大了初学者的理解难度。因此,本文以翻译稿为主要蓝本,结合自己的理解,给出本人改正后的代码及关键中间结果,以期能够更好地理解和处理变长序列。
如果你用过 PyTorch 进行深度学习研究和实验的话,你可能经历过欣喜愉悦、能量爆棚的体验,甚至有点像是走在阳光下,感觉生活竟然如此美好 。但是直到你试着用 PyTorch 实现可变大小的 mini-batch RNNs 的时候,瞬间一切又回到了解放前。
不怕,我们还是有希望的。读完这篇文章,你又会找回那种感觉,你和 PyTorch步入阳光中,此时你的循环神经网络模型的准确率又创新高,而这种准确率你只在 Arxiv 上读到过。真让人觉得兴奋!
我们将告诉你几个独门绝技:
1.如何在PyTorch中采用 mini-batch 中的可变大小序列实现 LSTM 。
2. PyTorch中pack_padded_sequence 和 pad_packed_sequence 的原理和作用。
3.在基于时间维度的反向传播算法中屏蔽(Mask Out)用于填充的符号。
Tips: 文本填充,使所有文本长度相等;pack_padded_sequence , 运行LSTM;使用pad_packed_sequence;扁平化所有输出和标签, 屏蔽填充输出, 计算交叉熵损失函数(Cross-Entropy)。
当然是速度和性能啦。
将可变长度元素同时输入到 LSTM 曾经可是一个艰巨的技术挑战,不过像 PyTorch 这样的框架已经基本解决了( Tensorflow 也有一个很好的解决方案,但它看起来非常非常复杂)。
此外,文档也没有很清楚的解释,用例也很老旧。正确的做法是使用来自多个示样本的梯度,而不是仅仅来自一个样本。这将加快训练速度,提高梯度下降的准确性 。
尽管 RNNs 很难并行化,因为每一步都依赖于上一步,但是使用 mini-batch 在速度上将会使其得到很大的提升。
先来尝试一个简单的序列标注问题,在这里我们会创建一个 LSTM/GRU 模型 对贾斯汀·比伯的歌词做词性标注。譬如:“is it too late now to say sorry?” (移除 ’to’ 和 ’?’ )。
在实际情况中你会做大量的格式化处理,但在这里由于篇幅限制我们不会这样做。为简单起见,让我们用不同长度的序列来制作这组人造数据。
sent_1_x = ['is', 'it', 'too', 'late', 'now', 'say', 'sorry']
sent_1_y = ['VB', 'PRP', 'RB', 'RB', 'RB', 'VB', 'JJ']
sent_2_x = ['ooh', 'ooh']
sent_2_y = ['NNP', 'NNP']
sent_3_x = ['sorry', 'yeah']
sent_3_y = ['JJ', 'NNP']
X = [sent_1_x, sent_2_x, sent_3_x]
Y = [sent_1_y, sent_2_y, sent_3_y]
当我们将每个句子输入到嵌入层(Embedding Layer)的时候,每个单词(word)将会映射(mapping)到一个索引(index),所以我们需要将他们转换成整数列表(list)。
这里我们将这些句子映射到相应的词汇表(V)索引。
# map sentences to vocab
vocab = {'': 0, 'is': 1, 'it': 2, 'too': 3, 'late': 4, 'now': 5, 'say': 6, 'sorry': 7, 'ooh': 8, 'yeah': 9}
# fancy nested list comprehension
X = [[vocab[word] for word in sentence] for sentence in X]
print(X)
[[1, 2, 3, 4, 5, 6, 7], [8, 8], [7, 9]]
对于分类标签也是一样的(在我们的例子中是 POS 标记),这些不会嵌入 。
tags = {'': 0, 'VB': 1, 'PRP': 2, 'RB': 3, 'JJ': 4, 'NNP': 5}
# fancy nested list comprehension
Y = [[tags[tag] for tag in sentence] for sentence in Y]
print(Y)
[[1, 2, 3, 3, 3, 1, 4], [5, 5], [4, 5]]
在模型里有着不同长度的是什么?当然不会是我们的每批数据!
利用 PyTorch 处理时,在填充之前,我们需要保存每个序列的长度。我们需要利用这些信息去掩盖(mask out)损失函数,使其不对填充元素进行计算。
import numpy as np
# get the length of each sentence
X_lengths = [len(sentence) for sentence in X]
# create an empty matrix with padding tokens
pad_token = vocab['']
longest_sent = max(X_lengths)
batch_size = len(X)
padded_X = np.ones((batch_size, longest_sent)) * pad_token
# copy over the actual sequences
for i, x_len in enumerate(X_lengths):
sequence = X[i]
padded_X[i, 0:x_len] = sequence[:x_len]
print(padded_X)
[[1. 2. 3. 4. 5. 6. 7.]
[8. 8. 0. 0. 0. 0. 0.]
[7. 9. 0. 0. 0. 0. 0.]]
我们用同样的方法处理标签 :
Y_lengths = [len(sentence) for sentence in Y]
# create an empty matrix with padding tokens
pad_token = tags['']
longest_sent = max(Y_lengths)
batch_size = len(Y)
padded_Y = np.ones((batch_size, longest_sent)) * pad_token
# copy over the actual sequences
for i, y_len in enumerate(Y_lengths):
sequence = Y[i]
padded_Y[i, 0:y_len] = sequence[:y_len]
print(padded_Y)
[[1. 2. 3. 3. 3. 1. 4.]
[5. 5. 0. 0. 0. 0. 0.]
[4. 5. 0. 0. 0. 0. 0.]]
我们将这些元素转换成索引序列并通过加入 0 元素对每个序列进行填充(Zero Padding),这样每批数据就可以拥有相同的长度。
借助 PyTorch 我们可以搭建一个非常简单的 LSTM 网络。模型的层结构如下:
1. 词嵌入层(Embedding Layer)
2. LSTM 层
3. 线性全连接层
4. Softmax 层
import torch
import torch.nn as nn
import torch.nn.functional as F
nb_tags = len(tags) - 1
nb_vocab_words = len(vocab)
batch_size, seq_len=padded_X.shape
embedding_dim=3
nb_lstm_units=10
nb_layers=2
padding_idx = vocab['']
word_embedding = nn.Embedding(
num_embeddings=nb_vocab_words,
embedding_dim=embedding_dim,
padding_idx=padding_idx
)
# design LSTM
lstm = nn.LSTM(
input_size=embedding_dim,
hidden_size=nb_lstm_units,
num_layers=nb_layers,
batch_first=True
)
# output layer which projects back to tag space
hidden_to_tag = nn.Linear(nb_lstm_units, nb_tags)
hidden_a = torch.randn(nb_layers, batch_size, nb_lstm_units).float()
hidden_b = torch.randn(nb_layers, batch_size, nb_lstm_units).float()
再次重申一下,现在我们输入的一批数据中的每组数据均已被填充为相同长度。
在前向传播中,我们将:
1. 对序列进行词嵌入(Word Embedding)操作
2. 使用 pack_padded_sequence 来确保 LSTM 模型不会处理用于填充的元素。
3. 在 LSTM 上运行 packed_batch
4. 使用 pad_packed_sequence 解包(unpack)pack_padded_sequence 操作后的序列
5. 对 LSTM 的输出进行变换,从而可以被输入到线性全连接层中
6. 再通过对序列计算 log_softmax
7. 最后将数据维度转换回来,最终的数据维度为 (batch_size, seq_len, nb_tags)
# 1. embed the input
# Dim transformation: (batch_size, seq_len) -> (batch_size, seq_len, embedding_dim)
X = torch.tensor(padded_X).long()
X = word_embedding(X)
print('embedded', X)
embedded tensor([[[ 1.3190, 0.0872, -0.2742],
[-0.1677, 1.1510, 0.4656],
[-0.8435, -0.5562, 0.9256],
[-0.8396, -0.0076, -1.5482],
[-0.5656, -1.3909, -0.7842],
[-0.5416, 1.7457, 0.4726],
[ 1.1060, 0.8440, -0.5556]],
[[ 0.6334, -1.5088, 1.0840],
[ 0.6334, -1.5088, 1.0840],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000]],
[[ 1.1060, 0.8440, -0.5556],
[ 1.2046, -1.6742, 0.6964],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000]]], grad_fn=)
将填充好的tensor根据所输入地参数压缩成实际地数据,同时数据格式变为PackedSequence。它有三个主要的参数, 分别是input, lengths, batch_first. 其中input就是我们填充过的数据, 而lengths就是数据的实际长度, batch_first就简单了, 就是把数据的batch_first放到最前面。
Tips:但是为啥我们需要使用pack_padded_sequence呢? 直接把填充好的数据输入到RNN中不可以吗?实际上是当然可以的, 但是在实际情况中, 数据是这样输入的, 下面给出一个batch的例子。
tensor([[1, 2, 3, 4, 5, 6, 7],
[2, 3, 4, 5, 6, 7, 0]])
输入到RNN的实际上是按照这样的顺序[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 0]依次输入到RNN中的. 但是我们发现最后一个是[7, 0], 这里的0输入到RNN中, 实际上并没有输出有用的数据, 这样的话就会浪费算力资源, 所以我们使用pack_padded_sequence进行压缩一下。
# Dim transformation: (batch_size, seq_len, embedding_dim) -> (batch_size, seq_len, nb_lstm_units)
# pack_padded_sequence so that padded items in the sequence won't be shown to the LSTM
X = torch.nn.utils.rnn.pack_padded_sequence(X, X_lengths, batch_first=True)
print('pack_padded',X)
pack_padded: PackedSequence(data=tensor([[ 1.3190, 0.0872, -0.2742],
[ 0.6334, -1.5088, 1.0840],
[ 1.1060, 0.8440, -0.5556],
[-0.1677, 1.1510, 0.4656],
[ 0.6334, -1.5088, 1.0840],
[ 1.2046, -1.6742, 0.6964],
[-0.8435, -0.5562, 0.9256],
[-0.8396, -0.0076, -1.5482],
[-0.5656, -1.3909, -0.7842],
[-0.5416, 1.7457, 0.4726],
[ 1.1060, 0.8440, -0.5556]], grad_fn=), batch_sizes=tensor([3, 3, 1, 1, 1, 1, 1]), sorted_indices=None, unsorted_indices=None)
现在,运行LSTM。
X, hidden = lstm(X, (hidden_a, hidden_b))
print('lstm ouput shape in packed seq: ', X[0].size())
print(X)
lstm ouput shape in packed seq: torch.Size([11, 10])
PackedSequence(data=tensor([[ 1.5307e-01, 6.7684e-02, 6.4468e-02, -2.2887e-01, 2.0291e-01,
1.6192e-02, 9.1459e-03, 1.6604e-01, 2.4689e-01, 2.1277e-01],
[-2.0542e-01, -6.3485e-02, 2.1305e-02, -1.8940e-01, 3.6822e-01,
-4.2697e-04, -1.2188e-02, 1.4914e-01, -1.9662e-01, 4.1007e-03],
[-1.8556e-01, 5.8267e-01, -5.5726e-02, 3.2447e-01, -5.6095e-02,
1.0067e-01, 1.5416e-02, -6.1702e-01, -3.9697e-02, 3.5665e-03],
[ 1.1510e-01, -3.5210e-02, 1.6324e-01, -1.1573e-01, 1.2481e-01,
-1.3048e-01, 1.2843e-02, 2.3278e-03, 5.5453e-02, 1.4491e-01],
[-1.8321e-01, -1.2067e-01, 5.8485e-02, -1.5943e-01, 2.3355e-01,
-1.0162e-01, -2.4926e-02, -7.1134e-02, -1.7803e-01, 3.8604e-02],
[-7.5403e-02, 1.5720e-01, 5.6410e-02, 4.4938e-02, -3.5079e-02,
-1.0616e-02, 6.9829e-02, -3.6775e-01, -4.3459e-02, 7.8495e-02],
[ 6.3267e-03, -1.0364e-01, 1.5995e-01, -1.1714e-01, 8.7210e-02,
-1.7794e-01, -6.6597e-03, -9.0191e-02, 3.0669e-03, 1.1943e-01],
[-4.8220e-02, -1.3315e-01, 1.5506e-01, -1.2272e-01, 7.2997e-02,
-1.8515e-01, -3.2535e-02, -1.3140e-01, -2.9254e-02, 1.0733e-01],
[-7.8605e-02, -1.4908e-01, 1.4581e-01, -1.3867e-01, 6.2283e-02,
-1.9139e-01, -5.0125e-02, -1.5407e-01, -4.2093e-02, 9.7522e-02],
[-9.3640e-02, -1.4058e-01, 1.4463e-01, -1.4144e-01, 7.2176e-02,
-1.7028e-01, -4.0201e-02, -1.7073e-01, -6.1939e-02, 1.0614e-01],
[-1.0190e-01, -1.3188e-01, 1.3591e-01, -1.2254e-01, 8.6571e-02,
-1.6958e-01, -4.1146e-02, -1.7248e-01, -8.0702e-02, 9.7578e-02]],
grad_fn=), batch_sizes=tensor([3, 3, 1, 1, 1, 1, 1]), sorted_indices=None, unsorted_indices=None)
从上面输出的结果中可以看出,由于lstm运行时没有计算pading, lstm的输出是(11,10),而常规的lstm的输出应该是(3,7,10),所以需要把原来的padding加上,还原成(3,7,10)。pad_packed_sequence方法可以完成这个工作。
# undo the packing operation
X, _ = torch.nn.utils.rnn.pad_packed_sequence(X, batch_first=True)
print('un packed:', X.size())
print(X)
un packed: torch.Size([3, 7, 10])
tensor([[[ 1.5307e-01, 6.7684e-02, 6.4468e-02, -2.2887e-01, 2.0291e-01,
1.6192e-02, 9.1459e-03, 1.6604e-01, 2.4689e-01, 2.1277e-01],
[ 1.1510e-01, -3.5210e-02, 1.6324e-01, -1.1573e-01, 1.2481e-01,
-1.3048e-01, 1.2843e-02, 2.3278e-03, 5.5453e-02, 1.4491e-01],
[ 6.3267e-03, -1.0364e-01, 1.5995e-01, -1.1714e-01, 8.7210e-02,
-1.7794e-01, -6.6597e-03, -9.0191e-02, 3.0669e-03, 1.1943e-01],
[-4.8220e-02, -1.3315e-01, 1.5506e-01, -1.2272e-01, 7.2997e-02,
-1.8515e-01, -3.2535e-02, -1.3140e-01, -2.9254e-02, 1.0733e-01],
[-7.8605e-02, -1.4908e-01, 1.4581e-01, -1.3867e-01, 6.2283e-02,
-1.9139e-01, -5.0125e-02, -1.5407e-01, -4.2093e-02, 9.7522e-02],
[-9.3640e-02, -1.4058e-01, 1.4463e-01, -1.4144e-01, 7.2176e-02,
-1.7028e-01, -4.0201e-02, -1.7073e-01, -6.1939e-02, 1.0614e-01],
[-1.0190e-01, -1.3188e-01, 1.3591e-01, -1.2254e-01, 8.6571e-02,
-1.6958e-01, -4.1146e-02, -1.7248e-01, -8.0702e-02, 9.7578e-02]],
[[-2.0542e-01, -6.3485e-02, 2.1305e-02, -1.8940e-01, 3.6822e-01,
-4.2697e-04, -1.2188e-02, 1.4914e-01, -1.9662e-01, 4.1007e-03],
[-1.8321e-01, -1.2067e-01, 5.8485e-02, -1.5943e-01, 2.3355e-01,
-1.0162e-01, -2.4926e-02, -7.1134e-02, -1.7803e-01, 3.8604e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
[[-1.8556e-01, 5.8267e-01, -5.5726e-02, 3.2447e-01, -5.6095e-02,
1.0067e-01, 1.5416e-02, -6.1702e-01, -3.9697e-02, 3.5665e-03],
[-7.5403e-02, 1.5720e-01, 5.6410e-02, 4.4938e-02, -3.5079e-02,
-1.0616e-02, 6.9829e-02, -3.6775e-01, -4.3459e-02, 7.8495e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]]],
grad_fn=)
运行线性连接层。
# 3. Project to tag space
# Dim transformation: (batch_size, seq_len, nb_lstm_units) -> (batch_size * seq_len, nb_lstm_units)
# this one is a bit tricky as well. First we need to reshape the data so it goes into the linear layer
X = X.contiguous()
X = X.view(-1, X.shape[2])
# run through actual linear layer
X = hidden_to_tag(X)
print(X)
tensor([[-0.2492, -0.0340, 0.2604, -0.1128, 0.1133],
[-0.2084, 0.0336, 0.1851, -0.1483, 0.1951],
[-0.1816, 0.0132, 0.1213, -0.1773, 0.2254],
[-0.1661, 0.0026, 0.0929, -0.1934, 0.2455],
[-0.1564, -0.0099, 0.0747, -0.2039, 0.2592],
[-0.1581, -0.0089, 0.0738, -0.2018, 0.2586],
[-0.1506, -0.0080, 0.0705, -0.1919, 0.2626],
[-0.0720, 0.0107, 0.0915, -0.0448, 0.2584],
[-0.0969, -0.0102, 0.0507, -0.1165, 0.2872],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.2086, -0.2195, 0.1842, -0.1434, 0.1055],
[-0.2085, -0.0694, 0.1326, -0.1655, 0.1689],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813],
[-0.1556, 0.0033, 0.1379, -0.1076, 0.1813]],
grad_fn=)
最后,运行log_softmax函数,以便分类。
Tips:log_softmax函数等价于log(softmax(x)),对应的损失函数为nn.NLLLoss。 nn.NLLLoss的输入是一个对数概率向量和一个目标标签。nn.NLLLoss的结果就是将输出与Label对应的那个值拿出来,再去掉负号,然后求均值。(参考:Pytorch损失函数torch.nn.NLLLoss()详解 https://blog.csdn.net/Jeremy_lf/article/details/102725285)。
# 4. Create softmax activations bc we're doing classification
# Dim transformation: (batch_size * seq_len, nb_lstm_units) -> (batch_size, seq_len, nb_tags)
X = F.log_softmax(X, dim=1)
# I like to reshape for mental sanity so we're back to (batch_size, seq_len, nb_tags)
X = X.view(batch_size, seq_len, nb_tags)
Y_hat = X
print(Y_hat)
tensor([[[-1.8699, -1.6547, -1.3603, -1.7335, -1.5074],
[-1.8429, -1.6009, -1.4494, -1.7828, -1.4395],
[-1.8042, -1.6095, -1.5014, -1.7999, -1.3972],
[-1.7854, -1.6166, -1.5264, -1.8126, -1.3738],
[-1.7727, -1.6262, -1.5416, -1.8202, -1.3571],
[-1.7743, -1.6251, -1.5424, -1.8180, -1.3576],
[-1.7702, -1.6276, -1.5490, -1.8115, -1.3570]],
[[-1.7375, -1.6548, -1.5739, -1.7103, -1.4071],
[-1.7402, -1.6535, -1.5926, -1.7598, -1.3561],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487]],
[[-1.7761, -1.7870, -1.3833, -1.7109, -1.4620],
[-1.8014, -1.6624, -1.4603, -1.7584, -1.4240],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487],
[-1.7856, -1.6267, -1.4921, -1.7376, -1.4487]]],
grad_fn=)
最终,我们准备要计算损失函数了。这里的重点在于我们并不想让用于填充的元素影响到最终的输出。
Tips:最好的方法是将所有的网络输出和标签展平。然后计算其所在序列的损失值。
定义损失函数如下。
def loss(Y_hat, Y, X_lengths):
# before we calculate the negative log likelihood, we need to mask out the activations
# this means we don't want to take into account padded items in the output vector
# simplest way to think about this is to flatten ALL sequences into a REALLY long sequence
# and calculate the loss on that.
# flatten all the labels
Y = Y.view(-1)
# flatten all predictions
Y_hat = Y_hat.view(-1, nb_tags)
# create a mask by filtering out all tokens that ARE NOT the padding token
tag_pad_token = tags['']
mask = (Y > tag_pad_token).float()
print('mask:', mask)
# count how many tokens we have
nb_tokens = int(torch.sum(mask).item()) #torch.sum(mask).data[0]
print('tokens number:', nb_tokens)
# pick the values for the label and zero out the rest with the mask
_,ix = torch.topk(Y_hat,1,dim=1)
print('Y_hat max', ix.view(-1))
Y = Y – 1 # the code is none in original article, added by me.
print('Y',Y)
# calculate curracy
count = 0
for i in range(21):
if (Y.numpy())[i] == (ix.view(-1).numpy())[i]:
count += 1
print('accuracy:', str(count / nb_tokens * 100) + '%')
Y_hat = Y_hat[range(Y_hat.shape[0]), Y] * mask
# compute cross entropy loss which ignores all tokens
ce_loss = -torch.sum(Y_hat) / nb_tokens
return ce_loss
最后,计算损失。
Y = torch.tensor(padded_Y).long()
loss = loss(Y_hat, Y, X_lengths)
print('loss', loss)
mask: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.])
tokens number: 11
Y_hat max tensor([2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4])
Y tensor([ 0, 1, 2, 2, 2, 0, 3, 4, 4, -1, -1, -1, -1, -1, 3, 4, -1, -1, -1, -1, -1])
accuracy: 27.27272727272727%
loss tensor(1.5931, grad_fn=)
这便是在 PyTorch 中解决 LSTM 变长批输入的最佳实践。
1. 将序列从长到短进行排序
2. 通过序列填充使得输入序列长度保持一致
3. 使用 pack_padded_sequence 确保 LSTM 不会额外处理序列中的填充项(Facebook 的 Pytorch 团队真应该考虑为这个绕口的 API 换个名字 !)
4. 使用 pad_packed_sequence 对步骤 3的操作进行还原
5. 将输出和标记展平为一个长的向量
6. 屏蔽(Mask Out) 你不想要的输出
7. 计算其 Cross-Entropy (交叉熵)
上面给出的是jupyter notebook版本。如果想实现多次训练,可以用.py版本(lstm_pad_pack.py)(https://download.csdn.net/download/longmaohu/12561087)。