torch.split(tensor, split_size_or_sections, dim=0)
将输入张量分割成相等形状的chunks(如果可分)。
split_size_or_sections为int型,则tensor被分为相等的大小,如果沿指定维的张量形状大小不能被split_size_or_sections整分, 则最后一个分块会小于其它分块。
split_size_or_sections为list型,则按list分割。
参数:
踩坑:之前版本的split函数第二个参数为split_size(int),如果torch版本较新则会报错:
TypeError: split() got an unexpected keyword argument 'split_size'
解决方法就是split_size改成split_size_or_sections.
torch.sort(input, dim=None, descending=False, out=None) -> (Tensor, LongTensor)
对输入张量input
沿着指定维按升序排序。如果不给定dim
,则默认为输入的最后一维。如果指定参数descending
为True
,则按降序排序
返回元组 (sorted_tensor, sorted_indices) , sorted_indices
为原始输入中的下标。
参数:
ByteTensor
或者与第一个参数tensor
相同类型。用到这个函数是在pytorch中RNN模型处理变长序列时,我在做序列标注任务时,当一个batch作为输入时,不同句子词数不同,因此需要将短句padding到batch最长句的len,这样输入到rnn中会对模型有影响,查看torch.nn.RNN可以看到对于输入input,有这样一句:
The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence()
or torch.nn.utils.rnn.pack_sequence()
for details.
于是就开始排坑啦,后面补这两个函数的用法。
Packs a Tensor containing padded sequences of variable length.
input
can be of size T x B x *
where T is the length of the longest sequence (equal to lengths[0]
), B
is the batch size, and *
is any number of dimensions (including 0). If batch_first
is True
, B x T x *
input
is expected.
For unsorted sequences, use enforce_sorted = False. If enforce_sorted
is True
, the sequences should be sorted by length in a decreasing order, i.e. input[:,0]
should be the longest sequence, and input[:,B-1]
the shortest one. enforce_sorted = True is only necessary for ONNX export.
NOTE
This function accepts any input that has at least two dimensions. You can apply it to pack the labels, and use the output of the RNN with them to compute the loss directly. A Tensor can be retrieved from a PackedSequence
object by accessing its .data
attribute.
Parameters
input (Tensor) – padded batch of variable length sequences.
lengths (Tensor) – list of sequences lengths of each batch element.
batch_first (bool, optional) – if True
, the input is expected in B x T x *
format.
enforce_sorted (bool, optional) – if True
, the input is expected to contain sequences sorted by length in a decreasing order. If False
, this condition is not checked. Default: True
.
Returns
a PackedSequence
object
这个函数的用处就是在RNN模型处理变长序列时,比如NER任务,在输入为一个batch的句子,输入的句子的长度必然是不同的,为了保证维度一致我们会把其中的短句子padding到和长句等长,但是我们又不希望这些padding的值参与训练,因此这里就是告诉RNN模型输入的padding情况。下面看这个函数的参数和输入。
input(T x B x *)T就是max_len,你一个batch中句子的最大长度,B是batch,后面一般就是emb_dim.
lengths就是batch中每个句子的有效长度
enforce_sorted 这个参数在版本1.1.0之前好像都是没有的,设置是否需要将lengths以及input按有效长度降序排序,默认False的时候需要降序排序,我自己没有设置过这个参数,因此需要自己调整一下。
word_emb = torch.randn(3,4,5)
print(word_emb)
lengths = [2,3,4]
lengths = torch.Tensor(lengths)
_, idx_sort = torch.sort(torch.Tensor(lengths), dim=0, descending=True)
_, idx_unsort = torch.sort(idx_sort, dim=0)
word_emb = word_emb.index_select(0, idx_sort)
lengths = list(lengths[idx_sort])
word_emb_after_packed = pack_padded_sequence(word_emb, lengths, batch_first=True)
print(word_emb_after_packed)
'''
word_emb:
tensor([[[-0.8626, -0.3088, -1.3562, -0.2448, -0.2467],
[-0.2388, 1.1539, 0.1325, -0.6413, -0.7694],
[ 0.3657, 0.4548, -1.3313, 1.8620, -1.1079],
[-0.5403, -0.9424, 2.2213, 0.7689, 0.8932]],
[[-0.9333, 1.5342, -1.4053, 0.7799, 0.4838],
[ 0.1614, -0.2331, 1.6667, -1.7032, 0.3099],
[-0.6210, -0.4821, -1.5498, -0.1731, -0.6864],
[ 0.0037, 1.0089, -2.5998, -0.3588, 0.0582]],
[[ 0.1429, -0.6191, 0.1100, -0.6952, 0.7599],
[ 1.0877, -0.6400, 1.9040, -1.6933, -0.7815],
[-1.9465, -0.7313, -0.0445, -1.9152, 1.7431],
[-1.3321, 1.3924, -0.4106, -1.5812, 0.2697]]])
word_emb_after_packed:
PackedSequence(data=tensor([[ 0.1429, -0.6191, 0.1100, -0.6952, 0.7599],
[-0.9333, 1.5342, -1.4053, 0.7799, 0.4838],
[-0.8626, -0.3088, -1.3562, -0.2448, -0.2467],
[ 1.0877, -0.6400, 1.9040, -1.6933, -0.7815],
[ 0.1614, -0.2331, 1.6667, -1.7032, 0.3099],
[-0.2388, 1.1539, 0.1325, -0.6413, -0.7694],
[-1.9465, -0.7313, -0.0445, -1.9152, 1.7431],
[-0.6210, -0.4821, -1.5498, -0.1731, -0.6864],
[-1.3321, 1.3924, -0.4106, -1.5812, 0.2697]]), batch_sizes=tensor([3, 3, 2, 1]), sorted_indices=None, unsorted_indices=None)
'''
这里将input处理成将batch中每个句子的对应的相同索引位置的词的emb先后排序,PackedSequence有个参数batch_sizes=tensor([3,3,2,1]),就是代表这个batch中输入到RNN模型的前四个单元中分别对应的有效词的个数。
现在我们得到Packed后的RNN输入,模型现在只输入有效长度的emb了,那么我们如何将Packed后的数据还原成原始数据呢,或者说模型预测结果也是Packed的数据,我们如何将其还原成原始数据的padding形式。
Pads a packed batch of variable length sequences.
It is an inverse operation to pack_padded_sequence()
.
The returned Tensor’s data will be of size T x B x *
, where T is the length of the longest sequence and B is the batch size. If batch_first
is True, the data will be transposed into B x T x *
format.
Batch elements will be ordered decreasingly by their length.
Parameters
sequence (PackedSequence) – batch to pad
batch_first (bool, optional) – if True
, the output will be in B x T x *
format.
padding_value (float, optional) – values for padded elements.
total_length (int, optional) – if not None
, the output will be padded to have length total_length
. This method will throw ValueError
if total_length
is less than the max sequence length in sequence
.
Returns
Tuple of Tensor containing the padded sequence, and a Tensor containing the list of lengths of each sequence in the batch.
直接上代码:
word_padded = nn.utils.rnn.pad_packed_sequence(word_emb_after_packed, batch_first=True)
print(word_padded)
output = word_padded[0].index_select(0, idx_unsort)
'''
word_padded:
(tensor([[[ 0.1429, -0.6191, 0.1100, -0.6952, 0.7599],
[ 1.0877, -0.6400, 1.9040, -1.6933, -0.7815],
[-1.9465, -0.7313, -0.0445, -1.9152, 1.7431],
[-1.3321, 1.3924, -0.4106, -1.5812, 0.2697]],
[[-0.9333, 1.5342, -1.4053, 0.7799, 0.4838],
[ 0.1614, -0.2331, 1.6667, -1.7032, 0.3099],
[-0.6210, -0.4821, -1.5498, -0.1731, -0.6864],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],
[[-0.8626, -0.3088, -1.3562, -0.2448, -0.2467],
[-0.2388, 1.1539, 0.1325, -0.6413, -0.7694],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]]), tensor([4, 3, 2]))
output:
tensor([[[-0.8626, -0.3088, -1.3562, -0.2448, -0.2467],
[-0.2388, 1.1539, 0.1325, -0.6413, -0.7694],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],
[[-0.9333, 1.5342, -1.4053, 0.7799, 0.4838],
[ 0.1614, -0.2331, 1.6667, -1.7032, 0.3099],
[-0.6210, -0.4821, -1.5498, -0.1731, -0.6864],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],
[[ 0.1429, -0.6191, 0.1100, -0.6952, 0.7599],
[ 1.0877, -0.6400, 1.9040, -1.6933, -0.7815],
[-1.9465, -0.7313, -0.0445, -1.9152, 1.7431],
[-1.3321, 1.3924, -0.4106, -1.5812, 0.2697]]])
'''
仔细观察word_emb_after_packed和word_padded,以及最后的output,就很容易能理解了。