本文记录了一些我在使用pytorch库中的torch.nn.Embedding类时,遇到的一些bug,以及处理方式。
先看看Traceback:
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.DoubleTensor instead (while checking arguments for embedding)
其实这个bug不难理解。如Traceback里所言,位置1的参数(就是`input`)本应是个long类型,但输入是DoubleTensor类型。
查阅了torch最新版的文档发现,torch.nn.Embedding类要求,该class输入必须是LongTensor(链接在此)。那么LongTensor对应的到底是Python中我们熟知的啥类型呢?我之前一直傻傻的以为LongTensor对应的tensor数据类型,就是Python里的“long”,后来我发现我错了。torch中tensor的数据类型和python中的数据类型的对照如下:
Data tyoe | CPU tensor | GPU tensor |
---|---|---|
32-bit floating point | torch.FloatTensor |
torch.cuda.FloatTensor |
64-bit floating point | torch.DoubleTensor |
torch.cuda.DoubleTensor |
16-bit floating point | N/A | torch.cuda.HalfTensor |
8-bit integer (unsigned) | torch.ByteTensor |
torch.cuda.ByteTensor |
8-bit integer (signed) | torch.CharTensor |
torch.cuda.CharTensor |
16-bit integer (signed) | torch.ShortTensor |
torch.cuda.ShortTensor |
32-bit integer (signed) | torch.IntTensor |
torch.cuda.IntTensor |
64-bit integer (signed) | torch.LongTensor |
torch.cuda.LongTensor |
从表格里可以看出,Embedding类要求 输入类型是LongTensor,即说明输入类型必须是int64.所以当你出现以上这个bug时,就说明你将“DoubleTensor
”,即float类型的数据塞到nn.Embedding-layer里去了。去检查一下你的输入吧!
先看看Traceback:
# ...前面还有一堆这种的traceback,长得几乎一样,就省略掉了
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "run_pretraining_layout.py", line 565, in
main()
File "run_pretraining_layout.py", line 482, in main
outputs = model(input_ids=input_ids, bbox=layout, token_type_ids=segment_ids, attention_mask=input_mask)
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 328, in forward
head_mask=head_mask)
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 171, in forward
input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 79, in forward
abs(bbox[:, :, 3] - bbox[:, :, 1])
RuntimeError: CUDA error: device-side assert triggered
注意该traceback提供的信息的前面相似度极高的几行:Assertion `srcIndex < srcSelectDimSize` failed.
这里的意思,即是数组越界。其实这是个很好预期的bug——因为假设你如此设置:nn.Embeddings(1024, 768),那么意味着你的输入类别不可超出`1024`类,也就是说,当你给Embedding-layer输入一个“1027”时,就会出此bug。
但问题是,你如何知道这个数组越界是在哪出现的呢?如果你按照上面pytorch给出的traceback,layoutlm.py的line-79去找“abs(bbox[:, :, 3] - bbox[:, :, 1])”这行代码的错,你会发现,其实问题根本不在这里,也就是说这里CUDA error给出的报错信息其实是不太对劲的。
这里给出一个小窍门。对于这种“device-side”的CUDA error,你只要别在GPU上运行该代码(i.e., 别加 .cuda() 或者 model.to("cuda")),将代码在CPU上运行,那么得到的traceback就一目了然了。
用此方法,我得到了这样的traceback:
Traceback (most recent call last):
File "run_pretraining_layout.py", line 565, in
main()
File "run_pretraining_layout.py", line 482, in main
outputs = model(input_ids=input_ids, bbox=layout, token_type_ids=segment_ids, attention_mask=input_mask)
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 328, in forward
head_mask=head_mask)
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 171, in forward
input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 74, in forward
left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 1027 out of table with 1023 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
这样的traceback就一目了然了。看该traceback的最后一行,“RuntimeError: index out of range: Tried to access index 1027 out of table with 1023 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418”, 结合前面的几行信息,这显然是在说“layoutlm.py”文件的第74行的代码“left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])”中,出现了输入index=1027。但实际上我设定的Embedding-layer的输入类别=1024,即[0, 1023]这个范围,1027显然越界了。将这个问题改掉,就畅通无阻了。