有关torch.nn.embedding的bug解决记录

本文记录了一些我在使用pytorch库中的torch.nn.Embedding类时,遇到的一些bug,以及处理方式。

<问题1> RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.DoubleTensor instead (while checking arguments for embedding)

先看看Traceback:

  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.DoubleTensor instead (while checking arguments for embedding)

其实这个bug不难理解。如Traceback里所言,位置1的参数(就是`input`)本应是个long类型,但输入是DoubleTensor类型。

查阅了torch最新版的文档发现,torch.nn.Embedding类要求,该class输入必须是LongTensor(链接在此)。那么LongTensor对应的到底是Python中我们熟知的啥类型呢?我之前一直傻傻的以为LongTensor对应的tensor数据类型,就是Python里的“long”,后来我发现我错了。torch中tensor的数据类型和python中的数据类型的对照如下:

Data tyoe CPU tensor GPU tensor
32-bit floating point torch.FloatTensor torch.cuda.FloatTensor
64-bit floating point torch.DoubleTensor torch.cuda.DoubleTensor
16-bit floating point N/A torch.cuda.HalfTensor
8-bit integer (unsigned) torch.ByteTensor torch.cuda.ByteTensor
8-bit integer (signed) torch.CharTensor torch.cuda.CharTensor
16-bit integer (signed) torch.ShortTensor torch.cuda.ShortTensor
32-bit integer (signed) torch.IntTensor torch.cuda.IntTensor
64-bit integer (signed) torch.LongTensor torch.cuda.LongTensor

从表格里可以看出,Embedding类要求 输入类型是LongTensor,即说明输入类型必须是int64.所以当你出现以上这个bug时,就说明你将“DoubleTensor”,即float类型的数据塞到nn.Embedding-layer里去了。去检查一下你的输入吧!

<问题2>  Assertion `srcIndex < srcSelectDimSize` failed. 

RuntimeError: CUDA error: device-side assert triggered

先看看Traceback:

# ...前面还有一堆这种的traceback,长得几乎一样,就省略掉了
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = -1, SrcDim = -1, IdxDim = -1, IndexIsMajor = true]: block: [137,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "run_pretraining_layout.py", line 565, in 
    main()
  File "run_pretraining_layout.py", line 482, in main
    outputs = model(input_ids=input_ids, bbox=layout, token_type_ids=segment_ids, attention_mask=input_mask)
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 328, in forward
    head_mask=head_mask)
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 171, in forward
    input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 79, in forward
    abs(bbox[:, :, 3] - bbox[:, :, 1])
RuntimeError: CUDA error: device-side assert triggered

注意该traceback提供的信息的前面相似度极高的几行:Assertion `srcIndex < srcSelectDimSize` failed.

这里的意思,即是数组越界。其实这是个很好预期的bug——因为假设你如此设置:nn.Embeddings(1024, 768),那么意味着你的输入类别不可超出`1024`类,也就是说,当你给Embedding-layer输入一个“1027”时,就会出此bug。

但问题是,你如何知道这个数组越界是在哪出现的呢?如果你按照上面pytorch给出的traceback,layoutlm.py的line-79去找“abs(bbox[:, :, 3] - bbox[:, :, 1])”这行代码的错,你会发现,其实问题根本不在这里,也就是说这里CUDA error给出的报错信息其实是不太对劲的。

这里给出一个小窍门。对于这种“device-side”的CUDA error,你只要别在GPU上运行该代码(i.e., 别加 .cuda() 或者 model.to("cuda")),将代码在CPU上运行,那么得到的traceback就一目了然了。

用此方法,我得到了这样的traceback:

Traceback (most recent call last):
  File "run_pretraining_layout.py", line 565, in 
    main()
  File "run_pretraining_layout.py", line 482, in main
    outputs = model(input_ids=input_ids, bbox=layout, token_type_ids=segment_ids, attention_mask=input_mask)
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 328, in forward
    head_mask=head_mask)
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 171, in forward
    input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/disk1/shenfei/repos/layoutlm/layoutlm/layoutlm/modeling/layoutlm.py", line 74, in forward
    left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/shenfei1/anaconda3/envs/bert/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 1027 out of table with 1023 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

这样的traceback就一目了然了。看该traceback的最后一行,“RuntimeError: index out of range: Tried to access index 1027 out of table with 1023 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418”, 结合前面的几行信息,这显然是在说“layoutlm.py”文件的第74行的代码“left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])”中,出现了输入index=1027。但实际上我设定的Embedding-layer的输入类别=1024,即[0, 1023]这个范围,1027显然越界了。将这个问题改掉,就畅通无阻了。

你可能感兴趣的:(#,pytorch,Python)