在训练 Transformer 的过程中,pytorhc出现的问题:RuntimeError: cuda runtime error (59) : device-side assert triggered at C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src\THC/THCReduceAll.cuh:327
具体报错如下
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src\THC/THCReduceAll.cuh line=327 error=59 : device-side assert triggered
Traceback (most recent call last):
File "C:\Users\AppData\Local\conda\conda\envs\yuanbo_pytorch\lib\site-packages\torch\nn\functional.py", line 3105, in multi_head_attention_forward
qkv_same = torch.equal(query, key) and torch.equal(key, value)
RuntimeError: cuda runtime error (59) : device-side assert triggered at C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src\THC/THCReduceAll.cuh:327
debug了很久也没有找到问题所在,后来发现 GPU 不能正确定位异常位置,device改用 CPU 后才发现真正的错误:RuntimeError: index out of range: Tried to access index 103 out of table with 99 rows. at C:\w\1\s\tmp_conda_3.6_155139\conda\conda-bld\pytorch_1565366019852\work\aten\src\TH/generic/THTensorEvenMoreMath.cpp:237
原来是由于索引出错了,检查后发现,在 Transformer 的 decoder 做 position embedding 的时候,由于词表中的索引出错导致出现了 “RuntimeError: cuda runtime error (59) : device-side assert triggered”。重新制备词表即可。