RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously

运行分类任务的时候出现如下报错:

/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [10,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [11,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [12,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [13,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [14,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [15,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [16,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [17,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [18,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [19,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [20,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [21,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [22,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [23,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [24,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [25,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [26,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [27,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [28,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [29,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [30,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1646755861072/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [31,0,0] Assertion `input_val >= zero && input_val <= one` failed.


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.

真,很烦,因为这是第二次出现这个报错了,更烦的是上次出现这个报错的时候我是怎么修复的我已经不记得了!现在又需要重新整,oh No!故等下找到修复的法门一定要好好记录一番!


  • 记起来了!上次报这个错的原因是做tensor.log()运算的时候,tensor中有很多负数,因此我把tensor用softmax处理一下然后就修复了!!
  • 但是这次的原因还没找到!现在看来应该还是数据的问题,我看github中说是对数组执行索引操作的时候,索引超出了数组本身的数据量,因此出现错误,所以要去定位是什么数据索引超出界限了!
  • 最后发现是梯度爆炸的原因,我把出错那一行代码的前面的tensor值打印出来,发现到后面出现了Nan值,排查后发现是我直接把乘了注意力权重的特征值拿来做后续任务,这就导致后面预测值出现Nan,于是我做了个残差结构:w*F+F(w是权重,F是特征),然后就好了。

你可能感兴趣的:(学习笔记,深度学习,python,人工智能)