最近被一个bug困扰了一两天。
报错如下:
line 109, in _Similarity
dist_rho[dist_rho < 0] = 0
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
由于程序运行时间较长才会出现这个bug,数据量较大,debug也需要漫长的等待。
网上搜索有说设置的label 数量和网络最后一层输出的数量不一致,明显不适用于这个情况。最后搜索,index out of bounds"` failed.
得到如下这段话
the error message would give you the failing operation. However, the stack trace might point to the wrong line of code, due to the asynchronous behavior. You could rerun the code with: CUDA_LAUNCH_BLOCKING=1 python script.py args to get the proper stack trace with the offending operation.(https://discuss.pytorch.org/t/runtimeerror-cuda-error-device-side-assert-triggered-index-out-of-bounds-failed/87827)
通过CUDA_LAUNCH_BLOCKING=1 python main.py 定位程序真正出错的位置,来找越界的位置在哪个地方。再进行修改。