1、non-finite loss, ending training tensor(nan, device=‘cuda:0‘,2、‘LogSoftmaxBackward3、Function ‘MulB

WARNING: non-finite loss, ending training tensor(nan, device='cuda:0', grad_


错误1:WARNING: non-finite loss, ending training tensor(nan, device=‘cuda:0’, grad_

错误2:Function ‘LogSoftmaxBackward’ returned nan values in its 0th output.

错误3:Function ‘MulBackward0’ returned nan values in its 0th output

参考1:出现这种情况,大家可以尝试换一个数据集,我折腾了两天,pytorch版本也换了,各种的都试了,一直怀疑是网络结构的问题,改来改去的,还是不行,我的损失函数是交叉熵,focalloss也试了,这两个损失函数都用到了log这玩意,出现0值时就会报错,换了个数据集可以了,最后看了看数据集中的图像,有些图片接近于白色,值非常接近0 ,可能是这个原因导致的把

参考2:torch.autograd.set_detect_anomaly(True),可以尝试在训练train.py文件最开始的位置加入这句话,这样报错的时候就会有更详细的解释,我的是其中一个模块的问题,删除了模块,代码正常,如果加入那个模块会出现这个问题,意思就是某个值被修改了,最后还祝我好运,哈哈哈,网上的办法都试了,包括改inplace=false,RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 480, 14, 14]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

加入torch.autograd.set_detect_anomaly(True),这个后,更详细的报错如下,可以根据详细的报错位置去删除修改,可以看到我的具体报错在下文加粗的地方,将其删除后,代码正常了

C:\Users\dyh.conda\envs\dyh_torch2\python.exe M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py
5232 images were found in the dataset.
4187 images for training.
1045 images for validation.
Using 12 dataloader workers every process
0%| | 0/262 [00:00 C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\autograd_init_.py:173: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 127, in
main(opt)
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 84, in main
mean_loss = train_one_epoch(model=model,
File “M:\第三个分类\第三个分类\Test8_densenet\utils.py”, line 129, in train_one_epoch
pred = model(images.to(device))
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “M:\第三个分类\第三个分类\Test8_densenet\model_three_path.py”, line 308, in forward
features_sk3 = self.sk3(left2)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in call_impl
return forward_call(*input, **kwargs)
File “M:\第三个分类\第三个分类\Test8_densenet\sk_model.py”, line 41, in forward
fea = conv(x).unsqueeze
(dim=1)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\container.py”, line 139, in forward
input = module(input)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\modules\activation.py”, line 98, in forward
return F.relu(input, inplace=self.inplace)
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\nn\functional.py”, line 1457, in relu
result = torch.relu(input)
(Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
0%| | 0/262 [00:54 Traceback (most recent call last):
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 127, in
main(opt)
File “M:/第三个分类/第三个分类/Test8_densenet/train_three_path.py”, line 84, in main
mean_loss = train_one_epoch(model=model,
File “M:\第三个分类\第三个分类\Test8_densenet\utils.py”, line 133, in train_one_epoch
loss.backward()
File “C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "C:\Users\dyh.conda\envs\dyh_torch2\lib\site-packages\torch\autograd_init
.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 480, 14, 14]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Process finished with exit code 1

你可能感兴趣的:(深度学习,python,pytorch)