pytorch loss.backward() 报错RuntimeError: select(): index 0 out of range for tensor of size [0, 1]解决方法

遇到了一个很奇怪的bug,尝试了很多方法终于解决了,特来此记录一下,以供后人参考

问题描述:
使用对抗生成模型,判别器和生成器,然后进行反向传播,基本代码结构如下

G = Generator(3, 3, 32, norm='bn').apply(weights_init)
D = MS_Discriminator(input_nc=6).apply(weights_init)
optimizer_G = optim.Adam(G.parameters(), lr=0.0002, betas=(0.5, 0.999))
optimizer_D = optim.Adam(D.parameters(), lr=0.0002, betas=(0.5, 0.999))
for epoch in range(EPOCH):
	train(...)
def train(...):
	...
	...
	...
	loss_D = (loss_D_fake + loss_D_real)/2
	loss_G = loss_G_GAN + loss_G_ReID + loss_G_ssim

	############## Backward #############
	# update generator weights
	optimizer_G.zero_grad()
	loss_G.backward()
	#loss_G.backward()
	optimizer_G.step()
	# update discriminator weights
	optimizer_D.zero_grad()
	loss_D.backward()
	optimizer_D.step()

报错很奇怪,报错的位置是在loss.backward(),报错原因是RuntimeError: select(): index 0 out of range for tensor of size [0, 1] at dimension 0

  File "./xxx/utils/attack_patch/attack_algorithm/MIS_RANKING/MIS_RANKING.py", line 161, in train
    loss_G.backward()
  File "/home/xxx/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/xxx/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: select(): index 0 out of range for tensor of size [0, 1] at dimension 0
Exception raised from select at /opt/conda/conda-bld/pytorch_1595629427478/work/aten/src/ATen/native/TensorShape.cpp:889 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x2b527cfbb77d in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::native::select(at::Tensor const&, long, long) + 0x347 (0x2b5240b88ff7 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2:  + 0xfe3789 (0x2b5240f6d789 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3:  + 0xfd6a83 (0x2b5240f60a83 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::select(at::Tensor const&, long, long) + 0xe0 (0x2b5240e930f0 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5:  + 0x2b62186 (0x2b5242aec186 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6:  + 0xfd6a83 (0x2b5240f60a83 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::select(long, long) const + 0xe0 (0x2b524101e240 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8:  + 0x2a6d69d (0x2b52429f769d in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::generated::MaxBackward1::apply(std::vector >&&) + 0x188 (0x2b5242a110d8 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10:  + 0x30d1017 (0x2b524305b017 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0x1400 (0x2b5243056860 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x451 (0x2b5243057401 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x89 (0x2b524304f579 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x4a (0x2b523efbc99a in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #15:  + 0xc9039 (0x2b523cb4c039 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #16:  + 0x7e65 (0x2b52126b9e65 in /lib64/libpthread.so.0)
frame #17: clone + 0x6d (0x2b52129cc88d in /lib64/libc.so.6)

最令我感到疑惑的地方是它并不是一开始就报错,而是运行期间报错,也就是运行了几个epoch之后报错,运行的结果文件如下:

start training epoch 0 for Mis-ranking model G and D
ssim = 0.22805009749679203
ssim = 0.2545814767895862
ssim = 0.2610446151667491
ssim = 0.3381733775738129
ssim = 0.39292161640214335
ssim = 0.4152196025258168
loss_G = 308529.04647636414
loss_D = 184930.34714967012
average ssim in epoch 0 is 0.31499846432581674
successfully save attack model weights of G and D
start training epoch 1 for Mis-ranking model G and D
ssim = 0.41696580195656113
ssim = 0.49091494214729137
ssim = 0.5351251625051616
ssim = 0.5997415484053679
ssim = 0.6003826963236703
ssim = 0.6587150843015662
loss_G = 134753.96519470215
loss_D = 9467.809808760881
average ssim in epoch 1 is 0.5503075392732698
successfully save attack model weights of G and D
start training epoch 2 for Mis-ranking model G and D
ssim = 0.6423501039873163
Traceback (most recent call last):
  File "tools/train_net.py", line 111, in 
    args=(args,),
  File "./fastreid/engine/launch.py", line 71, in launch
    main_func(*args)
    ...
    ...
    (报错信息)

也就是说代码一开始是可以运行的,每一个独立的train()函数都可以正确执行,问题出现在循环过程中,也就是进行了几个epoch之后突然终止,报错。关于这个问题我搜索发现帖子和我的问题比较相似,但也没有解决办法。有GAN训练时后移optimizer.step()的解决方法,但其报错原因与我不同,我不能确定是否是同一个问题

不过其引用的博客确实可以解决我的报错问题
解决办法如下:

  • 交换loss_D,loss_G的backward()顺序
  • 把生成器(D)的loss_D.backward()修改为loss_D.backward(retain_graph = True)
  • optimizer_D.step()optimizer_G.step()放在最后一起更新
def train(...):
	...
	...
	...
	loss_D = (loss_D_fake + loss_D_real)/2
    loss_G = loss_G_GAN + loss_G_ReID + loss_G_ssim

    ############## Backward #############
    # update discriminator weights
    optimizer_D.zero_grad()
    loss_D.backward(retain_graph=True)
   
    # update generator weights
    optimizer_G.zero_grad()
    loss_G.backward()
    #loss_G.backward()
    optimizer_D.step()
    optimizer_G.step()

而后正确执行没有报错


2022.1.10-笔者机器学习功力尚浅,暂时没有针对这个问题的理论层面的合理解释,日后若有所顿悟必定回来补充

你可能感兴趣的:(pytorch,深度学习,python)