遇到了一个很奇怪的bug,尝试了很多方法终于解决了,特来此记录一下,以供后人参考
问题描述:
使用对抗生成模型,判别器和生成器,然后进行反向传播,基本代码结构如下
G = Generator(3, 3, 32, norm='bn').apply(weights_init)
D = MS_Discriminator(input_nc=6).apply(weights_init)
optimizer_G = optim.Adam(G.parameters(), lr=0.0002, betas=(0.5, 0.999))
optimizer_D = optim.Adam(D.parameters(), lr=0.0002, betas=(0.5, 0.999))
for epoch in range(EPOCH):
train(...)
def train(...):
...
...
...
loss_D = (loss_D_fake + loss_D_real)/2
loss_G = loss_G_GAN + loss_G_ReID + loss_G_ssim
############## Backward #############
# update generator weights
optimizer_G.zero_grad()
loss_G.backward()
#loss_G.backward()
optimizer_G.step()
# update discriminator weights
optimizer_D.zero_grad()
loss_D.backward()
optimizer_D.step()
报错很奇怪,报错的位置是在loss.backward()
,报错原因是RuntimeError: select(): index 0 out of range for tensor of size [0, 1] at dimension 0
File "./xxx/utils/attack_patch/attack_algorithm/MIS_RANKING/MIS_RANKING.py", line 161, in train
loss_G.backward()
File "/home/xxx/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/xxx/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: select(): index 0 out of range for tensor of size [0, 1] at dimension 0
Exception raised from select at /opt/conda/conda-bld/pytorch_1595629427478/work/aten/src/ATen/native/TensorShape.cpp:889 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x2b527cfbb77d in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::native::select(at::Tensor const&, long, long) + 0x347 (0x2b5240b88ff7 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0xfe3789 (0x2b5240f6d789 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0xfd6a83 (0x2b5240f60a83 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::select(at::Tensor const&, long, long) + 0xe0 (0x2b5240e930f0 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: + 0x2b62186 (0x2b5242aec186 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: + 0xfd6a83 (0x2b5240f60a83 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::select(long, long) const + 0xe0 (0x2b524101e240 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0x2a6d69d (0x2b52429f769d in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::generated::MaxBackward1::apply(std::vector >&&) + 0x188 (0x2b5242a110d8 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x30d1017 (0x2b524305b017 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0x1400 (0x2b5243056860 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x451 (0x2b5243057401 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x89 (0x2b524304f579 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x4a (0x2b523efbc99a in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0xc9039 (0x2b523cb4c039 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #16: + 0x7e65 (0x2b52126b9e65 in /lib64/libpthread.so.0)
frame #17: clone + 0x6d (0x2b52129cc88d in /lib64/libc.so.6)
最令我感到疑惑的地方是它并不是一开始就报错,而是运行期间报错,也就是运行了几个epoch之后报错,运行的结果文件如下:
start training epoch 0 for Mis-ranking model G and D
ssim = 0.22805009749679203
ssim = 0.2545814767895862
ssim = 0.2610446151667491
ssim = 0.3381733775738129
ssim = 0.39292161640214335
ssim = 0.4152196025258168
loss_G = 308529.04647636414
loss_D = 184930.34714967012
average ssim in epoch 0 is 0.31499846432581674
successfully save attack model weights of G and D
start training epoch 1 for Mis-ranking model G and D
ssim = 0.41696580195656113
ssim = 0.49091494214729137
ssim = 0.5351251625051616
ssim = 0.5997415484053679
ssim = 0.6003826963236703
ssim = 0.6587150843015662
loss_G = 134753.96519470215
loss_D = 9467.809808760881
average ssim in epoch 1 is 0.5503075392732698
successfully save attack model weights of G and D
start training epoch 2 for Mis-ranking model G and D
ssim = 0.6423501039873163
Traceback (most recent call last):
File "tools/train_net.py", line 111, in
args=(args,),
File "./fastreid/engine/launch.py", line 71, in launch
main_func(*args)
...
...
(报错信息)
也就是说代码一开始是可以运行的,每一个独立的train()函数都可以正确执行,问题出现在循环过程中,也就是进行了几个epoch之后突然终止,报错。关于这个问题我搜索发现帖子和我的问题比较相似,但也没有解决办法。有GAN训练时后移optimizer.step()的解决方法,但其报错原因与我不同,我不能确定是否是同一个问题
不过其引用的博客确实可以解决我的报错问题
解决办法如下:
loss_D
,loss_G
的backward()顺序loss_D.backward()
修改为loss_D.backward(retain_graph = True)
optimizer_D.step()
与optimizer_G.step()
放在最后一起更新def train(...):
...
...
...
loss_D = (loss_D_fake + loss_D_real)/2
loss_G = loss_G_GAN + loss_G_ReID + loss_G_ssim
############## Backward #############
# update discriminator weights
optimizer_D.zero_grad()
loss_D.backward(retain_graph=True)
# update generator weights
optimizer_G.zero_grad()
loss_G.backward()
#loss_G.backward()
optimizer_D.step()
optimizer_G.step()
而后正确执行没有报错
2022.1.10-笔者机器学习功力尚浅,暂时没有针对这个问题的理论层面的合理解释,日后若有所顿悟必定回来补充