pytorch DDP训练遇到EOF问题

最近在使用pytorch的DDP(分布式数据并行)训练网络时,使用tensorboard记录损失和精度变化曲线,在训练完毕后一直会弹出以下错误,程序无法正常结束。

Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(Thread-1, started daemon 140268195714816)>>
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/.../miniconda3/envs/sim/lib/python3.7/threading.py", line 926, in _bootstrap_inner                                                                                                                                       
    self.run()                                                                                                                                                                                                                         
  File "/.../miniconda3/envs/sim/lib/python3.7/threading.py", line 870, in run                                                                                                                                                    
    self._target(*self._args, **self._kwargs)                                                                                                                                                                                          
  File "/.../miniconda3/envs/sim/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop                                                                                                 
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)                                                                                                                                                                                 
  File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/queues.py", line 113, in get                                                                                                                                       
    return _ForkingPickler.loads(res)                                                                                                                                                                                                  
  File "/.../miniconda3/envs/sim/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd                                                                                                
    fd = df.detach()                                                                                                                                                                                                                     File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:      
  File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection 
    c = Client(address, authkey=process.current_process().authkey)                                                                                                                                                                     
  File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/connection.py", line 498, in Client            
    answer_challenge(c, authkey)                                                                                                                                                                                                       
  File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/connection.py", line 742, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message                                                                                                                                                                
  File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)                                                                                                                                                                                                  
  File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)                                                                                            
  File "/.../miniconda3/envs/sim/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError                                                                                                                                                                                                                     
EOFError

找了很久没找到问题,后来发现是DDP使用spwan开启多进程后,tensorboard没有正常退出的原因。
需要在spawn开启的函数最后加入writer.close()。

torch.multiprocessing.spawn(main, nprocs=config["n_gpu"], args=(config,))

def main():
	writer = SummaryWriter(...)
	# train code
	...
	writer.close()

参考链接

你可能感兴趣的:(pytorch,python,人工智能)