pytorch分布式卡住

在一台 A100 的实验室用单机多卡的方式跑 MoCoGAN-HD 时,发现其在跑到 main_worker 打完这行的 log 之后就卡住不动。手动 Ctrl + C:

------------ Options -------------
G_step: 5
batchSize: 4
beta1: 0.5
beta2: 0.999
checkpoints_dir: checkpoints/my_dataset
cross_domain: True
dataroot: /home/itom/data3/my_dataset/train-frames
display_freq: 100
dist_backend: nccl
dist_url: tcp://localhost:10003
gpu: None
h_dim: 384
img_g_weights: pretrained/checkpoint.pkl
isPCA: False
isTrain: True
l_len: 256
latent_dimension: 512
load_pretrain_epoch: -1
load_pretrain_path: pretrained_models
lr: 0.0001
moco_m: 0.999
moco_t: 0.07
multiprocessing_distributed: True
n_frames_G: 16
n_mlp: 8
n_pca: 384
name: my_dataset
nc: 3
norm_D_3d: instance
num_D: 2
print_freq: 5
q_len: 4096
rank: 0
resize_style_gan_size: None
save_epoch_freq: 10
save_latest_freq: 1000
save_pca_path: pca_stats/my_dataset
sg2_ada: False
style_gan_size: [512, 256]
time_step: 5
total_epoch: 500
video_frame_size: 128
w_match: 1.0
w_residual: 0.2
workers: 8
world_size: 1
-------------- End ----------------
Use GPU: 0 for training
Use GPU: 1 for training
^CTraceback (most recent call last):
  File "train_sg2_ada.py", line 243, in 
    main()
  File "train_sg2_ada.py", line 48, in main
    args=(ngpus_per_node, args))
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 77, in join
    timeout=timeout,
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/multiprocessing/connection.py", line 921, in wait
    ready = selector.select(timeout)
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt


^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/itom/miniconda3/envs/py37_pt171/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

通过手动 print 发现是卡在 dist.init_process_group 这里,此时 nvidia-smi 有占用,但只有 2k+MB,明显没启动完。同学说换用 gloo backend,开头是在 shell 改环境变量:

PL_TORCH_DISTRIBUTED_BACKEND=gloo \
CUDA_HOME=/usr/local/cuda \
python -W ignore train.py ...

无效,还是卡住。注意到其 train_options.py 中有个 –dist_backend 参数,默认值是 nccl,换用命令行指定:

CUDA_HOME=/usr/local/cuda \
python -W ignore train.py --dist_backend gloo ...

可以了。即:

# import torch.distributed as dist

dist.init_process_group(backend="gloo", # 这里改成 gloo
    init_method=args.dist_url, world_size=args.world_size, rank=args.rank)

它的代码基于 rosinality/stylegan2-pytorch,说不定用这份代码如果卡住了也可以参考。

你可能感兴趣的:(环境,机器学习,pytorch,分布式,DDP,gloo,nccl)