关于subprocess.CalledProcessError: Commandxxx returned non-zero exit status 1. 的问题--pytorch分布式训练问题

1.问题描述

我想跑一个模型的训练源代码时,就出现了这个问题,之前上网一顿查,发现并没有解决的办法。所说的也跟这个对不上。这个问题的本身是有关于pytorch分布使训练的问题。

 实际情况如下。

root@node02:~/data/zjx/others/DDPtry# python -m torch.distributed.launch --nproc_per_node 3 tryDDP_1.py 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
  File "tryDDP_1.py", line 92, in 
Traceback (most recent call last):
  File "tryDDP_1.py", line 92, in 
Traceback (most recent call last):
  File "tryDDP_1.py", line 92, in 
    b = c
NameError: name 'c' is not defined
    b = c
    b = c
NameError: name 'c' is not defined
NameError: name 'c' is not defined
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in 
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'tryDDP_1.py', '--local_rank=2']' returned non-zero exit status 1.

2.问题的解决

出现这个问题时,解决问题的关键不在于这个问题本身,而是在于这个问题前面所报出的问题。

正因为原代码中的某处或者某几处错误,从而导致分布使训练不能进行,所以都会报出这个错误。从上面的实际举例可以看出,在这个错误之前,还有个错误,如下图画框所示

关于subprocess.CalledProcessError: Commandxxx returned non-zero exit status 1. 的问题--pytorch分布式训练问题_第1张图片

 当然,这个错误是我故意设计的,就是为了举例说明出现这个问题的来源,因为我在代码中加了一处错误,如下图划线处所示。

关于subprocess.CalledProcessError: Commandxxx returned non-zero exit status 1. 的问题--pytorch分布式训练问题_第2张图片

 正式因为代码中出现的这处错误,导致分布式训练不能顺利进行,所以才会返回如

subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'tryDDP_1.py', '--local_rank=2']' returned non-zero exit status 1.

这样的错误。所以,解决这个问题的关键是把这个错误之前的所有报错都解决掉,之后就可以顺利进行分布式训练了。

如下所示,(下面将b=c 这个错误去掉)

关于subprocess.CalledProcessError: Commandxxx returned non-zero exit status 1. 的问题--pytorch分布式训练问题_第3张图片

 然后运行,如下图所示,可以正常运行了

你可能感兴趣的:(pytorch,深度学习,人工智能)