详细报错:
~/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
198 ' torch.multiprocessing.start_process(...)' % start_method)
199 warnings.warn(msg)
--> 200 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
~/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
156
157 # Loop on join until it returns True or raises an exception.
--> 158 while not context.join():
159 pass
160
~/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
111 raise Exception(
112 "process %d terminated with exit code %d" %
--> 113 (error_index, exitcode)
114 )
115
Exception: process 0 terminated with exit code 1
看到其他回答,回答是可能是jupyter平台的原因。确实如此,用如下代码在terminal下放在py文件中可以运行,但是在jupyter下无法运行,报错信息如下:
##代码测试用例1
##来源https://discuss.pytorch.org/t/exception-process-0-terminated-with-exit-code-1-error-when-using-torch-multiprocessing-spawn-to-parallelize-over-multiple-gpus/90636
import numpy as np
import torch
from torch.multiprocessing import Pool, set_start_method, spawn
X = np.array([[1, 3, 2, 3], [2, 3, 5, 6], [1, 2, 3, 4]])
X = torch.DoubleTensor(X)
def X_power_func(j):
# print(j)
X_power = X**j
# print(X_power)
return X_power
if __name__ == '__main__':
spawn(X_power_func, nprocs=3)
上面代码参考:Exception: process 0 terminated with exit code 1error when using
torch.multiprocessing.spawn` to parallelize over multiple GPUs
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'X_power_func' on <module '__main__' (built-in)>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'X_power_func' on <module '__main__' (built-in)>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'X_power_func' on <module '__main__' (built-in)>
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/tmp/ipykernel_173/3160031924.py in <module>
15
16 if __name__ == '__main__':
---> 17 spawn(X_power_func, nprocs=3)
18
19
~/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
198 ' torch.multiprocessing.start_process(...)' % start_method)
199 warnings.warn(msg)
--> 200 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
~/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
156
157 # Loop on join until it returns True or raises an exception.
--> 158 while not context.join():
159 pass
160
~/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
111 raise Exception(
112 "process %d terminated with exit code %d" %
--> 113 (error_index, exitcode)
114 )
115
Exception: process 1 terminated with exit code 1
解决方法: create a separate file for func
参考链接:
但是我决定直接在terminal下运行,不在jupyter下运行,避开jupyter没有create a separate file for func带来的报错,得到如下结果
代码上仍然出现的错误:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "test_main_ddp.py", line 543, in <module>
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 113, in join
(error_index, exitcode)
Exception: process 0 terminated with exit code 1
参考链接:RuntimeError:An attempt has been made to start a new process before the
Pytorch-lightning: 例外:DDP时,进程0以退出代码1终止
解决方法:根据traceback的报错信息,把line 543中的
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
改成
if __name__=="__main__":
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
出错原因:因为多线程程序要放在主函数中训练。这样就不报Exception: process 0 terminated with exit code 1
的错了。