使用NAS,网络太大,一块放不下,所以尝试用ddp玩一个多gpu训练。
(py36torch15) xx@cluster:~/wang/FasterCrowdCountingNAS/FBNetBranch$ python main.py
/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. Trainer(distributed_backend=dp) (or ddp, ddp2). Setting distributed_backend=ddp for you.
warnings.warn(*args, **kwargs)
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0,1,2]
Traceback (most recent call last):
File "main.py", line 29, in
File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 844, in fit
File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/process.py", line 105, in start
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
RuntimeError: unable to open shared memory object in read-write mode
刚开始还以为我哪点写错了……直接到spawn空间找,发现应该是open file限制问题。
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514771
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 514771
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
1024空间实在太小,直接使用ulimit -SHn 51200
问题解决。
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
使用多进程发现又弹出这种问题,其实这个是python使用多进程设置的问题。
Python多进程的实现方式[。Unix系统下默认的实现方式是fork,而fork可以将进程复制一份,子进程可以执行与主程序不同的函数,此外,这种方式生成的进程继承了父进程的数据,所以数据可以方便的从父进程流动到子进程。而在Windows上不支持fork,而是要使用spawn。spawn其实也是将进程复制一份,但是进程会重新执行一遍主函数里面的代码,就像父进程一样,然后再去执行相应的函数。所以这就会导致一个问题就是如果我们不加任何判断的话,这个进程会不断的复制自身,形成新的进程。Python的设计者当然考虑到了这一点,所以如果你在spawn进程的初始阶段还尝试创建新进程的话,会报错退出。怎么区别主进程和父进程呢?一般会采用__name__属性来区分。
解决方法:
if __name__ == '__main__':
p = mp.Process(target=v)
p.start()
p.join()