第三个大坑:在python多进程中使用pytorch加载模型时报错:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=3 : initialization error
Process Process-12:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/data/user1/intergration_test/recog/recognition.py", line 410, in recog_output
seg_net = infer.load_model(net_arc=args.seg_network, pre_model=args.seg_model)
File "/home/data/user1/intergration_test/recog/recognition.py", line 115, in load_model
return torch.nn.DataParallel(net, device_ids=range(args.ngpu)).cuda()
File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 305, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
module._apply(fn)
File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
module._apply(fn)
File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
module._apply(fn)
File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in _apply
param_applied = fn(param)
File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 305, in
return self._apply(lambda t: t.cuda(device))
File "/home/user1/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 193, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:50
或者是下面这个:
THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=51 error=3 : initialization error
Process Process-4:
Traceback (most recent call last):
File "/home/data/conda_envs/thpj/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/data/conda_envs/thpj/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/pj/frm_v3/frmwork3_all_q/frmwork3_all_q/recog_v3.py", line 445, in recog_output
seg_net = infer.load_model(net_arc=args.seg_network, pre_model=args.seg_model)
File "/home/pj/frm_v3/frmwork3_all_q/frmwork3_all_q/recog_v3.py", line 124, in load_model
return torch.nn.DataParallel(net, device_ids=range(args.ngpu)).cuda()
File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
param.data = fn(param.data)
File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in
return self._apply(lambda t: t.cuda(device))
File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:51
原因(个人判断):theano不能和pytorch运行在同一个gpu上。可以让他们分别运行在两个gpu上,或者一个在gpu,另一个在cpu
可能的解决办法:
1,gpu充足的情况下,将theano和pytorch分别运行在不同的GPU上有可能解决这个问题。如果硬件不允许,则将theano运行在cpu, pytorch运行在gpu。
2,在运行过程中,先使用pytorch加载模型,再使用theano加载模型,保证这个先后顺序有可能能解决问题
3,其他可能的解决方法见参考github链接2。
4,这是一个bug。这个问题在pytorch 1.2及之后的版本中修复了?参考github链接1。
参考:
1 https://github.com/pytorch/pytorch/issues/17357
2 https://github.com/pytorch/pytorch/issues/15734