RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:

第三个大坑:在python多进程中使用pytorch加载模型时报错:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=3 : initialization error
Process Process-12:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/data/user1/intergration_test/recog/recognition.py", line 410, in recog_output
    seg_net = infer.load_model(net_arc=args.seg_network, pre_model=args.seg_model)
  File "/home/data/user1/intergration_test/recog/recognition.py", line 115, in load_model
    return torch.nn.DataParallel(net, device_ids=range(args.ngpu)).cuda()
  File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 305, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
    module._apply(fn)
  File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
    module._apply(fn)
  File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
    module._apply(fn)
  File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in _apply
    param_applied = fn(param)
  File "/home/user1/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 305, in 
    return self._apply(lambda t: t.cuda(device))
  File "/home/user1/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 193, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:50

或者是下面这个:

THCudaCheck FAIL file=../aten/src/THC/THCGeneral.cpp line=51 error=3 : initialization error
Process Process-4:
Traceback (most recent call last):
  File "/home/data/conda_envs/thpj/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/data/conda_envs/thpj/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pj/frm_v3/frmwork3_all_q/frmwork3_all_q/recog_v3.py", line 445, in recog_output
    seg_net = infer.load_model(net_arc=args.seg_network, pre_model=args.seg_model)
  File "/home/pj/frm_v3/frmwork3_all_q/frmwork3_all_q/recog_v3.py", line 124, in load_model
    return torch.nn.DataParallel(net, device_ids=range(args.ngpu)).cuda()
  File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
    param.data = fn(param.data)
  File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in 
    return self._apply(lambda t: t.cuda(device))
  File "/home/data/conda_envs/thpj/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:51

 

原因(个人判断):theano不能和pytorch运行在同一个gpu上。可以让他们分别运行在两个gpu上,或者一个在gpu,另一个在cpu

 

可能的解决办法:

1,gpu充足的情况下,将theano和pytorch分别运行在不同的GPU上有可能解决这个问题。如果硬件不允许,则将theano运行在cpu, pytorch运行在gpu。

2,在运行过程中,先使用pytorch加载模型,再使用theano加载模型,保证这个先后顺序有可能能解决问题

3,其他可能的解决方法见参考github链接2。

4,这是一个bug。这个问题在pytorch 1.2及之后的版本中修复了?参考github链接1。

 

 

参考:

1 https://github.com/pytorch/pytorch/issues/17357

2 https://github.com/pytorch/pytorch/issues/15734

 

你可能感兴趣的:(#,DL-报错,#,并行化,Pytorch)