PyTorch -- RuntimeError:

Failure:

 

 

Concurrent:

1. 

RuntimeError: Caught RuntimeError in replica 0 on device 0.

2.

RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 10.73 GiB total capacity; 1.36 GiB already allocated; 11.31 MiB free; 66.94 MiB cached)

 

 

 

Analize:

device 0 is running other jobs so lack of memory now. My job is parallel so it will use all devices and make device 0 the master one in default. A parallel job will occupy the same amount of memory on every device, therefore, the job will use only a little memory on every device if I use all devices due to the device 0. In this case, I can cut the batch_size to make it still run in parallel, or I can change it into non-parallel job and let it run only on one device other than device 0.

 

 

 

Solution:

 

At beginning of .py file, add one code line:

os.environ['CUDA_VISIBLE_DEVICES'] = '1, 2, 3' .

By specifying devices, my job is still parallel.

 

 

你可能感兴趣的:(自然语言处理,机器学习,深度学习,pytorch)