解决GPU显存无法释放问题

经常有开发反馈他们的程序已停掉,但是GPU显存无法释放,我们在使用tensorflow+pycharm 或者PyTorch写程序的时候, 有时候会在控制台终止掉正在运行的程序,但是有时候程序已经结束了,nvidia-smi也看到没有程序了,但是GPU的内存并没有释放,这是怎么回事呢?
使用PyTorch设置多线程(threads)进行数据读取(DataLoader),其实是假的多线程,他是开了N个子进程(PID都连着)进行模拟多线程工作,所以你的程序跑完或者中途kill掉主进程的话,子进程的GPU显存并不会被释放,需要手动一个一个kill才行,具体方法描述如下:

1、查看现象

 nvidia-smi
Mon Dec  6 14:26:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:04:00.0 Off |                  N/A |
| 34%   42C    P8    26W / 250W |   9575MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:05:00.0 Off |                  N/A |
| 35%   45C    P8    28W / 250W |   8503MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  TITAN V             Off  | 00000000:08:00.0 Off |                  N/A |
| 34%   45C    P8    28W / 250W |   8503MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  TITAN V             Off  | 00000000:09:00.0 Off |                  N/A |
| 36%   46C    P8    28W / 250W |   8503MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  TITAN V             Off  | 00000000:84:00.0 Off |                  N/A |
| 28%   37C    P8    27W / 250W |      4MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  TITAN V             Off  | 00000000:85:00.0 Off |                  N/A |
| 28%   34C    P8    25W / 250W |      4MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  TITAN V             Off  | 00000000:88:00.0 Off |                  N/A |
| 28%   35C    P8    26W / 250W |      4MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  TITAN V             Off  | 00000000:89:00.0 Off |                  N/A |
| 28%   34C    P8    24W / 250W |      4MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

2、查看进程

fuser -v /dev/nvidia*

/dev/nvidia5:        i001  11085 F.... nvidia-smi
                     i001  18493 F...m python
                     i001  33238 F...m python
                     i001  33239 F...m python
                     i001  33240 F...m python
                     i001  33251 F...m python
                     i001  33256 F...m python
                     i001  33257 F...m python
                     i001  33258 F...m python
                     i001  33261 F...m python
                     i001  33264 F...m python
                     i001  33265 F...m python
                     i001  33269 F...m python
                     i001  33270 F...m python
                     i001  33271 F...m python
                     i001  33278 F...m python

3、取出PID

fuser -v /dev/nvidia*|awk -F " " '{print $0}' >/tmp/pid.file

4、强制杀掉进程

while read pid ; do kill -9 $pid; done 

你可能感兴趣的:(linux,tensorflow,GPU,深度学习,tensorflow,gpu)