运行pytorch框架下的图像分类训练程序,出现cuda out of memory,解决方法探索

今天利用python和pytorch编写图像分类训练程序,好不容易噼里啪啦敲完键盘,运行之。。。。。,结果突然报错(RuntimeError  cuda out of memory),使笔者大失所望,具体信息如下:

/usr/bin/python3.5 /home/xxx/train.py
Step 1: prepare train/test dataset
There are 121 classes
Step 1 has been completed ---------7.801877
Step 2: Begin to train the model
num_ftrs=2048
num_classes=121
Epoch [0/29] ----------
Traceback (most recent call last):
  File "/home/xxx/train.py", line 121, in 
    outputs=model(images)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 204, in forward
    x = self.layer4(x)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 99, in forward
    out = self.bn1(out)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/batchnorm.py", line 81, in forward
    exponential_average_factor, self.eps)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 154.00 MiB (GPU 0; 23.65 GiB total capacity; 22.54 GiB already allocated; 18.00 MiB free; 257.96 MiB cached)

Process finished with exit code 1

其中采用的网络模型是torchvision自带的resnext101_32x8d模型,batch_size=100。其他代码不变,直接修改batch_size=50。并在命令行中启用 watch -n 0.1 nvidia-smi开启监控窗口,可以看到如下界面:

运行pytorch框架下的图像分类训练程序,出现cuda out of memory,解决方法探索_第1张图片

从图中可以看出,虽然有四块GPU卡,但是只用了其中一块,显存使用率已经过半。应该是batch_size=100的时候显存溢出了。

简单的通过减少batch_size的数值可以解决这个显存溢出的问题。但是这不是最完美的解决之道,而且四块卡没有得到很好的利用。后续将介绍 多卡训练模型的相关问题,敬请期待。

-------------------- 正文到此结束------------------------

推荐一个公众号:健哥聊量化,会持续推出股票相关基础知识,以及python实现的一些基本的分析代码。欢迎大家关注,二维码如下:

相关文章列表如下:

  • 股票基础知识----- K线形态

  • 股票K线形态 ----早晨之星

  • “早晨之星”实际操作篇---通达信软件为例

  • 牛刀小试----python+tushare进行股票分析

  • 股票K线形态----黄昏之星

  • 股票K线形态-----墓碑线

  • 股票K线形态-----多方炮

  • 股票K线形态-----红三兵

  • 股票K线形态----三只乌鸦

  • 股票K线形态-----锤头线、吊颈线、倒锤头线

你可能感兴趣的:(python,pytorch,ubuntu,深度学习,python,人工智能,神经网络)