限制训练时GPU显存使用量

Pytorch

import torch
# 限制0号设备的显存的使用量为0.5,就是半张卡那么多,比如12G卡,设置0.5就是6G。
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
# 计算一下总内存有多少。
total_memory = torch.cuda.get_device_properties(0).total_memory
# 使用0.499的显存:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')

# 清空该显存:
del tmp_tensor
torch.cuda.empty_cache()

# 下面这句话会触发显存OOM错误,因为刚好触碰到了上限:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')

"""
It raises an error as follows: 
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
显存超标后,比不设置限制的错误信息多了一个提示,“5.59 GiB allowed;

Tensorflow 2

import tensorflow as tf
# GPU内存占用设置:方法1

# 获取所有GPU组成list
physical_gpus = tf.config.list_physical_devices("GPU")
tf.config.experimental.set_virtual_device_configuration(
physical_gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=12000)]
)
logical_gpus = tf.config.list_logical_devices("GPU")

# GPU内存占用设置:方法2

# 获取所有GPU组成list
physical_gpus = tf.config.list_physical_devices("GPU")
# 设置按需申请
# 由于这里仅有一块GPU,multi-GPU需要for一下
tf.config.experimental.set_memory_growth(physical_gpus[0],True)
logical_gpus = tf.config.list_logical_devices("GPU")

你可能感兴趣的:(Python,NLP,nlp)