torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 31.75 GiB total capacity; 24.79 GiB already allocated; 20.94 MiB free; 26.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
There are a few possible ways to fix the out of memory error, depending on the available resources and the desired trade-offs. Here are some suggestions:
per_device_train_batch_size
in the TrainingArguments
. For example, try setting it to 2 or 1 instead of 4.gradient_accumulation_steps
in the TrainingArguments
. For example, if the original batch size is 4 and the gradient accumulation steps is 2, then the effective batch size is 8, but the memory usage per step is the same as with batch size 4. Note that the learning rate may need to be adjusted accordingly.apex
library (https://github.com/NVIDIA/apex) and set the value of fp16
to True in the TrainingArguments
. Alternatively, use the native PyTorch AMP (https://pytorch.org/docs/stable/amp.html) by setting the value of fp16_backend
to “native” in the TrainingArguments
.model_name_or_path
argument when loading the tokenizer and the model. For example, try using “microsoft/prophetnet-base-uncased” instead of “microsoft/prophetnet-large-uncased”.torch.distributed
or horovod
instead of accelerate
, or use DeepSpeed
(https://github.com/microsoft/DeepSpeed) which offers some memory optimization techniques. To use a different backend, change the value of backend
in the Accelerator
class (https://huggingface.co/docs/accelerate/api_reference.html#accelerator).