梯度累加(Gradient Accumulation)

受显存限制,运行一些预训练的large模型时,batch-size往往设置的比较小1-4,否则就会‘CUDA out of memory’,但一般batch-size越大(一定范围内)模型收敛越稳定效果相对越好,这时梯度累加(Gradient Accumulation)就可以发挥作用了,梯度累加可以先累加多个batch的梯度再进行一次参数更新,相当于增大了batch-size,这里记录一下梯度累计的使用(Pytorch)

# 截取脚本片段
step = 0
accum_step = 10
for epoch in range(epochs):
    print(f"epochs: {epoch}/{epochs}")
    for batch in train_dataloader:
        step += 1
        input_ids = batch['input_ids'].to(device)
        labels = batch['decoder_input_ids'].to(device)
        loss = model(input_ids=input_ids, labels=labels).loss
        loss = loss / accum_step
        loss.backward()
        if step % accum_step ==0:
            optim.step()
            optim.zero_grad()

你可能感兴趣的:(模型,pytorch,深度学习,人工智能)