RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

pytorch多GPU(单机多卡)训练采坑:RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

  • 问题描述
  • 问题探究
    • forward输出多个参数但不参与loss计算并不会导致报错
    • 在模型的__init__方法中定义了某些层但是没有使用才是发生报错的关键
  • 解决方案



我使用的是hugging face开源的accelerate库中的accelerate.Accelerator类来进行的多GPU训练。


RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 40 41 42 43 44 45 46 47 48 49 50 51
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error


RuntimeError: 在开始新的迭代之前,预期上一个迭代已完成归约。此错误表明您的模块有未用于产生损失的参数。您可以通过向 torch.nn.parallel.DistributedDataParallel 传递关键字参数 find_unused_parameters=True 来启用未使用的参数检测,并确保所有的 forward 函数输出都参与计算损失。
如果您已经这样做了,那么分布式数据并行模块无法定位您模块的 forward 函数返回值中的输出张量。在报告此问题时,请包含损失函数和您模块的 forward 函数的返回值结构(例如,列表,字典,迭代器)。
未为排名1接收梯度的参数索引为: 40 41 42 43 44 45 46 47 48 49 50 51
此外,您可以将环境变量 TORCH_DISTRIBUTED_DEBUG 设置为 INFO 或 DETAIL,以打印有关此排名上哪些特定参数未收到梯度的信息作为此错误的一部分。

一开始我看到这句话making sure all forward function outputs participate in calculating loss

import torch
import torch.nn as nn
import torch.optim as optim
from import DataLoader, TensorDataset
from accelerate import Accelerator
from accelerate.utils import write_basic_config
from collections import namedtuple

def main():

    # write_basic_config()  # Write a config file

    # 1. 初始化 Accelerator
    accelerator = Accelerator()

    # 2. 定义一个简单的模型和优化器
    class SimpleModel(nn.Module):

        def __init__(self):
            super(SimpleModel, self).__init__()
            self.fc1 = nn.Linear(10, 10)
            self.fc2 = nn.Linear(10, 2)

        def forward(self, x):
            fc1 = self.fc1(x)
            fc2 = self.fc2(fc1)
            fc3 = self.fc2(fc1) * 2 + 1
            return_namedtuple = namedtuple('return_namedtuple', ['fc1', 'fc2', 'fc3'])
            return return_namedtuple(fc1=fc1, fc2=fc2, fc3=fc3)

    model = SimpleModel()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # 3. 定义虚拟数据
    x = torch.randn(10000, 10)
    y = (torch.randn(10000) > 0.5).long()
    dataset = TensorDataset(x, y)
    loader = DataLoader(dataset, batch_size=512, shuffle=True)

    # 4. 使用 accelerator.prepare 函数准备模型和优化器
    model, optimizer, loader = accelerator.prepare(model, optimizer, loader)

    # 5. 训练循环
    for epoch in range(10000):
        for batch in loader:
            inputs, targets = batch

            loss = 0

            outputnamedtuple = model(inputs)
            loss1 = nn.CrossEntropyLoss()(outputnamedtuple.fc1, targets)
            loss2 = nn.CrossEntropyLoss()(outputnamedtuple.fc2, targets)

            loss = loss1 + loss2



            print(f"Loss: {loss.item()}")

    print("Training completed!")

class SimpleModel(nn.Module):

    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 2)
        self.fc3 = nn.Linear(10, 2)

    def forward(self, x):
        fc1 = self.fc1(x)
        fc2 = self.fc2(fc1)
        fc3 = self.fc3(fc1)
        return_namedtuple = namedtuple('return_namedtuple', ['fc1', 'fc2', 'fc3'])
        return return_namedtuple(fc1=fc1, fc2=fc2, fc3=fc3)

这一代码与上面的区别就在于定义了一个self.fc3 = nn.Linear(10, 2)这个层,但是这个层所计算出来的fc3结果没有参与loss计算,就会导致报错。


class SimpleModel(nn.Module):

    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 2)
        self.fc3 = nn.Linear(10, 2)

    def forward(self, x):
        fc1 = self.fc1(x)
        fc2 = self.fc2(fc1)
        return_namedtuple = namedtuple('return_namedtuple', ['fc1', 'fc2'])
        return return_namedtuple(fc1=fc1, fc2=fc2)


class SimpleModel(nn.Module):

    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 2)
        fc3 = nn.Linear(10, 2)

    def forward(self, x):
        fc1 = self.fc1(x)
        fc2 = self.fc2(fc1)
        return_namedtuple = namedtuple('return_namedtuple', ['fc1', 'fc2'])
        return return_namedtuple(fc1=fc1, fc2=fc2)




