OpenMMLab

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析

本文介绍 PyTorch 里的数据并行训练，涉及 nn.DataParallel (DP) 和 nn.parallel.DistributedDataParallel (DDP) 两个模块（基于 1.7 版本），涵盖分布式训练的原理以及源码解读（大多以汉字注释，记得仔细读一下 comment ）。内容组织如下：

0 数据并行

1 DP

1.1 使用

1.2 原理

1.3 实现

1.4 分析

2 DDP

2.1 使用

2.2 原理

2.3 实现

0 数据并行

当一张 GPU 可以存储一个模型时，可以采用数据并行得到更准确的梯度或者加速训练，即每个 GPU 复制一份模型，将一批样本分为多份输入各个模型并行计算。因为求导以及加和都是线性的，数据并行在数学上也有效。

假设我们一个 batch 有个样本，一共有个 GPU 每个 GPU 分到个样本。假设样本刚好等分，则有 $m_j = \frac{n}{k}$ 。我们考虑总的损失函数对参数的导数：

$\frac{\partial\ Loss}{\partial w} = \frac{\partial[\frac{1}{n}\sum_{i=1}^{n}l(x_i,y_i)]}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial l(x_i,y_i)}{\partial w} = \sum_{j=1}^{k} \frac{m_j}{n} \frac{\partial[\frac{1}{m_j}\sum_{i= m_{j-1}}^{m_{j-1}+m_{j}}l(x_i,y_i)]}{\partial w} = \sum_{j=1}^{k} \frac{m_j}{n}\frac{\partial\ loss_{j}}{\partial w} = \frac{1}{k} \sum_{j=1}^{k} \frac{\partial\ loss_{j}}{\partial w}$

那么接下来我们看一下 PyTorch 究竟是怎么实现数据并行的。

1 DP

1.1 使用

DP 的好处是，使用起来非常方便，只需要将原来单卡的 module 用 DP 改成多卡:

model = nn.DataParallel(model)

1.2 原理

DP 基于单机多卡，所有设备都负责计算和训练网络，除此之外， device[0] (并非 GPU 真实标号而是输入参数 device_ids 首位) 还要负责整合梯度，更新参数。图 1 即为 GPU 0 作为 device[0] 的例子。从图中我们可以看出，有三个主要过程：

过程一（图中红色部分）：各卡分别计算损失和梯度
过程二（图中蓝色部分）：所有梯度整合到 device[0]
过程三（图中绿色部分）：device[0] 进行参数更新，其他卡拉取 device[0] 的参数进行更新

所有卡都并行运算（图中红色），将梯度收集到 device[0]（图中浅蓝色）和 device[0] 分享模型参数给其他 GPU（图中绿色）三个主要过程。

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第1张图片

图 1: GPU 0 as device[0]，原图见 [2]

虽然 DP 只能实现单机训练不能算是严格意义上的分布式训练（多个节点），但是其原理和分布式训练算法里的 Parameter Server 架构很相近，我们借用 PS 的伪代码来说明一下。

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第2张图片

图 2: PS，原图见 [3]

我们可以看到 PS 的并行梯度下降流程分为四部分：

Task Scheduler：负责加载数据并分发数据至每个 worker 节点，并执行多轮迭代。

在每轮迭代中，worker 负责：

初始化：载入数据并将全部模型参数从 server 节点拉下来（图 1 绿色）
梯度计算：利用该节点的数据计算梯度（图 1 红色）并将梯度更新到 server 节点（图 1 蓝色）

Server 负责：

汇总梯度
更新参数

OK，现在我们已经知道了 DP 使用的算法，接下来我们看一下 PyTorch 是如何实现的。

1.3 实现

这一节主要讨论 DP 的实现，首先先贴上源码（顺便看一下comment 部分）：

class DataParallel(Module):

    def __init__(self, module, device_ids=None, output_device=None, dim=0):
        super(DataParallel, self).__init__()

        # 检查是否有可用的 GPU
        device_type = _get_available_device_type()
        if device_type is None:
            self.module = module
            self.device_ids = []
            return
                # 默认使用所有可见的 GPU
        if device_ids is None:
            device_ids = _get_all_device_indices()

                # 默认 server 是 device_ids 列表上第一个
        if output_device is None:
            output_device = device_ids[0]

        self.dim = dim
        self.module = module
        self.device_ids = list(map(lambda x: _get_device_index(x, True), device_ids))
        self.output_device = _get_device_index(output_device, True)
        self.src_device_obj = torch.device(device_type, self.device_ids[0])

        # 检查负载是否平衡， 不平衡（指内存或者处理器 max/min > 0.75 会有警告）
        _check_balance(self.device_ids)

        # 单卡
        if len(self.device_ids) == 1:
            self.module.to(self.src_device_obj)

    def forward(self, *inputs, **kwargs):

        # 没 GPU 可用
        if not self.device_ids:
            return self.module(*inputs, **kwargs)

                # 运行前 GPU device_ids[0] （即我们的 server ）上必须有 parallelized module 的parameters 和 buffers
        # 因为 DP 保证 GPU device_ids[0] 和 base parallelized module 共享存储
        # 所以在device[0] 上的 in-place 更新也会被保留下来，其他的则不会

        for t in chain(self.module.parameters(), self.module.buffers()):
            if t.device != self.src_device_obj:
                raise RuntimeError("module must have its parameters and buffers "
                                   "on device {} (device_ids[0]) but found one of "
                                   "them on device: {}".format(self.src_device_obj, t.device))

                # nice 现在 device[0] 上已经有了 module 和 input， 接下来我们就要开始 PS 算法了
        # 可以开始看正文了

        inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

        # 如果仅有单卡可用，直接单卡计算，不用并行
        if len(self.device_ids) == 1:
            return self.module(*inputs[0], **kwargs[0])

        replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
        outputs = self.parallel_apply(replicas, inputs, kwargs)
        return self.gather(outputs, self.output_device)

    def replicate(self, module, device_ids):
        return replicate(module, device_ids, not torch.is_grad_enabled())

    def scatter(self, inputs, kwargs, device_ids):
        return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)

    def parallel_apply(self, replicas, inputs, kwargs):
        return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])

    def gather(self, outputs, output_device):
        return gather(outputs, output_device, dim=self.dim)

从 forward 函数可以看出，关键函数有 scatter, replicate, parallel_apply 和 gather，我们一个一个看一下。

首先是 scatter 函数，即 scatter_kwargs 函数。

def scatter_kwargs(inputs, kwargs, target_gpus, dim=0):
    r"""Scatter with support for kwargs dictionary"""

    # 主要函数
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
    kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []

    # 用空项补全使 inputs 和 kwargs 长度相当
    if len(inputs) < len(kwargs):
        inputs.extend([() for _ in range(len(kwargs) - len(inputs))])
    elif len(kwargs) < len(inputs):
        kwargs.extend([{} for _ in range(len(inputs) - len(kwargs))])
    # 返回 tuple
    inputs = tuple(inputs)
    kwargs = tuple(kwargs)
    return inputs, kwargs

scatter_kwargs 函数中最重要的就是 scatter 函数，负责将 tensor 分成大概相等的块并将他们分给不同的 GPU。对其他的数据类型，则是复制分散给不同的 GPU 。

def scatter(inputs, target_gpus, dim=0):
    r"""
    Slices tensors into approximately equal chunks and
    distributes them across given GPUs. Duplicates
    references to objects that are not tensors.
    """
    def scatter_map(obj):
        if isinstance(obj, torch.Tensor):
            return Scatter.apply(target_gpus, None, dim, obj)
        if is_namedtuple(obj):
            return [type(obj)(*args) for args in zip(*map(scatter_map, obj))]
        if isinstance(obj, tuple) and len(obj) > 0:
            return list(zip(*map(scatter_map, obj)))
        if isinstance(obj, list) and len(obj) > 0:
            return [list(i) for i in zip(*map(scatter_map, obj))]
        if isinstance(obj, dict) and len(obj) > 0:
            return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
        return [obj for targets in target_gpus]

    # After scatter_map is called, a scatter_map cell will exist. This cell
    # has a reference to the actual function scatter_map, which has references
    # to a closure that has a reference to the scatter_map cell (because the
    # fn is recursive). To avoid this reference cycle, we set the function to
    # None, clearing the cell
    try:
        res = scatter_map(inputs)
    finally:
        scatter_map = None
    return res

其中，针对 tensor 的函数，

class Scatter(Function):

    @staticmethod
    def forward(ctx, target_gpus, chunk_sizes, dim, input):
        target_gpus = [_get_device_index(x, True) for x in target_gpus]
        ctx.dim = dim
        ctx.input_device = input.get_device() if input.device.type != "cpu" else -1
        streams = None
        if torch.cuda.is_available() and ctx.input_device == -1:
            # Perform CPU to GPU copies in a background stream

            # 新建 cuda stream
            streams = [_get_stream(device) for device in target_gpus]

        # 真正的操作
        outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)

        # Synchronize with the copy stream
        if streams is not None:
            for i, output in enumerate(outputs):
                with torch.cuda.device(target_gpus[i]):
                    main_stream = torch.cuda.current_stream()
                    main_stream.wait_stream(streams[i])
                    output.record_stream(main_stream)
        return outputs

    @staticmethod
    def backward(ctx, *grad_output):
        return None, None, None, Gather.apply(ctx.input_device, ctx.dim, *grad_output)

comm.scatter 依赖于 C++，就不介绍了。

回顾 DP 代码块，我们已经运行完 scatter函数，即将一个 batch 近似等分成更小的 batch。接下来我们要看 replicate 函数和 gather 函数（假设我们有不少于两张卡）。

#  DP forward 里的代码
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])

    # 实现
    def replicate(network, devices, detach=False):

        if not _replicatable_module(network):
            raise RuntimeError("Cannot replicate network where python modules are "
                               "childrens of ScriptModule")

        if not devices:
            return []

        # 需要复制到哪些 GPU， 复制多少份
        devices = [_get_device_index(x, True) for x in devices]
        num_replicas = len(devices)

        # 复制 parameters
        params = list(network.parameters())
        param_indices = {param: idx for idx, param in enumerate(params)}

        # 拉到代码块底部看原函数，然后再回来
        param_copies = _broadcast_coalesced_reshape(params, devices, detach)


        # 复制 buffers
        buffers = list(network.buffers())
        buffers_rg = []
        buffers_not_rg = []
        for buf in buffers:
            if buf.requires_grad and not detach:
                buffers_rg.append(buf)
            else:
                buffers_not_rg.append(buf)

                # 记录需要和不需要求导的 buffer 的 index
        buffer_indices_rg = {buf: idx for idx, buf in enumerate(buffers_rg)}
        buffer_indices_not_rg = {buf: idx for idx, buf in enumerate(buffers_not_rg)}

                # 分别拷贝，这个咱们已经会了
        buffer_copies_rg = _broadcast_coalesced_reshape(buffers_rg, devices, detach=detach)
        buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)

        # 现在开始拷贝网络
        # 准备过程：将 network.modules() 变成list
        # 然后再为之后复制的模型准备好空的 list 和 indices

        modules = list(network.modules())
        module_copies = [[] for device in devices]
        module_indices = {}
        scriptmodule_skip_attr = {"_parameters", "_buffers", "_modules", "forward", "_c"}

        for i, module in enumerate(modules):
            module_indices[module] = i
            for j in range(num_replicas):
                replica = module._replicate_for_data_parallel()
                # This is a temporary fix for DDP. DDP needs to access the
                # replicated model parameters. It used to do so through
                # `mode.parameters()`. The fix added in #33907 for DP stops the
                # `parameters()` API from exposing the replicated parameters.
                # Hence, we add a `_former_parameters` dict here to support DDP.
                replica._former_parameters = OrderedDict()

                module_copies[j].append(replica)

                # 接下来分别复制 module，param，buffer
        for i, module in enumerate(modules):
            for key, child in module._modules.items():
                if child is None:
                    for j in range(num_replicas):
                        replica = module_copies[j][i]
                        replica._modules[key] = None
                else:
                    module_idx = module_indices[child]
                    for j in range(num_replicas):
                        replica = module_copies[j][i]
                        setattr(replica, key, module_copies[j][module_idx])
            for key, param in module._parameters.items():
                if param is None:
                    for j in range(num_replicas):
                        replica = module_copies[j][i]
                        replica._parameters[key] = None
                else:
                    param_idx = param_indices[param]
                    for j in range(num_replicas):
                        replica = module_copies[j][i]
                        param = param_copies[j][param_idx]
                        # parameters in replicas are no longer leaves,
                        # so setattr them as non-parameter attributes
                        setattr(replica, key, param)
                        # expose the parameter for DDP
                        replica._former_parameters[key] = param
            for key, buf in module._buffers.items():
                if buf is None:
                    for j in range(num_replicas):
                        replica = module_copies[j][i]
                        replica._buffers[key] = None
                else:
                    if buf.requires_grad and not detach:
                        buffer_copies = buffer_copies_rg
                        buffer_idx = buffer_indices_rg[buf]
                    else:
                        buffer_copies = buffer_copies_not_rg
                        buffer_idx = buffer_indices_not_rg[buf]
                    for j in range(num_replicas):
                        replica = module_copies[j][i]
                        setattr(replica, key, buffer_copies[j][buffer_idx])

        return [module_copies[j][0] for j in range(num_replicas)]

    # ！！！从replicate来看这里
    def _broadcast_coalesced_reshape(tensors, devices, detach=False):

      from ._functions import Broadcast

      # 先看 else 的 comment，因为不 detach 也会用到同样的函数
      if detach:
          return comm.broadcast_coalesced(tensors, devices)
      else:
          # Use the autograd function to broadcast if not detach
          if len(tensors) > 0:

            # 下拉看源码
              tensor_copies = Broadcast.apply(devices, *tensors)

              return [tensor_copies[i:i + len(tensors)]
                      for i in range(0, len(tensor_copies), len(tensors))]
          else:
              return []

   #  Broadcast.apply
   class Broadcast(Function):

    @staticmethod
    def forward(ctx, target_gpus, *inputs):
        assert all(i.device.type != 'cpu' for i in inputs), (
            'Broadcast function not implemented for CPU tensors'
        )
        target_gpus = [_get_device_index(x, True) for x in target_gpus]
        ctx.target_gpus = target_gpus
        if len(inputs) == 0:
            return tuple()
        ctx.num_inputs = len(inputs)
        # input 放在 device[0]
        ctx.input_device = inputs[0].get_device()

        # 和 detach 的情况一样
        outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)

        # comm.broadcast_coalesced 的代码
        # tensors 必须在同一个设备，CPU 或者 GPU； devices 即是要拷贝到的设备；buffer_size 则是最大的buffer
        # 这里用到 buffer 将小张量合并到缓冲区以减少同步次数
        # def broadcast_coalesced(tensors, devices, buffer_size=10485760):
        #    devices = [_get_device_index(d) for d in devices]
            #       return torch._C._broadcast_coalesced(tensors, devices, buffer_size)

        non_differentiables = []
        for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):
            if not input_requires_grad:
                for output in outputs:
                    non_differentiables.append(output[idx])
        ctx.mark_non_differentiable(*non_differentiables)
        return tuple([t for tensors in outputs for t in tensors])

    @staticmethod
    def backward(ctx, *grad_outputs):
        return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)

下面继续 parallel_apply 部分。⚠️ DP 和 DDP 共用 parallel_apply 代码

# DP 代码
outputs = self.parallel_apply(replicas, inputs, kwargs)

# threading 实现，用前面准备好的 replica 和输入数据，然后
# for 循环启动多线程

# 源码
def parallel_apply(modules, inputs, kwargs_tup=None, devices=None):

        # 每个 GPU 都有模型和输入
    assert len(modules) == len(inputs)

    # 确保每个 GPU 都有相应的数据，如没有就空白补全
    if kwargs_tup is not None:
      # 咱们在 scatter 已经补全了
        assert len(modules) == len(kwargs_tup)
    else:
        kwargs_tup = ({},) * len(modules)

    if devices is not None:
        assert len(modules) == len(devices)
    else:
        devices = [None] * len(modules)

    devices = [_get_device_index(x, True) for x in devices]

    # 多线程实现

    lock = threading.Lock()
    results = {}
    grad_enabled, autocast_enabled = torch.is_grad_enabled(), torch.is_autocast_enabled()

    # 定义 worker
    def _worker(i, module, input, kwargs, device=None):
        torch.set_grad_enabled(grad_enabled)
        if device is None:
            device = get_a_var(input).get_device()
        try:
            with torch.cuda.device(device), autocast(enabled=autocast_enabled):
                # this also avoids accidental slicing of `input` if it is a Tensor
                if not isinstance(input, (list, tuple)):
                    input = (input,)
                output = module(*input, **kwargs)
            with lock:
              # 并行计算得到输出
                results[i] = output
        except Exception:
            with lock:
                results[i] = ExceptionWrapper(
                    where="in replica {} on device {}".format(i, device))

    if len(modules) > 1:

      # 如有一个进程控制多个 GPU ，起多个线程
      # 需要强调一下，虽然 DDP 推荐单卡单进程，即每次调用 DDP device_ids 都只输入一张卡的 id（通常是 args.local_rank），但是如果输入多个 device_id，此时 DDP 就是单进程多线程控制多卡，和 DP 一样，关于 DDP 的解读可以看下文

        threads = [threading.Thread(target=_worker,
                                    args=(i, module, input, kwargs, device))
                   for i, (module, input, kwargs, device) in
                   enumerate(zip(modules, inputs, kwargs_tup, devices))]

        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
    else:
      # 一个 GPU 一个进程 （ DDP 推荐操作）
        _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])

    outputs = []
    for i in range(len(inputs)):
        output = results[i]

        # error handle
        if isinstance(output, ExceptionWrapper):
            output.reraise()
        outputs.append(output)
    # 输出 n 个计算结果
    return outputs

现在我们已经得到并行计算的结果了，接下来我们要将结果收集到 device[0]。

# DP 代码
return self.gather(outputs, self.output_device)
# 收集到 devices[0]

# 源码
def gather(outputs, target_device, dim=0):
    r"""
    Gathers tensors from different GPUs on a specified device
      (-1 means the CPU).
    """
    def gather_map(outputs):
        out = outputs[0]
        if isinstance(out, torch.Tensor):
            return Gather.apply(target_device, dim, *outputs)
        if out is None:
            return None
        if isinstance(out, dict):
            if not all((len(out) == len(d) for d in outputs)):
                raise ValueError('All dicts must have the same number of keys')
            return type(out)(((k, gather_map([d[k] for d in outputs]))
                              for k in out))
        return type(out)(map(gather_map, zip(*outputs)))

    # Recursive function calls like this create reference cycles.
    # Setting the function to None clears the refcycle.
    try:
        res = gather_map(outputs)
    finally:
        gather_map = None
    return res

# Gather 源码

class Gather(Function):

    @staticmethod
    def forward(ctx, target_device, dim, *inputs):
        assert all(i.device.type != 'cpu' for i in inputs), (
            'Gather function not implemented for CPU tensors'
        )

        target_device = _get_device_index(target_device, True)

        ctx.target_device = target_device

        ctx.dim = dim
        ctx.input_gpus = tuple(i.get_device() for i in inputs)

        if all(t.dim() == 0 for t in inputs) and dim == 0:
            inputs = tuple(t.view(1) for t in inputs)
            warnings.warn('Was asked to gather along dimension 0, but all '
                          'input tensors were scalars; will instead unsqueeze '
                          'and return a vector.')
            ctx.unsqueezed_scalar = True
        else:
            ctx.unsqueezed_scalar = False
        ctx.input_sizes = tuple(i.size(ctx.dim) for i in inputs)
        return comm.gather(inputs, ctx.dim, ctx.target_device)

    @staticmethod
    def backward(ctx, grad_output):
        scattered_grads = Scatter.apply(ctx.input_gpus, ctx.input_sizes, ctx.dim, grad_output)
        if ctx.unsqueezed_scalar:
            scattered_grads = tuple(g[0] for g in scattered_grads)
        return (None, None) + scattered_grads

# comm.gather 涉及到 C++，具体实现咱也不讲了 ；)  
# Gathers tensors from multiple GPU devices.   
def gather(tensors, dim=0, destination=None, *, out=None):
    tensors = [_handle_complex(t) for t in tensors]
    if out is None:
        if destination == -1:
            warnings.warn(
                'Using -1 to represent CPU tensor is deprecated. Please use a '
                'device object or string instead, e.g., "cpu".')
        destination = _get_device_index(destination, allow_cpu=True, optional=True)
        return torch._C._gather(tensors, dim, destination)
    else:
        if destination is not None:
            raise RuntimeError(
                "'destination' must not be specified when 'out' is specified, but "
                "got destination={}".format(destination))
        return torch._C._gather_out(tensors, out, dim)

因为大量实现都依赖 C++，这篇笔记就不涉及了。

最后，用一张图形象地看一下 DP Module 究竟怎么执行的，具体看第一行和第三行：

前向传播的时候我们会先用 Scatter 函数将数据从 device[0] 分配并复制到不同的卡，之后用 Replicate 函数将模型从 device[0] 复制到不同的卡，之后各个卡都有了同样的模型和不同的数据，分别调用 forward 计算损失和梯度。

反向传播的时候，我们会将梯度收集到 device[0] 然后在 device[0] 更新参数。

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第3张图片

Fig. 3: DP Module流程图，原图见[4]

1.4 分析

负载不均衡

device[0] 负载大一些

通信开销

假设有个 GPU，完成一次通信需要时间 $\frac{p}{b}$ ，那么使用 PS 算法，总共需要花费时间 $T = 2(k-1)\frac{p}{b}$

单进程

The difference between DistributedDataParallel and DataParallel is: DistributedDataParallel uses multiprocessing where a process is created for each GPU, while DataParallel uses multithreading. By using multiprocessing, each GPU has its dedicated process, this avoids the performance overhead caused by GIL of Python interpreter.

官方文档

Global Interpreter Lock (GIL)全局解释器锁，简单来说就是，一个 Python 进程只能利用一个 CPU kernel，即单核多线程并发时，只能执行一个线程。考虑多核，多核多线程可能出现线程颠簸 (thrashing) 造成资源浪费，所以 Python 想要利用多核最好是多进程。

2 DDP

2.1 使用

import argparse
import torch
from torch.nn.parallel import DistributedDataParallel as DDP

parser = argparse.ArgumentParser()
parser.add_argument("--save_dir", default='')
parser.add_argument("--local_rank", default=-1)
parser.add_argument("--world_size", default=1)
args = parser.parse_args()

# 初始化后端

# world_size 指的是总的并行进程数目
# 比如16张卡单卡单进程 就是 16
# 但是如果是8卡单进程 就是 1
# 等到连接的进程数等于world_size，程序才会继续运行
torch.distributed.init_process_group(backend='nccl',
                                         world_size=ws,
                                         init_method='env://')

torch.cuda.set_device(args.local_rank)

device = torch.device(f'cuda:{args.local_rank}')

model = nn.Linear(2,3).to(device)

# train dataset
# train_sampler
# train_loader

# 初始化 DDP，这里我们通过规定 device_id 用了单卡单进程
# 实际上根据我们前面对 parallel_apply 的解读，DDP 也支持一个进程控制多个线程利用多卡
model = DDP(model,
            device_ids=[args.local_rank],
            output_device=args.local_rank).to(device)


# 保存模型 
if torch.distributed.get_rank() == 0:
  torch.save(model.module.state_dict(),
             'results/%s/model.pth' % args.save_dir)

2.2 原理

区别
- 多进程
  和 DP 不同， DDP 采用多进程，最推荐的做法是每张卡一个进程从而避免上一节所说单进程带来的影响。前文也提到了 DP 和 DDP 共用一个 parallel_apply 函数，所以 DDP 同样支持单进程多线程多卡操作，自然也支持多进程多线程，不过需要注意一下 world_size。
- 通信效率
  DP 的通信成本随着 GPU 数量线性增长，而 DDP 支持 Ring AllReduce，其通信成本是恒定的，与 GPU 数量无关。
- 同步参数
  DP 通过收集梯度到 device[0]，在device[0] 更新参数，然后其他设备复制 device[0] 的参数实现各个模型同步；
  DDP 通过保证初始状态相同并且改变量也相同（指同步梯度），保证模型同步。
Ring AllReduce

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第4张图片

Fig. 4: Ring AllReduce 流程图，原图见 [7]

假设我们有个 GPU，传输总量是，为每次通信上限。

首先我们将要传输的梯度等分成份，则每台机器每次需要传输 $\frac{p}{k}$ 。传输次可以收集到一个完整梯度（如动图 5 所示），之后再传输次将梯度分给所有 GPU（如动图 6 所示）。

举个例子，假设现有 5 个 GPU，那么就将梯度分为 5 份，如下图，分别是 , 这里的指的是 GPU 编号。

Scatter Reduce 流程，从 diagonal 的位置开始传，每次传输时 GPU 之间只有一个块在传输，比如，在传播 4 次后 GPU 4 上就有了一个完整的梯度块。

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第5张图片

图 5: Scatter Reduce 流程图，原图见 [7]， gif 见 [8]

All Gather 的过程也类似，只是将收集到的完整梯度通过通信传播给所有参与的 GPU。

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第6张图片

图 6: All Gather 流程图，原图见 [7]， gif 见 [8]

这样，通信开销就只有 $2(k-1)\frac{\frac{p}{k}}{b}$ ，和 GPU 数量无关了。

DDP 也是数据并行，所以每张卡都有模型和输入。我们以多进程多线程为例，每起一个进程，该进程的 device[0] 都会从本地复制模型，如果该进程仍有多线程，就像 DP，模型会从 device[0] 复制到其他设备。

DDP 通过 Reducer 来管理梯度同步。为了提高通讯效率， Reducer 会将梯度归到不同的桶里（按照模型参数的 reverse order，因为反向传播需要符合这样的顺序），一次归约一个桶。其中桶的大小为参数 bucket_cap_mb 默认为 25，可根据需要调整。下图即为一个例子。

可以看到每个进程里，模型参数都按照倒序放在桶里，每次归约一个桶。

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第7张图片

图 7: Gradient Bucketing 示意图，原图见 [10]

DDP 通过在构建时注册 autograd hook 进行梯度同步。反向传播时，当一个梯度计算好后，相应的 hook 会告诉 DDP 可以用来归约。当一个桶里的梯度都可以了，Reducer 就会启动异步 allreduce 去计算所有进程的平均值。allreduce 异步启动使得 DDP 可以边计算边通信，提高效率。当所有桶都可以了，Reducer 会等所有 allreduce 完成，然后将得到的梯度写到 param.grad。

2.3 实现

DDP 主要基于下图所示结构，本节我们会着重讲解 distributed.py 和 reducer.cpp 两个文件。至于 backend，NCCL 已经最优化了，建议直接用 NCCL，不过 NCCL 只支持 GPU Tensor 间通信。

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第8张图片

图 8: 代码架构，原图见 [10]

终于可以看 DDP 的实现了！！首先我们贴上伪代码！

伪代码

PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析_第9张图片

图 9: DDP 伪代码，原图见 [11]

从 DDP 的伪代码我们可以看出，DDP 最重要的包括三部分：

constructor
- 负责在构建的时候将 rank 0 的 state_dict() 广播 ➜ 保证所有网络初始状态相同；
- 初始化 buckets 并尽可能按逆序将 parameters 分配进 buckets ➜ 按桶通信提高效率；
- 为每个 parameter 加上 grad_accumulator 以及在 autograd_graph 注册 autograd_hook ➜ 在 backward 时负责梯度同步。
forward
- 正常的 forward 操作；
- 如果 self.find_unused_parameters 设置为 True，DDP 会在 forward 结束时 traverse autograd graph 找到所有没用过的parameters 并标记为 ready ➜ 虽说这一步开销很大，但是有时计算动态图会改变，所以很必要。
autograd_hook
- 这个 hook 是挂在 autograd graph 在 backward 时负责梯度同步的。当一个梯度计算好后，相应的 hook 会告诉 DDP 可以用来归约。当一个桶里的梯度都可以了，Reducer 就会启动异步 allreduce 去计算所有进程的平均值。当所有桶都可以了，Reducer 会等所有 allreduce 完成，然后将得到的梯度写到 param.grad。

好的，但现在为止我们应该对 DDP 有了大致了解了，接下来就一起看一下代码是怎么实现的！

通信

因为 DDP 依赖 c10d 的 ProcessGroup 进行通信，所以开始前我们先要有个 ProcessGroup 实例。这步可以通过 torch.distributed.init_process_group 实现。

构建

我们先贴上 DDP 初始化的源码，最重要的是 _ddp_init_helper 这个函数，负责多线程时复制模型、将 parameters 分组、创建 reducer 以及为 SyncBN 做准备等。这部分代码看 comment 就能懂，我们会重点说一下 dist.Reducer，作为管理器，自然很重要了。

class DistributedDataParallel(Module):       
    def __init__(self, module, device_ids=None,
                 output_device=None, dim=0, broadcast_buffers=True,
                 process_group=None,  
                 bucket_cap_mb=25,       
                 find_unused_parameters=False,       
                 check_reduction=False,      
                 gradient_as_bucket_view=False):

        super(DistributedDataParallel, self).__init__()

        assert any((p.requires_grad for p in module.parameters())), (
            "DistributedDataParallel is not needed when a module "
            "doesn't have any parameter that requires a gradient."
        )

        self.is_multi_device_module = len({p.device for p in module.parameters()}) > 1
        distinct_device_types = {p.device.type for p in module.parameters()}
        assert len(distinct_device_types) == 1, (
            "DistributedDataParallel's input module must be on "
            "the same type of devices, but input module parameters locate in {}."
        ).format(distinct_device_types)
        self.device_type = list(distinct_device_types)[0]

        if self.device_type == "cpu" or self.is_multi_device_module:
            assert not device_ids and not output_device, (
                "DistributedDataParallel device_ids and output_device arguments "
                "only work with single-device GPU modules, but got "
                "device_ids {}, output_device {}, and module parameters {}."
            ).format(device_ids, output_device, {p.device for p in module.parameters()})

            self.device_ids = None
            self.output_device = None
        else:
            # Use all devices by default for single-device GPU modules
            if device_ids is None:
                device_ids = _get_all_device_indices()

            self.device_ids = list(map(lambda x: _get_device_index(x, True), device_ids))

            if output_device is None:
                output_device = device_ids[0]

            self.output_device = _get_device_index(output_device, True)

        if process_group is None:
            self.process_group = _get_default_group()
        else:
            self.process_group = process_group

        self.dim = dim
        self.module = module
        self.device = list(self.module.parameters())[0].device
        self.broadcast_buffers = broadcast_buffers
        self.find_unused_parameters = find_unused_parameters
        self.require_backward_grad_sync = True
        self.require_forward_param_sync = True
        self.ddp_join_enabled = False
        self.gradient_as_bucket_view = gradient_as_bucket_view

        if check_reduction:
            # This argument is no longer used since the reducer
            # will ensure reduction completes even if some parameters
            # do not receive gradients.
            warnings.warn(
                "The `check_reduction` argument in `DistributedDataParallel` "
                "module is deprecated. Please avoid using it."
            )
            pass

        # used for intra-node param sync and inter-node sync as well
        self.broadcast_bucket_size = int(250 * 1024 * 1024)

        #
        # reduction bucket size
        self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024)

        # 保证初始状态一样
        # Sync params and buffers
        self._sync_params_and_buffers(authoritative_rank=0)

        # 下拉看源码
        self._ddp_init_helper()

    def _sync_params_and_buffers(self, authoritative_rank=0):
        module_states = list(self.module.state_dict().values())
        if len(module_states) > 0:
            self._distributed_broadcast_coalesced(
                module_states,
                self.broadcast_bucket_size,
                authoritative_rank)

    def _ddp_init_helper(self):
        """
        Initialization helper function that does the following:

        (1) replicating the module from device[0] to the other devices （前文提到 DDP 也支持一个进程多线程利用多卡，类似 DP ，这时候就会用到第一步）
        (2) bucketing the parameters for reductions （把 parameter 分组，梯度通讯时，先得到梯度的会通讯）
        (3) resetting the bucketing states
        (4) registering the grad hooks （创建管理器）
        (5) passing a handle of DDP to SyncBatchNorm Layer （为 SyncBN 准备）
        """

        def parameters(m, recurse=True):
            def model_parameters(m):
                ps = m._former_parameters.values() \
                    if hasattr(m, "_former_parameters") \
                    else m.parameters(recurse=False)
                for p in ps:
                    yield p

            for m in m.modules() if recurse else [m]:
                for p in model_parameters(m):
                    yield p

        if self.device_ids and len(self.device_ids) > 1:

            warnings.warn(
                "Single-Process Multi-GPU is not the recommended mode for "
                "DDP. In this mode, each DDP instance operates on multiple "
                "devices and creates multiple module replicas within one "
                "process. The overhead of scatter/gather and GIL contention "
                "in every forward pass can slow down training. "
                "Please consider using one DDP instance per device or per "
                "module replica by explicitly setting device_ids or "
                "CUDA_VISIBLE_DEVICES. "
            )

            # only create replicas for single-device CUDA modules
            #
            # TODO: we don't need to replicate params in here. they're always going to
            # be broadcasted using larger blocks in broadcast_coalesced, so it might be
            # better to not pollute the caches with these small blocks
            self._module_copies = replicate(self.module, self.device_ids, detach=True)
            self._module_copies[0] = self.module

            for module_copy in self._module_copies[1:]:
                for param, copy_param in zip(self.module.parameters(), parameters(module_copy)):
                    # Reducer requires param copies have the same strides across replicas.
                    # Fixes up copy_param strides in case replicate didn't match param strides.
                    if param.layout is torch.strided and param.stride() != copy_param.stride():
                        with torch.no_grad():
                            copy_param.set_(copy_param.clone()
                                                      .as_strided(param.size(), param.stride())
                                                      .copy_(copy_param))
                    copy_param.requires_grad = param.requires_grad

        else:
            self._module_copies = [self.module]

        self.modules_params = [list(parameters(m)) for m in self._module_copies]
        self.modules_buffers = [list(m.buffers()) for m in self._module_copies]

        # Build tuple of (module, parameter) for all parameters that require grads.
        modules_and_parameters = [
            [
                (module, parameter)
                for module in replica.modules()
                for parameter in filter(
                    lambda parameter: parameter.requires_grad,
                    parameters(module, recurse=False))
            ] for replica in self._module_copies]

        # Build list of parameters.
        parameters = [
            list(parameter for _, parameter in replica)
            for replica in modules_and_parameters]

        # Checks if a module will produce a sparse gradient.
        def produces_sparse_gradient(module):
            if isinstance(module, torch.nn.Embedding):
                return module.sparse
            if isinstance(module, torch.nn.EmbeddingBag):
                return module.sparse
            return False

        # Build list of booleans indicating whether or not to expect sparse
        # gradients for the corresponding parameters.
        expect_sparse_gradient = [
            list(produces_sparse_gradient(module) for module, _ in replica)
            for replica in modules_and_parameters]

        # The bucket size limit is specified in the constructor.
        # Additionally, we allow for a single small bucket for parameters
        # that are defined first, such that their gradients don't spill into
        # a much larger bucket, adding unnecessary latency after gradient
        # computation finishes. Experiments showed 1MB is a reasonable value.
        bucket_indices = dist._compute_bucket_assignment_by_size(
            parameters[0],
            [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap],
            expect_sparse_gradient[0])

        # Note: reverse list of buckets because we want to approximate the
        # order in which their gradients are produced, and assume they
        # are used in the forward pass in the order they are defined.
        # 管理器
        self.reducer = dist.Reducer(
            parameters,
            list(reversed(bucket_indices)),
            self.process_group,
            expect_sparse_gradient,
            self.bucket_bytes_cap,
            self.find_unused_parameters,
            self.gradient_as_bucket_view)

        # passing a handle to torch.nn.SyncBatchNorm layer
        self._passing_sync_batchnorm_handle(self._module_copies)

每个 DDP 进程都会创建本地 Reducer 在 backward 时管理梯度。

self.reducer = dist.Reducer(
     parameters,
     list(reversed(bucket_indices)),
     self.process_group,
     expect_sparse_gradient,
     self.bucket_bytes_cap,
     self.find_unused_parameters,
     self.gradient_as_bucket_view)

我们看 Reducer.cpp 可以发现，构建 Reducer 时，除了各种初始化，最重要的一步就是

{
  const auto replica_count = replicas_.size();
  grad_accumulators_.resize(replica_count);
  for (size_t replica_index = 0; replica_index < replica_count;
       replica_index++) {         
    const auto variable_count = replicas_[replica_index].size();
    grad_accumulators_[replica_index].resize(variable_count);
    for (size_t variable_index = 0; variable_index < variable_count;
         variable_index++) 
    { 
      auto& variable = replicas_[replica_index][variable_index];  
      const auto index = VariableIndex(replica_index, variable_index);

      // The gradient accumulator function is lazily initialized once.
      // Therefore we can use its presence in the autograd graph as
      // evidence that the parameter has participated in an iteration.
      auto grad_accumulator =
          torch::autograd::impl::grad_accumulator(variable);

#ifndef _WIN32
        using torch::distributed::autograd::ThreadLocalDistAutogradContext;   
#endif
        // grad_accumulator 执行完后，autograd_hook 就会运行
        hooks.emplace_back(
            grad_accumulator->add_post_hook(
                torch::make_unique(
                    [=](const torch::autograd::variable_list& outputs,
                        const torch::autograd::variable_list& /* unused */){   
#ifndef WIN32
                         this->rpc_context.set(
                             ThreadLocalDistAutogradContext::getContextPtr());   
#endif
                         this->autograd_hook(index);
                         return outputs;
                       })),
               grad_accumulator);

          // Map raw function pointer to replica index and parameter index.
          // This is used later on when the autograd graph is traversed
          // to check for parameters for which no gradient is computed.
          func_[grad_accumulator.get()] = index;

          // The gradient accumulator is stored as weak_ptr in the autograd
          // metadata of the variable, so we have to keep it alive here for
          // the raw pointer to be valid.
          grad_accumulators_[replica_index][variable_index] =
              std::move(grad_accumulator);
        }
      }
    }

    // std::unordered_map func_;
    // func_ 存了grad_accumulator & index 的对应，方便我们之后在 autograd graph 寻找 unused parameters

    //  std::vector>>
    //  grad_accumulators_;
    //  grad_accumulators_ 对应的 index 存了相应的 grad_accumulator

    //   std::vector>>
    //   hooks_;

其中，发挥重要功能的 autograd_hook 如下：

void Reducer::autograd_hook(VariableIndex index) {
     std::lock_guard lock(this->mutex_);
     if (find_unused_parameters_) {
       // 在 no_sync 时，只要参数被用过一次，就会被标记为用过
       // Since it gets here, this param has been used for this iteration. We want
       // to mark it in local_used_maps_. During no_sync session, the same var can
       // be set multiple times, which is OK as does not affect correctness. As
       // long as it is used once during no_sync session, it is marked as used.
       local_used_maps_[index.replica_index][index.variable_index] = 1;
     }

    // Ignore if we don't expect to be called.
    // This may be the case if the user wants to accumulate gradients
    // for number of iterations before reducing them.
    if (!expect_autograd_hooks_) {
      return;
    }

    // Rebuild bucket only if 1) it is the first time to rebuild bucket 2)
    // find_unused_parameters_ is false, currently it does not support when there
    // are unused parameters 3) this backward pass needs to run allreduce. Here,
    // we just dump tensors and their parameter indices into rebuilt_params_ and
    // rebuilt_param_indices_ based on gradient arriving order, and then at the
    // end of finalize_backward(), buckets will be rebuilt based on
    // rebuilt_params_ and rebuilt_param_indices_, and then will be broadcasted
    // and initialized. Also we only need to dump tensors and parameter indices of
    // one replica.
    push_rebuilt_params(index);

    // If `find_unused_parameters_` is true there may be model parameters that
    // went unused when computing the model output, they won't be part of the
    // autograd graph, and won't receive gradients. These parameters are
    // discovered in the `prepare_for_backward` function and their indexes stored
    // in the `unused_parameters_` vector.
    if (!has_marked_unused_parameters_ && find_unused_parameters_) {
      has_marked_unused_parameters_ = true;
      for (const auto& unused_index : unused_parameters_) {
        mark_variable_ready(unused_index);
      }
    }

    // Finally mark variable for which this function was originally called.
    mark_variable_ready(index);
}

前向传播

def forward(self, inputs, *kwargs):           if self.ddp_join_enabled:               ones = torch.ones(                   1, device=self.device               )               work = dist.all_reduce(ones, group=self.process_group, async_op=True)               self.reducer._set_forward_pass_work_handle(                   work, self.ddp_join_divide_by_initial_world_size               )
# Calling _rebuild_buckets before forward compuation,
      # It may allocate new buckets before deallocating old buckets
      # inside _rebuild_buckets. To save peak memory usage,
      # call _rebuild_buckets before the peak memory usage increases
      # during forward computation.
      # This should be called only once during whole training period.
      if self.reducer._rebuild_buckets():
          logging.info("Reducer buckets have been rebuilt in this iteration.")

      if self.require_forward_param_sync:
          self._sync_params()

      if self.ddp_join_enabled:
          # Notify joined ranks whether they should sync in backwards pass or not.
          self._check_global_requires_backward_grad_sync(is_joined_rank=False)

      # ！！！
      if self.device_ids:
          inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
          if len(self.device_ids) == 1:
              output = self.module(*inputs[0], **kwargs[0])
          else:
            # 单进程多线程多卡的情况
              outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
              output = self.gather(outputs, self.output_device)
      else:
          output = self.module(*inputs, **kwargs)

      if torch.is_grad_enabled() and self.require_backward_grad_sync:
          self.require_forward_param_sync = True
          # We'll return the output object verbatim since it is a freeform
          # object. We need to find any tensors in this object, though,
          # because we need to figure out which parameters were used during
          # this forward pass, to ensure we short circuit reduction for any
          # unused parameters. Only if `find_unused_parameters` is set.
          if self.find_unused_parameters:
          # 当DDP参数 find_unused_parameter 为 true 时，其会在 forward 结束时，启动一个回溯，标记出所有没被用到的 parameter，提前把这些设定为 ready，这样 backward 就可以在一个 subgraph 进行，但这样会牺牲一部分时间。
              self.reducer.prepare_for_backward(list(_find_tensors(output)))
          else:
              self.reducer.prepare_for_backward([])
      else:
          self.require_forward_param_sync = False

      return output

反向传播

那么，DDP 究竟是怎么启动 allreduce 的呢？我们看一下 reducer.cpp 里对桶的定义以及用法，主要是在mark_*_ready。

struct Bucket {       std::vector replicas;
// Global indices of participating variables in the bucket
  std::vector variable_indices;

  // Number of replicas to be marked done before this bucket is ready.
  // 计数
  size_t pending;

  // Keep work handle around when this set of buckets is being reduced.
  std::shared_ptr work;

  // Keep future work handle around if DDP comm hook is registered.
  c10::intrusive_ptr future_work;

  // If this bucket should expect a single sparse gradient.
  // Implies: replicas[i].variables.size() == 1.
  bool expect_sparse_gradient = false;
};

先看 mark_variable_ready，截取片段（指去除报错信息）

void Reducer::mark_variable_ready(VariableIndex index) {     const auto replica_index = index.replica_index;     const auto variable_index = index.variable_index;     TORCH_CHECK(replica_index < replicas_.size(), "Out of range replica index.");     TORCH_CHECK(         variable_index < variable_locators_.size(),         "Out of range variable index.");     backward_stats_[replica_index][variable_index] =         current_time_in_nanos() - backward_stats_base_;
// 每当变量被标记成 ready 了，都要调用一下 finalize
require_finalize_ = true;

const auto& bucket_index = variable_locators_[variable_index];
auto& bucket = buckets_[bucket_index.bucket_index];
auto& replica = bucket.replicas[replica_index];


// If it was scheduled, wait on allreduce in forward pass that tells us
// division factor based on no. of currently participating processes.
if (divFactor_ == kUnsetDivFactor) {
  divFactor_ = process_group_->getSize();
  auto& workHandle = forwardPassWorkHandle_.workHandle;
  if (workHandle && !forwardPassWorkHandle_.useStaticWorldSize) {
    workHandle->wait();
    auto results = workHandle->result();
    // Guard against the results being empty
    TORCH_INTERNAL_ASSERT(results.size() > 0);
    at::Tensor& res = results.front();
    divFactor_ = res.item().to();
  }
}

if (bucket.expect_sparse_gradient) {
  mark_variable_ready_sparse(index);
} else {
  mark_variable_ready_dense(index);
}

// 检查桶里的变量是不是都ready了，如果没有东西 pending，那就是都 ready了
if (--replica.pending == 0) {
  if (--bucket.pending == 0) {
    mark_bucket_ready(bucket_index.bucket_index);
  }
}

// Run finalizer function and kick off reduction for local_used_maps once the
// final bucket was marked ready.
if (next_bucket_ == buckets_.size()) {
  if (find_unused_parameters_) {
    // H2D from local_used_maps_ to local_used_maps_dev_
    for (size_t i = 0; i < local_used_maps_.size(); i++) {
      // We do async H2D to avoid the blocking overhead. The async copy and
      // allreduce respect the current stream, so will be sequenced correctly.
      local_used_maps_dev_[i].copy_(local_used_maps_[i], true);
    }
    local_used_work_ = process_group_->allreduce(local_used_maps_dev_);
  }

  // The autograd engine uses the default stream when running callbacks, so we
  // pass in the current CUDA stream in case it is not the default.
  c10::DeviceType deviceType = replica.contents.device().type();
  const c10::impl::VirtualGuardImpl guard =
      c10::impl::VirtualGuardImpl{deviceType};
  const c10::Stream currentStream =
      guard.getStream(replica.contents.device());
  torch::autograd::Engine::get_default_engine().queue_callback([=] {
    std::lock_guard lock(this->mutex_);
    // Run callback with the current stream
    c10::OptionalStreamGuard currentStreamGuard{currentStream};
    this->finalize_backward();
  });
}
}

void Reducer::mark_bucket_ready(size_t bucket_index) {     TORCH_INTERNAL_ASSERT(bucket_index >= next_bucket_);
// Buckets are reduced in sequence. Ignore this bucket if
// it's not its turn to be reduced.
if (bucket_index > next_bucket_) {
  return;
}

// Keep going, until we either:
// - 所有桶都在 allreduce 那就等着 or
// - 还有桶没好，那也等着.
for (; next_bucket_ < buckets_.size() && buckets_[next_bucket_].pending == 0;
     next_bucket_++) {
  auto& bucket = buckets_[next_bucket_];
  std::vector tensors;
  tensors.reserve(bucket.replicas.size());
  for (const auto& replica : bucket.replicas) {

        // CUDA default stream 都按时序排好了
    tensors.push_back(replica.contents);
  }
  if (comm_hook_ == nullptr) {
    // 如果没注册 comm_hook，直接 allreduce
    bucket.work = process_group_->allreduce(tensors);
  } else {
    // 注册了 comm_hook 那就先跑 hook
    // 需要注意的是，这个comm_hook 只是处理通信的底层hook，如果想在 reduce 前分别进行梯度裁剪，还是需要在 autograph 挂 hook
    bucket.future_work = comm_hook_->runHook(GradBucket(tensors));
  }
}
}

除了正常的前向传播，DDP 还允许在 subgraph 进行反向传播，只需将 self.find_unused_parameters 设置为 True。或许有朋友会问，如果 find_unused_parameters 设置为 True，那每次都要 traverse 计算图，明明开销很大，为什么有时候我们还要将 self.find_unused_parameters 设置为 True？这是因为训练时有可能某次迭代只用到整个模型的一个 subgraph，并且这个 subgraph 迭代时可能会改变，就是说某些参数可能会在训练时被跳过。但因为所有parameters 在一开始就被分好桶了，而我们的 hook 又规定了只有整个桶 ready 了（pending==0）才会通信，如果我们不将 unused parameter 标记为 ready，整个过程会没法进行。我们在这节结束的部分附上一个小实验验证一下。

DDP 通过在构建时注册 autograd hook 进行梯度同步。当一个梯度计算好后，相应的 hook 会告诉 DDP 可以用来归约。当一个桶里的梯度都可以了，Reducer 就会启动异步 allreduce 去计算所有进程的平均值。当所有桶都可以了，Reducer 会等所有 allreduce 完成，然后将得到的梯度写到 param.grad。

optimizer step 独立于 DDP，所有进程的模型能够同步是因为初始状态相同并且改变量也相同。
no_sync

@contextmanager   
def no_sync(self):
        old_require_backward_grad_sync = self.require_backward_grad_sync
        self.require_backward_grad_sync = False
        try:
            yield
        finally:
            self.require_backward_grad_sync = old_require_backward_grad_sync

>>> ddp = torch.nn.DistributedDataParallel(model, pg)
>>> with ddp.no_sync():
>>>   for input in inputs:
>>>     ddp(input).backward()  # 不同步梯度 
>>> ddp(another_input).backward()  # 同步梯度

迭代多次再同步梯度。

实验：find_unused_params

import os   
import torch   
import torch.distributed as dist   
import torch.multiprocessing as mp   
import torch.nn as nn   
import torch.optim as optim   

from torch.nn.parallel import DistributedDataParallel as DDP
from timeit import default_timer as timer

os.environ['MASTER_ADDR'] = 'localhost'   
os.environ['MASTER_PORT'] = '12138'   
# sync   
seed = 0   
torch.manual_seed(seed)   
torch.cuda.manual_seed(seed)   
torch.cuda.manual_seed_all(seed)   
os.environ['PYTHONHASHSEED'] = str(seed)   
torch.backends.cudnn.deterministic = True   
torch.backends.cudnn.benchmark = False

def example(rank, world_size):       
    # create default process group       
    dist.init_process_group("gloo",rank=rank, 
world_size=world_size,init_method='env://')       
    # create local model
    model = nn.Linear(10, 10).to(rank)
    # construct DDP model
    ddp_model = DDP(model, device_ids=[rank])
    # define loss function and optimizer
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    buf = 0
    tmp = 0
    for i in range(10000):
        start = timer()
        # forward pass
        outputs = ddp_model(torch.randn(20, 10).to(rank))
        end = timer()

        tmp = end-start
        buf+=tmp
        labels = torch.randn(20, 10).to(rank)
        # backward pass
        loss_fn(outputs, labels).backward()
        # update parameters
        optimizer.step()
    print(tmp)
    print(buf)
    print(buf/10000)

def main():
    world_size = 1
    mp.spawn(example,
        args=(world_size,),
        nprocs=world_size,
        join=True)

if __name__=="__main__":
   for i in range(10):
     main()

将 find_unused_params 分别设置成 True 或者 False 跑多次取平均，可以得到：

find_unused_params=True: 0.3367 ms
find_unused_params=False: 0.2993 ms

小结

关于 DP 和 DDP 的分享到这里就结束啦～

参考资料

[1] https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/

[2] https://d2l.ai/chapter_computational-performance/parameterserver.html

[3] https://kknews.cc/code/y5aejon.html

[4] https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

[5] https://opensource.com/article/17/4/grok-gil

[6] 谈谈python的GIL、多线程、多进程 - 知乎

[7] https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/

[8] 【分布式训练】单机多卡的正确打开方式（一）：理论基础 - 知乎

[9] [原创][深度][PyTorch] DDP系列第二篇：实现原理与源代码解析 - 知乎

[10] https://pytorch.org/docs/stable/notes/ddp.html

[11] http://www.vldb.org/pvldb/vol13/p3

你可能感兴趣的:(技术干货,pytorch,分布式,深度学习)

使用GitHub API进行智能文档加载 fgayif github python
GitHub是一个强大的开发者平台，提供了代码存储、管理和分享的功能。它采用Git软件，增强了分布式版本控制，同时提供了访问控制、错误跟踪、软件功能请求、任务管理、持续集成和项目的wiki等功能。随着AI技术的发展，我们可以利用GitHub的API实现智能文档加载，以便更好地进行代码管理和分析。下面我将介绍如何使用GitHubAPI进行文档加载，并通过实用的代码示例来帮助大家理解。技术背景介绍Gi
高效快速教你DeepSeek如何进行本地部署并且可视化对话大富大贵7 程序员知识储备1 程序员知识储备2 程序员知识储备3 经验分享
科技文章：高效快速教你DeepSeek如何进行本地部署并且可视化对话摘要：随着自然语言处理（NLP）技术的进步，DeepSeek作为一款基于深度学习的语义搜索技术，广泛应用于文本理解、对话系统及信息检索等多个领域。本文将探讨如何高效快速地在本地部署DeepSeek，并结合可视化工具实现对话过程的监控与分析。通过详尽的步骤、案例分析与代码示例，帮助开发者更好地理解和应用DeepSeek技术。同时，本
《AI医疗系统开发实战录》第6期——智能导诊系统实战骆驼_代码狂魔程序员的法宝人工智能 django python neo4j 知识图谱
关注我，后期文章全部免费开放，一起推进AI医疗的发展核心主题：如何构建95%准确率的智能导诊系统？技术突破：结合BERT+知识图谱的混合模型设计一、智能导诊架构设计python基于BERT的意图识别模型（PyTorch）fromtransformersimportBertTokenizer,BertForSequenceClassificationimporttorchclassTriageMod
Java架构师成长之路 hweiyu00 分享 spring 微服务 spring cloud java
概述本教程主要从6个方面，全面讲解Java技术栈的知识。1.性能调优深入理解MySQL底层原理、索引逻辑，数据结构与算法。使用Explain进行优化分析MVCC原理剖析日志机制解析2.框架源码掌握Spring底层原理带你手写一个Spring解析IOC、AOP源码、以及事务原理3.并发编程剖析Java底层锁机制CAS、JUC工具使用、AQS源码分析以及并发的集合类的讲解4.分布式开发剖析分布式中使用
【读点论文】Chain Replication for Supporting High Throughput and Availability 寻雾&启示分布式系统论文阅读
在分布式系统中，强一致性往往和高可用、高吞吐是矛盾的。比如传统的关系型数据库，其保证了强一致性，但往往牺牲了可用性和吞吐量。而像NoSQL数据库，虽然其吞吐量、和扩展性很高，但往往只支持最终一致性，无法保证强一致性。由此ChainReplicationforSupportingHighThroughputandAvailability提出了链式复制协议，旨在保证高吞吐、高可用的同时，支持数据的强一
【自建分布式数据库详细指南】（五）使用：常见API及使用问题大板牙花生分布式
延续前几篇文章，下面着重从一些基本的API讲讲从入门到习惯的常用方法，后续更新。USAGE1节点管理设置主节点，又成为协调节点SELECTcitus_set_coordinator_host('coord.example.com',5432);step1.创建节点select*frommaster_add_node('new-node',12345);step2.删除节点step3.新增节点后重新
Python基于深度学习的动物图片识别技术的研究与实现 Java老徐 Python 毕业设计 python 深度学习开发语言深度学习的动物图片识别技术 Python动物图片识别技术
博主介绍：✌程序员徐师兄、7年大厂程序员经历。全网粉丝12w+、csdn博客专家、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌文末获取源码联系精彩专栏推荐订阅不然下次找不到哟2022-2024年最全的计算机软件毕业设计选题大全：1000个热门选题推荐✅Java项目精品实战案例《100套》Java微信小程序项目实战《100套》感兴趣的可以先收藏起来，还有大家
【深度学习与大模型基础】第7章-特征分解与奇异值分解 lynn-66 深度学习与大模型基础算法机器学习人工智能
一、特征分解特征分解（EigenDecomposition）是线性代数中的一种重要方法，广泛应用于计算机行业的多个领域，如机器学习、图像处理和数据分析等。特征分解将一个方阵分解为特征值和特征向量的形式，帮助我们理解矩阵的结构和性质。1.特征分解的定义对于一个n×n的方阵A，如果存在一个非零向量v和一个标量λ，使得：则称λ为矩阵A的特征值，v为对应的特征向量。特征分解将矩阵A分解为：其中：Q是由特征
震惊！ “深度学习”都在学习什么扉间798 深度学习学习人工智能
常见的机器学习分类算法俗话说三个臭皮匠胜过诸葛亮这里面集成学习就是将单一的算法弱弱结合算法融合用投票给特征值加权重AdaBoost集成学习算法通过迭代训练一系列弱分类器，给予分类错误样本更高权重，使得后续弱分类器更关注这些样本，然后将这些弱分类器线性组合成强分类器，提高整体分类性能。（一）投票机制投票是一种直观且常用的算法融合策略。在多分类问题中，假设有多个分类器对同一数据进行分类判断。每个分类器
深度学习 | pytorch + torchvision + python 版本对应及环境安装 zfgfdgbhs 深度学习 python pytorch
目录一、版本对应二、安装命令（pip）1.版本（1）v2.5.1~v2.0.0（2）v1.13.1~v1.11.0（3）v1.10.1~v1.7.02.安装全过程（1）选择版本（2）安装结果参考文章一、版本对应下表来自pytorch的github官方文档：pytorch/vision:Datasets,TransformsandModelsspecifictoComputerVisionpytor
PyTorch核心基础知识点 niuTaylor 编程区 pytorch 人工智能 python
PyTorch核心基础知识点，结合最新特性与工业级实践，按优先级和逻辑关系分层解析：▍核心基石：张量编程（TensorProgramming）1.张量创建（8种生产级初始化）#设备自动选择（2024最佳实践）device="cuda"iftorch.cuda.is_available()else"mps"iftorch.backends.mps.is_available()else"cpu"#关键
一文讲清楚深度学习和机器学习平凡而伟大. 机器学习人工智能深度学习机器学习人工智能
目录1.定义机器学习（MachineLearning,ML）深度学习（DeepLearning,DL）2.工作原理机器学习深度学习3.应用场景机器学习深度学习4.主要区别5.为什么选择深度学习？6.总结深度学习和机器学习是人工智能（AI）领域中两个密切相关但有所区别的概念。要清楚地解释它们之间的关系，我们可以从定义、工作原理、应用场景以及两者的主要区别等方面进行探讨。1.定义机器学习（Machin
Rust + 时序数据库 TDengine：打造高性能时序数据处理利器涛思数据（TDengine）时序数据库 rust tdengine
引言：为什么选择TDengine与Rust？TDengine是一款专为物联网、车联网、工业互联网等时序数据场景优化设计的开源时序数据库，支持高并发写入、高效查询及流式计算，通过“一个数据采集点一张表”与“超级表”的概念显著提升性能。Rust作为一门系统级编程语言，近年来在数据库、嵌入式系统、分布式服务等领域迅速崛起，以其内存安全、高性能著称，与TDengine的高效特性天然契合，适合构建高可靠、高
DeepSeek：智能搜索与分析的新纪元 XRC2231 学习
在人工智能浪潮席卷全球的今天，DeepSeek如同一颗璀璨的新星，以其独特的魅力和强大的功能，在AI领域脱颖而出。DeepSeek，这一基于深度学习和数据挖掘技术的智能搜索与分析系统，不仅重新定义了搜索引擎的边界，更以其卓越的性能和广泛的应用场景，为全球用户带来了前所未有的智能体验。本文将从DeepSeek的定义、特点、应用场景、优势等方面进行全面而深入的介绍，带您领略这一新兴技术的独特魅力。一、
Linux部署模型报错OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_mod dkgee linux pytorch 运维
报错内容：OSError:Errornofilenamedpytorch_model.bin,tf_model.h5,model.ckpt.indexorflax_model.msgpackfoundindirectory主要原因是transformer版本不对，需要升级pipinstall--upgradehuggingface_hubpipinstalltransformers[torch]其
LLM之向量数据库Chroma milvus FAISS maxmaxma 数据库 milvus faiss
以下是Chroma、Milvus和FAISS的核心区别，从功能定位、架构设计、性能及应用场景等维度进行对比：一、功能定位Chroma轻量级向量数据库：专注于快速构建中小型语义搜索原型，提供简单易用的API，适合快速集成到现有应用中。特点：支持近似最近邻搜索（ANN）、实时性能优化，但对大规模数据处理能力有限。Milvus分布式向量数据库：专为超大规模向量数据设计，支持云原生架构和高可用性，适合企业
深入解析Flink Kafka Connector的分布式流数据采集架构与底层实现数据与算法架构提升之路 #Flink flink kafka conector 源码
目录1.FlinkKafka连接器的分布式流采集架构1.1架构组成1.2分布式流模型2.数据分区分配策略3.为什么重写序列化和偏移量管理3.1与Flink分布式架构集成3.2与Flink检查点机制集成同时承接多级并行架构3.3OffsetsInitializer与细粒度偏移量控制3.4与Flink的Source接口统一4.版本兼容性管理5.有界流处理支持5.1实现原理5.2API使用示例5.3多种
AI模型技术演进与行业应用图谱智能计算研究中心其他
内容概要当前AI模型技术正经历从基础架构到行业落地的系统性革新。主流深度学习框架如TensorFlow和PyTorch持续优化动态计算图与分布式训练能力，而MXNet凭借高效的异构计算支持在边缘场景崭露头角。与此同时，模型压缩技术通过量化和知识蒸馏将参数量降低60%-80%，联邦学习则通过加密梯度交换实现多机构数据协同训练。在应用层面，医疗诊断模型通过迁移学习在CT影像分类任务中达到98.2%的准
模型优化驱动产业应用创新智能计算研究中心其他
内容概要当前模型优化技术的迭代正沿着多维路径快速演进，其核心驱动力在于突破算法性能与产业需求间的适配瓶颈。以自适应学习机制与迁移学习框架为基础的优化策略，显著提升了模型在跨场景应用中的泛化能力，而超参数自动调优技术则通过PyTorch、TensorFlow等主流框架的接口标准化，降低了复杂模型的开发门槛。在部署层面，边缘计算与联邦学习的协同应用不仅缩短了金融预测、医疗影像分析等场景的响应延迟，更通
Sa-Token v1.20.0 发布，新增临时Token认证
框架介绍Sa-Token是一个轻量级Java权限认证框架，主要解决：登录认证、权限认证、分布式Session会话、单点登录、OAuth2.0等一系列权限相关问题。框架针对踢人下线、自动续签、前后台分离、分布式会话……等常见业务进行N多适配，通过sa-token，你可以以一种极简的方式实现系统的权限认证部分Sa-Tokenv1.20.0版本更新包括以下内容：新增：新增Solon适配插件，感谢大佬@刘
基于roop/insightface将视频中包含指定人脸的视频片段提取并合并成新视频阆遤 python roop pytorch insightface
利用insightface.app.FaceAnalysis提最一个视频中包含指定人脸的视频片段，并将其合并成一个新视频，使用“buffalo_l”模型，模型需安装在代码当前目录下的.\models中。需要roop或其他支持pytorch、insightface、moviepy的环境。pytorch安装请见我其他文章。#cython:language_level=3str#-*-coding:ut
关于pytorch3d的安装诚威_lol_中大努力中人工智能 pytorch 人工智能 python
更新1：2025_2_04今天发现，原来的pytorch3d不见了，在我的aaa1环境中。重新安装，我发现最好用的还是去github下载最新的pytorch3d的zip，unzip之后，进去pipinstall-e.然后安装成功！1、参考文章1：windows安装PyTorch3D详细指南-哔哩哔哩(bilibili.com)这篇文章巨好2、参考文章2：pytorch3d/INSTALL.mdat
SpringBoot分布式架构下字典表设计与实战应用潘多编程 spring boot 分布式架构
在分布式系统中，字典表作为基础数据的核心载体，其设计合理性直接影响系统的扩展性和维护效率。本文将结合具体代码实例，深入讲解分布式环境下字典表的设计方案与实现细节。一、分布式环境下的字典表挑战数据一致性要求：多服务节点间的字典数据同步高并发访问压力：基础数据的频繁读取需求动态更新需求：业务运行时字典数据的热更新能力多级缓存策略：本地缓存与分布式缓存的协同工作二、技术方案设计架构图：[Client]-
【Hive】-- hive 3.1.3 伪分布式部署（单节点） oo寻梦in记 Apache Paimon 大数据服务部署 hive 分布式 hadoop
1、环境准备1.1、版本选择apachehive3.1.3apachehadoop3.1.0oraclejdk1.8mysql8.0.15操作系统：Macos10.151.2、软件下载https://archive.apache.org/dist/hive/https://archive.apache.org/dist/hadoop/1.3、解压tar-zxvfapache-hive-4.0.0-
使用Jupyter Notebook进行深度学习编程 - 深度学习教程 shandianfk_com ChatGPT AI jupyter 深度学习 ide
大家好，今天我们要聊聊如何使用JupyterNotebook进行深度学习编程。深度学习是人工智能领域中的一项重要技术，通过模仿人脑神经网络的方式进行学习和分析。JupyterNotebook作为一个强大的工具，可以帮助我们轻松地进行深度学习编程，尤其适合初学者和研究人员。本文将带领大家一步步了解如何在JupyterNotebook中开展深度学习项目。一、什么是JupyterNotebook？Jup
【Linux】Hadoop-3.4.1的伪分布式集群的初步配置孤独打铁匠Julian Linux linux hadoop ubuntu
配置步骤一、检查环境JDK#目前还是JDK8最适合Hadoopjava-versionecho$JAVA_HOMEHadoophadoopversionecho$HADOOP_HOME二、配置SSH免密登录Hadoop需要通过SSH管理节点（即使在伪分布式模式下）sudoaptinstallopenssh-server#安装SSH服务（如未安装）cd~/.ssh/ssh-keygen-trsa#生
深度学习 Deep Learning 第8章深度学习优化 odoo中国 AI编程人工智能深度学习人工智能优化
深度学习第8章深度学习的优化章节概述本章深入探讨了深度学习中的优化技术，旨在解决模型训练过程中面临的各种挑战。优化是深度学习的核心环节，直接关系到模型的训练效率和最终性能。本章首先介绍了优化在深度学习中的特殊性，然后详细讨论了多种优化算法，包括随机梯度下降（SGD）、动量法、Nesterov动量法、AdaGrad、RMSProp和Adam等。此外，还探讨了参数初始化策略、自适应学习率方法以及二阶优
【零基础入门】一篇弄懂nn.Sequential以及ModuleList的使用（呕心沥血版）十二月的猫 PyTorch深度学习 pytorch 零基础入门
个人主页：十二月的猫-CSDN博客系列专栏：《PyTorch科研加速指南：即插即用式模块开发》CSDN博客十二月的寒冬阻挡不了春天的脚步，十二点的黑夜遮蔽不住黎明的曙光目录1.前言2.Sequential类的使用2.1序列容器简单注入2.2序列容器字典注入2.3序列容器函数注入2.4序列容器修改2.5序列容器删除3.nn.ModuleList()的使用3.1定义模型3.2使用模型4.总结1.前言《
事务回滚核心技术 KBkongbaiKB java
一、事务回滚的数学本质与核心挑战1.1事务状态机模型操作执行持久化完成系统故障事务回滚ActivePartiallyCommittedCommittedFailedAborted1.2核心技术挑战矩阵问题维度单机事务分布式事务原子性保证存储引擎WAL日志二阶段提交协议隔离性实现MVCC多版本控制全局锁调度机制可见性管理事务ID版本链向量时钟同步回滚触发条件SQL执行异常/死锁网络分区/节点故障二、
景联文科技提供高质量文本标注服务，驱动AI技术发展景联文科技科技人工智能
文本标注是指在原始文本数据上添加标签的过程，这些标签可以用来指示特定的实体、关系、事件等信息，以帮助计算机理解和处理这些数据。文本标注是自然语言处理（NLP）领域的一个重要环节，它通过为文本的不同部分提供具体的含义和上下文信息，增强机器学习和深度学习模型对文本内容的理解能力。标注类型情感分析情感极性：确定文本表达的情感倾向，如正面、负面或中立。强度评估：衡量情感的强烈程度，从轻微到极端不等。命名实
枚举的构造函数中抛出异常会怎样 bylijinnan java enum 单例
首先从使用enum实现单例说起。为什么要用enum来实现单例？这篇文章（ http://javarevisited.blogspot.sg/2012/07/why-enum-singleton-are-better-in-java.html）阐述了三个理由： 1.enum单例简单、容易，只需几行代码： public enum Singleton { INSTANCE;
CMake 教程 aigo C++
转自：http://xiang.lf.blog.163.com/blog/static/127733322201481114456136/ CMake是一个跨平台的程序构建工具，比如起自己编写Makefile方便很多。介绍：http://baike.baidu.com/view/1126160.htm 本文件不介绍CMake的基本语法，下面是篇不错的入门教程： http:
cvc-complex-type.2.3: Element 'beans' cannot have character Cb123456 spring Webgis
cvc-complex-type.2.3: Element 'beans' cannot have character Line 33 in XML document from ServletContext resource [/WEB-INF/backend-servlet.xml] is i
jquery实例:随页面滚动条滚动而自动加载内容 120153216 jquery
<script language="javascript"> $(function (){ var i = 4;$(window).bind("scroll", function (event){ //滚动条到网页头部的高度，兼容ie,ff,chrome var top = document.documentElement.s
将数据库中的数据转换成dbs文件何必如此 sql dbs
旗正规则引擎通过数据库配置器（DataBuilder）来管理数据库，无论是Oracle，还是其他主流的数据都支持，操作方式是一样的。旗正规则引擎的数据库配置器是用于编辑数据库结构信息以及管理数据库表数据，并且可以执行SQL 语句，主要功能如下。 1)数据库生成表结构信息：主要生成数据库配置文件(.conf文
在IBATIS中配置SQL语句的IN方式 357029540 ibatis
在使用IBATIS进行SQL语句配置查询时，我们一定会遇到通过IN查询的地方，在使用IN查询时我们可以有两种方式进行配置参数：String和List。具体使用方式如下： 1.String:定义一个String的参数userIds，把这个参数传入IBATIS的sql配置文件，sql语句就可以这样写： <select id="getForms" param
Spring3 MVC 笔记（一） 7454103 spring mvc bean REST JSF
自从 MVC 这个概念提出来之后 struts1.X struts2.X jsf 。。。。。这个view 层的技术一个接一个！都用过！不敢说哪个绝对的强悍！要看业务，和整体的设计！最近公司要求开发个新系统！
Timer与Spring Quartz 定时执行程序 darkranger spring bean 工作 quartz
有时候需要定时触发某一项任务。其实在jdk1.3，java sdk就通过java.util.Timer提供相应的功能。一个简单的例子说明如何使用，很简单： 1、第一步，我们需要建立一项任务，我们的任务需要继承java.util.TimerTask package com.test; import java.text.SimpleDateFormat; import java.util.Date;
大端小端转换，le32_to_cpu 和cpu_to_le32 aijuans C语言相关
大端小端转换，le32_to_cpu 和cpu_to_le32 字节序 http://oss.org.cn/kernel-book/ldd3/ch11s04.html 小心不要假设字节序. PC 存储多字节值是低字节为先(小端为先, 因此是小端), 一些高级的平台以另一种方式(大端)
Nginx负载均衡配置实例详解 avords
[导读] 负载均衡是我们大流量网站要做的一个东西，下面我来给大家介绍在Nginx服务器上进行负载均衡配置方法，希望对有需要的同学有所帮助哦。负载均衡先来简单了解一下什么是负载均衡，单从字面上的意思来理解就可以解负载均衡是我们大流量网站要做的一个东西，下面我来给大家介绍在Nginx服务器上进行负载均衡配置方法，希望对有需要的同学有所帮助哦。负载均衡先来简单了解一下什么是负载均衡
乱说的 houxinyou 框架敏捷开发软件测试
从很久以前，大家就研究框架，开发方法，软件工程，好多！反正我是搞不明白！这两天看好多人研究敏捷模型，瀑布模型！也没太搞明白. 不过感觉和程序开发语言差不多，瀑布就是顺序，敏捷就是循环. 瀑布就是需求、分析、设计、编码、测试一步一步走下来。而敏捷就是按摸块或者说迭代做个循环，第个循环中也一样是需求、分析、设计、编码、测试一步一步走下来。也可以把软件开发理
欣赏的价值——一个小故事 bijian1013 有效辅导欣赏欣赏的价值
　　第一次参加家长会，幼儿园的老师说："您的儿子有多动症，在板凳上连三分钟都坐不了，你最好带他去医院看一看。"　　回家的路上，儿子问她老师都说了些什么，她鼻子一酸，差点流下泪来。因为全班30位小朋友，惟有他表现最差；惟有对他，老师表现出不屑，然而她还在告诉她的儿子："老师表扬你了，说宝宝原来在板凳上坐不了一分钟，现在能坐三分钟。其他妈妈都非常羡慕妈妈，因为全班只有宝宝
包冲突问题的解决方法 bingyingao eclipse maven exclusions 包冲突
包冲突是开发过程中很常见的问题：其表现有： 1.明明在eclipse中能够索引到某个类，运行时却报出找不到类。 2.明明在eclipse中能够索引到某个类的方法，运行时却报出找不到方法。 3.类及方法都有，以正确编译成了.class文件，在本机跑的好好的，发到测试或者正式环境就抛如下异常： java.lang.NoClassDefFoundError: Could not in
【Spark七十五】Spark Streaming整合Flume-NG三之接入log4j bit1129 Stream
先来一段废话：实际工作中，业务系统的日志基本上是使用Log4j写入到日志文件中的，问题的关键之处在于业务日志的格式混乱，这给对日志文件中的日志进行统计分析带来了极大的困难，或者说，基本上无法进行分析，每个人写日志的习惯不同，导致日志行的格式五花八门，最后只能通过grep来查找特定的关键词缩小范围，但是在集群环境下，每个机器去grep一遍，分析一遍，这个效率如何可想之二，大好光阴都浪费在这上面了
sudoku solver in Haskell bookjovi sudoku haskell
这几天没太多的事做，想着用函数式语言来写点实用的程序，像fib和prime之类的就不想提了（就一行代码的事），写什么程序呢？在网上闲逛时发现sudoku游戏，sudoku十几年前就知道了，学生生涯时也想过用C/Java来实现个智能求解，但到最后往往没写成，主要是用C/Java写的话会很麻烦。现在写程序，本人总是有一种思维惯性，总是想把程序写的更紧凑，更精致，代码行数最少，所以现
java apache ftpClient bro_feng java
最近使用apache的ftpclient插件实现ftp下载，遇见几个问题，做如下总结。 1. 上传阻塞，一连串的上传，其中一个就阻塞了，或是用storeFile上传时返回false。查了点资料，说是FTP有主动模式和被动模式。将传出模式修改为被动模式ftp.enterLocalPassiveMode();然后就好了。看了网上相关介绍，对主动模式和被动模式区别还是比较的模糊，不太了解被动模
读《研磨设计模式》-代码笔记-工厂方法模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * 工厂方法模式：使一个类的实例化延迟到子类 * 某次，我在工作不知不觉中就用到了工厂方法模式（称为模板方法模式更恰当。2012-10-29）： * 有很多不同的产品，它
面试记录语 chenyu19891124 招聘
或许真的在一个平台上成长成什么样，都必须靠自己去努力。有了好的平台让自己展示，就该好好努力。今天是自己单独一次去面试别人，感觉有点小紧张，说话有点打结。在面试完后写面试情况表，下笔真的好难，尤其是要对面试人的情况说明真的好难。今天面试的是自己同事的同事，现在的这个同事要离职了，介绍了我现在这位同事以前的同事来面试。今天这位求职者面试的是配置管理，期初看了简历觉得应该很适合做配置管理，但是今天面
Fire Workflow 1.0正式版终于发布了 comsci 工作 workflow Google
Fire Workflow 是国内另外一款开源工作流，作者是著名的非也同志，哈哈.... 官方网站是 http://www.fireflow.org 经过大家努力,Fire Workflow 1.0正式版终于发布了正式版主要变化: 1、增加IWorkItem.jumpToEx(...)方法，取消了当前环节和目标环节必须在同一条执行线的限制，使得自由流更加自由 2、增加IT
Python向脚本传参 daizj python 脚本传参
如果想对python脚本传参数，python中对应的argc, argv(c语言的命令行参数)是什么呢？需要模块：sys 参数个数：len(sys.argv) 脚本名： sys.argv[0] 参数1： sys.argv[1] 参数2： sys.argv[
管理用户分组的命令gpasswd dongwei_6688 passwd
NAME： gpasswd - administer the /etc/group file SYNOPSIS： gpasswd group gpasswd -a user group gpasswd -d user group gpasswd -R group gpasswd -r group gpasswd [-A user,...] [-M user,...] g
郝斌老师数据结构课程笔记 dcj3sjt126com 数据结构与算法
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
yii2 cgridview加上选择框进行操作 dcj3sjt126com GridView
页面代码 <?=Html::beginForm(['controller/bulk'],'post');?> <?=Html::dropDownList('action','',[''=>'Mark selected as: ','c'=>'Confirmed','nc'=>'No Confirmed'],['class'=>'dropdown',])
linux mysql fypop linux
enquiry mysql version in centos linux yum list installed | grep mysql yum -y remove mysql-libs.x86_64 enquiry mysql version in yum repositoryyum list | grep mysql oryum -y list mysql* install mysq
Scramble String hcx2013 String
Given a string s1, we may represent it as a binary tree by partitioning it to two non-empty substrings recursively. Below is one possible representation of s1 = "great":
跟我学Shiro目录贴 jinnianshilongnian 跟我学shiro
历经三个月左右时间，《跟我学Shiro》系列教程已经完结，暂时没有需要补充的内容，因此生成PDF版供大家下载。最近项目比较紧，没有时间解答一些疑问，暂时无法回复一些问题，很抱歉，不过可以加群（334194438/348194195）一起讨论问题。 ----广告-----------------------------------------------------
nginx日志切割并使用flume-ng收集日志 liyonghui160com
nginx的日志文件没有rotate功能。如果你不处理，日志文件将变得越来越大，还好我们可以写一个nginx日志切割脚本来自动切割日志文件。第一步就是重命名日志文件，不用担心重命名后nginx找不到日志文件而丢失日志。在你未重新打开原名字的日志文件前，nginx还是会向你重命名的文件写日志，linux是靠文件描述符而不是文件名定位文件。第二步向nginx主
Oracle死锁解决方法 pda158 oracle
　select p.spid,c.object_name,b.session_id,b.oracle_username,b.os_user_name from v$process p,v$session a, v$locked_object b,all_objects c where p.addr=a.paddr and a.process=b.process and c.object_id=b.
java之List排序 shiguanghui list排序
在Java Collection Framework中定义的List实现有Vector，ArrayList和LinkedList。这些集合提供了对对象组的索引访问。他们提供了元素的添加与删除支持。然而，它们并没有内置的元素排序支持。　　你能够使用java.util.Collections类中的sort()方法对List元素进行排序。你既可以给方法传递
servlet单例多线程 utopialxw 单例多线程 servlet
转自http://www.cnblogs.com/yjhrem/articles/3160864.html 和 http://blog.chinaunix.net/uid-7374279-id-3687149.html Servlet 单例多线程 Servlet如何处理多个请求访问？Servlet容器默认是采用单实例多线程的方式处理多个请求的：1.当web服务器启动的