罗西的思考

[源码解析] PyTorch 分布式(7) ----- DistributedDataParallel 之进程组

[源码解析] PyTorch 分布式(7) ----- DistributedDataParallel 之进程组
- 0x00 摘要
- 0x01 回顾
  - 1.1 基础概念
  - 1.2 初始化进程组
- 0x02 概念与设计
  - 2.1 功能
  - 2.2 本质
- 0x03 使用
- 0x04 构建
  - 4.1 Python 世界
    - 4.1.1 rendezvous
    - 4.1.2 _new_process_group_helper
    - 4.1.3
  - 4.2 C++ 世界
    - 4.2.1 ProcessGroupMPI 定义
    - 4.2.2 初始化
      - 4.2.2.1 initMPIOnce
      - 4.2.2.2 ProcessGroupMPI
    - 4.2.3 运行
      - 4.2.3.1 执行封装
      - 4.2.3.2 allreduce
      - 4.2.3.3 enqueue
      - 4.2.3.4 runLoop
  - 4.4 封装
- 0xFF 参考

0x00 摘要

本文是 PyTorch 分布式系列的第七篇，介绍 DistributedDataParallel 所依赖的进程组概念。

本系列其他文章如下：

深度学习利器之自动微分(1)

深度学习利器之自动微分(2)

[源码解析]深度学习利器之自动微分(3) --- 示例解读

[源码解析]PyTorch如何实现前向传播(1) --- 基础类(上)

[源码解析]PyTorch如何实现前向传播(2) --- 基础类(下)

[源码解析] PyTorch如何实现前向传播(3) --- 具体实现

[源码解析] Pytorch 如何实现后向传播 (1)---- 调用引擎

[源码解析] Pytorch 如何实现后向传播 (2)---- 引擎静态结构

[源码解析] Pytorch 如何实现后向传播 (3)---- 引擎动态逻辑

[源码解析] PyTorch 如何实现后向传播 (4)---- 具体算法

[源码解析] PyTorch 分布式(1)------历史和概述

[源码解析] PyTorch 分布式(2) ----- DataParallel(上)

[源码解析] PyTorch 分布式(3) ----- DataParallel(下)

[源码解析] PyTorch 分布式(4)------分布式应用基础概念

[源码解析] PyTorch分布式(5) ------ DistributedDataParallel 总述&如何使用

[源码解析] PyTorch分布式(6) ---DistributedDataParallel -- 初始化&store

0x01 回顾

1.1 基础概念

关于分布式通信，PyTorch 提供的几个概念是：进程组，后端，初始化，Store。

进程组 ：DDP是真正的分布式训练，可以使用多台机器来组成一次并行运算的任务。为了能够让 DDP 的各个worker之间通信，PyTorch 设置了进程组这个概念。
后端：后端这个概念是一个逻辑上的概念。本质上后端是一种IPC通信机制。对于用户来说，就是采用那种方式来进行集合通信，从代码上看，就是走什么流程（一系列流程），以及后端使用 ProcessGroupMPI 还是 ProcessGroupGloo .....。
初始化 : 虽然有了后端和进程组的概念，但是如何让 worker 在建立进程组之前发现彼此？这就需要一种初始化方法来告诉大家传递一个信息：如何联系到其它机器上的进程？
Store : 可以认为是分布式键值存储，这个存储在组中的进程之间共享信息以及初始化分布式包（通过显式创建存储来作为init_method的替代）。

1.2 初始化进程组

在调用任何 DDP 其他方法之前，需要使用torch.distributed.init_process_group()进行初始化进程组。

from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    
    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

该方法会初始化默认分布式进程组和分布式包。此方法会阻塞，直到所有进程都加入，函数定义如下：

init_process_group ( backend , 
                       init_method = None , 
                       timeout = default_pg_timeout , 
                       world_size =- 1 , 
                       rank =- 1 , 
                       store = None , 
                       group_name = '' , 
                       pg_options = None )

初始化进程组有两种主要方法：

明确指定 store，rank 和 world_size。
指定 init_method（一个 URL 字符串），它指示在哪里/如何发现对等点。

如果两者都没有指定，init_method则假定为“env://”。

因此大家可以看到，store 和 init_method 是互斥的。

参数具体如下：

后端 – 要使用的后端。有效值包括mpi，gloo，和nccl。该字段应作为小写字符串（例如"gloo"）给出，也可以通过Backend属性（例如Backend.GLOO）访问。如果在nccl后端每台机器上使用多个进程，则每个进程必须对其使用的每个 GPU 具有独占访问权限，因为在进程之间共享 GPU 可能会导致死锁。
init_method – 指定如何初始化进程组的 URL。如果未指定init_method或store指定，则默认为“env://” 。与 store互斥。
world_size – 参与作业的进程数。如果store指定，则 world_size 为必需。
rank – 当前进程的等级（它应该是一个介于 0 和world_size-1之间的数字）。如果store指定，则 rank 为必需。
store – 所有 worker 都可以访问的键/值存储，用于交换连接/地址信息。与init_method 互斥。
timeout – 针对进程组执行的操作超时。默认值等于 30 分钟。这适用于gloo后端。对于nccl，这仅在环境变量NCCL_BLOCKING_WAIT 或NCCL_ASYNC_ERROR_HANDLING设置为 1 时适用。
group_name – 组名。
pg_options ( Process Group Options , optional ) – 进程组选项，指定在构建特定进程组期间需要传入哪些附加选项。

0x02 概念与设计

2.1 功能

默认情况下，集合通信在默认组（也称为world）上运行，并要求所有进程都进入分布式函数调用。但是，一些工作可以从更细粒度的通信中受益。这就是分布式组发挥作用的地方。new_group() 函数可用于创建一个新分布式组，这个新组是所有进程的任意子集。new_group() 返回一个不透明的组句柄，此句柄可以作为group参数提供给所有集合函数（集合函数是分布式函数，用于在某些编程模式中交换信息）。

2.2 本质

抛开概念，从代码看其本质。进程组就是给每一个训练的 process 建立一个通信thread。主线程（计算线程）在前台进行训练，这个通信 thread 在后台做通信。我们以 ProcessGroupMPI 为例，是在通信线程之中另外添加了一个 queue，做buffer 和异步处理。这样，进程组中所有进程都可以组成一个集体在后台进行集合通信操作。

比如下面，左侧worker之中就有两个线程，计算线程负责计算梯度，然后要求通信线程与其它worker进行交换梯度。

+---------------------------------------------------------------+        +--------------+
| Worker Process                                                |        | Other Worker |
|                                                               |        |              |
|  +----------------------+       +-----------------------+     |        | +----------+ |
|  | Computation thread   |       | Communication thread  |     |        | |   Comm   | |
|  |                      |       |                       |     |        | |  thread  | |
|  |                      |       |                       |     |        | |          | |
|  |     Main Thread      |       |    workerThread_      |     |        | |          | |
|  |                      |       |                       |     |        | |          | |
|  |                      |       |                       |     |        | |          | |
|  | Gradient computation |       |                       |     |        | |          | |
|  |          +           |       |                       |     |        | |          | |
|  |          |           |       |                       |   + |    +   | |          | |
|  |          |           |       |                       |  /| |    |\  | |          | |
|  |          v           | /|_|\ |                       | / +-+----+ \ | |          | |
|  |    Does All+Reduce   |/ grad\|   Does communication  |/  Gradient  \| |          | |
|  |                      |\  _  /|                       |\            /| |          | |
|  |                      | \| |/ |                       | \ +-+----+ / | |          | |
|  |                      |       |                       |  \| |    |/  | |          | |
|  |                      |       |                       |   + |    +   | |          | |
|  |                      |       |                       |     |        | |          | |
|  |                      |       |                       |     |        | |          | |
|  +----------------------+       +-----------------------+     |        | +----------+ |
|                                                               |        |              |
+---------------------------------------------------------------+        +--------------+

0x03 使用

既然知道了进程组的本质，我们接下来看看如何使用进程组。

首先，在 _ddp_init_helper 之中会生成 dist.Reducer，进程组会作为 Reducer 的参数之一传入。

def _ddp_init_helper(self, parameters, expect_sparse_gradient, param_to_name_mapping):
    """
    Initialization helper function that does the following:
    (1) bucketing the parameters for reductions
    (2) resetting the bucketing states
    (3) registering the grad hooks
    (4) Logging constructin-time DDP logging data
    (5) passing a handle of DDP to SyncBatchNorm Layer
    """
    self.num_iterations = 0
    # The bucket size limit is specified in the constructor.
    # Additionally, we allow for a single small bucket for parameters
    # that are defined first, such that their gradients don't spill into
    # a much larger bucket, adding unnecessary latency after gradient
    # computation finishes. Experiments showed 1MB is a reasonable value.
    bucket_indices = dist._compute_bucket_assignment_by_size(
        parameters[0],
        [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap],
        expect_sparse_gradient[0],
    )

    # Note: reverse list of buckets because we want to approximate the
    # order in which their gradients are produced, and assume they
    # are used in the forward pass in the order they are defined.
    self.reducer = dist.Reducer(
        parameters,
        list(reversed(bucket_indices)),
        self.process_group, # 这里使用了
        expect_sparse_gradient,
        self.bucket_bytes_cap,
        self.find_unused_parameters,
        self.gradient_as_bucket_view,
        param_to_name_mapping,
    )

其次，在 Reducer 构建函数之中，会把进程组配置给 Reducer 的成员变量 process_group_ 之上。

Reducer::Reducer(
    std::vector> replicas,
    std::vector> bucket_indices,
    c10::intrusive_ptr process_group, 
    std::vector> expect_sparse_gradients,
    int64_t bucket_bytes_cap,
    bool find_unused_parameters,
    bool gradient_as_bucket_view,
    std::unordered_map paramNames)
    : replicas_(std::move(replicas)),
      process_group_(std::move(process_group)), // 在这里

最后，当需要对梯度做 all-reduce 时候，则会调用 process_group_->allreduce(tensors) 进行处理。

现在，我们就知道如何使用进程组了。

void Reducer::all_reduce_bucket(Bucket& bucket) {
  std::vector tensors;
  tensors.reserve(bucket.replicas.size());
  for (const auto& replica : bucket.replicas) {
    tensors.push_back(replica.contents);
  }

  if (comm_hook_ == nullptr) {
    bucket.work = process_group_->allreduce(tensors); // 这里会进行调用
  } else {
    GradBucket grad_bucket(
        next_bucket_,
        tensors[0],
        // Since currently we do not support single-process multiple-device
        // mode, we can assume only one replica in the bucket.
        bucket.replicas[0].offsets,
        bucket.replicas[0].lengths,
        bucket.replicas[0].sizes_vec);
    bucket.future_work = comm_hook_->runHook(grad_bucket);
  }
}

0x04 构建

4.1 Python 世界

4.1.1 rendezvous

从 init_process_group 源码之中看，几种构建实现在细节上有所不同，我们只是看gloo和mpi。

gloo利用 rendezvous 设置了master地址。
MPI不需要 rendezvous，而是利用mpirun启动。

两种方法都生成了一个 ProcessGroup 赋值给 default_pg，然后用 default_pg 设置 GroupMember.WORLD。

def _update_default_pg(pg):
    GroupMember.WORLD = group.WORLD = pg

具体 init_process_group 代码如下：

def init_process_group(backend,
                       init_method=None,
                       timeout=default_pg_timeout,
                       world_size=-1,
                       rank=-1,
                       store=None,
                       group_name='',
                       pg_options=None):
    """
    Initializes the default distributed process group, and this will also
    initialize the distributed package.

    There are 2 main ways to initialize a process group:
        1. Specify ``store``, ``rank``, and ``world_size`` explicitly.
        2. Specify ``init_method`` (a URL string) which indicates where/how
           to discover peers. Optionally specify ``rank`` and ``world_size``,
           or encode all required parameters in the URL and omit them.

    If neither is specified, ``init_method`` is assumed to be "env://".
    """
    
    global _pg_group_ranks
    global _backend
    global _default_pg_init_method

    if store is not None:
        assert world_size > 0, 'world_size must be positive if using store'
        assert rank >= 0, 'rank must be non-negative if using store'
    elif init_method is None:
        init_method = "env://"

    backend = Backend(backend)
    if backend == Backend.MPI:
        default_pg = _new_process_group_helper( # 生成了一个 ProcessGroup 赋值给 default_pg
            -1,
            -1,
            [],
            Backend.MPI,
            None,
            group_name=group_name,
            timeout=timeout)
        _update_default_pg(default_pg) # 用 default_pg 设置 GroupMember.WORLD
    else:
        # backward compatible API
        if store is None:
            rendezvous_iterator = rendezvous( # 先生成一个store
                init_method, rank, world_size, timeout=timeout
            )
            store, rank, world_size = next(rendezvous_iterator)
            store.set_timeout(timeout)

        default_pg = _new_process_group_helper( # 再进行构建 ProcessGroup
            world_size,
            rank,
            [],
            backend,
            store,
            pg_options=pg_options,
            group_name=group_name,
            timeout=timeout)
        _update_default_pg(default_pg) # 用 default_pg 设置 GroupMember.WORLD

    _pg_group_ranks[GroupMember.WORLD] = {i: i for i in range(GroupMember.WORLD.size())}  # type: ignore[attr-defined, index]
    _backend = _pg_map[GroupMember.WORLD][0]  # type: ignore[index]
    _default_pg_init_method = init_method

    # barrier at the end to ensure that once we return from this method, all
    # process groups including global variables are updated correctly on all
    # ranks.
    if backend == Backend.MPI:
        # MPI backend doesn't use store.
        barrier()
    else:
        # Use store based barrier here since barrier() used a bunch of
        # default devices and messes up NCCL internal state.
        _store_based_barrier(rank, store, timeout)
        # Set sequence numbers for gloo and nccl process groups.
        if get_backend(default_pg) in [Backend.GLOO, Backend.NCCL]:
            default_pg._set_sequence_number_for_group()

4.1.2 _new_process_group_helper

各种后端都会使用 _new_process_group_helper 进行具体构建，_new_process_group_helper 其实就是调用了不同的C++实现，比如 ProcessGroupGloo, ProcessGroupMPI, ProcessGroupNCCL。

def _new_process_group_helper(world_size,
                              rank,
                              group_ranks,
                              backend,
                              store,
                              pg_options=None,
                              group_name=None,
                              timeout=default_pg_timeout):
    """
    Create a new distributed process group.

    This function must be called by ALL processes in the global group, even if
    the calling process is not part of the newly created group. In that case,
    this function returns GroupMember.NON_GROUP_MEMBER.

    This function is called with ``group_ranks == []`` for the default group.
    """
    global _pg_map
    global _group_count
    global _pg_names
    if not group_name:
        group_name = str(_group_count)
        _group_count += 1

    # The list of group ranks is empty if we're creating the default group.
    is_default_group = (len(group_ranks) == 0)

    backend = Backend(backend)
    pg: Union[ProcessGroupGloo, ProcessGroupMPI, ProcessGroupNCCL]
    if backend == Backend.MPI:
        pg = ProcessGroupMPI.create(group_ranks) # 构建了 ProcessGroupMPI
        if not pg:
            return GroupMember.NON_GROUP_MEMBER
        _pg_map[pg] = (Backend.MPI, None)
        _pg_names[pg] = group_name
    else:
        # If this is a subgroup (which means group_ranks is specified),
        # we check if the current process is a member of the new group.
        if not is_default_group:
            global_rank = _get_default_group().rank()
            if global_rank not in group_ranks:
                return GroupMember.NON_GROUP_MEMBER

        # Use the group name as prefix in the default store, such that
        # a single store can be reused by multiple groups.
        prefix_store = PrefixStore(group_name, store)

        if backend == Backend.GLOO:
            pg = ProcessGroupGloo( # 构建了 ProcessGroupGloo
                prefix_store,
                rank,
                world_size,
                timeout=timeout)
            _pg_map[pg] = (Backend.GLOO, store)
            _pg_names[pg] = group_name
        elif backend == Backend.NCCL:
            if pg_options is not None:
                assert isinstance(pg_options, ProcessGroupNCCL.Options), \
                    "Expected pg_options argument to be of type ProcessGroupNCCL.Options"
            else:
                # default pg_options for NCCL
                pg_options = ProcessGroupNCCL.Options()
                pg_options.is_high_priority_stream = False
                pg_options._timeout = timeout

            pg = ProcessGroupNCCL( # 构建了 ProcessGroupNCCL
                prefix_store,
                rank,
                world_size,
                pg_options)
            _pg_map[pg] = (Backend.NCCL, store)
            _pg_names[pg] = group_name
        else:
            pg = getattr(Backend, backend.upper())(
                prefix_store,
                rank,
                world_size,
                timeout)
            _pg_map[pg] = (backend, store)
            _pg_names[pg] = group_name

    return pg

目前流程如下：

                                  +
                                  |
                                  |
                                  v
                          init_process_group
                                  +
                                  |
                                  |
                     +------------+-------------+
                     |                          |
                     |                          |
                     v                          v
                Backend.MPI        Backend.GLOO & Backend.NCCL
                     +                          +
                     |                          |
                     |                          |
                     |                          v
                     |                  store = rendezvous()
                     |                          +
                     |                          |
                     |                          |
                     +------------+-------------+
                                  |
                                  |
                                  v

                       _new_process_group_helper
                                  +
                                  |
                                  |
                                  |
       +------------------------------------------------------+
       |                          |                           |
       |                          |                           |
       v                          v                           v

ProcessGroupMPI         ProcessGroupGloo(store)        ProcessGroupNCCL(store)

4.1.3

我们以 ProcessGroupMPI 为例来看看，可以看到ProcessGroupMPI的基类是ProcessGroup。

class ProcessGroupMPI(ProcessGroup):
    def __init__(
        self,
        rank: int,
        size: int,
        pgComm: int,
    ): ...
    @staticmethod
    def create(ranks: List[int]) -> ProcessGroupMPI: ...

ProcessGroup 定义了若干集合通信函数，但是均未实现，不过从其注释之中，我们可以看到派生类会有多种重载实现。

class ProcessGroup(__pybind11_builtins.pybind11_object):
    # no doc
    def allgather(self, *args, **kwargs): # real signature unknown; restored from __doc__
        """
        allgather(*args, **kwargs)
        Overloaded function.
        
        1. allgather(self: torch._C._distributed_c10d.ProcessGroup, output_tensors: List[List[at::Tensor]], input_tensors: List[at::Tensor], opts: torch._C._distributed_c10d.AllgatherOptions = ) -> c10d::ProcessGroup::Work
        
        2. allgather(self: torch._C._distributed_c10d.ProcessGroup, output_tensors: List[at::Tensor], input_tensor: at::Tensor) -> c10d::ProcessGroup::Work
        """
        pass

    def allgather_coalesced(self, output_lists, *args, **kwargs): # real signature unknown; NOTE: unreliably restored from __doc__ 
        """ allgather_coalesced(self: torch._C._distributed_c10d.ProcessGroup, output_lists: List[List[at::Tensor]], input_list: List[at::Tensor], opts: torch._C._distributed_c10d.AllgatherOptions = ) -> c10d::ProcessGroup::Work """
        pass

    def allreduce(self, *args, **kwargs): # real signature unknown; restored from __doc__
        """
        allreduce(*args, **kwargs)
        Overloaded function.
        
        1. allreduce(self: torch._C._distributed_c10d.ProcessGroup, tensors: List[at::Tensor], opts: torch._C._distributed_c10d.AllreduceOptions = ) -> c10d::ProcessGroup::Work
        
        2. allreduce(self: torch._C._distributed_c10d.ProcessGroup, tensors: List[at::Tensor], op: torch._C._distributed_c10d.ReduceOp = ) -> c10d::ProcessGroup::Work
        
        3. allreduce(self: torch._C._distributed_c10d.ProcessGroup, tensor: at::Tensor, op: torch._C._distributed_c10d.ReduceOp = ) -> c10d::ProcessGroup::Work
        """
        pass

而无论哪个ProcessGroup的派生类，都指向了C++世界，比如在 torch/csrc/distributed/c10d/init.cpp 之中有如下代码：

// Define static create function instead of a constructor, because
// this function may return null. This happens if this process is not
// part of a sub group that is to be created.
processGroupMPI.def_static(
    "create",
    [](std::vector ranks) {
      return ::c10d::ProcessGroupMPI::createProcessGroupMPI(ranks);
    },
    py::call_guard());

因此可见，最后调用到的是 createProcessGroupMPI，于是我们直接去C++世界看看。

4.2 C++ 世界

4.2.1 ProcessGroupMPI 定义

ProcessGroupMPI 定义位于 torch/lib/c10d/ProcessGroupMPI.cpp。这里相当于做了一个工作队列，以及异步操作。几个注意点如下：

ProcessGroupMPI 类上的所有函数都应在组中的进程之间以相同的顺序调用。这是我们能够保证跨进程匹配相同调用的唯一方法。
ProcessGroupMPI 类提供的所有MPI函数都在工作线程上异步调度。因此，ProcessGroupMPI 依赖于MPI实现，该实现用于提供 MPI_THREAD_SERIALIZED 的最小线程支持值。也就是说，进程可以是多线程的，多个线程可以进行MPI调用，但一次只能进行一个：MPI调用不是从两个不同的线程同时进行的（所有MPI调用都是序列化的）。但是，如果使用 MPI_THREAD_SERIALIZED，ProcessGroupMPI将只支持单个进程组。换句话说，全局创建的进程组不能超过1个。
如果希望使用多个ProcessGroupMPI，它要求MPI实现的线程支持值为MPI\u thread\u multiple，也就是说，多个线程可以调用MPI，没有任何限制。
还要注意，ProcessGroupMPI只支持单个张量操作。换句话说，输入张量向量的大小应始终为1。
如果使用的MPI是CUDA-aware MPI，则可以支持CUDA tensor，并且ProcessGroupMPI将自动检测此支持。

// ProcessGroupMPI implements MPI bindings for c10d.
//
// All functions on this class are expected to be called in the same
// order across processes in the group. This is the only way that we
// can guarantee to match up the same calls across processes.
//
// All MPI functions provided by this class is asynchronously scheduled on a
// Worker thread. Therefore, ProcessGroupMPI requires the MPI implementation
// that is used to have a minimum thread support value of MPI_THREAD_SERIALIZED.
// That is, The process may be multi-threaded, and multiple threads may make
// MPI calls, but only one at a time: MPI calls are not made concurrently from
// two distinct threads (all MPI calls are serialized). However, with
// MPI_THREAD_SERIALIZED, ProcessGroupMPI will only support a singe process
// group. In other words, no more than 1 process group can be created globally.
//
// If you would like to use multiple ProcessGroupMPI, it requres your MPI
// implemenation to have a thread support value of MPI_THREAD_MULTIPLE, that is,
// multiple threads may call MPI, with no restriction.
//
// Also note that ProcessGroupMPI only supports a single Tensor operation. In
// other words, the size of the input Tensor vector should always be 1.
//
// CUDA tensor can be supported if the MPI used is CUDA-aware MPI, and
// ProcessGroupMPI will automatically detect this support.
class ProcessGroupMPI : public ProcessGroup {
 public:
  class WorkMPI : public ProcessGroup::Work {
   public:
    explicit WorkMPI(
        std::vector outputTensors,
        const char* profilingTitle = nullptr,
        const c10::optional>& inputTensors =
            c10::nullopt)
        : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
          outputTensors_(std::move(outputTensors)),
          future_(c10::make_intrusive(
              c10::ListType::create(c10::TensorType::get()))) {}

    std::vector result() override;
    c10::intrusive_ptr getFuture() override;

   protected:
    friend class ProcessGroupMPI;

   private:
    void finishWorkMPI();
    void finishWorkMPIError(std::exception_ptr eptr);
    std::vector outputTensors_;
    c10::intrusive_ptr future_;
  };

  class AsyncWork : public ProcessGroup::Work {
   public:
    AsyncWork(
        MPI_Request request,
        std::vector outputTensors,
        const char* profilingTitle = nullptr,
        const c10::optional>& inputTensors =
            c10::nullopt);

    virtual ~AsyncWork();
    bool isCompleted() override;
    bool isSuccess() const override;
    int sourceRank() const override;
    bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override;
    void abort() override;
    std::vector result() override;

   protected:
    void populateException();

   private:
    const std::vector outputTensors_;
    MPI_Request request_;
    MPI_Status status_;
  };

  // Constructor will spawn up the worker thread loop
  explicit ProcessGroupMPI(int rank, int size, MPI_Comm pgComm);
  virtual ~ProcessGroupMPI();

 protected:
  using WorkType =
      std::tuple, c10::intrusive_ptr>;
  // Worker thread loop
  void runLoop();
  // Helper function that is called by the destructor
  void destroy();

  c10::intrusive_ptr enqueue(
      std::unique_ptr entry,
      const char* profilingTitle = nullptr,
      const c10::optional>& inputTensors = c10::nullopt);

  bool stop_;
  std::mutex pgMutex_;
  std::thread workerThread_;
  std::deque queue_;
  std::condition_variable queueProduceCV_;
  std::condition_variable queueConsumeCV_;

  // Global states
  static void initMPIOnce();
  static void mpiExit();
  static std::once_flag onceFlagInitMPI;
  static std::mutex pgGlobalMutex_;
  static int mpiThreadSupport_;

  MPI_Comm pgComm_;
};

4.2.2 初始化

createProcessGroupMPI 方法完成了进程组的初始化，其主要是调用了 MPI 编程常见套路，比如initMPIOnce，MPI_Comm_create，MPI_Barrier之类。

c10::intrusive_ptr ProcessGroupMPI::createProcessGroupMPI(
    std::vector ranks) {
  // Once initialization
  initMPIOnce();
  MPI_Comm groupComm = MPI_COMM_WORLD;
  int rank = -1;
  int size = -1;

  {
    std::lock_guard globalLock(pgGlobalMutex_);

    // If no ranks are specified, assume we're creating the root group
    if (!ranks.empty()) {
      MPI_Group worldGroup;
      MPI_Group ranksGroup;
      MPI_CHECK(MPI_Comm_group(MPI_COMM_WORLD, &worldGroup));
      MPI_CHECK(
          MPI_Group_incl(worldGroup, ranks.size(), ranks.data(), &ranksGroup));
      constexpr int kMaxNumRetries = 3;
      bool groupComm_updated = false;
      MPI_Barrier(MPI_COMM_WORLD);
      for (int i = 0; i < kMaxNumRetries; ++i) {
        if (MPI_Comm_create(MPI_COMM_WORLD, ranksGroup, &groupComm)) {
          groupComm_updated = true;
          break;
        }
      }
      MPI_CHECK(groupComm_updated);
      MPI_CHECK(MPI_Group_free(&worldGroup));
      MPI_CHECK(MPI_Group_free(&ranksGroup));
    }

    // Fetch rank and world size for this group (MPI_COMM_WORLD or new)
    if (groupComm != MPI_COMM_NULL) {
      MPI_CHECK(MPI_Comm_rank(groupComm, &rank));
      MPI_CHECK(MPI_Comm_size(groupComm, &size));
    }
  }

  // If this process is not part of the group, we don't construct a
  // process group instance. This is in line with the semantics of the
  // other process group types.
  if (groupComm == MPI_COMM_NULL) {
    return c10::intrusive_ptr(); // 生成
  }

  return c10::make_intrusive(rank, size, groupComm); // 生成
}

4.2.2.1 initMPIOnce

调用了 MPI_Init_thread API 初始化了 MPI 执行环境。

void ProcessGroupMPI::initMPIOnce() {
  // Initialize MPI environment
  std::call_once(onceFlagInitMPI, []() {
    MPI_CHECK(MPI_Init_thread(
        nullptr, nullptr, MPI_THREAD_SERIALIZED, &mpiThreadSupport_));
    if (mpiThreadSupport_ < MPI_THREAD_SERIALIZED) {
      throw std::runtime_error(
          "Used MPI implementation doesn't have the "
          "minimum level of threading support: "
          "MPI_THREAD_SERIALIZED. This is required by "
          "c10d package");
    }
    if (std::atexit(ProcessGroupMPI::mpiExit)) {
      throw std::runtime_error("Fail to register the MPI exit handler");
    }
  });

4.2.2.2 ProcessGroupMPI

ProcessGroupMPI 构建方法之中生成了 workerThread_，其运行 runLoop。

ProcessGroupMPI::ProcessGroupMPI(int rank, int size, MPI_Comm pgComm)
    : ProcessGroup(rank, size), stop_(false), pgComm_(pgComm) {
  if (pgComm_ == MPI_COMM_NULL) {
    throw std::runtime_error("pgComm_ must not be MPI_COMM_NULL");
  }

  // Start the worker thread accepting MPI calls
  workerThread_ = std::thread(&ProcessGroupMPI::runLoop, this);
}

4.2.3 运行

4.2.3.1 执行封装

这里有两个封装，WorkEntry 封装计算执行，WorkMPI封装计算执行结果（因为计算是异步的）。具体如下：

WorkEntry 是执行方法的封装，或者说每次需要执行的集合通信操作，都要封装在这里。

struct WorkEntry {
  explicit WorkEntry(
      std::vector* srcPtr,
      std::vector* dstPtr,
      std::function&)> run)
      : dst(dstPtr ? *dstPtr : std::vector()),
        run(std::move(run)) {
    if (srcPtr) {
      src = *srcPtr;
    }
  }

  // Not copyable
  WorkEntry(const WorkEntry&) = delete;
  // Not copy assignable
  WorkEntry& operator=(const WorkEntry&) = delete;

  // For input and output tensors (in-place), we will always use src
  std::vector src;

  // Copy of user provided outputs.
  const std::vector dst;

  // src rank returned, for recv only
  int* srcRank = nullptr;
  std::function&)> run;
};

WorkMPI 是执行结果的封装。

class WorkMPI : public ProcessGroup::Work {
 public:
  explicit WorkMPI(
      std::vector outputTensors,
      const char* profilingTitle = nullptr,
      const c10::optional>& inputTensors =
          c10::nullopt)
      : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
        outputTensors_(std::move(outputTensors)),
        future_(c10::make_intrusive(
            c10::ListType::create(c10::TensorType::get()))) {}

  std::vector result() override;
  c10::intrusive_ptr getFuture() override;

 protected:
  friend class ProcessGroupMPI;

 private:
  void finishWorkMPI();
  void finishWorkMPIError(std::exception_ptr eptr);

  std::vector outputTensors_;
  c10::intrusive_ptr future_;
};

在往工作queue插入时候，实际插入的是二元组(WorkEntry, WorkMPI)，我们后续会讲解如何使用。

4.2.3.2 allreduce

以allreduce 为例，看看如何处理。就是把 MPI_Allreduce 封装到 WorkEntry 之中，然后插入到 queue。

后续 runLoop 之中就是取出 WorkEntry，然后运行 MPI_Allreduce。

c10::intrusive_ptr ProcessGroupMPI::allreduce(
    std::vector& tensors,
    const AllreduceOptions& opts) {
  checkSingleTensor(tensors);

  std::function&)> runFunc =
      [opts, this](std::unique_ptr& entry) {
        auto data = (entry->src)[0];
        c10::DeviceGuard guard(data.device());
        std::unique_lock globalLock(pgGlobalMutex_);
        MPI_CHECK(MPI_Allreduce( // 封装了此函数
            MPI_IN_PLACE,
            data.data_ptr(),
            data.numel(),
            mpiDatatype.at(data.scalar_type()),
            mpiOp.at(opts.reduceOp),
            pgComm_));
      };
  auto entry = std::make_unique(&tensors, &tensors, std::move(runFunc));
  return enqueue(
      std::move(entry),
      "mpi:all_reduce",
      c10::optional>(tensors));
}

4.2.3.3 enqueue

enqueue 方法是往queue插入二元组(WorkEntry, WorkMPI)，里面的 entry->dst 就是计算结果存放到 WorkMPI 之中。

c10::intrusive_ptr ProcessGroupMPI::enqueue(
    std::unique_ptr entry,
    const char* profilingTitle,
    const c10::optional>& inputTensors) {
  // 生成 WorkMPI，把 entry->dst 就是 计算结果存放到 WorkMPI 之中
  auto work = c10::make_intrusive(entry->dst, profilingTitle, inputTensors);
  std::unique_lock lock(pgMutex_);
  // 插入二元组
  queue_.push_back(std::make_tuple(std::move(entry), work));
  lock.unlock();
  queueProduceCV_.notify_one();
  return work;
}

4.2.3.4 runLoop

主循环runLoop方法就是不停的取出entry来处理。

void ProcessGroupMPI::runLoop() {
  std::unique_lock lock(pgMutex_);

  while (!stop_) {
    if (queue_.empty()) {
      queueProduceCV_.wait(lock);
      continue;
    }

    auto workTuple = std::move(queue_.front());

    queue_.pop_front();

    auto& workEntry = std::get<0>(workTuple); // 进行计算
    auto& work = std::get<1>(workTuple); // 拿到WorkMPI

    lock.unlock();
    queueConsumeCV_.notify_one();

    try {
      workEntry->run(workEntry);
      work->finishWorkMPI(); // 会等待WorkMPI的计算结果
    } catch (...) {
      work->finishWorkMPIError(std::current_exception());
    }

    lock.lock();
  }
}

finishWorkMPI 会标示并且进行通知。

void ProcessGroupMPI::WorkMPI::finishWorkMPI() {
  future_->markCompleted(at::IValue(outputTensors_));
  finish();
}

基类代码如下：

void ProcessGroup::Work::finish(std::exception_ptr exception) {
  std::unique_lock lock(mutex_);
  completed_ = true;
  exception_ = exception;
  if (recordFunctionEndCallback_) {
    recordFunctionEndCallback_();
    recordFunctionEndCallback_ = nullptr;
  }
  lock.unlock();
  cv_.notify_all();
}

具体如下图：

                                                                        +
                                                             Worker 1   |   Worker 2
                                                                        |
                                                                        |
                                                                        |
+-----------------+           +--------------------------------------+  |   +------------------------------------+            +---------------+
| Main Thread     |           |  ProcessGroupMPI                     |  |   | ProcessGroupMPI                    |            | Main Thread   |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |  +--------------------------------+  |  |   |  +------------------------------+  |            |               |
|                 |           |  |  runLoop        workerThread_  |  |  |   |  | runloop    workerThread_     |  |            |               |
|                 |           |  |                                |  |  |   |  |                              |  |            |               |
|                 |           |  |                                |  |  |   |  |                              |  |            |               |
|  +---------+    |           |  |   +-------------------------+  |  |  |   |  |  +-----------------------+   |  |            |               |
|  |         |    | allreduce |  |   | queue_                  |  |  |  |   |  |  | queue_                |   |  | allreduce  |   +---------+ |
|  | Reducer | +-------------------> |                         |  |  |  |   |  |  |                       | <-------------------+ |         | |
|  |         |    |           |  |   |                         |  |  |  |   |  |  |                       |   |  |            |   | Reducer | |
|  +---------+    |           |  |   |  +-------------------+  |  |  |  |   |  |  |  +-----------------+  |   |  |            |   |         | |
|                 |           |  |   |  |WorkEntry          |  |  |  |  |   |  |  |  | WorkEntry       |  |   |  |            |   +---------+ |
|                 |           |  |   |  |                   |  |  |  |  |   |  |  |  |                 |  |   |  |            |               |
|                 |           |  |   |  |   MPI_Allreduce <-----------------------------> MPI_Allreduce|  |   |  |            |               |
|                 |           |  |   |  |                   |  |  |  |  |   |  |  |  |                 |  |   |  |            |               |
|                 |           |  |   |  +-------------------+  |  |  |  |   |  |  |  +-----------------+  |   |  |            |               |
|                 |           |  |   |                         |  |  |  |   |  |  |                       |   |  |            |               |
|                 |           |  |   |                         |  |  |  |   |  |  |                       |   |  |            |               |
|                 |           |  |   +-------------------------+  |  |  |   |  |  +-----------------------+   |  |            |               |
|                 |           |  |                                |  |  |   |  |                              |  |            |               |
|                 |           |  +--------------------------------+  |  |   |  +------------------------------+  |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
+-----------------+           +--------------------------------------+  |   +------------------------------------+            +---------------+
                                                                        |
                                                                        |
                                                                        +

手机如下：

4.4 封装

PyTorch 对各种 process group 做了封装，这样用户就可以调用 GroupMember.WORLD 来完成各种操作，但是用户是无感的。

def _get_default_group():
    """
    Getting the default process group created by init_process_group
    """
    if not is_initialized():
        raise RuntimeError("Default process group has not been initialized, "
                           "please make sure to call init_process_group.")
    return GroupMember.WORLD

又比如，在 torch/distributed/distributed_c10d.py 之中如下方法可以看到 all_to_all 和 all_gather 之类的函数，其注释有很详细的用法（这里因为篇幅所限略去），大家有兴趣可以自行学习。

def all_to_all(output_tensor_list,
               input_tensor_list,
               group=None,
               async_op=False):
    """
    Each process scatters list of input tensors to all processes in a group and
    return gathered list of tensors in output list.

    Args:
        output_tensor_list (list[Tensor]): List of tensors to be gathered one
            per rank.
        input_tensor_list (list[Tensor]): List of tensors to scatter one per rank.
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.

    Returns:
        Async work handle, if async_op is set to True.
        None, if not async_op or if not part of the group.
    """
    if _rank_not_in_group(group):
        return

    opts = AllToAllOptions()
    _check_tensor_list(output_tensor_list, "output_tensor_list")
    _check_tensor_list(input_tensor_list, "input_tensor_list")

    if group is None:
        default_pg = _get_default_group()
        work = default_pg.alltoall(output_tensor_list, input_tensor_list, opts)
    else:
        work = group.alltoall(output_tensor_list, input_tensor_list, opts)

    if async_op:
        return work
    else:
        work.wait()

all_gather 代码如下：

def all_gather(tensor_list,
               tensor,
               group=None,
               async_op=False):
    """
    Gathers tensors from the whole group in a list.

    Complex tensors are supported.

    Args:
        tensor_list (list[Tensor]): Output list. It should contain
            correctly-sized tensors to be used for output of the collective.
        tensor (Tensor): Tensor to be broadcast from current process.
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op

    Returns:
        Async work handle, if async_op is set to True.
        None, if not async_op or if not part of the group
    """
    _check_tensor_list(tensor_list, "tensor_list")
    _check_single_tensor(tensor, "tensor")
    if _rank_not_in_group(group):
        return

    tensor_list = [t if not t.is_complex() else torch.view_as_real(t) for t in tensor_list]
    tensor = tensor if not tensor.is_complex() else torch.view_as_real(tensor)

    if group is None:
        default_pg = _get_default_group()
        work = default_pg.allgather([tensor_list], [tensor])
    else:
        work = group.allgather([tensor_list], [tensor])

    if async_op:
        return work
    else:
        work.wait()

至此，进程组介绍完毕，下一篇我们分析DDP的论文，敬请期待。

0xFF 参考

pytorch(分布式)数据并行个人实践总结——DataParallel/DistributedDataParallel

https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

DISTRIBUTED TRAINING WITH UNEVEN INPUTS USING THE JOIN CONTEXT MANAGER

你可能感兴趣的:([源码解析] PyTorch 分布式(7) ----- DistributedDataParallel 之进程组)

ES 压缩包安装思静鱼 #elasticsearch elasticsearch 大数据
以下是Elasticsearch(ES)通过.tar.gz压缩包安装的详细步骤（适用于Linux/macOS系统）：1.准备工作1.1检查系统依赖Java环境：ES需要JDK，推荐OpenJDK11/17（ES7.x/8.x兼容版本）。java-version#检查是否已安装如果未安装，使用以下命令安装（以Ubuntu为例）：sudoaptupdatesudoaptinstallopenjdk-1
第十五届蓝桥杯C++B组国赛题解+复盘总结 Savior｀L 蓝桥杯蓝桥杯 c++
文章目录1、合法密码2、选数概率3、蚂蚁开会4、立定跳远5、最小字符串6、数位翻转7、数星星8、套手镯9、挑石头10、最长回文前后缀总结本人目标是国一，昨天模拟做了一遍，结果如下图，这水平感觉远远不够。现在已经全部补完，发现了许多问题，好多该得到的分数没有得到，总结记录一下。1、合法密码赛时暴力枚举，官方正解是把非字母非数字算作字符，这点与我的想法一致，但是代码就是写错了，在数字处忘记contin
LeetCode 每日一题 2025/6/30-2025/7/6
记录了初步解题思路以及本地实现代码；并不一定为最优也希望大家能一起探讨一起进步目录6/30594.最长和谐子序列7/13330.找到初始输入字符串I7/23333.找到初始输入字符串II7/33304.找出第K个字符I7/43307.找出第K个字符II7/51394.找出数组中的幸运数7/61865.找出和为指定值的下标对6/30594.最长和谐子序列m记录每一个数字出现的次数l记录去重后从小到大
OpenHarmony vs Linux：分布式操作系统的终极对决 109702008 编程操作系统 #linux系统 linux 分布式人工智能
副标题：从架构基因到场景适配，解析两大系统的分布式能力差异与未来演进引言：分布式操作系统的时代命题在万物互联时代，设备协同与算力融合成为刚需。OpenHarmony和Linux作为两大开源操作系统，代表了不同的技术路线：前者是原生分布式设计，后者是生态驱动演进。本文从分布式视角深度对比二者，为开发者提供选型参考。一、架构设计：原生支持vs生态补足能力维度OpenHarmonyLinux内核模型微内
swiftui_SwiftUI中的模糊效果 weixin_26638123 python
swiftuiIfyou’vebeenaroundsmartphoneslongenough,youmayremeberiOS7.ItwasahugeupdatetotheiPhone,bringinginnotonlyatoneofnewfeaturesandfunctionality,butalsoafreshnewdesign.ThisupdatetoiOSwasnothinglesstha
iOS 各种demo链接汇总~其它UI
//联系人:石虎QQ:1224614774昵称:嗡嘛呢叭咪哄一、其他UIAwesomeMenu-最多人用的Path菜单。DCPathButton-Path，4.0的弹出菜单，呼出或者关闭菜单时，多个小图标会分别按照逆时针和顺时针的方向进行滚动。SphereMenu-利用UIDynamicAnimator的有趣的菜单，path类似。KYGooeyMenu-KYGooeyMenu是一个具有GooeyE
Swift 6.2 并发江湖：两大神功破局旧制，代码运行经脉革新（下）大熊猫侯佩 Apple开发入门 Swift 6.2 WWDC 25 并发 async/await nonisolated nonsending concurrent
楔子江湖风云变幻，Swift武林近日再掀波澜。传闻Apple于密室推演三月，终得《Swift6.2并发新篇》，扬言要破解困扰开发者多年的“经脉错乱”之症——那便是异步函数与同步函数运行规则不一、主Actor调用常生冲突之陈年旧疾。想当年，多少英雄好汉折戟于GCD到Swift并发的转型之路：明明是同门函数，同步者循调用者经脉而行，异步者却偏要另辟蹊径，轻则编译器怒目相向，重则数据走火入魔。如今6.2
融云入选「创业邦·2025 中国企业全球化新势力 100 强·引领型」资讯
7月3日-4日，由创业邦主办的“2025DEMOWORLD企业开放式创新创投大会”在上海松江举行。作为全球领先的智能通信云服务商，融云凭借卓越的全球化实践和在“通信+AI”领域的创新探索，成功入选大会重磅发布的“2025中国企业全球化新势力100强·引领型”榜单。本次榜单评选采用内部调研回访+专家评委评审的模式，邀请来自峰瑞资本、嘉御资本、创世伙伴创投等头部机构的10位专家评委，根据专家评委的综合
Java安全之JNI java软件安全
介绍JNI（JavaNativeInterface）是一种允许Java程序与本地代码（如C或C++）互操作的接口技术。通过JNI，Java程序能够调用本地代码，实现性能和功能上的优化，克服Java在某些场景下的内存管理和执行效率瓶颈。它使得开发者可以在Java应用中集成底层操作系统功能或使用已存在的高效本地库，从而提升应用的执行速度或访问硬件资源的能力。JNI基本知识本地库生命周期阶段触发条件关键
涨薪技术|Prometheus之PromQL操作符川石课堂软件测试 prometheus python 数据库 postman 测试工具 appium 功能测试
使用PromQL除了能够方便的按照查询和过滤时间序列以外，PromQL还支持丰富的操作符，用户可以使用这些操作符对进一步的对事件序列进行二次加工。这些操作符包括：数学运算符，逻辑运算符，布尔运算符等等。01数学运算例如，我们可以通过指标node_memory_free_bytes_total获取当前主机可用的内存空间大小，其样本单位为Bytes。这是如果客户端要求使用MB作为单位响应数据，那只需要
Linux基础复习第五天龙利基斯 linux chrome 运维
Linux基础复习第五天1./etc/passwd这个文件有什么作用，记录的内容是什么/etc/passwd是Linux的核心系统文件，用于存储用户账户的基本信息。它是用户身份验证、权限管理和进程控制的基础。尽管文件名包含passwd，但它不存储加密后的密码（现代系统中密码通常存储在/etc/shadow文件中），而是记录用户的其他关键属性。文件中的每一行对应一个用户账户，字段由冒号:分隔，共7个
洛谷-分支结构 pay4fun 刷题算法 c++开发语言
洛谷–分支结构题目来源于洛谷，若有侵权，私信后立刻删除P5709【深基2.习6】ApplesPrologue/苹果和虫子题目描述小B喜欢吃苹果。她现在有mmm（1≤m≤1001\lem\le1001≤m≤100）个苹果，吃完一个苹果需要花费ttt（0≤t≤1000\let\le1000≤t≤100）分钟，吃完一个后立刻开始吃下一个。现在时间过去了sss（1≤s≤100001\les\le10000
人体坐姿检测系统开发实战（YOLOv8+PyTorch+可视化） Loving_enjoy 计算机学科论文创新点人工智能深度学习迁移学习经验分享
本文将手把手教你构建智能坐姿检测系统，结合目标检测与姿态估计技术，实现不良坐姿的实时识别与预警###一、项目背景与价值现代人每天平均坐姿时间超过8小时，不良坐姿会导致：-脊椎压力增加300%-颈椎病发病率提升45%-腰椎间盘突出风险增加60%本系统通过计算机视觉技术实时监测坐姿状态，对驼背、侧倾、前倾等不良姿势进行智能识别和预警。相较于传统传感器方案，我们的视觉方案具有非接触、低成本、易部署的优势
Linux常用命令之静态IP配置 weixin_45766539 linux 运维服务器
Linux常用命令之静态IP配置-来吧，阿笔在线ubuntu24配置静态网络，备用确保NetworkManager已经安装执行命令：sudosystemctlstatusNetworkManager.service如果找不到这个服务就安装：sudoaptinstallnetwork-manager确保这个服务是正常运行再去应用以下配置文件，否则网络错误没有ip就很难ssh访问了###netplan
【经验分享】分布式爬虫的优势与劣势分析电商数据girl 跨境电商API接口电商项目API接口测试电商ERP项目接口经验分享分布式爬虫 java 数据库大数据 python
分布式爬虫通过多节点协同工作实现数据采集，其设计初衷是解决单节点爬虫在大规模数据抓取场景中的性能瓶颈，但同时也因架构复杂度带来了新的挑战。以下从技术特性、应用场景适配性两个维度，系统分析其优势与劣势：一、分布式爬虫的核心优势高效突破大规模数据采集瓶颈并行处理能力：通过将任务拆分到多个节点并行执行，大幅提升数据抓取效率。例如，采集100万条电商商品数据时，单节点爬虫可能需要数天，而由10个节点组成的
面试官问“了解 MySQL 索引失效的场景吗？请说说” —— 深入剖析与避坑指南码里看花‌ mysql 数据库
引言：效率之殇在数据库性能优化的战场上，索引无疑是那把最锋利的武器。它能将全表扫描的“大海捞针”变为精准定位的“探囊取物”。然而，这把利器并非万能，如果使用不当，精心设计的索引可能会瞬间“哑火”，导致查询性能断崖式下跌。当面试官抛出“MySQL索引失效的场景有哪些？”这个问题时，他不仅是在考察你对索引机制的理解深度，更是在检验你的实战排障能力和对数据库底层原理的掌握程度。本文将结合原理与实践，系统
Docker容器如何实现分布式微服务：从0到1的深度解析 cda2024 docker 分布式微服务
在当今云计算和大数据时代，企业面临的最大挑战之一是如何快速、稳定地部署和管理复杂的软件应用。传统的单体架构已难以满足现代互联网应用的需求，而分布式微服务架构成为了解决这一难题的关键。但问题随之而来：如何高效地构建和管理分布式微服务？Docker容器技术的出现为这个问题带来了新的曙光。它不仅简化了应用程序的打包和部署过程，还为微服务架构提供了强大的支持。本文将深入探讨Docker容器如何实现分布式微
深度剖析：向70岁老系统植入通信芯片——MCP注入构建未来级分布式通信 Loving_enjoy 计算机学科论文创新点迁移学习人工智能机器学习深度学习
>如何让老旧系统重获新生？协议注入技术是关键。##一、当遗留系统遇上分布式未来：一场艰难的对话想象一下：你负责维护一套诞生于20年前的单体式银行核心系统，它像一位固执的70岁老人，使用着陈旧的TCP自定义协议。这时业务部门要求实现与云原生风险分析引擎的实时交互。直接改造？风险巨大；推倒重来？成本天文数字。这就是**分布式通信协议断层**带来的典型困境。###传统桥接方案痛点1.**协议转换地狱**
2025年7月-9月广深地区学术会议征稿邀稿 | 2025年7-9月广州学术会议、深圳学术会议参会投稿 | 广深参会 EI 检索会议推荐 | 期待在广东与您相见，共襄学术盛举！
会议名称【点击会议名称查看详情】会议时间会议地点第四届能源与电力系统国际学术会议(ICEEPS2025)2025年7月17-19日广州第七届电子与通信，网络与计算机技术国际学术会议（ECNCT2025）2025年7月18-20日广州2025年人工智能与基础模型国际学术会议（AIFM2025）2025年7月18-20日广州第六届经济管理与大数据应用国际学术会议(ICEMBDA2025)2025年7月
Gemini CLI 智能记忆系统全景解析：从单点存储到分布式记忆网络的架构进化步子哥智能涌现分布式架构人工智能
前言在前面的分析中，我们了解了MemoryTool的基础记忆存储功能。今天，我们将深入探索GeminiCLI记忆系统的完整生态——通过分析memoryDiscovery.ts和memoryImportProcessor.ts，揭示一个更加复杂而精妙的分布式记忆网络¹。这个系统不仅能够存储单点记忆，更能够构建跨文件、跨项目的智能上下文体系。注解1-分布式记忆网络：不同于传统的单文件存储，Gemini
魔都AI医疗哪家强？全景揭秘科技创新与未来钱景！
引言上海作为中国科技创新的先锋城市，正在AI医疗领域崭露头角。根据2024年12月的数据，上海拥有34家专注于AI药物研发的公司，占全国预临床研究的60%和临床试验的47%。这些公司利用深度学习、大语言模型（LLM）和计算机视觉等技术，革新药物发现、医疗影像分析和数据治理，推动医疗行业的智能化转型。从全球首个人工智能医院“AgentHospital”到AI驱动的诊断系统，上海的AI医疗生态正在重塑
黑洞加速器官方android安卓版本,www.a0qmherg.com
DomainName:A0QMHERG.COMRegistryDomainID:2477874593_DOMAIN_COM-VRSNRegistrarWHOISServer:whois.namesilo.comRegistrarURL:http://www.namesilo.comUpdatedDate:2020-01-09T05:00:47ZCreationDate:2020-01-09T04:
20250709荣品RD-RK3588开发板的Android13系统下修改为连续长按10s开机南棱笑笑生杂质杂质
20250709荣品RD-RK3588开发板的Android13系统下修改为连续长按10s开机2025/7/910:11缘起：由于荣品RD-RK3588开发板使用的PMIC是RK806。以前在荣品PRO-RK3566开发板上使用的PMIC是RK809上做过了长按开机的。直接迁移过来了！1、根据RK809的DATASHEET，短按开机【100ms/500ms】/长按关机，长按关机。6s/8s/10s
R语言如何接入实时行情接口
目录1.安装必要的R包2.导入库3.连接WebSocket4.处理连接成功后的操作5.处理接收到的消息6.处理连接关闭和错误7.发送心跳数据8.自动重连机制9.启动连接和重连总结在数据分析和金融研究中，实时行情数据的获取至关重要，但市面上的实时行情接口并不多，本文将一步步教你如何使用R语言接入实时行情接口，获取来自WebSocket的实时数据。1.安装必要的R包首先，确保你已安装了以下R包，用于处
系统架构设计师论文分享-论分布式数据库技术及应用码农卿哥系统架构分布式数据库
我的软考历程摘要2023年2月，我所在的公司通过了研发纱线MES系统的立项，该项目为国内纱线工厂提供SAAS服务，旨在提高纱线工厂的数字化和智能化水平，我在该项目中担任系统架构设计师一职，负责该项目的架构设计工作。本文结合我在该项目中的实践，详细论述了分布式数据技术及其应用。在该项目中，会接入众多纱线工厂的全部设备的生产数据，数据量巨大，如果采用传统的单体关系型数据库，难以支撑起这庞大的数据。基于
APP测试手册
目录一、APP测试流程二、APP测试点1.功能测试2.UI测试3.软件权限4.数据安全性5.安装/卸载6.免登录7.运行8.APP更新9.数据更新10.离线浏览11.前后台切换12.用户体验测试13.图形测试14.交叉事件测试15.时间测试16.定位、照相机服务17.消息、通知测试18.异常测试19.兼容性测试20.适配性测试21.PUSH测试22.硬件环境测试23.网络环境24.性能测试25.安
【云计算解决方案面试整理】3-7主流云计算平台、云计算架构、安全防护不太灵光的程序员阿里云云计算工程师ACP认证云计算云计算面试架构
准备面云计算解决方案的岗位，整理了一些，也请大佬们指点。文档分为云计算基础概念、云计算技术原理、主流云计算平台（以天翼云为例）、云计算架构（弹性设计、高可用设计、高性能设计）、安全防护几个方面。三、主流云计算平台1.阿里云云计算平台强大的计算能力：拥有自主研发的飞天操作系统，可提供高效、稳定的计算服务，能够满足大规模数据处理和高并发业务的需求。例如，在应对双11这样的高并发场景时，飞天系统可以快速
构建分布式高防架构实现业务零中断群联云防护小杜安全问题汇总分布式架构前端安全游戏 tcp/ip 网络
传统方案痛点单一高防IP在遭遇TCP连接耗尽攻击时，仍可能导致合法用户被挤出连接池。创新方案：多节点负载均衡+协议栈优化#Nginx高防配置核心片段（TCP层防护）stream{#启用SYNCookie防护syn_floodon;syn_flood_timeout=30s;#连接速率限制（每个IP每秒最大新建连接数）limit_conn_zone$binary_remote_addrzone=pe
20250708-02-redis通用key操作命令_笔记
一、Redis1.通用键值操作1）键的查看操作keys命令基本功能：查询当前数据库中的所有key，支持精确查询和模糊查询与memcached区别：memcached无法查询所有key，这是Redis特有的功能查询示例：keys*返回所有key（如"age"和"site"）keyssite精确查询指定keykeyss*查询以s开头的key通配符三种通配符：*：匹配任意多个字符（如key
分布式推客系统架构设计：从微服务到高性能计算的实践路径 wx_ywyy6798 推客系统推客小程序推客分销系统推客系统开发推客小程序开发推客分销系统开发分销系统
一、推客系统概述与市场背景分析推客系统（PromoterSystem）作为一种创新的社交化营销工具，近年来在电商、知识付费、本地生活服务等领域展现出强大的市场渗透力。该系统本质上是一种基于社交关系的分布式营销网络，通过激励用户主动分享商品或服务信息，实现裂变式传播效果。根据2023年数字营销行业白皮书显示，采用推客系统的企业平均获客成本比传统广告渠道降低47%，转化率提升3倍以上。在数字化转型浪潮
用MiddleGenIDE工具生成hibernate的POJO（根据数据表生成POJO类） AdyZhang POJO eclipse Hibernate MiddleGenIDE
推荐:MiddlegenIDE插件, 是一个Eclipse 插件. 用它可以直接连接到数据库, 根据表按照一定的HIBERNATE规则作出BEAN和对应的XML ，用完后你可以手动删除它加载的JAR包和XML文件! 今天开始试着使用
.9.png Cb123456 android
“点九”是andriod平台的应用软件开发里的一种特殊的图片形式，文件扩展名为：.9.png 　　智能手机中有自动横屏的功能,同一幅界面会在随着手机(或平板电脑)中的方向传感器的参数不同而改变显示的方向,在界面改变方向后,界面上的图形会因为长宽的变化而产生拉伸,造成图形的失真变形。　　我们都知道android平台有多种不同的分辨率，很多控件的切图文件在被放大拉伸后，边
算法的效率天子之骄算法效率复杂度最坏情况运行时间大O阶平均情况运行时间
算法的效率效率是速度和空间消耗的度量。集中考虑程序的速度，也称运行时间或执行时间，用复杂度的阶(O)这一标准来衡量。空间的消耗或需求也可以用大O表示，而且它总是小于或等于时间需求。以下是我的学习笔记： 1.求值与霍纳法则，即为秦九韶公式。 2.测定运行时间的最可靠方法是计数对运行时间有贡献的基本操作的执行次数。运行时间与这个计数成正比。
java数据结构何必如此 java 数据结构
Java 数据结构 Java工具包提供了强大的数据结构。在Java中的数据结构主要包括以下几种接口和类：枚举（Enumeration）位集合（BitSet）向量（Vector）栈（Stack）字典（Dictionary）哈希表（Hashtable）属性（Properties）以上这些类是传统遗留的，在Java2中引入了一种新的框架-集合框架(Collect
MybatisHelloWorld 3213213333332132
//测试入口TestMyBatis package com.base.helloworld.test; import java.io.IOException; import org.apache.ibatis.io.Resources; import org.apache.ibatis.session.SqlSession; import org.apache.ibat
Java|urlrewrite|URL重写|多个参数 7454103 java xml Web 工作
个人工作经验！如有不当之处，敬请指点 1.0 web -info 目录下建立 urlrewrite.xml 文件类似如下： <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE u
达梦数据库+ibatis darkranger sql mysql ibatis SQL Server
--插入数据方面如果您需要数据库自增... 那么在插入的时候不需要指定自增列. 如果想自己指定ID列的值, 那么要设置 set identity_insert 数据库名.模式名.表名; ----然后插入数据; example: create table zhabei.test( id bigint identity(1,1) primary key, nam
XML 解析四种方式 aijuans android
XML现在已经成为一种通用的数据交换格式,平台的无关性使得很多场合都需要用到XML。本文将详细介绍用Java解析XML的四种方法。 XML现在已经成为一种通用的数据交换格式,它的平台无关性,语言无关性,系统无关性,给数据集成与交互带来了极大的方便。对于XML本身的语法知识与技术细节,需要阅读相关的技术文献,这里面包括的内容有DOM(Document Object
spring中配置文件占位符的使用 avords
1.类 <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.o
前端工程化-公共模块的依赖和常用的工作流 bee1314 webpack
题记：一个人的项目，还有工程化的问题嘛？我们在推进模块化和组件化的过程中，肯定会不断的沉淀出我们项目的模块和组件。对于这些沉淀出的模块和组件怎么管理？另外怎么依赖也是个问题？你真的想这样嘛？ var BreadCrumb = require(‘../../../../uikit/breadcrumb’); //真心ugly。
上司说「看你每天准时下班就知道你工作量不饱和」，该如何回应？ bijian1013 项目管理沟通 IT职业规划
问题：上司说「看你每天准时下班就知道你工作量不饱和」，如何回应正常下班时间6点，只要是6点半前下班的，上司都认为没有加班。 Eno-Bea回答，注重感受，不一定是别人的虽然我不知道你具体从事什么工作与职业，但是我大概猜测，你是从事一项不太容易出现阶段性成果的工作
TortoiseSVN，过滤文件征客丶 SVN
环境： TortoiseSVN 1.8 配置：在文件夹空白处右键选择 TortoiseSVN -> Settings 在 Global ignote pattern 中添加要过滤的文件：多类型用英文空格分开 *name ：过滤所有名称为 name 的文件或文件夹 *.name ：过滤所有后缀为 name 的文件或文件夹 --------
【Flume二】HDFS sink细说 bit1129 Flume
1. Flume配置 a1.sources=r1 a1.channels=c1 a1.sinks=k1 ###Flume负责启动44444端口 a1.sources.r1.type=avro a1.sources.r1.bind=0.0.0.0 a1.sources.r1.port=44444 a1.sources.r1.chan
The Eight Myths of Erlang Performance bookjovi erlang
erlang有一篇guide很有意思： http://www.erlang.org/doc/efficiency_guide 里面有个The Eight Myths of Erlang Performance： http://www.erlang.org/doc/efficiency_guide/myths.html Myth: Funs are sl
java多线程网络传输文件(非同步)-2008-08-17 ljy325 java 多线程 socket
利用 Socket 套接字进行面向连接通信的编程。客户端读取本地文件并发送；服务器接收文件并保存到本地文件系统中。使用说明:请将TransferClient, TransferServer, TempFile三个类编译，他们的类包是FileServer. 客户端: 修改TransferClient: serPort, serIP, filePath, blockNum,的值来符合您机器的系
读《研磨设计模式》-代码笔记-模板方法模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.ResultSet;
配置心得 chenyu19891124 配置
时间就这样不知不觉的走过了一个春夏秋冬，转眼间来公司已经一年了，感觉时间过的很快，时间老人总是这样不停走，从来没停歇过。作为一名新手的配置管理员，刚开始真的是对配置管理是一点不懂，就只听说咱们公司配置主要是负责升级，而具体该怎么做却一点都不了解。经过老员工的一点点讲解，慢慢的对配置有了初步了解，对自己所在的岗位也慢慢的了解。做了一年的配置管理给自总结下： 1.改变从一个以前对配置毫无
对“带条件选择的并行汇聚路由问题”的再思考 comsci 算法工作软件测试嵌入式领域模型
2008年上半年，我在设计并开发基于”JWFD流程系统“的商业化改进型引擎的时候，由于采用了新的嵌入式公式模块而导致出现“带条件选择的并行汇聚路由问题”(请参考2009-02-27博文)，当时对这个问题的解决办法是采用基于拓扑结构的处理思想，对汇聚点的实际前驱分支节点通过算法预测出来，然后进行处理，简单的说就是找到造成这个汇聚模型的分支起点，对这个起始分支节点实际走的路径数进行计算，然后把这个实际
Oracle 10g 的clusterware 32位下载地址 daizj oracle
Oracle 10g 的clusterware 32位下载地址 http://pan.baidu.com/share/link?shareid=531580&uk=421021908 http://pan.baidu.com/share/link?shareid=137223&uk=321552738 http://pan.baidu.com/share/l
非常好的介绍：Linux定时执行工具cron dongwei_6688 linux
Linux经过十多年的发展，很多用户都很了解Linux了，这里介绍一下Linux下cron的理解，和大家讨论讨论。cron是一个Linux 定时执行工具，可以在无需人工干预的情况下运行作业，本文档不讲cron实现原理，主要讲一下Linux定时执行工具cron的具体使用及简单介绍。新增调度任务推荐使用crontab -e命令添加自定义的任务（编辑的是/var/spool/cron下对应用户的cr
Yii assets目录生成及修改 dcj3sjt126com yii
assets的作用是方便模块化，插件化的，一般来说出于安全原因不允许通过url访问protected下面的文件，但是我们又希望将module单独出来，所以需要使用发布，即将一个目录下的文件复制一份到assets下面方便通过url访问。 assets设置对应的方法位置 \framework\web\CAssetManager.php assets配置方法在m
mac工作软件推荐 dcj3sjt126com mac
mac上的Terminal + bash ＋ screen组合现在已经非常好用了，但是还是经不起iterm＋zsh＋tmux的冲击。在同事的强烈推荐下，趁着升级mac系统的机会，顺便也切换到iterm＋zsh＋tmux的环境下了。我为什么要要iterm2 切换过来也是脑袋一热的冲动，我也调查过一些资料，看了下iterm的一些优点： * 兼容性好，远程服务器 vi 什么的低版本能很好兼
Memcached(三)、封装Memcached和Ehcache frank1234 memcached ehcache spring ioc
本文对Ehcache和Memcached进行了简单的封装，这样对于客户端程序无需了解ehcache和memcached的差异，仅需要配置缓存的Provider类就可以在二者之间进行切换，Provider实现类通过Spring IoC注入。 cache.xml <?xml version="1.0" encoding="UTF-8"?>
Remove Duplicates from Sorted List II hcx2013 remove
Given a sorted linked list, delete all nodes that have duplicate numbers, leaving only distinct numbers from the original list. For example,Given 1->2->3->3->4->4->5,
Spring4新特性——注解、脚本、任务、MVC等其他特性改进 jinnianshilongnian spring4
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
MySQL安装文档 liyong0802 mysql
工作中用到的MySQL可能安装在两种操作系统中，即Windows系统和Linux系统。以Linux系统中情况居多。安装在Windows系统时与其它Windows应用程序相同按照安装向导一直下一步就即，这里就不具体介绍，本文档只介绍Linux系统下MySQL的安装步骤。 Linux系统下安装MySQL分为三种：RPM包安装、二进制包安装和源码包安装。二
使用VS2010构建HotSpot工程 p2p2500 HotSpot OpenJDK VS2010
1. 下载OpenJDK7的源码： http://download.java.net/openjdk/jdk7 http://download.java.net/openjdk/ 2. 环境配置 ▶
Oracle实用功能之分组后列合并 seandeng888 oracle 分组实用功能合并
1 实例解析由于业务需求需要对表中的数据进行分组后进行合并的处理，鉴于Oracle10g没有现成的函数实现该功能，且该功能如若用JAVA代码实现会比较复杂，因此，特将SQL语言的实现方式分享出来，希望对大家有所帮助。如下：表test 数据如下： ID,SUBJECTCODE,DIMCODE,VALUE 1&nbs
Java定时任务注解方式实现 tuoni java spring jvm xml jni
Spring 注解的定时任务，有如下两种方式：第一种： <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http
11大Java开源中文分词器的使用方法和分词效果对比 yangshangchuan word分词器 ansj分词器 Stanford分词器 FudanNLP分词器 HanLP分词器
本文的目标有两个： 1、学会使用11大Java开源中文分词器 2、对比分析11大Java开源中文分词器的分词效果本文给出了11大Java开源中文分词的使用方法以及分词结果对比代码，至于效果哪个好，那要用的人结合自己的应用场景自己来判断。 11大Java开源中文分词器，不同的分词器有不同的用法，定义的接口也不一样，我们先定义一个统一的接口： /** * 获取文本的所有分词结果, 对比