福伴

[源码解析] PyTorch 分布式(11) ----- DistributedDataParallel 之构建Reducer和Join操作1

1.2 参数说明
调用的 parameters 举例如下， parameters[0] 就是 rank 0 上模型的 parameters，可以看到其只有 [0] 元素有意义，这个 [0] 原始本身包括 20 个元素：

parameters = {list: 1}
0 = {list: 4}
0 = {Parameter: 10} Parameter containing:\ntensor([[-4.0381e-02, 3.8828e-02, 1 )
1 = {Parameter: 10} Parameter containing:\ntensor([-0.0438, -0.2033, 0.2771, 0.0721, )
2 = {Parameter: 5} Parameter containing:\ntensor([[-0.0094, -0.1319, 0.0713, 0.3155, )
3 = {Parameter: 5} Parameter containing:\ntensor([-0.0008, 0.0582, -0.1245, -0.2538, )
…
20 = {Parameter: 5} Parameter containing:\ntensor([-0.0008, 0.0582, -0.1245, -0.2538, )
len = {int} 20
len = {int} 1
bucket_indices 举例如下：

关于 tensor indices，就是给所有的tensor一个index，从0开始递增，一直到 tensors.size()。假如模型的 parameters 一共有20个张量，则 tensor index 从 0 到 19，分成 6 个buckets，则在这6个buckets之中，每个 tensor index 都是唯一不重复的。

±----------------------------------------------------------------------+
| |
| |
| |
| |
| |
| |
| |
| … |
| |
| |
| |
| |
±----------------------------------------------------------------------+

接下来，我们就看看如何进行初始化 Reducer。

0x02 Reducer 初始化
代码位于：torch/lib/c10d/reducer.h 和 torch/lib/c10d/reducer.cpp

2.1 构造函数
具体逻辑如下：

看看本模块是不是多设备模块，具体是: 遍历张量，得到张量的设备，把设备插入到一个set结构之中，如果set内的设备多于一个，是多设备
如果 expect_sparse_gradients没有设置，就把expect_sparse_gradients_初始化为false。
调用 initialize_buckets 初始化 buckets 并尽可能按照逆序将 parameters 分配到 buckets 之中，这样按桶通信就可以提高效率。后续在运行时候也可能再次重新初始化桶。
为每个 parameter 加上 grad_accumulator，它们在 backward 时负责梯度同步。
因为这些variables是autograd图的叶子张量，所以它们的grad_fn都被设置为 gradient accumulation function。
Reducer保存了指向这些functions的指针，这样Reducer就可以知道它们在autograd传播之中是否被使用，如果没有使用，那么就把这些functions的梯度张量（grad tensors）设置为规约就绪状态。
遍历张量，为每个张量生成一个类型为VariableIndex的变量index。
得到Variable::AutogradMeta的grad_accumulator_，即用于累加叶子 Variable 的梯度累加器。
把reducer的autograd_hook函数添加进去每个grad_accumulator_之中，变量index是hook的参数。这个 hook 挂在 autograd graph 之上，在 backward 时负责梯度同步。grad_accumulator 执行完后，autograd_hook 就会运行。
gradAccToVariableMap_ 存了grad_accumulator & index 的对应关系（函数指针和参数张量的对应关系），这样以后在 autograd graph 遍历寻找 unused parameters 就方便了。
初始化 backward_stats_。
调用 initialize_local_used_map 初始化各种 unused map。
// The constructor takes a list of variables for every model replica.
// The bucket assignment for this reducer is specified as a list of
// buckets, each of which is specified as a list of indices into the
// variables list for a single replica (i.e. variables[0]).
Reducer::Reducer(
std::vector replicas, // 张量
std::vector> bucket_indices, // 桶信息
c10::intrusive_ptrc10d::ProcessGroup process_group,
std::vector expect_sparse_gradients,
int64_t bucket_bytes_cap,
bool find_unused_parameters,
bool gradient_as_bucket_view,
std::unordered_map paramNames)
: replicas_(std::move(replicas)),
process_group_(std::move(process_group)),
expect_sparse_gradients_(std::move(expect_sparse_gradients)),
expect_autograd_hooks_(false),
require_finalize_(false),
next_bucket_(0),
has_marked_unused_parameters_(false),
find_unused_parameters_(find_unused_parameters),
gradient_as_bucket_view_(gradient_as_bucket_view),
local_used_maps_reduced_(false),
num_iterations_(0),
num_buckets_ready_(0),
has_rebuilt_bucket_(false),
bucket_bytes_cap_(bucket_bytes_cap),
divFactor_(kUnsetDivFactor),
static_graph_(false),
comm_hook_(nullptr),
thread_local_state_(at::ThreadLocalState()),
ddp_debug_level_(parseDistDebugLevel()),
param_names_(std::move(paramNames)) {

// Check whether the module is multi_device_module
// 看看本模块是不是多设备模块
{
std::set unique_devices;
for (const auto& v : replicas_[0]) { // 遍历张量
auto device_idx = int(v.device().index()); // 得到张量的设备
if (unique_devices.find(device_idx) == unique_devices.end()) {
unique_devices.insert(device_idx); // 把设备插入到一个set结构之中
if (unique_devices.size() > 1) { // 如果set内的设备多于一个，是多设备
is_multi_device_module_ = true;
break;
}
}
}
}

// If expect_sparse_gradients is not specified, initialize it such that
// we do not expect sparse gradients for any parameter.
if (expect_sparse_gradients_.empty()) {
expect_sparse_gradients_ = std::vector(
replicas_.size(), std::vector(replicas_[0].size(), false));
}

// Initialize variable bucketing.
// This can be reinitialized later after capturing runtime information.
{
std::lock_guardstd::mutex lock(mutex_);
initialize_buckets(std::move(bucket_indices)); //初始化桶
}

// All variables are expected to have their grad_fn set to the gradient
// accumulation function (since they are leafs in the autograd graph).
// We store pointers to these functions such that we can check if they are
// used in an autograd pass. If they are not, we know their grad tensors
// can be marked as ready for reduction.
{
const auto replica_count = replicas_.size();
grad_accumulators_.resize(replica_count);
for (size_t replica_index = 0; replica_index < replica_count; // 只有replicas_[0]有意义
replica_index++) {
const auto variable_count = replicas_[replica_index].size(); //张量数目
grad_accumulators_[replica_index].resize(variable_count); // 给grad_accumulators_分配内存

  for (size_t variable_index = 0; variable_index < variable_count;
       variable_index++) { // 遍历张量，variable_index 就是张量的index
    auto& variable = replicas_[replica_index][variable_index]; //得到具体的张量
    const auto index = VariableIndex(replica_index, variable_index); //每个张量生成一个VariableIndex

    // The gradient accumulator function is lazily initialized once.
    // Therefore we can use its presence in the autograd graph as
    // evidence that the parameter has participated in an iteration.
    auto grad_accumulator =
        torch::autograd::impl::grad_accumulator(variable); // 得到Variable::AutogradMeta的grad_accumulator_，即，用于累加叶子 Variable 的梯度累加器

#ifndef WIN32
using torch::distributed::autograd::ThreadLocalDistAutogradContext;
#endif
// Hook to execute after the gradient accumulator has executed.
hooks.emplace_back(
// 累加器添加hook,这个 hook 挂在 autograd graph 之上，在 backward 时负责梯度同步。
// grad_accumulator 执行完后，autograd_hook 就会运行
grad_accumulator->add_post_hook(
torch::make_uniquetorch::autograd::utils::LambdaPostHook(
[=](const torch::autograd::variable_list& outputs,
const torch::autograd::variable_list& /* unused */) {
#ifndef WIN32
this->rpc_context.set(
ThreadLocalDistAutogradContext::getContextPtr());
#endif
this->autograd_hook(index); // 把reducer的autograd_hook函数添加进去
return outputs;
})),
grad_accumulator);

    // Map raw function pointer to replica index and parameter index.
    // This is used later on when the autograd graph is traversed
    // to check for parameters for which no gradient is computed, if
    // find_unused_parameters=True.
    // Note that the mapping of gradient accumulator to variable should be
    // one to one as we deduplicate shared parameters before constructing
    // Reducer.
      
    // gradAccToVariableMap_ 存了grad_accumulator & index 的对应关系（函数指针和参数张量的对应关系），这样以后在 autograd graph 遍历寻找 unused parameters 就方便了
    if (find_unused_parameters_) {
      gradAccToVariableMap_[grad_accumulator.get()] = index;
    }

    numGradHooksTriggeredMap_[index] = 0;

    // The gradient accumulator is stored as weak_ptr in the autograd
    // metadata of the variable, so we have to keep it alive here for
    // the raw pointer to be valid.
    TORCH_CHECK(
        grad_accumulators_[replica_index][variable_index] == nullptr,
        c10::str(
            "Reducer tried to register duplicate grad accumulator for replica ",
            replica_index,
            " variable ",
            variable_index));
    grad_accumulators_[replica_index][variable_index] =
        std::move(grad_accumulator);
  }
}

}

// Initialize backward stats vector.
{
const auto replica_count = replicas_.size();
backward_stats_.resize(replica_count);
const auto variable_count = replicas_[0].size();
std::for_each(
backward_stats_.begin(),
backward_stats_.end(),
[=](std::vector& v) { v.resize(variable_count); });
}

// See Note [Skip allreducing local_used_maps_dev]
if (find_unused_parameters_) {
initialize_local_used_map();
}
}
我们接下来具体分析每一个部分。

2.2 初始化桶
initialize_buckets方法用来初始化桶，具体逻辑是对于每一个桶，添加其模型副本，对于每一个模型副本，添加张量列表：

用分布式上下文设置 rpc_context_。

如果在DDP构造函数内调用initialize_bucket，则 rpc上下文指针（rpc context ptr）是否为null 无关紧要，因为grad不会发生变化。
如果在训练循环期间调用initialize_bucket，例如在rebuild_bucket 内部，因为grad可能会发生改变并指向bucket_view，那么它需要检查rpc context ptr是否为null。
如果rpc context ptr是null，则改变 variable.grad()，否则，在rpc上下文中改变梯度。
清空buckets_ 和 variable_locators_。

重置variable_locators_的尺寸，这样每个variable都有一个bucket index。

利用如下得到所有桶的个数和每个桶中副本个数：bucket_count = bucket_indices.size(); replica_count = replicas_.size();

从0开始递增到 bucket_count，逐一初始化 Bucket。

生成一个 Bucket bucket
如果bucket_indices[bucket_index].size() == 1，说明这个桶期待一个single sparse gradient，则设置 bucket.expect_sparse_gradient = true。
从0开始递增到replica_count，逐一初始化 BucketReplica。
生成一个 BucketReplica replica
如果这个桶期待一个single sparse gradient，则
利用bucket_indices[bucket_index].front()取出向量第一个元素，设置为 variable_index。
利用 variable_index 得到副本之中对应的variable。
设置副本replica的变量列表，代码为replica.variables = {variable}，这个副本只包括一个variable。
否则说明是dense gradient，则
遍历桶的variable，即利用 replicas_[replica_index][variable_index] 得到variable。
设置variable的设备和数据类型
给副本设置其variables，代码为：replica.variables.push_back(variable)。
设置replica 的一些关于variable的元信息，这些元信息是flat contents相关的，比如offsets存储了各个张量在flat bucket contents中的offset。
给relica.contents分配内存
利用 initialize_bucket_views(replica, replica.contents) 初始化 cotnents 和 views。
利用 bucket.replicas.push_back(std::move(replica)) 把这个 replica 加入到 bucket。
遍历桶中的variable，代码为 bucket_indices[bucket_index]。
设置 Reducer.variable_locators_，这样 Reducer 就知道如何在 bucket 之中确定一个varaible。bucket_index 是buckets_列表的位置，表示 buckets_ 之上的一个bucket。intra_bucket_index 是在 bucket replica 之中 vector 域的 variable index。
设置桶的变量，bucket.variable_indices = std::move(bucket_indices[bucket_index]);
利用 buckets_.push_back(std::move(bucket)) 把bucket这个桶加入到 Reducer之中。
具体代码是：

void Reducer::initialize_buckets(
std::vector> bucket_indices) {
// If initialize_buckets is called inside DDP constructor, then
// it does not matter rpc context ptr is nullptr or not, as grad
// will not be mutated.
// If initialize_buckets is called during training loop, e.g, inside
// rebuild_buckets(), since grad could be mutated and be pointed to
// bucket_view, then it needs to check rpc context ptr is nullptr or not,
// If rpc context ptr is nullptr, mutate variable.grad(); otherwise,
// mutate grad in rpc context.
#ifndef WIN32
using torch::distributed::autograd::ThreadLocalDistAutogradContext;
this->rpc_context.set(ThreadLocalDistAutogradContext::getContextPtr());
#endif

// This shouldn’t be called if we’re expecting autograd hooks to fire.
TORCH_CHECK(
!expect_autograd_hooks_,
"initialize_buckets must NOT be called during autograd execution.");

// Clear current bucket assignment.
buckets_.clear();
variable_locators_.clear();

// Ensure we have a bucket index for every variable.
variable_locators_.resize(replicas_[0].size());

// Iterate over buckets.
const auto bucket_count = bucket_indices.size();
const auto replica_count = replicas_.size();
buckets_.reserve(bucket_count);
// 从0开始递增到bucket_count
for (size_t bucket_index = 0; bucket_index < bucket_count; bucket_index++) {
Bucket bucket; // 生成一个桶

// TODO(@pietern): Validate indices.
// Must be non-empty, unique, and unique across buckets.
TORCH_CHECK(
    bucket_indices[bucket_index].size() > 0, "Empty bucket specified.");

// Variables that expect sparse gradients must have their own bucket.
if (bucket_indices[bucket_index].size() == 1) {
  // 说明这个桶期待一个single sparse gradient
  const auto variable_index = bucket_indices[bucket_index].front();
  bucket.expect_sparse_gradient =
      expect_sparse_gradients_[0][variable_index];
} else {
  for (const auto variable_index : bucket_indices[bucket_index]) {
    TORCH_CHECK(
        !expect_sparse_gradients_[0][variable_index],
        "Buckets with more than one variable cannot include variables ",
        "that expect a sparse gradient.");
  }
}

// Iterate over model replicas. 从0开始递增到replica_count，遍历模型副本数目，为每一个模型副本都要做同样设置
for (size_t replica_index = 0; replica_index < replica_count;
     replica_index++) {
  BucketReplica replica; // 生成一个副本

  if (bucket.expect_sparse_gradient) {
    // 说明这个桶期待一个single sparse gradient
    const auto variable_index = bucket_indices[bucket_index].front(); // 得到张量的index
    const auto& variable = replicas_[replica_index][variable_index]; // 得到张量
    TORCH_INTERNAL_ASSERT(bucket_indices[bucket_index].size() == 1);
    replica.variables = {variable}; // 这个副本只包括一个variable
  } else {
    at::TensorOptions options;
    // The start index of the variable in the flattened tensor.
    size_t offset = 0;

    // Reserve enough space for the per-variable fields stored in bucket
    // replica for efficiency.
    const size_t num_variables = bucket_indices[bucket_index].size();
    replica.variables.reserve(num_variables); 
    replica.offsets.reserve(num_variables);
    replica.lengths.reserve(num_variables);
    replica.sizes_vec.reserve(num_variables);

    // Iterate over bucket variables.
    for (const auto variable_index : bucket_indices[bucket_index]) { //遍历桶中的variable
      TORCH_CHECK(
          variable_index < replicas_[replica_index].size(),
          "Out of range variable index specified.");
      const auto& variable = replicas_[replica_index][variable_index];
      if (!options.has_device()) {
        options = options.device(variable.device());
      } else {
        TORCH_CHECK(
            variable.device() == options.device(),
            "All parameters in a bucket must be ",
            "placed on the same device.");
      }
      if (!options.has_dtype()) {
        options = options.dtype(variable.dtype());
      } else {
        TORCH_CHECK(
            variable.dtype() == options.dtype(),
            "All parameters in a bucket must have the same dtype.");
      }
      
      const auto length = variable.numel();
      // 给副本设置其variables
      replica.variables.push_back(variable); // 这里添加了一个新变量，所以最终能知道该桶中的变量数目
      // 设置replica 的一些关于variable的元信息
      replica.offsets.push_back(offset);
      replica.lengths.push_back(length);
      replica.sizes_vec.push_back(variable.sizes());
      offset += length;
    }

    // Allocate bucket contents tensor.
    replica.contents = at::empty({static_cast(offset)}, options);

    initialize_bucket_views(replica, replica.contents); // 初始化cotents和views
  }

  // Add bucket replica to enclosing bucket.
  bucket.replicas.push_back(std::move(replica)); // 桶的副本列表中添加一个新副本
}

// Map participating variables to this bucket.
// This is identical across replicas so we only need to do this once.
size_t intra_bucket_index = 0;
for (const auto variable_index : bucket_indices[bucket_index]) { // 遍历桶中的variable
  TORCH_CHECK(
      variable_index < variable_locators_.size(),
      "Out of range variable index specified.");
  variable_locators_[variable_index] = // 这样 Reducer 就知道如何在 bucket 之中确定一个varaible
      VariableLocator(bucket_index, intra_bucket_index++);
}
bucket.variable_indices = std::move(bucket_indices[bucket_index]);

buckets_.push_back(std::move(bucket)); // 把桶插入Reducer

}
}
2.3 初始化视图
initialize_bucket_views 这里是设置 Replica 的contents 和 views。

// (see Note: “Gradient Layout Contract” in initialize_buckets).
void Reducer::initialize_bucket_views(
Reducer::BucketReplica& replica,
at::Tensor& contents) {
for (size_t i = 0; i < replica.variables.size(); i++) {
auto& v = replica.variables[i];
const auto offset = replica.offsets[i];
const auto length = replica.lengths[i];
if (v.is_non_overlapping_and_dense()) { // Dense类型的张量
// If the param’s memory is dense, match its layout, anticipating
// the autograd engine (AccumulateGrad) will also create gradients
// matching its layout.
replica.bucket_views_in.push_back( // replica.bucket_views_in里面都是视图
contents.as_strided(v.sizes(), v.strides(), offset));
} else { // Sparse类型的张量
// Fall back to a C-style contiguous view, again anticipating
// AccumulateGrad will do the same when stashing grads for non-dense
// params.
replica.bucket_views_in.push_back( // replica.bucket_views_in里面都是视图
contents.narrow(0, offset, length).view(v.sizes()));
}
// By default bucket_views_out and bucket_views_in are
// essentially the same thing.
replica.bucket_views_out = replica.bucket_views_in; // out也是视图

// If gradient_as_bucket_view_ is set as true, then there are two cases to
// handle: initialize_bucket_views could be called inside initialize_buckets
// when rebuild_buckets, if grad has already been defined/calculated in
// previous iteration, old grad needs to be copied into new bucket_view and
// let grad point to the new bucket_view, initialize_bucket_views could also
// be called inside initialize_buckets during construction. Grads are not
// defined during construction time, in this case, do not let grad point to
// bucket_view, because grads should be kept as being undefined for globally
// unused parameters.
if (gradient_as_bucket_view_) {
  auto& bucket_view = replica.bucket_views_in.back();
  runGradCallbackForVariable(v, [&](auto& grad) {
    if (grad.defined() && !grad.is_alias_of(bucket_view)) {
      bucket_view.copy_(grad);
      grad = bucket_view; // 梯度被修改了，需要回写
      // The grad is modefied and needs to be written back.
      return true;
    }
    // The grad is not modified and does not need to be written back.
    return false; // 不需要回写，因为没有被修改
  });
}

}
}
2.3.1 BucketReplica成员变量
我们先回忆一下BucketReplica的几个成员变量。

at::Tensor contents ：把桶的内容展平的结果，即Flattened (1 dimensional) 之后的结果。
std::vectorat::Tensor bucket_views_in ：提供了从输入角度在 contents 之中查看具体梯度的方法。
std::vectorat::Tensor bucket_views_out ：提供了从输入角度在 contents 之中查看具体梯度的方法。
关于 std::vectorat::Tensor bucket_views_in 和 std::vectorat::Tensor bucket_views_out 的进一步说明：

这两个变量提供在 contents 之中操作具体梯度的方法，或者说，它们提供了视图（views），该视图可以操作contents 之中每个张量的梯度。用户把这两个变量作为入口点来把每个梯度的数据从 content 之中移入和移出。
在 PyTorch 之中，视图是指创建一个方便查看的东西，视图与原数据共享内存，它只是将原有的数据进行整理，直接显示其中部分内容或者进行重排序后再显示出来。
也需要对几个 PyTorch 函数进行说明。

as_strided ：依据现有tensor以及给定的步长来创建一个视图（类型仍然为tensor），需要注意，这里的结果是视图，所以这个张量依然和原始张量共享内存。
narrow ：返回一个新的张量，其是原来张量的缩小版，但是这个张量依然和原始张量共享内存。
BucketReplica 逻辑具体如下图：

±-----------------------------------------+
| BucketReplica |
| |
| vector bucket_views_in ±-------------------+
| | |
| | |
| vector bucket_views_out ±-------------+ |
| | | |
| | | |
| | v v
| | ±----±—±-------------------------+
| Tensor contents ±--------------------> |Flattened (Tensor1, Tensor2, Tensor3)|
| | ±------------------------------------+
| |
| |
| vector variables ±-----------> [Tensor1,Tensor2,Tensor3]
| |
| |
| |
±-----------------------------------------+

2.3.2 调用
如何调用？如果gradient_as_bucket_view_设置为true，则有两种情况需要处理：

rebuild_buckets 之中可以在initialize_bucket内调用initialize_bucket_view，如果grad在上一次迭代中已经定义/计算过，则需要将旧的grad复制到新的bucket_view中，并让grad指向新的bucket_view，
在构造过程中，也可以在initialize_bucket中调用initialize_bucket_views。在构造期间不会定义梯度，在这种情况下，不要让梯度指向bucket_view，因为对于全局未使用的参数，梯度应保持为未定义。
2.4 初始化本地使用变量
initialize_local_used_map此处是初始化 local_used_maps_，我们回忆一下论文内容，local_used_maps_ 就是用来查找全局未使用参数（Globally Unused Parameters）：

全局未使用参数（Globally Unused Parameters）的梯度在向前和向后过程中应保持不变。检测未使用的参数需要全局信息，因为在一个DDP过程中，一个参数可能在一次操作中不存在，但可能在另一个过程的同一次迭代中参与训练。因此DDP在位图中维护本地未使用的参数信息，并启动额外的AllReduce以收集全局位图。由于位图比张量尺寸小得多，因此模型中的所有参数共享同一位图，而不是创建每桶位图（per-bucket bitmaps）。位图位于CPU上，以避免为每次更新启动专用CUDA内核。但是，某些ProcessGroup后端可能无法在CPU 张量上运行AllReduce。例如，ProcessGroupNCCL仅支持CUDA张量。此外，由于DDP应该与任何定制的ProcessGroup后端一起工作，它不能假设所有后端都支持CPU张量。为了解决这个问题，DDP在同一设备上维护另一个位图作为第一个模型参数，并调用非阻塞拷贝操作（non-blocking copy）将CPU位图移动到设备位图以进行集合通信。

具体代码如下：

void Reducer::initialize_local_used_map() {
const auto replica_count = replicas_.size();
const auto variable_count = replicas_[0].size();
local_used_maps_.resize(replica_count);
local_used_maps_dev_.resize(replica_count);

for (size_t i = 0; i < replica_count; i++) {
at::TensorOptions options;
options = options.dtype(at::kInt);

// Deliberately don't pin the memory even if local_used_maps_dev_ will
// be cuda. See Note [local_used_maps_ -> local_used_maps_dev copying]
local_used_maps_[i] =
    at::zeros({static_cast(variable_count)}, options);

// This tensor needs to be on the same device as replica because backend
// such as NCCL may not support CPU tensors, and hence it might not work
// if we always put it on CPU.
options = options.device(replicas_[i][0].device());
local_used_maps_dev_[i] =
    at::empty({static_cast(variable_count)}, options);

}
}
初始化流程大致如下：

                                +
                                |
                                |
                                v
              rpc_context_ = ThreadLocalDistAutogradContext
                                +
                                |
                                |
                                v
              buckets_ & variable_locators_ (clear & resize)
                                +
                                |
                                |
                                v

±----------------------> from 0 ~ bucket_count : ±-------------------------->
| +
| |
| ±------------------------------------------------------------------+ |
| | init Bucket set bucket_indices | |
| | + | |
| | | | |
| | | | |
| | v | |
| | ^ ±-----------> from 0 ~ replica_count : ±----------------> | |
| | | | | |
| | | ±--------------------------------------------------+ | | |
| | | | init BucketReplica | | | |
| | | | | | | |
<----+ | ±-+ | <–+ | <—+
| | bucket.replicas.push_back(std::move(replica)) | |
| | | |
| ±---------------------±---------------------------+ |
| | |
| | |
| v |
| buckets_.push_back(std::move(bucket)) |
| + |
±------------------------------------------------------------------+
|
v

得到的 Reducer 大致如下，这里需要注意的是，BucketReplica 每个桶只有一个：

        +----------------------------------------+                 +------------------+
        |tensor index 4, tensor index 5, tensor 6| <------+        | index 2, index 3 |
        +----------------------------------------+        |        +--------------+---+
                                                          |                       ^
                                                          |                       |

±--------------------------+ ±--------------------------------------------------------+
| Reducer | | ±---------------------------------+ ±-----------+ |
| | | |Bucket | | |Bucket | | |
| | | | + | | | | |
| vector buckets_ ±–> | | vector variable_indices | | indices ++ | |
| | | | | | | |
| | | | vector replicas | … | replicas | |
| | | | + | | + | |
| | | | | | | | | |
| | | ±---------------------------------+ ±-----------+ |
| | | | | |
±--------------------------+ ±--------------------------------------------------------+
| |
| |
v v
±--------------------------------------+ ±------------------+
| ±---------------------------------+ | | ±--------------+ |
| | BucketReplica | | | | BucketReplica | |
| | | | | | | |
| | | | | | | |
| | vector bucket_views_in | | | | views_in | |
| | | | | | | |
| | vector bucket_views_out | | | | views_out | |
| | | | | | | |
| | Tensor contents | | | | contents | |
| | | | | | | |
| | vector variables | | | | variables | |
| | + | | | | + | |
| ±---------------------------------+ | | ±--------------+ |
±--------------------------------------+ ±------------------+
| |
| |
v v
±--------------±-----------+ ±--------±---------+
|Tensor 4, Tensor 5, Tensor 6| | Tensor 2, Tensor 3 |
±---------------------------+ ±-------------------+
0x03 静态图
3.1 缘由
虽然 PyTorch 是动态图，但是用户可以明确地让DDP知道训练图是静态的，有如下情况时候可以设定：

已使用和未使用的参数集在整个训练循环中不变，在这种情况下，用户是否将find_unsued_parameters设置为true并不重要。

图形的训练方式在整个训练循环过程中不会改变（意味着不存在依赖于迭代的控制流）。当图被设置为静态时，DDP将支持以前不支持的case，比如：

可重入的反向传播。
多次activation checkpointing。
activation checkpointing 并且find_unused_parameters = true。
并不是所有的输出张量都用于损失计算。。
在前向函数之外有一个模型参数。
当find_unsued_parameters=true时或者存在未使用的参数，可能会提高性能，因为DDP在每个迭代之内不会搜索网络来检查未使用的参数。
3.2 使用
_set_static_graph 可以配置静态图，此API应在DistributedDataParallel构造之后，并且在训练循环开始之前调用。并且，也应该以同样的方式对所有的rank 进行调用。例如：

ddp_model = DistributedDataParallel(model)
ddp_model._set_static_graph()
for i in range(n):
_set_static_graph 代码为：

def _set_static_graph(self):
“”"
Users can explicitly let DDP know the trained graph is static,
when 1) the set of used and unused parameters will not change
during the whole training loop; in this case, it does not matter
whether users set find_unsued_parameters = true or not.
2) how the graph is trained will not change during the whole training
loop (meaning there is no control flow depending on iterations).
When graph is set to be static, DDP will support cases that can not
be supported in the past: 1) reentrant backwards
2) activation checkpointing multiple times 3)
activation checkpointing with find_unused_parameters = true.
4) not all output tensors are used in loss calculation.
5) there is model parameter that is outside of forward function.
6) potentially improve performance when find_unsued_parameters = true
or there are unused parameters, as DDP will not search graph in each
iteraton to detect unused parameters when static_graph is set to be True.

This API should be called after DistributedDataParallel construction, and
before training loops starts. Also it should be called in the same way for
all ranks. For example:
    ddp_model = DistributedDataParallel(model)
    ddp_model._set_static_graph()
    for i in range(n):
        .....
"""
self.static_graph = True
self.reducer._set_static_graph() # 调用 Reducer 进行配置
self.logger._set_static_graph()
if self.find_unused_parameters:
    warnings.warn(
        "You passed find_unused_parameters=true to DistributedDataParallel, "
        "`_set_static_graph` will detect unused parameters automatically, so "
        "you do not need to set find_unused_parameters=true, just be sure these "
        "unused parameters will not change during training loop while calling "
        "`_set_static_graph`."
    )

3.2 Reducer
Reducer 只有在第一次迭代之后才能生成静态图，因为毕竟PyTorch还是动态的，无论如何也得走一步动态生成。

void Reducer::set_static_graph() {
std::lock_guardstd::mutex lock(mutex_);
TORCH_CHECK(
num_iterations_ == 0,
"set_static_graph() should be called before training loop starts "
“and after DistributedDataParallel is constructed.”);
static_graph_ = true;
// when static_graph_ is set as true, always initialize_local_used_map
// and detect the global unused parameters in the first iteration.
initialize_local_used_map();
}
0x04 重建桶
4.1 为何要重建
因为 PyTorch 是动态生成计算图，所以需要相应重建桶。但是只有设置了静态图并且第一次迭代之后才会重建，如果设置 find_unused_parameters_，就不重建。

// Returns true if we should rebuild buckets, else false. We only rebuild
// buckets once after the first iteration and never rebuild them if
// find_unused_parameters_.
inline bool should_rebuild_buckets() const {
return (static_graph_ || !find_unused_parameters_) && !has_rebuilt_bucket_;
}
4.2 准备重建
我们首先看看重建之前的一些准备。

push_rebuilt_params 就是插入一个重建参数列表。

void Reducer::push_rebuilt_params(const VariableIndex& index) {
rebuilt_params_.push_back(
replicas_[index.replica_index][index.variable_index]);
rebuilt_param_indices_.push_back(index.variable_index);
}
其次，push_rebuilt_params_for_all_indices 会遍历每个 replica，针对 replica 之中的每个 variable 进行设置。

void Reducer::push_rebuilt_params_for_all_indices() {
std::lock_guardstd::mutex lock(mutex_);
if (!should_rebuild_buckets() || !rebuilt_param_indices_.empty()) {
return;
}
const auto replica_count = replicas_.size();
for (size_t replica_index = 0; replica_index < replica_count;
++replica_index) {
const auto variable_count = replicas_[replica_index].size();
for (size_t variable_index = 0; variable_index < variable_count;
++variable_index) {
const auto index = VariableIndex(replica_index, variable_index);
push_rebuilt_params(index);
}
}
}
4.3 重建
我们接下来看看重建机制。

DDP 根据张量在后向传播中接收梯度的时间，使用 rebuilt_params_ 和 rebuilt_param_indices_ 来重建存储桶。

rebuild_buckets 函数进行广播通信调用，并且可以与下一个forward()调用重叠，因此它可以是异步的。

在find_unused_parameters=true情况下重建bucket 就是异步操作，因为我们可以多次重建bucket，其中子图经过训练，参数索引顺序可能会更频繁地更改。
对于find_unused_parameters=false的情况，bucket只重建一次，性能成本可以忽略不计。如果已重建存储桶， rebuild_buckets 则返回true。
bool Reducer::rebuild_buckets() {
// Ensure reduction for previous backwards pass is finished. If user’s model
// has unused parameters for example, this will raise an error recommending to
// run with find_unused_parameters=True, instead of the size mismatch
// exception below.
std::lock_guardstd::mutex lock(mutex_);
ensure_prior_reduction_finished();
if (!should_rebuild_buckets() || rebuilt_params_.empty()) {
return false;
}

std::vector> rebuilt_bucket_indices;
std::vector bucket_size_limits;
bucket_size_limits.push_back(kDefaultFirstBucketBytes);
bucket_size_limits.push_back(bucket_bytes_cap_);
rebuilt_bucket_indices = compute_bucket_assignment_by_size(
rebuilt_params_,
bucket_size_limits,
expect_sparse_gradients_[0],
rebuilt_param_indices_);

// For rebuilt bucket indices, it needs to be synced across all ranks.
// Broadcast the newly rebuilt bucket indices from rank 0 in default.
// After syncing up rebuilt bucket indices, initialize buckets for reducer.
sync_bucket_indices(rebuilt_bucket_indices);

has_rebuilt_bucket_ = true; // 只重建一次
rebuilt_params_.clear();
rebuilt_param_indices_.clear();

initialize_buckets(std::move(rebuilt_bucket_indices));
return true;
}
4.4 何时设定重建
重建仅在以下情况进行设定：

第一次重建存储桶

static_graph_ is true 或 find_unused_parameters_ is false

此反向传播过程需要运行allreduce。

在这里，我们只需基于梯度到达顺序将张量及其参数索引转储到rebuilt_params_和 rebuilt_param_indices_。然后在finalize_backward() 结束时，将基于rebuilt_params_和 rebuilt_param_indices_重建存储桶，然后广播和初始化存储桶。

此外，我们只需要转储一个副本的张量和参数索引。

以 mark_variable_ready 为例，其中就会调用 push_rebuilt_params(index) 来插入列表。

void Reducer::mark_variable_ready(VariableIndex index) {
// Rebuild bucket only if 1) it is the first time to rebuild bucket 2)
// static_graph_ is true or find_unused_parameters_ is false,
// 3) this backward pass needs to run allreduce.
// Here, we just dump tensors and their parameter indices into
// rebuilt_params_ and rebuilt_param_indices_ based on gradient arriving
// order, and then at the end of finalize_backward(), buckets will be
// rebuilt based on rebuilt_params_ and rebuilt_param_indices_, and then
// will be broadcasted and initialized. Also we only need to dump tensors
// and parameter indices of one replica.
if (should_rebuild_buckets()) {
push_rebuilt_params(index); // 插入列表
}

const auto replica_index = index.replica_index;
const auto variable_index = index.variable_index;

if (replica_index == 0) {
checkAndRaiseMarkedTwiceError(variable_index);
perIterationReadyParams_.insert(variable_index);
}
backward_stats_[replica_index][variable_index] =
current_time_in_nanos() - cpu_timer_.backward_compute_start_time;

// Any time we mark a variable ready (be it in line due to unused parameters,
// or via an autograd hook), we require a call to the finalize function. If
// this doesn’t happen before the next iteration (or call to
// prepare_for_backwards), we know something is wrong.
require_finalize_ = true;

const auto& bucket_index = variable_locators_[variable_index];
auto& bucket = buckets_[bucket_index.bucket_index];
auto& replica = bucket.replicas[replica_index];

set_divide_factor();

if (bucket.expect_sparse_gradient) {
mark_variable_ready_sparse(index);
} else {
mark_variable_ready_dense(index);
}

// TODO(@pietern): Make this work for both CPU/CUDA tensors.
// When using CPU tensors we don’t need to do this.
// // Record event so that we can wait for all of them.
// auto& event = replica.events[bucket_index.intra_bucket_index];
// event.record();

// Check if this was the final gradient for this bucket.
if (–replica.pending == 0) {
// Kick off reduction if all replicas for this bucket are ready.
if (–bucket.pending == 0) {
mark_bucket_ready(bucket_index.bucket_index);
}
}

// Run finalizer function and kick off reduction for local_used_maps once the
// final bucket was marked ready.
if (next_bucket_ == buckets_.size()) {

if (dynamic_graph_find_unused()) {
  all_reduce_local_used_map();
}

// The autograd engine uses the default stream when running callbacks, so we
// pass in the current CUDA stream in case it is not the default.
const c10::Stream currentStream = get_current_stream();
torch::autograd::Engine::get_default_engine().queue_callback([=] {
  std::lock_guard lock(this->mutex_);
  // Run callback with the current stream
  c10::OptionalStreamGuard currentStreamGuard{currentStream};
  if (should_collect_runtime_stats()) {
    record_backward_compute_end_time();
  }
  // Check that all buckets were completed and had their work kicked off.
  TORCH_INTERNAL_ASSERT(next_bucket_ == buckets_.size());
  this->finalize_backward();
});

}
}
亚马逊测评 www.yisuping.cn

你可能感兴趣的:(stm32,物联网,pytorch)

PyTorch & TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）阿牛的药铺算法移植部署 pytorch tensorflow fpga开发
PyTorch&TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）引言：为什么算法移植工程师必须掌握框架基础？针对光学类产品算法FPGA移植岗位需求（如可见光/红外图像处理），深度学习框架是算法落地的"桥梁"——既要用PyTorch/TensorFlow验证算法可行性，又要将训练好的模型（如CNN、目标检测）转换为FPGA可部署的格式（ONNX、TFLite）。本文采用"
等保测评中的物联网设备安全评估亿林数据物联网安全网络安全等保测评
随着物联网（IoT）技术的飞速发展，物联网设备已经广泛应用于智能家居、智慧城市、工业自动化等多个领域，极大地提升了社会生产力和生活便利性。然而，随着IoT设备数量的激增，其安全性问题也日益凸显，成为我们必须面对的重要课题。在这一背景下，等级保护（等保）测评中的物联网设备安全评估显得尤为重要，它为我们提供了一个有效的安全评估和管理机制。一、物联网设备安全评估的重要性物联网设备的核心理念是实现物物相连
【Qualcomm】高通SNPE框架简介、下载与使用 Jackilina_Stone 人工智能 Qualcomm SNPE
目录一高通SNPE框架1SNPE简介2QNN与SNPE3Capabilities4工作流程二SNPE的安装与使用1下载2Setup3SNPE的使用概述一高通SNPE框架1SNPE简介SNPE（SnapdragonNeuralProcessingEngine），是高通公司推出的面向移动端和物联网设备的深度学习推理框架。SNPE提供了一套完整的深度学习推理框架，能够支持多种深度学习模型，包括Pytor
【Freertos实战】零基础制作基于stm32的物联网温湿度检测(教程非常简易)持续更新中......... 熬夜的猪仔 stm32 物联网嵌入式硬件
本次记录采用Freertos的第二个DIY作品，基于Onenet的物联网温湿度检测系统，此次代码依然是全部开源。通过网盘分享的文件：物联网温湿度检测.rar链接:https://pan.baidu.com/s/1uj9UURVtGE6ZB6OsL2W8lw?pwd=qm2e提取码:qm2e大家也可以看看我上个的开源项目【Freertos实战】零基础制作基于stm32智能小车(教程非常简易)实物演示
STM32 ADC详解月入鱼饵 stm32 嵌入式硬件单片机
本文介绍stm32ADC的使用，本文较长，可以配合目录跳转到需要的地方阅读。ADC转换原理本文重点在于STM32的ADC的使用，介绍ADC转换原理是为了更好理解STM32中关于ADC的配置，所以这里只是简单介绍一下ADC的转换原理，想详细了解ADC的转换原理可以看看看完这篇文章，终于搞懂了ADC原理及分类！和ADC基本工作原理-CSDN。简单来说，模拟信号输入进来，经过低通滤波操作预处理信号之后，
STM32-DAC数模转换
DAC数模转换：将数字信号转换成模拟信号特性：2个DAC转换器每个都拥有一个转换通道8位或12位单调输出（8位右对齐；12位左对齐右对齐）双ADC通道同时或者分别转换外部触发中断电压源控制部分（外部触发3个APB1；不使用1个APB1）外部触发输出：DAC1-PA4;DAC2-PA5软件设计流程：使能端口以及DAC时钟；设置引脚为模拟输入RCC_APB2PeriphClockCmd(RCC_APB
vllm本地部署bge-reranker-v2-m3模型API服务实战教程雷电法王大模型部署 linux python vscode language model
文章目录一、说明二、配置环境2.1安装虚拟环境2.2安装vllm2.3对应版本的pytorch安装2.4安装flash_attn2.5下载模型三、运行代码3.1启动服务3.2调用代码验证一、说明本文主要介绍vllm本地部署BAAI/bge-reranker-v2-m3模型API服务实战教程本文是在Ubuntu24.04+CUDA12.8+Python3.12环境下复现成功的二、配置环境2.1安装虚
STM32F1单片机驱动42步进电机 All right 1 STM32学习单片机 stm32 嵌入式硬件
我们使用的单片机是STM32F103ZET6，电机是42步进电机（额定电流是1A）、驱动是TMC2209；但是暂时使用2160这个外接驱动（注意：2160为大电流电机驱动不能长时间带动这个42电机，否则会发烫烧电机）。开启一个定时器2外设中断：为电机提供步进脉冲；开启三个GPIO口：作为EN、STEP、DIR控制；42步进电机：步距角1.8°、16细分、3200步每圈。一、代码：tim.c:/*U
让电机转起来--基于STM32F1控制两相步进电机转动-新手小白入（完整代码）梦想是成为甜妹儿 stm32 嵌入式硬件单片机
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、基础内容1、步进电机2、电机驱动器3、接线方法二、最简单控制电机转动程序1.定时器的输出比较功能生成PWM波2.电机方向控制3.主函数三、进阶版电机控制程序1.加入按键控制2.motor.c中添加一个函数3.主函数总结前言本帖分享步进电机与驱动器的接线方式、速度计算与代码分析。第一次接触电机的小白可能会面对无数的代码分
stm32与ESP32-C3通过串口连接林内克思 stm32 嵌入式硬件单片机
ESP32-C3是一款安全稳定、低功耗、低成本的物联网芯片，搭载RISC-V32位单核处理器，支持2.4GHzWi-Fi和Bluetooth5（LE）。ESP32-C3本身就可以作为一个单片机使用，但是我们这里只是把ESP32-C3作为一个Wi-Fi/蓝牙模块使用。STM32与ESP32-C3使用串口进行通讯。STM32可以给ESP32-C3发送命令，这种命令叫ESP-AT指令。首先通过pc串口E
STM32CubeMX配置-看门狗配置一叶知秋06 MCU stm32 嵌入式硬件单片机
一、简介MCU为STM32G070，LSI为32K，看门狗IWDG配置为4S溢出，则配置是设置分频为32分频，重装载值为3000。二、IWDG配置1.外设配置2.时钟配置3.生成代码HAL_IWDG_Refresh(&hiwdg);//喂狗
STM32 CubMax 6.1.1 版本安装包姜奇惟Sparkling
STM32CubMax6.1.1版本安装包【下载地址】STM32CubMax6.1.1版本安装包本仓库提供STM32CubeMX6.1.1版本的安装包，支持Linux、macOS和Windows64位系统。STM32CubeMX是STMicroelectronics推出的一款图形化配置工具，能够自动生成适用于STM32微控制器的初始化代码，极大地简化了开发流程。用户只需根据操作系统选择相应的安装包
使用STM32单片机控制步进电动机是一个常见的应用场景 QoyOle 单片机 stm32 mongodb
使用STM32单片机控制步进电动机是一个常见的应用场景。步进电动机可以通过产生脉冲信号来控制转动角度和速度。在本文中，我们将详细介绍如何使用STM32单片机来生成脉冲信号并控制步进电动机。下面是一个简要的步骤概览：初始化STM32单片机的GPIO引脚：首先，我们需要初始化单片机的GPIO引脚，以将其配置为输出模式。这些引脚将用于产生脉冲信号，并控制步进电动机的步进脚。具体的引脚配置取决于你使用的具
基于STM32金属探测器设计
摘要随着便携式金属探测器在安防，考古及工业检测等领域需求的增加，现有探测器的体积大，能耗高，操作复杂的缺点亟需解决。本文针对便携式金属探测器的设计进行探索，在硬件上使用了STM32F103C8T6单片机模块，WL02涡流传感器模块，ADS1115模数转换模块，蜂鸣器模块等设计出本系统的电路，在软件上设计出主程序，信号采集及报警子程序等，对系统进行基础功能，灵敏度，抗干扰和耐久性测试，测试结果表明探
2025年UDP洪水攻击防护实战全解析：从T级流量清洗到AI智能防御上海云盾商务经理杨杨 udp 人工智能网络协议
一、2025年UDP洪水攻击的新特征AI驱动的自适应攻击攻击者利用生成式AI动态调整UDP报文特征（如载荷内容、发送频率），攻击流量与正常业务流量差异率低至0.5%，传统指纹过滤规则失效。反射放大攻击升级黑客通过劫持物联网设备（如摄像头、传感器）构建僵尸网络，利用DNS/NTP协议漏洞发起反射攻击，1Gbps请求可放大至50-500倍流量，峰值突破8Tbps。混合协议打击70%的UDP攻击伴随TC
STM32F1系列综合测试程序实践指南 Love Snape
本文还有配套的精品资源，点击获取简介：STM32F1系列微控制器是基于ARMCortex-M3内核的低成本、高性能嵌入式系统解决方案。本综合测试程序旨在帮助初学者快速掌握STM32的基础操作和关键知识点，包括裸机编程、GPIO操作、定时器应用、串行通信、ADC转换、中断处理和Bootloader等。同时，程序将指导学习者熟悉开发环境和理解代码结构，为未来在嵌入式系统开发领域打下坚实的基础。1.ST
pycharm无法识别conda环境（已解决） Reborker pycharm conda ide
文章目录前言研究过程解决办法前言好久不用pycharm了，打开后提示更新，更新到了2023.1版本。安装conda后在新建了一个虚拟环境pytorch，但是无论是基础环境还是虚拟环境，pycharm都识别不出conda里的python.exe(如图)。如果不想看啰嗦直接看后面的解决办法，比较闲的话可以看看我的研究过程。研究过程看了很多博客，尝试了以下解决办法：加载conda.bat文件，虽然出现了
STM32与FPGA用FMC进行通讯 weixin_43554366 单片机 stm32 fpga 物联网人工智能
stm32正常按读写SDRAM进行配置，FPGA进行信号采集。FPGA信号采集发现SDWNE是高但H7手册上时序显示是低，造成无法像FPGA模拟的SDRAM无法写入数据FPGA采集信号应该在时钟下降沿，上升沿采集，数据会发生错误。
jetson agx orin 刷机、cuda、pytorch配置指南【亲测有效】
jetsonagxorin刷机指南注意事项刷机具体指南cuda环境配置指南Anconda、Pytorch配置注意事项1.使用设备自带usbtoc的传输线时，注意c口插到orin左侧的口，右侧的口不支持数据传输；2.刷机时需准备ubuntu系统，可以是虚拟机，注意安装SDKManager刷机时，JetPack版本要选对，JetPack6.0的对应ubuntu22，cuda12版本，对应pytorch
2 STM32单片机-蜂鸣器驱动书山有路勤为径~ 物联网-单片机单片机 stm32 嵌入式硬件
系列文章目录文章目录系列文章目录前言1硬件连接2目录结构3软件编写3.1main.c3.2beep_driver3.2.1beep_driver.c3.2.2beep_driver.h3.3board_config3.3.1board_config.c3.3.2board_config.h3.4utils3.4.1system_config.h总结前言在各种单片机中，都离不开蜂鸣器。蜂鸣器可以作为
SQL Server通过CLR连接InfluxDB实现异构数据关联查询技术指南 Favor_Yang SQL调优及高级SQL语法编写 SQL Server InfluxDB
一、背景与需求场景在工业物联网和金融监控场景中，实时时序数据（InfluxDB）需与业务元数据（SQLServer）联合分析：工业场景：设备传感器每秒采集温度、振动数据（InfluxDB），需关联工单状态、设备型号（SQLServer）金融场景：交易流水时序数据（每秒万条）需实时匹配客户风险等级、账户余额（SQLServer）核心痛点：传统ETL延迟高，无法满足实时风控/故障诊断需求，需实现毫秒级
STM32-- 调试 -日志输出 code_snow stm32 stm32 嵌入式硬件单片机
在调试嵌入式程序时，输出日志是非常重要的环节，可以帮助开发者定位问题、监控程序状态和性能。以下是几种常见的日志输出方式及其适用场景：1.使用串口（UART）输出日志实现方式：通过串口将日志输出到主机的串口工具（如PuTTY、TeraTerm、minicom）中。优点：简单易用，几乎所有嵌入式设备都支持。实时性强，适合调试运行时的动态信息。与printf结合使用方便。示例代码：#include//配
Yolov5-obb(旋转目标poly_nms_cuda.cu编译bug记录及解决方案)
关于在执行pythonsetup.pydevelop#or"pipinstall-v-e."时poly_nms_cuda.cu报错问题。前面步骤严格按照install.md环境1.pytorch版本较低时（我的是1.10）：poly_nms_cuda.cu文件添加”#defineeps1e-8“，删除“constdoubleeps=1E-8;”这句2.pytorch版本较高时（我用的是1.27）h
物联网入门资料收集 Robin罗兵物联网
1、动动手做一个简单的物联网门禁，手机远程开锁，还带本地射频遥控https://blog.csdn.net/qq_40582683/article/details/796439082、一张图读懂基于微信硬件平台的物联网架构：https://blog.csdn.net/yueqian_scut/article/details/491534053、疯狂物联的控制模块：https://s.taobao.
鸿蒙 Secure Boot 全流程解析：从 BootROM 到内核签名验证的实战指南
摘要随着智能设备应用的深入，操作系统安全成为设备可信运行的基础。在物联网和多终端场景中，一旦系统被恶意篡改，将带来数据泄露、设备被控等严重后果。鸿蒙系统在安全启动方面设计了完整的机制，从最底层的BootROM开始逐级校验，确保每一阶段软件的完整性和可信度。引言鸿蒙系统作为一款面向全场景的操作系统，支持手机、电视、可穿戴设备等多种形态，安全性要求远高于传统系统。为了防止设备在启动过程中被注入恶意代码
【深度学习实战】当前三个最佳图像分类模型的代码详解云博士的AI课堂大模型技术开发与实践哈佛博后带你玩转机器学习深度学习深度学习人工智能分类模型机器学习 Transformer EfficientNet ConvNeXt
下面给出三个在当前图像分类任务中精度表现突出的模型示例，分别基于SwinTransformer、EfficientNet与ConvNeXt。每个模型均包含：训练代码（使用PyTorch）从预训练权重开始微调（也可注释掉预训练选项，从头训练）数据集目录结构：└──dataset_root├──buy#第一类图像└──nobuy#第二类图像随机拆分：80%训练，20%验证每个Epoch输出一次loss
什么是ARM架构和Cortex内核？ cykaw2590 单片机MCU arm开发架构
ARM（AdvancedRISCMachine）架构是一种基于精简指令集（RISC，ReducedInstructionSetComputing）的计算机处理器架构，广泛应用于移动设备、嵌入式系统、物联网设备等领域。ARM架构的处理器以其高效的功耗和较低的发热量著称，是目前移动设备中最主流的处理器架构之一。ARM架构的特点高效的功耗：ARM架构设计旨在减少功耗，这对于需要长时间续航的设备非常重要，
Text2Reward学习笔记
1.提示词请问，“glew”是一个RL工程师常用的工具库吗？请问,thiscodebase主要是做什么用的呀？1.1解释代码是否可以请您根据thiscodebase的主要功能，参考PyTorch的文档格式和文档风格，使用Markdown格式为选中的代码行编写一段相应的文档说明呢？2.项目环境配置2.1新建环境[official]2.1.1Featurizecondacreate-p~/work/d
计算机网络 1.2.2 1.2.3 OSI参考模型 chin” 计算机网络网络
计算机网络1.2.21.2.3OSI参考模型1、计算机网络分层结构法定标准事实标准最终演化OSI七层参考模型4层TCP/IP模型5层体系结构解决计算机网络的大问题====>分层结构（按功能）OSI七层模型：物联网淑慧适用（上四层端到端，下四层点到点）2、OSI七层模型各层的作用应用层：（用户与网络的界面）所有能和用户交互，并能产生网络流量的程序（FTPSMTPHTTP）传输单位是报文表示层：处理通
3D 可视化技术开启污水治理全新发展阶段广州华锐视点 3d
3D可视化大屏展示技术在污水厂的应用，已然开启了污水处理的全新篇章。它不仅为污水厂解决了当下管理和展示的难题，更如同一座灯塔，照亮了未来污水处理领域的发展道路。随着科技的持续进步，3D可视化大屏展示技术必将迎来更加辉煌的发展。一方面，其与人工智能、大数据、物联网等前沿技术的融合将愈发紧密。借助人工智能算法，大屏系统将具备更强大的自主学习和分析能力，能够根据实时数据和历史经验，自动优化污水处理工艺参
java类加载顺序 3213213333332132 java
package com.demo; /** * @Description 类加载顺序 * @author FuJianyong * 2015-2-6上午11:21:37 */ public class ClassLoaderSequence { String s1 = "成员属性"; static String s2 = "
Hibernate与mybitas的比较 BlueSkator sql Hibernate 框架 ibatis orm
第一章 Hibernate与MyBatis Hibernate 是当前最流行的O/R mapping框架，它出身于sf.net，现在已经成为Jboss的一部分。 Mybatis 是另外一种优秀的O/R mapping框架。目前属于apache的一个子项目。 MyBatis 参考资料官网：http:
php多维数组排序以及实际工作中的应用 dcj3sjt126com PHP usort uasort
自定义排序函数返回false或负数意味着第一个参数应该排在第二个参数的前面, 正数或true反之, 0相等usort不保存键名uasort 键名会保存下来uksort 排序是对键名进行的 <!doctype html> <html lang="en"> <head> <meta charset="utf-8&q
DOM改变字体大小周华华前端
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
c3p0的配置 g21121 c3p0
c3p0是一个开源的JDBC连接池，它实现了数据源和JNDI绑定，支持JDBC3规范和JDBC2的标准扩展。c3p0的下载地址是：http://sourceforge.net/projects/c3p0/这里可以下载到c3p0最新版本。以在spring中配置dataSource为例：  <bean name="prope
Java获取工程路径的几种方法 510888780 java
第一种： File f = new File(this.getClass().getResource("/").getPath()); System.out.println(f); 结果: C:\Documents%20and%20Settings\Administrator\workspace\projectName\bin 获取当前类的所在工程路径; 如果不加“
在类Unix系统下实现SSH免密码登录服务器 Harry642 免密 ssh
1.客户机 (1)执行ssh-keygen -t rsa -C "[email protected]"生成公钥，xxx为自定义大email地址 (2)执行scp ~/.ssh/id_rsa.pub root@xxxxxxxxx:/tmp将公钥拷贝到服务器上，xxx为服务器地址 (3)执行cat
Java新手入门的30个基本概念一 aijuans java java 入门新手
在我们学习Java的过程中,掌握其中的基本概念对我们的学习无论是J2SE,J2EE,J2ME都是很重要的,J2SE是Java的基础,所以有必要对其中的基本概念做以归纳,以便大家在以后的学习过程中更好的理解java的精髓,在此我总结了30条基本的概念。　　Java概述:　　目前Java主要应用于中间件的开发(middleware)---处理客户机于服务器之间的通信技术,早期的实践证明,Java不适合
Memcached for windows 简单介绍 antlove java Web windows cache memcached
1. 安装memcached server a. 下载memcached-1.2.6-win32-bin.zip b. 解压缩，dos 窗口切换到 memcached.exe所在目录，运行memcached.exe -d install c.启动memcached Server,直接在dos窗口键入 net start "memcached Server&quo
数据库对象的视图和索引百合不是茶索引 oeacle数据库视图
视图视图是从一个表或视图导出的表，也可以是从多个表或视图导出的表。视图是一个虚表，数据库不对视图所对应的数据进行实际存储，只存储视图的定义，对视图的数据进行操作时,只能将字段定义为视图,不能将具体的数据定义为视图为什么oracle需要视图; &
Mockito(一) --入门篇 bijian1013 持续集成 mockito 单元测试
Mockito是一个针对Java的mocking框架，它与EasyMock和jMock很相似，但是通过在执行后校验什么已经被调用，它消除了对期望行为（expectations）的需要。其它的mocking库需要你在执行前记录期望行为（expectations），而这导致了丑陋的初始化代码。 &nb
精通Oracle10编程SQL(5)SQL函数 bijian1013 oracle 数据库 plsql
/* * SQL函数 */ --数字函数 --ABS(n):返回数字n的绝对值 declare v_abs number(6,2); begin v_abs:=abs(&no); dbms_output.put_line('绝对值：'||v_abs); end; --ACOS(n):返回数字n的反余弦值，输入值的范围是-1~1，输出值的单位为弧度
【Log4j一】Log4j总体介绍 bit1129 log4j
Log4j组件：Logger、Appender、Layout Log4j核心包含三个组件：logger、appender和layout。这三个组件协作提供日志功能：日志的输出目标日志的输出格式日志的输出级别(是否抑制日志的输出) logger继承特性 A logger is said to be an ancestor of anothe
Java IO笔记白糖_ java
public static void main(String[] args) throws IOException { //输入流 InputStream in = Test.class.getResourceAsStream("/test"); InputStreamReader isr = new InputStreamReader(in); Bu
Docker 监控 ronin47 docker监控
目前项目内部署了docker，于是涉及到关于监控的事情，参考一些经典实例以及一些自己的想法，总结一下思路。 1、关于监控的内容监控宿主机本身监控宿主机本身还是比较简单的，同其他服务器监控类似，对cpu、network、io、disk等做通用的检查，这里不再细说。额外的，因为是docker的
java-顺时针打印图形 bylijinnan java
一个画图程序要求打印出： 1.int i=5; 2.1 2 3 4 5 3.16 17 18 19 6 4.15 24 25 20 7 5.14 23 22 21 8 6.13 12 11 10 9 7. 8.int i=6 9.1 2 3 4 5 6 10.20 21 22 23 24 7 11.19
关于iReport汉化版强制使用英文的配置方法 Kai_Ge iReport汉化英文版
对于那些具有强迫症的工程师来说，软件汉化固然好用，但是汉化不完整却极为头疼，本方法针对iReport汉化不完整的情况，强制使用英文版，方法如下：在 iReport 安装路径下的 etc/ireport.conf 里增加红色部分启动参数，即可变为英文版。 # ${HOME} will be replaced by user home directory accordin
[并行计算]论宇宙的可计算性 comsci 并行计算
现在我们知道,一个涡旋系统具有并行计算能力.按照自然运动理论,这个系统也同时具有存储能力,同时具备计算和存储能力的系统,在某种条件下一般都会产生意识...... 那么,这种概念让我们推论出一个结论 &nb
用OpenGL实现无限循环的coverflow dai_lm android coverflow
网上找了很久，都是用Gallery实现的，效果不是很满意，结果发现这个用OpenGL实现的，稍微修改了一下源码，实现了无限循环功能源码地址： https://github.com/jackfengji/glcoverflow public class CoverFlowOpenGL extends GLSurfaceView implements GLSurfaceV
JAVA数据计算的几个解决方案1 datamachine java Hibernate 计算
老大丢过来的软件跑了10天，摸到点门道，正好跟以前攒的私房有关联，整理存档。 -----------------------------华丽的分割线------------------------------------- 数据计算层是指介于数据存储和应用程序之间，负责计算数据存储层的数据，并将计算结果返回应用程序的层次。J &nbs
简单的用户授权系统,利用给user表添加一个字段标识管理员的方式 dcj3sjt126com yii
怎么创建一个简单的(非 RBAC)用户授权系统通过查看论坛，我发现这是一个常见的问题，所以我决定写这篇文章。本文只包括授权系统.假设你已经知道怎么创建身份验证系统(登录)。数据库首先在 user 表创建一个新的字段(integer 类型),字段名 'accessLevel',它定义了用户的访问权限扩展 CWebUser 类在配置文件(一般为 protecte
未选之路 dcj3sjt126com 诗
作者:罗伯特*费罗斯特黄色的树林里分出两条路, 可惜我不能同时去涉足, 我在那路口久久伫立, 我向着一条路极目望去, 直到它消失在丛林深处. 但我却选了另外一条路, 它荒草萋萋,十分幽寂; 显得更诱人,更美丽, 虽然在这两条小路上, 都很少留下旅人的足迹. 那天清晨落叶满地, 两条路都未见脚印痕迹. 呵,留下一条路等改日再
Java处理15位身份证变18位蕃薯耀 18位身份证变15位 15位身份证变18位身份证转换
15位身份证变18位，18位身份证变15位 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 201
SpringMVC4零配置--应用上下文配置【AppConfig】 hanqunfeng springmvc4
从spring3.0开始，Spring将JavaConfig整合到核心模块，普通的POJO只需要标注@Configuration注解，就可以成为spring配置类，并通过在方法上标注@Bean注解的方式注入bean。 Xml配置和Java类配置对比如下： applicationContext-AppConfig.xml <!-- 激活自动代理功能参看：
Android中webview跟JAVASCRIPT中的交互 jackyrong JavaScript html android 脚本
在android的应用程序中,可以直接调用webview中的javascript代码,而webview中的javascript代码,也可以去调用ANDROID应用程序(也就是JAVA部分的代码).下面举例说明之: 1 JAVASCRIPT脚本调用android程序要在webview中,调用addJavascriptInterface(OBJ,int
8个最佳Web开发资源推荐 lampcy 编程 Web 程序员
Web开发对程序员来说是一项较为复杂的工作，程序员需要快速地满足用户需求。如今很多的在线资源可以给程序员提供帮助，比如指导手册、在线课程和一些参考资料，而且这些资源基本都是免费和适合初学者的。无论你是需要选择一门新的编程语言，或是了解最新的标准，还是需要从其他地方找到一些灵感，我们这里为你整理了一些很好的Web开发资源，帮助你更成功地进行Web开发。这里列出10个最佳Web开发资源，它们都是受
架构师之面试------jdk的hashMap实现 nannan408 HashMap
1.前言。如题。 2.详述。 (1)hashMap算法就是数组链表。数组存放的元素是键值对。jdk通过移位算法（其实也就是简单的加乘算法），如下代码来生成数组下标(生成后indexFor一下就成下标了）。 static int hash(int h) { h ^= (h >>> 20) ^ (h >>>
html禁止清除input文本输入缓存 Rainbow702 html 缓存 input 输入框 change
多数浏览器默认会缓存input的值，只有使用ctl+F5强制刷新的才可以清除缓存记录。如果不想让浏览器缓存input的值，有2种方法：方法一：在不想使用缓存的input中添加 autocomplete="off"; <input type="text" autocomplete="off" n
POJO和JavaBean的区别和联系 tjmljw POJO java beans
POJO 和JavaBean是我们常见的两个关键字，一般容易混淆，POJO全称是Plain Ordinary Java Object / Pure Old Java Object，中文可以翻译成：普通Java类，具有一部分getter/setter方法的那种类就可以称作POJO，但是JavaBean则比 POJO复杂很多， Java Bean 是可复用的组件，对 Java Bean 并没有严格的规
java中单例的五种写法 liuxiaoling java 单例
/** * 单例模式的五种写法： * 1、懒汉 * 2、恶汉 * 3、静态内部类 * 4、枚举 * 5、双重校验锁 */ /** * 五、双重校验锁，在当前的内存模型中无效 */ class LockSingleton { private volatile static LockSingleton singleton; pri

[源码解析] PyTorch 分布式(11) ----- DistributedDataParallel 之 构建Reducer和Join操作1

你可能感兴趣的:(stm32,物联网,pytorch)

[源码解析] PyTorch 分布式(11) ----- DistributedDataParallel 之构建Reducer和Join操作1