fb-caffe-exts:Facebook Caffe 推理多线程调用及内存优化

fb-caffe-exts 是 Facebook 在(主要)生产场景中使用 Caffe 时开发的扩展集合。predictor 是一个简单的 C++ 库,它封装了在共享权重的同时在多个线程中运行 caffe::Net 的常见模式。它还为推理情况提供了一个更方便使用的 API。该库主要由三个部分组成:可多线程调用的基本类 Predictor、线程池版本的 PooledPredictor 以及实现内存复用的 optimizeMemory 函数。虽然 Caffe 已经十分老旧,但是 predictor 同样适用于其他框架。使用范例如下:

#include "caffe/predictor/Predictor.h"

// In your setup phase
predictor_ = Predictor::paths(FLAGS_prototxt_path,
                   FLAGS_weights_path);

// When calling in a worker thread
static thread_local caffe::Blob<float> input_blob;
input_blob.set_cpu_data(input_data); // avoid the copy.
const auto& output_blobs = predictor_->forward({
     &input_blob});
return output_blobs[FLAGS_output_layer_name];

thread_local 表示线程存储期。
值得注意的是 optimizeMemory,通过在安全的情况下自动重用中间激活来优化内存使用。这使得 AlexNet 风格的模型,中间激活所需的内存量减少了约50%;而 GoogLeNet 风格的模型减少了约75%。

如果按照在网络的拓扑顺序绘制每组激活,并为每个重用的激活缓冲区分配一个独特的颜色,其中 blob 的高度与缓冲区的大小成正比。则在类似 AlexNet 的模型中,分配看起来像

fb-caffe-exts:Facebook Caffe 推理多线程调用及内存优化_第1张图片

GoogLeNet 的相应分配看起来像

fb-caffe-exts:Facebook Caffe 推理多线程调用及内存优化_第2张图片

这个想法本质上是线性扫描寄存器分配。我们

  • 为每个 caffe::SyncedMemory 计算一组“活动范围”(由于共享,我们无法在 caffe::Blob 级别执行此操作)
  • 计算一组活动区间,并以不重叠的方式将每个 caffe::SyncedMemory 安排到每个活动区间
  • 为每个活动区间分配一个规范 caffe::SyncedMemory 缓冲区
  • 将blob内部指针更新为指向规范缓冲区

根据模型的不同,缓冲区重用也可能导致推理时一些显著的性能改进。

要启用它,只需将Predictor::Optimization::MEMORY传递给 Predictor 构造函数。

PooledPredictor 使用 caffe::Net 的本地线程实例维护一个线程池。将 PooledPredictor::forward() 调用添加到一个 folly::MPMCQueue 中,然后由线程池出队进行处理。forward() 调用是非阻塞的并返回一个 folly::Future,当正向传递作业完成时其将满足。PooledPredictor 还支持在同一个线程池中运行多个模型。也就是说,如果加载两个模型,线程池中的每个线程将维护两个 caffe::Net 实例(每个模型一个实例),forward() 中的netId参数指定要运行的模型。PinnedPooledPredictor 是与多个模型一起使用,将 forward() 调用绑定到特定模型时,PooledPredictor 的抽象。

#include "caffe/predictor/PooledPredictor.h"

// In your setup phase
caffe::fb::PooledPredictor::Config config;
config.numThreads_ = 10;
config.optimization_ = caffe::fb::Predictor::Optimization::MEMORY;
config.protoWeightPaths_.emplace_back(FLAGS_prototxt_path,
                                      FLAGS_weights_path);
pooledPredictor_ = caffe::fb::PooledPredictor::makePredictor(config);

// When calling predictor
caffe::fb::PooledPredictor::OutputLayers output_blobs;
pooledPredictor_->forward({
     &input_blob}, &output_blobs)
  .then([&] {
     
    const auto& output_blob = outputs_blobs[FLAGS_output_layer_name];
    // Do something with output_blob
  });

基础版 Predictor

Predictor 的构造函数私有,只能通过以下静态函数创建。支持三种加载模型的方式。

 public:
  enum Optimization {
     
    NONE,
    MEMORY
  };
  static std::unique_ptr<Predictor> strings(
      const std::string& text_prototxt,
      const std::string& binary_weights,
      Optimization optimization = Optimization::NONE,
      const bool flag_disable_blas_threading = true);

  static std::unique_ptr<Predictor> hdf5_paths(
      const std::string& prototxt_path,
      const std::string& hdf5_binary_weights_path,
      Optimization optimization = Optimization::NONE,
      const bool flag_disable_blas_threading = true);

  static std::unique_ptr<Predictor> paths(
      const std::string& prototxt_path,
      const std::string& weights_path,
      Optimization optimization = Optimization::NONE,
      const bool flag_disable_blas_threading = true);

3个 forward 函数。

forward_1p
runForward
forward_3p
forward_2p
  std::vector<caffe::Blob<float>*> forward(
      const std::vector<caffe::Blob<float>*>& input_blobs,
      const std::vector<std::string>& output_layer_names);

  void forward(
      const std::vector<caffe::Blob<float>*>& input_blobs,
      const std::vector<std::string>& output_layer_names,
      std::vector<caffe::Blob<float>*>* output_blobs);

  std::unordered_map<std::string, caffe::Blob<float>*> forward(
      const std::vector<caffe::Blob<float>*>& input_blobs);

  caffe::Net<float>* canonicalNet() const {
     
    return net_.get();
  }

  folly::ThreadLocalPtr<caffe::Net<float>>& getThreadLocalPredictors() {
     
    return predictors_;
}

两个私有构造函数分别处理二进制文件和 hdf5 存储的权重。predictors_为静态变量,名字暗示有多个,不在构造函数中初始化,而是各线程独自创建网络。

Predictor::Predictor
CopyTrainedLayersFrom
Predictor::Predictor_hdf5
CopyTrainedLayersFromHDF5
 private:
  Predictor(const caffe::NetParameter& params,
            const caffe::NetParameter& weights,
            Optimization optimization = Optimization::NONE,
            const bool flag_disable_blas_threading = true);

  Predictor(const caffe::NetParameter& params,
            const std::string& hdf5_binary_weights_path,
            Optimization optimization = Optimization::NONE,
            const bool flag_disable_blas_threading = true);

  void runForward(
    const std::vector<caffe::Blob<float>*>& input_blobs);

  // Shared for forward declaration
  std::shared_ptr<caffe::NetParameter> param_;
  std::shared_ptr<caffe::Net<float>> net_;
  const Optimization optimization_;
  folly::ThreadLocalPtr<caffe::Net<float>> predictors_;

文档中的说明不够详细,使用细节可以参考测试程序。

TEST_P(PredictorTest, ConsistentAcrossThreads)

SetCaffeModeForTest 根据是否有 GPU 设定工作模式。

  const auto& inputType = std::get<0>(GetParam());
  const auto& optimization = std::get<1>(GetParam());
  const auto& ms = std::get<2>(GetParam());
  Caffe::set_random_seed(1701);
  SetCaffeModeForTest();

根据输入类型调用相应的构造函数。

  std::unique_ptr<Predictor> pp;
  if (inputType == InputType::PATHS) {
     
    pp = Predictor::paths(ms.prototxt, ms.caffemodel, optimization);
  } else if (inputType == InputType::STRINGS) {
     
    std::string prototxt_str;
    folly::readFile(ms.prototxt.c_str(), prototxt_str);
    std::string caffemodel_str;
    folly::readFile(ms.caffemodel.c_str(), caffemodel_str);
    pp = Predictor::strings(prototxt_str, caffemodel_str, optimization);
  } else if (inputType == InputType::HDF5_PATHS) {
     
    pp = Predictor::hdf5_paths(ms.prototxt, ms.caffemodel, optimization);
  }

依次调用 forward_2p、forward_3p、forward_1p。
forward_3p 的第3个参数为输出。
TestSpecs.cpp 中创建了ModelSpec 实例。

  CHECK(pp);
  auto& p = *pp;
  FillerParameter param;
  param.set_min(-1000);
  param.set_max(1000);
  UniformFiller<float> filler(param);
  Blob<float> blob;
  blob.Reshape(ms.inputDims);
  filler.Fill(&blob);
  auto output_blobs = p.forward({
     &blob}, {
     ms.outputLayer});
  // Test output blobs in-place.
  EXPECT_EQ(1, output_blobs.size());
  output_blobs.clear();
  p.forward({
     &blob}, {
     ms.outputLayer}, &output_blobs);
  EXPECT_EQ(1, output_blobs.size());
  for (const auto& kv: ms.outputValues) {
     
    EXPECT_NEAR(
      kv.score,
      output_blobs[0]->cpu_data()[kv.index],
      kv.epsilon);
  }

  auto output_blobs2 = p.forward({
     &blob});
  for (const auto& kv : ms.outputValues) {
     
    EXPECT_NEAR(
        kv.score,
        output_blobs2[ms.outputLayer]->cpu_data()[kv.index],
        kv.epsilon);
  }

测试跨线程对结果有无影响。

  // True across threads as well.
  std::vector<std::thread> ts;
  for (auto i = 0; i < 3; ++i) {
     
    ts.emplace_back([&](){
     
        auto output_blobs = p.forward({
     &blob}, {
     ms.outputLayer});
        EXPECT_EQ(1, output_blobs.size());
        for (const auto& kv: ms.outputValues) {
     
          EXPECT_NEAR(
            kv.score,
            output_blobs[0]->cpu_data()[kv.index],
            kv.epsilon);
        }
      });
  }
  for (auto& t: ts) {
     
    t.join();
}

Predictor::runForward

构造函数未对predictors_进行初始化,而是在第一次使用时初始化predictors_。“thread-local”变量与线程的局部变量不同,每个线程都保存一份改变量的副本。这意味着每个线程调用首先构造网络再执行。ShareTrainedLayersWith 可以共享对象中保存的net_caffe::Net仅调用Init(param)函数。

Net::Net
Net::Init
  if (!predictors_.get()) {
     
    auto predictor =
        std::make_unique<caffe::Net<float>>(*param_);
    predictor->ShareTrainedLayersWith(net_.get());
    if (optimization_ == Optimization::MEMORY) {
     
      optimizeMemory(predictor.get());
    }
    predictors_.reset(predictor.release());
  }
  auto* predictor = predictors_.get();
  CHECK(predictor);
  CHECK_EQ(input_blobs.size(), predictor->input_blobs().size());
  for (auto i = 0; i < input_blobs.size(); ++i) {
     
    auto& input_blob = input_blobs[i];
    CHECK(input_blob);
    predictor->input_blobs()[i]->ReshapeLike(*input_blob);
    // mutable_cpu_data b/c the interface demands it, but logically const.
    predictor->input_blobs()[i]->set_cpu_data(input_blob->mutable_cpu_data());
  }
  predictor->Reshape();
predictor->ForwardPrefilled();

线程池版本的 PooledPredictor

PinnedPooledPredictor
BasePooledPredictor

推理相关操作均在 PooledPredictor::forward 函数中实现。PooledPredictor::makePredictors 返回一组 BasePooledPredictor 类型的独占指针,指向 PinnedPooledPredictor 对象。每个 PinnedPooledPredictor 对象的成员指针指向同一 PooledPredictor 对象。PinnedPooledPredictor 回过头来调用 PooledPredictor 的函数。
假设有2个网络模型任务队列长度为3,则产生的对象如下图所示:

PinnedPooledPredictor0
PooledPredictor
PinnedPooledPredictor1
Job0
Job1
Job2

PooledPredictorTest

PooledPredictorTest
TestWithParam

定义一个测试类。

嵌套定义 TestCallback,重载 Callback 中的函数,记录任务队列的状态。

 protected:
  class TestCallback : public PooledPredictor::Callback {
     
   public:
    void onJobEnqueued(ssize_t queueSize, uint64_t enqueueDelayMs) override {
     
      enqueuedJobs_++;
    }

    void onJobDequeued() override {
     }

    void onJobProcessed(uint64_t processTimeMs) override {
     
      processedJobs_++;
    }

    std::atomic<uint32_t> enqueuedJobs_{
     0};
    std::atomic<uint32_t> processedJobs_{
     0};
  };
  void SetUp() override {
     
    inputType_ = std::get<0>(GetParam());
    modelSpec_ = std::get<1>(GetParam());
    numThreads_ = std::get<2>(GetParam());
    optimization_ = std::get<3>(GetParam());
  }

  Config getConfig(bool allowInlineScheduling = false) {
     
    Caffe::set_random_seed(1701);
    SetCaffeModeForTest();

    Config config;
    config.numThreads_ = numThreads_;
    config.mode_ = Caffe::mode();
    config.optimization_ = optimization_;
    config.allowInlineScheduling_ = allowInlineScheduling;

    if (inputType_ == InputType::PATHS) {
     
      config.protoWeightPaths_.emplace_back(
          modelSpec_.prototxt, modelSpec_.caffemodel);
    } else if (inputType_ == InputType::STRINGS) {
     
      config.protoWeightStrings_.resize(1);
      folly::readFile(
          modelSpec_.prototxt.c_str(), config.protoWeightStrings_[0].first);
      folly::readFile(
          modelSpec_.caffemodel.c_str(), config.protoWeightStrings_[0].second);
    } else {
     
      throw std::runtime_error("Unexpected input type");
    }

    return config;
  }

  InputType inputType_;
  ModelSpec modelSpec_;
  int numThreads_;
  Optimization optimization_;
  TestCallback cob_;

TEST_P(PooledPredictorTest, Correctness)

创建预测器和输入输出 blob,运行前向。Future::get() 阻塞至 future 实现。返回数值(移出),或抛出异常。
检查输出结果的值。
pp为 BasePooledPredictor 类型的独占指针。

  auto pp = PooledPredictor::makePredictor(getConfig(), &cob_);

  // Create input/output
  auto input_blob = createInputBlob(modelSpec_);
  PooledPredictor::OutputLayers output;
  output[modelSpec_.outputLayer] = std::make_unique<caffe::Blob<float>>();

  // Run forward pass
  pp->forward({
     input_blob.get()}, &output).get();

  // Check result
  const auto& output_blob = output[modelSpec_.outputLayer];
  for (const auto& v : modelSpec_.outputValues) {
     
    EXPECT_NEAR(v.score, output_blob->cpu_data()[v.index], v.epsilon);
  }
  EXPECT_EQ(cob_.enqueuedJobs_, 1);
  EXPECT_EQ(cob_.processedJobs_, 1);

检查任务运行前后是否正确。

  const std::vector<caffe::Blob<float>*>& input_blobs = {
     input_blob.get()};
  auto future = pp->forward(input_blobs, &output).then([&] {
     
    EXPECT_EQ(cob_.enqueuedJobs_, 2);
    EXPECT_EQ(cob_.processedJobs_, 2);
    return pp->forward(input_blobs, &output);
  });
  future.get();
  EXPECT_EQ(cob_.enqueuedJobs_, 3);
  EXPECT_EQ(cob_.processedJobs_, 3);

在给定不存在的层时测试预测器不会爆炸。

  // Test that the predictor doesn't blow up when given a nonexisting layer.
  output.clear();
  output["undefined"] = std::make_unique<caffe::Blob<float>>();
  pp->forward({
     input_blob.get()}, &output).get();
  EXPECT_EQ(cob_.enqueuedJobs_, 4);
  EXPECT_EQ(cob_.processedJobs_, 4);

TEST_P(PooledPredictorTest, Threading)

测试线程。getConfig 函数创建并返回一个 Config 对象。
std::vector::emplace_back 添加新元素到容器尾。元素通过 std::allocator_traits::construct 构造,它典型地用布置 new 于容器所提供的位置原位构造元素。参数 args... 以 std::forward(args)... 转发到构造函数。

Lambda 表达式 & 以引用隐式捕获被使用的自动变量。

每个线程运行5次网络。

  auto pp = PooledPredictor::makePredictor(getConfig(), &cob_);

  // Run twice as many threads as the pooled predictor uses
  std::vector<std::thread> threads;
  for (int i = 0; i < numThreads_; i++) {
     
    threads.emplace_back([&] {
     
        // Create input/output
        auto input_blob = createInputBlob(modelSpec_);
        PooledPredictor::OutputLayers output;
        output[modelSpec_.outputLayer] = std::make_unique<caffe::Blob<float>>();

        for (int j = 0; j < 5; j++) {
     
          // Run forward pass
          pp->forward({
     input_blob.get()}, &output).get();

          // Check result
          const auto& output_blob = output[modelSpec_.outputLayer];
          for (const auto& kv : modelSpec_.outputValues) {
     
            auto actual = output_blob->cpu_data()[kv.index];
            EXPECT_NEAR(kv.score, actual, kv.epsilon);
          }
        }
      });
  }

  for (auto& thread : threads) {
     
    thread.join();
  }
  EXPECT_EQ(cob_.enqueuedJobs_, 5 * numThreads_);
  EXPECT_EQ(cob_.processedJobs_, 5 * numThreads_);
}

BasePooledPredictor

BasePooledPredictor 有3个纯虚函数。

OutputLayers为字典类型。
两个不同输入类型的forward函数。

 public:
  using OutputLayers =
      std::unordered_map<std::string, std::unique_ptr<caffe::Blob<float>>>;

  virtual ~BasePooledPredictor() {
     }

  virtual folly::Future<folly::Unit> forward(
      const std::vector<caffe::Blob<float>*>& input_blobs,
      OutputLayers* output) = 0;

  virtual folly::Future<folly::Unit> forward(
      std::vector<caffe::Blob<float>*>&& input_blobs,
      OutputLayers* output) = 0;

  virtual const caffe::Net<float>* canonicalNet() const = 0;

PooledPredictor

PooledPredictor 没有直接继承 BasePooledPredictor 而是拥有 makePredictors 函数。二者似乎是聚合关系。

内部定义了 Callback、Config 和 Job 三个类。

 public:
  using Optimization = Predictor::Optimization;
  using OutputLayers = BasePooledPredictor::OutputLayers;

类嵌套定义 Callback。
一旦将前馈作业添加到队列中,就会调用 onJobEnqueued 回调函数。
queueSize为作业入队后估计的队列大小。可以为负。 有关详细信息,请参阅 sizeGuess()。
enqueueDelayMs为队列已满时等待排队作业的阻塞毫秒数。

onJobDequeued 是从队列中拾取作业进行处理时调用的回调。

在线程拾取前馈作业并处理完, Promise 完成后调用回调函数 onJobProcessed。
processTimeMs为单个前馈作业所用的时间。

  class Callback {
     
   public:
    virtual ~Callback() {
     }

    /**
     * Callback invoked once a feed-forward job is added to the queue.
     * @param queueSize - Estimated size of the queue once the job is enqueue.
     *                    Can be negative - see folly/MPMCQueue.h sizeGuess()
     *                    for details.
     * @param enqueueDelayMs - Number of milliseconds blocked waiting to
     *                         enqueue the job if the queue is full.
     */
    virtual void onJobEnqueued(ssize_t queueSize, uint64_t enqueueDelayMs) = 0;

    /**
     * Callback invoked when a job is picked up from the queue for processing.
     */
    virtual void onJobDequeued() = 0;

    /**
     * Callback invoked after a feed-forward job has been picked up by a
     * thread, processed, and the promise fulfilled.
     * @param processTimeMs - Time elapsed by a single feed-forward job.
     */
    virtual void onJobProcessed(uint64_t processTimeMs) = 0;
  };

参数结构体。
如果设置allowInlineScheduling_,从 PooledPredictor 线程内联入队的作业(例如从返回的 Future 的 then() 回调中调度而没有线程池执行程序的作业)将立即运行而不会添加到队列的末尾。

对于串行链接前向作业的请求,内联调度将减少总执行时间,因为每个后续前馈作业都可以立即运行,而无需在队列中等待。
numThreads_决定了作业队列的长度。
Config 中可以配置多个网络,从而由 PooledPredictor::makePredictors 创建出多个 BasePooledPredictor 对象。

  struct Config {
     
    // Pairs of (prototxt path, weights path)
    std::vector<std::pair<std::string, std::string>> protoWeightPaths_;

    // Pairs of (prototxt string, weights string)
    std::vector<std::pair<std::string, std::string>> protoWeightStrings_;

    caffe::Caffe::Brew mode_{
     caffe::Caffe::CPU};
    int numThreads_{
     1};
    bool disableBlasThreading_{
     true};
    Optimization optimization_{
     Optimization::NONE};

    // If set, jobs enqueued inline from a PooledPredictor thread, such as
    // those scheduled from the returned future's then() callbacks without
    // a thread-pool executor, will be run immediately without being
    // added to the end of the queue.
    //
    // For requests that serially chain feed-forward jobs, inline scheduling
    // would cut down the total execution time as each subsequent feed-forward
    // job is run immediately without having to wait in the queue.
    bool allowInlineScheduling_{
     false};
  };
  explicit PooledPredictor(const Config& config, Callback* cob = nullptr);

  ~PooledPredictor();

对于配置中的每个原型/权重,创建一个预测器并返回创建的预测器的向量。 所有Predictors 共享相同的底层 PooledPredictor 队列和线程。

单个网络等效的 makePredictors。 常见用途的助手——仅有一个网络的 PooledPredictor 案例。


  /**
   * For each prototxt/weight in the config, creates a Predictor and returns
   * a vector of the Predictors created. All the Predictors share the same
   * underlying PooledPredictor queue and threads.
   */
  static std::vector<std::unique_ptr<BasePooledPredictor>> makePredictors(
      const Config& config,
      Callback* cob = nullptr);

  /**
   * Single-net equivalent of makePredictors(). Helper for common use-cases
   * of PooledPredictor with only one net.
   */
  static std::unique_ptr<BasePooledPredictor> makePredictor(
      const Config& config,
      Callback* cob = nullptr);

  folly::Future<folly::Unit> forward(
      const std::vector<caffe::Blob<float>*>& input_blobs,
      OutputLayers* output,
      uint32_t netId);

  folly::Future<folly::Unit> forward(
      std::vector<caffe::Blob<float>*>&& input_blobs,
      OutputLayers* output,
      uint32_t netId);

  const caffe::Net<float>* canonicalNet(uint32_t netId) const;

  size_t netCount() const {
     
    return nets_.size();
  }

类模板 std::function 是通用多态函数封装器。 std::function 的实例能存储、复制及调用任何可调用 (Callable) 目标——函数、 lambda 表达式、 bind 表达式或其他函数对象,还有指向成员函数指针和指向数据成员指针。

存储的可调用对象被称为 std::function 的目标。若 std::function 不含目标,则称它为空。调用空 std::function 的目标导致抛出 std::bad_function_call 异常。

std::function 满足可复制构造 (CopyConstructible) 和可复制赋值 (CopyAssignable) 。

 private:
  struct Job {
     
    using Function = std::function<void(caffe::Net<float>*)>;
    using Promise = folly::Promise<folly::Unit>;

   public:
    Job(Function&& f, Promise&& p, int32_t netId)
        : function_(std::move(f)),
          promise_(std::move(p)),
          netId_(netId) {
     }

    Function function_;
    Promise promise_;
    uint32_t netId_;
  };
  void initNet(
      std::unique_ptr<caffe::NetParameter> param,
      std::unique_ptr<caffe::NetParameter> weights);

  void startPredictorThread();

  void forward(
      const std::vector<caffe::Blob<float>*>& input_blobs,
      OutputLayers* output,
      caffe::Net<float>* predictor);

  folly::Future<folly::Unit> enqueueJob(Job::Function&& fn, uint32_t netId);

  void processJob(std::unique_ptr<Job> job);

  const caffe::Caffe::Brew mode_;
  const int numThreads_;
  Optimization optimization_{
     Optimization::NONE};

  // In GPU mode
  int gpuDeviceCount_;

  // Variables needed to construct a new net (happens on demand)
  std::vector<std::unique_ptr<caffe::NetParameter>> params_;
  std::vector<std::unique_ptr<caffe::NetParameter>> gpuWeights_;
  std::vector<std::unique_ptr<caffe::Net<float>>> nets_;

  // Default input blob shapes
  std::vector<std::vector<std::vector<int>>> shapes_;

  // One predictor net per model per thread
  folly::ThreadLocal<std::vector<std::unique_ptr<caffe::Net<float>>>>
      predictors_;

  folly::MPMCQueue<std::unique_ptr<Job>> queue_;
  std::vector<std::thread> threads_;
  std::mutex mutex_;
  std::atomic<int> availableThreads_{
     0};

  // Helps determine if the current thread is a PooledPredictor thread
  // or not. Used for checking if a job should be scheduled inline.
  folly::ThreadLocal<bool> inPredictorThread_;
  bool allowInlineScheduling_;

  Callback* cob_{
     nullptr};

PooledPredictor::makePredictors

PooledPredictor::makePredictor
PooledPredictor::makePredictors
PooledPredictor::PooledPredictor
PinnedPooledPredictor::PinnedPooledPredictor

pooledPredictor为共享指针。由一个 PooledPredictor 构造 PinnedPooledPredictor 对象数组。
PinnedPooledPredictor 类中拥有指向 PooledPredictor 的共享指针。二者是聚合关系。

std::vector<std::unique_ptr<BasePooledPredictor>>
PooledPredictor::makePredictors(const Config& config, Callback* cob) {
     
  auto pooledPredictor = std::make_shared<PooledPredictor>(config, cob);
  std::vector<std::unique_ptr<BasePooledPredictor>> predictors;
  for (auto id = 0; id < pooledPredictor->netCount(); id++) {
     
    predictors.push_back(
        std::make_unique<PinnedPooledPredictor>(pooledPredictor, id));
  }
  return predictors;
}
std::unique_ptr<BasePooledPredictor> PooledPredictor::makePredictor(
    const Config& config,
    Callback* cob) {
     
  auto predictors = makePredictors(config, cob);
  CHECK_EQ(predictors.size(), 1)
      << "Did you mean to use PooledPredictor::makePredictors?";
  return std::move(predictors[0]);
}

PooledPredictor::PooledPredictor

PooledPredictor
initNet
PooledPredictor::PooledPredictor(const Config& config, Callback* cob)
    : mode_(config.mode_),
      numThreads_(config.numThreads_),
      optimization_(config.optimization_),
      allowInlineScheduling_(config.allowInlineScheduling_),
      cob_(cob) {
     

disable_blas_threading 调用的是 mkl_set_num_threads 函数。

  if (config.disableBlasThreading_) {
     
    disable_blas_threading();
  }

加载模型描述和权重。

  CHECK(config.protoWeightPaths_.empty() ^ config.protoWeightStrings_.empty())
      << "Specify exactly one of prototxt/weights paths OR strings";
  if (!config.protoWeightPaths_.empty()) {
     
    for (const auto& it : config.protoWeightPaths_) {
     
      auto param = loadNetFromFile(it.first);
      auto weights = loadWeightsFromFile(it.second);
      initNet(std::move(param), std::move(weights));
    }
  } else {
     
    for (const auto& it : config.protoWeightStrings_) {
     
      auto param = loadNetFromString(it.first);
      auto weights = loadWeightsFromString(it.second);
      initNet(std::move(param), std::move(weights));
    }
  }
  DCHECK_EQ(params_.size(), nets_.size());
  DCHECK(gpuWeights_.empty() || (gpuWeights_.size() == nets_.size()));
  DCHECK_EQ(shapes_.size(), nets_.size());

MPMCQueue 是一个高性能的有界并发队列,支持多个生产者,多个消费者和可选阻塞。队列具有固定容量,所有内存都将预先分配。排队和出队的大部分工作可以并行进行。
MPMCQueue 是可线性化的。这意味着如果在 write(B) 的调用开始之前 write(A) 调用返回,那么 A 肯定会在 B 之前的队列中结束,并且如果对 read(X) 的调用在调用 read(Y) 启动之前返回,X 将在队列中早于 Y 。这也意味着如果读取调用返回一个值,您可以确保队列的所有先前元素都已分配了一个读取器(该读器可能还没有返回,但它存在)。

  // Initialize queue
  queue_ = folly::MPMCQueue<std::unique_ptr<Job>>(numThreads_);

cudaGetDeviceCount 返回支持计算的设备数。

#ifndef CPU_ONLY
  // Initialize GPU count
  if (mode_ == caffe::Caffe::GPU) {
     
    CUDA_CHECK(cudaGetDeviceCount(&gpuDeviceCount_));
  }
#endif
}

PooledPredictor::initNet

检查网络描述和权重是否为空。设置为测试模式。

  // Check that we have some layers - empty strings/files, for
  // example, are forgivingly deserialized.
  CHECK_GT(param->layer().size(), 0);
  CHECK_GT(weights->layer().size(), 0);
  param->mutable_state()->set_phase(caffe::TEST);

创建网络并初始化权重参数。

  // Initialize the canonical net
  auto net = std::make_unique<caffe::Net<float>>(*param);
  net->CopyTrainedLayersFrom(*weights);

保存输入 blob 的形状。PooledPredictor::processJob 根据其设置预测器的输入 blob。

  // Store default input blob shapes
  shapes_.emplace_back();
  for (const auto& blob : net->input_blobs()) {
     
    shapes_.back().push_back(blob->shape());
  }

保存网络描述和权重。

  params_.push_back(std::move(param));
  nets_.push_back(std::move(net));
  if (mode_ == caffe::Caffe::GPU) {
     
    // Stash the weights to be copied to the GPU nets
    gpuWeights_.push_back(std::move(weights));
  }

PooledPredictor::~PooledPredictor()

将队列中的任务置为空,从而让线程退出。

  auto n = threads_.size();

  // Send nullptr's to signal the threads they can exit
  for (int i = 0; i < n; i++) {
     
    queue_.blockingWrite(nullptr);
  }

  // Wait for all threads to exit
  for (int i = 0; i < n; i++) {
     
    threads_[i].join();
  }

PooledPredictor::forward

前两个forward函数的input_blobs参数类型不同。
PooledPredictor::canonicalNet 将3个函数隔开了,非常不合理。

PooledPredictor::forward填好的输入输出,PooledPredictor::processJob 仅需填模型指针。

folly::Future<folly::Unit> PooledPredictor::forward(
    const std::vector<caffe::Blob<float>*>& input_blobs,
    OutputLayers* output,
    uint32_t netId) {
     
  auto fn = [=](caffe::Net<float>* predictor) {
     
    forward(input_blobs, output, predictor);
  };

  return enqueueJob(std::move(fn), netId);
}
folly::Future<folly::Unit> PooledPredictor::forward(
    std::vector<caffe::Blob<float>*>&& input_blobs,
    OutputLayers* output,
    uint32_t netId) {
     
  auto fn = [ this, in_blobs = std::move(input_blobs), output ](
      caffe::Net<float> * predictor) {
     
    forward(in_blobs, output, predictor);
  };

  return enqueueJob(std::move(fn), netId);
}

predictor参数名和类名太像。PooledPredictor::forward 接受3个参数,分别为input_blobsoutputpredictor

void PooledPredictor::forward(
    const std::vector<caffe::Blob<float>*>& input_blobs,
    OutputLayers* output,
    caffe::Net<float>* predictor) {
     
  CHECK(predictor);
  CHECK_EQ(input_blobs.size(), predictor->input_blobs().size());
  for (auto i = 0; i < input_blobs.size(); ++i) {
     
    auto& blob = input_blobs[i];
    CHECK(blob);
    predictor->input_blobs()[i]->ReshapeLike(*blob);
    // mutable_cpu_data b/c the interface demands it, but logically const.
    predictor->input_blobs()[i]->set_cpu_data(blob->mutable_cpu_data());
  }
  predictor->Reshape();
  predictor->ForwardPrefilled();

  if (FLAGS_log_caffe_predictor && optimization_ == Optimization::NONE) {
     
    auto blob_names = predictor->blob_names();
    for (auto& bname : blob_names) {
     
      auto& blob = predictor->blob_by_name(bname);
      LOG(INFO) << bname << " " << blob->shape_string();
    }
  }

  for (auto& it : *output) {
     
    auto predictor_blob = predictor->blob_by_name(it.first);
    auto target_blob = it.second.get();

    if (predictor_blob == nullptr) {
     
      LOG(WARNING) << "Requested output blob not found: " << it.first;
      continue;
    }

    target_blob->ReshapeLike(*predictor_blob);

    if (mode_ == caffe::Caffe::CPU) {
     
      caffe_copy(
          predictor_blob->count(),
          predictor_blob->cpu_data(),
          target_blob->mutable_cpu_data());
    } else {
     
      caffe_copy(
          predictor_blob->count(),
          predictor_blob->gpu_data(),
          target_blob->mutable_cpu_data());
    }
  }
}

PooledPredictor::enqueueJob

InlineSchedule
noAvailableThreads
enqueueJob
processJob
startPredictorThread
  folly::Promise<folly::Unit> promise;
  folly::Future<folly::Unit> future = promise.getFuture();
  auto job = std::make_unique<Job>(std::move(fn), std::move(promise), netId);
  if (allowInlineScheduling_ && *inPredictorThread_) {
     
    // Note: This prevents tail-call optimization, so if lots of subsequent
    // jobs are being chained, disabling inline scheduling would be safer
    // to avoid running out of stack memory.
    processJob(std::move(job));
    return future;
  }

PooledPredictor::enqueueJob 只管启动线程,而 PooledPredictor::startPredictorThread() 会检查线程数量是否达到限制。

  // Optimistically try to add a predictor thread if none are available
  if (availableThreads_.load() == 0) {
     
    startPredictorThread();
  }
  CPUTimer timer;
  timer.Start();
  queue_.blockingWrite(std::move(job));
  timer.Stop();
  if (cob_) {
     
    cob_->onJobEnqueued(queue_.sizeGuess(), timer.MilliSeconds());
  }

  return future;

PooledPredictor::processJob

processJob
PooledPredictor::forward

获得推理网络,运行 PooledPredictor::forward 函数。

  auto netId = job->netId_;
  auto predictor = predictors_->at(netId).get();

  caffe::CPUTimer timer;
  timer.Start();
  job->function_(predictor);
  timer.Stop();
  if (cob_) {
     
    cob_->onJobProcessed(timer.MilliSeconds());
  }

  job->promise_.setValue();

结束后将网络恢复为原始形状。

  // Restore network to original shape.
  //
  // If the network just processed a large input it will hold
  // on to it until the next job comes along. Without precise
  // memory accounting and input-size-based dispatch, this can
  // cause the process to tip over and OOM. Shrinking the
  // network after every forward pass doesn't eliminate the
  // probability of this happening, it just reduces it.
  //
  // Originally, processing an image in scanning mode would run
  // multiple forward passes and have the last one be the
  // smallest input shape, all against the same predictor
  // instance. This effectively means resizing the network to
  // the smallest input size after processing an image. Since all
  // feed-forwards for a single request are no longer pinned to a
  // single predictor (dispatch happens for every call to the pooled
  // predictor), this implied reshape to a smaller shape no
  // longer happens.
  for (auto i = 0; i < predictor->input_blobs().size(); ++i) {
     
    predictor->input_blobs()[i]->Reshape(shapes_[netId][i]);
  }
  predictor->Reshape();
}

PooledPredictor::startPredictorThread()

PooledPredictor::startPredictorThread() 中创建的线程初始化了所有网络。线程创建后直接从作业队列中读取一个任务进行处理。

创建线程前加锁。

  std::lock_guard<std::mutex> lock(mutex_);
  auto threadId = threads_.size();

  // Never exceed capacity
  if (threadId >= numThreads_) {
     
    return;
  }

构造出所有网络。

  // Create thread and add to list of threads
  threads_.push_back(std::thread([&, threadId] () {
     
        if (mode_ == caffe::Caffe::CPU) {
     
          caffe::Caffe::set_mode(caffe::Caffe::CPU);
        } else {
     
          caffe::Caffe::set_mode(caffe::Caffe::GPU);
          caffe::Caffe::SetDevice(threadId % gpuDeviceCount_);
        }

        // Setup the predictor nets
        for (int i = 0; i < nets_.size(); i++) {
     
          auto predictor = std::make_unique<caffe::Net<float>>(*params_[i]);
          if (mode_ == caffe::Caffe::CPU) {
     
            predictor->ShareTrainedLayersWith(nets_[i].get());
          } else {
     
            // We tried adding weight sharing between nets on the same GPU,
            // which resulted in sporadic NaN outputs of cuDNN (R5) layers.
            // Removing weight sharing immediately solved this problem.
            predictor->CopyTrainedLayersFrom(*gpuWeights_[i]);
          }

          if (optimization_ == Optimization::MEMORY) {
     
            optimizeMemory(predictor.get());
          }
          predictors_->push_back(std::move(predictor));
        }

不断从队列中读取任务进行处理。

        for (;;) {
     
          availableThreads_++;
          std::unique_ptr<Job> job;
          queue_.blockingRead(job);
          availableThreads_--;
          if (job == nullptr) {
     
            return;
          }

          if (cob_) {
     
            cob_->onJobDequeued();
          }

          *inPredictorThread_ = true;
          processJob(std::move(job));
          *inPredictorThread_ = false;
        }
}));

内存复用函数 optimizeMemory

Created with Raphaël 2.2.0 optimizeMemory net net->Reshape net->ForwardPrefilled analyze assign logAssignmentMetrics applyAssignments net End

analyze 生成网络中每个 SyncedMemory 的活跃信息(起止层)。
assign 函数遍历analysis中的每一个元素,尝试合并相容的 SyncedMemory。

  net->Reshape();
  // If the net does sharing (e.g. SplitLayer), run a forward pass to
  // get the sharing setup so that it is indentified when we use the
  // SyncedMemory addresses as identifiers for def/use ranges.
  net->ForwardPrefilled();
  const auto& analysis = analyze(*net);
  const auto& assignments = assign(*net, analysis);
  logAssignmentMetrics(analysis, assignments);
  applyAssignments(net, assignments);

analyze

通过遍历连接网络中 blob 的 SyncedMemory 指针来构建活跃度分析。

LiveRange 记录内存的活跃信息。
OrderedAnalysis 可以记录一组 SyncedMemory 的活跃信息。analysis虽然是数组,但由 findOrInsert 可以看出analysis中的 SyncedMemory 是唯一的。

  // Build up the liveness analysis by walking the SyncedMemory
  // pointers attached the the blobs in the network.
  const auto& bottoms = net.bottom_vecs();
  const auto& tops = net.top_vecs();
  OrderedAnalysis<LiveRange> analysis;

blob 索引使用int64_t,需要这么大?
对于每一层,findOrInsert 在analysis中查找输入的底层 SyncedMemory,如果找到则返回活跃信息,则追加元素到尾部。接着在外部更新 SyncedMemory 的活跃终点。

  for (int64_t i = 0; i < bottoms.size(); ++i) {
     
    for (const auto* bottom : bottoms[i]) {
     
      auto& range = findOrInsert(&analysis, bottom->data().get());
      if (range.used == kNotUsed) {
     
        range.used = i;
        continue;
      }
      range.used = std::max(range.used, i);
    }
  }

对于每一层的输出,设置其底层 SyncedMemory 的活跃起点。

  for (int64_t i = 0; i < tops.size(); ++i) {
     
    for (const auto* top : tops[i]) {
     
      auto& range = findOrInsert(&analysis, top->data().get());
      if (range.defined == kNotDefined) {
     
        range.defined = i;
        continue;
      }
      range.defined = std::min(range.defined, i);
    }
  }

对于网络的输入,设置其生命为整个周期。

  for (const auto* input : net.input_blobs()) {
     
    findOrInsert(&analysis, input->data().get()).defined = -kAlwaysLive;
    findOrInsert(&analysis, input->data().get()).used = kAlwaysLive;
  }
return analysis;

assign

计算 blob 到非重叠 blob 的分配。

blobNames 构造网络 blob 的 SyncedMemory 及其名称的字典。
根据终止点对analysis进行排序。analysis数组记录了网络所有 SyncedMemory 的活跃信息。
日志记录 blob 定义和最后使用的层。

  const auto& names = blobNames(net);
  std::stable_sort(analysis.begin(),
                   analysis.end(),
                   [](const SyncedMemoryRange& a, const SyncedMemoryRange& b) {
     
                     return a.second.used < b.second.used;
                   });
  for (const auto& kv : analysis) {
     
    LOG(INFO) << names.at(kv.first)
              << folly::format(": {}->{}", kv.second.defined, kv.second.used);
  }

isCompatible 判断候选范围是否与此分配兼容。
遍历analysis中的每个元素,如该元素与某一分配相容则追加到其中,否则创建一个新的分配。
Assignment 本质是 SyncedMemory 的生命周期。
Assignment 是一组 SyncedMemoryRange 数组,所以有assignment.back()

  Assignments assignments;
  for (const auto& candidate : analysis) {
     
    auto assigned = false;
    for (auto& assignment : assignments) {
     
      if (isCompatible(candidate, assignment)) {
     
        assignment.push_back(candidate);
        assigned = true;
        break;
      }
    }
    if (assigned) {
     
      continue;
    }
    assignments.push_back({
     candidate});
  }
return assignments;

isCompatible

候选范围是否与此作业兼容?
candidate记录 SyncedMemory 的生命周期。
从时序上看,仅在候选和分配的 SyncedMemory 已使用时才判断。

  if (candidate.second.used == kNotUsed ||
      assignment.back().second.used == kNotUsed) {
     
    return false;
  }

候选 SyncedMemory 的大小应大于设定的共享阈值。

  if (candidate.first->size() <= kMinimumCountForSharing) {
     
    return false;
  }

Assignment 是一组 SyncedMemoryRange。候选 SyncedMemory 定义应晚于assignment中的最后一次使用。

  CHECK_GE(assignment.size(), 1);
return candidate.second.defined > assignment.back().second.used;

logAssignmentMetrics

beforeTotalSize记录复用前 SyncedMemory 的总内存大小。

  size_t beforeTotalSize = 0;
  for (const auto& kv : analysis) {
     
    beforeTotalSize += kv.first->size();
  }

afterTotalSize累加每个 Assignment 的最大分配内存,相当于逐步申请内存但不释放。

  size_t afterTotalSize = 0;
  for (const auto& assignment : assignments) {
     
    size_t assignmentMaxSize = 0;
    for (const auto& kv : assignment) {
     
      assignmentMaxSize = std::max(assignmentMaxSize, kv.first->size());
    }
    LOG(INFO) << "Assignment max size: " << assignmentMaxSize;
    afterTotalSize += assignmentMaxSize;
  }
  LOG(INFO)
      << folly::format("Before: {}, After: {}, Compression: {:.2f}%",
                       beforeTotalSize,
                       afterTotalSize,
100.0 * (1.0 - afterTotalSize * 1.0 / beforeTotalSize));

applyAssignments

blobNames 返回类型为 Analysis 的字典,获得网络 blob 的每个 SyncedMemory 所对应的 blob 名称。
根据assignments创建共享 SyncedMemory 的 blob 数组。

  const auto& names = blobNames(*net);
  Analysis<boost::shared_ptr<Blob<float>>> reusedBlobs;
  for (const auto& assignment : assignments) {
     
    auto reused = boost::make_shared<Blob<float>>(1, 1, 1, 1);
    // Instantiate so blob->data() is valid.
    reused->cpu_data();
    LOG(INFO) << "Assignment: ";
    for (const auto& kv : assignment) {
     
      LOG(INFO) << "Blob: " << names.at(kv.first);
      reusedBlobs[kv.first] = reused;
}

std::unordered_map::at 返回对 unordered_map 中带有键k的元素的映射值的引用。
分别对网络的输入输出、层输入输出及中间结果 blob 分配 SyncedMemory。

  using BV = std::vector<Blob<float>*>;
  using SBV = std::vector<boost::shared_ptr<Blob<float>>>;
  for (auto& blob : const_cast<BV&>(net->input_blobs())) {
     
    reusedBlobs.at(blob->data().get())->ReshapeLike(*blob);
    blob = reusedBlobs.at(blob->data().get()).get();
  }
  for (auto& blob : const_cast<BV&>(net->output_blobs())) {
     
    blob = reusedBlobs.at(blob->data().get()).get();
  }
  for (auto& vec : net->top_vecs()) {
     
    for (auto& blob : const_cast<BV&>(vec)) {
     
      blob = reusedBlobs.at(blob->data().get()).get();
    }
  }
  for (auto& vec : net->bottom_vecs()) {
     
    for (auto& blob : const_cast<BV&>(vec)) {
     
      blob = reusedBlobs.at(blob->data().get()).get();
    }
  }
  for (auto& blob : const_cast<SBV&>(net->blobs())) {
     
    auto reusedBlob = reusedBlobs.at(blob->data().get());
    blob = reusedBlob;
  }
}

参考资料:

  • 类与类之间的几种关系
  • C++ 中的 Lambda 表达式
  • C++11 新特性:Lambda 表达式
  • C++ lambda表达式与函数对象
  • Facebook 为 C++11 带来了健壮且强大的 Folly Futures 库
  • folly教程系列之:future/promise
  • std::future和std::promise
  • Futures for C++11 at Facebook
  • Facebook 的 C++ 11 组件库 Folly Futures
  • 现代c++开发利器folly教程系列之:future/promise
  • C++11 并发指南四( 详解一 std::promise 介绍)
  • When do we need a private constructor in C++
  • A quick introduction to the Google C++ Testing Framework
  • {}-Initialization
  • 右值引用与转移语义

你可能感兴趣的:(Facebook,Caffe,DeepLearning,Caffe,多线程,C++,内存优化)