fb-caffe-exts 是 Facebook 在(主要)生产场景中使用 Caffe 时开发的扩展集合。predictor 是一个简单的 C++ 库,它封装了在共享权重的同时在多个线程中运行 caffe::Net 的常见模式。它还为推理情况提供了一个更方便使用的 API。该库主要由三个部分组成:可多线程调用的基本类 Predictor、线程池版本的 PooledPredictor 以及实现内存复用的 optimizeMemory 函数。虽然 Caffe 已经十分老旧,但是 predictor 同样适用于其他框架。使用范例如下:
#include "caffe/predictor/Predictor.h"
// In your setup phase
predictor_ = Predictor::paths(FLAGS_prototxt_path,
FLAGS_weights_path);
// When calling in a worker thread
static thread_local caffe::Blob<float> input_blob;
input_blob.set_cpu_data(input_data); // avoid the copy.
const auto& output_blobs = predictor_->forward({
&input_blob});
return output_blobs[FLAGS_output_layer_name];
thread_local 表示线程存储期。
值得注意的是 optimizeMemory,通过在安全的情况下自动重用中间激活来优化内存使用。这使得 AlexNet 风格的模型,中间激活所需的内存量减少了约50%;而 GoogLeNet 风格的模型减少了约75%。
如果按照在网络的拓扑顺序绘制每组激活,并为每个重用的激活缓冲区分配一个独特的颜色,其中 blob 的高度与缓冲区的大小成正比。则在类似 AlexNet 的模型中,分配看起来像
GoogLeNet 的相应分配看起来像
这个想法本质上是线性扫描寄存器分配。我们
根据模型的不同,缓冲区重用也可能导致推理时一些显著的性能改进。
要启用它,只需将Predictor::Optimization::MEMORY
传递给 Predictor 构造函数。
PooledPredictor 使用 caffe::Net 的本地线程实例维护一个线程池。将 PooledPredictor::forward() 调用添加到一个 folly::MPMCQueue 中,然后由线程池出队进行处理。forward() 调用是非阻塞的并返回一个 folly::Future,当正向传递作业完成时其将满足。PooledPredictor 还支持在同一个线程池中运行多个模型。也就是说,如果加载两个模型,线程池中的每个线程将维护两个 caffe::Net 实例(每个模型一个实例),forward() 中的netId
参数指定要运行的模型。PinnedPooledPredictor 是与多个模型一起使用,将 forward() 调用绑定到特定模型时,PooledPredictor 的抽象。
#include "caffe/predictor/PooledPredictor.h"
// In your setup phase
caffe::fb::PooledPredictor::Config config;
config.numThreads_ = 10;
config.optimization_ = caffe::fb::Predictor::Optimization::MEMORY;
config.protoWeightPaths_.emplace_back(FLAGS_prototxt_path,
FLAGS_weights_path);
pooledPredictor_ = caffe::fb::PooledPredictor::makePredictor(config);
// When calling predictor
caffe::fb::PooledPredictor::OutputLayers output_blobs;
pooledPredictor_->forward({
&input_blob}, &output_blobs)
.then([&] {
const auto& output_blob = outputs_blobs[FLAGS_output_layer_name];
// Do something with output_blob
});
Predictor 的构造函数私有,只能通过以下静态函数创建。支持三种加载模型的方式。
public:
enum Optimization {
NONE,
MEMORY
};
static std::unique_ptr<Predictor> strings(
const std::string& text_prototxt,
const std::string& binary_weights,
Optimization optimization = Optimization::NONE,
const bool flag_disable_blas_threading = true);
static std::unique_ptr<Predictor> hdf5_paths(
const std::string& prototxt_path,
const std::string& hdf5_binary_weights_path,
Optimization optimization = Optimization::NONE,
const bool flag_disable_blas_threading = true);
static std::unique_ptr<Predictor> paths(
const std::string& prototxt_path,
const std::string& weights_path,
Optimization optimization = Optimization::NONE,
const bool flag_disable_blas_threading = true);
3个 forward 函数。
std::vector<caffe::Blob<float>*> forward(
const std::vector<caffe::Blob<float>*>& input_blobs,
const std::vector<std::string>& output_layer_names);
void forward(
const std::vector<caffe::Blob<float>*>& input_blobs,
const std::vector<std::string>& output_layer_names,
std::vector<caffe::Blob<float>*>* output_blobs);
std::unordered_map<std::string, caffe::Blob<float>*> forward(
const std::vector<caffe::Blob<float>*>& input_blobs);
caffe::Net<float>* canonicalNet() const {
return net_.get();
}
folly::ThreadLocalPtr<caffe::Net<float>>& getThreadLocalPredictors() {
return predictors_;
}
两个私有构造函数分别处理二进制文件和 hdf5 存储的权重。predictors_
为静态变量,名字暗示有多个,不在构造函数中初始化,而是各线程独自创建网络。
private:
Predictor(const caffe::NetParameter& params,
const caffe::NetParameter& weights,
Optimization optimization = Optimization::NONE,
const bool flag_disable_blas_threading = true);
Predictor(const caffe::NetParameter& params,
const std::string& hdf5_binary_weights_path,
Optimization optimization = Optimization::NONE,
const bool flag_disable_blas_threading = true);
void runForward(
const std::vector<caffe::Blob<float>*>& input_blobs);
// Shared for forward declaration
std::shared_ptr<caffe::NetParameter> param_;
std::shared_ptr<caffe::Net<float>> net_;
const Optimization optimization_;
folly::ThreadLocalPtr<caffe::Net<float>> predictors_;
文档中的说明不够详细,使用细节可以参考测试程序。
SetCaffeModeForTest 根据是否有 GPU 设定工作模式。
const auto& inputType = std::get<0>(GetParam());
const auto& optimization = std::get<1>(GetParam());
const auto& ms = std::get<2>(GetParam());
Caffe::set_random_seed(1701);
SetCaffeModeForTest();
根据输入类型调用相应的构造函数。
std::unique_ptr<Predictor> pp;
if (inputType == InputType::PATHS) {
pp = Predictor::paths(ms.prototxt, ms.caffemodel, optimization);
} else if (inputType == InputType::STRINGS) {
std::string prototxt_str;
folly::readFile(ms.prototxt.c_str(), prototxt_str);
std::string caffemodel_str;
folly::readFile(ms.caffemodel.c_str(), caffemodel_str);
pp = Predictor::strings(prototxt_str, caffemodel_str, optimization);
} else if (inputType == InputType::HDF5_PATHS) {
pp = Predictor::hdf5_paths(ms.prototxt, ms.caffemodel, optimization);
}
依次调用 forward_2p、forward_3p、forward_1p。
forward_3p 的第3个参数为输出。
TestSpecs.cpp 中创建了ModelSpec 实例。
CHECK(pp);
auto& p = *pp;
FillerParameter param;
param.set_min(-1000);
param.set_max(1000);
UniformFiller<float> filler(param);
Blob<float> blob;
blob.Reshape(ms.inputDims);
filler.Fill(&blob);
auto output_blobs = p.forward({
&blob}, {
ms.outputLayer});
// Test output blobs in-place.
EXPECT_EQ(1, output_blobs.size());
output_blobs.clear();
p.forward({
&blob}, {
ms.outputLayer}, &output_blobs);
EXPECT_EQ(1, output_blobs.size());
for (const auto& kv: ms.outputValues) {
EXPECT_NEAR(
kv.score,
output_blobs[0]->cpu_data()[kv.index],
kv.epsilon);
}
auto output_blobs2 = p.forward({
&blob});
for (const auto& kv : ms.outputValues) {
EXPECT_NEAR(
kv.score,
output_blobs2[ms.outputLayer]->cpu_data()[kv.index],
kv.epsilon);
}
测试跨线程对结果有无影响。
// True across threads as well.
std::vector<std::thread> ts;
for (auto i = 0; i < 3; ++i) {
ts.emplace_back([&](){
auto output_blobs = p.forward({
&blob}, {
ms.outputLayer});
EXPECT_EQ(1, output_blobs.size());
for (const auto& kv: ms.outputValues) {
EXPECT_NEAR(
kv.score,
output_blobs[0]->cpu_data()[kv.index],
kv.epsilon);
}
});
}
for (auto& t: ts) {
t.join();
}
构造函数未对predictors_
进行初始化,而是在第一次使用时初始化predictors_
。“thread-local”变量与线程的局部变量不同,每个线程都保存一份改变量的副本。这意味着每个线程调用首先构造网络再执行。ShareTrainedLayersWith 可以共享对象中保存的net_
。caffe::Net
仅调用Init(param)
函数。
if (!predictors_.get()) {
auto predictor =
std::make_unique<caffe::Net<float>>(*param_);
predictor->ShareTrainedLayersWith(net_.get());
if (optimization_ == Optimization::MEMORY) {
optimizeMemory(predictor.get());
}
predictors_.reset(predictor.release());
}
auto* predictor = predictors_.get();
CHECK(predictor);
CHECK_EQ(input_blobs.size(), predictor->input_blobs().size());
for (auto i = 0; i < input_blobs.size(); ++i) {
auto& input_blob = input_blobs[i];
CHECK(input_blob);
predictor->input_blobs()[i]->ReshapeLike(*input_blob);
// mutable_cpu_data b/c the interface demands it, but logically const.
predictor->input_blobs()[i]->set_cpu_data(input_blob->mutable_cpu_data());
}
predictor->Reshape();
predictor->ForwardPrefilled();
推理相关操作均在 PooledPredictor::forward 函数中实现。PooledPredictor::makePredictors 返回一组 BasePooledPredictor 类型的独占指针,指向 PinnedPooledPredictor 对象。每个 PinnedPooledPredictor 对象的成员指针指向同一 PooledPredictor 对象。PinnedPooledPredictor 回过头来调用 PooledPredictor 的函数。
假设有2个网络模型任务队列长度为3,则产生的对象如下图所示:
定义一个测试类。
嵌套定义 TestCallback,重载 Callback 中的函数,记录任务队列的状态。
protected:
class TestCallback : public PooledPredictor::Callback {
public:
void onJobEnqueued(ssize_t queueSize, uint64_t enqueueDelayMs) override {
enqueuedJobs_++;
}
void onJobDequeued() override {
}
void onJobProcessed(uint64_t processTimeMs) override {
processedJobs_++;
}
std::atomic<uint32_t> enqueuedJobs_{
0};
std::atomic<uint32_t> processedJobs_{
0};
};
void SetUp() override {
inputType_ = std::get<0>(GetParam());
modelSpec_ = std::get<1>(GetParam());
numThreads_ = std::get<2>(GetParam());
optimization_ = std::get<3>(GetParam());
}
Config getConfig(bool allowInlineScheduling = false) {
Caffe::set_random_seed(1701);
SetCaffeModeForTest();
Config config;
config.numThreads_ = numThreads_;
config.mode_ = Caffe::mode();
config.optimization_ = optimization_;
config.allowInlineScheduling_ = allowInlineScheduling;
if (inputType_ == InputType::PATHS) {
config.protoWeightPaths_.emplace_back(
modelSpec_.prototxt, modelSpec_.caffemodel);
} else if (inputType_ == InputType::STRINGS) {
config.protoWeightStrings_.resize(1);
folly::readFile(
modelSpec_.prototxt.c_str(), config.protoWeightStrings_[0].first);
folly::readFile(
modelSpec_.caffemodel.c_str(), config.protoWeightStrings_[0].second);
} else {
throw std::runtime_error("Unexpected input type");
}
return config;
}
InputType inputType_;
ModelSpec modelSpec_;
int numThreads_;
Optimization optimization_;
TestCallback cob_;
创建预测器和输入输出 blob,运行前向。Future::get() 阻塞至 future 实现。返回数值(移出),或抛出异常。
检查输出结果的值。
pp
为 BasePooledPredictor 类型的独占指针。
auto pp = PooledPredictor::makePredictor(getConfig(), &cob_);
// Create input/output
auto input_blob = createInputBlob(modelSpec_);
PooledPredictor::OutputLayers output;
output[modelSpec_.outputLayer] = std::make_unique<caffe::Blob<float>>();
// Run forward pass
pp->forward({
input_blob.get()}, &output).get();
// Check result
const auto& output_blob = output[modelSpec_.outputLayer];
for (const auto& v : modelSpec_.outputValues) {
EXPECT_NEAR(v.score, output_blob->cpu_data()[v.index], v.epsilon);
}
EXPECT_EQ(cob_.enqueuedJobs_, 1);
EXPECT_EQ(cob_.processedJobs_, 1);
检查任务运行前后是否正确。
const std::vector<caffe::Blob<float>*>& input_blobs = {
input_blob.get()};
auto future = pp->forward(input_blobs, &output).then([&] {
EXPECT_EQ(cob_.enqueuedJobs_, 2);
EXPECT_EQ(cob_.processedJobs_, 2);
return pp->forward(input_blobs, &output);
});
future.get();
EXPECT_EQ(cob_.enqueuedJobs_, 3);
EXPECT_EQ(cob_.processedJobs_, 3);
在给定不存在的层时测试预测器不会爆炸。
// Test that the predictor doesn't blow up when given a nonexisting layer.
output.clear();
output["undefined"] = std::make_unique<caffe::Blob<float>>();
pp->forward({
input_blob.get()}, &output).get();
EXPECT_EQ(cob_.enqueuedJobs_, 4);
EXPECT_EQ(cob_.processedJobs_, 4);
测试线程。getConfig 函数创建并返回一个 Config 对象。
std::vector::emplace_back 添加新元素到容器尾。元素通过 std::allocator_traits::construct 构造,它典型地用布置 new
于容器所提供的位置原位构造元素。参数 args...
以 std::forward
转发到构造函数。
Lambda 表达式 &
以引用隐式捕获被使用的自动变量。
每个线程运行5次网络。
auto pp = PooledPredictor::makePredictor(getConfig(), &cob_);
// Run twice as many threads as the pooled predictor uses
std::vector<std::thread> threads;
for (int i = 0; i < numThreads_; i++) {
threads.emplace_back([&] {
// Create input/output
auto input_blob = createInputBlob(modelSpec_);
PooledPredictor::OutputLayers output;
output[modelSpec_.outputLayer] = std::make_unique<caffe::Blob<float>>();
for (int j = 0; j < 5; j++) {
// Run forward pass
pp->forward({
input_blob.get()}, &output).get();
// Check result
const auto& output_blob = output[modelSpec_.outputLayer];
for (const auto& kv : modelSpec_.outputValues) {
auto actual = output_blob->cpu_data()[kv.index];
EXPECT_NEAR(kv.score, actual, kv.epsilon);
}
}
});
}
for (auto& thread : threads) {
thread.join();
}
EXPECT_EQ(cob_.enqueuedJobs_, 5 * numThreads_);
EXPECT_EQ(cob_.processedJobs_, 5 * numThreads_);
}
BasePooledPredictor 有3个纯虚函数。
OutputLayers
为字典类型。
两个不同输入类型的forward
函数。
public:
using OutputLayers =
std::unordered_map<std::string, std::unique_ptr<caffe::Blob<float>>>;
virtual ~BasePooledPredictor() {
}
virtual folly::Future<folly::Unit> forward(
const std::vector<caffe::Blob<float>*>& input_blobs,
OutputLayers* output) = 0;
virtual folly::Future<folly::Unit> forward(
std::vector<caffe::Blob<float>*>&& input_blobs,
OutputLayers* output) = 0;
virtual const caffe::Net<float>* canonicalNet() const = 0;
PooledPredictor 没有直接继承 BasePooledPredictor 而是拥有 makePredictors 函数。二者似乎是聚合关系。
内部定义了 Callback、Config 和 Job 三个类。
public:
using Optimization = Predictor::Optimization;
using OutputLayers = BasePooledPredictor::OutputLayers;
类嵌套定义 Callback。
一旦将前馈作业添加到队列中,就会调用 onJobEnqueued 回调函数。
queueSize
为作业入队后估计的队列大小。可以为负。 有关详细信息,请参阅 sizeGuess()。
enqueueDelayMs
为队列已满时等待排队作业的阻塞毫秒数。
onJobDequeued 是从队列中拾取作业进行处理时调用的回调。
在线程拾取前馈作业并处理完, Promise 完成后调用回调函数 onJobProcessed。
processTimeMs
为单个前馈作业所用的时间。
class Callback {
public:
virtual ~Callback() {
}
/**
* Callback invoked once a feed-forward job is added to the queue.
* @param queueSize - Estimated size of the queue once the job is enqueue.
* Can be negative - see folly/MPMCQueue.h sizeGuess()
* for details.
* @param enqueueDelayMs - Number of milliseconds blocked waiting to
* enqueue the job if the queue is full.
*/
virtual void onJobEnqueued(ssize_t queueSize, uint64_t enqueueDelayMs) = 0;
/**
* Callback invoked when a job is picked up from the queue for processing.
*/
virtual void onJobDequeued() = 0;
/**
* Callback invoked after a feed-forward job has been picked up by a
* thread, processed, and the promise fulfilled.
* @param processTimeMs - Time elapsed by a single feed-forward job.
*/
virtual void onJobProcessed(uint64_t processTimeMs) = 0;
};
参数结构体。
如果设置allowInlineScheduling_
,从 PooledPredictor 线程内联入队的作业(例如从返回的 Future 的 then() 回调中调度而没有线程池执行程序的作业)将立即运行而不会添加到队列的末尾。
对于串行链接前向作业的请求,内联调度将减少总执行时间,因为每个后续前馈作业都可以立即运行,而无需在队列中等待。
numThreads_
决定了作业队列的长度。
Config 中可以配置多个网络,从而由 PooledPredictor::makePredictors 创建出多个 BasePooledPredictor 对象。
struct Config {
// Pairs of (prototxt path, weights path)
std::vector<std::pair<std::string, std::string>> protoWeightPaths_;
// Pairs of (prototxt string, weights string)
std::vector<std::pair<std::string, std::string>> protoWeightStrings_;
caffe::Caffe::Brew mode_{
caffe::Caffe::CPU};
int numThreads_{
1};
bool disableBlasThreading_{
true};
Optimization optimization_{
Optimization::NONE};
// If set, jobs enqueued inline from a PooledPredictor thread, such as
// those scheduled from the returned future's then() callbacks without
// a thread-pool executor, will be run immediately without being
// added to the end of the queue.
//
// For requests that serially chain feed-forward jobs, inline scheduling
// would cut down the total execution time as each subsequent feed-forward
// job is run immediately without having to wait in the queue.
bool allowInlineScheduling_{
false};
};
explicit PooledPredictor(const Config& config, Callback* cob = nullptr);
~PooledPredictor();
对于配置中的每个原型/权重,创建一个预测器并返回创建的预测器的向量。 所有Predictors 共享相同的底层 PooledPredictor 队列和线程。
单个网络等效的 makePredictors。 常见用途的助手——仅有一个网络的 PooledPredictor 案例。
/**
* For each prototxt/weight in the config, creates a Predictor and returns
* a vector of the Predictors created. All the Predictors share the same
* underlying PooledPredictor queue and threads.
*/
static std::vector<std::unique_ptr<BasePooledPredictor>> makePredictors(
const Config& config,
Callback* cob = nullptr);
/**
* Single-net equivalent of makePredictors(). Helper for common use-cases
* of PooledPredictor with only one net.
*/
static std::unique_ptr<BasePooledPredictor> makePredictor(
const Config& config,
Callback* cob = nullptr);
folly::Future<folly::Unit> forward(
const std::vector<caffe::Blob<float>*>& input_blobs,
OutputLayers* output,
uint32_t netId);
folly::Future<folly::Unit> forward(
std::vector<caffe::Blob<float>*>&& input_blobs,
OutputLayers* output,
uint32_t netId);
const caffe::Net<float>* canonicalNet(uint32_t netId) const;
size_t netCount() const {
return nets_.size();
}
类模板 std::function 是通用多态函数封装器。 std::function 的实例能存储、复制及调用任何可调用 (Callable) 目标——函数、 lambda 表达式、 bind 表达式或其他函数对象,还有指向成员函数指针和指向数据成员指针。
存储的可调用对象被称为 std::function 的目标。若 std::function 不含目标,则称它为空。调用空 std::function 的目标导致抛出 std::bad_function_call 异常。
std::function 满足可复制构造 (CopyConstructible) 和可复制赋值 (CopyAssignable) 。
private:
struct Job {
using Function = std::function<void(caffe::Net<float>*)>;
using Promise = folly::Promise<folly::Unit>;
public:
Job(Function&& f, Promise&& p, int32_t netId)
: function_(std::move(f)),
promise_(std::move(p)),
netId_(netId) {
}
Function function_;
Promise promise_;
uint32_t netId_;
};
void initNet(
std::unique_ptr<caffe::NetParameter> param,
std::unique_ptr<caffe::NetParameter> weights);
void startPredictorThread();
void forward(
const std::vector<caffe::Blob<float>*>& input_blobs,
OutputLayers* output,
caffe::Net<float>* predictor);
folly::Future<folly::Unit> enqueueJob(Job::Function&& fn, uint32_t netId);
void processJob(std::unique_ptr<Job> job);
const caffe::Caffe::Brew mode_;
const int numThreads_;
Optimization optimization_{
Optimization::NONE};
// In GPU mode
int gpuDeviceCount_;
// Variables needed to construct a new net (happens on demand)
std::vector<std::unique_ptr<caffe::NetParameter>> params_;
std::vector<std::unique_ptr<caffe::NetParameter>> gpuWeights_;
std::vector<std::unique_ptr<caffe::Net<float>>> nets_;
// Default input blob shapes
std::vector<std::vector<std::vector<int>>> shapes_;
// One predictor net per model per thread
folly::ThreadLocal<std::vector<std::unique_ptr<caffe::Net<float>>>>
predictors_;
folly::MPMCQueue<std::unique_ptr<Job>> queue_;
std::vector<std::thread> threads_;
std::mutex mutex_;
std::atomic<int> availableThreads_{
0};
// Helps determine if the current thread is a PooledPredictor thread
// or not. Used for checking if a job should be scheduled inline.
folly::ThreadLocal<bool> inPredictorThread_;
bool allowInlineScheduling_;
Callback* cob_{
nullptr};
pooledPredictor
为共享指针。由一个 PooledPredictor 构造 PinnedPooledPredictor 对象数组。
PinnedPooledPredictor 类中拥有指向 PooledPredictor 的共享指针。二者是聚合关系。
std::vector<std::unique_ptr<BasePooledPredictor>>
PooledPredictor::makePredictors(const Config& config, Callback* cob) {
auto pooledPredictor = std::make_shared<PooledPredictor>(config, cob);
std::vector<std::unique_ptr<BasePooledPredictor>> predictors;
for (auto id = 0; id < pooledPredictor->netCount(); id++) {
predictors.push_back(
std::make_unique<PinnedPooledPredictor>(pooledPredictor, id));
}
return predictors;
}
std::unique_ptr<BasePooledPredictor> PooledPredictor::makePredictor(
const Config& config,
Callback* cob) {
auto predictors = makePredictors(config, cob);
CHECK_EQ(predictors.size(), 1)
<< "Did you mean to use PooledPredictor::makePredictors?";
return std::move(predictors[0]);
}
PooledPredictor::PooledPredictor(const Config& config, Callback* cob)
: mode_(config.mode_),
numThreads_(config.numThreads_),
optimization_(config.optimization_),
allowInlineScheduling_(config.allowInlineScheduling_),
cob_(cob) {
disable_blas_threading 调用的是 mkl_set_num_threads 函数。
if (config.disableBlasThreading_) {
disable_blas_threading();
}
加载模型描述和权重。
CHECK(config.protoWeightPaths_.empty() ^ config.protoWeightStrings_.empty())
<< "Specify exactly one of prototxt/weights paths OR strings";
if (!config.protoWeightPaths_.empty()) {
for (const auto& it : config.protoWeightPaths_) {
auto param = loadNetFromFile(it.first);
auto weights = loadWeightsFromFile(it.second);
initNet(std::move(param), std::move(weights));
}
} else {
for (const auto& it : config.protoWeightStrings_) {
auto param = loadNetFromString(it.first);
auto weights = loadWeightsFromString(it.second);
initNet(std::move(param), std::move(weights));
}
}
DCHECK_EQ(params_.size(), nets_.size());
DCHECK(gpuWeights_.empty() || (gpuWeights_.size() == nets_.size()));
DCHECK_EQ(shapes_.size(), nets_.size());
MPMCQueue 是一个高性能的有界并发队列,支持多个生产者,多个消费者和可选阻塞。队列具有固定容量,所有内存都将预先分配。排队和出队的大部分工作可以并行进行。
MPMCQueue 是可线性化的。这意味着如果在 write(B) 的调用开始之前 write(A) 调用返回,那么 A 肯定会在 B 之前的队列中结束,并且如果对 read(X) 的调用在调用 read(Y) 启动之前返回,X 将在队列中早于 Y 。这也意味着如果读取调用返回一个值,您可以确保队列的所有先前元素都已分配了一个读取器(该读器可能还没有返回,但它存在)。
// Initialize queue
queue_ = folly::MPMCQueue<std::unique_ptr<Job>>(numThreads_);
cudaGetDeviceCount 返回支持计算的设备数。
#ifndef CPU_ONLY
// Initialize GPU count
if (mode_ == caffe::Caffe::GPU) {
CUDA_CHECK(cudaGetDeviceCount(&gpuDeviceCount_));
}
#endif
}
检查网络描述和权重是否为空。设置为测试模式。
// Check that we have some layers - empty strings/files, for
// example, are forgivingly deserialized.
CHECK_GT(param->layer().size(), 0);
CHECK_GT(weights->layer().size(), 0);
param->mutable_state()->set_phase(caffe::TEST);
创建网络并初始化权重参数。
// Initialize the canonical net
auto net = std::make_unique<caffe::Net<float>>(*param);
net->CopyTrainedLayersFrom(*weights);
保存输入 blob 的形状。PooledPredictor::processJob 根据其设置预测器的输入 blob。
// Store default input blob shapes
shapes_.emplace_back();
for (const auto& blob : net->input_blobs()) {
shapes_.back().push_back(blob->shape());
}
保存网络描述和权重。
params_.push_back(std::move(param));
nets_.push_back(std::move(net));
if (mode_ == caffe::Caffe::GPU) {
// Stash the weights to be copied to the GPU nets
gpuWeights_.push_back(std::move(weights));
}
将队列中的任务置为空,从而让线程退出。
auto n = threads_.size();
// Send nullptr's to signal the threads they can exit
for (int i = 0; i < n; i++) {
queue_.blockingWrite(nullptr);
}
// Wait for all threads to exit
for (int i = 0; i < n; i++) {
threads_[i].join();
}
前两个forward
函数的input_blobs
参数类型不同。
PooledPredictor::canonicalNet 将3个函数隔开了,非常不合理。
PooledPredictor::forward
填好的输入输出,PooledPredictor::processJob 仅需填模型指针。
folly::Future<folly::Unit> PooledPredictor::forward(
const std::vector<caffe::Blob<float>*>& input_blobs,
OutputLayers* output,
uint32_t netId) {
auto fn = [=](caffe::Net<float>* predictor) {
forward(input_blobs, output, predictor);
};
return enqueueJob(std::move(fn), netId);
}
folly::Future<folly::Unit> PooledPredictor::forward(
std::vector<caffe::Blob<float>*>&& input_blobs,
OutputLayers* output,
uint32_t netId) {
auto fn = [ this, in_blobs = std::move(input_blobs), output ](
caffe::Net<float> * predictor) {
forward(in_blobs, output, predictor);
};
return enqueueJob(std::move(fn), netId);
}
predictor
参数名和类名太像。PooledPredictor::forward 接受3个参数,分别为input_blobs
、output
和predictor
。
void PooledPredictor::forward(
const std::vector<caffe::Blob<float>*>& input_blobs,
OutputLayers* output,
caffe::Net<float>* predictor) {
CHECK(predictor);
CHECK_EQ(input_blobs.size(), predictor->input_blobs().size());
for (auto i = 0; i < input_blobs.size(); ++i) {
auto& blob = input_blobs[i];
CHECK(blob);
predictor->input_blobs()[i]->ReshapeLike(*blob);
// mutable_cpu_data b/c the interface demands it, but logically const.
predictor->input_blobs()[i]->set_cpu_data(blob->mutable_cpu_data());
}
predictor->Reshape();
predictor->ForwardPrefilled();
if (FLAGS_log_caffe_predictor && optimization_ == Optimization::NONE) {
auto blob_names = predictor->blob_names();
for (auto& bname : blob_names) {
auto& blob = predictor->blob_by_name(bname);
LOG(INFO) << bname << " " << blob->shape_string();
}
}
for (auto& it : *output) {
auto predictor_blob = predictor->blob_by_name(it.first);
auto target_blob = it.second.get();
if (predictor_blob == nullptr) {
LOG(WARNING) << "Requested output blob not found: " << it.first;
continue;
}
target_blob->ReshapeLike(*predictor_blob);
if (mode_ == caffe::Caffe::CPU) {
caffe_copy(
predictor_blob->count(),
predictor_blob->cpu_data(),
target_blob->mutable_cpu_data());
} else {
caffe_copy(
predictor_blob->count(),
predictor_blob->gpu_data(),
target_blob->mutable_cpu_data());
}
}
}
folly::Promise<folly::Unit> promise;
folly::Future<folly::Unit> future = promise.getFuture();
auto job = std::make_unique<Job>(std::move(fn), std::move(promise), netId);
if (allowInlineScheduling_ && *inPredictorThread_) {
// Note: This prevents tail-call optimization, so if lots of subsequent
// jobs are being chained, disabling inline scheduling would be safer
// to avoid running out of stack memory.
processJob(std::move(job));
return future;
}
PooledPredictor::enqueueJob 只管启动线程,而 PooledPredictor::startPredictorThread() 会检查线程数量是否达到限制。
// Optimistically try to add a predictor thread if none are available
if (availableThreads_.load() == 0) {
startPredictorThread();
}
CPUTimer timer;
timer.Start();
queue_.blockingWrite(std::move(job));
timer.Stop();
if (cob_) {
cob_->onJobEnqueued(queue_.sizeGuess(), timer.MilliSeconds());
}
return future;
获得推理网络,运行 PooledPredictor::forward 函数。
auto netId = job->netId_;
auto predictor = predictors_->at(netId).get();
caffe::CPUTimer timer;
timer.Start();
job->function_(predictor);
timer.Stop();
if (cob_) {
cob_->onJobProcessed(timer.MilliSeconds());
}
job->promise_.setValue();
结束后将网络恢复为原始形状。
// Restore network to original shape.
//
// If the network just processed a large input it will hold
// on to it until the next job comes along. Without precise
// memory accounting and input-size-based dispatch, this can
// cause the process to tip over and OOM. Shrinking the
// network after every forward pass doesn't eliminate the
// probability of this happening, it just reduces it.
//
// Originally, processing an image in scanning mode would run
// multiple forward passes and have the last one be the
// smallest input shape, all against the same predictor
// instance. This effectively means resizing the network to
// the smallest input size after processing an image. Since all
// feed-forwards for a single request are no longer pinned to a
// single predictor (dispatch happens for every call to the pooled
// predictor), this implied reshape to a smaller shape no
// longer happens.
for (auto i = 0; i < predictor->input_blobs().size(); ++i) {
predictor->input_blobs()[i]->Reshape(shapes_[netId][i]);
}
predictor->Reshape();
}
PooledPredictor::startPredictorThread() 中创建的线程初始化了所有网络。线程创建后直接从作业队列中读取一个任务进行处理。
创建线程前加锁。
std::lock_guard<std::mutex> lock(mutex_);
auto threadId = threads_.size();
// Never exceed capacity
if (threadId >= numThreads_) {
return;
}
构造出所有网络。
// Create thread and add to list of threads
threads_.push_back(std::thread([&, threadId] () {
if (mode_ == caffe::Caffe::CPU) {
caffe::Caffe::set_mode(caffe::Caffe::CPU);
} else {
caffe::Caffe::set_mode(caffe::Caffe::GPU);
caffe::Caffe::SetDevice(threadId % gpuDeviceCount_);
}
// Setup the predictor nets
for (int i = 0; i < nets_.size(); i++) {
auto predictor = std::make_unique<caffe::Net<float>>(*params_[i]);
if (mode_ == caffe::Caffe::CPU) {
predictor->ShareTrainedLayersWith(nets_[i].get());
} else {
// We tried adding weight sharing between nets on the same GPU,
// which resulted in sporadic NaN outputs of cuDNN (R5) layers.
// Removing weight sharing immediately solved this problem.
predictor->CopyTrainedLayersFrom(*gpuWeights_[i]);
}
if (optimization_ == Optimization::MEMORY) {
optimizeMemory(predictor.get());
}
predictors_->push_back(std::move(predictor));
}
不断从队列中读取任务进行处理。
for (;;) {
availableThreads_++;
std::unique_ptr<Job> job;
queue_.blockingRead(job);
availableThreads_--;
if (job == nullptr) {
return;
}
if (cob_) {
cob_->onJobDequeued();
}
*inPredictorThread_ = true;
processJob(std::move(job));
*inPredictorThread_ = false;
}
}));
analyze 生成网络中每个 SyncedMemory 的活跃信息(起止层)。
assign 函数遍历analysis
中的每一个元素,尝试合并相容的 SyncedMemory。
net->Reshape();
// If the net does sharing (e.g. SplitLayer), run a forward pass to
// get the sharing setup so that it is indentified when we use the
// SyncedMemory addresses as identifiers for def/use ranges.
net->ForwardPrefilled();
const auto& analysis = analyze(*net);
const auto& assignments = assign(*net, analysis);
logAssignmentMetrics(analysis, assignments);
applyAssignments(net, assignments);
通过遍历连接网络中 blob 的 SyncedMemory 指针来构建活跃度分析。
LiveRange 记录内存的活跃信息。
OrderedAnalysis 可以记录一组 SyncedMemory 的活跃信息。analysis
虽然是数组,但由 findOrInsert 可以看出analysis
中的 SyncedMemory 是唯一的。
// Build up the liveness analysis by walking the SyncedMemory
// pointers attached the the blobs in the network.
const auto& bottoms = net.bottom_vecs();
const auto& tops = net.top_vecs();
OrderedAnalysis<LiveRange> analysis;
blob 索引使用int64_t
,需要这么大?
对于每一层,findOrInsert 在analysis
中查找输入的底层 SyncedMemory,如果找到则返回活跃信息,则追加元素到尾部。接着在外部更新 SyncedMemory 的活跃终点。
for (int64_t i = 0; i < bottoms.size(); ++i) {
for (const auto* bottom : bottoms[i]) {
auto& range = findOrInsert(&analysis, bottom->data().get());
if (range.used == kNotUsed) {
range.used = i;
continue;
}
range.used = std::max(range.used, i);
}
}
对于每一层的输出,设置其底层 SyncedMemory 的活跃起点。
for (int64_t i = 0; i < tops.size(); ++i) {
for (const auto* top : tops[i]) {
auto& range = findOrInsert(&analysis, top->data().get());
if (range.defined == kNotDefined) {
range.defined = i;
continue;
}
range.defined = std::min(range.defined, i);
}
}
对于网络的输入,设置其生命为整个周期。
for (const auto* input : net.input_blobs()) {
findOrInsert(&analysis, input->data().get()).defined = -kAlwaysLive;
findOrInsert(&analysis, input->data().get()).used = kAlwaysLive;
}
return analysis;
计算 blob 到非重叠 blob 的分配。
blobNames 构造网络 blob 的 SyncedMemory 及其名称的字典。
根据终止点对analysis
进行排序。analysis
数组记录了网络所有 SyncedMemory 的活跃信息。
日志记录 blob 定义和最后使用的层。
const auto& names = blobNames(net);
std::stable_sort(analysis.begin(),
analysis.end(),
[](const SyncedMemoryRange& a, const SyncedMemoryRange& b) {
return a.second.used < b.second.used;
});
for (const auto& kv : analysis) {
LOG(INFO) << names.at(kv.first)
<< folly::format(": {}->{}", kv.second.defined, kv.second.used);
}
isCompatible 判断候选范围是否与此分配兼容。
遍历analysis
中的每个元素,如该元素与某一分配相容则追加到其中,否则创建一个新的分配。
Assignment 本质是 SyncedMemory 的生命周期。
Assignment 是一组 SyncedMemoryRange 数组,所以有assignment.back()
Assignments assignments;
for (const auto& candidate : analysis) {
auto assigned = false;
for (auto& assignment : assignments) {
if (isCompatible(candidate, assignment)) {
assignment.push_back(candidate);
assigned = true;
break;
}
}
if (assigned) {
continue;
}
assignments.push_back({
candidate});
}
return assignments;
候选范围是否与此作业兼容?
candidate
记录 SyncedMemory 的生命周期。
从时序上看,仅在候选和分配的 SyncedMemory 已使用时才判断。
if (candidate.second.used == kNotUsed ||
assignment.back().second.used == kNotUsed) {
return false;
}
候选 SyncedMemory 的大小应大于设定的共享阈值。
if (candidate.first->size() <= kMinimumCountForSharing) {
return false;
}
Assignment 是一组 SyncedMemoryRange。候选 SyncedMemory 定义应晚于assignment
中的最后一次使用。
CHECK_GE(assignment.size(), 1);
return candidate.second.defined > assignment.back().second.used;
beforeTotalSize
记录复用前 SyncedMemory 的总内存大小。
size_t beforeTotalSize = 0;
for (const auto& kv : analysis) {
beforeTotalSize += kv.first->size();
}
afterTotalSize
累加每个 Assignment 的最大分配内存,相当于逐步申请内存但不释放。
size_t afterTotalSize = 0;
for (const auto& assignment : assignments) {
size_t assignmentMaxSize = 0;
for (const auto& kv : assignment) {
assignmentMaxSize = std::max(assignmentMaxSize, kv.first->size());
}
LOG(INFO) << "Assignment max size: " << assignmentMaxSize;
afterTotalSize += assignmentMaxSize;
}
LOG(INFO)
<< folly::format("Before: {}, After: {}, Compression: {:.2f}%",
beforeTotalSize,
afterTotalSize,
100.0 * (1.0 - afterTotalSize * 1.0 / beforeTotalSize));
blobNames 返回类型为 Analysis 的字典,获得网络 blob 的每个 SyncedMemory 所对应的 blob 名称。
根据assignments
创建共享 SyncedMemory 的 blob 数组。
const auto& names = blobNames(*net);
Analysis<boost::shared_ptr<Blob<float>>> reusedBlobs;
for (const auto& assignment : assignments) {
auto reused = boost::make_shared<Blob<float>>(1, 1, 1, 1);
// Instantiate so blob->data() is valid.
reused->cpu_data();
LOG(INFO) << "Assignment: ";
for (const auto& kv : assignment) {
LOG(INFO) << "Blob: " << names.at(kv.first);
reusedBlobs[kv.first] = reused;
}
std::unordered_map::at 返回对 unordered_map 中带有键k
的元素的映射值的引用。
分别对网络的输入输出、层输入输出及中间结果 blob 分配 SyncedMemory。
using BV = std::vector<Blob<float>*>;
using SBV = std::vector<boost::shared_ptr<Blob<float>>>;
for (auto& blob : const_cast<BV&>(net->input_blobs())) {
reusedBlobs.at(blob->data().get())->ReshapeLike(*blob);
blob = reusedBlobs.at(blob->data().get()).get();
}
for (auto& blob : const_cast<BV&>(net->output_blobs())) {
blob = reusedBlobs.at(blob->data().get()).get();
}
for (auto& vec : net->top_vecs()) {
for (auto& blob : const_cast<BV&>(vec)) {
blob = reusedBlobs.at(blob->data().get()).get();
}
}
for (auto& vec : net->bottom_vecs()) {
for (auto& blob : const_cast<BV&>(vec)) {
blob = reusedBlobs.at(blob->data().get()).get();
}
}
for (auto& blob : const_cast<SBV&>(net->blobs())) {
auto reusedBlob = reusedBlobs.at(blob->data().get());
blob = reusedBlob;
}
}