TensorRT采坑api

文章目录

  • TensorRT链接
  • TensorRT工具
    • 对接TensorFlow
  • 采坑 API
    • nvinfer1::INetworkDefinition
      • addInput
      • addReduce
      • addShuffle
        • dynamic reshape算子
      • addPluginV2
      • LSTM
    • 内存管理
  • 优化
    • matmul


TensorRT链接

官方API链接:https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/


TensorRT工具

trtexec集成了TensorRT对接三方格式的parser。

对接TensorFlow

  1. pb转uff
  • 环境
    cd python
    pip install tensorrt-xxxxx.whl
    cd ../uff
    pip install uff-xxxxx.whl
    cd ../graphsurgeon
    pip install graphsurgeon-xxxxx.whl

  • 命令
    convert-to-uff xxxx.pb

  1. pb转onnx
    python -m tf2onnx.convert --graphdef xxxxx.pb --output xxxxx.onnx --inputs input1:0,input2:0 --outputs output1:0,output2:0

  2. trtexec
    trtexec --uff=xxxx.uff --output=xxxx,xxxx --uffInput=input1,C,H,W --uffInput=input2,C,H,W --batch=N
    trtexec --onnx=xxxx.onnx --explicitBatch


采坑 API

nvinfer1::INetworkDefinition

add各种layer的文档写的真的是,一言难尽

network 分成两种:

  1. implicit(隐式) batch dimension的网络(比如,HWC)
  2. explicit(显式) dimensions = full dims网络(NHWC)

addinput的时候会有明显差别。

addInput

官网注释:

For networks with an implicit batch dimension, this volume includes the batch dimension with its length set to the maximum batch size. For networks with all explicit dimensions and with wildcard dimensions, the volume is based on the maxima specified by an IOptimizationProfile.Dimensions are normally non-negative integers. The exception is that in networks with all explicit dimensions, -1 can be used as a wildcard for a dimension to be specified at runtime. Input tensors with such a wildcard must have a corresponding entry in the IOptimizationProfiles indicating the permitted extrema, and the input dimensions must be set by IExecutionContext::setBindingDimensions. Different IExecutionContext instances can have different dimensions. Wildcard dimensions are only supported for EngineCapability::kSTANDARD. They are not supported in safety contexts. DLA does not support Wildcard dimensions.

以NCHW的输入为例,

  1. implicit(隐式) batch dimension的网络
    正常来说,应该是只输入CHW,在execute的时候再设置batch size。
    如果此时输入是NCHW,那么N就是作为batch size的最大值。然后,再execute的时候,以设置的batch size为准?那网络中各个链接tensor的申请的内存呢?以max为准?

  2. explicit(显式) dimensions网络
    正常来首,应该设置非负的NCHW。但是,输入维度可以为未知数(-1表示)。如果输入维度里有-1,构图的依据是需要IOptimizationProfile.Dimensions来设置-1维度的取值范围,在execute之前,通过 IExecutionContext::setBindingDimensions确定。

// HW is -1 wildcard
auto input = preprocessorNetwork->addInput("input", nvinfer1::DataType::kFLOAT, Dims4{1, 1, -1, -1});

// Create an optimization profile so that we can specify a range of input dimensions.
nvinfer1::IOptimizationProfile* profile = builder->createOptimizationProfile();
// This profile will be valid for all images whose size falls in the range of [(1, 1, 1, 1), (1, 1, 56, 56)]
// but TensorRT will optimize for (1, 1, 28, 28)
// We do not need to check the return of setDimension and addOptimizationProfile here as all dims are explicitly set
profile->setDimensions(input->getName(), OptProfileSelector::kMIN, Dims4{1, 1, 1, 1});
profile->setDimensions(input->getName(), OptProfileSelector::kOPT, Dims4{1, 1, 28, 28});
profile->setDimensions(input->getName(), OptProfileSelector::kMAX, Dims4{1, 1, 56, 56});
preprocessorConfig->addOptimizationProfile(profile);

// Set the input size for the preprocessor
mPreprocessorContext->setBindingDimensions(0, inputDims), false, "Invalid binding dimensions.";

// We can only run inference once all dynamic input shapes have been specified.
bool ret = mPreprocessorContext->allInputDimensionsSpecified();

addReduce

头文件和文档注释:

//! \param input The input tensor to the layer.
//! \param operation The reduction operation to perform.
//! \param reduceAxes The reduction dimensions.
//!        The bit in position i of bitmask reduceAxes corresponds to explicit dimension i if result.
//!        E.g., the least significant bit corresponds to the first explicit dimension and the next to least significant bit corresponds to the second explicit dimension.
//!
//! \param keepDimensions The boolean that specifies whether or not to keep the reduced dimensions in the output of the layer.

IReduceLayer* addReduce(ITensor& input, ReduceOperation operation, uint32_t reduceAxes, bool keepDimensions);

降维算子,根据reduceAxes的轴做降维,降维方式可以选ReduceOperation的kSUM,kPROD,kMAX,kMIN,kAVG。

reduceAxes,直接翻译:位掩码的i位,对应i维度的if取值?小端位对应第一个维度,第二小端位对应第二个维度。

reduceAxis |= 1u << axis_data;

axis index 二进制 十进制
3 1000 8
2 0100 4
1 0010 2
0 0001 1

addShuffle

很多改变维度的算子,比如reshape、flatten、squeeze、unsqueeze、transpose等。
固定常量维度的,直接用setReshapeDimensions可以设定。
transpose的常量perm,用setFirstTranspose设置。

dynamic reshape算子

nvinfer1::ITensor两类tensor,shape tensor 和 execution tensor。shape tensor 是表示shape信息的,shape算子的输出就是一个shape tensor。execution tensor 就是实际做计算的。一般来说一个网络的输入和输出tensor都应该是execution tensor。

reshape算子的shape如果是常量,直接用setReshapeDimension设置即可。
如果shape是变量,此时的shape对应的变量tensor就是一个shape tensor。
nvinfer1::IShuffleLayer默认是static的,setInput(0, xxxx)更新需要被reshape的tensor。
setInput(1, xxxx)第二个参数是一个shape tensor时,nvinfer1::IShuffleLayer会变为dynamic,可动态计算reshape。

addPluginV2

TensorRT不支持的算子,可以自己实现plugin的方式。
头文件模板:

class EqualPluginCreater : public nvinfer1::IPluginCreator {
 public:
  EqualPluginCreater();

  const char *getPluginName() const noexcept override;

  const char *getPluginVersion() const noexcept override;

  const nvinfer1::PluginFieldCollection *getFieldNames() noexcept override;

  nvinfer1::IPluginV2 *createPlugin(const char *name, const nvinfer1::PluginFieldCollection *fc) noexcept override;

  nvinfer1::IPluginV2 *deserializePlugin(const char *name, const void *serialData,
                                         size_t serialLength) noexcept override;

  void setPluginNamespace(const char *pluginNamespace) noexcept override;

  const char *getPluginNamespace() const noexcept override;

 private:
  static nvinfer1::PluginFieldCollection field_collection_;
  static std::vector<nvinfer1::PluginField> fields_;
  std::string name_space_;
};

class EqualPlugin : public nvinfer1::IPluginV2DynamicExt { // 支持动态input shape要用这个
 public:
  explicit EqualPlugin(const std::string name) : layer_name_(name) {}

  // It doesn't make sense to make GeluPluginDynamic without arguments, so we delete
  // default constructor.
  EqualPlugin() = delete;

  // IPluginV2DynamicExt Methods
  nvinfer1::IPluginV2DynamicExt *clone() const noexcept override;
  // 构图的时候调用,输出的tensor的维度
  nvinfer1::DimsExprs getOutputDimensions(int outputIndex, const nvinfer1::DimsExprs *inputs, int nbInputs,
                                          nvinfer1::IExprBuilder &exprBuilder) noexcept override;
  bool supportsFormatCombination(int pos, const nvinfer1::PluginTensorDesc *tensorsDesc, int nbInputs,
                                 int nbOutputs) noexcept override;
  void configurePlugin(const nvinfer1::DynamicPluginTensorDesc *in, int nbInputs,
                       const nvinfer1::DynamicPluginTensorDesc *out, int nbOutputs) noexcept override;
  size_t getWorkspaceSize(const nvinfer1::PluginTensorDesc *inputs, int nbInputs,
                          const nvinfer1::PluginTensorDesc *outputs, int nbOutputs) const noexcept override;
  // enqueue是推理真正执行的函数,inputs和outputs的内存地址都是cuda的device地址,可以直接调cuda的函数
  int enqueue(const nvinfer1::PluginTensorDesc *inputDesc, const nvinfer1::PluginTensorDesc *outputDesc,
              const void *const *inputs, void *const *outputs, void *workspace, cudaStream_t stream) noexcept override;

  // IPluginV2Ext Methods
  // 构图的时候调用,输出的tensor的数据类型
  nvinfer1::DataType getOutputDataType(int index, const nvinfer1::DataType *inputTypes, int nbInputs) const
    noexcept override;

  // IPluginV2 Methods
  const char *getPluginType() const noexcept override;
  const char *getPluginVersion() const noexcept override;
  int getNbOutputs() const noexcept override;
  int initialize() noexcept override;
  void terminate() noexcept override;
  size_t getSerializationSize() const noexcept override;
  void serialize(void *buffer) const noexcept override;
  void destroy() noexcept override; // delete this;
  void setPluginNamespace(const char *pluginNamespace) noexcept override;
  const char *getPluginNamespace() const noexcept override;

 private:
  const std::string layer_name_;
  std::string name_space_;
};

const char *EQUAL_PLUGIN_VERSION{"1"};
const char *EQUAL_PLUGIN_NAME{"EqualPluginCreater"};
nvinfer1::PluginFieldCollection EqualPluginCreater::field_collection_{};
std::vector<nvinfer1::PluginField> EqualPluginCreater::fields_;
REGISTER_TENSORRT_PLUGIN(EqualPluginCreater);

// 调用
nvinfer1::ITensor *inputTensors[] = {trt_tensor_1, trt_tensor_2};
auto plugin = std::make_shared<EqualPlugin>(name);
nvinfer1::IPluginV2Layer *equal_layer = network->addPluginV2(inputTensors, 2, *plugin);

LSTM

TensorRT 有三种方式可以实现LSTM:

  • addRNNv2
  • TensorRT 自己的plugin
    https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#persistent-lstm-plugin
  • addLoop (TensorRT7之后才有的API,官方推荐)

官方addLoop替换RNN使用说明
链接:https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#replacing-with-loops

// 伪代码
// 1. lstm op的input, 注意,addInput对应申请的device需要加到tensor_bindings里面:
nvinfer1::ITensor *hidden_init = network_->addInput(hidden_name, kFLOAT, Dims3(layer_count_ * directional_cnt_, batch_size_, hidden_size_));
nvinfer1::ITensor *cell_init = network_->addInput(cell_name, kFLOAT, Dims3(layer_count_ * directional_cnt_, batch_size_, hidden_size_));
nvinfer1::ITensor *sequence_size_input = network_->addInput((seq_input_name, kINT32, nvinfer1::Dims{});
nvinfer1::ITensor *max_sequence_size =network_->addConstant(nvinfer1::Dims{}, nvinfer1::Weights{kINT32, &sequence_size_, 1})->getOutput(0);

// 需要保存的中间计算
struct LstmState {
  nvinfer1::ITensor *data_{nullptr};
  nvinfer1::ITensor *hidden_{nullptr};
  nvinfer1::ITensor *cell_{nullptr};
}; // 每一个计算loop的输入

struct LstmWeights {
  nvinfer1::ITensor *input_weights_{nullptr};
  nvinfer1::ITensor *state_weights_{nullptr};
  nvinfer1::ITensor *input_bias_{nullptr};
  nvinfer1::ITensor *state_bias_{nullptr};
  nvinfer1::ITensor *max_seq_size_{nullptr};
}; // 每一个计算loop的常量

LstmState next_state{input_data_, nullptr, nullptr};

// 2. lstm op中,有layer_count_个层
// directional_cnt_为1是只有前向,directional_cnt_是2是双向的
for (int i = 0; i < layer_count_; i++) {
  LstmState layer_input_states[2];
  LstmWeights layer_weights[2];
  // input_state 和 weights 处理, 略
  nvinfer1::ITensor *forward_output = AddLSTMCalculation(layer_input_states[0], layer_weights[0], &forward_hidden_out, &forward_cell_out, false /* is_backward */);
  if (directional_cnt_ == 2) {
	backward_output = AddLSTMCalculation(layer_input_states[1], layer_weights[1], &backward_hidden_out, &backward_cell_out, true /* is_backward */);
  }
  // 把forward和backward的output、hidden output和cell output都concat起来,注意axis。此时output维度为[0: sequence size, 1: layer * dircetion, 2: batch size, 3: hidden cnt],以1轴concat。
  // concat 后的数据作为下一个layer的输入
  next_state = LstmState{output_tensor, hidden_out, cell_out};
  // 当前layer的hidden和cell的输出,暂存最后拼接
  hidden_outputs.push_back(next_state.hidden_);
  cell_outputs.push_back(next_state.cell_);
}
// hidden_outputs和cell_outputs concat,作为op的hidden和cell输出

上面的AddLSTMCalculation中还需要包含batch size的计算,即一次forward计算的输入维度应该为Dims3(sequence_size_, input_size_)

AddLSTMCalculation:
for (int batch_index = 0; batch_index < batch_size_; batch_index++) {
  LstmState one_batch_input_state;
  nvinfer1::ITensor *batch_index_tensor = network_->addConstant(nvinfer1::Dims{}, nvinfer1::Weights{kINT32, &INDICES[batch_index], 1})->getOutput(0);
  one_batch_input_state.data_ = network_->addGather(*input_state.data_, *batch_index_tensor, 0)->getOutput(0);
  one_batch_input_state.hidden_ = network_->addGather(*input_state.hidden_, *batch_index_tensor, 0)->getOutput(0);
  one_batch_input_state.cell_ = network_->addGather(*input_state.cell_, *batch_index_tensor, 0)->getOutput(0);
  // 下面为一个循环sequence len次数的loop
  nvinfer1::ITensor *one_batch_output = AddLSTMOneLoop(one_batch_input_state, lstm_weights, &one_batch_hidden, &one_batch_cell, is_backward);

  all_batch_outputs.push_back(one_batch_output);
  all_batch_hidden.push_back(one_batch_hidden);
  all_batch_cell.push_back(one_batch_cell);
}
// hidden output 和 cell output, 为所有batch的输出concat的,注意shape拼接的axis

一次循环的lstm公式见
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

nvinfer1::ILoop *sequence_loop = network_->addLoop();
sequence_loop->addTripLimit(*sequence_size_input, nvinfer1::TripLimit::kCOUNT);
nvinfer1::ITensor *input = sequence_loop->addIterator(*input_state.data_, 0, is_backward)->getOutput(0); // forward backward input顺序设置
nvinfer1::ILayer *hidden_mid = sequence_loop->addRecurrence(*input_state.hidden_);
nvinfer1::ILayer *cell_mid = sequence_loop->addRecurrence(*input_state.cell_);

// 所有的输入和hidden和weights bias全部计算。
// X[t] * W + H[t-1] * R + b
nvinfer1::ITensor *input_matmul = network_->addMatrixMultiply(*input, nvinfer1::MatrixOperation::kVECTOR, *lstm_weights.input_weights_, nvinfer1::MatrixOperation::kTRANSPOSE)->getOutput(0);
nvinfer1::ITensor *hidden_matmul = network_->addMatrixMultiply(*hidden_mid->getOutput(0), nvinfer1::MatrixOperation::kVECTOR, *lstm_weights.state_weights_, nvinfer1::MatrixOperation::kTRANSPOSE)->getOutput(0);
nvinfer1::ITensor *weights_add = network_->addElementWise(*input_matmul, *hidden_matmul, nvinfer1::ElementWiseOperation::kSUM)->getOutput(0);
nvinfer1::ITensor *bias = network_->addElementWise(*lstm_weights.input_bias_, *lstm_weights.state_bias_, nvinfer1::ElementWiseOperation::kSUM)->getOutput(0);
nvinfer1::ITensor *gates_calculate = network_->addElementWise(*weights_add, *bias, nvinfer1::ElementWiseOperation::kSUM)->getOutput(0);

// 切分每个gate,按照weight的顺序,我的顺序是input output forget cell
const auto isolateGate = [&](nvinfer1::ITensor &gates, int gateIndex) ->nvinfer1::ITensor * {
  nvinfer1::ISliceLayer *slice = network_->addSlice(gates, nvinfer1::Dims{1, {gateIndex * params_.hidden_size_}}, nvinfer1::Dims{1, {params_.hidden_size_}}, nvinfer1::Dims{1, {1}});
  return Reshape(slice->getOutput(0), nvinfer1::Dims{1, {params_.hidden_size_}});
};
nvinfer1::ITensor *i = network_->addActivation(*isolateGate(*gates_calculate, 0), nvinfer1::ActivationType::kSIGMOID)->getOutput(0);
nvinfer1::ITensor *o = network_->addActivation(*isolateGate(*gates_calculate, 1), nvinfer1::ActivationType::kSIGMOID)->getOutput(0);
nvinfer1::ITensor *f = network_->addActivation(*isolateGate(*gates_calculate, 2), nvinfer1::ActivationType::kSIGMOID)->getOutput(0);
nvinfer1::ITensor *c = network_->addActivation(*isolateGate(*gates_calculate, 3), nvinfer1::ActivationType::kTANH)->getOutput(0);

// 计算一次loop的cell和hidden输出
nvinfer1::ITensor *C = network_ ->addElementWise(*network_->addElementWise(*f, *cell_mid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD)->getOutput(0), *network_->addElementWise(*i, *c, nvinfer1::ElementWiseOperation::kPROD)->getOutput(0), nvinfer1::ElementWiseOperation::kSUM) ->getOutput(0);
nvinfer1::ITensor *H = network_->addElementWise(*o, *network_->addActivation(*C, nvinfer1::ActivationType::kTANH)->getOutput(0), nvinfer1::ElementWiseOperation::kPROD)->getOutput(0);
// 循环,此次的输出作为下次循环的输入
cell_mid->setInput(1, *C);
hidden_mid->setInput(1, *H);
// output_mode在forward时为nvinfer1::LoopOutput::kCONCATENATE, backward时为kREVERSE
nvinfer1::ILoopOutputLayer *output_layer = sequence_loop->addLoopOutput(*H, output_mode); 
output_layer->setInput(1, *lstm_weights.max_seq_size_);
// 输出的shape都是-1,需要自己设置下,才不会影响后面的算子
hidden_out = Reshape(sequence_loop->addLoopOutput(*hidden_mid->getOutput(0), nvinfer1::LoopOutput::kLAST_VALUE)->getOutput(0), nvinfer1::Dims3(1, 1, hidden_size_));
cell_out = Reshape(sequence_loop->addLoopOutput(*cell_mid->getOutput(0), nvinfer1::LoopOutput::kLAST_VALUE)->getOutput(0), nvinfer1::Dims3(1, 1, hidden_size_));
loop_out = Reshape(output_layer->getOutput(0), nvinfer1::Dims4(sequence_size_, 1, 1, hidden_size_));

内存管理

device的内存:input tensor的自己申请

// tensor的名字是唯一标识符,与index绑定的
int index = engine_->getBindingIndex(name);
tensor_bindings[index] = device_ptr; // device_ptr 自己cudaMalloc的内存地址

// 推理时,要把所有的input 和output tensor对应的内存地址存出来
trt_context->executeV2(tensor_bindings);

优化

matmul

  • 构图的时候,会对TensorRT根据nvinfer1::IOptimizationProfile的setDimensions设置的nvinfer1::OptProfileSelector::kOPT维度信息进行优化Tactic选择,cublas和cudnn哪个快用哪个(数据量大的时候会选择cudnn,小数据量会用cublas)。
  • 如果是matmul + bias add, 且weight和bias都是常量,用fullconnect代替matmul,TensorRT会自动优化成1 * 1的convolution代替matmul + bias。

你可能感兴趣的:(coding,nvidia,深度学习)