官方API链接:https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/
trtexec集成了TensorRT对接三方格式的parser。
环境
cd python
pip install tensorrt-xxxxx.whl
cd ../uff
pip install uff-xxxxx.whl
cd ../graphsurgeon
pip install graphsurgeon-xxxxx.whl
命令
convert-to-uff xxxx.pb
pb转onnx
python -m tf2onnx.convert --graphdef xxxxx.pb --output xxxxx.onnx --inputs input1:0,input2:0 --outputs output1:0,output2:0
trtexec
trtexec --uff=xxxx.uff --output=xxxx,xxxx --uffInput=input1,C,H,W --uffInput=input2,C,H,W --batch=N
trtexec --onnx=xxxx.onnx --explicitBatch
add各种layer的文档写的真的是,一言难尽
network 分成两种:
addinput的时候会有明显差别。
官网注释:
For networks with an implicit batch dimension, this volume includes the batch dimension with its length set to the maximum batch size. For networks with all explicit dimensions and with wildcard dimensions, the volume is based on the maxima specified by an IOptimizationProfile.Dimensions are normally non-negative integers. The exception is that in networks with all explicit dimensions, -1 can be used as a wildcard for a dimension to be specified at runtime. Input tensors with such a wildcard must have a corresponding entry in the IOptimizationProfiles indicating the permitted extrema, and the input dimensions must be set by IExecutionContext::setBindingDimensions. Different IExecutionContext instances can have different dimensions. Wildcard dimensions are only supported for EngineCapability::kSTANDARD. They are not supported in safety contexts. DLA does not support Wildcard dimensions.
以NCHW的输入为例,
implicit(隐式) batch dimension的网络
正常来说,应该是只输入CHW,在execute的时候再设置batch size。
如果此时输入是NCHW,那么N就是作为batch size的最大值。然后,再execute的时候,以设置的batch size为准?那网络中各个链接tensor的申请的内存呢?以max为准?
explicit(显式) dimensions网络
正常来首,应该设置非负的NCHW。但是,输入维度可以为未知数(-1表示)。如果输入维度里有-1,构图的依据是需要IOptimizationProfile.Dimensions
来设置-1维度的取值范围,在execute之前,通过 IExecutionContext::setBindingDimensions
确定。
// HW is -1 wildcard
auto input = preprocessorNetwork->addInput("input", nvinfer1::DataType::kFLOAT, Dims4{1, 1, -1, -1});
// Create an optimization profile so that we can specify a range of input dimensions.
nvinfer1::IOptimizationProfile* profile = builder->createOptimizationProfile();
// This profile will be valid for all images whose size falls in the range of [(1, 1, 1, 1), (1, 1, 56, 56)]
// but TensorRT will optimize for (1, 1, 28, 28)
// We do not need to check the return of setDimension and addOptimizationProfile here as all dims are explicitly set
profile->setDimensions(input->getName(), OptProfileSelector::kMIN, Dims4{1, 1, 1, 1});
profile->setDimensions(input->getName(), OptProfileSelector::kOPT, Dims4{1, 1, 28, 28});
profile->setDimensions(input->getName(), OptProfileSelector::kMAX, Dims4{1, 1, 56, 56});
preprocessorConfig->addOptimizationProfile(profile);
// Set the input size for the preprocessor
mPreprocessorContext->setBindingDimensions(0, inputDims), false, "Invalid binding dimensions.";
// We can only run inference once all dynamic input shapes have been specified.
bool ret = mPreprocessorContext->allInputDimensionsSpecified();
头文件和文档注释:
//! \param input The input tensor to the layer.
//! \param operation The reduction operation to perform.
//! \param reduceAxes The reduction dimensions.
//! The bit in position i of bitmask reduceAxes corresponds to explicit dimension i if result.
//! E.g., the least significant bit corresponds to the first explicit dimension and the next to least significant bit corresponds to the second explicit dimension.
//!
//! \param keepDimensions The boolean that specifies whether or not to keep the reduced dimensions in the output of the layer.
IReduceLayer* addReduce(ITensor& input, ReduceOperation operation, uint32_t reduceAxes, bool keepDimensions);
降维算子,根据reduceAxes的轴做降维,降维方式可以选ReduceOperation的kSUM,kPROD,kMAX,kMIN,kAVG。
reduceAxes
,直接翻译:位掩码的i
位,对应i
维度的if
取值?小端位对应第一个维度,第二小端位对应第二个维度。
reduceAxis |= 1u << axis_data;
axis index | 二进制 | 十进制 |
---|---|---|
3 | 1000 | 8 |
2 | 0100 | 4 |
1 | 0010 | 2 |
0 | 0001 | 1 |
很多改变维度的算子,比如reshape、flatten、squeeze、unsqueeze、transpose等。
固定常量维度的,直接用setReshapeDimensions
可以设定。
transpose的常量perm,用setFirstTranspose
设置。
nvinfer1::ITensor
两类tensor,shape tensor 和 execution tensor。shape tensor 是表示shape信息的,shape算子的输出就是一个shape tensor。execution tensor 就是实际做计算的。一般来说一个网络的输入和输出tensor都应该是execution tensor。
reshape算子的shape如果是常量,直接用setReshapeDimension
设置即可。
如果shape是变量,此时的shape对应的变量tensor就是一个shape tensor。
nvinfer1::IShuffleLayer
默认是static的,setInput(0, xxxx)
更新需要被reshape的tensor。
setInput(1, xxxx)
第二个参数是一个shape tensor时,nvinfer1::IShuffleLayer
会变为dynamic,可动态计算reshape。
TensorRT不支持的算子,可以自己实现plugin的方式。
头文件模板:
class EqualPluginCreater : public nvinfer1::IPluginCreator {
public:
EqualPluginCreater();
const char *getPluginName() const noexcept override;
const char *getPluginVersion() const noexcept override;
const nvinfer1::PluginFieldCollection *getFieldNames() noexcept override;
nvinfer1::IPluginV2 *createPlugin(const char *name, const nvinfer1::PluginFieldCollection *fc) noexcept override;
nvinfer1::IPluginV2 *deserializePlugin(const char *name, const void *serialData,
size_t serialLength) noexcept override;
void setPluginNamespace(const char *pluginNamespace) noexcept override;
const char *getPluginNamespace() const noexcept override;
private:
static nvinfer1::PluginFieldCollection field_collection_;
static std::vector<nvinfer1::PluginField> fields_;
std::string name_space_;
};
class EqualPlugin : public nvinfer1::IPluginV2DynamicExt { // 支持动态input shape要用这个
public:
explicit EqualPlugin(const std::string name) : layer_name_(name) {}
// It doesn't make sense to make GeluPluginDynamic without arguments, so we delete
// default constructor.
EqualPlugin() = delete;
// IPluginV2DynamicExt Methods
nvinfer1::IPluginV2DynamicExt *clone() const noexcept override;
// 构图的时候调用,输出的tensor的维度
nvinfer1::DimsExprs getOutputDimensions(int outputIndex, const nvinfer1::DimsExprs *inputs, int nbInputs,
nvinfer1::IExprBuilder &exprBuilder) noexcept override;
bool supportsFormatCombination(int pos, const nvinfer1::PluginTensorDesc *tensorsDesc, int nbInputs,
int nbOutputs) noexcept override;
void configurePlugin(const nvinfer1::DynamicPluginTensorDesc *in, int nbInputs,
const nvinfer1::DynamicPluginTensorDesc *out, int nbOutputs) noexcept override;
size_t getWorkspaceSize(const nvinfer1::PluginTensorDesc *inputs, int nbInputs,
const nvinfer1::PluginTensorDesc *outputs, int nbOutputs) const noexcept override;
// enqueue是推理真正执行的函数,inputs和outputs的内存地址都是cuda的device地址,可以直接调cuda的函数
int enqueue(const nvinfer1::PluginTensorDesc *inputDesc, const nvinfer1::PluginTensorDesc *outputDesc,
const void *const *inputs, void *const *outputs, void *workspace, cudaStream_t stream) noexcept override;
// IPluginV2Ext Methods
// 构图的时候调用,输出的tensor的数据类型
nvinfer1::DataType getOutputDataType(int index, const nvinfer1::DataType *inputTypes, int nbInputs) const
noexcept override;
// IPluginV2 Methods
const char *getPluginType() const noexcept override;
const char *getPluginVersion() const noexcept override;
int getNbOutputs() const noexcept override;
int initialize() noexcept override;
void terminate() noexcept override;
size_t getSerializationSize() const noexcept override;
void serialize(void *buffer) const noexcept override;
void destroy() noexcept override; // delete this;
void setPluginNamespace(const char *pluginNamespace) noexcept override;
const char *getPluginNamespace() const noexcept override;
private:
const std::string layer_name_;
std::string name_space_;
};
const char *EQUAL_PLUGIN_VERSION{"1"};
const char *EQUAL_PLUGIN_NAME{"EqualPluginCreater"};
nvinfer1::PluginFieldCollection EqualPluginCreater::field_collection_{};
std::vector<nvinfer1::PluginField> EqualPluginCreater::fields_;
REGISTER_TENSORRT_PLUGIN(EqualPluginCreater);
// 调用
nvinfer1::ITensor *inputTensors[] = {trt_tensor_1, trt_tensor_2};
auto plugin = std::make_shared<EqualPlugin>(name);
nvinfer1::IPluginV2Layer *equal_layer = network->addPluginV2(inputTensors, 2, *plugin);
TensorRT 有三种方式可以实现LSTM:
官方addLoop替换RNN使用说明
链接:https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#replacing-with-loops
// 伪代码
// 1. lstm op的input, 注意,addInput对应申请的device需要加到tensor_bindings里面:
nvinfer1::ITensor *hidden_init = network_->addInput(hidden_name, kFLOAT, Dims3(layer_count_ * directional_cnt_, batch_size_, hidden_size_));
nvinfer1::ITensor *cell_init = network_->addInput(cell_name, kFLOAT, Dims3(layer_count_ * directional_cnt_, batch_size_, hidden_size_));
nvinfer1::ITensor *sequence_size_input = network_->addInput((seq_input_name, kINT32, nvinfer1::Dims{});
nvinfer1::ITensor *max_sequence_size =network_->addConstant(nvinfer1::Dims{}, nvinfer1::Weights{kINT32, &sequence_size_, 1})->getOutput(0);
// 需要保存的中间计算
struct LstmState {
nvinfer1::ITensor *data_{nullptr};
nvinfer1::ITensor *hidden_{nullptr};
nvinfer1::ITensor *cell_{nullptr};
}; // 每一个计算loop的输入
struct LstmWeights {
nvinfer1::ITensor *input_weights_{nullptr};
nvinfer1::ITensor *state_weights_{nullptr};
nvinfer1::ITensor *input_bias_{nullptr};
nvinfer1::ITensor *state_bias_{nullptr};
nvinfer1::ITensor *max_seq_size_{nullptr};
}; // 每一个计算loop的常量
LstmState next_state{input_data_, nullptr, nullptr};
// 2. lstm op中,有layer_count_个层
// directional_cnt_为1是只有前向,directional_cnt_是2是双向的
for (int i = 0; i < layer_count_; i++) {
LstmState layer_input_states[2];
LstmWeights layer_weights[2];
// input_state 和 weights 处理, 略
nvinfer1::ITensor *forward_output = AddLSTMCalculation(layer_input_states[0], layer_weights[0], &forward_hidden_out, &forward_cell_out, false /* is_backward */);
if (directional_cnt_ == 2) {
backward_output = AddLSTMCalculation(layer_input_states[1], layer_weights[1], &backward_hidden_out, &backward_cell_out, true /* is_backward */);
}
// 把forward和backward的output、hidden output和cell output都concat起来,注意axis。此时output维度为[0: sequence size, 1: layer * dircetion, 2: batch size, 3: hidden cnt],以1轴concat。
// concat 后的数据作为下一个layer的输入
next_state = LstmState{output_tensor, hidden_out, cell_out};
// 当前layer的hidden和cell的输出,暂存最后拼接
hidden_outputs.push_back(next_state.hidden_);
cell_outputs.push_back(next_state.cell_);
}
// hidden_outputs和cell_outputs concat,作为op的hidden和cell输出
上面的AddLSTMCalculation中还需要包含batch size的计算,即一次forward计算的输入维度应该为Dims3(sequence_size_, input_size_)
AddLSTMCalculation:
for (int batch_index = 0; batch_index < batch_size_; batch_index++) {
LstmState one_batch_input_state;
nvinfer1::ITensor *batch_index_tensor = network_->addConstant(nvinfer1::Dims{}, nvinfer1::Weights{kINT32, &INDICES[batch_index], 1})->getOutput(0);
one_batch_input_state.data_ = network_->addGather(*input_state.data_, *batch_index_tensor, 0)->getOutput(0);
one_batch_input_state.hidden_ = network_->addGather(*input_state.hidden_, *batch_index_tensor, 0)->getOutput(0);
one_batch_input_state.cell_ = network_->addGather(*input_state.cell_, *batch_index_tensor, 0)->getOutput(0);
// 下面为一个循环sequence len次数的loop
nvinfer1::ITensor *one_batch_output = AddLSTMOneLoop(one_batch_input_state, lstm_weights, &one_batch_hidden, &one_batch_cell, is_backward);
all_batch_outputs.push_back(one_batch_output);
all_batch_hidden.push_back(one_batch_hidden);
all_batch_cell.push_back(one_batch_cell);
}
// hidden output 和 cell output, 为所有batch的输出concat的,注意shape拼接的axis
一次循环的lstm公式见
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
nvinfer1::ILoop *sequence_loop = network_->addLoop();
sequence_loop->addTripLimit(*sequence_size_input, nvinfer1::TripLimit::kCOUNT);
nvinfer1::ITensor *input = sequence_loop->addIterator(*input_state.data_, 0, is_backward)->getOutput(0); // forward backward input顺序设置
nvinfer1::ILayer *hidden_mid = sequence_loop->addRecurrence(*input_state.hidden_);
nvinfer1::ILayer *cell_mid = sequence_loop->addRecurrence(*input_state.cell_);
// 所有的输入和hidden和weights bias全部计算。
// X[t] * W + H[t-1] * R + b
nvinfer1::ITensor *input_matmul = network_->addMatrixMultiply(*input, nvinfer1::MatrixOperation::kVECTOR, *lstm_weights.input_weights_, nvinfer1::MatrixOperation::kTRANSPOSE)->getOutput(0);
nvinfer1::ITensor *hidden_matmul = network_->addMatrixMultiply(*hidden_mid->getOutput(0), nvinfer1::MatrixOperation::kVECTOR, *lstm_weights.state_weights_, nvinfer1::MatrixOperation::kTRANSPOSE)->getOutput(0);
nvinfer1::ITensor *weights_add = network_->addElementWise(*input_matmul, *hidden_matmul, nvinfer1::ElementWiseOperation::kSUM)->getOutput(0);
nvinfer1::ITensor *bias = network_->addElementWise(*lstm_weights.input_bias_, *lstm_weights.state_bias_, nvinfer1::ElementWiseOperation::kSUM)->getOutput(0);
nvinfer1::ITensor *gates_calculate = network_->addElementWise(*weights_add, *bias, nvinfer1::ElementWiseOperation::kSUM)->getOutput(0);
// 切分每个gate,按照weight的顺序,我的顺序是input output forget cell
const auto isolateGate = [&](nvinfer1::ITensor &gates, int gateIndex) ->nvinfer1::ITensor * {
nvinfer1::ISliceLayer *slice = network_->addSlice(gates, nvinfer1::Dims{1, {gateIndex * params_.hidden_size_}}, nvinfer1::Dims{1, {params_.hidden_size_}}, nvinfer1::Dims{1, {1}});
return Reshape(slice->getOutput(0), nvinfer1::Dims{1, {params_.hidden_size_}});
};
nvinfer1::ITensor *i = network_->addActivation(*isolateGate(*gates_calculate, 0), nvinfer1::ActivationType::kSIGMOID)->getOutput(0);
nvinfer1::ITensor *o = network_->addActivation(*isolateGate(*gates_calculate, 1), nvinfer1::ActivationType::kSIGMOID)->getOutput(0);
nvinfer1::ITensor *f = network_->addActivation(*isolateGate(*gates_calculate, 2), nvinfer1::ActivationType::kSIGMOID)->getOutput(0);
nvinfer1::ITensor *c = network_->addActivation(*isolateGate(*gates_calculate, 3), nvinfer1::ActivationType::kTANH)->getOutput(0);
// 计算一次loop的cell和hidden输出
nvinfer1::ITensor *C = network_ ->addElementWise(*network_->addElementWise(*f, *cell_mid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD)->getOutput(0), *network_->addElementWise(*i, *c, nvinfer1::ElementWiseOperation::kPROD)->getOutput(0), nvinfer1::ElementWiseOperation::kSUM) ->getOutput(0);
nvinfer1::ITensor *H = network_->addElementWise(*o, *network_->addActivation(*C, nvinfer1::ActivationType::kTANH)->getOutput(0), nvinfer1::ElementWiseOperation::kPROD)->getOutput(0);
// 循环,此次的输出作为下次循环的输入
cell_mid->setInput(1, *C);
hidden_mid->setInput(1, *H);
// output_mode在forward时为nvinfer1::LoopOutput::kCONCATENATE, backward时为kREVERSE
nvinfer1::ILoopOutputLayer *output_layer = sequence_loop->addLoopOutput(*H, output_mode);
output_layer->setInput(1, *lstm_weights.max_seq_size_);
// 输出的shape都是-1,需要自己设置下,才不会影响后面的算子
hidden_out = Reshape(sequence_loop->addLoopOutput(*hidden_mid->getOutput(0), nvinfer1::LoopOutput::kLAST_VALUE)->getOutput(0), nvinfer1::Dims3(1, 1, hidden_size_));
cell_out = Reshape(sequence_loop->addLoopOutput(*cell_mid->getOutput(0), nvinfer1::LoopOutput::kLAST_VALUE)->getOutput(0), nvinfer1::Dims3(1, 1, hidden_size_));
loop_out = Reshape(output_layer->getOutput(0), nvinfer1::Dims4(sequence_size_, 1, 1, hidden_size_));
device的内存:input tensor的自己申请
// tensor的名字是唯一标识符,与index绑定的
int index = engine_->getBindingIndex(name);
tensor_bindings[index] = device_ptr; // device_ptr 自己cudaMalloc的内存地址
// 推理时,要把所有的input 和output tensor对应的内存地址存出来
trt_context->executeV2(tensor_bindings);