The core of NVIDIA® TensorRT™ is a C++ library that facilitates high-performance inference on NVIDIA graphics processing units (GPUs).
TensorRT takes a trained network, which consists of a network definition and a set of trained parameters, and produces a highly optimized runtime engine that performs inference for that network.
TensorRT provides API's via C++ and Python that help to express deep learning models via the Network Definition API or load a pre-defined model via the parsers that allow TensorRT to optimize and run them on an NVIDIA GPU.
TensorRT applies graph optimizations, layer fusion, among other optimizations, while also finding the fastest implementation of that model leveraging a diverse collection of highly optimized kernels.
TensorRT also supplies a runtime that you can use to execute this network on all of NVIDIA’s GPU’s from the Kepler generation onwards.
TensorRT also includes optional high speed mixed precision capabilities introduced in the Tegra™ X1, and extended with the Pascal™, Volta™, Turing™, and NVIDIA® Ampere GPU architectures.
Speeding Up Deep Learning Inference Using TensorRT
1. Convert the pretrained image segmentation PyTorch model into ONNX.
2. Import the ONNX model into TensorRT.
3. Apply optimizations and generate an engine.
4. Perform inference on the GPU.
tensorrt动态输入(Dynamic shapes)
dynamic shape,定义engine的时候不指定,用-1代替。在推理的时候自动确定。
tensorrt6 以后的版本是支持动态输入的,需要给每个动态输入绑定一个profile,用于指定最大值,最小值和常规值,如果超出这个范围会报异常。
The optimization profile enables you to set the optimum input, minimum, and maximum dimensions to the profile.
The builder selects the kernel that results in lowest runtime for input tensor dimensions and which is valid for all input tensor dimensions in the range between the minimum and maximum dimensions.
It also converts the network object into a TensorRT engine.
The setMaxBatchSize function in the following code example is used to specify the maximum batch size that a TensorRT engine expects.
The setMaxWorkspaceSize function allows you to increase the GPU memory footprint during the engine building phase.
Inputs are copied from host (CPU) to device (GPU) within launchInference
# 如果是动态输入
# 增加部分
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, \
trt.OnnxParser(network, TRT_LOGGER) as parser, \
builder.create_builder_config() as config:
if dynamic_input:
profile = builder.create_optimization_profile()
profile.set_shape(network.get_input(0).name, (1,3,512,512), (1,3,1600,1600), (1,3,1024,1024))
# 推理的时候
context.active_optimization_profile = 0 # 增加部分
origin_inputshape = context.get_binding_shape(0)
# 增加部分
if (origin_inputshape[-1] == -1):
origin_inputshape[-2], origin_inputshape[-1] = (input_shape)
context.set_binding_shape(0, (origin_inputshape))
CHW: 对于GPU更优。使用CUDA做infer或者后处理的话,由于硬件DRAM的原因,CHW可以保证线程是以coalescing的方式读取。具体性能对比参考Programming_Massively_Parallel_Processors
HWC: 对于CPU更优。使用CPU进行处理的时候,HWC格式可以保证单个线程处理的数据具有连续的内存地址。而CPU缓存具有空间局部性,这样能极大的提升效率。
Python API 的优点
The Python API is that data preprocessing and postprocessing are easy to use because you’re able to use a variety of libraries like NumPy and SciPy.
python 在数据预处理和数据后处理方面很方便,有很多第三方包,比如 Numpy 和 SciPy。
C++ API 的优点
The C++ API should be used in situations where safety is important, for example, in automotive.
Python API 和 C++ API 推理耗时几乎接近
Inference time should be nearly identical between the Python API and C++ API.
reformat 层是用于数据格式转换的,比如上一层输出为 FP32,而下一层输入要求 INT8,那么 TensorRT 就会在这两层之间加入 reformat,用于完成数据转换。
设置为 VERBOSE 类型,那么在编译 engine 完成之后,TensorRT 的日志系统会自动输出每一层的所使用的算子类型、数据精度等信息。