官网链接:https://developer.nvidia.com/tensorrt
NVIDIA® TensorRT™ is an SDK for optimizing trained deep learning models to enable high-performance inference. TensorRT contains a deep learning inference optimizer for trained deep learning models, and a runtime for execution. After you have trained your deep learning model in a framework of your choice, TensorRT enables you to run it with higher throughput and lower latency.
根据官方对于TensorRT
的介绍可知,TensorRT
是一个针对已训练好模型的SDK,通过该SDK能够在NVIDIA的设备上进行高性能的推理。那么TensorRT
具体会对我们训练好的模型做哪些优化呢,可以参考TensorRT
官网中的一幅图,如下图所示:
总结下来主要有以下6点:
Reduced Precision
:将模型量化成INT8
或者FP16
的数据类型(在保证精度不变或略微降低的前提下),以提升模型的推理速度。Layer and Tensor Fusion
:通过将多个层结构进行融合(包括横向和纵向)来优化GPU的显存以及带宽。Kernel Auto-Tuning
:根据当前使用的GPU平台选择最佳的数据层和算法。Dynamic Tensor Memory
:最小化内存占用并高效地重用张量的内存。Multi-Stream Execution
:使用可扩展设计并行处理多个输入流。Time Fusion
:使用动态生成的核去优化随时间步长变化的RNN网络。安装TensorRT
建议直接按照官方的教程来,官方最新TensorRT
快速开始文档:
https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html
或者指定某一版本的TensorRT
快速开始文档(以当前最新稳定版8.2.5
为例):
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-825/quick-start-guide/index.html
对于安装TensorRT
官方列出了下面三种安装方式,但我个人还是喜欢TAR Package
安装(其他安装方式参考https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html):
Container Installation
Debian Installation
pip Wheel File Installation
如果会使用Docker的建议用Container Installation
,本文先以pip Wheel File Installation
安装方式为例。在官方快速开始文档pip Wheel File Installation
中(8.2.5
)明确说明Python的版本只支持3.6至3.9,CUDA版本只支持11.x
,并且只支持Linux操作系统以及x86_64的CPU架构,官方建议使用Centos 7
或者Ubuntu 18.04
。
The pip-installable nvidia-tensorrt Python wheel files only support Python versions 3.6 to 3.9 and CUDA 11.x at this time and will not work with other Python or CUDA versions. Only the Linux operating system and x86_64 CPU architecture is currently supported. These wheel files are expected to work on CentOS 7 or newer and Ubuntu 18.04 or newer.
除了以上说的要求外,还需要注意下GPU的驱动版本,因为不同的CUDA版本对GPU的驱动有不同的要求,而这里安装的TensorRT(8.2.5)
要求使用CUDA 11.x版本,所以需要看下自己GPU的驱动版本是否满足,可通过nvidia-smi
指令查看自己的驱动版本。这里可以直接在NVIDIA官网,看下CUDA版本以及GPU驱动的对应关系:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
在保证GPU驱动满足要求的前提下,建议先用conda创建一个新的虚拟环境(不影响其他环境)。这里就直接创建一个名为tensorrt
的虚拟环境,然后采用python
的版本为3.8
。
conda create -n tensorrt python=3.8
创建好虚拟环境以后,激活进入虚拟环境:
conda activate tensorrt
接着安装nvidia-pyindex
和nvidia-tensorrt
,注意,如果不指定nvidia-tensorrt
的版本号默认安装最新版本,本文是以8.2.5
版本为例,所以这里安装的是当前可用的8.2.5.1
:
pip install nvidia-pyindex
pip install nvidia-tensorrt==8.2.5.1
安装完成后,按照官方的步骤检查下是否安装成功,只需进入Python环境,然后简单打印下版本号等信息,只要不报错就说明安装成功。
import tensorrt
print(tensorrt.__version__)
assert tensorrt.Builder(tensorrt.Logger())
但后面按照官方教程使用trtexec
转换模型格式时发现找不到这个工具,我怀疑通过pip安装方式只是安装了TensorRT的运行时,没有提供trtexec
工具。
安装过程主要按照官网流程:https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing-tar
在安装之前需要准备好以下环境:
进入官方TensorRT的下载页面(需要登录)
下载对应的包,这里我下载的是TensorRT 8.2 GA Update 4 for Linux x86_64 and CUDA 11.0, 11.1, 11.2, 11.3, 11.4 and 11.5 TAR Package
:
下载完成后解压文件:
tar -xzvf TensorRT-8.2.5.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
解压后会生成TensorRT-8.2.5.1
文件夹,接着将TensorRT-8.2.5.1/lib
文件夹路径添加到环境变量LD_LIBRARY_PATH
中,注意我是将TensorRT-8.2.5.1
文件夹放在root
路径下所以设置的是/root/TensorRT-8.2.5.1/lib
,这里需要根据自己解压的路径设置。同理将TensorRT-8.2.5.1/bin
文件夹路径添加到环境变量PATH
中,其中包含后面需要用到的trtexec
工具:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/TensorRT-8.2.5.1/lib
export PATH=$PATH:/root/TensorRT-8.2.5.1/bin
接着进入TensorRT-8.2.5.1/python
文件夹下安装TensorRT wheel
文件,在该文件夹里有针对不同python版本的whl文件,由于我采用的虚拟环境中的python版本是3.8
所以安转cp38
对应的whl文件:
cd TensorRT-8.2.5.1/python
pip install tensorrt-8.2.5.1-cp38-none-linux_x86_64.whl
接着进入TensorRT-8.2.5.1/graphsurgeon
文件夹下安装graphsurgeon wheel
文件:
cd TensorRT-8.2.5.1/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl
接着进入TensorRT-8.2.5.1/onnx-graphsurgeon
文件夹下安装onnx-graphsurgeon wheel
文件:
cd TensorRT-8.2.5.1/onnx-graphsurgeon
pip install onnx_graphsurgeon-0.3.12-py2.py3-none-any.whl
安转完后,同样可以进入Python环境,然后简单打印下版本号等信息,只要不报错就说明安装成功。
import tensorrt
print(tensorrt.__version__)
assert tensorrt.Builder(tensorrt.Logger())
根据官网的介绍,转换TensorRT的工作流程主要有以下6个步骤:
Export the Model
: 导出模型Select A Batch Size
: 根据自己的实际项目选择一个合适的Batch Size
Select A Precision
: 选择一个精度类型,比如INT8
,FLOAT16
,FLOAT32
Convert The Model
: 转换模型Deploy The Model
: 部署模型那哪些格式的模型能够导出并转换成TensorRT模型呢,官方提到了三种方式:
using TF-TRT
: 使用TF-TRT(TensorFlow-TensorRT
)automatic ONNX conversion from .onnx files
:从ONNX通用格式转换得到(注意,这里需要自己提前将模型转成ONNX格式)manually constructing a network using the TensorRT API (either in C++ or Python)
:自己用TensorRT API
构建模型(这个对新人不太友好,难度有点大)也可以参考下面这幅图,比如说对于Pytorch的模型,我们一般需要先转成ONNX通用格式,然后再转成TensorRT模型,最后部署的时候可以选择C++
或者Python
:
按照上述内容,我们知道一般将Pytorch模型转成TensorRT格式的流程是先转ONNX通用格式,再转TensorRT。
这里以Pytorch官方提供的ResNet34
为例,直接从torchvision
中实例化ResNet34
并载入自己在flower_photos
数据集上训练好的权重,然后在转成ONNX格式,示例代码如下:
import torch
import torch.onnx
import onnx
import onnxruntime
import numpy as np
from torchvision.models import resnet34
device = torch.device("cpu")
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
def main():
weights_path = "resNet34(flower).pth"
onnx_file_name = "resnet34.onnx"
batch_size = 1
img_h = 224
img_w = 224
img_channel = 3
# create model and load pretrain weights
model = resnet34(pretrained=False, num_classes=5)
model.load_state_dict(torch.load(weights_path, map_location='cpu'))
model.eval()
# input to the model
# [batch, channel, height, width]
x = torch.rand(batch_size, img_channel, img_h, img_w, requires_grad=True)
torch_out = model(x)
# export the model
torch.onnx.export(model, # model being run
x, # model input (or a tuple for multiple inputs)
onnx_file_name, # where to save the model (can be a file or file-like object)
input_names=["input"],
output_names=["output"],
verbose=False)
# check onnx model
onnx_model = onnx.load(onnx_file_name)
onnx.checker.check_model(onnx_model)
ort_session = onnxruntime.InferenceSession(onnx_file_name)
# compute ONNX Runtime output prediction
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs)
# compare ONNX Runtime and Pytorch results
# assert_allclose: Raises an AssertionError if two objects are not equal up to desired tolerance.
np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)
print("Exported model has been tested with ONNXRuntime, and the result looks good!")
if __name__ == '__main__':
main()
注意,这里将Pytorch模型转成ONNX后,又利用ONNXRUNTIME载入导出的模型,然后输入同样的数据利用np.testing.assert_allclose
方法对比转换前后输出的差异,其中rtol
代表相对偏差,atol
代表绝对偏差,如果两者的差异超出指定的精度则会报错。在转换后,会在当前文件夹中生成一个resnet34.onnx
文件。
将ONNX转成TensorRT engine的方式有多种,其中最简单的就是使用trtexec
工具。在上面3.1
章节中已经将Pyotrch中的Resnet34
转成ONNX格式了,接下来可以直接使用trtexec
工具将其转为TensorRT engine格式:
trtexec --onnx=resnet34.onnx --saveEngine=trt_output/resnet34.trt
其中:
--onnx
是指向生成的onnx模型文件路径--saveEngine
是保存TensorRT engine的文件路径(发现一个小问题,就是保存的目录必须提前创建好,如果没有创建的话就会报错)转化过程中终端会输出如下信息:
[06/23/2022-08:08:14] [I] === Model Options ===
[06/23/2022-08:08:14] [I] Format: ONNX
[06/23/2022-08:08:14] [I] Model: /root/project/resnet34.onnx
[06/23/2022-08:08:14] [I] Output:
[06/23/2022-08:08:14] [I] === Build Options ===
[06/23/2022-08:08:14] [I] Max batch: explicit batch
[06/23/2022-08:08:14] [I] Workspace: 16 MiB
[06/23/2022-08:08:14] [I] minTiming: 1
[06/23/2022-08:08:14] [I] avgTiming: 8
[06/23/2022-08:08:14] [I] Precision: FP32
[06/23/2022-08:08:14] [I] Calibration:
[06/23/2022-08:08:14] [I] Refit: Disabled
[06/23/2022-08:08:14] [I] Sparsity: Disabled
[06/23/2022-08:08:14] [I] Safe mode: Disabled
[06/23/2022-08:08:14] [I] DirectIO mode: Disabled
[06/23/2022-08:08:14] [I] Restricted mode: Disabled
[06/23/2022-08:08:14] [I] Save engine: trt_ouput/resnet34.trt
[06/23/2022-08:08:14] [I] Load engine:
[06/23/2022-08:08:14] [I] Profiling verbosity: 0
[06/23/2022-08:08:14] [I] Tactic sources: Using default tactic sources
[06/23/2022-08:08:14] [I] timingCacheMode: local
[06/23/2022-08:08:14] [I] timingCacheFile:
[06/23/2022-08:08:14] [I] Input(s)s format: fp32:CHW
[06/23/2022-08:08:14] [I] Output(s)s format: fp32:CHW
[06/23/2022-08:08:14] [I] Input build shapes: model
[06/23/2022-08:08:14] [I] Input calibration shapes: model
......
[06/23/2022-08:08:41] [I] === Performance summary ===
[06/23/2022-08:08:41] [I] Throughput: 550.406 qps
[06/23/2022-08:08:41] [I] Latency: min = 1.85938 ms, max = 2.23706 ms, mean = 1.87513 ms, median = 1.87372 ms, percentile(99%) = 1.90234 ms
[06/23/2022-08:08:41] [I] End-to-End Host Latency: min = 1.87573 ms, max = 3.56226 ms, mean = 3.38754 ms, median = 3.47742 ms, percentile(99%) = 3.50659 ms
[06/23/2022-08:08:41] [I] Enqueue Time: min = 0.402954 ms, max = 2.53369 ms, mean = 0.68202 ms, median = 0.653564 ms, percentile(99%) = 0.830811 ms
[06/23/2022-08:08:41] [I] H2D Latency: min = 0.0581055 ms, max = 0.0943298 ms, mean = 0.063807 ms, median = 0.0615234 ms, percentile(99%) = 0.0910645 ms
[06/23/2022-08:08:41] [I] GPU Compute Time: min = 1.79099 ms, max = 2.14551 ms, mean = 1.80203 ms, median = 1.80127 ms, percentile(99%) = 1.8125 ms
[06/23/2022-08:08:41] [I] D2H Latency: min = 0.00610352 ms, max = 0.0129395 ms, mean = 0.00928149 ms, median = 0.00949097 ms, percentile(99%) = 0.0119934 ms
[06/23/2022-08:08:41] [I] Total Host Walltime: 3.00324 s
[06/23/2022-08:08:41] [I] Total GPU Compute Time: 2.97876 s
[06/23/2022-08:08:41] [I] Explanations of the performance metrics are printed in the verbose logs.
有关trtexec
工具的使用方法,可以通过trtexec --help
查看详细介绍,比如要使用FP16
精度转模型时加上--fp16
参数即可。
这里主要参考官方提供的notebook
教程:https://github.com/NVIDIA/TensorRT/blob/main/quickstart/SemanticSegmentation/tutorial-runtime.ipynb
下面是我参考官方demo写的一个样例,在样例中对比ONNX
和TensorRT
的输出结果。
import numpy as np
import tensorrt as trt
import onnxruntime
import pycuda.driver as cuda
import pycuda.autoinit
def normalize(image: np.ndarray) -> np.ndarray:
"""
Normalize the image to the given mean and standard deviation
"""
image = image.astype(np.float32)
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
image /= 255.0
image -= mean
image /= std
return image
def onnx_inference(onnx_path: str, image: np.ndarray):
# load onnx model
ort_session = onnxruntime.InferenceSession(onnx_path)
# compute onnx Runtime output prediction
ort_inputs = {ort_session.get_inputs()[0].name: image}
res_onnx = ort_session.run(None, ort_inputs)[0]
return res_onnx
def trt_inference(trt_path: str, image: np.ndarray):
# Load the network in Inference Engine
trt_logger = trt.Logger(trt.Logger.WARNING)
with open(trt_path, "rb") as f, trt.Runtime(trt_logger) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
with engine.create_execution_context() as context:
# Set input shape based on image dimensions for inference
context.set_binding_shape(engine.get_binding_index("input"), (1, 3, image.shape[-2], image.shape[-1]))
# Allocate host and device buffers
bindings = []
for binding in engine:
binding_idx = engine.get_binding_index(binding)
size = trt.volume(context.get_binding_shape(binding_idx))
dtype = trt.nptype(engine.get_binding_dtype(binding))
if engine.binding_is_input(binding):
input_buffer = np.ascontiguousarray(image)
input_memory = cuda.mem_alloc(image.nbytes)
bindings.append(int(input_memory))
else:
output_buffer = cuda.pagelocked_empty(size, dtype)
output_memory = cuda.mem_alloc(output_buffer.nbytes)
bindings.append(int(output_memory))
stream = cuda.Stream()
# Transfer input data to the GPU.
cuda.memcpy_htod_async(input_memory, input_buffer, stream)
# Run inference
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer prediction output from the GPU.
cuda.memcpy_dtoh_async(output_buffer, output_memory, stream)
# Synchronize the stream
stream.synchronize()
res_trt = np.reshape(output_buffer, (1, -1))
return res_trt
def main():
image_h = 224
image_w = 224
onnx_path = "resnet34.onnx"
trt_path = "trt_output/resnet34.trt"
image = np.random.randn(image_h, image_w, 3)
normalized_image = normalize(image)
# Convert the resized images to network input shape
# [h, w, c] -> [c, h, w] -> [1, c, h, w]
normalized_image = np.expand_dims(np.transpose(normalized_image, (2, 0, 1)), 0)
onnx_res = onnx_inference(onnx_path, normalized_image)
ir_res = trt_inference(trt_path, normalized_image)
np.testing.assert_allclose(onnx_res, ir_res, rtol=1e-03, atol=1e-05)
print("Exported model has been tested with TensorRT Runtime, and the result looks good!")
if __name__ == '__main__':
main()
最后提下模型的量化,关于量化可以简单的分成两类(不严谨):QAT(Quantiztion Aware Training)在训练过程中同时进行量化,PTQ(Post Training Quantization)训练后量化。由于现在深度学习框架非常多以及各种runtime(比如tensorflow的tf-lite,pytorch的torchscript,onnx,tensorrt,openvino等等),量化的工具也一堆。这里对于QAT推荐nvidia的pytorch-quantization
工具。对于PTQ,如果是部署在nvidia卡上推荐tensorrt
,如果部署在cpu上可以尝试openvino。