前段时间写了tvm的转换编译部署相关的文章,
牛先生:神经网络编译器TVM,autoTVM自动代码优化及c++部署cuda实践20 赞同 · 0 评论文章正在上传…重新上传取消
以及tensorrt偏向讲原理的文章:
牛先生:tensorRT量化实践手册3 赞同 · 0 评论文章
后来发现现在用tensorrt的还更多一些。今天也总结下,tensorrt的对应python接口模型转换,模型推理,模型量化相关的内容。
作者做tensorrt在ubuntu桌面和jetson都做了相关的实践。总的来看,接口基本是一致的,可以直接用。jetson的tensorrt版本要小一些,有的特性可能不支持。
首先拿到onnx模型,不会转的同学自行百度吧,讲的文章太多了。随后总的工作流如下:
根据以上的 pipe line 开始吧 !
假设平台是 ubuntu 桌面 18.04 X86_64。还是和普通的显卡推理环境基本一致,安装驱动,安装cuda ,安装 tensorrt 包。这里贴一个tensorrt的包位置吧:
NVIDIA Developerdeveloper.nvidia.com/nvidia-tensorrt-8x-download
下载下来后是 TAR包解压。再配置相关的环境变量就可以了。官方有个教程:
NVIDIA Deep Learning TensorRT Documentationdocs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#maclearn-net-repo-install
我使用的是TAR包的安装方式。
wheel 包不用安装完,我仅仅安装了TensorRTwheel file。
另外还需要注意的是解压后的lib文件夹路径需要加到 ldconfig路径。并执行ldconfig使能。
这些事情做完后,如果是在ubuntu桌面环境,开发环境就差不多OK 了
假设环境是Jetson的移动平台:
首先是下载Jetson的刷机镜像包,我是NX,贴一个下载路径:
JetPack SDKdeveloper.nvidia.com/zh-cn/embedded/jetpack
网上刷机教程也很多,这里就不说了,自行百度。刷机完成后环境里面比如tensorrt , cuda . opencv 都有了。
为了使开发过程更愉快点,我们安装一下相关的辅助工具,以及包:
首先配置一个Jetson 上面也比较好使用的VNC:
jetson nano开启VNCblog.csdn.net/weixin_43181350/article/details/106491056正在上传…重新上传取消
随后安装一些必备的补充软件和包:
sudo apt install python3-pip
sudo apt-get install libjpeg-dev zlib1g-dev
sudo apt-get install libprotobuf-dev protobuf-compiler
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host=https://pypi.tuna.tsinghua.edu.cn/simple statsmodels cython numpy pycuda pillow
sudo pip3 install jetson-stats
另外jetson上面也可以安装torch , 选择性安装吧。
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host=https://pypi.tuna.tsinghua.edu.cn/simple ./torch-1.6.0-cp36-cp36m-linux_aarch64.whl
其他版本的torch 如下:
PyTorch for Jetson - version 1.9.0 now availableforums.developer.nvidia.com/t/pytorch-for-jetson-version-1-9-0-now-available/72048
相关官方API文档传送门:
NVIDIA TensorRT Standard Python API Documentation 8.0.1 documentationdocs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html正在上传…重新上传取消
对于开发IDE而言,就使用pycharm+Clion就好了。当然vscode也行。我自己用着pycharm和clion的跳转以及远程部署及调试比较方便,兄弟们也可以尝试下,看顺不顺手。
好了,现在开始从代码层面讲工作流。本文使用CenterFace做了示例,演示整个模型转换和推理过程。
关于onnx的推理,数据的加载推荐使用 下面的函数,在各个平台的兼容性和效率都还不错。
cv2.dnn.blobFromImage(img, scalefactor=1.0, size=(self.img_w_new, self.img_h_new), mean=(0, 0, 0), swapRB=True, crop=False)
其中关键的代码也就两块【模型加载,和模型推理】:
self.net = cv2.dnn.readNetFromONNX('../models/onnx/cface.1k.onnx')
def inference_opencv(self, img, threshold):
blob = cv2.dnn.blobFromImage(img, scalefactor=1.0, size=(self.img_w_new, self.img_h_new), mean=(0, 0, 0), swapRB=True, crop=False)
self.net.setInput(blob)
begin = datetime.datetime.now()
if self.landmarks:
heatmap, scale, offset, lms = self.net.forward(["537", "538", "539", '540'])
else:
heatmap, scale, offset = self.net.forward(["535", "536", "537"])
end = datetime.datetime.now()
print("cpu times = ", end - begin)
return self.postprocess(heatmap, lms, offset, scale, threshold)
完整的ONNX推理代码在:
GitHub - Star-Clouds/CenterFace: face detectiongithub.com/Star-Clouds/CenterFace正在上传…重新上传取消
关于原理部分上文的一篇文章已经讲过,这里直接贴函数吧。拿走不谢,直接用!
import os, sys
import onnx
import pycuda.driver as cuda
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
def build_engine_onnx(onnx_file_path, engine_file_path):
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) \
as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30
# config.max_workspace_size = 1 << 30 # 1GB
# builder.max_batch_size = 1
# builder.fp16_mode = True
profile = builder.create_optimization_profile()
profile.set_shape('input.1', (1, 3, 32, 32), (1, 3, 480, 480), (1, 3, 544, 960))
config.add_optimization_profile(profile)
# Parse model file
if not os.path.exists(onnx_file_path):
print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path))
exit(0)
print('Loading ONNX file from path {}...'.format(onnx_file_path))
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
if parser.parse(model.read()) is False:
print('parsing of ONNX file Failed ')
for error in range(parser.num_errors):
print(parser.get_error(error))
return None
print('Completed parsing of ONNX file')
print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
# network.get_input(0).shape = [1, 3, max_H, max_W] #use while in static input
engine = builder.build_engine(network, config)
print("Completed creating Engine")
if os.path.exists(os.path.dirname(engine_file_path)) is False:
os.makedirs(os.path.dirname(engine_file_path))
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
return engine
函数输入是onnx的模型路径,另外一个是tensorrt模型需要被保存的位置。
代码关键位置简单讲一讲:
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) \
as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
上面代码,创建builder并行设定batchsize为固定值。在模型输入shape为dynamic时,必须设置。
profile = builder.create_optimization_profile()
profile.set_shape('input.1', (1, 3, 32, 32), (1, 3, 480, 480), (1, 3, 544, 960))
config.add_optimization_profile(profile)
profile在模型输入未dynamic时才需要设置,否则设置一下模型的输入shape就可以了。
这一步使用刚才生成的tensorrt模型来做推理。
核心的步骤也就三部分。加载模型,申请buffer空间,执行推理从设备拷贝回相关的内存。
self.trt_logger = trt.Logger() # This logger is required to build an engine
f = open("../models/tensorrt/centerface.trt", "rb")
runtime = trt.Runtime(self.trt_logger)
engine = runtime.deserialize_cuda_engine(f.read())
以上初始化时,使用默认的logger就可以了。
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
max_feat_map_size = 544 * 960
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size * max_feat_map_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
申请buffer时,如果是动态的shape,需要注意的是直接申请最大shape对应的内存块就好。
context = engine.create_execution_context()
# Allocate buffers for input and output
inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings
# Do inference
shape_of_output = [(1, 1, int(self.img_h_new / 4), int(self.img_w_new / 4)),
(1, 2, int(self.img_h_new / 4), int(self.img_w_new / 4)),
(1, 2, int(self.img_h_new / 4), int(self.img_w_new / 4)),
(1, 10, int(self.img_h_new / 4), int(self.img_w_new / 4))]
# call set_binding_shape while in dynamic mode.
context.set_binding_shape(0, (1, 3, self.img_h_new, self.img_w_new))
# Load data to the buffer
inputs[0].host = blob.reshape(-1)
begin = datetime.datetime.now()
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs,
stream=stream) # numpy data
trt_outputs = [out[0:shape_of_output[i][0] * shape_of_output[i][1]
* shape_of_output[i][2] * shape_of_output[i][3]]
for i, out in enumerate(trt_outputs)]
end = datetime.datetime.now()
print("gpu times = ", end - begin)
heatmap, scale, offset, lms = [output.reshape(shape) for output, shape in zip(trt_outputs, shape_of_output)]
如果不是动态的输入shape,不需要调用:
context.set_binding_shape(0, (1, 3, self.img_h_new, self.img_w_new))
也不需要过滤相关的输出output内存块:
trt_outputs = [out[0:shape_of_output[i][0] * shape_of_output[i][1]
* shape_of_output[i][2] * shape_of_output[i][3]]
for i, out in enumerate(trt_outputs)]
最后使用刚才的结果和之前ONNX推理的结果进行对比。如果要做的更全面一些可以用一个数据集跑相关的指标来看。如果这个过程不顺利的话,也可以先跑跑全零全一矩阵看看输出的tensor是否相差很大。
c++来做相应的推理,代码的编写要稍微麻烦点。首先编写CMakeLists.txt。 其中tensorrt有相关的示例,拿来改改可以直接用。再把TAR包tensorrt相关的inlucde路径, link路径都加进工程,就可以开干了。另外使用c++推理时,tensorrt官方有个common目录,里面实现了常用的一些方法函数,可以直接编译到你的工程,这样也能省去一些时间。
核心部分的代码如下:
runtime = createInferRuntime(sample::gLogger.getTRTLogger());
assert(runtime != nullptr);
engine = runtime->deserializeCudaEngine(trtModelStream, size);
assert(engine != nullptr);
context = engine->createExecutionContext();
assert(context != nullptr);
delete[] trtModelStream;
CHECK(cudaMalloc(&buffers[0], MAX_SIZE_INPUT * sizeof(float)));
CHECK(cudaMalloc(&buffers[1], MAX_SIZE_OUTPUT1 * sizeof(float)));
CHECK(cudaMalloc(&buffers[2], MAX_SIZE_OUTPUT2 * sizeof(float)));
CHECK(cudaMalloc(&buffers[3], MAX_SIZE_OUTPUT3 * sizeof(float)));
CHECK(cudaMalloc(&buffers[4], MAX_SIZE_OUTPUT4 * sizeof(float)));
CHECK(cudaStreamCreate(&stream));
input_host = new float[MAX_SIZE_INPUT * sizeof(float)];
output1_host = new float[MAX_SIZE_OUTPUT1 * sizeof(float)];
output2_host = new float[MAX_SIZE_OUTPUT2 * sizeof(float)];
output3_host = new float[MAX_SIZE_OUTPUT3 * sizeof(float)];
output4_host = new float[MAX_SIZE_OUTPUT4 * sizeof(float)];
准备相关环境,如推理context ,设备端和主机端的内存申请。
if (!context->setBindingDimensions(0, Dims4(1, 3, inputBlob.size[2], inputBlob.size[3]))) {
printf(" SHAPE SET ERROR ");
exit(-1);
}
CHECK(cudaMemcpyAsync(buffers[0], input_host, input_size, cudaMemcpyHostToDevice, stream));
context->enqueue(1, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output1_host, buffers[1], output1_size, cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(output2_host, buffers[2], output2_size, cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(output3_host, buffers[3], output3_size, cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(output4_host, buffers[4], output4_size, cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
拷贝数据至设备端,并执行推理,如果是动态shape 需要调用setBindingDimensions来设置相关的输入尺寸。后面通过buffers拷贝得到的output_host可以使用cvMat来访问对应内存,没有什么性能损失,给编码带来不小的便利性。
剩下的就是数据的后处理,模型的后处理逻辑是什么样就怎么写。没啥好说的。
通过上面的描述,F32的模型都搞得差不多了。现在就开始说说int8的模型。
对于tensorrt,使用int8的模型量化接口挺友好的。和编译F32模型不同的是,设置部分多一些配置,另外再实现一个calibrator就搞定。calibrator可以选择的很多:
对于他们的区别官方文档描述的也比较清楚:
NVIDIA TensorRT Standard Python API Documentation 8.2.0 documentationdocs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Int8/pyInt8.html
对于量化的原理可以参考:
牛先生:tensorRT量化实践手册3 赞同 · 0 评论文章
配置部分的代码如下:
def build_engine_onnx_int8(onnx_file_path, engine_file_path, dynamic_shape=False):
calib = CenterFaceEntropyCalibrator("../calibration_ims", cache_file="calibration_centerface.cache")
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) \
as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30
# builder.fp16_mode = True
# use while generating quantitative model
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = calib
# Parse model file
if not os.path.exists(onnx_file_path):
print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path))
exit(0)
print('Loading ONNX file from path {}...'.format(onnx_file_path))
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
if parser.parse(model.read()) is False:
print('parsing of ONNX file Failed ')
for error in range(parser.num_errors):
print(parser.get_error(error))
return None
print('Completed parsing of ONNX file')
print('Building an engine of INT8 from file {}; this may take a while...'.format(onnx_file_path))
if dynamic_shape:
# optimization dimension should be same as the calibration resolution
profile = builder.create_optimization_profile()
profile.set_shape('input.1', (1, 3, 32, 32), (1, 3, 544, 960), (1, 3, 544, 960))
config.add_optimization_profile(profile)
config.set_calibration_profile(profile)
else:
network.get_input(0).shape = [1, 3, 544, 960] # use while in static input
engine = builder.build_engine(network, config)
print("Completed creating Engine of INT8")
if os.path.exists(os.path.dirname(engine_file_path)) is False:
os.makedirs(os.path.dirname(engine_file_path))
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
return engine
然后,这里吐槽下,我使用的硬件环境是桌面ubuntu以及jetson Nx , 对应的tensorrt是 8.0.1以及7.1.3。 然而7.1.3在 jetson上转 int8模型时,总是会报错。
Assertion Error in assertRegionTightlyFitsTensor: 0 (tensor.region->getDimensions(true) == tensor.extent)
还有个github上的issue记录:
https://github.com/NVIDIA/TensorRT/issues/1528github.com/NVIDIA/TensorRT/issues/1528
提供一个参考的calibrator实现如下:
import sys
import cv2
import tensorrt as trt
import os
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
class CenterFaceEntropyCalibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, cali_dir, cache_file, batch_size=1):
# Whenever you specify a custom constructor for a TensorRT class,
# you MUST call the constructor of the parent explicitly.
super(CenterFaceEntropyCalibrator, self).__init__()
self.all_files = []
for root, dirs, files in os.walk(cali_dir):
for file in files:
if os.path.splitext(file)[1] in ['.jpg', '.png']:
self.all_files.append(os.path.join(root, file))
self.batch_size = batch_size
self.current_index = 0
self.cache_file = cache_file
self.whole_len = len(self.all_files)
# Allocate enough memory for a whole batch.
self.device_input = cuda.mem_alloc(self.batch_size * 3 * 1920 * 1080 * 4)
def get_batch_size(self):
return self.batch_size
def transform(self, h, w):
img_h_new, img_w_new = int(np.ceil(h / 32) * 32), int(np.ceil(w / 32) * 32)
scale_h, scale_w = img_h_new / h, img_w_new / w
return img_h_new, img_w_new, scale_h, scale_w
# TensorRT passes along the names of the engine bindings to the get_batch function.
# You don't necessarily have to use them, but they can be useful to understand the order of
# the inputs. The bindings list is expected to have the same ordering as 'names'.
def get_batch(self, names):
if self.current_index + self.batch_size > self.whole_len:
print("all calibrated self.current_index + self.batch_size > self.whole_len \n".format(
self.current_index, self.batch_size, self.whole_len))
return None
current_batch = int(self.current_index / self.batch_size)
if current_batch % 1 == 0:
print("Calibrating batch {:}, containing {:} images, whole:{}".format(current_batch, self.batch_size,
len(self.all_files)))
# batch = self.data[self.current_index:self.current_index + self.batch_size].ravel()
batch = None
for i in range(self.current_index, self.current_index + self.batch_size):
img = cv2.imread(self.all_files[self.current_index])
# # should be same with optimized profile while engine building.
img = cv2.resize(img, (544, 960))
img_h_new, img_w_new, scale_h, scale_w = self.transform(img.shape[0], img.shape[1])
one_node = cv2.dnn.blobFromImage(img, scalefactor=1.0, size=(img_w_new, img_h_new), mean=(0, 0, 0),
swapRB=True, crop=False)
if batch is None:
batch = one_node
else:
batch = np.concatenate((batch, one_node), 0)
# print("batch {}".format(self.current_index))
sys.stdout.flush()
cuda.memcpy_htod(self.device_input, batch)
self.current_index += self.batch_size
return [self.device_input]
def read_calibration_cache(self):
# If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
if os.path.exists(self.cache_file):
with open(self.cache_file, "rb") as f:
return f.read()
def write_calibration_cache(self, cache):
with open(self.cache_file, "wb") as f:
f.write(cache)
主要需要实现的就是get_batch这个函数,并在里面将数据做对应的前处理。
模型量化完成后,python的推理代码,与F32比较而言大同小异,仅仅在内存申请部分有些差异。后处理部分如果是动态shape也一样对相应内存按照地址获取即可。这一部分主要的意义在于验证模型的正确性。
量化的模型最终也会加载到c++的推理接口去。当然的,与F32比较而言同样大同小异,在数据的内存申请部分需要稍微注意一下,其他部分基本没有区别,而且tensorrt比较友好的是,数据在推理完成暴露给用户时,数据类型已经是float,可以直接进行相应的后处理。
在jetson上做了一个性能比较,量化后的模型推理速率大约比量化前快一倍的样子。
如果以上内容有帮助,欢迎关注,微信公众号:CV老司机。
或者加入知识星球ID也是:CV老司机
有想看的内容可以联系牛先生小猪反馈,wx号:jishudashou。
今天就到这儿。谢谢~