TensorRT工作手册

前段时间写了tvm的转换编译部署相关的文章,

牛先生:神经网络编译器TVM,autoTVM自动代码优化及c++部署cuda实践20 赞同 · 0 评论文章正在上传…重新上传取消​

以及tensorrt偏向讲原理的文章:

牛先生:tensorRT量化实践手册3 赞同 · 0 评论文章

后来发现现在用tensorrt的还更多一些。今天也总结下,tensorrt的对应python接口模型转换,模型推理,模型量化相关的内容。

作者做tensorrt在ubuntu桌面和jetson都做了相关的实践。总的来看,接口基本是一致的,可以直接用。jetson的tensorrt版本要小一些,有的特性可能不支持。

首先拿到onnx模型,不会转的同学自行百度吧,讲的文章太多了。随后总的工作流如下:

  1. 安装相关环境
  2. 调通python接口,onnx的推理代码,用于后面的一致性验证
  3. python接口转换F32的tensorrt模型
  4. python接口加载F32模型,执行推理并验证一致性
  5. c++接口加载F32模型,执行推理并验证一致性
  6. python接口转换Int8模型【含量化校准】
  7. python接口加载int8模型,执行推理并验证一致性
  8. c++接口加载INT8模型,执行推理并验证一致性

根据以上的 pipe line 开始吧 !

相关环境的安装以及文档

假设平台是 ubuntu 桌面 18.04 X86_64。还是和普通的显卡推理环境基本一致,安装驱动,安装cuda ,安装 tensorrt 包。这里贴一个tensorrt的包位置吧:

NVIDIA Developer​developer.nvidia.com/nvidia-tensorrt-8x-download

下载下来后是 TAR包解压。再配置相关的环境变量就可以了。官方有个教程:

NVIDIA Deep Learning TensorRT Documentation​docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#maclearn-net-repo-install

我使用的是TAR包的安装方式。

TensorRT工作手册_第1张图片

wheel 包不用安装完,我仅仅安装了TensorRTwheel file。

另外还需要注意的是解压后的lib文件夹路径需要加到 ldconfig路径。并执行ldconfig使能。

这些事情做完后,如果是在ubuntu桌面环境,开发环境就差不多OK 了

假设环境是Jetson的移动平台:

首先是下载Jetson的刷机镜像包,我是NX,贴一个下载路径:

JetPack SDK​developer.nvidia.com/zh-cn/embedded/jetpack

网上刷机教程也很多,这里就不说了,自行百度。刷机完成后环境里面比如tensorrt , cuda . opencv 都有了。

为了使开发过程更愉快点,我们安装一下相关的辅助工具,以及包:

首先配置一个Jetson 上面也比较好使用的VNC:

jetson nano开启VNC​blog.csdn.net/weixin_43181350/article/details/106491056正在上传…重新上传取消​

随后安装一些必备的补充软件和包:

sudo apt install python3-pip
sudo apt-get install libjpeg-dev zlib1g-dev
sudo apt-get install libprotobuf-dev protobuf-compiler 
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host=https://pypi.tuna.tsinghua.edu.cn/simple statsmodels cython numpy pycuda pillow
sudo pip3 install jetson-stats

另外jetson上面也可以安装torch , 选择性安装吧。

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host=https://pypi.tuna.tsinghua.edu.cn/simple  ./torch-1.6.0-cp36-cp36m-linux_aarch64.whl

其他版本的torch 如下:

PyTorch for Jetson - version 1.9.0 now available​forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-9-0-now-available/72048

相关官方API文档传送门:

NVIDIA TensorRT Standard Python API Documentation 8.0.1 documentation​docs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html正在上传…重新上传取消​

对于开发IDE而言,就使用pycharm+Clion就好了。当然vscode也行。我自己用着pycharm和clion的跳转以及远程部署及调试比较方便,兄弟们也可以尝试下,看顺不顺手。

好了,现在开始从代码层面讲工作流。本文使用CenterFace做了示例,演示整个模型转换和推理过程。

ONNX推理

关于onnx的推理,数据的加载推荐使用 下面的函数,在各个平台的兼容性和效率都还不错。

 cv2.dnn.blobFromImage(img, scalefactor=1.0, size=(self.img_w_new, self.img_h_new), mean=(0, 0, 0), swapRB=True, crop=False)

其中关键的代码也就两块【模型加载,和模型推理】:

self.net = cv2.dnn.readNetFromONNX('../models/onnx/cface.1k.onnx')

    def inference_opencv(self, img, threshold):
        blob = cv2.dnn.blobFromImage(img, scalefactor=1.0, size=(self.img_w_new, self.img_h_new), mean=(0, 0, 0), swapRB=True, crop=False)
        self.net.setInput(blob)
        begin = datetime.datetime.now()
        if self.landmarks:
            heatmap, scale, offset, lms = self.net.forward(["537", "538", "539", '540'])
        else:
            heatmap, scale, offset = self.net.forward(["535", "536", "537"])
        end = datetime.datetime.now()
        print("cpu times = ", end - begin)
        return self.postprocess(heatmap, lms, offset, scale, threshold)

完整的ONNX推理代码在:

GitHub - Star-Clouds/CenterFace: face detection​github.com/Star-Clouds/CenterFace正在上传…重新上传取消​

python接口F32的tensorrt模型转换

关于原理部分上文的一篇文章已经讲过,这里直接贴函数吧。拿走不谢,直接用!

import os, sys

import onnx
import pycuda.driver as cuda
import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)


def build_engine_onnx(onnx_file_path, engine_file_path):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
            1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) \
            as network, trt.OnnxParser(network, TRT_LOGGER) as parser:

        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 30
        # config.max_workspace_size = 1 << 30  # 1GB
        # builder.max_batch_size = 1
        # builder.fp16_mode = True
        profile = builder.create_optimization_profile()
        profile.set_shape('input.1', (1, 3, 32, 32), (1, 3, 480, 480), (1, 3, 544, 960))
        config.add_optimization_profile(profile)

        # Parse model file
        if not os.path.exists(onnx_file_path):
            print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path))
            exit(0)
        print('Loading ONNX file from path {}...'.format(onnx_file_path))
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            if parser.parse(model.read()) is False:
                print('parsing of ONNX file Failed ')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
        print('Completed parsing of ONNX file')

        print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
        # network.get_input(0).shape = [1, 3, max_H, max_W] #use while in static input

        engine = builder.build_engine(network, config)
        print("Completed creating Engine")
        if os.path.exists(os.path.dirname(engine_file_path)) is False:
            os.makedirs(os.path.dirname(engine_file_path))
        with open(engine_file_path, "wb") as f:
            f.write(engine.serialize())
        return engine

函数输入是onnx的模型路径,另外一个是tensorrt模型需要被保存的位置。

代码关键位置简单讲一讲:

  with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
            1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) \
            as network, trt.OnnxParser(network, TRT_LOGGER) as parser:

上面代码,创建builder并行设定batchsize为固定值。在模型输入shape为dynamic时,必须设置。

profile = builder.create_optimization_profile()
profile.set_shape('input.1', (1, 3, 32, 32), (1, 3, 480, 480), (1, 3, 544, 960))
config.add_optimization_profile(profile)

profile在模型输入未dynamic时才需要设置,否则设置一下模型的输入shape就可以了。

python接口加载F32模型执行推理并验证一致性

这一步使用刚才生成的tensorrt模型来做推理。

核心的步骤也就三部分。加载模型,申请buffer空间,执行推理从设备拷贝回相关的内存。

self.trt_logger = trt.Logger()  # This logger is required to build an engine
f = open("../models/tensorrt/centerface.trt", "rb")
runtime = trt.Runtime(self.trt_logger)
engine = runtime.deserialize_cuda_engine(f.read())

以上初始化时,使用默认的logger就可以了。


        def allocate_buffers(engine):
            inputs = []
            outputs = []
            bindings = []
            stream = cuda.Stream()

            max_feat_map_size = 544 * 960
            for binding in engine:
                size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size * max_feat_map_size
                dtype = trt.nptype(engine.get_binding_dtype(binding))
                # Allocate host and device buffers
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                # Append the device buffer to device bindings.
                bindings.append(int(device_mem))
                # Append to the appropriate list.
                if engine.binding_is_input(binding):
                    inputs.append(HostDeviceMem(host_mem, device_mem))
                else:
                    outputs.append(HostDeviceMem(host_mem, device_mem))
            return inputs, outputs, bindings, stream

申请buffer时,如果是动态的shape,需要注意的是直接申请最大shape对应的内存块就好。

context = engine.create_execution_context()

# Allocate buffers for input and output
inputs, outputs, bindings, stream = allocate_buffers(engine)  # input, output: host # bindings

# Do inference
shape_of_output = [(1, 1, int(self.img_h_new / 4), int(self.img_w_new / 4)),
                           (1, 2, int(self.img_h_new / 4), int(self.img_w_new / 4)),
                           (1, 2, int(self.img_h_new / 4), int(self.img_w_new / 4)),
                           (1, 10, int(self.img_h_new / 4), int(self.img_w_new / 4))]
# call set_binding_shape  while in dynamic mode.
context.set_binding_shape(0, (1, 3, self.img_h_new, self.img_w_new))

# Load data to the buffer
inputs[0].host = blob.reshape(-1)
begin = datetime.datetime.now()
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs,
                                   stream=stream)  # numpy data
trt_outputs = [out[0:shape_of_output[i][0] * shape_of_output[i][1]
                             * shape_of_output[i][2] * shape_of_output[i][3]]
                       for i, out in enumerate(trt_outputs)]
end = datetime.datetime.now()
print("gpu times = ", end - begin)

heatmap, scale, offset, lms = [output.reshape(shape) for output, shape in zip(trt_outputs, shape_of_output)]

如果不是动态的输入shape,不需要调用:

context.set_binding_shape(0, (1, 3, self.img_h_new, self.img_w_new))

也不需要过滤相关的输出output内存块:

trt_outputs = [out[0:shape_of_output[i][0] * shape_of_output[i][1]
                             * shape_of_output[i][2] * shape_of_output[i][3]]
                       for i, out in enumerate(trt_outputs)]

最后使用刚才的结果和之前ONNX推理的结果进行对比。如果要做的更全面一些可以用一个数据集跑相关的指标来看。如果这个过程不顺利的话,也可以先跑跑全零全一矩阵看看输出的tensor是否相差很大。

c++接口加载F32模型执行推理并验证一致性

c++来做相应的推理,代码的编写要稍微麻烦点。首先编写CMakeLists.txt。 其中tensorrt有相关的示例,拿来改改可以直接用。再把TAR包tensorrt相关的inlucde路径, link路径都加进工程,就可以开干了。另外使用c++推理时,tensorrt官方有个common目录,里面实现了常用的一些方法函数,可以直接编译到你的工程,这样也能省去一些时间。

核心部分的代码如下:

    runtime = createInferRuntime(sample::gLogger.getTRTLogger());
    assert(runtime != nullptr);
    engine = runtime->deserializeCudaEngine(trtModelStream, size);
    assert(engine != nullptr);
    context = engine->createExecutionContext();
    assert(context != nullptr);
    delete[] trtModelStream;

    CHECK(cudaMalloc(&buffers[0], MAX_SIZE_INPUT * sizeof(float)));
    CHECK(cudaMalloc(&buffers[1], MAX_SIZE_OUTPUT1 * sizeof(float)));
    CHECK(cudaMalloc(&buffers[2], MAX_SIZE_OUTPUT2 * sizeof(float)));
    CHECK(cudaMalloc(&buffers[3], MAX_SIZE_OUTPUT3 * sizeof(float)));
    CHECK(cudaMalloc(&buffers[4], MAX_SIZE_OUTPUT4 * sizeof(float)));

    CHECK(cudaStreamCreate(&stream));

    input_host = new float[MAX_SIZE_INPUT * sizeof(float)];
    output1_host = new float[MAX_SIZE_OUTPUT1 * sizeof(float)];
    output2_host = new float[MAX_SIZE_OUTPUT2 * sizeof(float)];
    output3_host = new float[MAX_SIZE_OUTPUT3 * sizeof(float)];
    output4_host = new float[MAX_SIZE_OUTPUT4 * sizeof(float)];

准备相关环境,如推理context ,设备端和主机端的内存申请。

    if (!context->setBindingDimensions(0, Dims4(1, 3, inputBlob.size[2], inputBlob.size[3]))) {
        printf(" SHAPE SET ERROR ");
        exit(-1);
    } 
    CHECK(cudaMemcpyAsync(buffers[0], input_host, input_size, cudaMemcpyHostToDevice, stream));
    context->enqueue(1, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output1_host, buffers[1], output1_size, cudaMemcpyDeviceToHost, stream));
    CHECK(cudaMemcpyAsync(output2_host, buffers[2], output2_size, cudaMemcpyDeviceToHost, stream));
    CHECK(cudaMemcpyAsync(output3_host, buffers[3], output3_size, cudaMemcpyDeviceToHost, stream));
    CHECK(cudaMemcpyAsync(output4_host, buffers[4], output4_size, cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

拷贝数据至设备端,并执行推理,如果是动态shape 需要调用setBindingDimensions来设置相关的输入尺寸。后面通过buffers拷贝得到的output_host可以使用cvMat来访问对应内存,没有什么性能损失,给编码带来不小的便利性。

剩下的就是数据的后处理,模型的后处理逻辑是什么样就怎么写。没啥好说的。

python接口转换Int8模型

通过上面的描述,F32的模型都搞得差不多了。现在就开始说说int8的模型。

对于tensorrt,使用int8的模型量化接口挺友好的。和编译F32模型不同的是,设置部分多一些配置,另外再实现一个calibrator就搞定。calibrator可以选择的很多:

TensorRT工作手册_第2张图片

对于他们的区别官方文档描述的也比较清楚:

NVIDIA TensorRT Standard Python API Documentation 8.2.0 documentation​docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Int8/pyInt8.html

对于量化的原理可以参考:

牛先生:tensorRT量化实践手册3 赞同 · 0 评论文章

配置部分的代码如下:

def build_engine_onnx_int8(onnx_file_path, engine_file_path, dynamic_shape=False):
    calib = CenterFaceEntropyCalibrator("../calibration_ims", cache_file="calibration_centerface.cache")

    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
            1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) \
            as network, trt.OnnxParser(network, TRT_LOGGER) as parser:

        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 30
        # builder.fp16_mode = True
        # use while generating quantitative model
        config.set_flag(trt.BuilderFlag.INT8)
        config.int8_calibrator = calib

        # Parse model file
        if not os.path.exists(onnx_file_path):
            print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path))
            exit(0)
        print('Loading ONNX file from path {}...'.format(onnx_file_path))
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            if parser.parse(model.read()) is False:
                print('parsing of ONNX file Failed ')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
        print('Completed parsing of ONNX file')

        print('Building an engine of INT8 from file {}; this may take a while...'.format(onnx_file_path))
        if dynamic_shape:
            # optimization dimension should be same as the calibration resolution
            profile = builder.create_optimization_profile()
            profile.set_shape('input.1', (1, 3, 32, 32), (1, 3, 544, 960), (1, 3, 544, 960))
            config.add_optimization_profile(profile)
            config.set_calibration_profile(profile)
        else:
            network.get_input(0).shape = [1, 3, 544, 960]  # use while in static input

        engine = builder.build_engine(network, config)
        print("Completed creating Engine of INT8")
        if os.path.exists(os.path.dirname(engine_file_path)) is False:
            os.makedirs(os.path.dirname(engine_file_path))
        with open(engine_file_path, "wb") as f:
            f.write(engine.serialize())
        return engine

然后,这里吐槽下,我使用的硬件环境是桌面ubuntu以及jetson Nx , 对应的tensorrt是 8.0.1以及7.1.3。 然而7.1.3在 jetson上转 int8模型时,总是会报错。

Assertion Error in assertRegionTightlyFitsTensor: 0 (tensor.region->getDimensions(true) == tensor.extent)

还有个github上的issue记录:

https://github.com/NVIDIA/TensorRT/issues/1528​github.com/NVIDIA/TensorRT/issues/1528

提供一个参考的calibrator实现如下:

import sys

import cv2
import tensorrt as trt
import os

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np


class CenterFaceEntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, cali_dir, cache_file, batch_size=1):
        # Whenever you specify a custom constructor for a TensorRT class,
        # you MUST call the constructor of the parent explicitly.
        super(CenterFaceEntropyCalibrator, self).__init__()

        self.all_files = []
        for root, dirs, files in os.walk(cali_dir):
            for file in files:
                if os.path.splitext(file)[1] in ['.jpg', '.png']:
                    self.all_files.append(os.path.join(root, file))

        self.batch_size = batch_size
        self.current_index = 0
        self.cache_file = cache_file
        self.whole_len = len(self.all_files)
        # Allocate enough memory for a whole batch.
        self.device_input = cuda.mem_alloc(self.batch_size * 3 * 1920 * 1080 * 4)

    def get_batch_size(self):
        return self.batch_size

    def transform(self, h, w):
        img_h_new, img_w_new = int(np.ceil(h / 32) * 32), int(np.ceil(w / 32) * 32)
        scale_h, scale_w = img_h_new / h, img_w_new / w
        return img_h_new, img_w_new, scale_h, scale_w

    # TensorRT passes along the names of the engine bindings to the get_batch function.
    # You don't necessarily have to use them, but they can be useful to understand the order of
    # the inputs. The bindings list is expected to have the same ordering as 'names'.
    def get_batch(self, names):
        if self.current_index + self.batch_size > self.whole_len:
            print("all calibrated self.current_index + self.batch_size > self.whole_len \n".format(
                self.current_index, self.batch_size, self.whole_len))
            return None

        current_batch = int(self.current_index / self.batch_size)
        if current_batch % 1 == 0:
            print("Calibrating batch {:}, containing {:} images, whole:{}".format(current_batch, self.batch_size,
                                                                                  len(self.all_files)))

        # batch = self.data[self.current_index:self.current_index + self.batch_size].ravel()
        batch = None
        for i in range(self.current_index, self.current_index + self.batch_size):
            img = cv2.imread(self.all_files[self.current_index])

            # # should be same with optimized profile while engine building.
            img = cv2.resize(img, (544, 960))
            img_h_new, img_w_new, scale_h, scale_w = self.transform(img.shape[0], img.shape[1])
            one_node = cv2.dnn.blobFromImage(img, scalefactor=1.0, size=(img_w_new, img_h_new), mean=(0, 0, 0),
                                             swapRB=True, crop=False)
            if batch is None:
                batch = one_node
            else:
                batch = np.concatenate((batch, one_node), 0)
        # print("batch {}".format(self.current_index))
        sys.stdout.flush()
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size
        return [self.device_input]

    def read_calibration_cache(self):
        # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)

主要需要实现的就是get_batch这个函数,并在里面将数据做对应的前处理。

python接口加载int8模型

模型量化完成后,python的推理代码,与F32比较而言大同小异,仅仅在内存申请部分有些差异。后处理部分如果是动态shape也一样对相应内存按照地址获取即可。这一部分主要的意义在于验证模型的正确性。

c++接口加载INT8模型

量化的模型最终也会加载到c++的推理接口去。当然的,与F32比较而言同样大同小异,在数据的内存申请部分需要稍微注意一下,其他部分基本没有区别,而且tensorrt比较友好的是,数据在推理完成暴露给用户时,数据类型已经是float,可以直接进行相应的后处理。

在jetson上做了一个性能比较,量化后的模型推理速率大约比量化前快一倍的样子。

如果以上内容有帮助,欢迎关注,微信公众号:CV老司机。

TensorRT工作手册_第3张图片

 

或者加入知识星球ID也是:CV老司机

TensorRT工作手册_第4张图片

 

有想看的内容可以联系牛先生小猪反馈,wx号:jishudashou。

今天就到这儿。谢谢~

你可能感兴趣的:(深度学习,神经网络,pytorch)