Xavier中使用TensorRT的Python API对Pytorch模型进行FP16精度和INT8精度转换

Xavier中使用TensorRT的Python API对Pytorch模型进行FP16精度和INT8精度转换

0.Xavier环境

JetPack 4.6
python 3.6.9
tensorrt 8.0.1.6
torch 1.9.0 在jetson_zoo下载符合JetPack对应的版本
opencv 4.1.1
ros2安装在docker中

1.TensorRT模型压缩

  • TRT加载模型,并构建TRT的引擎,主要分为6步:
    • 1.建立一个logger日志,必须要有,但又不是那么重要;
    • 2.创建一个builder;
    • 3.创建一个network,这时候network只是一个空架子;
    • 4.建立一个Parser,caffe模型,onnx模型和TF模型都有对应的paser,顾名思义,就是用来解析模型文件的
    • 5.建议engine,进行层之间的融合或者校准方式,可以FP32,FP16或者INT8;
    • 6.建立一个context,这个是用来做inference推断的。上面连接engine,下对应推断数据,所以称之为上下文联系器。
  • FP16和INT8能加速的本质
    通过指令或硬件技术,在单位时钟周期内,FP16和INT8类型的运算次数大于FP32类型的运算次数。
  • FP16的模型文件大小要比原始的onnx模型小一半,INT8的模型文件大小要比FP16的模型小一半。例如原始的pytorch车道线模型大小为500M,转为onnx文件后大小为244M,FP16精度的engine文件大小为122.6M,INT8精度的engine文件大小为62.9M

2.FP16压缩

  • FP32压缩FP16的原理:
    Xavier中使用TensorRT的Python API对Pytorch模型进行FP16精度和INT8精度转换_第1张图片

  • max_batch_size: 1

  • onnx_file_path: onnx文件路径

  • engine_file_path: engine文件路径

  • save_engine: 是否保存engine文件

def get_engine(max_batch_size,
            onnx_file_path,
            engine_file_path,
            save_engine=True):
    TRT_LOGGER = trt.Logger()
    assert not os.path.exists(engine_file_path), "Engine file alrealdy exist"
    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    with trt.Builder(TRT_LOGGER) as builder, \
        builder.create_network(explicit_batch) as network,  \
        trt.OnnxParser(network, TRT_LOGGER) as parser, \
        builder.create_builder_config() as config: 
        profile = builder.create_optimization_profile()

        config.max_workspace_size = 1<<28
        builder.max_batch_size = max_batch_size

        if builder.platform_has_fast_fp16:
            config.set_flag(trt.BuilderFlag.FP16)

        if not os.path.exists(onnx_file_path):
            quit("ONNX file {} not found!".format(onnx_file_path))
        print('loading onnx file from path {} ...'.format(onnx_file_path))
        parser.parse_from_file(onnx_file_path)
        print("Completed parsing of onnx file")
        print("Building an engine from file{}' this may take a while...".format(onnx_file_path))
        print(network.get_layer(network.num_layers-1).get_output(0).shape)
        engine = builder.build_engine(network, config)
        print("Completed creating Engine")
        if save_engine:
            with open(engine_file_path, 'wb') as f:
                f.write(engine.serialize())
        return engine

3.INT8压缩

  • 量化方法
    Xavier中使用TensorRT的Python API对Pytorch模型进行FP16精度和INT8精度转换_第2张图片

  • 熵校准类

class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, training_data, cache_file, batch_size=128):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.cache_file = cache_file
        t1 = time.time()
        self.data = self.load_data(training_data)
        t2 = time.time()
        print('load_data:', 1000*(t2-t1), ' ms')
        self.batch_size = batch_size
        self.current_index = 0

        self.device_input = cuda.mem_alloc(self.data[0].nbytes * self.batch_size)

    def load_data(self, datapath):
        print("loading image data")
        imgs = os.listdir(datapath)
        dataset = []
        for order, data in enumerate(imgs):
            image_path = os.path.join(datapath, data)
            img = cv2.imread(image_path)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = Image.fromarray(img)
            img = img_transforms(img).numpy()
            dataset.append(img)
            print('calibration image order:', order)
        return np.array(dataset)

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index + self.batch_size > self.data.shape[0]:
            return None

        current_batch = int(self.current_index / self.batch_size)
        if current_batch % 10 == 0:
            print("Calibrating batch {:}, containing {:} images".format(current_batch, self.batch_size))

        batch = self.data[self.current_index:self.current_index + self.batch_size].ravel()
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size
        return [self.device_input]

    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)
  • 模型生成
def get_engine(max_batch_size=1, onnx_file_path="", engine_file_path="", mode="fp16", save_engine=True):
    TRT_LOGGER = trt.Logger()
    assert not os.path.exists(engine_file_path), "Engine file alrealdy exist"
    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    with trt.Builder(TRT_LOGGER) as builder, \
        builder.create_network(explicit_batch) as network,  \
        trt.OnnxParser(network, TRT_LOGGER) as parser, \
        builder.create_builder_config() as config: 
        profile = builder.create_optimization_profile()

        config.max_workspace_size = 1<<28
        builder.max_batch_size = max_batch_size
        half = True
        assert (builder.platform_has_fast_int8 == True), 'not support int8'
        config.set_flag(trt.BuilderFlag.INT8)
        config.int8_calibrator = Int8_calibrator

        if not os.path.exists(onnx_file_path):
            quit("ONNX file {} not found!".format(onnx_file_path))
        print('loading onnx file from path {} ...'.format(onnx_file_path))
        parser.parse_from_file(onnx_file_path)
        print("Completed parsing of onnx file")
        print("Building an engine from file {}' this may take a while...".format(onnx_file_path))
        print(network.get_layer(network.num_layers-1).get_output(0).shape)
        engine = builder.build_engine(network, config)
        print("Completed creating Engine")
        if save_engine:
            with open(engine_file_path, 'wb') as f:
                f.write(engine.serialize())
        return engine

Quadro P2000 INT8与FP32推理速度对比

  • 原始pytorch算法,推理速度FPS:11.5
  • INT8精度engine模型,推理速度约FPS:18
  • P2000显卡不支持FP16精度

Xavier INT8与FP32推理速度对比

  • FP16推理约6-8毫秒,总体帧率22帧
  • INT8推理4-6毫秒,总体帧率23帧
  • 没有必要INT8,影响速度的瓶颈不在inference了,对加载图像和后面推理结果处理以及可视化等部分耗时优化收益更大

INT8与FP32推理精度对比

  • 精度指标待补充,目测还不错。(下图INT8精度的推理结果,校准用了2000多张图片)
    Xavier中使用TensorRT的Python API对Pytorch模型进行FP16精度和INT8精度转换_第3张图片

你可能感兴趣的:(模型部署,python,pytorch,深度学习,tensorrt,Xavier)