Pytorch模型TensorRT部署

Ubuntu20.04.02环境配置

Anaconda3/Python        3.8.8
Pytorch                          1.9.0+cu102
CUDA                            11.3.r11.3
Onnx                              1.10.2
Onnxruntime                  1.9.0


pytorch常用的部署方案 

CPU: pytorch->onnx->onnxruntime
GPU: pytorch->onnx->onnx2trt->tensorRT
ARM: pytorch->onnx->ncnn/mace/mnn等

Pytorch GPU 部署流程

  • Step1:pytorch2onnx

  • Step2:onnx2TensorRT Engine

  • Step3:TensorRT Inference

第一步:Pytorch2Onnx

  • pytorch2onnx

    • 安装onnx , onnxsimonnxruntime模块;

      pip install onnx # 安装onnx
      pip install onnx-simiplifier # 安装onnxsim
      pip install onnxruntime # 安装onnxruntime
    • 可以降低精度 FP32 --> FP16,FP16仅支持CUDA;也可以先正常转换,后面在TensorRT上进行精度量化;

    • 没有特殊操作的模型直接使用torch.onnx.export()函数导出;

    • 如果有特殊操作,需要改写特殊操作,torch.onnx中有关于 特殊操作的转换方式;

    • 使用onnxsim.simplify对转换后的ONNX进行简化;

    • 转换成onnx后可以使用onnx模块/ONNX Runtime模块验证转换的结果是否正确

      • 方式一:

      import onnx
      from onnxsim.onnx_simplifier import simplify
      model = onnx.load('xxx.onnx')
      onnx.checker.check_model(model)
      print(onnx.helper.printable_graph(model.graph))
      # simplify onnx model
      model_sim, check = simplify(model)
      onnx.save(model_sim, 'xxx-sim.onnx')
      • 方式二:

      import onnxruntime as ort
      import numpy as np
      ort_session = ort.InferenceSession('xxx.onnx')
      outputs = ort_session.run(None, {
        'input_name': np.random.randn(10, 3,  224, 224).astype(np.float32)
      })
      print(outputs[0])

    • 验证onnx模型与原pytorch模型结果是否一致,一致就表示模型转换无误;

    import torch
    import original_model
    import onnxruntime as ort
    import numpy as np
    ​
    # load pytorch model
    torch_model = original_model()
    torch_model.load_state_dict(torch.load('xxx.pth'), map_location = 'cpu')
    img = torch.ones((1, 3, 640, 640))/255
    # pytorch model inference
    with torch.no_grad():
        res = torch_model(img)
        print(res[0])
      
    # load onnx model
    sess = ort.InferenceSession('xxx.onnx')
    # get input name 
    input_names = sess.get_inputs()[0].name 
    outputs = sess.run([], {input_names: np.array(img)})
    print(outputs[0])
    # verify whether the result is corresponing with each other 
    • 也可使用可视化工具netron可视化ONNX模型

    netron的安装与使用:

    # For Linux or Ubuntu, with pip installation
    pip install netron
    # For visualize the onnx model, use the command `netron` and press `Enetr` key
    netron
    # After pressing `enter` key, netron will show the interface

 

第二步:Onnx2TensorRT Engines

  • TensorRT安装

    Python安装TensorRT(采用.tar文件的安装方式,此方式安装不会出现未安装的依赖包 问题)

    • 确认已经安装cudapytorch相匹配的版本

    • 安装pycuda模块,pip install pycuda

    • 确认安装onnx模块

    • 下载TensorRT8.x相对应 cuda 版本的.tar文件,网址:NVIDIA TensorRT | NVIDIA Developer

    • 解压文件,切换至python分支,安装TensorRT,graphsurgeononnx_graphsurgeon

    cd TensorRT-8.x.x/python
    pip install tensorrt-8.x.x-xxx.whl
    cd TensorRT-8.x.x/graphsurgeon
    pip install graphsurgeon-x.x.x.whl
    cd TensorRT-8.x.x/onnx_graphsurgeon
    pip install onnx_graphsurgeon-x.x.x.whl
    • 添加TensorRT/lib到环境变量

    export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/lib:$LD_LIBRARY_PATH  
    export PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/bin:$PATH  
    export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/targets/x86_64-linux-gnu/lib:$LD_LIBRARY_PATH
    source ~/.bashrc
    • 验证TensorRT是否安装好

    import tensorrt # 能正常导入即成功安装
    print(tensorrt.__version__)
  • TensorRT的运行的基本流程图

    Pytorch模型TensorRT部署_第1张图片 

    • 首先以trt的Logger为参数,使用builder创建计算图类型INetworkDefinition;

    • 然后使用Parsers将onnx等网络框架下的结构填充计算图,当然也可以使用tensorrt的API进行构建;

    • 由计算图创建cuda环境下的引擎;

    • 最终进行推理的则是cuda引擎生成的ExecutionContext;

      engine.create_execution_context()

  • Onnx2TensorRT Engine

    • 方式一

      • 采用TensorRT中自带的工具trtexec可以直接将onnx模型转为TensorRT Engine,通过如下指令即可生成TensorRT Engine

      trtexec --onnx=your_model.onnx --saveEngine=resnet_engine.trt  --explicitBatch

      trtexec有如下可选参数:

      • --explicitBatch 网络输入时batch维度明确的size大小,该参数不能隐藏,ONNX解析器只支持明确的batch模式。更多信息可参考TensorRT Developer Guide中的Working With Dynamic Shapes 部分;

      • --fp16 enables FP16 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.

      • --int8 enables INT8 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.

      • --best enables all supported precisions to achieve the best performance for every layer.

      • --workspace controls the maximum amount of persistent scratch memory available (in MB) for algorithms considered by the builder. This should be set as high as possible for a given platform based on availability; at runtime TensorRT will allocate only what is required, not exceeding the max.

      • --minShapes and --maxShapes specify the range of dimensions for each network input and --optShapes specifies the dimensions that the auto-tuner should use for optimization. For more information, refer to the Optimization Profiles section in the TensorRT Developer Guide.

      • --buildOnly requests that inference performance measurements be skipped.

      • --saveEngine specifies the file into which the serialized engine must be saved.

      • --safe enables building safety certified engines. This switch is used for prototyping automotive safety restricted flows in the TensorRT safe runtime.

      • --tacticSources can be used to add or remove tactics from the default tactic sources (cuDNN, cuBLAS and cuBLASLt).

      • --minTiming and --avgTiming respectively set the minimum and average number of iterations used in tactic selection.

      • --noBuilderCache disables the layer timing cache in the TensorRT builder. The timing cache helps to reduce the time taken in the builder phase by caching the layer profiling information and should work for most cases. Use this switch for the problematic cases. For more information, refer to the Builder Layer Timing Cache section in the TensorRT Developer Guide.

      • --timingCacheFile can be used to save or load the serialized global timing cache.

    • 方式二

      • 采用TensorRT API(Python版,TensorRT中自带的ONNX解析器[正如TensorRT运行流程图中的Parser]构建TensorRT Engine)

      import os
      import tensorrt as trt
      ​
      # 打印日志
      TRT_LOGGER = trt.Logger()
      model_path = 'resnet18.onnx'
      engine_file_path = "resnet18.trt"
      EXPLICIT_BATCH = 1 << (int) \ (trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)  # 明确batchsize=1
      ​
      # 定义创建builder, network和ONNX Parser
      with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) \
              as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
          builder.max_workspace_size = 1 << 28
          builder.max_batch_size = 1
          # 检查onnx文件是否存在
          if not os.path.exists(model_path):
              print('ONNX file {} not found.'.format(model_path))
              exit(0)
          print('Loading ONNX file from path {}...'.format(model_path))
          with open(model_path, 'rb') as model:
              print('Beginning ONNX file parsing')
              if not parser.parse(model.read()):
                  print('ERROR: Failed to parse the ONNX file.')
                  for error in range(parser.num_errors):
                      print(parser.get_error(error))
       
          network.get_input(0).shape = [1, 3, 32, 32]
          print('Completed parsing of ONNX file')
          engine = builder.build_cuda_engine(network)
          with open(engine_file_path, "wb") as f:
              f.write(engine.serialize())

 

第三步 TensorRT Engine进行模型Inference

  • Inference的四个基本步骤

    • 1、输入前处理:和训练的前处理过程,保证输入的图片格式和训练一致;

    • 2、分配内存:allocate_buffers函数,不需要改动;

    • 3、推理函数:do_inference_v2函数,不需要改动,如果需要多张推理,则可用do_inference函数;

    • 4、结果:tensorrt推理结果为列表;

  • TensorRT+Pycuda实现

import pycuda.driver as cuda
import pycuda.autoinit
import cv2
import numpy as np
import os
import tensorrt as trt
import time
from  PIL import Image
 
TRT_LOGGER = trt.Logger()
engine_file_path = "resnet18.trt"
 
# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem
 
    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
 
    def __repr__(self):
        return self.__str__()
 
# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream
 
# 推理函数,固定函数
def do_inference_v2(context, bindings, inputs, outputs, stream):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]
 
i = 0
j = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
        runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
    inputs, outputs, bindings, stream = allocate_buffers(engine)
    print(inputs,outputs,bindings,stream)
    dir = "val/white/"
 
    for name in os.listdir(dir):
        # 前处理部分
        t1 = time.clock()
        image_path = os.path.join(dir,name)
 
        img = Image.open(image_path)
        img = np.array(img)
        img = img.transpose((1, 0, 2))
        img = img.transpose((2, 1, 0))
        img = img.astype(np.float32) / 255.0
 
        img = img[np.newaxis, :, :].astype(np.float32)
        print(img.shape)
        img = np.ascontiguousarray(img)
        # 前处理结束
 
        # 开始推理
        inputs[0].host = img
        trt_outputs = do_inference_v2(context, bindings=bindings, \
                                      inputs=inputs, outputs=outputs, stream=stream)
        print(trt_outputs)
 
        # 结果判断
        if trt_outputs[0][0] > trt_outputs[0][1]:
            print("0")
            i = i +1
        else:
            print("1")
            j = j +1
        print("Time:",time.clock()-t1)
 
m = 0
n = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
        runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
    inputs, outputs, bindings, stream = allocate_buffers(engine)
    print(inputs,outputs,bindings,stream)
    dir = "val/yellow/"
 
    for name in os.listdir(dir):
        t1 = time.clock()
        image_path = os.path.join(dir,name)
 
        image = cv2.imread(image_path)
        img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
        img = np.array(img)
        img = img.transpose((1, 0, 2))
        img = img.transpose((2, 1, 0))
        img = img.astype(np.float32) / 255.0
 
        img = img[np.newaxis, :, :].astype(np.float32)
        print(img.shape)
        img = np.ascontiguousarray(img)
        inputs[0].host = img
        # 开始推理
        trt_outputs = do_inference_v2(context, bindings=bindings, \
                                      inputs=inputs, outputs=outputs, stream=stream)
        print(trt_outputs)
        if trt_outputs[0][0] > trt_outputs[0][1]:
            print("0")
            m = m+1
        else:
            print("1")
            n = n +1
        print("Time:",time.clock()-t1)
#
print("i = ",i)
print("j = ",j)
print("m = ",m)
print("n = ",n)

Reference

Pytorch通过保存为ONNX模型转TensorRT5_连正的博客-CSDN博客_onnx转tensorrt

pytorch转tensorRT步骤_小菜的博客-CSDN博客_pth转tensorrt

你可能感兴趣的:(Pytorch模型部署,pytorch,模型部署,TensoRT)