Anaconda3/Python 3.8.8
Pytorch 1.9.0+cu102
CUDA 11.3.r11.3
Onnx 1.10.2
Onnxruntime 1.9.0
CPU: pytorch->onnx->onnxruntime
GPU: pytorch->onnx->onnx2trt->tensorRT
ARM: pytorch->onnx->ncnn/mace/mnn等
Step1:pytorch2onnx
Step2:onnx2TensorRT Engine
Step3:TensorRT Inference
pytorch2onnx
安装onnx
, onnxsim
和onnxruntime
模块;
pip install onnx # 安装onnx
pip install onnx-simiplifier # 安装onnxsim
pip install onnxruntime # 安装onnxruntime
可以降低精度 FP32 --> FP16,FP16仅支持CUDA;也可以先正常转换,后面在TensorRT上进行精度量化;
没有特殊操作的模型直接使用torch.onnx.export()
函数导出;
如果有特殊操作,需要改写特殊操作,torch.onnx
中有关于 特殊操作的转换方式;
使用onnxsim.simplify
对转换后的ONNX进行简化;
转换成onnx后可以使用onnx模块/ONNX Runtime模块验证转换的结果是否正确
方式一:
import onnx
from onnxsim.onnx_simplifier import simplify
model = onnx.load('xxx.onnx')
onnx.checker.check_model(model)
print(onnx.helper.printable_graph(model.graph))
# simplify onnx model
model_sim, check = simplify(model)
onnx.save(model_sim, 'xxx-sim.onnx')
方式二:
import onnxruntime as ort
import numpy as np
ort_session = ort.InferenceSession('xxx.onnx')
outputs = ort_session.run(None, {
'input_name': np.random.randn(10, 3, 224, 224).astype(np.float32)
})
print(outputs[0])
验证onnx模型与原pytorch模型结果是否一致,一致就表示模型转换无误;
import torch
import original_model
import onnxruntime as ort
import numpy as np
# load pytorch model
torch_model = original_model()
torch_model.load_state_dict(torch.load('xxx.pth'), map_location = 'cpu')
img = torch.ones((1, 3, 640, 640))/255
# pytorch model inference
with torch.no_grad():
res = torch_model(img)
print(res[0])
# load onnx model
sess = ort.InferenceSession('xxx.onnx')
# get input name
input_names = sess.get_inputs()[0].name
outputs = sess.run([], {input_names: np.array(img)})
print(outputs[0])
# verify whether the result is corresponing with each other
也可使用可视化工具netron
可视化ONNX模型
netron的安装与使用:
# For Linux or Ubuntu, with pip installation pip install netron # For visualize the onnx model, use the command `netron` and press `Enetr` key netron # After pressing `enter` key, netron will show the interface
TensorRT安装
Python安装TensorRT(采用.tar文件的安装方式,此方式安装不会出现未安装的依赖包 问题)
确认已经安装cuda
与pytorch
相匹配的版本
安装pycuda
模块,pip install pycuda
确认安装onnx
模块
下载TensorRT8.x相对应 cuda
版本的.tar
文件,网址:NVIDIA TensorRT | NVIDIA Developer
解压文件,切换至python
分支,安装TensorRT
,graphsurgeon
和onnx_graphsurgeon
cd TensorRT-8.x.x/python
pip install tensorrt-8.x.x-xxx.whl
cd TensorRT-8.x.x/graphsurgeon
pip install graphsurgeon-x.x.x.whl
cd TensorRT-8.x.x/onnx_graphsurgeon
pip install onnx_graphsurgeon-x.x.x.whl
添加TensorRT/lib
到环境变量
export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/lib:$LD_LIBRARY_PATH
export PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/bin:$PATH
export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/targets/x86_64-linux-gnu/lib:$LD_LIBRARY_PATH
source ~/.bashrc
验证TensorRT是否安装好
import tensorrt # 能正常导入即成功安装
print(tensorrt.__version__)
TensorRT的运行的基本流程图
首先以trt的Logger为参数,使用builder创建计算图类型INetworkDefinition;
然后使用Parsers将onnx等网络框架下的结构填充计算图,当然也可以使用tensorrt的API进行构建;
由计算图创建cuda环境下的引擎;
最终进行推理的则是cuda引擎生成的ExecutionContext;
engine.create_execution_context()
Onnx2TensorRT Engine
方式一
采用TensorRT中自带的工具trtexec
可以直接将onnx模型转为TensorRT Engine,通过如下指令即可生成TensorRT Engine
trtexec --onnx=your_model.onnx --saveEngine=resnet_engine.trt --explicitBatch
trtexec有如下可选参数:
--explicitBatch 网络输入时batch维度明确的size大小,该参数不能隐藏,ONNX解析器只支持明确的batch模式。更多信息可参考TensorRT Developer Guide中的Working With Dynamic Shapes 部分;
--fp16 enables FP16 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.
--int8 enables INT8 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.
--best enables all supported precisions to achieve the best performance for every layer.
--workspace controls the maximum amount of persistent scratch memory available (in MB) for algorithms considered by the builder. This should be set as high as possible for a given platform based on availability; at runtime TensorRT will allocate only what is required, not exceeding the max.
--minShapes and --maxShapes specify the range of dimensions for each network input and --optShapes specifies the dimensions that the auto-tuner should use for optimization. For more information, refer to the Optimization Profiles section in the TensorRT Developer Guide.
--buildOnly requests that inference performance measurements be skipped.
--saveEngine specifies the file into which the serialized engine must be saved.
--safe enables building safety certified engines. This switch is used for prototyping automotive safety restricted flows in the TensorRT safe runtime.
--tacticSources can be used to add or remove tactics from the default tactic sources (cuDNN, cuBLAS and cuBLASLt).
--minTiming and --avgTiming respectively set the minimum and average number of iterations used in tactic selection.
--noBuilderCache disables the layer timing cache in the TensorRT builder. The timing cache helps to reduce the time taken in the builder phase by caching the layer profiling information and should work for most cases. Use this switch for the problematic cases. For more information, refer to the Builder Layer Timing Cache section in the TensorRT Developer Guide.
--timingCacheFile can be used to save or load the serialized global timing cache.
方式二
采用TensorRT API(Python版,TensorRT中自带的ONNX解析器[正如TensorRT运行流程图中的Parser]构建TensorRT Engine)
import os
import tensorrt as trt
# 打印日志
TRT_LOGGER = trt.Logger()
model_path = 'resnet18.onnx'
engine_file_path = "resnet18.trt"
EXPLICIT_BATCH = 1 << (int) \ (trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) # 明确batchsize=1
# 定义创建builder, network和ONNX Parser
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) \
as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
builder.max_workspace_size = 1 << 28
builder.max_batch_size = 1
# 检查onnx文件是否存在
if not os.path.exists(model_path):
print('ONNX file {} not found.'.format(model_path))
exit(0)
print('Loading ONNX file from path {}...'.format(model_path))
with open(model_path, 'rb') as model:
print('Beginning ONNX file parsing')
if not parser.parse(model.read()):
print('ERROR: Failed to parse the ONNX file.')
for error in range(parser.num_errors):
print(parser.get_error(error))
network.get_input(0).shape = [1, 3, 32, 32]
print('Completed parsing of ONNX file')
engine = builder.build_cuda_engine(network)
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
Inference的四个基本步骤
1、输入前处理:和训练的前处理过程,保证输入的图片格式和训练一致;
2、分配内存:allocate_buffers函数,不需要改动;
3、推理函数:do_inference_v2函数,不需要改动,如果需要多张推理,则可用do_inference函数;
4、结果:tensorrt推理结果为列表;
TensorRT+Pycuda实现
import pycuda.driver as cuda
import pycuda.autoinit
import cv2
import numpy as np
import os
import tensorrt as trt
import time
from PIL import Image
TRT_LOGGER = trt.Logger()
engine_file_path = "resnet18.trt"
# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
# 推理函数,固定函数
def do_inference_v2(context, bindings, inputs, outputs, stream):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
i = 0
j = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
inputs, outputs, bindings, stream = allocate_buffers(engine)
print(inputs,outputs,bindings,stream)
dir = "val/white/"
for name in os.listdir(dir):
# 前处理部分
t1 = time.clock()
image_path = os.path.join(dir,name)
img = Image.open(image_path)
img = np.array(img)
img = img.transpose((1, 0, 2))
img = img.transpose((2, 1, 0))
img = img.astype(np.float32) / 255.0
img = img[np.newaxis, :, :].astype(np.float32)
print(img.shape)
img = np.ascontiguousarray(img)
# 前处理结束
# 开始推理
inputs[0].host = img
trt_outputs = do_inference_v2(context, bindings=bindings, \
inputs=inputs, outputs=outputs, stream=stream)
print(trt_outputs)
# 结果判断
if trt_outputs[0][0] > trt_outputs[0][1]:
print("0")
i = i +1
else:
print("1")
j = j +1
print("Time:",time.clock()-t1)
m = 0
n = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
inputs, outputs, bindings, stream = allocate_buffers(engine)
print(inputs,outputs,bindings,stream)
dir = "val/yellow/"
for name in os.listdir(dir):
t1 = time.clock()
image_path = os.path.join(dir,name)
image = cv2.imread(image_path)
img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
img = np.array(img)
img = img.transpose((1, 0, 2))
img = img.transpose((2, 1, 0))
img = img.astype(np.float32) / 255.0
img = img[np.newaxis, :, :].astype(np.float32)
print(img.shape)
img = np.ascontiguousarray(img)
inputs[0].host = img
# 开始推理
trt_outputs = do_inference_v2(context, bindings=bindings, \
inputs=inputs, outputs=outputs, stream=stream)
print(trt_outputs)
if trt_outputs[0][0] > trt_outputs[0][1]:
print("0")
m = m+1
else:
print("1")
n = n +1
print("Time:",time.clock()-t1)
#
print("i = ",i)
print("j = ",j)
print("m = ",m)
print("n = ",n)
Pytorch通过保存为ONNX模型转TensorRT5_连正的博客-CSDN博客_onnx转tensorrt
pytorch转tensorRT步骤_小菜的博客-CSDN博客_pth转tensorrt