本文使用的是jetpack4.5.1,之所以没有使用最新的jetpack4.6,4.6.1是因为4.6以上版本中包含的tensorrt>=8,其中的函数接口都与tensorrt7有所不同,网上资料较少不方便实践,但是tensorrt8的加速效果要比7强很多
TensorRT
TensorRT is a high performance deep learning inference runtime for image classification, segmentation, and object detection neural networks. TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
JetPack 4.5.1 includes TensorRT 7.1.3
cuDNN
CUDA Deep Neural Network library provides high-performance primitives for deep learning frameworks. It provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
JetPack 4.5.1 includes cuDNN 8.0
CUDA
CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing the performance of your applications.
JetPack 4.5.1 includes CUDA 10.2
刷机成功后不要安装archiconda,或者设置不要默认使用conda的环境,否则tensorrt等无法使用
网络模型生成onnx
import torch
from logs.two_asff.model import YoloBody //换成你的网络模型,权重如下
model_path = 'logs/two_asff/99.09.pth'
torch_model = YoloBody(num_classes=1, phi= 'shufflenet') //加载模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch_model.load_state_dict(torch.load(model_path, map_location=device))
# set the model to inference mode
torch_model.eval()
batch_size = 1
# Input to the model
x = torch.randn(batch_size, 3, 416, 416, requires_grad=True)
torch_out = torch_model(x)
# Export the model
torch.onnx.export(torch_model, # model being run
x, # model input (or a tuple for multiple inputs)
"shufflenetv2-yolox.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['images'], # the model's input names
output_names = ['output'], # the model's output names
dynamic_axes={'input' : {0 : 'batch_size'}, # variable lenght axes
'output' : {0 : 'batch_size'}})
要注意input_name和output_name,与后续的生成tensorrt模型的输入输出必须相同
!!!onnx生成的是INT64格式,或者里面包含INT64格式,
在测试tensorrt的时候会报警告声明tensorrt不支持,会压缩onnx INT64 到INT32
可以无视
测试onnx模型是否生成成功,测试的onnx的运行速度可能会小于pytorch的
(pip3 install onnx 的时候需注意,可能会顺带安装一下numpy或者什么包,版本不匹配有可能导致出现核心错误,无法使用,可以在PC上生成并测试onnx)
//测试onnx模型是否生成成功,一般来说都没有问题
import onnx
# Load the ONNX model
model = onnx.load("../two_asff.onnx")
# Check that the IR is well formed
print(onnx.checker.check_model(model))
print('-'*50)
# Print a human readable representation of the graph
print('Model :\n\n{}'.format(onnx.helper.printable_graph(model.graph)))
跑一下onnx模型,测试一下速度
import time
import cv2
import numpy as np
import onnxruntime as rt
def image_process(image_path):
mean = np.array([[[0.485, 0.456, 0.406]]]) # 训练的时候用来mean和std
std = np.array([[[0.229, 0.224, 0.225]]])
img = cv2.imread(image_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (416, 416)) # (96, 96, 3)
image = img.astype(np.float32)/255.0
image = (image - mean)/ std
image = image.transpose((2, 0, 1)) # (3, 96, 96)
image = image[np.newaxis,:,:,:] # (1, 3, 96, 96)
image = np.array(image, dtype=np.float32)
return image
def onnx_runtime():
imgdata = image_process('../img/1.jpg')
p1 = time.time()
sess = rt.InferenceSession('yolox-tiny.onnx')
p2 = time.time()
print("Inference time with the TensorRT engine: {}".format(p2 - p1))
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
pred_onnx = sess.run([output_name], {input_name: imgdata})
# print("outputs:")
# print(np.array(pred_onnx))
onnx_runtime()
接下来是生成tensorrt模型
下面是 common.py 文件,创建复制就好,后面要用到
#
# Copyright 1993-2020 NVIDIA Corporation. All rights reserved.
#
# NOTICE TO LICENSEE:
#
# This source code and/or documentation ("Licensed Deliverables") are
# subject to NVIDIA intellectual property rights under U.S. and
# international Copyright laws.
#
# These Licensed Deliverables contained herein is PROPRIETARY and
# CONFIDENTIAL to NVIDIA and is being provided under the terms and
# conditions of a form of NVIDIA software license agreement by and
# between NVIDIA and Licensee ("License Agreement") or electronically
# accepted by Licensee. Notwithstanding any terms or conditions to
# the contrary in the License Agreement, reproduction or disclosure
# of the Licensed Deliverables to any third party without the express
# written consent of NVIDIA is prohibited.
#
# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
# LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE
# SUITABILITY OF THESE LICENSED DELIVERABLES FOR ANY PURPOSE. IT IS
# PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND.
# NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE LICENSED
# DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,
# NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
# LICENSE AGREEMENT, IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY
# SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY
# DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
# WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
# ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
# OF THESE LICENSED DELIVERABLES.
#
# U.S. Government End Users. These Licensed Deliverables are a
# "commercial item" as that term is defined at 48 C.F.R. 2.101 (OCT
# 1995), consisting of "commercial computer software" and "commercial
# computer software documentation" as such terms are used in 48
# C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government
# only as a commercial end item. Consistent with 48 C.F.R.12.212 and
# 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all
# U.S. Government End Users acquire the Licensed Deliverables with
# only those rights set forth herein.
#
# Any use of the Licensed Deliverables in individual and commercial
# software must include, in the user documentation and internal
# comments to the code, the above Disclaimer and U.S. Government End
# Users Notice.
#
from itertools import chain
import argparse
import os
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import tensorrt as trt
try:
# Sometimes python2 does not understand FileNotFoundError
FileNotFoundError
except NameError:
FileNotFoundError = IOError
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
def GiB(val):
return val * 1 << 30
def add_help(description):
parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
args, _ = parser.parse_known_args()
def find_sample_data(description="Runs a TensorRT Python sample", subfolder="", find_files=[]):
'''
Parses sample arguments.
Args:
description (str): Description of the sample.
subfolder (str): The subfolder containing data relevant to this sample
find_files (str): A list of filenames to find. Each filename will be replaced with an absolute path.
Returns:
str: Path of data directory.
'''
# Standard command-line arguments for all samples.
kDEFAULT_DATA_ROOT = os.path.join(os.sep, "usr", "src", "tensorrt", "data")
parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-d", "--datadir", help="Location of the TensorRT sample data directory, and any additional data directories.", action="append", default=[kDEFAULT_DATA_ROOT])
args, _ = parser.parse_known_args()
def get_data_path(data_dir):
# If the subfolder exists, append it to the path, otherwise use the provided path as-is.
data_path = os.path.join(data_dir, subfolder)
if not os.path.exists(data_path):
print("WARNING: " + data_path + " does not exist. Trying " + data_dir + " instead.")
data_path = data_dir
# Make sure data directory exists.
if not (os.path.exists(data_path)):
print("WARNING: {:} does not exist. Please provide the correct data path with the -d option.".format(data_path))
return data_path
data_paths = [get_data_path(data_dir) for data_dir in args.datadir]
return data_paths, locate_files(data_paths, find_files)
def locate_files(data_paths, filenames):
"""
Locates the specified files in the specified data directories.
If a file exists in multiple data directories, the first directory is used.
Args:
data_paths (List[str]): The data directories.
filename (List[str]): The names of the files to find.
Returns:
List[str]: The absolute paths of the files.
Raises:
FileNotFoundError if a file could not be located.
"""
found_files = [None] * len(filenames)
for data_path in data_paths:
# Find all requested files.
for index, (found, filename) in enumerate(zip(found_files, filenames)):
if not found:
file_path = os.path.abspath(os.path.join(data_path, filename))
if os.path.exists(file_path):
found_files[index] = file_path
# Check that all files were found
for f, filename in zip(found_files, filenames):
if not f or not os.path.exists(f):
raise FileNotFoundError("Could not find {:}. Searched in data paths: {:}".format(filename, data_paths))
return found_files
# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
# This function is generalized for multiple inputs/outputs for full dimension networks.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference_v2(context, bindings, inputs, outputs, stream):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
log.py 文件,自己写的用来计时和显示的,也可以不用
import time
def print_div(text, custom='', end='\n\n'):
print('-'*50, '\n')
print(f"{text}{custom}", end=end)
class timer():
def __init__(self, text):
print_div(text, '...', end=' ')
self.t_start = time.time()
def end(self):
t_cost = time.time() - self.t_start
print('\033[35m', end='')
print('Done ({:.5f}s)'.format(t_cost))
print('\033[0m')
class logger():
def __init__(self, text):
print_div(text)
生成trt模型
build_engine.py
import tensorrt as trt
from log import timer, logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def build_engine(onnx_path, shape = [1,416,416,3]):
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
builder.max_workspace_size = (256 << 20)
# 256MiB
builder.fp16_mode = True
# fp32_mode -> False
with open(onnx_path, 'rb') as model:
parser.parse(model.read())
engine = builder.build_cuda_engine(network)
return engine
def save_engine(engine, engine_path):
buf = engine.serialize()
with open(engine_path, 'wb') as f:
f.write(buf)
def load_engine(trt_runtime, engine_path):
with open(engine_path, 'rb') as f:
engine_data = f.read()
engine = trt_runtime.deserialize_cuda_engine(engine_data)
return engine
if __name__ == "__main__":
onnx_path = '../two_asff.onnx'
trt_path = 'two_asff_113.trt'
input_shape = [1, 416, 416, 3]
build_trt = timer('Parser ONNX & Build TensorRT Engine')
engine = build_engine(onnx_path, input_shape)
build_trt.end()
save_trt = timer('Save TensorRT Engine')
save_engine(engine, trt_path)
save_trt.end()
输出`在这里插入代码片
--------------------------------------------------
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
Parser ONNX & Build TensorRT Engine... Done (358.54099s)
--------------------------------------------------
Save TensorRT Engine... Done (0.35394s)
第二个生成trt的文件,使用其中一个就可以
onnx2trt.py
import numpy as np
import pycuda.driver as cudadriver
import tensorrt as trt
import torch
import os
import time
import common
from PIL import Image
import cv2
import torchvision
def ONNX_build_engine(onnx_file_path, write_engine=True):
# 通过加载onnx文件,构建engine
# :param onnx_file_path: onnx文件路径
# :return: engine
G_LOGGER = trt.Logger(trt.Logger.WARNING)
# 1、动态输入第一点必须要写的
explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
batch_size = 1 # trt推理时最大支持的batchsize
with trt.Builder(G_LOGGER) as builder, builder.create_network(explicit_batch) as network, \
trt.OnnxParser(network, G_LOGGER) as parser:
builder.max_batch_size = batch_size
config = builder.create_builder_config()
config.max_workspace_size = common.GiB(2)
config.set_flag(trt.BuilderFlag.FP16)
print('Loading ONNX file from path {}...'.format(onnx_file_path))
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
parser.parse(model.read())
print('Completed parsing of ONNX file')
print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
# 重点
profile = builder.create_optimization_profile() # 动态输入时候需要 分别为最小输入、常规输入、最大输入
# 有几个输入就要写几个profile.set_shape 名字和转onnx的时候要对应
# tensorrt6以后的版本是支持动态输入的,需要给每个动态输入绑定一个profile,用于指定最小值,常规值和最大值,如果超出这个范围会报异常。
profile.set_shape("input", (1, 3, 208, 208), (1, 3, 416, 416), (1, 3, 640, 640))
config.add_optimization_profile(profile)
engine = builder.build_engine(network, config)
print("Completed creating Engine")
# 保存engine文件
if write_engine:
engine_file_path = 'two_asff_112.trt'
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
return engine
onnx_file_path = r'../two_asff.onnx'
write_engine = True
engine = ONNX_build_engine(onnx_file_path, write_engine)
推理trt
trt_inference.py
这个推理的时间会比较慢,是什么问题导致的我也不清楚,下面有一个比较快的
import tensorrt as trt
from PIL import Image
import torchvision.transforms as T
import numpy as np
import common
from engine import load_engine
from log import timer, logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def load_engine(trt_runtime, engine_path):
with open(engine_path, 'rb') as f:
engine_data = f.read()
engine = trt_runtime.deserialize_cuda_engine(engine_data)
return engine
def load_data(path):
trans = T.Compose([
T.Resize(208), T.CenterCrop(208), T.ToTensor()
])
img = Image.open(path)
img_tensor = trans(img).unsqueeze(0)
return np.array(img_tensor)
# load trt engine
load_trt = timer("Load TRT Engine")
trt_path = 'two_asff_112.trt'
engine = load_engine(trt_runtime, trt_path)
load_trt.end()
# allocate buffers
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
# load data
inputs[0].host = load_data('../img/1.jpg')
# infer_trt = timer("TRT Inference")
with engine.create_execution_context() as context:
infer_trt = timer("TRT Inference")
trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
infer_trt.end()
preds = trt_outputs[0]
# infer_trt.end()
# Get Labels
f = open('../model_data/apple_classes.txt')
t = [ i.replace('\n','') for i in f.readlines()]
logger(f"Result : {t[np.argmax(preds)]}")
直接从onnx生成trt并进行推理计算时间,但是没有输出保存,想保存的可以用上面的然后将生成trt之前的代码注释掉就可以了,这样只跑推理时间会很快
我本来想将自己的pytorch模型放进去进行比较的,但是花的时间太长,不如直接在python3的环境下运行
pAt_complate.py
import pycuda.autoinit
import numpy as np
import pycuda.driver as cuda
import tensorrt as trt
import torch
import os
import time
from PIL import Image
import cv2
import torchvision
# from logs.two_asff.model import YoloBody
filename = '../img/1.jpg'
max_batch_size = 1
onnx_model_path = '../two_asff.onnx'
# model_path = 'logs/two_asff/yolox/99.09.pth'
TRT_LOGGER = trt.Logger() # This logger is required to build an engine
# net = YoloBody(1, shufflenet)
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# net.load_state_dict(torch.load(model_path, map_location=device))
# net = net.eval()
def get_img_np_nchw(filename):
image = cv2.imread(filename)
image_cv = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image_cv = cv2.resize(image_cv, (224, 224))
miu = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
img_np = np.array(image_cv, dtype=float) / 255.
r = (img_np[:, :, 0] - miu[0]) / std[0]
g = (img_np[:, :, 1] - miu[1]) / std[1]
b = (img_np[:, :, 2] - miu[2]) / std[2]
img_np_t = np.array([r, g, b])
img_np_nchw = np.expand_dims(img_np_t, axis=0)
return img_np_nchw
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
"""Within this context, host_mom means the cpu memory and device means the GPU memory
"""
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
def get_engine(max_batch_size=1, onnx_file_path="", engine_file_path="", \
fp16_mode=False, int8_mode=False, save_engine=False,
):
"""Attempts to load a serialized engine if available, otherwise builds a new TensorRT engine and saves it."""
def build_engine(max_batch_size, save_engine):
"""Takes an ONNX file and creates a TensorRT engine to run inference with"""
explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
with trt.Builder(TRT_LOGGER) as builder, \
builder.create_network(explicit_batch) as network, \
trt.OnnxParser(network, TRT_LOGGER) as parser:
builder.max_workspace_size = 1 << 30 # Your workspace size
builder.max_batch_size = max_batch_size
# pdb.set_trace()
builder.fp16_mode = fp16_mode # Default: False
builder.int8_mode = int8_mode # Default: False
if int8_mode:
# To be updated
raise NotImplementedError
# Parse model file
if not os.path.exists(onnx_file_path):
quit('ONNX file {} not found'.format(onnx_file_path))
print('Loading ONNX file from path {}...'.format(onnx_file_path))
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
parser.parse(model.read())
print('Completed parsing of ONNX file')
print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
last_layer = network.get_layer(network.num_layers - 1)
network.mark_output(last_layer.get_output(0))
engine = builder.build_cuda_engine(network)
print("Completed creating Engine")
if save_engine:
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
return engine
if os.path.exists(engine_file_path):
# If a serialized engine exists, load it instead of building a new one.
print("Reading engine from file {}".format(engine_file_path))
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
else:
return build_engine(max_batch_size, save_engine)
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer data from CPU to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
def postprocess_the_outputs(h_outputs, shape_of_output):
h_outputs = h_outputs.reshape(*shape_of_output)
return h_outputs
img_np_nchw = get_img_np_nchw(filename)
img_np_nchw = img_np_nchw.astype(dtype=np.float32)
# These two modes are dependent on hardwares
fp16_mode = True
int8_mode = False
trt_engine_path = './model_fp16_{}_int8_{}.trt'.format(fp16_mode, int8_mode)
# Build an engine
engine = get_engine(max_batch_size, onnx_model_path, trt_engine_path, fp16_mode, int8_mode)
# Create the context for this engine
context = engine.create_execution_context()
# Allocate buffers for input and output
inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings
# Do inference
shape_of_output = (max_batch_size, 4056)
# Load data to the buffer
inputs[0].host = img_np_nchw.reshape(-1)
# inputs[1].host = ... for multiple input
t1 = time.time()
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream) # numpy data
t2 = time.time()
print("Inference time with the TensorRT engine: {}".format(t2 - t1))
feat = postprocess_the_outputs(trt_outputs[0], shape_of_output)
print('TensorRT ok')
# model = torchvision.models.resnet50(pretrained=True).cuda()
# resnet_model = model.eval()
# input_for_torch = torch.from_numpy(img_np_nchw).cuda()
# t3 = time.time()
# feat_2 = resnet_model(input_for_torch)
# t4 = time.time()
# feat_2 = feat_2.cpu().data.numpy()
# print('Pytorch ok!')
# mse = np.mean((feat - feat_2) ** 2)
print("Inference time with the TensorRT engine: {}".format(t2 - t1))
# print("Inference time with the PyTorch model: {}".format(t4 - t3))
# print('MSE Error = {}'.format(mse))
print('All completed!')
输出结果
Loading ONNX file from path ../two_asff.onnx...
Beginning ONNX file parsing
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
Completed parsing of ONNX file
Building an engine from file ../two_asff.onnx; this may take a while...
[TensorRT] ERROR: Tensor 790 is already set as an output
Completed creating Engine
Inference time with the TensorRT engine: 0.04271745681762695
TensorRT ok
All completed!
fp16精度比原来快了好多