本文用于记录pytorch/mxnet模型使用tersorrt的整个流程以及遇到的坑。
tensorrt支持TensorFlow的uff和onnx以及自定义模型的推理加速,对于pytorch有第三方接口torch2trt项目,但是这个需要定义好模型在加入,不能把模型和tensorrt分离
import torch
from torch2trt import torch2trt
from torchvision.models.alexnet import alexnet
# create some regular pytorch model...
model = alexnet(pretrained=True).eval().cuda()
# create example data
x = torch.ones((1, 3, 224, 224)).cuda()
# convert to TensorRT feeding sample data as input
model_trt = torch2trt(model, [x])
部署的时候还依赖pytorch环境,就没尝试。
mxnet官方是有接口直接转tensorrt的,
arg_params.update(aux_params)
all_params = dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()])
executor = mx.contrib.tensorrt.tensorrt_bind(sym, ctx=mx.gpu(0), all_params=all_params,data=batch_shape, grad_req='null', force_rebind=True)
y_gen = executor.forward(is_train=False, data=input)
y_gen[0].wait_to_read()
这个也没有尝试,主要还是想部署时分离,只用tensorrt环境,不需要装深度学习全家桶
pytorch和mxnet转换为onnx的模型官方都有接口和文档,使用方法也很简单
#mxnet转onnx
sym = './resnet-50-symbol.json'
params = './resnet-50-0000.params'
input_shape = (1, 3, 224, 224)
onnx_file = './resnet-50.onnx'
converted_model_path = onnx_mxnet.export_model(sym, params, [input_shape], np.float32, onnx_file)
#pytorch转onnx
import torch
import torchvision
dummy_input = torch.randn(10, 3, 224, 224, device='cuda')
model = torchvision.models.alexnet(pretrained=True).cuda()
# Providing input and output names sets the display names for values
# within the model's graph. Setting these does not change the semantics
# of the graph; it is only for readability.
#
# The inputs to the network consist of the flat list of inputs (i.e.
# the values you would pass to the forward() method) followed by the
# flat list of parameters. You can partially specify names, i.e. provide
# a list here shorter than the number of inputs to the model, and we will
# only set that subset of names, starting from the beginning.
input_names = [ "actual_input_1" ] + [ "learned_%d" % i for i in range(16) ]
output_names = [ "output1" ]
torch.onnx.export(model, dummy_input, "alexnet.onnx", verbose=True, input_names=input_names, output_names=output_names)
自定义层SegmentConsensus 不识别
对于自定义层,在pytorch转onnx需要自定义,onnx转trt是还需要自定义,对于这种层还是建议搞懂底层原理,用基础的操作来实现,这个层比较简单,使用了mean和index_select操作实现了
TracerWarning: There are 2 live references to the data region being modified when tracing in-place operator copy_ (possibly due to an assignment). This might cause the trace to be incorrect, because all other views that also reference this data will not reflect this change in the trace! On the other hand, if all other views use the same memory chunk, but are disjoint (e.g. are outputs of torch.split), this might still be safe
这个错误是说修改的数据有两个引用导致无法trace,错误的代码如下:
out[:, :-1, :fold] = x[:, 1:, :fold] # shift left
out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold] # shift right
out[:, :, 2 * fold:] = x[:, :, 2 * fold:] # not shift
查了一些资料应该是说左边赋值是一个引用,切片又是一个引用,两个引用无法trace,那么把切片使用index_select替换
left_side = torch.cat((x[:, 1:, :fold], torch.zeros(1, 1, fold, h, w)), dim=1)
middle_side = torch.cat((torch.zeros(1, 1, fold, h, w), x[:, :n_segment - 1, fold: 2 * fold]), dim=1)
out = torch.cat((left_side, middle_side, x[:, :, 2 * fold:]), dim=2)
sym, arg_params, aux_params = mx.model.load_checkpoint(pretrained, epoch)
sym = get_output_sym(sym, 'fc1_output')
arg_params.update(aux_params)
onnx_mx.export_model(sym, arg_params, input_shape, onnx_file_path=onnx_file_path, verbose=True)
对于pytorch可以继承torch.nn.Module将模型传进来自己进行修改定制
class ExtractFeature(torch.nn.Module):
def __init__(self, cnn, frames=16):
super().__init__()
self.model = cnn
self.num_segments = frames
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def forward(self, data):
# st = time.time()
# print('feature extracting start')
n = self.model
pool = torch.nn.MaxPool2d(3,2)
with torch.no_grad():
input= data.view((-1, 3) + data.size()[-2:]).to(self.device)
x=n.conv1(input)
x=n.bn1(x)
x=n.relu(x)
x=n.maxpool(x)
x=n.layer1(x)
x=n.layer2(x)
x=n.layer3(x)
x=n.layer4(x)
x=pool(x)
x=x.flatten(start_dim=1)
ndata=x
data=ndata.view((-1, self.num_segments) + ndata.size()[1:])
return data
模型调用不使用默认的forward
模型继承torch.nn.Module,该类有个__call__方法可以使类可以像函数一样被调用,在__call__中调用了apply方法最终调用到forward方法,如果模型使用中不使用forward方法,该怎么转onnx呢?如下这种
out = net.forward_features(x)
显式调用了forward_features方法,开始想通过继承方式,将forward_features函数直接返回父类的forward,其实可以直接修改方法的指向,像下面这样直接修改指向即可
OCR.forward = OCR.forward_ocr
Exporting the operator GatherElements to ONNX opset version 9 is not supported
opset9 不支持该op,可以将opset version调高,目前最高是12,越高支持的op越多,opset_version默认是9
torch.onnx.export(model, dummy_input, "alexnet.onnx", verbose=True, input_names=input_names, output_names=output_names,opset_version=11,)
torch.onnx.export(OCR, dummy_input ,onnx_ocr_forword_ocr_path,
input_names=['input'],
output_names=['segm_pred', 'segm_pred2', 'rbox', 'rbox2', 'angle', 'angle2', 'x'],
opset_version=11,
dynamic_axes={"input": {0: 'batch',2:'h', 3:'w'}})
在tensorrt官网下载最新的tensorrt7.1版本,安装好后,配置环境变量,库里面都是so库,和一些c文件,无法import tensorrt,查看官网说明发现tensorrt 的Python接口是不支持Windows的,无法在Windows下用Python接口
[TensorRT] ERROR: …/rtSafe/cuda/caskConvolutionRunner.cpp (290) - Cask Error in checkCaskExecError: 7 (Cask Convolution execution)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
这个问题是因为创建的engine和执行不在一个线程中,使用了多线程,将创建和执行放在一个线程中
[TensorRT] ERROR: …/rtSafe/cuda/cudaConvolutionRunner.cpp (303) - Cudnn Error in execute: 7 (CUDNN_STATUS_MAPPING_ERROR)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
创建engine后不使用to(device)和cuda操作,pytorch和mxnet都需要将模型和数据cuda操作,需要删除
[TensorRT] WARNING: Explicit batch network detected and batch size specified, use execute without batch size instead.
[TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1024, condition: allInputDimensionsSpecified(routine)
动态batchsize tensorrt不能直接构建engine,需要设置profile构建
profile = builder.create_optimization_profile()
profile.set_shape(
ModelData.INPUT_NAME,
ModelData.MIN_INPUT_SHAPE,
ModelData.OPT_INPUT_SHAPE,
ModelData.MAX_INPUT_SHAPE)
config.add_optimization_profile(profile)
engine = builder.build_engine(network,config)
size = trt.volume(engine.get_binding_shape(binding)) * batch_size
if size < 0:
size *= -1
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
for binding in engine:
bind = engine.get_binding_shape(binding)
vol = trt.volume(bind)
# size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
if binding == 'input':
size = trt.volume(engine.get_binding_shape(binding)) * batch_size * h_ * w_
else:
size = trt.volume(engine.get_binding_shape(binding)) * batch_size * math.ceil(h_ / 4) * math.ceil(w_ / 4)
if size < 0:
size *= -1
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
@parse_args('v', 'v', 'v', 'v', 'v', 'i', 'f', 'f', 'i')
def instance_norm(g, input, weight, bias, running_mean, running_var, use_input_stats, momentum, eps,
cudnn_enabled):
axes = [-i for i in range(2, 0, -1)]
two_cst = g.op("Constant", value_t=torch.tensor(2.))
eps_cst = g.op("Constant", value_t=torch.tensor(eps))
mean = g.op("ReduceMean", input, axes_i=axes)
numerator = sub(g, input, mean)
# variance = e((x - e(x))^2), and (x - e(x)) is the numerator in the layer_norm formula
variance = g.op("ReduceMean", pow(g, numerator, two_cst), axes_i=axes)
denominator = sqrt(g, add(g, variance, eps_cst))
inst_norm = div(g, numerator, denominator)
if not (weight is None or weight.node().mustBeNone()):
inst_norm = mul(g, inst_norm, weight)
if not (bias is None or bias.node().mustBeNone()):
inst_norm = add(g, inst_norm, bias)
return inst_norm
@parse_args('v', 'v', 'v', 'v', 'v', 'i', 'f', 'f', 'i')
def instance_norm(g, input, weight, bias, running_mean, running_var, use_input_stats, momentum, eps,
cudnn_enabled):
axes = [-i for i in range(2, 0, -1)]
two_cst = g.op("Constant", value_t=torch.tensor(2.))
eps_cst = g.op("Constant", value_t=torch.tensor(eps))
mean = g.op("ReduceMean", input, axes_i=axes)
numerator = sub(g, input, mean)
# variance = e((x - e(x))^2), and (x - e(x)) is the numerator in the layer_norm formula
variance = g.op("ReduceMean", pow(g, numerator, two_cst), axes_i=axes)
denominator = sqrt(g, add(g, variance, eps_cst))
inst_norm = div(g, numerator, denominator)
weight = g.op("Unsqueeze", weight, axes_i=[-1])
weight = g.op("Unsqueeze", weight, axes_i=[-1])
bias = g.op("Unsqueeze", bias, axes_i=[-1])
bias = g.op("Unsqueeze", bias, axes_i=[-1])
if not (weight is None or weight.node().mustBeNone()):
inst_norm = mul(g, inst_norm, weight)
if not (bias is None or bias.node().mustBeNone()):
inst_norm = add(g, inst_norm, bias)
return inst_norm