一些题外话:这篇博客源自于实际的项目经历,项目中我负责对各类模型在Qt系统上的部署,从Libtorch到Pytorch再到TensorFlow的模型部署,都浅浅走了一遍,不透彻但能跑通了。
整体介绍:以TensorFlow训练DenseNet121分类CIFAR10的应用场景为例,讲模型在C++环境下的TensorRT加速部署。
名称 | 版本号 |
---|---|
TensorRT | TensorRT-7.2.3.4.Windows10.x86_64.cuda-11.1.cudnn8.1 |
tensorflow-gpu | 2.9.1 |
C++ Compiler | MSVC/14.29.30133 |
CUDA | 11.1 |
cuDNN | 8.4.1 |
libtorch | libtorch-1.8.2+cu111 |
pytorch | torch1.12.0+cu113 |
tf2onnx | 1.11.1 |
opencv | opencv-3.4.13 |
keras | 2.9.0 |
h5py | 3.9.0 |
Windows | Windows 10 家庭中文版 19044.1889 |
OpenCV | 3.4.13 |
模型部署整体的流程如下图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lc6navg7-1661434427234)(博客.assets/onnx-workflow.png)]
可以参考链接:使用 TensorFlow、ONNX 和 TensorRT 加速深度学习推理
参考这篇博客,训练一个基于keras.application中的DenseNet网络的、处理Cifar10的模型,保存为.hdf5
格式。
我们在经典的DenNet121网络前加了resize层,使得网络能接收CIFAR10数据集中32x32x3
的数据。代码如下:
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import keras as K
from keras import datasets, layers, models
def preprocess_data(X, Y):
"""pre-processes the data"""
X_p = X_p = K.applications.densenet.preprocess_input(X)
"""one hot encode target values"""
Y_p = K.utils.to_categorical(Y, 10)
return X_p, Y_p
"""load dataset"""
(trainX, trainy), (testX, testy) = K.datasets.cifar10.load_data()
x_train, y_train = preprocess_data(trainX, trainy)
x_test, y_test = preprocess_data(testX, testy)
""" USE DenseNet121"""
OldModel = K.applications.DenseNet121(include_top=False,input_tensor=None,weights='imagenet')
for layer in OldModel.layers[:149]:
layer.trainable = False
for layer in OldModel.layers[149:]:
layer.trainable = True
model = K.models.Sequential()
"""a lambda layer that scales up the data to the correct size"""
model.add(K.layers.Lambda(lambda x:K.backend.resize_images(x,height_factor=7,width_factor=7,data_format='channels_last')))
model.add(OldModel)
model.add(K.layers.Flatten())
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(256, activation='relu'))
model.add(K.layers.Dropout(0.7))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(128, activation='relu'))
model.add(K.layers.Dropout(0.5))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(64, activation='relu'))
model.add(K.layers.Dropout(0.3))
model.add(K.layers.Dense(10, activation='softmax'))
"""callbacks"""
# cbacks = K.callbacks.CallbackList()
# cbacks.append(K.callbacks.ModelCheckpoint(filepath='cifar10.h5',monitor='val_accuracy',save_best_only=True))
# cbacks.append(K.callbacks.EarlyStopping(monitor='val_accuracy',patience=2))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
"""train"""
model.fit(x=x_train,y=y_train,batch_size=128,epochs=5,validation_data=(x_test, y_test))
model.summary()
model.save('cifar10.h5')
事实上,如果使用这个训练得到的cifar10.h5
模型来做下面的转换,在转到trt
引擎文件的时候会报错:
[07/28/2022-12:54:39] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
[07/28/2022-12:54:39] [E] Parsing model failed
[07/28/2022-12:54:39] [E] Engine creation failed
[07/28/2022-12:54:39] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec
这是因为目前TensorRt的BUG:#974 (comment),不支持模型中的resize_image
操作。不支持的还有NonZero (op is not supported in TRT yet。)
刚才训练代码里使用的keras.backend.resize_images
这个方法使用的是 the nearest
model + half_pixel
+ round_prefer_ceil
。
一模一样的issue 。
解决方案:Lambda式子改成model.add(K.layers.Lambda(lambda x:tf.image.resize(x,[224,224])))
。
OK,使用Keras的Sequential模型,“搭”自己的网络很快,保存也方便。
hdf5模型是可以再次被训练的动态图,现将其冻结转换成pb文件,用于前向计算。
import tensorflow as tf
import keras as K
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
def convert_h5to_pb():
model = tf.keras.models.load_model("E:/cifar10.h5",compile=False)
model.summary()
full_model = tf.function(lambda Input: model(Input))
full_model = full_model.get_concrete_function(tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))
# Get frozen ConcreteFunction
frozen_func = convert_variables_to_constants_v2(full_model)
frozen_func.graph.as_graph_def()
layers = [op.name for op in frozen_func.graph.get_operations()]
print("-" * 50)
print("Frozen model layers: ")
for layer in layers:
print(layer)
print("-" * 50)
print("Frozen model inputs: ")
print(frozen_func.inputs)
print("Frozen model outputs: ")
print(frozen_func.outputs)
# Save frozen graph from frozen ConcreteFunction to hard drive
tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
logdir="E:/",
name="cifar10.pb",
as_text=False)
convert_h5to_pb()
#output
--------------------------------------------------
Frozen model inputs:
[<tf.Tensor 'Input:0' shape=(None, 32, 32, 3) dtype=float32>]
Frozen model outputs:
[<tf.Tensor 'Identity:0' shape=(None, 10) dtype=float32>]
使用tf2onnx.convert
命令将.pb
文件转为.onnx
文件:
python -m tf2onnx.convert --input E:/cifar10.pb --inputs Input:0 --outputs Identity:0 --output E:/cifar10.onnx --opset 11
–inputs :模型输入层的名字 --outputs :模型输出层的名字
输入输出层的名字在冻结代码里可以输出出来。
生成的onnx文件可以在Netron网站进行可视化,查看网络结构。
此时onnx模型的输入向量维度可以通过netron看到是**float32[unk__1220,224,224,3]
**,格式是TF的NHWC.
(trtexec的用法,TensorRT - 自带工具trtexec的参数使用说明,官方介绍文档,测试博客)
trtexec --onnx=cifar10.onnx --saveEngine=cifar10.trt --workspace=4096 --minShapes=Input:0:1x32x32x3 --optShapes=Input:0:1x32x32x3 --maxShapes=Input:0:50x32x32x3 --fp16
我们的最终目的是使用引擎对数据进行前向推理。到第四章结束,我们就拿到了最终的“模型”即序列化的引擎文件,下面是对数据的预处理,即加载数据。(我是直接使用了这位佬根据官方MNIST数据集处理代码改写的CIFAR10代码,github链接)
为了满足动态批量的数据输入,可以利用Libtorch的DataLoader类。自定义我们的DataLoader类,只需要重写torch::data::dataset
的get和size方法。
这篇文章完全可以让你自学废对自定义数据类型的加载:Custom Data Loading using PyTorch C++ API
假设现在已经写好了CustomDataset
类,那么分批喂数据的代码大抵就可以是这样:
// Make DataSet
auto test_dataset = CustomDataset(dataset_path, ".txt", class2label)
.map(torch::data::transforms::Stack<>());
//Build DataLoader
auto test_data_loader = torch::data::make_data_loader(
std::move(test_set_transformed), INFERENCE_BATCH);
//const size_t test_dataset_size = test_dataset.size().value();
for (const auto& batch : *test_data_loader){
torch::Tensor inputs_tensor = batch.data;
torch::Tensor labels_tensor = batch.target;
...
}
流程:
.trt
文件到变量.nvinfer1::createInferRuntime
创建runtime对象.deserializeCudaEngine
方法反序列化.trt
文件得到engine对象.IExecutionContext* context = engine->createExecutionContext();
得到执行上下文对象context.模型的推理就通过context的enqueueV2
方法实现。可以把前三步集合到一个方法中,名叫readTRTfile,方法返回一个engine
对象。
之所以不直接取到context后返回context,因为我们需要调用engine的方法查看模型的输入输出维度。
【要点】前文我们生成的模型(得到的pb亦或是pt文件)都是动态批量,得到动态输入的onnx,转为trt时指定了之后推理输入的shape范围,注意只是范围,得到的trt经过deserialize得到engine,在调用engine时需要指定维度。如果没有指定或者维度不对则报错:
[E] [TRT] Parameter check failed at: engine.cpp::nvinfer1::rt::ShapeMachineContext::resolveSlots::1318, condition: allInputDimensionsSpecified(routine)
解决办法:
//查看engine的输入输出维度
for (int i = 0; i < engine->getNbBindings(); i++){
nvinfer1::Dims dims = engine->getBindingDimensions(i);
printf("index %d, dims: (",i);
for (int d = 0; d < dims.nbDims; d++){
if (d < dims.nbDims - 1) printf("%d,", dims.d[d]);
else printf("%d", dims.d[d]);
} printf(")\n");
}
以DenseNet121的trt文件为例,以上程序输出
index 0, dims: (-1,224,224,3)
index 1, dims: (-1,100)
所以我们得把输入的动态维度写死,在python里,在调用engine推理前做这样的设置即可:context.set_binding_shape(0, (BATCH, 3, INPUT_H, INPUT_W))
,C++代码里应该调用IExecutionContext类型的实例的setBindingDimensions(int bindingIndex, Dims dimensions)方法。
//确定动态维度
nvinfer1::Dims dims4;
dims4.d[0] = 1; // replace dynamic batch size with 1
dims4.d[1] = 224;
dims4.d[2] = 224;
dims4.d[3] = 3;
dims4.nbDims = 4;
context->setBindingDimensions(0, dims4);
然后再执行推理就可以了。
总体思路是:拿到一个对维度未知的模型engine文件后,首先读入文件内容并做deserialize获得engine。
然后调用getBindingDimensions()查看engine的输入输出维度(如果知道维度就不用)。
在调用context->executeV2()做推理前把维度值为-1的动态维度值替换成具体的维度并调用context->setBindingDimensions()设置具体维度,然后在数据填入input buffer准备好后调用context->executeV2()做推理即可:
为什么是V2,V1V2有什么区别:
execute/enqueue are for implicit batch networks, and executeV2/enqueueV2 are for explicit batch networks. The V2 versions don’t take a batch_size argument since it’s taken from the explicit batch dimension of the network / or from the optimization profile if used.
In TensorRT 7, the ONNX parser requires that you create an explicit batch network, so you’ll have to use V2 methods.
到这里,我们通过readTRTfile函数得到了engine对象,通过engine得到了context对象,然后确定了context输入的动态维度。
写一个doinference的方法,传入输入和输出数据数组。前文写的DataLoader每批得到的数据都是torch::tensor
向量,
cudaMalloc
开辟GPU内存。cudaMemcpyAsync
将批数据传给GPU。context.enqueueV2
执行推理。cudaMemcpyAsync
将批数据传回CPU。大致分为这四步。
程序运行结果:
(TrtInfer::testAllSample) test_dataset_size0
loading filename from:E:/cifar10fix.trt
length:47512416
load engine done
deserializing
[08/25/2022-20:37:10] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
deserialize done
The engine in TensorRT.cpp is not nullptr
tensorRT engine created successfully.
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
index 4, dims: (-1,32,32,3)
index 2, dims: (-1,10)
num_running_corrects_NUMS=====2132
num_running_NUMS=====10000
Eval Loss: 2.23657 Eval Acc: 0.2132
test_dataset_size:()
HAPYY ENDING!!!~~~~~ヾ(≧▽≦*)oヾ(≧▽≦*)oヾ(≧▽≦*)o
代码之后贴出来…笔记推了好久好久,之后继续更
[W] Dynamic dimensions required for input: input_1:0, but no shapes were provided. Automatically overriding shape to: 1x224x224x3
#这是因为Shapes参数处,输入节点的名字有错误,应该是input_1:0而不是input_1。直接和netron上显示的结点name保持一致即可
[E] [TRT] input_1:0: for dimension number 1 in profile 0 does not match network definition (got min=3, opt=3, max=3), expected min=opt=max=224).
#Shapes参数1x3x224x224改成1x224x224x3即可
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
#模型中resize(nearest-ceil model)算子不支持
[E] [TRT] C:\source\rtSafe\cuda\cudaConvolutionRunner.cpp (483) - Cudnn Error in nvinfer1::rt::cuda::CudnnConvolutionRunner::executeConv: 2 (CUDNN_STATUS_ALLOC_FAILED)
#--workspace参数设置的太大了 调小一点
【Could not load library cudnn_cnn_infer64_8.dll. Error code 1455.Please make sure cudnn_cnn_infer64_8.dll is in your library path! 】
or 【context null】
原因:内存不足,重启VS或者电脑就OK。(或者参考此问答)