TensorRT & C++ & TensorFlow & DenseNet & CIFAR10模型部署

TensorRT & C++ & TensorFlow & DenseNet & CIFAR10模型部署

一些题外话:这篇博客源自于实际的项目经历,项目中我负责对各类模型在Qt系统上的部署,从Libtorch到Pytorch再到TensorFlow的模型部署,都浅浅走了一遍,不透彻但能跑通了。

整体介绍:以TensorFlow训练DenseNet121分类CIFAR10的应用场景为例,讲模型在C++环境下的TensorRT加速部署。

零. 环境配置

名称 版本号
TensorRT TensorRT-7.2.3.4.Windows10.x86_64.cuda-11.1.cudnn8.1
tensorflow-gpu 2.9.1
C++ Compiler MSVC/14.29.30133
CUDA 11.1
cuDNN 8.4.1
libtorch libtorch-1.8.2+cu111
pytorch torch1.12.0+cu113
tf2onnx 1.11.1
opencv opencv-3.4.13
keras 2.9.0
h5py 3.9.0
Windows Windows 10 家庭中文版 19044.1889
OpenCV 3.4.13

模型部署整体的流程如下图所示:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lc6navg7-1661434427234)(博客.assets/onnx-workflow.png)]

可以参考链接:使用 TensorFlow、ONNX 和 TensorRT 加速深度学习推理

一、模型训练及保存

参考这篇博客,训练一个基于keras.application中的DenseNet网络的、处理Cifar10的模型,保存为.hdf5格式。

我们在经典的DenNet121网络前加了resize层,使得网络能接收CIFAR10数据集中32x32x3的数据。代码如下:

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import keras as K
from keras import datasets, layers, models

def preprocess_data(X, Y):
    """pre-processes the data"""
    X_p = X_p = K.applications.densenet.preprocess_input(X)
    """one hot encode target values"""
    Y_p = K.utils.to_categorical(Y, 10)
    return X_p, Y_p

"""load dataset"""
(trainX, trainy), (testX, testy) = K.datasets.cifar10.load_data()
x_train, y_train = preprocess_data(trainX, trainy)
x_test, y_test = preprocess_data(testX, testy)

""" USE DenseNet121"""
OldModel = K.applications.DenseNet121(include_top=False,input_tensor=None,weights='imagenet')
for layer in OldModel.layers[:149]:
    layer.trainable = False
for layer in OldModel.layers[149:]:
    layer.trainable = True

model = K.models.Sequential()

"""a lambda layer that scales up the data to the correct size"""
model.add(K.layers.Lambda(lambda x:K.backend.resize_images(x,height_factor=7,width_factor=7,data_format='channels_last')))

model.add(OldModel)
model.add(K.layers.Flatten())
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(256, activation='relu'))
model.add(K.layers.Dropout(0.7))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(128, activation='relu'))
model.add(K.layers.Dropout(0.5))
model.add(K.layers.BatchNormalization())
model.add(K.layers.Dense(64, activation='relu'))
model.add(K.layers.Dropout(0.3))
model.add(K.layers.Dense(10, activation='softmax'))
"""callbacks"""
# cbacks =  K.callbacks.CallbackList()
# cbacks.append(K.callbacks.ModelCheckpoint(filepath='cifar10.h5',monitor='val_accuracy',save_best_only=True))
# cbacks.append(K.callbacks.EarlyStopping(monitor='val_accuracy',patience=2))

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
"""train"""
model.fit(x=x_train,y=y_train,batch_size=128,epochs=5,validation_data=(x_test, y_test))
model.summary()

model.save('cifar10.h5')

事实上,如果使用这个训练得到的cifar10.h5模型来做下面的转换,在转到trt引擎文件的时候会报错:

[07/28/2022-12:54:39] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
[07/28/2022-12:54:39] [E] Parsing model failed
[07/28/2022-12:54:39] [E] Engine creation failed
[07/28/2022-12:54:39] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec 

这是因为目前TensorRt的BUG:#974 (comment),不支持模型中的resize_image操作。不支持的还有NonZero (op is not supported in TRT yet。)

刚才训练代码里使用的keras.backend.resize_images这个方法使用的是 the nearest model + half_pixel + round_prefer_ceil

一模一样的issue 。

解决方案:Lambda式子改成model.add(K.layers.Lambda(lambda x:tf.image.resize(x,[224,224])))

OK,使用Keras的Sequential模型,“搭”自己的网络很快,保存也方便。

二、模型冻结

hdf5模型是可以再次被训练的动态图,现将其冻结转换成pb文件,用于前向计算。

import tensorflow as tf
import keras as K
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

def convert_h5to_pb():
    model = tf.keras.models.load_model("E:/cifar10.h5",compile=False)
    model.summary()
    full_model = tf.function(lambda Input: model(Input))
    full_model = full_model.get_concrete_function(tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))

    # Get frozen ConcreteFunction
    frozen_func = convert_variables_to_constants_v2(full_model)
    frozen_func.graph.as_graph_def()

    layers = [op.name for op in frozen_func.graph.get_operations()]
    print("-" * 50)
    print("Frozen model layers: ")
    for layer in layers:
        print(layer)

    print("-" * 50)
    print("Frozen model inputs: ")
    print(frozen_func.inputs)
    print("Frozen model outputs: ")
    print(frozen_func.outputs)

    # Save frozen graph from frozen ConcreteFunction to hard drive
    tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                      logdir="E:/",
                      name="cifar10.pb",
                      as_text=False)
convert_h5to_pb()

#output
--------------------------------------------------
Frozen model inputs: 
[<tf.Tensor 'Input:0' shape=(None, 32, 32, 3) dtype=float32>]
Frozen model outputs: 
[<tf.Tensor 'Identity:0' shape=(None, 10) dtype=float32>]

三、转onnx文件

使用tf2onnx.convert命令将.pb文件转为.onnx文件:

python -m tf2onnx.convert  --input E:/cifar10.pb --inputs Input:0 --outputs Identity:0 --output E:/cifar10.onnx --opset 11

–inputs :模型输入层的名字 --outputs :模型输出层的名字
输入输出层的名字在冻结代码里可以输出出来。

生成的onnx文件可以在Netron网站进行可视化,查看网络结构。
此时onnx模型的输入向量维度可以通过netron看到是**float32[unk__1220,224,224,3]**,格式是TF的NHWC.

四、生成优化引擎文件

(trtexec的用法,TensorRT - 自带工具trtexec的参数使用说明,官方介绍文档,测试博客)

trtexec --onnx=cifar10.onnx --saveEngine=cifar10.trt --workspace=4096 --minShapes=Input:0:1x32x32x3 --optShapes=Input:0:1x32x32x3 --maxShapes=Input:0:50x32x32x3 --fp16
  • onnx: 输入的onnx模型
  • saveEngine:转换好后保存的tensorrt engine
  • workspace:使用的gpu内存,有时候不够,需要手动增大点,单位是MB
  • minShapes:动态尺寸时的最小尺寸,格式为NCHW,需要给定输入node的名字,
  • optShapes:推理测试的尺寸,trtexec会执行推理测试,该shape就是测试时的输入shape
  • maxShapes:动态尺寸时的最大尺寸,这里只有batch是动态的,其他维度都是写死的
  • fp16:float16推理

五、数据预处理

我们的最终目的是使用引擎对数据进行前向推理。到第四章结束,我们就拿到了最终的“模型”即序列化的引擎文件,下面是对数据的预处理,即加载数据。(我是直接使用了这位佬根据官方MNIST数据集处理代码改写的CIFAR10代码,github链接)

为了满足动态批量的数据输入,可以利用Libtorch的DataLoader类。自定义我们的DataLoader类,只需要重写torch::data::dataset的get和size方法。

这篇文章完全可以让你自学废对自定义数据类型的加载:Custom Data Loading using PyTorch C++ API

假设现在已经写好了CustomDataset类,那么分批喂数据的代码大抵就可以是这样:

// Make DataSet
auto test_dataset = CustomDataset(dataset_path, ".txt", class2label)
    .map(torch::data::transforms::Stack<>());
//Build DataLoader
auto test_data_loader = torch::data::make_data_loader(
    std::move(test_set_transformed), INFERENCE_BATCH);
//const size_t test_dataset_size = test_dataset.size().value();
for (const auto& batch : *test_data_loader){
    torch::Tensor inputs_tensor = batch.data;
    torch::Tensor labels_tensor = batch.target;
    ...
}

六、加载引擎文件

流程:

  1. 读取.trt文件到变量.
  2. 通过nvinfer1::createInferRuntime创建runtime对象.
  3. 调用runtimedeserializeCudaEngine方法反序列化.trt文件得到engine对象.
  4. IExecutionContext* context = engine->createExecutionContext();得到执行上下文对象context.

模型的推理就通过contextenqueueV2方法实现。可以把前三步集合到一个方法中,名叫readTRTfile,方法返回一个engine对象。

之所以不直接取到context后返回context,因为我们需要调用engine的方法查看模型的输入输出维度。

【要点】前文我们生成的模型(得到的pb亦或是pt文件)都是动态批量,得到动态输入的onnx,转为trt时指定了之后推理输入的shape范围,注意只是范围,得到的trt经过deserialize得到engine,在调用engine时需要指定维度。如果没有指定或者维度不对则报错:

[E] [TRT] Parameter check failed at: engine.cpp::nvinfer1::rt::ShapeMachineContext::resolveSlots::1318, condition: allInputDimensionsSpecified(routine)

解决办法:

//查看engine的输入输出维度
for (int i = 0; i < engine->getNbBindings(); i++){
    nvinfer1::Dims dims = engine->getBindingDimensions(i);
    printf("index %d, dims: (",i);
    for (int d = 0; d < dims.nbDims; d++){
        if (d < dims.nbDims - 1)	printf("%d,", dims.d[d]);
        else	printf("%d", dims.d[d]);
    }	printf(")\n");
}

以DenseNet121的trt文件为例,以上程序输出

index 0, dims: (-1,224,224,3)
index 1, dims: (-1,100)

所以我们得把输入的动态维度写死,在python里,在调用engine推理前做这样的设置即可:context.set_binding_shape(0, (BATCH, 3, INPUT_H, INPUT_W)),C++代码里应该调用IExecutionContext类型的实例的setBindingDimensions(int bindingIndex, Dims dimensions)方法。

//确定动态维度
nvinfer1::Dims dims4;
dims4.d[0] = 1;    // replace dynamic batch size with 1
dims4.d[1] = 224;
dims4.d[2] = 224;
dims4.d[3] = 3;
dims4.nbDims = 4;
context->setBindingDimensions(0, dims4);

然后再执行推理就可以了。

总体思路是:拿到一个对维度未知的模型engine文件后,首先读入文件内容并做deserialize获得engine。
然后调用getBindingDimensions()查看engine的输入输出维度(如果知道维度就不用)。
在调用context->executeV2()做推理前把维度值为-1的动态维度值替换成具体的维度并调用context->setBindingDimensions()设置具体维度,然后在数据填入input buffer准备好后调用context->executeV2()做推理即可:

为什么是V2,V1V2有什么区别:

execute/enqueue are for implicit batch networks, and executeV2/enqueueV2 are for explicit batch networks. The V2 versions don’t take a batch_size argument since it’s taken from the explicit batch dimension of the network / or from the optimization profile if used.

In TensorRT 7, the ONNX parser requires that you create an explicit batch network, so you’ll have to use V2 methods.


到这里,我们通过readTRTfile函数得到了engine对象,通过engine得到了context对象,然后确定了context输入的动态维度。

七、执行推理

写一个doinference的方法,传入输入和输出数据数组。前文写的DataLoader每批得到的数据都是torch::tensor向量,

  1. cudaMalloc开辟GPU内存。
  2. cudaMemcpyAsync将批数据传给GPU。
  3. 调用context.enqueueV2执行推理。
  4. cudaMemcpyAsync将批数据传回CPU。

大致分为这四步。

程序运行结果:

(TrtInfer::testAllSample) test_dataset_size0
loading filename from:E:/cifar10fix.trt
length:47512416
load engine done
deserializing
[08/25/2022-20:37:10] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:11] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
deserialize done
The engine in TensorRT.cpp is not nullptr
tensorRT engine created successfully.
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuDNN 8.1.0 but loaded cuDNN 8.0.5
[08/25/2022-20:37:12] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
index 4, dims: (-1,32,32,3)
index 2, dims: (-1,10)
num_running_corrects_NUMS=====2132
num_running_NUMS=====10000
 Eval Loss: 2.23657 Eval Acc: 0.2132
test_dataset_size:()
HAPYY ENDING!!!~~~~~ヾ(≧▽≦*)oヾ(≧▽≦*)oヾ(≧▽≦*)o

代码之后贴出来…笔记推了好久好久,之后继续更

可能遇到的错误:

onnx转trt

[W] Dynamic dimensions required for input: input_1:0, but no shapes were provided. Automatically overriding shape to: 1x224x224x3
#这是因为Shapes参数处,输入节点的名字有错误,应该是input_1:0而不是input_1。直接和netron上显示的结点name保持一致即可
[E] [TRT] input_1:0: for dimension number 1 in profile 0 does not match network definition (got min=3, opt=3, max=3), expected min=opt=max=224).
#Shapes参数1x3x224x224改成1x224x224x3即可
ERROR: builtin_op_importers.cpp:2593 In function importResize:
[8] Assertion failed: (mode != "nearest" || nearest_mode == "floor") && "This version of TensorRT only supports floor nearest_mode!"
[07/28/2022-12:54:39] [E] Failed to parse onnx file
#模型中resize(nearest-ceil model)算子不支持
[E] [TRT] C:\source\rtSafe\cuda\cudaConvolutionRunner.cpp (483) - Cudnn Error in nvinfer1::rt::cuda::CudnnConvolutionRunner::executeConv: 2 (CUDNN_STATUS_ALLOC_FAILED)
#--workspace参数设置的太大了  调小一点

【Could not load library cudnn_cnn_infer64_8.dll. Error code 1455.Please make sure cudnn_cnn_infer64_8.dll is in your library path! 】
or 【context null
原因:内存不足,重启VS或者电脑就OK。(或者参考此问答)

你可能感兴趣的:(随笔,记录,tensorflow,c++,机器学习)