TensorRT实现yolov5推理加速(一)

一、参考资料

TensorRT实现yolov5推理加速(二)
tensorrt_inference
yolov5 PyTorch模型转TensorRT
yolov5剪枝蒸馏压缩

二、实验环境

系统环境

Environment
Operating System + Version: Ubuntu + 16.04
TensorRT Version: 7.1.3.4
GPU Type: GeForce GTX1650,4GB
Nvidia Driver Version: 470.63.01
CUDA Version: 10.2.300
CUDNN Version: 7.6.5
Python Version (if applicable): 3.7.3
Anaconda Version:4.10.3
gcc:7.5.0
g++:7.5.0

tensorRT-yolov5.yaml

name: tensorRT-yolov5
channels:
  - >
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=4.5=1_gnu
  - blas=1.0=mkl
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2021.7.5=h06a4308_1
  - certifi=2021.5.30=py37h06a4308_0
  - cudatoolkit=10.2.89=hfd86e86_1
  - ffmpeg=4.2.2=h20bf706_0
  - freetype=2.10.4=h5ab3b9f_0
  - gmp=6.2.1=h2531618_2
  - gnutls=3.6.15=he1e5248_0
  - jpeg=9b=h024ee3a_2
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - libedit=3.1.20210714=h7f8727e_0
  - libffi=3.2.1=hf484d3e_1007
  - libgcc-ng=9.3.0=h5101ec6_17
  - libgomp=9.3.0=h5101ec6_17
  - libidn2=2.3.2=h7f8727e_0
  - libopus=1.3.1=h7b6447c_0
  - libpng=1.6.37=hbc83047_0
  - libstdcxx-ng=9.3.0=hd4cf53a_17
  - libtasn1=4.16.0=h27cfd23_0
  - libtiff=4.2.0=h85742a9_0
  - libunistring=0.9.10=h27cfd23_0
  - libuv=1.40.0=h7b6447c_0
  - libvpx=1.7.0=h439df22_0
  - libwebp-base=1.2.0=h27cfd23_0
  - lz4-c=1.9.3=h295c915_1
  - mkl_fft=1.3.0=py37h42c9631_2
  - mkl_random=1.2.2=py37h51133e4_0
  - ncurses=6.2=he6710b0_1
  - nettle=3.7.3=hbbd107a_1
  - ninja=1.10.2=hff7bd54_1
  - numpy-base=1.20.3=py37h74d4b33_0
  - openh264=2.1.0=hd408876_0
  - openjpeg=2.4.0=h3ad879b_0
  - openssl=1.1.1l=h7f8727e_0
  - pip=21.2.2=py37h06a4308_0
  - python=3.7.3=h0371630_0
  - pytorch=1.8.0=py3.7_cuda10.2_cudnn7.6.5_0
  - readline=7.0=h7b6447c_5
  - setuptools=52.0.0=py37h06a4308_0
  - six=1.16.0=pyhd3eb1b0_0
  - sqlite=3.33.0=h62c20be_0
  - tk=8.6.10=hbc83047_0
  - torchvision=0.9.0=py37_cu102
  - typing_extensions=3.10.0.0=pyh06a4308_0
  - wheel=0.37.0=pyhd3eb1b0_0
  - x264=1!157.20191217=h7b6447c_0
  - xz=5.2.5=h7b6447c_0
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.4.9=haebb681_0
  - pip:
    - appdirs==1.4.4
    - charset-normalizer==2.0.4
    - cycler==0.10.0
    - dpcpp-cpp-rt==2021.3.0
    - flatbuffers==2.0
    - graphsurgeon==0.4.5
    - idna==3.2
    - intel-cmplr-lib-rt==2021.3.0
    - intel-cmplr-lic-rt==2021.3.0
    - intel-opencl-rt==2021.3.0
    - intel-openmp==2021.3.0
    - kiwisolver==1.3.1
    - mako==1.1.5
    - markupsafe==2.0.1
    - matplotlib==3.4.3
    - mkl==2021.3.0
    - mkl-fft==1.3.0
    - mkl-service==2.4.0
    - netron==5.1.6
    - numpy==1.21.2
    - olefile==0.46
    - onnx==1.10.1
    - onnx-simplifier==0.3.6
    - onnxoptimizer==0.2.6
    - onnxruntime==1.8.1
    - opencv-python==4.5.3.56
    - pandas==1.3.2
    - pillow==8.3.2
    - protobuf==3.17.3
    - pycuda==2021.1
    - pyparsing==2.4.7
    - python-dateutil==2.8.2
    - pytools==2021.2.8
    - pytz==2021.1
    - pyyaml==5.4.1
    - requests==2.26.0
    - scipy==1.7.1
    - seaborn==0.11.2
    - tbb==2021.3.0
    - tensorrt==7.1.3.4
    - torchsummary==1.5.1
    - tqdm==4.62.2
    - typing-extensions==3.10.0.2
    - uff==0.6.9
    - urllib3==1.26.6
prefix: /home/yichao/miniconda3/envs/tensorRT-yolov5

requirements-gpu.txt

appdirs==1.4.4
certifi==2021.5.30
charset-normalizer==2.0.4
cycler==0.10.0
dpcpp-cpp-rt==2021.3.0
flatbuffers==2.0
graphsurgeon @ file:///home/yichao/360Downloads/TensorRT-7.1.3.4/graphsurgeon/graphsurgeon-0.4.5-py2.py3-none-any.whl
idna==3.2
intel-cmplr-lib-rt==2021.3.0
intel-cmplr-lic-rt==2021.3.0
intel-opencl-rt==2021.3.0
intel-openmp==2021.3.0
kiwisolver==1.3.1
Mako==1.1.5
MarkupSafe==2.0.1
matplotlib==3.4.3
mkl==2021.3.0
mkl-fft==1.3.0
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626179032232/work
mkl-service==2.4.0
netron==5.1.6
numpy==1.21.2
olefile==0.46
onnx==1.10.1
onnx-simplifier==0.3.6
onnxoptimizer==0.2.6
onnxruntime==1.8.1
opencv-python==4.5.3.56
pandas==1.3.2
Pillow==8.3.2
protobuf==3.17.3
pycuda==2021.1
pyparsing==2.4.7
python-dateutil==2.8.2
pytools==2021.2.8
pytz==2021.1
PyYAML==5.4.1
requests==2.26.0
scipy==1.7.1
seaborn==0.11.2
six @ file:///tmp/build/80754af9/six_1623709665295/work
tbb==2021.3.0
tensorrt @ file:///home/yichao/360Downloads/TensorRT-7.1.3.4/python/tensorrt-7.1.3.4-cp37-none-linux_x86_64.whl
torch==1.8.0
torchsummary==1.5.1
torchvision==0.9.0
tqdm==4.62.2
typing-extensions==3.10.0.2
uff @ file:///home/yichao/360Downloads/TensorRT-7.1.3.4/uff/uff-0.6.9-py2.py3-none-any.whl
urllib3==1.26.6

三、相关介绍

3.1 重要说明

  1. 序列化生成yolov5s.engine耗时,大概6-8分钟
time ./yolov5 -s yolov5s.wts yolov5s.engine s

输出
real	7m29.211s
user	5m10.066s
sys	0m42.794s
  1. yolov5s.trt与yolov5s.engine是一样的,只是后缀名不同。
  2. c++推理yolov5s和python API推理yolov5s模型,速度相差不大,但是显存占用相差较大
  3. no tensorRT,tensorRT FP 32,tensorRT FP16,tensorRT INT8性能比较,测试数据集是COCO数据集。
no tensorRT tensorRT FP 32 tensorRT FP16 tensorRT INT8
engine ~ 38.3MB 21.5MB 10.8MB
FPS 12ms/张,83fps 11ms/张,90fps 7ms/张,142fps 5ms/张,200fps
生成engine耗时 ~ 31s 7m12s 7m27s
C++ API 显存 ~ 752MB 544MB 526MB
python API 显存 1133MB 2285MB 2075MB 2057MB
accuracy 精度 ~ ~ ~ 无框
mAP ~ ~ ~ ~
  • tensorRT默认使用的是 USE_FP16,USE_FP32 --> USE_FP16 在CNN里面基本上只是做小数点后几位的截断。只有USE_INT8 才需要校准数据集进行校准量化

四、关键步骤

4.1 准备工作

  1. 下载yolov5预训练模型 下载地址

4.2 检查验证模型

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model)
import onnximport numpy as np
import onnxruntime as rt
import cv2


model_path = '/home/oldpan/code/models/Resnet34_3inputs_448x448_20200609.onnx'

# 验证模型合法性
onnx_model = onnx.load(model_path)
onnx.checker.check_model(onnx_model)

# 读入图像并调整为输入维度
image = cv2.imread("data/images/person.png")
image = cv2.resize(image, (448,448))
image = image.transpose(2,0,1)
image = np.array(image)[np.newaxis, :, :, :].astype(np.float32)

# 设置模型session以及输入信息
sess = rt.InferenceSession(model_path)
input_name1 = sess.get_inputs()[0].name
input_name2 = sess.get_inputs()[1].name
input_name3 = sess.get_inputs()[2].name
output = sess.run(None, {input_name1: image, input_name2: image, input_name3: image})
print(output)

4.3 cmake

此步骤相同,cmake生成cmake相关配置文件

(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time cmake ..
CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDA: /usr/local/cuda (found version "10.2") 
-- Found OpenCV: /usr/local/opencv3.3.0 (found version "3.3.0") 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/yichao/MyDocuments/tensorrtx/yolov5/build

real	0m0.241s
user	0m0.201s
sys	0m0.042s

4.4 CMakeLists.txt

/home/yichao/MyDocuments/tensorrtx/yolov5/CMakeLists.txt
cmake_minimum_required(VERSION 2.6)

project(yolov5)

add_definitions(-std=c++11)
add_definitions(-DAPI_EXPORTS)
option(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_BUILD_TYPE Debug)

find_package(CUDA REQUIRED)

if(WIN32)
enable_language(CUDA)
endif(WIN32)

include_directories(${PROJECT_SOURCE_DIR}/include)
# include and link dirs of cuda and tensorrt, you need adapt them if yours are different
# cuda
# 需要修改目录
include_directories(/usr/local/cuda/include)
link_directories(/usr/local/cuda/lib64)
# tensorrt
# 需要修改目录
include_directories(/home/yichao/360Downloads/TensorRT-7.1.3.4/include/)
link_directories(/home/yichao/360Downloads/TensorRT-7.1.3.4/lib/)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED")

cuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu)
target_link_libraries(myplugins nvinfer cudart)

find_package(OpenCV)
include_directories(${OpenCV_INCLUDE_DIRS})

add_executable(yolov5 ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/yolov5.cpp)
target_link_libraries(yolov5 nvinfer)
target_link_libraries(yolov5 cudart)
target_link_libraries(yolov5 myplugins)
target_link_libraries(yolov5 ${OpenCV_LIBS})

if(UNIX)
add_definitions(-O2 -pthread)
endif(UNIX)

4.5 make编译

(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.723s
user	0m5.887s
sys	0m0.421s

五、tensorRT FP32 推理

5.1 修改配置

修改文件
/home/yichao/MyDocuments/tensorrtx/yolov5/yolov5.cpp

#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32

5.2 生成cmake

cd /home/yichao/MyDocuments/tensorrtx/yolov5
mkdir build
cd build
cp {ultralytics}/yolov5/yolov5s.wts {tensorrtx}/yolov5/build
cmake ..

5.3 make 编译

(yolov5-pytorch) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.702s
user	0m5.841s
sys	0m0.406s

5.4 序列化engine

(yolov5-pytorch) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -s yolov5s.wts yolov5s.engine s
Loading weights: yolov5s.wts
Building engine, please wait for a while...
Build engine successfully!

real	0m31.284s
user	0m24.642s
sys	0m1.750s

yolov5s.engine,38.3MB

显存占用情况:

Thu Sep  9 14:23:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   40C    P0    24W /  75W |    829MiB /  3903MiB |     24%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                206MiB |
|    0   N/A  N/A     13920      C   ./yolov5                          619MiB |
+-----------------------------------------------------------------------------+

5.5 反序列化推理

(yolov5-pytorch) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -d yolov5s.engine ../samples
375ms
13ms
12ms
13ms
14ms
12ms
12ms
...
10ms
10ms
10ms
11ms

real	0m41.621s
user	0m29.085s
sys	0m3.601s
1000张图,图片分辨率为 640x640
平均11ms/张,即90fps

显存占用情况:

Thu Sep  9 14:25:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   42C    P0    35W /  75W |    962MiB /  3903MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                206MiB |
|    0   N/A  N/A     13988      C   ./yolov5                          752MiB |
+-----------------------------------------------------------------------------+

5.6 推理的结果

# 推理后图片路径
/home/yichao/MyDocuments/tensorrtx/yolov5/build
  1. 用python API调用无量化的yolov5s.engine模型
// install python-tensorrt, pycuda, etc.
// ensure the yolov5s.engine and libmyplugins.so have been built
python yolov5_trt.py
(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5$ time python yolov5_trt.py 
----------- True
bingding: data (3, 640, 640)
bingding: prob (6001, 1, 1)
batch size is 1
warm_up->(640, 640, 3), time->416.93ms
warm_up->(640, 640, 3), time->11.84ms
warm_up->(640, 640, 3), time->13.25ms
warm_up->(640, 640, 3), time->12.98ms
warm_up->(640, 640, 3), time->12.79ms
warm_up->(640, 640, 3), time->12.70ms
warm_up->(640, 640, 3), time->11.82ms
warm_up->(640, 640, 3), time->11.90ms
warm_up->(640, 640, 3), time->13.13ms
warm_up->(640, 640, 3), time->11.89ms
input->['samples/COCO_train2014_000000421903.jpg'], time->10.30ms, saving into output/
input->['samples/COCO_train2014_000000145736.jpg'], time->11.23ms, saving into output/
input->['samples/COCO_train2014_000000482834.jpg'], time->11.26ms, saving into output/
...
input->['samples/COCO_train2014_000000221565.jpg'], time->10.94ms, saving into output/
input->['samples/COCO_train2014_000000366274.jpg'], time->10.30ms, saving into output/
input->['samples/COCO_train2014_000000048824.jpg'], time->10.77ms, saving into output/

real	1m14.491s
user	0m53.540s
sys	0m8.307s
1000张图,图片分辨率为 640x640
平均11ms/张,即90fps

显存占用情况:

Thu Sep  9 14:35:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 28%   43C    P0    27W /  75W |   2495MiB /  3903MiB |     33%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                206MiB |
|    0   N/A  N/A     15510      C   python                           2285MiB |
+-----------------------------------------------------------------------------+

六、tensorRT FP16 量化推理

6.1 修改配置

默认是 FP16
/home/yichao/MyDocuments/tensorrtx/yolov5/yolov5.cpp

#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32

6.2 生成cmake

cd /home/yichao/MyDocuments/tensorrtx/yolov5
mkdir build
cd build
cp {ultralytics}/yolov5/yolov5s.wts {tensorrtx}/yolov5/build
cmake ..

6.3 make 编译

(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.723s
user	0m5.887s
sys	0m0.421s

6.4 序列化engine

(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -s yolov5s.wts yolov5s.engine s
Loading weights: yolov5s.wts
Building engine, please wait for a while...
Build engine successfully!

real	7m11.939s
user	4m43.104s
sys	0m39.300s

yolov5s.engine,21.5MB

显存占用情况:

Thu Sep  9 15:20:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 29%   44C    P0    24W /  75W |    843MiB /  3903MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1658      G   /usr/lib/xorg/Xorg                216MiB |
|    0   N/A  N/A     17616      C   ./yolov5                          623MiB |
+-----------------------------------------------------------------------------+

6.5 反序列化推理

# 下载图片 coco_calib
[GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing)

[BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh

# 图片路径
# /home/yichao/MyDocuments/tensorrtx/yolov5/build-int8/coco_calib

# 创建软链接
ln -s /home/yichao/MyDocuments/tensorrtx/yolov5/build-int8/coco_calib /home/yichao/MyDocuments/tensorrtx/yolov5/samples
(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -d yolov5s.engine ../samples
7ms
8ms
7ms
7ms
7ms
7ms
8ms
8ms
7ms
7ms
7ms
...
7ms

real	0m37.748s
user	0m27.568s
sys	0m2.609s
1000张图,图片分辨率为 640x640
平均7ms/张,即142fps

显存占用情况:

Wed Sep  8 14:35:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   39C    P0    18W /  75W |    790MiB /  3903MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      8440      C   ./yolov5                          544MiB |
+-----------------------------------------------------------------------------+

6.6 推理的结果

# 推理后图片路径
/home/yichao/MyDocuments/tensorrtx/yolov5/build
  1. 用python API调用无量化的yolov5s.engine模型
// install python-tensorrt, pycuda, etc.
// ensure the yolov5s.engine and libmyplugins.so have been built
python yolov5_trt.py
(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5$ time python yolov5_trt.py 
----------- True
bingding: data (3, 640, 640)
bingding: prob (6001, 1, 1)
batch size is 1
warm_up->(640, 640, 3), time->7.03ms
warm_up->(640, 640, 3), time->6.38ms
warm_up->(640, 640, 3), time->6.99ms
warm_up->(640, 640, 3), time->6.42ms
warm_up->(640, 640, 3), time->6.42ms
warm_up->(640, 640, 3), time->6.42ms
warm_up->(640, 640, 3), time->6.99ms
warm_up->(640, 640, 3), time->7.30ms
warm_up->(640, 640, 3), time->6.98ms
warm_up->(640, 640, 3), time->7.28ms
input->['samples/COCO_train2014_000000421903.jpg'], time->7.25ms, saving into output/
input->['samples/COCO_train2014_000000145736.jpg'], time->6.71ms, saving into output/
input->['samples/COCO_train2014_000000482834.jpg'], time->6.70ms, saving into output/
input->['samples/COCO_train2014_000000393241.jpg'], time->6.79ms, saving into output/
...
input->['samples/COCO_train2014_000000221565.jpg'], time->6.70ms, saving into output/
input->['samples/COCO_train2014_000000366274.jpg'], time->6.61ms, saving into output/
input->['samples/COCO_train2014_000000048824.jpg'], time->6.70ms, saving into output/

real	0m51.729s
user	0m44.069s
sys	0m5.622s
1000张图,图片分辨率为 640x640
平均7ms/张,即142fps

显存占用情况:

Wed Sep  8 15:52:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   39C    P0    22W /  75W |   2321MiB /  3903MiB |     29%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A     11220      C   python                           2075MiB |
+-----------------------------------------------------------------------------+

七、tensorRT INT8 量化推理

7.1 下载 校准数据集 coco_calib

Prepare calibration images, you can randomly select 1000s images from your train set.

For coco, you can also download my calibration images coco_calib from GoogleDrive or BaiduPan pwd: a9wh

7.2 解压校准数据集到 yolov5/build

/home/yichao/MyDocuments/tensorrtx/yolov5/build/coco_calib

7.3 设置 USE_INT8

修改文件
/home/yichao/MyDocuments/tensorrtx/yolov5/yolov5.cpp

#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32
改为
#define USE_INT8  // set USE_INT8 or USE_FP16 or USE_FP32

7.4 生成cmake

cd /home/yichao/MyDocuments/tensorrtx/yolov5
mkdir build
cd build
cp {ultralytics}/yolov5/yolov5s.wts {tensorrtx}/yolov5/build
cmake ..

7.5 make编译

# 如果之前已经编译,清理编译
make clean

make -j6
(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time make -j6
[ 20%] Building NVCC (Device) object 
...
[100%] Linking CXX executable yolov5
[100%] Built target yolov5

real	0m4.709s
user	0m5.902s
sys	0m0.373s

7.6 序列化engine

sudo ./yolov5 -s [.wts] [.engine] [s/m/l/x/s6/m6/l6/x6 or c/c6 gd gw]  // serialize model to plan file

// For example yolov5s
sudo ./yolov5 -s yolov5s.wts yolov5s.engine s
(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -s yolov5s.wts yolov5s.engine s
Loading weights: yolov5s.wts
Your platform support int8: true
Building engine, please wait for a while...
reading calib cache: int8calib.table
COCO_train2014_000000421903.jpg  0
COCO_train2014_000000145736.jpg  1
...
COCO_train2014_000000048824.jpg  999
reading calib cache: int8calib.table
writing calib cache: int8calib.table size: 13506
Build engine successfully!

real	7m27.392s
user	6m58.768s
sys	0m38.621s

yolov5s.engine,10.8MB

显存占用情况:

Wed Sep  8 15:20:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 33%   46C    P0    18W /  75W |    920MiB /  3903MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      9326      C   ./yolov5                          674MiB |
+-----------------------------------------------------------------------------+

7.7 反序列化推理

(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5/build$ time ./yolov5 -d yolov5s.engine ../samples
5ms
6ms
5ms
5ms
...
5ms
6ms
5ms

real	0m24.968s
user	0m23.439s
sys	0m1.660s
1000张图,图片分辨率为 640x640
平均5ms/张,即200fps

显存占用情况:

Wed Sep  8 15:23:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   43C    P0    24W /  75W |    772MiB /  3903MiB |     38%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      9573      C   ./yolov5                          526MiB |
+-----------------------------------------------------------------------------+

7.8 推理的结果

# 推理后图片路径
/home/yichao/MyDocuments/tensorrtx/yolov5/build

7.9 用 python API 调用INT8量化好的yolov5s.engine模型

// install python-tensorrt, pycuda, etc.
// ensure the yolov5s.engine and libmyplugins.so have been built
python yolov5_trt.py
(tensorRT-yolov5) yichao@yichao:~/MyDocuments/tensorrtx/yolov5$ python yolov5_trt.py 
----------- True
bingding: data (3, 640, 640)
bingding: prob (6001, 1, 1)
batch size is 1
warm_up->(640, 640, 3), time->7.82ms
warm_up->(640, 640, 3), time->4.51ms
warm_up->(640, 640, 3), time->4.55ms
warm_up->(640, 640, 3), time->4.61ms
warm_up->(640, 640, 3), time->5.11ms
warm_up->(640, 640, 3), time->4.81ms
warm_up->(640, 640, 3), time->4.56ms
warm_up->(640, 640, 3), time->4.75ms
warm_up->(640, 640, 3), time->4.52ms
warm_up->(640, 640, 3), time->4.91ms
input->['samples/COCO_train2014_000000421903.jpg'], time->4.57ms, saving into output/
input->['samples/COCO_train2014_000000145736.jpg'], time->5.38ms, saving into output/
input->['samples/COCO_train2014_000000482834.jpg'], time->4.66ms, saving into output/
input->['samples/COCO_train2014_000000393241.jpg'], time->5.01ms, saving into output/output/
input->['samples/COCO_train2014_000000548377.jpg'], time->5.28ms, saving into output/
input->['samples/COCO_train2014_000000329954.jpg'], time->4.93ms, saving into output/
...
input->['samples/COCO_train2014_000000141181.jpg'], time->5.19ms, saving into output/
input->['samples/COCO_train2014_000000221565.jpg'], time->5.15ms, saving into output/
input->['samples/COCO_train2014_000000366274.jpg'], time->4.76ms, saving into output/
input->['samples/COCO_train2014_000000048824.jpg'], time->4.80ms, saving into output/
1000张图,图片分辨率为 640x640
平均5ms/张,即200fps

显存占用情况:

Wed Sep  8 15:39:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   40C    P0    20W /  75W |   2303MiB /  3903MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1469      G   /usr/lib/xorg/Xorg                242MiB |
|    0   N/A  N/A      9979      C   python                           2057MiB |
+-----------------------------------------------------------------------------+

八、mAP评价指标

参考资料

README_mAP.md
tensorrt_demo

九、性能比较

Yolov5的3种tensorRT加速方式及3090测评结果(C++版和Python torchtrt版)

你可能感兴趣的:(深度学习,python,yolov5,tensorRT)