YOLOv5 Jetson部署 TensorRT

用TensorRT部署YOLOv5,用的最多的方式就是通过:
https://github.com/wang-xinyu/tensorrtx/tree/master/yolov5

步骤和教程都已经很详细了,我只是记录一下遇到的问题。

我的环境:
Jetson Xavier NX
Jetpack 4.5
aarch64 Ubuntu18.04
python 3.6.9
cuda 10.2
cudnn 8.0
tensorrt 7.1.3.0
opencv-python 4.5.5.64
torch 1.8.0
torchvision 0.9.0

在make的时候出现如下报错:
YOLOv5 Jetson部署 TensorRT_第1张图片
查了半天没找到原因,后来突然想起一篇部署文章中说好多错误是忘记加sudo导致的,我试了一下,真是这个原因……

在生成engine的时候出现如下报错:
YOLOv5 Jetson部署 TensorRT_第2张图片
如果yololayer.h中CLASS_NUM改过了, 那就是模型不对。因为我很早就开始使用YOLOv5,模型版本有点乱,下载对应版本的模型就好了。
YOLOv5 Jetson部署 TensorRT_第3张图片
生成engine之后,就可以用Python脚本yolov5_trt.py推理图片了。

这个脚本是推理一个文件夹中的所有图片,是个入门级的demo,对我使用来说有以下问题:

  1. 无法推理视频。
  2. 每张图的框颜色是随机的,没有根据类别固定,且不支持中文标签。
  3. BATCH_SIZE>1的时候,比如BATCH_SIZE=2,那么测试的图片数必须是2的倍数,否则会报错。
  4. 没明白为啥要不断创建新线程,我用不到多线程,去掉线程之后速度会更快。

我的需求是要能实时推理视频,以实现算法功能,所以画框和保存图片在实际使用中没有必要,反而拖慢速度。

我根据自己的需求改写了代码,用来推理视频或视频流。代码如下,有需要的同学自取:

"""
An example that uses TensorRT's Python api to make inferences.
"""
import ctypes
import os
import shutil
import random
import sys
import time
import cv2
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import torch

CONF_THRESH = 0.5
IOU_THRESHOLD = 0.4


class YoLov5TRT(object):
    """
    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.
    """

    def __init__(self, engine_file_path):
        # Create a Context on this device,
        self.ctx = cuda.Device(0).make_context()
        stream = cuda.Stream()
        TRT_LOGGER = trt.Logger(trt.Logger.INFO)
        runtime = trt.Runtime(TRT_LOGGER)

        # Deserialize the engine from file
        with open(engine_file_path, "rb") as f:
            engine = runtime.deserialize_cuda_engine(f.read())
        context = engine.create_execution_context()

        host_inputs = []
        cuda_inputs = []
        host_outputs = []
        cuda_outputs = []
        bindings = []

        for binding in engine:
            print('bingding:', binding, engine.get_binding_shape(binding))
            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            cuda_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(cuda_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                self.input_w = engine.get_binding_shape(binding)[-1]
                self.input_h = engine.get_binding_shape(binding)[-2]
                host_inputs.append(host_mem)
                cuda_inputs.append(cuda_mem)
            else:
                host_outputs.append(host_mem)
                cuda_outputs.append(cuda_mem)

        # Store
        self.stream = stream
        self.context = context
        self.engine = engine
        self.host_inputs = host_inputs
        self.cuda_inputs = cuda_inputs
        self.host_outputs = host_outputs
        self.cuda_outputs = cuda_outputs
        self.bindings = bindings
        self.batch_size = engine.max_batch_size

    def infer(self, raw_image_generator):
        # Make self the active context, pushing it on top of the context stack.
        self.ctx.push()
        # Restore
        stream = self.stream
        context = self.context
        engine = self.engine
        host_inputs = self.host_inputs
        cuda_inputs = self.cuda_inputs
        host_outputs = self.host_outputs
        cuda_outputs = self.cuda_outputs
        bindings = self.bindings
        # Do image preprocess
        batch_image_raw = []
        batch_origin_h = []
        batch_origin_w = []
        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])
        for i, image_raw in enumerate(raw_image_generator):
            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)
            batch_image_raw.append(image_raw)
            batch_origin_h.append(origin_h)
            batch_origin_w.append(origin_w)
            np.copyto(batch_input_image[i], input_image)
        batch_input_image = np.ascontiguousarray(batch_input_image)

        # Copy input image to host buffer
        np.copyto(host_inputs[0], batch_input_image.ravel())
        start = time.time()
        # Transfer input data  to the GPU.
        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
        # Run inference.
        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)
        # Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
        # Synchronize the stream
        stream.synchronize()
        end = time.time()
        # Remove any context from the top of the context stack, deactivating it.
        self.ctx.pop()
        # Here we use the first row of output in that batch_size = 1
        output = host_outputs[0]
        result_boxes, result_scores, result_classid = None, None, None
        # Do postprocess
        for i in range(self.batch_size):
            result_boxes, result_scores, result_classid = self.post_process(
                output[i * 6001: (i + 1) * 6001], batch_origin_h[i], batch_origin_w[i]
            )
        return result_boxes, result_scores, result_classid, end - start

    def destroy(self):
        # Remove any context from the top of the context stack, deactivating it.
        self.ctx.pop()

    def get_raw_image_zeros(self):
        """
        description: Ready data for warmup
        """
        for _ in range(self.batch_size):
            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)

    def preprocess_image(self, raw_bgr_image):
        """
        description: Convert BGR image to RGB,
                     resize and pad it to target size, normalize to [0,1],
                     transform to NCHW format.
        param:
            input_image_path: str, image path
        return:
            image:  the processed image
            image_raw: the original image
            h: original height
            w: original width
        """
        image_raw = raw_bgr_image
        h, w, c = image_raw.shape
        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)
        # Calculate widht and height and paddings
        r_w = self.input_w / w
        r_h = self.input_h / h
        if r_h > r_w:
            tw = self.input_w
            th = int(r_w * h)
            tx1 = tx2 = 0
            ty1 = int((self.input_h - th) / 2)
            ty2 = self.input_h - th - ty1
        else:
            tw = int(r_h * w)
            th = self.input_h
            tx1 = int((self.input_w - tw) / 2)
            tx2 = self.input_w - tw - tx1
            ty1 = ty2 = 0
        # Resize the image with long side while maintaining ratio
        image = cv2.resize(image, (tw, th))
        # Pad the short side with (128,128,128)
        image = cv2.copyMakeBorder(
            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)
        )
        image = image.astype(np.float32)
        # Normalize to [0,1]
        image /= 255.0
        # HWC to CHW format:
        image = np.transpose(image, [2, 0, 1])
        # CHW to NCHW format
        image = np.expand_dims(image, axis=0)
        # Convert the image to row-major order, also known as "C order":
        image = np.ascontiguousarray(image)
        return image, image_raw, h, w

    def xywh2xyxy(self, origin_h, origin_w, x):
        """
        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right
        param:
            origin_h:   height of original image
            origin_w:   width of original image
            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]
        return:
            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]
        """
        y = np.zeros_like(x)
        r_w = self.input_w / origin_w
        r_h = self.input_h / origin_h
        if r_h > r_w:
            y[:, 0] = x[:, 0] - x[:, 2] / 2
            y[:, 2] = x[:, 0] + x[:, 2] / 2
            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2
            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2
            y /= r_w
        else:
            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2
            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2
            y[:, 1] = x[:, 1] - x[:, 3] / 2
            y[:, 3] = x[:, 1] + x[:, 3] / 2
            y /= r_h

        return y

    def post_process(self, output, origin_h, origin_w):
        """
        description: postprocess the prediction
        param:
            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...] 
            origin_h:   height of original image
            origin_w:   width of original image
        return:
            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]
            result_scores: finally scores, a numpy, each element is the score correspoing to box
            result_classid: finally classid, a numpy, each element is the classid correspoing to box
        """
        # Get the num of boxes detected
        num = int(output[0])
        # Reshape to a two dimentional ndarray
        pred = np.reshape(output[1:], (-1, 6))[:num, :]
        # Do nms
        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)
        result_boxes = boxes[:, :4] if len(boxes) else np.array([])
        result_scores = boxes[:, 4] if len(boxes) else np.array([])
        result_classid = boxes[:, 5] if len(boxes) else np.array([])
        return result_boxes, result_scores, result_classid

    def bbox_iou(self, box1, box2, x1y1x2y2=True):
        """
        description: compute the IoU of two bounding boxes
        param:
            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))
            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))            
            x1y1x2y2: select the coordinate format
        return:
            iou: computed iou
        """
        if not x1y1x2y2:
            # Transform from center and width to exact coordinates
            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
        else:
            # Get the coordinates of bounding boxes
            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

        # Get the coordinates of the intersection rectangle
        inter_rect_x1 = np.maximum(b1_x1, b2_x1)
        inter_rect_y1 = np.maximum(b1_y1, b2_y1)
        inter_rect_x2 = np.minimum(b1_x2, b2_x2)
        inter_rect_y2 = np.minimum(b1_y2, b2_y2)
        # Intersection area
        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \
                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)
        # Union Area
        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)

        return iou

    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):
        """
        description: Removes detections with lower object confidence score than 'conf_thres' and performs
        Non-Maximum Suppression to further filter detections.
        param:
            prediction: detections, (x1, y1, x2, y2, conf, cls_id)
            origin_h: original image height
            origin_w: original image width
            conf_thres: a confidence threshold to filter detections
            nms_thres: a iou threshold to filter detections
        return:
            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)
        """
        # Get the boxes that score > CONF_THRESH
        boxes = prediction[prediction[:, 4] >= conf_thres]
        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]
        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])
        # clip the coordinates
        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)
        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)
        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)
        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)
        # Object confidence
        confs = boxes[:, 4]
        # Sort by the confs
        boxes = boxes[np.argsort(-confs)]
        # Perform non-maximum suppression
        keep_boxes = []
        while boxes.shape[0]:
            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres
            label_match = boxes[0, -1] == boxes[:, -1]
            # Indices of boxes with lower confidence scores, large IOUs and matching labels
            invalid = large_overlap & label_match
            keep_boxes += [boxes[0]]
            boxes = boxes[~invalid]
        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])
        return boxes


if __name__ == "__main__":
    # load custom plugin and engine
    engine_file_path = "build/device_17_5765_s.engine"
    PLUGIN_LIBRARY = "build/libmyplugins.so"
    source = 'samples/t1.mp4'

    if len(sys.argv) > 1:
        engine_file_path = sys.argv[1]
    if len(sys.argv) > 2:
        PLUGIN_LIBRARY = sys.argv[2]
    if len(sys.argv) > 3:
        source = sys.argv[3]
    assert source.lower().endswith('mp4') or source.lower().startswith(('rtsp://', 'rtmp://'))

    ctypes.CDLL(PLUGIN_LIBRARY)

    # load coco labels
    categories = ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
                  "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
                  "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
                  "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
                  "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon"]

    # a YoLov5TRT instance
    yolov5_wrapper = YoLov5TRT(engine_file_path)

    try:
        print('batch size is', yolov5_wrapper.batch_size)
        for i in range(10):
            result_boxes, result_scores, result_classid, use_time = yolov5_wrapper.infer(yolov5_wrapper.get_raw_image_zeros())
            print('input->({},{}), time->{:.2f}ms'.format(yolov5_wrapper.input_h, yolov5_wrapper.input_w, use_time * 1000))

        raw_image_generator = []
        cap = cv2.VideoCapture(source)
        t1 = time.time()
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            if yolov5_wrapper.batch_size == 1:
                raw_image_generator = [frame]
                result_boxes, result_scores, result_classid, use_time = yolov5_wrapper.infer(raw_image_generator)
                print('input->{}\ttime->{:.2f}ms'.format(cap.get(cv2.CAP_PROP_POS_FRAMES), use_time * 1000))
            else:
                if cap.get(cv2.CAP_PROP_POS_FRAMES) % yolov5_wrapper.batch_size == 0:
                    raw_image_generator.append(frame)
                    result_boxes, result_scores, result_classid, use_time = yolov5_wrapper.infer(raw_image_generator)
                    print('input->{}\ttime->{:.2f}ms'.format(cap.get(cv2.CAP_PROP_POS_FRAMES), use_time * 1000))
                    raw_image_generator = []
                else:
                    raw_image_generator.append(frame)
        print('FPS:', cap.get(cv2.CAP_PROP_FRAME_COUNT) / (time.time() - t1))
        cap.release()

    finally:
        # destroy the instance
        yolov5_wrapper.destroy()

新增了第三个命令行参数source作为视频路径或视频流地址,类别名categories根据自己的改,因为我反正不画图不展示也不保存,所以这个类别名只是用来打印结果的,写中文还是英文没区别。画图和保存的代码我都不要了,大家可以根据自己需求保留。

我用yolov5s.pt生成的engine,跑测试视频,没有画框和保存的情况下,帧率刚好超过25,基本实时了。
YOLOv5 Jetson部署 TensorRT_第4张图片
为了做比较,我这个代码也支持BATCH_SIZE>1的情况,即将每几帧图片一起推理,当然这个engine要重新生成。
我测试了一下当BATCH_SIZE=2的时候:
YOLOv5 Jetson部署 TensorRT_第5张图片
虽然看起来2帧的推理时间没有到原来的2倍,但整体速度反而变慢了。

------------------------------------------------------------------我是一条慵懒的分割线------------------------------------------------------------------------------------

YOLOv5最新的6.1版本增加了TensorRT的导出和推理,记录一下自己的使用过程。

我这台NX的cuda和tensorrt是内置的,主要是python、torch和torchvision要匹配且符合YOLOv5要求。

还是用yolov5s.pt训练得到的模型best.pt,yolov5-6.1下,使用命令:

python3 export.py  --device 0 --weights best.pt --include engine

导出TensorRT的模型,会先转成onnx再转成engine。

第一次转的时候,内存不足直接退出了,我的NX内存是8G。
YOLOv5 Jetson部署 TensorRT_第6张图片
应该是模型太大了?我把export.py中的export_engine()的workspace参数从4改成2,还是失败,用了最小的yolov5n还是内存爆炸。没找到办法,我想是不是因为先转onnx已经占了一部分内存,如果把这两步拆开内存可能就够用了。于是我把代码中将onnx转engine的部分给简化了出来:

import tensorrt as trt

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.max_workspace_size = 2 << 30
flag = (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
network = builder.create_network(flag)
parser = trt.OnnxParser(network, logger)
parser.parse_from_file('best.onnx')
with builder.build_engine(network, config) as engine, open('best.engine', 'wb') as t:
    t.write(engine.serialize())

其中config.max_workspace_size = 2 << 30中的2就是workspace参数,如果内存还是不够用,可以改成1。onnx和engine的名字按自己的改就好了。先生成onnx,再用上述代码生成engine,这样终于成功了。

我用视频做推理测试
pt的推理速度为:

Speed: 1.3ms pre-process, 44.3ms inference, 3.5ms NMS per image at shape (1, 3, 640, 640)

engine的推理速度为:

Speed: 1.5ms pre-process, 31.9ms inference, 3.8ms NMS per image at shape (1, 3, 640, 640)

速度提升不多,这种通过onnx路径转engine的方式,和第一种直接用TensorRT的C++API复现模型的方式,效果还是有挺大差别的。生成的模型也不一样,YOLOv5生成的engine会大得多。

你可能感兴趣的:(AI,深度学习,人工智能,pytorch)