Tom Hardy

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)...

点击上方“计算机视觉工坊”，选择“星标”

干货第一时间送达

作者丨ChaucerG

来源丨集智书童

1YOLOv5-Lite

1、Backbone与Head

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第2张图片

YOLOv5-Lite的网络结构的Backbone主要使用的是含Shuffle channel的Shuffle block组成；

检测 Head 依旧用的是 YOLOv5 head，但用的是其简化版的 YOLOv5 head

Shuffle block示意图如下：

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第3张图片

YOLOv5 backbone：在原先U版的 YOLOv5 Backbone中，作者在特征提取的上层结构中采用了4次slice操作组成了Focus层

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第4张图片

YOLOv5 head：

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第5张图片

2、Focus

在讨论Focus的作用之前，先了解两个概念：

参数数量（params）：关系到模型大小，单位通常是M，通常参数用float32表示，所以模型大小是参数数量的4倍。

计算量（FLOPs）：即浮点运算数，可以用来衡量算法/模型的复杂度，这关系到算法速度，大模型的单位通常为G，小模型单位通常为M；通常只考虑乘加操作的数量，而且只考虑Conv和FC等参数层的计算量，忽略BN和PReLU等，一般情况下，Conv和FC层也会忽略仅纯加操作的计算量，如bias偏置加和shoutcut残差加等，目前技术有BN和CNN可以不加bias。

params计算公式：

Kh × Kw × Cin × Cout

FLOPs计算公式：

Kh × Kw × Cin × Cout × H × W = 即（当前层filter × 输出的feature map）= params × H × W

总所周知，图片在经过Focus模块后，最直观的是起到了下采样的作用，但是和常用的卷积下采样有些不一样，可以对Focus的计算量和普通卷积的下采样计算量进行做个对比：

在yolov5s的网络结构中，可以看到，Focus模块的卷积核是3 × 3，输出通道是32：

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第6张图片

那么做个对比：

普通下采样：即将一张640×640×3的图片输入3×3的卷积中，步长为2，输出通道32，下采样后得到320 × 320 × 32的特征图，那么普通卷积下采样理论的计算量为：

FLOPs（conv）=3×3×3×32×320×320=88473600（不考虑bias情况下） params参数量（conv)=3×3×3×32+32+32=928（后面两个32分别为bias和BN层参数）

Focus：将640×640×3的图像输入Focus结构，采用切片操作，先变成320×320×12的特征图，再经过3×3的卷积操作，输出通道32，最终变成320×320×32的特征图，那么Focus理论的计算量为：

FLOPs（Focus）=3×3×12×32×320×320=353894400（不考虑bias情况下） params参数量（Focus）=3×3×12×32+32+32=3520（为了呼应上图输出的参数量，将后面两个32分别为bias和BN层的参数考虑进去，通常这两个占比比较小可以忽略）

可以明显的看到，对于单层卷积来进行对比来看Focus的计算量和参数量要比普通卷积要多一些，是普通卷积的4倍，但是下采样时没有信息的丢失。

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第7张图片

实际上有3层，所以经过改进后参数量其实还是变少了的，也确实达到了提速的效果，同时下采样时没有信息的丢失.

对于Focus层，在一个正方形中每 4 个相邻像素，并生成一个具有 4 倍通道数的feature map，类似与对上级图层进行了4次下采样操作，再将结果concat到一起，最主要的功能还是在不降低模型特征提取能力的前提下，对模型进行降参和加速。

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第8张图片

Focus的python实现如下所示：

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Focus, self).__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)      # 这里输入通道变成了4倍

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

参数量计算如下：

cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)

      Params       FLOPS    forward (ms)   backward (ms)                   input                  output
        7040       23.07           62.89           87.79       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07           15.52           48.69       (16, 3, 640, 640)      (16, 64, 320, 320)
cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)

      Params       FLOPS    forward (ms)   backward (ms)                   input                  output
        7040       23.07           11.61           79.72       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07           12.54           42.94       (16, 3, 640, 640)      (16, 64, 320, 320)

从上可以看出，Focus层确实在参数降低的情况下，对模型实现了加速。

但！这个加速是有前提的，必须在GPU的使用下才可以体现这一优势，对于云端部署这种处理方式，GPU不太需要考虑缓存的占用，即取即处理的方式让Focus层在GPU设备上十分work。

对于的芯片，特别是不含GPU、NPU加速的芯片，频繁的slice操作只会让缓存占用严重，加重计算处理的负担。同时，在芯片部署的时候，Focus层的转化对新手极度不友好。

2轻量化的理念

shufflenetv2的设计理念，在计算资源有限的边缘端，有着重要的意义，它提出模型轻量化的4条原则：

同等通道大小可以最小化内存访问量
过量使用组卷积会增加MAC
网络过于碎片化（特别是多路）会降低并行度
不能忽略元素级操作（比如shortcut和Add）

3YOLOv5-Lite设计理念

摘除Focus层，避免多次采用slice操作
避免多次使用C3 Leyer以及高通道的C3 Layer（C3 Leyer是YOLOv5作者提出的CSPBottleneck改进版本，它更简单、更快、更轻，在近乎相似的损耗上能取得更好的结果。但C3 Layer采用多路分离卷积，测试证明，频繁使用C3 Layer以及通道数较高的C3 Layer，占用较多的缓存空间，减低运行速度）

class C3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super(C3, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1) 
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
        # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])

对yolov5 head进行通道剪枝，剪枝细则参考G1
摘除shufflenetv2 backbone的1024 conv 和 5×5 pooling

这是为imagenet打榜而设计的模块，在实际业务场景并没有这么多类的情况下，可以适当摘除，精度不会有太大影响，但对于速度是个大提升，在消融实验中也证实了这点。

4Tengine部署YOLOv5-Lite

依照顺序调用Tengine核心API如下：

1. init_tengine

初始化Tengine，该函数在程序中只要调用一次即可。

2. create_graph

创建Tengine计算图。

3. prerun_graph

预运行，准备计算图推理所需资源。设置大小核，核个数、核亲和性、数据精度都在这里。

struct options
{
  int num_thread;//核个数设置，
  int cluster;//大小核设置，可选TENGINE_CLUSTER_[ALL,BIG，MEDIUM，LITTLE]
  int precision;//精度设置，TENGINE_MODE_[FP32,FP16,HYBRID_INT8,UINT8,INT8]
  uint64_t affinity;//核亲和性掩码，绑定具体核，
};

4. run_graph

启动Tengine计算图推理。

5. postrun_graph

停止运行graph，并释放graph占据的资源。

6. destroy_graph

销毁graph。

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第9张图片

1、图像自适应缩放

在训练阶段，比如网络输入的尺寸608×608，但我数据的尺寸是大小不一的，一般方法是直接同一缩放到标准尺寸，然后填充黑边，如下图所示：

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第10张图片

但如果填充的比较多，则存在信息冗余，影响推理速度。Yolov5在推理阶段，采用缩减黑边的方式，来提高推理的速度。在代码datasets.py的letterbox函数中进行了修改，对原始图像自适应的添加最少的黑边。eg：“比如我1000×800的图片不是直接缩放到608×608的大小，而是计算608/1000=0.608 然后缩放至608×486的大小，然后计算608-486=122 然后np.mod(122，32)取余数得到26，再平均成13填充到图片高度两端，最后是608×512。”

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第11张图片

def letterbox(img, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleFill=False, scaleup=True, stride=32):
    # Resize and pad image while meeting stride-multiple constraints
    shape = img.shape[:2]  # current shape [height, width]
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)

    # Scale ratio (new / old)
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    if not scaleup:  # only scale down, do not scale up (for better test mAP)
        r = min(r, 1.0)

    # Compute padding
    ratio = r, r  # width, height ratios
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # wh padding
    if auto:  # minimum rectangle
        dw, dh = np.mod(dw, stride), np.mod(dh, stride)  # wh padding
    elif scaleFill:  # stretch
        dw, dh = 0.0, 0.0
        new_unpad = (new_shape[1], new_shape[0])
        ratio = new_shape[1] / shape[1], new_shape[0] / shape[0]  # width, height ratios

    dw /= 2  # divide padding into 2 sides
    dh /= 2

    if shape[::-1] != new_unpad:  # resize
        img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # add border
    return img, ratio, (dw, dh)

C++版本如下：

void get_input_data_focus(const char* image_file, float* input_data, int img_h, int img_w, const float* mean, const float* scale)
{
    cv::Mat sample = cv::imread(image_file, 1);
    cv::Mat img;

    const int target_size = 640;
    int imge_w = img.cols;
    int imge_h = img.rows;
    int w = imge_w;
    int h = imge_h;
    float scale_im = 1.f;

    if (w > h)
    {
        scale_im = (float)target_size / w;
        w = target_size;
        h = h * scale_im;
    }
    else
    {
        scale_im = (float)target_size / h;
        h = target_size;
        w = w * scale_im;
    }

    cv::cvtColor(sample, img, cv::COLOR_BGR2RGB);
    cv::resize(img, img, cv::Size(w, h));
    // pad to target_size rectangle
    int wpad = (w + 31) / 32 * 32 - w;
    int hpad = (h + 31) / 32 * 32 - h;

    cv::Mat in_pad;
    cv::copy_make_border(img, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, cv::BORDER_CONSTANT, 114.f);
    img.convertTo(img, CV_32FC3);

    float* img_data = (float*)img.data;

    /* nhwc to nchw */
    for (int h = 0; h < img_h; h++)
    {
        for (int w = 0; w < img_w; w++)
        {
            for (int c = 0; c < 3; c++)
            {
                int in_index = h * img_w * 3 + w * 3 + c;
                int out_index = c * img_h * img_w + h * img_w + w;
                input_data[out_index] = (img_data[in_index] - mean[c]) * scale[c];
            }
        }
    }
}

YOLOv5-Lite 树莓派实时 | 更少的参数、更高的精度、更快的检测速度(C++部署分享)..._第12张图片

2、模型加载和推理

/* set runtime options */
    struct options opt;
    opt.num_thread = num_thread;
    opt.cluster = TENGINE_CLUSTER_ALL;
    opt.precision = TENGINE_MODE_FP32;
    opt.affinity = 0;

    /* inital tengine */
    if (init_tengine() != 0)
    {
        fprintf(stderr, "Initial tengine failed.\n");
        return -1;
    }
    fprintf(stderr, "tengine-lite library version: %s\n", get_tengine_version());

    /* create graph, load tengine model xxx.tmfile */
    graph_t graph = create_graph(nullptr, "tengine", model_file);
    if (graph == nullptr)
    {
        fprintf(stderr, "Create graph failed.\n");
        return -1;
    }

3 获取推理结果

/* yolov5 postprocess */
    // 0: 1, 3, 20, 20, 85
    // 1: 1, 3, 40, 40, 85
    // 2: 1, 3, 80, 80, 85
    tensor_t p8_output = get_graph_output_tensor(graph, 0, 0);
    tensor_t p16_output = get_graph_output_tensor(graph, 1, 0);
    tensor_t p32_output = get_graph_output_tensor(graph, 2, 0);

    float* p8_data = (float*)get_tensor_buffer(p8_output);
    float* p16_data = (float*)get_tensor_buffer(p16_output);
    float* p32_data = (float*)get_tensor_buffer(p32_output);

    /* postprocess */
    const float prob_threshold = 0.55;
    const float nms_threshold = 0.5;

    std::vector