YOLOv4论文翻译

 

Abstract

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.

摘要

据说有很多方法可以提高卷积神经网络的精度。从大型数据集上实际测这些组合方法,并从理论上证明在结果是正确的。一些方法只在某些模型上起作用并在但是对于其他的问题具有排他性,或者只在小样本数据集上起作用;但是有些方法,比如批量归一化和残差组合适用于大多数模型,任务和数据集。我们设想这些普适的方法包括WRC(Weighted Residual-Connections),CSP(Cross-Stage-Partial-connections),CmBN(Cross mini-Batch Normalization),SAT(Self-adversarial-training)和Mish激活函数.我们用这些新的方法:WRC、CSP、CmBN、SAT、Mish激活函数、Mosaic数据增强、DropBlock正则化和CIOU损失,并把这些方法进行了组合,用Tesla V100在COCO数据集上达到了最好的效果:AP值为43.5%,AP50值为65.7%,实时速度65FPS。源代码地址在:https://github.com/AlexeyAB/darknet.

 

Introduction

The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.

1.介绍

大多数基于CNN的检测器只适用于推荐系统。比如以低精度的模型通过urban视频相机寻找免费的停车地点,但是汽车碰撞则与速度快但精度低的模型有关。提高实时目标检测器的精度并不仅仅用于提示生成推荐系统,也用于减少独立的过程管理和人类输入(减少先验)。在可承单的价格内,传统的实时目标检测器会大量占用GPU。最准确的现代神经网络模型不能够达到实时并且需要在GPU上进行大批量的训练。我们创造了一种可以在普通GPU上达到实时的CNN网络,并且训练也只需要一块普通的GPU。

YOLOv4论文翻译_第1张图片

The main goal of this work is designing a fast operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing object detection results, as the YOLOv4 results shown in Figure 1. Our contributions are summarized as follows:

1. We develope an efficient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.

2. We verify the influence of state-of-the-art Bag-of-Freebies and Bag-of-Specials methods of object detection during the detector training.

3. We modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc.

这项工作的目的是设计一款可以在生产系统中快速运行并且可以并行计算的目标检测器,而不是低计算量的理论指标BFLOP。我们希望它可以轻易的训练和使用。比如,任何人用一块普通的GPU训练和测试都可以达到像在图一中展示的YOLOv4的结果一样实时并且高效。我们所做的贡献概括为以下几点:

1.我们开发了一款高效且强大的目标检测模型。它是的每个人都可以用一块1080Ti或者2080Ti的GPU训练出一个超级快的超级准确的目标的检测器。

2.在检测器的训练期间,我们验证了目标检测领域最先进的免费赠品袋(Bag-of-Freebies)和特色袋(Bag-of-Specials)

3.我们修正了最先进的方式并且使得他们高效且适合于单个GPU的训练,包括CBN、PAN、SAM等等。

YOLOv4论文翻译_第2张图片

2. Related work

2.1. Object detection models

A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShuffleNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a two stage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several topdown paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17].In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection.

2.相关工作

2.1目标检测模型

现代目标检测模型通常由两部分组成,一个在ImageNet数据集上预训练的主干网络和一个用来预测目标类别和边框的头部。对于那些在GPU上运行的检测器,它们的主干网络可以是 VGG , ResNet , ResNeXt , or DenseNet 。对于那些在CPU上运行的,他们的主干网络可以是SqueezeNet , MobileNet, or ShuffleNet。对于头部,它们通常分为两种类型:单阶段目标检测模型和两阶段目标检测模型。最具代表性的两阶段目标检测器是R-CNN系列,包括R-CNN、Faster-RCNN、R-FCN、Libra R-CNN,两阶段的目标检测器也可以是无锚框的,比如:RepPoints。对于单阶段的目标检测器而言,最具代表性的是YOLO、SSD和RetinaNet。近年也出现了无锚框的单阶段目标检测器,这一类的有 CenterNet、 CornerNet和FCOS等。近几年的目标检测器的发展通常是在主干网路和头部之间插入一些层,这些层通常是集合了不同阶段的特征图的信息。我们把它们称作目标检测器的颈部。通常,颈部由几个自下而上和几个自上而下的路径组成。配备了这些机制的神经网络包括FPN、PAN、BiFPN、和NAS-FPN等。除了以上模型,一些研究者致力于简历一个新的目标检测主干网络(如:DetNet、 DetNAS)或一个全新的整体的模型(如:SpineNet、 HitDetector)

To sum up, an ordinary object detector is composed ofseveral parts:

Input: Image, Patches, Image Pyramid

Backbones: VGG16 [68], ResNet-50 [26], SpineNet[12], EfficientNet-B0/B7 [75], CSPResNeXt50 [81],

CSPDarknet53 [81]

Neck:

Additional blocks: SPP [25], ASPP [5], RFB[47], SAM [85]

Path-aggregation blocks: FPN [44], PAN [49],NAS-FPN [17], Fully-connected FPN, BiFPN[77], ASFF [48], SFAM [98]

Heads::

Dense Prediction (one-stage):

    ◦ RPN [64], SSD [50], YOLO [61], RetinaNet

[45] (anchor based)

    ◦ CornerNet [37], CenterNet [13], MatrixNet[60], FCOS [78] (anchor free)

Sparse Prediction (two-stage):

◦ Faster R-CNN [64], R-FCN [9], Mask RCNN [23] (anchor based)

◦ RepPoints [87] (anchor free)

 

总结一下,一个普通的目标检测器由以下几部分组成:

    • 输入:图像、修补程序(不知道该怎么翻译好@!!)、图像金字塔
    • 主干:VGG16、ResNet-50、SpineNet、EfficientNet-B0/B7、 CSPResNeXt50、

CSPDarknet53

    • 颈部:
      • 附加模块:SPP、ASPP、RFB、SAM
      • 路径聚合模块:FPN、PAN、NAS-FPN、 Fully-connected FPN、BiFPN、ASFF、SFAM
    • 头部:
      • 稠密预测(单阶段):
        • RPN、 SSD、YOLO、RetinaNet(基于锚框)
        • CornerNet、 CenterNet、MatrixNet、FCOS(无锚框)
      • 稀疏预测(两阶段):
        • Faster R-CNN、R-FCN、Mask R-CNN(基于锚框)
        • RepPoints (无锚框)

2.2. Bag of freebies

Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods which can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the definition of bag of freebies is data augmentation. The purpose of data augmentation is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they definitely benefit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric distortion, we add random scaling, cropping, flipping, and rotating.

2.2.免费赠品袋

通常,传统的目标检测器都是离线训练,然而,研究者一直想发挥这种优势并且设计更好的训练方法,使得目标检测器可以在不增加推理成本的情况下取得更高的精度。我们称这些之改变训练策略或只增加训练成本的方法为“bag of freebies.”经常被用在目标检测中并且满足“bag of freebies.”定义的是数据增强。数据增强的设想是增强输入图像的多样性,它能够使得目标检测器在不同环境中获得得到图像上具有更好的鲁棒性。比如光线扭曲和几何形变是在数据增强中常用的两种方式,他们确实有利于目标检测工作。在做光线扭曲时,我们调整图像的亮度、对比度、色度、饱和度和噪声。对于几何形变,我们增减随机缩放、裁剪、翻转和旋转。

The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on simulating object occlusion issues. They have achieved good results in image classification and object detection. For example, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and fill in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation  . For example, MixUp [92] uses two images to multiply and superimpose with different coefficient ratios, and then adjusts the label with these superimposed ratios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN.

上述提到的数据增强的方式都是像素级的调整,并且都保留了原始的像素信息。除此以外,一些致力于数据增强的人重点研究模拟物体遮挡的问题。在图像分类和目标检测方面他们取得了较好的成果。比如,随机擦除(random erase),随机裁剪( CutOut )、随机选择一块巨型区域填充随机值或者0元素等。至于hide-and-seek和grid mask(网格掩码)、他们在图像中随机或者均匀地选择多个矩形区域并把他们全部替换为0。如果类似的概念用到特征图上,比如DropOut、DropConnect、和 DropBlock 。除此以外,一些研究者提出了用多张图像进行数据增强的方式。比如MixUp用两张图以不同的系数相乘叠加,然后用这些叠加比率调整标签。对于CutMix,它将裁剪后的图像覆盖到其他图像的矩形区域,并调整标签适应 混合区域的大小。 除了以上提到的方法,风格迁移GAN也被用作数据增强的方式,并且这样的使用可以有效地减少CNN学习到的纹理偏差。

Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distribution bias, a very important issue is that there is a problem of data imbalance between different classes, and this problem is often solved by hard negative example mining [72] or online hard example mining [67] in two-stage object detector. But the example mining method is not applicable to one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al. [45] proposed focal loss to deal with the problem of data imbalance existing between various classes. Another very important issue is that it is difficult to express the relationship of the degree of association between different categories with the one-hot hard representation. This representation scheme is often used when executing labeling. The label smoothing proposed in [73] is to convert hard label into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al. [33] introduced the concept of knowledge distillation to design the label refinement network.

和以上提出的各种方法不同,一些其他的免费赠品的方法致力于解决数据语义分布可能出现偏差的问题。在处理语义分布偏差的问题上,一个很重要的问题是不同类别的数据不平衡,在两阶段目标检测器中这个问题通常是通过困难负样本挖掘或者在线困难样本挖掘解决。但是样本挖掘的方式并不适用于单阶段目标检测器,因为这类检测器属于稠密预测结构。但是LIn等人提出focal loss来解决不同类别样本不平衡的问题。另一个明显的问题是利用one-hot很难去表达不同类别之间的相关性,在处理标签的时候经常会用到这种方法。标签平滑是将困难标签转化为软标签进行训练,它可以让模型更具鲁棒性。为了获得更好的软标签,Islam等人引入了知识蒸馏的概念来设计标签优化的网络。

The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to directly perform regression on the center point coordinates and height and width of the BBox, i.e., {xcenter, ycenter, w, h}, or the upper left point and the lower right point, i.e., {xtop lef t, ytop lef t, xbottom right, ybottom right}. As for anchor-based method, it is to estimate the corresponding offset, for example {xcenter offset, ycenter offset, w of f set, h of f set} and {xtop lef t of f set, ytop lef t of f set, xbottom right of f set, ybottom right of f set}. However, to directly estimate the coordinate values of each point of the BBox is to treat these points as independent variables, but in fact does not consider the integrity of the object itself. In order to make this issue processed better, some researchers recently proposed IoU loss [90], which puts the coverage of predicted BBox area and ground truth BBox area into consideration. The IoU loss computing process will trigger the calculation of the four coordinate points of the BBox by executing IoU with the ground truth, and then connecting the generated results into a whole code. Because IoU is a scale invariant representation, it can solve the problem that when traditional methods calculate the l1 or l2 loss of {x, y, w, h}, the loss will increase with the scale. Recently, some researchers have continued to improve IoU loss. For example, GIoU loss [65] is to include the shape and orientation of object in addition to the coverage area. They proposed to find the smallest area BBox that can simultaneously cover the predicted BBox and ground truth BBox, and use this BBox as the denominator to replace the denominator originally used in IoU loss. As for DIoU loss [99], it additionally considers the distance of the center of an object, and CIoU loss [99], on the other hand simultaneously considers the overlapping area, the distance between center points, and the aspect ratio. CIoU can achieve better convergence speed and accuracy on the BBox regression problem.

最后一袋免费赠品是边框回归的目标函数,传统的目标检测器通常直接使用均方误差对回归框的中心点以及宽高进行预测{中心点x,,中心点y,宽,高}。或者左上角点和右下角点{左上角x,左上角y,右下角x,右下角y} .在基于锚框的方法中,预测的是偏移量,比如{中心点x的偏移量,中心点y的偏移量,w的偏移量,h的偏移量}和{左上角x的偏移量,左上角y的偏移量,右下角x的偏移量,右下角y的偏移量}。然而,直接预测回归框每一个坐标的偏移量,是将这些边框上的坐标点看做独立的变量,缺并没有考虑到目标整体本身。为了更好的解决这些问题,最近有研究者提出了IOUloss,它综合考虑了预测框和真实框的覆盖区域。IOU损失计算将会导致通过真实框与回归框的四个坐标点的计算IOU,最终将生成的结果组合成完整的代码。由于IOU计算的是缩放后的变量,所以它可以解决传统的{x,y,w,h}计算L1loss或者L2loss的时的损失值随着尺度增大而增大的问题。最近,一些研究者提出了改进后的IOU,比如GIOUloss除了重合的区域还考虑了目标的形状和转向。他们提出了可以近似覆盖预测框和真实框的一个框,并且使用回归框做为分母以替换之前在IOUloss中的分母。至于DIOUloss,它还考虑了目标的中心点距离。CIOUloss考虑重叠区域和目标的中心点的距离,和宽高比。CIOUloss在边框回归问题上有更快的收敛速度和更高的精度。

 

2.3. Bag of specials

For those plugin modules and post-processing methods that only increase the inference cost by a small amount but can significantly improve the accuracy of object detection, we call them “bag of specials”. Generally speaking, these plugin modules are for enhancing certain attributes in a model, such as enlarging receptive field, introducing attention mechanism, or strengthening feature integration capability, etc., and post-processing is a method for screening model prediction results.

2.3 特色袋

对于那些值增加少量的推理成本但是却显著提高目标检测精度的即插即用模块和后处理方法,我们称之为“bag of specials”,一般而言,这些即插即用模快在模型中有着特殊的功效,比如:增大感受野、引入注意力机制或者增强特征融合等。后处理方法也是一种筛选模型预测结果的方式。

Common modules that can be used to enhance receptive field are SPP [25], ASPP [5], and RFB [47]. The

SPP module was originated from Spatial Pyramid Matching (SPM) [39], and SPMs original method was to split feature map into several d × d equal blocks, where d can be {1, 2, 3, ...}, thus forming spatial pyramid, and then extracting bag-of-word features. SPP integrates SPM into CNN and use max-pooling operation instead of bag-of-word operation. Since the SPP module proposed by He et al. [25] will output one dimensional feature vector, it is infeasible to be applied in Fully Convolutional Network (FCN). Thus in the design of YOLOv3 [63], Redmon and Farhadi improve SPP module to the concatenation of max-pooling outputs with kernel size k × k, where k = {1, 5, 9, 13}, and stride equals to 1. Under this design, a relatively large k × k maxpooling effectively increase the receptive field of backbone feature. After adding the improved version of SPP module, YOLOv3-608 upgrades AP50 by 2.7% on the MS COCO object detection task at the cost of 0.5% extra computation. The difference in operation between ASPP [5] module and improved SPP module is mainly from the original k×k kernel size, max-pooling of stride equals to 1 to several 3 × 3 kernel size, dilated ratio equals to k, and stride equals to 1 in dilated convolution operation. RFB module is to use several dilated convolutions of k×k kernel, dilated ratio equals to k, and stride equals to 1 to obtain a more comprehensive spatial coverage than ASPP. RFB [47] only costs 7% extra inference time to increase the AP50 of SSD on MS COCO by 5.7%.

常见的可以用来增强感受野的模型有SPP、ASPP和RFB。SPP模型起源于SPM,SPM最初的方式是将特征图划分成几个dxd的相同大小的块,d可以是{1,2,3....},于是就形成了图像金字塔,之后再提取词袋特征,SPP将SPM融合到了卷积神经网络中,并使用最大池化代替词袋操作。因为He等人提出的SPP模块输出一维的特征向量,不能够用在全卷积网络中,因此在YOLOv3中,Redmon 和 Farhadi改进SPP模块为一系列最大池化的输出,其中卷积核大小 k×k(k可以是1, 5, 9, 13),步长为1,在这样的设计下,一个大型的Kxk的最大池化可以有效增加主干网络特征图的感受野。在增加了改进版的SPP模块后,YOLOv3-608的在MSCOCO数据集目标检测的AP提升了2.7%,只增加了0.5%的计算量。ASPP和改进版的SPP的不同点主要在于卷积核kxk,最大池化步长1,在空洞卷积中变化为3x3卷积核,扩张比为k,步长为1。RFB模块采用多个k×k的空洞卷积,扩张比等于k,步长等于1,得到比ASPP更全局的空间覆盖。RFB只需额外花费7%的推理时间,就可以将SSD在MS-COCO上的AP50提高5.7%。

The attention module that is often used in object detection is mainly divided into channel-wise attention and pointwise attention, and the representatives of these two attention models are Squeeze-and-Excitation (SE) [29] and Spatial Attention Module (SAM) [85], respectively. Although SE module can improve the power of ResNet50 in the ImageNet image classification task 1% top-1 accuracy at the cost of only increasing the computational effort by 2%, but on a GPU usually it will increase the inference time by about 10%, so it is more appropriate to be used in mobile devices. But for SAM, it only needs to pay 0.1% extra calculation and it can improve ResNet50-SE 0.5% top-1 accuracy on the ImageNet image classification task. Best of all, it does not affect the speed of inference on the GPU at all.

常用于目标检测的注意力模块主要分为通道注意力和像素点注意力,二者各自具有与代表性的算法是SE和SAM。尽管SE模块可以提高ResNet50在ImageNet的分类top-1 精度提高1%,只增加了2%的 计算量,但是在GPU上会增加10%的的推理时间,所以它适合用在移动端。但是对于SAM模块,只需要增加0.1%的计算量就可以用提高ResNet50-SE网络在Image图像分类的top-1精度0.5%。最棒的是,它一点也不影响在GPU上的速度。

In terms of feature integration, the early practice is to use skip connection [51] or hyper-column [22] to integrate lowlevel physical feature to high-level semantic feature. Since multi-scale prediction methods such as FPN have become popular, many lightweight modules that integrate different feature pyramid have been proposed. The modules of this sort include SFAM [98], ASFF [48], and BiFPN [77]. The main idea of SFAM is to use SE module to execute channelwise level re-weighting on multi-scale concatenated feature maps. As for ASFF, it uses softmax as point-wise level reweighting and then adds feature maps of different scales. In BiFPN, the multi-input weighted residual connections is proposed to execute scale-wise level re-weighting, and then add feature maps of different scales.

就特征融合而言,早期的实践是使用跳跃连接,或者或者hyper-column来融合低层的物理特征与高层的语义特征信息。因为多尺度的预测方式比如FPN等变得火热,人们提出了许多融合不同的特征金字塔的轻量模型,比如SFAM,ASFF和BiFF。SFAM的主要思想是使用SE模块对多尺度级联特征图进行通道上的加权。而对于ASFF,它使用softmax来进行像素级的加权,然后将不通尺度的特征图相加。在BiFPN中,研究者提出多个输入的加权残差连接按比例加权,之后再将不同尺度的特征图相加。

In the research of deep learning, some people put their focus on searching for good activation function. A good activation function can make the gradient more efficiently propagated, and at the same time it will not cause too much extra computational cost. In 2010, Nair and Hinton [56] propose ReLU to substantially solve the gradient vanish problem which is frequently encountered in traditional tanh and sigmoid activation function. Subsequently, LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential Linear Unit (SELU) [35], Swish [59], hard-Swish [27], and Mish [55], etc., which are also used to solve the gradient vanish problem, have been proposed. The main purpose of LReLU and PReLU is to solve the problem that the gradient of ReLU is zero when the output is less than zero. As for ReLU6 and hard-Swish, they are specially designed for quantization networks. For self-normalizing a neural network, the SELU activation function is proposed to satisfy the goal. One thing to be noted is that both Swish and Mish are continuously differentiable activation function.

在深度学习的研究中,一些人重点关注寻找好的激活函数。一个好的激活函数,可以使得梯度更有效的传播,同时它也不会增加额外的计算损失。在2010年,Nair和Hinton提出Relu激活函数有效的解决了在传统激活函数tanh和sigmoid中出现的梯度消失问题。随后提出的LRelu,PRelu,Relu6,SELU,,Swish,Hard-Swish,和Mish等也被用作解决梯度消失问题。LRelu和PRelu的主要假设是用来解决Relu当输出值小于0时梯度为0的问题。至于RELU6和Hard-Swish,他们是为量化网络专门设计的。对于神经网络的自标准化,人们提出SELU激活函数满足目标。一件需要注意的事情是Swish和Mish都是连续可微激活函数。

The post-processing method commonly used in deeplearning-based object detection is NMS, which can be used to filter those BBoxes that badly predict the same object, and only retain the candidate BBoxes with higher response. The way NMS tries to improve is consistent with the method of optimizing an objective function. The original method proposed by NMS does not consider the context information, so Girshick et al. [19] added classification confidence score in R-CNN as a reference, and according to the order of confidence score, greedy NMS was performed in the order of high score to low score. As for soft NMS [1], it considers the problem that the occlusion of an object may cause the degradation of confidence score in greedy NMS with IoU score. The DIoU NMS [99] developers way of thinking is to add the information of the center point distance to the BBox screening process on the basis of soft NMS. It is worth mentioning that, since none of above postprocessing methods directly refer to the captured image features, post-processing is no longer required in the subsequent development of an anchor-free method.

常用于深度学习目标检测的后处理方式是NMS,可以用它来过滤掉对同一个目标的效果很差的预测框,只保留效果最好的预测框。用NMS提高预测效果的方式和优化目标函数的方法一致。最初提出的NMS并没有考虑到背景的信息,所以Girshick等人在R-CNN中增加了分类的置信度作为标准,并通过置信度的得分顺序,greedNMS按照从高到低的置信度顺序进行。至于soft NMS,它考虑了在greed NMS中用NMS,被遮挡的物体置信度低的情况。DIOU NMS的开发思路是在基于soft NMS的筛选过程中在预测框增加中心点的距离。值得一提的是,随后发展的无锚框方式将不再需要后处理方式。

YOLOv4论文翻译_第3张图片

3. Methodology

The basic aim is fast operating speed of neural network,in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We present two options of real-time neural networks:

• For GPU we use a small number of groups (1 - 8) in convolutional layers: CSPResNeXt50 / CSPDarknet53

• For VPU - we use grouped-convolution, but we refrain from using Squeeze-and-excitement (SE) blocks - specifically this includes the following models: EfficientNet-lite / MixNet [76] / GhostNet [21] / MobileNetV3

3.方法论

在生产系统和并行计算中,基本的目标提高神经网络的运算速度,而不是低计算量的理论指标(BFLOP)。我们展示了两个实时的神经网络:

    • 对于GPU,我们是在卷积层使用的小分组(1-8):CSPResNeXt50 / CSPDarknet53
    • 对于VPU,我们使用分组卷积,但是我们避免使用SE模块,具体来说有以下几种模型:EfficientNet-lite / MixNet [76] / GhostNet [21] / MobileNetV3

3.1. Selection of architecture

 

Our objective is to find the optimal balance among the input network resolution, the convolutional layer number, the parameter number (filter size2 * filters * channel / groups), and the number of layer outputs (filters). For instance, our numerous studies demonstrate that the CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classification on the ILSVRC2012 (ImageNet) dataset [10]. However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset [46].

3.1.架构选择

我们的目标是找到输入网络分辨率,卷积层的数量,参数量(k为卷积核的大小,n为卷积核的数量,m为通道数,c为分组数)、输出层卷积核的数量之间的最优平衡。比如,我们的大量研究表明,就ILSVRC2012(ImageNet)数据集上的目标分类而言,CSPResNext50比CSPDarknet53效果要好得多[10]。但是,相反,就COCO数据集上目标分类而言,CSPDarknet53的效果好于CSPResNext50。

The next objective is to select additional blocks for increasing the receptive field and the best method of parameter aggregation from different backbone levels for different detector levels: e.g. FPN, PAN, ASFF, BiFPN.

下一个目标是选择可以增加感受野和不同主干网络不同检测器的参数融合的最好方式的附加模块,比如:FPN,PAN,ASFF。

A reference model which is optimal for classification is not always optimal for a detector. In contrast to the classi-fier, the detector requires the following:

• Higher input network size (resolution) – for detecting multiple small-sized objects

• More layers – for a higher receptive field to cover the increased size of input network

• More parameters – for greater capacity of a model to detect multiple objects of different sizes in a single image

一个可参考的模块可用于图像分类但并不总是适用于目标检测。与分类不同,检测需要如下条件:

    • 更大的网络输入——为了检测多种小型物体
    • 更多的层——为了更大的感受野来覆盖增加的图像输入大小
    • 更多的参数——为了更强的模型性能来检测单张图片中不同大小的多个物体

Hypothetically speaking, we can assume that a model with a larger receptive field size (with a larger number of convolutional layers 3 × 3) and a larger number of parameters should be selected as the backbone. Table 1 shows the information of CSPResNeXt50, CSPDarknet53, and Effi-cientNet B3. The CSPResNext50 contains only 16 convolutional layers 3 × 3, a 425 × 425 receptive field and 20.6M parameters, while CSPDarknet53 contains 29 convolutional layers 3 × 3, a 725 × 725 receptive field and 27.6M parameters. This theoretical justification, together with our numerous experiments, show that CSPDarknet53 neural network is the optimal model of the two as the backbone for a detector.

说不定我们可以假设应该选择具有更大感受野(有大量的3x3的卷积层)的模型座位主干网络。表1展示了CSPResNeXt50, CSPDarknet53 和 Effi-cientNet B3的相关资料。CSPResNeXt50只包含了16个3x3卷积,一个425x425感受野和20.6M的参数量,但是CSPDarknet53包含了29个3x3的卷积层,一个725x725的感受野和27.6M的参数量。这些理论上的证明,与我们大量的实验表明,对于检测器而言,CSPDarknet53神经网络作为主干网络是二者中的最优的。

The influence of the receptive field with different sizes is summarized as follows:

• Up to the object size - allows viewing the entire object

• Up to network size - allows viewing the context around the object Exceeding the network size - increases the number of connections between the image point and the final activation.

不同大小的感受野的影响总结如下:

    • 取决目标大小——允许查看全局的目标
    • 取决网络大小——允许查看超出网络大小的目标附近的信息-增加像素点最后激活之间的连接数量

We add the SPP block over the CSPDarknet53, since it significantly increases the receptive field, separates out the most significant context features and causes almost no reduction of the network operation speed. We use PANet as the method of parameter aggregation from different backbone levels for different detector levels, instead of the FPN used in YOLOv3.

在 CSPDarknet53上我们增加了SPP模块,由于它显著地增加了感受野,分开了大多数的背景的特征并且几乎没有降低网络的运行速度。我们使用PANet作为融合参数不同检测器不同等级主干的参数,代替YOLOv3中使用的FPN。

Finally, we choose CSPDarknet53 backbone, SPP additional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head as the architecture of YOLOv4.

最终,我呢选择CSPDarknet53作为主干网络,SPP作为附加模块,PAN作为路径聚合的颈部,并且使用YOLOv3(基于锚框)的头部作为YOLOv4的结构。

In the future we plan to expand significantly the content of Bag of Freebies (BoF) for the detector, which theoretically can address some problems and increase the detector accuracy, and sequentially check the influence of each feature in an experimental fashion.

在未来我们将计划大量扩张目标检测BOF(Bag of Freebies)的内容,在理论上可以解决一些问题并且增加检测的精度,并以实验的形式依次检查每个特征的影响。

We do not use Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices. This allows anyone to reproduce our state-of-the-art outcomes on a conventional graphic processor e.g. GTX 1080Ti or RTX2080Ti.

我们不使用CGBN 或 SyncBN,或者昂贵的专业设备。这使得的任何人可以在普通的图形处理器上获得高水准的输出结果比如1080Ti 或RTX2080Ti。

3.2. Selection of BoF and BoS

For improving the object detection training, a CNN usually uses the following:

• Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish

• Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU

• Data augmentation: CutOut, MixUp, CutMix

• Regularization method: DropOut, DropPath [36], Spatial DropOut [79], or DropBlock

• Normalization of the network activations by their mean and variance: Batch Normalization (BN) [32],Cross-GPU Batch Normalization (CGBN or SyncBN)[93], Filter Response Normalization (FRN) [70], orCross-Iteration Batch Normalization (CBN) [89]

• Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP)

3.2BoF和BoS的选择

为了提高目标检测的训练效果,通常使用的CNN如下:

• 激活函数: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish,  Mish

• 边框回归损失: MSE, IoU, GIoU, CIoU, DIoU

• 数据增强: CutOut, MixUp, CutMix

• 正则化方式: DropOut, DropPath [36], Spatial DropOut [79],  DropBlock

• 通过均值和方差进行神经网络标准化: BN,CGBN 或 SyncBN, FRN, CBN

• 跳跃连接:残差连接,权重残差连接 , 多输入权重残差连接, CSP

As for training activation function, since PReLU and SELU are more difficult to train, and ReLU6 is specifically designed for quantization network, we therefore remove the above activation functions from the candidate list. In the method of reqularization, the people who published DropBlock have compared their method with other methods in detail, and their regularization method has won a lot. Therefore, we did not hesitate to choose DropBlock as our regularization method. As for the selection of normalization method, since we focus on a training strategy that uses only one GPU, syncBN is not considered.

至于训练的激活函数,由于PRelu和SELU更难训练,并且ReLU6是专为量化网络设计的,因此我们从候选列表中移除了这几个激活函数。在正则化方式中,展示DropBlock的人用其他细节的方式比较了他们的方法,并且他们的正则化方式很有成效。因此,我们毫不犹豫选择DropBlock作为我们的正则化方式。至于归一方式的选择,由于我们重点关注的是只用一块GPU的训练策略,并没有考虑 syncBN。

3.3. Additional improvements

In order to make the designed detector more suitable for training on single GPU, we made additional design and improvement as follows:

• We introduce a new method of data augmentation Mosaic, and Self-Adversarial Training (SAT)

• We select optimal hyper-parameters while applying genetic algorithms

• We modify some exsiting methods to make our design suitble for efficient training and detection - modified SAM, modified PAN, and Cross mini-Batch Normalization (CmBN)

3.3附加的改进

为了使设计的检测器更加适合在单块GPU上训练,我们做的额外的设计和改进如下:

    • 我们引入了一种新的数据增强方式Mosaic和SAT( Self-Adversarial Training)
    • 我们通过遗传算法选择最优的超参数
    • 我们改进了一些已经存在的方式使得我们的设计更加高效的训练和测试-改进版SAM,改进版PAN,和CMBN

Mosaic represents a new data augmentation method that mixes 4 training images. Thus 4 different contexts are mixed, while CutMix mixes only 2 input images. This allows detection of objects outside their normal context. In addition, batch normalization calculates activation statistics from 4 different images on each layer. This significantly reduces the need for a large mini-batch size.YOLOv4论文翻译_第4张图片

Mosaic代表了一种新的数据增强方式,它混合4张训练的图片。因此混合了4种不同的内容,而Cutmix只混合了2张输入图片。这允许检测的目标超出他们的正常背景。除此以外,批量归一化从每层四张不同的图片中计算激活统计信息。这能够显著地减少对于大型的小批量的需求。

Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward backward stages. In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself, altering the original image to create the deception that there is no desired object on the image. In the 2nd stage, the neural network is trained to detect an object on this modified image in the normal way.

自对抗训练(SAT)同样也代表了一种新的数据增强技巧,它能够作用于前向和后向两个阶段。在第一阶段中神经网络改变了原始的图像以代替网络的权重。在这种方法中,神经网络对自身进行对抗攻击,改变原始的图像来创造一种图中没有目标的假象。在第二阶段,训练神经网络以正常的方式在被改进的图像上进行目标检测。

YOLOv4论文翻译_第5张图片

CmBN represents a CBN modified version, as shown in Figure 4, defined as Cross mini-Batch Normalization(CmBN). This collects statistics only between mini-batches within a single batch.

CmBN代表了一种CBN的改进版,像图4中展示的那样,它被定义为CmBN。它收集的统计信息仅限于单个批次下的小批量。

We modify SAM from spatial-wise attention to point-wise attention, and replace shortcut connection of PAN to concatenation, as shown in Figure 5 and Figure 6, respectively.

如图5和图6所示,我们从空间方面呢改进SAM从通道注意力到点注意力,用concate替换了PAN中的shortcut连接。

YOLOv4论文翻译_第6张图片

YOLOv4论文翻译_第7张图片

3.4. YOLOv4

In this section, we shall elaborate the details of YOLOv4.

YOLOv4 consists of:

    • Backbone: CSPDarknet53 [81]
    • Neck: SPP [25], PAN [49]
    • Head: YOLOv3 [63]

YOLO v4 uses:

    • Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization,Class label smoothing
    • Bag of Specials (BoS) for backbone: Mish activation, Cross-stage partial connections (CSP), Multi-input weighted residual connections (MiWRC)
    • Bag of Freebies (BoF) for detector: CIoU-loss,CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single groundtruth, Cosine annealing scheduler [52], Optimal hyperparameters, Random training shapes
    • Bag of Specials (BoS) for detector: Mish activation,SPP-block, SAM-block, PAN path-aggregation block,DIoU-NMS

3.4YOLOv4

在这部分,我们将设计YOLOv4的细节。

YOLOv4包含:

    • 主干网络:CSPDarknet53
    • 颈部:SPP,PAN
    • 头部:YOLOv3

YOLOv4使用:

    • 主干网络的BoF:CutMix和Mosaic数据增强,DropBlock正则化,类别标签平滑
    • 主干网络的BoS:Mish激活函数,CSP连接,多输入加权残差连接
    • 检测器的BoF:CIoU-loss,CmBN,DropBlock正则化,Mosaic数据增强,SAT自对抗训练,网格灵敏度消除,一个真实框使用多个锚框,余弦退火衰减,最优超参数,随机训练形状等
    • 检测器的Bos:Mish激活函数,SPP模块,SAM模块,PAN模块,DIoU-NMS

4. Experiments

We test the influence of different training improvement techniques on accuracy of the classifier on ImageNet(ILSVRC 2012 val) dataset, and then on the accuracy of the detector on MS COCO (test-dev 2017) dataset.

4.实验

我们测试了不同的训练的改进技巧在ImageNet数据集上对分类精度的影响,之后又在MS COCO数据集上测试对精度的影响。

4.1. Experimental setup

In ImageNet image classification experiments, the default hyper-parameters are as follows: the training steps is 8,000,000; the batch size and the mini-batch size are 128 and 32, respectively; the polynomial decay learning rate scheduling strategy is adopted with initial learning rate 0.1; the warm-up steps is 1000; the momentum and weight decay are respectively set as 0.9 and 0.005. All of our BoS experiments use the same hyper-parameter as the default setting, and in the BoF experiments, we add an additional 50% training steps. In the BoF experiments, we verify MixUp, CutMix, Mosaic, Bluring data augmentation, and label smoothing regularization methods. In the BoS experiments, we compared the effects of LReLU, Swish, and Mish activation function. All experiments are trained with a 1080 Ti or 2080 Ti GPU.

4.1实验设置

在ImageNet数据集上分类实验中,默认的超参数如下:训练的步子是 8,000,000;批次大小和最小批次分别是128和32;采用多项式学习率衰减调整策略,初始学习率为0.1;预热步骤是1000步;动量和权重衰减分别是0.9和0.005。我们所有的BoS实验用的是同样的默认超参数,并且在BoF的实验中,我们增加额外50%的训练步骤。在BoF实验中,我们验证了MixUp,Cutmix,Mosaic模糊数据增强和标签平滑正则化方式。在BoS的实验中,我们比较了LRelu,Swish和Mish激活函数的效果。所有的实验都是在1080Ti或2080Ti的GPU上进行训练的。

In MS COCO object detection experiments, the default hyper-parameters are as follows: the training steps is 500,500; the step decay learning rate scheduling strategy is adopted with initial learning rate 0.01 and multiply with a factor 0.1 at the 400,000 steps and the 450,000 steps, respectively; The momentum and weight decay are respectively set as 0.9 and 0.0005. All architectures use a single GPU to execute multi-scale training in the batch size of 64 while mini-batch size is 8 or 4 depend on the architectures and GPU memory limitation. Except for using genetic algorithm for hyper-parameter search experiments, all other experiments use default setting. Genetic algorithm used YOLOv3-SPP to train with GIoU loss and search 300 epochs for min-val 5k sets. We adopt searched learning rate 0.00261, momentum 0.949, IoU threshold for assigning ground truth 0.213, and loss normalizer 0.07 for genetic algorithm experiments. We have verified a large number of BoF, including grid sensitivity elimination, mosaic data augmentation, IoU threshold, genetic algorithm, class label smoothing, cross mini-batch normalization, selfadversarial training, cosine annealing scheduler, dynamic mini-batch size, DropBlock, Optimized Anchors, different kind of IoU losses. We also conduct experiments on various BoS, including Mish, SPP, SAM, RFB, BiFPN, and Gaussian YOLO [8]. For all experiments, we only use one GPU for training, so techniques such as syncBN that optimizes multiple GPUs are not used.

在MS COCO目标检测实验中,默认的超参数如下:学习步骤是500,500步;采用阶跃衰减学习率调度策略,在400,400步和450,000步的时候采用初始学习率分别乘以0.01和0.1;动量和权重衰减分别是0.9和0.0005。所有的结构都使用单个GPU进行多尺度训练训练,批次是64,最小批次大小是8或4,这取决于网络的结构和GPU的内存限制。除了使用遗传算法搜索超参数的实验,所有其它的实验使用默认的设置。遗传算法使用YOLOv3-SPP来训练GIOU loss,收集300轮的min-val 5k。我们采用寻找的学习率0.00261,动量0.949,IOU阈值0.213,归一化的损失0.07进行遗传算法实验。我们已经证实了大量的BoF,包括网格灵敏度消除,mosaic数据增强,IOU阈值,遗传算法,类别标签平滑,交叉小批次归一化,自对抗训练,余弦退火衰减,动态小批次,DropBlock,最优的anchor,不同种类的IOU损失。我们也在各种BoS上做了实验,包括Mish,SPP,SAM,RFB,和Gaussian YOLO。对于所有的实验,我们只用单块GPU训练,所以并没有用比如syncBN等优化多个GPU的技巧。

4.2. Influence of different features on Classifier

training First, we study the influence of different features on classifier training; specifically, the influence of Class label smoothing, the influence of different data augmentation techniques, bilateral blurring, MixUp, CutMix and Mosaic, as shown in Fugure 7, and the influence of different activations, such as Leaky-ReLU (by default), Swish, and Mish.

4.2不同特征对分类器的影响

首先是训练,如图7中展示的一样,我们研究了不同特征对分类器训练的影响,尤其是类别标签平滑的影响,不同数据增强技巧的影响,双边滤波,MixUp,CutMix,和Mosaic,和不同激活函数(如默认的Leaky-Relu,Swish和Mish)

In our experiments, as illustrated in Table 2, the classifier’s accuracy is improved by introducing the features such as: CutMix and Mosaic data augmentation, Class label smoothing, and Mish activation. As a result, our BoF backbone (Bag of Freebies) for classifier training includes the following: CutMix and Mosaic data augmentation and Class label smoothing. In addition we use Mish activation as a complementary option, as shown in Table 2 and Table 3.

YOLOv4论文翻译_第8张图片

YOLOv4论文翻译_第9张图片

在我们的实验中,如表2所示,通过引入一些特征来提高分类器的精度,比如:CutMix和Mosaic数据增强方式,类别标签平滑,和Mish激活函数。最终,我们训练分类器的主干BoF包括如下:CutMix和Mosaic数据增强,和类别标签平滑。如表2和表3所示,除此以外,我们使用Mish激活函数作为一个补充的选项。

4.3. Influence of different features on Detector training

Further study concerns the influence of different Bag-of Freebies (BoF-detector) on the detector training accuracy, as shown in Table 4. We significantly expand the BoF list through studying different features that increase the detector accuracy without affecting FPS:

YOLOv4论文翻译_第10张图片

4.3不同的特征对检测器训练的影响

如表4所示,进一步研究考虑了不同的免费赠品袋(BoF检测器)对检测器训练精度的影响。通过研究不同的可以在不影响FPS而增加检测器精度的特征我们重点扩张了BoF的列表:

• S: Eliminate grid sensitivity the equation bx = σ(tx)+cx, by = σ(ty)+cy, where cx and cy are always whole

numbers, is used in YOLOv3 for evaluating the object coordinates, therefore, extremely high tx absolute

values are required for the bx value approaching the cx or cx + 1 values. We solve this problem through

multiplying the sigmoid by a factor exceeding 1.0, so eliminating the effect of grid on which the object is

undetectable.

• M: Mosaic data augmentation - using the 4-image mosaic during training instead of single image

• IT: IoU threshold - using multiple anchors for a single ground truth IoU (truth, anchor) > IoU threshold

• GA: Genetic algorithms - using genetic algorithms for selecting the optimal hyperparameters during network training on the first 10% of time periods

• LS: Class label smoothing - using class label smoothing for sigmoid activation

• CBN: CmBN - using Cross mini-Batch Normalization for collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch

• CA: Cosine annealing scheduler - altering the learning rate during sinusoid training

• DM: Dynamic mini-batch size - automatic increase ofmini-batch size during small resolution training by using Random training shapes

• OA: Optimized Anchors - using the optimized anchors for training with the 512x512 network resolution

• GIoU, CIoU, DIoU, MSE - using different loss algorithms for bounded box regression

    • S:在YOLOv3中用来估计目标坐标的,消除网格灵敏度的等式 bx = σ(tx)+cx, by = σ(ty)+cy,cx和cy总是整数,为了使得bx的值接近cx或cx+1,因此需要很大的tx的绝对值
    • M:马赛克数据增强——在训练时使用4张图片单张图片
    • IT:IOU阈值——对一个真实IOU大于IOU阈值的使用多个锚框
    • GA:遗传算法——在训练的前10%的训练阶段使用遗传算法选择最的超参数
    • LS:类别标签平滑——对于sigmoid激活函数使用类别标签平滑
    • CBN:CmBN——使用CmBN收集整个批次内部的统计数据,而不是在小批次中收集
    • CA:余弦退火衰减—— 在正弦曲线训练时更改学习率
    • GIoU,CIoU,MSE——在边框回归时使用不同弄的损失策略

Further study concerns the influence of different Bag-of-Specials (BoS-detector) on the detector training accuracy, including PAN, RFB, SAM, Gaussian YOLO (G), and ASFF, as shown in Table 5. In our experiments, the detector gets best performance when using SPP, PAN, and SAM.

YOLOv4论文翻译_第11张图片

进一步的研究考虑了在检测器训练时不同的BoS的对训练准确率的影响,包括PAN,RFB,SAM,Gaussian YOLO和ASFF,就像我们在表5中展示的那样。在我们的实验中检测器使用SPP,PAN,和SAM达到了最佳的效果。

4.4. Influence of different backbones and pretrained weightings on Detector training

Further on we study the influence of different backbone models on the detector accuracy, as shown in Table 6. We notice that the model characterized with the best classification accuracy is not always the best in terms of the detector accuracyYOLOv4论文翻译_第12张图片

4.4不同的骨干网络和预训练模型对检测器训练的影响

如表6所示,我们进一步研究了不同的骨干模型对检测器准确率的影响。我们发现就检测器的准确性而言,在分类精度上最优的模型并不总是最好的。

Second, using BoF and Mish for the CSPResNeXt50 classifier training increases its classification accuracy, but further application of these pre-trained weightings for detector training reduces the detector accuracy. However, using BoF and Mish for the CSPDarknet53 classifier training increases the accuracy of both the classifier and the detector which uses this classifier pre-trained weightings. The net result is that backbone CSPDarknet53 is more suitable for the detector than for CSPResNeXt50.

其次,在训练CSPResNeXt50分类器时使用BoF和Mish增加了它得到分类准确率,但是在使用这些预训练权重时会降低检测器的准确率。无论如何,在训练CSPDarknet53时,使用BoF和Mish分类器和检测器的准确率都会提高。网络的最终结果是CSPDarknet53比 CSPResNeXt50更加适用于检测器。

We observe that the CSPDarknet53 model demonstrates a greater ability to increase the detector accuracy owing to various improvements.

我们注意到由于不同的改进,使得CSPDarknet53在增加检测器精度的方面有着更好的能力。

4.5. Influence of different mini-batch size on Detector training

Finally, we analyze the results obtained with models trained with different mini-batch sizes, and the results are shown in Table 7. From the results shown in Table 7, we found that after adding BoF and BoS training strategies, the mini-batch size has almost no effect on the detector’s performance. This result shows that after the introduction of BoF and BoS, it is no longer necessary to use expensive GPUs for training. In other words, anyone can use only a conventional GPU to train an excellent detector.

4.5不同的批次对检测器训练的影响

最终,我们分析了不同的批次下的模型结果和表7中的结果。我们从表7的结果可知在在训练阶段增加了BoF和BoS后,批次对检测器的性能影响不大。这个结果说明训练时没必要使用太贵的高端显卡。换句话说,所有人都可以使用一块普通的显卡训练一个出色的目标检测器。、

YOLOv4论文翻译_第13张图片

YOLOv4论文翻译_第14张图片

YOLOv4论文翻译_第15张图片

YOLOv4论文翻译_第16张图片

Figure 8: Comparison of the speed and accuracy of different object detectors. (Some articles stated the FPS of their detectors for only one of the GPUs: Maxwell/Pascal/Volta)

5. Results

Comparison of the results obtained with other state-of-the-art object detectors are shown in Figure 8. Our YOLOv4 are located on the Pareto optimality curve and are superior to the fastest and most accurate detectors in terms of both speed and accuracy.

5.结果

图8展示了与其它最优的目标检测器的比较。就速度和精度而言,YOLOv4都是在帕累托曲线上,他是最先进和最精确的目标检测器。    

Since different methods use GPUs of different architectures for inference time verification, we operate YOLOv4 on commonly adopted GPUs of Maxwell, Pascal, and Volta architectures, and compare them with other state-of-the-art methods. Table 8 lists the frame rate comparison results of using Maxwell GPU, and it can be GTX Titan X (Maxwell) or Tesla M40 GPU. Table 9 lists the frame rate comparison results of using Pascal GPU, and it can be Titan X (Pascal), Titan Xp, GTX 1080 Ti, or Tesla P100 GPU. As for Table 10, it lists the frame rate comparison results of using Volta GPU, and it can be Titan Volta or Tesla V100 GPU.

由于不同的方法使用不同的GPU架构进行推理时间的验证,所以我们在Maxwell,Pascal和Volta架构的通用GPU上运行YOLOv4,并将它们与其他最新方法进行比较。表8罗列了使用Maxwell GPU的帧率的比较结果,他可以是GTX Titan X (Maxwell) 或 Tesla M40 GPU。表9罗列了使用Pascal GPU的帧率的比较结果,它可以是Titan X (Pascal), Titan Xp, GTX 1080 Ti, 或 Tesla P100 GPU。表10罗列了使用Volta GPU的比较结果,他可以是Titan Volta 或Tesla V100 GPU.

6. Conclusions

We offer a state-of-the-art detector which is faster (FPS) and more accurate (MS COCO AP50...95 and AP50) than all available alternative detectors. The detector described can be trained and used on a conventional GPU with 8-16 GB-VRAM this makes its broad use possible. The original concept of one-stage anchor-based detectors has proven its viability. We have verified a large number of features, and selected for use such of them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments.

6.结论

我们提供了一种最优的目标检测器,它比其他可用的目标检测器更快也更加准确,检测器可以用一块普通的16GB的GPU训练,这使得它可以被广泛使用。最初基于anchor的单阶段目标检测器证明了它的可行性。我们已经验证了大量的features并且使用她们来提高分类和检测的精度。,在未来的研究中这些features可以被用作吧最好的实践。

7. Acknowledgements

The authors wish to thank Glenn Jocher for the ideas of Mosaic data augmentation, the selection of hyper-parameters by using genetic algorithms and solving the grid sensitivity problem https://github.com/ultralytics/yolov3.

7.致谢

作者们感谢Glenn Jocher的Mosaic数据增强方式,通过遗传算法选择超参数和网格灵敏性问题的解决。

你可能感兴趣的:(神经网络,计算机视觉,目标跟踪)