《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译

最新的YoloV4已经出来好久了,今天主要读一下看看相比于YoloV3有什么改进和创新的地方,主要是来学习学习。废话不多说,开始。


Abstract

摘要

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.

 大量的特征可以提高卷积神经网络(CNN)的准确率。 这需要在大型数据集上对这些特征的组合进行实际测试,并且需要理论上对结果进行分析。 一些特性操作用在定的模型上并且是为了解决特定的问题,或者仅针对小的数据集;而一些特性,如BN层和残差连接,适用于大多数模型、任务和数据集。 我们假设有一些通用特性如:加权残差连接(WRC)、(CSP),交叉小批归一化(CMBN),自我对抗训练(SAT)和Mish激活。 我们使用新的特性:WRC,CSP,CMBN,SAT,Mish激活,马赛克数据增强,CMBN, DropBlock正则化和CIOU损失,将他们结合起来实现了在MSCOCO数据集上最优的结果:43.5%AP (65.7%AP50),在Tesla V100实时速度为65FPS。

1. Introduction

1. 介绍

The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example,

searching for free parking spaces via urban video cameras
is executed by slow accurate models, whereas car collision
warning is related to fast inaccurate models. Improving
the real-time object detector accuracy enables using them
not only for hint generating recommendation systems, but
also for stand-alone process management and human input
reduction. Real-time object detector operation on conven
tional Graphics Processing Units (GPU) allows their mass
usage at an affordable price. The most accurate modern
neural networks do not operate in real time and require large
number of GPUs for training with a large mini-batch-size.
We address such problems through creating a CNN that op
erates in real-time on a conventional GPU, and for which
training requires only one conventional GPU.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第1张图片

The main goal of this work is designing a fast operating

speed of an object detector in production systems and opti
mization for parallel computations, rather than the low com
putation volume theoretical indicator (BFLOP). We hope
that the designed object can be easily trained and used. For
example, anyone who uses a conventional GPU to train and
test can achieve real-time, high quality, and convincing ob
ject detection results, as the YOLOv4 results shown in Fig
ure 1 . Our contributions are summarized as follows:
 
1. We develope an effificient and powerful object detection
model. It makes everyone can use a 1080 Ti or 2080 Ti
GPU to train a super fast and accurate object detector.
 
2. We verify the inflfluence of state-of-the-art Bag-of
Freebies and Bag-of-Specials methods of object detec
tion during the detector training.
 
3. We modify state-of-the-art methods and make them
more effecient and suitable for single GPU training,
including CBN [ 89 ], PAN [ 49 ], SAM [ 85 ], etc.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第2张图片

  大多数基于CNN的目标检测器基本上用于推荐系统。 例如,通过城市摄像机搜索免费停车位是通过慢速精确执行的 模型,而汽车碰撞警告与快速不准确的模型有关。 提高实时目标检测器的准确性不仅可以用于提示生成推荐系统,而且还可以用于独立的过程管理和人力投入减少。在GPU上的实时对象检测器大量使用的话需要需要承担起价格。 大多数神经网络不能实时运行,并且需要大量的GPU来进行小批量的训练。  我们通过创建一个在传统GPU上实时运行的CNN来解决这些问题,而训练只需要一个传统GPU。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第3张图片

 本工作的主要目的是生产系统中设计一个快速目标检测器并且能够通过并行计算来优化,而不是降低计算量的理论指标 (BFLOP)。 我们希望设计的对象可以很容易地训练和使用。 例如,任何人能够使用常规GPU进行训练和测试并实现实时、高质量和传统对象检测结果,如图1所示的YOLOv4结果。 我们的贡献总结如下:

 1. 我们开发了一个高效、强大的目标检测模型。 它使每个人都可以使用1080Ti或2080TiGPU来训练超级快速和精确的物体探测器。

2. 我们验证了在探测器训练过程中,state-of-the-art Bag-of Freebies and Bag-of-Specials对物体检测的影响。

 3. 我们修改了当前的方法,使它们更有效,更适合于单个GPU训练,包括CBN[89]、PAN[49]、SAM[85]等。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第4张图片

2. Related work

2.1. Object detection models

2. 相关工作

2.1. 物体检测方法

A modern detector is usually composed of two parts,

a backbone which is pre-trained on ImageNet and a head
which is used to predict classes and bounding boxes of ob
jects. For those detectors running on GPU platform, their
backbone could be VGG [ 68 ], ResNet [ 26 ], ResNeXt [ 86 ],
or DenseNet [ 30 ]. For those detectors running on CPU plat
form, their backbone could be SqueezeNet [ 31 ], MobileNet
[ 28 , 66 , 27 , 74 ], or ShufflfleNet [ 97 , 53 ]. As to the head part,
it is usually categorized into two kinds, i.e., one-stage object
detector and two-stage object detector. The most represen
tative two-stage object detector is the R-CNN [ 19 ] series,
including fast R-CNN [ 18 ], faster R-CNN [ 64 ], R-FCN [ 9 ],
and Libra R-CNN [ 58 ]. It is also possible to make a two
stage object detector an anchor-free object detector, such as
RepPoints [ 87 ]. As for one-stage object detector, the most
representative models are YOLO [ 61 , 62 , 63 ], SSD [ 50 ],
and RetinaNet [ 45 ]. In recent years, anchor-free one-stage
object detectors are developed. The detectors of this sort are
CenterNet [ 13 ], CornerNet [ 37 , 38 ], FCOS [ 78 ], etc. Object
detectors developed in recent years often insert some lay
ers between backbone and head, and these layers are usu
ally used to collect feature maps from different stages. We
can call it the neck of an object detector. Usually, a neck
is composed of several bottom-up paths and several top
down paths. Networks equipped with this mechanism in
clude Feature Pyramid Network (FPN) [ 44 ], Path Aggrega
tion Network (PAN) [ 49 ], BiFPN [ 77 ], and NAS-FPN [ 17 ].
In addition to the above models, some researchers put their
emphasis on directly building a new backbone (DetNet [ 43 ],
DetNAS [ 7 ]) or a new whole model (SpineNet [ 12 ], HitDe
tector [ 20 ]) for object detection.
 

To sum up, an ordinary object detector is composed of

several parts:
 
Input : Image, Patches, Image Pyramid
 
Backbones : VGG16 [ 68 ], ResNet-50 [ 26 ], SpineNet
[ 12 ], EffificientNet-B0/B7 [ 75 ], CSPResNeXt50 [ 81 ],
CSPDarknet53 [ 81 ]
 
Neck :
 
Additional blocks : SPP [ 25 ], ASPP [ 5 ], RFB
[ 47 ], SAM [ 85 ]
 
Path-aggregation blocks : FPN [ 44 ], PAN [ 49 ],
NAS-FPN [ 17 ], Fully-connected FPN, BiFPN
[ 77 ], ASFF [ 48 ], SFAM [ 98 ]
 
Heads
 
Dense Prediction (one-stage) :
RPN [ 64 ], SSD [ 50 ], YOLO [ 61 ], RetinaNet
[ 45 ] (anchor based)
CornerNet [ 37 ], CenterNet [ 13 ], MatrixNet
[ 60 ], FCOS [ 78 ] (anchor free)
Sparse Prediction (two-stage) :
Faster R-CNN [ 64 ], R-FCN [ 9 ], Mask R
CNN [ 23 ] (anchor based)
RepPoints [ 87 ] (anchor free)

 现代检测器通常由两个部分组成,一个是在ImageNet上预先训练的骨干,一个是用来预测对象的类和包围框的头。 对于运行在GPU平台上的检测器来说,它们的骨干可以是VGG[68]、ResNet[26]、ResneXt[86]或DenseNet[30]。对于运行在CPU平台上的检测器来说,它们的骨干可以是SqueezeNet [31], MobileNet

[28, 66, 27, 74], or ShufflfleNet [97, 53].对于头部部分,通常分为两类,即一级对象检测器和二级对象检测器。最具代表性的两级物体检测器是R-CNN [19] 系列,包括fast R-CNN [18], faster R-CNN [64], R-FCN [9],and Libra R-CNN [58]. 还可以使两级物体检测器成为没有锚点对象检测器,如RepPoint[87]。 对于一级对象检测器,最具代表性的模型是YOLO[61,62,63]、SSD[50]和RetinaNet [45]。 近年来,无锚点一级目标探测器正在发展。这类检测器有CenterNet [13], CornerNet [37, 38], FCOS [78],等。 近年来发展起来的对象检测器 在骨干和头部之间插入一些层,这些层通常用于收集不同阶段的特征图。 我们可以称之为物体探测器的颈部。 通常颈部由几条自下而上的路径和几条自上而下的路径组成。有此机制的网络包括特征金字塔网络(FPN)[44]、路径聚合网络(PAN)[49]、BiFPN[77]和NAS-FPN[17]。 除了上述模型外,一些研究人员还强调直接为物体检测器构建一个新的骨干(DetNet[43],Det NAS[7])或一个新的整体模型(Spine Net[12],HitDetector[20]  Tection。

综上所述,一个普通的物体探测器由几部分组成:

  • 输入:图像,pitch,图像金字塔
  • 骨干:VGG16 [68], ResNet-50 [26], SpineNet
    [12], EffificientNet-B0/B7 [75], CSPResNeXt50 [81],
    CSPDarknet53 [81]
  • 颈部:​​​​​
    • 额外的块:SPP [25], ASPP [5], RFB[47], SAM [85]
    • 路径聚和块:FPN [44], PAN [49],NAS-FPN [17], Fully-connected FPN, BiFPN[77], ASFF [48], SFAM [98]
  • 头部:
    •  密集预测(一阶段): 
      • RPN[64],SSD[50],YOLO[61],RetinaNet[45](基于锚)
      • CornerNet [37], CenterNet [13], MatrixNet
        [60], FCOS [78] (anchor free)
    •  稀疏预测(两阶段): 
      • Faster R-CNN [64], R-FCN [9], Mask R
        CNN [23] (anchor based)(基于锚)
      • RepPoints [87] (无锚)

2.2. Bag of freebies

2.2. Bag of freebies

Usually, a conventional object detector is trained off

line. Therefore, researchers always like to take this advan
tage and develop better training methods which can make
the object detector receive better accuracy without increas
ing the inference cost. We call these methods that only
change the training strategy or only increase the training
cost as “bag of freebies.” What is often adopted by object
detection methods and meets the defifinition of bag of free
bies is data augmentation. The purpose of data augmenta
tion is to increase the variability of the input images, so that
the designed object detection model has higher robustness
to the images obtained from different environments. For
examples, photometric distortions and geometric distortions
are two commonly used data augmentation method and they
defifinitely benefifit the object detection task. In dealing with
photometric distortion, we adjust the brightness, contrast,
hue, saturation, and noise of an image. For geometric dis
tortion, we add random scaling, cropping, flflipping, and ro
tating.

The data augmentation methods mentioned above are all

pixel-wise adjustments, and all original pixel information in
the adjusted area is retained. In addition, some researchers
engaged in data augmentation put their emphasis on sim
ulating object occlusion issues. They have achieved good
results in image classifification and object detection. For ex
ample, random erase [ 100 ] and CutOut [ 11 ] can randomly
select the rectangle region in an image and fifill in a random
or complementary value of zero. As for hide-and-seek [ 69 ]
and grid mask [ 6 ], they randomly or evenly select multiple
rectangle regions in an image and replace them to all ze
ros. If similar concepts are applied to feature maps, there
are DropOut [ 71 ], DropConnect [ 80 ], and DropBlock [ 16 ]
methods. In addition, some researchers have proposed the
methods of using multiple images together to perform data
augmentation. For example, MixUp [ 92 ] uses two images
to multiply and superimpose with different coeffificient ra
tios, and then adjusts the label with these superimposed ra
tios. As for CutMix [ 91 ], it is to cover the cropped image
to rectangle region of other images, and adjusts the label
according to the size of the mix area. In addition to the
above mentioned methods, style transfer GAN [ 15 ] is also
used for data augmentation, and such usage can effectively
reduce the texture bias learned by CNN.

Different from the various approaches proposed above,

some other bag of freebies methods are dedicated to solving
the problem that the semantic distribution in the dataset may
have bias. In dealing with the problem of semantic distri
bution bias, a very important issue is that there is a problem
of data imbalance between different classes, and this prob
lem is often solved by hard negative example mining [ 72 ]
or online hard example mining [ 67 ] in two-stage object de
tector. But the example mining method is not applicable
to one-stage object detector, because this kind of detector
belongs to the dense prediction architecture. Therefore Lin
et al . [ 45 ] proposed focal loss to deal with the problem
of data imbalance existing between various classes. An
other very important issue is that it is diffificult to express the
relationship of the degree of association between different
categories with the one-hot hard representation. This rep
resentation scheme is often used when executing labeling.
The label smoothing proposed in [ 73 ] is to convert hard la
bel into soft label for training, which can make model more
robust. In order to obtain a better soft label, Islam et al . [ 33 ]
introduced the concept of knowledge distillation to design
the label refifinement network.

The last bag of freebies is the objective function of

Bounding Box (BBox) regression. The traditional object
detector usually uses Mean Square Error (MSE) to di
rectly perform regression on the center point coordinates
and height and width of the BBox, i.e., { x center , y center ,
w , h } , or the upper left point and the lower right point,
i.e., { x top lef t , y top lef t , x bottom right , y bottom right } . As
for anchor-based method, it is to estimate the correspond
ing offset, for example { x center of f set , y center of f set ,
w of f set , h of f set } and { x top lef t of f set , y top lef t of f set ,
x bottom right of f set , y bottom right of f set } . However, to di
rectly estimate the coordinate values of each point of the
BBox is to treat these points as independent variables, but
in fact does not consider the integrity of the object itself. In
order to make this issue processed better, some researchers
recently proposed IoU loss [ 90 ], which puts the coverage of
predicted BBox area and ground truth BBox area into con
sideration. The IoU loss computing process will trigger the
calculation of the four coordinate points of the BBox by ex
ecuting IoU with the ground truth, and then connecting the
generated results into a whole code. Because IoU is a scale
invariant representation, it can solve the problem that when
traditional methods calculate the l 1 or l 2 loss of { x , y , w ,
h } , the loss will increase with the scale. Recently, some
researchers have continued to improve IoU loss. For exam
ple, GIoU loss [ 65 ] is to include the shape and orientation
of object in addition to the coverage area. They proposed to
fifind the smallest area BBox that can simultaneously cover
the predicted BBox and ground truth BBox, and use this
BBox as the denominator to replace the denominator origi
nally used in IoU loss. As for DIoU loss [ 99 ], it additionally
considers the distance of the center of an object, and CIoU
loss [ 99 ], on the other hand simultaneously considers the
overlapping area, the distance between center points, and
the aspect ratio. CIoU can achieve better convergence speed
and accuracy on the BBox regression problem.

 通常,传统的物体检测器是离线训练的。 因此,研究人员总是喜欢利用这一优势,开发更好的训练方法,使对象检测器能够达到更高的准确率而不增加推理成本。 我们把这些只改变训练策略或只增加训练成本的方法称为“bag of freebies.”。该方法经常被物体检测器使用并且满足“bag of freebies”方法也叫做数据增强。 数据增强的目的是增加输入图像的可变性,使设计的物体检测模型对从不同环境中获得的图像具有较高的鲁棒性。 例如,光度畸变和几何畸变是两种常用的数据增强方法它们肯定有利于目标检测任务。 在处理光度失真时,我们调整图像的亮度、对比度、色调、饱和度和噪声。 对于几何畸变,我们 添加随机缩放、裁剪、翻转和旋转。

 上述数据增强方法均为像素级调整,保留调整区域内所有原始像素信息。 此外,一些从事数据增强的研究人员强调模拟对象遮挡问题。它们在图像分类和目标检测方面取得了良好的效果。 例如, 例如,在图像中随机擦除或剪切矩形区域,并随机填充零或其互补值。至于hide-and-seek和网格掩码,它们随机或均匀地 在图像中选择多个矩形区域 并将它们替换为所有零。 如果将类似的概念应用于特征映射,则有DropOut、DropConnect和DropBlock方法。 此外,一些研究人员也有专业人士 提出了使用多幅图像拼接在一起的数据增强的方法。 例如,将两张图片以不同的比例叠加在一起,然后调整这些带有叠加比率的标签。 至于裁剪混合,它是将裁剪后的图像覆盖到其他图像的矩形区域,并根据混合区域的大小调整标签。 除了上述方法外,风格迁移GAN网络也被用于数据增强,这样的使用可以有效地减少CNN学习的纹理偏差。

 与上述提出的各种方法不同,其他一些bag of freebies方法致力于解决数据集中语义分布可能存在偏差的问题。  在处理语义分布偏差问题中,一个非常重要的问题是不同类之间存在数据不平衡问题,这个问题通常是通过两级对象检测器中进行负例采样或在线负例采样来解决。 但实例挖掘方法不适用于一级对象检测器,因为这种检测器属于密集的预测体系结构。 因此,Lin等人提出了focal loss来处理各类之间存在的数据不平衡问题。  另一个非常重要的问题是,很难表达不同类别之间关联程度与one-hot标签之间的关系。标签平滑是将硬标签转换为软标签进行训练,使模型更加稳健,在制作标签时这种方案经常被使用。 为了获得更好的软标签,Islam等人引入知识蒸馏的概念来设计标签细化网络。

 最后一个bag of freebies是BoundingBox(BBox)回归的目标函数。 传统的对象检测器通常使用均方误差(MeanSquare Error,MSE)直接对中心坐标和高度、宽度的BBox进行回归,即{xcenter,ycenter,w,h},或左上点和右下点,即{xtop_left,ytop_left,xbottom_left,ybottom_right}。 如对于基于锚的方法,它是估计相应的偏移量,例如{xcenter_offset,ycenter_offset,woffset,hoffset}和f集的{xtop_left_offset,ytop_left_offset,xbottom_right_offset,ybottom_right_offset}。 然而,直接估计BBox每个点的坐标值并将这些点视为自变量,实际上没有考虑对象本身的完整性。 为了使这一问题得到更好的处理,一些研究人员最近提出了IoU损失,将预测的BBox和真实的BBox放在一起考虑。 通过将IoU与地面真相执行,IoU损失通过计算BBox的四个坐标点与真实标签的IoU,并将得到的结果加入到整个代码中 。 由于IoU是一个尺度不变表示,因此可以解决传统方法计算{x,y,w,h}的L1或L2损失时,损失会随着尺度的增加而增加的问题。 最近,一些研究人员在继续改善IoU损失。 例如,GIOU损失 除了覆盖区域外,还包括物体的形状和方向。 他们提出找到最小的区域BBox,可以同时覆盖预测的BBox和真实BBox,并使用这个BBox作为分母以取代原来在IoU损失中使用的分母。  对于DIOU损失,它还考虑了物体中心的距离,CIOU损失,另一方面同时考虑了重叠区域,即CEN之间的距离点和纵横比作为对于DIOU损失。 在BBox回归问题上,CIOU可以获得更好的收敛速度和精度。

2.3. Bag of specials

2.3. Bag of specials

For those plugin modules and post-processing methods

that only increase the inference cost by a small amount
but can signifificantly improve the accuracy of object detec
tion, we call them “bag of specials”. Generally speaking,
these plugin modules are for enhancing certain attributes in
a model, such as enlarging receptive fifield, introducing at
tention mechanism, or strengthening feature integration ca
pability, etc., and post-processing is a method for screening
model prediction results.

Common modules that can be used to enhance recep

tive fifield are SPP [ 25 ], ASPP [ 5 ], and RFB [ 47 ]. The
SPP module was originated from Spatial Pyramid Match
ing (SPM) [ 39 ], and SPMs original method was to split fea
ture map into several d × d equal blocks, where d can be
{ 1 , 2 , 3 , ... } , thus forming spatial pyramid, and then extract
ing bag-of-word features. SPP integrates SPM into CNN
and use max-pooling operation instead of bag-of-word op
eration. Since the SPP module proposed by He et al . [ 25 ]
will output one dimensional feature vector, it is infeasible to
be applied in Fully Convolutional Network (FCN). Thus in
the design of YOLOv3 [ 63 ], Redmon and Farhadi improve
SPP module to the concatenation of max-pooling outputs
with kernel size k × k , where k = { 1 , 5 , 9 , 13 } , and stride
equals to 1. Under this design, a relatively large k × k max
pooling effectively increase the receptive fifield of backbone
feature. After adding the improved version of SPP module,
YOLOv3-608 upgrades AP 50 by 2.7% on the MS COCO
object detection task at the cost of 0.5% extra computation.
The difference in operation between ASPP [ 5 ] module and
improved SPP module is mainly from the original k × k ker
nel size, max-pooling of stride equals to 1 to several 3 × 3
kernel size, dilated ratio equals to k , and stride equals to 1
in dilated convolution operation. RFB module is to use sev
eral dilated convolutions of k × k kernel, dilated ratio equals
to k , and stride equals to 1 to obtain a more comprehensive
spatial coverage than ASPP. RFB [ 47 ] only costs 7% extra
inference time to increase the AP 50 of SSD on MS COCO
by 5.7%.

The attention module that is often used in object detec

tion is mainly divided into channel-wise attention and point
wise attention, and the representatives of these two atten
tion models are Squeeze-and-Excitation (SE) [ 29 ] and Spa
tial Attention Module (SAM) [ 85 ], respectively. Although
SE module can improve the power of ResNet50 in the Im
ageNet image classifification task 1% top-1 accuracy at the
cost of only increasing the computational effort by 2%, but
on a GPU usually it will increase the inference time by
about 10%, so it is more appropriate to be used in mobile
devices. But for SAM, it only needs to pay 0.1% extra cal
culation and it can improve ResNet50-SE 0.5% top-1 accu
racy on the ImageNet image classifification task. Best of all,
it does not affect the speed of inference on the GPU at all.
In terms of feature integration, the early practice is to use
skip connection [ 51 ] or hyper-column [ 22 ] to integrate low
level physical feature to high-level semantic feature. Since
multi-scale prediction methods such as FPN have become
popular, many lightweight modules that integrate different
feature pyramid have been proposed. The modules of this
sort include SFAM [ 98 ], ASFF [ 48 ], and BiFPN [ 77 ]. The
main idea of SFAM is to use SE module to execute channel
wise level re-weighting on multi-scale concatenated feature
maps. As for ASFF, it uses softmax as point-wise level re
weighting and then adds feature maps of different scales.
In BiFPN, the multi-input weighted residual connections is
proposed to execute scale-wise level re-weighting, and then
add feature maps of different scales.

In the research of deep learning, some people put their

focus on searching for good activation function. A good
activation function can make the gradient more effificiently
propagated, and at the same time it will not cause too
much extra computational cost. In 2010, Nair and Hin
ton [ 56 ] propose ReLU to substantially solve the gradient
vanish problem which is frequently encountered in tradi
tional tanh and sigmoid activation function. Subsequently,
LReLU [ 54 ], PReLU [ 24 ], ReLU6 [ 28 ], Scaled Exponential
Linear Unit (SELU) [ 35 ], Swish [ 59 ], hard-Swish [ 27 ], and
Mish [ 55 ], etc., which are also used to solve the gradient
vanish problem, have been proposed. The main purpose of
LReLU and PReLU is to solve the problem that the gradi
ent of ReLU is zero when the output is less than zero. As
for ReLU6 and hard-Swish, they are specially designed for
quantization networks. For self-normalizing a neural net
work, the SELU activation function is proposed to satisfy
the goal. One thing to be noted is that both Swish and Mish
are continuously differentiable activation function.

The post-processing method commonly used in deep

learning-based object detection is NMS, which can be used
to fifilter those BBoxes that badly predict the same ob
ject, and only retain the candidate BBoxes with higher re
sponse. The way NMS tries to improve is consistent with
the method of optimizing an objective function. The orig
inal method proposed by NMS does not consider the con
text information, so Girshick et al . [ 19 ] added classifification
confifidence score in R-CNN as a reference, and according to
the order of confifidence score, greedy NMS was performed
in the order of high score to low score. As for soft NMS [ 1 ],
it considers the problem that the occlusion of an object may
cause the degradation of confifidence score in greedy NMS
with IoU score. The DIoU NMS [ 99 ] developers way of
thinking is to add the information of the center point dis
tance to the BBox screening process on the basis of soft
NMS. It is worth mentioning that, since none of above post
processing methods directly refer to the captured image fea
tures, post-processing is no longer required in the subse
quent development of an anchor-free method.
《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第5张图片

 对于那些只增加少量推理成本但能显著提高目标检测精度的插件模块和后处理方法,我们称之为“Bag of specials”。 一般来说,这些插件模块是为了增强模型中的某些属性,如扩大感受野、引入注意机制或 加强特征集成能力等,后处理是筛选模型预测结果的一种方法。

 可用于增强感受野的常用模块有SPP、ASPP和RFB。 SPP模块起源于空间金字塔匹配(SPM),SPM原始方法为t将特征映射分割成几个d×d等量块,其中d可以是{1,2,3,...},从而形成空间金字塔,然后提取词袋特征。 SPP将SPM集成到CNN中并使用max-pool操作而不是单词袋操作。 由于He等人提出的SPP模块。 将输出一维特征向量,在FCN中应用是不可行得。 因此,在YOLOv3的设计中,Redmon和Farhadi将SPP模块改进为核大小为k×k的最大池输出的级联,其中k={1、5、9、13},步长等于1。 在此设计中,相对较大的k×k maxpooling有效地增加了骨干特征的感受野。 增加SPP模块的改进版本后,YOLOv3-608将AP50提升2.7%并 减少0.5%的额外计算成本完成MSCOCO对象检测任务。 ASPP模块与改进SPP模块在操作上的区别主要是将原始的k×k大小的核,maxpooling步长等于1变成一系列3×3大小的核,扩展比等于k,步长等于1的。 射频模块是使用k×k的几个膨胀卷积核,扩张比等于k,步长等于1的扩展卷积运算,以获得比ASPP更全面的空间覆盖。 RFB只花费7%的额外推理时间来获得SSD在MS COCO上AP50增加5.7%。

 在物体检测中经常使用的注意模块主要分为通道注意力和点注意力,这两种注意模型的代表分别是SE和空间注意模块SAM。 虽然SE模块可以提高ResNet50在ImageNet图像分类任务中的能力,提高1%top-1的准确率只需要增加2%的计算量,但在GPU上通常会增加10%左右的推理时间,因此在移动设备中使用更合适。 但对SAM来说,在图像网图像分类任务中,它只需要花费0.1%的额外计算量就提高ResNet50-SE 0.5%top-1的准确率。 最重要的是,它不影响GPU上的推理速度。

 在特征集成方面,早期的实践是使用跳转连接或hyper-column将低级特征集成到高级语义特征中。 自从多尺度预测方法如FPN变的流行起来,许多集成不同特征金字塔的轻量级模块已经被提出。 这类模块包括SFAM、ASFF和BiFPN。 SFAM的主要思想是利用SE模块对多尺度级联特征进行通道级重加权。 至于ASFF,它使用Softmax作为点积重加权,然后添加不同尺度的特征。 在BiFPN中, 提出了多输入加权残差连接来执行标度级重新加权,然后添加不同尺度的特征。

 在深度学习的研究中,一些人把重点放在寻找更好的激活函数上。 一个好的激活函数可以使梯度更有效地传播,同时时间不会造成太多额外的计算成本。 在2010年,Nair和Hinton提出ReLU来实质性地解决传统中tanh和sigmoid经常遇到的梯度消失问题。 随后,LRELU、PRELU、RELU6、标度指数线性单元(SELU)、Swish、Hard-Swish和Mish等也被使用为了解决梯度消失问题。 而LRELU和PRELU的主要目的是解决当输出小于零时RELU的梯度为零的问题。 至于ReLU6和Hard-Swish,它们是专门为量化网络设计的。 对于神经网络的自归一化,提出了SELU激活函数就是为了来满足这个目标。 有一件事需要注意 在Swish和Mish都是连续可微激活函数。

 基于深度学习的对象检测中常用的后处理方法是NMS,它可以用来过滤那些预测同一对象不好的BBox,并且只保留候选具有较高响应的BBox。NMS试图改进的方法与优化目标函数的方法是一致的。 NMS提出的原始方法不考虑上下文信息, 所以Girshick等人在R-CNN中添加分类置信度评分作为参考,并根据置信度评分的顺序,按高分到低分的顺序执行NMS。 对于soft NMS,它考虑了对象的遮挡可能导致NMS中置信度分数下降的问题。DIoU NMS开发人员的思维方式是在soft NMS的基础上,将中心点距离的信息添加到BBox筛选过程中。 值得一提的是,由于上述后处理方法都没有直接提 捕获的图像特征,在后续开发无锚方法中后处理不再被需要。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第6张图片

3. Methodology

3. 方法

The basic aim is fast operating speed of neural network,

in production systems and optimization for parallel compu

tations, rather than the low computation volume theoreti

cal indicator (BFLOP). We present two options of real-time

neural networks:

 

For GPU we use a small number of groups (1 - 8) in

convolutional layers: CSPResNeXt50 / CSPDarknet53

 

For VPU - we use grouped-convolution, but we re

frain from using Squeeze-and-excitement (SE) blocks

- specififically this includes the following models:

EffificientNet-lite / MixNet [76] / GhostNet [21] / Mo

bileNetV3

 我们的主要目标是神经网络的快速运行速度、生产系统和并行计算的优化,而不是低计算量的理论指标(BFLOP)。 我们提出两个实时运行神经网络选项:

·对于GPU,我们在卷积层中使用少量的组合(1-8):CSPResNeXt50/CSPDarknet53

·对于VPU-我们使用分组卷积,但我们不使用挤压和SE模块

-特别是这包括以下模型:EffificientNet-lite/MxNet/GhostNet/MobileNetV3

3.1. Selection of architecture

3.1. 结构的选择

Our objective is to fifind the optimal balance among the in put network resolution, the convolutional layer number, the parameter number (fifilter size2 * fifilters * channel / groups), and the number of layer outputs (fifilters). For instance, our numerous studies demonstrate that the CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classifification on the ILSVRC2012 (ImageNet) dataset [10]. However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset [46].

The next objective is to select additional blocks for in

creasing the receptive fifield and the best method of parame

ter aggregation from different backbone levels for different

detector levels: e.g. FPN, PAN, ASFF, BiFPN.

A reference model which is optimal for classifification is

not always optimal for a detector. In contrast to the classi-

fifier, the detector requires the following:

Higher input network size (resolution) – for detecting

multiple small-sized objects

More layers – for a higher receptive fifield to cover the

increased size of input network

More parameters – for greater capacity of a model to

detect multiple objects of different sizes in a single im

age

Hypothetically speaking, we can assume that a model

with a larger receptive fifield size (with a larger number of

convolutional layers 3 × 3) and a larger number of parame

ters should be selected as the backbone. Table 1 shows the

information of CSPResNeXt50, CSPDarknet53, and Effifi-

cientNet B3. The CSPResNext50 contains only 16 convo

lutional layers 3 × 3, a 425 × 425 receptive fifield and 20.6

M parameters, while CSPDarknet53 contains 29 convolu

tional layers 3 × 3, a 725 × 725 receptive fifield and 27.6

M parameters. This theoretical justifification, together with

our numerous experiments, show that CSPDarknet53 neu

ral network is the optimal model of the two as the backbone

for a detector.

The inflfluence of the receptive fifield with different sizes is

summarized as follows:

Up to the object size - allows viewing the entire object

Up to network size - allows viewing the context around

the object

Exceeding the network size - increases the number of

connections between the image point and the fifinal ac

tivation

We add the SPP block over the CSPDarknet53, since it

signifificantly increases the receptive fifield, separates out the

most signifificant context features and causes almost no re

duction of the network operation speed. We use PANet as

the method of parameter aggregation from different back

bone levels for different detector levels, instead of the FPN

used in YOLOv3.

Finally, we choose CSPDarknet53 backbone, SPP addi

tional module, PANet path-aggregation neck, and YOLOv3

(anchor based) head as the architecture of YOLOv4.

In the future we plan to expand signifificantly the content

of Bag of Freebies (BoF) for the detector, which theoreti

cally can address some problems and increase the detector

accuracy, and sequentially check the inflfluence of each fea

ture in an experimental fashion.

We do not use Cross-GPU Batch Normalization (CGBN

or SyncBN) or expensive specialized devices. This al

lows anyone to reproduce our state-of-the-art outcomes on

a conventional graphic processor e.g. GTX 1080Ti or RTX

2080Ti.

 我们的目标是在输入网络分辨率、卷积层数、参数数(滤波器大小2*滤波器*信道/组)和输出层数量之间找到最优平衡。例如,我们的大量研究表明, 与CSPDarknet53相比,CSPRES Next50在ILSVRC2012(Image Net)数据集上的物体分类方面要好得多。 然而与此相反, CSPDarknet53与CSPRESNext50相比在MSCOCO物体检测数据集上的效果更好。

下一个目标是选择额外的块来增加感受野和根据不同的检测器级从不同的骨干级别进行参数聚合的最佳方法:例如, FPN,PAN,ASFF,BiFPN。

 一个最适合分类的参考模型对于检测器来说并不总是最优的。与分类器相比,检测器需要以下内容:

·更高的输入网络大小(分辨率)- 用于检测多个小型物体

·更多的层 - 获取更大的感受野以覆盖增加的输入网络大小

· 更多的参数-为了提高模型在单个图像中检测多个大小不同的对象的能力

 假设性地说,我们可以假设一个有较大感受野(具有更多的卷积层3×3)和大量的参数的模型,这样的模型应该选择为 backbone。 表1显示了CSPRESNeXt50、CSPDarknet53和Effi-cientNetB3的信息。 CSPRES Next50只包含16个3×3卷积层、425×425感受野和20.6M参数,而CSPDarknet53包含29个3×3卷积层、725×725接收场和27.6M参数。根据理论,加上我们的无数实验,证明 CSPDarknet53神经网络是两者作为检测器主干的最优模型。

 不同大小的感受野的影响总结如下:

·从对象大小方面-允许查看整个对象

·从网络大小方面-允许查看对象周围的上下文

· 超过网络大小-增加图像点与最终激活之间的连接数量

我们在CSPDarknet53上添加SPP块,因为它能够增加了感受野,提取最重要的上下文特征,并且没有降低网络运行速度。 我们使用PANet作为不同检测器级别的不同骨干级别的参数聚合方法,而不是YOLOv3中使用的FPN。

最后,我们选择  以CSPDarknet53骨干、SPP附加模块、PANet和YOLOv3(基于锚的)头部为YOLOv4的体系结构。

今后我们计划扩大检测器中Bag of Freebies 的内容,理论上可以解决一些问题并提高探测器的精度,并以实验的方式依次检查每个特征的影响。

 我们不使用交叉GPU批量归一化(CGBN或同步BN)或昂贵的专门设备。这允许任何人在传统的图形处理器上复制我们最好的结果。 例如GTX1080Ti或RTX2080Ti。

3.2. Selection of BoF and BoS

3.2. 选择BoF和BoS

For improving the object detection training, a CNN usu

ally uses the following:

Activations: ReLU, leaky-ReLU, parametric-ReLU,

ReLU6, SELU, Swish, or Mish

Bounding box regression loss: MSE, IoU, GIoU,

CIoU, DIoU

Data augmentation: CutOut, MixUp, CutMix

Regularization method: DropOut, DropPath [36],

Spatial DropOut [79], or DropBlock

Normalization of the network activations by their

mean and variance: Batch Normalization (BN) [32],

Cross-GPU Batch Normalization (CGBN or SyncBN)

[93], Filter Response Normalization (FRN) [70], or

Cross-Iteration Batch Normalization (CBN) [89]

Skip-connections: Residual connections, Weighted

residual connections, Multi-input weighted residual

connections, or Cross stage partial connections (CSP)

As for training activation function, since PReLU and

SELU are more diffificult to train, and ReLU6 is specififically

designed for quantization network, we therefore remove the

above activation functions from the candidate list. In the

method of reqularization, the people who published Drop

Block have compared their method with other methods in

detail, and their regularization method has won a lot. There

fore, we did not hesitate to choose DropBlock as our reg

ularization method. As for the selection of normalization

method, since we focus on a training strategy that uses only

one GPU, syncBN is not considered.

 为了改进目标检测训练,CNN通常使用以下方法:

 激活函数:ReLU,leaky-ReLU,parametric-ReLU,ReLU6,SELU,Swish,或Mish

盒回归损失:MSE,IoU,GioU,CIOU,DIOU

数据增强:切断,混合,切割混合

正则化:Dropout,DropPath,spatial Dropout,或dropBlock

归一化的网络激活的均值和方差:批归一化(B N),交叉GPU批归一化(CGBN或SyncBN),FRN,或CBN

跳跃连接:残差连接,加权残差连接,多输入魏剩余连接,或跨级部分连接(CSP)

 至于训练激活函数,由于PRELU和SELU难训练,而ReLU6是专门为量化网络设计的,因此我们删除了上述激活函数从候选人名单上。 在正则化方法中,发表DropBlock的人详细比较了他们的方法和其他方法,他们的正则化方法更胜一筹。 因此,我们毫不犹豫地选择DropBlock作为我们的正则化方法。 至于归一化方法的选择,由于我们专注于只使用一个GPU的训练,syncBN我们将不考虑。

3.3. Additional improvements

3.3. 其他改进

In order to make the designed detector more suitable for

training on single GPU, we made additional design and im

provement as follows:

We introduce a new method of data augmentation Mo

saic, and Self-Adversarial Training (SAT)

We select optimal hyper-parameters while applying

genetic algorithms

We modify some exsiting methods to make our design

suitble for effificient training and detection - modifified

SAM, modifified PAN, and Cross mini-Batch Normal

ization (CmBN)

Mosaic represents a new data augmentation method that

mixes 4 training images. Thus 4 different contexts are

mixed, while CutMix mixes only 2 input images. This al

lows detection of objects outside their normal context. In

addition, batch normalization calculates activation statistics

from 4 different images on each layer. This signifificantly

reduces the need for a large mini-batch size.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第7张图片

Self-Adversarial Training (SAT) also represents a new

data augmentation technique that operates in 2 forward

backward stages. In the 1st stage the neural network alters

the original image instead of the network weights. In this

way the neural network executes an adversarial attack on it

self, altering the original image to create the deception that

there is no desired object on the image. In the 2nd stage, the

neural network is trained to detect an object on this modifified

image in the normal way.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第8张图片

CmBN represents a CBN modifified version, as shown

in Figure 4, defifined as Cross mini-Batch Normalization

(CmBN). This collects statistics only between mini-batches

within a single batch.

We modify SAM from spatial-wise attention to point

wise attention, and replace shortcut connection of PAN to

concatenation, as shown in Figure 5 and Figure 6, respec

tively.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第9张图片

 为了使设计的检测器更适合于单个GPU上的训练,我们做了以下额外的设计和改进:

 我们引进了一种新的数据增强马赛克和自我对抗训练(SAT)的方法)

 我们在应用遗传算法的同时选择最优超参数

 • 我们修改了一些现有的方法,使我们的设计更有效的训练和检测——改进的SAM,改进的PAN,和(CMBN)

马赛克表示一种新的数据增强方法,它混合了4幅训练图像。 因此,4种不同的语境是混合,而剪切混合只混合2个输入图像。 这允许检测它们正常上下文之外的对象。 此外,批处理归一化从每个层上的4个不同的图像中计算激活统计量。 这大大减少了对大的小批量尺寸的需要。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第10张图片

 自我对抗训练(SAT)也代表了一种新的数据增强技术, 它工作在两个向前向后阶段。 在第一阶段,神经网络改变原始图像而不是网络权重。 通过这种方式,神经网络对自己执行对抗性攻击,改变原始图像以创建图像上没有所需对象的欺骗。 在第二阶段,对神经网络进行训练,以正常的方式检测该修改图像上的物体。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第11张图片

cmBN表示一个CBN修改版本,如图4所示,定义为Cross mini-Batch Normalization(cmBN)。 这只收集单个批次内的微型批次之间的统计数据。

 我们将SAM从空间注意修改为点注意,并将PAN的快捷连接替换为级联,分别如图5和图6所示。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第12张图片

3.4. YOLOv4

3.4. YOLOv4

In this section, we shall elaborate the details of YOLOv4.

YOLOv4 consists of:

Backbone: CSPDarknet53 [81]

Neck: SPP [25], PAN [49]

Head: YOLOv3 [63]

YOLO v4 uses:

Bag of Freebies (BoF) for backbone: CutMix and

Mosaic data augmentation, DropBlock regularization,

Class label smoothing

Bag of Specials (BoS) for backbone: Mish activa

tion, Cross-stage partial connections (CSP), Multi

input weighted residual connections (MiWRC)

Bag of Freebies (BoF) for detector: CIoU-loss,

CmBN, DropBlock regularization, Mosaic data aug

mentation, Self-Adversarial Training, Eliminate grid

sensitivity, Using multiple anchors for a single ground

truth, Cosine annealing scheduler [52], Optimal hyper

parameters, Random training shapes

Bag of Specials (BoS) for detector: Mish activation,

SPP-block, SAM-block, PAN path-aggregation block,

DIoU-NMS

 在本节中,我们将详细介绍YOLOv4的细节。

 YOLOv4组成:

Backbone: CSPDarknet53 

Neck: SPP , PAN 

Head: YOLOv3 

 YOLOv4的用途:

Bag of Freebies (BoF) for backbone: 混合切割和马赛克数据增强,DropBlock正则化,类标签平滑

Bag of Specials (BoS) for backbone: Mish激活,CSP,多输入加权残差链接连接(Mi WRC)

Bag of Freebies (BoF) for detector: CIoU-loss,CmBN, DropBlock 正则化,马赛克数据增强,自对抗训练, 消除网格敏感性,多个锚点对,余弦退火学习,最优超参数,随机训练形状

Bag of Specials (BoS) for detector: Mish activation,

SPP-block, SAM-block, PAN path-aggregation block,

DIoU-NMS

4. Experiments

4. 实验

Table 2: Inflfluence of BoF and Mish on the CSPResNeXt-50 clas

sififier accuracy.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第13张图片

Table 3: Inflfluence of BoF and Mish on the CSPDarknet-53 classi-

fifier accuracy.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第14张图片

Table 4: Ablation Studies of Bag-of-Freebies. (CSPResNeXt50-PANet-SPP, 512x512).

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第15张图片

Table 5: Ablation Studies of Bag-of-Specials. (Size 512x512).

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第16张图片

Table 6: Using different classififier pre-trained weightings for de

tector training (all other training parameters are similar in all mod

els) .

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第17张图片

Table 7: Using different mini-batch size for detector training.

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第18张图片

 

Figure 8: Comparison of the speed and accuracy of different object detectors. (Some articles stated the FPS of their detectors

for only one of the GPUs: Maxwell/Pascal/Volta)

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第19张图片

 

Table 8: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (test

dev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with

batch=1 without using tensorRT.)

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第20张图片

 

 表2:BoF和Mish对CSPResNeXt-50分类器精度的影响。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第21张图片

 表3:BoF和Mish对CSPDarknet-53分类精度的影响。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第22张图片

 表4:Bag-of-Freebies消融研究。 (CSPRes Ne Xt50-PANet-SPP,512x512)。

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第23张图片

 表5:Bag-of-Specials消融研究。 (尺寸512x512)。
《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第24张图片
 表6:使用不同的分类器进行检测器训练(所有其他训练参数在所有模型中都是相似的)。
《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第25张图片
 表7:使用不同的小批量大小进行探测器训练。
《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第26张图片
 图8:不同对象检测器的速度和精度比较。 (有些文章指出,它们的探测器的FPS只适用于一个GPU:Maxwell/Pascal/Volta)
《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第27张图片
表8:MSCOCO数据集上不同对象检测器的速度和精度的比较(testdev2017)。 (此处突出显示FPS30或更高的实时检测器。 我们比较batchsize=1且不使用tensorRT加速下的结果)
《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译_第28张图片
 
 

 

 

 

 

 

 

 

 

你可能感兴趣的:(论文翻译,深度学习)