Paper:《YOLOv4: Optimal Speed and Accuracy of Object Detection》的翻译与解读
目录
YOLOv4的评价
1、四个改进和一个创新
YOLOv4: Optimal Speed and Accuracy of Object Detection
Abstract
1. Introduction
2. Related work
2.1. Object detection models
2.2. Bag of freebies
2.3. Bag of specials
3. Methodology
3.1. Selection of architecture
3.2. Selection of BoF and BoS
3.3. Additional improvements
3.4. YOLOv4
4. Experiments
4.1. Experimental setup
4.2. Influence of different features on Classifier training
4.3. Influence of different features on
4.4. Influence of different backbones and pretrained weightings on Detector training
4.5. Influence of different mini-batch size on Detector training
5. Results
6. Conclusions
7. Acknowledgements
这篇文章主要有四个改进+一个创新,但组合了大约20项近几年来各种深度学习和目标检测领域的tricks。可以说,这篇论文有创新和改进,但多数是微小的改进。然而这篇文章对比了大量的、近几年新出来的tricks(大约有20多个),肯定得花费大量的时间和精力,工作量巨大。
来自知乎:https://zhuanlan.zhihu.com/p/135980432
论文地址:https://arxiv.org/abs/2004.10934
代码地址:https://github.com/AlexeyAB/darknet
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL |
有大量的特征被认为可以提高卷积神经网络(CNN)的精度。需要在大型数据集上对这些特性的组合进行实际测试,并对结果进行理论验证。有些特性只对某些模型起作用,只对某些问题起作用,或者只对小规模数据集起作用;而一些特性,如批处理规范化和调整大小连接,则适用于大多数模型、任务和数据集。我们假设这些通用特性包括加权残差连接(WRC)、跨阶段部分连接(CSP)、跨小批量标准化(CmBN)、自反训练(SAT)和Mish激活。我们使用新特性:WRC,CSP, CmBN,SAT,Mish激活,马赛克数据增强,CmBN, DropBlock正规化,CIoU损失,并结合一些实现先进的结果:基于MS COCO数据集,可以得到43.5%的AP(65.7% AP50),Tesla V100上的实时速度为 ~ 65 FPS。 |
总结来说:就是什么技巧最先进,我都拿来用用,组合成一个更完美的结果!
A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShuffleNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a twostage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several topdown paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17]. In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection. | 现代探测器通常由两个部分组成,一个是在ImageNet上预先训练的主干,另一个是用来预测物体的类和边界盒的头部。对于那些在GPU平台上运行的检测器,它们的主干可以是VGG[68]、ResNet[26]、ResNeXt[86]或DenseNet[30]。对于那些在CPU平台上运行的检测器,它们的主干可以是SqueezeNet[31]、MobileNet[28,666,27,74]或ShuffleNet[97,53]。至于头部,通常分为两类,即、一级目标探测器和二级目标探测器。最具代表性的两级目标探测器是R-CNN[19]系列,包括fast R-CNN [18], faster R-CNN [64], R-FCN [9], Libra R-CNN[58]。也可以使两级对象检测器成为无锚对象检测器,如RepPoints[87]。单级目标探测器最具代表性的型号有YOLO[61, 62, 63]、SSD[50]、RetinaNet[45]。近年来,无锚单级目标探测器得到了广泛的应用。这类探测器有CenterNet[13]、CornerNet[37,38]、FCOS[78]等。近年来发展起来的物体探测器常常在基干和头部之间插入一些层,这些层通常用来收集不同阶段的特征图。我们可以称之为物体探测器的颈部。通常,一个领由几个自下而上的路径和几个自上而下的路径组成。具有该机制的网络包括特征金字塔网络(Feature Pyramid Network, FPN)[44]、路径汇聚网络(Path Aggregation Network, PAN)[49]、BiFPN[77]和NAS-FPN[17]。除了上述模型外,一些研究者还着重于直接构建一个新的主干(DetNet [43], DetNAS[7])或一个新的整体模型(SpineNet [12], HitDetector[20])用于对象检测。 |
To sum up, an ordinary object detector is composed of several parts:
|
综上所述,一个普通的物体探测器由以下几个部分组成:
|
Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods which can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the definition of bag of freebies is data augmentation. The purpose of data augmentation is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they definitely benefit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric distortion, we add random scaling, cropping, flipping, and rotating. | 通常,传统的目标探测器是离线训练的。因此,研究人员总是希望利用这一优势,开发出更好的训练方法,使目标探测器在不增加推理成本的情况下获得更好的精度。我们把这些只会改变培训策略或只会增加培训成本的方法称为“免费包”。“对象检测方法经常采用的,符合免费赠品包定义的是数据扩充。数据扩充的目的是增加输入图像的可变性,使所设计的目标检测模型对不同环境下获得的图像具有更高的鲁棒性。例如,光度畸变和几何畸变是两种常用的数据增强方法,它们对目标检测任务无疑是有益的。在处理光度失真时,我们调整图像的亮度、对比度、色调、饱和度和噪声。对于几何畸变,我们添加了随机缩放、剪切、翻转和旋转。 |
The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on simulating object occlusion issues. They have achieved good results in image classification and object detection. For example, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and fill in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [92] uses two images to multiply and superimpose with different coefficient ratios, and then adjusts the label with these superimposed ratios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN. | 上述数据增强方法均为像素级调整,并保留调整区域内的所有原始像素信息。此外,一些从事数据扩充的研究人员将重点放在模拟物体遮挡问题上。在图像分类和目标检测方面取得了较好的效果。例如,随机擦除[100]和CutOut[11]可以随机选择图像中的矩形区域,并填充一个随机的或互补的零值。对于捉迷藏[69]和网格掩码[6],它们随机或均匀地选择图像中的多个矩形区域,并将其全部替换为零。如果将类似的概念应用于特征图,则有DropOut[71]、DropConnect[80]和DropBlock[16]方法。此外,一些研究者提出了将多幅图像结合在一起进行数据增强的方法。例如,MixUp[92]使用两张图像以不同的系数比率进行相乘和叠加,然后用这些叠加比率调整标签。CutMix[91]是将裁剪后的图像覆盖到其他图像的矩形区域,并根据混合区域的大小调整标签。除了上述方法外,还使用style transfer GAN[15]进行数据扩充,这样可以有效减少CNN学习到的纹理偏差。
|
Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distribution bias, a very important issue is that there is a problem of data imbalance between different classes, and this problem is often solved by hard negative example mining [72] or online hard example mining [67] in two-stage object detector. But the example mining method is not applicable to one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al. [45] proposed focal loss to deal with the problem of data imbalance existing between various classes. Another very important issue is that it is difficult to express the relationship of the degree of association between different categories with the one-hot hard representation. This representation scheme is often used when executing labeling. The label smoothing proposed in [73] is to convert hard label into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al. [33] introduced the concept of knowledge distillation to design the label refinement network. | 与上面提出的各种方法不同,其他一些免费包方法致力于解决数据集中的语义分布可能存在偏差的问题。在处理语义分布偏差问题时,一个非常重要的问题是不同类之间存在数据不平衡的问题,这个问题往往通过两阶段对象检测器中的硬反例挖掘[72]或在线硬例挖掘[67]来解决。但由于该方法属于稠密预测结构,因此不适用于单级目标检测。因此,Lin等人提出了焦损来处理各个类之间存在的数据不平衡问题。另一个非常重要的问题是,很难用一个热硬表示法来表达不同类别之间关联程度的关系。这种表示法常用于执行标记。文献[73]提出的标签平滑是将硬标签转换为软标签进行训练,使模型更加稳健。为了获得更好的软标签,Islam等人引入了知识蒸馏的概念来设计标签细化网络。 |
The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to directly perform regression on the center point coordinates and height and width of the BBox, i.e., {xcenter, ycenter, w, h}, or the upper left point and the lower right point, i.e., {xtop lef t, ytop lef t, xbottom right, ybottom right}. As for anchor-based method, it is to estimate the corresponding offset, for example {xcenter of f set, ycenter of f set, wof f set, hof f set} and {xtop lef t of f set, ytop lef t of f set, xbottom right of f set, ybottom right of f set}. However, to directly estimate the coordinate values of each point of the BBox is to treat these points as independent variables, but in fact does not consider the integrity of the object itself. In order to make this issue processed better, some researchers recently proposed IoU loss [90], which puts the coverage of predicted BBox area and ground truth BBox area into consideration. The IoU loss computing process will trigger the calculation of the four coordinate points of the BBox by executing IoU with the ground truth, and then connecting the generated results into a whole code. Because IoU is a scale invariant representation, it can solve the problem that when traditional methods calculate the l1 or l2 loss of {x, y, w, h}, the loss will increase with the scale. Recently, some researchers have continued to improve IoU loss. For example, GIoU loss [65] is to include the shape and orientation of object in addition to the coverage area. They proposed to find the smallest area BBox that can simultaneously cover the predicted BBox and ground truth BBox, and use this BBox as the denominator to replace the denominator originally used in IoU loss. As for DIoU loss [99], it additionally considers the distance of the center of an object, and CIoU loss [99], on the other hand simultaneously considers the overlapping area, the distance between center points, and the aspect ratio. CIoU can achieve better convergence speed and accuracy on the BBox regression problem. | 最后一袋赠品是边界盒(BBox)回归的目标函数。传统的目标检测器通常使用均方误差(MSE)直接对BBox的中心点坐标和高度、宽度进行回归,即, {xcenter, ycenter, w, h},或左上点和右下点,即, {xtop lef t, ytop lef t, xbottom right, ybottom right}。基于锚的方法是估计相应的偏移量,例如{f集合的xcenter, f集合的ycenter, wof f集合,hof f集合}和{xtop f集合的lef t, ytop f集合的lef t, xbottom right off集合,ybottom right off集合}。但是,直接估计BBox中每个点的坐标值,就是把这些点当作自变量,而实际上并不考虑对象本身的完整性。为了更好地处理这一问题,一些研究者最近提出了IoU损失[90],将预测BBox区域的覆盖范围和地面真实BBox区域考虑在内。IoU损失计算过程通过执行IoU和ground truth,触发BBox四个坐标点的计算,然后将生成的结果连接成一个完整的代码。由于IoU是尺度不变的表示,可以解决传统方法在计算{x, y, w, h}的l1或l2损耗时,损耗会随着尺度的增大而增大的问题。最近,一些研究人员继续改善欠条损失。例如,GIoU loss[65]除了覆盖区域外,还包括了物体的形状和方向。他们提出寻找能够同时覆盖预测BBox和地面真实BBox的最小面积BBox,并以此BBox作为分母来代替IoU损失中原来使用的分母。对于DIoU loss[99],它额外考虑了物体中心的距离,而CIoU loss[99]则同时考虑了重叠区域、中心点之间的距离和纵横比。对于BBox回归问题,CIoU具有更好的收敛速度和精度。 |
The basic aim is fast operating speed of neural network, in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We present two options of real-time neural networks:
|
其基本目标是在生产系统和并行计算优化中提高神经网络的运行速度,而不是低计算量理论指标(BFLOP)。我们提出了实时神经网络的两种选择: 对于GPU,我们在卷积层中使用少量的组(1 - 8):CSPResNeXt50 / CSPDarknet53 |
Our objective is to find the optimal balance among the input network resolution, the convolutional layer number, the parameter number (filter size2 * filters * channel / groups), and the number of layer outputs (filters). For instance, our numerous studies demonstrate that the CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classification on the ILSVRC2012 (ImageNet) dataset [10]. However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset [46]. | 我们的目标是在输入网络分辨率、卷积层数、参数数(filter size2 * filters * channel / groups)和层输出数(filters)之间找到最佳平衡。例如,我们的大量研究表明,在ILSVRC2012 (ImageNet)数据集[10]上的对象分类方面,CSPResNext50要比CSPDarknet53好得多。然而,相反地,在MS COCO数据集[46]上检测对象方面,CSPDarknet53比CSPResNext50更好。 |
The next objective is to select additional blocks for increasing the receptive field and the best method of parameter aggregation from different backbone levels for different detector levels: e.g. FPN, PAN, ASFF, BiFPN. A reference model which is optimal for classification is not always optimal for a detector. In contrast to the classifier, the detector requires the following:
|
下一个目标是选择额外的块来增加感受野,并从不同的主干水平对不同的检测器水平(如FPN、PAN、ASFF、BiFPN)进行参数聚合的最佳方法。 对于分类来说是最优的参考模型对于检测器来说并不总是最优的。与分类器相比,检测器需要满足以下条件:
|
Hypothetically speaking, we can assume that a model with a larger receptive field size (with a larger number of convolutional layers 3 × 3) and a larger number of parameters should be selected as the backbone. Table 1 shows the information of CSPResNeXt50, CSPDarknet53, and EfficientNet B3. The CSPResNext50 contains only 16 convolutional layers 3 × 3, a 425 × 425 receptive field and 20.6 M parameters, while CSPDarknet53 contains 29 convolutional layers 3 × 3, a 725 × 725 receptive field and 27.6 M parameters. This theoretical justification, together with our numerous experiments, show that CSPDarknet53 neural network is the optimal model of the two as the backbone for a detector.The influence of the receptive field with different sizes is summarized as follows:
|
假设,我们可以假设一个模型的主干是一个更大的接受域大小(包含更多的convolutional layers 3×3)和更多的参数。表1显示了CSPResNeXt50, CSPDarknet53,和efficient entnet B3的信息。CSPResNext50只包含16个卷积层3×3,一个425×425的接受域和20.6 M的参数,而CSPDarknet53包含29个卷积层3×3,一个725×725的接受域和27.6 M的参数。这一理论证明,以及我们的大量实验,表明CSPDarknet53神经网络是两个作为骨干的检测器的最佳模型。不同大小的感受野的影响总结如下:
|
We add the SPP block over the CSPDarknet53, since it significantly increases the receptive field, separates out the most significant context features and causes almost no reduction of the network operation speed. We use PANet as the method of parameter aggregation from different backbone levels for different detector levels, instead of the FPN used in YOLOv3. Finally, we choose CSPDarknet53 backbone, SPP additional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head as the architecture of YOLOv4. In the future we plan to expand significantly the content of Bag of Freebies (BoF) for the detector, which theoretically can address some problems and increase the detector accuracy, and sequentially check the influence of each feature in an experimental fashion. We do not use Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices. This allows anyone to reproduce our state-of-the-art outcomes on a conventional graphic processor e.g. GTX 1080Ti or RTX 2080Ti. |
我们在CSPDarknet53上添加了SPP块,因为它显著增加了接受域,分离出了最重要的上下文特性,并且几乎不会降低网络运行速度。我们使用PANet作为不同检测层的不同主干层的参数聚合方法,而不是YOLOv3中使用的FPN。 最后,我们选择CSPDarknet53主干、SPP附加模块、PANet路径聚合颈和基于锚的YOLOv3头部作为YOLOv4的架构。 在未来,我们计划大幅扩展探测器的免费赠品包(BoF)的内容,这在理论上可以解决一些问题,提高探测器的精度,并以实验的方式依次检查每个特征的影响。 我们不使用跨gpu批处理标准化(CGBN或SyncBN)或昂贵的专用设备。这使得任何人都可以在传统的图形处理器(如GTX 1080Ti或RTX 2080Ti)上重现我们最先进的结果。 |
For improving the object detection training, a CNN usually uses the following:
|
为了提高目标检测训练,CNN通常使用以下方法:
|
As for training activation function, since PReLU and SELU are more difficult to train, and ReLU6 is specifically designed for quantization network, we therefore remove the above activation functions from the candidate list. In the method of reqularization, the people who published DropBlock have compared their method with other methods in detail, and their regularization method has won a lot. Therefore, we did not hesitate to choose DropBlock as our regularization method. As for the selection of normalization method, since we focus on a training strategy that uses only one GPU, syncBN is not considered. | 对于训练激活函数,由于PReLU和SELU更难训练,而ReLU6是专门为量化网络设计的,因此我们将上述激活函数从候选列表中删除。在reqularization方法中,发表了DropBlock的人将他们的方法与其他方法进行了详细的比较,并且他们的regularization方法取得了很大的成果。因此,我们毫不犹豫的选择了DropBlock作为我们的regularization方法。对于归一化方法的选择,由于我们关注的是只使用一个GPU的训练策略,所以没有考虑syncBN。 |
In this section, we shall elaborate the details of YOLOv4. YOLOv4 consists of:
YOLO v4 uses:
|
在本节中,我们将详细介绍YOLOv4。YOLOv4包括:
YOLO v4意思用途:
|
We test the influence of different training improvement techniques on accuracy of the classifier on ImageNet (ILSVRC 2012 val) dataset, and then on the accuracy of the detector on MS COCO (test-dev 2017) dataset. | 我们在ImageNet (ILSVRC 2012 val)数据集上测试了不同的训练改进技术对分类器精度的影响,然后在MS COCO (test-dev 2017)数据集上测试了检测器的精度。 |
In ImageNet image classification experiments, the default hyper-parameters are as follows: the training steps is 8,000,000; the batch size and the mini-batch size are 128 and 32, respectively; the polynomial decay learning rate scheduling strategy is adopted with initial learning rate 0.1; the warm-up steps is 1000; the momentum and weight decay are respectively set as 0.9 and 0.005. All of our BoS experiments use the same hyper-parameter as the default setting, and in the BoF experiments, we add an additional 50% training steps. In the BoF experiments, we verify MixUp, CutMix, Mosaic, Bluring data augmentation, and label smoothing regularization methods. In the BoS experiments, we compared the effects of LReLU, Swish, and Mish activation function. All experiments are trained with a 1080 Ti or 2080 Ti GPU. | 在ImageNet图像分类实验中,默认超参数为:训练步骤为8,000,000;批大小和小批大小分别为128和32;采用多项式衰减学习率调度策略,初始学习率为0.1;热身步长1000;动量衰减为0.9,重量衰减为0.005。我们所有的BoS实验都使用与默认设置相同的超参数,在BoF实验中,我们添加了额外的50%的训练步骤。在BoF实验中,我们验证了混合、切分、镶嵌、Bluring数据增强和标签平滑正则化等方法。在BoS实验中,我们比较了LReLU、Swish和Mish激活函数的作用。所有实验均使用1080 Ti或2080 Ti GPU进行训练。 |
In MS COCO object detection experiments, the default hyper-parameters are as follows: the training steps is 500,500; the step decay learning rate scheduling strategy is adopted with initial learning rate 0.01 and multiply with a factor 0.1 at the 400,000 steps and the 450,000 steps, respectively; The momentum and weight decay are respectively set as 0.9 and 0.0005. All architectures use a single GPU to execute multi-scale training in the batch size of 64 while mini-batch size is 8 or 4 depend on the architectures and GPU memory limitation. Except for using genetic algorithm for hyper-parameter search experiments, all other experiments use default setting. Genetic algorithm used YOLOv3-SPP to train with GIoU loss and search 300 epochs for min-val 5k sets. We adopt searched learning rate 0.00261, momentum 0.949, IoU threshold for assigning ground truth 0.213, and loss normalizer 0.07 for genetic algorithm experiments. We have verified a large number of BoF, including grid sensitivity elimination, mosaic data augmentation, IoU threshold, genetic algorithm, class label smoothing, cross mini-batch normalization, selfadversarial training, cosine annealing scheduler, dynamic mini-batch size, DropBlock, Optimized Anchors, different kind of IoU losses. We also conduct experiments on various BoS, including Mish, SPP, SAM, RFB, BiFPN, and Gaussian YOLO [8]. For all experiments, we only use one GPU for training, so techniques such as syncBN that optimizes multiple GPUs are not used. | 在MS COCO对象检测实验中,默认的超参数为:训练步骤为500500;采用步进衰减学习率调度策略,初始学习率为0.01,在400,000步和450,000步分别乘以因子0.1;动量衰减为0.9,重量衰减为0.0005。所有的架构都使用一个GPU来执行批处理大小为64的多尺度训练,而小批处理大小为8或4取决于架构和GPU内存限制。除超参数搜索实验采用遗传算法外,其他实验均采用默认设置。遗传算法利用YOLOv3-SPP进行带GIoU损失的训练,搜索最小值5k集的300个epoch。遗传算法实验采用搜索学习率0.00261、动量0.949、IoU阈值分配ground truth 0.213和损失正态值0.07。我们已经验证了大量的BoF,包括网格敏感性消除、马赛克数据增强、IoU阈值、遗传算法、类标记平滑、交叉小批量标准化、自对抗训练、余弦退火调度器、动态小批量大小、DropBlock、优化锚,不同类型的IoU损失。我们还对各种BoS进行了实验,包括Mish、SPP、SAM、RFB、BiFPN和Gaussian YOLO[8]。对于所有的实验,我们只使用一个GPU进行训练,所以像syncBN这样的优化多个GPU的技术并没有被使用。 |
First, we study the influence of different features on classifier training; specifically, the influence of Class label smoothing, the influence of different data augmentation techniques, bilateral blurring, MixUp, CutMix and Mosaic, as shown in Fugure 7, and the influence of different activations, such as Leaky-ReLU (by default), Swish, and Mish. | 首先,研究了不同特征对分类器训练的影响;具体来说,类标签平滑的影响,不同数据增强技术的影响,双边模糊,混合,CutMix和马赛克的影响,如Fugure 7所示,和不同的激活的影响,如泄漏- relu(默认),Swish,和Mish。 |
Figure 7: Various method of data augmentation 图7:各种数据增强方法 |
|
In our experiments, as illustrated in Table 2, the classifier’s accuracy is improved by introducing the features such as: CutMix and Mosaic data augmentation, Class label smoothing, and Mish activation. As a result, our BoFbackbone (Bag of Freebies) for classifier training includes the following: CutMix and Mosaic data augmentation and Class label smoothing. In addition we use Mish activation as a complementary option, as shown in Table 2 and Table 3. | 在我们的实验中,如表2所示,通过引入特征如:CutMix和Mosaic数据增强、Class label平滑、Mish激活等,提高了分类器的准确率。因此,我们用于分类器训练的bof主干(免费包)包括以下内容:CutMix和Mosaic数据增强和类标签平滑。此外,我们使用Mish激活作为补充选项,如表2和表3所示。 |
Detector training Further study concerns the influence of different Bag-ofFreebies (BoF-detector) on the detector training accuracy, as shown in Table 4. We significantly expand the BoF list through studying different features that increase the detector accuracy without affecting FPS:
|
进一步的研究关注不同的口袋- offreebies (BoF-detector)对检测器训练精度的影响,如表4所示。我们通过研究在不影响FPS的情况下提高检测精度的不同特征,显著扩展了BoF list:
|
Further study concerns the influence of different Bagof-Specials (BoS-detector) on the detector training accuracy, including PAN, RFB, SAM, Gaussian YOLO (G), and ASFF, as shown in Table 5. In our experiments, the detector gets best performance when using SPP, PAN, and SAM. |
进一步研究不同的Bagof-Specials (boss -detector)对检测器训练精度的影响,包括PAN、RFB、SAM、Gaussian YOLO (G)、ASFF,如表5所示。在我们的实验中,当使用SPP、PAN和SAM时,检测器的性能最佳。 |
Further on we study the influence of different backbone models on the detector accuracy, as shown in Table 6. We notice that the model characterized with the best classification accuracy is not always the best in terms of the detector accuracy. First, although classification accuracy of CSPResNeXt50 models trained with different features is higher compared to CSPDarknet53 models, the CSPDarknet53 model shows higher accuracy in terms of object detection. |
进一步研究不同骨架模型对检测精度的影响,如表6所示。我们注意到,在检测器精度方面,具有最佳分类精度的模型并不总是最佳的。 首先,虽然不同特征训练的CSPResNeXt50模型的分类精度要高于CSPDarknet53模型,但CSPDarknet53模型在目标检测方面具有更高的精度。 |
Second, using BoF and Mish for the CSPResNeXt50 classifier training increases its classification accuracy, but further application of these pre-trained weightings for detector training reduces the detector accuracy. However, using BoF and Mish for the CSPDarknet53 classifier training increases the accuracy of both the classifier and the detector which uses this classifier pre-trained weightings. The net result is that backbone CSPDarknet53 is more suitable for the detector than for CSPResNeXt50. We observe that the CSPDarknet53 model demonstrates a greater ability to increase the detector accuracy owing to various improvements. |
其次,在CSPResNeXt50分类器训练中使用BoF和Mish可以提高分类精度,但是在检测器训练中进一步使用这些预训练权重会降低检测器的精度。然而,在CSPDarknet53分类器训练中使用BoF和Mish提高了分类器和使用预先训练权重的检测器的准确性。结果表明,基干CSPDarknet53比CSPResNeXt50更适合于检测器。 我们观察到,CSPDarknet53模型由于各种改进而显示出更大的提高检测器准确度的能力。 |
Comparison of the results obtained with other stateof-the-art object detectors are shown in Figure 8. Our YOLOv4 are located on the Pareto optimality curve and are superior to the fastest and most accurate detectors in terms of both speed and accuracy. |
得到的结果与其他最先进的物体探测器的比较如图8所示。我们的YOLOv4位于Pareto最优曲线上,无论是速度还是精度都优于最快最准确的检测器。 |
Since different methods use GPUs of different architectures for inference time verification, we operate YOLOv4 on commonly adopted GPUs of Maxwell, Pascal, and Volta architectures, and compare them with other state-of-the-art methods. Table 8 lists the frame rate comparison results of using Maxwell GPU, and it can be GTX Titan X (Maxwell) or Tesla M40 GPU. Table 9 lists the frame rate comparison results of using Pascal GPU, and it can be Titan X (Pascal), Titan Xp, GTX 1080 Ti, or Tesla P100 GPU. As for Table 10, it lists the frame rate comparison results of using Volta GPU, and it can be Titan Volta or Tesla V100 GPU. | 由于不同的方法使用不同架构的gpu进行推理时间验证,我们在Maxwell架构、Pascal架构和Volta架构常用的gpu上运行YOLOv4,并与其他最先进的方法进行比较。表8列出了使用Maxwell GPU的帧速率比较结果,可以是GTX Titan X (Maxwell)或者Tesla M40 GPU。表9列出了使用Pascal GPU的帧率比较结果,可以是Titan X (Pascal)、Titan Xp、GTX 1080 Ti或Tesla P100 GPU。表10列出了使用Volta GPU的帧速率比较结果,可以是Titan Volta或者Tesla V100 GPU。 |
We offer a state-of-the-art detector which is faster (FPS) and more accurate (MS COCO AP50...95 and AP50) than all available alternative detectors. The detector described can be trained and used on a conventional GPU with 8-16 GB-VRAM this makes its broad use possible. The original concept of one-stage anchor-based detectors has proven its viability. We have verified a large number of features, and selected for use such of them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments. |
|
The authors wish to thank Glenn Jocher for the ideas of Mosaic data augmentation, the selection of hyper-parameters by using genetic algorithms and solving the grid sensitivity problem https://github.com/ ultralytics/yolov3. | 作者感谢Glenn Jocher提出的镶嵌数据增强、利用遗传算法选择超参数以及解决网格敏感性问题的思路:https://github.com/ultralytics /yolov3。 |