YOLOv1(论文翻译)

You Only Look Once: Unified, Real-Time Object Detection

你只看一次:统一的,实时的目标检测

http://pjreddie.com/yolo/

Abstract

摘要
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network pre- dicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
我们提出了一种新的目标检测方法:YOLO。先前的目标检测工作都是重新利用分类器来执行检测。相反,我们将目标检测作为一个回归问题来处理空间分离的边界框和相关的类别概率。单个神经网络在一次评估中能够直接从整个图像预测边界框和类别概率。由于整个检测流程是一个单一的网络,因此可以直接对检测性能进行端到端的优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
我们的统一网络架构速度非常快。我们的基本YOLO模型以每秒45帧的速度实时处理图像。另一个更小版本的网络Fast YOLO可以达到令人震惊的每秒处理155帧,同时仍然可以实现其他实时检测器的两倍mAP。与最先进的检测系统相比,YOLO的定位误差更大,但它不太可能在背景中预测假阳性(False Positive)目标。最后,YOLO可以学习对象泛化性很强的特征。当从自然图像推广到其他领域(如艺术作品)时,它优于其他检测方法,包括DPM和R-CNN。

1.Introduction

引言
Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
人类只需要看一眼图像,就会立刻知道图像中有什么物体,它们在哪里,以及它们是如何相互作用的。人类的视觉系统是快速而准确的,允许我们执行复杂的任务,如驾驶时很少产生有意识的想法。快速、准确的目标检测算法将允许计算机在没有专用传感器的情况下驾驶汽车,使辅助设备能够向人类用户传送实时的场景信息,并释放通用、响应迅速的机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
当前的检测系统重新利用分类器来进行检测。为了检测一个目标,这些系统为该目标提供一个分类器,并在测试图像中的不同位置和尺寸对其进行评估。像deformable parts models(DPM)这样的系统使用滑动窗口方法,其中分类器在整个图像上以均匀间隔的位置运行。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
最近的一些方法,如R-CNN,使用候选区域(region proposal)方法首先在图像中生成潜在的边界框,然后在这些边界框上运行分类器。分类后,后处理用于优化边界框,消除重复检测,并根据场景中的其他目标对边界框重新定位。因为每个单独的组件都必须单独训练,导致这些复杂的流程不仅速度慢而且很难优化。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
我们将目标检测重新看作是一个单一的回归问题,直接从图像像素到边界框坐标和类概率。使用我们的系统,您只需查看一次(YOLO)图像就可以预测哪些对象存在以及它们在哪里。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
YOLO非常简单:见图1。单个卷积网络同时预测多个边界框和这些边界框的类概率。YOLO在整个图像上训练,并直接优化检测性能。与传统的目标检测方法相比,这种统一的模型有许多优点。
YOLOv1(论文翻译)_第1张图片

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448× 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
图1:YOLO检测系统。用YOLO处理图像简单直接的。我们的系统(1)将输入图像的大小调整为448×448,(2)对图像运行一个卷积网络,(3)根据模型的置信度对检测道德结果进行阈值处理。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
首先,YOLO非常快。因为我们将检测看作回归问题,所以不需要复杂的流程。测试时我们只需要在新图像上运行我们的神经网络来预测检测结果。没有批处理时,在Titan X GPU上我们的基本网络能够以每秒45帧的速度运行,而快速版本的运行速度超过150 fps。这意味着我们可以在不到25毫秒的延迟时间内实时处理流媒体视频。此外,YOLO的mAP(mean average precision)是其他实时系统的两倍以上。有关我们的系统在网络摄像头上实时运行的演示,请参见我们的项目网页:http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.
其次,YOLO在做预测时对图像进行了全面的推理。与基于滑动窗口和候选区域的方法不同,YOLO在训练和测试期间看到整个图像,因此它隐式地编码关于类的上下文信息及其外观。Fast R-CNN,一种顶级的检测方法,由于不能看到更多的上下文信息,错误地将图像中的背景块作为目标。与Fast R-CNN相比,YOLO产生的背景误差的概率不到一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on art-work, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
第三,YOLO学习对象的泛化表示。当在自然图像上训练并在艺术作品上测试时,YOLO在很大程度上优于DPM和R-CNN等顶级检测方法。由于YOLO具有高度的泛化性,当应用于新的领域或意外的输入时,它不太可能故障。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.
YOLO 的准确度仍然落后于最先进的检测系统。虽然它可以快速识别图像中的对象,但它很难精确定位某些目标,特别是小型目标。我们在实验中进一步研究了如何对这些进行权衡。

All of our training and testing code is open source. A variety of pretrained models are also available to download.
我们所有的训练和测试代码都是开源的。还可以下载各种预训练模型。

2.Unified Detection

统一检测
We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.
我们将目标检测的单独部分统一到一个单一神经网络中。我们的网络使用整个图像的特征来预测每个边界框。它还可以同时预测一张图像中所有类别的所有边界框。这意味着我们的网络会对整个图像和图像中的所有目标进行全局性的分析。YOLO的设计可以实现端到端的训练和实时的速度,同时保持较高的平均精度。

Our system divides the input image into angrid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
我们的系统将输入图像分割成的网格。如果一个物体的中心落在一个网格单元中,该网格单元负责检测该物体。

Each grid cell predictsbounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as. If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union () between the predicted box and the ground truth.
每个网格单元预测个边界框和这些边界框的置信度分数。这些置信度分数反映了模型对边界框中包含一个目标的自信程度,以及边界框预测的准确性。在形式上,我们把置信度定义为。如果该单元格中不存在任何目标,则置信度分数应为零。否则,我们希望置信度分数等于预测框和真实值之间的交并比()。
Each bounding box consists of 5 predictions:, and confidence. Thecoordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the between the predicted box and any ground truth box.
每个边界框由5个预测组成:,,,和置信度。坐标表示边界框相对于网格单元边界的中心。宽度和高度相对于整个图像进行预测。最后,置信度预测表示预测框和真实边界框之间的。

Each grid cell also predictsconditional class probabilities,. These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes.
每个网格单元还预测个条件类别概率。这些概率取决于包含目标的网格单元。我们只预测每个网格单元的一组类别概率,而不考虑边界框的数量。
(1)
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
在测试时,我们将条件类别概率和单个边界框的置信度预测相乘,

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
这给了我们每个边界框特定类别的置信度分数。这些分数编码了该类出现在边界框中的概率以及预测框与目标的匹配程度。
YOLOv1(论文翻译)_第2张图片
Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an grid and for each grid cell predictsbounding boxes, confidence for those boxes, andclass probabilities. These predictions are encoded as antensor.
图2:模型。我们的系统模型检测作为一个回归问题。它将图像划分为个网格,每个网格单元预测个边界框,这些边界框的置信度和个类别概率。这些预测被编码为个张量。

For evaluating YOLO on PASCAL VOC, we use,. PASCAL VOC has 20 labelled classes so. Our final prediction is atensor.
为了评估PASCAL VOC数据及上的YOLO,我们使用,。PASCAL VOC有20个标记类,所以。我们最后的预测是的张量。

2.1. Network Design

网络设计
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset[9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
我们将此模型作为卷积神经网络来实现,并在PASCAL VOC检测数据集上对其进行评估。网络的初始卷积层从图像中提取特征,而全连接层预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1×1 reduction layers followed by 3×3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
我们的网络架构受到了用于图像分类的GoogLeNet模型的启发。我们的网络有24个卷积层和2个全连接层。与GoogLeNet使用的inception模块不同,我们只使用1×1的降维层和3×3的卷积层,类似于Lin等人。完整的网络如图3所示。

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.
图3:架构。我们的检测网络有24个卷积层和2个全连接层。交替的1×1卷积层减少了前几层的特征空间。在ImageNet分类任务中,我们以一半的分辨率(224×224的输入图像)对卷积层进行预训练,然后将分辨率提高一倍来进行检测。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the
same between YOLO and Fast YOLO.
我们还训练了一个快速版本的YOLO,旨在推动快速目标检测的界限。Fast YOLO使用的神经网络具有更少的卷积层(9层而不是24层)和更少的滤波器。除了网络的大小,YOLO和Fast YOLO之间所有的训练和测试参数都是是一样的。

The final output of our network is the 7 × 7 × 30 tensor of predictions.
我们网络预测的最终输出是7×7×30张量。

2.2. Training

训练
We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].
我们在ImageNet1000类竞赛数据集上预训练卷积层。对于预训练,我们使用图3中的前20个卷积层,然后是平均池化层和全连接层。我们对该网络进行了大约一周的培训,并在ImageNet 2012验证集上实现了88%的单一裁剪图像top-5准确率,与Caffe Model Zoo中的GoogLeNet模型相当。我们使用Darknet框架进行所有训练和推断。
We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
然后我们将模型转换来执行检测。Ren等人表明,将卷积层和连接层添加到预训练网络中可以提高性能。按照他们的示例,我们添加了四个卷积层和两个全连接层,其具有随机初始化权重。检测通常需要细粒度的视觉信息,因此,我们将网络输入图像的分辨率从224×224提高到448×448。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding boxandcoordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
网络的最后一层预测类别概率和边界框坐标。我们通过图像的宽度和高度来归一化边界框的宽度和高度,使它们介于0和1之间。我们将边界框的和坐标参数化为特定网格单元位置的偏移量,因此它们的边界也在0和1之间。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
我们对最后一层使用线性激活函数,所有其他层使用以下leaky校正线性激活:

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
我们对模型输出的平方和误差(sum-squared error)进行了优化。我们使用平方和误差是因为它很容易优化,但它并不完全符合我们最大化平均精度的目标。它将定位误差与分类误差的权重相等,这可能不理想。此外,在每个图像中,许多网格单元不包含任何目标。这会将这些单元格的置信度分数推向零,通常会压倒包含目标单元格的梯度。这可能会导致模型不稳定,导致训练提前发散。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, andto accomplish this. We setand.
为了改善这个问题,我们增加了边界框坐标预测的损失,并减少了不包含目标的框的置信度预测的损失。我们使用两个参数,和来实现这一点。我们设置和。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
平方和误差在大边界框和小边界框中的权重相等。我们的误差度量应该反映出大边界框里的小偏差比小边界框里的小偏差的影响更小。为了部分地解决这个问题,我们预测边界框宽度和高度的平方根(square root),而不是直接预测宽度和高度。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest currentwith the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
YOLO预测每个网格单元的多个边界框。在训练时,我们只希望每个目标由一个边界框预测器负责。我们指定一个预测器根据哪个预测与真实值之间具有最高的来“负责”预测一个目标。这将导致边界框预测器之间的专业化。每个预测器更好地预测特定大小、纵横比或目标类别,从而提高整体召回率。

During training we optimize the following, multi-part loss function:
在训练期间,我们优化了以下的多部分损失函数:

wheredenotes if object appears in cellanddenotes that thebounding box predictor in cellis “responsible” for that prediction.
式中,表示目标是否出现在单元格中,表示单元格中的第个边界框预测器“负责”该预测。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest of any predictor in that grid cell).
注意,如果网格单元中存在目标,则损失函数只惩罚分类错误(前面讨论的条件类别概率)。如果该预测器“负责”的是真实边界框(即网格单元的任何预测器都有最高的),它只惩罚边界框坐标误差。

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
我们在PASCAL VOC 2007和2012的训练和验证数据集上对网络进行了大约135个epochs的训练。在VOC 2012上测试时,我们的训练还包括了VOC 2007的测试数据。在整个训练过程中,batch size为64,momentum为0.9,decay为0.0005。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from to. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training withfor 75 epochs, thenfor 30 epochs, and finallyfor 30 epochs.
我们的学习率计划如下:在第一个epoch中我们将学习率(learning rate)从缓慢提高到。如果我们从一个较高的学习率开始,我们的模型经常因为不稳定的梯度而发散。我们以继续训练75个epoch,然后以训练30个epoch,最后以训练30个epoch。
To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate =.5 after the first connected layer prevents co-adaptation between layers[18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
为了避免过度拟合,我们使用了dropout(神经元随机失效)层和大量数据增强的办法。在第一连接层之后,速率为0.5的dropout层防止了层之间的协同适应。对于数据增强,我们引入了达到原始图像大小20%的随机缩放和平移。我们还在HSV颜色空间中随机调整高达1.5倍图像的曝光和饱和度。

2.3. Inference

推断
Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
就像在训练中一样,预测测试图像的检测只需要运行一个网络评估。在PASCAL VOC数据集上,网络预测每个图像有98个边界框,每个边界框有类别概率。与基于分类器的方法不同,YOLO在测试时速度非常快,因为它只需要运行一个网络评估。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
网格设计在边界框预测中增强了空间多样性。通常我们很清楚目标属于哪个网格单元,并且网络只为每个目标预测一个边界框。但是,一些大型目标或靠近多个单元格边界的目标可以被多个单元格很好地定位。非极大值抑制(non-maximal suppression,NMS)可以用来修正这些多重检测。虽然非极大值抑制对于YOLO性能的影响不像对R-CNN或DPM那么重要,但也能增加2-3%的mAP。

2.4. Limitations of YOLO

YOLO的局限性
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.
YOLO对边界框预测施加了很强的空间约束,因为每个网格单元只能预测两个框,并且只能有一个类。这个空间约束限制了我们的模型可以预测的邻近目标的数量。我们的模型与成群出现的小目标作斗争,例如成群的鸟。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
由于我们的模型从数据中学习如何预测边界框,所以它很难推广到新的或不寻常的宽高比或配置中的目标。因为我们的模型有来自输入图像的多个下采样层(downsampling),所以我们的模型使用相对粗糙的特征来预测边界框。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on.Our main source of error is incorrect localizations.
最后,当我们训练一个接近检测性能的损失函数时,我们的损失函数对待小边界框和大边界框中的误差是一样的。大边界框里的小误差通常是良性的,但小边界框里的小误差对有极大的影响,我们的误差的主要来源是不正确的定位。

3.Comparison to Other Detection Systems

与其他检测系统对比
Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar[25],SIFT[23],HOG[4], convolutional features[6]). Then, classifiers[36,21,13,10] or localizers[1,32] are used to identify objects in the feature space.These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image[35,15,39].We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
目标检测是计算机视觉的核心问题。检测流程通常是首先从输入图像中提取一组鲁棒性强的特征(Haar、SIFT、HOG、卷积特征)开始。然后,分类器或定位器用于在特征空间中识别目标。这些分类器或定位器要么以滑动窗口的方式在整个图像上运行,要么在图像中的某些区域子集上运行。我们将YOLO检测系统与几个顶级的检测框架进行了比较,突出了主要的相似性和差异性。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, non-maximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.
可变形部件模型。可变形部件模型(DPM)使用滑动窗口方法进行目标检测。DPM使用一个不相交的流程来提取静态特征、分类区域、预测高分区域的边界框等,我们的系统用一个卷积神经网络来代替所有这些不同的部分。该网络同时进行特征提取、边界框预测、非极大值抑制和上下文推理。网络是在线训练特征而不是静态的,并针对检测任务对其进行优化。与DPM相比,我们的统一架构带来了更快、更准确的模型。

R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].
R-CNN。R-CNN及其变体使用候选区域(region proposals)而不是滑动窗口的方式来查找图像中的目标。选择性搜索生成候选的边界框,卷积网络提取特征,SVM对边界框打分,线性模型调整边界框,NMS消除重复检测。这个复杂流程的每个阶段都必须独立进行精确的调整,结果系统非常慢,在测试时每张图像需要40秒以上。

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.
YOLO和R-CNN有一些相似之处。每个网格单元提出候选的边界框,并使用卷积特征对这些边界框进行评分。然而,我们的系统在网格单元提出上设置空间约束,这有助于减少对同一目标的重复检测。我们的系统也提出了更少的边界框,每张图像只有98个,而选择性搜索约有2000个。最后,我们的系统将这些单独的组件组合成一个单一的、联合优化的模型。

Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.
其他快速检测器。FastR-CNN和Faster R-CNN专注于通过共享计算和使用神经网络而不是选择性搜索找出候选区域来提高R-CNN框架的速度。虽然它们提供了比R-CNN更快的速度和更高的准确度,但都还不能达到实时性能。

Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [31] actually runs in real-time.
许多研究工作集中在加速DPM的流程上。它们加快HOG计算,使用级联,并将计算推送到GPU上。然而,DPM的实时运行只有30Hz。

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.
YOLO没有试图优化大型检测流程的单个组件,而是完全抛弃流程,并且设计为快速检测。

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.
像人脸或人这样的单类别检测器可以高度优化,因为它们必须处理的变化更少。YOLO是一种通用的检测器,它可以学习同时检测各种目标。

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.
Deep MultiBox。与R-CNN不同,Szegedy等人训练一个卷积神经网络而不是使用选择性搜索来预测感兴趣的区域。MultiBox还可以通过将置信度预测替换为单类别预测来执行单目标检测。然而,MultiBox不能进行一般的目标检测,仍然只是一个较大的检测流程中的一部分,需要进一步的图像块分类。YOLO和MultiBox都使用卷积网络来预测图像中的边界库,但YOLO是一个完整的检测系统。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.
OverFeat。Sermanet等人训练卷积神经网络来执行定位并利用该定位器执行检测。OverFeat有效地执行滑动窗口检测,但它仍然是一个不相交的系统。OverFeat优化了定位而不是检测性能。像DPM一样,定位程序在进行预测时只看到局部信息。OverFeat不能推断全局上下文,因此需要大量的后处理来产生一致的检测。

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.
MultiGrasp。我们的工作在设计上与Redmon等人的抓取检测工作类似。我们的边界框预测的网格方法是基于MultiGrasp系统回归的抓取。然而,抓取检测比目标检测任务简单得多。对于只包含一个目标的图像,MultiGrasp只需要预测一个可抓取区域。它不需要估计目标的大小、位置或边界,也不需要预测其类别,只需要找到一个适合抓取的区域。YOLO预测一张图像中多个类别的多个目标的边界框和类别概率。

4. Experiments

实验
First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.
首先,我们比较了YOLO和其他实时检测系统在PASCAL VOC 2007数据集上的性能。为了了解YOLO和R-CNN变体之间的差异,我们探讨了YOLO和Fast R-CNN在VOC 2007上的错误率,后者是R-CNN性能最高的版本之一。基于不同的误差分布,我们证明YOLO可以用来重新评估Fast R-CNN检测,减少背景假阳性带来的误差,从而显著提高性能。我们还展示了VOC 2012数据集的结果,并将mAP与当前最先进的方法进行了比较。最后,我们证明在两个图形数据集上,YOLO比其他检测器更好地泛华到新领域。

4.1. Comparison to Other Real-Time Systems

与其他实时系统的比较
Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
目标检测的许多研究工作都集中在快速建立标准检测流程上。然而,只有Sadeghi等人实际产生一个实时运行的检测系统(每秒30帧或更好)。我们将YOLO与在30Hz或100Hz下运行的DPM的GPU实现进行了比较。虽然其他的努力没有达到实时性的里程碑,我们也比较了他们的相对mAP和速度,以检查目标检测系统中准确性和性能之间的权衡。

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.
Fast YOLO是PASCAL上最快的目标检测方法;据我们所知,它是现存最快的目标检测方法。在52.7%的mAP下,实时检测的准确率是以往工作的两倍以上。YOLO在保持实时性能的同时将mAP推高到63.4%。

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.
我们还使用VGG-16训练YOLO。该模型比YOLO更精确,但速度明显更慢。与依赖VGG-16的其他检测系统相比,它是有用的,但是由于它比实时的模型慢,所以本文的其余部分将重点放在我们更快的模型上。

Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.
最快的DPM可以有效地加快DPM的速度,而不会牺牲太多的mAP,但它仍然会将实时性能降低2倍。与神经网络方法相比,它还收到DPM相对较低的检测精度的限制。

R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
R-CNN减R用静态候选边界框代替选择性搜索。虽然它比R-CNN快得多,但它仍然达不到实时性的要求,并且由于没有好的边界框,准确性受到了很大的影响。

Table 1: Real-Time Systems on PASCAL VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for PASCAL VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed.
表1:PASCAL VOC 2007上的实时系统。比较了快速探测器的性能和速度。Fast YOLO是PASCAL VOC检测记录中速度最快的检测器,其准确度仍然是其他实时检测器的两倍。YOLO比Fast YOLO更精确10的mAP,同时速度仍远高于实时性要求。

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from real-time.
Fast R-CNN加速了R-CNN的分类阶段,但它仍然依赖于选择性搜索,每幅图像大约需要2秒来生成候选边界框。因此,它有很高的mAP ,但0.5的fps仍然距离实时性很远。

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
最近的Faster R-CNN用一个神经网络代替了选择性搜索来生成边界框,类似于Szegedy等人。在我们的测试中,他们最精确的模型达到了每秒7fps,而更小、不太精确的模型则以每秒18fps的速度运行。VGG-16版本的Faster R-CNN比YOLO高10 mAP,但也比YOLO慢6倍。Zeiler Fergus的R-CNN速度只比YOLO慢了2.5倍,但也不太准确。

4.2. VOC 2007 Error Analysis

VOC 2007错误分析
To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast R-CNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.
为了进一步研究YOLO和最先进探测器之间的差异,我们详细分析了VOC 2007的分类结果。我们将YOLO与Fast R-CNN进行比较,因为Fast R-CNN是PASCAL上性能最高的检测器之一,而且它是公开可用的。

We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:
我们使用了Hoiem等人的方法和工具。对于测试时的每个类别,我们查看该类别的前N个预测。每个预测要么正确,要么根据错误类型进行分类:

• Correct: correct class and IOU > .5
• Localization: correct class, .1 < IOU < .5
• Similar: class is similar, IOU > .1
• Other: class is wrong, IOU > .1
• Background: IOU < .1 for any object
• 正确:正确的类别且IOU>0.5
• 定位:正确的类别,0.1 • 相似:类别相似,IOU>0.1
• 其他:类别错误,IOU>0.1
• 背景:任何IOU<0.1的目标

Figure 4: Error Analysis: Fast R-CNN vs. YOLO These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
图4:误差分析:Fast R-CNN vs.YOLO这些图表显示了不同类别(N=该类别中的目标)的前N个检测中定位和背景错误的百分比。

Figure 4 shows the breakdown of each error type averaged across all 20 classes.
图4显示了所有20个类别中平均每个错误类型的细分。

YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.
YOLO努力正确地定位目标。定位错误占YOLO错误的大多数,比其他所有错误源加起来都要多。Fast R-CNN定位错误少得多,但背景错误更多。Fast R-CNN检测的13.6%是不包含任何目标的误报。Fast R-CNN比YOLO更容易预测背景检测,且可能性是YOLO的3倍。

4.3. Combining Fast R-CNN and YOLO

Fast R-CNN和YOLO的组合
YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.
YOLO的背景错误比Fast R-CNN少得多。通过使用YOLO消除Fast R-CNN的背景检测,我们在性能上得到了显著的提高。对于R-CNN预测的每个边界框,我们检查YOLO是否预测了类似的框。如果是的话,我们会根据YOLO预测的概率和两个框之间的重叠来提高预测。

The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.
最好的Fast R-CNN模型在VOC 2007测试集上实现了71.8%的mAP。当与YOLO结合时,其mAP增加了3.2%达到75.0%。我们还尝试将最好的Fast R-CNN模型与其他几个版本的Fast R-CNN相结合。这些组合在mAP上产生了0.3%到0.6%之间的小幅增长,详情见表2。

Table 2: Model combination experiments on VOC 2007. We examine the effect of combining various models with the best version of Fast R-CNN. Other versions of Fast R-CNN provide only a small benefit while YOLO provides a significant performance boost.
表2:VOC 2007上的模型组合实验。我们研究了将各种模型与Fast R-CNN的最佳版本相结合的效果。其他版本的Fast R-CNN只提供了很小的收益,而YOLO则提供了一个显著的性能提升。

The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.
来自YOLO的提升不仅仅是模型集成的副产品,因为组合不同版本的Fast R-CNN几乎没有什么好处。相反,正是因为YOLO在测试时犯了各种各样的错误,所以它对提高Fast R-CNN的性能非常有效。

Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.
不幸的是,这种组合并没有受益于YOLO的速度,因为我们分别运行每个模型,然后将结果组合起来。然而,由于YOLO是如此之快,相比Fast R-CNN它没有增加任何显着的计算时间。

Table 3: PASCAL VOC 2012 Leaderboard. YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.
表3:PASCAL VOC 2012排行榜。截至2015年11月6日,YOLO与comp4(允许外部数据)公开排行榜进行了比较。显示了各种检测方法的平均精度均值和每类平均精度。YOLO是唯一的实时探测器。Fast R-CNN+YOLO得分排名第四,比Fast R-CNN提高2.3%。

4.4. VOC 2012 Results

VOC 2012结果
On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.
在VOC 2012测试集中,YOLO的mAP分数为57.9%。这比目前的最新技术水平低,更接近使用VGG-16的原始R-CNN,见表3。与最接近的竞争对手相比,我们的系统在处理小目标方面存在困难。在水瓶、绵羊和电视/监视器等类别中,YOLO的得分比R-CNN或Feature Edit低8-10%。但是,在其他类别上,如猫和火车YOLO可以获得更高的性能。

Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.
我们组合的Fast R-CNN+YOLO模型是性能最好的检测方法之一。Fast R-CNN与YOLO的结合提高了2.3%,在公共排行榜上提升了5个位置。

4.5. Generalizability: Person Detection in Artwork

泛化能力:艺术作品中的人员检测
Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.
用于目标检测的学术数据集从相同的分布中提取训练和测试数据。在实际应用中,很难预测所有可能的用例,测试数据可能与系统之前看到的不同。我们将YOLO与其他检测系统在毕加索数据集和People-Art数据集上进行了比较,这两个数据集用于测试艺术品上的人员检测。

Figure 5: Generalization results on Picasso and People-Art datasets.
(a)毕加索数据集精确度召回曲线。
(b)VOC 2007、毕加索和People-Art数据集的定量结果。毕加索的数据集对和最佳分数进行评估。
图5:毕加索和People-Art数据集上的泛化结果。

Figure 6: Qualitative Results. YOLO running on sample artwork and natural images from the internet. It is mostly accurate although it does think one person is an airplane.
图6:定性结果。YOLO运行在互联网上采样的艺术品和自然图像上。虽然它认为一个人就是一架飞机,但它基本上是准确的。

Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.
图5显示了YOLO和其他检测方法的性能比较。作为参考,我们给出了VOC 2007的人员检测AP,所有的模型都是在VOC 2007数据上训练的。毕加索数据集上的模型在VOC 2012上训练,而在People-Art数据集上的模型在VOC 2010上训练。

R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.
R-CNN在VOC 2007上有很高的AP。然而,R-CNN在应用于艺术作品时会大幅度下降。R-CNN使用选择性搜索来调整自然图像的候选边界框。R-CNN中的分类器步骤只看到小区域,并且需要很好的候选边界框。
DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.
DPM在应用于艺术品时能很好地保持其AP。先前的工作认为理论上来说,DPM表现良好,因为它有目标形状和布局的强大空间模型。虽然DPM没有R-CNN退化的多,但它是从一个较低的AP开始的。

YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.
YOLO在VOC 2007上有很好的性能,在应用于艺术品时其AP的下降也低于其他方法。与DPM一样,YOLO对目标的大小和形状、目标之间的关系以及目标通常出现的位置进行建模。艺术作品和自然图像在像素级别上有很大的不同,但它们在目标的大小和形状方面是相似的,因此YOLO仍然可以预测良好的边界框和检测结果。

5. Real-Time Detection In The Wild

自然环境下的实时监测
YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance,including the time to fetch images from the camera and display the detections.
YOLO是一种快速、准确的目标检测器,非常适合计算机视觉应用。我们将YOLO连接到一个网络摄像头并验证它是否保持实时性能,包括从摄像头获取图像并显示检测结果的时间。

The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/.
由此产生的系统是交互式的和参与式的。虽然YOLO单独处理图像,当连接到网络摄像头时,它的功能就像一个跟踪系统,在目标移动和外观变化时检测它们。系统演示和源代码可以在我们的项目网站上找到:http://pjreddie.com/yolo/。

6. Conclusion

结论
We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
我们介绍了一种统一的目标检测模型YOLO。模型的构造简单,可以直接在整张图像上训练。与基于分类器的方法不同,YOLO是基于与检测性能直接对应的损失函数来训练的,整个模型是联合训练的。

Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
Fast YOLO是目前文献中速度最快的通用目标检测器,它推动了实时目标检测最新技术的发展。YOLO还可以很好地泛化到新的领域,使其成为依赖快速、鲁棒性好的目标检测应用程序的理想选择。

Acknowledgements: This work is partially supported by ONR N00014-13-1-0720, NSF IIS- 1338054, and The Allen Distinguished Investigator Award.
致谢:这项工作得到了ONR N00014-13-1-0720、NSF IIS-1338054和艾伦杰出研究员奖的部分支持。

你可能感兴趣的:(YOLOv1(论文翻译))