You Only Look Once Unified, Real-Time Object Detection

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

University of Washington
Allen Institute for AI
Facebook AI Research

https://pjreddie.com/darknet/yolov1/
https://pjreddie.com/darknet/yolov2/
https://pjreddie.com/darknet/yolo/

University of Washington,UW, Washington or U-Dub:华盛顿大学
Facebook Artificial Intelligence Research,Facebook AI Research,FAIR
Allen Institute for Artificial Intelligence,Allen Institute for AI,AI2
Computer Science,CS:计算机科学
Computer Vision,CV:计算机视觉

arXiv (archive - the X represents the Greek letter chi [χ]) is a repository of electronic preprints approved for posting after moderation, but not full peer review.

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
我们提出了 YOLO,一种新的目标检测方法。以前的目标检测工作是重新利用分类器来执行检测。相反,我们将目标检测设计为空间分离边界框和相关的类别概率的回归问题。单个神经网络在一次评估中直接从完整图像上预测边界框和类别概率。由于整个检测流水线是单一网络,因此可以直接对检测性能进行端到端的优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
我们的统一架构非常快。我们的基础模型 YOLO 实时处理图片速度达到 45 帧/秒。我们网络的一个小规模版本,Fast YOLO,达到了惊人的处理 155 帧/秒的图片速率,并且仍然实现了两倍于其他实时检测器的 mAP。与当前最先进的检测系统相比,YOLO 有较高的定位错误,但是对于背景区域的检测错误,YOLO 远远少于其他系统。最后,YOLO 学习了目标的一般化表示。它比其他检测方法 (包括 DPM 和 R-CNN) 在从自然图像推广到其他领域 (例如艺术图像) 时更胜一筹。

astounding [ə'staʊndɪŋ]:adj. 令人震惊的,令人惊骇的
bounding box:边界框,限位框,检测框
state-of-the-art (cutting edge or leading edge),SOTA:adj. 使用最先进技术的,体现最高水平的

YOLO 基于回归的框架,用于目标的定位和识别,泛化能力较高。YOLO struggles to localize objects correctly. (YOLO 对于目标的定位错误较多。) YOLO 背景预测错误较少 (false positives that don’t contain any objects)。

YOLO 不使用 region proposals or sliding windows,它的 whole detection pipeline is a single network。

1. Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
人们瞥视图像,立即知道图像中的物体,它们在哪里以及它们如何相互关系。人类的视觉系统是快速和准确的,使我们能够执行复杂的任务,例如驾驶时几乎没有意识的想法。快速,准确的目标检测算法可以让计算机在没有专门传感器的情况下驾驶汽车,帮助辅助设备向人类用户传达实时的场景信息,并释放通用目的、响应性机器人系统的潜力。

glance [glɑːns]:v. 瞥闪,瞥见,扫视,匆匆一看,浏览,斜击某物后斜弹开来,反光,轻擦,斜击 n. 一瞥,一滑,闪光,斜击,辉金属
conscious ['kɒnʃəs]:adj. 意识到的,故意的,神志清醒的
responsive [rɪ'spɒnsɪv]:adj. 响应的,应答的,回答的
assistive [əˈsɪstɪv]:辅助的

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
现在的检测系统重新使用分类器来执行检测。为了检测一个物体,这些系统使用一个对应类别的分类器并且将其应用到测试图片中的不同区域和尺度。例如可变形部件模型 (DPM) 使用一个滑动窗口方法,以使得分类器作用于整张图片上的每一处分离区域上。

Deformable Part Models,DPM:可变形部件模型

deformable parts models (DPM) 使用滑动窗口方式 (sliding window approach)。

proposal 提供位置信息,classifier 提供类别信息。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
最近的一些方法,像 R-CNN,使用候选区域方法 (region proposal method),先在图像中定位出边界框,然后再通过一个分类器来对边界框的内容进行分类。在分类之后,使用后处理对边框进行校正,消除重复边界框,并且根据场景中的其他物体,对边界框重新打分。这些复杂的管道很慢并且难以优化,因为每个单独的组件都必须单独进行训练。

R-CNN 使用候选区域方法 (region proposal method)。
目标检测经历了一个高度符合人类直觉的过程。将图片划分成小图片扔进算法中去,当算法认为某物体在这个小区域上之时,那么检测完成,识别出目标的位置。这个思路是早期的目标检测思路,比如 R-CNN。Fast R-CNN、Faster R-CNN 不再是将图片一块块的传进 CNN 提取特征,而是整体放进 CNN 提取特征图后,再做进一步处理,整体流程分为区域提取和目标分类两部分 (two-stage),虽然确保了精度,但速度非常慢。

RCNN / Fast RCNN 采用分离的模块 (独立于网络之外的 Selective Search 方法) 求取候选框 (可能会包含物体的矩形区域),训练过程也是分成多个模块进行。Faster RCNN 使用 RPN (Region Proposal Network) 卷积网络替代 RCNN / Fast RCNN 的 Selective Search 模块,将 RPN 集成到 Fast RCNN 检测网络中,得到一个统一的检测网络。尽管 RPN 与 FAST RCNN 共享卷积层,但是在模型训练过程中,需要反复训练 RPN 网络和 Fast RCNN 网络 (这两个网络核心卷积层是参数共享的)。

You Only Look Once Unified, Real-Time Object Detection_第1张图片

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
我们重新抽象物体检测为一个简单的回归问题,直接从图像像素输入到输出边界框坐标和类别概率。使用我们的系统,你只需要看一遍 (you only look once,YOLO) 图片就能预测出物体的类别和位置。

YOLO 将目标检测视为回归问题 (regression problem),通过像素值直接预测 bounding box 和类别概率。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
YOLO 非常简单,见图 1。一个简单的卷积网络同时预测多个边界框以及其每一个对应的分类类别概率。YOLO 是对完整图像进行训练并且直接优化检测效果的。这种统一的模型相比传统的目标检测模型有几个优点。

refreshingly [rɪ'frɛʃɪŋli]:adv. 清爽地,有精神地
simultaneously [,sɪml'teɪnɪəslɪ]:adv. 同时地

You Only Look Once Unified, Real-Time Object Detection_第2张图片
Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × \times × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
图 1:YOLO 目标检测系统。使用 YOLO 处理图像非常简单和直接。我们的系统 (1) 缩放输入图像的大小为 448 × \times × 448,(2) 将图像输入到一个简单的卷积神经网络,并且 (3) 通过阈值筛选出检测结果。

You Only Look Once: Unified, Real-Time Object Detection (1506.02640v3)
The Model. Our system models detection as a regression problem to a 7 × 7 × 24 7 \times 7 \times 24 7×7×24 tensor. This tensor encodes bounding boxes and class probabilities for all objects in the image.
You Only Look Once Unified, Real-Time Object Detection_第3张图片

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: https://pjreddie.com/darknet/yolo/.
第一,YOLO 速度非常快。由于我们将检测视为回归问题,因此我们不需要复杂的管道。我们只是在测试的时候在一幅新图像上运行我们的神经网络来预测检测结果。我们的基础网络以每秒 45 帧的速度运行,在 Titan X GPU 上没有执行批处理,而快速版本运行速度超过 150 fps。这意味着我们可以在不到 25 毫秒的延迟时间内实时处理流媒体视频。更进一步,YOLO 的平均精度是其他实时系统平均精度的两倍以上。有关我们的系统在网络摄像头上实时运行的演示,请参阅我们的项目网页:https://pjreddie.com/darknet/yolo/.

基于回归的框架简单,不需要额外复杂流程。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.
第二,YOLO 在预测的时候利用了图像的全局信息。与基于滑动窗口和区域提议的技术不同,YOLO 在训练和测试时间期间看到整个图像,所以它对类别的上下文信息以及它们的外观进行编码。Fast R-CNN,一种顶级的检测方法,由于无法看到更大的上下文信息,所以会在图像中错误地显示背景块 (将背景块错误识别)。YOLO 产生的背景错误不到 Fast R-CNN 的一半。

Fast R-CNN 产生虚警是因为没有利用上下文信息。

Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. 因此 background error (把背景错认为物体) 较少。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
第三,YOLO 学习了物体的可概括表征。在自然图像进行训练并且在艺术品图像上进行测试时,YOLO 的表现远远超过 DPM 和 R-CNN 等顶级的检测方法。由于 YOLO 具有高度泛化性,所以当其应用到新的领域或者意外的输入时,它不太可能出现崩溃 (效果很差)。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.
YOLO 在准确性方面仍落后于最先进的检测系统。虽然它可以快速识别图像中的物体,但它正努力精确定位某些物体,尤其是小物体。我们在实验中进一步检查了这些折衷。

All of our training and testing code is open source. A variety of pretrained models are also available to download.
我们训练和测试的代码全部已经开源的,可以在网上获取。各种预训练模型也可以下载。

lag [læg]:n. 落后,迟延,防护套,囚犯,桶板 vt. 落后于,押往监狱,加上外套 vi. 滞后,缓缓而行,蹒跚 adj. 最后的
struggle ['strʌg(ə)l]:vi. 奋斗,努力,挣扎 n. 努力,奋斗,竞争 vt. 使劲移动,尽力使得

2. Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
我们将物体检测的几个分离部分统一成一个简单的神经网络。我们的网络使用整张图片的特征去预测每一个边界框。它也支持对一张图片同时预测全部边界框。这意味着我们的网络关注了整张图片以及在这张图片中的所有物体的信息。这个 YOLO 设计允许端到端的训练以及在保证高平均准确率的同时确保实时速度。

Our system divides the input image into an S × S S \times S S×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
我们的系统将输入图像分成 S × S S \times S S×S 的网格。如果物体的中心点落在某一个网格单元,这个网格单元将负责识别出这个物体。

将输入图像分为 S × S S \times S S×S 个 grid cell,目标中心所在的 cell,负责此目标的检测及识别。

Each grid cell predicts B B B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as P r ( O b j e c t ) ∗ I O U p r e d t r u t h Pr(Object) ∗ IOU^{truth}_{pred} Pr(Object)IOUpredtruth. If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
每一个网格单元预测 B B B 个边界框以及对应于每一个边界框的置信分数。这些置信分数反映了这个模型预测该边界框包含某一物体的可能性以及模型认为对于这个边界框的预测有多高的准确率。我们将置信度用公式定义为 P r ( O b j e c t ) ∗ I O U p r e d t r u t h Pr(Object) ∗ IOU^{truth}_{pred} Pr(Object)IOUpredtruth。如果在网格单元中没有物体,置信分数将为 0。否则,我们定义置信分数为预测边界框和真实边界框的 IOU (intersection over union)。

每一个 grid cell 预测 B B B 个 bounding boxes 以及对应于每一个边界框的置信分数 (confidence score = P r ( O b j e c t ) ∗ I O U p r e d t r u t h Pr(Object) ∗ IOU^{truth}_{pred} Pr(Object)IOUpredtruth)。如果在网格单元中没有目标 ( P r ( O b j e c t ) Pr(Object) Pr(Object) = 0),confidence score 为 0。如果在网格单元中有目标 ( P r ( O b j e c t ) Pr(Object) Pr(Object) = 1),confidence score 为预测边界框和真实边界框的 IOU (intersection over union)。

Each bounding box consists of 5 predictions: x , y , w , h x, y, w, h x,y,w,h, and confidence. The ( x , y ) (x, y) (x,y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
每一个边界框包含 5 个预测值: x , y , w , h x, y, w, h x,y,w,h 以及置信分数。坐标 ( x , y ) (x, y) (x,y) 表示了边界框的中心相对于当前网格单元边缘的位置。width 和 height 是根据整张图片的比例预测的。最后置信分数表示预测边界框和真实边界框的 IOU。

( x , y ) (x, y) (x,y) 坐标表示检测目标的中心点,该值是相对于 grid cell 左上角的偏移,取值范围 [0, 1]。 w , h w, h w,h 相对于整个图像尺寸,取值范围 [0, 1]。grid cell 的左上角为 (0, 0),右下角为 (1, 1)。
x , y , w , h x, y, w, h x,y,w,h 全部归一化到 [0, 1],回归问题都会将输出进行归一化,否则各个输出维度的取值范围差别较大,进而在训练的过程中,神经网络关注数值大的维度。因为数值大的维度,计算 loss 时会比较大,为了将 loss 降低,神经网络将尽量学习让这个维度的 loss 变小,导致区别对待。

Each grid cell also predicts C C C conditional class probabilities, P r ( C l a s s i ∣ O b j e c t ) Pr(Class_{i}|Object) Pr(ClassiObject). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B B B.
每一个网格单元同时也预测 C C C 类别的条件概率, P r ( C l a s s i ∣ O b j e c t ) Pr(Class_{i}|Object) Pr(ClassiObject)。这个概率表示了这个网格单元包含某一个物体的置信度。我们仅仅预测每个网格的一组类别的概率,而不考虑边界框的数量 B B B

At test time we multiply the conditional class probabilities and the individual box confidence predictions,
在测试的时候我们将条件类别概率和独立的边界框预测置信分数相乘,有如下公式。

P r ( C l a s s i ∣ O b j e c t ) ∗ P r ( O b j e c t ) ∗ I O U p r e d t r u t h = P r ( C l a s s i ) ∗ I O U p r e d t r u t h (1) Pr(Class_{i}|Object) ∗ Pr(Object) ∗ IOU^{truth}_{pred} = Pr(Class_{i}) ∗ IOU^{truth}_{pred} \tag{1} Pr(ClassiObject)Pr(Object)IOUpredtruth=Pr(Classi)IOUpredtruth(1)

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
这个给了我们对于每一个边界框属于某一个特定类别的置信分数。这些分数编码了这个类出现在框中的概率以及预测框和物体的匹配程度。

confidence score 表示了所预测的 box 中含有 object 的置信度和 box 预测的位置准确性双重信息。

在测试的时候,the conditional class probabilities ( P r ( C l a s s i ∣ O b j e c t ) Pr(Class_{i}|Object) Pr(ClassiObject)) and the individual box confidence predictions ( P r ( O b j e c t ) ∗ I O U p r e d t r u t h Pr(Object) ∗ IOU^{truth}_{pred} Pr(Object)IOUpredtruth) 相乘,得到 class-specific confidence scores for each box。 P r ( C l a s s i ∣ O b j e c t ) ∗ P r ( O b j e c t ) ∗ I O U p r e d t r u t h = P r ( C l a s s i ) ∗ I O U p r e d t r u t h Pr(Class_{i}|Object) ∗ Pr(Object) ∗ IOU^{truth}_{pred} = Pr(Class_{i}) ∗ IOU^{truth}_{pred} Pr(ClassiObject)Pr(Object)IOUpredtruth=Pr(Classi)IOUpredtruth

由于输出层为全连接层,在检测时,YOLO 训练模型只支持与训练图像相同的输入分辨率。

虽然每个 grid cell 可以预测 B 个 bounding box,但是最终只选择 P r ( C l a s s i ∣ O b j e c t ) ∗ P r ( O b j e c t ) ∗ I O U p r e d t r u t h = P r ( C l a s s i ) ∗ I O U p r e d t r u t h Pr(Class_{i}|Object) ∗ Pr(Object) ∗ IOU^{truth}_{pred} = Pr(Class_{i}) ∗ IOU^{truth}_{pred} Pr(ClassiObject)Pr(Object)IOUpredtruth=Pr(Classi)IOUpredtruth 得分最高的 bounding box 作为物体检测输出,即每个 grid cell 最多可以预测出一个物体。当物体占画面比例较小,图像中包含鸟群等时,每个 grid cell 包含多个物体,但却只能检测出其中一个。

训练和推理是分开的,训练时需要知道标注数据每个物体落入的格子以及位置。推理时利用已经训练好的模型进行预测。

probability [prɒbə'bɪlɪtɪ]:n. 可能性,机率,或然率

For evaluating YOLO on PASCAL VOC, we use S = 7 S = 7 S=7, B = 2 B = 2 B=2. PASCAL VOC has 20 labelled classes so C = 20 C = 20 C=20. Our final prediction is a 7 × 7 × 30 7 \times 7 \times 30 7×7×30 tensor.
为了在 PASCAL VOC 上评估 YOLO,我们使用 S = 7 S = 7 S=7, B = 2 B = 2 B=2。PASCAL VOC 有 20 个标签类,因此我们的最终预测是一个 7 × 7 × 30 7 \times 7 \times 30 7×7×30 张量。

每个 grid cell 有 30 维,8 (4 + 4) 维是 bounding box 坐标 x , y , w , h x, y, w, h x,y,w,h,2 (1 + 1) 维是 confidence score,20 维是 conditional class probabilities。
49 (7 × \times × 7) 个 grid cell 给出 20 类的概率,一幅图像产生 980 (7 × \times × 7 × \times × 20) 个预测的概率。当大部分概率为 0 时,导致训练发散。(This can lead to model instability, causing training to diverge early on.)

YOLOv1 没有使用 anchor boxes,每个 grid cell 预测 B 个边界框。由于没有先验知识 anchor boxes,在训练时不需要根据 anchor boxes 调整训练集中 ground-truth boxes 的 size,直接使用即可。YOLOv2 中引入 anchor box 机制后需要初始化,回归输出的数据变为 anchor box 的偏移和缩放。

7 × 7 × 30 7 \times 7 \times 30 7×7×30 张量中,每个 1 × 1 × 30 1 \times 1 \times 30 1×1×30 的维度对应原图 7 × 7 7 \times 7 7×7 grid cell 中的一个, 1 × 1 × 30 1 \times 1 \times 30 1×1×30 中含有类别预测和 bbox 坐标预测。网格负责类别信息,bounding box 主要负责坐标信息 (部分负责类别信息,confidence 可以理解为类别信息)。

每个网格 ( 1 × 1 × 30 1 \times 1 \times 30 1×1×30 维度对应原图中的一个 grid cell) 要预测 2 个 bounding box。每一个 grid cell 预测 B B B 个 bounding boxes,可以适应不同尺度比例的物体,提高定位准确性。

You Only Look Once Unified, Real-Time Object Detection_第4张图片
Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S × S S \times S S×S grid and for each grid cell predicts B B B bounding boxes, confidence for those boxes, and C C C class probabilities. These predictions are encoded as an S × S × ( B ∗ 5 + C ) S \times S \times (B ∗ 5 + C) S×S×(B5+C) tensor.
图 2:模型。 我们的系统将检测模型化为回归问题。它将图像划分为 S × S S \times S S×S 网格,并且每个网格单元预测 B B B 个边界框和这些框的置信度以及 C C C 类概率。这些预测被编码为 S × S × ( B ∗ 5 + C ) S \times S \times (B ∗ 5 + C) S×S×(B5+C) 张量。

class 信息是针对每个 grid 的,confidence 信息是针对每个 bounding box 的。每个 grid cell 预测的所有 B 个 bbox 共享同一个分类 score。

目标检测任务一般会转化为一个分类问题。首先通过某种方式 (region proposal methods) 产生大量的可能包含待检测物体的 potential bounding boxes,再用分类器去判断每个 bounding boxes 里是否包含物体,以及物体所属类别的 probability 或者 confidence,例如 R-CNN 等。

You Only Look Once Unified, Real-Time Object Detection_第5张图片

You Only Look Once Unified, Real-Time Object Detection_第6张图片

输入图片被划分为 7 × \times × 7 单元格,每个单元格独立检测目标。在 inference 的过程中,网格只是物体中心点位置的划分之用,并不是对图片进行切片,不会让网格脱离整体的关系。预测框的位置、大小和物体类别都通过 CNN 预测出来。

2.1. Network Design

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
我们通过卷积神经网络实现这个模型,并且在 PASCAL VOC 检测数据集上进行评估。网络中初始的卷积层从图像中提取特征,而全连接层用来预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × \times × 1 reduction layers followed by 3 × \times × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
我们的网络架构是受到图像识别模型 GoogLeNet 的启发。我们的网络有 24 个卷积层后接 2 个全连接层。然而,不同于 GoogLeNet 中使用的 inception modules,我们简单地在 1 × \times × 1 reduction 卷积层,后面接上 3 × \times × 3 卷积层 (similar to Lin et al [22])。我们的整个网络如图 3 所示。

由于输出层为全连接层,因此在检测时,YOLO 训练模型只支持与训练图像相同的输入分辨率。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
我们也训练了一个 YOLO 的快速版本,其用于快速进行物体检测并且给出边界框。Fast YOLO 使用了一个有更少卷积层的神经网络 (9 个而不是 24 个) 以及卷积层中有更少的卷积核 (过滤器)。除了网络的规模,其他的训练和测试的参数 YOLO 和 Fast YOLO 是一样的。

You Only Look Once Unified, Real-Time Object Detection_第7张图片
Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × \times × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × \times × 224 input image) and then double the resolution for detection.
图 3:架构。 我们的检测网络拥有 24 个卷积层后接 2 个全连接层。交替使用 1 × \times × 1 卷积层从前面网络层中减小了特征空间。我们在 ImageNet 分类任务上使用一半分辨率 (224 × \times × 224的输入图像) 预训练了卷积层,然后使用一倍分辨率用来训练检测。

448 * 448 图像输入,经过 6 次 stride = 2, 448 / ( 2 6 ) = 448 / 64 = 7 448 / (2^6) = 448 / 64 = 7 448/(26)=448/64=7。输出 S × S × ( B ∗ 5 + C ) = 7 ∗ 7 ∗ ( 2 ∗ 5 + 20 ) = 7 ∗ 7 ∗ 30 = 1470 S \times S \times (B ∗ 5 + C) = 7 * 7 * (2 * 5 + 20) = 7 * 7 * 30 = 1470 S×S×(B5+C)=77(25+20)=7730=1470

每张图像预测 S × S × B = 7 ∗ 7 ∗ 2 = 98 S \times S \times B = 7 * 7 * 2 = 98 S×S×B=772=98 个 bbox。在训练过程中,每个网格预测多个 bbox,每个网格预测一个类别,对类别的预测要求更高,得到的结果较为准确。YOLO 的主要错误来源于定位 (localization) 错误。由于包含全连接层,输入图像的尺寸不可改变。

The final output of our network is the 7 × 7 × 30 7 \times 7 \times 30 7×7×30 tensor of predictions.
我们的网络预测输出是一个 7 × 7 × 30 7 \times 7 \times 30 7×7×30 的张量。

inspire [ɪn'spaɪə]:vt. 激发,鼓舞,启示,产生,使生灵感
concatenation [kənˌkætəˈneɪʃn]:n. 一系列相互关联的事,连结
inception [ɪnˈsepʃn]:n. 起初,获得学位

You Only Look Once Unified, Real-Time Object Detection_第8张图片

You Only Look Once Unified, Real-Time Object Detection_第9张图片

通过 P r ( C l a s s i ∣ O b j e c t ) ∗ P r ( O b j e c t ) ∗ I O U p r e d t r u t h = P r ( C l a s s i ) ∗ I O U p r e d t r u t h Pr(Class_{i}|Object) ∗ Pr(Object) ∗ IOU^{truth}_{pred} = Pr(Class_{i}) ∗ IOU^{truth}_{pred} Pr(ClassiObject)Pr(Object)IOUpredtruth=Pr(Classi)IOUpredtruth 运算得到每个 bbox 属于每个类别的 score。20 × \times × 98 score 矩阵,20 代表类别。以下操作针对每个类别轮流进行,在某个类别中 (即矩阵的某一行),将得分少于阈值 0.2 的设置为 0,然后按得分从高到低排序。再用 NMS 算法去掉重复率较大的 bounding box (NMS 针对某一类别,选择得分最大的 bounding box,然后计算它和其它 bounding box 的 IOU 值,如果 IOU 大于 0.5,说明重复率较大,该得分设为 0。如果不大于 0.5,则不改。这样一轮后,再选择剩下的 score 里面最大的那个 bounding box,然后计算该 bounding box 和其它 bounding box 的 IOU,重复以上过程直到最后)。最后每个 bounding box 的 20 个 score 取最大的 score,如果这个 score 大于0,那么这个 bounding box 就是这个 socre 对应的类别 (矩阵的行),如果小于等于 0,说明这个 bounding box 里面没有物体,跳过即可。

同一个网格的两个 bbox 可以属于同一类,同一个网格中不可能出现两个不同类的 bbox。
对于同一网格内的 B = 2 个 bbox,20 个类别之间的高低得分顺序 P r ( C l a s s i ∣ O b j e c t ) Pr(Class_{i}|Object) Pr(ClassiObject) 是固定的,用每个网格内不同 bbox 的置信度 P r ( O b j e c t ) ∗ I O U p r e d t r u t h Pr(Object) ∗ IOU^{truth}_{pred} Pr(Object)IOUpredtruth 乘以共同的 20 类别概率并不影响类别概率之间的高低得分排序 (线性相关),同时这 B 个 bbox 相对应类别之间的 IOU 值是相同的 (例如 20 个类别里面由 car 和 cat 两个类别,bbox1 的两个类别的得分为 0.8 和 0.6,bbox2 的两个类别得分为 0.4 和 0.3),那么在对每个类别做 NMS 时,IOU 值是固定的 (两个 bbox 的位置一直都没变),即对于 bbox1 和 bbox2 的 car 和 cat 两个类,当 IOU > 0.5 时,bbox2 的 0.4 和 0.3 就会置 0 (舍弃掉了bbox2)。当 IOU < 0.5 时,bbox2 保留下来了,最后对于每个 bbox 输出的是 score 得分最高的那个类,因为前面提到的线性相关特性,最终输出的类别也会是相同的 (如果 bbox1 和 bbox2 中 car 的概率是最高的,那么这两个 bbox 最后预测都是 car 这个类别)。由上所述,并不会存在输出不同类别的情况。

You Only Look Once Unified, Real-Time Object Detection_第10张图片

2.2. Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].
我们在 ImageNet 1000-class 挑战赛数据集上预训练了我们的卷积层。为了预训练,我们使用了图 3 中的前面 20 个卷积层接上一个平均池化层和一个全连接层。我们训练这个网络用了大约一周,达到了在 ImageNet 2012 验证数据集上单一切片 top-5 88% 的准确率,堪比 Caffe’s Model Zoo 中的 GoogLeNet 模型 [24]。我们使用 Darknet 框架进行所有训练和推理 [26]。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × \times × 224 to 448 × \times × 448.
然后我们转换模型以执行检测。Ren et al. 表明将卷积和连接层添加到预训练网络可以提高性能 [29]。根据他们的样例,我们添加四个卷积层和两个全连接层,它们都用随机权重初始化。检测任务通常需要细粒度的视觉信息,因此我们将网络的输入像素从 224 × \times × 224 增加到 448 × \times × 448。

[29] Object Detection Networks on Convolutional Feature Maps
Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Ren et al. 表明将卷积和连接层添加到预训练网络可以提高性能 [29]。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x x x and y y y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
我们最后一层同时预测了类别概率以及边界框位置。我们利用图像的宽和高归一化边界框的宽和高,从而使得边界框的宽和高表示数值落在 0 和 1 之间。我们将边界框的坐标 x x x y y y 参数化,使得其成为特定网络单元格位置的偏移量,因此他们的数值也在 0 和 1 之间。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
我们对最后一层使用线性激活函数,所有其他层使用以下 leaky rectified linear activation:

ϕ ( x ) = { x , if  x > 0 0.1 x , otherwise (2) \begin{aligned} \phi (x) = \begin{cases} x, &\text{if } x > 0 \\ 0.1x, &\text{otherwise} \end{cases} \end{aligned} \tag{2} ϕ(x)={x,0.1x,if x>0otherwise(2)

rectified linear unit,ReLU:线性整流函数,修正线性单元
linear activation function:线性激活函数
leaky rectified linear unit,Leaky ReLU:带泄露线性整流单元

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
我们优化了模型输出中的平方和误差。我们使用平方和误差,因为它很容易进行优化,但它并不完全符合我们最大化平均精度的目标。它将定位误差与分类误差同等加权,这可能并不理想。而且,在每个图像中,许多网格单元不包含任何物体。这将使这些单元格的置信度分数为零,通常超过了包含物体的单元格的梯度。这可能导致模型不稳定,从而导致训练早期发生发散。

sum-squared error 问题:
(1) It weights localization error equally with classification error which may not be ideal.
(2) Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects.

解决方法:
(1) λ c o o r d = 5 \lambda_{coord} = 5 λcoord=5,增加边界框坐标预测的损失,赋予较大的权重。
(2) λ n o o b j = . 5 \lambda_{noobj} = .5 λnoobj=.5,减少不包含物体的框的置信度的预测损失,赋予较小的权重。

diverge [daɪ'vɜːdʒ; dɪ-]:vi. 分歧,偏离,分叉,离题 vt. 使偏离,使分叉
overpowering ['ovɚ'paʊərɪŋ]:adj. 压倒性的,无法抵抗的

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λ c o o r d \lambda_{coord} λcoord and λ n o o b j \lambda_{noobj} λnoobj to accomplish this. We set λ c o o r d = 5 \lambda_{coord} = 5 λcoord=5 and λ n o o b j = . 5 \lambda_{noobj} = .5 λnoobj=.5.
为了弥补这一点,我们增加了边界框坐标预测的损失,并减少了不包含物体的框的置信度预测损失。我们使用两个参数 λ c o o r d \lambda_{coord} λcoord and λ n o o b j \lambda_{noobj} λnoobj 来实现这一点。我们设置了 λ c o o r d = 5 \lambda_{coord} = 5 λcoord=5 and λ n o o b j = . 5 \lambda_{noobj} = .5 λnoobj=.5

remedy ['remɪdɪ]:vt. 补救,治疗,纠正 n. 补救,治疗,赔偿

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
平方和误差也对大框和小框中的误差进行同等加权。我们的误差度量应该反映出,小框中的小偏差比大框中的小偏差要更重要 (小的偏差一点,结果就会偏差很多,而大的偏差一点,并不会很影响结果)。为了部分解决这个问题,我们直接预测边界框宽度和高度的平方根,而不是宽度和高度。

deviation [diːvɪ'eɪʃ(ə)n]:n. 偏差,误差,背离

y = sqrt(x)

You Only Look Once Unified, Real-Time Object Detection_第11张图片

You Only Look Once Unified, Real-Time Object Detection_第12张图片

bbox 的 width 和 height 取平方根代替原本的 width 和 height。较小的 bbox 的横轴值小,发生偏移时,反映到 y 轴上相比较大 box 要大。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
YOLO 对于每个网格单元预测多个边界框。在训练时,我们只希望一个边界框预测器负责各自的目标。我们根据哪个预测与 ground truth 具有当前最高的 IOU,将一个预测器指定为“负责”预测物体。这导致了边界框预测器之间的专业化。每个预测器可以更好地预测特定的尺寸、宽高比或物体类别,从而改善整体召回率。

During training we optimize the following, multi-part loss function:
在训练期间,我们优化以下多部分损失函数:

λ c o o r d ∑ i = 0 S 2 ∑ j = 0 B I i j o b j [ ( x i − x ^ i ) 2 + ( y i − y ^ i ) 2 ] + λ c o o r d ∑ i = 0 S 2 ∑ j = 0 B I i j o b j [ ( w i − w ^ i ) 2 + ( h i − h ^ i ) 2 ] + ∑ i = 0 S 2 ∑ j = 0 B I i j o b j ( C i − C ^ i ) 2 + λ n o o b j ∑ i = 0 S 2 ∑ j = 0 B I i j n o o b j ( C i − C ^ i ) 2 + ∑ i = 0 S 2 I i o b j ∑ c ∈ c l a s s e s ( p i ( c ) − p ^ i ( c ) ) 2 (3) \begin{aligned} &\lambda_{\boldsymbol {coord}} \sum^{S^{2}}_{i=0} \sum^{B}_{j=0} \mathbb{I}^{obj}_{ij} \left[\left( x_{i} - \hat{x}_{i} \right)^{2} + \left( y_{i} - \hat{y}_{i} \right)^{2} \right] \\ &+\lambda_{\boldsymbol {coord}} \sum^{S^{2}}_{i=0} \sum^{B}_{j=0} \mathbb{I}^{obj}_{ij} \left[\left(\sqrt{w_{i}} - \sqrt{\hat{w}_{i}}\right)^{2} + \left(\sqrt{h_{i}} - \sqrt{\hat{h}_{i}}\right)^{2} \right] \\ &+\sum^{S^{2}}_{i=0} \sum^{B}_{j=0} \mathbb{I}^{obj}_{ij} \left(C_{i} - \hat{C}_{i} \right)^{2} \\ &+\lambda_{\boldsymbol {noobj}} \sum^{S^{2}}_{i=0} \sum^{B}_{j=0} \mathbb{I}^{noobj}_{ij} \left(C_{i} - \hat{C}_{i} \right)^{2} \\ &+\sum^{S^{2}}_{i=0} \mathbb{I}^{obj}_{i} \sum_{c \in classes} \left(p_{i}(c) - \hat{p}_{i}(c) \right)^{2} \\ \end{aligned} \tag{3} λcoordi=0S2j=0BIijobj[(xix^i)2+(yiy^i)2]+λcoordi=0S2j=0BIijobj[(wi w^i )2+(hi h^i )2]+i=0S2j=0BIijobj(CiC^i)2+λnoobji=0S2j=0BIijnoobj(CiC^i)2+i=0S2Iiobjcclasses(pi(c)p^i(c))2(3)

损失函数的反向传播可以贯穿整个网络,端到端训练。

where I i o b j \mathbb{I}^{obj}_{i} Iiobj denotes if object appears in cell i i i and I i j o b j \mathbb{I}^{obj}_{ij} Iijobj denotes that the j j jth bounding box predictor in cell i i i is “responsible” for that prediction.
其中 I i o b j \mathbb{I}^{obj}_{i} Iiobj 表示在网格单元 i i i 中有目标出现。而 I i j o b j \mathbb{I}^{obj}_{ij} Iijobj 表示在网格单元 i i i 中的第 j j j 个边界框预测器对该预测“负责”。

loss function 中,第一、二两行是 bbox 误差,第三、四两行是置信度误差 (含有目标的置信度误差 + 不含有目标的置信度误差),第五行是分类误差。第一行是中心位置误差,第二行是 bbox 的尺度误差,使用平方根。

第五行对应的是分类误差,分类误差是针对 grid cell,而非 bbox 的。只有该 grid cell 对某个目标负责时 (即某个目标的 ground truth 的中心落在了该网格上),对应的 I i o b j \mathbb{I}^{obj}_{i} Iiobj 才为1。以下图为例,共有 7 × \times × 7 = 49 个网格,其中只有三个网格 (下图中有三个目标) 对分类误差有贡献,其余的不计算。

第一、二、三、四行是针对 bbox 的,在网格单元 i i i 中的第 j j j 个 bbox 对某个目标负责时 (对某个目标负责的网格内有 B = 2 个 bbox,选择与该目标 ground truth 的 IOU 值最大的那个 bbox 对该目标负责),此时 I i j o b j \mathbb{I}^{obj}_{ij} Iijobj 才为 1。以下图为例,只有三个 bbox 是计入第一、二、三行误差的,剩下的 7 × \times × 7 × \times × 2 - 3 = 95 个 bbox 只计算第四行误差,很明显误差项的个数不平衡,所以需要引入权重因子做一些平衡。

You Only Look Once Unified, Real-Time Object Detection_第13张图片

You Only Look Once Unified, Real-Time Object Detection_第14张图片
注意上面图片的 loss function 公式缺少两个中括号。

You Only Look Once Unified, Real-Time Object Detection_第15张图片

YOLOv1 每个 box 只能预测一类,而 YOLOv2 每个 box 能预测所有类。
You Only Look Once Unified, Real-Time Object Detection_第16张图片

YOLO loss (coordError + iouError + classError) 函数中,大物体 IOU 误差和小物体 IOU 误差对网络训练中 loss 贡献值接近 (虽然采用求平方根方式,但没有根本解决问题)。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
请注意,如果物体存在于该网格单元中 (因此前面讨论了条件类别概率),则损失函数仅惩罚分类错误。如果该预测器对 ground truth box “负责” (即,在该网格单元中具有任何预测器的最高 IOU),则它也仅惩罚边界框坐标定位错误。

penalize ['pi:nəlaɪz]:vt. 处罚,处刑,使不利

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
我们在 PASCAL VOC 2007 and 2012 的 training and validation data sets 上训练这个网络大约 135 epochs。在 PASCAL VOC 2012 上测试时,我们还使用了 VOC 2007 测试数据进行了训练。在整个训练过程中,我们使用的 batch size 为 64,动量 (momentum) 为 0.9,衰减 (decay) 为 0.0005。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 1 0 − 3 10^{-3} 103 to 1 0 − 2 10^{-2} 102. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 1 0 − 2 10^{-2} 102 for 75 epochs, then 1 0 − 3 10^{-3} 103 for 30 epochs, and finally 1 0 − 4 10^{-4} 104 for 30 epochs.
我们的学习率策略如下:在第一个 epochs 的训练中,我们将学习率从 1 0 − 3 10^{-3} 103 缓慢增长到了 1 0 − 2 10^{-2} 102。如果在一开始就使用一个较高的学习率,我们的模型通常会因梯度不稳定而发散。接下来使用 1 0 − 2 10^{-2} 102 的学习率进行了 75 epochs 的训练,接下来使用 1 0 − 3 10^{-3} 103 进行了 30 epochs 的训练,最终使用 1 0 − 4 10^{-4} 104 的学习率进行了 30 epochs 的训练。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
为了避免过拟合,我们使用了 dropout 和广泛数据增强。第一个连接层之后的 dropout 层的 rate 为 0.5,以防止层之间的相互适应性。对于数据增强,我们引入了随机缩放和平移高达 20% 的原始图像尺度。我们同时也在 HSV 颜色空间中随机调整图像的曝光和饱和度,其调整幅度高达 1.5倍。

epoch [ˈiːpɒk]:n. 世,新纪元,新时代,时间上的一点

2.3. Inference

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
就像在训练中一样,预测测试图像的检测只需要一次网络评估。在 PASCAL VOC 上,网络对于每张图像预测 98 个边界框以及对于每一个边界框预测类别概率。YOLO 在测试时非常快,因为它只需要单一的网络评估,这与基于分类器的方法不同。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
网格设计在边界框预测中强制实施空间多样性。通常很清楚一个目标落入哪个网格单元,并且网络仅为每个目标预测一个框。然而,一些大的目标或者目标刚好在多个网格单元的边缘可能会被归属有多个网格单元。可以使用非极大值抑制来过滤这些多个检测。虽然不像 R-CNN 或 DPM 那样对性能至关重要,但非极大值抑制使得 mAP 中增加了 2-3%。

diversity [daɪˈvɜːsəti]:n. 多样性,差异

2.4. Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.
YOLO 对边界框预测施加了严格的空间约束,因为每个网格单元只预测两个边界框,并且只能有一个类。此空间约束限制了我们的模型可以预测的临近目标的数量。我们的模型在处理以群体形式出现的小目标时会有困难,比如成群的鸟。

每个 grid cell 预测两个框,但是属于一个类别。如果多个不同物体 (或者同类物体的不同实体) 的中心落在同一个 grid cell 中,一个网格内的 bbox 输出类别唯一,会造成漏检。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
由于我们的模型是从数据集中学习预测边界框的,所以它很难推广到新的或者不常见的宽高比例或者不同属性的目标。我们的模型还使用相对粗糙的特征来预测边界框,因为我们的体系结构具有来自输入图像的多个下采样层。

YOLO 模型采用了多个下采样层 (downsampling layers),导致模型学到的特征并不精细。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.
最后,当我们训练一个近似检测性能的损失函数时,我们的损失函数对于小边界框和大边界框中的错误做了同等的对待。大边界框的小错误可能没什么影响,当小边界框有轻微的错误偏差将极大地影响到 IOU。YOLO 中的误差主要还是定位的误差。

YOLO loss function treats errors the same in small bounding boxes versus large bounding boxes. YOLO loss 函数中,大物体 IOU 误差和小物体 IOU 误差在网络训练中 loss 贡献值接近 (虽然采用求平方根方式,但没有根本解决问题)。

limitation [ˌlɪmɪˈteɪʃn]:n. 限制,限度,极限,追诉时效,有效期限,缺陷
constraint [kənˈstreɪnt]:n. 约束,局促,态度不自然,强制
struggle [ˈstrʌɡl]:v. 奋斗,艰难地行进,斗争,搏斗,争夺,挣扎成名 n. 斗争,冲突,使劲,奋斗,难事
flock [flɒk]:n. 群,棉束 vt. 用棉束填满 vi. 聚集,成群而行
coarse [kɔːs]:adj. 粗糙的,粗俗的,下等的
benign [bɪˈnaɪn]:adj. 良性的,和蔼的,亲切的,吉利的

3. Comparison to Other Detection Systems

Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
目标检测是计算机视觉中的核心问题。检测流水线通常首先从输入图像中提取一组鲁棒特征 (Haar [25], SIFT [23], HOG [4], convolutional features [6])。然后,分类器 [36, 21, 13, 10] 或定位器 [1, 32] 用于识别特征空间中的目标。这些分类器或定位器可以在整个图像上以滑动窗口方式运行,也可以在图像中的某些区域子集上运行 [35, 15, 39]。我们将 YOLO 检测系统与几个顶级检测框架进行了比较,突出了主要的相似点和不同点。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.
Deformable parts models. 可变形零件模型 (DPM) 使用滑动窗口方法进行物体检测 [10]。DPM 使用不相交的管道来提取静态特征,对区域进行分类,预测高分区域的边界框等。我们的系统用一个卷积神经网络替换所有这些不同的部分。网络同时执行特征提取,边界框预测,非极大值抑制和上下文推理。相比 DPM,我们的网络并不是提取静态特征,而是在线的训练特征并且根据检测任务优化它们。我们的统一模型相比 DPM 有着更快的速度以及更高的准确率。

deformable [,di'fɔ:məbl]:adj. 可变形的
disparate [ˈdɪspərət]:adj. 不同的,不相干的,全异的 n. 无法相比的东西
disjoint [dɪs'dʒɒɪnt]:v. 打散,拆开,(使) 关节脱离 adj. 不连贯的,(两个集合) 不相交的

R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].
R-CNN. R-CNN 及其变体使用候选区域而不是滑动窗口来查找图像中的目标。Selective Search [35] 生成潜在的边界框,卷积网络提取特征,SVM 对框进行评分,线性模型调整边界框,非极大值抑制消除重复检测。这个复杂流水线的每个阶段都必须独立地进行精确调整,所得到的系统非常缓慢,在测试时间每个图像需要超过 40 秒 [14]。

tune [tjuːn]:n. 曲调,和谐,心情 vt. 调整,使一致,为...调音 vi. 调谐,协调

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.
YOLO 与 R-CNN 有一些相似之处。每个网格单元提出潜在的边界框并使用卷积特征对这些框进行评分。然而,我们的系统对网格单元候选区域施加空间限制,这有助于缓解对同一目标的多次检测。我们的系统还提出了更少的边界框,每张图像只有 98 个,而 Selective Search 则只有约 2000 个。最后,我们的系统将这些单独的组件组合成一个联合优化的模型。

mitigate [ˈmɪtɪɡeɪt]:vt. 使缓和,使减轻 vi. 减轻,缓和下来
component [kəmˈpəʊnənt]:n. 组成部分,成分,组件,元件 adj. 组成的,构成的

Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.
Other Fast Detectors Fast and Faster R-CNN 专注于加速 R-CNN 框架,通过共享计算和使用神经网络来提出候选区域而不是 Selective Search [14] [28]。虽然它们提供了比 R-CNN 更快的速度和准确度,但两者仍然没有达到实时性能。

Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [31] actually runs in real-time.
许多研究工作都集中在加速 DPM 管道 [31] [38] [5]。它们加速 HOG 计算,使用级联,并将计算推送到 GPU。但是,DPM 实际上是实时运行只有 30Hz。

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.
相比去优化一个大的检测系统中的某一些独立的部分,YOLO 完全抛出管道,并且设计速度很快。

throw [θrəʊ]:vt. 投,抛,掷 vi. 抛,投掷 n. 投掷,冒险

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.
像 faces or people 等单个类别的检测器可以高度优化,因为他们必须处理更少的变化 [37]。YOLO 是一种通用的检测器,可以学习同时检测各种物体。

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.
Deep MultiBox. 与 R-CNN 不同,Szegedy et al. 训练卷积神经网络来预测感兴趣的区域 [8] 而不是使用 Selective Search。MultiBox 还可以通过使用单个类预测替换置信预测来执行单个目标检测。但是,MultiBox 无法执行常规物体检测,并且仍然只是更大的检测管道中的一部分,需要进一步的图像块分类。YOLO 和 MultiBox 都使用卷积网络来预测图像中的边界框,但 YOLO 是一个完整的检测系统。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.
OverFeat. Sermanet et al. 训练卷积神经网络以执行定位并调整定位器以执行检测 [32]。OverFeat 可以有效地执行滑动窗口检测,但它仍然是一个不连贯的系统。OverFeat 优化了定位,而不是检测性能。类似 DPM,定位器在做预测的时候只看到了定位信息。OverFeat 不能关注到全局的上下文信息,因此需要有效的后续处理才能产生连贯的检测。

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.
MultiGrasp. 我们的工作在设计上类似于 Redmon et al [27] 的 grasp detection 工作。我们用于边界框预测的网格方法是基于 MultiGrasp 系统进行回归分析的。但是,grasp detection 比目标检测要简单得多。MultiGrasp 只需要为包含一个目标的图像预测单个可抓住的区域。不必估计物体的大小、位置或边界或预测其类别,仅需找到适合抓取的区域即可。YOLO 预测图像中多个类别的多个目标的边界框和类别概率。

disjoint [dɪs'dʒɒɪnt]:v. 打散,拆开,(使) 关节脱离 adj. 不连贯的,(两个集合) 不相交的
coherent [kəʊˈhɪərənt]:adj. 连贯的,一致的,明了的,清晰的,凝聚性的,互相耦合的,粘在一起的
grasp [ɡrɑːsp]:v. 抓牢,握紧,试图抓住,理解,领悟,毫不犹豫地抓住 (机会) n. 抓,握,理解,领会,力所能及,把握,权力,控制

4. Experiments

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets
首先,我们将 YOLO 与 PASCAL VOC 2007 上的其他实时检测系统进行比较。为了理解 YOLO 和 R-CNN 变体之间的区别,我们探索了 YOLO 和 Fast R-CNN 在 VOC 2007 上的错误,Fast R-CNN 是 R-CNN 的最高性能版本之一 [14]。基于不同的错误配置文件,我们证明了 YOLO 可用于对 Fast R-CNN 检测进行评分,并减少背景 false positives 引起的错误,从而显著提高性能。我们还展示了 VOC 2012 的结果,并将 mAP 与当前最先进的方法进行比较。最后,我们展示了 YOLO 在两个艺术作品数据集上相比其他检测器有更好的扩展到新领域的性能。

artwork [ˈɑːtwɜːk]:n. 艺术品,插图,美术品
boost [buːst]:vt. 促进,增加,支援 vi. 宣扬,偷窃 n. 推动,帮助,宣扬

4.1. Comparison to Other Real-Time Systems

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
目标检测方面的许多研究工作都集中在快速建立标准检测管道上 [5] [38] [31] [14] [17] [28]。但是,只有 Sadeghi et al. 实际上产生了一个实时运行的检测系统 (每秒 30 帧或更高) [31]。我们将 YOLO 与他们以 30Hz 或 100Hz 运行的 DPM 的 GPU 实现进行了比较。尽管其他工作尚未达到实时里程碑,但我们还比较了它们的相对 mAP 和速度,以检查目标检测系统中可用的精度-性能折衷。

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.
Fast YOLO 是 PASCAL 上最快的目标检测方法,据我们所知,它是现存最快的物体检测器。达到了 52.7% 的 mAP,它比以前的实时检测工作的准确率高出两倍以上。YOLO 将 mAP 提高到 63.4%,同时仍保持实时性能。

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.
我们还使用 VGG-16 训练 YOLO。这个模型比 YOLO 更精确,但也比它慢很多。它与依赖于 VGG-16 的其他检测系统相比是有用的,但由于它比实时更慢,所以本文的其他部分将重点放在我们更快的模型上。

Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.
Fastest DPM 可以在不牺牲很多 mAP 的情况下有效加速 DPM,但仍然会将实时性能降低 2 倍 [38]。与神经网络方法相比,DPM 的检测精度也受到限制。

R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
R-CNN 减去 R 用静态边界框候选区域取代 Selective Search [20]。虽然速度比 R-CNN 快得多,但它仍然没有实时性,并且由于没有好的候选区域导致精确度受到很大的影响

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from realtime.
Fast R-CNN 加快了 R-CNN 的分类阶段,但仍依赖于选择性搜索,每个图像需要大约 2 秒才能生成边界框候选区域。因此它具有很高的 mAP,但在 0.5 fps 时仍然远未实时。

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The ZeilerFergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
最近 Faster R-CNN 用神经网络替代了选择性搜索来提出边界框,类似于 Szegedy et al. [8]。在我们的测试中,他们最准确的模型达到了 7 fps,而较小的,不太准确的模型以 18 fps 运行。Faster R-CNN 的 VGG-16 版本比 YOLO 多了 10 mAP 但是它也慢了 6 倍。ZeilerFergus Faster R-CNN 只比 YOLO 慢 2.5 倍,但准确率却更低。

YOLO 抛弃 region proposal,使用固定的 grid cell 进行 bounding box 回归和分类,而且预测预选框也减少了,起到了加速作用。(It divides the image into an S × S S \times S S×S grid and for each grid cell predicts B B B bounding boxes, confidence for those boxes, and C C C class probabilities.)

extant [ekˈstænt; ˈekstənt]:adj. 现存的,显著的
sacrifice [ˈsækrɪfaɪs]:n. 牺牲,祭品,供奉 vt. 牺牲,献祭,亏本出售 vi. 献祭,奉献
hit [hɪt]:vt. 打击,袭击,碰撞,偶然发现,伤...的感情 vi. 打,打击,碰撞,偶然碰上 n. 打,打击, (演出等) 成功,讽刺

You Only Look Once Unified, Real-Time Object Detection_第17张图片
Table 1: Real-Time Systems on PASCAL VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for PASCAL VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed.
Table 1: Real-Time Systems on PASCAL VOC 2007. 比较快速检测器的性能和速度。Fast YOLO 是 PASCAL VOC 检测中记录最快的检测器,其精度仍然是其他任何实时检测器的两倍。YOLO 比 Fast YOLO 更精确 10 mAP,同时仍然远远高于实时速度。

4.2. VOC 2007 Error Analysis

To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast RCNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.
为了进一步研究 YOLO 和最先进的检测器之间的差异,我们将详细分析 VOC 2007 的结果。我们将 YOLO 与 Fast RCNN 进行比较,因为 Fast R-CNN 是 PASCAL 上性能最高的检测器之一,它是开源可获取的。

We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:
我们使用 Hoiem et al. [19] 的方法和工具。对于测试时的每个类别,我们查看该类别的前 N 个预测。每个预测都是正确的,或者根据错误类型进行分类:

  • Correct: correct class and IOU > .5 (正确:正确类别并且 IOU > .5)
  • Localization: correct class, .1 < IOU < .5 (定位:正确类别,.1 < IOU < .5)
  • Similar: class is similar, IOU > .1 (相似:类别是相似的,IOU > .1)
  • Other: class is wrong, IOU > .1 (其他:类别是错误的,IOU > .1)
  • Background: IOU < .1 for any object (背景:IOU < .1 对于任何目标)

Figure 4 shows the breakdown of each error type averaged across all 20 classes.
图 4 显示了所有 20 个类中每种错误类型的平均细分。

YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.
YOLO 对于正确地定位目标相对困难。定位错误占 YOLO 错误的比例比所有其他来源的总和还多。Fast R-CNN 产生更低的定位错误但是更高的背景错误。其中 13.6% 的 top 检测结果是误报,不包含任何目标。Fast R-CNN 预测背景检测的可能性几乎是 YOLO 的 3 倍。

breakdown [ˈbreɪkdaʊn]:n. 故障,崩溃,分解,分类,衰弱,跺脚曳步舞
methodology [ˌmeθəˈdɒlədʒi]:n. 方法学,方法论
struggle [ˈstrʌɡl]:v. 奋斗,艰难地行进,斗争,搏斗,争夺,挣扎成名 n. 斗争,冲突,使劲,奋斗,难事

You Only Look Once Unified, Real-Time Object Detection_第18张图片
Figure 4: Error Analysis: Fast R-CNN vs. YOLO These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
Figure 4: Error Analysis: Fast R-CNN vs. YOLO 这些图表显示了各种类别的前 N 个检测中的定位和背景错误的百分比 (N = # objects in that category)。

4.3. Combining Fast R-CNN and YOLO

YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.
与 Fast R-CNN 相比,YOLO 产生的背景错误少得多。通过使用 YOLO 消除 Fast R-CNN 的背景检测,我们可以显著提高性能。对于 R-CNN 预测的每个边界框,我们都会检查 YOLO 是否预测了类似的框。如果是这样,我们将根据 YOLO 预测的概率和两个框之间的重叠来对该预测进行改进。

The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.
最好的 Fast R-CNN 模型在 VOC 2007 测试集中达到了 71.8% 的 mAP。当与 YOLO 合并时,其 mAP 增加了 3.2% 至 75.0%。我们还尝试将顶级 Fast R-CNN 模型与其他几个版本的 Fast R-CNN 相结合。这些总体的平均增幅在 0.3% 至 0.6% 之间,详情见表 2。

The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.
YOLO 的推动不仅仅是模型集成的副产品,因为组合不同版本的 Fast R-CNN 几乎没有什么好处。相反,这是因为 YOLO 在测试时犯了不同类型的错误,因此它在提升 Fast R-CNN 的性能方面非常有效。

Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model separately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.
不幸的是,这种组合无法从 YOLO 的速度中受益,因为我们分别运行每个模型然后组合结果。但是,由于 YOLO 如此之快,与 Fast R-CNN 相比,它不会增加任何明显的计算时间。

ensemble [ɒnˈsɒmbl]:n. 全体,总效果,全套服装,全套家具,合奏组 adv. 同时
benefit [ˈbenɪfɪt]:n. 利益,好处,救济金 vt. 有益于,对...有益 vi. 受益,得益
byproduct ['baɪ,prɒdʌkt]:n. 副产品
separately [ˈseprətli]:adv. 分别地,分离地,个别地

You Only Look Once Unified, Real-Time Object Detection_第19张图片
Table 2: Model combination experiments on VOC 2007. We examine the effect of combining various models with the best version of Fast R-CNN. Other versions of Fast R-CNN provide only a small benefit while YOLO provides a significant performance boost.
Table 2: Model combination experiments on VOC 2007. 我们研究了将各种模型与最佳版本的 Fast R-CNN 相结合的效果。其他版本的 Fast R-CNN 仅提供小的提升,而 YOLO 则提供了显著的性能提升。

4.4. VOC 2012 Results

On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.
在 VOC 2012 测试集上,YOLO 的 mAP 得分为 57.9%。这低于现有技术水平,更接近于使用 VGG-16 的原始 R-CNN,请参见表 3。与最接近的竞争对手相比,我们的系统在处理小物体时遇到困难。在瓶子、绵羊和电视/显示器等类别上,YOLO 的得分比 R-CNN 或 Feature Edit 低 8-10%。但是,在其他类别 (如猫和火车) 上,YOLO 可获得更高的性能。

Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.
我们的 Fast R-CNN + YOLO 组合模型是性能最高的检测方法之一。与 YOLO 的组合使 Fast R-CNN 获得 2.3% 的提升,使其在公共排行榜上的排名上升了 5 位。

You Only Look Once Unified, Real-Time Object Detection_第20张图片
Table 3: PASCAL VOC 2012 Leaderboard. YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.
Table 3: PASCAL VOC 2012 Leaderboard. 截至 2015 年 11 月 6 日,YOLO 与完整的 comp4 (允许外部数据)公共排行榜相比较。针对各种检测方法显示了平均精度和每类平均精度。YOLO 是唯一的实时检测器。Fast R-CNN + YOLO 是第四高得分方法,比 Fast R-CNN 提升 2.3%。

spot [spɒt]:n. 地点,斑点 vt. 认出,弄脏,用灯光照射 vi. 沾上污渍,满是斑点 adj. 现场的,现货买卖的 adv. 准确地,恰好

4.5. Generalizability: Person Detection in Artwork

Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.
用于目标检测的学术数据集从同一分布中提取训练和测试数据。在现实世界的应用中,很难预测所有可能的用例,并且测试数据可能会与系统之前看到的有所不同 [3]。我们将 YOLO 在 the Picasso Dataset [12] and the People-Art Dataset [3] 上与其他检测系统进行比较,这两个数据集用于测试艺术品上的人物检测。

You Only Look Once Unified, Real-Time Object Detection_第21张图片
(a) Picasso Dataset precision-recall curves.

You Only Look Once Unified, Real-Time Object Detection_第22张图片
(b) Quantitative results on the VOC 2007, Picasso, and People-Art Datasets. The Picasso Dataset evaluates on both AP and best F 1 F_1 F1 score.
Figure 5: Generalization results on Picasso and People-Art datasets.

Picasso [pi'kæsəu]:n. 毕加索 (西班牙画家)
curve [kɜːv]:n. 曲线,弯曲,曲线球,曲线图表 vt. 弯,使弯曲 vi. 成曲形 adj. 弯曲的,曲线形的

Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.
图 5 显示了 YOLO 和其他检测方法之间的性能比较。作为参考,所有的模型都仅在 VOC 2007 数据上训练,我们给出 VOC 2007 person 类别的检测 AP。在 Picasso 上,模型在 VOC 2012 上进行训练,而在 People-Art 上,模型在 VOC 2010 上进行训练。

R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.
R-CNN 在 VOC 2007 上有很高的 AP 值。然而,当应用于美术品时,R-CNN 性能显著下降。R-CNN 使用 Selective Search 提供候选区域,该建议针对自然图像进行了调整。R-CNN 中的分类器步骤只能看到很小的区域,并且需要好的候选区域。

DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.
DPM 应用于美术品时,可以很好地保持其 AP。先前的工作理论认为 DPM 表现良好,因为它具有强大的目标形状和布局空间模型。尽管 DPM 的降级程度不如 R-CNN,但它是从较低的 AP 开始的。

YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.
YOLO 在 VOC 2007 上具有良好的性能,并且在应用于美术品时,其 AP 的降级比其他方法要少。与 DPM 一样,YOLO 对目标的大小和形状以及目标之间的关系以及目标通常出现的位置进行建模。美术品图像和自然图像在像素级别上有很大不同,但是在目标的大小和形状方面相似,因此 YOLO 仍可以预测良好的边界框和检测。

academic [ˌækəˈdemɪk]:adj. 学术的,理论的,学院的 n. 大学生,大学教师,学者
diverge [daɪˈvɜːdʒ]:vi. 分歧,偏离,分叉,离题 vt. 使偏离,使分叉
generalization [ˌdʒenrəlaɪˈzeɪʃn]:n. 概括,普遍化,一般化
considerably [kənˈsɪdərəbli]:adv. 相当地,非常地
theorize [ˈθɪəraɪz]:vi. 建立理论或学说,推理 vt. 建立理论
quantitative [ˈkwɒntɪtətɪv]:adj. 定量的,量的,数量的

5. Real-Time Detection In The Wild

YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance, including the time to fetch images from the camera and display the detections.
YOLO 是一种快速、准确的物体检测器,非常适合计算机视觉应用。我们将 YOLO 连接到网络摄像机,并验证它是否保持实时性能,包括从摄像机获取图像并显示检测结果的时间。

The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: https://pjreddie.com/darknet/yolo/.
最终的系统是交互式的并且引人入胜。YOLO 单独处理图像时,将其附加到网络摄像头后,其功能类似于跟踪系统,可以检测到物体移动和外观变化。该系统的演示和源代码可以在我们的项目网站上找到:https://pjreddie.com/darknet/yolo/。

fetch [fetʃ]:vt. 取来,接来,到达,吸引 vi. 拿,取物,卖得 n. 取得,诡计
interactive [ˌɪntərˈæktɪv]:adj. 交互式的,相互作用的
engage [ɪnˈɡeɪdʒ]:vt. 吸引,占用,使参加,雇佣,使订婚,预定 vi. 从事,参与,答应,保证,交战,啮合
individually [ˌɪndɪˈvɪdʒuəli]:adv. 个别地,单独地

You Only Look Once Unified, Real-Time Object Detection_第23张图片
Figure 6: Qualitative Results. YOLO running on sample artwork and natural images from the internet. It is mostly accurate although it does think one person is an airplane.
Figure 6: Qualitative Results. YOLO 运行在互联网上的样本艺术作品和自然图像上。尽管它确实认为一个人是飞机,但它基本上是准确的。

6. Conclusion

We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
我们介绍 YOLO,一个统一的目标检测模型。我们的模型很容易构建,可以直接在完整的图像上进行训练。不同于基于分类器的方法,YOLO 被训练的损失函数直接对应于检测性能,整个模型联合训练。

Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
Fast YOLO 是文献中最快的通用目标检测器,YOLO 推动了实时目标检测中的最先进的技术。YOLO 也广泛适用于新的领域,使其成为依赖于快速、鲁棒 (稳健) 的目标检测的应用程序的理想选择。

ideal [aɪ'dɪəl; aɪ'diːəl]:adj. 理想的,完美的,想象的,不切实际的 n. 理想,典范

Acknowledgements: This work is partially supported by ONR N00014-13-1-0720, NSF IIS-1338054, and The Allen Distinguished Investigator Award.

Office of Naval Research,ONR:海军研究所
National Science Foundation,NSF:国家科学基金会
Allen Distinguished Investigator

References

[22] Network In Network
[29] Object Detection Networks on Convolutional Feature Maps
[34] Going Deeper with Convolutions

WORDBOOK

exhibit [ɪɡˈzɪbɪt]:vt. 展览,显示,提出 (证据等) n. 展览品,证据,展示会 vi. 展出,开展览会
Statement of Work,SOW:工作说明书
kick-off ['kɪkɔf]:n. 开球,剔除,分离
pipeline [ˈpaɪplaɪn]:n. 管道,输油管,传递途径
dashboard [ˈdæʃbɔːd]:n. 汽车等的仪表板,马车等前部的挡泥板
restful [ˈrestfl]:adj. 宁静的,安静的,给人休息的
Unibail-Rodamco-Westfield,URW
fee [fiː]:n. 费用,酬金,小费 vt. 付费给......

KEY POINTS

https://pjreddie.com/publications/yolo/
https://pjreddie.com/publications/
https://deepsystems.ai/reviews

你可能感兴趣的:(object,detection,-,目标检测)