YOLOv3论文中英文对照翻译

YOLOv3论文名称: YOLOv3: An Incremental Improvement

YOLOv3论文下载地址:https://arxiv.org/pdf/1804.02767.pdf

声明:论文翻译仅用来学习,转载请注明出处

YOLOv3: An Incremental Improvement

YOLOv3:一个渐进式的改进

Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better . We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP , as accurate as SSD but three times faster . When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster . As always, all the code is online at https://pjreddie.com/yolo/.

摘要

我们提出了对YOLO的一些更新! 我们做了一些小的设计改动,使其变得更好。我们还训练了这个新的网络,这个网络非常棒。它比上次大了一点,但更准确。不过它仍然很快,不用担心。在320×320的情况下,YOLOv3在28.2mAP的情况下运行22毫秒,与SSD一样准确,但速度快三倍。当我们看一下旧的0.5 IOU mAP检测指标时,YOLOv3是相当好的。在Titan X上,它在51毫秒内实现了57.9个AP50,而RetinaNet在198毫秒内实现了57.5个AP50,性能相似但快3.8倍。像往常一样,所有的代码都在网上,https://pjreddie.com/yolo/。

1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little.

Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT!

The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

1、介绍

有时你会把它搁置一年,你知道吗?今年我没有做大量的研究。在Twitter上花了很多时间。玩了一下GANs。我有一点去年留下的动力[12] [1];我设法对YOLO做了一些改进。但是,说实话,没有什么超级有趣的东西,只是做了一些小改动,让它变得更好。我还对其他人的研究提供了一些帮助。

实际上,这也是我们今天来到这里的原因。我们有一个可以上镜的最后期限[4],我们需要引用我对YOLO的一些随机更新,但我们没有来源。所以准备好接受技术报告吧!

技术报告的好处是,它们不需要介绍,你们都知道我们为什么在这里。因此,这个介绍的结尾将为本文的其余部分做一个标志。首先,我们将告诉你YOLOv3是什么情况。然后我们会告诉你我们是如何做的。我们也会告诉你一些我们尝试过但没有成功的事情。最后,我们将思考这一切意味着什么。

2. The Deal

So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

2、处理

所以YOLOv3的处理是这样的:我们主要从其他人那里获得了好的想法。我们还训练了一个新的分类器网络,比其他的更好。我们将从头开始带你了解整个系统,以便你能理解这一切。

YOLOv3论文中英文对照翻译_第1张图片

Figure 1. We adapt this figure from the Focal Loss paper [9]. YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU.

图1:我们从Focal Loss论文[9]中改编此图。YOLOv3的运行速度明显高于其他性能相当的检测方法。时间来自M40或Titan
X,它们基本上是相同的GPU。

2.1. Bounding Box Prediction

Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:
b x = σ ( t x ) + c x b y = σ ( t y ) + c y b w = p w e t w b h = p h e t h \begin{equation} \begin{aligned} b_{x} &=\sigma\left(t_{x}\right)+c_{x} \\ b_{y} &=\sigma\left(t_{y}\right)+c_{y} \\ b_{w} &=p_{w} e^{t_{w}} \\ b_{h} &=p_{h} e^{t_{h}} \end{aligned} \end{equation} bxbybwbh=σ(tx)+cx=σ(ty)+cy=pwetw=pheth
During training we use sum of squared error loss. If the ground truth for some coordinate prediction is ˆt* our gradient is the ground truth value (computed from the ground truth box) minus our prediction: ˆt* − t*. This ground truth value can be easily computed by inverting the equations above.

2.1. 边界框预测

按照YOLO9000,我们的系统使用维度集群作为锚定框来预测边界框[15]。该网络为每个边界框预测4个坐标,tx, ty, tw, th。如果单元格从图像的左上角偏移(cx,cy),并且边界框先验具有宽度和高度pw,ph,那么预测对应于:
b x = σ ( t x ) + c x b y = σ ( t y ) + c y b w = p w e t w b h = p h e t h \begin{equation} \begin{aligned} b_{x} &=\sigma\left(t_{x}\right)+c_{x} \\ b_{y} &=\sigma\left(t_{y}\right)+c_{y} \\ b_{w} &=p_{w} e^{t_{w}} \\ b_{h} &=p_{h} e^{t_{h}} \end{aligned} \end{equation} bxbybwbh=σ(tx)+cx=σ(ty)+cy=pwetw=pheth
在训练期间,我们使用平方误差损失之和。如果某个真实框的坐标是ˆt*,我们的梯度就是真实框坐标值(从真实框中计算)减去我们的预测坐标值:ˆt* - t*。这个真实框坐标值可以通过倒置上述方程轻松计算出来。

YOLOv3论文中英文对照翻译_第2张图片

Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].

图2. 带有维度先验和位置预测的边界框。我们预测框的宽度和高度,作为集群中心点的偏移量。我们用一个sigmoid函数来预测框的中心坐标与过滤器应用的位置关系。此图摘自[15]。

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

YOLOv3使用逻辑回归法为每个边界框预测一个目标分数。如果边界框先验比其他边界框先验更多地与目标真实框重叠,则该分数应为1。如果先验边界框不是最好的,但与真实框目标的重叠程度超过了某个阈值,我们就忽略这个预测,遵循[17]。我们使用0.5的阈值。与[17]不同的是,我们的系统只为每个真实框对象分配一个先验边界框。如果没有为一个真实框对象分配一个先验边界框,它不会对坐标或类别的预测产生任何损失,只有目标置信度。

2.2. Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions.

This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person).

Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

2.2. 类别预测

每个框都使用多标签分类法预测边界框可能包含的类别。我们不使用softmax,因为我们发现它对于良好的性能是不必要的,相反,我们只是使用独立的逻辑分类器。在训练过程中,我们使用二元交叉熵损失来进行分类预测。

当我们转向更复杂的领域,如开放图像数据集[7]时,这种提法会有所帮助。在这个数据集中,有许多重叠的标签(即女人和人)。

使用softmax的假设是,每个框都有一个确切的类别,但情况往往并非如此。多标签方法可以更好地模拟数据。

2.3. Predictions Across Scales

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

2.3. 跨尺度的预测

YOLOv3在三个不同的尺度上对框进行预测。我们的系统使用类似于特征金字塔网络[8]的概念从这些尺度上提取特征。从我们的基础特征提取器中,我们添加了几个卷积层。最后一个卷积层预测一个3维张量,编码边界框、对象性和类别预测。在我们与COCO[10]的实验中,我们在每个尺度上预测3个框,所以张量是N×N×[3∗(4+1+80)],用于4个边界盒的偏移,1个对象性预测,和80个类别预测。

Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size.

We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network.

We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326).

接下来,我们从前两层中提取特征图,并对其进行2倍的上采样。我们还从网络中的早期特征图中取出一个特征图,用连接法将其与我们的上采样特征合并。这种方法使我们能够从上采样的特征中获得更有意义的语义信息,并从早期的特征图中获得更精细的信息。然后,我们再增加几个卷积层来处理这个合并的特征图,最终预测出一个类似的张量,尽管现在的张量是原来的两倍。

我们再进行一次同样的设计,以预测最终规模的框。因此,我们对第三个尺度的预测得益于所有先前的计算以及网络早期的细化特征。

我们仍然使用k-means聚类法来确定我们的边界框预设。我们只是任意地选择了9个聚类和3个尺度,然后在各个尺度上均匀地划分聚类。在COCO数据集上,这9个聚类是。(10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326).

2.4. Feature Extractor

We use a new network for performing feature extraction.

Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it… wait for it… Darknet-53!

2.4. 特征提取器

我们使用一个新的网络来进行特征提取。

我们的新网络是YOLOv2中使用的网络、Darknet-19和新式的Darknet网络之间的一种混合方法。我们的网络使用连续的3×3和1×1卷积层,但现在也有一些快捷连接,而且明显更大。它有53个卷积层,所以我们把它叫做… 等待它… Darknet-53!

YOLOv3论文中英文对照翻译_第3张图片

This new network is much more powerful than Darknet19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:

这个新网络比Darknet19强大得多,但仍然比ResNet-101或ResNet-152更有效率。下面是一些ImageNet的结果:

YOLOv3论文中英文对照翻译_第4张图片

Table 2. Comparison of backbones. Accuracy, billions of operations, billion floating point operations per second, and FPS for various networks.

表2. 骨干网的比较。各种网络的准确度、数十亿次操作、每秒十亿次浮点运算和FPS。

Each network is trained with identical settings and tested at 256 × 256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster.

Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.

每个网络都以相同的设置进行训练,并在256×256的条件下进行测试,单目标精度。运行时间是在256×256的Titan X上测量的。因此,Darknet-53的性能与最先进的分类器相当,但浮点运算更少,速度更高。Darknet-53比ResNet-101更好,速度快1.5倍。Darknet-53的性能与ResNet-152相似,速度为2倍。

Darknet-53还实现了最高的测量浮点运算/秒。这意味着该网络结构更好地利用了GPU,使其评估效率更高,因此速度更快。这主要是因为ResNets的层数实在太多,效率不高。

2.5. Training

We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14].

2.5. 训练

我们仍然在完整的图像上进行训练,没有硬性的负面挖掘或任何这些东西。我们使用多尺度训练,大量的数据增强,批量归一化,所有标准的东西。我们使用Darknet神经网络框架进行训练和测试[14]。

3. How We Do

YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3× faster. It is still quite a bit behind other models like RetinaNet in this metric though.

3、我们如何做

YOLOv3是相当不错的,见表3。在COCOs怪异的平均AP指标方面,它与SSD的变体相当,但速度快了3倍。虽然在这个指标上,YOLOv3和RetinaNet等模型一样。

However, when we look at the “old” detection metric of mAP at IOU= .5 (or AP50 in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxes for objects. However, performance drops significantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the object.

In the past YOLO struggled with small objects. However, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high APS performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this.

When we plot accuracy vs speed on the AP50 metric (see figure 5) we see YOLOv3 has significant benefits over other detection systems. Namely, it’s faster and better.

然而,相比于IOU=0.5(或图表中的AP50)的 "老 "检测指标mAP时,YOLOv3显得非常强大。它几乎与RetinaNet持平,远高于SSD的变体。这表明YOLOv3是一个非常强大的检测器,擅长为目标生成合理的框。然而,随着IOU阈值的增加,性能明显下降,表明YOLOv3努力使框与目标完全对齐。

在过去,YOLO在处理小物体时很吃力。然而,现在我们看到这一趋势发生了逆转。通过新的多尺度预测,我们看到YOLOv3具有相对较高的APS性能。然而,它在中等和较大尺寸物体上的性能相对较差。需要进行更多的调查来了解这个问题的真相。

当我们在AP50指标上绘制准确度与速度的关系时(见图5),我们看到YOLOv3比其他检测系统有明显的优势。也就是说,它更快、更好。

4. Things We Tried That Didn’t Work

We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember.

Anchor box x, y offset predictions. We tried using the normal anchor box prediction mechanism where you predict the x, y offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well.

Linear x, y predictions instead of logistic. We tried using a linear activation to directly predict the x, y offset instead of the logistic activation. This led to a couple point drop in mAP .

Focal loss. We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure.

Dual IOU thresholds and truth assignment. Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3 − .7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results.

We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training.

4、我们试过的没有用的工作

我们在做YOLOv3的时候尝试了很多东西。很多东西都没有成功。以下是我们能记住的东西。

锚框的X、Y偏移量预测。我们尝试使用正常的锚定框预测机制,即用线性激活的方式将x、y偏移量预测为框宽或框高的倍数。我们发现这种提法降低了模型的稳定性,而且效果不是很好。

线性x,y预测,而不是逻辑预测。我们尝试用线性激活来直接预测x,y偏移量,而不是用逻辑激活。这导致了mAP下降了几个点。

Focal loss。我们尝试使用焦点损失。它使我们的mAP下降了大约2点。YOLOv3可能已经对焦点损失试图解决的问题很稳健,因为它有独立的对象性预测和条件类预测。因此,对于大多数例子来说,类别预测没有损失?还是什么?我们并不完全确定。

双重IOU阈值和真实分配。Faster R-CNN在训练过程中使用两个IOU阈值。如果一个预测与真实框重叠0.7,它就是一个正面的例子,重叠[0.3 - 0.7],它就会被忽略,对于所有地面真相对象来说,小于0.3就是一个负面的例子。我们尝试过类似的策略,但没有得到好的结果。

我们相当喜欢我们目前的表述,它似乎至少处于一个局部最优状态。这些技术中的一些可能最终会产生好的结果,也许它们只是需要一些调整来稳定训练。

YOLOv3论文中英文对照翻译_第5张图片

Table 3. I’m seriously just stealing all these tables from [9] they take soooo long to make from scratch. Ok, YOLOv3 is doing alright. Keep in mind that RetinaNet has like 3.8× longer to process an image. YOLOv3 is much better than SSD variants and comparable to state-of-the-art models on the AP50 metric.

表3:说真的,我只是从[9]那里偷来了所有这些表格,它们从头开始做需要花费太多时间。好吧,YOLOv3做得不错。请记住,RetinaNet处理一张图片的时间要长3.8倍。YOLOv3比SSD的变体好得多,在AP50指标上与最先进的模型相当。

YOLOv3论文中英文对照翻译_第6张图片

Figure 3. Again adapted from the [9], this time displaying speed/accuracy tradeoff on the mAP at .5 IOU metric. Y ou can tell YOLOv3 is good because it’s very high and far to the left. Can you cite your own paper? Guess who’s going to try, this guy → [16]. Oh, I forgot, we also fix a data loading bug in YOLOv2, that helped by like 2 mAP. Just sneaking this in here to not throw off layout.

图3. 再次改编自[9],这次显示了在0.5 IOU指标下的mAP的速度/精度权衡。你可以告诉YOLOv3是好的,因为它非常高,而且远在左边。你能引用你自己的论文吗?猜猜谁会去尝试,这个人→[16]。哦,我忘了,我们还修复了YOLOv2中的一个数据加载错误,这帮助了大约2 mAP。只是在这里偷偷说一下,以免影响布局。

5. What This All Means

YOLOv3 is a good detector. It’s fast, it’s accurate. It’s not as great on the COCO average AP between .5 and .95 IOU metric. But it’s very good on the old detection metric of .5 IOU.

Why did we switch metrics anyway? The original COCO paper just has this cryptic sentence: “A full discussion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that humans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is sur prisingly difficult.” [18] If humans have a hard time telling the difference, how much does it matter? But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook.

I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal information and sell it to… wait, you’re saying that’s exactly what it will be used for?? Oh.

Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology oh wait…1 I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [13], or tracking their cat as it wanders around their house [19]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much.

In closing, do not @ me. (Because I finally quit Twitter).

5、这一切意味着

YOLOv3是一个好的检测器。它的速度很快,很准确。它在0.5和0.95 IOU之间的COCO平均AP指标上不那么好。但它在旧的检测指标0.5 IOU上是非常好的。

我们到底为什么要转换指标?最初的COCO论文中只有这样一句话:"关于评估指标的全面讨论。“一旦评估服务器完成,将增加对评估指标的全面讨论”。Russakovsky等人报告说,人类很难区分0.3和0.5的IOU! “训练人类目测一个IOU为0.3的边界框并将其与一个IOU为0.5的边界框区分开来是非常困难的。” [18] 如果人类很难区分,那么这又有多大关系呢?但也许一个更好的问题是:“既然我们有了这些探测器,我们要用它们做什么?” 很多做这项研究的人都在谷歌和Facebook。

我想至少我们知道这项技术是在良好的手中,绝对不会被用来收集你的个人信息并出售给…,等等,你是说这正是它将被用来做什么?哦。

好吧,其他大量资助视觉研究的人是军方,他们从来没有做过任何可怕的事情,比如用新技术杀死很多人,哦,等等…1,我很希望大多数使用计算机视觉的人只是在用它做快乐的好事,比如计算国家公园里斑马的数量[13],或者跟踪他们的猫在家里徘徊[19]。但是,计算机视觉已经被用于可疑的用途,作为研究人员,我们有责任至少考虑我们的工作可能造成的伤害,并想办法减轻它。我们欠世界这么多。

最后,请不要@我。(因为我终于退出了Twitter)。

References

[1] Analogy. Wikipedia, Mar 2018. 1

[2] M. Everingham, L. V an Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010. 6

[3] C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 3

[4] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316, 2017. 1

[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3

[6] J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. 3

[7] I. Krasin, T. Duerig, N. Alldrin, V . Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. V eit, S. Belongie, V . Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2017. 2

[8] T.-Y . Lin, P . Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 2, 3

[9] T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017. 1, 3, 4

[10] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2

[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 3

[12] I. Newton. Philosophiae naturalis principia mathematica. William Dawson & Sons Ltd., London, 1687. 1

[13] J. Parham, J. Crall, C. Stewart, T. Berger-Wolf, and D. Rubenstein. Animal population censusing at scale with citizen science and photographic identification. 2017. 4

[14] J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 3

[15] J. Redmon and A. Farhadi. Y olo9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6517–6525. IEEE, 2017. 1, 2, 3

[16] J. Redmon and A. Farhadi. Y olov3: An incremental improvement. arXiv, 2018. 4

[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015. 2

[18] O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2121–2131, 2015. 4

[19] M. Scott. Smart camera gimbal bot scanlime:027, Dec 2017.4

[20] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851, 2016. 3

[21] C. Szegedy, S. Ioffe, V . V anhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2017. 3

你可能感兴趣的:(经典论文翻译,深度学习,人工智能,计算机视觉,目标检测)