论文阅读笔记(三十五):R-FCN-3000 at 30fps: Decoupling Detection and Classification

We present R-FCN-3000, a large-scale real-time object detector in which objectness detection and classification are decoupled. To obtain the detection score for an RoI, we multiply the objectness score with the fine-grained classification score. Our approach is a modification of the R-FCN architecture in which position-sensitive filters are shared across different object classes for performing localization. For fine-grained classification, these position-sensitive filters are not needed. R-FCN-3000 obtains an mAP of 34.9% on the ImageNet detection dataset and outperforms YOLO9000 by 18% while processing 30 images per second. We also show that the objectness learned by R-FCN-3000 generalizes to novel classes and the performance increases with the number of training object classes supporting the hypothesis that it is possible to learn a universal objectness detector.

提出了一种大规模的实时目标检测器 R-FCN-3000, objectness 检测与分类同时进行。为了获得 RoI 的检测分数, 我们将 objectness 分数与细粒度分类评分相乘。我们的方法是对 R-FCN 体系结构的修改, 其中位置敏感过滤器是在不同的物体类之间共享的, 用于执行定位。对于细粒度分类, 不需要这些位置敏感的筛选器。R-FCN-3000 在 ImageNet 检测数据集上获得34.9% 的mAP, 并在每秒处理30幅图像时优于 18%的YOLO9000。我们还表明, objectness通过 R-FCN-3000 一般化到新的类学习并且随着正在训练的物体级别数量增加, 支持假设性能增加。有必要学习一个通用的 objectness 检测器。

With the advent of Deep CNNs [16, 20], object-detection has witnessed a quantum leap in the performance on benchmark datasets. It is due to the powerful feature learning capabilities of deep CNN architectures. Within the last five years, the mAP scores on PASCAL [9] and COCO [24] have improved from 33% to 88% and 37% to 73% (at 50% overlap), respectively. While there have been massive improvements on standard benchmarks with tens of classes [13, 12, 31, 6, 14], little progress has been made towards real-life object detection that requires real-time detection of thousands of classes. Some recent efforts [30, 17] in this direction have led to large-scale detection systems, but at the cost of accuracy. We propose a solution to the largescale object detection problem that outperforms YOLO9000 [30] by 18% and can process 30 images per second while detecting 3000 classes, referred to as R-FCN-3000.

随着Deep CNN的出现[16,20],目标检测在基准数据集上的性能出现了重大飞跃。这是由于深度CNN架构强大的特征学习功能。在过去的五年中,PASCAL [9]和COCO [24]的mAP得分分别从33%提高到88%和37%至73%(重叠为50%)。尽管对数十个类别的标准基准进行了大规模改进[13,12,31,6,14],但对于需要实时检测数千个类别的实际物体检测方面进展甚微。最近的一些工作[30,17]在这个方向上导致了大规模的检测系统,但是以精确性为代价。我们提出了一种解决方案,以超过YOLO9000 [30] 18%的大规模物体检测问题,并能够在检测3000个类的过程中每秒处理30个图像,称为R-FCN-3000。

R-FCN-3000 is a result of systematic modifications to some of the recent object-detection architectures [6, 5, 23, 25, 29] to afford real-time large-scale object detection. Recently proposed fully convolutional class of detectors [6, 5, 23, 25, 29] compute per-class objectness score for a given image. They have shown impressive accuracy within limited computational budgets. Although fullyconvolutional representations provide an efficient [19] solution for tasks like object detection [6], instance segmentation [22], tracking [10], relationship detection [41] etc., they require class-specific sets of filters for each class that prohibits their application for large number of classes. For example, R-FCN [5]/ Deformable-R-FCN [6] requires 49/197 position-specific filters for each class. Retina-Net [23] requires 9 filters for each class for each convolutional feature map. Therefore, such architectures would need hundreds of thousands of filters for detecting 3000 classes, which will make them extremely slow for practical purposes.

R-FCN-3000是对最近的一些目标检测体系结构[6,5,23,25,29]进行系统修改以提供实时大规模目标检测的结果。最近提出的完全卷积类检测器[6,5,23,25,29]计算给定图像的每类对象分数。他们在有限的计算预算内显示出令人印象深刻的准确性虽然完全卷积表示为诸如物体检测[6],实例分割[22],跟踪[10],关系检测[41]等任务提供了一种有效的[19]解决方案,但它们需要针对每个类的class-specific的过滤器集禁止他们申请大量类别。例如,R-FCN [5] / Deformable-R-FCN [6]需要每个类别有49/197个位置特定的滤波器。 Retina-Net [23]为每个卷积特征映射需要9个滤波器。因此,这样的体系结构将需要数十万个用于检测3000个类的过滤器,这将使其在实际应用中非常缓慢。

The key insight behind the proposed R-FCN-3000 architecture is to decouple objectness detection and classification of the detected object so that the computational requirements for localization remain constant as the number of classes increases see Fig. 1. We leverage the fact that many object categories are visually similar and share parts. For example different breeds of dogs all have common body parts; therefore, learning a different set of filters for detecting each breed is overkill. So, R-FCN-3000 performs object detection (with position-sensitive filters) for a fixed number of super-classes followed by fine-grained classification (without position-sensitive filters) within each superclass. The super-classes are obtained by clustering the deep semantic features of images (2048 dimensional features of ResNet-101 in this case); therefore, we do not require a semantic hierarchy. The fine-grained class probability at a given location is obtained by multiplying the super-class probability with the classification probability of the finegrained category within the super-class.

建议的R-FCN-3000体系结构背后的关键洞察是解耦检测对象的物体检测和分类,以便随着类别数量的增加,定位的计算需求保持不变,如图1所示。我们利用了许多物体类别在视觉上相似并共享部分。例如,不同品种的狗都有共同的身体部位;因此,学习一套不同的过滤器来检测每个品种是过度的。因此,R-FCN-3000为固定数量的超级类别执行物体检测(使用位置敏感滤波器),然后在每个超类中进行细粒度分类(无位置敏感滤波器)。超类通过聚类图像的深层语义特征获得(在这种情况下,ResNet-101的2048维特征);因此,我们不需要语义层次结构。通过将超级类别概率与超级类别内细粒度类别的分类概率相乘,获得给定位置处的细粒度类别概率。

In order to study the effect of using super-classes instead of individual object categories, we varied the number of super-classes from 1 to 100 and evaluated the performance on the ImageNet detection dataset. Surprisingly, the detector performs well even with one super-class! This observation indicates that position-sensitive filters can potentially learn to detect universal objectness. It also reaffirms a wellresearched concept from the past [1, 2, 39] that objectness is a generic concept and a universal objectness detector can be learned. Thus, for performing object detection, it suffices to multiply the objectness score of an RoI with the classifiation probability for a given class. This results in a fast detector for thousands of classes, as per-class position sensitive filters are no longer needed. On the PASCAL-VOC dataset, with only our objectness based detector, we observe a 1.5% drop in mAP compared to the deformable R-FCN [6] detector with class-specific filters for all 20 object classes. R-FCN-3000, trained for 3000 classes, obtains an 18% improvement in mAP over the current state-of-the-art large scale object detector (YOLO-9000) on the ImageNet detection dataset. Finally, we also evaluate the generalizability of our objectness detector on unseen classes (a zero-shot setting for localization) and observe that the generalization error decreases as we train the objectness detector on larger numbers of classes.

为了研究使用超级类别而不是单个对象类别的影响,我们将超级类别的数量从1改为100,并评估ImageNet检测数据集的性能。令人惊讶的是,即使只有一个超级类别,检测器也能表现出色!这一观察结果表明位置敏感滤波器可能会学习检测通用对象。它也重申了过去的一个很好的概念[1,2,49]:对象是一个通用的概念,可以学习一个通用的物体检测器。因此,为了执行物体检测,将RoI的对象分数乘以给定类别的分类概率就足够了。这导致了数千个类别的快速检测器,因为不再需要每个类别的位置敏感滤波器。在PASCAL-VOC数据集中,只有我们的基于对象的检测器,我们观察到与deformable R-FCN [6]检测器相比,mAP下降1.5%,所有20个对象类别都使用class-specific的滤波器。 R-FCN-3000,经过3000级训练,与ImageNet检测数据集上当前最先进的大型物体检测器(YOLO-9000)相比,获得了18%的mAP改进。最后,我们还评估了我们的物体检测器在看不见的类(zero-shot setting定位)上的普遍性,并观察到当我们在大量类上训练物体检测器时泛化误差减小。

Large scale localization using deep convolutional networks was first performed in [33, 35] which used regression for predicting the location of bounding boxes. Later, RPN [31] was used for localization in ImageNet classification [15]. However, no evaluations were performed to determine if these networks generalize when applied on detection datasets without specifically training on them. Weaklysupervised detection has been a major focus over the past few years for solving large-scale object detection. In [17], knowledge of detectors trained with bounding boxes was transferred to classes for which no bounding boxes are available. The assumption is that it is possible to train object detectors on a fixed number of classes. For a class for which supervision is not available, transformations are learned to adapt the classifier to a detector. Multiple-instance learning based approaches have also been proposed which can leverage weakly supervised data for adapting classifiers to detectors [18]. Recently, YOLO-9000 [30] jointly trains on classification and detection data. When it sees a classification image, classification loss is back-propagated on the bounding box which has the highest probability. It assumes that the predicted box is the ground truth box and uses the difference between other anchors and the predicted box as the objectness loss. YOLO-9000 is fast, as it uses a lightweight network and uses 3 filters per class for performing localization. For performing good localization, just 3 priors are not sufficient.

采用深度卷积网络的大规模定位首先在[33,35]中进行,该算法使用回归来预测边界框的位置。后来,RPN [31]被用于ImageNet分类中的定位[15]。然而,没有进行评估以确定这些网络是否应用于检测数据集而不进行专门训练时是否泛化。在过去几年中,弱监督检测一直是解决大规模物体检测的主要关注点。在[17]中,用边界框训练的检测器的知识被转移到没有边界框可用的类。假设是可以在固定数量的类上训练物体检测器。对于没有监督的类别,学会转换以使分类器适应检测器。也已经提出了基于多实例学习的方法,其可以利用弱监督数据来使分类器适应检测器[18]。最近,YOLO-9000 [30]共同对分类和检测数据进行训练。当它看到分类图像时,分类损失在具有最高概率的边界框上反向传播。它假定predicted box是ground truth box,并使用其他锚点和predicted box之间的差异作为对象损失。 YOLO-9000速度很快,因为它使用轻量级网络,每个类别使用3个过滤器来执行定位。为了实现良好的定位,只有3个先决条件是不够的。

For classifying and localizing a large number of classes, some methods leverage the fact that parts can be shared across objects categories [27, 32, 37, 28]. Sharing filters for object parts reduces model complexity and also reduces the amount of training data required for learning part-based filters. Even in traditional methods, it has been shown that when filters are shared, they are more generic [37]. However, current detectors like Deformable-R-FCN [6], R-FCN [5], RetinaNet [23] do not share filters (in the final classification layer) across object categories: because of this, inference is slow when they are applied on thousands of categories. Taking motivation from prior work on sharing filters across object categories, we propose an architecture where filters can be shared across some object categories for large scale object detection.

为了对大量类进行分类和定位,一些方法利用了部分可以跨对象类别共享的事实[27,32,37,28]。共享对象部分的过滤器可降低模型复杂性,并减少学习基于零件的过滤器所需的训练数据量。即使在传统方法中,已经表明,当过滤器被共享时,它们更通用[37]。然而,像 Deformable-R-FCN [6],R-FCN [5],RetinaNet [23]这样的当前检测器不能在不同对象类别之间共享滤波器(在最终分类层中):因此,推理速度很慢应用于数千个类别。考虑到以前在跨对象类别共享过滤器方面的工作的动力,我们提出了一种架构,可以在某些对象类别之间共享过滤器以实现大规模物体检测。

The extreme version of sharing parts is objectness, where we assume that all objects have something in common. Early in this decade (if not before), it was proposed that objectness is a generic concept and it was demonstrated that only a very few category agnostic proposals were sufficient to obtain high recall [39, 3, 2, 1]. With a bag-of-words feature-representation [21] for these proposals, better performance was shown compared to a sliding-window based part-based-model [11] for object detection. R-CNN [13] used the same proposals for object detection but also applied per-class bounding-box regression to refine the location of these proposals. Subsequently, it was observed that per-class regression was not necessary and a class-agnostic regression step is sufficient to refine the proposal position [5]. Therefore, if the regression step is class agnostic, and it is possible to obtain a reasonable objectness score, a simple classification layer should be sufficient to perform detection. We can simply multiply the objectness probability with the classification probability to make a detector! Therefore, in the extreme case, we set the number of superclasses to one and show that we can train a detector which obtains an mAP which is very close to state-of-the-art object detection architectures [5].

共享零件的极端版本​​是objectness,我们假设所有对象都有一些共同点。在这个十年的早期(如果不是在此之前),有人提出,objectness是一个通用的概念,并且证明只有极少数类别的不可知论提案足以获得高召回率[39,3,2,1]。对于这些提议,通过一个包含词语的特征表示[21],与基于滑动窗口的基于部分的模型[11]相比,用于物体检测的表现更好。 R-CNN [13]使用相同的提案进行物体检测,但也使用每类边界框回归来优化这些提议的位置。随后,观察到每类减免不是必要的,而类不可知的回归步骤足以完善提案立场[5]。因此,如果回归步骤是不可知的类别,并且有可能获得合理的对象分数,那么简单的分类层就足以执行检测。我们可以简单地将对象概率与分类概率相乘来构成一个检测器!因此,在极端情况下,我们将超类的数量设置为1,并表明我们可以训练一个检测器,该检测器获得与最先进的目标检测体系结构非常接近的mAP [5]。

This section provides a brief introduction of Deformable R-FCN [6] which is used in R-FCN-3000. In R-FCN [5], Atrous convolution [4] is used in the conv5 layer to increase the resolution of the feature map while still utilizing the pre-trained weights from the ImageNet classification network. In Deformable-R-FCN [6], the atrous convolution is replaced by a deformable convolution structure in which a separate branch predicts offsets for each pixel in the feature map, and the convolution kernel is applied after the offsets have been applied to the feature map. A region proposal network (RPN) is used for generating object proposals, which is a two layer CNN on top of the conv4 features. Efficiently implemented local convolutions, referred to as position sensitive filters, are used to classify these proposals.

本节简要介绍在R-FCN-3000中使用的 Deformable R-FCN [6]。在R-FCN [5]中,在conv5层中使用Atrous convolution[4]来增加特征映射的分辨率,同时仍然利用来自ImageNet分类网络的预先训练的权重。在Deformable-R-FCN [6]中,Atrous convolution被一个deformable convolution结构所替代,其中一个单独的分支预测了特征映射中每个像素的偏移量,并且在偏移量被应用到特征映射之后应用卷积核。region proposal network(RPN)用于生成对象提议,它是conv4功能之上的两层CNN。有效实施的局部卷积被称为位置敏感滤波器,用于对这些提议进行分类。

We demonstrate that it is possible to predict a universal objectness score by using only one set of filters for object vs. background detection. This objectness score can simply be multiplied with the classification score for detecting objects with only a marginal drop in performance. Finally, we show that the objectness learned generalizes to unseen classes and the performance increases with the number of training object classes. It bolsters the hypothesis of the universality of objectness.

我们证明可以通过仅使用一组过滤器来预测对象与背景检测的通用对象分数。这种客体性评分可以简单地与检测对象的分类评分相乘,而只有性能的下降。最后,我们表明,学习的对象概括为看不见的类,并且性能随着训练对象类的数量而增加。它支持了对象普遍性的假设。

This paper presents significant improvements for largescale object detection but many questions still remain unanswered. Some promising research questions are How can we accelerate the classification stage of R-FCN-3000 for detecting 100,000 classes? A typical image contains a limited number object categories how to use this prior to accelerate inference? What changes are needed in this architecture if we also need to detect objects and their parts? Since it is expensive to label each object instance with all valid classes in every image, can we learn robust object detectors if some objects are not labelled in the dataset?

本文为大规模物体检测提出了重大改进,但许多问题仍未得到解答。一些有前景的研究问题是:我们如何加快R-FCN-3000的分类阶段以检测100,000个类别?一个典型的图像包含一个有限数量的对象类别,如何在加速推理之前使用它?如果我们还需要检测对象及其部件,那么在此架构中需要做什么更改?由于使用每个图像中的所有有效类标记每个对象实例的代价昂贵,因此如果某些对象未标记在数据集中,我们可以学习健壮的物体检测器吗?

论文阅读笔记(三十五):R-FCN-3000 at 30fps: Decoupling Detection and Classification_第1张图片

Figure 2. R-FCN-3000 first generates region proposals which are provided as input to a super-class detection branch (like R-FCN) which jointly predicts the detection scores for each super-class (sc). A class-agnostic bounding-box regression step refines the position of each RoI (not shown). To obtain the semantic class, we do not use position-sensitive filters but predict per class scores in a fully convolutional fashion. Finally, we average pool the per-class scores inside the RoI to get the classification probability. The classification probability is multiplied with the super-class detection probability for detecting 3000 classes. When K is 1, the super-class detector predicts objectness.

图2. R-FCN-3000首先生成提供给超级检测分支(如R-FCN)的输入的区域提议,它们共同预测每个超级类别(sc)的检测分数。类不可知的边界框回归步骤改进每个RoI的位置(未显示)。为了获得语义类,我们不使用位置敏感滤波器,而是以完全卷积的方式预测每个类的分数。最后,我们将RoI内的每个类别分数进行平均以获得分类概率。分类概率乘以用于检测3000个类别的超级检测概率。当K是1时,超级检测器预测对象。

你可能感兴趣的:(笔记,R-FCN-3000,Detection,and,Classification)