Frustratingly Simple Few-Shot Object Detection 简单小样本目标检测
论文地址:https://arxiv.org/pdf/2003.06957.pdf
代码地址:GitHub - ucbdrive/few-shot-object-detection: Implementations of few-shot object detection benchmarks
Detecting rare objects from a few examples is an emerging problem. Prior works show metalearning is a promising approach. But, finetuning techniques have drawn scant attention. We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 2∼20 points on current benchmarks and sometimes even doubles the accuracy of the prior methods. However, the high variance in the few samples often leads to the unreliability of existing benchmarks. We revise the evaluation protocols by sampling multiple groups of training examples to obtain stable comparisons and build new benchmarks based on three datasets: PASCAL VOC, COCO and LVIS. Again, our fine-tuning approach establishes a new state of the art on the revised benchmarks. The code as well as the pretrained models are available at https://github.com/ucbdrive/ few-shot-object-detection.
从几个例子中检测稀有物体是一个新出现的问题。先前的工作表明MetalLearning是一种很有前途的方法。但是,微调技术很少受到关注。我们发现,仅微调稀有类上现有检测器的最后一层对于小样本目标检测任务至关重要。这样一种简单的方法比元学习方法比目前的基准大约高出上2-20%个点,有时甚至比以前的方法精度高出一倍。然而,小数样本的高方差往往导致现有基准的不可靠性。我们通过抽样多组训练样本来修改评估协议,以获得稳定的比较,并基于三个数据集:PASCAL VOC、COCO和L VIS建立新的基准。同样,我们的微调方法在修订的基准上建立了新的技术水平。代码以及预训练模型在https://github.com/ucbdrive/ few-shot-object-detection
Machine perception systems have witnessed significant progress in the past years. Yet, our ability to train models that generalize to novel concepts without abundant labeled data is still far from satisfactory when compared to human visual systems. Even a toddler can easily recognize a new concept with very little instruction (Landau et al., 1988; Samuelson & Smith, 2005; Smith et al., 2002).
机器感知系统在过去几年中取得了重大进展。然而,与人类视觉系统相比,我们在没有大量标记数据的情况下训练概括为新概念的模型的能力仍然远远不能令人满意。即使是蹒跚学步的孩子也能在很少的指导下很容易地识别出一个新概念(Landau等人,1988年;Samuelson& Smith,2005年;Smith等人,2002年)。
The ability to generalize from only a few examples (so called few-shot learning) has become a key area of interest in the machine learning community. Many (Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017; Hariharan & Girshick, 2017; Gidaris & Komodakis, 2018; Wang et al., 2019a) have explored techniques to transfer knowledge from the data-abundant base classes to the data-scarce novel classes through meta-learning. They use simulated few-shot tasks by sampling from base classes during training to learn to learn from the few examples in the novel classes.
仅从少数示例(所谓的小样本学习)进行概括的能力已成为机器学习社区感兴趣的一个关键领域。许多人(Vinyals等人,2016年;Snell等人,2017年;Finn等人,2017年;Hariharan和Girshick,2017年;Gidaris和Komodakis,2018年;Wang等人,2019a)探索了通过元学习将知识从数据丰富的基类转移到数据稀缺的新类的技术。他们在训练过程中通过从基类中取样,使用模拟的小样本任务,学习从新类中的少量示例中学习。
However, much of this work has focused on basic image classification tasks. In contrast, few-shot object detection has received far less attention. Unlike image classification, object detection requires the model to not only recognize the object types but also localize the targets among millions of potential regions. This additional subtask substantially raises the overall complexity. Several (Kang et al., 2019; Yan et al., 2019; Wang et al., 2019b) have attempted to tackle the under-explored few-shot object detection task, where only a few labeled bounding boxes are available for novel classes. These methods attach meta learners to existing object detection networks, following the meta-learning methods for classification. But, current evaluation protocols suffer from statistical unreliability, and the accuracy of baseline methods, especially simple fine-tuning, on few-object detection are not consistent in the literature.
然而,这项工作的大部分都集中在基本的图像分类任务上。相比之下,小样本目标检测受到的关注要少得多。与图像分类不同,目标检测不仅需要识别目标类型,还需要在数百万个潜在区域中定位目标。这个额外的子任务大大提高了总体复杂性。一些(Kang等人,2019年;Yan等人,2019年;Wang等人,2019b)试图解决探索不足的小样本目标检测任务,其中只有少量标记的边界框可用于新类。这些方法将元学习器附加到现有的目标检测网络,遵循分类的元学习方法。但是,目前的评估协议存在统计不可靠性,并且基线方法的准确性,特别是简单的微调,在小样本目标检测方面,在文献中并不一致。
In this work, we propose improved methods to evaluate few-shot object detection. We carefully examine finetuning based approaches, which are considered to be underperforming in the previous works (Kang et al., 2019; Yan et al., 2019; Wang et al., 2019b). We focus on the training schedule and the instance-level feature normalization of the object detectors in model design and training based on fine-tuning
在这项工作中,我们提出了改进的方法来评估小样本目标检测。我们仔细检查了基于微调的方法,这些方法在之前的工作中被认为表现不佳(Kang等人,2019年;Yan等人,2019年;Wang等人,2019b)。在基于微调的模型设计和训练中,我们重点研究了目标检测器的训练计划和实例级特征规范化。
We adopt a two-stage training scheme for fine-tuning as shown in Figure 1. We first train the entire object detector, such as Faster R-CNN (Ren et al., 2015), on the dataabundant base classes, and then only fine-tune the last layers of the detector on a small balanced training set consisting of both base and novel classes while freezing the other parameters of the model. During the fine-tuning stage, we introduce instance-level feature normalization to the box classifier inspired by Gidaris & Komodakis (2018); Qi et al. (2018); Chen et al. (2019).
我们采用两阶段训练方案进行微调,如图1所示。我们首先在数据丰富的基类上训练整个目标检测器,例如Faster R-CNN(Ren et al.,2015),然后在一个由基类和新类组成的小型平衡训练集上微调检测器的最后一层,同时冻结模型的其他参数。在微调阶段,我们根据Gidaris& Komodakis(2018)的启发,将实例级特征规范化引入边界框分类器;齐等(2018);Chen等人(2019年)。
Figure 1. Illustration of our two-stage fine-tuning approach (TFA). In the base training stage, the entire object detector, including both the feature extractor F and the box predictor, are jointly trained on the base classes. In the few-shot fine-tuning stage, the feature extractor components are fixed and only the box predictor is fine-tuned on a balanced subset consisting of both the base and novel classes. 说明我们的两阶段微调方法(TFA)。在基本训练阶段,整个目标检测器(包括特征提取器F和框预测器)在基类上联合训练。在小样本微调阶段,特征提取器组件是固定的,只有框预测器在由基类和新类组成的平衡子集上进行微调。
We find that this two-stage fine-tuning approach (TFA) outperforms all previous state-of-the-art meta-learning based methods by 2∼20 points on the existing PASCAL VOC (Everingham et al., 2007) and COCO (Lin et al., 2014) benchmarks. When training on a single novel example (one-shot learning), our method can achieve twice the accuracy of prior sophisticated state-of-the-art approaches.
我们发现,这种两阶段微调方法(TFA)在现有PASCAL VOC(Everingham等人,2007年)和COCO(Lin等人,2014年)基准数据集上,比以前所有最先进的基于元学习的方法高出2-20%。当在单个新示例(一次性学习)上进行训练时,我们的方法可以达到先前先进的最新方法的两倍精度。
Several issues with the existing evaluation protocols prevent consistent model comparisons. The accuracy measurements have high variance, making published comparisons unreliable. Also, the previous evaluations only report the detection accuracy on the novel classes, and fail to evaluate knowledge retention on the base classes.
现有评估协议的几个问题妨碍了一致的模型比较。准确度测量值方差较大,使得公布的比较结果不可靠。此外,以前的评估仅报告新类的检测精度,而没有评估基类的知识保留。
To resolve these issues, we build new benchmarks on three datasets: PASCAL VOC, COCO and LVIS (Gupta et al., 2019). We sample different groups of few-shot training examples for multiple runs of the experiments to obtain a stable accuracy estimation and quantitatively analyze the variances of different evaluation metrics. The new evaluation reports the average precision (AP) on both the base classes and novel classes as well as the mean AP on all classes, referred to as the generalized few-shot learning setting in the few-shot classification literature (Hariharan & Girshick, 2017; Wang et al., 2019a).
为了解决这些问题,我们在三个数据集上建立了新的基准:PASCAL VOC、COCO和LVIS(Gupta et al.,2019)。我们对不同的few-shot训练样本进行多次实验,以获得稳定的准确度估计,并定量分析不同评估指标的方差。新的评估报告了基本类和新类的平均精度(AP)以及所有类的平均AP,在小样本分类文献中称为广义小样本学习设置(Hariharan&Girshick,2017;Wang等人,2019a)。
Our fine-tuning approach establishes new states of the art on the benchmarks. On the challenging LVIS dataset, our two-stage training scheme improves the average detection precision of rare classes (<10 images) by ∼4 points and common classes (10-100 images) by ∼2 points with negligible precision loss for the frequent classes (>100 images)
我们的微调方法在基准上建立了新的技术状态。在具有挑战性的LVIS数据集上,我们的两阶段训练方案将稀有类(<10幅图像)的平均检测精度提高了4个点,普通类(10-100张图像)的平均检测精度提高了2个点,可忽略常见类的精度损失(>100张图像)
Our work is related to the rich literature on few-shot image classification, which uses various meta-learning based or metric-learning based methods. We also draw connections between our work and the existing meta-learning based fewshot object detection methods. To the best of our knowledge, we are the first to conduct a systematic analysis of finetuning based approaches on few-shot object detection.
我们的工作与丰富的小样本图像分类文献相关,该分类使用各种基于元学习或基于度量学习的方法。我们还将我们的工作与现有的基于元学习的小样本目标检测方法联系起来。据我们所知,我们是第一个对基于微调的小样本目标检测方法进行系统分析的人。
Meta-learning. The goal of meta-learning is to acquire task-level meta knowledge that can help the model quickly adapt to new tasks and environments with very few labeled examples. Some (Finn et al., 2017; Rusu et al., 2018; Nichol et al., 2018) learn to fine-tune and aim to obtain a good parameter initialization that can adapt to new tasks with a few scholastic gradient updates. Another popular line of research on meta-learning is to use parameter generation during adaptation to novel tasks. Gidaris & Komodakis (2018) propose an attention-based weight generator to generate the classifier weights for the novel classes. Wang et al. (2019a) construct task-aware feature embeddings by generating parameters for the feature layers. These approaches have only been used for few-shot image classification and not on more challenging tasks like object detection.
元学习。元学习的目标是获取任务级的元知识,这些元知识可以帮助模型快速适应新任务和环境,而标记的示例很少。一些人(Finn等人,2017年;Rusu等人,2018年;Nichol等人,2018年)学会了微调,目的是获得一个良好的参数初始化,能够通过一些学术梯度更新来适应新任务。元学习的另一个流行研究方向是在适应新任务的过程中使用参数生成。Gidaris和Komodakis(2018)提出了一种基于注意力机制的权重生成器,用于生成新类的分类器权重。Wang等人(2019a)通过生成特征层的参数来构建任务感知特征嵌入。这些方法仅用于小样本图像分类,而不用于更具挑战性的任务,如目标检测。
However, some (Chen et al., 2019) raise concerns about the reliability of the results given that a consistent comparison of different approaches is missing. Some simple fine-tuning based approaches, which draw little attention in the community, turn out to be more favorable than many prior works that use meta-learning on few-shot image classification (Chen et al., 2019; Dhillon et al., 2019). As for the emerging few-shot object detection task, there is neither consensus on the evaluation benchmarks nor a consistent comparison of different approaches due to the increased network complexity, obscure implementation details, and variances in evaluation protocols.
然而,一些人(Chen等人,2019年)对结果的可靠性提出了担忧,因为缺少对不同方法的一致性比较。一些简单的基于微调的方法在社区中很少引起关注,结果证明比许多以前在小样本头图像分类中使用元学习的工作更有利(Chen等人,2019年;Dhillon等人,2019年)。对于新出现的小样本目标检测任务,由于网络复杂性增加、实现细节模糊以及评估协议的差异,既没有对评估基准达成共识,也没有对不同方法进行一致的比较。
Metric-learning. Another line of work (Koch, 2015; Snell et al., 2017; Vinyals et al., 2016) focuses on learning to compare or metric-learning. Intuitively, if the model can construct distance metrics to estimate the similarity between two input images, it may generalize to novel categories with few labeled instances. More recently, several (Chen et al., 2019; Gidaris & Komodakis, 2018; Qi et al., 2018) adopt a cosine similarity based classifier to reduce the intraclass variance on the few-shot classification task, which leads to favorable performance compared to many metalearning based approaches. Our method also adopts a cosine similarity classifier to classify the categories of the region proposals. However, we focus on the instance-level distance measurement rather than on the image level.
度量学习。另一项工作(Koch,2015;Snell等人,2017;Vinyals等人,2016)侧重于学习比较或度量学习。直观地说,如果该模型能够构造距离度量来估计两幅输入图像之间的相似性,它可以推广到具有少量标记实例的新类别。最近,几个团队(Chen等人,2019年;Gidaris& Komodakis,2018年;Qi等人,2018年)采用了基于余弦相似性的分类器,以减少小样本分类任务的组内方差,与许多基于MetalLearning的方法相比,这带来了良好的性能。我们的方法还采用了一个余弦相似性分类器来分类区域建议的类别。然而,我们关注的是实例级距离测量,而不是图像级距离测量。
Few-shot object detection. There are several early attempts at few-shot object detection using meta-learning. Kang et al. (2019) and Yan et al. (2019) apply feature re-weighting schemes to a single-stage object detector (YOLOv2) and a two-stage object detector (Faster R-CNN), with the help of a meta learner that takes the support images (i.e., a small number of labeled images of the novel/base classes) as well as the bounding box annotations as inputs. Wang et al. (2019b) propose a weight prediction meta-model to predict parameters of category-specific components from the few examples while learning the category-agnostic components from base class examples.
小样本目标检测。有几次早期尝试使用元学习进行小样本目标检测。Kang等人(2019年)和Yan等人(2019年)在获取支持图像的元学习者的帮助下,将特征重加权方案应用于单级目标检测器(YOLOv2)和两级目标检测器(Faster R-CNN)(即,少量新颖/基类的标记图像)以及边界框注释作为输入。Wang et al.(2019b)提出了一个权重预测元模型,从少数示例中预测类别特定组件的参数,同时从基类示例中学习类别不可知组件。
In all these works, fine-tuning based approaches are considered as baselines with worse performance than metalearning based approaches. They consider jointly finetuning, where base classes and novel classes are trained together, and fine-tuning the entire model, where the detector is first trained on the base classes only and then fine-tuned on a balanced set with both base and novel classes. In contrast, we find that fine-tuning only the last layer of the object detector on the balanced subset and keeping the rest of model fixed can substantially improve the detection accuracy, outperforming all the prior meta-learning based approaches. This indicates that feature representations learned from the base classes might be able to transfer to the novel classes and simple adjustments to the box predictor can provide strong performance gain (Dhillon et al., 2019).
在所有这些工作中,基于微调的方法被认为是性能比基于元学习的方法差。他们考虑联合微调,其中基类和新类被一起训练,并对整个模型进行微调,其中检测器首先只对基类进行训练,然后对具有基础和新颖类的平衡集进行微调。相比之下,我们发现仅微调平衡子集上的目标检测器的最后一层并保持模型的其余部分固定,可以显著提高检测精度,优于所有先前基于元学习的方法。这表明从基类学习到的特征表示可能能够转移到新类,对框预测器的简单调整可以提供强大的性能增益(Dhillon et al.,2019)。
In this section, we start with the preliminaries on the fewshot object detection setting. Then, we talk about our twostage fine-tuning approach in Section 3.1. Section 3.2 summarizes the previous meta-learning approaches.
在本节中,我们将从小样本目标检测设置的初步介绍开始。然后,我们将在第3.1节讨论我们的两阶段微调方法。第3.2节总结了以前的元学习方法。
We follow the few-shot object detection settings introduced in Kang et al. (2019). There are a set of base classes Cb that have many instances and a set of novel classes Cn that have only K (usually less than 10) instances per category. For an object detection dataset D = {(x, y), x ∈ X, y ∈ Y}, where x is the input image and y = {(ci,li), i = 1, ..., N} denotes the categories c ∈ Cb∪ Cnand bounding box coordinates l of the N object instances in the image x. For synthetic few-shot datasets using PASCAL VOC and COCO, the novel set for training is balanced and each class has the same number of annotated objects (i.e., K-shot). The recent L VIS dataset has a natural long-tail distribution, which does not have the manual K-shot split. The classes in LVIS are divided into frequent classes (appearing in more than 100 images), common classes (10-100 images), and rare classes (less than 10 images). We consider both synthetic and natural datasets in our work and follow the naming convention of k-shot for simplicity.
我们遵循Kang等人(2019)中介绍的小样本目标检测设置。有一组基类Cb有许多实例,还有一组新类Cn每个类别只有K个(通常少于10个)实例。对于目标检测数据集D={(x,y),x∈ X, y∈ Y} ,其中x是输入图像,y={(ci,li),i=1,…,N}表示类别c∈ Cb∪ Cn和图像x中N个目标实例的边界框坐标l。对于使用PASCAL VOC和COCO的合成小样本数据集,新的训练集是平衡的,每个类具有相同数量的标注目标(即K-shot)。最近的L-VIS数据集具有自然的长尾分布,没有手动K-shot分割。LVIS中的类分为常见类(出现在100多张图像中)、普通类(10-100张图像)和罕见类(少于10张图像)。在我们的工作中考虑合成和自然数据集,为简单起见命名为K-shot。
The few-shot object detector is evaluated on a test set of both the base classes and the novel classes. The goal is to optimize the detection accuracy measured by average precision (AP) of the novel classes as well as the base classes. This setting is different from the N-way-K-shot setting (Finn et al., 2017; Vinyals et al., 2016; Snell et al., 2017) commonly used in few-shot classification.
在基类和新类的测试集上对小样本目标检测器进行评估。目标是优化通过新类和基类的平均精度(AP)测量的检测精度。该设置不同于小样本分类中常用的N-way K-shot设置(Finn等人,2017;Vinyals等人,2016;Snell等人,2017)。
We describe our two-stage fine-tuning approach (TFA) for few-shot object detection in this section. We adopt the widely used Faster R-CNN (Ren et al., 2015), a two-stage object detector, as our base detection model. As shown in Figure 1, the feature learning components, referred to as F, of a Faster R-CNN model include the backbone (e.g., ResNet (He et al., 2016), VGG16 (Simonyan & Zisserman, 2014)), the region proposal network (RPN), as well as a twolayer fully-connected (FC) sub-network as a proposal-level feature extractor. There is also a box predictor composed of a box classifier C to classify the object categories and a box regressor R to predict the bounding box coordinates. Intuitively, the backbone features as well as the RPN features are class-agnostic. Therefore, features learned from the base classes are likely to transfer to the novel classes without further parameter updates. The key component of our method is to separate the feature representation learning and the box predictor learning into two stages.
在本节中,我们将介绍用于小样本目标检测的两阶段微调方法(TFA)。我们采用了广泛使用的Faster R-CNN(Ren等人,2015),一种两级目标检测器,作为我们的基本检测模型。如图1所示,Faster R-CNN模型的特征学习比分(称为F)包括主干网(如ResNet(He et al.,2016)、VGG16(Simonyan&Zisserman,2014))、区域建议网络(RPN)以及作为建议级特征提取器的两层完全连接(FC)子网络。还有一个框预测器,由框分类器C和框回归器R组成,框分类器C用于分类目标类别,框回归器R用于预测边界框坐标。直观地说,主干特性和RPN特性都是与类无关的。因此,从基类学习到的特征可能会转移到新类,而无需进一步的参数更新。该方法的关键部分是将特征表示学习和框预测学习分为两个阶段。
Base model training. In the first stage, we train the feature extractor and the box predictor only on the base classes Cb, with the same loss function used in Ren et al. (2015). The joint loss is,
基础模型训练。在第一阶段,我们仅在基类Cb上训练特征提取器和框预测器,使用Ren等人(2015)中使用的相同损失函数。共同损失是,
where Lrpnis applied to the output of the RPN to distinguish foreground from backgrounds and refine the anchors, Lcls is a cross-entropy loss for the box classifier C, and Llocis a smoothed L1loss for the box regressor R.
其中,LRPN应用于RPN的输出以区分前景和背景并细化锚框,Lcls是框分类器C的交叉熵损失,Lloc是框回归器R的平滑L1损失。
Few-shot fine-tuning. In the second stage, we create a small balanced training set with K shots per class, containing both base and novel classes. We assign randomly initialized weights to the box prediction networks for the novel classes and fine-tune only the box classification and regression networks, namely the last layers of the detection model, while keeping the entire feature extractor F fixed. We use the same loss function in Equation 1 and a smaller learning rate. The learning rate is reduced by 20 from the first stage in all our experiments.
小样本微调。在第二阶段中,我们创建了一个小的平衡训练集,每个类有K shots,包含基本类和新类。我们为新类别的框预测网络分配随机初始化的权重,并仅微调框分类和回归网络,即检测模型的最后一层,同时保持整个特征提取器F固定。我们在方程1中使用相同的损失函数和较小的学习率。在我们所有的实验中,学习率比第一阶段降低了20%。
Cosine similarity for box classifier. We consider using a classifier based on cosine similarity in the second fine-tuning stage, inspired by Gidaris & Komodakis (2018); Qi et al. (2018); Chen et al. (2019). The weight matrix W ∈ Rd×cof the box classifier C can be written as [w1, w2, ..., wc], where wc∈ Rd is the per-class weight vector. The output of C is scaled similarity scores S of the input feature F(x) and the weight vectors of different classes. The entries in S are
框分类器的余弦相似性。在GIDARIS和KOMODAKIS(2018)的启发下,我们考虑使用基于余弦相似性的分类器在第二微调阶段;齐等(2018);Chen等人(2019年)。框分类器C的权重矩阵可以写成[w1,w2,…,wc],其中wc∈ Rd是每类权重向量。C的输出是输入特征F(x)和不同类别的权重向量的缩放相似性分数S。S为
where si,j is the similarity score between the i-th object proposal of the input x and the weight vector of class j. α is the scaling factor. We use a fixed α of 20 in our experiments. We find empirically that the instance-level feature normalization used in the cosine similarity based classifier helps reduce the intra-class variance and improves the detection accuracy of novel classes with less decrease in detection accuracy of base classes when compared to a FC-based classifier, especially when the number of training examples is small.
其中,si,j是输入x的第i个目标建议框和类j的权重向量之间的相似性得分。α是比例因子。在我们的实验中,我们固定α为20。从经验上发现,与基于FC的分类器相比,基于余弦相似性的分类器中使用的实例级特征归一化有助于减少类内方差,提高新类的检测精度,同时减少基类检测精度的降低,特别是当训练示例数量较少时。
We describe the existing meta-learning based few-shot object detection networks, including FSRW (Kang et al.,2019), Meta R-CNN (Yan et al., 2019) and MetaDet (Wang et al., 2019b), in this section to draw comparisons with our approach. Figure 2 illustrates the structures of these networks. In meta-learning approaches, in addition to the base object detection model that is either single-stage or two-stage, a meta-learner is introduced to acquire class-level meta knowledge and help the model generalize to novel classes through feature re-weighting, such as FSRW and Meta R-CNN, or class-specific weight generation, such as MetaDet. The input to the meta learner is a small set of support images with the bounding box annotations of the target objects.
在本节中,我们描述了现有的基于元学习的小样本目标检测网络,包括FSRW(Kang等人,2019)、Meta R-CNN(Yan等人,2019)和MetaDet(Wang等人,2019b),以与我们的方法进行比较。图2说明了这些网络的结构。在元学习方法中,除了单阶段或两阶段的基本目标检测模型外,还引入了元学习器来获取类级元知识,并通过特征重加权(如FSRW和meta R-CNN)或类特定权重生成(如MetaDet)帮助模型推广到新类。元学习者的输入是一组小的支持图像,带有目标目标的边界框注释。
Figure 2. Abstraction of the meta-learning based few-shot object detectors. A meta-learner is introduced to acquire task-level meta information and help the model generalize to novel classes through feature re-weighting (e.g., FSRW and Meta R-CNN) or weight generation (e.g., MetaDet). A two-stage training approach (meta-training and meta fine-tuning) with episodic learning is commonly adopted. 基于元学习的小样本目标检测器的抽象。引入元学习者获取任务级元信息,并通过特征重加权(如FSRW和meta R-CNN)或权重生成(如MetaDet)帮助模型推广到新类。通常采用两阶段训练法(元训练和元微调)和情景学习。
The base object detector and the meta-learner are often jointly trained using episodic training (Vinyals et al., 2016). Each episode is composed of a supporting set of N objects and a set of query images. In FSRW and Meta R-CNN, the support images and the binary masks of the annotated objects are used as input to the meta-learner, which generates class reweighting vectors that modulate the feature representation of the query images. As shown in Figure 2, the training procedure is also split into a meta-training stage, where the model is only trained on the data of the base classes, and a meta fine-tuning stage, where the support set includes the few examples of the novel classes and a subset of examples from the base classes.
基本目标检测器和元学习者通常使用情景训练进行联合训练(Vinyals等人,2016)。每集由一组支持的N个目标和一组查询图像组成。在FSRW和Meta R-CNN中,支持图像和注释目标的二进制掩码被用作元学习器的输入,元学习器生成类重加权向量,用于调节查询图像的特征表示。如图2所示,训练过程也分为元训练阶段(其中模型仅在基类数据上训练)和元微调阶段(其中支持集包括新类的少数示例和基类示例的子集)。
Both the meta-learning approaches and our approach have a two-stage training scheme. However, we find that the episodic learning used in meta-learning approaches can be very memory inefficient as the number of classes in the supporting set increases. Our fine-tuning method only finetunes the last layers of the network with a normal batch training scheme, which is much more memory efficient.
元学习方法和我们的方法都有两个阶段的训练计划。然而,我们发现在元学习方法中使用的情景学习随着支持集中类的数量的增加而变得非常低效。我们的微调方法仅使用正常的批处理训练方案微调网络的最后一层,这更节省内存。
In this section, we conduct extensive comparisons with previous methods on the existing few-shot object detection benchmarks using PASCAL VOC and COCO, where our approach can obtain about 2∼20 points improvement in all settings (Section 4.1). We then introduce a new benchmark on three datasets (PASCAL VOC, COCO and LVIS) with revised evaluation protocols to address the unreliability of previous benchmarks (Section 4.2). We also provide various ablation studies and visualizations in Section 4.3.
在本节中,我们使用PASCAL VOC和COCO对现有的小样本目标检测基准与以前的方法进行了广泛的比较,我们的方法在所有设置中可以获得大约2∼20点的提高(第4.1节)。然后,我们在三个数据集(PASCAL VOC、COCO和LVIS)上引入了一个新的基准,并修订了评估协议,以解决以前基准的不可靠性(第4.2节)。我们还在第4.3节中提供了各种消融研究和可视化。
Implementation details. We use Faster R-CNN (Ren et al., 2015) as our base detector and Resnet-101 (He et al., 2016) with a Feature Pyramid Network (Lin et al., 2017) as the backbone. All models are trained using SGD with a minibatch size of 16, momentum of 0.9, and weight decay of 0.0001. A learning rate of 0.02 is used during base training and 0.001 during few-shot fine-tuning. For more details, the code is available at https://github.com/ ucbdrive/few-shot-object-detection.
实施细节。我们使用 Faster R-CNN(Ren等人,2015年)作为基本检测器,使用具有特征金字塔网络(Lin等人,2017年)的Resnet-101(He等人,2016年)作为主干。所有模型都使用SGD进行训练,最小批量为16,动量为0.9,重量衰减为0.0001。在基础训练期间使用0.02的学习率,在少数镜头微调期间使用0.001的学习率。有关更多详细信息,请访问https://github.com/ ucbdrive/few-shot-object-detection
Existing benchmarks. Following the previous work (Kang et al., 2019; Yan et al., 2019; Wang et al., 2019b), we first evaluate our approach on PASCAL VOC 2007+2012 and COCO, using the same data splits and training examples provided by Kang et al. (2019). For the few-shot PASCAL VOC dataset, the 20 classes are randomly divided into 15 base classes and 5 novel classes, where the novel classes have K = 1,2,3,5,10 objects per class sampled from the combination of the trainval sets of the 2007 and 2012 versions for training. Three random split groups are considered in this work. PASCAL VOC 2007 test set is used for evaluation. For COCO, the 60 categories disjoint with PASCAL VOC are used as base classes while the remaining 20 classes are used as novel classes with K = 10,30. For evaluation metrics, AP50 (matching threshold is 0.5) of the novel classes is used on PASCAL VOC and the COCO-style AP of the novel classes is used on COCO.
现有基准。在之前的工作(Kang等人,2019年;Yan等人,2019年;Wang等人,2019b)之后,我们首先使用Kang等人(2019年)提供的相同数据分割和训练示例,评估我们对PASCAL VOC 2007 2012年和COCO的方法。对于少量的PASCAL VOC数据集,将20个类随机分为15个基类和5个新类,其中新类从2007和2012版本的trainval集合的组合中采样,每个类的K=1,2,3,5,10个对象,用于训练。在这项工作中考虑了三个随机分组。PASCAL VOC 2007测试集用于评估。对于COCO,与PASCAL VOC不相交的60个类别用作基类,而剩余的20个类别用作K=10,30的新类。对于评估指标,新类的AP50(匹配阈值为0.5)用于PASCAL。
Baselines. We compare our approach with the metalearning approaches FSRW, Meta-RCNN and MetaDet together with the fine-tuning based approaches: jointly training, denoted by FRCN/YOLO+joint, where the base and novel class examples are jointly trained in one stage, and fine-tuning the entire model, denoted by FRCN/YOLO+ft-full, where both the feature extractor F and the box predictor (C and R) are jointly fine-tuned until convergence in the second fine-tuning stage. FRCN is Faster R-CNN for short. Fine-tuning with less iterations, denoted by FRCN/YOLO+ft, are reported in Kang et al. (2019) and Y an et al. (2019).
基线。我们将我们的方法与元学习方法FSRW、Meta-RCNN和MetaDet以及基于微调的方法进行比较:联合训练,由FRCN/YOLO+联合表示,其中基础和新类示例在一个阶段联合训练,并微调整个模型,由FRCN/YOLO+ft full表示,其中,特征提取器F和盒预测器(C和R)共同微调,直到在第二微调阶段收敛。FRCN是简洁版的Faster R-CNN。Kang等人(2019年)和Y-an等人(2019年)报告了用FRCN/YOLO+ft表示的迭代次数较少的微调。
Results on PASCAL VOC. We provide the average AP50 of the novel classes on PASCAL VOC with three random splits in Table 1. Our approach uses ResNet-101 as the backbone, similar to Meta R-CNN. We implement FRCN+ft-full in our framework, which roughly matches the results reported in Y an et al. (2019). MetaDet uses VGG16 as the backbone, but the performance is similar and sometimes worse compared to Meta R-CNN.
PASCAL VOC结果。我们在表1中提供了PASCAL VOC上三个随机拆分的新类的平均AP50。我们的方法使用ResNet-101作为主干,类似于Meta R-CNN。我们在我们的框架中实现了FRCN+ft full,这与Yan等人(2019)报告的结果大致相符。MetaDet使用VGG16作为主干,但性能与MetaR-CNN类似,有时甚至更差。
In all different data splits and different numbers of training shots, our approach (the last row) is able to outperform the previous methods by a large margin. It even doubles the performance of the previous approaches in the one-shot cases. The improvements, up to 20 points, is much larger than the gap among the previous meta-learning based approaches, indicating the effectiveness of our approach. We also compare the cosine similarity based box classifier (TFA +w/cos) with a normal FC-based classifier (TFA +w/fc) and find that TFA +w/cos is better than TFA +w/fc on extremely low shots (e.g., 1-shot), but the two are roughly similar when there are more training shots, e.g., 10-shot.
在所有不同的数据分割和不同数量的训练中,我们的方法(最后一行)能够大大优于以前的方法。在一次性案例中,它甚至将以前方法的性能提高了一倍。这些改进达到了20点,远远超过了以前基于元学习的方法之间的差距,表明了我们方法的有效性。我们还将基于余弦相似性的框分类器(TFA+w/cos)与基于普通FC的分类器(TFA+w/FC)进行了比较,发现TFA+w/cos在极少样本(例如1样本)上优于TFA+w/FC,但在训练类别较多(例如10shots)时,两者大致相似。
For more detailed comparisons, we cite the numbers from Yan et al. (2019) of their model performance on the base classes in Table 2. We find that our model has a much higher average AP on the base classes than Meta R-CNN with a gap of about 10 to 15 points. To eliminate the differences in implementation details, we also report our re-implementation of FRCN+ft-full and training base only, where the base classes should have the highest accuracy as the model is only trained on the base classes examples. We find that our model has a small decrease in performance, less than 2 points, on the base classes.
为了进行更详细的比较,我们在表2中引用了Yan等人(2019)关于其模型在基类上性能的数据。我们发现我们的模型在基类上的平均AP比Meta R-CNN高得多,差距约为10到15分。为了消除实现细节上的差异,我们还报告了FRCN+ft full和training base only的重新实现,其中基类应具有最高的准确性,因为模型仅在基类示例上进行训练。我们发现我们的模型在基类上的性能有一个小的下降,不到2点。
Results on COCO. Similarly, we report the average AP and AP75 of the 20 novel classes on COCO in Table 3. AP75 means matching threshold is 0.75, a more strict metric than AP50. Again, we consistently outperform previous methods across all shots on both novel AP and novel AP75. We achieve around 1 point improvement in AP over the best performing baseline and around 2.5 points improvement in AP75.
COCO的结果。同样,我们在表3中报告了20个COCO新类别的平均AP和AP75。AP75表示匹配阈值为0.75,比AP50更严格的指标。同样,我们在新AP和新AP75的所有样本中始终优于以前的方法。我们在AP方面比表现最好的基线提高了约1点,比目前提高了约2点。AP75提高了5点。
Revised evaluation protocols. We find several issues with existing benchmarks. First, previous evaluation protocols focus only on the performance on novel classes. This ignores the potential performance drop in base classes and thus the overall performance of the network. Second, the sample variance is large due to the few samples that are used for training. This makes it difficult to draw conclusions from comparisons against other methods, as differences in performance could be insignificant.
修订的评价方案。我们发现现有基准存在一些问题。首先,以前的评估协议只关注新类的性能。这忽略了基类中潜在的性能下降,从而忽略了网络的整体性能。其次,由于用于训练的样本很少,因此样本方差很大。这使得很难从与其他方法的比较中得出结论,因为性能差异可能微不足道。
To address these issues, we first revise the evaluation protocol to include evaluation on base classes. On our benchmark, we report AP on base classes (bAP) and the overall AP in addition to AP on the novel classes (nAP). This allows us to observe trends in performance on both base and novel classes, and the overall performance of the network.
为了解决这些问题,我们首先修改评估协议,以包括对基类的评估。在我们的基准测试中,我们报告了基类上的AP(bAP),以及除了新类上的AP(nAP)之外的总体AP。这使我们能够观察基本类和新类的性能趋势,以及网络的总体性能。
Additionally, we train our models for multiple runs on different random samples of training shots to obtain averages and confidence intervals. In Figure 3, we show the cumulative mean and 95% confidence interval across 40 repeated runs with K = 1,3,5,10 on the first split of PASCAL VOC. Although the performance is high on the first random sample, the average decreases significantly as more samples are used. Additionally, the confidence intervals across the first few runs are large, especially in the low-shot scenario. When we use more repeated runs, the averages stabilizes and the confidence intervals become small, which allows for better comparisons.
此外,我们在不同的随机训练样本上训练我们的模型,以获得平均值和置信区间。在图3中,我们显示了PASCAL VOC第一次拆分时,在K=1,3,5,10的40次重复运行中的累积平均值和95%置信区间。虽然第一个随机样本的性能很高,但随着使用更多样本,平均值会显著降低。此外,前几次运行的置信区间很大,特别是在少样本场景中。当我们使用更多的重复运行时,平均值趋于稳定,置信区间变小,这样可以进行更好的比较。
Figure 3. Cumulative means with 95% confidence intervals across 40 repeated runs, computed on the novel classes of the first split of PASCAL VOC. The means and variances become stable after around 30 runs. 40次重复运行中95%置信区间的累积平均值,根据PASCAL VOC第一次拆分的新类别计算。平均值和方差在大约30次运行后变得稳定。
Results on LVIS. We evaluate our approach on the recently introduced LVIS dataset (Gupta et al., 2019). The number of images in each category in LVIS has a natural long-tail distribution. We treat the frequent and common classes as base classes, and the rare classes as novel classes. The base training is the same as before. During few-shot fine-tuning, we artificially create a balanced subset of the entire dataset by sampling up to 10 instances for each class and fine-tune on this subset.
LVIS的结果。我们在最近引入的LVIS数据集上评估了我们的方法(Gupta等人,2019年)。LVIS中每个类别中的图像数量具有自然的长尾分布。我们将频繁类和公共类视为基类,将稀有类视为新类。基地训练和以前一样。在小样本微调过程中,我们人为地创建整个数据集的一个平衡子集,为每个类采样多达10个实例,并对该子集进行微调。
We show evaluation results on L VIS in Table 4. Compared to the methods in Gupta et al. (2019), our approach is able to achieve better performance of ∼1-1.5 points in overall AP and ∼2-4 points in AP for rare and common classes. We also demonstrate results without using repeated sampling, which is a weighted sampling scheme that is used in Gupta et al. (2019) to address the data imbalance issue. In this setting, the baseline methods can only achieve ∼2-3 points in AP for rare classes. On the other hand, our approach is able to greatly outperform the baseline and increase the AP on rare classes by around 13 points and on common classes by around 1 point. Our two-stage fine-tuning scheme is able to address the severe data imbalance issue without needing repeated sampling.
我们在表4中显示了LVIS的评估结果。与Gupta等人(2019)的方法相比,我们的方法能够实现更好的性能AP总得分为1-1.5分,且稀有类和普通类的AP分数为2-4分。我们还展示了不使用重复抽样的结果,这是Gupta等人(2019)为解决数据不平衡问题而使用的加权抽样方案。在此设置中,基线方法只能实现稀有类别的AP分数为2-3分。另一方面,我们的方法能够大大优于基线,稀有类的AP增加约13个点,普通类的AP增加约1个点。我们的两阶段微调方案能够解决严重的数据不平衡问题,而无需重复采样。
Results on PASCAL VOC and COCO. We show evaluation results on generalized PASCAL VOC in Figure 4 and COCO in Figure 5. On both datasets, we evaluate on the base classes and the novel classes and report AP scores for each. On PASCAL VOC, we evaluate our models over 30 repeated runs and report the average and the 95% confidence interval. On COCO, we provide results on 1, 2, 3, and 5 shots in addition to the 10 and 30 shots used by the existing benchmark for a better picture of performance trends in the low-shot regime. For the full quantitative results of other metrics (e.g., AP50 and AP75), more details are available in the appendix.
PASCAL VOC和COCO的结果。我们在图4和图5中分别显示了广义PASCAL VOC和COCO的评估结果。在这两个数据集上,我们对基础类和新类进行评估,并报告每个类的AP分数。在PASCAL VOC上,我们通过30次重复运行评估了我们的模型,并报告了平均值和95%置信区间。在COCO上,除了现有基准使用的10次和30次测试外,我们还提供了1次、2次、3次和5次测试的结果,以便更好地了解低测试制度下的性能趋势。关于其他指标(如AP50和AP75)的完整定量结果,更多详情见附录。
Weight initialization. We explore two different ways of initializing the weights of the novel classifier before few-shot fine-tuning: (1) random initialization and (2) fine-tuning a predictor on the novel set and using the classifier’s weights as initialization. We compare both methods on K = 1,3,10 on split 3 of PASCAL VOC and COCO and show the results in Table 5. On PASCAL VOC, simple random initialization can outperform initialization using fine-tuned novel weights. On COCO, using the novel weights can improve the performance over random initialization. This is probably due to the increased complexity and number of classes of COCO compared to PASCAL VOC. We use random initialization for all PASCAL VOC experiments and novel initialization for all COCO and L VIS experiments.
权重初始化。我们探索了两种不同的方法来初始化新分类器的权重,然后进行少量微调:(1)随机初始化,(2)微调新集合上的预测值,并使用分类器的权重作为初始化。我们在PASCAL VOC和COCO的第3部分中对K=1,3,10的两种方法进行了比较,结果如表5所示。在PASCAL VOC上,简单的随机初始化可以优于使用微调新权重的初始化。在COCO上,使用新的权重可以提高随机初始化的性能。这可能是因为与PASCAL VOC相比,COCO的复杂性和种类增加了。我们对所有PASCAL VOC实验使用随机初始化,对所有COCO和LVIS实验使用新的初始化。
Scaling factor of cosine similarity. We explore the effect of different scaling factors for computing cosine similarity. We compare three different factors, α = 10,20,50. We use the same evaluation setting as the previous ablation study and report the results in Table 6. On PASCAL VOC, α = 20 outperforms the other scale factors in both base AP and novel AP . On COCO, α = 20 achieves better novel AP at the cost of worse base AP . Since it has the best performance on novel classes across both datasets, we use α = 20 in all of our experiments with cosine similarity.
余弦相似性的比例因子。我们探讨了计算余弦相似性的不同比例因子的影响。我们比较了三个不同的因素,α=10,20,50。我们使用与先前消融研究相同的评估设置,并在表6中报告结果。在PASCAL VOC中,α=20在基础AP和新AP中均优于其他标度因子。在COCO上,α=20以较差的基础AP为代价获得更好的新AP。由于它在两个数据集中的新类上都有最好的性能,我们在所有实验中都使用α=20的余弦相似性。
Detection results. We provide qualitative visualizations of the detected novel objects on PASCAL VOC and COCO in Figure 6. We show both success (green boxes) and failure cases (red boxes) when detecting novel objects for each dataset to help analyze the possible error types. On the first split of PASCAL VOC, we visualize the results of our 10-shot TFA w/ cos model. On COCO, we visualize the results of the 30-shot TFA w/cos model. The failure cases include misclassifying novel objects as similar base objects, e.g., row 2 columns 1, 2, 3, and 4, mislocalizing the objects, e.g., row 2 column 5, and missing detections, e.g., row 4 columns 1 and 5.
检测结果。我们在图6中提供了PASCAL VOC和COCO上检测到的新目标的定性可视化。我们在为每个数据集检测新目标时显示成功(绿色框)和失败(红色框),以帮助分析可能的错误类型。在PASCAL VOC的第一次拆分中,我们可视化了我们的10shotTFA w/cos模型的结果。在COCO上,我们可视化了30shot TFA w/cos模型的结果。故障情况包括将新目标错误分类为类似的基本目标,例如第2行第1列、第2列、第3列和第4列,错误定位目标,例如第2行第5列,以及缺失检测,例如第4行第1列和第5列。
We proposed a simple two-stage fine-tuning approach for few-shot object detection. Our method outperformed the previous meta-learning methods by a large margin on the current benchmarks. In addition, we built more reliable benchmarks with revised evaluation protocols. On the new benchmarks, our models achieved new states of the arts, and on the LVIS dataset our models improved the AP of rare classes by 4 points with negligible reduction of the AP of frequent classes.
我们提出了一种简单的两阶段微调方法,用于小样本目标检测。在当前的基准测试中,我们的方法大大优于以前的元学习方法。此外,我们通过修订评估协议建立了更可靠的基准。在新的基准上,我们的模型达到了新的技术水平,在LVIS数据集上,我们的模型将稀有类的AP提高了4个点,而频繁类的AP的降低可以忽略不计。