[paper] https://arxiv.org/pdf/2005.12872.pdf
[github] https://github.com/facebookresearch/detr
We present a new method that views object detection as a direct set prediction problem.
本文做了啥:提出了一种将目标检测看作直接集预测问题的新方法。
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task.
本文工作的亮点:简化了检测流程,有效地消除了许多手工设计的组件的需求,比如一个非最大抑制程序或锚的生成,显式地编码了关于任务的先验知识。
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors.
方法的具体内容:
新框架的主要成分,被称为 DEtection TRansformer or DETR。
DETR 有两个 ingredients:
1. 损失函数部分:基于集的 global 损失,迫使唯一的预测通过二部匹配,
2. 网络结构部分:transformer 编解码架构。
DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines.
实验结果:给定一小组固定的学习对象查询,DETR 根据对象和全局图像上下文的关系,直接并行输出最终的预测集。与许多其他现代探测器不同,新模型在概念上很简单,不需要专门的库。
DETR 在 COCO 对象检测数据集上的结果是,准确性和运行时间与 Faster-RCNN 相当。此外,DETR可以很容易地推广,以统一的方式产生全景分割。
The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. Modern detectors address this set prediction task in an indirect way, by defining surrogate regression and classification problems on a large set of proposals [37,5], anchors [23], or window centers [53,46]. Their performances are significantly influenced by post-processing steps to collapse near-duplicate predictions, by the design of the anchor sets and by the heuristics that assign target boxes to anchors [52]. To simplify these pipelines, we propose a direct set prediction approach to bypass the surrogate tasks. This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection: previous attempts [43,16,4,39] either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks. This paper aims to bridge this gap.
本文的 motivation:
过去的及于深度学习目标识别模型都不是端到端的,需要借助很多后处理过程,例如为了避免重复的预测,需要设计 anchor 集和将目标框分配给 anchor 的启发式方法等。即便在一些尝试工作 [43,16,4,39] 实现了端到端的目标识别,但他们都需要其他形式的先验知识,且不具备较好的性能。
本文的目的就是提出一种端到端的、高性能的目标识别网络。
之后的三段,描述了 DETR 的特点。
We streamline the training pipeline by viewing object detection as a direct set prediction problem. We adopt an encoder-decoder architecture based on transformers [47], a popular architecture for sequence prediction. The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing duplicate predictions.
网络结构采用的技术:本文将目标检测视为直接集预测问题,从而简化了 training pipeline。我们采用了一种 transformers 的编解码器结构,这是一种常用的序列预测结构。transformers 的自注意机制以序列明确地为元素之间的成对交互建模,使这些结构特别适合于集合预测的特定约束,如删除重复预测。
Our DEtection TRansformer (DETR, see Figure 1) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects. DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression. Unlike most existing detection methods, DETR doesn’t require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN and transformer classes.
网络损失函数:DETR 一次性预测所有目标,并使用集合损失函数(set loss function)进行端到端训练,该函数对预测的目标和 GT 目标进行二部图匹配 (bipartite matching)。DETR通过丢弃多个手工设计的编码先验知识的组件,如空间 anchors 或 非最大抑制,简化了检测 pipeline。与大多数现有的检测方法不同,DETR 不需要任何 customized 层,因此可以在任何包含标准 CNN 和transformer 类的框架中轻松复制。
Compared to most previous work on direct set prediction, the main features of DETR are the conjunction of the bipartite matching loss and transformers with (non-autoregressive) parallel decoding [29,12,10,8]. In contrast, previous work focused on autoregressive decoding with RNNs [43,41,30,36,42]. Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel.
与传统目标检测方法的对比:与之前大多数直接集预测的工作相比,DETR的主要特征是 bipartite matching loss 和 transformers (非自回归) 并行译码的结合。相比之下,之前的研究主要关注 RNNs 的自回归译码。本文的匹配损失函数为一个 ground truth 对象分配唯一地一个预测,并且对预测对象的排列是不变的,因此我们可以并行地输出它们。
We evaluate DETR on one of the most popular object detection datasets, COCO [24], against a very competitive Faster R-CNN baseline [37]. Faster RCNN has undergone many design iterations and its performance was greatly improved since the original publication. Our experiments show that our new model achieves comparable performances. More precisely, DETR demonstrates significantly better performance on large objects, a result likely enabled by the non-local computations of the transformer. It obtains, however, lower performances on small objects. We expect that future work will improve this aspect in the same way the development of FPN [22] did for Faster R-CNN.
在 COCO 数据集上的实验结果。
Training settings for DETR differ from standard object detectors in multiple ways. The new model requires extra-long training schedule and benefits from auxiliary decoding losses in the transformer. We thoroughly explore what components are crucial for the demonstrated performance.
本文算法的一些瑕疵:新的模型需要超长的训练计划和有效的辅助解码损失。我们将深入研究哪些组件对演示的性能至关重要。
The design ethos of DETR easily extend to more complex tasks. In our experiments, we show that a simple segmentation head trained on top of a pretrained DETR outperfoms competitive baselines on Panoptic Segmentation [19], a challenging pixel-level recognition task that has recently gained popularity.
本文算法的一些优点。
There is no canonical deep learning model to directly predict sets. The basic set prediction task is multilabel classification (see e.g., [40,33] for references in the context of computer vision) for which the baseline approach, one-vs-rest, does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes). The first difficulty in these tasks is to avoid near-duplicates. Most current detectors use postprocessings such as non-maximal suppression to address this issue, but direct set prediction are postprocessing-free. They need global inference schemes that model interactions between all predicted elements to avoid redundancy. For constant-size set prediction, dense fully connected networks [9] are sufficient but costly. A general approach is to use auto-regressive sequence models such as recurrent neural networks [48]. In all cases, the loss function should be invariant by a permutation of the predictions. The usual solution is to design a loss based on the Hungarian algorithm [20], to find a bipartite matching between ground-truth and prediction. This enforces permutation-invariance, and guarantees that each target element has a unique match. We follow the bipartite matching loss approach. In contrast to most prior work however, we step away from autoregressive models and use transformers with parallel decoding, which we describe below.
目前还没有典型的深度学习模型来直接预测集合。
基本的集合预测任务是多标签分类 (例如,[40,33] 参考计算机视觉上下文),其中基线方法 one-vs-rest 不适用于诸如检测元素之间存在底层结构 (即类同的 boxes)。
在这些任务中,第一个困难是避免近似重复。目前大多数检测器使用诸如非最大抑制等后处理来解决这个问题,但直接集预测是没有后处理的。他们需要对所有预测元素之间的交互进行建模的全局推理方案,以避免冗余。对于固定尺寸的集预测,稠密全连接网络(dense fully connected networks)是可以实现的,但代价昂贵。一般的方法是使用自回归序列模型,如循环神经网络[48]。在所有情况下,损失函数应该是不变的置换的预测。通常的解决方案是设计一个基于匈牙利算法(Hungarian algorithm) [20] 的损失,以找到一个 GT 和预测之间的双边匹配(bipartite matching)。这可以强制执行置换不变性,并保证每个目标元素具有唯一的匹配。本文采用双边匹配损失的方法。然而,与之前的工作相反,本文的方法脱离自回归模型,使用具有并行解码的transformers。
Two ingredients are essential for direct set predictions in detection: (1) a set prediction loss that forces unique matching between predicted and ground truth boxes; (2) an architecture that predicts (in a single pass) a set of objects and models their relation. We describe our architecture in detail in Figure 2.
在检测中,有两个要素对直接的集预测至关重要:
(1) 一组预测损失,用于强制预测 boxe 与 真实(GT)boxes 间的 进行唯一匹配;
(2) 网络体系结构,用于预测目标并对目标之间的关系进行建模。
图2中详细描述了体系结构。
DETR infers a fixed-size set of predictions, in a single pass through the decoder, where is set to be significantly larger than the typical number of objects in an image. One of the main difficulties of training is to score predicted objects (class, position, size) with respect to the ground truth. Our loss produces an optimal bipartite matching between predicted and ground truth objects, and then optimize object-specific (bounding box) losses.
DETR 推断出一个个预测的固定大小的集合,通过解码器的一次传递,其中被设置为明显大于图像中典型对象的数量。训练的主要困难之一是根据 GT 对预测对象 (类别、位置、大小) 进行评分。本文的损失计算了预测目标和 GT 目标之间的最佳双边匹配,然后优化目标特定 (边界盒 bounding box) 损失。
Let us denote by the ground truth set of objects, and the set of predictions. Assuming is larger than the number of objects in the image, we consider also as a set of size padded with ∅ (no object). To find a bipartite matching between these two sets we search for a permutation of elements with the lowest cost:
where is a pair-wise matching cost between ground truth and a prediction with index . This optimal assignment is computed efficiently with the Hungarian algorithm, following prior work (e.g. [43]).
损失函数的解释:
用 表示对象的 ground truth 集合,并且 表示 个预测的集合。假设 大于图像中对象的数量, 也是大小为 的且包含 ∅(无对象) 的集合。这两组之间找到一个双边匹配,使得一个排列的 个元素 最低的成本:
这个最优分配是有效地计算与匈牙利算法 [43]。
[43] CVPR 2015 : End-to-end people detection in crowded scenes
The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes. Each element of the ground truth set can be seen as a where is the target class label (which may be ∅) and is a vector that defines ground truth box center coordinates and its height and width relative to the image size. For the prediction with index we define probability of class as and the predicted box as . With these notations we define as .
匹配代价同时考虑了类预测和预测 boxes 与 GT boxes 的相似性。
表示 ground truth 的类别; 表示 ground truth box 的中心坐标及其高度和宽度。
表示预测的类别; 表示预测 box 的中心坐标及其高度和宽度。
This procedure of finding matching plays the same role as the heuristic assignment rules used to match proposal [37] or anchors [22] to ground truth objects in modern detectors. The main difference is that we need to find one-to-one matching for direct set prediction without duplicates.
这种寻找匹配的过程与现代探测器中用于匹配提议[37]或将[22]锚定到地面真值对象的启发式分配规则的作用相同。主要的区别是我们需要找到一对一匹配的直接集预测没有重复。
The second step is to compute the loss function, the Hungarian loss for all pairs matched in the previous step. We define the loss similarly to the losses of common object detectors, i.e. a linear combination of a negative log-likelihood for class prediction and a box loss defined later:
where is the optimal assignment computed in the first step (1). In practice, we down-weight the log-probability term when ∅ by a factor 10 to account for class imbalance. This is analogous to how Faster R-CNN training procedure balances positive/negative proposals by subsampling [37]. Notice that the matching cost between an object and ∅ doesn’t depend on the prediction, which means that in that case the cost is a constant. In the matching cost we use probabilities instead of log-probabilities. This makes the class prediction term commensurable to (described below), and we observed better empirical performances.
第二步是计算损失函数,匈牙利损失为所有匹配在前一步。我们对损失的定义类似于普通对象检测器的损失,即类预测的负对数似然和后面定义的盒损失的线性组合:
式中\widehat{\sigma}为第一步(1)计算出的最优分配,实际中,当c_i =∅by a因子10时,将对数概率项减重,以解释阶层不平衡。这类似于更快的R-CNN训练程序如何通过对[37]进行子采样来平衡正/负建议。注意,对象和∅之间的匹配成本不依赖于预测,也就是说,这种情况下成本是常数。在匹配代价中,我们使用概率\widehat{p}_{\sigma(i)}(ci)而不是对数概率。这使得类预测项可通约度
- Bounding box loss.
The second part of the matching cost and the Hungarian loss is Lbox(·) that scores the bounding boxes. Unlike many detectors that do box predictions as a ∆ w.r.t. some initial guesses, we make box predictions directly. While such approach simplify the implementation it poses an issue with relative scaling of the loss. The most commonly-used `1 loss will have different scales for small and large boxes even if their relative errors are similar. To mitigate this issue we use a linear combination of the `1 loss and the generalized IoU loss [38] Liou(·, ·) that is scale-invariant. Overall, our box loss is defined as where λiou, λL1 ∈ R are hyperparameters. These two losses are normalized by the number of objects inside the batch.
The overall DETR architecture is surprisingly simple and depicted in Figure 2. It contains three main components, which we describe below: a CNN backbone to extract a compact feature representation, an encoder-decoder transformer, and a simple feed forward network (FFN) that makes the final detection prediction. Unlike many modern detectors, DETR can be implemented in any deep learning framework that provides a common CNN backbone and a transformer architecture implementation with just a few hundred lines. Inference code for DETR can be implemented in less than 50 lines in PyTorch [32]. We hope that the simplicity of our method will attract new researchers to the detection community.
整个 DETR 体系结构非常简单,如图2所示。它包含三个主要部分:
1. CNN,提取一个紧凑的特征表示;
2. 编解码 transformer;
3. 前向网络(FFN),做出最终的检测预测。
DETR可以在任何深度学习框架中实现,该框架提供一个普通的 CNN 主干和一个 transformer 结构实现,在PyTorch 中,DETR 的推理代码可以在不到50行代码中实现。
- Backbone.
Starting from the initial image (with 3 color channels ), a conventional CNN backbone generates a lower-resolution activation map . Typical values we use are and .
主干网络是一个 CNN:
将深入彩色图像转换为 ; 大小的特征图;
- Transformer encoder.
First, a convolution reduces the channel dimension of the high-level activation map from to a smaller dimension . creating a new feature map . The encoder expects a sequence as input, hence we collapse the spatial dimensions of into one dimension, resulting in a feature map. Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN). Since the transformer architecture is permutation-invariant, we supplement it with fixed positional encodings [31,3] that are added to the input of each attention layer. We defer to the supplementary material the detailed definition of the architecture, which follows the one described in [Attention is all you need].
Transformer 编码器:
1. 卷积层:对输入的 2048 个卷积层进行通道降维到 ,得到特征 。再把 拉伸成一维张量 。
2. 每个编码器层都有一个标准的体系结构,由一个多头自注意模块和一个前馈网络 (FFN) 组成。
3. 由于 transformer 的结构是置换不变的,用固定的位置编码来补充,这些编码被添加到每个注意层的输入中。
更多细节要看文章后面的 Appendix。
[3] 2019 ICCV : Attention augmented convolutional networks
[31] 2018 ICML : Image transformer
- Transformer decoder.
The decoder follows the standard architecture of the transformer, transforming embeddings of size using multi-headed self- and encoder-decoder attention mechanisms. The difference with the original transformer is that our model decodes the objects in parallel at each decoder layer, while Vaswani et al. [47] use an autoregressive model that predicts the output sequence one element at a time. We refer the reader unfamiliar with the concepts to the supplementary material. Since the decoder is also permutation-invariant, the input embeddings must be different to produce different results. These input embeddings are learnt positional encodings that we refer to as object queries, and similarly to the encoder, we add them to the input of each attention layer. The object queries are transformed into an output embedding by the decoder. They are then independently decoded into box coordinates and class labels by a feed forward network (described in the next subsection), resulting final predictions. Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.
该解码器遵循 transformer 的标准架构,利用多头自/编/解码器注意机制转换尺寸为 的 个嵌入件。与原始 transformer 不同的是,本文的模型在每个解码器层并行解码 个对象,而 Vaswani 等人 [47] 使用自回归模型,一次预测一个元素的输出序列。由于解码器也是置换不变的, 个输入嵌入必须是不同的,以产生不同的结果。这些输入嵌入是学习得的位置编码,我们称为对象查询 object queries,类似于编码器,把它们添加到每个注意层的输入。 个对象查询被译码器转换成一个嵌入的输出。然后,它们被一个前馈网络 (在下一小节中描述) 独立解码成 box 坐标和类标签,从而产生 个最终预测。利用自身和编码器-解码器对这些嵌入的关注,模型通过它们之间的成对关系对所有对象进行全局推理,同时能够使用整个图像作为上下文。
Appendix
Detailed architecture
The detailed description of the transformer used in DETR, with positional encodings passed at every attention layer, is given in Fig. 10. Image features from the CNN backbone are passed through the transformer encoder, together with spatial positional encoding that are added to queries and keys at every multihead self-attention layer. Then, the decoder receives queries (initially set to zero), output positional encoding (object queries), and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multihead self-attention and decoder-encoder attention. The first self-attention layer in the first decoder layer can be skipped.
图 10 给出了 DETR 中使用的 transformer 的详细描述,在每个注意层通过位置编码。来自CNN主干的图像特征通过 transformer编码器传递,以及在查询 Q 和键 K 中添加到每个多线程自注意层的的空间位置编码。然后,解码器接受的 queries (最初设置为零),输出位置编码 (object queries) 和编码器内存,并通过多个多头自注意和解码器-编码器注意生成最终的预测类标签和边界boxes。
- Prediction feed-forward networks (FFNs).
The final prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension , and a linear projection layer. The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function. Since we predict a fixed-size set of bounding boxes, where is usually much larger than the actual number of objects of interest in an image, an additional special class label ∅ is used to represent that no object is detected within a slot. This class plays a similar role to the “background” class in the standard object detection approaches.
FFN 的结构:感知器(包含 3 层)+ ReLU ,估计 box;
一个 linear projection 层:softmax,估计 class。
- Auxiliary decoding losses.
We found helpful to use auxiliary losses [1] in decoder during training, especially to help the model output the correct number of objects of each class. We add prediction FFNs and Hungarian loss after each decoder layer. All predictions FFNs share their parameters. We use an additional shared layer-norm to normalize the input to the prediction FFNs from different decoder layers.
在训练过程中,发现在解码器中使用辅助损失函数很有帮助,特别是帮助模型输出每个类的正确对象数量。
在每一层解码器后加入预测 FFNs 和匈牙利损失。所有的预测 FFN 共享它们的参数。
使用一个额外的共享层规范来规范化来自不同解码器层的预测 FFN 的输入。