detr 历史解析代码
重点 (Top highlight)
Transformers are widely known for their accomplishments in the field of NLP. Recent investigations prove that transformers have the inherent ability to generalize and fit into many tasks, this inherent capability of the architecture has become the primary reason to adopt the transformers in the vision field
变压器因其在NLP领域的成就而广为人知。 最近的调查证明 变压器具有内在的能力,可以概括并适合许多任务 ,架构的这种固有能力已成为在视觉领域采用变压器的主要原因
什么是DETR? 为什么选择DETR? (What is DETR? Why DETR?)
well, DETR stands for “detecting transformers”. This was proposed by FAIR and the benchmarks are slightly better than Faster-RCNN
好吧,DETR代表“检测变压器”。 这是FAIR提出的,其基准比Faster-RCNN稍好
Existing object detection frameworks are carefully crafted with different design principles. Two-stage detectors(Faster RCNN) predict boxes w.r.t. proposals, whereas single-stage methods(YOLO) make predictions w.r.t. anchors or a grid of possible object centers. The performance of these systems is influenced by post-processing steps to collapse near-duplicate predictions (NMS), and the exact way these initial guesses are set (anchor boxes). DETR proposes an end-to-end architecture by eliminating any customized layers and can predict the bounding boxes w.r.t to the input Image(absolute box prediction).
现有的对象检测框架是根据不同的设计原则精心设计的。 两阶段检测器(Faster RCNN)可以通过提议来预测框,而单阶段方法(YOLO)可以通过锚点或可能的对象中心网格来进行预测。 这些系统的性能受后处理步骤的影响,以崩溃接近重复的预测(NMS),以及设置这些初始猜测的准确方式(锚定框)。 DETR通过消除任何自定义层来提出端到端架构,并且可以预测输入图像的边界框(绝对框预测)。
DETR架构 (DETR architecture)
DETR mainly comprises of four blocks as depicted in the below diagram
如下图所示,DETR主要包含四个块
- Backbone 骨干
- Transformer Encoder 变压器编码器
- Transformer Decoder 变压器解码器
- Prediction heads 预测头
骨干。 (Backbone.)
Starting from the initial image a CNN backbone generates a lower-resolution activation map. The input images are batched together, applying 0-padding adequately to ensure they all have the same dimensions (H, W) as the largest image of the batch.
从初始图像开始,CNN主干会生成较低分辨率的激活图。 输入的图像被批处理在一起,充分使用0填充以确保它们都具有与批处理的最大图像相同的尺寸(H,W)。
变压器编码器 (Transformer Encoder)
The transformer encoder expects a sequence as input, hence we collapse the feature map from the backbone into a 1-d vector. Since the transformer architecture has no order for the sequences (why do we need an order at all? ), we supplement it with fixed positional encoding’s that are added to the input before passing it into a multi-head attention module. I am not explaining Transformers in detail since there are many great tutorials, but at high-level attention mechanism is key for the success of this architecture. The extracted latent features from the CNN backbone are passed to the multi-head attention module where each extracted feature is allowed to interact with every other extracted feature with the help of query, key, and value mechanism. This should have helped the network to figure out the features that belong to a single object and distinguish with other objects which lie very close even if the object is of the same class. That is why Transformers are so powerful at generalizing the aspects.
变压器编码器期望一个序列作为输入,因此我们将特征图从主干折叠到一维向量中。 由于转换器的体系结构没有顺序的顺序( 为什么我们根本不需要顺序? ),因此在将其传递到多头注意模块之前,我们在输入中添加了固定的位置编码,以对它进行补充。 由于有许多很棒的教程,因此我不会详细解释《变形金刚》,但在高层关注机制下,此架构的成功至关重要。 从CNN主干中提取的潜在特征将传递到多头注意模块,在此模块中,每个提取的特征都可以借助查询,键和值机制与其他每个提取的特征进行交互。 这应该有助于网络找出属于单个对象的特征,并与其他非常接近的对象(即使该对象属于同一类)进行区分。 这就是为什么变形金刚在概括方面方面如此强大的原因。
变压器解码器 (Transformer decoder)
A standard transformer decoder expects three inputs queries, keys, and values. Outputs from the encoder are passed as keys and values to the decoder but what are queries to the decoder? well, formulation of decoder part is fairly cool, queries are simply numbers ranging from 1 to N which are passed as embedding’s and the number of objects detected by the model is equal to the value N. So what do these queries do? These queries try to ask what objects lie at a specific position E.g. embedding for value 1 might ask for the objects present in the bottom left corner as shown in the below diagram.
标准的转换器解码器需要三个输入查询,键和值。 编码器的输出作为键和值传递到解码器,但是对解码器的查询是什么? 很好,解码器部分的公式很酷,查询只是从1到N的数字,它们是作为嵌入传递的,并且模型检测到的对象数等于值N。 那么这些查询做什么? 这些查询试图询问哪些对象位于特定位置,例如,嵌入值1可能会要求左下角存在对象,如下图所示。
This sort of explains the need for positional embedding’s in the encoder block.
这种解释说明了在编码器块中需要位置嵌入的情况。
预测头。 (Prediction heads.)
The FFN predicts the normalized center coordinates, height, and width of the box w.r.t. the input image and the linear layer predicts the class label using a soft-max function.
FFN预测输入图像的框的标准化中心坐标,高度和宽度,而线性层使用soft-max函数预测类标签。
集合预测和基于集合的损失 (Set predictions and Set-based loss)
Set prediction is nothing new, the network predicts a tuple of class-label and the co-ordinates of the object w.r.t input image. Given a predefined number N network predicts N of such tuples if there no object in the proposed instance of tuple then network defaults it to the “no object” class(∅).
集合预测并不是什么新鲜事,网络可以预测输入标签的对象的类标签和坐标的元组。 给定预定义的数量N,如果网络在建议的元组实例中没有对象,则网络会预测N个这样的元组,然后网络将其默认为“无对象”类。
Network infers a fixed-size set of N predictions in a single forward pass where N is set to be significantly larger than the typical number of objects in an image. Now let’s consider a single prediction in a batch, as the N is higher than the actual number of ground truth objects in an image, padding is done to the ground truth with “no-object” class for any extra predictions that are made by the network.
网络在单个前向通过中推断N个预测的固定大小集合,其中N设置为显着大于图像中对象的典型数量。 现在,让我们考虑一个批处理中的单个预测,因为N大于图像中地面真实物体的实际数量,所以对于“地面物体”类别,对地面真实物体进行填充,以对图像进行的任何额外预测网络。
Each prediction is assigned to one ground truth which lies in the closest proximity of the prediction and the left out predictions are mapped to “no-object” class, in the paper, this is termed as a “bipartite matching”.
在本文中,将每个预测分配给一个最接近该预测的地面实况,将遗漏的预测映射到“无对象”类别,这被称为“二分匹配”。
The above function clearly describes how the loss is computed, for each of the assigned pairs the above computes the cross-entropy loss and bounding box loss(bounding box loss is not computed if the actual class is ‘no object’) also the loss for “no object” class is downscaled by 10 times to avoid class imbalance because of the large value of N. Bounding box loss is calculated by weighted addition of IOU loss and L1 loss between predicted and actual boxes normalized by the number of ground truth objects.
上面的函数清楚地描述了损失的计算方式,对于每个分配的对,上面的计算交叉熵损失和边界框损失(如果实际类别为“无对象”,则不计算边界框损失)由于N值较大,“无对象”类别将缩小10倍,以避免类别失衡。边界框损失是通过将预测的和实际的框之间的IOU损失和L1损失加权相加得出的,并通过地面真实对象的数量进行归一化。
结论与结果 (Conclusion and Results)
The approach achieves comparable results to an optimized Faster R-CNN baseline on the COCO dataset. In addition, DETR achieves significantly better performance on large objects than Faster R-CNN but lags a bit for the smaller objects.
该方法可达到与COCO数据集上优化的Faster R-CNN基线相当的结果。 另外,相比于Faster R-CNN,DETR在大型对象上的性能要好得多,但在较小的对象上却有些滞后。
翻译自: https://medium.com/swlh/transformers-for-vision-detr-24006addce01
detr 历史解析代码