论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

文章目录

      • 1、论文总述
      • 2、RFP模块的具体实现
      • 3、SAC模块的具体实现
      • 4、 Ablation Studies

1、论文总述

本篇论文提出的目标检测模型DetectoRS在COCO数据集上的性能是当前最好(mAP:54.7),在实例分割和全景分割上效果也不错,主要是因为提出的改进方法是 基于backbone和FPN的, 适用于多种视觉任务,其他次优模型如:ResNeSt,CBnet也是基于backbone的改进,也许现在的趋势就是目标检测的网络结构大致已定(除anchor-free系列外),而且也有论文统计过,如今检测网络性能不好的大部分原因是因为目标检测网络的分类性能提不上去,所以现在的改进基本都是基于backbone和FPN的,例如BiFPN也是如此。
论文主要工作有两方面:一是在宏观方面提出了递归FPN:Recursive Feature Pyramid(RFP),就是把FPN的输出先连接到bottom up那儿进行再次输入,然后再输出时候与原FPN的输出再进行结合一起输出;二是在微观方面提出了可切换的空洞卷积:Switchable Atrous Convolution(SAC)。

注: 作者把提出的两个模块加入到HTC这个网络中,baseline就是HTC,作者也非常nice,论文给出的东西都很细,干货比较多,所以建议阅读一下原论文,只是看几篇博客是不能完全理解这些细节的,作者在实验部分对权重的初始化都进行了详细介绍。
论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第1张图片

At the macro level, our proposed Recursive Feature Pyramid (RFP) builds on top of the Feature Pyramid Networks
(FPN) [44] by incorporating extra feedback connections from
the FPN layers into the bottom-up backbone layers, as illustrated in Fig. 1a. Unrolling the recursive structure to a
sequential implementation, we obtain a backbone for object
detector that looks at the images twice or more. Similar to
the cascaded detector heads in Cascade R-CNN trained with
more selective examples
, our RFP recursively enhances FPN
to generate increasingly powerful representations.

At the micro level, we propose Switchable Atrous Convolut
ion (SAC), which convolves the same input feature with
different atrous rates [11,30,53] and gathers the results using
switch functions. Fig. 1b shows an illustration of the concept of SAC. The switch functions are spatially dependent,
i.e., each location of the feature map might have different
switches to control the outputs of SAC. To use SAC in the
detector, we convert all the standard 3x3 convolutional layers in the bottom-up backbone to SAC, which improves the
detector performance by a large margin. Some previous
methods adopt conditional convolution, e.g., [39, 74], which
also combines results of different convolutions as a single
output. Unlike those methods whose architecture requires
to be trained from scratch, SAC provides a mechanism to
easily convert pretrained standard convolutional networks
(e.g., ImageNet-pretrained [59] checkpoints).
Moreover, a
new weight locking mechanism is used in SAC where the
weights of different atrous convolutions are the same except
for a trainable difference.
Combining the proposed RFP and SAC results in our DetectoRS. To demonstrate its effectiveness, we incorporate
DetectoRS into the state-of-art HTC [7] on the challenging
COCO dataset [47].

2、RFP模块的具体实现

论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第2张图片
论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第3张图片

论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第4张图片

Figure2显示的RFP模块的整体流程,其中第一次FPN出来的feature map要经过ASPP模块转换之后就是figure3所示的RFP Features,Figure3上方所显示的就是以resnet 为基础bottom up的原始结构,然后和ASPP出来的RFP Features进行相加,这个就是作者提出的RFP模块在融合进Resnet时的具体操作,最后第二次出来的feature map和第一次出来的feature map进行融合时需要根据figure5所示的操作进行,作者提到这是借鉴了RNN。

We make changes to the ResNet [28] backbone B to
allow it to take both x and R(f) as its input. ResNet has four
stages, each of which is composed of several similar blocks.
We only make changes to the first block of each stage, as
shown in Fig. 3. This block computes a 3-layer feature and
adds it to a feature computed by a shortcut. To use the feature
R(f), we add another convolutional layer with the kernel
size set to 1. The weight of this layer is initialized with 0 to
make sure it does not have any real effect when we load the
weights from a pretrained checkpoint.

We use Atrous Spatial Pyramid Pooling (ASPP) [12] to
implement the connecting module R, which takes a feature
f ti as its input and transforms it to the RFP feature used
in Fig. 3. In this module, there are four parallel branches
that take f ti as their inputs, the outputs of which are then
concatenated together along the channel dimension to form
the final output of R. Three branches of them use a convolutional layer followed by a ReLU layer, the number of
the output channels is 1/4 the number of the input channels. The last branch uses a global average pooling layer to
compress the feature, followed by a 1x1 convolutional layer
and a ReLU layer to transform the compressed feature to
a 1/4-size (channel-wise) feature. Finally, it is resized and
concatenated with the features from the other three branches.

3、SAC模块的具体实现

论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第5张图片
下面是上图中那个锁的介绍:

We propose a locking mechanism by setting one weight
as w and the other as w + ∆w for the following reasons.
Object detectors usually use pretrained checkpoints to initialize the weights. However, for an SAC layer converted
from a standard convolutional layer, the weight for the larger
atrous rate is missing. Since objects at different scales can
be roughly detected by the same weight with different atrous
rates, it is natural to initialize the missing weights with those
in the pretrained model. Our implementation uses w + ∆w
for the missing weight where w is from the pretrained checkpoint and ∆w is initialized with 0. When fixing ∆w = 0,
we observe a drop of 0.1% AP. But ∆w alone without the
locking mechanism degrades AP a lot.

这个 locking mechanism就是作者提出的将IMAGEnet上的预训练模型与SAC模块相结合,这样就不用将自己的backbone从头开始训练,有可以利用的预训练模型,那些空洞卷积不为1的新加进来的卷积模块权重先暂时用预训练模型里的,但是给他们一个detaW 让他们同时也可以学习。(这块不太明白 这些权重学习的时候是怎么只学习一个偏移量的??)

Figure 4中 Pre-Global Context 的作者的解释:

As shown in Fig. 4, we insert two global context modules
before and after the main component of SAC. These two
modules are light-weighted as the input features are first
compressed by a global average pooling layer. The global
context modules are similar to SENet [31] except for two
major differences:
(1) we only have one convolutional layer
without any non-linearity layers, and (2) the output is added
back to the main stream instead of multiplying the input by
a re-calibrating value computed by Sigmoid.
Experimentally, we found that adding the global context information
before the SAC component (i.e., adding global information
to the switch function) has a positive effect on the detection
performance. We speculate that this is because S can make
more stable switching predictions when global information
is available. We then move the global information outside
the switch function and place it before and after the major
body so that both Conv and S can benefit from it. We did
not adopt the original SENet formulation as we found no
improvement on the final model AP. In the ablation study in
Sec. 5, we show the performances of SAC with and without
the global context modules.

4、 Ablation Studies

论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第6张图片

论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第7张图片

For RFP, we show ‘RFP + sharing’ where B1i and B2i
share their weights.

论文阅读:DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution_第8张图片
参考文献
重磅开源!目标检测新网络 DetectoRS:54.7 AP,特征金字塔与空洞卷积的完美结合

你可能感兴趣的:(论文阅读)