显著性目标检测之Global Context-Aware Progressive Aggregation Network for Salient Object Detection

Global Context-Aware Progressive Aggregation Network for Salient Object Detection


AAAI 2020


这篇文章和F3Net在想法上有很大的相似之处,都认为:the previous works mainly adopted multiple-level feature integration yet ignored the gap between different features.

而另一条 there also exists a dilution process of high-level features as they passed on the top-down pathway 实际上也是借用自之前的PoolNet。


  1. Due to the gap between different level features, the simple combination of semantic information and appearance information is insufficient and lacks consideration of the different contribution of different features for salient object detection;
  2. Most of the previous works ignored the global context information, which benefits for deducing the relationship among multiple salient regions and producing more complete saliency result.


  • 对于第一个问题:Feature Interweaved Aggregation (FIA) module fully integrates the high-level semantic features, low-level detail features, and global context features, which is expected to suppress the noises but recover more structural and detail information.
  • 通用提升:
    • Head Attention (HA) module is used to reduce information redundancy and enhance the top layers features by leveraging the spatial and channel-wise attention
    • Self Refinement (SR) module is utilized to further refine and heighten the input features
  • 对于第二个问题:Global Context Flow (GCF) module generates the global context information at different stages, which aims to learn the relationship among different salient regions and alleviate the dilution effect of high-level features


这里多处使用乘法操作。The multiplication operation can strengthen the response of salient objects, meanwhile suppress the background noises. 从图中可以比较直观的了解整体的计算过程。注意,这里中间和右侧分支使用的输入时 f ~ l \tilde f_l f~l而不是 f l f_l fl,也就是左侧分支的中间特征。


  1. the high-level features from the output of the previous layer
  2. the low-level features from the corresponding bottom layer
  3. the global context feature generated by the GCF module


例如,在预测的显著物体上有一些洞,这是由不同层的矛盾反应引起的。因此,我们开发了一个 SR 模块,在通过 HA 模块和 FIA 模块后,通过使用乘法和加法操作来进一步细化和增强特征图。



由于编码器组件的顶层特征对于显著目标检测通常是冗余的,我们设计了一个接在顶层后的 HA 模块,通过利用空间和通道注意机制来学习更具有选择性和代表性的特征。

输入特征图 F F F,先将通道调整成256,得到 F ~ \tilde F F~然后使用简单的卷积结构得到第一阶段特征 F 1 F_1 F1

之后再通过全局平均池化来处理 F F F变成了通道级特征矢量 f f f,后接两个全连接层,分别使用ReLU和Sigmoid作为激活函数,从而得到权重矢量 y y y

最终的输出使用 F 1 ⊙ y F_1 \odot y F1y,即用 y y y F 1 F_1 F1进行通道加权。



To facilitate the optimization of the proposed network, we add auxiliary loss at three decoder stages. Specifically, a 3×3 convolution operation is applied for each stage to squeeze the channel of the output feature maps to 1. Then these maps are up-sampled to the same size as the ground truth via bilinear interpolation and sigmoid function is used to normalize the predicted values into [0,1].

The auxiliary loss branches only exist during the training stage, whereas they are abandoned when inference.


  • We adopt ResNet-50 (He et al. 2016) pretrained on ImageNet (Deng et al. 2009) as our network backbone.
  • In the training stage, we resize each image to 320×320 with random horizontal flipping, then randomly crop a patch with the size of 288 × 288 for training.
  • During the inference stage, images are simply **resized to 320 × 320 **then fed into the network to obtain prediction without any other post-processing (e.g., CRF).
  • We use Pytorch (Paszke et al. 2017) to implement our model.
  • Mini-batch Stochastic gradient descent (SGD) is used to optimize the whole network with the batch size of 32, the momentum of 0.9, and the weight decay of 5e-4.
  • We use the **warm-up and linear decay strategies **with the maximum learning rate 5e-3 for the backbone and 0.05 for other parts to train our model and stop training after 30 epochs.
  • The inference of a 320×320 image takes about 0.02s (over 50 fps) with the acceleration of one NVIDIA Titan-Xp GPU card.

  • 论文:https://arxiv.org/pdf/2003.00651.pdf
  • 代码:https://github.com/JosephChenHub/GCPANet
