显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection

F3Net: Fusion, Feedback and Focus for Salient Object Detection


  • F3Net: Fusion, Feedback and Focus for Salient Object Detection
    • 说在开头
    • 主要工作
    • 主要结构
      • 损失函数
    • 实验细节
    • 相关链接


显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection_第1张图片

AAAI 2020






  • reduce the impact of inconsistency between features of different levels
  • assign larger weights to those truly important pixels


  • CFM&CFD:
    • First, to mitigate the discrepancy between features, we design cross feature module (CFM), which fuses features of different levels by element-wise multiplication. Different from addition and concatenation, CFM takes a selective fusion strategy, where redundant information will be suppressed to avoid the contamination between features and important features will complement each other. Compared with traditional fusion methods, CFM is able to remove background noises and sharpen boundaries.
    • Second, due to downsampling, high level features may suffer from information loss and distortion, which can not be solved by CFM. Therefore, we develop the cascaded feedback decoder (CFD) to refine these features iteratively. CFD contains multiple sub-decoders, each of which contains both bottom-up and top-down processes.
      • For bottom-up process, multi-level features are aggregated by CFM gradually.
      • For top-down process, aggregated features are feedback into previous features to refine them.
  • PPA:
    • We propose the pixel position aware loss (PPA) to improve the commonly used binary cross entropy loss which treats all pixels equally. In fact, pixels located at boundaries or elongated areas are more difficult and discriminating. Paying more attention to these hard pixels can further enhance model generalization. PPA loss assigns different weights to different pixels, which extends binary cross entropy. The weight of each pixel is determined by its surrounding pixels. Hard pixels will get larger weights and easy pixels will get smaller ones.


显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection_第2张图片

从图上看,结构很直观。通过这里的CFM结构,作者想要实现:By multiple feature crossings, fl and fh will gradually absorb useful information from each other to complement themselves, i.e., noises of fl will be suppressed and boundaries of fh will be sharpened.


  • 关于级联:

Cascaded feedback decoder (CFD) is built upon CFM which refines the multi-level features and generate saliency maps iteratively.

  • 关于双向:

In fact, features of different levels may have missing or redundant parts because of downsamplings and noises. Even with CFM, these parts are still difficult to identify and restore, which may hurt the final performance.
Considering the output saliency map is relatively complete and approximate to ground truth, we propose to propagate the features of the last convolution layer back to features of previous layers to correct and refine them.



  1. 像素级损失:First, it calculates the loss for each pixel independently and ignores the global structure of the image.
  2. 易受大的区域的引导:Second, in pictures where the background is dominant, loss of foreground pixels will be diluted.
  3. 平等对待每个像素:Third, it treats all pixels equally. In fact, pixels located on cluttered or elongated areas (e.g., pole and horn) are prone to wrong predictions and deserve more attention and pixels located areas, like sky and grass, deserveless attention.




显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection_第3张图片


显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection_第4张图片






  • DUTS-TR is used to train F3Net and other above mentioned datasets are used to evaluate F3Net.
  • For data augmentation, we use horizontal flip, random crop and multi-scale input images.
  • ResNet-50 (He et al. 2016), pre-trained on ImageNet, is used as the backbone network. Maximum learning rate is set to 0.005 for ResNet-50 backbone and 0.05 for other parts.
  • Warm-up and linear decay strategies are used to adjust the learning rate.
  • The whole network is trained end-to-end, using stochastic gradient descent (SGD). Momentum and weight decay are set to 0.9 and 0.0005, respectively.
  • Batchsize is set to 32 and maximum** epoch is set to 32**.
  • We use Pytorch 1.3 to implement our model. An RTX 2080Ti GPU is used for acceleration.
  • During testing, we resized each image to 352 x 352 and then feed it to F3Net to predict saliency maps without any post-processing.

显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection_第5张图片

显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection_第6张图片

显著性目标检测之F3Net: Fusion, Feedback and Focus for Salient Object Detection_第7张图片


  • 论文:https://arxiv.org/pdf/1911.11445.pdf
  • 代码:https://github.com/weijun88/F3Net
