显著目标检测(SOD)方向部分经典论文摘要

(CVPR’15) Visual Saliency Based on Multiscale Deep Features

Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.

视觉显著性是认知科学和计算科学(包括计算机视觉)的一个基本问题。在本文中,我们发现高质量的视觉显著性模型可以从使用深度卷积神经网络(CNN)提取的多尺度特征中学习,该网络在许多视觉识别任务中已经取得了成功。为了学习这样的显著性模型,我们引入了一个神经网络架构,它在多个CNN上有全连接层,负责三个不同尺度的特征提取。然后,我们提出了一种细化方法,以增强我们显著性结果的空间一致性。最后,将不同级别的图像分割计算出的多个显著图汇总起来,可以进一步提高性能,产生比单一分割产生的显著图更好的结果。为了促进对视觉显著性模型的进一步研究和评估,我们还构建了一个新的大型数据库,其中包括4447张具有挑战性的图像及其像素级的显著性标注。实验结果表明,我们提出的方法能够在所有公开基准测试上取得SOTA,在MSRA-B数据集和我们的新数据集(HKU-IS)上,F-Measure分别提高了5.0%和13.2%,在这两个数据集上,平均绝对误差(MAE)分别降低了5.7%和35.1%。


(CVPR’15) Deep Networks for Saliency Detection via Local Estimation and Global Search

This paper presents a saliency detection algorithm by integrating both local estimation and global search. In the local estimation stage, we detect local saliency by using a deep neural network (DNN-L) which learns local patch features to determine the saliency value of each pixel. The estimated local saliency maps are further refined by exploring the high level object concepts. In the global search stage, the local saliency map together with global contrast and geometric information are used as global features to describe a set of object candidate regions. Another deep neural network (DNN-G) is trained to predict the saliency score of each object region based on the global features. The final saliency map is generated by a weighted sum of salient object regions. Our method presents two interesting insights. First, local features learned by a supervised scheme can effectively capture local contrast, texture and shape information for saliency detection. Second, the complex relationship between different global saliency cues can be captured by deep networks and exploited principally rather than heuristically. Quantitative and qualitative experiments on several benchmark data sets demonstrate that our algorithm performs favorably against the state-of-the-art methods.

本文通过融合局部估计和全局搜索提出了一种显著性检测算法。在局部估计阶段,我们通过使用深度神经网络(DNN-L)来检测局部显著性,该网络学习局部块特征来确定每个像素的显著值。通过探索高级的对象概念,进一步完善估计的局部显著图。在全局搜索阶段,局部显著图与全局对比度和几何信息一起被用作全局特征来描述一组对象候选区域。另一个深度神经网络(DNN-G)被训练来预测基于全局特征的每个对象区域的显著性分数。最终的显著图是由显著性对象区域的加权和产生的。我们的方法提出了两个有趣的见解。首先,通过监督方案学习的局部特征可以有效地捕捉局部对比度、纹理和形状信息,用于显著性检测。第二,不同的全局显著性线索之间的复杂关系可以被深度网络所捕捉并被利用,而非启发式地使用。在几个基准数据集上进行的定量和定性实验表明,我们的算法与SOTA相比表现良好。


(CVPR’16) Deep Contrast Learning for Salient Object Detection

Salient object detection has recently witnessed substantial progress due to powerful features extracted using deep convolutional neural networks (CNNs). However, existing CNN-based methods operate at the patch level instead of the pixel level. Resulting saliency maps are typically blurry, especially near the boundary of salient objects. Furthermore, image patches are treated as independent samples even when they are overlapping, giving rise to significant redundancy in computation and storage. In this paper, we propose an end-to-end deep contrast network to overcome the aforementioned limitations. Our deep network consists of two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream. The first stream directly produces a saliency map with pixel-level accuracy from an input image. The second stream extracts segment-wise features very efficiently, and better models saliency discontinuities along object boundaries. Finally, a fully connected CRF model can be optionally incorporated to improve spatial coherence and contour localization in the fused result from these two streams. Experimental results demonstrate that our deep model significantly improves the state of the art.

由于使用深度卷积神经网络(CNN)提取的强大特征,显著目标检测最近取得了实质性进展。然而,现有的基于CNN的方法是在图像块级而不是像素级进行操作。由此产生的显著图通常是模糊的,尤其是在显著性对象的边界附近。此外,图像块被视为独立的样本,即使它们是重叠的,也会在计算和存储中产生大量的冗余。在本文中,我们提出了一个端到端的深度对比网络来克服上述的局限。我们的深度网络由两个互补的部分组成,一个像素级的完全卷积流和一个分段的空间池流。第一个流直接从输入图像中产生一个具有像素级精度的显著图。第二个流有效地提取分段特征,并更好地建模对象边界上的显著性不连续现象。最后,一个全连接的CRF模型可以被选择性地引入,以改善这两个流的融合结果中的空间一致性和轮廓定位。实验结果表明,我们的深度模型明显提升了SOTA。


(CVPR’16) DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection

Traditional salient object detection models often use hand-crafted features to formulate contrast and various prior knowledge, and then combine them artificially. In this work, we propose a novel end-to-end deep hierarchical saliency network (DHSNet) based on convolutional neural networks for detecting salient objects. DHSNet first makes a coarse global prediction by automatically learning various global structured saliency cues, including global contrast, objectness, compactness, and their optimal combination. Then a novel hierarchical recurrent convolutional neural network (HRCNN) is adopted to further hierarchically and progressively refine the details of saliency maps step by step via integrating local context information. The whole architecture works in a global to local and coarse to fine manner. DHSNet is directly trained using whole images and corresponding ground truth saliency masks. When testing, saliency maps can be generated by directly and efficiently feedforwarding testing images through the network, without relying on any other techniques. Evaluations on four benchmark datasets and comparisons with other 11 state-of-the-art algorithms demonstrate that DHSNet not only shows its significant superiority in terms of performance, but also achieves a real-time speed of 23 FPS on modern GPUs.

传统的显著目标检测模型通常使用手工制作的特征来制定对比度和各种先验知识,然后人为地将它们结合起来。在这项工作中,我们提出了一个新颖的基于卷积神经网络的端到端深度分层显著性网络(DHSNet),用于检测显著对象。DHSNet首先通过自动学习各种全局结构化的显著性线索,包括全局对比度、对象性、紧凑性以及它们的最佳组合,进行粗略的全局预测。然后,采用新型的分层递归卷积神经网络(HRCNN),通过整合局部上下文信息,进一步分层逐步细化显著图的细节。整个架构以全局到局部和从粗到细的方式工作。DHSNet直接使用整幅图像和相应的真值的显著性mask进行训练。在测试时,可以通过网络直接有效地前馈测试图像来生成显著图,而不需要依赖任何其他技术。对四个基准数据集的评估以及与其他11种SOTA的比较表明,DHSNet不仅在性能上显示出明显的优势,而且在现代GPU上达到了23FPS的实时速度。


(CVPR’16) Deep Saliency with Encoded Low level Distance Map and High Level Features

Recent advances in saliency detection have utilized deep learning to obtain high level features to detect salient regions in a scene. These advances have demonstrated superior results over previous works that utilize hand-crafted low level features for saliency detection. In this paper, we demonstrate that hand-crafted features can provide complementary information to enhance performance of saliency detection that utilizes only high level features. Our method utilizes both high level and low level features for saliency detection under a unified deep learning framework. The high level features are extracted using the VGG-net, and the low level features are compared with other parts of an image to form a low level distance map. The low level distance map is then encoded using a convolutional neural network(CNN) with multiple 1 × 1 convolutional and ReLU layers. We concatenate the encoded low level distance map and the high level features, and connect them to a fully connected neural network classifier to evaluate the saliency of a query region. Our experiments show that our method can further improve the performance of state-of-the-art deep learning-based saliency detection methods.

显著性检测的最新进展是利用深度学习获得高级特征来检测场景中的显著区域。这些进展显示了比以前利用手工制作的低级特征进行显著性检测工作更好的结果。在本文中,我们证明了手工制作的特征可以提供补充信息,以提高只利用高级特征的显著性检测的性能。我们的方法在一个统一的深度学习框架下利用高级和低级的特征进行显著性检测。高级特征是用VGG网络提取的,而低层特征是与图像的其他部分进行比较以形成低级距离图。然后,低级距离图用一个具有多个1×1卷积层和ReLU层的卷积神经网络(CNN)进行编码。我们将编码后的低级距离图和高层特征连接起来,并将它们连接到一个全连接的神经网络分类器,以评估查询区域的显著性。实验表明,我们的方法可以进一步提高基于深度学习的显著性检测SOTA的性能。


(CVPR’17) Non-Local Deep Features for Salient Object Detection

Saliency detection aims to highlight the most relevant objects in an image. Methods using conventional models struggle whenever salient objects are pictured on top of a cluttered background while deep neural nets suffer from excess complexity and slow evaluation speeds. In this paper, we propose a simplified convolutional neural network which combines local and global information through a multiresolution 4 × 5 grid structure. Instead of enforcing spacial coherence with a CRF or superpixels as is usually the case, we implemented a loss function inspired by the MumfordShah functional which penalizes errors on the boundary. We trained our model on the MSRA-B dataset, and tested it on six different saliency benchmark datasets. Results show that our method is on par with the state-of-the-art while reducing computation time by a factor of 18 to 100 times, enabling near real-time, high performance saliency detection.

显著性检测的目的是突出图像中最相关的物体。每当显著物体出现在杂乱的背景上时,使用传统模型的方法就会陷入困境,而深层神经网络则受到过于复杂和缓慢推理速度的影响。在本文中,我们提出了一个简化的卷积神经网络,它通过一个多分辨率的4×5网格结构将局部和全局信息结合起来。我们没有像通常那样用CRF或超像素来强制执行空间一致性,而是实现了一个受MumfordShah函数启发的损失函数,对边界上的错误进行惩罚。我们在MSRA-B数据集上训练了我们的模型,并在六个不同的显著性基准数据集上对其进行了测试。结果表明,我们的方法与SOTA相当,同时将计算时间减少了18到100倍,从而实现了近乎实时的高性能显著性检测。


(CVPR’17) Deeply Supervised Salient Object Detection with Short Connections

Recent progress on salient object detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs). Semantic segmentation and salient object detection algorithms developed lately have been mostly based on Fully Convolutional Neural Networks (FCNs). There is still a large room for improvement over the generic FCN models that do not explicitly deal with the scale-space problem. Holistically-Nested Edge Detector (HED) provides a skip-layer structure with deep supervision for edge and boundary detection, but the performance gain of HED on saliency detection is not obvious. In this paper, we propose a new salient object detection method by introducing short connections to the skip-layer structures within the HED architecture. Our framework takes full advantage of multi-level and multi-scale features extracted from FCNs, providing more advanced representations at each layer, a property that is critically needed to perform segment detection. Our method produces state-of-theart results on 5 widely tested salient object detection benchmarks, with advantages in terms of efficiency (0.08 seconds per image), effectiveness, and simplicity over the existing algorithms. Beyond that, we conduct an exhaustive analysis on the role of training data on performance. Our experimental results provide a more reasonable and powerful training set for future research and fair comparisons.

最近在显著目标检测方面取得了重大进展,主要受益于卷积神经网络(CNN)的爆炸性发展。最近开发的语义分割和显著目标检测算法大多是基于全卷积神经网络(FCN)。与没有明确处理尺度空间问题的通用FCN模型相比,仍有很大的改进空间。整体嵌套边缘检测器(HED)为边缘和边界检测提供了一个具有深度监督的跳层结构,但HED在显著性检测上的性能提升并不明显。在本文中,我们通过在HED架构内部的跳层结构中引入短连接,提出了一种新的显著性物体检测方法。我们的框架充分利用了从FCN中提取的多层次和多尺度的特征,在每一层都提供了更高级的表征,而这一特征是进行分割和检测所迫切需要的。我们的方法在5个广泛测试的显著目标检测基准上产生了SOTA,在效率(每幅图像0.08秒)、有效性和简单性方面比现有算法更有优势。除此之外,我们对训练数据对性能的作用进行了详尽的分析。我们的实验结果为未来的研究和公平的比较提供了一个更合理和强大的训练集。


(ICCV’17) A Stagewise Refinement Model for Detecting Salient Objects in Images

Deep convolutional neural networks (CNNs) have been successfully applied to a wide variety of problems in computer vision, including salient object detection. To detect and segment salient objects accurately, it is necessary to extract and combine high-level semantic features with low-level fine details simultaneously. This happens to be a challenge for CNNs as repeated subsampling operations such as pooling and convolution lead to a significant decrease in the initial image resolution, which results in loss of spatial details and finer structures. To remedy this problem, here we propose to augment feedforward neural networks with a novel pyramid pooling module and a multi-stage refinement mechanism for saliency detection. First, our deep feedward net is used to generate a coarse prediction map with much detailed structures lost. Then, refinement nets are integrated with local context information to refine the preceding saliency maps generated in the master branch in a stagewise manner. Further, a pyramid pooling module is applied for different-region-based global context aggregation. Empirical evaluations over six benchmark datasets show that our proposed method compares favorably against the state-of-the-art approaches.

深度卷积神经网络(CNN)已经成功地应用于计算机视觉中的各种问题,包括显著目标检测。为了准确地检测和分割显著物体,有必要同时提取和结合高级语义特征和低级精细细节。这对CNN来说是一个挑战,因为重复的下采样操作,如池化和卷积,会导致初始图像分辨率大幅下降,从而导致空间细节和精细结构的损失。为了弥补这个问题,我们在这里提出用一个新颖的金字塔池化模块和一个多阶段细化机制来增强前馈神经网络的显著性检测。首先,我们的深度前馈网络被用来生成一个粗略的预测图,其中的细节结构会丢失。然后,细化网与局部上下文信息相结合,以分阶段的方式细化在主分支中产生的上述显著图。此外,一个金字塔池化模块被应用于基于不同区域的全局环境聚合。对六个基准数据集的实证评估表明,我们提出的方法与SOTA相比更有优势。


(ICCV’17) Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection

Fully convolutional neural networks (FCNs) have shown outstanding performance in many dense labeling problems. One key pillar of these successes is mining relevant information from features in convolutional layers. However, how to better aggregate multi-level convolutional feature maps for salient object detection is underexplored. In this work, we present Amulet, a generic aggregating multi-level convolutional feature framework for salient object detection. Our framework first integrates multi-level feature maps into multiple resolutions, which simultaneously incorporate coarse semantics and fine details. Then it adaptively learns to combine these feature maps at each resolution and predict saliency maps with the combined features. Finally, the predicted results are efficiently fused to generate the final saliency map. In addition, to achieve accurate boundary inference and semantic enhancement, edge-aware feature maps in low-level layers and the predicted results of low resolution features are recursively embedded into the learning framework. By aggregating multi-level convolutional features in this efficient and flexible manner, the proposed saliency model provides accurate salient object labeling. Comprehensive experiments demonstrate that our method performs favorably against state-of-the-art approaches in terms of near all compared evaluation metrics.

全卷积神经网络(FCN)在许多密集标签问题上表现出了出色的性能。这些成功的一个关键支柱是从卷积层的特征中挖掘相关信息。然而,如何更好地融合多级卷积特征图以进行显著目标检测还没有得到充分的探索。在这项工作中,我们提出了Amulet,一个用于显著目标检测的通用融合多级卷积特征框架。我们的框架首先将多级特征图融合到多个分辨率中,这些分辨率同时包含粗略的语义和精细的细节。然后,它自适应地学习在每个分辨率下结合这些特征图,并通过结合的特征预测显著图。最后,预测的结果被有效地融合以生成最终的显著图。此外,为了实现准确的边界推理和语义增强,低层的边缘感知特征图和低分辨率特征的预测结果被递归地嵌入到学习框架中。通过以这种高效和灵活的方式融合多级卷积特征,所提出的显著性模型提供了准确的显著性物体标签。综合实验表明,我们的方法在几乎所有比较的评价指标方面都比SOTA表现更好。


(ICCV’17) Learning Uncertain Convolutional Features for Accurate Saliency Detection

Deep convolutional neural networks (CNNs) have delivered superior performance in many computer vision tasks. In this paper, we propose a novel deep fully convolutional network model for accurate salient object detection. The key contribution of this work is to learn deep uncertain convolutional features (UCF), which encourage the robustness and accuracy of saliency detection. We achieve this via introducing a reformulated dropout (R-dropout) after specific convolutional layers to construct an uncertain ensemble of internal feature units. In addition, we propose an effective hybrid upsampling method to reduce the checkerboard artifacts of deconvolution operators in our decoder network. The proposed methods can also be applied to other deep convolutional networks. Compared with existing saliency detection methods, the proposed UCF model is able to incorporate uncertainties for more accurate object boundary inference. Extensive experiments demonstrate that our proposed saliency model performs favorably against state-ofthe-art approaches. The uncertain feature learning mechanism as well as the upsampling method can significantly improve performance on other pixel-wise vision tasks.

深度卷积神经网络(CNN)在许多计算机视觉任务中都有出色的表现。在本文中,我们提出了一个新的深度全卷积网络模型,用于准确的显著目标检测。这项工作的主要贡献是学习深度不确定卷积特征(UCF),它鼓励了显著性检测的鲁棒性和准确性。我们通过在特定的卷积层之后引入一个reformulated dropout(R-dropout)来构建内部特征单元的不确定集合来实现这一目标。此外,我们提出了一种有效的混合上采样方法,以减少我们解码器网络中反卷积算子的棋盘伪像。所提出的方法也可以应用于其他深度卷积网络。与现有的显著性检测方法相比,所提出的UCF模型能够纳入不确定因素,以获得更准确的物体比边缘推断。广泛的实验表明,我们提出的显著性模型与现有的方法相比表现良好。不确定的特征学习机制以及上采样方法可以显著提高其他像素级视觉任务的性能。


(CVPR’18) Detect Globally, Refine Locally: A Novel Approach to Saliency Detection

Effective integration of contextual information is crucial for salient object detection. To achieve this, most existing methods based on ’skip’ architecture mainly focus on how to integrate hierarchical features of Convolutional Neural Networks (CNNs). They simply apply concatenation or element-wise operation to incorporate high-level semantic cues and low-level detailed information. However, this can degrade the quality of predictions because cluttered and noisy information can also be passed through. To address this problem, we proposes a global Recurrent Localization Network (RLN) which exploits contextual information by the weighted response map in order to localize salient objects more accurately. Particularly, a recurrent module is employed to progressively refine the inner structure of the CNN over multiple time steps. Moreover, to effectively recover object boundaries, we propose a local Boundary Refinement Network (BRN) to adaptively learn the local contextual information for each spatial position. The learned propagation coefficients can be used to optimally capture relations between each pixel and its neighbors. Experiments on five challenging datasets show that our approach performs favorably against all existing methods in terms of the popular evaluation metrics.

有效整合上下文信息对于显著目标检测至关重要。为了实现这一点,大多数现有的基于"跳过"架构的方法主要集中在如何整合卷积神经网络(CNN)的分层特征。他们只是简单地应用串联或逐元操作来纳入高层次的语义线索和低层次的细节信息。然而,这可能会降低预测的质量,因为杂乱和噪声的信息也会被传递出去。为了解决这个问题,我们提出了一个全局性的循环定位网络(RLN),它通过加权响应图来利用上下文信息,以便更准确地定位显著物体。具体来说,一个递归模块被用来在多个时间步骤中逐步完善CNN的内部结构。此外,为了有效地恢复物体的边界,我们提出了一个局部的边界细化网络(BRN)来自适应地学习每个空间位置的局部上下文信息。学习到的传播系数可以用来最佳地捕捉每个像素和其相邻之间的关系。在五个具有挑战性的数据集上的实验表明,我们的方法在流行的评估指标方面比所有现有的方法都表现得好。


(CVPR’18) PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection

Contexts play an important role in the saliency detection task. However, given a context region, not all contextual information is helpful for the final task. In this paper, we propose a novel pixel-wise contextual attention network, i.e., the PiCANet, to learn to selectively attend to informative context locations for each pixel. Specifically, for each pixel, it can generate an attention map in which each attention weight corresponds to the contextual relevance at each context location. An attended contextual feature can then be constructed by selectively aggregating the contextual information. We formulate the proposed PiCANet in both global and local forms to attend to global and local contexts, respectively. Both models are fully differentiable and can be embedded into CNNs for joint training. We also incorporate the proposed models with the U-Net architecture to detect salient objects. Extensive experiments show that the proposed PiCANets can consistently improve saliency detection performance. The global and local PiCANets facilitate learning global contrast and homogeneousness, respectively. As a result, our saliency model can detect salient objects more accurately and uniformly, thus performing favorably against the state-of-the-art methods.

上下文在显著性检测任务中起着重要作用。然而,给定一个上下文区域,并非所有的上下文信息都对最终的任务有帮助。在本文中,我们提出了一个新的像素级上下文注意网络,即PiCANet,以学习有选择地关注每个像素的有信息的上下文位置。具体来说,对于每个像素,它可以生成一个注意图,其中每个注意力权重对应于每个上下文位置的上下文相关性。然后,通过有选择地汇总上下文信息,可以构建一个被关注的上下文特征。我们提出的PiCANet有全局和局部两种形式,分别用于关注全局和局部上下文。这两种模型都是完全可分的,可以嵌入到CNN中进行联合训练。我们还将提出的模型与U-Net架构结合起来,以检测显著物体。广泛的实验表明,所提出的PiCANet可以持续改善显著性检测性能。全局和局部PiCANets别有助于学习全局对比度和同质性。因此,我们的显著性模型可以更准确、更统一地检测出显著性物体,从而在与SOTA相比时表现出优势。


(CVPR’18) A Bi-directional Message Passing Model for Salient Object Detection

Recent progress on salient object detection is beneficial from Fully Convolutional Neural Network (FCN). The saliency cues contained in multi-level convolutional features are complementary for detecting salient objects. How to integrate multi-level features becomes an open problem in saliency detection. In this paper, we propose a novel bi-directional message passing model to integrate multilevel features for salient object detection. At first, we adopt a Multi-scale Context-aware Feature Extraction Module (MCFEM) for multi-level feature maps to capture rich context information. Then a bi-directional structure is designed to pass messages between multi-level features, and a gate function is exploited to control the message passing rate. We use the features after message passing, which simultaneously encode semantic information and spatial details, to predict saliency maps. Finally, the predicted results are efficiently combined to generate the final saliency map. Quantitative and qualitative experiments on five benchmark datasets demonstrate that our proposed model performs favorably against the state-of-the-art methods under different evaluation metrics.

全卷积神经网络(FCN)对显著性检测的最新进展是有益的。多级卷积特征中包含的显著性线索对于检测显著性物体是互补的。如何整合多级特征成为显著性检测的一个开放性问题。在本文中,我们提出了一个新颖的双向信息传递模型,以整合多级特征来进行显著目标检测。首先,我们采用多尺度上下文感知特征提取模块(MCFEM),用于多级特征图来捕获丰富的上下文信息。然后,我们设计了一个双向结构,在多级特征之间传递信息,并利用一个门函数来控制信息传递率。我们使用信息传递后的特征,同时编码语义信息和空间细节,来预测显著图。最后,预测的结果被有效地结合起来,生成最终的显著图。在五个基准数据集上进行的定量和定性实验表明,我们提出的模型在不同的评估指标下与SOTA相比表现良好。


(CVPR’18) Progressive Attention Guided Recurrent Network for Salient Object Detection

Effective convolutional features play an important role in saliency estimation but how to learn powerful features for saliency is still a challenging task. FCN-based methods directly apply multi-level convolutional features without distinction, which leads to sub-optimal results due to the distraction from redundant details. In this paper, we propose a novel attention guided network which selectively integrates multi-level contextual information in a progressive manner. Attentive features generated by our network can alleviate distraction of background thus achieve better performance. On the other hand, it is observed that most of existing algorithms conduct salient object detection by exploiting side-output features of the backbone feature extraction network. However, shallower layers of backbone network lack the ability to obtain global semantic information, which limits the effective feature learning. To address the problem, we introduce multi-path recurrent feedback to enhance our proposed progressive attention driven framework. Through multi-path recurrent connections, global semantic information from the top convolutional layer is transferred to shallower layers, which intrinsically refines the entire network. Experimental results on six benchmark datasets demonstrate that our algorithm performs favorably against the state-of-the-art approaches.

有效的卷积特征在显著性估计中起着重要作用,但如何学习强大的显著性特征仍然是一项具有挑战性的任务。基于FCN的方法不加区分地直接应用多级卷积特征,由于受到冗余细节的干扰,导致了次优的结果。在本文中,我们提出了一种新的注意力引导网络,它以渐进的方式选择性地融合多级上下文信息。由我们的网络产生的注意力特征可以减轻背景的干扰,从而达到更好的性能。另一方面,我们发现大多数现有的算法都是通过利用主干特征提取网络的侧面输出特征来进行显著目标检测。然而,主干网络的较浅层缺乏获得全局语义信息的能力,这限制了有效的特征学习。为了解决这个问题,我们引入了多路径递归反馈来加强我们提出的渐进式注意力驱动框架。通过多路递归连接,来自高层卷积的全局语义信息被转移到较浅的层,这在本质上完善了整个网络。在六个基准数据集上的实验结果表明,我们的算法与SOTA相比表现良好。

(ECCV’18) Contour Knowledge Transfer for Salient Object Detection

In recent years, deep Convolutional Neural Networks (CNNs) have broken all records in salient object detection. However, training such a deep model requires a large amount of manual annotations. Our goal is to overcome this limitation by automatically converting an existing deep contour detection model into a salient object detection model without using any manual salient object masks. For this purpose, we have created a deep network architecture, namely Contour-to-Saliency Network (C2SNet), by grafting a new branch onto a well-trained contour detection network. Therefore, our C2S-Net has two branches for performing two different tasks: (1) predicting contours with the original contour branch, and (2) estimating per-pixel saliency score of each image with the newly added saliency branch. To bridge the gap between these two tasks, we further propose a contour-to-saliency transferring method to automatically generate salient object masks which can be used to train the saliency branch from outputs of the contour branch. Finally, we introduce a novel alternating training pipeline to gradually update the network parameters. In this scheme, the contour branch generates saliency masks for training the saliency branch, while the saliency branch, in turn, feeds back saliency knowledge in the form of saliency-aware contour labels, for fine-tuning the contour branch. The proposed method achieves state-of-the-art performance on five well-known benchmarks, outperforming existing fully supervised methods while also maintaining high efficiency.

近年来,深度卷积神经网络(CNN)已经打破了显著目标检测的所有记录。然而,训练这样一个深度模型需要大量的人工标注。我们的目标是克服这一限制,将现有的深度轮廓检测模型自动转换为显著目标检测模型,而不使用任何人工显著物体mask。为此,我们创建了一个深度网络架构,即轮廓到显著性网络(C2SNet),将一个新的分支嫁接到一个训练好的轮廓检测网络上。因此,我们的C2S-Net有两个分支来执行两个不同的任务:(1)用原来的轮廓分支预测轮廓;(2)用新增加的显著性分支估计每个图像的逐像素显著性分数。为了弥补这两项任务之间的差距,我们进一步提出了一种轮廓到显著性的迁移方法,以自动生成显著性物体mask,这些mask可用于从轮廓分支的输出中训练显著性分支。最后,我们引入了一个新颖的交替训练流水线来逐步更新网络参数。在这个方案中,轮廓分支产生的显著性mask用于训练显著性分支,而显著性分支则以显著性感知的轮廓标签的形式反馈显著性知识,用于微调轮廓分支。所提出的方法在五个著名的基准上取得了最先进的性能,超过了现有的完全监督的方法,同时也保持了高效率。


(ECCV’18) Reverse Attention for Salient Object Detection

Benefit from the quick development of deep learning techniques, salient object detection has achieved remarkable progresses recently. However, there still exists following two major challenges that hinder its application in embedded devices, low resolution output and heavy model weight. To this end, this paper presents an accurate yet compact deep network for efficient salient object detection. More specifically, given a coarse saliency prediction in the deepest layer, we first employ residual learning to learn side-output residual features for saliency refinement, which can be achieved with very limited convolutional parameters while keep accuracy. Secondly, we further propose reverse attention to guide such side-output residual learning in a top-down manner. By erasing the current predicted salient regions from side-output features, the network can eventually explore the missing object parts and details which results in high resolution and accuracy. Experiments on six benchmark datasets demonstrate that the proposed approach compares favorably against state-of-the-art methods, and with advantages in terms of simplicity, efficiency (45 FPS) and model size (81 MB).

受益于深度学习技术的快速发展,显著目标检测最近取得了可观的进展。然而,仍然存在以下两大挑战,阻碍了其在嵌入式设备中的应用,即低分辨率输出和庞大的模型参数。为此,本文提出了一个准确而紧凑的深度网络,用于高效的显著目标检测。更具体地说,在最深层给定一个粗略的显著性预测,我们首先采用残差学习来学习侧面输出的残差特征,以实现显著性的细化,这可以用非常有限的卷积参数同时保持准确性。其次,我们进一步提出反向注意力,以自上而下的方式指导这种侧输出的残差学习。通过从侧输出特征中抹去当前预测的显著区域,网络最终可以探索缺失的物体部分和细节,从而获得高的分辨率和准确性。在六个基准数据集上的实验表明,所提出的方法与SOTA相比更有优势,并且在简单性、效率(45 FPS)和模型大小(81 MB)方面更优。


(CVPR’19) Attentive Feedback Network for Boundary-Aware Salient Object Detection

Recent deep learning based salient object detection methods achieve gratifying performance built upon Fully Convolutional Neural Networks (FCNs). However, most of them have suffered from the boundary challenge. The state-of-the-art methods employ feature aggregation technique and can precisely find out wherein the salient object, but they often fail to segment out the entire object with fine boundaries, especially those raised narrow stripes. So there is still a large room for improvement over the FCN based models. In this paper, we design the Attentive Feedback Modules (AFMs) to better explore the structure of objects. A Boundary-Enhanced Loss (BEL) is further employed for learning exquisite boundaries. Our proposed deep model produces satisfying results on the object boundaries and achieves state-of-the-art performance on five widely tested salient object detection benchmarks. The network is in a fully convolutional fashion running at a speed of 26 FPS and does not need any post-processing.

最近基于深度学习的显著目标检测方法在全卷积神经网络(FCN)的基础上取得了令人满意的性能。然而,它们中的大多数都遭受了边缘的挑战。最先进的方法采用了特征融合技术,可以精确地找到显著物体的位置,但它们往往不能用细小的边界分割出整个物体,特别是那些凸起的线条。因此,基于FCN的模型仍有很大的改进空间。在本文中,我们设计了注意力反馈模块(Attentive Feedback Module、AFM)来更好地探索物体的结构。边界增强损失(Boundary-Enhanced Loss、BEL)被进一步用于学习精细的边界。我们提出的深度模型在物体边界上产生了令人满意的结果,并在五个广泛测试的显著目标检测基准上实现了SOTA。该网络以完全卷积的方式运行,速度为26FPS,不需要任何后处理。


(CVPR’19) Salient Object Detection With Pyramid Attention and Salient Edges

This paper presents a new method for detecting salient objects in images using convolutional neural networks (CNNs). The proposed network, named PAGE-Net, offers two key contributions. The first is the exploitation of an essential pyramid attention structure for salient object detection. This enables the network to concentrate more on salient regions while considering multi-scale saliency information. Such a stacked attention design provides a powerful tool to efficiently improve the representation ability of the corresponding network layer with an enlarged receptive field. The second contribution lies in the emphasis on the importance of salient edges. Salient edge information offers a strong cue to better segment salient objects and refine object boundaries. To this end, our model is equipped with a salient edge detection module, which is learned for precise salient boundary estimation. This encourages better edge-preserving salient object segmentation. Exhaustive experiments confirm that the proposed pyramid attention and salient edges are effective for salient object detection. We show that our deep saliency model outperforms state-of-the-art approaches for several benchmarks with a fast processing speed (25fps on one GPU).

本文介绍了一种使用卷积神经网络(CNN)检测图像中显著物体的新方法。所提出的网络,名为PAGE-Net,提供了两个关键的贡献。首先是利用一个基础的金字塔注意力结构来检测显著的物体。这使得网络在考虑多尺度的显著性信息的同时,能够更多地集中在显著性区域。这样的堆叠式注意力设计提供了一个强有力的工具,可以有效地提高相应网络层的表征能力,并扩大了感受野。第二个贡献在于强调了显著边缘的重要性。显著的边缘信息为更好地分割显著物体和完善物体的边界提供了强有力的线索。为此,我们的模型配备了一个显著边缘检测模块,该模块是为精确的显著边界估计而学习的。这鼓励了更好的边缘保留的显著对象分割。详尽的实验证实,所提出的金字塔注意力和显著边缘对显著目标检测是有效的。我们表明,我们的深度显著性模型在几个基准测试中以快速的处理速度(单个GPU上为25fps)胜过SOTA。


(CVPR’19) Pyramid Feature Attention Network for Saliency detection

Saliency detection is one of the basic challenges in computer vision. How to extract effective features is a critical point for saliency detection. Recent methods mainly adopt integrating multi-scale convolutional features indiscriminately. However, not all features are useful for saliency detection and some even cause interferences. To solve this problem, we propose Pyramid Feature Attention network to focus on effective high-level context features and low-level spatial structural features. First, we design Context-aware Pyramid Feature Extraction (CPFE) module for multi-scale high-level feature maps to capture rich context features. Second, we adopt channel-wise attention (CA) after CPFE feature maps and spatial attention (SA) after low-level feature maps, then fuse outputs of CA & SA together. Finally, we propose an edge preservation loss to guide network to learn more detailed information in boundary localization. Extensive evaluations on five benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art approaches under different evaluation metrics.

显著性检测是计算机视觉的基本挑战之一。如何提取有效的特征是显著性检测的一个关键点。最近的方法主要是不加区分地采用融合多尺度卷积特征。然而,并非所有的特征都对显著性检测有用,有些甚至会造成干扰。为了解决这个问题,我们提出了金字塔特征注意力网络,以关注有效的高级背景特征和低级空间结构特征。首先,我们设计了上下文感知的金字塔特征提取(CPFE)模块,用于多尺度高层特征图,以捕获丰富的上下文特征。其次,我们在CPFE特征图后采用通道注意力(CA),在低层次特征图后采用空间注意力(SA),然后将CA和SA的输出融合在一起。最后,我们提出了一个边缘保留损失,以指导网络在边界定位中学习更多的细节信息。在五个基准数据集上进行的广泛评估表明,所提出的方法在不同的评估指标下优于SOTA。


(CVPR’19) BASNet: Boundary-Aware Salient Object Detection

Deep Convolutional Neural Networks have been adopted for salient object detection and achieved the state-of-the-art performance. Most of the previous works however focus on region accuracy but not on the boundary quality. In this paper, we propose a predict-refine architecture, BASNet, and a new hybrid loss for Boundary-Aware Salient object detection. Specifically, the architecture is composed of a densely supervised Encoder-Decoder network and a residual refinement module, which are respectively in charge of saliency prediction and saliency map refinement. The hybrid loss guides the network to learn the transformation between the input image and the ground truth in a three-level hierarchy – pixel-, patch- and map- level – by fusing Binary Cross Entropy (BCE), Structural SIMilarity (SSIM) and Intersection-over-Union (IoU) losses. Equipped with the hybrid loss, the proposed predict-refine architecture is able to effectively segment the salient object regions and accurately predict the fine structures with clear boundaries. Experimental results on six public datasets show that our method outperforms the state-of-the-art methods both in terms of regional and boundary evaluation measures. Our method runs at over 25 fps on a single GPU. The code is available at: https://github.com/NathanUA/BASNet.

深度卷积神经网络已被用于显著目标检测,并取得了SOTA。然而,以前的工作大多集中在区域精度上,而不是在边界质量上。在本文中,我们提出了一个预测-细化架构,BASNet,和一个新的混合损失,用于边界感知显著目标检测。具体来说,该架构由一个密集监督的编码器-解码器网络和一个残差细化模块组成,它们分别负责显著性预测和显著图的细化。混合损失指导网络学习输入图像和GT之间的转换,它分为三个层次:像素级、块级和图级,通过融合二元交叉熵(BCE)、结构相似度(SSIM)和交叉联合(IoU)损失。在混合损失的帮助下,所提出的预测-细化架构能够有效地分割突出的物体区域,并准确地预测具有清晰边界的精细结构。在六个公共数据集上的实验结果表明,我们的方法在区域和边界评估指标方面都优于SOTA。我们的方法在单个GPU上的运行速度超过25fps。代码可见:https://github.com/NathanUA/BASNet。


(CVPR’19) Cascaded Partial Decoder for Fast and Accurate Salient Object Detection

Existing state-of-the-art salient object detection networks rely on aggregating multi-level features of pre-trained convolutional neural networks (CNNs). Compared to high-level features, low-level features contribute less to performance but cost more computations because of their larger spatial resolutions. In this paper, we propose a novel Cascaded Partial Decoder (CPD) framework for fast and accurate salient object detection. On the one hand, the framework constructs partial decoder which discards larger resolution features of shallower layers for acceleration. On the other hand, we observe that integrating features of deeper layers obtain relatively precise saliency map. Therefore we directly utilize generated saliency map to refine the features of backbone network. This strategy efficiently suppresses distractors in the features and significantly improves their representation ability. Experiments conducted on five benchmark datasets exhibit that the proposed model not only achieves state-of-the-art performance but also runs much faster than existing models. Besides, the proposed framework is further applied to improve existing multi-level feature aggregation models and significantly improve their efficiency and accuracy.

现有的最先进的显著目标检测网络依赖于融合预训练的卷积神经网络(CNN)的多级特征。与高层特征相比,低层特征对性能的贡献较小,但由于其空间分辨率较大,因此计算成本较高。在本文中,我们提出了一个新颖的级联部分解码器(CPD)框架,用于快速和准确的显著目标检测。一方面,该框架构建了部分解码器,放弃了较浅层的较大分辨率特征,以达到加速的目的。另一方面,我们观察到,整合较深层的特征可以获得相对精确的显著图。因此,我们直接利用生成的显著图来完善主干网络的特征。这一策略有效地抑制了特征中的干扰因素,并极大地提高了其表征能力。在五个基准数据集上进行的实验表明,所提出的模型不仅达到了SOTA,而且比现有的模型运行得更快。此外,所提出的框架还被进一步应用于改进现有的多级特征融合模型,并显著提高其效率和准确性。


(CVPR’19) A Simple Pooling-Based Design for Real-Time Salient Object Detection

We solve the problem of salient object detection by investigating how to expand the role of pooling in convolutional neural networks. Based on the U-shape architecture, we first build a global guidance module (GGM) upon the bottom-up pathway, aiming at providing layers at different feature levels the location information of potential salient objects. We further design a feature aggregation module (FAM) to make the coarse-level semantic information well fused with the fine-level features from the top-down pathway. By adding FAMs after the fusion operations in the top-down pathway, coarse-level features from the GGM can be seamlessly merged with features at various scales. These two pooling-based modules allow the high-level semantic features to be progressively refined, yielding detail enriched saliency maps. Experiment results show that our proposed approach can more accurately locate the salient objects with sharpened details and hence substantially improve the performance compared to the previous state-of-the-arts. Our approach is fast as well and can run at a speed of more than 30 FPS when processing a 300×400 image. Code can be found at http://mmcheng.net/poolnet/.

我们通过研究如何扩大卷积神经网络中池化的作用来解决显著目标检测的问题。基于U型结构,我们首先在自下而上的路径上建立了一个全局引导模块(GGM),旨在为不同特征层次的层提供潜在显著对象的位置信息。我们进一步设计了一个特征聚合模块(FAM),使粗略层次的语义信息与自上而下路径的精细层次的特征很好地融合。通过在自上而下途径的融合操作之后添加FAM,来自GGM的粗糙特征可以与各种尺度的特征无缝融合。这两个基于池化的模块允许高层次的语义特征被逐步细化,产生细节丰富的显著图。实验结果表明,我们提出的方法可以更准确地定位具有精细细节的显著对象,因此与以前的SOTA相比,性能得到了大幅提高。我们的方法也很快速,在处理300×400的图像时可以以超过30FPS的速度运行。代码可以在http://mmcheng.net/poolnet/找到。


(ICCV’19) Stacked Cross Refinement Network for Edge-Aware Salient Object Detection

Salient object detection is a fundamental computer vision task. The majority of existing algorithms focus on aggregating multi-level features of pre-trained convolutional neural networks. Moreover, some researchers attempt to utilize edge information for auxiliary training. However, existing edge-aware models design unidirectional frameworks which only use edge features to improve the segmentation features. Motivated by the logical interrelations between binary segmentation and edge maps, we propose a novel Stacked Cross Refinement Network (SCRN) for salient object detection in this paper. Our framework aims to simultaneously refine multi-level features of salient object detection and edge detection by stacking Cross Refinement Unit (CRU). According to the logical interrelations, the CRU designs two direction-specific integration operations, and bidirectionally passes messages between the two tasks. Incorporating the refined edge-preserving features with the typical U-Net, our model detects salient objects accurately. Extensive experiments conducted on six benchmark datasets demonstrate that our method outperforms existing state-of-the-art algorithms in both accuracy and efficiency. Besides, the attribute-based performance on the SOC dataset show that the proposed model ranks first in the majority of challenging scenes. Code can be found at https://github.com/wuzhe71/SCAN.

显著目标检测是一项基本的计算机视觉任务。现有的大多数算法都集中在融合预先训练好的卷积神经网络的多层次特征上。此外,一些研究人员试图利用边缘信息进行辅助训练。然而,现有的边缘感知模型设计的是单向框架,只利用边缘特征来改进分割特征。在二元分割和边缘图之间的逻辑关系的激励下,我们在本文中提出了一个新颖的重叠交叉细化网络(SCRN)用于显著目标检测。我们的框架旨在通过堆叠交叉细化单元(CRU)同时细化显著目标检测和边缘检测的多层次特征。根据逻辑上的相互关系,CRU设计了两个特定方向的融合操作,并在两个任务之间双向传递信息。将细化的边缘保留特征与典型的U-Net相结合,我们的模型能够准确地检测出显著对象。在六个基准数据集上进行的广泛实验表明,我们的方法在准确性和效率方面都优于SOTA。此外,在SOC数据集上的基于属性的表现表明,所提出的模型在大多数具有挑战性的场景中排名第一。代码可以在https://github.com/wuzhe71/SCAN中找到。


(ICCV’19) Selectivity or Invariance: Boundary-aware Salient Object Detection

Typically, a salient object detection (SOD) model faces opposite requirements in processing object interiors and boundaries. The features of interiors should be invariant to strong appearance change so as to pop-out the salient object as a whole, while the features of boundaries should be selective to slight appearance change to distinguish salient objects and background. To address this selectivity-invariance dilemma, we propose a novel boundary-aware network with successive dilation for image-based SOD. In this network, the feature selectivity at boundaries is enhanced by incorporating a boundary localization stream, while the feature invariance at interiors is guaranteed with a complex interior perception stream. Moreover, a transition compensation stream is adopted to amend the probable failures in transitional regions between interiors and boundaries. In particular, an integrated successive dilation module is proposed to enhance the feature invariance at interiors and transitional regions. Extensive experiments on six datasets show that the proposed approach outperforms 16 state-of-the-art methods.

通常情况下,显著目标检测(SOD)模型在处理物体内部和边界时面临相反的要求。内部的特征应该对强烈的外观变化保持不变,以便将显著的物体作为一个整体给体现出来,而边界的特征应该对微小的外观变化具有选择性,以区分显著的对象和背景。为了解决这种选择性-不变性的困境,我们提出了一种新型的边界感知网络,该网络具有基于图像的SOD的逐级膨胀功能。在这个网络中,通过加入边界定位流来提高边界的特征选择性,而通过内部感知流来保证内部的复杂特征不变性。此外,还采用了一个过渡补偿流来修正内部和边界之间的过渡区域可能出现的故障。特别是,提出了一个综合的连续膨胀模块,以提高内部和过渡区域的特征不变性。在六个数据集上进行的广泛实验表明,所提出的方法优于16种SOTA。


(ICCV’19) EGNet:Edge Guidance Network for Salient Object Detection

Fully convolutional neural networks (FCNs) have shown their advantages in the salient object detection task. However, most existing FCNs-based methods still suffer from coarse object boundaries. In this paper, to solve this problem, we focus on the complementarity between salient edge information and salient object information. Accordingly, we present an edge guidance network (EGNet) for salient object detection with three steps to simultaneously model these two kinds of complementary information in a single network. In the first step, we extract the salient object features by a progressive fusion way. In the second step, we integrate the local edge information and global location information to obtain the salient edge features. Finally, to sufficiently leverage these complementary features, we couple the same salient edge features with salient object features at various resolutions. Benefiting from the rich edge information and location information in salient edge features, the fused features can help locate salient objects, especially their boundaries more accurately. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on six widely used datasets without any pre-processing and post-processing. The source code is available at http://mmcheng.net/egnet/.

全卷积神经网络(FCN)在显著目标检测任务中显示了其优势。然而,大多数现有的基于FCN的方法仍然受到粗糙的对象边界的影响。在本文中,为了解决这个问题,我们把重点放在显著的边缘信息和显著的对象信息之间的互补性。因此,我们提出了一个用于显著目标检测的边缘引导网络(EGNet),通过三个步骤在一个网络中同时模拟这两种互补的信息。在第一步中,我们通过渐进式融合的方式提取显著目标特征。第二步,我们融合局部边缘信息和全局位置信息以获得显著的边缘特征。最后,为了充分地利用这些互补的特征,我们将相同的显著边缘特征与不同分辨率的显著目标特征相结合。受益于显著边缘特征中丰富的边缘信息和位置信息,融合后的特征可以帮助定位显著目标,尤其是它们的边界更加准确。实验结果表明,所提出的方法在六个广泛使用的数据集上的表现优于SOTA,不需要任何预处理和后处理。源代码可在http://mmcheng.net/egnet/找到。


(ICCV’19) Employing Deep Part-Object Relationships for Salient Object Detection

Despite Convolutional Neural Networks (CNNs) based methods have been successful in detecting salient objects, their underlying mechanism that decides the salient intensity of each image part separately cannot avoid inconsistency of parts within the same salient object. This would ultimately result in an incomplete shape of the detected salient object. To solve this problem, we dig into part-object relationships and take the unprecedented attempt to employ these relationships endowed by the Capsule Network (CapsNet) for salient object detection. The entire salient object detection system is built directly on a Two-Stream Part-Object Assignment Network (TSPOANet) consisting of three algorithmic steps. In the first step, the learned deep feature maps of the input image are transformed to a group of primary capsules. In the second step, we feed the primary capsules into two identical streams, within each of which low-level capsules (parts) will be assigned to their familiar high-level capsules (object) via a locally connected routing. In the final step, the two streams are integrated in the form of a fully connected layer, where the relevant parts can be clustered together to form a complete salient object. Experimental results demonstrate the superiority of the proposed salient object detection network over the state-of-the-art methods.

尽管基于卷积神经网络(CNN)的方法在检测显著目标方面取得了成功,但其决定每个图像部分的显著程度的基本机制无法避免同一显著目标中各部分的不一致情况。这最终会导致检测到的显著对象的形状不完整。为了解决这个问题,我们挖掘了部分与整体之间的关系,并史无前例地尝试采用胶囊网络(CapsNet)所赋予的这些关系进行显著目标检测。整个显著目标检测系统直接建立在双流部分对象分配网络(TSPOANet)上,包括三个算法步骤。在第一步中,学习到的输入图像的深度特征图被转换为一组主要的胶囊。在第二步中,我们将初级胶囊送入两个相同的流中,在每个流中,低级胶囊(部分对象)将通过局部连接的路由分配给它们熟悉的高级胶囊(完整对象)。在最后一步,这两个流以全连接层的形式被融合,其中相关的部分可以被集中在一起,形成一个完整的显著对象。实验结果表明,所提出的显著目标检测网络比SOTA更有优势。


(CVPR’20) Interactive Two-Stream Decoder for Accurate and Fast Saliency Detection

Recently, contour information largely improves the performance of saliency detection. However, the discussion on the correlation between saliency and contour remains scarce. In this paper, we first analyze such correlation and then propose an interactive two-stream decoder to explore multiple cues, including saliency, contour and their correlation. Specifically, our decoder consists of two branches, a saliency branch and a contour branch. Each branch is assigned to learn distinctive features for predicting the corresponding map. Meanwhile, the intermediate connections are forced to learn the correlation by interactively transmitting the features from each branch to the other one. In addition, we develop an adaptive contour loss to automatically discriminate hard examples during learning process. Extensive experiments on six benchmarks well demonstrate that our network achieves competitive performance with a fast speed around 50 FPS. Moreover, our VGG-based model only contains 17.08 million parameters, which is significantly smaller than other VGG-based approaches. Code has been made available at: https://github.com/moothes/ITSD-pytorch.

最近,轮廓信息在很大程度上提高了显著性检测的性能。然而,关于显著性和轮廓之间的相关性的讨论仍然很少。在本文中,我们首先分析了这种相关性,然后提出了一个交互式双流解码器来探索多种线索,包括显著性、轮廓和它们的相关性。具体来说,我们的解码器由两个分支组成,一个是显著性分支,一个是轮廓分支。每个分支都被指定学习独特的特征来预测相应的图。同时,中间连接强迫通过交互式地将每个分支的特征传递给另一个分支来学习相关的内容。此外,我们还开发了一个自适应的轮廓损失,以便在学习过程中自动分辨出困难的样本。在六个基准上进行的广泛实验表明,我们的网络以50FPS左右的速度实现了有竞争力的性能。此外,我们基于VGG的模型只包含1708万个参数,这比其他基于VGG的方法小得多。代码可在https://github.com/moothes/ITSD-pytorch找到。


(CVPR’20) Multi-scale Interactive Network for Salient Object Detection

Deep-learning based salient object detection methods achieve great progress. However, the variable scale and unknown category of salient objects are great challenges all the time. These are closely related to the utilization of multi-level and multi-scale features. In this paper, we propose the aggregate interaction modules to integrate the features from adjacent levels, in which less noise is introduced because of only using small up-/down-sampling rates. To obtain more efficient multi-scale features from the integrated features, the self-interaction modules are embedded in each decoder unit. Besides, the class imbalance issue caused by the scale variation weakens the effect of the binary cross entropy loss and results in the spatial inconsistency of the predictions. Therefore, we exploit the consistency-enhanced loss to highlight the fore-/back-ground difference and preserve the intra-class consistency. Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches. The source code will be publicly available at this https https://github.com/lartpang/MINet.

基于深度学习的显著目标检测方法取得了很大进展。然而,显著目标的可变尺度和未知类别一直是巨大的挑战。这些都与多层次、多尺度特征的利用密切相关。在本文中,我们提出了融合交互模块来整合相邻层次的特征,由于只使用小的上/下采样率,所以引入的噪声较少。为了从融合的特征中获得更有效的多尺度特征,自交互模块被嵌入到每个解码器单元中。此外,由尺度变化引起的类不平衡问题削弱了二元交叉熵损失的效果,导致预测的空间不一致。因此,我们利用一致性增强的损失来突出前/后景的差异并保持类内的一致性。在五个基准数据集上的实验结果表明,所提出的方法在没有任何后处理的情况下,与23种SOTA相比表现良好。源代码将在以下网址公开:https://github.com/lartpang/MINet。


(CVPR’20) Label Decoupling Framework for Salient Object Detection

To get more accurate saliency maps, recent methods mainly focus on aggregating multi-level features from fully convolutional network (FCN) and introducing edge information as auxiliary supervision. Though remarkable progress has been achieved, we observe that the closer the pixel is to the edge, the more difficult it is to be predicted, because edge pixels have a very imbalance distribution. To address this problem, we propose a label decoupling framework (LDF) which consists of a label decoupling (LD) procedure and a feature interaction network (FIN). LD explicitly decomposes the original saliency map into body map and detail map, where body map concentrates on center areas of objects and detail map focuses on regions around edges. Detail map works better because it involves much more pixels than traditional edge supervision. Different from saliency map, body map discards edge pixels and only pays attention to center areas. This successfully avoids the distraction from edge pixels during training. Therefore, we employ two branches in FIN to deal with body map and detail map respectively. Feature interaction (FI) is designed to fuse the two complementary branches to predict the saliency map, which is then used to refine the two branches again. This iterative refinement is helpful for learning better representations and more precise saliency maps. Comprehensive experiments on six benchmark datasets demonstrate that LDF outperforms state-of-the-art approaches on different evaluation metrics.

为了得到更准确的显著图,最近的方法主要集中在从全卷积网络(FCN)中融合多级特征,并引入边缘信息作为辅助监督。虽然已经取得了显著的进展,但我们观察到,越是靠近边缘的像素,越是难以被预测,因为边缘像素的分布非常不平衡。为了解决这个问题,我们提出了一个标签解耦框架(LDF),它由一个标签解耦(LD)程序和一个特征交互网络(FIN)组成。标签解耦明确地将原始显著图分解为主体图和细节图,其中主体地图集中在物体的中心区域,细节图集中在边缘区域。细节图的效果更好,因为它涉及的像素比传统的边缘监督多得多。与显著图不同的是,主体图抛弃了边缘像素,只关注中心区域。这成功地避免了训练过程中边缘像素的干扰。因此,我们在FIN中采用了两个分支,分别处理主体图和细节图。特征交互(FI)的设计是为了融合这两个互补的分支来预测显著图,然后再利用这两个分支再次进行细化。这种迭代式的细化有助于学习更好的表征和更精确的显著图。在六个基准数据集上进行的综合实验表明,LDF在不同的评价指标上优于SOTA。


(ECCV’20) Suppress and Balance: A Simple Gated Network for Salient Object Detection

Most salient object detection approaches use U-Net or feature pyramid networks (FPN) as their basic structures. These methods ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control between them, the other is without considering the disparity of the contributions of different encoder blocks. In this work, we propose a simple gated network (GateNet) to solve both issues at once. With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder. We design a novel gated dual branch structure to build the cooperation among different levels of features and improve the discriminability of the whole network. Through the dual branch design, more details of the saliency map can be further restored. In addition, we adopt the atrous spatial pyramid pooling based on the proposed “Fold” operation (Fold-ASPP) to accurately localize salient objects of various scales. Extensive experiments on five challenging datasets demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics.

大多数显著目标检测方法使用U-Net或特征金字塔网络(FPN)作为其基本结构。这些方法忽略了编码器与解码器交换信息时的两个关键问题:一个是它们之间缺乏干扰控制,另一个是没有考虑不同编码器块的贡献的不一致性。在这项工作中,我们提出一个简单的门控网络(GateNet)来同时解决这两个问题。在多级门控单元的帮助下,来自编码器的有价值的上下文信息可以最优化地传输给解码器。我们设计了一个新颖的门控双分支结构,以建立不同层次特征之间的合作,提高整个网络的可辨别性。通过双分支的设计,可以进一步恢复显著图的更多细节。此外,我们还采用了基于所提出的"折叠"操作(Fold-ASPP)的空洞空间金字塔集合,以准确定位不同尺度的显著对象。在五个具有挑战性的数据集上进行的广泛实验表明,所提出的模型在不同的评估指标下与大多数SOTA相比表现良好。


(ECCV’20) Highly Efficient Salient Object Detection with 100K Parameters

Salient object detection models often demand a considerable amount of computation cost to make precise prediction for each pixel, making them hardly applicable on low-power devices. In this paper, we aim to relieve the contradiction between computation cost and model performance by improving the network efficiency to a higher degree. We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features, while reducing the representation redundancy by a novel dynamic weight decay scheme. The effective dynamic weight decay scheme stably boosts the sparsity of parameters during training, supports learnable number of channels for each scale in gOctConv, allowing 80% of parameters reduce with negligible performance drop. Utilizing gOctConv, we build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% parameters (100k) of large models on popular salient object detection benchmarks.

显著目标检测模型通常需要相当多的计算成本来对每个像素进行精确预测,这使得它们很难在低功耗设备上适用。在本文中,我们旨在通过提高网络效率来缓解计算成本和模型性能之间的矛盾。我们提出了一个灵活的卷积模块,即广义的OctConv(gOctConv),以有效地利用阶段内和跨阶段的多尺度特征,同时通过一个新的动态权重衰减方案减少表征的冗余。有效的动态权重衰减方案在训练过程中稳定地提高了参数的稀疏性,使得gOctConv中每个尺度的可学习通道数量,允许80%的参数减少而性能下降可以忽略不计。利用gOctConv,我们建立了一个极其轻量的模型,即CSNet,它在流行的显著目标检测基准上以大约0.2%的参数(100k)实现了与大型模型相当的性能。

你可能感兴趣的:(杂文,划水)