Ming-Ming Cheng1,4 Niloy J. Mitra2 Xiaolei Huang3 Philip H. S. Torr4 Shi-Min Hu1
1TNList, Tsinghua University 2UCL/KAUST 3Lehigh University 4Oxford Brookes University
Figure. Given input images (top), a global contrast analysis is used to compute high resolution saliency maps (middle), which can be used to produce masks (bottom) around regions of interest.
Automatic estimation of salient object regions across images, without any prior assumption or knowledge of the contents of the corresponding scenes, enhances many computer vision and computer graphics applications. We introduce a regional contrast based salient object extraction algorithm, which simultaneously evaluates global contrast differences and spatial weighted coherence scores. The proposed algorithm is simple, efficient, naturally multi-scale, and produces full-resolution, high-quality saliency maps. These saliency maps are further used to initialize a novel iterative version of GrabCut for high quality salient object segmentation. We extensively evaluated our algorithm using traditional salient object detection datasets, as well as a more challenging Internet image dataset. Our experimental results demonstrate that our algorithm consistently outperforms existing salient object detection and segmentation methods, yielding higher precision and better recall rates. We also show that our algorithm can be used to efficiently extract salient object masks from Internet images, enabling effective sketch-based image retrieval (SBIR) via simple shape comparisons. Despite such noisy internet images, where the saliency regions are ambiguous, our saliency guided image retrieval achieves a superior retrieval rate compared with state-of-the-art SBIR methods, and additionally provides important target object region information.
Figure. Statistical comparison results of (a) different saliency region detection methods, (b) their variants, and (c) object of interest region segmentation methods, using largest public available dataset (i) and (ii) our THUS10000 dataset (to be made public available). We compare our HC method and RC method with 15 state of art methods, including FT [1], AIM [2], MSS [3], SEG [4], SeR [5], SUN [6], SWD [7], IM [8], IT [9], GB [10], SR [11], CA [12], LC [13], AC [14], and CB [15]. We also take simple variable-size Gaussian model 'Gau' and GrabCut method as a baseline. (Please see our paper for detailed explaintions) |
Figure. Comparison of average Fβ for different saliency segmentation methods: FT [1], SEG [4], and ours, on THUR15000 dataset, which is composed by non-selected internet images. |
Table. Average time taken to compute a saliency map for images in the THUS10000 database. (Note that we use the authors original implementations for MSS and FT, which is not well optimized code.) |
Table. Comparison of average time for different saliency segmentation methods. |
Figure. Saliency maps computed by different state-of-the-art methods~(b-p), and with our proposed HC~(q) and RC methods~(r). Most results highlight edges, or are of low resolution. See also the shared data for saliency detection results for the whole THUS10000 dataset.
Figure. Sketch based image comparison. In each group from left to right, first column shows images download from Flickr using the corresponding keyword; second column shows our retrieval results obtained by comparing user input sketch with SaliencyCut result using shape context measure [41]; third column shows corresponding sketch based retrieval results using SHoG [42].
1. Data
The THUS10000 benchmark dataset comprises of 10, 000 images (181 MB), each of which has an unambiguous salient object and the object region is accurately annotated with pixel wise ground-truth labeling (13.1M). We provide saliency maps (5.3 GB containing 170, 000 image) for our methods as well as other 15 state of the art methods, including FT [1], AIM [2], MSS [3], SEG [4], SeR [5], SUN [6], SWD [7], IM [8], IT [9], GB [10], SR [11], CA [12], LC [13], AC [14], and CB [15]. Saliency segmentation (71.3MB) results for FT[1], SEG[4], and CB[10] are also avilable.
2. Windows executable
We supply an windows msi for install our prototype software, which includes our implementation for FT[2], SR[14], LC[28], our HC, RC and saliency cut method.
3. C++ source code
The C++ implementation of our paper as well as several other state of the art works.
4. Supplemental material
Supplemental materials (647 MB) including comparisons with other 15 state of the art algorithms are now available.
Until now, more than 1000+ readers (according to email records) have request to get the source code for this project. Some of them have questions about using the code. Here are some frequently asked questions for new users to refer:
Q1: I’m confused with the sentence in the paper: “In our experiments, the threshold is chosen empirically to be the threshold that gives 95% recall rate in our fixed thresholding experiments”. But all most the case, people have not the ground truth, so cannot compute the call rate. When I use your Cut application, I need to guess threshold value to have good cut image.
A: The recall rate is just used to evaluate the algorithm. When you use it, you typically don't have to evaluate the algorithm itself very often. This sentence is used to explain what the fixed threshold we use typically means. Actually, when initialized using RC saliency maps, this threshold is 70 with saliency values normalized to [0,255]. It doesn’t mean that the saliency values corresponds to recall rate of 95% for every image, but empirically corresponds to recall rate of 95% for a large number of images. So, just use the suggested threshold of 70 is OK.
Q2: I use your code to get results for the same database you used. But the results seem to have some small difference from yours.
A: It seems that the cvtColor function in OpenCV 1.x is different from those in OpenCv 2.X. I suggest users to use those in recent versions. The segmentation method I used sometimes generates strange results, leading to strange results of saliency maps. This happens at low frequency. When this happens, I rerun the exe again and it becomes OK. I don't know why, but this really happens when I use the exe first time after compiling (Very strange, maybe because some default initializations). If someone find the bug, please report to me.
Q3: Does your algorithm only get good results for images with single salient object?
A: Mostly yes. As described in our paper, our method is suitable for images with an unambiguous saliency object. Since saliency detection methods typically have no prior knowledge about the target object, thus is very difficult. Much recent researches focus on images with single saliency object. Even for this simple case, state of the art algorithm may also fail. It's understandable since supervised object detection which uses a large number of training data and prior knowledge also fails in many cases.
However, the value of saliency detection methods lies on their applications in many fields. Because they don't need large human annotation for learning, and typically much faster than object detection methods, it’s possible to automatically process a large number of images with low cost. Although many of the saliency detection results may be wrong (up to 60% for noise internet image) because of the ambiguous or even missing of salient objects, we can still use efficient algorithms to select those good results and use them in many interesting applications like (Notes: all following projects use our saliency source code, with initial version of SaliencyCut used in our own Sketch2Photo project):
Q4: I'm confused about the definition of saliency. Why the annotation format (isolated points, binary mask regions, and bounding boxes) in different benchmarks for evaluating saliency detection methods are so different?
There are 3 different saliency detection directions: i) fixation prediction, ii) salient object detection, iii) objectness estimation. They have very different research target and very different applications. Personally, I’m mainly interested in the last two problems and will discuss them in a bit more detail.
Eye fixation models aims at predicting where human looks, i.e. a small set of fixation points. The most famous method in this area is Itti’s work in PAMI 1998. The MIT benchmark is designed for evaluating such methods.
Salient object detection, as what is done in this work, aim at finding most salient object in a scene and segment the whole extent of that object. The output is typically a single saliency map (or figure-ground segmentation). The advantages and disadvantages are described in detail in Q3. High precision is a major focus of our work, as we can use shape matching based technique to effectively select good segmentations and build robust applications on top. Most widely used benchmark for evaluating this problem is MSRA1000, which precisely segment 1000 salient objects in MSRA images. Our method achieves 93% precision and 90% recall on MSRA1000 (previous best reported results: 75% precision and 83% recall). Since our results on MSRA100 are mostly comparable to ground truth annotations, we need more challenging benchmark. THUS10000 and THUR15000 are built for this purpose.
Objectness estimation is another attractive direction. These methods aim at proposing a small set (typically 1000) of bounding boxes to improve efficiency of classical sliding window pipeline. High recall at a small set of bounding box proposals is a major target. PASCAL VOC is a standard dataset for evaluating this problem. Using purely bottom up data driven methods to produce a single saliency map, as what is done in most salient object detection model, is less likely to succeed in this very challenging dataset. State of the art objectness proposal methods (PAMI12, IJCV13) achieves 90+% recall on challenging PASCAL VOC dataset given a relatively small (e.g. 1000) number of bounding boxes, while been computational efficient (4 seconds per image). This is especially useful for speed up multi-class object detection problem, as each classifier only need to examine a much smaller number of image windows (e.g. 1,000,000 -> 1,000).
FT | [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk,“Frequency-tuned salient region detection,” in IEEE CVPR, 2009, pp. 1597–1604. |
AIM | [2] N. Bruce and J. Tsotsos, “Saliency, attention, and visual search: An information theoretic approach,” Journal of Vision, vol. 9, no. 3, pp. 5:1–24, 2009. |
MSS | [3] R. Achanta and S. S ¨ usstrunk, “Saliency detection using maximum symmetric surround,” in IEEE ICIP, 2010, pp. 2653–2656. |
SEG | [4] E. Rahtu, J. Kannala, M. Salo, and J. Heikkila, “Segmenting salient objects from images and videos,” ECCV, pp. 366–379, 2010. |
SeR | [5] H. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance,” Journal of vision, vol. 9, no. 12, pp. 15:1–27, 2009. |
SUN | [6] L. Zhang, M. Tong, T. Marks, H. Shan, and G. Cottrell, “SUN: A bayesian framework for saliency using natural statistics,” Journal of Vision, vol. 8, no. 7, pp. 32:1–20, 2008. |
SWD | [7] L. Duan, C. Wu, J. Miao, L. Qing, and Y. Fu, “Visual saliency detection by spatially weighted dissimilarity,” in IEEE CVPR, 2011, pp. 473–480. |
IM | [8] N. Murray, M. Vanrell, X. Otazu, and C. A. Parraga, “Saliency estimation using a non-parametric low-level vision model,” in IEEE CVPR, 2011, pp. 433–440. |
IT | [9] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE TPAMI, vol. 20, no. 11, pp. 1254–1259, 1998. |
GB | [10] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in NIPS, 2007, pp. 545–552. |
SR | [11] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in IEEE CVPR, 2007, pp. 1–8. |
CA | [12] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” in IEEE CVPR, 2010, pp. 2376–2383. |
LC | [13] Y. Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in ACM Multimedia, 2006, pp. 815–824. |
AC | [14] R. Achanta, F. Estrada, P. Wils, and S. S ¨ usstrunk, “Salient region detection and segmentation,” in IEEE ICVS, 2008, pp. 66–75. |
CB | [15] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li,“Automatic salient object segmentation based on context and shape prior,” in British Machine Vision Conference, 2011, pp. 1–12. |
LP | [16] T. Judd, K. Ehinger, F. Durand, A Torralba, Learning to predict where humans look, ICCV 2009. |