Department of Information Management, National Central University, Jhongli 32001, Taiwan
Received 26 August 2012; Accepted 19 September 2012
Academic Editors: F. Camastra, J. A. Hernandez, P. Kokol, J. Wang, and S. Zhu
Copyright © 2012 Chih-Fong Tsai. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Content-based image retrieval (CBIR) systems require users to query images by their low-level visual content; this not only makes it hard for users to formulate queries, but also can lead to unsatisfied retrieval results. To this end, image annotation was proposed. The aim of image annotation is to automatically assign keywords to images, so image retrieval users are able to query images by keywords. Image annotation can be regarded as the image classification problem: that images are represented by some low-level features and some supervised learning techniques are used to learn the mapping between low-level features and high-level concepts (i.e., class labels). One of the most widely used feature representation methods is bag-of-words (BoW). This paper reviews related works based on the issues of improving and/or applying BoW for image annotation. Moreover, many recent works (from 2006 to 2012) are compared in terms of the methodology of BoW feature generation and experimental design. In addition, several different issues in using BoW are discussed, and some important issues for future research are discussed.
Advances in computer and multimedia technologies allow for the production of digital images and large repositories for image storage with little cost. This has led to the rapid increase in the size of image collections, including digital libraries, medical imaging, art and museum, journalism, advertising and home photo archives, and so forth. As a result, it is necessary to design image retrieval systems which can operate on a large scale. The main goal is to create, manage, and query image databases in an efficient and effective, that is, accurate manner.
Content-based image retrieval (CBIR), which was proposed in the early 1990s, is a technique to automatically index images by extracting their (low-level) visual features, such as color, texture, and shape, and the retrieval of images is based solely upon the indexed image features [1–3]. Therefore, it is hypothesized that relevant images can be retrieved by calculating the similarity between the low-level image contents through browsing, navigation, query-by-example, and so forth. Typically, images are represented as points in a high dimensional feature space. Then, a metric is used to measure similarity or dissimilarity between images on this space. Thus, images close to the query are similar to the query and retrieved. Although CBIR introduced automated image feature extraction and indexation, it does not overcome the so-called semantic gap described below.
The semantic gap is the gap between the extracted and indexed low-level features by computers and the high-level concepts (or semantics) of user’s queries. That is, the automated CBIR systems cannot be readily matched to the users’ requests. The notation of similarity in the user’s mind is typically based on high-level abstractions, such as activities, entities/objects, events, or some evoked emotions, among others. Therefore, retrieval by similarity using low-level features like color or shape will not be very effective. In other words, human similarity judgments do not obey the requirements of the similarity metric used in CBIR systems. In addition, general users usually find it difficult to search or query images by using color, texture, and/or shape features directly. They usually prefer textual or keyword-based queries, since they are easier and more intuitive for representing their information needs [4–6]. However, it is very challenging to make computers capable of understanding or extracting high-level concepts from images as humans do.
Consequently, the semantic gap problem has been approached by automatic image annotation. In automatic image annotation, computers are able to learn which low-level features correspond to which high-level concepts. Specifically, the aim of image annotation is to make the computers extract meanings from the low-level features by a learning process based on a given set of training data which includes pairs of low-level features and their corresponding concepts. Then, the computers can assign the learned keywords to images automatically. For the review of image annotation, please refer to Tsai and Hung [7], Hanbury [8], and Zhang et al. [9].
Image annotation can be defined as the process of automatically assigning keywords to images. It can be regarded as an automatic classification of images by labeling images into one of a number of predefined classes or categories, where classes have assigned keywords or labels which can describe the conceptual content of images in that class. Therefore, the image annotation problem can be thought of as image classification or categorization.
More specifically, image classification can be divided into object categorization [10] and scene classification. For example, object categorization focuses on classifying images into “concrete” categories, such as “agate”, “car”, “dog”, and so on. On the other hand, scene classification can be regarded as abstract keyword based image annotation [11, 12], where scene categories are such as “harbor”, “building”, and “sunset”, which can be regarded as an assemblage of multiple physical or entity objects as a single entity. The difference between object recognition/categorization and scene classification was defined by Quelhas et al. [13].
However, image annotation performance is heavily dependent on image feature representation. Recently, the bag-of-words (BoW) or bag-of-visual-words model, a well-known and popular feature representation method for document representation in information retrieval, was first applied to the field of image and video retrieval by Sivic and Zisserman [14]. Moreover, BoW has generally shown promising performance for image annotation and retrieval tasks [15–22].
The BoW feature is usually based on tokenizing keypoint-based features, for example, scale-invariant feature transform (SIFT) [23], to generate a visual-word vocabulary (or codebook). Then, the visual-word vector of an image contains the presence or absence information of each visual word in the image, for example, the number of keypoints in the corresponding cluster, that is, visual word.
Since 2003, BoW has been used extensively in image annotation, but there has not as yet been any comprehensive review of this topic. Therefore, the aim of this paper is to review the work of using BoW for image annotation from 2006 to 2012.
The rest of this paper is organized as follows. Section 2 describes the process of extracting the BoW feature for image representation and annotation. Section 3 discusses some important extension studies of BoW, including the improvement of BoW per se and its application to other related research problems. Section 4 provides some comparisons of related work in terms of the methodology of constructing the BoW feature, including the detection method, the clustering algorithm, the number of visual words, and so forth and the experimental set up including the datasets used, the number of object or scene categories, and so forth. Finally, Section 5concludes the paper.
The bag-of-words (BoW) methodology was first proposed in the text retrieval domain problem for text document analysis, and it was further adapted for computer vision applications [24]. For image analysis, a visual analogue of a word is used in the BoW model, which is based on the vector quantization process by clustering low-level visual features of local regions or points, such as color, texture, and so forth.
To extract the BoW feature from images involves the following steps: (i) automatically detect regions/points of interest, (ii) compute local descriptors over those regions/points, (iii) quantize the descriptors into words to form the visual vocabulary, and (iv) find the occurrences in the image of each specific word in the vocabulary for constructing the BoW feature (or a histogram of word frequencies) [24]. Figure 1 describes these four steps to extract the BoW feature from images.
The BoW model can be defined as follows. Given a training dataset containing images represented by , and , where is the extracted visual features, a specific unsupervised learning algorithm, such as -means, is used to group based on a fixed number of visual words (or categories) represented by , and , where is the cluster number. Then, we can summarize the data in a cooccurrence table of counts , where denotes how often the word occurred in an image .
The first step of the BoW methodology is to detect local interest regions or points. For feature extraction of interest points (or keypoints), they are computed at predefined locations and scales. Several well-known region detectors that have been described in the literature are discussed below [25, 26].(i)Harris-Laplace regions are detected by the scale-adapted Harris function and selected in scale-space by the Laplacian-of-Gaussian operator. Harris-Laplace detects corner-like structures.(ii)DoG regions are localized at local scale-space maxima of the difference-of-Gaussian. This detector is suitable for finding blob-like structures. In addition, the DoG point detector has previously been shown to perform well, and it is also faster and more compact (less feature points per image) than other detectors.(iii)Hessian-Laplace regions are localized in space at the local maxima of the Hessian determinant and in scale at the local maxima of the Laplacian-of-Gaussian.(iv)Salient regions are detected in scale-space at local maxima of the entropy. The entropy of pixel intensity histograms is measured for circular regions of various sizes at each image position.(v)Maximally stable extremal regions (MSERs) are components of connected pixels in a thresholded image. A watershed-like segmentation algorithm is applied to image intensities and segment boundaries which are stable over a wide range of thresholds that define the region.
In Mikolajczyk et al. [27], they compare six types of well-known detectors, which are detectors based on affine normalization around Harris and Hessian points, MSER, an edge-based region detector, a detector based on intensity extrema, and a detector of salient regions. They conclude that the Hessian-Affine detector performs best.
On the other hand, according to Hörster and Lienhart [21], interest points can be detected by the sparse or dense approach. For sparse features, interest points are detected at local extremas in the difference of a Gaussian pyramid [23]. A position and scale are automatically assigned to each point and thus the extracted regions are invariant to these properties. For dense features, on the other hand, interest points are defined at evenly sampled grid points. Feature vectors are then computed based on three different neighborhood sizes, that is, at different scales, around each interest point.
Some authors believe that a very precise segmentation of an image is not required for the scene classification problem [28], and some studies have shown that coarse segmentation is very suitable for scene recognition. In particular, Bosch et al. [29] compare four dense descriptors with the widely used sparse descriptor (i.e., the Harris detector) [14, 15] and show that the best results are obtained with the dense descriptors. This is because there is more information on scene images, and intuitively a dense image description is necessary to capture uniform regions such as sky, calm water, or road surface in many natural scenes. Similarly, Jurie and Triggs [30] show that the sampling of many patches on a regular dense grid (or a fixed number of patches) outperforms the use of interest points. In addition, Fei-Fei and Perona [31], and Bosch et al. [29] show that dense descriptors outperform the sparse ones.
In most studies, some single local descriptors are extracted, in which the Scale Invariant Feature Transform (SIFT) descriptor is the most widely extracted [23]. It combines a scale invariant region detector and a descriptor based on the gradient distribution in the detected regions. The descriptor is represented by a 3D histogram of gradient locations and orientations. The dimensionality of the SIFT descriptor is 128.
In order to reduce the dimensionality of the SIFT descriptor, which is usually 128 dimensions per keypoint, principal component analysis (PCA) can be used for increasing image retrieval accuracy and faster matching [32]. Specifically, Uijlings et al. [33] show that retrieval performance can be increased by using PCA for the removal of redundancy in the dimensions.
SIFT was found to work best [13, 25, 34, 35]. Specifically, Mikolajczyk and Schmid [34] compared 10 different descriptors extracted by the Harris-Affine detector, which are SIFT, gradient location and orientation histograms (GLOH) (i.e., an extension of SIFT), shape context, PCA-SIFT, spin images, steerable filters, differential invariants, complex filters, moment invariants, and cross-correlation of sampled pixel values. They show that the SIFT-based descriptors perform best.
In addition, Quelhas et al. [13] confirm in practice that DoG + SIFT constitutes a reasonable choice. Very few consider the extraction of different descriptors. For example, Li et al. [36] combine or fuse the SIFT descriptor and the concatenation of block and blob based HSV histogram and local binary patterns to generate the BoW.
When the keypoints are detected and their features are extracted, such as with the SIFT descriptor, the final step of extracting the BoW feature from images is based on vector quantization. In general, the -means clustering algorithm is used for this task, and the number of visual words generated is based on the number of clusters (i.e., ). Jiang et al. [17] conducted a comprehensive study on the representation choices of BoW, including vocabulary size, weighting scheme, such as binary, term frequency (TF) and term frequency-inverse document frequency (TF-IDF), stop word removal, feature selection, and so forth for video and image annotation.
To generate visual words, many studies focus on capturing spatial information in order to improve the limitations of the conventional BoW model, such as Yang et al. [37], Zhang et al. [38], Chen et al. [39], S. Kim and D. Kim [40], Lu and Ip [41], Lu and Ip [42], Uijlings et al. [43], Cao and Fei-Fei [44], Philbin et al. [45], Wu et al. [46], Agarwal and Triggs [47], Lazebnik et al. [48], Marszałek and Schmid [49], and Monay et al. [50], in which spatial pyramid matching introduced by Lazebnik et al. [48] has been widely compared as one of the baselines.
However, Van de Sande et al. [51] have shown that the severe drawback of the bag-of-words model is its high computational cost in the quantization step. In other words, the most expensive part in a state-of-the-art setup of the bag-of-words model is the vector quantization step, that is, finding the closest cluster for each data point in the -means algorithm.
Uijlings et al. [33] compare -means and random forests for the word assignment task in terms of computational efficiency. By using different descriptors with different grid sizes, random forests are significantly faster than -means. In addition, using random forests to generate BoW can provide a slightly better Mean Average Precision (MAP) than -means does. They also recommend two BoW pipelines when the focuses are on accuracy and speed, respectively.
In their seminal work, Philbin et al. [45], the approximate -means, hierarchical -means, and (exact) -means are compared in terms of the precision performance and computational cost, where approximate -means works best. (See Section 4.3 for further discussion).
Chum et al. [52] observe that feature detection and quantization are noisy processes and this can result in variation in the particular visual words that appear in different images of the same object, leading to missed results.
After the BoW feature is extracted from images, it is entered into a classifier for training or testing. Besides constructing the discriminative models as classifiers for image annotation, some Bayesian text models by Latent Semantic Analysis [53], such as probabilistic Latent Semantic Analysis (pLSA) [54] and Latent Dirichlet Analysis (LDA) [55] can be adapted to model object and scene categories.
The construction of discriminative models for image annotation is based on the supervised machine learning principle for pattern recognition. Supervised learning can be thought as learning by examples or learning with a teacher [56]. The teacher has knowledge of the environment which is represented by a set of input-output examples. In order to classify unknown patterns, a certain number of training samples are available for each class, and they are used to train the classifier [57].
The learning task is to compute a classifier or model that approximates the mapping between the input-output examples and correctly labels the training set with some level of accuracy. This can be called thetraining or model generation stage. After the model is generated or trained, it is able to classify an unknown instance, into one of the learned class labels in the training set. More specifically, the classifier calculates the similarity of all trained classes and assigns the unlabeled instance to the class with the highest similarity measure. More specifically, the most widely developed classifier is based on support vector machines (SVM) [58].
In text analysis, pLSA and LDA are used to discover topics in a document using the BoW document representation. For image annotation, documents and discovered topics are thought of as images and object categories, respectively. Therefore, an image containing instances of several objects is modeled as a mixture of topics. This topic distribution over the images is used to classify an image as belonging to a certain scene. For example, if an image contains “water with waves”, “sky with clouds”, and “sand”, it will be classified into the “coast” scene class [24].
Following the previous definition of BoW, in pLSA there is a latent variable model for cooccurrence data which associates an unobserved class variable with each observation. A joint probability model over is defined by the mixture:where are the topic specific distributions and each image is modeled as a mixture of topics, .
On the other hand, LDA treats the multinomial weights over topics as latent random variables. In particular, the pLSA model is extended by sampling those weights from a Dirichlet distribution. This extension allows the model to assign probabilities to data outside the training corpus and uses fewer parameters, which can reduce the overfitting problem.
The goal of LDA is to maximize the following likelihood:where and are multinomial parameters over the topics and words, respectively, and and are Dirichlet distributions parameterized by the hyperparameters and .
Bosch et al. [24] compare BoW + pLSA with different semantic modeling approaches, such as the traditional global based feature representation, block-based feature representation [59] with the -nearest neighbor classifier. They show that BoW + pLSA performs best. Specifically, the HIS histogram + cooccurrence matrices + edge direction histogram are used as the image descriptors.
However, it is interesting that Lu and Ip [41] and Quelhas et al. [60] show that pLSA does not perform better than BoW + SVM over the Corel dataset, where the former uses blocked based HSV and Gabor texture features and the latter uses keypoint based SIFT features.
This section reviews the literature regarding using BoW for some related problems. They are divided into five categories, namely, feature representation, vector quantization, visual vocabulary construction, image segmentation, and others.
Since the annotation accuracy is heavily dependent on feature representation, using different region/point descriptors and/or the BoW feature representation will provide different levels of discriminative power for annotation. For example, Mikolajczyk and Schmid [34] compare 10 different local descriptors for object recognition. Jiang et al. [17] examine the classification accuracy of the BoW features using different numbers of visual words and different weighting schemes.
Due to the drawbacks that vector quantization may reduce the discriminative power of images and the BoW methodology ignores geometric relationships among visual words, Zhong et al. [61] present a novel scheme where SIFT features are bundled into local groups. These bundled features are repeatable and are much more discriminative than an individual SIFT feature. In other words, a bundled feature provides a flexible representation that allows us to partially match two groups of SIFT features.
On the other hand, since the image feature generally carries mixed information of the entire image which may contain multiple objects and background, the annotation accuracy can be degraded by such noisy (or diluted) feature representations. Chen et al. [62] propose a novel feature representation, pseudo-objects. It is based on a subset of proximate feature points with its own feature vector to represent a local area to approximate candidate objects in images.
Gehler and Nowozin [63] focus on feature combination, which is to combine multiple complementary features based on different aspects such as shape, color, or texture. They study several models that aim at learning the correct weighting of different features from training data. They provide insight into when combination methods can be expected to work and how the benefit of complementary features can be exploited most efficiently.
Qin and Yung [64] use localized maximum-margin learning to fuse different types of features during the BoW modeling. Particularly, the region of interest is described by a linear combination of the dominant feature and other features extracted from each patch at different scales, respectively. Then, dominant feature clustering is performed to create contextual visual words, and each image in the training set is evaluated against the codebook using the localized maximum-margin learning method to fuse other features, in order to select a list of contextual visual words that best represents the patches of the image.
As there is a relation between the composition of a photograph and its subject, similar subjects are typically photographed in a similar style. Van Gemert [65] exploits the assumption that images within a category share a similar style, such as colorfulness, lighting, depth of field, viewpoints and saliency. They use the photographic style for category-level image classification. In particular, where the spatial pyramid groups features spatially [48], they focus on more general feature grouping, including these photographic style attributes.
In Rasiwasia and Vasconcelos [66], they introduce an intermediate space, based on a low dimensional semantic “theme” image representation, which is learned with weak supervision from casual image annotations. Each theme induces a probability density on the space of low-level features, and images are represented as vectors of posterior theme probabilities.
In order to reduce the quantization noise, Jégou et al. [67] construct short codes using quantization. The goal is to estimate distances using vector-to-centroid distances, that is, the query vector is not quantized, codes are assigned to the database vectors only. In other words, the feature space is decomposed into a Cartesian product of low-dimensional subspaces, and then each subspace is quantized separately. In particular, a vector is represented by a short code composed of its subspace quantization indices.
As abrupt quantization into discrete bins does cause some aliasing, Agarwal and Triggs [47] focus on soft vector quantization, that is, softly voting into the cluster centers that lie close to the patch, for example, with Gaussian weights. They show that diagonal-covariance Gaussian mixtures fitted using expectation-maximization performs better than hard vector quantization.
Similarly, Fernando et al. [68] propose a supervised learning algorithm based on a Gaussian mixture model, which not only generalizes the -means by allowing “soft assignments”, but also exploits supervised information to improve the discriminative power of the clusters. In their approach, an EM-based approach is used to optimize a convex combination of two criteria, in which the first one is unsupervised and based on the likelihood of the training data, and the second is supervised and takes into account the purity of the clusters.
On the other hand, Wu et al. [69] propose a Semantics-Preserving Bag-of-Words (SPBoW) model, which considers the distance between the semantically identical features as a measurement of the semantic gap and tries to learn a codebook by minimizing this semantic gap. That is, the codebook generation task is formulated as a distance metric learning problem. In addition, one visual feature can be assigned to multiple visual words in different object categories.
In de Campos et al. [70], images are modeled as order-less sets of weighted visual features where each visual feature is associated with a weight factor that may inform re its relevance. In this approach, visual saliency maps are used to determine the relevance weight of a feature.
Zheng et al. [71] argue that for the BoW model used in information retrieval and document categorization, the textual word possesses semantics itself and the documents are well-structured data regulated by grammar, linguistic, and lexicon rules. However, there appears to be no well-defined rules in the visual word composition of images. For instance, the objects of the same class might have arbitrarily different shapes and visual appearances, while objects of different classes might share similar local appearances. To this end, a higher-level visual representation, visual synset for object recognition is presented. First, an intermediate visual descriptor, delta visual phrase, is constructed from a frequently co-occurring visual word-set with similar spatial context. Second, the delta visual phrases are clustered into a visual synset based their probabilistic “semantics”, that is, class probability distribution.
Besides reducing the vector quantization noise, another severe drawback of the BoW model is its high computational cost. To address this problem, Moosmann et al. [72] introduce extremely randomized clustering forests based on ensembles of randomly created clustering trees and show that more accurate results can be obtained as well as much faster training and testing.
Recently, Van de Sande et al. [51] proposed two algorithms to combine GPU hardware and a parallel programming model to accelerate the quantization and classification components of the visual categorization architecture.
On the other hand, Hare et al. [73] show the intensity inversion characteristics of the SIFT descriptor and local interest region detectors can be exploited to decrease the time it takes to create vocabularies of visual terms. In particular, they show that clustering inverted and noninverted (or minimum and maximum) features separate results in the same retrieval performance when compared to the clustering of all the features as a single set (with the same overall vocabulary size).
Since related studies, such as Jegou et al. [74], Marszałek and Schmid [49], Sivic and Zisserman [14], and Winn et al. [75], have shown that the commonly generated visual words are still not as expressive as text words, in Zhang et al. [76], images are represented as visual documents composed of repeatable and distinctive visual elements, which are comparable to text words. They propose descriptive visual words (DVWs) and descriptive visual phrases (DVPs) as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs.
Gavves et al. [77] focus on identifying pairs of independent, distant words—the visual synonyms—that are likely to host image patches of similar visual reality. Specifically, landmark images are considered, where the image geometry guides the detection of synonym pairs. Image geometry is used to find those image features that lie in a nearly identical physical location, yet are assigned to different words of the visual vocabulary.
On the other hand, López-Sastre et al. [78] present a novel method for constructing a visual vocabulary that takes into account the class labels of images. It consists of two stages: Cluster Precision Maximisation (CPM) and Adaptive Refinement. In the first stage, a Reciprocal Nearest Neighbours (RNN) clustering algorithm is guided towards class representative visual words by maximizing a new cluster precision criterion. Next, an adaptive threshold refinement scheme is proposed with the aim of increasing vocabulary compactness, while at the same time improving the recognition rate and further increasing the representativeness of the visual words for category-level object recognition. In other words, this is a correlation clustering based approach, which works as a kind of metaclustering and optimizes the cut-off threshold for each cluster separately.
Constructing visual codebook ensembles is another approach to improve image annotation accuracy. In Luo et al. [18], three methods for constructing visual codebook ensembles are presented. The first one is based on diverse individual visual codebooks by randomly choosing interesting points. The second one uses a random subtraining image dataset with random interesting points. The third one directly utilizes different patch information for constructing an ensemble with high diversity. Consequently, different types of image presentations are obtained. Then, a classification ensemble is learned by the different expression datasets from the same training set.
Bae and Juang [79] apply the idea of linguistic parsing to generate the BoW feature for image annotation. That is, images are represented by a number of variable-size patches by a multidimensional incremental parsing algorithm. Then, the occurrence pattern of these parsed visual patches is fed into the LSA framework.
Since one major challenge in object categorization is to find class models that are “invariant” enough to incorporate naturally-occurring intraclass variations and yet “discriminative” enough to distinguish between different classes, Winn et al. [75] proposed a supervised learning algorithm, which automatically finds such models. In particular, it classifies a region according to the proportions of different visual words. The specific visual words and the typical proportions in each object are learned from a segmented training set.
Kesorn and Poslad [80] propose a framework to enhance the visual word quality. First of all, visual words from representative keypoints are constructed by reducing similar keypoints. Second, domain specific noninformative visual words are detected, which are useless for representing the content of visual data but which can degrade the categorization capability. A noninformative visual word is defined as having a high document frequency and a small statistical association with all the concepts in the image collection. Third, the vector space model of visual words is restructured with respect to a structural ontology model in order to solve visual synonym and polysemy problems.
Tirlly et al. [81] present a new image representation called visual sentences that allows us to “read” visual words in a certain order, as in the case of text. Particularly, simple spatial relations between visual words are considered. In addition, pLSA is used to eliminate the noisiest visual words.
Effective image segmentation can be an important factor affecting the BoW feature generation. Uijlings et al. [43] study the role of context in the BoW approach. They observe that using the precise localization of object patches based on image segmentation is likely to yield a better performance than the dense sampling strategy, which sample patches of 8 * 8 pixels at every 4th pixel.
Besides point detection, an image can be segmented into several or a fixed number of regions or blocks. However, very few compared the effect of image segmentation on generating the BoW feature. In Cheng and Wang [82], 20–50 regions per image are segmented, and each region is represented by a HSV histogram and cooccurrence texture features. By using contextual Bayesian networks to model spatial relationship between local regions and integrating multiattributes to infer high-level semantics of an image, this approach performs better and is comparable with a number of works using SIFT descriptors and pLSA for image annotation.
Similarly, Wu et al. [46] extract a texture histogram from the 8 * 8 blocks/patches per image based on their proposed visual language modeling method utilizing the spatial correlation of visual words. This representation is compared with the BoW model including pLSA and LDA using the SIFT descriptor. They show that neither image segmentation nor interest point detection is used in the visual language modeling method, which makes the method not only very efficient, but also very effective over the Caltech 7 dataset.
In addition to using the BoW feature for image annotation, Larlus et al. [83] combine BoW with random fields and some generative models, such as a Dirichlet processes for more effective object segmentation.
Although the BoW model has been extensively studied for general object and scene categorization, it has also been considered in some domain specific applications, such as human action recognition [84], facial expression recognition [85], medical images [86], robot, sport image analysis [80], 3D image retrieval and classification [87, 88], image quality assessment [89], and so forth.
Farhadi et al. [90] propose shifting the goal of recognition from naming to describing. That is, they focus on describing objects by their attributes, which is not only to name familiar objects, but also to report unusual aspects of a familiar object, such as “spotty dog”, not just “dog”, and to say something about unfamiliar objects, such as “hairy and four-legged”, not just “unknown”.
On the other hand, Sudderth et al. [91] develop hierarchical, probabilistic models for objects, the parts composing them, and the visual scenes surrounding them. These models share information between object categories in three distinct ways. First, parts define distributions over a common low-level feature vocabulary. Second, objects are defined using a common set of parts. Finally, object appearance information is shared between the many scenes in which that object is found.
Chum et al. [52] adopt the BoW architecture with spatial information for query expansion, which has proven successful in achieving high precision at low recall. On the other hand, Philbin et al. [92] quantize a keypoint to the -nearest visual words as a form of query expansion.
Based on the BoW feature representation, Jegou et al. [74] introduce a contextual dissimilarity measure (CDM), which is iteratively obtained by regularizing the average distance of each point to its neighborhood. In addition, CDM is learned in an unsupervised manner, which does not need to learn the distance measure from a set of training images.
Since the aim of image annotation is to support very large scale keyword-based image search, such as web image retrieval, it is very critical to assess existing approaches over some large scale dataset(s). Chum et al. [52], Hörster and Lienhart [21], and Lienhart and Slaney [93] used datasets composed of 100000 to 250000 images belonging to 12 categories, which were downloaded from Flickr.
Moreover, Philbin et al. [45] use over 1000000 images from Flickr for experiments and Zhang et al. [94] use about 370000 images collected from Google belonging to 1506 object or scene categories.
On the other hand, Torralba and Efros [95] study some bias issues of object recognition datasets. They provide some suggestions for creating a new and high quality dataset to minimize the selection bias, capture bias, and negative set bias. Furthermore, they claim that in the state of today’s datasets there are virtually no studies demonstrating cross-dataset generalization, for example, training on ImageNet, while testing on PASCAL VOC. This could be considered as an additional experimental setup for future works.
Although modeling the spatial relationship between visual words can improve the recognition performance, the spatial features are expensive to compute. Liu et al. [96] propose a method that simultaneously performs feature selection and (spatial) feature extraction based on higher-order spatial features for speed and storage improvements.
For the dimensionality reduction purpose, Elfiky et al. [97] present a novel framework for obtaining a compact pyramid representation. In particular, the divisive information theoretic feature clustering (DITC) algorithm is used to create a compact pyramid representation.
Bosch et al. [98] investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In their approach, latent “topics” using pLSA are first of all discovered, and a generative model is then applied to the BoW representation for each image.
In contrast to reducing the dimensionality of the feature representation, selecting more discriminative features (e.g., SIFT descriptors) from a given set of training images has been considered. Shang and Xiao [99] introduce a pairwise image matching scheme to select the discriminative features. Specifically, the feature weights are updated by the labeled information from the training set. As a result, the selected features corresponding to the foreground content of the images can highlight the information category of the images.
Simultaneously learning object/scene category models and performing segmentation on the detected objects were studied in Cao and Fei-Fei [44]. They propose a spatially coherent latent topic model (Spatial-LTM), which represents an image containing objects in a hierarchical way by oversegmented image regions of homogeneous appearances and the salient image patches within the regions. It can provide a unified representation for spatially coherent BoW topic models and can simultaneously segment and classify objects.
On the other hand, Tong et al. [100] propose a statistical framework for large-scale near duplicate image retrieval which unifies the step of generating a BoW representation and the step of image retrieval. In this approach, each image is represented by a kernel density function, and the similarity between the query image and a database image is then estimated as the query likelihood.
Shotton et al. [101] utilize semantic texton forests, which are ensembles of decision trees that act directly on image pixels, where the nodes in the trees provide an implicit hierarchical clustering into semantic textons and an explicit local classification estimate. In addition, the bag of semantic textons combines a histogram of semantic textons over an image region with a region prior category distribution, and the bag of semantic textons is computed over the whole image for categorization and over local rectangular regions for segmentation.
Romberg et al. [102] extend the standard single-layer pLSA to multiple layers, where the multiple layers handle multiple modalities and a hierarchy of abstractions. In particular, the multilayer multimodal pLSA (mm-pLSA) model is based on a two leaf-pLSAs and a single top-level pLSA node merging the two leaf-pLSAs. In addition, SIFT features and image annotations (tags) as well as the combination of SIFT and HOG features are considered as two pairs of different modalities.
In their study, Lee and Grauman [103] discover new categories by knowing some categories. That is, previously learned categories are used to discover their familiarity in unsegmented, unlabeled images. In their approach, two variants of a novel object-graph descriptor to encode 2D and 3D spatial layout of object-level cooccurrence patterns relative to an unfamiliar region, and they are used to model the interaction between an image’s known and unknown objects for detecting new visual categories.
Since interest point detection is an important step for extracting the BoW feature, Stottinger et al. [104] propose color interest points for sparse image representation. Particularly, light-invariant interest points are introduced to reduce the sensitivity to varying imaging conditions. Color statistics based on occurrence probability lead to color boosted points, which are obtained through saliency-based feature selection.
This section compares related work in terms of the ways the BoW feature and experimental setup are structured. These comparisons allow us to figure out the most suitable interest point detector(s), clustering algorithm(s), and so forth used to extract the BoW feature from images. In addition, we are able to realize the most widely used dataset(s) and experimental settings for image annotation by BoW.
Table 1 compares related work for the methodology of extracting the BoW feature. Note that we leave a blank if the information in our comparisons is not clearly described in these related works.
From Table 1 we can observe that the most widely used interest point detector for generating the BoW feature is DoG, and the second and third most popular detectors are Harris-Laplace and Hessian-Laplace, respectively. Besides extracting sparse BoW features, many related studies have focused on dense BoW features.
On the other hand, several studies used some region segmentation algorithms, such as NCuts [116] and Mean-shift [117], to segment an image into several regions to represent keypoints.
For the local feature descriptor to describe interest points, most studies used a 128 dimensional SIFT feature, in which some considered using PCA to reduce the dimensionality of SIFT, but some “fuse” the color feature and SIFT resulting in longer dimensional features than SIFT. Except for extracting SIFT related features, some studies considered conventional color and texture features to represent local regions or points.
About vector quantization, we can see that -means is the most widely used clustering algorithm to generate the codebook or visual vocabularies. However, in order to solve the limitations of -means, for example, clustering accuracy and computational cost, some studies used hierarchical -means, approximate -means, accelerated -means, and so forth.
For the number of visual words, related works have considered various amounts of clusters during vector quantization. This may be because the datasets used in these works are different. In Jiang et al. [17], different numbers of visual words were studied, and their results show that 1000 is a reasonable choice. Some related studies also used similar numbers of visual words to generate their BoW features.
On the other hand, the most and second most widely used weighting schemes are TF and TF-IDF. This is consistent with Jiang et al. [17], who concluded that these two weighting schemes perform better than the other weighting schemes.
Finally, SVM is no doubt the most popular classification technique as the learning model for image annotation. In particular, one of the most widely used kernel functions for constructing the SVM classifier is the Gaussian radial basis function. However, some other SVM classifiers, such as linear SVM and SVM with a polynomial kernel have also been considered in the literature.
Table 2 compares related work for the experimental design. That is, the chosen dataset(s) and baseline(s) are examined.
According to Table 2, most studies considered more than one single dataset for their experiments, and many of them contained object and scene categories. This is very important for image annotation that the annotated keywords should be broadened for users to perform keyword-based queries for image retrieval.
Specifically, the PASCAL, Caltech, and Corel datasets are the three most widely used benchmarks for image classification. However, the datasets used in most studies usually contain a small number of categories and images, except for the studies focusing on retrieval rather than classification. That is, similar based queries are used to retrieve relevant images instead of training a learning model to classify unknown images into one specific category.
For the chosen baselines, most studies compared BoW and/or spatial pyramid matching based BoW since their aims were to propose novel approaches to improve these two feature representations. Specifically, Lazebnik et al. [48] proposed spatial pyramid matching based BoW as the most popular baseline.
Besides improving the feature representation per se, some studies focused on improving the performance of LDA and/or pLSA discriminative learning models. Another popular baseline is that of Fei-Fei and Perona [31], who proposed a Bayesian hierarchical model to represent each region as part of a “theme.”
The above comparisons indicate several issues that were not examined in the literature. Since the local features can be represented using object-based regions by region segmentation [143, 144] or point-based regions by point detection (c.f. Section 2.1), regarding the BoW feature based on tokenizing, it is unknown which local feature is more appropriate for large scale image annotation (For large scale image annotation, this means that the number of annotated keywords is certainly large and their meanings are very broad, containing object and scene concepts.)
In addition, the local feature descriptor is the key component to the success of better image annotation; it is a fact that the number of visual words (i.e., clusters) is another factor affecting image annotation performance. Although Jiang et al. [17] conducted a comprehensive study of using various amounts of visual words, they only used one dataset, that is, TRECVID, containing 20 concepts. Therefore, one important issue is to provide the guidelines for determining the number of visual words over different kinds of image datasets having different image contents.
The learning techniques can be divided into generative and discriminative models, but there are very few studies which assess their annotation performance over different kinds of image datasets which is necessary in order to fully understand the value of these two kinds of learning models. On the other hand, a combination of generative and discriminative learning techniques [145] or hybrid models are considered for the image annotation task.
For the experimental setup, the target of most studies was not image retrieval. In other words, the performance evaluation was usually for small scale problems based on datasets containing a small number of categories, say 10. However, image retrieval users will not be satisfied with a system providing only 10 keyword-based queries to search relevant images. Some benchmarks are much more suitable for larger scale image annotation, such as the Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) by ImageNet (http://www.image-net.org/challenges/LSVRC/2012/index) and Photo Annotation and Retrieval 2012 by ImageCLEF (http://www.imageclef.org/2012/photo). In particular, the ImageNet dataset contains over 10000 categories and 10000000 labeled images and ImageCLEF uses a subset of the MIRFLICKR collection (http://press.liacs.nl/mirflickr/), which contains 25 thousand images and 94 concepts.
However, it is also possible that some smaller scale datasets composed of a relatively small number of images and/or categories can be combined into larger datasets. For example, the combination of Caltech 256 and Corel could be regarded as a benchmark that is more close to the real world problem.
In this paper, a number of recent related works using BoW for image annotation are reviewed. We can observe that this topic has been extensively studied recently. For example, there are many issues for improving the discriminative power of BoW feature representations by such techniques as image segmentation, vector quantization, and visual vocabulary construction. In addition, there are other directions for integrating the BoW feature for different applications, such as face detection, medical image analysis, 3D image retrieval, and so forth.
From comparisons of related work, we can find the most widely used methodology to extract the BoW feature which can be regarded as a baseline for future research. That is, DoG is used as the kepoint detector and each keypoint is represented by the SIFT feature. The vector quantization step is based on the -means clustering algorithm with 1000 visual words. However, the number of visual words (i.e., the values) is dependent on the dataset used. Finally, the weighting scheme can be either TF or TF-IDF.
On the other hand, for the dataset issue in the experimental design, which can affect the contribution and final conclusion, the PASCAL, Caltech, and/or Corel datasets can be used as the initial study.
According to the comparative results, there are some future research directions. First, the local feature descriptor for vector quantization usually by point-based SIFT feature can be compared with other descriptors, such as a region-based feature or a combination of different features. Second, a guideline for determining the number of visual words over what kind of datasets should be provided. The third issue is to assess the performance of generative and discriminative learning models over different kinds of datasets, such as different dataset sizes and different image contents, for example, a single object per image and multiple objects per image. Finally, it is worth examining the scalability of BoW feature representation for large scale image annotation.