引言
The problem of clustering has been studied widely in the database and statistics literature in the context of a wide variety of data mining tasks [50, 54]. The clustering problem is defined to be that of finding groups of similar objects in the data. The similarity between the objects is measured with the use of a similarity function. The problem of clustering can be very useful in the text domain, where the objects to be clusters can be of different granularities such as documents, paragraphs, sentences or terms. Clustering is especially useful for organizing documents to improve retrieval and support browsing [11, 26].
在各种数据挖掘任务的背景下,聚类问题的研究广泛的应用于数据库和文献统计,聚类问题的的本质是对数据中相似的对象进行分组,通过相似性函数来测量对象之间的相似性,聚类问题在文本领域非常有用,要用于聚类的对象可以是不同的粒度,例如文档,段落,句子或者词。聚类对于组织文档以改善检索和浏览支持特别有用
The study of the clustering problem precedes its applicability to the text domain. Traditional methods for clustering have generally focussed on the case of quantitative data [44, 71, 50, 54, 108], in which the attributes of the data are numeric. The problem has also been studied for the case of categorical data [10, 41, 43], in which the attributes may take on nominal values. A broad overview of clustering (as it relates to generic numerical and categorical data) may be found in [50, 54]. A number of implementations of common text clustering algorithms, as applied to text data, may be found in several toolkits such as Lemur [114] and BOW toolkit in [64]. The problem of clustering finds applicability for a number of tasks:
对聚类问题的研究要早于其对文本领域的应用,传统的聚类方法通常集中在定量数据的情况 [44, 71, 50, 54, 108],其中数据是数字,这个问题在分类数据中也有研究 [10, 41, 43],其中属性可以采用名义值,在[50, 54]中有对聚类的广泛概述(涉及一般的数字和分类数据),一些常见的文本聚类方法的实现,可以在下面找到相关的tookit,例如Lemur[114]和BOW[64],聚类问题可以适用于以下一些任务:
- Document Organization and Browsing: The hierarchical organization of documents into coherent categories can be very useful for systematic browsing of the document collection. A classical example of this is the Scatter/Gather method [25], which provides a systematic browsing technique with the use of clustered organization of the document collection.
文档组织和浏览:将文档分层组织成连贯的类别对于系统的浏览文档集合非常有用,一个典型的例子是Scatter/Gather方法[25],它提供了一种系统的浏览技术,使用聚类进行文档集合的组织. - Corpus Summarization: Clustering techniques provide a coherent summary of the collection in the form of cluster-digests [83] or word-clusters [17, 18], which can be used in order to provide summary insights into the overall content of the underlying corpus. Variants of such methods, especially sentence clustering, can also be used for document summarization, a topic, discussed in detail in Chapter 3. The problem of clustering is also closely related to that of dimensionality reduction and topic modeling. Such dimensionality reduction methods are all different ways of summarizing a corpus of documents, and are covered in Chapter 5.
语料概要:聚类技术以摘要集[83]或词集[17,18]的形式提供集合的清晰的概要,这可以用于提供对底层语料库的整体内容的概要见解,第三章节有这种方法的变种,特别是句子聚类,也可以用于文档摘要,话题,评论。聚类问题也与降维和主题建模密切相关,这些不同方式的对文档语料库的降维方法,在第5章中介绍。 - Document Classification: While clustering is inherently an unsupervised learning method, it can be leveraged in order to improve the quality of the results in its supervised variant. In particular, word-clusters [17, 18] and co-training methods [72] can be used in order to improve the classification accuracy of supervised applications with the use of clustering techniques.
文档分类:聚类的本质是一个无监督学习方法,可以利用它来提高其监督变体的结果质量,在特定的情况下,词集[17,18]和协同训练方法可以提高通过聚类技术监督应用的分类精度。
We note that many classes of algorithms such as the k-means algorithm, or hierarchical algorithms are general-purpose methods, which can be extended to any kind of data, including text data. A text document can be represented either in the form of binary data, when we use the presence or absence of a word in the document in order to create a binary vector. In such cases, it is possible to directly use a variety of categorical data clustering algorithms [10, 41, 43] on the binary representation. A more enhanced representation would include refined weighting methods based on the frequencies of the individual words in the document as well as frequencies of words in an entire collection (e.g., TF-IDF weighting [82]). Quantitative data clustering algorithms [44, 71, 108] can be used in conjunction with these frequencies in order to determine the most relevant groups of objects in the data.
我们注意到了很多类型的算法,例如k-means,或者分级算法是通用方法,可以扩展至其他任何类型的数据,包括文本数据。当我们用文档中存在或不存在的词构建二进制向量时,文本文档可以以二进制数据的形式表示.在这种情况下,它可能直接使用各种分类数据聚类算法[10,41,43]。更加增强的表示将包括基于文档中各个单词的频率的精确加权方法以及整个集合中的单词的频率(例如,TF-IDF加权[82])。定量数据聚类算法[44,71,108]可以与这些频率结合使用,以便确定数据中最相关的对象组。
However, such naive techniques do not typically work well for clustering text data. This is because text data has a number of unique properties which necessitate the design of specialized algorithms for the task. The distinguishing characteristics of the text representation are as follows:
然而,这种naive的技术通常不适用于聚类文本数据,因为文本数据有一些列独有的属性从而需要为任务设计专门的算法,以下这些用于文本表示的区别特征:
The dimensionality of the text representation is very large, but the underlying data is sparse. In other words, the lexicon from which the documents are drawn may be of the order of 105, but a given document may contain only a few hundred words. This problem is even more serious when the documents to be clustered are very short (e.g., when clustering sentences or tweets).
文本表示的特征维度特别大,但是基础数据是非常稀疏的,换一种说法,从中提取文件的词典可以是105,但是某个给出的文档仅包含100不到个词,当要聚类的文档非常短时,这个问题就显得更加严重(e.g.,句子聚类或推特聚类)While the lexicon of a given corpus of documents may be large, the words are typically correlated with one another. This means that the number of concepts (or principal components) in the data is much smaller than the feature space. This necessitates the careful design of algorithms which can account for word correlations in the clustering process.
也可能文档语料的词库非常巨大,这些词通常彼此相关.这个意思是一些数据中的主要概念或成分比特征空间小的多,这需要仔细设计可以解释聚类过程中词相关性的算法。The number of words (or non-zero entries) in the different documents may vary widely. Therefore, it is important to normalize the document representations appropriately during the clustering task.
不同文档中的词(或非零条目)的数量可能差异很大,因此,在聚类任务过程中,适当的归一化文档的表示是非常重要的。
The sparse and high dimensional representation of the different documents necessitate the design of text-specific algorithms for document representation and processing, a topic heavily studied in the information retrieval literature where many techniques have been proposed to optimize document representation for improving the accuracy of matching a document with a query [82, 13]. Most of these techniques can also be used to improve document representation for clustering.
在文档表示和处理的过程中,不同文档的稀疏和高维表示需要设计特别的算法,在深入研究信息检索文献中的课题中,很多技术提出优化文档表示,以提高文档的查询匹配的准确性[82,13],这些技术大多也可以用于提升聚类问题中的文档表示
In order to enable an effective clustering process, the word frequencies need to be normalized in terms of their relative frequency of presence in the document and over the entire collection. In general, a common representation used for text processing is the vector-space based TF-IDF representation [81]. In the TF-IDF representation, the term frequency for each word is normalized by the inverse document frequency, or IDF. The inverse document frequency normalization reduces the weight of terms which occur more frequently in the collection. This reduces the importance of common terms in the collection, ensuring that the matching of documents be more influenced by that of more discriminative words which have relatively low frequencies in the collection. In addition, a sub-linear transformation function is often applied to the term- frequencies in order to avoid the undesirable dominating effect of any single term that might be very frequent in a document. The work on document-normalization is itself a vast area of research, and a variety of other techniques which discuss different normalization methods may be found in [86, 82].
为了能有效的进行聚类处理,词频必须存在于文档中的所有词的频率进行相对的归一化,一般来说,一个常用的用于文本处理的表示是基于TF-IDF的向量空间[81],在TF-IDF表示中,词频是根据逆文档频率或者IDF进行归一化.逆文档频率归一化减少了在集合中更频繁出现的词的权重,这些减少对于集合中相同的词非常重要,确保文档的匹配更多地受到集合中频率相对较低的且更具辨别性的词的影响。此外,通常将次线性变换函数应用于词频,以避免在文档中可能非常频繁的任何单个词的不期望的主导效应。这项工作在文档归一化中是一个广泛的研究领域,在[86,82]中讨论不同归一化技术。
Text clustering algorithms are divided into a wide variety of different types such as agglomerative clustering algorithms, partitioning algorithms, and standard parametric modeling based methods such as the EM-algorithm. Furthermore, text representations may also be treated as strings (rather than bags of words). These different representations necessitate the design of different classes of clustering algorithms. Different clustering algorithms have different tradeoffs in terms of effectiveness and efficiency. An experimental comparison of different clustering algorithms may be found in [90, 111]. In this chapter we will discuss a wide variety of algorithms which are commonly used for text clustering. We will also discuss text clustering algorithms for related scenarios such as dynamic data, network-based text data and semi-supervised scenarios.
文本聚类算法被分成各种各样的不同类型,例如凝聚聚类算法,分区算法和基于标准参数化建模的方法,例如EM算法,此外,文本表示也可以被视为字符串 (而不是 bags of words词袋模型)。不同的表示必须设计不同的聚类算法,不同的聚类算法在有效性和效率方面具有不同的权衡。
在[90,111]有比较不同聚类算法的实验,在这一章节我们将会讨论各种经常使用的文本聚类算法.我们也将会讨论聚类算法相关的场景,例如动态数据,基于网络的文本数据以及半监督场景。
This chapter is organized as follows. In section 2, we will present feature selection and transformation methods for text clustering. Section 3 describes a number of common algorithms which are used for distance-based clustering of text documents. Section 4 contains the description of methods for clustering with the use of word patterns and phrases. Methods for clustering text streams are described in section 5. Section 6 describes methods for probabilistic clustering of text data. Section 7 contains a description of methods for clustering text which naturally occurs in the context of social or web-based networks. Section 8 discusses methods for semi-supervised clustering. Section 9 presents the conclusions and summary.
本章安排如下.在第2部分,我们将介绍当下的文本聚类的特征选择和转换方法.第3部分描述了一些基于距离的通用文本聚类算法,第4部分包含通过文本模式和短语的聚类方法的描述。文本流聚类的方法在第5部分介绍.第6部分介绍了文本数据概率聚类的方法.第7部分包含社会自然发生或基于网络的文本聚类方法的描述,第8部分讨论了半监督聚类方法,第9部分介绍了结论和总结。
2. Feature Selection and Transformation Methods for Text Clustering(文本聚类的特征选择和转换)
The quality of any data mining method such as classification and clustering is highly dependent on the noisiness of the features that are used for the clustering process. For example, commonly used words such as “the”, may not be very useful in improving the clustering quality. Therefore, it is critical to select the features effectively, so that the noisy words in the corpus are removed before the clustering. In addition to feature selection, a number of feature transformation methods such as Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA), and Non-negative Matrix Factorization (NMF) are available to improve the quality of the document representation and make it more amenable to clustering. In these techniques (often called dimension reduction), the correlations among the words in the lexicon are leveraged in order to create features, which correspond to the concepts or principal components in the data. In this section, we will discuss both classes of methods. A more in-depth discussion of dimension reduction can be found in Chapter 5.
任何数据挖掘方法比如分类,聚类的质量,非常依赖聚类过程中特征的噪声减小程度,举个例子,常用词"the",对于提升聚类质量不是太有用,因此,选择有效的特征是至关重要的,所以,语料中的噪声词在聚类之前需要去掉,除了特征选择外,特征转换方法例如LSI,PLSA,NMF,这些能有效的提升文本表示的质量使其更适合聚类,在这些技术(通常称为降维)中,利用词典中单词之间的相关性来创建与数据中的概念或主要成分相对应的特征。在这一部分,我们将会讨论这两类方法。关于降维的更深层次的讨论在第5章讨论。
2.1 Feature Selection Methods(特征选择方法)
Feature selection is more common and easy to apply in the problem of text categorization [99] in which supervision is available for the feature selection process. However, a number of simple unsupervised methods can also be used for feature selection in text clustering. Some examples of such methods are discussed below.
特征选择更常见的应用于监督文本分类[99]在特征选择过程中很有效,然而一些简单的无监督方法也能用于文本聚类特征选择,一些例子我们再下面讨论
2.1.1 Document Frequency-based Selection.(基于文档频率的选择) The simplest possible method for feature selection in document clustering is that of the use of document frequency to filter out irrelevant features. While the use of inverse document frequencies reduces the importance of such words, this may not alone be sufficient to reduce the noise effects of very frequent words. In other words, words which are too frequent in the corpus can be removed because they are typically common words such as “a”, “an”, “the”, or “of” which are not discriminative from a clustering perspective. Such words are also referred to as stop words. A variety of methods are commonly available in the literature [76] for stop-word removal.
文档聚类中最简单的特征选择方法是通过文档频率筛除掉不恰当的特征, 使用逆文档频率降低了这些词的重要性, 这可能不足以减少非常频繁的单词的噪音影响。换一种说法,语料中频率很高的词可以被删掉,因为他们是一般的通用词,例如"a","an","the","of",从聚类的角度来看,它们没有区别,这些词被称为停用词,文献[76]中有多种通用方法可用于删除停用词。
Typically commonly available stop word lists of about 300 to 400 words are used for the retrieval process. In addition, words which occur extremely infrequently can also be removed from the collection. This is because such words do not add anything to the similarity computations which are used in most clustering methods. In some cases, such words may be misspellings or typographical errors in documents. Noisy text collections which are derived from the web, blogs or social networks are more likely to contain such terms. We note that some lines of research define document frequency based selection purely on the basis of very infrequent terms, because these terms contribute the least to the similarity calculations. However, it should be emphasized that very frequent words should also be removed, especially if they are not discriminative between clusters. Note that the TF-IDF weighting method can also naturally filter out very common words in a “soft” way. Clearly, the standard set of stop words provide a valid set of words to prune. Nevertheless, we would like a way of quantifying the importance of a term directly to the clustering process, which is essential for more aggressive pruning. We will discuss a number of such methods below.
检索过程中,一般通用的停用词列表大概有300-400个,此外,一些出现频率比较低的词也可以在这个集合中删掉.因为这些词在很多聚类方法中对于相似性计算不会有任何用,在某些情况下,文档中一些词也许会拼写错误或排版错误,噪音文本可能源自于网络,博客或社交媒体等经常包含这类词,我们注意到,一些研究领域仅仅基于非常罕见的词来定义基于文档频率的选择,因为这些术语对相似度计算的贡献最小(理解的这句话是,包含这些词文档被认为相似的可能性能最小???)。不管怎样,它应该强调,非常频繁的词必须被删除,特别是在每个类之间没有辨识度的,注意,TF-IDF计算权重方法可以在一定程度("soft")上很自然的筛除掉通用词,标准停用词表也提供了一些有效的集合以用于过滤,尽管如此,我们想要一种直接量化一个词对于聚类过程的重要性的方法,这对于更积极的修剪至关重要,我们将在接下来讨论
2.1.2 Term Strength.(词语强度) A much more aggressive technique for stop-word removal is proposed in [94]. The core idea of this approach is to extend techniques which are used in supervised learning to the unsupervised case. The term strength is essentially used to measure how informative a word is for identifying two related documents. For example, for two related documents x and y, the term strength s(t) of term t is defined in terms of the following probability:
一个更有效的去除停用词的技术在[94],这种方法的核心思想是将监督学习中使用的技术扩展到无监督的案例,术语强度主要用于衡量单词用于识别两个相关文档的信息量。例如,两个相关的文档x,y,其中的词t,定义词语强度为s(t),根据以下概率定义:
(大意是:t在两篇文档中出现的次数/他在第一篇文档中出现的次数)
Here, the first document of the pair may simply be picked randomly. In order to prune features, the term strength may be compared to the expected strength of a term which is randomly distributed in the training documents with the same frequency. If the term strength of t is not at least two standard deviations greater than that of the random word, then it is removed from the collection.
第一篇文档可随机挑选,为了减少特征,可以将词语强度与在具有相同频率的训练文档中随机分布的词语的预期强度进行比较。如果t的强度值不比随机词的强度大至少两个标准偏差,就需要将它从特征集合中删掉
One advantage of this approach is that it requires no initial supervision or training data for the feature selection, which is a key requirement in the unsupervised scenario. Of course, the approach can also be used for feature selection in either supervised clustering [4] or categorization [100], when such training data is indeed available. One observation about this approach to feature selection is that it is particularly suited to similarity-based clustering because the discriminative nature of the underlying features is defined on the basis of similarities in the documents themselves.
这种方式的一个优点是,不需要初始监督或训练数据,这是无监督情景中的关键要求,当然,如果训练数据确实可用时这种方法也可以用于监督聚类的特征选择[4]或者分类[100],关于这种特征选择方法的一个看法是它特别适合基于相似性的聚类,因为根据文档本身的相似性来定义底层特征的辨别性。
2.1.3 Entropy-based Ranking.(基于熵排名) The entropy-based ranking approach was proposed in [27]. In this case, the quality of the term is measured by the entropy reduction when it is removed. Here the entropy E(t) of the term t in a collection of n documents is defined as follows:
基于熵排名的途径在[27]中被提议,当术语被移除时,通过熵降低的程度来测量术语的质量,n个文档中词t的熵E(t)的定义如下:
Here is the similarity between the ith and jth document in the collection, after the term t is removed, and is defined as follows:
是词被删除之后i和j文档之间的相似性,定义如下:
Here dist(i,j) is the distance between the terms i and j after the term t is removed, and dist is the average distance between the documents after the term t is removed. We note that the computation of E(t) for each term t requires operations. This is impractical for a very large corpus containing many terms. It has been shown in [27] how this method may be made much more efficient with the use of sampling methods.
是术语t被移除后,词语i和词语j的距离,"dist"是术语t被移除后文档之间的平均距离,我们注意到,的计算对每个术语t需要的计算,这对于一个包含很多术语的大语料集是不切实际的,在[27]中已经表明,使用采样方法可以使这种方法更有效。
2.1.4 Term Contribution(词贡献). The concept of term contribution [62] is based on the fact that the results of text clustering are highly dependent on document similarity. Therefore, the contribution of a term can be viewed as its contribution to document similarity. For example, in the case of dot-product based similarity, the similarity between two documents is defined as the dot product of their normalized frequencies. Therefore, the contribution of a term of the similarity of two documents is the product of their normalized frequencies in the two documents.This needs to be summed over all pairs of documents in order to determine the term contribution. As in the previous case, this method requires time for each term, and therefore sampling methods may be required to speed up the contribution. A major criticism of this method is that it tends to favor highly frequent words without regard to the specific discriminative power within a clustering process.
词语贡献的概念[62]基于文本聚类高度依赖文本相似度的事实,因此,词的贡献可以被视为其对文档相似性的贡献,举个例子,在基于点积的相似性的情况下,两个文档间的相似性被定义成它们的频率归一化的点积,因此,两个文档的相似性词的贡献是两个文档中归一化频率的乘积。这需要对每一对文档求和以确定词的贡献度,与前一种情况一样,这种方法对每个词需要的时间复杂度,因此可能需要采样方法来加快贡献,对这种方法的一个主要诟病是它倾向于支持高频率的词而不考虑聚类过程中的特定判别力。
In most of these methods, the optimization of term selection is based on some pre-assumed similarity function (e.g., cosine). While this strategy makes these methods unsupervised, there is a concern that the term selection might be biased due to the potential bias of the assumed similarity function. That is, if a different similarity function is assumed, we may end up having different results for term selection. Thus the choice of an appropriate similarity function may be important for these methods.
在这些方法中,术语选择的优化是基于一些预先假定的相似性函数(e.g.,余弦).虽然这些策略使得这些方法无监督,由于假设的相似性函数的潜在偏差,人们担心术语选择可能存在偏差,如果是一个不同的假定相似性函数,也许会得到一个不同的术语选择结果,因此对于这些方法来说,选择一个合适的相似性函数是非常重要的
2.2 LSI-based Methods(基于LSI方法)
In feature selection, we attempt to explicitly select out features from the original data set. Feature transformation is a different method in which the new features are defined as a functional representation of the features in the original data set. The most common class of methods is that of dimensionality reduction [53] in which the documents are transformed to a new feature space of smaller dimensionality in which the features are typically a linear combination of the features in the original data. Methods such as Latent Semantic Indexing (LSI) [28] are based on this common principle. The overall effect is to remove a lot of dimensions in the data which are noisy for similarity based applications such as clustering. The removal of such dimensions also helps magnify the semantic effects in the underlying data.
在特征选择中,我们尝试明确地从原始数据集选择特征,特征转换采用了不同的方法,其中新特征被定义为原始数据集中的特征的功能表示。最常见的一类方法是降维[53],其中文档被转换为较小维度的新特征空间,其中特征通常是原始数据中特征的线性组合。例如LSI[28]方法就是基于相同的原理,最终效果是删除了很多对聚类有噪音影响的维度,删除这些维度也有助于放大底层数据中的语义效应。
Since LSI is closely related to problem of Principal Component Analysis (PCA) or Singular Value Decomposition (SVD), we will first discuss this method, and its relationship to LSI. For a d-dimensional data set, PCA constructs the symmetric d × d covariance matrix C of the data, in which the (i, j)th entry is the covariance between dimensions i and j. This matrix is positive semi-definite, and can be diagonalized as follows:
LSI与PCA或SVD问题密切相关,我们首先讨论这种方法以及它和LSI的关系,对于一个d维数据集,PCA构造数据的对称d*d协方差,其中第(i,j)个实体是i维和j维之间的协方差,该矩阵是半正定的,并且可以对角线化成以下:
Here P is a matrix whose columns contain the orthonormal eigenvectors of C and D is a diagonal matrix containing the corresponding eigenvalues. We note that the eigenvectors represent a new orthonormal basis system along which the data can be represented. In this context, the eigenvalues correspond to the variance when the data is projected along this basis system. This basis system is also one in which the second order covariances of the data are removed, and most of variance in the data is captured by preserving the eigenvectors with the largest eigenvalues. Therefore, in order to reduce the dimensionality of the data, a common approach is to represent the data in this new basis system, which is further truncated by ignoring those eigenvectors for which the corresponding eigenvalues are small. This is because the variances along those dimensions are small, and the relative behavior of the data points is not significantly affected by removing them from consideration. In fact, it can be shown that the Euclidian distances between data points are not significantly affected by this transformation and corresponding truncation. The method of PCA is commonly used for similarity search in database retrieval applications.
这里P是一个列包含C的正交特征向量的矩阵,D是一个包含相应特征值的对角线的矩阵,我们注意到,特征向量代表一个新的正交基系统,沿着该系统可以表示数据。在这个上下文中,当数据沿着这个基础系统投射时,特征值对应方差.该基础系统也是去除数据的二阶协方差的系统,通过保留具有最大特征值的特征向量来捕获数据中的大部分方差,因此为了减少数据的维度,一个常用的途径是在一个新的基础系统中表示数据,通过忽略相应特征值较小的那些特征向量进一步截断,这是因为沿这些维度的差异很小,并且通过将它们从考虑中移除而不会显着影响数据点的相对行为。 事实上,可以证明数据点之间的欧几里德距离不会受到这种变换和相应截断的显着影响。 PCA的方法通常用于数据库检索应用程序中的相似性搜索。
LSI is quite similar to PCA, except that we use an approximation of the covariance matrix C which is quite appropriate for the sparse and high-dimensional nature of text data. Specifically, let A be the n × d term-document matrix in which the (i,j)th entry is the normalized frequency for term j in document i. Then, AT · A is a d × d matrix which is close (scaled) approximation of the covariance matrix, in which the means have not been subtracted out. In other words, the value of AT ·A would be the same as a scaled version (by factor n) of the covariance matrix, if the data is mean-centered. While text-representations are not mean-centered, the sparsity of text ensures that the use of AT · A is quite a good approximation of the (scaled) covariances. As in the case of numerical data, we use the eigenvectors of AT ·A with the largest variance in order to represent the text. In typical collections, only about 300 to 400 eigenvectors are required for the representation. One excellent characteristic of LSI [28] is that the truncation of the dimensions removes the noise effects of synonymy and polysemy, and the similarity computations are more closely affected by the semantic concepts in the data. This is particularly useful for a semantic application such as text clustering. However, if finer granularity clustering is needed, such low-dimensional space representation of text may not be sufficiently discriminative; in information retrieval, this problem is often solved by mixing the low-dimensional representation with the original high-dimensional word-based representation (see, e.g., [105]).
LSI和PCA非常相似,除了我们用了一个非常恰当的自然文本数据的稀疏高维矩阵的协方差近似矩阵C,特别的,让A是n*d的文档术语矩阵,其中第(i,j)项为文档i中的术语j的频率归一化,然后AT·A 是一个d*d的是协方差矩阵的近似(缩放)矩阵,意思是没有被减去,换一种说法,
AT·A是协方差矩阵的缩放版本,(if the data is mean-centerd,While text-representations are not mean-centered, ???),稀疏文本确保使用AT·A是一个非常好的相似协方差(缩放)矩阵,像数值矩阵一样,我们使用具有最大方差的AT·A的特征向量来表示文本,在一般的集合,仅需要300-400特征向量来表示,一个优秀的LSI特征[28]是可以去除同义或一次多义噪音的,相似度计算非常受数据中语义概念的影响,这非常有用对于语义程序,例如文本聚类,然而,如果需要细粒度聚类,例如低维空间的文本表示可能无法充分辨别,在信息恢复,这个问题经常是通过低维表示和原始的高维基于词的表示来混合使用
A similar technique to LSI, but based on probabilistic modeling is Probabilistic Latent Semantic Analysis (PLSA) [49]. The similarity and equivalence of PLSA and LSI are discussed in [49].
一个和LSI相似的技术,但是是概率模型的算法是PLSA[49],两者间的相似在[49]讨论.
2.2.1 Concept Decomposition using Clustering(通过聚类分解概念).One interesting observation is that while feature transformation is often used as a pre-processing technique for clustering, the clustering itself can be used for a novel dimensionality reduction technique known as concept decomposition [2, 29]. This of course leads to the issue of circularity in the use of this technique for clustering, especially if clustering is required in order to perform the dimensionality reduction. Nevertheless, it is still possible to use this technique effectively for pre-processing with the use of two separate phases of clustering.
一个有趣的观察是特征转换通常通过聚类预处理技术,聚类可以通过一个新颖的降维技术被称为概念分解[2,29],这当然导致使用该技术进行聚类时的循环性问题,特别是如果聚类以执行降维,虽然,通过使用两个独立的聚类阶段,仍然可以有效地使用该技术进行预处理。
The technique of concept decomposition uses any standard clustering technique [2, 29] on the original representation of the documents. The frequent terms in the centroids of these clusters are used as basis vectors which are almost orthogonal to one another. The documents can then be represented in a much more concise way in terms of these basis vectors. We note that this condensed conceptual representation allows for enhanced clustering as well as classification. Therefore, a second phase of clustering can be applied on this reduced representation in order to cluster the documents much more effectively. Such a method has also been tested in [87] by using word-clusters in order to represent documents. We will describe this method in more detail later in this chapter.
概念分解技术用于任何标准的聚类技术在原始文档表示上,这些簇的质心中的频繁项用作基本矢量,它们几乎彼此正交。这些文档可以以更简洁的基础向量来表示。我们注意到浓缩概念允许像聚类以及分类.因此,可以对这种缩减的表示应用第二阶段的聚类,以便更有效地聚类文档。例如一种已经测试过的方法[87]通过词聚类以表示文档,我们将会描述对这种方法进行更详细的描述在本章节.
2.3 Non-negative Matrix Factorization(非负矩阵因式分解)
The non-negative matrix factorization (NMF) technique is a latent-space method, and is particularly suitable to clustering [97]. As in the case of LSI, the NMF scheme represents the documents in a new axis-system which is based on an analysis of the term-document matrix. However, the NMF method has a number of critical differences from the LSI scheme from a conceptual point of view. In particular, the NMF scheme is a feature transformation method which is particularly suited to clustering. The main conceptual characteristics of the NMF scheme, which are very different from LSI are as follows:
非负矩阵因式分解(NMF)技术是一个潜在空间方法,它尤其适合聚类[97],在LSI的例子中,NMF是在一个基于术语-文档矩阵的分析的axis-system中表示文档,然而,从概念的角度来看,NMF方法与LSI方案有许多重要的区别。特别的,NMF是一个特别适合聚类的特征转换方法,NMF的主要概念特征,和LSI非常不同:
In LSI, the new basis system consists of a set of orthonormal vectors. This is not the case for NMF.
在LSI中,新的基础系统由一组正交向量组成,但在NMF中不是In NMF, the vectors in the basis system directly correspond to cluster topics. Therefore, the cluster membership for a document may be determined by examining the largest component of the document along any of the vectors. The coordinate of any document along a vector is always non-negative. The expression of each document as an additive combination of the underlying semantics makes a lot of sense from an intuitive perspective. Therefore, the NMF transformation is particularly suited to clustering, and it also provides an intuitive understanding of the basis system in terms of the clusters.
在NMF中,向量在基础系统中直接对应聚类主题,因此,可以通过沿任何向量检查文档的最大分量来确定文档的聚类成员资格。沿矢量的任何文档的坐标始终是非负的,作为底层语义的附加组合,每个文档的表达从直观的角度来看很有意义。因袭,NMF转换特别适合聚类,它还提供了关于集群的基础系统的直观理解。
Let A be the n × d term document matrix. Let us assume that we wish to create k clusters from the underlying document corpus. Then, the non-negative matrix factorization method attempts to determine the matrices U and V which minimize the following objective function:
A是n×d的术语文档矩阵,让我么假设我们希望从基础文档语料集创建k 聚类,因此,非负矩阵的因式分解方法尝试通过下列目标函数确定矩阵U和V最小化
Here || · || represents the sum of the squares of all the elements in the matrix, U is an n×k non-negative matrix, and V is a m×k non-negative matrix. We note that the columns of V provide the k basis vectors which correspond to the k different clusters.
这里|| · ||表示矩阵中所有元素的平方和,U是一个n×k的非负矩阵,V是m×k非负矩阵,我们注意到V的列提供了对应于k个不同聚类的k个基矢量。
What is the significance of the above optimization problem? Note that by minimizing J, we are attempting to factorize A approximately as:
上述优化问题的意义何在?通过最小化J,我们试图将A大致分解为:
For each row a of A (document vector), we can rewrite the above equation as:
对于A的每一行,我们可以将上面的等式重写为:
Here u is the corresponding row of U. Therefore, the document vector a can be rewritten as an approximate linear (non-negative) combination of the basis vector which corresponds to the k columns of . If the value of k is relatively small compared to the corpus, this can only be done if the column vectors of discover the latent structure in the data. Furthermore, the non-negativity of the matrices U and V ensures that the documents are expressed as a non-negative combination of the key concepts (or clustered) regions in the term-based feature space.
这里u对应U的每一行,因此,文档向量可以重写为一个 的k列基础向量的近似线性(非负)组合,如果k和语料集对比是一个相对小的值,这只能在发现数据中潜在结构的列向量时才能完成,此外,矩阵U和V的非负性确保文档表示为基于术语的特征空间中的关键概念(或聚类)区域的非负组合。
Next, we will discuss how the optimization problem for J above is actually solved. The squared norm of any matrix Q can be expressed as the trace of the matrix . Therefore, we can express the objective function above as follows:
接下来,我们将讨论如何实际解决上述J的优化问题,任何矩阵Q的平方范数可以表示为矩阵的轨迹.因此,我们可以用下面的目标函数表示以上:
Thus, we have an optimization problem with respect to the matrices and , the entries and of which are the variables with respect to which we need to optimize this problem. In addition, since the matrices are non-negative, we have the constraints that and . This is a typical constrained non-linear optimization problem, and can be solved using the Lagrange method. Let and be matrices with the same dimensions as U and V respectively. The elements of the matrices α and β are the corresponding Lagrange multipliers for the non-negativity conditions on the different elements of
U and V respectively. We note that is simply equal to and tr(β · V ) is simply equal to i,j βij · vij . These correspond to the lagrange expressions for the non-negativity constraints. Then, we can express the Lagrangian optimization problem as follows:
从而,我们有一个关于矩阵和的优化问题,其中的项和是我们需要优化此问题的变量,此外,因为矩阵是非负的,我们限制和,这是一个一般性的限制性非线性优化问题,并且可以使用拉格朗日方法来解决,使,分别成为和U和V有相同维数的矩阵,矩阵α和β的元素是对应的拉格朗日乘数,分别用于U和V元素的非负性条件,我们可以用以下公式表示拉格朗日优化问题:
Then, we can express the partial derivative of L with respect to U and V as follows, and set them to 0:
我们可以如下表示L相对于U和V的偏导数,并使其为0:
We can then multiply the (i,j)th entry of the above (two matrices of) conditions with and respectively. Using the Kuhn-Tucker conditions and ,we get the following:
然后,我们可以将上述(两个矩阵)条件的第(i,j)个条目分别与和相乘。
使用Kuhn-Tucker条件 and ,我们得到下面的公式:
We note that these conditions are independent of α and β. This leads to the following iterative updating rules for and :
我们注意到条件依赖α和β,这即出现了以下的迭代更新规则:
It has been shown in [58] that the objective function continuously improves under these update rules, and converges to an optimal solution.
在[58]中已经表明,目标函数在这些更新规则下不断改进,并收敛到最优解。
One interesting observation about the matrix factorization technique is that it can also be used to determine word-clusters instead of document clusters. Just as the columns of V provide a basis which can be used to discover document clusters, we can use the columns of U to discover a basis which correspond to word clusters. As we will see later, document clusters and word clusters are closely related, and it is often useful to discover both simultaneously, as in frameworks such as co-clustering [30, 31, 75]. Matrix-factorization provides a natural way of achieving this goal. It has also been shown both theoretically and experimentally [33, 93] that the matrix-factorization technique is equivalent to another graph-structure based document clustering technique known as spectral clustering. An analogous technique called concept factorization was proposed in [98], which can also be applied to data points with negative values in them.
一个关于矩阵因式分解技术有趣的观察是,它可以用确定词聚类替代文档聚类,正如V的列提供了可用于发现文档聚类的基础,我们可以用U的列发现对应的词聚类基础,我们稍后会看到,文档聚类和词聚类非常相关,同时发现它们通常很有用,在框架中例如co-聚类,矩阵分解提供了一个自然的方式实现这一目标,理论和实验[33,93]也表明,矩阵分解技术相当于另一种基于图结构的文档聚类技术,称为谱聚类。一个被称为概念分解类似的技术在[98],它也可以应用于具有负值的数据点。
3. Distance-based Clustering Algorithms(基于距离的聚类算法)
Distance-based clustering algorithms are designed by using a similarity function to measure the closeness between the text objects. The most well known similarity function which is used commonly in the text domain is the cosine similarity function. Let and be the damped and normalized frequency term vector in two different documents U and V . The values and represent the (normalized) term frequencies, and the function represents the damping function. Typical damping functions for f(·) could represent either the square-root or the logarithm [25]. Then, the cosine similarity between the two documents is defined as follows:
基于距离的聚类算法是基于相似性函数来测量两个文本对象间的接近程度,最为熟知的通用相似性函数是余弦相似函数.使和在两个不同的文件U和V中是阻尼(可百度这个词的含义)和归一化的词向量频率。 和 表示(归一化后的)词频,函数表示阻尼函数,一般的阻尼函数可以代表平方根或对数.所以,两篇文档的余弦相似性有以下定义:
Computation of text similarity is a fundamental problem in information retrieval. Although most of the work in information retrieval has focused on how to assess the similarity of a keyword query and a text document, rather than the similarity between two documents, many weighting heuristics and similarity functions can also be applied to optimize the similarity function for clustering. Effective information retrieval models generally capture three heuristics, i.e., TF weighting, IDF weighting, and document length normalization [36]. One effective way to assign weights to terms when representing a document as a weighted term vector is the BM25 term weighting method [78], where the normalized TF not only addresses length normalization, but also has an upper bound which improves the robustness as it avoids overly rewarding the matching of any particular term. A document can also be represented with a probability distribution over words (i.e., unigram language models), and the similarity can then be measured based an information theoretic measure such as cross entropy or Kullback-Leibler divergencce [105]. For clustering, symmetric variants of such a similarity function may be more appropriate.
计算文本相似度是信息检索中的基本问题,虽然在信息检索中的主要工作聚焦在怎样评估一个文档和一个查询关键词的相似性上,而不是两个文档间的相似性,很多加权式启发算法和相似性函数函数也可以应用于聚类相似性函数的有划伤,有效的信息检索模型通常捕获三个启发式算法。即,TF加权,IDF加权和文档长度归一化,在将文档表示为一个权重词向量的一个有效的给词分配权重的方式是BM25词权重方法[78],其中归一化TF不仅地址长度归一化,但也有一个上限,提高了鲁棒性,因为它避免过多地奖励任何特定词语的匹配。一个文档也可以通过词的概率分布(i.e.unigram模型),并且相似度也可以基于信息论测度被测量,例如通过熵或者通过交叉熵发散[105]