Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 七

WangBen 20110916 Beijing


Advantages, Disadvantages, and Applications of LSA

LSA的优势、劣势以及应用

Latent SemanticAnalysis has many nice properties that make it widely applicable to manyproblems.

1.    First, the documents and words end up being mapped to thesame concept space. In this space we can cluster documents, cluster words, andmost importantly, see how these clusters coincide so we can retrieve documentsbased on words and vice versa.

2.    Second, the concept space has vastly fewer dimensionscompared to the original matrix. Not only that, but these dimensions have beenchosen specifically because they contain the most information and least noise.This makes the new concept space ideal for running further algorithms such astesting different clustering algorithms.

3.    Last, LSA is an inherently global algorithm that looks attrends and patterns from all documents and all words so it can find things thatmay not be apparent to a more locally based algorithm. It can also be usefullycombined with a more local algorithm such as nearest neighbors to become moreuseful than either algorithm by itself.

LSA有着许多优良的品质使得其应用广泛:

         1. 第一,文档和词被映射到了同一个“概念空间”。在这个空间中,我们可以把聚类文档,聚类词,最重要的是可以知道不同类型的聚类如何联系在一起的,这样我们可以通过词来寻找文档,反之亦然。

         2. 第二,概念空间比较原始矩阵来说维度大大减少。不仅如此,这种维度数量是刻意为之的,因为他们包含了大部分的信息和最少的噪音。这使得新产生的概念空间对于运行之后的算法非常理想,例如尝试不同的聚类算法。

         3. 最后LSA天生是全局算法,它从所有的文档和所有的词中找到一些东西,而这是一些局部算法不能完成的。它也能和一些更局部的算法(最近邻算法nearest neighbors)所结合发挥更大的作用。

There are a fewlimitations that must be considered when deciding whether to use LSA. Some ofthese are:

1.    LSA assumes a Gaussian distribution and Frobenius normwhich may not fit all problems. For example, words in documents seem to followa Poisson distribution rather than a Gaussian distribution.

2.    LSA cannot handle polysemy (words with multiple meanings)effectively. It assumes that the same word means the same concept which causesproblems for words like bank that have multiple meanings depending on whichcontexts they appear in.

3.    LSA depends heavily on SVD which is computationallyintensive and hard to update as new documents appear. However recent work hasled to a new efficient algorithm which can update SVD based on new documents ina theoretically exact sense.

当选择使用LSA时也有一些限制需要被考量。其中的一些是:

         1. LSA假设Gaussiandistribution and Frobenius norm,这些假设不一定适合所有问题。例如,文章中的词符合Poissondistribution而不是Gaussian distribution。

         2. LSA不能够有效解决多义性(一个词有多个意思)。它假设同样的词有同样的概念,这就解决不了例如bank这种词需要根据语境才能确定其具体含义的。

         3. LSA严重依赖于SVD,而SVD计算复杂度非常高并且对于新出现的文档很难去做更新。然而,近期出现了一种可以更新SVD的非常有效的算法。

In spite ofthese limitations, LSA is widely used for finding and organizing searchresults, grouping documents into clusters, spam filtering, speech recognition,patent searches, automated essay evaluation, etc.

不考虑这些限制,LSA被广泛应用在发现和组织搜索结果,把文章聚类,过滤作弊,语音识别,专利搜索,自动文章评价等应用之上。

As an example,iMetaSearch uses LSA to map search results and words to a “concept” space.Users can then find which results are closest to which words and vice versa.The LSA results are also used to cluster search results together so that yousave time when looking for related results.

例如,iMetaSearch利用LSA去把搜索结果和词映射到“概念”空间。用户可以知道那些结果和那些词更加接近,反之亦然。LSA的结果也被用在把搜索结果聚在一起,这样你可以节约寻找相关结果的时间。



你可能感兴趣的:(Algorithm,算法,properties,search,文档,Semantic)