Latent Semantic Analysis (LSA) Tutorial

本文转载于:http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html


LSA 也被称为 latent Semantic Indexing,LSI,可以用来分析文档内部的意义或者文档中的concept。


如果一个 word 只对应一个 concept,并且一个 concept 只描述一个 word,那么 LSI 将会变得非常容易,因为只需要简单在 words 和 concepts 之间建立一个一一映射,如下图:

不幸的是,实际上,words 和 concepts 之间不是简单的一一映射,而是多对多的映射,如下图:


LSI 是如何运作的呢?

LSI 起源是为了解决如下这个问题:如何使用 search words 找到相关的 documents;当我们通过比较 words 寻找相关的 documents 时,实际上想要比较 words 真正的含义,而非仅仅是形式上的不同;LSI 通过把 words 和 documents 映射到一个 concept space,然后在这个 space 里面进行比较,从而解决这个问题。


由于作者写作的时候,对于 words 的使用有多种选择,对于同一个 concept,由于不同的作者选择不同的 words,可能导致 concepts 模糊不清。这种对于 words 的随机选择,导致在 concept-word 映射关系里产生 noises。LSI 可以过滤掉一些噪音,并且试图找到能够跨越所有的 documents 的最小的一组 concepts。


为了解决这个问题,LSI 使用下面的一些简化:

1. documents 被表示为 “bags of words”,words 在 document 中的顺序是不重要的,只需要考虑 words 在 document 中出现的频率

2. concept 被表示为 一组 words,这些 words 频繁地同时出现在 documents 中,For example "leash", "treat", and "obey" might usually appear in documents about dog training

3. 假设每个 word 都只有一个意思


一个简单的例子

在这个例子里,我尝试在 Amazon.com 使用 “investing”搜索书籍,然后取返回结果的前10个作为测试数据;因为有一本书的 title 与其他书籍的 titles 只有一个共同的 index word,所以被去掉了;index word 的定义如下:

1. 出现在 2个或者更过的书籍 title 中

2. 不是 stop words,例如, “and”,“the”


这个例子里,我们剔除这些 stop words:“and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.


下面是剩下的9个 titles,index words 加了下划线:

  1. The Neatest Little Guide to StockMarketInvesting

  2. Investing For Dummies, 4th Edition

  3. The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns

  4. The Little Book of ValueInvesting

  5. ValueInvesting: From Graham to Buffett and Beyond

  6. RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!

  7. Investing in RealEstate, 5th Edition

  8. StockInvesting For Dummies

  9. RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss


使用 LSI 分析这些 titles 后,我们可以在 XY坐标系里标记出 index words 的位置以及它们所属的 clusters;9个 titles 使用蓝色圆圈表示,11个 index words 使用红色方块表示,我们不仅可以画出 titles 所属的 clusters,而且可以给这些 clusters 打上 label,因为 index words 可以和 titles 在画在一起,如下图,蓝色的 cluster 代表 real estate, 包含 titles T7 和 T9;蓝色的 cluster 是关于 value investing, 包含 T2,T4,T5, 和 T8;红色的 cluster 代表 stock market,包含 T1 和 T3,;T6 代表的 title 是一个 outlier



下面将分几步介绍使用 LSI 的几个步骤


Part 1 -- 创建 count matrix

第一步是创建 word by title matrix, 每一个 index word 是一行,每一个 title 是一列;matrix 的每一项的值是对应的 word 在对应的 title 中出现的次数;一般的,这个 matrix 是很大的,但很稀疏,大部分项都是 0,下图中 0没有写出来


Index Words Titles

T1 T2 T3 T4 T5 T6 T7 T8 T9
book

1 1




dads




1

1
dummies
1




1
estate





1
1
guide 1



1


investing 1 1 1 1 1 1 1 1 1
market 1
1





real





1
1
rich




2

1
stock 1
1



1
value


1 1




Python 代码实现及介绍

Python - Getting Started

Download the python code here.

Throughout this article, we'll give Python code that implements all the steps necessary for doing Latent Semantic Analysis. We'll go through the code section by section and explain everything. The Python code used in this article can be downloaded here and then run in Python. You need to have already installed the Python NumPy and SciPy libraries.

Python - Import Functions

First we need to import a few functions from Python libraries to handle some of the math we need to do. NumPy is the Python numerical library, and we'll import zeros, a function that creates a matrix of zeros that we use when building our words by titles matrix. From the linear algebra part of the scientific package (scipy.linalg) we import the svd function that actually does the singular value decomposition, which is the heart of LSA.

from numpy import zeros
from scipy.linalg import svd


Python - Define Data

Next, we define the data that we are using. Titles holds the 9 book titles that we have gathered, stopwords holds the 8 common words that we are going to ignore when we count the words in each title, and ignorechars has all the punctuation characters that we will remove from words. We use Python's triple quoted strings, so there are actually only 4 punctuation symbols we are removing: comma (,), colon (:), apostrophe ('), and exclamation point (!).

titles =
  
  
  
  
[
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss"
]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''


Python - Define LSA Class

The LSA class has methods for initialization, parsing documents, building the matrix of word counts, and calculating. The first method is the __init__ method, which is called whenever an instance of the LSA class is created. It stores the stopwords and ignorechars so they can be used later, and then initializes the word dictionary and the document count variables.

class LSA(object):
  
  
  
  
def __init__(self, stopwords, ignorechars):
self.stopwords = stopwords
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0


Python - Parse Documents

The parse method takes a document, splits it into words, removes the ignored characters and turns everything into lowercase so the words can be compared to the stop words. If the word is a stop word, it is ignored and we move on to the next word. If it is not a stop word, we put the word in the dictionary, and also append the current document number to keep track of which documents the word appears in.

The documents that each word appears in are kept in a list associated with that word in the dictionary. For example, since the word book appears in titles 3 and 4, we would have self.wdict['book'] = [3, 4] after all titles are parsed.

After processing all words from the current document, we increase the document count in preparation for the next document to be parsed.

  
  
  
  
def parse(self, doc):
words = doc.split();
for w in words:
w = w.lower().translate(None, self.ignorechars)
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
self.wdict[w] = [self.dcount]
self.dcount += 1


Python - Build the Count Matrix

Once all documents are parsed, all the words (dictionary keys) that are in more than 1 document are extracted and sorted, and a matrix is built with the number of rows equal to the number of words (keys), and the number of columns equal to the document count. Finally, for each word (key) and document pair the corresponding matrix cell is incremented.

  
  
  
  
def build(self):
self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1


Python - Print the Count Matrix

The printA() method is very simple, it just prints out the matrix that we have built so it can be checked.

  
  
  
  
def printA(self):
print self.A


Python - Test the LSA Class

After defining the LSA class, it's time to try it out on our 9 book titles. First we create an instance of LSA, called mylsa, and pass it the stopwords and ignorechars that we defined. During creation, the __init__ method is called which stores the stopwords and ignorechars and initializes the word dictionary and document count.

Next, we call the parse method on each title. This method extracts the words in each title, strips out punctuation characters, converts each word to lower case, throws out stop words, and stores remaining words in a dictionary along with what title number they came from.

Finally we call the build() method to create the matrix of word by title counts. This extracts all the words we have seen so far, throws out words that occur in less than 2 titles, sorts them, builds a zero matrix of the right size, and then increments the proper cell whenever a word appears in a title.

mylsa = LSA(stopwords, ignorechars) 
for t in titles:
mylsa.parse(t)
mylsa.build()
mylsa.printA()

Here is the raw output produced by printA(). As you can see, it's the same as the matrix that we showed earlier.

[[ 0. 0. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 1.]
[ 0. 1. 0. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 1. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 0. 0. 0. 0. 0. 2. 0. 0. 1.]
[ 1. 0. 1. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 1. 1. 0. 0. 0. 0.]]


Part 2 - Modify the Counts with TFIDF

In sophisticated Latent Semantic Analysis systems, the raw matrix counts are usually modified so that rare words are weighted more heavily than common words. For example, a word that occurs in only 5% of the documents should probably be weighted more heavily than a word that occurs in 90% of the documents. The most popular weighting is TFIDF (Term Frequency - Inverse Document Frequency). Under this method, the count in each cell is replaced by the following formula.

TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di ) where

  • Ni,j = the number of times word i appears in document j (the original cell count).

  • N*,j = the number of total words in document j (just add the counts in column j).

  • D = the number of documents (the number of columns).

  • Di = the number of documents in which word i appears (the number of non-zero columns in row i).

In this formula, words that concentrate in certain documents are emphasized (by the Ni,j / N*,j ratio) and words that only appear in a few documents are also emphasized (by the log( D / Di ) term).

Since we have such a small example, we will skip this step and move on the heart of LSA, doing the singular value decomposition of our matrix of counts. However, if we did want to add TFIDF to our LSA class we could add the following two lines at the beginning of our python file to import the log, asarray, and sum functions.

from math import log
from numpy import asarray, sum

Then we would add the following TFIDF method to our LSA class. WordsPerDoc (N*,j) just holds the sum of each column, which is the total number of index words in each document. DocsPerWord (Di) uses asarray to create an array of what would be True and False values, depending on whether the cell value is greater than 0 or not, but the 'i' argument turns it into 1's and 0's instead. Then each row is summed up which tells us how many documents each word appears in. Finally, we just step through each cell and apply the formula. We do have to change cols (which is the number of documents) into a float to prevent integer division.

  
  
  
  
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])


Part 3 - Using the Singular Value Decomposition

Once we have built our (words by titles) matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. The "Singular Value Decomposition Tutorial" is a gentle introduction for readers that want to learn more about this powerful and useful algorithm.

The reason SVD is useful, is that it finds a reduced dimensional representation of our matrix that emphasizes the strongest relationships and throws away the noise. In other words, it makes the best possible reconstruction of the matrix with the least possible information. To do this, it throws out noise, which does not help, and emphasizes strong patterns and trends, which do help. The trick in using SVD is in figuring out how many dimensions or "concepts" to use when approximating the matrix. Too few dimensions and important patterns are left out, too many and noise caused by random word choices will creep back in.

The SVD algorithm is a little involved, but fortunately Python has a library function that makes it simple to use. By adding the one line method below to our LSA class, we can factor our matrix into 3 other matrices. The U matrix gives us the coordinates of each word on our “concept” space, the Vt matrix gives us the coordinates of each document in our “concept” space, and the S matrix of singular values gives us a clue as to how many dimensions or “concepts” we need to include.

  
  
  
  
def calc(self):
self.U, self.S, self.Vt = svd(self.A)

In order to choose the right number of dimensions to use, we can make a histogram of the square of the singular values. This graphs the importance each singular value contributes to approximating our matrix. Here is the histogram in our example.



Singular Value Importance

对于很大的 documents 集合,我们一般选择 100-500 个 dimensions;在我们这个小例子中,由于我们想能够更好的画出示意图,我们仅使用3个 dimensions,并且扔到第一个 dimension,画出第二个和第三个 dimensions

我们为什么要扔掉第一个dimension 呢?因为,对于 documents,第一个 dimension 和 document 的长度是相关的,而对于 words,第一个 dimension 是和 word 在所有的 documents 中出现的次数相关;但是如果我们让matrix 的每一列减去该列的平均值,从而对 matrix 进行 center 操作,那么我们就可以使用第一个 dimension

但是我们在使用 LSI 时一般不对 matrix 进行 center,因为 LSI 会把一个 sparse matrix 转换为一个 dense matrix,并且会大幅度的增加内存和计算的消耗,所以 不对 matrix 进行 center 操作,将第一个 dimension 丢弃,会提高效率

下面是我们的 matrix 的完整的3个dimension 的 Singular Value Decompostion 的结果,每一个 word 有 3个数字与它们相关,对应3个 dimensions,word 的第一个 dimension里面的数子对应该 word 在所有 tiltes 里面出现的次数,所以它不如第二个和第三个 dimension  有用;类似地,每个 title 有3个数字与之相关,对应3个 dimensions,同样地,第一个 dimension 里面的数字对应该 title 包含的 words 的数目,即该 title 的长度,它也被丢弃

book 0.15 -0.27 0.04
dads 0.24 0.38 -0.09
dummies 0.13 -0.17 0.07
estate 0.18 0.19 0.45
guide 0.22 0.09 -0.46
investing 0.74 -0.21 0.21
market 0.18 -0.30 -0.28
real 0.18 0.19 0.45
rich 0.36 0.59 -0.34
stock 0.25 -0.42 -0.28
value 0.12 -0.14 0.23
*

3.91 0 0
0 2.61 0
0 0 2.00
*

T1 T2 T3 T4 T5 T6 T7 T8 T9
0.35 0.22 0.34 0.26 0.22 0.49 0.28 0.29 0.44
-0.32 -0.15 -0.46 -0.24 -0.14 0.55 0.07 -0.31 0.44
-0.41 0.14 -0.16 0.25 0.22 -0.51 0.55 0.00 0.34


Part 4 -- 使用 color 进行 clustering

将数字转换为 colors,蓝色代表负数,红色代表正数,白色代表接近0的数字:

Top 3 Dimensions of Book Titles

We can use these colors to cluster the titles. We ignore the first dimension for clustering because all titles are red. In the second dimension, we have the following result.

Dim2 Titles
red 6-7, 9
blue 1-5, 8

Using the third dimension, we can split each of these groups again the same way. For example, looking at the third dimension, title 6 is blue, but title 7 and title 9 are still red. Doing this for both groups, we end up with these 4 groups.

Dim2 Dim3 Titles
red red 7, 9
red blue 6
blue red 2, 4-5, 8
blue blue 1, 3

It’s interesting to compare this table with what we get when we graph the results in the next section.

Part 5 - Clustering by Value

Leaving out the first dimension, as we discussed, let's graph the second and third dimensions using a XY graph. We'll put the second dimension on the X axis and the third dimension on the Y axis and graph each word and title. It's interesting to compare the XY graph with the table we just created that clusters the documents.

In the graph below, words are represented by red squares and titles are represented by blue circles. For example the word "book" has dimension values (0.15, -0.27, 0.04). We ignore the first dimension value 0.15 and graph "book" to position (x = -0.27, y = 0.04) as can be seen in the graph. Titles are similarly graphed.

xygraph2

One advantage of this technique is that both words and titles are placed on the same graph. Not only can we identify clusters of titles, but we can also label the clusters by looking at what words are also in the cluster. For example, the lower left cluster has titles 1 and 3 which are both about stock market investing. The words "stock" and "market" are conveniently located in the cluster, making it easy to see what the cluster is about. Another example is the middle cluster which has titles 2, 4, 5, and, to a somewhat lesser extent, title 8. Titles 2, 4, and 5 are close to the words "value" and "investing" which summarizes those titles quite well.


LSI 的优缺点以及应用

优点:

1. documents 和 words 都被映射到同一个 concept space,在这个 space 里面,我们可以进行 cluster documents,cluster words,并且更重要的是,我们可以给定 words,搜索 documents,反之亦然

2. 得到的 concept space 和原来的 matrix 比起来,包含少得多的 dimensions,这些 dimensions 包含最重要信息,最少的 noiese,所以这个 concept space 可以用来使用运行其它算法,例如测试不同的 clustering 算法

3. LSI 是一个 global algorithm,它基于所有的 words 和 documents 寻找 trends 和 pattern, 所以它可能找到其它 local algorithms 不能找到的信息,它还可以结合 local algorithms 使用,例如 nearest neighbours,从而变得更加有用


缺点:

1. LSI 假设数据符合 Gaussian distribution 和 Frobenius norm,这并不适合所有的情况,例如,documents 中的 words 服从 Poisson distribution,而非 Gaussian distribution

2. LSI 假设一个 word 只有 一个 concept,所以不能处理一词多义的情况

3. LSI 依赖于 SVD,需要大量的计算,所以当有新的 document 时,难以更新


尽管有这么多缺点,LSI 仍被大量使用,例如寻找和组织搜索结果,文档聚类,垃圾过滤,语音识别,专利查找,自动文章评价等


本文转载于:http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html

你可能感兴趣的:(lsi,LSA)