Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 三

Part 1 - Creating the Count Matrix

第一部分 - 创建计数矩阵

The first step in Latent Semantic Analysis is to create theword by title (or document) matrix. In this matrix, each index word is a rowand each title is a column. Each cell contains the number of times that wordoccurs in that title. For example, the word "book" appears one timein title T3 and one time in title T4, whereas "investing" appears onetime in every title. In general, the matrices built during LSA tend to be verylarge, but also very sparse (most cells contain 0). That is because each titleor document usually contains only a small number of all the possible words.This sparseness can be taken advantage of in both memory and time by moresophisticated LSA implementations.

LSA的第一步是要去创建词到标题(文档)的矩阵。在这个矩阵里,每一个索引词占据了一行,每一个标题占据一列。每一个单元(cell)包含了这个词出现在那个标题中的次数。例如,词”book”出现在T3中一次,出现在T4中一次,而” investing”在所有标题中都出现了一次。一般来说,在LSA中的矩阵会非常大而且会非常稀疏(大部分的单元都是0)。这是因为每个标题或者文档一般只包含所有词汇的一小部分。更复杂的LSA算法会利用这种稀疏性去改善空间和时间复杂度。

In the following matrix, we have left out the 0's to reduceclutter.

Index Words

Titles

 

T1

T2

T3

T4

T5

T6

T7

T8

T9

book

   

1

1

         

dads

         

1

   

1

dummies

 

1

         

1

 

estate

           

1

 

1

guide

1

       

1

     

investing

1

1

1

1

1

1

1

1

1

market

1

 

1

           

real

           

1

 

1

rich

         

2

   

1

stock

1

 

1

       

1

 

value

     

1

1

       

 

Python - Getting Started

Download the python code here.

Throughout this article, we'll givePython code that implements all the steps necessary for doing Latent SemanticAnalysis. We'll go through the code section by section and explain everything.The Python code used in this article can be downloaded here and then run in Python. You need to havealready installed the Python NumPy and SciPy libraries.

 

在这篇文章中,我们用python代码去实现LSA的所有步骤。我们将介绍所有的代码。Python代码可以在这里被下到(见上)。需要安装NumPy 和 SciPy这两个库。

Python - Import Functions

First we need to import a few functions from Python librariesto handle some of the math we need to do. NumPy is the Python numericallibrary, and we'll import zeros, a function that creates a matrix of zeros thatwe use when building our words by titles matrix. From the linear algebra partof the scientific package (scipy.linalg) we import the svd function thatactually does the singular value decomposition, which is the heart of LSA.

NumPy是python的数值计算类,用到了zeros(初始化矩阵),scipy.linalg这个线性代数的库中,我们引入了svd函数也就是做奇异值分解,LSA的核心。

   
   
   
   
[python] view plain copy
  1. from numpy import zeros  
  2. from scipy.linalg import svd  

 

Python - Define Data

Next, we define the data that we are using. Titles holds the9 book titles that we have gathered, stopwords holds the 8 common words that weare going to ignore when we count the words in each title, and ignorechars hasall the punctuation characters that we will remove from words. We use Python'striple quoted strings, so there are actually only 4 punctuation symbols we areremoving: comma (,), colon (:), apostrophe ('), and exclamation point (!).

Stopwords 是停用词 ignorechars是无用的标点

[python]  view plain copy
  1. titles =  
  2.   
  3. [   
  4. "The Neatest Little Guide to Stock Market Investing",   
  5. "Investing For Dummies, 4th Edition",   
  6. "The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",   
  7. "The Little Book of Value Investing",   
  8. "Value Investing: From Graham to Buffett and Beyond",   
  9. "Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",   
  10. "Investing in Real Estate, 5th Edition",   
  11. "Stock Investing For Dummies",   
  12. "Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss"   
  13. ]  
  14.   
  15. stopwords = ['and','edition','for','in','little','of','the','to']   
  16. ignorechars = ''''',:'!'''  


 

Python - Define LSA Class

The LSA class has methods for initialization, parsingdocuments, building the matrix of word counts, and calculating. The firstmethod is the __init__ method, which is called whenever an instance of the LSAclass is created. It stores the stopwords and ignorechars so they can be usedlater, and then initializes the word dictionary and the document countvariables.

这里定义了一个LSA的类,包括其初始化过程wdict是词典,dcount用来记录文档号。

[python]  view plain copy
  1. class LSA(object):  
  2.   
  3. def __init__(self, stopwords, ignorechars):  
  4.   
  5. self.stopwords = stopwords   
  6. self.ignorechars = ignorechars   
  7. self.wdict = {}   
  8. self.dcount = 0  


 

Python - Parse Documents

The parse method takes a document, splits it into words, removesthe ignored characters and turns everything into lowercase so the words can becompared to the stop words. If the word is a stop word, it is ignored and wemove on to the next word. If it is not a stop word, we put the word in thedictionary, and also append the current document number to keep track of whichdocuments the word appears in.

The documents that each word appears in are kept in a listassociated with that word in the dictionary. For example, since the word bookappears in titles 3 and 4, we would have self.wdict['book'] = [3, 4] after alltitles are parsed.

After processing all words from the current document, weincrease the document count in preparation for the next document to be parsed.

这个函数就是把文档拆成词并滤除停用词和标点,剩下的词会把其出现的文档号填入到wdict中去,例如,词book出现在标题3和4中,则我们有self.wdict['book'] = [3, 4]。相当于建了一下倒排。

translate用法:http://blog.csdn.net/maoersong/article/details/22381797

[python]  view plain copy
  1. def parse(self, doc):  
  2.   
  3.     words = doc.split();  for w in words:  
  4.   
  5.      w = w.lower().translate(Noneself.ignorechars)   
  6.   
  7.      if w in self.stopwords:  
  8.   
  9.          continue  
  10.   
  11.      elif w in self.wdict:  
  12.   
  13.          self.wdict[w].append(self.dcount)  
  14.   
  15.      else:  
  16.   
  17.          self.wdict[w] = [self.dcount]  
  18.   
  19.     self.dcount += 1  


 

Python - Build the Count Matrix

Once all documents are parsed, all the words (dictionarykeys) that are in more than 1 document are extracted and sorted, and a matrix isbuilt with the number of rows equal to the number of words (keys), and thenumber of columns equal to the document count. Finally, for each word (key) anddocument pair the corresponding matrix cell is incremented.

所有的文档被解析之后,所有出现的词(也就是词典的keys)被取出并且排序。建立一个矩阵,其行数是词的个数,列数是文档个数。最后,所有的词和文档对所对应的矩阵单元的值被统计出来。

enumerate用法:http://blog.csdn.net/maoersong/article/details/22378001

[python]  view plain copy
  1. def build(self):  
  2.   
  3.     self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]   
  4.   
  5.     self.keys.sort()   
  6.   
  7.     self.A = zeros([len(self.keys), self.dcount])   
  8.   
  9.     for i, k in enumerate(self.keys):  
  10.   
  11.         for d in self.wdict[k]:  
  12.   
  13.             self.A[i,d] += 1  


 

Python - Print the Count Matrix

The printA() method is very simple, it just prints out thematrix that we have built so it can be checked.

把矩阵打印出来

[python]  view plain copy
  1. def printA(self):  
  2.   
  3. print self.A  


 

Python - Test the LSA Class

After defining the LSA class, it's time to try it out on our9 book titles. First we create an instance of LSA, called mylsa, and pass itthe stopwords and ignorechars that we defined. During creation, the __init__method is called which stores the stopwords and ignorechars and initializes theword dictionary and document count.

Next, we call the parse method on each title. This methodextracts the words in each title, strips out punctuation characters, convertseach word to lower case, throws out stop words, and stores remaining words in adictionary along with what title number they came from.

Finally we call the build() method to create the matrix ofword by title counts. This extracts all the words we have seen so far, throwsout words that occur in less than 2 titles, sorts them, builds a zero matrix ofthe right size, and then increments the proper cell whenever a word appears ina title.

[python]  view plain copy
  1. mylsa = LSA(stopwords, ignorechars)   
  2. for t in titles:  
  3.   
  4. mylsa.parse(t)  
  5.   
  6. mylsa.build()   
  7. mylsa.printA()  


Here is the raw output produced by printA(). As you can see,it's the same as the matrix that we showed earlier.

在刚才的测试数据中验证程序逻辑,并查看最终生成的矩阵:

   
   
   
   
[python] view plain copy
  1. [[ 0. 0. 1. 1. 0. 0. 0. 0. 0.]  
  2. 0. 0. 0. 0. 0. 1. 0. 0. 1.]  
  3. 0. 1. 0. 0. 0. 0. 0. 1. 0.]  
  4. 0. 0. 0. 0. 0. 0. 1. 0. 1.]  
  5. 1. 0. 0. 0. 0. 1. 0. 0. 0.]  
  6. 1. 1. 1. 1. 1. 1. 1. 1. 1.]  
  7. 1. 0. 1. 0. 0. 0. 0. 0. 0.]  
  8. 0. 0. 0. 0. 0. 0. 1. 0. 1.]  
  9. 0. 0. 0. 0. 0. 2. 0. 0. 1.]  
  10. 1. 0. 1. 0. 0. 0. 0. 1. 0.]  
  11. 0. 0. 0. 1. 1. 0. 0. 0. 0.]]  

 转载:http://blog.csdn.net/yihucha166/article/details/6795112

四、计算TFIDF替代简单计数

五、使用奇异值分解

六、用颜色聚类

七、LSA的优势、劣势以及应用

其中里面讲到LSA不能解决多义词的问题有点不妥,这篇文章讲的LSA还是不错的:搜索背后的奥秘——浅谈语义主题计算

你可能感兴趣的:(python,LSA)