Latent Semantic Analysis (LSA)也被叫做Latent Semantic Indexing(LSI),从字面上的意思理解就是通过分析文档去发现这些文档中潜在的意思和概念。假设每个词仅表示一个概念,并且每个概念仅仅被一个词所描述,LSA将非常简单(从词到概念存在一个简单的映射关系)
潜语义分析(Latent SemanticAnalysis)源自问题:如何从搜索query中找到相关的文档。当我们试图通过比较词来找到相关的文本时,存在着难以解决的局限性,那就是在搜索中我们实际想要去比较的不是词,而是隐藏在词之后的意义和概念。潜语义分析试图去解决这个问题,它把词和文档都映射到一个‘概念’空间并在这个空间内进行比较(注:也就是一种降维技术)。
当文档的作者写作的时候,对于词语有着非常宽泛的选择。不同的作者对于词语的选择有着不同的偏好,这样会导致概念的混淆。这种对于词语的随机选择在词-概念 的关系中引入了噪音。LSA滤除了这样的一些噪音,并且还能够从全部的文档中找到最小的概念集合(为什么是最小?)。
1. 文档被表示为”一堆词(bags of words)”,因此词在文档中出现的位置并不重要,只有一个词的出现次数。
接下来看一个LSA的小例子,Next Part:
一个简单的小例子一个小例子,我在amazon.com上搜索”investing”(投资) 并且取top10搜索结果的书名。其中一个被废弃了,因为它只含有一个索引词(indexword)和其它标题相同。索引词可以是任何满足下列条件的词:
1. 在2个或者2个以上标题中出现 并且
2. 不是那种特别常见的词例如 “and”, ”the” 这种(停用词-stopword)。这种词没有包含进来是因为他们本身不存在什么意义。
在这个例子中,我们拿掉了如下停用词:“and”, “edition”, “for”, “in”,“little”, “of”, “the”, “to”.
1. The Neatest Little Guide toStock Market Investing
2. Investing For Dummies, 4th Edition
3. The Little Book of Common SenseInvesting: The OnlyWay to Guarantee Your Fair Share ofStock Market Returns
4. The Little Book ofValue Investing
5. ValueInvesting: From Graham to Buffett and Beyond
6. RichDad's Guide toInvesting: What theRich Invest in,That the Poor and the Middle Class Do Not!
7. Investing in Real Estate, 5th Edition
8. StockInvesting ForDummies
9. RichDad's Advisors: The ABC's ofReal Estate Investing: TheSecrets of Finding Hidden Profits Most Investors Miss
在这篇文章中,我们用python代码去实现LSA的所有步骤。我们将介绍所有的代码。Python代码可以在这里被下到(见上)。需要安装NumPy和 SciPy这两个库。
def printA(self):
print self.A
u,s,vt = svd(self.A)
print """\r"""
print u
print """\r"""
print s
print """\r"""
print vt
print """\r"""
titles = ['T1','T2','T3','T4','T5','T6','T7','T8','T9']
vdemention2 = vt[1]
vdemention3 = vt[2]
for j in range(len(vdemention2)):
plot(vdemention2, vdemention3, '.')
ut = u.T
demention2 = ut[1]
demention3 = ut[2]
for i in range(len(demention2)):
plot(demention2, demention3, '.')
mylsa = LSA(stopwords, ignorechars)
for t in titles:
# -*- coding: utf-8 -*- """ Created on Wed Jun 11 17:02:39 2014 @author: modified by zhouxu,add plot """ from numpy import zeros import numpy as np from scipy.linalg import svd titles =[ "The Neatest Little Guide to Stock Market Investing", "Investing For Dummies, 4th Edition", "The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns", "The Little Book of Value Investing", "Value Investing: From Graham to Buffett and Beyond", "Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!", "Investing in Real Estate, 5th Edition", "Stock Investing For Dummies", "Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss" ] stopwords = ['and','edition','for','in','little','of','the','to'] ignorechars = ''''',:'!''' class LSA(object): def __init__(self, stopwords, ignorechars): self.stopwords = stopwords self.ignorechars = ignorechars self.wdict = {} self.dcount = 0 def parse(self, doc): words = doc.split(); for w in words: #print self.dcount w = w.lower().translate(None, self.ignorechars) if w in self.stopwords: continue elif w in self.wdict: self.wdict[w].append(self.dcount) else: self.wdict[w] = [self.dcount] self.dcount += 1 def build(self): self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1] self.keys.sort() print self.keys self.A = zeros([len(self.keys), self.dcount]) for i, k in enumerate(self.keys): for d in self.wdict[k]: self.A[i,d] += 1 def printA(self): print self.A u,s,vt = svd(self.A) print """\r""" print u print """\r""" print s print """\r""" print vt print """\r""" plt.title("LSA") plt.xlabel(u'dimention2') plt.ylabel(u'dimention3') titles = ['T1','T2','T3','T4','T5','T6','T7','T8','T9'] vdemention2 = vt[1] vdemention3 = vt[2] for j in range(len(vdemention2)): text(vdemention2[j],vdemention3[j],titles[j]) plot(vdemention2, vdemention3, '.') ut = u.T demention2 = ut[1] demention3 = ut[2] for i in range(len(demention2)): text(demention2[i],demention3[i],self.keys[i]) plot(demention2, demention3, '.') mylsa = LSA(stopwords, ignorechars) for t in titles: mylsa.parse(t) mylsa.build() mylsa.printA()程序运行结果: