gensim 实践篇

继上篇文章了解了一些模型的基本原理以后,这里来讲讲怎么用 gensim,主要参考官方网站的 gensim: Tutorials,这篇博文也只是简单记下一点笔记。

主要有三块内容,先讲怎么把文档表示成向量空间模型(VSM,vector space model)中的稀疏向量(sparse vector)形式,然后是怎么用模型(这里叫 topic and transformations)把词袋模型(BoW,Bag of Word)的表示转化成该模型的形式。最后是怎么把结果存下来做文档的相似度检索等。

顺便提一下,安装 gensim 特别简单,直接在终端里,

pip install --upgrade gensim

Corpora and Vector Spaces

这一节讨论怎么表示文档,特别是怎么表示成 BoW 形式,并用词典统计所有的词汇。

这里约定文档的表示有三种形式,

  • document,用一个字符串表示一篇文章
  • text,分词后(or tokenize)的表示,这里要去停词和低频词等
  • corpus,文档的 BoW 表示,写成词的 id 和对应的词频,词典记录了词汇和对应 id

我们用一组非常短的文档集示例,这里每篇文档只有一句话,

from gensim import corpora
from pprint import pprint

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
            "The EPS user interface management system",
            "System and human system engineering testing of EPS",
            "Relation of user perceived response time to error measurement",
            "The generation of random binary unordered trees",
            "The intersection graph of paths in trees",
            "Graph minors IV Widths of trees and well quasi ordering",
            "Graph minors A survey"]

print 'got', len(documents), 'documents'    # got 9 documents
pprint(documents)

构造生成器,以避免把整个文档一次性全部加载到内存,

class MyTexts(object):
    def __init__(self):
        # 停词表
        self.stoplist = set('for a of the and to in'.split())

    def __iter__(self):
        for doc in documents:
            # 分词,去停词
            yield [word for word in doc.lower().split() if word not in stoplist]

texts = MyTexts()
for text in texts:
    print text

构造词典,注意要删去那些低频词

def get_dictionary(documents, min_count=1):
    dictionary = corpora.Dictionary(texts)
    lowfreq_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() 
                    if docfreq < min_count]
    # remove stop words and low frequence words
    dictionary.filter_tokens(lowfreq_ids)
    # remove gaps in id sequence after words that were removed
    dictionary.compactify()
    return dictionary

dictionary = get_dictionary(documents, min_count=1)
print dictionary

# save and load dictionary
dictionary.save('a.dict')
dictionary = corpora.Dictionary.load('a.dict')
print dictionary

得到 corpus,注意因为是把文档整个转化成 list,可能内存占用会很大。

corpus = [dictionary.doc2bow(text) for text in texts]
pprint(corpus)

# save corpus
corpora.MmCorpus.serialize('corpus.mm', corpus)

# load corpus
corpus = corpora.MmCorpus('corpus.mm')
print corpus

Topics and Transformations

1. TF-IDF

把 BoW 的文档集转化成 TF-IDF 表示时,gensim 只会创建一个 wrapper,统计一下 idf,实际的转化是在遍历 corpus_tfidf 结果的过程中(叫做 on the fly)做的。

from gensim import models

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus] # wrap the old corpus to tfidf

可以打印查看具体的结果

print tfidf, '\n' # TfidfModel(num_docs=9, num_nnz=51) 
print corpus_tfidf, '\n'
print tfidf[corpus[0]], '\n' # convert first doc from bow to tfidf

for doc in corpus_tfidf: # convert the whole corpus on the fly
    print doc

2. LSI

和 TF-IDF 的做法类似,要在 TF-IDF 的基础上构造。

# initialize a fold-in LSI transformation
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) 

# create a double wrapper over the original corpus:bow->tfidf->fold-in-lsi
corpus_lsi = lsi[corpus_tfidf] 

# save model
lsi.save('model.lsi')

# load model
lsi = models.LsiModel.load('model.lsi')

上面的主题设置成了 num_topics = 2,可以打印出这两个主题,发现是由不同的词加权得到的。

print lsi, '\n'
print corpus_lsi, '\n'
for ids, topic in lsi.print_topics():
    print 'topic id = ', ids, '\n', topic, '\n'

'''
LsiModel(num_terms=35, num_topics=2, decay=1.0, chunksize=20000) 

 

topic id =  0 
-0.400*"system" + -0.318*"survey" + -0.290*"user" + -0.274*"eps" + -0.236*"management" + -0.236*"opinion" + -0.235*"time" + -0.235*"response" + -0.224*"interface" + -0.224*"computer" 

topic id =  1 
-0.421*"minors" + -0.420*"graph" + -0.293*"survey" + -0.239*"trees" + -0.226*"intersection" + -0.226*"paths" + 0.204*"system" + 0.196*"eps" + -0.189*"ordering" + -0.189*"quasi" 
'''

3. More Models

其他模型的用法类似,列在下面,

# RP, Random Projections
model = models.RpModel(tfidf_corpus, num_topics=500)

# LDA, Latentn Dirichlet Allocation
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

# HDP, Hierarchical Dirichlet Process
model = models.HdpModel(corpus, id2word=dictionary)

Similarity Queries

上面的模型为文本进行建模,向量化,可以很方便地进行文档相似度的检索。

from gensim import corpora, models, similarities

dictionary = corpora.Dictionary.load('a.dict')
corpus = corpora.MmCorpus('corpus.mm')
lsi = models.LsiModel.load('model.lsi')
print dictionary
print corpus
print lsi, '\n'

# transform corpus to LSI space and index it
# index = similarities.MatrixSimilarity(lsi[corpus])
index = similarities.Similarity('./', lsi[corpus], len(dictionary)) 
print index

# save and load
index.save('a.index')
index = similarities.Similarity.load('a.index')

检索得到余弦相似度,

# convert query to lsi vector
query_doc = "Human computer interaction"
query_bow = dictionary.doc2bow(query_doc.lower().split())
query_lsi = lsi[query_bow]
print query_lsi, '\n'

# performing queries
sims = index[query_lsi]
print len(sims)
print sims

你可能感兴趣的:(NLP)