gensim实现LDA(Latent Dirichlet Allocation)算法提取主题词(topic)

 Latent Dirichlet Allocation(LDA) 隐含分布作为目前最受欢迎的主题模型算法被广泛使用。LDA能够将文本集合转化为不同概率的主题集合。需要注意的是LDA是利用统计手段对主题词汇进行到的处理,是一种词袋(bag-of-words)方法。如:

第一段:“Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. It is altogether fitting and proper that we should do this.”
第二段:‘Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.’
第三段:"We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. "


(0, u'0.032*"conceive" + 0.032*"dedicate" + 0.032*"nation" + 0.032*"life"')
(1, u'0.059*"conceive" + 0.059*"score" + 0.059*"seven" + 0.059*"proposition"')
(2, u'0.103*"nation" + 0.071*"dedicate" + 0.071*"great" + 0.071*"field"')
(3, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"rest"')
(4, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"battle"')

 本文将简单介绍如何使用Python 的nltk、spacy、gensim包,实现包括预处理流程在内的LDA算法。

1. 预处理:

1.1 分词处理

#python -m spacy download en
import spacy
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
        elif token.like_url:
        elif token.orth_.startswith('@'):
    return lda_tokens

1.2 lemma处理

lemma 将变形了的单词还原为元单词 “dictionaries”–>“dictionary”
stem 从单词中抽取词根 “dictionaries”—>“dict”

import nltk

from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
        return lemma

1.3 从nltk包中引入英文停顿词停顿词处理


en_stop = set(nltk.corpus.stopwords.words('english'))

1.4 预处理流程


def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

2. LDA算法

2.1 预处理文本集合

 通过预处理函数加载文本集合,需要注意的是,gensim:models.ldamodel 处理对象是一个文本集合而不是文本集,因此其输入应该为[[],``````,[]]结构而不是[]

    text_1 = u"Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. It is altogether fitting and proper that we should do this."
    text_2 = u'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'
    text_3 = u"We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. "
    text_data_1 = prepare_text_for_lda(text_1)
    text_data_2 = prepare_text_for_lda(text_2)
    text_data_3 = prepare_text_for_lda(text_3)
    text_data =[]
    print "text_data :",text_data


[[u'engage', u'great', u'civil', u'testing', u'whether', u'nation', u'nation', u'conceive', u'dedicate', u'endure', u'altogether', u'fitting', u'proper'], [u'score', u'seven', u'years', u'father', u'bring', u'forth', u'continent', u'nation', u'conceive', u'liberty', u'dedicate', u'proposition', u'create', u'equal'], [u'great', u'battle', u'field', u'dedicate', u'portion', u'field', u'final', u'rest', u'place', u'life', u'nation', u'might']]

2.2 使用LDA算法提取主题词


    dictionary = corpora.Dictionary(text_data)
    corpus = [dictionary.doc2bow(text) for text in text_data]

    NUM_TOPICS = 5#定义了生成的主题词的个数
    ldamodel = gensim.models.ldamodel.LdaModel(corpus,              
    	                                       num_topics = NUM_TOPICS,
    topics = ldamodel.print_topics(num_words=4)
    for topic in topics:

3. 附录遇到的问题及修改

3.1 来自spacy的报错

import spacy
Traceback (most recent call last):
  File "", line 13, in 
  File "C:\Python27\lib\site-packages\spacy\", line 15, in load
    return util.load_model(name, **overrides)
  File "C:\Python27\lib\site-packages\spacy\", line 119, in load_model
    raise IOError(Errors.E050.format(name=name))
IOError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.



import spacy

3.2 来自dictionary的报错


C:\Python27\lib\site-packages\gensim\ UserWarning:
detected Windows; aliasing chunkize to chunkize_serial
warnings.warn(“detected Windows; aliasing chunkize to
chunkize_serial”) Traceback (most recent call last): File
“”, line 122, in
dictionary = corpora.Dictionary(text_data_1) File “C:\Python27\lib\site-packages\gensim\corpora\”, line 81,
in init
self.add_documents(documents, prune_at=prune_at) File “C:\Python27\lib\site-packages\gensim\corpora\”, line
198, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids File
“C:\Python27\lib\site-packages\gensim\corpora\”, line
236, in doc2bow
raise TypeError(“doc2bow expects an array of unicode tokens on input, not a single string”) TypeError: doc2bow expects an array of
unicode tokens on input, not a single string
