Latent Dirichlet Allocation(LDA) 隐含分布作为目前最受欢迎的主题模型算法被广泛使用。LDA能够将文本集合转化为不同概率的主题集合。需要注意的是LDA是利用统计手段对主题词汇进行到的处理,是一种词袋(bag-of-words)方法。如:
输入:
第一段:“Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. It is altogether fitting and proper that we should do this.”
第二段:‘Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.’
第三段:"We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. "
输出:
(0, u'0.032*"conceive" + 0.032*"dedicate" + 0.032*"nation" + 0.032*"life"')
(1, u'0.059*"conceive" + 0.059*"score" + 0.059*"seven" + 0.059*"proposition"')
(2, u'0.103*"nation" + 0.071*"dedicate" + 0.071*"great" + 0.071*"field"')
(3, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"rest"')
(4, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"battle"')
本文将简单介绍如何使用Python 的nltk、spacy、gensim包,实现包括预处理流程在内的LDA算法。
#第一次使用需要首先下载en包:
#python -m spacy download en
import spacy
spacy.load('en_core_web_sm')
from spacy.lang.en import English
parser = English()
#对文章内容进行清洗并将单词统一降为小写
def tokenize(text):
lda_tokens = []
tokens = parser(text)
for token in tokens:
if token.orth_.isspace():
continue
elif token.like_url:
lda_tokens.append('URL')
elif token.orth_.startswith('@'):
lda_tokens.append('SCREEN_NAME')
else:
lda_tokens.append(token.lower_)
return lda_tokens
lemma与stem都是NLP中常用的对于单词的处理:
lemma 将变形了的单词还原为元单词 “dictionaries”–>“dictionary”
stem 从单词中抽取词根 “dictionaries”—>“dict”
#引入一个同义词、近义词、反义词包
import nltk
#第一次使用需要下载这个nltk包
# nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
#dogs->dog
#aardwolves->aardwolf'
#sichuan->sichuan
lemma = wn.morphy(word)
if lemma is None:
return word
else:
return lemma
#第一次使用需要下载停顿词
# nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
预处理的过程包括以上所提及的分词、lemma处理及停顿词处理
#定义预处理函数
def prepare_text_for_lda(text):
#分词处理
tokens = tokenize(text)
#取出长度大于4的单词
tokens = [token for token in tokens if len(token) > 4]
#取出非停顿词
tokens = [token for token in tokens if token not in en_stop]
#对词语进行还原
tokens = [get_lemma(token) for token in tokens]
return tokens
通过预处理函数加载文本集合,需要注意的是,gensim:models.ldamodel 处理对象是一个文本集合而不是文本集,因此其输入应该为[[],``````,[]]结构而不是[]
text_1 = u"Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. It is altogether fitting and proper that we should do this."
text_2 = u'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'
text_3 = u"We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. "
text_data_1 = prepare_text_for_lda(text_1)
text_data_2 = prepare_text_for_lda(text_2)
text_data_3 = prepare_text_for_lda(text_3)
text_data =[]
text_data.append(text_data_1)
text_data.append(text_data_2)
text_data.append(text_data_3)
print "text_data :",text_data
通过对于三个string的预处理并组合成为一个list集合,数据如下:
[[u'engage', u'great', u'civil', u'testing', u'whether', u'nation', u'nation', u'conceive', u'dedicate', u'endure', u'altogether', u'fitting', u'proper'], [u'score', u'seven', u'years', u'father', u'bring', u'forth', u'continent', u'nation', u'conceive', u'liberty', u'dedicate', u'proposition', u'create', u'equal'], [u'great', u'battle', u'field', u'dedicate', u'portion', u'field', u'final', u'rest', u'place', u'life', u'nation', u'might']]
需要注意的是,如下实现LDA算法的gensim.models.ldamodel.LdaModel()与生成的corpus、dictionary密切相关。
#加载gensim
#使用gensim.Dictionary从text_data中生成一个词袋(bag-of-words)
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
#加载gensim,使用LDA算法求得前五的topic,
#同时生成的topic在之后也会被使用到来定义文本所属主题
NUM_TOPICS = 5#定义了生成的主题词的个数
ldamodel = gensim.models.ldamodel.LdaModel(corpus,
num_topics = NUM_TOPICS,
id2word=dictionary,
passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)
import spacy
spacy.load('en')
Traceback (most recent call last):
File "topial_LDA.py", line 13, in
spacy.load('en')
File "C:\Python27\lib\site-packages\spacy\__init__.py", line 15, in load
return util.load_model(name, **overrides)
File "C:\Python27\lib\site-packages\spacy\util.py", line 119, in load_model
raise IOError(Errors.E050.format(name=name))
IOError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
这条报错是因为没有向spacy指明引入的english类型的语言包具体是那个,在spacy中我们发现了如下多个包:
修改代码代码,实现功能:
import spacy
spacy.load('en_core_web_sm')
这个报错参考2.1
C:\Python27\lib\site-packages\gensim\utils.py:1209: UserWarning:
detected Windows; aliasing chunkize to chunkize_serial
warnings.warn(“detected Windows; aliasing chunkize to
chunkize_serial”) Traceback (most recent call last): File
“topial_LDA.py”, line 122, in
dictionary = corpora.Dictionary(text_data_1) File “C:\Python27\lib\site-packages\gensim\corpora\dictionary.py”, line 81,
in init
self.add_documents(documents, prune_at=prune_at) File “C:\Python27\lib\site-packages\gensim\corpora\dictionary.py”, line
198, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids File
“C:\Python27\lib\site-packages\gensim\corpora\dictionary.py”, line
236, in doc2bow
raise TypeError(“doc2bow expects an array of unicode tokens on input, not a single string”) TypeError: doc2bow expects an array of
unicode tokens on input, not a single string