目录[-]
说明:
原文:http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html
本文包含了上文的主要内容。
关于LDA:LDA漫游指南
使用的python库lda来自:https://github.com/ariddell/lda 。
gensim库也含有lda相关函数。
$ pip install lda --user
from __future__ import division, print_function
import numpy as np
import lda
import lda.datasets
# document-term matrix
X = lda.datasets.load_reuters()
print("type(X): {}".format(type(X)))
print("shape: {}\n".format(X.shape))
print(X[:5, :5])
'''输出: type(X): <type 'numpy.ndarray'> shape: (395L, 4258L) [[ 1 0 1 0 0] [ 7 0 2 0 0] [ 0 0 0 1 10] [ 6 0 1 0 0] [ 0 0 0 2 14]] '''
X
为395*4298的矩阵,意味着395个文本,共4258个单词。值代表出现次数。
看一下是哪些单词:
# the vocab
vocab = lda.datasets.load_reuters_vocab()
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}\n".format(len(vocab)))
print(vocab[:6])
'''输出 type(vocab): <type 'tuple'> len(vocab): 4258 ('church', 'pope', 'years', 'people', 'mother', 'last') '''
X
中第0列对应的单词是church
,第1列对应的单词是pope
下面看一下文章标题:
# titles for each story
titles = lda.datasets.load_reuters_titles()
print("type(titles): {}".format(type(titles)))
print("len(titles): {}\n".format(len(titles)))
print(titles[:2]) # 前两篇文章的标题
'''输出 type(titles): <type 'tuple'> len(titles): 395 ('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21') '''
训练数据,指定20个主题,500次迭代:
model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(X)
主题-单词(topic-word)分布:
topic_word = model.topic_word_
print("type(topic_word): {}".format(type(topic_word)))
print("shape: {}".format(topic_word.shape))
'''输出: type(topic_word): <type 'numpy.ndarray'> shape: (20L, 4258L) '''
topic_word
中一行对应一个topic,一行之和为1。 看一看'church', 'pope', 'years'这三个单词在各个主题中的比重:
print(topic_word[:, :3])
'''输出 [[ 2.72436509e-06 2.72436509e-06 2.72708945e-03] [ 2.29518860e-02 1.08771556e-06 7.83263973e-03] [ 3.97404221e-03 4.96135108e-06 2.98177200e-03] [ 3.27374625e-03 2.72585033e-06 2.72585033e-06] [ 8.26262882e-03 8.56893407e-02 1.61980569e-06] [ 1.30107788e-02 2.95632328e-06 2.95632328e-06] [ 2.80145003e-06 2.80145003e-06 2.80145003e-06] [ 2.42858077e-02 4.66944966e-06 4.66944966e-06] [ 6.84655429e-03 1.90129250e-06 6.84655429e-03] [ 3.48361655e-06 3.48361655e-06 3.48361655e-06] [ 2.98781661e-03 3.31611166e-06 3.31611166e-06] [ 4.27062069e-06 4.27062069e-06 4.27062069e-06] [ 1.50994982e-02 1.64107142e-06 1.64107142e-06] [ 7.73480150e-07 7.73480150e-07 1.70946848e-02] [ 2.82280146e-06 2.82280146e-06 2.82280146e-06] [ 5.15309856e-06 5.15309856e-06 4.64294180e-03] [ 3.41695768e-06 3.41695768e-06 3.41695768e-06] [ 3.90980357e-02 1.70316633e-03 4.42279319e-03] [ 2.39373034e-06 2.39373034e-06 2.39373034e-06] [ 3.32493234e-06 3.32493234e-06 3.32493234e-06]] '''
获取每个topic下权重最高的5个单词:
n = 5
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]
print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))
'''输出: *Topic 0 - government british minister west group *Topic 1 - church first during people political *Topic 2 - elvis king wright fans presley *Topic 3 - yeltsin russian russia president kremlin *Topic 4 - pope vatican paul surgery pontiff *Topic 5 - family police miami versace cunanan *Topic 6 - south simpson born york white *Topic 7 - order church mother successor since *Topic 8 - charles prince diana royal queen *Topic 9 - film france french against actor *Topic 10 - germany german war nazi christian *Topic 11 - east prize peace timor quebec *Topic 12 - n't told life people church *Topic 13 - years world time year last *Topic 14 - mother teresa heart charity calcutta *Topic 15 - city salonika exhibition buddhist byzantine *Topic 16 - music first people tour including *Topic 17 - church catholic bernardin cardinal bishop *Topic 18 - harriman clinton u.s churchill paris *Topic 19 - century art million museum city '''
文档-主题(Document-Topic)分布:
doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))
'''输出: type(doc_topic): <type 'numpy.ndarray'> shape: (395, 20) '''
一篇文章对应一行,每行的和为1。
输入前10篇文章最可能的Topic:
for n in range(10):
topic_most_pr = doc_topic[n].argmax()
print("doc: {} topic: {}".format(n, topic_most_pr))
'''输出: doc: 0 topic: 8 doc: 1 topic: 1 doc: 2 topic: 14 doc: 3 topic: 8 doc: 4 topic: 14 doc: 5 topic: 14 doc: 6 topic: 14 doc: 7 topic: 14 doc: 8 topic: 14 doc: 9 topic: 8 '''