LDA推荐下载地址包括:其中前三个比较常用。
gensim下载地址:https://radimrehurek.com/gensim/models/ldamodel.html
pip install lda安装地址:https://github.com/ariddell/lda
scikit-learn官网文档:LatentDirichletAllocation
其中sklearn的代码例子可参考下面这篇:
Topic extraction with NMF and Latent Dirichlet Allocation
其部分输出如下所示,包括各个主体Topic包含的主题词:
Loading dataset... Fitting LDA models with tf features, n_samples=2000 and n_features=1000... done in 0.733s. Topics in LDA model: Topic #0: 000 war list people sure civil lot wonder say religion america accepted punishment bobby add liberty person kill concept wrong Topic #1: just reliable gods consider required didn war makes little seen faith default various civil motto sense currency knowledge belief god Topic #2: god omnipotence power mean rules omnipotent deletion policy non nature suppose definition given able goal nation add place powerful leaders ....
model_parameter.dat 保存模型训练时选择的参数 wordidmap.dat 保存词与id的对应关系,主要用作topN时查询 model_twords.dat 输出每个类高频词topN个 model_tassgin.dat 输出文章中每个词分派的结果,文本格式为词id:类id model_theta.dat 输出文章与类的分布概率,文本一行表示一篇文章,概率1 概率2..表示文章属于类的概率 model_phi.dat 输出词与类的分布概率,是一个K*M的矩阵,K为设置分类的个数,M为所有文章的词的总数
但是短文本信息还行,但使用大量文本内容时,输出文章与类分布概率几乎每行数据存在大量相同的,可能代码还存在BUG。pip install lda
参考:[python] 安装numpy+scipy+matlotlib+scikit-learn及问题解决
这部分内容主要参考下面几个链接,强推大家去阅读与学习:
官网文档:https://github.com/ariddell/lda
lda: Topic modeling with latent Dirichlet Allocation
Getting started with Latent Dirichlet Allocation in Python - sandbox
[翻译] 在Python中使用LDA处理文本 - letiantian
文本分析之TFIDF/LDA/Word2vec实践 - vs412237401
import numpy as np import lda import lda.datasets # document-term matrix X = lda.datasets.load_reuters() print("type(X): {}".format(type(X))) print("shape: {}\n".format(X.shape)) print(X[:5, :5]) # the vocab vocab = lda.datasets.load_reuters_vocab() print("type(vocab): {}".format(type(vocab))) print("len(vocab): {}\n".format(len(vocab))) print(vocab[:5]) # titles for each story titles = lda.datasets.load_reuters_titles() print("type(titles): {}".format(type(titles))) print("len(titles): {}\n".format(len(titles))) print(titles[:5])载入LDA包数据集后, 输出如下所示:
type(X): <type 'numpy.ndarray'> shape: (395L, 4258L) [[ 1 0 1 0 0] [ 7 0 2 0 0] [ 0 0 0 1 10] [ 6 0 1 0 0] [ 0 0 0 2 14]] type(vocab): <type 'tuple'> len(vocab): 4258 ('church', 'pope', 'years', 'people', 'mother') type(titles): <type 'tuple'> len(titles): 395 ('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21', "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23", '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25', '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25')From the above we can see that there are 395 news items (documents) and a vocabulary of size 4258. The document-term matrix, X, has a count of the number of occurences of each of the 4258 vocabulary words for each of the 395 documents.
# X[0,3117] is the number of times that word 3117 occurs in document 0 doc_id = 0 word_id = 3117 print("doc id: {} word id: {}".format(doc_id, word_id)) print("-- count: {}".format(X[doc_id, word_id])) print("-- word : {}".format(vocab[word_id])) print("-- doc : {}".format(titles[doc_id])) '''输出 doc id: 0 word id: 3117 -- count: 2 -- word : heir-to-the-throne -- doc : 0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 '''
model = lda.LDA(n_topics=20, n_iter=500, random_state=1) model.fit(X) # model.fit_transform(X) is also available
topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print(vocab[:3]) print(topic_word[:, :3]) for n in range(5): sum_pr = sum(topic_word[n,:]) print("topic: {} sum: {}".format(n, sum_pr))输出结果如下:
type(topic_word): <type 'numpy.ndarray'> shape: (20L, 4258L) ('church', 'pope', 'years') [[ 2.72436509e-06 2.72436509e-06 2.72708945e-03] [ 2.29518860e-02 1.08771556e-06 7.83263973e-03] [ 3.97404221e-03 4.96135108e-06 2.98177200e-03] [ 3.27374625e-03 2.72585033e-06 2.72585033e-06] [ 8.26262882e-03 8.56893407e-02 1.61980569e-06] [ 1.30107788e-02 2.95632328e-06 2.95632328e-06] [ 2.80145003e-06 2.80145003e-06 2.80145003e-06] [ 2.42858077e-02 4.66944966e-06 4.66944966e-06] [ 6.84655429e-03 1.90129250e-06 6.84655429e-03] [ 3.48361655e-06 3.48361655e-06 3.48361655e-06] [ 2.98781661e-03 3.31611166e-06 3.31611166e-06] [ 4.27062069e-06 4.27062069e-06 4.27062069e-06] [ 1.50994982e-02 1.64107142e-06 1.64107142e-06] [ 7.73480150e-07 7.73480150e-07 1.70946848e-02] [ 2.82280146e-06 2.82280146e-06 2.82280146e-06] [ 5.15309856e-06 5.15309856e-06 4.64294180e-03] [ 3.41695768e-06 3.41695768e-06 3.41695768e-06] [ 3.90980357e-02 1.70316633e-03 4.42279319e-03] [ 2.39373034e-06 2.39373034e-06 2.39373034e-06] [ 3.32493234e-06 3.32493234e-06 3.32493234e-06]] topic: 0 sum: 1.0 topic: 1 sum: 1.0 topic: 2 sum: 1.0 topic: 3 sum: 1.0 topic: 4 sum: 1.0
n = 5 for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1] print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))输出如下所示:
*Topic 0 - government british minister west group *Topic 1 - church first during people political *Topic 2 - elvis king wright fans presley *Topic 3 - yeltsin russian russia president kremlin *Topic 4 - pope vatican paul surgery pontiff *Topic 5 - family police miami versace cunanan *Topic 6 - south simpson born york white *Topic 7 - order church mother successor since *Topic 8 - charles prince diana royal queen *Topic 9 - film france french against actor *Topic 10 - germany german war nazi christian *Topic 11 - east prize peace timor quebec *Topic 12 - n't told life people church *Topic 13 - years world time year last *Topic 14 - mother teresa heart charity calcutta *Topic 15 - city salonika exhibition buddhist byzantine *Topic 16 - music first people tour including *Topic 17 - church catholic bernardin cardinal bishop *Topic 18 - harriman clinton u.s churchill paris *Topic 19 - century art million museum city
doc_topic = model.doc_topic_ print("type(doc_topic): {}".format(type(doc_topic))) print("shape: {}".format(doc_topic.shape)) for n in range(10): topic_most_pr = doc_topic[n].argmax() print("doc: {} topic: {}".format(n, topic_most_pr))输出如下所示:
type(doc_topic): <type 'numpy.ndarray'> shape: (395L, 20L) doc: 0 topic: 8 doc: 1 topic: 1 doc: 2 topic: 14 doc: 3 topic: 8 doc: 4 topic: 14 doc: 5 topic: 14 doc: 6 topic: 14 doc: 7 topic: 14 doc: 8 topic: 14 doc: 9 topic: 8
import matplotlib.pyplot as plt f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True) for i, k in enumerate([0, 5, 9, 14, 19]): ax[i].stem(topic_word[k,:], linefmt='b-', markerfmt='bo', basefmt='w-') ax[i].set_xlim(-50,4350) ax[i].set_ylim(0, 0.08) ax[i].set_ylabel("Prob") ax[i].set_title("topic {}".format(k)) ax[4].set_xlabel("word") plt.tight_layout() plt.show()输出如下图所示:
import matplotlib.pyplot as plt f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True) for i, k in enumerate([1, 3, 4, 8, 9]): ax[i].stem(doc_topic[k,:], linefmt='r-', markerfmt='ro', basefmt='w-') ax[i].set_xlim(-1, 21) ax[i].set_ylim(0, 1) ax[i].set_ylabel("Prob") ax[i].set_title("Document {}".format(k)) ax[4].set_xlabel("Topic") plt.tight_layout() plt.show()
输出结果如下图:
这篇文章主要是对Python下LDA用法的入门介绍,下一篇文章将结合具体的txt文本内容进行分词处理、文档主题分布计算等。其中也会涉及Python计算词频tf和tfidf的方法。
由于使用fit()总报错“TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'”,后使用sklearn中计算词频TF方法:
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
总之,希望文章对你有所帮助吧!尤其是刚刚接触机器学习、Sklearn、LDA的同学,毕竟我自己其实也只是一个门外汉,没有系统的学习过机器学习相关的内容,所以也非常理解那种不知道如何使用一种算法的过程,毕竟自己就是嘛,而当你熟练使用后才会觉得它非常简单,所以入门也是这篇文章的宗旨吧!
最后非常感谢上面提到的文章链接作者,感谢他们的分享。如果有不足之处,还请海涵~
(By:Eastmount 2016-03-17 深夜3点半 http://blog.csdn.net/eastmount/ )