主题提取LDA方法

此处用fetch_20newsgroups数据训练
参考:
https://blog.csdn.net/scotfield_msn/article/details/72904651

import gensim
from sklearn.datasets import fetch_20newsgroups
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
import os
from pprint import pprint

准备数据用于训练和测试

news_dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents = news_dataset.data
print("In the dataset there are", len(documents), "textual documents")
print("And this is the first one:\n", documents[0])

结果:

In the dataset there are 18846 textual documents
And this is the first one:


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

该训练集的使用说明如下:
https://blog.csdn.net/panghaomingme/article/details/53486252

预处理

1、分词,去掉停等词

def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]   
print("After the tokenizer, the previous document becomes:\n", tokenize(documents[0]))
processed_docs = [tokenize(doc) for doc in documents]

结果:

 ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'going', 'end', 'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'better', 'regular', 'season', 'stats', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'couple', 'games', 'pens', 'going', 'beat', 'pulp', 'jersey', 'disappointed', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule']

2、字典化

#每一个单词关联一个唯一的ID
word_count_dict = Dictionary(processed_docs)

结果:

In the corpus there are 95507 unique tokens

 Dictionary(95507 unique tokens: ['jet', 'transfer', 'stratus', 'moicheia', 'multiplies']...)

3、对字典中的词进行过滤,去除高频低频次

具体使用见
https://blog.csdn.net/u014595019/article/details/52218249

dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
1.去掉出现次数低于no_below的
2.去掉出现次数高于no_above的。注意这个小数指的是百分数
3.在1和2的基础上,保留出现频率前keep_n的单词

word_count_dict.filter_extremes(no_below=5, no_above=0.05)  
print("After filtering, in the corpus there are only", len(word_count_dict), "unique tokens")

结果

After filtering, in the corpus there are only 21757 unique tokens

可以见到去除了一些词,字典中词的数量从95507个变成了个21757

4、将文档表示成词袋向量

bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs] corpus
print(bag_of_words_corpus[0])

结果

[(104, 1), (941, 1), (1016, 1), (1366, 1), (1585, 1), (1699, 1), (1730, 2), (2122, 1), (2359, 1), (3465, 1), (3811, 1), (4121, 1), (4570, 1), (5336, 2), (5739, 1), (5885, 1), (6533, 1), (6856, 1), (8175, 1), (8519, 1), (8707, 1), (8834, 1), (9126, 2), (9746, 1), (9807, 1), (11553, 1), (11775, 1), (11930, 1), (12398, 1), (12855, 1), (13529, 5), (13958, 1), (14521, 1), (14740, 1), (14928, 2), (15185, 1), (15415, 1), (18229, 1), (18361, 1), (18915, 2), (20936, 1)]
列表中每个元组中,第一个元素表示字典中单词的ID,第二个表示在这个句子中这个单词出现的次数。

LDA主题模型

具体详见
https://blog.csdn.net/u014595019/article/details/52218249

1、训练LDA模型

#第一个参数为选用的文档向量,num_topics为主题个数,id2word可选是选用的字典,保存模型,方便下次不用再训练了
model_name = "./model.lda"  
if os.path.exists(model_name):
    lda_model = gensim.models.LdaModel.load(model_name)  
    print("loaded from old")
else:
    # preprocess()  
    lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=80, id2word=word_count_dict)#num_topics: the maximum numbers of topic that can provide  
    lda_model.save(model_name)  
    print("loaded from new")

看一下训练出来的LDA模型的不同主题下的词分布

打印前20个topic的词分布
lda.print_topics(20)
打印id为20的topic的词分布
lda.print_topic(20)

#打印前5个主题
print(lda_model.print_topics(5))

结果:

[(20, '0.021*"yeah" + 0.020*"sc" + 0.019*"darren" + 0.018*"dream" + 0.017*"homosexuality" + 0.015*"skin" + 0.015*"car" + 0.015*"gays" + 0.013*"weather" + 0.012*"greatly"'), (14, '0.041*"ax" + 0.009*"ub" + 0.009*"mm" + 0.008*"mk" + 0.008*"cx" + 0.008*"pl" + 0.007*"yx" + 0.006*"mp" + 0.005*"mr" + 0.005*"max"'), (12, '0.046*"sleeve" + 0.024*"ss" + 0.021*"picture" + 0.020*"dave" + 0.018*"ed" + 0.015*"plutonium" + 0.014*"cox" + 0.014*"netcom" + 0.014*"boys" + 0.011*"frank"'), (11, '0.010*"cyl" + 0.010*"heads" + 0.008*"mfm" + 0.007*"fame" + 0.007*"exclusive" + 0.007*"club" + 0.007*"phillies" + 0.007*"hall" + 0.007*"eric" + 0.006*"mask"'), (6, '0.042*"card" + 0.032*"video" + 0.024*"cd" + 0.020*"monitor" + 0.018*"vga" + 0.014*"cards" + 0.012*"disks" + 0.011*"nec" + 0.011*"includes" + 0.010*"sale"')]

训练出的lda_model有num_topics个主题的id和每个主题对应的词分布,这里出现了5个主题的id,分别为20、14、12、11、6
从后面对应的词分布发现,对应的一些语气词等并非主题词的词还存在,比如yeah,mm

2、测试

用未登录文档在训练出的LDA模型下测试效果:
1、将该文档分词并表示成词袋向量

test_dataset=fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
test_documents=test_dataset.data[0]
test=word_count_dict.doc2bow(tokenize(test_documents))

2、用训练好的lda模型测试该文档

result=lda_model(test)
print(result)
for topic in result:
        #print_topic(x,y) x是主题的id,y是打印该主题的前y个词,词是按权重排好序的
    print(lda_model.print_topic(topic[0],2))

结果:

#此处的result前面是主题对应的id,后面是权重
[(0, 0.055915479531474528), (1, 0.043632309200769978), (4, 0.039731045163527434), (9, 0.080996565212587676), (17, 0.10949904163764483), (28, 0.052307996592365257), (37, 0.041910355481773388), (41, 0.042178320776285916), (48, 0.054036399139048161), (53, 0.12185345365187586), (55, 0.12552429387152431), (56, 0.037461188219746117), (60, 0.070012626709205633), (63, 0.042189064087661758), (76, 0.048897694057843777)]
下面是该文档对应主题的词分布
0.025*"scsi" + 0.007*"director"
0.031*"keyboard" + 0.030*"key"
0.013*"usr" + 0.011*"apr"
0.016*"captain" + 0.014*"myers"
0.009*"went" + 0.007*"thought"
0.015*"government" + 0.007*"clinton"
0.011*"colors" + 0.010*"shell"
0.045*"cover" + 0.030*"marriage"
0.035*"disk" + 0.018*"jumper"
0.016*"nuclear" + 0.014*"water"
0.015*"server" + 0.014*"software"
0.055*"apple" + 0.025*"modem"
0.009*"israel" + 0.007*"israeli"
0.055*"greek" + 0.038*"greece"
0.012*"mhz" + 0.012*"speed"

因为主题的权重没有排序,随便输出几个大的,这里希望输出权重前n名的主题及词分布

##test
result=lda_model[test]
#按照第二个元素从大到小的顺序排列
result_sort=sorted(result,key=lambda tup: -1 * tup[1])
print("下面是该文档对应主题的词分布")
#需要打印出的该文本的前几个主题数
topic_num=5
count=1
for topic in result_sort:
    if count>topic_num:
        break;
    print(lda_model.print_topic(topic[0],2),topic[1])
    count=count+1
#前者是该主题的词分布,后者是该主题的权重
下面是该文档对应主题的词分布
0.009*"faq" + 0.007*"random" 0.152984684444
0.012*"mhz" + 0.012*"speed" 0.101706532617
0.008*"probe" + 0.007*"earth" 0.088976368815
0.019*"lib" + 0.018*"memory" 0.0841998334088
0.016*"israel" + 0.013*"iran" 0.0832577504835

你可能感兴趣的:(机器学习)