NLP 主题抽取 Topic LDA学习案例
数据准备中的相关参考资料见:https://blog.csdn.net/xiaoql520/article/details/79883409
后续参考资料见代码末尾。
# -*- coding: UTF-8 -*- import warnings warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') import gensim from sklearn.datasets import fetch_20newsgroups from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS from gensim.corpora import Dictionary import os from pprint import pprint from sklearn.datasets import fetch_20newsgroups #准备数据 news_dataset = fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))#获取并缓存数据 documents = news_dataset.data print("In the dataset there are", len(documents), "textual documents") """ In the dataset there are 18846 textual documents """ print("And this is the first one:\n", documents[0]) """And this is the first one: I am sure some bashers of Pens fans are pretty confused about the lack of any kind of posts about the recent Pens massacre of the Devils. Actually, I am bit puzzled too and a bit relieved. However, I am going to put an end to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they are killing those Devils worse than I thought. Jagr just showed you why he is much better than his regular season stats. He is also a lot fo fun to watch in the playoffs. Bowman should let JAgr have a lot of fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game. PENS RULE!!! """ print("In the dataset ,the filenames are as follow:\n",news_dataset.filenames) """ In the dataset ,the filenames are as follow: ['C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.sport.hockey\\54367' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60215' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.politics.mideast\\76120' ... 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60695' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38319' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.autos\\103195'] """ print("In the dataset ,the target is as follow:\n",news_dataset.target) """ In the dataset ,the target is as follow: [10 3 17 ... 3 1 7] """ # token化句子(分词、去stopword等),并用词袋表示出句子向量 def tokenize(text): """ 将text分词,并去掉停用词。STOPWORDS -是指Stone, Denis, Kwantes(2010)的stopwords集合. :param text:需要处理的文本 :return:去掉停用词后的"词"序列 """ return [token for token in simple_preprocess(text) if token not in STOPWORDS] print("第一篇文档通过切词和去停用词后的结果为:\n", tokenize(documents[0])) """ 第一篇文档通过切词和去停用词后的结果为: ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'going', 'end', 'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'better', 'regular', 'season', 'stats', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'couple', 'games', 'pens', 'going', 'beat', 'pulp', 'jersey', 'disappointed', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule'] """ #对文档集中所有文档进行切词和去停用词处理,得到对应的“词”序列集。 processed_docs = [tokenize(doc) for doc in documents] #Dictionary封装了规范化单词和它们的整数id之间的映射,即:为每个出现在语料库中的单词分配了一个独一无二的整数编号。 # 并收集了单词计数及其他相关的统计信息 word_count_dict = Dictionary(processed_docs) print ("word_count_dict:",word_count_dict) """ word_count_dict: Dictionary(95507 unique tokens: ['actually', 'bashers', 'beat', 'better', 'bit']...) """ #查看词条数量:编号矩阵中非零元素的总个数 print("在当前语料库中共有", len(word_count_dict), "条不同的“词”") """ 在当前语料库中共有 95507 条不同的“词” """ #选择词频率大于10次,却又不超过文档大小的20%的词(备注:由于规模缩小,所以有些词的id可能会改变) word_count_dict.filter_extremes(no_below=20,no_above=0.1) print("通过筛选后, 预料库中只有:", len(word_count_dict), "条不同的“词”") """ 通过筛选后, 预料库中只有: 8121 条不同的“词” """ #所有语料库文件构成的词袋 bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs] print("bag_of_words_corpus:\n",bag_of_words_corpus) """ (词id,词频) ... (4081, 1), (4372, 2), (4684, 1), (4743, 1), (5112, 1), (5117, 1), (5200, 1), (5250, 1), (5443, 1), (5586, 1), (5782, 1), (6212, 1), (6483, 1), (6636, 1), (6707, 1), (7237, 1), (7453, 1)], [(85, 1), (255, 1), (354, 1), (624, 1), (2855, 1), (3449, 1), (3626, 1), (3774, 1), (3847, 1), (3917, 1), (7613, 1)], [(24, 4), (28, 4), ... """ # LDA上模型 #如果存在已有模型,则加载已有模型,否则构造器根据训练语料库估计Dirichlet模型: model_name = "./modle/model.lda" if os.path.exists(model_name): lda_model = gensim.models.LdaModel.load(model_name) print("加载已有模型" ) #num_topics: the maximum numbers of topic that can provide else: lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=100, id2word=word_count_dict, passes=5) # 保存模型 lda_model.save(model_name) print("加载新创建的模型") # 验证非登录句子或者文档的主题抽取能力情况 print("1.不指定文档") #获取最重要的主题:选择30个主题,每个主题包含6个词 print("选择30个主题:") pprint(lda_model.print_topics(30,6)) """ 选择30个主题: [(46, '0.749*"max" + 0.109*"ax" + 0.021*"rx" + 0.016*"mb" + 0.015*"pl" + ' '0.015*"um"'), (33, '0.051*"et" + 0.033*"location" + 0.032*"title" + 0.027*"physics" + ' '0.025*"msg" + 0.023*"ne"'), (4, '0.042*"page" + 0.039*"sun" + 0.034*"language" + 0.032*"cancer" + ' '0.025*"pro" + 0.023*"ati"'), (83, '0.102*"game" + 0.037*"play" + 0.034*"games" + 0.026*"fan" + 0.022*"watch" + ' '0.019*"night"'), (45, '0.028*"version" + 0.027*"number" + 0.027*"tech" + 0.026*"contact" + ' '0.025*"quality" + 0.021*"phone"'), (58, '0.145*"thanks" + 0.048*"advance" + 0.037*"hi" + 0.035*"microsoft" + ' '0.033*"anybody" + 0.030*"mail"'), (65, '0.061*"attack" + 0.042*"bank" + 0.040*"islamic" + 0.029*"islam" + ' '0.026*"sea" + 0.025*"land"'), (96, '0.075*"book" + 0.024*"read" + 0.018*"quote" + 0.016*"physical" + ' '0.015*"dan" + 0.014*"edge"'), (9, '0.038*"san" + 0.029*"city" + 0.025*"york" + 0.022*"division" + ' '0.019*"boston" + 0.019*"california"'), (77, '0.188*"db" + 0.077*"tcp" + 0.073*"ip" + 0.038*"pts" + 0.032*"substance" + ' '0.027*"rw"'), (34, '0.028*"government" + 0.019*"rights" + 0.014*"country" + 0.012*"states" + ' '0.012*"state" + 0.010*"united"'), (1, '0.042*"ago" + 0.031*"tv" + 0.030*"face" + 0.025*"wasn" + 0.024*"heard" + ' '0.021*"seen"'), (18, '0.024*"going" + 0.024*"hell" + 0.019*"guys" + 0.019*"guy" + 0.016*"lot" + ' '0.015*"ll"'), (27, '0.051*"state" + 0.049*"law" + 0.042*"laws" + 0.039*"shall" + 0.028*"ra" + ' '0.026*"amendment"'), (80, '0.110*"data" + 0.032*"message" + 0.027*"number" + 0.024*"users" + ' '0.021*"access" + 0.016*"block"'), (51, '0.048*"technology" + 0.034*"national" + 0.030*"law" + ' '0.026*"administration" + 0.024*"enforcement" + 0.021*"agencies"'), (67, '0.036*"section" + 0.030*"output" + 0.029*"code" + 0.029*"program" + ' '0.028*"input" + 0.027*"line"'), (66, '0.107*"window" + 0.070*"application" + 0.037*"manager" + 0.021*"create" + ' '0.020*"user" + 0.019*"program"'), (15, '0.082*"god" + 0.031*"bible" + 0.026*"religion" + 0.026*"christian" + ' '0.017*"christians" + 0.016*"religious"'), (13, '0.039*"fbi" + 0.031*"koresh" + 0.024*"guns" + 0.021*"batf" + ' '0.019*"children" + 0.018*"gun"'), (90, '0.044*"systems" + 0.027*"analysis" + 0.026*"applications" + ' '0.023*"processing" + 0.019*"font" + 0.018*"programming"'), (43, '0.021*"evidence" + 0.010*"point" + 0.009*"case" + 0.009*"claim" + ' '0.009*"science" + 0.007*"mind"'), (74, '0.032*"myers" + 0.018*"guide" + 0.017*"book" + 0.017*"children" + ' '0.015*"verse" + 0.015*"considered"'), (2, '0.040*"unit" + 0.038*"vision" + 0.035*"tom" + 0.030*"length" + 0.029*"phil" ' '+ 0.029*"instructions"'), (91, '0.068*"price" + 0.058*"board" + 0.042*"tape" + 0.037*"pin" + 0.032*"old" + ' '0.031*"cable"'), (12, '0.093*"key" + 0.038*"chip" + 0.036*"encryption" + 0.033*"keys" + ' '0.028*"public" + 0.027*"clipper"'), (87, '0.024*"problem" + 0.019*"ll" + 0.016*"got" + 0.016*"little" + 0.016*"thing" ' '+ 0.015*"better"'), (36, '0.045*"team" + 0.031*"year" + 0.027*"season" + 0.021*"players" + ' '0.019*"hockey" + 0.019*"league"'), (47, '0.109*"space" + 0.024*"shuttle" + 0.021*"design" + 0.020*"station" + ' '0.018*"nasa" + 0.016*"flight"'), (25, '0.128*"edu" + 0.040*"cs" + 0.031*"uk" + 0.025*"apr" + 0.025*"ca" + ' '0.024*"ac"')] """ print ("\n" ) #获取最重要的主题:选择10个主题,每个主题包含10个词 print("选择10个主题:") pprint(lda_model.print_topics(10)) """ 选择10个主题: [(70, '0.122*"car" + 0.041*"cars" + 0.025*"road" + 0.018*"dog" + 0.017*"auto" + ' '0.016*"driving" + 0.016*"speed" + 0.015*"xfree" + 0.015*"ford" + ' '0.015*"automatic"'), (76, '0.047*"games" + 0.044*"runs" + 0.039*"win" + 0.029*"mike" + 0.026*"game" + ' '0.023*"year" + 0.020*"run" + 0.018*"pitcher" + 0.018*"smith" + ' '0.017*"pitching"'), (26, '0.156*"drive" + 0.078*"hard" + 0.077*"disk" + 0.072*"mac" + 0.056*"apple" + ' '0.032*"floppy" + 0.027*"internal" + 0.026*"external" + 0.015*"installed" + ' '0.014*"software"'), (33, '0.051*"et" + 0.033*"location" + 0.032*"title" + 0.027*"physics" + ' '0.025*"msg" + 0.023*"ne" + 0.018*"theory" + 0.018*"mercury" + 0.017*"kt" + ' '0.016*"map"'), (24, '0.025*"unix" + 0.023*"os" + 0.023*"multi" + 0.020*"support" + 0.018*"sec" + ' '0.017*"features" + 0.016*"built" + 0.015*"vendor" + 0.013*"installation" + ' '0.013*"product"'), (93, '0.026*"dc" + 0.026*"reported" + 0.026*"study" + 0.024*"volume" + ' '0.021*"newsletter" + 0.019*"washington" + 0.019*"increased" + ' '0.018*"vehicle" + 0.016*"news" + 0.014*"reports"'), (46, '0.749*"max" + 0.109*"ax" + 0.021*"rx" + 0.016*"mb" + 0.015*"pl" + ' '0.015*"um" + 0.011*"au" + 0.010*"dm" + 0.009*"eq" + 0.008*"shifts"'), (97, '0.051*"card" + 0.044*"mb" + 0.040*"scsi" + 0.028*"drives" + 0.025*"bus" + ' '0.025*"mhz" + 0.025*"bit" + 0.023*"controller" + 0.020*"speed" + ' '0.019*"cpu"'), (86, '0.060*"sale" + 0.039*"model" + 0.038*"offer" + 0.037*"condition" + ' '0.034*"shipping" + 0.032*"asking" + 0.031*"manual" + 0.025*"box" + ' '0.024*"included" + 0.023*"sell"'), (3, '0.062*"power" + 0.024*"drug" + 0.023*"low" + 0.023*"ground" + 0.022*"drugs" ' '+ 0.021*"high" + 0.020*"rate" + 0.019*"wire" + 0.019*"supply" + ' '0.018*"current"')] """ print("2. 使用unseed 文档") unseen_document = "In my spare time I either play badmington or drive my car" print("unseen document的内容如下:", unseen_document ) print() bow_vector = word_count_dict.doc2bow(tokenize(unseen_document)) for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1 * tup[1]): print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 7))) """ unseen document的内容如下: In my spare time I either play badmington or drive my car Score: 0.23174265027046204 Topic: 0.156*"drive" + 0.078*"hard" + 0.077*"disk" + 0.072*"mac" + 0.056*"apple" + 0.032*"floppy" + 0.027*"internal" Score: 0.21374230086803436 Topic: 0.122*"car" + 0.041*"cars" + 0.025*"road" + 0.018*"dog" + 0.017*"auto" + 0.016*"driving" + 0.016*"speed" Score: 0.2019999921321869 Topic: 0.102*"game" + 0.037*"play" + 0.034*"games" + 0.026*"fan" + 0.022*"watch" + 0.019*"night" + 0.019*"espn" Score: 0.16051504015922546 Topic: 0.024*"problem" + 0.019*"ll" + 0.016*"got" + 0.016*"little" + 0.016*"thing" + 0.015*"better" + 0.015*"probably" """
print_topics
(
num_topics=20,
num_words=10
)
获取最重要的主题(show_topic()方法的别名)。
Parameters: |
|
---|---|
Returns: | (topic_id, [(word, value), … ])序列. |
Return type: | list of (int, list of (str, float)) |
gensim.models.ldamodel.
LdaModel
(corpus=None, num_topics=100, id2word=None, distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, minimum_probability=0.01, random_state=None, ns_conf=None, minimum_phi_value=0.01, per_word_topics=False, callbacks=None, dtype=
构造器根据训练语料库估计Dirichlet模型参数:
通过加载/保存保存方法完成模型的持久性。
corpus 语料库:如果给定,立即从可迭代的语料库开始训练。如果没有给出,模型就没有经过训练(大概是因为您想手动调用update())。
num_topic:需要从训练语料库中提取的潜在主题的数量。
id2word:表示一个从单词id(整数)到单词(字符串)的映射。它用于确定词汇表的大小,以及调试和主题打印。
alpha和eta:是影响文档主题(theta)和词主题(lambda)分布的稀疏性的超参数。两者都默认为一个对称的1.0/num_topic prior。
可以将alpha设置为一个明确的数组=之前选择过的。它还支持“asymmetric”和“auto”这种特殊值:前者使用固定的标准化非对称1.0/topic no prior,后者直接从你的数据中学习一个不对称的。
eta:可以是一个对称先验的标量,而不是主题/词的分布,或者是一个num_words维的向量,它可以用来加强词分布的(用户定义的)不对称先验概率。它还支持特殊值“auto”,表示直接从您的数据中学习一个不对称的单词。eta也可以是一个num_topic x num_words维的矩阵,表示用来在每个基本主题(不能从数据中学习)上增加单词分布的不对称的先验概率
从最新的每一个模型更新(设置为1减慢训练~2x)计算和记录复杂的估计。默认值为10,以获得更好的性能。设置为None,使困惑估计失效。
load
(fname, *args, **kwargs)
从文件中加载先前通过使用save()保存的对象
Parameters: |
|
---|
doc2bow
(
document,
allow_update=False,
return_missing=False
)
Convert document into the bag-of-words (BoW) format = list of (token_id, token_count).
将文档转换为词袋格式=一个包含(token_id, token_count)的列表
Parameters: |
|
---|---|
Returns: |
|
filter_extremes
(
no_below=5,
no_above=0.5,
keep_n=100000,
keep_tokens=None
)
按频率在字典中过滤“词”
Parameters: |
|
---|
gensim.utils.
simple_preprocess
(doc, deacc=False, min_len=2, max_len=15)
Convert a document into a list of tokens (also with lowercase and optional de-accents), used tokenize()
.
将文档转换为“词”列表。
Parameters: |
|
---|---|
Returns: | 从文本中提取出的“词”集合. |
Return type: | 字符串列表 |
filter_extremes
(
no_below=5,
no_above=0.5,
keep_n=100000,
keep_tokens=None
)
Filter tokens in dictionary by frequency.。
Parameters: |
|
---|
本文参考资料:
1、https://blog.csdn.net/scotfield_msn/article/details/72904651
2、https://radimrehurek.com/gensim/apiref.html