NLP 主题抽取 Topic LDA学习案例(一)

NLP 主题抽取 Topic LDA学习案例

数据准备中的相关参考资料见:https://blog.csdn.net/xiaoql520/article/details/79883409

后续参考资料见代码末尾。

# -*- coding: UTF-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

import gensim
from sklearn.datasets import fetch_20newsgroups
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
import os
from pprint import pprint
from sklearn.datasets import fetch_20newsgroups


#准备数据
news_dataset = fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))#获取并缓存数据
documents = news_dataset.data
print("In the dataset there are", len(documents), "textual documents")
"""
In the dataset there are 18846 textual documents
"""
print("And this is the first one:\n", documents[0])
"""And this is the first one:
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!
"""
print("In the dataset ,the filenames are as follow:\n",news_dataset.filenames)
"""
In the dataset ,the filenames are as follow:
 ['C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.sport.hockey\\54367'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60215'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.politics.mideast\\76120'
 ...
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60695'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38319'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.autos\\103195']
"""
print("In the dataset ,the target is as follow:\n",news_dataset.target)
"""
In the dataset ,the target is as follow:
 [10  3 17 ...  3  1  7]
"""


# token化句子(分词、去stopword等),并用词袋表示出句子向量
def tokenize(text):
    """
    text分词,并去掉停用词。STOPWORDS -是指Stone, Denis, Kwantes(2010)stopwords集合.
    :param text:需要处理的文本
    :return:去掉停用词后的""序列
    """
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]
print("第一篇文档通过切词和去停用词后的结果为:\n", tokenize(documents[0]))
"""
第一篇文档通过切词和去停用词后的结果为:
 ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 
 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'going', 'end',
  'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse',
   'thought', 'jagr', 'showed', 'better', 'regular', 'season', 'stats', 'lot', 'fo', 'fun', 
   'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'couple', 'games', 'pens', 'going', 
   'beat', 'pulp', 'jersey', 'disappointed', 'islanders', 'lose', 'final', 'regular', 'season', 
   'game', 'pens', 'rule']
"""
#对文档集中所有文档进行切词和去停用词处理,得到对应的序列集。
processed_docs = [tokenize(doc) for doc in documents]
#Dictionary封装了规范化单词和它们的整数id之间的映射,即:为每个出现在语料库中的单词分配了一个独一无二的整数编号。
# 并收集了单词计数及其他相关的统计信息
word_count_dict = Dictionary(processed_docs)
print ("word_count_dict:",word_count_dict)
"""
word_count_dict: Dictionary(95507 unique tokens: ['actually', 'bashers', 'beat', 'better', 'bit']...)
"""
#查看词条数量:编号矩阵中非零元素的总个数
print("在当前语料库中共有", len(word_count_dict), "条不同的”")
"""
在当前语料库中共有 95507 条不同的"""
#选择词频率大于10次,却又不超过文档大小的20%的词(备注:由于规模缩小,所以有些词的id可能会改变)
word_count_dict.filter_extremes(no_below=20,no_above=0.1)
print("通过筛选后, 预料库中只有:", len(word_count_dict), "条不同的”")
"""
通过筛选后, 预料库中只有: 8121 条不同的"""
#所有语料库文件构成的词袋
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
print("bag_of_words_corpus:\n",bag_of_words_corpus)
"""
(id,词频)
...
(4081, 1), (4372, 2), (4684, 1), (4743, 1), (5112, 1), (5117, 1), (5200, 1), (5250, 1), (5443, 1), (5586, 1), 
(5782, 1), (6212, 1), (6483, 1), (6636, 1), (6707, 1), (7237, 1), (7453, 1)], [(85, 1), (255, 1), (354, 1), 
(624, 1), (2855, 1), (3449, 1), (3626, 1), (3774, 1), (3847, 1), (3917, 1), (7613, 1)], [(24, 4), (28, 4), 
...
"""


# LDA上模型

#如果存在已有模型,则加载已有模型,否则构造器根据训练语料库估计Dirichlet模型:
model_name = "./modle/model.lda"
if os.path.exists(model_name):
    lda_model = gensim.models.LdaModel.load(model_name)
    print("加载已有模型" )

#num_topics: the maximum numbers of topic that can provide
else:
    lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=100, id2word=word_count_dict, passes=5)
    # 保存模型
    lda_model.save(model_name)
    print("加载新创建的模型")

# 验证非登录句子或者文档的主题抽取能力情况
print("1.不指定文档")
#获取最重要的主题:选择30个主题,每个主题包含6个词
print("选择30个主题:")
pprint(lda_model.print_topics(30,6))
"""
选择30个主题:
[(46,
  '0.749*"max" + 0.109*"ax" + 0.021*"rx" + 0.016*"mb" + 0.015*"pl" + '
  '0.015*"um"'),
 (33,
  '0.051*"et" + 0.033*"location" + 0.032*"title" + 0.027*"physics" + '
  '0.025*"msg" + 0.023*"ne"'),
 (4,
  '0.042*"page" + 0.039*"sun" + 0.034*"language" + 0.032*"cancer" + '
  '0.025*"pro" + 0.023*"ati"'),
 (83,
  '0.102*"game" + 0.037*"play" + 0.034*"games" + 0.026*"fan" + 0.022*"watch" + '
  '0.019*"night"'),
 (45,
  '0.028*"version" + 0.027*"number" + 0.027*"tech" + 0.026*"contact" + '
  '0.025*"quality" + 0.021*"phone"'),
 (58,
  '0.145*"thanks" + 0.048*"advance" + 0.037*"hi" + 0.035*"microsoft" + '
  '0.033*"anybody" + 0.030*"mail"'),
 (65,
  '0.061*"attack" + 0.042*"bank" + 0.040*"islamic" + 0.029*"islam" + '
  '0.026*"sea" + 0.025*"land"'),
 (96,
  '0.075*"book" + 0.024*"read" + 0.018*"quote" + 0.016*"physical" + '
  '0.015*"dan" + 0.014*"edge"'),
 (9,
  '0.038*"san" + 0.029*"city" + 0.025*"york" + 0.022*"division" + '
  '0.019*"boston" + 0.019*"california"'),
 (77,
  '0.188*"db" + 0.077*"tcp" + 0.073*"ip" + 0.038*"pts" + 0.032*"substance" + '
  '0.027*"rw"'),
 (34,
  '0.028*"government" + 0.019*"rights" + 0.014*"country" + 0.012*"states" + '
  '0.012*"state" + 0.010*"united"'),
 (1,
  '0.042*"ago" + 0.031*"tv" + 0.030*"face" + 0.025*"wasn" + 0.024*"heard" + '
  '0.021*"seen"'),
 (18,
  '0.024*"going" + 0.024*"hell" + 0.019*"guys" + 0.019*"guy" + 0.016*"lot" + '
  '0.015*"ll"'),
 (27,
  '0.051*"state" + 0.049*"law" + 0.042*"laws" + 0.039*"shall" + 0.028*"ra" + '
  '0.026*"amendment"'),
 (80,
  '0.110*"data" + 0.032*"message" + 0.027*"number" + 0.024*"users" + '
  '0.021*"access" + 0.016*"block"'),
 (51,
  '0.048*"technology" + 0.034*"national" + 0.030*"law" + '
  '0.026*"administration" + 0.024*"enforcement" + 0.021*"agencies"'),
 (67,
  '0.036*"section" + 0.030*"output" + 0.029*"code" + 0.029*"program" + '
  '0.028*"input" + 0.027*"line"'),
 (66,
  '0.107*"window" + 0.070*"application" + 0.037*"manager" + 0.021*"create" + '
  '0.020*"user" + 0.019*"program"'),
 (15,
  '0.082*"god" + 0.031*"bible" + 0.026*"religion" + 0.026*"christian" + '
  '0.017*"christians" + 0.016*"religious"'),
 (13,
  '0.039*"fbi" + 0.031*"koresh" + 0.024*"guns" + 0.021*"batf" + '
  '0.019*"children" + 0.018*"gun"'),
 (90,
  '0.044*"systems" + 0.027*"analysis" + 0.026*"applications" + '
  '0.023*"processing" + 0.019*"font" + 0.018*"programming"'),
 (43,
  '0.021*"evidence" + 0.010*"point" + 0.009*"case" + 0.009*"claim" + '
  '0.009*"science" + 0.007*"mind"'),
 (74,
  '0.032*"myers" + 0.018*"guide" + 0.017*"book" + 0.017*"children" + '
  '0.015*"verse" + 0.015*"considered"'),
 (2,
  '0.040*"unit" + 0.038*"vision" + 0.035*"tom" + 0.030*"length" + 0.029*"phil" '
  '+ 0.029*"instructions"'),
 (91,
  '0.068*"price" + 0.058*"board" + 0.042*"tape" + 0.037*"pin" + 0.032*"old" + '
  '0.031*"cable"'),
 (12,
  '0.093*"key" + 0.038*"chip" + 0.036*"encryption" + 0.033*"keys" + '
  '0.028*"public" + 0.027*"clipper"'),
 (87,
  '0.024*"problem" + 0.019*"ll" + 0.016*"got" + 0.016*"little" + 0.016*"thing" '
  '+ 0.015*"better"'),
 (36,
  '0.045*"team" + 0.031*"year" + 0.027*"season" + 0.021*"players" + '
  '0.019*"hockey" + 0.019*"league"'),
 (47,
  '0.109*"space" + 0.024*"shuttle" + 0.021*"design" + 0.020*"station" + '
  '0.018*"nasa" + 0.016*"flight"'),
 (25,
  '0.128*"edu" + 0.040*"cs" + 0.031*"uk" + 0.025*"apr" + 0.025*"ca" + '
  '0.024*"ac"')]
"""

print ("\n" )
#获取最重要的主题:选择10个主题,每个主题包含10个词
print("选择10个主题:")
pprint(lda_model.print_topics(10))
"""
选择10个主题:
[(70,
  '0.122*"car" + 0.041*"cars" + 0.025*"road" + 0.018*"dog" + 0.017*"auto" + '
  '0.016*"driving" + 0.016*"speed" + 0.015*"xfree" + 0.015*"ford" + '
  '0.015*"automatic"'),
 (76,
  '0.047*"games" + 0.044*"runs" + 0.039*"win" + 0.029*"mike" + 0.026*"game" + '
  '0.023*"year" + 0.020*"run" + 0.018*"pitcher" + 0.018*"smith" + '
  '0.017*"pitching"'),
 (26,
  '0.156*"drive" + 0.078*"hard" + 0.077*"disk" + 0.072*"mac" + 0.056*"apple" + '
  '0.032*"floppy" + 0.027*"internal" + 0.026*"external" + 0.015*"installed" + '
  '0.014*"software"'),
 (33,
  '0.051*"et" + 0.033*"location" + 0.032*"title" + 0.027*"physics" + '
  '0.025*"msg" + 0.023*"ne" + 0.018*"theory" + 0.018*"mercury" + 0.017*"kt" + '
  '0.016*"map"'),
 (24,
  '0.025*"unix" + 0.023*"os" + 0.023*"multi" + 0.020*"support" + 0.018*"sec" + '
  '0.017*"features" + 0.016*"built" + 0.015*"vendor" + 0.013*"installation" + '
  '0.013*"product"'),
 (93,
  '0.026*"dc" + 0.026*"reported" + 0.026*"study" + 0.024*"volume" + '
  '0.021*"newsletter" + 0.019*"washington" + 0.019*"increased" + '
  '0.018*"vehicle" + 0.016*"news" + 0.014*"reports"'),
 (46,
  '0.749*"max" + 0.109*"ax" + 0.021*"rx" + 0.016*"mb" + 0.015*"pl" + '
  '0.015*"um" + 0.011*"au" + 0.010*"dm" + 0.009*"eq" + 0.008*"shifts"'),
 (97,
  '0.051*"card" + 0.044*"mb" + 0.040*"scsi" + 0.028*"drives" + 0.025*"bus" + '
  '0.025*"mhz" + 0.025*"bit" + 0.023*"controller" + 0.020*"speed" + '
  '0.019*"cpu"'),
 (86,
  '0.060*"sale" + 0.039*"model" + 0.038*"offer" + 0.037*"condition" + '
  '0.034*"shipping" + 0.032*"asking" + 0.031*"manual" + 0.025*"box" + '
  '0.024*"included" + 0.023*"sell"'),
 (3,
  '0.062*"power" + 0.024*"drug" + 0.023*"low" + 0.023*"ground" + 0.022*"drugs" '
  '+ 0.021*"high" + 0.020*"rate" + 0.019*"wire" + 0.019*"supply" + '
  '0.018*"current"')]
"""

print("2. 使用unseed 文档")
unseen_document = "In my spare time I either play badmington or drive my car"
print("unseen document的内容如下:", unseen_document )
print()
bow_vector = word_count_dict.doc2bow(tokenize(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1 * tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 7)))

"""
unseen document的内容如下: In my spare time I either play badmington or drive my car

Score: 0.23174265027046204  Topic: 0.156*"drive" + 0.078*"hard" + 0.077*"disk" + 0.072*"mac" + 0.056*"apple" + 0.032*"floppy" + 0.027*"internal"
Score: 0.21374230086803436  Topic: 0.122*"car" + 0.041*"cars" + 0.025*"road" + 0.018*"dog" + 0.017*"auto" + 0.016*"driving" + 0.016*"speed"
Score: 0.2019999921321869   Topic: 0.102*"game" + 0.037*"play" + 0.034*"games" + 0.026*"fan" + 0.022*"watch" + 0.019*"night" + 0.019*"espn"
Score: 0.16051504015922546  Topic: 0.024*"problem" + 0.019*"ll" + 0.016*"got" + 0.016*"little" + 0.016*"thing" + 0.015*"better" + 0.015*"probably"

"""

print_topics ( num_topics=20num_words=10 )

获取最重要的主题(show_topic()方法的别名)。

Parameters:
  • num_topics (int, optional) – 要选择的主题的数量,如果-1则所有的主题都将在结果中(按重要性排序)。
  • num_words (int, optional) – 每个主题包含的单词数量(按重要性排序)。
Returns:

 (topic_id, [(word, value), … ])序列.

Return type:

list of (int, list of (str, float))


gensim.models.ldamodel.LdaModel(corpus=Nonenum_topics=100id2word=Nonedistributed=Falsechunksize=2000passes=1update_every=1alpha='symmetric'eta=Nonedecay=0.5offset=1.0eval_every=10iterations=50gamma_threshold=0.001minimum_probability=0.01random_state=Nonens_conf=Noneminimum_phi_value=0.01per_word_topics=Falsecallbacks=Nonedtype=)

构造器根据训练语料库估计Dirichlet模型参数:

通过加载/保存保存方法完成模型的持久性。

corpus 语料库:如果给定,立即从可迭代的语料库开始训练。如果没有给出,模型就没有经过训练(大概是因为您想手动调用update())。

num_topic:需要从训练语料库中提取的潜在主题的数量。

id2word:表示一个从单词id(整数)到单词(字符串)的映射。它用于确定词汇表的大小,以及调试和主题打印。

alpha和eta:是影响文档主题(theta)和词主题(lambda)分布的稀疏性的超参数。两者都默认为一个对称的1.0/num_topic prior。

可以将alpha设置为一个明确的数组=之前选择过的。它还支持“asymmetric”和“auto”这种特殊值:前者使用固定的标准化非对称1.0/topic no prior,后者直接从你的数据中学习一个不对称的。

eta:可以是一个对称先验的标量,而不是主题/词的分布,或者是一个num_words维的向量,它可以用来加强词分布的(用户定义的)不对称先验概率。它还支持特殊值“auto”,表示直接从您的数据中学习一个不对称的单词。eta也可以是一个num_topic x num_words维的矩阵,表示用来在每个基本主题(不能从数据中学习)上增加单词分布的不对称的先验概率

从最新的每一个模型更新(设置为1减慢训练~2x)计算和记录复杂的估计。默认值为10,以获得更好的性能。设置为None,使困惑估计失效。

load(fname*args**kwargs)

从文件中加载先前通过使用save()保存的对象

Parameters:            

  • fname (str) – 包含需要的文件对象的路径
  • mmap (str, optional) – 内存映射选项。如果该对象被单独存储在不同的数组中时,则可以使用mmap= ' r '来加载这些数组。如果加载的文件是压缩的(不管是“.gz”或' .bz2 '),那么必须设置' mmap=None'。
doc2bow ( documentallow_update=Falsereturn_missing=False )

Convert document into the bag-of-words (BoW) format = list of (token_id, token_count).

将文档转换为词袋格式=一个包含(token_id, token_count)的列表

Parameters:
  • document (list of str) –输入文档
  • allow_update (bool, optional) – 如果为True - 在处理规程中允许更新词典 (包含添加新标记和更新频率)
  • return_missing (bool, optional) – 是否返回丢弃的“词”(它不包含在当前字典中。).
Returns:

  •  (int, int)的列表 – 表示一篇文档的词袋
  • (int, int)列表, (str, int) 构成的字典– 如果return_missing 为True, 返回文档词袋+丢失的词和其对应的词频.


filter_extremes ( no_below=5no_above=0.5keep_n=100000keep_tokens=None )

按频率在字典中过滤“词”

Parameters:
  • no_below (int, optional) –保存在至少no_below文档中包含的“词”
  • no_above (float, optional) – 保存包含在no_above文档中的“词”(全部文集大小的分数,而不是绝对数字)。
  • keep_n (int, optional) –只保留第一个keep_n最频繁的标记。
  • keep_tokens (iterable of str) –可迭代的“词”序列:在过滤后必须留在字典中“词”。

gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)

Convert a document into a list of tokens (also with lowercase and optional de-accents), used tokenize().

将文档转换为“词”列表。

Parameters:
  • doc (str) –输入文档.
  • deacc (bool, optional) –则通过使用deaccent()使用去除字符串中的重音
  • min_len (int, optional) – 结果集中“词”的最小长度.
  • max_len (int, optional) –结果集中“词”的最大长度.
Returns:

从文本中提取出的“词”集合.

Return type:

字符串列表





filter_extremes ( no_below=5no_above=0.5keep_n=100000keep_tokens=None )

Filter tokens in dictionary by frequency.

Parameters:  
  • no_below (int, optional) –保存在至少no_below文档中包含的令牌。
  • no_above (float, optional) –保存包含在no_above文档中的令牌(全部文集大小的分数,而不是绝对数字)。
  • keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.

本文参考资料:

1、https://blog.csdn.net/scotfield_msn/article/details/72904651

2、https://radimrehurek.com/gensim/apiref.html

你可能感兴趣的:(NLP)