使用python gensim轻松实现lda模型。
gensim简介
gemsim是一个免费python库,能够从文档中有效地自动抽取语义主题。gensim中的算法包括:LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 通过在一个训练文档语料库中,检查词汇统计联合出现模式, 可以用来发掘文档语义结构,这些算法属于非监督学习,可以处理原始的,非结构化的文本(”plain text”)。
Gensim是一个相当专业的主题模型Python工具包。在文本处理中,比如商品评论挖掘,有时需要了解每个评论分别和商品的描述之间的相似度,以此衡量评论的客观性。评论和商品描述的相似度越高,说明评论的用语比较官方,不带太多感情色彩,比较注重描述商品的属性和特性,角度更客观。
Gensim:实现语言,Python,实现模型,LDA,Dynamic Topic Model,Dynamic Influence Model,HDP,LSI,Random Projections,深度学习的word2vec,paragraph2vec。
gensim 特性
内存独立- 对于训练语料来说,没必要在任何时间将整个语料都驻留在RAM中
有效实现了许多流行的向量空间算法-包括tf-idf,分布式LSA, 分布式LDA 以及 RP;并且很容易添加新算法
对流行的数据格式进行了IO封装和转换
在其语义表达中,可以相似查询
gensim的创建的目的是,由于缺乏简单的(java很复杂)实现主题建模的可扩展软件框架.
gensim 设计原则
简单的接口,学习曲线低。对于原型实现很方便
根据输入的语料的size来说,内存各自独立;基于流的算法操作,一次访问一个文档.
gensim 核心概念
gensim的整个package会涉及三个概念:corpus, vector, model.
语库(corpus)
文档集合,用于自动推出文档结构,以及它们的主题等,也可称作训练语料。
向量(vector)
在向量空间模型(VSM)中,每个文档被表示成一个特征数组。例如,一个单一特征可以被表示成一个问答对(question-answer pair):
[1].在文档中单词”splonge”出现的次数? 0个
[2].文档中包含了多少句子? 2个
[3].文档中使用了多少字体? 5种
这里的问题可以表示成整型id (比如:1,2,3等), 因此,上面的文档可以表示成:(1, 0.0), (2, 2.0), (3, 5.0). 如果我们事先知道所有的问题,我们可以显式地写成这样:(0.0, 2.0, 5.0). 这个answer序列可以认为是一个多维矩阵(3维). 对于实际目的,只有question对应的answer是一个实数.
对于每个文档来说,answer是类似的. 因而,对于两个向量来说(分别表示两个文档),我们希望可以下类似的结论:“如果两个向量中的实数是相似的,那么,原始的文档也可以认为是相似的”。当然,这样的结论依赖于我们如何去选取我们的question。
稀疏矩阵(Sparse vector)
通常,大多数answer的值都是0.0. 为了节省空间,我们需要从文档表示中忽略它们,只需要写:(2, 2.0), (3, 5.0) 即可(注意:这里忽略了(1, 0.0)). 由于所有的问题集事先都知道,那么在稀疏矩阵的文档表示中所有缺失的特性可以认为都是0.0.
gensim的特别之处在于,它没有限定任何特定的语料格式;语料可以是任何格式,当迭代时,通过稀疏矩阵来完成即可。例如,集合 ([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) 是一个包含两个文档的语料,每个都有两个非零的 pair。
模型(model)
对于我们来说,一个模型就是一个变换(transformation),将一种文档表示转换成另一种。初始和目标表示都是向量--它们只在question和answer之间有区别。这个变换可以通过训练的语料进行自动学习,无需人工监督,最终的文档表示将更加紧凑和有用;相似的文档具有相似的表示。[Gensim-
用Python做主题模型 ]
gensim的安装
gensim依赖NumPy和SciPy这两大Python科学计算工具包,要先安装。
再安装gensim: pip install gensim
gensim官网教程
使用gensim快速实现lda
文档的向量表示Corpora and Vector Spaces
将用字符串表示的文档转换为用id表示的文档向量:
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
"""#use StemmedCountVectorizer to get stemmed without stop words corpusVectorizer = StemmedCountVectorizer# Vectorizer = CountVectorizervectorizer = Vectorizer(stop_words='english')vectorizer.fit_transform(documents)texts = vectorizer.get_feature_names()# print(texts)"""texts =[doc.lower().split() fordoc indocuments]
# print(texts)dict =corpora.Dictionary(texts) #自建词典# print dict, dict.token2id#通过dict将用字符串表示的文档转换为用id表示的文档向量corpus =[dict.doc2bow(text) fortext intexts]
print(corpus)
查找doc主题的两种方式
也就是查询某个文档对应的主题及其概率
topics = [lda_model[c] for c in corpus_tfidf] #大量查询时不推荐,太慢,只适合查询小的集合
实现实例
使用gensim python拓展包
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__title__ = 'topic model - build lda - 20news dataset'
__author__ = 'pi'
__mtime__ = '12/26/2014-026'
# code is far away from bugs with the god animal protecting
I love animals. They taste delicious.
┏┓ ┏┓
┏┛┻━━━┛┻┓
┃ ☃ ┃
┃ ┳┛ ┗┳ ┃
┃ ┻ ┃
┗━┓ ┏━┛
┃ ┗━━━┓
┃ 神兽保佑 ┣┓
┃ 永无BUG! ┏┛
┗┓┓┏━┳┓┏┛
┃┫┫ ┃┫┫
┗┻┛ ┗┻┛
"""
from Colors import *
from collections import defaultdict
import re
import datetime
from sklearn import datasets
import nltk
from gensim import corpora
from gensim import models
import numpy as np
from scipy import spatial
from CorePyPro.Fun.TimeStump import totalTime
def load_texts(dataset_type='train', groups=None):
"""
load datasets to bytes list
:return:train_dataset_bunch.data bytes list
"""
if groups == 'small':
groups = ['comp.graphics', 'comp.os.ms-windows.misc'] # 仅用于小数据测试时用, #1368
elif groups == 'medium':
groups = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware',
'comp.windows.x', 'sci.space'] # 中量数据时用 #3414
train_dataset_bunch = datasets.load_mlcomp('20news-18828', dataset_type, mlcomp_root='./datasets',
categories=groups) # 13180
return train_dataset_bunch.data
def preprocess_texts(texts, test_doc_id=1):
"""
texts preprocessing
:param texts: bytes list
:return:bytes list
"""
texts = [t.decode(errors='ignore') for t in texts] # bytes2str
# print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n',texts[test_doc_id])
# split_texts = [t.lower().split() for t in texts]
# print(REDH, 'split texts[%d]: #%d' % (test_doc_id, len(split_texts)), DEFAULT, '\n',split_texts[test_doc_id])
# lower str & split str 2 word list with sep=... & delete None
SEPS = '[\s()-/,:.?!]\s*'
texts = [re.split(SEPS, t.lower()) for t in texts]
for t in texts:
while '' in t:
t.remove('')
# print(REDH, 'texts[%d] lower & split(seps= %s) & delete None: #%d' % (test_doc_id, SEPS, len(texts[test_doc_id])), DEFAULT, '\n',texts[test_doc_id])
# nltk.download() #then choose the corpus.stopwords
stopwords = set(nltk.corpus.stopwords.words('english')) # #127
stopwords.update(['from', 'subject', 'writes']) # #129
word_usage = defaultdict(int)
for t in texts:
for w in t:
word_usage[w] += 1
COMMON_LINE = len(texts) / 10
too_common_words = [w for w in t if word_usage[w] > COMMON_LINE] # set(too_common_words)
# print('too_common_words: #', len(too_common_words), '\n', too_common_words) #68
stopwords.update(too_common_words)
# print('stopwords: #', len(stopwords), '\n', stopwords) # #147
english_stemmer = nltk.SnowballStemmer('english')
MIN_WORD_LEN = 3 # 4
texts = [[english_stemmer.stem(w) for w in t if
not set(w) & set('@+>0123456789*') and w not in stopwords and len(w) >= MIN_WORD_LEN] for t in
texts] # set('+-.?!()>@0123456789*/')
# print(REDH, 'texts[%d] delete ^alphanum & stopwords & len
# len(texts[test_doc_id]), DEFAULT, '\n', texts[test_doc_id])
return texts
def build_corpus(texts):
"""
build corpora
:param texts: bytes list
:return: corpus DirectTextCorpus(corpora.TextCorpus)
"""
class DirectTextCorpus(corpora.TextCorpus):
def get_texts(self):
return self.input
def __len__(self):
return len(self.input)
corpus = DirectTextCorpus(texts)
return corpus
def build_id2word(corpus):
"""
from corpus build id2word=dict
:param corpus:
:return:dict = corpus.dictionary
"""
dict = corpus.dictionary # gensim.corpora.dictionary.Dictionary
# print(dict.id2token)
try:
dict['anything']
except:
pass
# print("dict.id2token is not {} now")
# print(dict.id2token)
return dict
def save_corpus_dict(dict, corpus, dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
dict.save(dictDir)
print(GREENL, 'dict saved into %s successfully ...' % dictDir, DEFAULT)
corpora.MmCorpus.serialize(corpusDir, corpus)
print(GREENL, 'corpus saved into %s successfully ...' % corpusDir, DEFAULT)
# corpus.save(fname='./LDA/corpus.mm') # stores only the (tiny) iteration object
def load_ldamodel(modelDir='./lda.pkl'):
model = models.LdaModel.load(fname=modelDir)
print(GREENL, 'ldamodel load from %s successfully ...' % modelDir, DEFAULT)
return model
def load_corpus_dict(dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
dict = corpora.Dictionary.load(fname=dictDir)
print(GREENL, 'dict load from %s successfully ...' % dictDir, DEFAULT)
# dict = corpora.Dictionary.load_from_text('./id_word.txt')
corpus = corpora.MmCorpus(corpusDir) # corpora.mmcorpus.MmCorpus
print(GREENL, 'corpus load from %s successfully ...' % corpusDir, DEFAULT)
return dict, corpus
def build_doc_word_mat(corpus, model, num_topics):
"""
build doc_word_mat in topic space
:param corpus:
:param model:
:param num_topics: int
:return:doc_word_mat np.array (len(topics) * num_topics)
"""
topics = [model[c] for c in corpus] # (word_id, weight) list
doc_word_mat = np.zeros((len(topics), num_topics))
for doc, topic in enumerate(topics):
for word_id, weight in topic:
doc_word_mat[doc, word_id] += weight
return doc_word_mat
def compute_pairwise_dist(doc_word_mat):
"""
compute pairwise dist
:param doc_word_mat: np.array (len(topics) * num_topics)
:return:pairwise_dist
"""
pairwise_dist = spatial.distance.squareform(spatial.distance.pdist(doc_word_mat))
max_weight = pairwise_dist.max() + 1
for i in list(range(len(pairwise_dist))):
pairwise_dist[i, i] = max_weight
return pairwise_dist
def closest_texts(corpus, model, num_topics, test_doc_id=1, topn=5):
"""
find the closest_doc_ids for doc[test_doc_id]
:param corpus:
:param model:
:param num_topics:
:param test_doc_id:
:param topn:
:return:
"""
doc_word_mat = build_doc_word_mat(corpus, model, num_topics)
pairwise_dist = compute_pairwise_dist(doc_word_mat)
# print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n', original_texts[test_doc_id])
closest_doc_ids = pairwise_dist[test_doc_id].argsort()
# return closest_doc_ids[:topn]
for closest_doc_id in closest_doc_ids[:topn]:
print(RED, 'closest doc[%d]' % closest_doc_id, DEFAULT, '\n', original_texts[closest_doc_id])
def evaluate_model(model):
"""
計算模型在test data的Perplexity
:param model:
:return:model.log_perplexity float
"""
test_texts = load_texts(dataset_type='test', groups='small')
test_texts = preprocess_texts(test_texts)
test_corpus = build_corpus(test_texts)
return model.log_perplexity(test_corpus)
def test_num_topics():
dict, corpus = load_corpus_dict()
print("#corpus_items:", len(corpus))
for num_topics in [3, 5, 10, 30, 50, 100, 150, 200, 300]:
start_time = datetime.datetime.now()
model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)
end_time = datetime.datetime.now()
print("total running time = ", end_time - start_time)
print(REDL, 'model.log_perplexity for test_texts with num_topics=%d : ' % num_topics, evaluate_model(model),
DEFAULT)
def test():
texts = load_texts(dataset_type='train', groups='small')
original_texts = texts
test_doc_id = 1
# texts = preprocess_texts(texts, test_doc_id=test_doc_id)
# corpus = build_corpus(texts=texts) # corpus DirectTextCorpus(corpora.TextCorpus)
# dict = build_id2word(corpus)
# save_corpus_dict(dict, corpus)
dict, corpus = load_corpus_dict()
# print(len(corpus))
num_topics = 100
model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict) # 每次结果不同
model.show_topic(0)
# model.save(fname='./lda.pkl')
# model = load_ldamodel()
# closest_texts(corpus, model, num_topics, test_doc_id=1, topn=3)
print(REDL, 'model.log_perplexity for test_texts', evaluate_model(model), DEFAULT)
if __name__ == '__main__':
test()
# test_num_topics()