作者主题模型的通俗解释
model_list = []
for i in range(5):
model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
eval_every=0, iterations=1, random_state=i)
top_topics = model.top_topics(corpus)
tc = sum([t[1] for t in top_topics])
model_list.append((model, tc))
通过设置参数random_state,不同的随机种子,并选择具有最高主题一致性的模型
model.show_topics(num_topics=10, num_words=10, log=False, formatted=True) 结果是
model.show_topics(num_topics=10)
>>> [(0,
'0.014*"action" + 0.014*"control" + 0.010*"policy" + 0.009*"q" + 0.009*"reinforcement" + 0.007*"optimal" + 0.006*"robot" + 0.005*"controller" + 0.005*"dynamic" + 0.005*"environment"'),
(1,
'0.020*"image" + 0.008*"face" + 0.007*"cluster" + 0.006*"signal" + 0.005*"source" + 0.005*"matrix" + 0.005*"filter" + 0.005*"search" + 0.004*"distance" + 0.004*"o_o"')]
model.get_topic_terms(topicid, topn=10) 输入主题号,返回重要词以及重要词概率,结果
model.get_topic_terms(1, topn=10)
>>> [(774, 0.019700538013351386),
(3215, 0.0075965808303036916),
(3094, 0.0067132528809042526),
(514, 0.0063925849599646822),
(2739, 0.0054527647598129206),
(341, 0.004987335769043616),
(752, 0.0046566448210636699),
(1218, 0.0046234352422933724),
(186, 0.0042132891022475458),
(829, 0.0041800479706789939)]
model.get_topics() 返回主题数字数的矩阵,10主题 7674个单词,结果是
model.get_topics()
>>> array([[ 9.57974777e-05, 6.17130780e-07, 6.34938224e-07, ...,
6.17080048e-07, 6.19691132e-07, 6.17090716e-07],
[ 9.81065671e-05, 3.12945042e-05, 2.80837858e-04, ...,
7.86879291e-07, 7.86479617e-07, 7.86592758e-07],
[ 4.57734625e-05, 1.33555568e-05, 2.55108081e-05, ...,
5.31796854e-07, 5.32000122e-07, 5.31934336e-07],
print_topics(num_topics=20, num_words=10)结果是
model.print_topics(num_topics=20, num_words=10)
[(0,
'0.008*"gaussian" + 0.007*"mixture" + 0.006*"density" + 0.006*"matrix" + 0.006*"likelihood" + 0.005*"noise" + 0.005*"component" + 0.005*"prior" + 0.005*"estimate" + 0.004*"log"'),
(1,
'0.025*"image" + 0.010*"object" + 0.008*"distance" + 0.007*"recognition" + 0.005*"pixel" + 0.004*"cluster" + 0.004*"class" + 0.004*"transformation" + 0.004*"constraint" + 0.004*"map"'),
(2,
'0.011*"visual" + 0.010*"cell" + 0.009*"response" + 0.008*"field" + 0.008*"motion" + 0.007*"stimulus" + 0.007*"direction" + 0.005*"orientation" + 0.005*"eye" + 0.005*"frequency"')]
☆model[‘name’]
☆model.get_author_topics(‘name’)
model['GeoffreyE.Hinton']
model.get_author_topics('GeoffreyE.Hinton')
>>> [(6, 0.76808063951144978), (7, 0.23181972762044473)]
☆model.id2author.values() 模型作者列表
model.id2author.values() #作者列表
>>> dict_values(['A.A.Handzel', 'A.Afghan', 'A.B.Bonds', 'A.B.Kirillov', 'A.Blake', 'A.C.C.Coolen', 'A.C.Tsoi', 'A.Cichocki', 'A.D.Back', 'A.D.Rexlish', 'A.Dembo', 'A.Drees', 'A.During', 'A.E.Friedman', 'A.F.Murray', 'A.Ferguson', 'A.G.Barto', 'A.G.U.Perera']
get_document_topics ,get_document_topics(word_id, minimum_probability=None),该参数适合LDA,并不适用在ATM模型之中。Method “get_document_topics” is not valid for the author-topic model.
返回词典中指定词汇最有可能对应的主题
[(model.get_term_topics(i),dictionary.id2token[i]) for i in range(100)]
>>> [([], 'acknowledgements'),
([], 'acknowledgements_-PRON-'),
([], 'acquire'),
([(0, 0.013787687427660612)], 'action'),
([], 'action_potential'),
([], 'active')]
其中单词’action’,更偏好于0号主题(’Circuits’),可能性为0.013
官方案例中包括两种相似性距离:cos距离、Hellinger距离
自带的、常规的cos距离
from gensim.similarities import MatrixSimilarity
Generate a similarity object for the transformed corpus.
index = MatrixSimilarity(model[list(model.id2author.values())])
Get similarities to some author.
author_name = 'YannLeCun'
sims = index[model[author_name]]
sims
>>> array([ 0.20777275, 0.8723157 , 0. , ..., 0.16174853,
0.07938839, 0. ], dtype=float32)
sims的列表是,'YannLeCun'作者跟其他每个作者,主题偏好向量的cos距离
其中model[list(model.id2author.values())]中,model.id2author是作者姓名的列表,model[姓名列表]代表每个作者-主题偏好列表向量,当然不定长,有的作者对某些主题没有一点关联,就会缺失。
主题偏好量:
[(4, 0.88636616132828339), (8, 0.083545096138703312)],
[(4, 0.27129746825443646), (8, 0.71594003971848896)],
[(0, 0.07188868711639794), (1, 0.069390116586544176),
(3, 0.035190795872695843), (4, 0.011718365474455844),
(5, 0.058831820905365088), (6, 0.68542799691757561),
(9, 0.041390087371285786)]
如 [(4, 0.88636616132828339), (8, 0.083545096138703312)]代表某作者,偏好于4/8号主题。
Hellinger距离
官方自定义函数
from gensim import matutils
import pandas as pd
# Make a list of all the author-topic distributions.
# 作者-主题偏好向量
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
def similarity(vec1, vec2):
'''Get similarity between two vectors'''
dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \
matutils.sparse2full(vec2, model.num_topics))
sim = 1.0 / (1.0 + dist)
return sim
def get_sims(vec):
'''Get similarity of vector to all authors.'''
sims = [similarity(vec, vec2) for vec2 in author_vecs]
return sims
def get_table(name, top_n=10, smallest_author=1):
'''
Get table with similarities, author names, and author sizes.
Return `top_n` authors as a dataframe.
'''
# Get similarities.
sims = get_sims(model.get_author_topics(name))
# Arrange author names, similarities, and author sizes in a list of tuples.
table = []
for elem in enumerate(sims):
author_name = model.id2author[elem[0]]
sim = elem[1]
author_size = len(model.author2doc[author_name])
if author_size >= smallest_author:
table.append((author_name, sim, author_size))
# Make dataframe and retrieve top authors.
df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
df = df.sort_values('Score', ascending=False)[:top_n]
return df
可以用get_table(‘YannLeCun’, top_n=10,smallest_author=3)得到’YannLeCun’作者,与其他每一个作者的Hellinger距离。
问题
C:\Python27\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
I am unable to import word2vec from gensim.models due to this warning.
down vote
accepted
You can suppress the message with this code before importing gensim:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
I think is not a big problem. gensim just let you know that they will alias chunkize to different function cause you use specific os.
check of code of gensim.util
if os.name == 'nt':
logger.info("detected Windows; aliasing chunkize to chunkize_serial")
def chunkize(corpus, chunksize, maxsize=0, as_numpy=False):
for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
yield chunk
else:
def chunkize(corpus, chunksize, maxsize=0, as_numpy=False):
"""
Split a stream of values into smaller chunks.
Each chunk is of length `chunksize`, except the last one which may be smaller.
A once-only input stream (`corpus` from a generator) is ok, chunking is done
efficiently via itertools.
If `maxsize > 1`, don't wait idly in between successive chunk `yields`, but
rather keep filling a short queue (of size at most `maxsize`) with forthcoming
chunks in advance. This is realized by starting a separate process, and is
meant to reduce I/O delays, which can be significant when `corpus` comes
from a slow medium (like harddisk).
If `maxsize==0`, don't fool around with parallelism and simply yield the chunksize
via `chunkize_serial()` (no I/O optimizations).
>>> for chunk in chunkize(range(10), 4): print(chunk)
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]
"""
assert chunksize > 0
if maxsize > 0:
q = multiprocessing.Queue(maxsize=maxsize)
worker = InputQueue(q, corpus, chunksize, maxsize=maxsize, as_numpy=as_numpy)
worker.daemon = True
worker.start()
while True:
chunk = [q.get(block=True)]
if chunk[0] is None:
break
yield chunk.pop()
else:
for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
yield chunk
构造词典,删掉低频词
def get_dictionary(documents, min_count=1):
dictionary = corpora.Dictionary(texts)
lowfreq_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems()
if docfreq < min_count]
# remove stop words and low frequence words
dictionary.filter_tokens(lowfreq_ids)
# remove gaps in id sequence after words that were removed
dictionary.compactify()
return dictionary
dictionary = get_dictionary(documents, min_count=1)
dictionary的其他一些用法
dictionary还有其他的一些用法,现罗列一部分
dictionary.filter_n_most_frequent(N)
过滤掉出现频率最高的N个单词
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
1.去掉出现次数低于no_below的
2.去掉出现次数高于no_above的。注意这个小数指的是百分数
3.在1和2的基础上,保留出现频率前keep_n的单词
dictionary.filter_tokens(bad_ids=None, good_ids=None)
有两种用法,一种是去掉bad_id对应的词,另一种是保留good_id对应的词而去掉其他词。注意这里bad_ids和good_ids都是列表形式
dictionary.compacity()
在执行完前面的过滤操作以后,可能会造成单词的序号之间有空隙,这时就可以使用该函数来对词典来进行重新排序,去掉这些空隙。
有时主题数目的选择都相当随意,如果我们仅仅把主题抽取作为处理的中间环节,那么对大多数用户来说,主题数量对最终结果并无影响,也就是说,只要你抽取了足够多的主题,最终结果并无区别。
然而(lz注)主题数目应该代表模型的复杂度,如果主题数目太小,模型描述数据的能力就会受限;当主题数目设置超过某个threshold时,模型已经足够来处理数据了,这里增加主题数目没用,同时还会增加模型训练的时间。
lz建议使用交叉验证,使用不同的主题数目值,测试参数的sensitivity,通过最终结果的准确率来实现主题数目的选择。