gensim中带监督味的作者-主题模型

作者主题模型的通俗解释

model_list = []
for i in range(5):
    model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                    author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
                    eval_every=0, iterations=1, random_state=i)
    top_topics = model.top_topics(corpus)
    tc = sum([t[1] for t in top_topics])
    model_list.append((model, tc))

通过设置参数random_state,不同的随机种子,并选择具有最高主题一致性的模型

model.show_topics(num_topics=10, num_words=10, log=False, formatted=True) 结果是
model.show_topics(num_topics=10)
>>> [(0,
  '0.014*"action" + 0.014*"control" + 0.010*"policy" + 0.009*"q" + 0.009*"reinforcement" + 0.007*"optimal" + 0.006*"robot" + 0.005*"controller" + 0.005*"dynamic" + 0.005*"environment"'),
 (1,
  '0.020*"image" + 0.008*"face" + 0.007*"cluster" + 0.006*"signal" + 0.005*"source" + 0.005*"matrix" + 0.005*"filter" + 0.005*"search" + 0.004*"distance" + 0.004*"o_o"')]

model.get_topic_terms(topicid, topn=10) 输入主题号,返回重要词以及重要词概率,结果
model.get_topic_terms(1, topn=10)
>>> [(774, 0.019700538013351386),
 (3215, 0.0075965808303036916),
 (3094, 0.0067132528809042526),
 (514, 0.0063925849599646822),
 (2739, 0.0054527647598129206),
 (341, 0.004987335769043616),
 (752, 0.0046566448210636699),
 (1218, 0.0046234352422933724),
 (186, 0.0042132891022475458),
 (829, 0.0041800479706789939)]

model.get_topics() 返回主题数字数的矩阵,10主题 7674个单词,结果是
model.get_topics()
>>> array([[  9.57974777e-05,   6.17130780e-07,   6.34938224e-07, ...,
          6.17080048e-07,   6.19691132e-07,   6.17090716e-07],
       [  9.81065671e-05,   3.12945042e-05,   2.80837858e-04, ...,
          7.86879291e-07,   7.86479617e-07,   7.86592758e-07],
       [  4.57734625e-05,   1.33555568e-05,   2.55108081e-05, ...,
          5.31796854e-07,   5.32000122e-07,   5.31934336e-07],


print_topics(num_topics=20, num_words=10)结果是
model.print_topics(num_topics=20, num_words=10)
[(0,
  '0.008*"gaussian" + 0.007*"mixture" + 0.006*"density" + 0.006*"matrix" + 0.006*"likelihood" + 0.005*"noise" + 0.005*"component" + 0.005*"prior" + 0.005*"estimate" + 0.004*"log"'),
 (1,
  '0.025*"image" + 0.010*"object" + 0.008*"distance" + 0.007*"recognition" + 0.005*"pixel" + 0.004*"cluster" + 0.004*"class" + 0.004*"transformation" + 0.004*"constraint" + 0.004*"map"'),
 (2,
  '0.011*"visual" + 0.010*"cell" + 0.009*"response" + 0.008*"field" + 0.008*"motion" + 0.007*"stimulus" + 0.007*"direction" + 0.005*"orientation" + 0.005*"eye" + 0.005*"frequency"')]
  • 作者主题偏好函数
☆model[‘name’]
☆model.get_author_topics(‘name’)
model['GeoffreyE.Hinton']
model.get_author_topics('GeoffreyE.Hinton')
>>> [(6, 0.76808063951144978), (7, 0.23181972762044473)]

☆model.id2author.values() 模型作者列表
model.id2author.values()  #作者列表
>>> dict_values(['A.A.Handzel', 'A.Afghan', 'A.B.Bonds', 'A.B.Kirillov', 'A.Blake', 'A.C.C.Coolen', 'A.C.Tsoi', 'A.Cichocki', 'A.D.Back', 'A.D.Rexlish', 'A.Dembo', 'A.Drees', 'A.During', 'A.E.Friedman', 'A.F.Murray', 'A.Ferguson', 'A.G.Barto', 'A.G.U.Perera']

get_document_topics ,get_document_topics(word_id, minimum_probability=None),该参数适合LDA,并不适用在ATM模型之中。Methodget_document_topicsis not valid for the author-topic model.
  • 单词主题偏好函数

返回词典中指定词汇最有可能对应的主题

[(model.get_term_topics(i),dictionary.id2token[i]) for i in range(100)]
>>>  [([], 'acknowledgements'),
 ([], 'acknowledgements_-PRON-'),
 ([], 'acquire'),
 ([(0, 0.013787687427660612)], 'action'),
 ([], 'action_potential'),
 ([], 'active')]

其中单词’action’,更偏好于0号主题(’Circuits’),可能性为0.013

  • 相似作者推荐

官方案例中包括两种相似性距离:cos距离、Hellinger距离
自带的、常规的cos距离

from gensim.similarities import MatrixSimilarity

Generate a similarity object for the transformed corpus.
index = MatrixSimilarity(model[list(model.id2author.values())])

Get similarities to some author.
author_name = 'YannLeCun'
sims = index[model[author_name]]
sims 
>>> array([ 0.20777275,  0.8723157 ,  0.        , ...,  0.16174853,
        0.07938839,  0.        ], dtype=float32)
sims的列表是,'YannLeCun'作者跟其他每个作者,主题偏好向量的cos距离

其中model[list(model.id2author.values())]中,model.id2author是作者姓名的列表,model[姓名列表]代表每个作者-主题偏好列表向量,当然不定长,有的作者对某些主题没有一点关联,就会缺失。
主题偏好量:

 [(4, 0.88636616132828339), (8, 0.083545096138703312)],
 [(4, 0.27129746825443646), (8, 0.71594003971848896)],
 [(0, 0.07188868711639794),  (1, 0.069390116586544176),
  (3, 0.035190795872695843),  (4, 0.011718365474455844),
  (5, 0.058831820905365088),  (6, 0.68542799691757561),
  (9, 0.041390087371285786)]

如 [(4, 0.88636616132828339), (8, 0.083545096138703312)]代表某作者,偏好于4/8号主题。
Hellinger距离
官方自定义函数

from gensim import matutils
import pandas as pd

# Make a list of all the author-topic distributions.
# 作者-主题偏好向量
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]

def similarity(vec1, vec2):
    '''Get similarity between two vectors'''
    dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \
                              matutils.sparse2full(vec2, model.num_topics))
    sim = 1.0 / (1.0 + dist)
    return sim

def get_sims(vec):
    '''Get similarity of vector to all authors.'''
    sims = [similarity(vec, vec2) for vec2 in author_vecs]
    return sims

def get_table(name, top_n=10, smallest_author=1):
    '''
    Get table with similarities, author names, and author sizes.
    Return `top_n` authors as a dataframe.

    '''

    # Get similarities.
    sims = get_sims(model.get_author_topics(name))

    # Arrange author names, similarities, and author sizes in a list of tuples.
    table = []
    for elem in enumerate(sims):
        author_name = model.id2author[elem[0]]
        sim = elem[1]
        author_size = len(model.author2doc[author_name])
        if author_size >= smallest_author:
            table.append((author_name, sim, author_size))

    # Make dataframe and retrieve top authors.
    df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
    df = df.sort_values('Score', ascending=False)[:top_n]

    return df

可以用get_table(‘YannLeCun’, top_n=10,smallest_author=3)得到’YannLeCun’作者,与其他每一个作者的Hellinger距离。

问题

C:\Python27\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

I am unable to import word2vec from gensim.models due to this warning.

down vote
accepted

You can suppress the message with this code before importing gensim:

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

import gensim

I think is not a big problem. gensim just let you know that they will alias chunkize to different function cause you use specific os.

check of code of gensim.util

if os.name == 'nt':
    logger.info("detected Windows; aliasing chunkize to chunkize_serial")

    def chunkize(corpus, chunksize, maxsize=0, as_numpy=False):
        for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
            yield chunk
else:
    def chunkize(corpus, chunksize, maxsize=0, as_numpy=False):
    """
    Split a stream of values into smaller chunks.
    Each chunk is of length `chunksize`, except the last one which may be smaller.
    A once-only input stream (`corpus` from a generator) is ok, chunking is done
    efficiently via itertools.

    If `maxsize > 1`, don't wait idly in between successive chunk `yields`, but
    rather keep filling a short queue (of size at most `maxsize`) with forthcoming
    chunks in advance. This is realized by starting a separate process, and is
    meant to reduce I/O delays, which can be significant when `corpus` comes
    from a slow medium (like harddisk).

    If `maxsize==0`, don't fool around with parallelism and simply yield the chunksize
    via `chunkize_serial()` (no I/O optimizations).

    >>> for chunk in chunkize(range(10), 4): print(chunk)
    [0, 1, 2, 3]
    [4, 5, 6, 7]
    [8, 9]

    """
    assert chunksize > 0

    if maxsize > 0:
        q = multiprocessing.Queue(maxsize=maxsize)
        worker = InputQueue(q, corpus, chunksize, maxsize=maxsize, as_numpy=as_numpy)
        worker.daemon = True
        worker.start()
        while True:
            chunk = [q.get(block=True)]
           if chunk[0] is None:
                break
            yield chunk.pop()
    else:
        for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
            yield chunk  

构造词典,删掉低频词

def get_dictionary(documents, min_count=1):
    dictionary = corpora.Dictionary(texts)
    lowfreq_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() 
                    if docfreq < min_count]
    # remove stop words and low frequence words
    dictionary.filter_tokens(lowfreq_ids)
    # remove gaps in id sequence after words that were removed
    dictionary.compactify()
    return dictionary

dictionary = get_dictionary(documents, min_count=1)
dictionary的其他一些用法

dictionary还有其他的一些用法,现罗列一部分

dictionary.filter_n_most_frequent(N) 
过滤掉出现频率最高的N个单词

dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000) 
1.去掉出现次数低于no_below2.去掉出现次数高于no_above的。注意这个小数指的是百分数 
3.12的基础上,保留出现频率前keep_n的单词

dictionary.filter_tokens(bad_ids=None, good_ids=None) 
有两种用法,一种是去掉bad_id对应的词,另一种是保留good_id对应的词而去掉其他词。注意这里bad_idsgood_ids都是列表形式

dictionary.compacity() 
在执行完前面的过滤操作以后,可能会造成单词的序号之间有空隙,这时就可以使用该函数来对词典来进行重新排序,去掉这些空隙。

有时主题数目的选择都相当随意,如果我们仅仅把主题抽取作为处理的中间环节,那么对大多数用户来说,主题数量对最终结果并无影响,也就是说,只要你抽取了足够多的主题,最终结果并无区别。

然而(lz注)主题数目应该代表模型的复杂度,如果主题数目太小,模型描述数据的能力就会受限;当主题数目设置超过某个threshold时,模型已经足够来处理数据了,这里增加主题数目没用,同时还会增加模型训练的时间。

lz建议使用交叉验证,使用不同的主题数目值,测试参数的sensitivity,通过最终结果的准确率来实现主题数目的选择。

你可能感兴趣的:(gensim中带监督味的作者-主题模型)