gensim 简介

1.简介

一个python NLP库. 包含tf-idf模型, word2vec 与 doc2vec 等.
官网地址

2.word2vec

官方教程:models.word2vec – Deep learning with word2vec

2.1 类与方法

  • gensim.models.word2vec.Word2Vec(utils.SaveLoad)
    类. 用于训练, 使用, 评估 word2vec 模型.

    __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, ...)
    sentences: 一个list, 元素为sentence. sentence也是一个list, 格式为[word1, word2, …, word_n].
    size: the dimensionality of the feature vectors.
    window: the maximum distance between the current and predicted word within a sentence.
    alpha: the initial learning rate.
    seed: for the random number generator
    min_count: ignore all words with total frequency lower than this.

  • save(self, *args, **kwargs)
    持久化模型, 如 model.save('/tmp/mymodel').
  • @classmethod load(cls, *args, **kwargs)
    将持久化的模型反序列化回来. 如new_model = gensim.models.Word2Vec.load('/tmp/mymodel').
  • model[word]
    如, model[‘computer’], 返回的是该单词的向量, 它是NumPy的vector.
  • model.wv.similar_by_word(self, word, topn=10,…)
    查询一个词的k-nearest neighbor. 计算的是 余弦相似度.

2.2一些例子

model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
# 得到('queen', 0.71382287), ...]

model.wv.doesnt_match("breakfast cereal dinner lunch".split())
# 'cereal'

model.wv.similarity('woman', 'man')
# 0.73723527

3.doc2vec

官方教程:models.doc2vec – Deep learning with paragraph2vec

在word2vec中, 语料库的词典都是十几万级别的, 所以来了新句子, 里面的 word 也很少碰到未登录的.
而在doc2vec中, 来了一篇新文章, 它就是未登录的, gensim 提供了
gensim.models.doc2vec.Doc2Vec#infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5)
函数, 产出模型后, 用于预测新文档的 vector representation.

常用类与方法

  • gensim.similarities.docsim.SparseMatrixSimilarity(interfaces.SimilarityABC)
    类, 用余弦相似度 来度量.

4.tf_idf model

import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora, models, similarities

# First, create a small corpus of 9 documents and 12 features
# a list of list of tuples
# see: https://radimrehurek.com/gensim/tut1.html
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
           [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
           [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
           [(0, 1.0), (4, 2.0), (7, 1.0)],
           [(3, 1.0), (5, 1.0), (6, 1.0)],
           [(9, 1.0)],
           [(9, 1.0), (10, 1.0)],
           [(9, 1.0), (10, 1.0), (11, 1.0)],
           [(8, 1.0), (10, 1.0), (11, 1.0)]]

tfidf = models.TfidfModel(corpus)

vec = [(0, 1), (4, 1)]
print(tfidf[vec])
# shape=9*12
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
sims = index[tfidf[vec]]
print(list(enumerate(sims)))
"""
[(0, 0.8075244024440723), (4, 0.5898341626740045)]

# Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.
[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
"""

你可能感兴趣的:(NLP)