一个python NLP库. 包含tf-idf模型, word2vec 与 doc2vec 等.
官网地址
官方教程:models.word2vec – Deep learning with word2vec
gensim.models.word2vec.Word2Vec(utils.SaveLoad)
类. 用于训练, 使用, 评估 word2vec 模型.
__init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, ...)
sentences
: 一个list, 元素为sentence. sentence也是一个list, 格式为[word1, word2, …, word_n].
size
: the dimensionality of the feature vectors.
window
: the maximum distance between the current and predicted word within a sentence.
alpha
: the initial learning rate.
seed
: for the random number generator
min_count
: ignore all words with total frequency lower than this.
save(self, *args, **kwargs)
model.save('/tmp/mymodel')
.@classmethod load(cls, *args, **kwargs)
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
.model[word]
model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
# 得到('queen', 0.71382287), ...]
model.wv.doesnt_match("breakfast cereal dinner lunch".split())
# 'cereal'
model.wv.similarity('woman', 'man')
# 0.73723527
官方教程:models.doc2vec – Deep learning with paragraph2vec
在word2vec中, 语料库的词典都是十几万级别的, 所以来了新句子, 里面的 word 也很少碰到未登录的.
而在doc2vec中, 来了一篇新文章, 它就是未登录的, gensim 提供了
gensim.models.doc2vec.Doc2Vec#infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5)
函数, 产出模型后, 用于预测新文档的 vector representation.
gensim.similarities.docsim.SparseMatrixSimilarity(interfaces.SimilarityABC)
import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
# First, create a small corpus of 9 documents and 12 features
# a list of list of tuples
# see: https://radimrehurek.com/gensim/tut1.html
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
[(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
[(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
[(0, 1.0), (4, 2.0), (7, 1.0)],
[(3, 1.0), (5, 1.0), (6, 1.0)],
[(9, 1.0)],
[(9, 1.0), (10, 1.0)],
[(9, 1.0), (10, 1.0), (11, 1.0)],
[(8, 1.0), (10, 1.0), (11, 1.0)]]
tfidf = models.TfidfModel(corpus)
vec = [(0, 1), (4, 1)]
print(tfidf[vec])
# shape=9*12
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
sims = index[tfidf[vec]]
print(list(enumerate(sims)))
"""
[(0, 0.8075244024440723), (4, 0.5898341626740045)]
# Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.
[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
"""