NLP计算文档相似度之doc2vec

import gensim

outp1 = 'D:\python_noweightpathway\TIA\docmodel'
file = open(u'D:\python_noweightpathway\TIA\TIAxmmc.txt', encoding='utf-8')
# fileghdjid = open(u'D:\python_noweightpathway\TIA\TIA.txt', encoding='utf-8')
# ghdjids = []
# for ghdjid in fileghdjid:
#     ghdjids.append(ghdjid)
# i = 0
# for line in file:
#     LabeledSentence(words=line.split(), labels=['SENT_%s' % ghdjids[i]])
#     i = i + 1
documents = gensim.models.doc2vec.TaggedLineDocument(file)
model = gensim.models.Doc2Vec(documents, size=100, window=8, min_count=100, workers=8)
model.save(outp1)

读取模型

import gensim

model=gensim.models.Doc2Vec.load("D:\python_noweightpathway\TIA\docmodel")
print(model.docvecs.most_similar(4))
print(model.docvecs.similarity(2,12))

文向量其实跟词向量的模型一样,只不过是训练的时候把文档id也作为一个词进行训练,这样文档id就学习到了文档下面每一个词的信息,就会生成一个文向量。

你可能感兴趣的:(NLP)