使用Python3中jieba包进行分词,整理为指定格式,gensim库将要对比的文档通过doc2bow转化为稀疏向量,再通过models中的tf-idf将语料库进行处理,特征值和稀疏矩阵相似度建立索引,最后的到相似度结果。
以下是代码实现:
# -*- coding: utf-8 -*-
"""
@author: zjp
Python3.6.6
"""
import jieba
from gensim import corpora, models, similarities
from collections import defaultdict
text1 = open(r'F://data_temp/test1.txt', encoding='utf-8').read()
text2 = open(r'F://data_temp/test1.txt', encoding='utf-8').read()
text3 = open(r'F://data_temp/test1.txt', encoding='utf-8').read() # 加载要对比的文档
textCut1 = '/'.join(jieba.cut(text1, cut_all=False))
textCut2 = '/'.join(jieba.cut(text2, cut_all=False))
textCut3 = '/'.join(jieba.cut(text3, cut_all=False))
documents1 = [textCut1, textCut2]
texts1 = [[word for word in i.split('/')] for i in documents1]
frequency = defaultdict(int)
for text in texts1:
for i in text:
frequency[i] += 1
texts2 = [[word for word in text if frequency[word] > 2] for text in texts1]
dict1 = corpora.Dictionary(texts1) # corpora.Dictionary()寻找整篇语料的词典、所有词
corpus1 = [dict1.doc2bow(i) for i in texts1] # 对语料库进一步处理,得到新语料库
tfidf1 = models.TfidfModel(corpus1) # 使用tf-idf模型得出新语料库的tf-idf模型
corpusTfidf1 = tfidf1[corpus1]
featureNum1 = len(dict1.token2id.keys()) # 通过token2id得到特征数
similarity1 = similarities.Similarity('Similarity-tfidf-index', corpusTfidf1, num_features=featureNum1) # 稀疏矩阵相似度,从而建立索引
#print(similarity1[test_corpus_tfidf_1]) # 返回最相似的样本材料,(index_of_document, similarity) tuples
new_xs = dict1.doc2bow(textCut3.split('/')) # 将要对比的文档通过doc2bow转化为稀疏向量
sim = similarity1[tfidf1[new_xs]] # 得到最终相似结果
print(sim)
运行后没有报错,但是得到的相似度为[0.,0.],一开始以为是文本不相似,文件改成同一个文本文档后还是[0.,0.],有没有大神知道为什么?
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\AP\AppData\Local\Temp\jieba.cache
Loading model cost 3.835 seconds.
Prefix dict has been built succesfully.
[0. 0.]