计算细节:参见知乎文章“sklearn-TfidfVectorizer彻底说清楚”
1.根据训练集语料库,计算出tfidf值
2.计算出测试语句每个词语的tfidf值(只有当测试语句的词语在训练语料库的dictionary中,测试语句的词语才会计算tfidf值)
import jieba from gensim import corpora, similarities, models sentances = ['我爱你', '我喜欢他和他喜欢我', '他说今天空气很清新'] test_sent = '我爱你们,我喜欢他' text = [[word for word in jieba.cut(sentance)]for sentance in sentances] # 1.把每个句子分词 dictionary = corpora.Dictionary(text) # 2.把每个词语建立索引,得到索引字典 print('dictionary=', dictionary) for idx,word in dictionary.items(): print(idx, word,end="\t") print() corpus = [dictionary.doc2bow(word_list) for word_list in text] # 3.对每句话的每个词语进行词频统计,得到词频统计过后的语料corpus print("[dictionary.doc2bow(word_list) for word_list in text]") for word_list in text: print('\t',word_list, end="\t") print(dictionary.doc2bow(word_list)) model = models.TfidfModel(corpus) # 4. corpus输入到TFIDF模型计算,model保存着有每句话中每个词语的tfidf值 tfidf = model[corpus] # 保存着每句话中每个词语的tfidf值 print('tfidf=',tfidf) for ele in tfidf: print('\t',ele) similarity =similarities.MatrixSimilarity(tfidf) # 用于计算相似度,similarity的输入参数是tfidf值 print('similarity=', similarity) for ele in similarity: print('\t',ele) test_word_list = [word for word in jieba.cut(test_sent)] print('test_word_list=',test_word_list) test_word_freq_count = dictionary.doc2bow(test_word_list) print('test_word_freq_count=', test_word_freq_count) # 因为是根据训练数据得到的dictionary,测试语句只有部分词语在训练集中 test_tfidf = model[test_word_freq_count] print('test_tfidf=', test_tfidf) sim = similarity[test_tfidf] # 获得与所有句子的相似度,训练集有三个句子,所以sim的长度为3 print("sim=",sim,sim.dtype) max_sim = max(sim) print('max_sim=', max_sim, end='\t') max_index = list(sim).index(max_sim) print('max_index=', max_index)
# 输出 dictionary= Dictionary(10 unique tokens: ['我爱你', '他', '和', '喜欢', '我']...) 0 我爱你 1 他 2 和 3 喜欢 4 我 5 今天 6 很 7 清新 8 空气 9 说 [dictionary.doc2bow(word_list) for word_list in text] ['我爱你'] [(0, 1)] ['我', '喜欢', '他', '和', '他', '喜欢', '我'] [(1, 2), (2, 1), (3, 2), (4, 2)] ['他', '说', '今天', '空气', '很', '清新'] [(1, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)] tfidf=[(0, 1.0)] [(1, 0.23892106670040594), (2, 0.323679663983242), (3, 0.647359327966484), (4, 0.647359327966484)] [(1, 0.16284991207632715), (5, 0.44124367556640004), (6, 0.44124367556640004), (7, 0.44124367556640004), (8, 0.44124367556640004), (9, 0.44124367556640004)] similarity= MatrixSimilarity<3 docs, 10 features> [1. 0. 0.] [0. 0.99999994 0.03890828] [0. 0.03890828 1. ] test_word_list= ['我', '爱', '你们', ',', '我', '喜欢', '他'] test_word_freq_count= [(1, 1), (3, 1), (4, 2)] test_tfidf= [(1, 0.16284991207632712), (3, 0.44124367556640004), (4, 0.8824873511328001)] sim= [0. 0.8958379 0.0265201] float32 max_sim= 0.8958379 max_index= 1
可以看到,测试语句与训练语料库中的第index=1条语句最相似.
tfidf如何表示一个句子:
加入一个句子有n个单词,每个单词计算出它的tfidf值,即每个单词用一个标量表示,则句子的维度是1*n
如果是用embedding表示法,每个单词用m维向量表示,句子的维度是m*n
保存和加载模型的方法:
保存词典:
dictionary.save(DICT_PATH)
保存tfidf模型:model.save(MODEL_PATH)
保存相似度
similarity.save(SIMILARITY_PATH)
加载词典:
dictionary = corpora.Dictionary.load('require_files/dictionary.dict')
加载模型
tfidf = models.TfidfModel.load("require_files/my_model.tfidf")
加载相似度
index=similarities.MatrixSimilarity.load('require_files/similarities.0')
————————————————
refference:https://blog.csdn.net/qq_33908388/article/details/94554309