'''
文本相似度计算方法
'''
#1,编辑距离计算
'''
是指两个字符串之间,由一个转成另一个所需的最少操作次数,如果它们就
距离越大,说明它们越不同,预科的编辑操作包括将一个字符替换成另一个字符,插入一个字符
删除一个字符
'''
import distance
print(distance.levenshtein('setting','string')) #2
#第一步在s和t之间插入一个v字符e
#第二步把r替换成t 所以结果是2
'''
这样的话就可以设置阈值来实现文本相似的选择
'''
def edit_distance(s1,s2):
return distance.levenshtein(s1,s2)
strings = [
'你在干什么',
'你在干啥子',
'你在做什么',
'你好啊',
'我喜欢吃香蕉'
]
target = '你在干啥'
results = list(filter(lambda x:edit_distance(x, target)<=2, strings))
print(results)
#['你在干什么', '你在干啥子']
#下面几种方法,方法的结果有差异
#2,杰卡德系数计算
'''
用于比较有效样本集之间的相似性与差异性,系数越大相似度越高
计算方式很简单,就是两个样本的交集除以并集得到的数值,当两个样本
当两个样本完全相同时结果为1,完全不同时结果为0
'''
#(1)
s1 = '你在干嘛呢'
s2 = '你在干什么呢'
def jacc_distance(s1,s2):
first = set(s1).intersection(set(s2))
second = set(first).union(set(s2))
return len(first)/len(second)
x = set(list(s1))
y = set(list(s2))
print(jacc_distance(x,y)) #0.5
#(2)
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
def jacc_similarity(s1,s2):
def add_space(s):
return ' '.join(list(s))
#将字中间加入空格
s1,s2 = add_space(s1),add_space(s2)
#转化为TF矩阵
cv = CountVectorizer(tokenizer=lambda s:s.split())
corpus = [s1, s2]
vectors = cv.fit_transform(corpus).toarray()
#求交集
numerator = np.sum(np.min(vectors, axis=0))
#求并集
denominator = np.sum(np.max(vectors, axis=0))
#计算杰卡德系数
return 1.0*numerator / denominator
print(jacc_similarity(s1,s2)) #0.42857142857142855
#3,TF计算
'''
第三种就是直接计算TF矩阵中两个向量的相似度,实际上就是求解两个
向量夹角的余弦值,就是点乘积以二者的模长,公式如下
cos=a*b/|a|*|b|
'''
from scipy.linalg import norm
def tf_similarty(s1,s2):
def add_space(s):
return ' '.join(list(s))
#将字中间加入空格
s1,s2 = add_space(s1),add_space(s2)
#转化为TF矩阵
cv = CountVectorizer(tokenizer=lambda s:s.split())
corpus = [s1,s2]
vectors = cv.fit_transform(corpus).toarray()
#计算TF系数
return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))
print(tf_similarty(s1,s2)) #0.7302967433402214
还有一种方法是利用word2vec模块来完成的,需要下载word2vec训练好的model,这里用的是news_12g_baidubaike_20g_novel_90g_embedding_64.bin 里面的,下载地址,下载完之后把文件放在你的脚本目录下即可,代码如下:
#4,word2vec方法
import gensim
import jieba
model_file = 'news_12g_baidubaike_20g_novel_90g_embedding_64.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True)
def vector_similarity(s1,s2):
def sentence_vector(s):
words = jieba.lcut(s)
v = np.zeros(64)
for word in words:
v += model[word]
v /= len(words)
return v
v1,v2 = sentence_vector(s1),sentence_vector(s2)
return np.dot(v1,v2) / (norm(v1) * norm(v2))
print(vector_similarity(s1,s2))
以上代码运行结果为:
可见word2vec这个方法效果比较好,但是运行时间很久,要等几分钟
2
['你在干什么', '你在干啥子']
0.6666666666666666
0.5714285714285714
0.7302967433402214
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\swt\AppData\Local\Temp\jieba.cache
Loading model cost 0.581 seconds.
Prefix dict has been built succesfully.
0.9918127888048456
在用word2vec这个方法时,顺便学习了一下word2vec的一些用法,下面是使用了GoogleNews-vectors-negative300.bin里面的一些model 下载地址
import gensim
import jieba
model_file = 'GoogleNews-vectors-negative300.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True)
#查看词向量
#print(model['word'].shape) #(300,)
#计算两个词的相似度
#y1 = model.similarity('woman','man')
#print(y1) #结果为0.76640123
#计算某个词的相关词列表
# y2 = model.most_similar('good',topn=20) #20个最相关的
# for item in y2:
# print(item[0],item[1])
'''
great 0.7291510105133057
bad 0.7190051078796387
terrific 0.6889115571975708
decent 0.6837348341941833
nice 0.6836092472076416
excellent 0.644292950630188
fantastic 0.6407778859138489
better 0.6120728850364685
solid 0.5806034803390503
lousy 0.5764203071594238
wonderful 0.5726118087768555
terrible 0.5602041482925415
Good 0.5586155652999878
tough 0.558531641960144
best 0.5467195510864258
alright 0.5417848825454712
perfect 0.5401014089584351
strong 0.539231538772583
pretty_darn 0.5339667797088623
really 0.5286920070648193
'''
#寻找对应关系
# print(' "boy" is to "father" as "girl" is to ...? ')
# y3 = model.most_similar(['girl','father'],['boy'],topn=3)
# for item in y3:
# print(item[0],item[1])
'''
"boy" is to "father" as "girl" is to ...?
mother 0.831214427947998
daughter 0.8000643253326416
husband 0.769158124923706
'''
#寻找不和群的词
# y4 = model.doesnt_match('breakfast cereal dinner lunch'.split())
# print('不合群的词:',y4)
#不合群的词: cereal
我把上面两种训练好的model互换一下运行以上代码时,结果都报错,
KeyError: “word ‘word’ not in vocabulary”。难道是语料库里没有这个词???
这个我还没弄清楚,请知晓的大佬指点
参考博客
https://blog.csdn.net/baidu_36535885/article/details/79592755
https://cuiqingcai.com/6101.html