J ( A , B ) J(A,B) J(A,B)表示有限样本集之间的相似程度:
J ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ = ∣ A ∩ B ∣ ∣ A ∣ + ∣ B ∣ − ∣ A ∩ B ∣ J(A,B)=\frac{|A∩B|}{|A∪B|}=\frac{|A∩B|}{|A|+|B|-|A∩B|} J(A,B)=∣A∪B∣∣A∩B∣=∣A∣+∣B∣−∣A∩B∣∣A∩B∣
Jaccard相似度:
d j ( A , B ) = 1 − J ( A , B ) = ∣ A ∪ B ∣ − ∣ A ∩ B ∣ ∣ A ∪ B ∣ = A Δ B ∣ A ∪ B ∣ d_j(A,B)=1-J(A,B)=\frac{|A∪B|-|A∩B|}{|A∪B|}=\frac{AΔB}{|A∪B|} dj(A,B)=1−J(A,B)=∣A∪B∣∣A∪B∣−∣A∩B∣=∣A∪B∣AΔB
当A=B时,Jaccard相似度为1;当|A∩B|=0时,Jaccard相似度为0.
Jaccard相似度的取值范围为[0,1],值越大表示越相似。代码如下:
def Jaccard(words1, words2):
words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
interNum = 0
for word in words1_cut:
if word in words2_cut:
interNum += 1
return float(interNum/(len(set(words1_cut))+len(set(words2_cut))-interNum))
c o s ( X , Y ) = X ⋅ Y ∣ X ∣ ∣ Y ∣ cos(X,Y)=\frac{X·Y}{|X||Y|} cos(X,Y)=∣X∣∣Y∣X⋅Y
s = 2 ∣ A ∩ B ∣ ∣ A ∣ + ∣ B ∣ s=2\frac{|A∩B|}{|A|+|B|} s=2∣A∣+∣B∣∣A∩B∣
def Dice(words1, words2):
words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
interNum = 0
for word in words1_cut:
if word in words2_cut:
interNum += 1
return float(2*interNum/(len(set(words1_cut))+len(set(words2_cut))))
o v e r l a p ( X , Y ) = ∣ X ∩ Y ∣ m i n ( ∣ X ∣ , ∣ Y ∣ ) overlap(X,Y)=\frac{|X∩Y|}{min(|X|,|Y|)} overlap(X,Y)=min(∣X∣,∣Y∣)∣X∩Y∣
def overlap(words1, words2):
words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
interNum = 0
for word in words1_cut:
if word in words2_cut:
interNum += 1
return float(2*interNum/min(len(set(words1_cut)),len(set(words2_cut))))