文本相似度指标-基于词汇的相似度量

文章目录

    • Jaccard相似度
    • 余弦相似度
    • Dice系数
    • 匹配系数

Jaccard相似度

J ( A , B ) J(A,B) J(A,B)表示有限样本集之间的相似程度:

J ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ = ∣ A ∩ B ∣ ∣ A ∣ + ∣ B ∣ − ∣ A ∩ B ∣ J(A,B)=\frac{|A∩B|}{|A∪B|}=\frac{|A∩B|}{|A|+|B|-|A∩B|} J(A,B)=ABAB=A+BABAB

Jaccard相似度:

d j ( A , B ) = 1 − J ( A , B ) = ∣ A ∪ B ∣ − ∣ A ∩ B ∣ ∣ A ∪ B ∣ = A Δ B ∣ A ∪ B ∣ d_j(A,B)=1-J(A,B)=\frac{|A∪B|-|A∩B|}{|A∪B|}=\frac{AΔB}{|A∪B|} dj(A,B)=1J(A,B)=ABABAB=ABAΔB

当A=B时,Jaccard相似度为1;当|A∩B|=0时,Jaccard相似度为0.

Jaccard相似度的取值范围为[0,1],值越大表示越相似。代码如下:

def Jaccard(words1, words2):
    words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
    interNum = 0
    for word in words1_cut:
        if word in words2_cut:
            interNum += 1
    return float(interNum/(len(set(words1_cut))+len(set(words2_cut))-interNum))

余弦相似度

c o s ( X , Y ) = X ⋅ Y ∣ X ∣ ∣ Y ∣ cos(X,Y)=\frac{X·Y}{|X||Y|} cos(X,Y)=XYXY

Dice系数

s = 2 ∣ A ∩ B ∣ ∣ A ∣ + ∣ B ∣ s=2\frac{|A∩B|}{|A|+|B|} s=2A+BAB

def Dice(words1, words2):
    words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
    interNum = 0
    for word in words1_cut:
        if word in words2_cut:
            interNum += 1
    return float(2*interNum/(len(set(words1_cut))+len(set(words2_cut))))

匹配系数

o v e r l a p ( X , Y ) = ∣ X ∩ Y ∣ m i n ( ∣ X ∣ , ∣ Y ∣ ) overlap(X,Y)=\frac{|X∩Y|}{min(|X|,|Y|)} overlap(X,Y)=min(X,Y)XY

def overlap(words1, words2):
    words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
    interNum = 0
    for word in words1_cut:
        if word in words2_cut:
            interNum += 1
    return float(2*interNum/min(len(set(words1_cut)),len(set(words2_cut))))

你可能感兴趣的:(自然语言处理,数据分析,机器学习,python,NLP)