[实践]两个句子的相似度计算方法

目录

简介

方法一之SpaCy

方法二之Sentence Transformers

方法三之scipy

方法四之torch

方法五之TFHub Universal Sentence Encoder

参考资料


简介

下面的大多数库应该是语义相似性比较的不错选择。您可以使用这些库中的预训练模型生成单词或句子向量,从而跳过直接单词比较。

方法一之SpaCy

参考文献

Linguistic Features · spaCy Usage Documentation

需要下载模型

要使用 en_core_web_md,请使用 python -m spacy download en_core_web_md 进行下载。要使用 en_core_web_lg,请使用 python -m spacy download en_core_web_lg。 大型模型大约为 830mb 左右,而且速度很慢,因此中型模型是一个不错的选择。

python -m spacy download en_core_web_lg

代码

import spacy
nlp = spacy.load("en_core_web_lg")

doc1 = nlp(u'the person wear red T-shirt')
doc2 = nlp(u'this person is walking')
doc3 = nlp(u'the boy wear red T-shirt')

print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))

结果

0.7003971105290047
0.9671912343259517
0.6121211244876517

方法二之Sentence Transformers

GitHub - UKPLab/sentence-transformers: Multilingual Sentence & Image Embeddings with BERT

Semantic Textual Similarity — Sentence-Transformers documentation

代码

这个会安装词嵌入

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

sentences = [
    'the person wear red T-shirt',
    'this person is walking',
    'the boy wear red T-shirt'
    ]
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

输出

Sentence: the person wear red T-shirt
Embedding: [ 1.31643847e-01 -4.20616418e-01 ... 8.13076794e-01 -4.64620918e-01]

Sentence: this person is walking
Embedding: [-3.52878094e-01 -5.04286848e-02 ... -2.36091137e-01 -6.77282438e-02]

Sentence: the boy wear red T-shirt
Embedding: [-2.36365378e-01 -8.49713564e-01 ... 1.06414437e+00 -2.70157874e-01]

方法三之scipy

代码

from scipy.spatial import distance
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[1]))
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[2]))
print(1 - distance.cosine(sentence_embeddings[1], sentence_embeddings[2]))

输出

0.4643629193305969
0.9069876074790955
0.3275738060474396

方法四之torch

代码

import torch.nn
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
b = torch.from_numpy(sentence_embeddings)
print(cos(b[0], b[1]))
print(cos(b[0], b[2]))
print(cos(b[1], b[2]))

输出

tensor(0.4644)
tensor(0.9070)
tensor(0.3276)

方法五之TFHub Universal Sentence Encoder

https://tfhub.dev/google/universal-sentence-encoder/4

https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

这个大约 1GB 的模型非常大,而且看起来比其他模型慢。这也会生成句子的嵌入

代码

import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
    "the person wear red T-shirt",
    "this person is walking",
    "the boy wear red T-shirt"
    ])

print(embeddings)

from scipy.spatial import distance
print(1 - distance.cosine(embeddings[0], embeddings[1]))
print(1 - distance.cosine(embeddings[0], embeddings[2]))
print(1 - distance.cosine(embeddings[1], embeddings[2]))

输出

tf.Tensor(
[[ 0.063188    0.07063895 -0.05998802 ... -0.01409875  0.01863449
   0.01505797]
 [-0.06786212  0.01993554  0.03236153 ...  0.05772103  0.01787272
   0.01740014]
 [ 0.05379306  0.07613157 -0.05256693 ... -0.01256405  0.0213196
  -0.00262441]], shape=(3, 512), dtype=float32)

0.15320375561714172
0.8592830896377563
0.09080004692077637

其它嵌入

https://github.com/facebookresearch/InferSent

GitHub - Tiiiger/bert_score: BERT score for text generation

参考资料

How to compute the similarity between two text documents?

https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity

https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632

scipy.spatial.distance.cosine — SciPy v0.14.0 Reference Guide

https://www.tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity

deep learning - is there a way to check similarity between two full sentences in python? - Stack Overflow

NLP Town

你可能感兴趣的:(Python,自然语言处理,人工智能)