bert文本相似度计算_使用bert和其他模型计算文档相似度

bert文本相似度计算

入门(Getting Started)

Introduction

介绍

Document similarities is one of the most crucial problems of NLP. Finding similarity across documents is used in several domains such as recommending similar books and articles, identifying plagiarised documents, legal documents, etc.

文档相似性是NLP的最关键问题之一。在多个领域使用跨文档查找相似性，例如推荐相似的书籍和文章，识别抄袭的文档，法律文档等。

We can call two documents similar if they are semantically similar and define the same concept or if they are duplicates.

如果两个文档在语义上相似并且定义相同的概念，或者它们是重复的，则可以称两个文档相似。

To make machines figure out the similarity between documents we need to define a way to measure the similarity mathematically and it should be comparable so that machine can tell us which documents are most similar or which are least. We also need to represent text from documents in a quantifiable form (or a mathematical object, which is usually a vector form), so that we can perform similarity calculations on top of it.

为了使机器能够计算出文档之间的相似性，我们需要定义一种数学上测量相似性的方法，并且该方法应该具有可比性，以便机器可以告诉我们哪些文档最相似或哪些文档最少。我们还需要以可量化的形式(或数学对象，通常是矢量形式)表示文档中的文本，以便我们可以在其上执行相似度计算。

So, converting a document into a mathematical object and defining a similarity measure are primarily the two steps required to make machines perform this exercise. We will look into different ways of doing this.

因此，将文档转换为数学对象并定义相似性度量主要是使机器执行此练习所需的两个步骤。我们将研究执行此操作的不同方法。

Similarity Function

相似度函数

Some of the most common and effective ways of calculating similarities are,

计算相似度的一些最常见，最有效的方法是，

Cosine Distance/Similarity - It is the cosine of the angle between two vectors, which gives us the angular distance between the vectors. Formula to calculate cosine similarity between two vectors A and B is,

余弦距离/相似度-它是两个向量之间的角度的余弦值，它为我们提供了向量之间的角距离。计算两个向量A和B之间的余弦相似度的公式为：

In a two-dimensional space it will look like this,

在二维空间中，它看起来像这样，

angle between two vectors A and B in 2-dimensional space (Image by author) 二维空间中两个向量A和B之间的角度(图片由作者提供)

You can easily work out the math and prove this formula using the law of cosines.

您可以轻松地算出数学并使用余弦定律证明该公式。

Cosine is 1 at theta=0 and -1 at theta=180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. For this reason, it is called similarity. You can consider 1 - cosine as distance.

余弦在theta = 0处为1，在theta = 180处为-1，这意味着对于两个重叠的向量，余弦对于两个完全相反的向量而言将是最高和最低的。因此，这称为相似性。您可以考虑1-余弦作为距离。

Euclidean Distance - This is one of the forms of Minkowski distance when p=2. It is defined as follows,

欧几里德距离-当p = 2时，这是Minkowski距离的形式之一。定义如下，

In two-dimensional space, Euclidean distance will look like this,

在二维空间中，欧几里得距离看起来像这样，

Euclidean distance between two vectors A and B in 2-dimensional space (Image by author) 二维空间中两个向量A和B之间的欧式距离(图片由作者提供)

Jaccard Distance - Jaccard Index is used to calculate the similarity between two finite sets. Jaccard Distance can be considered as 1 - Jaccard Index.

雅卡德距离-雅卡德索引用于计算两个有限集之间的相似度。提卡距离可以视为1-提卡索引。

We can use Cosine or Euclidean distance if we can represent documents in the vector space. Jaccard Distance can be used if we consider our documents just the sets or collections of words without any semantic meaning.

如果可以在向量空间中表示文档，则可以使用余弦或欧几里得距离。如果我们认为我们的文档只是单词的集合或集合而没有任何语义含义，则可以使用“杰卡德距离”。

Cosine and Euclidean distance are the most widely used measures and we will use these two in our examples below.

余弦和欧几里得距离是使用最广泛的度量，我们将在下面的示例中使用这两个度量。

Embeddings

嵌入

Embeddings are the vector representations of text where word or sentences with similar meaning or context have similar representations.

嵌入是文本的矢量表示，其中具有相似含义或上下文的单词或句子具有相似的表示。

vector representation of words in 3-D (Image by author) 3-D中单词的矢量表示(作者提供的图片)

Following are some of the algorithms to calculate document embeddings with examples,

以下是一些使用示例计算文档嵌入的算法，

Tf-idf - Tf-idf is a combination of term frequency and inverse document frequency. It assigns a weight to every word in the document, which is calculated using the frequency of that word in the document and frequency of the documents with that word in the entire corpus of documents. For more details on tf-idf please refer to this story.

Tf-idf- Tf-idf是术语频率和文档反向频率的组合。它将权重分配给文档中的每个单词，该权重是使用文档中该单词的频率以及整个文档集中带有该单词的文档的频率来计算的。有关tf-idf的更多详细信息，请参阅此故事。

Let’s define following as the corpus(collections) of documents for which we want to calculate similarities,

让我们将以下定义为要为其计算相似度的文档的语料库(集合)，

Document corpus (Image by author) 文档语料库(作者提供的图片)

We will perform basic text cleaning, removing special characters, removing stop words, and convert everything to lower case. Then we will convert documents to their tf-idf vectors and calculate pairwise similarities using cosine and euclidean distance.

我们将执行基本的文本清理，删除特殊字符，删除停用词，并将所有内容转换为小写。然后，我们将文档转换为其tf-idf向量，并使用余弦和欧式距离计算成对相似度。

Pairwise cosine similarity would just be the dot product of the tf-idf vectors becasue tf-idf vectors from sklearn are already normalised and L2 norm of these vectors is 1. So denominator of cosine similarity formula is 1 in this case.

成对的余弦相似度只是tf-idf向量的点积，因为来自sklearn的tf-idf向量已被标准化，并且这些向量的L2范数为1。因此，在这种情况下，余弦相似度公式的分母为1。

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
import re
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances


# Sample corpus
documents = ['Machine learning is the study of computer algorithms that improve automatically through experience.\
Machine learning algorithms build a mathematical model based on sample data, known as training data.\
The discipline of machine learning employs various approaches to teach computers to accomplish tasks \
where no fully satisfactory algorithm is available.',
'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.\
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.',
'Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. \
It involves computers learning from data provided so that they carry out certain tasks.',
'Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"\
or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement',
'Software engineering is the systematic application of engineering approaches to the development of software.\
Software engineering is a computing discipline.',
'A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned\
about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.\
Developing a machine learning application is more iterative and explorative process than software engineering.'
]


documents_df=pd.DataFrame(documents,columns=['documents'])


# removing special characters and stop words from the text
stop_words_l=stopwords.words('english')
documents_df['documents_cleaned']=documents_df.documents.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stop_words_l) )


tfidfvectoriser=TfidfVectorizer()
tfidfvectoriser.fit(documents_df.documents_cleaned)
tfidf_vectors=tfidfvectoriser.transform(documents_df.documents_cleaned)


pairwise_similarities=np.dot(tfidf_vectors,tfidf_vectors.T).toarray()
pairwise_differences=euclidean_distances(tfidf_vectors)


def most_similar(doc_id,similarity_matrix,matrix):
    print (f'Document: {documents_df.iloc[doc_id]["documents"]}')
    print ('\n')
    print ('Similar Documents:')
    if matrix=='Cosine Similarity':
        similar_ix=np.argsort(similarity_matrix[doc_id])[::-1]
    elif matrix=='Euclidean Distance':
        similar_ix=np.argsort(similarity_matrix[doc_id])
    for ix in similar_ix:
        if ix==doc_id:
            continue
        print('\n')
        print (f'Document: {documents_df.iloc[ix]["documents"]}')
        print (f'{matrix} : {similarity_matrix[doc_id][ix]}')


most_similar(0,pairwise_similarities,'Cosine Similarity')
most_similar(0,pairwise_differences,'Euclidean Distance')

print (tfidf_vectors[0].toarray())print (pairwise_similarities.shape)print (pairwise_similarities[0][:])

# documents similar to the first document in the corpus
most_similar(0,pairwise_similarities,'Cosine Similarity')

Documents similar to first document based on cosine similarity and euclidean distance (Image by author) 基于余弦相似度和欧几里得距离的类似于第一个文档的文档(作者提供图片)

Word2vec - As the name suggests word2vec embeds words into vector space. Word2vec takes a text corpus as input and produce word embeddings as output. There are two main learning algorithms in word2vec: continuous bag of words and continuous skip gram.

Word2vec-顾名思义，word2vec将单词嵌入向量空间。 Word2vec将文本语料库作为输入，并产生单词嵌入作为输出。 word2vec中有两种主要的学习算法：连续单词袋和连续跳过语法。

https://arxiv.org/pdf/1301.3781.pdf) https://arxiv.org/pdf/1301.3781.pdf )

We can train our own embeddings if have enough data and computation available or we can use pre-trained embeddings. We will use a pre-trained embedding provided by Google.

如果有足够的数据和计算可用，我们可以训练自己的嵌入，也可以使用预训练的嵌入。我们将使用Google提供的经过预先训练的嵌入。

We will start with tokenising and padding each document to make them all of the same size.

我们将从标记和填充每个文档开始，以使它们都具有相同的大小。

# tokenize and pad every document to make them of the same size
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequencestokenizer=Tokenizer()
tokenizer.fit_on_texts(documents_df.documents_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(documents_df.documents_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1print (tokenized_paded_documents[0])

tokenised document (Image by author) 标记化的文档(作者提供的图片)

Let’s load the pre-trained embeddings. Each word is represented as a 300-dimensional vector.

让我们加载预训练的嵌入。每个单词都表示为一个300维向量。

# loading pre-trained embeddings, each word is represented as a 300 dimensional vectorimport gensimW2V_PATH="GoogleNews-vectors-negative300.bin.gz"
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(W2V_PATH, binary=True)

Using this embedding we can convert every word of our document corpus into a 300-dimensional vector. As we have 6document and we have padded each document to be of maximum size 64, vector representation of the corpus would be of shape 6X64X300.

使用此嵌入，我们可以将文档语料库的每个单词转换为300维向量。由于我们有6个文档，并且每个文档的最大填充大小为64，因此语料库的矢量表示形式为6X64X300。

# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in model_w2v:
        embedding_matrix[i]=model_w2v[word]# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))for i in range(len(tokenized_paded_documents)):
    for j in range(len(tokenized_paded_documents[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]document_word_embeddings.shape

document-word embedding shape (Image by author) 文档词嵌入形状(图片由作者提供)

Now we have to represent every document as a single vector. We can either average or sum over every word vector and convert every 64X300 representation into a 300-dimensional representation. But averaging or summing over all the words would lose the semantic and contextual meaning of the documents. Different lengths of the documents would also have an adverse effect on such operations.

现在，我们必须将每个文档表示为单个向量。我们可以对每个单词向量求平均值或求和，然后将每个64X300表示形式转换为300维表示形式。但是，对所有单词进行平均或求和将丢失文档的语义和上下文含义。文件长度的不同也会对这种操作产生不利影响。

One better way of doing this could be taking a weighted average of word vectors using the tf-idf weights. This can handle the variable length problem to a certain extent but cannot keep the semantic and contextual meaning of words. After doing that we can use the pairwise distances to calculate similar documents as we did in the tf-idf model.

一种更好的方法是使用tf-idf权重对单词向量进行加权平均。这可以在一定程度上处理可变长度问题，但不能保留单词的语义和上下文含义。之后，我们可以像在tf-idf模型中那样使用成对距离来计算相似的文档。

# calculating average of word vectors of a document weighted by tf-idfdocument_embeddings=np.zeros((len(tokenized_paded_documents),300))
words=tfidfvectoriser.get_feature_names()for i in range(len(document_word_embeddings)):
    for j in range(len(words)):
        document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_vectors[i][j]print (document_embeddings.shape)pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)most_similar(0,pairwise_similarities,'Cosine Similarity')
most_similar(0,pairwise_differences,'Euclidean Distance')

GloVe - Global Vectors for word Embedding (GloVe) is an unsupervised learning algorithm to produce vector representations of word. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

GloVe-词嵌入的全局向量(GloVe)是一种无监督的学习算法，用于生成词的向量表示形式。对来自语料库的汇总的全局单词-单词共现统计信息进行训练，并且所得表示形式展示了单词向量空间的有趣线性子结构。

We will use the pre-trained Glove embeddings from Stanford. All the steps would remain same as word2vec embeddings it’s just that in this case we will use the Glove pre-trained model. We are using Glove embeddings of 100-dimensions because of the large size of the embeddings file. You can use higher dimensions also.

我们将使用来自Stanford的预训练的Glove嵌入。所有步骤都将与word2vec嵌入相同，只是在这种情况下，我们将使用Glove预训练模型。由于嵌入文件的尺寸很大，因此我们正在使用100维的Glove嵌入。您也可以使用更大的尺寸。

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


tokenizer=Tokenizer()
tokenizer.fit_on_texts(documents_df.documents_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(documents_df.documents_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1


# reading Glove word embeddings into a dictionary with "word" as key and values as word vectors
embeddings_index = dict()


with open('glove.6B.100d.txt') as file:
    for line in file:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    
# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,100))


for word,i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        
# calculating average of word vectors of a document weighted by tf-idf
document_embeddings=np.zeros((len(tokenized_paded_documents),100))
words=tfidfvectoriser.get_feature_names()


# instead of creating document-word embeddings, directly creating document embeddings
for i in range(documents_df.shape[0]):
    for j in range(len(words)):
        document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_vectors[i][j]
        


pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)

Doc2Vec - Doc2vec is an unsupervised learning algorithm to produce vector representations of sentence/paragraph/documents. This is an adaptation of word2vec. Doc2vec can represent an entire documents into a vector. So we don’t have to take average of word vectors to create document vector.

Doc2Vec- Doc2vec是一种无监督的学习算法，用于生成句子/段落/文档的矢量表示。这是word2vec的改编。 Doc2vec可以将整个文档表示为一个向量。因此，我们不必对单词向量进行平均即可创建文档向量。

https://arxiv.org/pdf/1405.4053.pdf) https://arxiv.org/pdf/1405.4053.pdf )

We will use gensim to train a Doc2vec model on our corpus and create vector representations of documents.

我们将使用gensim在我们的语料库上训练Doc2vec模型并创建文档的矢量表示。

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')


tagged_data = [TaggedDocument(words=word_tokenize(doc), tags=[i]) for i, doc in enumerate(documents_df.documents_cleaned)]
model_d2v = Doc2Vec(vector_size=100,alpha=0.025, min_count=1)
  
model_d2v.build_vocab(tagged_data)


for epoch in range(100):
    model_d2v.train(tagged_data,
                total_examples=model_d2v.corpus_count,
                epochs=model_d2v.epochs)
    
document_embeddings=np.zeros((documents_df.shape[0],100))


for i in range(len(document_embeddings)):
    document_embeddings[i]=model_d2v.docvecs[i]
    
    
pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)


most_similar(0,pairwise_similarities,'Cosine Similarity')
most_similar(0,pairwise_differences,'Euclidean Distance')

BERT- Bidirectional Encoder Representation from Transformers (BERT) is a state of the art technique for natural language processing pre-training developed by Google. BERT is trained on unlabelled text including Wikipedia and Book corpus. BERT uses transformer architecture, an attention model to learn embeddings for words.

BERT-变压器的双向编码器表示(BERT)是Google开发的用于自然语言处理预训练的最新技术。 BERT受过未标记文本的培训，包括Wikipedia和Book语料库。 BERT使用变压器架构(一种注意力模型)来学习单词的嵌入。

BERT consists of two pre training steps Masked Language Modelling (MLM) and Next Sentence Prediction (NSP). In BERT training text is represented using three embeddings, Token Embeddings + Segment Embeddings + Position Embeddings.

BERT包含两个预训练步骤：屏蔽语言建模(MLM)和下一句预测(NSP)。在BERT训练中，文本使用三个嵌入来表示，即令牌嵌入+段嵌入+位置嵌入。

https://arxiv.org/pdf/1810.04805.pdf) https://arxiv.org/pdf/1810.04805.pdf )

https://arxiv.org/pdf/1810.04805.pdf) https://arxiv.org/pdf/1810.04805.pdf )

We will use a pre trained BERT model from Huggingface to embed our corpus. We are loading the BERT base model, which has 12 layers (transformer blocks), 12 attention heads, 110 million parameters and hidden size of 768.

我们将使用来自Huggingface的经过预训练的BERT模型来嵌入语料库。我们正在加载BERT基本模型，该模型具有12层(变压器模块)，12个关注头，1.1亿个参数和768的隐藏大小。

from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')


document_embeddings = sbert_model.encode(documents_df['documents_cleaned'])


pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)


most_similar(0,pairwise_similarities,'Cosine Similarity')
most_similar(0,pairwise_differences,'Euclidean Distance')

So you have seen multiple ways of representing documents in vector forms and measuring similarities. You can customise them for your own problem and see what works best for you.

因此，您已经看到了多种以矢量形式表示文档和衡量相似性的方法。您可以针对自己的问题自定义它们，然后查看最适合您的方法。

Complete code of this story is available here - https://github.com/varun21290/medium/blob/master/Document_Similarities.ipynb

有关此故事的完整代码，请参见此处-https://github.com/varun21290/medium/blob/master/Document_Similarities.ipynb

翻译自: https://towardsdatascience.com/calculating-document-similarities-using-bert-and-other-models-b2c1a29c9630

bert文本相似度计算

bert文本相似度计算_使用bert和其他模型计算文档相似度

入门(Getting Started)

你可能感兴趣的:(python,java,人工智能,机器学习,大数据)