对于相似度任务,使用BERT的一个简单的做法就是把相似度任务转化成分类任务。比如判断两个句子是否匹配就是一个二分类任务,这种做法比较简单,但是会损失精度,因为BERT支持的最大长度为512个token,在输入处理上需要把两个句子拼接成"[CLS]" + seq1 + "[SEP]" + seq2 + "[SEP]",然后取得[CLS]对应的向量,并经过一个sofmax,最后完成分类。
在论文Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks中指出,这种方法在相似度或者是聚类任务上效果比较差,
于是该作者做了若干个实验,并在论文中给出了实验结果。同时也有python库的实现,那就是sentence-transformers,安装:
pip install sentence-transformers
注意,该库是通过pytorch实现的(不能用于python2.7)。
本文章作为一个学习笔记,总共分为了两部分,第一部分是运行该库中的例子,并对结果进行一个判断;第二部分则剖析例子中使用到的代码以及内部实现。
代码如下:
"""
This is a simple application for sentence embeddings: clustering
Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from utils import show_cluster_image
if __name__ == '__main__':
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
# Corpus with example sentences
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'A man is eating pasta.',
'The girl is carrying a baby.',
'The baby is carried by the woman',
'A man is riding a horse.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah is running behind its prey.',
'A cheetah chases prey on across a field.'
]
corpus_embeddings = embedder.encode(corpus)
# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster ", i)
print(cluster)
# 展现聚类结果
show_cluster_image(corpus_embeddings, cluster_assignment)
在这个例子中,先是创建了一个使用了"bert-base-nli-mean-tokens"的编码器(下面会提到这个预训练模型),然后把句子语料喂给编码器,并得到这些句子的embedding;之后对这些向量进行簇个数为5的k-mean聚类;打印出各个簇的文本,并以可视化的形式展现(原例子中并没有可视化聚类结果)。
接着是show_cluster_image的实现:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
def show_cluster_image(vectors, labels, excepted_labels=None):
"""
根据数据绘制出聚类散点图,目前最多8个类别
:param vectors: 向量
:param labels: 该点属于哪个簇
:param excepted_labels: 排除的标签,该标签不绘制
:return:
"""
# 降维
estimator = PCA(n_components=2)
data_set = estimator.fit_transform(vectors)
# 分成若干个簇
clusters = {}
for index in range(len(data_set)):
datum = data_set[index]
# 标签所代表的簇
label = labels[index]
# 异常值目前不显示
if excepted_labels and label in excepted_labels:
continue
if label not in clusters:
clusters[label] = []
clusters[label].append(datum)
# 遍历簇
for label, array in clusters.items():
matrix = np.array(array)
plt.scatter(matrix[:, 0], matrix[:, 1], label='cluter%d' % label)
plt.legend(loc='upper right')
plt.show()
show_cluster_image先对向量使用了PCA降到了2维,然后以散点图的形式展现出来。
结果如下(手动调节了聚类的输出顺序):
Cluster 1
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']
Cluster 0
['The girl is carrying a baby.', 'The baby is carried by the woman']
Cluster 3
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']
Cluster 4
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']
Cluster 2
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']
可视化结果:
先看cluster 0,对应的句子为 ['The girl is carrying a baby.', 'The baby is carried by the woman'],嗯~,这两个句子在我们看来,还是比较相似的。
暂且不谈数量,综合文本和聚类展示上来看,sentence-transformers库的聚类效果还可以。
首先,我们需要知道,我们所用的名为"bert-base-nli-mean-tokens"的预训练模型是如何得到的:
该预训练模型是根据Google提供的模型对BERT(bert-base-uncased)进行微调。 它根据自然语言推断(NLI)数据调整模型。 给定两个句子,模型会对这两个句子是否需要(entail),相互矛盾(contradict)或中立(neutral)进行多类别分类。 为此,将这两个句子传递给transformer模型以生成固定大小的句子嵌入(embedding)。 然后将这些句子嵌入传递给softmax分类器,以得出最终标签(包含,矛盾,中立)。 这会生成句子嵌入,这对于其他任务(例如聚类或语义文本相似性)也很有用。
综合下来,是在NLI数据集上进行一个分类类别为3的分类任务。
不同于BERT提供的传统分类,bert-base-nli-mean-tokens模型主要是通过下面这种方式得到的:
在分类上,SBERT使用到了孪生神经网络(Siamse network),模型的编码器的权重是共享的。
输入两个句子SentenceA和SentenceB,经过BERT后得到两个代表着句子的特征向量u和v,然后u和v以(u, v, |u-v|)的形式进行拼接,此时得到的向量维度为(batch_size, 3 * 768),然后再接一个(3 * 768, 3)的带有bias的全连接层,再通过softmax得到分类的预测结果。
在有了网络结构和语料后,就可以对BERT进行微调,得到了"bert-base-nli-mean-tokens"预训练模型,然后使用了STS语料对微调的结果进行相似度评估,并得到了70%多的准确度。
比较有意思的在于使用这种网络结构分类得到的预训练模型在相似度和聚类上取得了不错的效果。
具体的代码可以参见:examples/training_nli_bert.py,下面针对下这个例子中出现的几个类以及他们的关系进行一个梳理。
使用了BERT预训练模型:
model_name = 'bert-base-uncased'
# 使用BERT映射token到embedding
word_embedding_model = models.BERT(model_name)
然后创建了一个池化层:
# 使用mean pooling
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False)
创建了一个SentenceTransformer类,该类继承自nn.Sequential:
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
SentenceTransformer的功能之一在于输入文本,返回这个文本的embedding。详细流程如下:
把文本整理成BERT需要的格式,并传递给BERT,得到最后一层token embedding进行取平均(源码中还去除了为了使得文本长度保持一致大小的填充字符的token embedding),流动如下:
(batch_size, texts)=>BERT=>Pooling=>(batch, 768)
经过Pooling得到的则是句向量。
接着又创建了一个Module:
train_loss = losses.SoftmaxLoss(model=model,sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=train_num_labels)
SoftmaxLoss类继承自nn.Module,它内部使用到了softmax和交叉熵损失函数。
该类的网络结构如图3所示,给定两个文本,通过model得到了这两个文本的句向量u和v,然后把这两个句向量拼接成(u, v, |u-v|)的形式,之后再接一个全连接层,最后使用交叉熵损失函数得到损失并返回:
def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
reps = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]
rep_a, rep_b = reps
vectors_concat = []
if self.concatenation_sent_rep:
vectors_concat.append(rep_a)
vectors_concat.append(rep_b)
if self.concatenation_sent_difference:
vectors_concat.append(torch.abs(rep_a - rep_b))
if self.concatenation_sent_multiplication:
vectors_concat.append(rep_a * rep_b)
features = torch.cat(vectors_concat, 1)
output = self.classifier(features)
loss_fct = nn.CrossEntropyLoss()
if labels is not None:
loss = loss_fct(output, labels.view(-1))
return loss
else:
return reps, output
train_loss这个对象会在model.fit函数中调用,并使用了Adam进行反向传播进行微调。
比较绕的地方就是model和train_loss的相互调用,train_loss的forward方法中使用model得到了句向量,model在fit方法中调用train_loss得到了损失。
在得到微调之后的模型,就可以对训练结果进行评估。官方示例使用的是STS的sts-dev的1600多条数据,STS数据主要是两个句子s1和s2,以及s1和s2有多相似,返回0和5之间的相似度得分。
这里直接使用官方微调好的模型"bert-base-nli-mean-tokens"进行评估:
"""
使用官方的nli-mean评价模型
"""
from torch.utils.data import DataLoader
from sentence_transformers import SentencesDataset, LoggingHandler, SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import *
import logging
# Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S',
level=logging.INFO,
handlers=[LoggingHandler()])
# /print debug information to stdout
if __name__ == '__main__':
batch_size = 16
model = SentenceTransformer('bert-base-nli-mean-tokens')
# dev-set 用于测试
sts_reader = STSDataReader('datasets/stsbenchmark')
test_data = SentencesDataset(examples=sts_reader.get_examples("sts-test.csv"), model=model)
test_dataloader = DataLoader(test_data, shuffle=False, batch_size=batch_size)
# 评估
evaluator = EmbeddingSimilarityEvaluator(test_dataloader)
model.evaluate(evaluator)
EmbeddingSimilarityEvaluator类使用了余弦距离、余弦相似度、欧几里得距离和曼哈顿距离进行了评估,结果如下:
2020-03-31 11:06:33 - Cosine-Similarity : Pearson: 0.7415 Spearman: 0.7698
2020-03-31 11:06:33 - Manhattan-Distance: Pearson: 0.7730 Spearman: 0.7712
2020-03-31 11:06:33 - Euclidean-Distance: Pearson: 0.7713 Spearman: 0.7707
2020-03-31 11:06:33 - Dot-Product-Similarity: Pearson: 0.7273 Spearman: 0.7270
除了聚类之外,另一个应用则是相似度:
from sentence_transformers import SentenceTransformer
import scipy.spatial
if __name__ == '__main__':
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
# 语料实例
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'The girl is carrying a baby.',
'A man is riding a horse.',
'A woman is playing violin.',
'Two men pushed carts through the woods.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'A cheetah is running behind its prey.'
]
corpus_embeddings = embedder.encode(corpus)
# 待查询的句子
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']
query_embeddings = embedder.encode(queries)
# 对于每个句子,使用余弦相似度查询最接近的5个句子
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
# 按照距离逆序
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("======================")
print("Query:", query)
print("Result:Top 5 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
该示例与聚类类似,它同样是生成了句向量,然后使用余弦相似度来获取最相近的若干个句子,结果如下:
======================
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating a piece of bread. (Score: 0.8480)
A man is eating food. (Score: 0.7759)
Two men pushed carts through the woods. (Score: 0.2095)
A monkey is playing drums. (Score: 0.1945)
A man is riding a white horse on an enclosed ground. (Score: 0.1586)
======================
Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.7985)
A cheetah is running behind its prey. (Score: 0.2860)
The girl is carrying a baby. (Score: 0.2351)
A man is riding a horse. (Score: 0.2023)
A man is riding a white horse on an enclosed ground. (Score: 0.1963)
======================
Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.9007)
Two men pushed carts through the woods. (Score: 0.3662)
A monkey is playing drums. (Score: 0.3061)
A man is riding a horse. (Score: 0.2930)
A man is riding a white horse on an enclosed ground. (Score: 0.2718)
从结果可见,微调之后的BERT模型的余弦相似度有了参考价值。
文本匹配利器:从Siamse孪生网络到Sentence-BERT综述
sentence-transformers
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks