三、使用 BERT 生成词嵌入的方法


3.1 方法 1:使用转换器库



        1. 安装必要的库:要使用 BERT 生成词嵌入,您需要安装“转换器”库。

!pip install transformers

        2. 加载 BERT 模型:安装必要的库后,您可以使用“转换器”库加载预先训练的 BERT 模型。BERT有多种版本可用,因此请选择最适合您需求的版本。

from transformers import BertModel, BertTokenizer
text = "This is an example sentence."
tokens = tokenizer.tokenize(text)


        3. 标记化文本:在生成单词嵌入之前,您需要使用 BERT 标记器将文本标记为单个单词或子单词。这会将您的文本转换为可以输入BERT模型的格式。

text = "This is an example sentence."
tokens = tokenizer.tokenize(text)
['this', 'is', 'an', 'example', 'sentence', '.']

        4. 将令牌转换为输入 ID:将文本标记化后,您需要将标记转换为输入 ID,输入 ID 是可以输入到 BERT 模型中的标记的数字表示。

input_ids = tokenizer.convert_tokens_to_ids(tokens)
[2023, 2003, 2019, 2742, 6251, 1012]

        5. 生成词嵌入:最后,您可以通过将输入 ID 输入到 BERT 模型中来为每个令牌生成词嵌入。该模型将返回一个张量,其中包含文本中每个标记的嵌入。

import torch
input_ids = torch.tensor(input_ids).unsqueeze(0)
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs.last_hidden_state[0]


3.2 方法2:使用TensorFlow



        1. 安装必要的库:要使用 BERT 和 TensorFlow 生成词嵌入,您需要安装 TensorFlow 和 TensorFlow Hub。

!pip install tensorflow tensorflow_hub

        2. 加载 BERT 模型:安装必要的库后,您可以从 TensorFlow Hub 加载预先训练的 BERT 模型。

import tensorflow as tf
import tensorflow_hub as hub
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=False)

        3. 标记化文本:在生成单词嵌入之前,您需要使用 TensorFlow Hub 提供的 BERT 标记器将文本标记为单个单词或子单词。

from bert.tokenization import FullTokenizer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)
text = "This is an example sentence."
tokens = tokenizer.tokenize(text)

        4. 将令牌转换为输入 ID:将文本标记化后,您需要将标记转换为输入 ID,输入 ID 是可以输入到 BERT 模型中的标记的数字表示。

input_ids = tokenizer.convert_tokens_to_ids(tokens)

        5. 生成词嵌入:最后,您可以通过将输入 ID 输入到 BERT 模型中来为每个令牌生成词嵌入。该模型将返回一个张量,其中包含文本中每个标记的嵌入。

input_ids = tf.expand_dims(input_ids, 0)
outputs = bert_layer(input_ids)
embeddings = outputs["sequence_output"][0]

3.3 使用BERT进行上下文化词嵌入


        1. 设置

import pandas as pd
import numpy as np
import torch

接下来,我们从Hugging Face安装变压器包,这将为我们提供一个用于BERT的pytorch接口。我们之所以选择 PyTorch 接口,是因为它在高级 API(易于使用,但不能深入了解事物的工作原理)和 TensorFlow 代码(包含大量细节,但经常将我们绕开到关于 TensorFlow 的课程,而这里的目的是 BERT)之间取得了很好的平衡。

!pip install transformers


        从变压器导入 BertModel, BertTokenizer
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True,
)tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


四. 创建上下文嵌入

        我们必须将输入文本放入BERT可以读取的特定格式。主要是我们将 [CLS] 添加到输入的开头,将 [SEP] 添加到输入的末尾。然后我们将标记化的 BERT 输入转换为张量格式。

def bert_text_preparation(text, tokenizer):
Preprocesses text input in a way that BERT can interpret.
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1]*len(indexed_tokens)
# convert inputs to tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensor = torch.tensor([segments_ids])
return tokenized_text, tokens_tensor, segments_tensor



def get_bert_embeddings(tokens_tensor, segments_tensor, model):
Obtains BERT embeddings for tokens, in context of the given sentence.
# gradient calculation id disabled
with torch.no_grad():
# obtain hidden states
outputs = model(tokens_tensor, segments_tensor)
hidden_states = outputs[2]
# concatenate the tensors for all layers
# use "stack" to create new dimension in tensor
token_embeddings = torch.stack(hidden_states, dim=0)
# remove dimension 1, the "batches"
token_embeddings = torch.squeeze(token_embeddings, dim=1)
# swap dimensions 0 and 1 so we can loop over tokens
token_embeddings = token_embeddings.permute(1,0,2)
# intialized list to store embeddings
token_vecs_sum = []
# "token_embeddings" is a [Y x 12 x 768] tensor
# where Y is the number of tokens in the sentence
# loop over tokens in sentence
for token in token_embeddings:
# "token" is a [12 x 768] tensor
# sum the vectors from the last four layers
sum_vec = torch.sum(token[-4:], dim=0)
return token_vecs_sum


sentences = ["bank",
         "he eventually sold the shares back to the bank at a premium.",
         "the bank strongly resisted cutting interest rates.",
         "the bank will supply and buy back foreign currency.",
         "the bank is pressing us for repayment of the loan.",
         "the bank left its lending rates unchanged.",
         "the river flowed over the bank.",
         "tall, luxuriant plants grew along the river bank.",
         "his soldiers were arrayed along the river bank.",
         "wild flowers adorned the river bank.",
         "two fox cubs romped playfully on the river bank.",
         "the jewels were kept in a bank vault.",
         "you can stow your jewellery away in the bank.",
         "most of the money was in storage in bank vaults.",
         "the diamonds are shut away in a bank vault somewhere.",
         "thieves broke into the bank vault.",
         "can I bank on your support?",
         "you can bank on him to hand you a reasonable bill for your services.",
         "don't bank on your friends to help you out of trouble.",
         "you can bank on me when you need money.",
         "i bank on your help."
from collections import OrderedDict
context_embeddings = []
context_tokens = []
for sentence in sentences:
  tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
  # make ordered dictionary to keep track of the position of each word
  tokens = OrderedDict()
  # loop over tokens in sensitive sentence
  for token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if token in tokens:
      tokens[token] += 1
      tokens[token] = 1
    # compute the position of the current token
    token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
    current_index = token_indices[tokens[token]-1]
    # get the corresponding embedding
    token_vec = list_token_embeddings[current_index]
    # save values



from scipy.spatial.distance import cosine
# embeddings for the word 'record'
token = 'bank'
indices = [i for i, t in enumerate(context_tokens) if t == token]
token_embeddings = [context_embeddings[i] for i in indices]
# compare 'record' with different contexts
list_of_distances = []
for sentence_1, embed1 in zip(sentences, token_embeddings):
    for sentence_2, embed2 in zip(sentences, token_embeddings):
        cos_dist = 1 - cosine(embed1, embed2)
        list_of_distances.append([sentence_1, sentence_2, cos_dist])
distances_df = pd.DataFrame(list_of_distances, columns=['sentence_1', 'sentence_2', 'distance'])
distances_df[distances_df.sentence_1 == "bank"]


distances_df[distances_df.sentence_1 == "he eventually sold the shares back to the bank at a premium."]


