BERT Word Embeddings

这篇笔记摘自原文：BERT Word Embeddings Tutorial · Chris McCormick (mccormickml.com)
加入自己的一些理解。
通过一个例子来介绍 bert word embedding:

from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM   
import torch            
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "embeddings After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."   
marked_text = "[CLS] " + text + " [SEP]"    
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenized_text = tokenizer.tokenize(marked_text)    
print(tokenized_text)   
>>
['[CLS]', 'em', '##bed', '##ding', '##s', 'after', 'stealing', 'money', 'from', 'the', 'bank', 'vault', ',', 'the', 'bank', 'robber', 'was', 'seen', 'fishing', 'on', 'the', 'mississippi', 'river', 'bank', '.', '[SEP]']

可以看到，通过bert 编码之后，单词 "embeddings" 是如何编码的：
[‘em’, ‘##bed’, ‘##ding’, ‘##s’]
原来的单词被分成更小的子单词和字符。这些子单词前面的两个 # 号只是我们的tokenizer用来表示这个子单词或字符是一个更大单词的一部分，并在其前面加上另一个子单词的方法。因此，例如，' ##bed ' token与 ' bed ' token是分开的，当一个较大的单词中出现子单词时，使用第一种方法，当一个独立的token “thing you sleep on” 出现时，使用第二种方法。
BERT tokenizer 是用WordPiece模型创建的。这个模型使用贪心法创建了一个固定大小的词汇表，其中包含单个字符、子单词和最适合我们的语言数据的单词。由于我们的BERT tokenizer模型的词汇量限制大小为30,000，因此，用WordPiece模型生成一个包含所有英语字符的词汇表，再加上该模型所训练的英语语料库中发现的~30,000个最常见的单词和子单词。这个词汇表包含四种类型：

整个单词
出现在单词前面或单独出现的子单词(“em”(如embeddings中的“em”)与“go get em”中的独立字符序列“em”分配相同的向量)
不在单词前面的子单词，在前面加上“##”来表示这种情况
单个字符
30000个词分别是以下的顺序排列：
(1）前999个是这样的形式：[unused975]
1-[PAD]
101-[UNK]
102-[CLS]
103-[SEP]
104=[MASK]
(2) 1000-1996是独立的字符，并未按照频次排序
(3）第一个词 "the" 在位置1997，从这里开始按照频次排列，前18个词是完整的词，第2016个词是 ##S，表示最常见的子词，最后一个完整的词是在位置 29612 ："necessitated"

要在此模型下对单词进行记号化，tokenizer首先检查整个单词是否在词汇表中。如果没有，则尝试将单词分解为词汇表中包含的尽可能大的子单词，最后将单词分解为单个字符。注意，由于这个原因，我们总是可以将一个单词表示为至少是它的单个字符的集合。
因此，不是将词汇表中的单词分配给诸如“OOV”或“UNK”之类的默认处理，而是将词汇表中没有的单词分解为子单词和字符，然后我们可以为它们生成嵌入。
因此，我们没有将“embeddings”和词汇表之外的每个单词分配给一个重载的未知词汇表标记，而是将其拆分为子单词标记[' em '、' ##bed '、' ##ding '、' ##s ']，这些标记将保留原单词的一些上下文含义。我们甚至可以平均这些子单词的嵌入向量来为原始单词生成一个近似的向量。

下面记录一下 bert 预训练模型的输入输出：

1. 导入模型

import torch
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

2. 准备输入数据

text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
marked_text = "[CLS] " + text + " [SEP]"#bert要求输入句子前后加入特殊的符号
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
print (tokenized_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(indexed_tokens)
segments_ids = [1] * len(tokenized_text)
print (segments_ids)
>>
['[CLS]', 'after', 'stealing', 'money', 'from', 'the', 'bank', 'vault', ',', 'the', 'bank', 'robber', 'was', 'seen', 'fishing', 'on', 'the', 'mississippi', 'river', 'bank', '.', '[SEP]']
[101, 2044, 11065, 2769, 2013, 1996, 2924, 11632, 1010, 1996, 2924, 27307, 2001, 2464, 5645, 2006, 1996, 5900, 2314, 2924, 1012, 102]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

3. 转换为tensor

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

数据准备可以直接使用 tokenizer.encode_plus()

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []
# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)
# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])
>>
Original:  Our friends won't buy this analysis, let alone the next one we propose.
Token IDs: tensor([  101,  2256,  2814,  2180,  1005,  1056,  4965,  2023,  4106,  1010,
         2292,  2894,  1996,  2279,  2028,  2057, 16599,  1012,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])

4. 调用模型

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers. 
with torch.no_grad():#节省内存加快速度，表示只执行前向算法，不更新参数
    outputs = model(tokens_tensor, segments_tensors)
    # Evaluating the model will return a different number of objects based on 
    # how it's  configured in the `from_pretrained` call earlier. In this case, 
    # becase we set `output_hidden_states = True`, the third item will be the 
    # hidden states from all layers. See the documentation for more details:
    # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
    hidden_states = outputs[2]

模型的全部隐藏状态存储在 hidden_states 中，这个对象有四个维度，分别代表：
1. 层数：13层，包含第一层输入 token 嵌入和bert 的12层隐藏层
2. batch number ：1 ,只处理了一个句子
3. token number：22 个词
4. hidden unit / feature number：768维的tensor

print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
layer_i = 0

print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0

print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0

print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))
print(hidden_states[0].shape)
>>
Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 22
Number of hidden units: 768
torch.Size([1, 22, 768])

5. 对每个 token 进行整合获取词嵌入

目前的维度：
[# layers, # batches, # tokens, # features]
想要得到的维度：
[# tokens, # layers, # features]

# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)
print(token_embeddings.size())

# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)
print(token_embeddings.size())
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
print( token_embeddings.size())
>>
torch.Size([13, 1, 22, 768])
torch.Size([13, 22, 768])
torch.Size([22, 13, 768])#最终得到的 每个token 对应每个层的 词嵌入 tensor 向量，再根据需要选择需要的层进行再整合

6. 词向量

整合最后四层隐藏层的特征作为词向量

# Stores the token vectors, with shape [22 x 768]
token_vecs_sum = []
# `token_embeddings` is a [22 x 12 x 768] tensor.
# For each token in the sentence...
for token in token_embeddings:
    # `token` is a [12 x 768] tensor
    # Sum the vectors from the last four layers.
    sum_vec = torch.sum(token[-4:], dim=0)
    # Use `sum_vec` to represent `token`.
    token_vecs_sum.append(sum_vec)
print ((len(token_vecs_sum), len(token_vecs_sum[0])))
>>22 ，768

句向量简单的取每个 token 的均值：

# `hidden_states` has shape [13 x 1 x 22 x 768]

# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]#[0]就是取的当前句子，因为batch=1,只有一个句子，获取倒数第二层的隐藏层向量：[22,768]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
print ("Our final sentence embedding vector of shape:", sentence_embedding.size())
>>Our final sentence embedding vector of shape: torch.Size([768])

7. bert 词嵌入具有上下文信息

一个例子看使用 bert 编码后的 embedding 带有上下文信息，因为同样的单词不同的位置会得到不同的编码表示,原句子为：
After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
这个例子中的单词 "bank" 具有不同的语义信息：

print('First 5 vector values for each instance of "bank".')
print('')
print("bank vault   ", str(token_vecs_sum[6][:5]))#第一个参数代表句子中不同位置的 bank 单词
print("bank robber  ", str(token_vecs_sum[10][:5]))
print("river bank   ", str(token_vecs_sum[19][:5]))
First 5 vector values for each instance of "bank".
>>
bank vault    tensor([ 3.3596, -2.9805, -1.5421,  0.7065,  2.0031])
bank robber   tensor([ 2.7359, -2.5577, -1.3094,  0.6797,  1.6633])
river bank    tensor([ 1.5266, -0.8895, -0.5152, -0.9298,  2.8334])

然后使用余弦相似度检验，因为有两个地方的 "bank" 意思几乎一致：

from scipy.spatial.distance import cosine
# Calculate the cosine similarity between the word bank 
# in "bank robber" vs "river bank" (different meanings).
diff_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[19])
# Calculate the cosine similarity between the word bank
# in "bank robber" vs "bank vault" (same meaning).
same_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[6])
print('Vector similarity for  *similar*  meanings:  %.2f' % same_bank)
print('Vector similarity for *different* meanings:  %.2f' % diff_bank)
>>
Vector similarity for  *similar*  meanings:  0.94
Vector similarity for *different* meanings:  0.69

可以看到结果差距很明显，证实 bert 可以动态编码词向量，从一定程度上克服 word2vec 等静态词向量的缺陷：一个单词最后只能得到一个固定的词向量表示.

8. Pooling Strategy & Layer Choice

bert 的作者通过测试选取不同层的词嵌入，输入一个 BiLSTM 的 NER 任务，发现最后四层的连接取得了最好的 F1值，但是同时他也指出，针对不同的任务词嵌入的组合方式会有很大的变动性，通常建议根据自己的任务测试最佳的词嵌入来源.

bert-as-service

肖涵博士在 github 上开源了一个 bert-as-service 的服务可以轻松使用bert为文本获取词嵌入，他试验了多种组合影藏层的方式，最后得到一些结论：

第一层的词嵌入没有任何上下文信息
随着层数的加深，每一层获取更多的上下文本信息
到最后一层时，开始得到特定于bert预训练的两个任务(MLM&NSP)的信息
倒数第二层词嵌入是作为下游任务选取的较为合理的

安装流程：
windows10 + python 3.5 + tf 1.10.0；目前 bert-as-service 不支持 tf 2.0.0
我用的是 anaconda 新建一个 python 3.5 的虚拟环境：

conda create -n tf python=3.5

激活虚拟环境之后安装 tensorflow==1.10.0

pip install tensorflow==1.10.0

接着继续在虚拟环境中安装 bert - service 服务：

pip install bert-serving-server # 服务端
pip install bert-serving-client # 客户端

下载谷歌或者其他开源的预训练模型：pretrained model
我下载的是中文预训练模型：https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
解压之后放在指定路径

启动服务：
控制台进入bert-serving-start.exe所在的文件夹，在该文件路径下输入：

bert-serving-start.exe -cpu -max_seq_len NONE -max_batch_size 16 -model_dir D:\project\data\chinese_L-12_H-768_A-12 -num_worker=1

（训练好的中文模型路径，num_worker的数量可以自行选择），如果成功开启则出现以下界面

接着测试(仍旧在虚拟环境中)：

from bert_serving.client import BertClient
bc = BertClient()
print(bc.encode(['中国', '美国']))
>>
[[-0.03909698  0.3139335  -0.27065182 ...  0.03900041  0.20890802
  -0.6030004 ]
 [-0.24806723  0.3853286  -0.52268845 ... -0.1046574   0.17402393
  -0.149189  ]]

顺利得到词向量！！！

9. 句子相似度

值得注意的是，单词级相似度比较不适用于BERT embeddings，因为这些嵌入是上下文相关的，这意味着单词vector会根据它出现在的句子而变化。这就允许了像一词多义这样的奇妙的东西，例如，你的表示编码了river “bank”，而不是金融机构“bank”，但却使得直接的词与词之间的相似性比较变得不那么有价值。但是，对于句子嵌入相似性比较仍然是有效的，这样就可以对一个句子查询其他句子的数据集，从而找到最相似的句子。根据使用的相似度度量，得到的相似度值将比相似度输出的相对排序提供的信息更少，因为许多相似度度量对向量空间(例如，等权重维度)做了假设，而这些假设不适用于768维向量空间。

获得bert的输出：

tokenizer=BertTokenizer.from_pretrained('chinese_L-12_H-768_A-12')
input1=tokenizer('今天天气不错', return_tensors="pt", padding='max_length', max_length=30, truncation=True)
input2=tokenizer('今天天气不错', return_tensors="pt", padding='max_length', max_length=30, truncation=True)
input_ids, token_type_ids, attention_mask=[],[],[]
input_ids.append(input1['input_ids'])
token_type_ids.append(input1['token_type_ids'])
attention_mask.append(input1['attention_mask'])
    
    input_ids.append(input2['input_ids'])
    token_type_ids.append(input2['token_type_ids'])
    attention_mask.append(input2['attention_mask'])
    
    
    input_ids = torch.cat(input_ids, dim=0)
    token_type_ids = torch.cat(token_type_ids, dim=0)
    attention_mask = torch.cat(attention_mask, dim=0)
    
    print(input_ids.shape)  # torch.Size([2, 30])
    bert=BertModel.from_pretrained('chinese_L-12_H-768_A-12').to(device)
    out=bert(input_ids=input_ids.to(device),
            token_type_ids=token_type_ids.to(device),
            attention_mask=attention_mask.to(device),
            output_hidden_states=True,
            return_dict=True)
    
print(type(out.last_hidden_state),type(out.pooler_output),type(out.hidden_states))
print(out.last_hidden_state.shape)
print(len(out.hidden_states))  # 13 layers
print(len(out)) 
print(out.last_hidden_state)
print(out.hidden_states[0].shape,out.hidden_states[0])
>>
  
len(out.hidden_states)=13:包含第一层embedding的输出和12层bertlayer的编码

参考：

BERT中的词向量指南-CSDN博客
BERT Word Embeddings Tutorial · Chris McCormick (mccormickml.com)