深度学习实战(4)如何向BERT词汇表中添加token,新增特殊占位符

向BERT词汇表中添加token

  • 问题表述
  • 添加特殊占位符号 add_special_tokens
  • 其他占位符接口
  • 报错与解决方案

问题表述

在实际应用或者学术科研过程中,我们常常需要添加一些特殊的占位符,然而我们希望使用BERT来做embedding,有兴趣查看BERT本身词汇表的可以去以下相应连接查看:

PRETRAINED_VOCAB_FILES_MAP = {
    "vocab_file": {
        "bert-base-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
        "bert-large-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
        "bert-base-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
        "bert-large-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
        "bert-base-multilingual-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
        "bert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
        "bert-base-chinese": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
        "bert-base-german-cased": "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt",
        "bert-large-uncased-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt",
        "bert-large-cased-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt",
        "bert-large-uncased-whole-word-masking-finetuned-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
        "bert-large-cased-whole-word-masking-finetuned-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
        "bert-base-cased-finetuned-mrpc": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",
        "bert-base-german-dbmdz-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt",
        "bert-base-german-dbmdz-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt",
        "TurkuNLP/bert-base-finnish-cased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/vocab.txt",
        "TurkuNLP/bert-base-finnish-uncased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/vocab.txt",
        "wietsedv/bert-base-dutch-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt",
    }
}

一开始我想通过简单的修改网站上的vocab.txt,向其中添加我们想增加的特殊占位符,却还是被BERT的分词器分成了三部分。最终终于发现BERT其实贴心的给我们留好了添加特殊占位符号的接口

添加特殊占位符号 add_special_tokens

这个往分词器tokenizer中添加新的特殊占位符的方法就是add_special_tokens,代码实现如下:

tokenizer.add_special_tokens({'additional_special_tokens':[""]})

在这里我们是往additional_special_tokens这一类tokens中添加特殊占位符。我们可以做一个实验看看在被添加进去前后分词的效果:

# 导入模型和分词器
from transformers import BertTokenizer,BertModel,RobertaTokenizer, RobertaModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-cased') # Bert的分词器
bertmodel = BertModel.from_pretrained('bert-base-cased',from_tf=True)# load the TF model for Pytorch
text = " I love  ! "
# 对于一个句子,首尾分别加[CLS]和[SEP]。
text = "[CLS] " + text + " [SEP]"
# 然后进行分词
tokenized_text1 = tokenizer.tokenize(text)
print(tokenized_text1)
indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
# 分词结束后获取BERT模型需要的tensor
segments_ids1 = [1] * len(tokenized_text1)
tokens_tensor1 = torch.tensor([indexed_tokens1]) # 将list转为tensor
segments_tensors1 = torch.tensor([segments_ids1])
# 获取所有词向量的embedding
word_vectors1 = bertmodel(tokens_tensor1, segments_tensors1)[0]
# 获取句子的embedding
sentenc_vector1 = bertmodel(tokens_tensor1, segments_tensors1)[1]
tokenizer.add_special_tokens({'additional_special_tokens':[""]})
print(tokenizer.additional_special_tokens) # 查看此类特殊token有哪些
print(tokenizer.additional_special_tokens_ids) # 查看其id
tokenized_text1 = tokenizer.tokenize(text)
print(tokenized_text1)

最终的运行效果如图:
深度学习实战(4)如何向BERT词汇表中添加token,新增特殊占位符_第1张图片
可以清楚地看到在被加入到特殊占位符以后就没有再被分开了!

其他占位符接口

除了提到的additional_special_tokens这一类添加占位符号的接口,transformers还留有一下接口:

    SPECIAL_TOKENS_ATTRIBUTES = [
        "bos_token",
        "eos_token",
        "unk_token",
        "sep_token",
        "pad_token",
        "cls_token",
        "mask_token",
        "additional_special_tokens",
    ]

可以通过一下代码进行查看:

from transformers import BertTokenizer,BertModel,RobertaTokenizer, RobertaModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased') # Bert的分词器
print(tokenizer.SPECIAL_TOKENS_ATTRIBUTES)

报错与解决方案

在添加特殊字符后用Roberta或者Bert模型获取embedding报错:Cuda error during evaluation - CUBLAS_STATUS_NOT_INITIALIZED

这是因为在将模型放到eval模式后添加了新token,必须在添加新token完毕后运行以下代码:

robertamodel.resize_token_embeddings(len(tokenizer))

你可能感兴趣的:(pytorch深度学习,自然语言处理,pytorch,python)