在实际应用或者学术科研过程中,我们常常需要添加一些特殊的占位符,然而我们希望使用BERT来做embedding,有兴趣查看BERT本身词汇表的可以去以下相应连接查看:
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"bert-base-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
"bert-large-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
"bert-base-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
"bert-large-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
"bert-base-multilingual-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
"bert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
"bert-base-chinese": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
"bert-base-german-cased": "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt",
"bert-large-uncased-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt",
"bert-large-cased-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt",
"bert-large-uncased-whole-word-masking-finetuned-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
"bert-large-cased-whole-word-masking-finetuned-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
"bert-base-cased-finetuned-mrpc": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",
"bert-base-german-dbmdz-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt",
"bert-base-german-dbmdz-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt",
"TurkuNLP/bert-base-finnish-cased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/vocab.txt",
"TurkuNLP/bert-base-finnish-uncased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/vocab.txt",
"wietsedv/bert-base-dutch-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/vocab.txt",
}
}
一开始我想通过简单的修改网站上的vocab.txt,向其中添加我们想增加的特殊占位符
,却还是被BERT的分词器分成了三部分。最终终于发现BERT其实贴心的给我们留好了添加特殊占位符号的接口
这个往分词器tokenizer中添加新的特殊占位符的方法就是add_special_tokens,代码实现如下:
tokenizer.add_special_tokens({'additional_special_tokens':["" ]})
在这里我们是往additional_special_tokens
这一类tokens中添加特殊占位符
。我们可以做一个实验看看
在被添加进去前后分词的效果:
# 导入模型和分词器
from transformers import BertTokenizer,BertModel,RobertaTokenizer, RobertaModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-cased') # Bert的分词器
bertmodel = BertModel.from_pretrained('bert-base-cased',from_tf=True)# load the TF model for Pytorch
text = " I love ! "
# 对于一个句子,首尾分别加[CLS]和[SEP]。
text = "[CLS] " + text + " [SEP]"
# 然后进行分词
tokenized_text1 = tokenizer.tokenize(text)
print(tokenized_text1)
indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
# 分词结束后获取BERT模型需要的tensor
segments_ids1 = [1] * len(tokenized_text1)
tokens_tensor1 = torch.tensor([indexed_tokens1]) # 将list转为tensor
segments_tensors1 = torch.tensor([segments_ids1])
# 获取所有词向量的embedding
word_vectors1 = bertmodel(tokens_tensor1, segments_tensors1)[0]
# 获取句子的embedding
sentenc_vector1 = bertmodel(tokens_tensor1, segments_tensors1)[1]
tokenizer.add_special_tokens({'additional_special_tokens':["" ]})
print(tokenizer.additional_special_tokens) # 查看此类特殊token有哪些
print(tokenizer.additional_special_tokens_ids) # 查看其id
tokenized_text1 = tokenizer.tokenize(text)
print(tokenized_text1)
最终的运行效果如图:
可以清楚地看到
在被加入到特殊占位符以后就没有再被分开了!
除了提到的additional_special_tokens
这一类添加占位符号的接口,transformers还留有一下接口:
SPECIAL_TOKENS_ATTRIBUTES = [
"bos_token",
"eos_token",
"unk_token",
"sep_token",
"pad_token",
"cls_token",
"mask_token",
"additional_special_tokens",
]
可以通过一下代码进行查看:
from transformers import BertTokenizer,BertModel,RobertaTokenizer, RobertaModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased') # Bert的分词器
print(tokenizer.SPECIAL_TOKENS_ATTRIBUTES)
在添加特殊字符后用Roberta或者Bert模型获取embedding报错:Cuda error during evaluation - CUBLAS_STATUS_NOT_INITIALIZED
这是因为在将模型放到eval模式后添加了新token,必须在添加新token完毕后运行以下代码:
robertamodel.resize_token_embeddings(len(tokenizer))