Bert 详解-非简体(排版好看)
bert相关资源整理-原理篇
一文学会pytorch版本bert使用-英文例
model_name = 'bert-base-chinese' #指定需下载的预训练模型参数
#任务一:遮蔽语言模型
# BERT 在预训练中引入了 [CLS] 和 [SEP] 标记句子的开头和结尾
samples = ['[CLS] 中国的首都是哪里? [SEP] 北京是 [MASK] 国的首都。 [SEP]'] # 准备输入模型的语句
tokenizer = BertTokenizer.from_pretrained(model_name)
tokenized_text = [tokenizer.tokenize(i) for i in samples] #将句子分割成一个个token,即一个个汉字和分隔符
input_ids = [tokenizer.convert_tokens_to_ids(i) for i in tokenized_text] #把每个token转换成对应的索引
input_ids = torch.LongTensor(input_ids)
# 读取预训练模型
model = BertForMaskedLM.from_pretrained(model_name, cache_dir="E:/transformer_file/")
model.eval()
outputs = model(input_ids)
#---------------------------------------------------------------------------------
prediction_scores = outputs[0] #prediction_scores.shape=torch.Size([1, 21, 21128])
#output[0]是last layer的hidden state
sample = prediction_scores[0].detach().numpy() #sample.shape = (21, 21128)
pred = np.argmax(sample, axis=1) #21为序列长度,pred代表每个位置最大概率的字符索引
print(tokenizer.convert_ids_to_tokens(pred)[14]) #被标记的[MASK]是第14个位置
#------------------------------------------------------------------------------
pred_score=outputs[0][0][masked_idx]outputs[0][0][masked_idx] #get the bert pred score of all word in vocabulary at the masked position
pred_score=pred_score.to("cpu").numpy() #从tensor转化成numpy
#------------------------------------------------------------------------------------------------
使用Pytorch版本BERT使用方式如下:
(1)First prepare a tokenized input with BertTokenizer
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
# 加载词典 pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenized input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
# 将 token 转为 vocabulary 索引
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# 定义句子 A、B 索引
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
# 将 inputs 转为 PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
(2)use BertModel to get hidden states
# 加载模型 pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
# GPU & put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')
# 得到每一层的 hidden states
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
# 模型 bert-base-uncased 有12层,所以 hidden states 也有12层
assert len(encoded_layers) == 12
(3)use BertForMaskedLM
# 加载模型 pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')
# Predict all tokens
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'
Return:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):
Last layer hidden-state of the first token of the sequence (classification token)
further processed by a Linear layer and a Tanh activation function. The Linear
layer weights are trained from the next sentence prediction (classification)
objective during pre-training.
This output is usually *not* a good summary
of the semantic content of the input, you're often better with averaging or pooling
the sequence of hidden-states for the whole input sequence.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
bert使用过程中的参数
bert源码
bertformaskedLM输出的是词表中的所有词,在被遮住的位置出现的概率