bert的输出是tuple类型的,包括4个:
Return:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (:obj:`torch.FloatTensor`: of shape :obj:`(batch_size, hidden_size)`):
Last layer hidden-state of the first token of the sequence (classification token)
further processed by a Linear layer and a Tanh activation function. The Linear
layer weights are trained from the next sentence prediction (classification)
objective during pre-training.
This output is usually *not* a good summary
of the semantic content of the input, you're often better with averaging or pooling
the sequence of hidden-states for the whole input sequence.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
config.output_hidden_states=True
,它也是一个元组,它的第一个元素是embedding,其余元素是各层的输出,每个元素的形状是(batch_size, sequence_length, hidden_size)config.output_attentions=True
,它也是一个元组,它的元素是每一层的注意力权重,用于计算self-attention heads的加权平均值这里给出示例代码
import torch
from transformers import *
# Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel]
# All the classes for an architecture can be initiated from pretrained weights for this architecture
# Note that additional weights added for fine-tuning are only initialized
# and need to be trained on the down-stream task
pretrained_weights = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)
for model_class in BERT_MODEL_CLASSES:
# Load pretrained model/tokenizer
model = model_class.from_pretrained(pretrained_weights)
# Models can return full list of hidden-states & attentions weights at each layer
model = model_class.from_pretrained(pretrained_weights,
output_hidden_states=True,
output_attentions=True)
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
last_hidden_state, pooler_output, all_hidden_states, all_attentions = model(input_ids)
print(last_hidden_state.shape)
print(pooler_output.shape)
print(len(all_hidden_states))
print(len(all_attentions))
print(all_hidden_states[-2])
输出:
input_ids: tensor([[ 101, 2292, 1005, 1055, 2156, 2035, 5023, 1011, 2163, 1998, 3086, 2015, 2006, 2023, 3793, 102]])
last_hidden_state.shape: torch.Size([1, 16, 768])
pooler_output.shape: torch.Size([1, 768])
len(all_hidden_states): 13
len(all_attentions): 12
all_hidden_states[-2]: tensor([[[ 0.3522, -0.6508, 0.4068, …, -0.5943, -0.1012, 0.3161],
[ 0.9840, -0.2480, 0.0171, …, -0.0287, 1.1418, -0.4333],
[ 0.0406, 0.0278, -0.0156, …, -0.0117, -0.0351, 0.0244],
…,
[-0.4968, 0.1059, 0.1520, …, -1.0849, 0.3682, 0.6323],
[-0.0365, -0.2779, -0.3252, …, -0.0088, 0.0322, -0.4090],
[ 0.0271, 0.0178, -0.0082, …, 0.0126, -0.0168, 0.0107]]],
grad_fn=< NativeLayerNormBackward>)
all_hidden_states[-2].shape: torch.Size([1, 16, 768])
可以看出all_hidden_states=13,all_attentions=12,因为all_hidden_states比all_attentions多了一层的embedding.