首先找到Trainer.train()中的
Trainer.train()
的过程的内容
这里的Trainer在__init__.py之中有所阐明过程
from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction
输入的input_ids以及labels的内容为
input_ids =
[2, 193, 194, 8982, 23, 4, 15, 1073, 3, 418, 43, 13, 319, 8981,
4622, 258, 4937, 4, 36, 864, 339, 1162, 3]
labels =
[-100, -100, -100, -100, -100, 453, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, 83, -100, -100, -100, -100, -100]
这里的input_ids中的第5个以及第17个位置中的labels标记现在对应的数值,即453,83,还设置了ngram=[1,2,3]的概率为[0.7,0.2,0.1],而input_ids为替换之前的数值,即概率在(0,0.150.8)的情况下mask预测自己,概率在(0.150.8,0.150.9)的情况下自己预测自己,概率在(0.150.9,1)的情况下保持原样不预测。
输出在原来的基础上加上了一个对应的网络层
(cls): BertOnlyMLMHead(
(predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=768, out_features=21128, bias=True)
)
最后损失函数为计算input_ids的输出与labels的交叉熵损失函数内容
sequence_output = outputs[0]
prediction_scores = self.cls(sequence_output)
masked_lm_loss = None
if labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
这里的prediction_scores.size =
torch.Size([10, 13, 9448])
labels =
torch.Size([10, 13])
接下来讲解下预训练过程中的nextsentence预测
nextsentence-prediction在transformer之中的BertForNextSentencePrediction类之中
nextsentence预测之中你一波只能放置两个句子
比如你放置的句子内容如下,最长的长度为50
[CLS]谷歌和[MASK][MASK]都是不存在的。[SEP]同时,[MASK]也是不存在的。[SEP]
此时batch_size = 1(如果想要多个nextsentence预测构造batch_size为多波即可),这样就构成了
(1,50),经过bertmodel之后输出为(1,50,768)维度的矩阵,然后经过pooler和tanh激活函数之后为(1,50,768),接着取出第0维度值(1,768),加入一个(hidden_size,2)的线性层,算出标签概率
[[-3.0729, 5.9056]],最后求这个概率与标签([[1]])(是下一个句子)的交叉熵损失函数
另外这里注意一下预训练之前的参数初始化的过程:
class BertPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = BertConfig
load_tf_weights = load_tf_weights_in_bert
base_model_prefix = "bert"
_keys_to_ignore_on_load_missing = [r"position_ids"]
def _init_weights(self, module):
""" Initialize the weights """
if isinstance(module, (nn.Linear, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()