bert预训练过程

首先找到Trainer.train()中的

Trainer.train()

的过程的内容
这里的Trainer在__init__.py之中有所阐明过程

from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction

输入的input_ids以及labels的内容为

input_ids = 
[2, 193, 194, 8982, 23, 4, 15, 1073, 3, 418, 43, 13, 319, 8981, 
 4622, 258, 4937, 4, 36, 864, 339, 1162, 3]
labels = 
[-100, -100, -100, -100, -100, 453, -100, -100, -100, -100, -100, 
 -100, -100, -100, -100, -100, -100, 83, -100, -100, -100, -100, -100]

这里的input_ids中的第5个以及第17个位置中的labels标记现在对应的数值,即453,83,还设置了ngram=[1,2,3]的概率为[0.7,0.2,0.1],而input_ids为替换之前的数值,即概率在(0,0.150.8)的情况下mask预测自己,概率在(0.150.8,0.150.9)的情况下自己预测自己,概率在(0.150.9,1)的情况下保持原样不预测。
输出在原来的基础上加上了一个对应的网络层

  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=21128, bias=True)
    )

最后损失函数为计算input_ids的输出与labels的交叉熵损失函数内容

sequence_output = outputs[0]
prediction_scores = self.cls(sequence_output)
masked_lm_loss = None
if labels is not None:
    loss_fct = CrossEntropyLoss()  # -100 index = padding token
    masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

这里的prediction_scores.size =
torch.Size([10, 13, 9448])
labels =
torch.Size([10, 13])
接下来讲解下预训练过程中的nextsentence预测
nextsentence-prediction在transformer之中的BertForNextSentencePrediction类之中
nextsentence预测之中你一波只能放置两个句子
比如你放置的句子内容如下,最长的长度为50

[CLS]谷歌和[MASK][MASK]都是不存在的。[SEP]同时,[MASK]也是不存在的。[SEP]

此时batch_size = 1(如果想要多个nextsentence预测构造batch_size为多波即可),这样就构成了
(1,50),经过bertmodel之后输出为(1,50,768)维度的矩阵,然后经过pooler和tanh激活函数之后为(1,50,768),接着取出第0维度值(1,768),加入一个(hidden_size,2)的线性层,算出标签概率
[[-3.0729, 5.9056]],最后求这个概率与标签([[1]])(是下一个句子)的交叉熵损失函数
另外这里注意一下预训练之前的参数初始化的过程:

class BertPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = BertConfig
    load_tf_weights = load_tf_weights_in_bert
    base_model_prefix = "bert"
    _keys_to_ignore_on_load_missing = [r"position_ids"]

    def _init_weights(self, module):
        """ Initialize the weights """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()

你可能感兴趣的:(模型预训练过程,bert,自然语言处理,神经网络)