我们不使用所有层的预训练权重,而是使用原始的Transformer初始化来重新初始化指定的层数。重新初始化的层会破坏这些特定块的预训练知识。我们知道较低的预训练层学习更多的全局一般特征,而靠近输出的较高的层则更专注于预训练任务。因此初始化较高的层,并重新训练能够让网络更好的学习当前特定的任务。下面的例子是初始化 roberta
最后两层。
from transformers import AutoConfig
from transformers import AutoModelForSequenceClassification
reinit_layers = 2
_model_type = 'roberta'
_pretrained_model = 'roberta-base'
config = AutoConfig.from_pretrained(_pretrained_model)
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)
if reinit_layers > 0:
print(f'Reinitializing Last {reinit_layers} Layers ...')
encoder_temp = getattr(model, _model_type)
for layer in encoder_temp.encoder.layer[-reinit_layers:]:
for module in layer.modules():
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
print('Done.!')
LLRD是一种对顶层采用较高学习率,对底层采用较低学习率的方法。这是通过设置顶层的学习率和使用乘法衰减率从上到下逐层降低学习率来实现的。这是因为在网络的底层一般学习的是全局的一般信息,所以预训练的权重效果会更好,而对于网络的高层是针对特定任务的权重,因此需要较高的学习率来加速更新学习。代码实现如下。
def get_optimizer_grouped_parameters(
model, model_type,
learning_rate, weight_decay,
layerwise_learning_rate_decay
):
no_decay = ["bias", "LayerNorm.weight"]
# initialize lr for task specific layer
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if "classifier" in n or "pooler" in n],
"weight_decay": 0.0,
"lr": learning_rate,
},
]
# initialize lrs for every layer
num_layers = model.config.num_hidden_layers
layers = [getattr(model, model_type).embeddings] + list(getattr(model, model_type).encoder.layer)
layers.reverse()
lr = learning_rate
for layer in layers:
optimizer_grouped_parameters += [
{
"params": [p for n, p in layer.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": weight_decay,
"lr": lr,
},
{
"params": [p for n, p in layer.named_parameters() if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
"lr": lr,
},
]
# 学习率衰减
lr *= layerwise_learning_rate_decay
return optimizer_grouped_parameters
然后将定义好的参数,放入优化器中:
grouped_optimizer_params = get_optimizer_grouped_parameters(
model, _model_type,
learning_rate, weight_decay,
layerwise_learning_rate_decay
)
optimizer = AdamW(
grouped_optimizer_params,
lr=learning_rate,
)