weight decay(权值衰减)的最终目的是防止过拟合。在损失函数中,weight decay是放在正则项(regularization)前面的一个系数,正则项一般指示模型的复杂度,所以weight decay的作用是调节模型复杂度对损失函数的影响,若weight decay很大,则复杂的模型损失函数的值也就大。
momentum是梯度下降法中一种常用的加速技术。对于一般的SGD,其表达式为
,x沿负梯度方向下降。而带momentum项的SGD则写生如下形式:
其中\beta 即momentum系数,通俗的理解上面式子就是,如果上一次的momentum(即v)与这一次的负梯度方向是相同的,那这次下降的幅度就会加大,所以这样做能够达到加速收敛的过程。
{'placeholder:} ——感觉是用来放置sent的
{‘meta’: } ——感觉是用来放置一些特定实体的,比如,entity、title等等
{‘soft’: ,‘duplicate’:} ——软标签,表示需要优化的参数,如果是word,则初始化为token的emb的吧(我理解是这样)如果是其他,则随机初始化。参数duplicate 表示soft tokens个数,比如,50个软令牌等。
{“mask”} ——表示产出
官方文档https://thunlp.github.io/OpenPrompt/notes/template.html?highlight=duplicate
参数查看:[n for n,p in prompt_model.template.named_parameters()]
n _name ,p_paramters
LM模型参数查看:[n for n, p in prompt_model.plm.named_parameters()]
#如果您尝试定义 10000 个软令牌,请使用密钥 ,duplicate
template_text ='{"placeholder":"text_a"} {"soft": "quenstion", "duplicate": 50} {"placeholder":"text_b"} {"soft": "yes", "duplicate": 16} {"soft": "no", "duplicate":16} {"soft": "maybe" , "duplicate": 16} {"mask"}.'
mytemplate = MixedTemplate(model=plm,tokenizer=tokenizer, text=template_text)
# To better understand how does the template wrap the example, we visualize one instance.
wrapped_example = mytemplate.wrap_one_example(dataset['train'][0])
wrapped_example
----------------------其中,dataset[‘train’][0]格式为:
{
"guid": 0,
"label": 0,
"meta": {},
"text_a": "It was a complex language. Not written down but handed down. One might say it was peeled down.",
"text_b": "the language was peeled down",
"tgt_text": null
}
数据加载为字典形式
model_inputs = {}
for split in ['train', 'validation', 'test']:
model_inputs[split] = []
for sample in dataset[split]:
tokenized_example = wrapped_t5tokenizer.tokenize_one_example(mytemplate.wrap_one_example(sample), teacher_forcing=False)
model_inputs[split].append(tokenized_example)
from openprompt import PromptDataLoader
train_dataloader = PromptDataLoader(dataset=dataset["train"], template=mytemplate, tokenizer=tokenizer,
tokenizer_wrapper_class=WrapperClass, max_seq_length=256, decoder_max_length=3,
batch_size=4,shuffle=True, teacher_forcing=False, predict_eos_token=False,
truncate_method="head")
#tokenizing: 250it [00:00, 624.06it/s]表示训练数据的数量
定义要更新的参数,比如,LM参数中那部分,template——model中的哪部分参数
from openprompt import PromptForClassification
use_cuda = torch.cuda.is_available()
print("GPU enabled? {}".format(use_cuda))
prompt_model = PromptForClassification(plm=plm,template=mytemplate, verbalizer=myverbalizer, freeze_plm=False)
if use_cuda:
prompt_model= prompt_model.cuda()
from transformers import AdamW, get_linear_schedule_with_warmup
loss_func = torch.nn.CrossEntropyLoss()
no_decay = ['bias', 'LayerNorm.weight']
# it's always good practice to set no decay to biase and LayerNorm parameters
optimizer_grouped_parameters = [
{'params': [p for n, p in prompt_model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},#weight_decay : 权重衰减项,防止过拟合的一个参数
{'params': [p for n, p in prompt_model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-4)
for epoch in range(5):
tot_loss = 0
for step, inputs in enumerate(train_dataloader):
if use_cuda:
inputs = inputs.cuda()
logits = prompt_model(inputs)
labels = inputs['label']
loss = loss_func(logits, labels)
loss.backward()
tot_loss += loss.item()
optimizer.step()
optimizer.zero_grad()
if step %100 ==1:
print("Epoch {}, average loss: {}".format(epoch, tot_loss/(step+1)), flush=True)
手工构建,ManualVerbalizer,labels是由词构成,比如[[great,wonderful],[bad]] or {“World”: “politics”, “Tech”: “technology”}
SoftVerbalizer
OpenPrompt
Prompt——demo链接:https://colab.research.google.com/drive/10syott1zXaQkjnlxOiSXKDFGy68SWR0y?usp=sharing#scrollTo=MHZc0szQ8tkY
opendelta
Delra——demo链接:
https://colab.research.google.com/drive/1uAhgAdc8Qr42UKYDlgUv0f7W1-gAFwGo?usp=sharing