Huggingface Transformers简约教程(三)

9 微调模型

在官方文档中这微调这部分介绍了三个方面内容
1 Fine-tune a pretrained model with Transformers Trainer.
2 Fine-tune a pretrained model in TensorFlow with Keras.
3 Fine-tune a pretrained model in native PyTorch.

因为我主要应用pytorch框架,所以在这里我只介绍第一种第三种,关于Kares请见官方文档。

在微调预训练模型之前,请下载一个数据集,并为训练做好准备。(这里下载的是Yelp Reviews 数据集,它有五个标签)

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset[100]

结果:

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

我们现在需要一个Tokenizer来处理文本,并包括一个填充和截断策略来处理任何可变的序列长度。要一步处理数据集,请使用Datasets map方法在整个数据集上应用预处理函数:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

或者可以利用较小的数据集进行调参,这样可以减少时间

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

数据集准备好了,现在我们就开始正式调参

1 Fine-tune with Trainer

① 加载模型

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

② 超参数的训练
接下来,创建一个TrainingArguments类,其中包含所有可以调整的超参数,以及激活不同训练选项的标志。在本教程中,可以从默认的训练超参数开始,但可以随意尝试这些参数,以找到最佳设置。

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

③ 测度
Trainer不会自己分配一个metric,所以需要我们来指定一个

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

调用compute on metric来计算预测的准确性。在将预测传递给计算之前,需要将预测转换为logits(记住所有Transformer model(返回登录):

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

④ Trainer()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

使用下面内容来调参

trainer.train()

2 在本地PyTorch中进行微调

可以使用以下内容释放一下内存:

del model
del pytorch_model
del trainer
torch.cuda.empty_cache()

然后手动对数据集进行处理

tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

metric = load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

你可能感兴趣的:(pytorch,深度学习,python)