问题回答任务返回给定问题的答案。有两种常见的问题回答形式:
本指南将向您展示如何对 SQuAD 数据集上的 DistilBERT 进行微调,以便进行提取问题回答。
有关其他形式的问题回答及其相关模型、数据集和指标的更多信息,请参见问题回答任务页。
从 Datasets 库加载 SQUAD 数据集:
>>> from datasets import load_dataset
>>> squad = load_dataset("squad")
然后看一个例子:
>>> squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
'context': 'Architecturally, ...and the Gold Dome), is a simple, modern stone statue of Mary.',
'id': '5733be284776f41900661182',
'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
'title': 'University_of_Notre_Dame'
}
答案字段是一个字典,包含答案的起始位置和答案的文本。
加载 DistilBERT 标记器以处理问题和上下文字段:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
有几个预处理步骤,特别是问题的回答,你应该知道:
下面是如何创建一个函数来截断和映射答案的开始和结束标记到上下文:
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
# Find the start and end of the context
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# If the answer is not fully inside the context, label it (0, 0)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# Otherwise it's the start and end token positions
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
使用 Datasets 映射函数对整个数据集应用预处理函数。您可以通过设置 batching = True 来加速 map 函数,以便一次处理数据集的多个元素。移除你不需要的栏目:
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
使用 DefaultDataCollator 创建一批示例。与转换器中的其他数据校对器不同,DefaultDataCollator 不应用其他预处理,如填充。
Pytorch
>>> from transformers import DefaultDataCollator
>>> data_collator = DefaultDataCollator()
TensorFlow
>>> from transformers import DefaultDataCollator
>>> data_collator = DefaultDataCollator(return_tensors="tf")
Pytorch
使用 AutoModelForquestions 加载 DistilBERT 回答:
>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
如果您不熟悉用 Trainer 对模型进行微调,请参阅这里的基本教程!
目前,只剩下三个步骤:
在 TrainingArguments 中定义训练超参数。
将训练参数连同模型、数据集、标记器和数据校对器一起传递给 Trainer。
调用 train()对模型进行微调。
>>> training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
>>> trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_squad["train"],
eval_dataset=tokenized_squad["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
)
>>> trainer.train()
TensorFlow
要在 TensorFlow 中微调模型,首先要将数据集转换为 tf.data。具有 to_tf_data 的数据集格式。在列中指定输入以及答案的开始和结束位置,是否对数据集顺序、批量大小和数据排序器进行洗牌:
>>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
dummy_labels=True,
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
>>> tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
dummy_labels=True,
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
如果您不熟悉使用 Kera 对模型进行微调,请在这里查看基本教程!
建立一个优化器函数、学习速率进度表和一些训练超参数:
>>> from transformers import create_optimizer
>>> batch_size = 16
>>> num_epochs = 2
>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
>>> optimizer, schedule = create_optimizer(
init_lr=2e-5,
num_warmup_steps=0,
num_train_steps=total_train_steps,
)
使用 TFAutoModelForquestions 加载 DistilBERT 回答:
>>> from transformers import TFAutoModelForQuestionAnswering
>>> model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
通过编译配置培训模型:
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer)
调用 [fit](https://keras.io/api/models/model_training_apis/#fit-method) 对模型进行微调:
```python
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)
有关如何微调问题回答模型的更深入示例,请查看相应的 PyTorch 笔记本或 TensorFlow 笔记本。
原文链接