本文为学习Datawhale 2021.8组队学习NLP入门之Transformer笔记
原学习文档地址:https://github.com/datawhalechina/learn-nlp-with-transformers
任务:多项选择
数据集:SWAG 在四个选项中决定最合理的延续,相当于阅读理解
数据集中的每个示例都有一个上下文,它是由第一个句子(字段sent1)和第二个句子的简介(字段sent2)组成。然后给出四种可能的结尾(字段ending0, ending1, ending2和ending3),然后让模型从中选择正确的一个(由字段label表示)。
数据集的样子:
{'ending0': 'passes by walking down the street playing their instruments.',
'ending1': 'has heard approaching them.',
'ending2': "arrives and they're outside dancing and asleep.",
'ending3': 'turns the lead singer watches the performance.',
'fold-ind': '3416',
'gold-source': 'gold',
'label': 0,
'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
'sent2': 'A drum line',
'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
'video-id': 'anetv_jkn6uvmqwh4'}
sent2就是下一句的简介,所以一并整合给ending输入给模型
依旧是分三步 数据读入,数据预处理,训练和评估
model_checkpoint = "bert-base-uncased"
batch_size = 16
提供的链接下载数据并解压,将解压后的3个csv文件复制到到docs/篇章4-使用Transformers解决NLP任务/datasets/swag目录下,然后用下面的代码进行加载。
这次提供了一种读入csv数据的方式,但是一直运行不完,卡里面了,具体原因还没研究,代码如下:
from datasets import load_dataset, load_metric
import os
data_path = './datasets/swag/'
cache_dir = os.path.join(data_path, 'cache')
data_files = {'train': os.path.join(data_path, 'train.csv'), 'val': os.path.join(data_path, 'val.csv'), 'test': os.path.join(data_path, 'test.csv')}
datasets = load_dataset(data_path, 'regular', data_files=data_files, cache_dir=cache_dir)
所以依旧是上次的方法,colab下载完后本地load_from_disk
from datasets import load_from_disk
datasets = load_from_disk("./datasets/swag")
一样的套路,定义一个tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
后面的数据预处理和之前的序列标注就不一样了,应用bert解决不同任务的区别之一,调整输入的数据。
def preprocess_function(examples):
# Repeat each first sentence four times to go with the four possibilities of second sentences.
first_sentences = [[context] * 4 for context in examples["sent1"]]
# Grab all second sentences possible for each context.
question_headers = examples["sent2"]
# 外面一个循环套着里面一个循环,第二维i相当于这个batch中的第几个数据,每一个输入都要组合四个endings,所以这部分的循环在内部。
second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]
# Flatten everything
# 由5个list,每个list4个句子拍平成1个list,20个句子
first_sentences = sum(first_sentences, [])
second_sentences = sum(second_sentences, [])
# Tokenize
# truncation=True 表示最大长度截断
tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
# Un-flatten
return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
看看最后一句代码,就是按4个为一组的返回
for k, v in tokenized_examples.items():
if k != "input_ids":
break # 会遍历到的token_type_ids
for i in range(0, len(v), 4):
w = k
w2 = v[i:i + 4]
这个函数可以使用一个或多个示例。在传入多个示例时,tokenizer将为每个键返回一个列表的列表:所有示例的列表(长度为5),然后是所有选项的列表(长度为4)以及输入id的列表(长度不同,因为我们没有应用任何填充)。
解码输入
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]
['[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
'[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP]',
'[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
'[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']
再看下原始数据
show_one(datasets["train"][3])
Context: A drum line passes by walking down the street playing their instruments.
A - Members of the procession are playing ping pong and celebrating one left each in quick.
B - Members of the procession wait slowly towards the cadets.
C - Members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions.
D - Members of the procession play and go back and forth hitting the drums while the audience claps for them.
Ground truth: option D
tokenizer输入两个句子会在中间自动加个[SEP],
观察到数据处理没有问题,就用map函数,对所有的数据进行处理
encoded_datasets = datasets.map(preprocess_function, batched=True)
更好的是,结果会被Datasets库自动缓存,以避免下次运行时在这一步上花费时间。Datasets库通常足够智能,它可以检测传递给map的函数何时发生更改(此时不再使用缓存数据)。例如,它将检测您是否在第一个单元格中更改了任务并重新运行笔记本。当Datasets使用缓存文件时,它提示相应的警告,你可以在调用map中传入load_from_cache_file=False从而不使用缓存文件,并强制进行预处理。
请注意,我们传递了batched=True以批量对文本进行编码。这是为了充分利用我们前面加载的快速tokenizer的优势,它将使用多线程并发地处理批中的文本。
上面看到的token IDs也就是input_ids一般来说随着预训练模型名字的不同而有所不同。原因是不同的预训练模型在预训练的时候设定了不同的规则。但只要tokenizer和model的名字一致,那么tokenizer预处理的输入格式就会满足model需求的。
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
args = TrainingArguments(
"test-glue",
evaluation_strategy = "epoch",
learning_rate=5e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=3,
weight_decay=0.01,
)
哇,这里还需要自己写DataCollator
我们需要告诉我们的Trainer如何从预处理的输入数据中构造批数据。我们还没有做任何填充,因为我们将填充每个批到批内的最大长度(而不是使用整个数据集的最大长度)。这将是data collator的工作。它接受示例的列表,并将它们转换为一个批(在我们的示例中,通过应用填充)。由于在库中没有data collator来处理我们的特定问题,这里我们根据DataCollatorWithPadding自行改编一个:
之前的序列标注和文本分类不用做padding
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch
@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
举一个调用的例子
accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
# 一次传入10个,bs=10
batch = DataCollatorForMultipleChoice(tokenizer)(features)
看看features
for i in range(10):
for k, v in encoded_datasets["train"][i].items():
if k in accepted_keys:
w = k
w1 = v
每个feature
{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'input_ids': [[101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 2038, 2657, 8455, 2068, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 8480, 1998, 2027, 1005, 2128, 2648, 5613, 1998, 6680, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 4332, 1996, 2599, 3220, 12197, 1996, 2836, 1012, 102]], 'label': 0}
定义评价函数
import numpy as np
def compute_metrics(eval_predictions):
predictions, label_ids = eval_predictions
preds = np.argmax(predictions, axis=1)
return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}
将参数传入trainer
trainer = Trainer(
model,
args,
train_dataset=encoded_datasets["train"],
eval_dataset=encoded_datasets["validation"],
tokenizer=tokenizer,
data_collator=DataCollatorForMultipleChoice(tokenizer),
compute_metrics=compute_metrics,
)
训练
trainer.train()