Datawhale组队学习NLP_Bert多项选择学习笔记

本文为学习Datawhale 2021.8组队学习NLP入门之Transformer笔记
原学习文档地址:https://github.com/datawhalechina/learn-nlp-with-transformers

任务:多项选择
数据集:SWAG 在四个选项中决定最合理的延续,相当于阅读理解
数据集中的每个示例都有一个上下文,它是由第一个句子(字段sent1)和第二个句子的简介(字段sent2)组成。然后给出四种可能的结尾(字段ending0, ending1, ending2和ending3),然后让模型从中选择正确的一个(由字段label表示)。

数据集的样子:

{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

sent2就是下一句的简介,所以一并整合给ending输入给模型

依旧是分三步 数据读入,数据预处理,训练和评估

model_checkpoint = "bert-base-uncased"
batch_size = 16

1 数据读入

提供的链接下载数据并解压,将解压后的3个csv文件复制到到docs/篇章4-使用Transformers解决NLP任务/datasets/swag目录下,然后用下面的代码进行加载。

这次提供了一种读入csv数据的方式,但是一直运行不完,卡里面了,具体原因还没研究,代码如下:

from datasets import load_dataset, load_metric
import os

data_path = './datasets/swag/'
cache_dir = os.path.join(data_path, 'cache')
data_files = {'train': os.path.join(data_path, 'train.csv'), 'val': os.path.join(data_path, 'val.csv'), 'test': os.path.join(data_path, 'test.csv')}
datasets = load_dataset(data_path, 'regular', data_files=data_files, cache_dir=cache_dir)

所以依旧是上次的方法,colab下载完后本地load_from_disk

from datasets import load_from_disk
datasets = load_from_disk("./datasets/swag")

2 数据预处理

一样的套路,定义一个tokenizer

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

后面的数据预处理和之前的序列标注就不一样了,应用bert解决不同任务的区别之一,调整输入的数据。

def preprocess_function(examples):
    # Repeat each first sentence four times to go with the four possibilities of second sentences.
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    # Grab all second sentences possible for each context.
    question_headers = examples["sent2"]
    # 外面一个循环套着里面一个循环,第二维i相当于这个batch中的第几个数据,每一个输入都要组合四个endings,所以这部分的循环在内部。
    second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]
    
    # Flatten everything
    # 由5个list,每个list4个句子拍平成1个list,20个句子
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])
    
    # Tokenize
    # truncation=True 表示最大长度截断
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

看看最后一句代码,就是按4个为一组的返回

    for k, v in tokenized_examples.items():
        if k != "input_ids":
            break  # 会遍历到的token_type_ids
        for i in range(0, len(v), 4):
            w = k
            w2 = v[i:i + 4]

这个函数可以使用一个或多个示例。在传入多个示例时,tokenizer将为每个键返回一个列表的列表:所有示例的列表(长度为5),然后是所有选项的列表(长度为4)以及输入id的列表(长度不同,因为我们没有应用任何填充)。

解码输入

idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]
['[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']

再看下原始数据

show_one(datasets["train"][3])
Context: A drum line passes by walking down the street playing their instruments.
  A - Members of the procession are playing ping pong and celebrating one left each in quick.
  B - Members of the procession wait slowly towards the cadets.
  C - Members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions.
  D - Members of the procession play and go back and forth hitting the drums while the audience claps for them.

Ground truth: option D

tokenizer输入两个句子会在中间自动加个[SEP],

观察到数据处理没有问题,就用map函数,对所有的数据进行处理
encoded_datasets = datasets.map(preprocess_function, batched=True)

更好的是,结果会被Datasets库自动缓存,以避免下次运行时在这一步上花费时间。Datasets库通常足够智能,它可以检测传递给map的函数何时发生更改(此时不再使用缓存数据)。例如,它将检测您是否在第一个单元格中更改了任务并重新运行笔记本。当Datasets使用缓存文件时,它提示相应的警告,你可以在调用map中传入load_from_cache_file=False从而不使用缓存文件,并强制进行预处理。

请注意,我们传递了batched=True以批量对文本进行编码。这是为了充分利用我们前面加载的快速tokenizer的优势,它将使用多线程并发地处理批中的文本。

上面看到的token IDs也就是input_ids一般来说随着预训练模型名字的不同而有所不同。原因是不同的预训练模型在预训练的时候设定了不同的规则。但只要tokenizer和model的名字一致,那么tokenizer预处理的输入格式就会满足model需求的。

3 微调模型

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

哇,这里还需要自己写DataCollator

我们需要告诉我们的Trainer如何从预处理的输入数据中构造批数据。我们还没有做任何填充,因为我们将填充每个批到批内的最大长度(而不是使用整个数据集的最大长度)。这将是data collator的工作。它接受示例的列表,并将它们转换为一个批(在我们的示例中,通过应用填充)。由于在库中没有data collator来处理我们的特定问题,这里我们根据DataCollatorWithPadding自行改编一个:

之前的序列标注和文本分类不用做padding

from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

举一个调用的例子

accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
# 一次传入10个,bs=10 
batch = DataCollatorForMultipleChoice(tokenizer)(features)

看看features

for i in range(10):
    for k, v in encoded_datasets["train"][i].items():
        if k in accepted_keys:
            w = k
            w1 = v

每个feature

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'input_ids': [[101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 5235, 2011, 3788, 2091, 1996, 2395, 2652, 2037, 5693, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 2038, 2657, 8455, 2068, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 8480, 1998, 2027, 1005, 2128, 2648, 5613, 1998, 6680, 1012, 102], [101, 2372, 1997, 1996, 14385, 3328, 2091, 1996, 2395, 3173, 2235, 7109, 8782, 5693, 1012, 102, 1037, 6943, 2240, 4332, 1996, 2599, 3220, 12197, 1996, 2836, 1012, 102]], 'label': 0}

定义评价函数

import numpy as np

def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

将参数传入trainer

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

训练

trainer.train()

你可能感兴趣的:(Datawhale组队学习,bert,自然语言处理,机器学习)