Datawhale组队学习NLP_Bert抽取式问答学习笔记

本文为学习Datawhale 2021.8组队学习NLP入门之Transformer笔记
原学习文档地址:https://github.com/datawhalechina/learn-nlp-with-transformers

任务:抽取式问答
数据集:squad
三个key:“context", "question"和“answers”

# 展示训练集的第一个句子
datasets["train"][0]
{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

answers保存答案的开始位置和整个答案内容

1 数据读入

依旧是colab下载后本地读入

from datasets import load_from_disk

datasets = load_from_disk("E:/jupyter_notebook/0_learn-nlp-with-transformers-main/docs/篇章4-使用Transformers解决NLP任务/datasets/squad")

2 数据预处理

定义

squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

from transformers import AutoTokenizer
import transformers

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)   

抽取式问答的数据预处理有几个点

一是context可能会很长,如何处理长度超过max_length的文本?
二是tokenizer以后,start_label和end_label要重新定位
三是切割文本以后,label标记可能又会出现问题

def prepare_train_features(examples):
    # 既要对examples进行truncation(截断)和padding(补全)还要还要保留所有信息,所以要用的切片的方法。
    # 每一个一个超长文本example会被切片成多个输入,相邻两个输入之间会有交集。
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",  # 如果context是拼接在question后面的,对应着第2个文本,所以使用only_second控制
        max_length=max_length,
        stride=doc_stride,  # tokenizer使用doc_stride控制切片之间的重合长度
        return_overflowing_tokens=True,
        return_offsets_mapping=True,  # 可以得到token对应原context中的位置
        padding="max_length",
    )
    # !!这里用的是pop方法
    # 我们使用overflow_to_sample_mapping参数来映射切片片ID到原始ID。
    # 比如有2个expamples被切成4片,那么对应是[0, 0, 1, 1],前两片对应原来的第一个example。
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # offset_mapping也对应4片
    # offset_mapping参数帮助我们映射到原始输入,由于答案标注在原始输入上,所以有助于我们找到答案的起始和结束位置。
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # 重新标注数据
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):  # i就是第几个句子,offsets是一个存储这个句子每个token在原context中对应位置的列表
        # 对每一片进行处理
        # 将无答案的样本标注到CLS上
        input_ids = tokenized_examples["input_ids"][i]  # 得到这个句子的输入token
        cls_index = input_ids.index(tokenizer.cls_token_id)  # 找到CLS也就是token为101的位置 = 0

        # 区分question和context
        sequence_ids = tokenized_examples.sequence_ids(i)  # None,0,1分别标注为特殊符号,第一个句子和第二个句子

        # 拿到原始的example 下标.
        sample_index = sample_mapping[i]  # 第i个切片对应的原context标号
        answers = examples["answers"][sample_index]  # 得到该切片对应的原context的answer
        # 如果没有答案,则使用CLS所在的位置为答案.
        if len(answers["answer_start"]) == 0:  # 感觉这里就需要看具体的数据集标注了,这里认为没有答案的数据集answer_start什么都没存
            tokenized_examples["start_positions"].append(cls_index) # 没有答案就标注头尾都在CLS
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # 答案的character级别Start/end位置.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # 找到token级别的index start.
            token_start_index = 0
            # sequence_ids就是存0,1,None的一个区分句子的列表
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # 找到token级别的index end.
            token_end_index = len(input_ids) - 1
            # 这里的输入经过tokenizer以后,长度全变成最大长度384了,input_ids没有做填充,后面都是0,0对应的label也是None
            # 所以0肯定不会代表token
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # 检测答案是否超出文本长度,超出的话也适用CLS index作为标注.
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                # start_char和end_char是答案的位置
                # offsets[token_start_index][0]就是当前输入token对应的原始文本的开始字母,0就是这个token对应的元组的第一个数
                # offsets[token_end_index][1]就是当前输入token对应的原始文本的结束字母
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # 如果不超出则找到答案token的start和end位置。.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                # 符合的话就开始纠正位置
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

进行处理

tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

3 训练和评估

训练还是常规操作

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from transformers import default_data_collator

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5, #学习率
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1, # 训练的论次
    weight_decay=0.01,
)

data_collator = default_data_collator

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("test-squad-trained")

评估过程要复杂一些,因为可能输出得到的两个分数告诉我们找不到答案:比如start的位置比end的位置下标大,或者start和end的位置指向了question。此时就要综合打分(其实也就是start下标的值+end下标的值最大的一组)然后再检查答案是否有效,最后进行排序选择得分最高的。
简而言之,始末位置得分相加+有效判断+排序

import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# 收集最佳的start和end logits的位置:
# [-1 : -n_best_size - 1 : -1]的用法!!
# argsort()函数是将x中的元素从小到大排列,提取其对应的index(索引),然后输出到y
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # 如果start小于end,那么合理的
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # 后续需要根据token的下标将答案找出来
                }
            )

上面完成了始末位置得分相加+有效判断,随后我们对根据score对valid_answers进行排序,找到最好的那一个。最后还剩一步是:检查start和end位置对应的文本是否在context里面而不是在question里面。

为了完成这件事情,我们需要添加以下两个信息到validation的features(切片)里面:

  • 1 产生切片的example的ID。由于每个example可能会产生多个切片,所以每个切片的需要知道他们对应的example。
  • 2 offset mapping: 将每个切片的tokens的位置映射会原始文本基于character(字母级别)的下标位置。

所以准备验证集的数据就和准备训练集数据时有所不同

def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)  # None,0,1 得到是特殊符号、question或者context
        context_index = 1 if pad_on_right else 0
		
		# 处理1 得到example id
        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]  # 得到第i个切片对应的原context的第几个
        tokenized_examples["example_id"].append(examples["id"][sample_index])  # 把上面的切片对应的哪部分这个信息纳入到tokenized_examples中
		
		# 处理2 修正offset_mapping
        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)  # 如果是context内容就赋值,不是就赋None
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

处理验证集数据

validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

获得所有预测结果

raw_predictions = trainer.predict(validation_features)

这个 Trainer 隐藏了 一些模型训练时候没有使用的属性(这里是 example_id和offset_mapping,后处理的时候会用到),所以我们需要把这些设置回来:

validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

当一个token位置对应着question部分时候,prepare_validation_features函数将offset mappings设定为None,所以我们根据offset mapping很容易可以鉴定token是否在context里面啦。我们同样也根绝扔掉了特别长的答案。

整合上面的流程,仍然是处理一个句子,顺序和刚才相反,是要先预处理数据,这样就得到了切片对应的example id并对offset_mapping值做了修正,接来用这个预处理得到的数据进行检查和打分,要考虑全面。

max_answer_length = 30
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

可以得到一顺排列

[{'score': 14.21919, 'text': 'Denver Broncos'},
 {'score': 12.823972,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 11.04076, 'text': 'Carolina Panthers'},
 {'score': 9.815048,
  'text': 'American Football Conference (AFC) champion Denver Broncos'},
 {'score': 9.213307,
  'text': 'Denver Broncos defeated the National Football Conference'},
 {'score': 9.160126, 'text': 'Denver'},
 {'score': 8.942065, 'text': 'Broncos'},
 {'score': 8.683533,
  'text': 'The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 8.41983,
  'text': 'American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 8.371606,
  'text': 'Denver Broncos defeated the National Football Conference (NFC)'},
 {'score': 8.318468,
  'text': 'Denver Broncos defeated the National Football Conference (NFC'},
 {'score': 7.5468473,
  'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 7.4054294,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina'},
 {'score': 7.288315,
  'text': 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 6.8056736, 'text': 'champion Denver Broncos'},
 {'score': 6.4467125,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion'},
 {'score': 6.364973,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10'},
 {'score': 6.362278,
  'text': 'National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 6.132183, 'text': 'Panthers'},
 {'score': 5.8664913, 'text': 'American Football Conference'}]

输出真实答案看看

datasets["validation"][0]["answers"]
# 输出
{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

发现上面的最高分和这里的输出一样,预测成功,太牛了!

由于第1个feature一定是来自于第1个example,所以相对容易。对于其他的fearures来说,我们需要一个features和examples的一个映射map。同样,由于一个example可能被切片成多个features,所以我们也需要将所有features里的答案全部手机起来。以下的代码就将exmaple的下标和features的下标进行map映射。(这里的feature指的应该就是切片)

整理上面的流程得到

from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

得到最终的预测结果
```python
max_answer_length = 30

final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

加载评测指标,网络不好github手动下载后导入

metric_path = 'E:/jupyter_notebook/0_learn-nlp-with-transformers-main/docs/篇章4-使用Transformers解决NLP任务/datasets/squad'
metric = load_metric(metric_path)

评价

if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]

print(metric.compute(predictions=formatted_predictions, references=references))

你可能感兴趣的:(bert,自然语言处理,深度学习)