基于BERT进行抽取式问答

本文参与了 SegmentFault 思否征文「百度搜索技术创新挑战赛」，欢迎正在阅读的你也加入

数据集

CoQA 是Stanford NLP在2019年发布的Conversational Questional Answering数据集，是用于构建Conversational Question Answering Systems的大规模数据集。该数据集旨在衡量机器理解文本段落和回答对话中出现的一系列相互关联的问题的能力。该数据集的独特之处在于，每次对话都是通过将两名众包工作者配对，以问答的形式讨论一段话来收集的，因此，问题是对话式的。

JSON 数据有很多字段。出于我们的目的，我们将使用“问题”和“答案”中的“故事”、“输入文本”并形成我们的数据。

安装transformers

!pip install transformers

导入库

import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

加载数据

coqa = **pd.read_json**('[http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json'](http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json'))
coqa**.head()**

数据清洗

对于每个问答对，我们都会附上对应的上下文。

#required columns in our dataframe
cols = ["text","question","answer"]
#list of lists to create our dataframe
comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols) 
#saving the dataframe to csv file for further loading
new_df.to_csv("CoQA_data.csv", index=False)

形成DataFrame

data = pd.read_csv("CoQA_data.csv")
data.head()

构建模型

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

BERT 使用词块标记化。在 BERT 中，稀有词被分解成子词/片段。 Wordpiece 标记化使用## 来分隔已拆分的标记。举个例子：“Karin”是一个常用词，所以 wordpiece 不会拆分它。然而，“Karingu”是一个罕见的词，所以 wordpiece 将其拆分为“Karin”和“##gu”这两个词。请注意，它在 gu 之前添加了 ## 以指示它是拆分词的第二部分。

使用 wordpiece tokenization 背后的想法是减少词汇量，从而提高训练性能。考虑一下这些词，run，running，runner。如果没有词块标记化，模型必须独立存储和学习所有三个词的含义。然而，通过词块标记化，这三个词中的每一个都将被拆分为“run”和相关的“##SUFFIX”（如果有任何后缀——例如，“run”、“##ning”、“##ner” ”）。现在，该模型将学习单词“run”的上下文，其余含义将编码在后缀中，该后缀将从具有相似后缀的其他单词中学习。

def question_answer(question, text):
    
    #tokenize question and text as a pair
    input_ids = tokenizer.encode(question, text)
    
    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    #number of tokens in segment A (question)
    num_seg_a = sep_idx+1
    #number of tokens in segment B (text)
    num_seg_b = len(input_ids) - num_seg_a
    
    #list of 0s and 1s for segment embeddings
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)
    
    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
    
    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]
                
    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."
    
    print("\nPredicted answer:\n{}".format(answer.capitalize()))

实验效果

Please enter your text: 
The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula.   The Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail.   In March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online.   The Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items.   Scholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican.   The Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.

Please enter your question: 
When was the Vat formally opened?

Answer:
1475

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
what is the library for?

Answer:
Research library for history , law , philosophy , science and theology

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
for what subjects?

Answer:
History , law , philosophy , science and theology
Do you want to ask another question based on this text (Y/N)? N

Bye!

参考链接

https://towardsdatascience.co...