参考 问答系统案例----基于Bert实现知识库问答
知识库问答也叫做知识图谱问答,模型结合知识图谱,对输入的问题进行推理和查询从而得到正确答案的一项综合性任务。
知识图谱问答方法可分为两大类:
一种是基于信息检索的方式
一种是基于语义解析的方式
信息检索的方式不需要生成中间结果,直接得到问题答案,十分简洁,但是对复杂问题的处理能力有限。
语义解析的方式需要对输入的自然语言问题进行语义解析,再进行推理,具备解决复杂问题的能力。
本教程选用信息检索的方式进行讨论。
conda install -c huggingface transformers
conda install -c huggingface -c conda-forge datasets
# 如果一直超时,也可以直接 pip install datasets
相关资源对应网址如下:
网址 | |
---|---|
库的 GitHub 地址 | https://github.com/huggingface/transformers |
官方开发文档 | https://huggingface.co/docs/transformers/index |
预训练模型下载地址 | https://huggingface.co/models |
使用开放式问答数据集WikiQA【kaggle下载链接】.
WikiQA使用Bing查询日志作为问题源,每个问题都链接到一个可能有答案的维基百科页面,页面的摘要部分提供了关于这个问题的重要信息,WikiQA使用其中的句子作为问题的候选答案。数据集中共包括3047个问题和29258个句子。
WikiQA问答数据集可以用于问答系统的训练。数据集中存放着问题的文本,每个问题对应的知识库数据,以及对应的答案。
本数据集的知识库就是通过问题检索到的文档摘要,而摘要中的每一句话都作为候选答案。因此我们可以将问答问题转化为两个句子之间的匹配问题。为了后续模型的训练,我们将数据加载为
如果answer是question的正确答案,则label为1,反之则为0.
每一个三元组用一个字典来存储。
定义load函数。使用csv将文件读入,在csv.reader中指定’\t’作为分隔符(delimiter),将数据自动分割。依次遍历每一行,将数据按照上述数据结构加载
# 统一导入工具包
import pandas as pd
import csv
import transformers
import torch
from transformers import BertPreTrainedModel, BertModel, BertTokenizer
from torch import nn
import numpy as np
import os
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
def load(filename):
result = []
with open(filename, 'r', encoding='utf-8') as csvfile:
spamreader = pd.read_csv(filename, sep="\t", header=None)
for i in range(len(spamreader)):
row = spamreader.iloc[i]
res = {}
res['question'] = str(row[0])
res['answer'] = str(row[1])
res['label'] = int(row[2])
if res['question'] == "" or res['answer'] == "" or res['label'] == None:
continue
result.append(res)
return result
train_file = load('data/WikiQA-train.txt')
valid_file = load('data/WikiQA-dev.txt')
test_file = load('data/WikiQA-test.txt')
输出
[{'question': 'how are glacier caves formed ?',
'answer': 'A partly submerged glacier cave on Perito Moreno Glacier .',
'label': 0},
{'question': 'how are glacier caves formed ?',
'answer': 'The ice facade is approximately 60 m high',
'label': 0},
{'question': 'how are glacier caves formed ?',
'answer': 'Ice formations in the Titlis glacier cave',
'label': 0},
{'question': 'how are glacier caves formed ?',
'answer': 'A glacier cave is a cave formed within the ice of a glacier .',
'label': 1},
{'question': 'how are glacier caves formed ?',
'answer': 'Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice .',
'label': 0},
{'question': 'How are the directions of the velocity and force vectors related in a circular motion',
'answer': 'In physics , circular motion is a movement of an object along the circumference of a circle or rotation along a circular path .',
'label': 0},
{'question': 'How are the directions of the velocity and force vectors related in a circular motion',
'answer': 'It can be uniform , with constant angular rate of rotation ( and constant speed ) , or non-uniform with a changing rate of rotation .',
'label': 0},
{'question': 'How are the directions of the velocity and force vectors related in a circular motion',
'answer': 'The rotation around a fixed axis of a three-dimensional body involves circular motion of its parts .',
'label': 0},
...]
print(len(train_file), len(valid_file), len(test_file))
输出:
20360 2733 6165
将数据处理为Bert的标准输入形式。Bert的输入主要由input_ids
, attention_mask
, token_type_ids
三部分构成。
tokenize = BertTokenizer.from_pretrained("bert-base-uncased")
使用方法
tokenize.encode('how are glacier caves formed ?')
# [101, 2129, 2024, 10046, 10614, 2719, 1029, 102]
tokenize.encode_plus('how are glacier caves formed ?')
# {'input_ids': [101, 2129, 2024, 10046, 10614, 2719, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
tokenize.encode_plus(text='how are glacier caves formed ?',
text_pair='Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice .',
max_length=64,
truncation=True,
add_special_tokens=True,
padding="max_length")
# {'input_ids': [101, 2129, 2024, 10046, 10614, 2719, 1029, 102, 10046, 10614, 2024, 2411, 2170, 3256, 10614, 1010, 2021, 2023, 2744, 2003, 7919, 2109, 2000, 6235, 28272, 10614, 2008, 5383, 2095, 1011, 2461, 3256, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
【说明】
attention_mask 只用到 tokens 部分,padding 部分会被 mask(设置为0)
def get_data(name_file):
input_ids = []
attention_mask = []
token_type_ids = []
labels = []
for i, dic in enumerate(name_file):
question = dic["question"]
answer = dic["answer"]
label = dic["label"]
output = tokenize.encode_plus(text=question,
text_pair=answer,
max_length=64,
truncation=True,
add_special_tokens=True,
padding="max_length")
input_ids.append(output["input_ids"])
attention_mask.append(output["attention_mask"])
token_type_ids.append(output["token_type_ids"])
labels.append(label)
dic = {"input_ids": torch.tensor(input_ids),
"attention_mask": torch.tensor(attention_mask),
"token_type_ids": torch.tensor(token_type_ids),
"labels": torch.tensor(labels)}
return dic
from datasets import Dataset
transformers.logging.set_verbosity_error() # 减少一些不必要的 warning
train_dataset = Dataset.from_dict(get_data(train_file))
eval_dataset = Dataset.from_dict(get_data(valid_file))
test_dataset = Dataset.from_dict(get_data(test_file))
from transformers import BertPreTrainedModel, BertModel, BertConfig
config = BertConfig.from_pretrained("bert-base-uncased")
class BertQA(BertPreTrainedModel):
def __init__(self, config, freeze=True):
super(BertQA, self).__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel.from_pretrained("bert-base-uncased")
# 冻结bert参数,只fine-tuning后面层的参数
if freeze:
for p in self.bert.parameters():
p.requires_grad = False
self.qa_ouputs = nn.Linear(config.hidden_size, 2)
self.loss_fn = nn.CrossEntropyLoss()
self.init_weights()
def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None):
outputs = self.bert(input_ids, attention_mask, token_type_ids)
logits = self.qa_ouputs(outputs[1])
# 通过全连接网络,将特征转化为一个二维向量,可以看作标签0和1的得分情况
predicted_labels = torch.softmax(logits, dim=-1)
# 如果输入数据中含有标准答案,就计算loss值(即训练过程)
if labels is not None:
loss = self.loss_fn(predicted_labels, labels)
return {"loss": loss, "predicted_labels": predicted_labels}
# 否则返回预测值(测试过程)
else:
return {"predicted_labels": predicted_labels}
model = BertQA(config)
from transformers import Trainer, TrainingArguments
# 自定义评测指标
def compute_metrics(pred):
# pred: ----> label_ids predictions
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
acc = (labels==preds).sum()/len(labels)
return {"acc":acc}
args = TrainingArguments(output_dir="./result",
gradient_accumulation_steps=10,
learning_rate=1e-3,
logging_dir="./logging",
num_train_epochs=2,
logging_steps=100,
evaluation_strategy="epoch",
per_device_eval_batch_size=8)
trainer = Trainer(model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset,
compute_metrics=compute_metrics)
trainer.train()
大概训练了10分钟,输出
TrainOutput(global_step=50,
training_loss=0.3679719543457031,
metrics={'train_runtime': 719.1163,
'train_samples_per_second': 56.625,
'train_steps_per_second': 0.07,
'total_flos': 1327395274291200.0,
'train_loss': 0.3679719543457031,
'epoch': 1.98})
测试
# 这个用于测试传入测试集
trainer.evaluate(test_dataset)
输出
{'eval_loss': 0.3607916831970215,
'eval_acc': 0.9524736415247365,
'eval_runtime': 75.7807,
'eval_samples_per_second': 81.353,
'eval_steps_per_second': 1.029,
'epoch': 1.98}
可以用 pytorch 常规的方法去训练,自定义 optimizer, lr_scheduler,写 train_epoch 等
Dataset
train_dataset = Dataset.from_dict(get_data(train_file))
eval_dataset = Dataset.from_dict(get_data(valid_file))
test_dataset = Dataset.from_dict(get_data(test_file))
train_dataset.set_format("torch")
eval_dataset.set_format("torch")
test_dataset.set_format("torch")
DataLoader
from torch.utils.data import DataLoader
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)
test_dataloader = DataLoader(test_dataset, batch_size=8)
optimizer
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)
lr_scheduler
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
model
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
metrics
import evaluate
def eval(eval_dataloader):
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
predictions = torch.argmax(outputs["predicted_labels"], dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
return metric.compute()
train
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
# k: 'input_ids', 'attention_mask', 'token_type_ids', 'labels'
# v: input_ids.to(device)
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs['loss']
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
print(eval(train_dataloader))
test
print(eval(eval_dataloader))
输出结果
{'accuracy': 0.9487742407610684}
汇总了第二种训练方法
数据处理部分
import numpy as np
import pandas as pd
import os
import warnings
import torch
import evaluate
import transformers
from transformers import BertPreTrainedModel, BertModel, BertTokenizer, BertConfig, get_scheduler
from datasets import Dataset
from torch import nn
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm
model_name = "bert-base-uncased"
warnings.filterwarnings('ignore')
transformers.logging.set_verbosity_error()
tokenize = BertTokenizer.from_pretrained(model_name)
config = BertConfig.from_pretrained("bert-base-uncased")
def load(filename):
result = []
with open(filename, 'r', encoding='utf-8') as csvfile:
spamreader = pd.read_csv(filename, sep="\t", header=None)
for i in range(len(spamreader)):
row = spamreader.iloc[i]
res = {}
res['question'] = str(row[0])
res['answer'] = str(row[1])
res['label'] = int(row[2])
if res['question'] == "" or res['answer'] == "" or res['label'] == None:
continue
result.append(res)
return result
train_file = load('data/WikiQA-train.txt')
valid_file = load('data/WikiQA-dev.txt')
test_file = load('data/WikiQA-test.txt')
def get_data(name_file):
input_ids = []
attention_mask = []
token_type_ids = []
labels = []
for i, dic in enumerate(name_file):
question = dic["question"]
answer = dic["answer"]
label = dic["label"]
output = tokenize.encode_plus(text=question,
text_pair=answer,
max_length=64,
truncation=True,
add_special_tokens=True,
padding="max_length")
input_ids.append(output["input_ids"])
attention_mask.append(output["attention_mask"])
token_type_ids.append(output["token_type_ids"])
labels.append(label)
dic = {"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids": token_type_ids,
"labels": labels}
return dic
train_dataset = Dataset.from_dict(get_data(train_file))
eval_dataset = Dataset.from_dict(get_data(valid_file))
test_dataset = Dataset.from_dict(get_data(test_file))
train_dataset.set_format("torch")
eval_dataset.set_format("torch")
test_dataset.set_format("torch")
训练部分
'''
超参数设置
'''
batch_size = 8
lr = 5e-5
num_epochs = 3
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)
class BertQA(BertPreTrainedModel):
def __init__(self, config, freeze=True):
super(BertQA, self).__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel.from_pretrained("bert-base-uncased")
# 冻结bert参数,只fine-tuning后面层的参数
if freeze:
for p in self.bert.parameters():
p.requires_grad = False
self.qa_ouputs = nn.Linear(config.hidden_size, 2)
self.loss_fn = nn.CrossEntropyLoss()
self.init_weights()
def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None):
outputs = self.bert(input_ids, attention_mask, token_type_ids)
logits = self.qa_ouputs(outputs[1])
# 通过全连接网络,将特征转化为一个二维向量,可以看作标签0和1的得分情况
predicted_labels = torch.softmax(logits, dim=-1)
# 如果输入数据中含有标准答案,就计算loss值(即训练过程)
if labels is not None:
loss = self.loss_fn(predicted_labels, labels)
return {"loss": loss, "predicted_labels": predicted_labels}
# 否则返回预测值(测试过程)
else:
return {"predicted_labels": predicted_labels}
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = BertQA(config)
model.to(device)
optimizer = AdamW(model.parameters(), lr=lr)
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
def eval(eval_dataloader):
metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
predictions = torch.argmax(outputs["predicted_labels"], dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
return metric.compute()
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
# k: 'input_ids', 'attention_mask', 'token_type_ids', 'labels'
# v: input_ids.to(device)
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs['loss']
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
print('train:', eval(train_dataloader))
print('eval:', eval(eval_dataloader))