本文根据Chris McCormick的BERT微调教程进行优化并使其适应于数据集Quora Question Pairs里的判断问题对是否一致的任务。(文字部分大部分为原文的翻译)
原文博客地址:https://mccormickml.com/2019/07/22/BERT-fine-tuning/
原文colab地址:https://colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX
本文项目地址:https://github.com/yxf975/pretraining_models_learning
本文对删除了很多原英文博文中一些介绍性的内容,着重于如何实现基础的BERT微调方法。本解决方法不同于Chris McCormick的有以下几点:
具体对于BERT等预训练模型的原理的理解,我会单独创建一个话题,让我们直接开始吧!
检查GPU
为了让 torch 使用 GPU,我们需要识别并指定 GPU 作为设备。稍后,在我们的训练循环中,我们将把数据加载到设备上。
import torch
# If there's a GPU available...
if torch.cuda.is_available():
# Tell PyTorch to use the GPU.
device = torch.device("cuda")
n_gpu = torch.cuda.device_count()
print('There are %d GPU(s) available.' % n_gpu)
print('We will use the GPU:', [torch.cuda.get_device_name(i) for i in range(n_gpu)])
# If not...
else:
print('No GPU available, using the CPU instead.')
device = torch.device("cpu")
安装Transformer库
目前,Hugging Face的Transformer库似乎是最被广泛接受的、最强大的与BERT合作的pytorch接口。除了支持各种不同的预先训练好的变换模型外,该库还包含了这些模型的预构建修改,适合你的特定任务。例如,在本教程中,我们将使用BertForSequenceClassification
。
该库还包括用于标记分类、问题回答、下句预测等的特定任务类。使用这些预建的类可以简化为您的目的修改BERT的过程。
!pip install transformers
数据集在kaggle官网上,注册登录即可下载,下载地址:https://www.kaggle.com/c/quora-question-pairs 。另外本人在google drive上也共享了数据集,下载地址:https://drive.google.com/drive/folders/1kFkte0Kt2xLe6Ykl4O4_TrL2iCzorOYk
Quora Question Pairs数据集介绍
这个数据集针对于Quora平台,很多人在Quora上会提出类似措辞的问题。具有相同意图的多个问题可能会导致搜寻者花费更多时间来寻找问题的最佳答案,并使作者感到他们需要回答同一问题的多个版本。
该任务需要对问题对是否重复进行分类,从而解决自然语言处理问题。这样做将使查找问题的高质量答案变得更加容易,从而为Quora的作家,搜寻者和读者带来了更好的体验。
pandas加载数据
import pandas as pd
import numpy as np
# Load the dataset into a pandas dataframe.
train_data = pd.read_csv("./train.csv", index_col="id",nrows=10000)
train_data.head(6)
这里我显示6行,因为到第六行才有个正样本。
id | qid1 | qid2 | question1 | question2 | is_duplicate |
---|---|---|---|---|---|
0 | 1 | 2 | What is the step by step guide to invest in share market in india? | What is the step by step guide to invest in share market? | 0 |
1 | 3 | 4 | What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? | 0 |
2 | 5 | 6 | How can I increase the speed of my internet connection while using a VPN? | How can Internet speed be increased by hacking through DNS? | 0 |
3 | 7 | 8 | Why am I mentally very lonely? How can I solve it? | Find the remainder when [math]23^{24}[/math] is divided by 24,23? | 0 |
4 | 9 | 10 | Which one dissolve in water quikly sugar, salt, methane and carbon di oxide? | Which fish would survive in salt water? | 0 |
5 | 11 | 12 | Astrology: I am a Capricorn Sun Cap moon and cap rising…what does that say about me? | I’m a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me? | 1 |
我们实际关心的三个属性是"question1",“question1"和它们的标签"is_duplicate”,这个标签被称为"是否重复"(0=不重复,1=重复)。
训练集验证集拆分
把我们的训练集分成 80% 用于训练,20% 用于验证。
from sklearn.model_selection import train_test_split
# train_validation data split
X_train, X_val, y_train, y_val = train_test_split(train_data[["question1", "question2"]], train_data["is_duplicate"], test_size=0.2, random_state=405633)
BERT Tokenizer
from transformers import BertTokenizer
# load bert tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
查看数据中句子的最长长度
#calculate the maximum sentence length
max_len = 0
for _, row in train_data.iterrows():
max_len = max(max_len, len(tokenizer(row['question1'],row['question2'])["input_ids"]))
print("max token length of the input:", max_len)
# set the maximum token length
max_length = pow(2,int(np.log2(max_len)+1))
print("max token length for BERT:", max_length)
转换为BERT输入
from torch.utils.data import TensorDataset
# func to convert data to bert input
def convert_to_dataset_torch(data: pd.DataFrame, labels = pd.Series(data=None)) -> TensorDataset:
input_ids = []
attention_masks = []
token_type_ids = []
for _, row in tqdm(data.iterrows(), total=data.shape[0]):
encoded_dict = tokenizer.encode_plus(row["question1"], row["question2"], max_length=max_length, pad_to_max_length=True,
return_attention_mask=True, return_tensors='pt', truncation=True)
# Add the encoded sentences to the list.
input_ids.append(encoded_dict['input_ids'])
token_type_ids.append(encoded_dict["token_type_ids"])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
token_type_ids = torch.cat(token_type_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
if labels.empty:
return TensorDataset(input_ids, attention_masks, token_type_ids)
else:
labels = torch.tensor(labels.values)
return TensorDataset(input_ids, attention_masks, token_type_ids, labels)
train = convert_to_dataset_torch(X_train, y_train)
validation = convert_to_dataset_torch(X_val, y_val)
将数据放入DataLoader
我们还将使用 torch DataLoader 类为我们的数据集创建一个迭代器。这有助于在训练过程中节省内存,因为与for循环不同,有了迭代器,整个数据集不需要加载到内存中。
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# set batch size for DataLoader(options from paper:16 or 32)
batch_size = 32
# Create the DataLoaders for training and validation sets
train_dataloader = DataLoader(
train,
sampler = RandomSampler(train), # Select batches randomly
batch_size = batch_size
)
# For validation
validation_dataloader = DataLoader(
validation,
sampler = SequentialSampler(validation), # Pull out batches sequentially.
batch_size = batch_size
)
加载预训练模型BertForSequenceClassification
我们将使用BertForSequenceClassification。这是普通的BERT模型,上面增加了一个用于分类的单线性层,我们将使用它作为句子分类器。当我们输入数据时,整个预先训练好的BERT模型和额外的未经训练的分类层会根据我们的特定任务进行训练。
from transformers import BertForSequenceClassification, AdamW, BertConfig
# Load BertForSequenceClassification, the pretrained BERT model with a single
# linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
num_labels = 2, # The number of output labels--2 for binary classification.
# You can increase this for multi-class tasks.
output_attentions = False, # Whether the model returns attentions weights.
output_hidden_states = False, # Whether the model returns all hidden-states.
)
# Tell pytorch to run this model on the GPU.
model.cuda()
if n_gpu > 1:
model = torch.nn.DataParallel(model)
当然也可以对BERT网络结构进行修改以适应我们的任务,这里我就直接使用原模型。
优化器 & 学习率调度器
为了微调的目的,BERT论文的作者建议从以下数值中选择(来自BERT论文的附录A.3)。
- batch大小: 16,32。(在Dataloader里设置)
- 学习率(Adam): 5e-5、3e-5、2e-5。
- epoch数: 2、3、4。
from transformers import get_linear_schedule_with_warmup
optimizer = AdamW(model.parameters(),
lr = 2e-5, # args.learning_rate
eps = 1e-8 # args.adam_epsilon
)
# Number of training epochs
epochs = 2
# Total number of training steps is [number of batches] x [number of epochs].
total_steps = len(train_dataloader) * epochs
# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0, # Default value in run_glue.py
num_training_steps = total_steps)
时间规范函数
import time
import datetime
# Helper function for formatting elapsed times as hh:mm:ss
def format_time(elapsed):
'''
Takes a time in seconds and returns a string hh:mm:ss
'''
# Round to the nearest second.
elapsed_rounded = int(round((elapsed)))
# Format as hh:mm:ss
return str(datetime.timedelta(seconds=elapsed_rounded))
fit函数
from tqdm import tqdm
def fit_batch(dataloader, model, optimizer, epoch):
total_train_loss = 0
for batch in tqdm(dataloader, desc=f"Training epoch:{epoch+1}", unit="batch"):
# Unpack batch from dataloader.
input_ids = batch[0].to(device)
attention_masks = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# clear any previously calculated gradients before performing a backward pass.
model.zero_grad()
# Perform a forward pass (evaluate the model on this training batch).
outputs = model(input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_masks,
labels=labels)
loss = outputs[0]
total_train_loss += loss.item()
# Perform a backward pass to calculate the gradients.
loss.backward()
# normlization of the gradients to 1.0 to avoid exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update parameters and take a step using the computed gradient.
optimizer.step()
# Update the learning rate.
scheduler.step()
return total_train_loss
验证评估函数
from sklearn.metrics import accuracy_score
def eval_batch(dataloader, model, metric=accuracy_score):
total_eval_accuracy = 0
total_eval_loss = 0
predictions , predicted_labels = [], []
for batch in tqdm(dataloader, desc="Evaluating", unit="batch"):
# Unpack batch from dataloader.
input_ids = batch[0].to(device)
attention_masks = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():
# Forward pass, calculate logit predictions.
outputs = model(input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_masks,
labels=labels)
loss = outputs[0]
logits = outputs[1]
total_eval_loss += loss.item()
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = labels.to('cpu').numpy()
# Calculate the accuracy for this batch of validation sentences, and
# accumulate it over all batches.
y_pred = np.argmax(logits, axis=1).flatten()
total_eval_accuracy += metric(label_ids, y_pred)
predictions.extend(logits.tolist())
predicted_labels.extend(y_pred.tolist())
return total_eval_accuracy, total_eval_loss, predictions ,predicted_labels
训练函数
def train(train_dataloader, validation_dataloader, model, optimizer, epochs):
# list to store a number of quantities such as
# training and validation loss, validation accuracy, and timings.
training_stats = []
# Measure the total training time for the whole run.
total_t0 = time.time()
for epoch in range(0, epochs):
# Measure how long the training epoch takes.
t0 = time.time()
# Reset the total loss for this epoch.
total_train_loss = 0
# Put the model into training mode.
model.train()
total_train_loss = fit_batch(train_dataloader, model, optimizer, epoch)
# Calculate the average loss over all of the batches.
avg_train_loss = total_train_loss / len(train_dataloader)
# Measure how long this epoch took.
training_time = format_time(time.time() - t0)
t0 = time.time()
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()
total_eval_accuracy, total_eval_loss, _, _ = eval_batch(validation_dataloader, model)
# Report the final accuracy for this validation run.
avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
print("\n")
print(f"score: {avg_val_accuracy}")
# Calculate the average loss over all of the batches.
avg_val_loss = total_eval_loss / len(validation_dataloader)
# Measure how long the validation run took.
validation_time = format_time(time.time() - t0)
print(f"Validation Loss: {avg_val_loss}")
print("\n")
# Record all statistics from this epoch.
training_stats.append(
{
'epoch': epoch,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Valid. score.': avg_val_accuracy,
'Training Time': training_time,
'Validation Time': validation_time
}
)
print("")
print("Training complete!")
print(f"Total training took {format_time(time.time()-total_t0)}")
return training_stats
开始训练
import random
# Set the seed value all over the place to make this reproducible.
seed_val = 2020
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
if n_gpu > 0:
torch.cuda.manual_seed_all(seed_val)
training_stats = train(train_dataloader, validation_dataloader, model, optimizer, epochs)
查看训练过程中的的评估数据
df_stats = pd.DataFrame(training_stats).set_index('epoch')
df_stats
预测函数
def predict(dataloader, model):
prediction = list()
for batch in tqdm(dataloader, desc="predicting", unit="batch"):
# Unpack batch from dataloader.
input_ids = batch[0].to(device)
attention_masks = batch[1].to(device)
token_type_ids = batch[2].to(device)
# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():
# Forward pass, calculate logit predictions.
outputs = model(input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_masks)
logits = outputs[0]
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
prediction.append(logits)
pred_logits = np.concatenate(prediction, axis=0)
pred_label = np.argmax(pred_logits, axis=1).flatten()
print("done")
return (pred_label,pred_logits)
为测试集创建Dataloader
# Create the DataLoader for test data.
prediction_data = convert_to_dataset_torch(test_data)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)
预测
也可以用softmax将logits转化为相应的概率
y_pred,logits = predict(prediction_dataloader,model)
# get the corresponding probablities
prob = torch.nn.functional.softmax(torch.tensor(logits))
本篇文章演示了利用预先训练好的 BERT 模型,微调适应于Quora问题对任务。在面对其他类似的文本分类问题时也可以采取类似的微调方法。
当然如果想要更精确的更好的预测结果,可能需要使用更好的更合适的预训练模型,修改网络模型使之更适合当前任务,或者加入对抗训练等方法。