在本教程中,我将向您展示如何使用huggingface PyTorch库对预先训练好的XLNet模型进行调整,以快速生成用于文本分类的分类器。
这篇文章有两种形式——一种是博客,另一种是Colab notebook。内容是相同的,但:
Edit -> Notebook Settings -> Add accelerator (GPU)
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
## Found GPU at: /device:GPU:0
目前,hug Face库似乎是使用迁移学习模型工作的最广泛和最强大的pytorch接口。除了支持各种不同的预先训练的语言模型(以及未来的模型—在BERT和XLNet发布后的短短几个月,它们都被新模型超越了!)之外,这个库还包括针对您的特定任务预先构建的不同模型的修改。例如,在本教程中,我们将使用XLNet进行序列分类,但是这个库还包括为标记分类、问题回答、下一个句子预测等设计的模型修改。使用这些预构建的类可以简化修改迁移学习模型的过程。
!pip install pytorch-transformers
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_transformers import XLNetModel, XLNetTokenizer, XLNetForSequenceClassification
from pytorch_transformers import AdamW
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
## 'Tesla T4'
我们将使用Corpus of Linguistic Acceptability (CoLA)进行单句分类。它是一组语法被标记正确或不正确的句子。数据如下:
数据的标记(tokenized)版本和原始(raw)版本都是可用的。我们将使用原始版本,因为我们需要使用XLNet tokenizer将文本分解为tokens和模型将识别的块。
from google.colab import files
uploaded = files.upload()
df = pd.read_csv("in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
## (8551,4)
# Create sentence and label lists
sentences = df.sentence.values
对于BERT,special token pattern是这样的:
[CLS] + Sentence_A + [SEP] + Sentence_B + [SEP]
而XLNet的token pattern是这样的:
Sentence_A + [SEP] + Sentence_B + [SEP] + [CLS]
sentences = [sentence + " [SEP] [CLS]" for sentence in sentences]
labels = df.label.values
接下来,导入XLNet tokenizer,用于将文本转换为与XLNet词汇表相对应的tokens。
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])
Tokenize the first sentence:
['▁our', '▁friends', '▁won', "'", 't', '▁buy', '▁this', '▁analysis', ',', '▁let', '▁alone', '▁the', '▁next', '▁one', '▁we', '▁propose', '.', '▁[', 's', 'ep', ']', '▁[', 'cl', 's', ']']
XLNet需要特定格式的输入。对于每个标记化( tokenized)的输入语句,我们需要创建:
# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway.
MAX_LEN = 128
# Use the XLNet tokenizer to convert the tokens to their index numbers in the XLNet vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
# Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
seq_mask = [float(i>0) for i in seq]
# Use train_test_split to split our data into train and validation sets for training
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=2018, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids, random_state=2018,test_size=0.1)
# Convert all of our data into torch tensors, the required datatype for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)
# Select a batch size for training. For fine-tuning with XLNet, the authors recommend a batch size of 32, 48, or 128. We will use 32 here to avoid memory issues.
batch_size = 32
# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop,
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
对于这个任务,我们首先想要修改预训练的模型以提供分类的输出,然后我们想要继续在我们的数据集上训练模型,直到整个模型,端到端的,非常适合我们的任务。值得庆幸的是,huggingface pytorch实现包含一组为各种NLP任务设计的接口。虽然这些接口都是在一个训练好的模型之上构建的,但是每个接口都有不同的顶层和输出类型,以适应它们特定的NLP任务。
因为预先训练的模型层已经编码了大量关于语言的信息,所以训练分类器相对好些。而不是每一层在一个大模型从头训练,就好像我们已经进行了95%的训练工作,并且只需要根据具体下游任务适当的调整top layer的训练就可以了。
有时practicioners 会选择在微调时“freeze”某些layers,或者应用不同的学习速率,应用递减的学习速率等等,所有这些都是为了在网络中保持高质量的权重和加速训练(通常是相当快的)。事实上,最近对像BERT这样的转移学习模型的研究已经明确表明,freeze大部分的权重只会导致最低限度的准确性下降,但是也有例外,转移学习的更广泛的规则也应该被考虑。例如,如果您的任务和微调数据集与用于训练转移学习模型的数据集非常不同,那么冻结权重可能不是一个好主意。我们将在以后的文章中讨论NLP中更广泛的迁移学习。
好的,让我们加载XLNet!有几种不同的预先训练过的XLNet模型可用。 “xlnet-base-cased”是指同时包含大小写字母(“大小写混合格式”)和较小的大小写字母(“base”和“large”)的版本.
# Load XLNEtForSequenceClassification, the pretrained XLNet model with a single linear classification layer on top.
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=2)
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.0}
# This variable contains all of the hyperparemeter information our training loop needs
optimizer = AdamW(optimizer_grouped_parameters,
下面是我们的训练循环。还有很多工作要做,但基本上我们的循环中的每一次循环都有一个training 阶段和一个validation 阶段。在每一关,我们需要:
Training loop:
Evalution loop:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)
# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs (authors recommend between 2 and 4)
epochs = 4
# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
# Training
# Set our model to training mode (as opposed to evaluation mode)
# Tracking variables
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
# Train the data for one epoch
for step, batch in enumerate(train_dataloader):
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Clear out the gradients (by default they accumulate)
# Forward pass
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
loss = outputs[0]
logits = outputs[1]
# Backward pass
# Update parameters and take a step using the computed gradient
# Update tracking variables
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1
print("Train loss: {}".format(tr_loss/nb_tr_steps))
# Validation
# Put model in evaluation mode to evaluate loss on the validation set
# Tracking variables
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
# Evaluate data for one epoch
for batch in validation_dataloader:
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Telling the model not to compute or store gradients, saving memory and speeding up validation
with torch.no_grad():
# Forward pass, calculate logit predictions
output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
logits = output[0]
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
Let’s take a look at our training loss over all batches:
plt.title("Training loss")
# Upload the test file from your local drive
from google.colab import files
uploaded = files.upload()
df = pd.read_csv("out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
# Create sentence and label lists
sentences = df.sentence.values
# We need to add special tokens at the beginning and end of each sentence for XLNet to work properly
sentences = [sentence + " [SEP] [CLS]" for sentence in sentences]
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
MAX_LEN = 128
# Use the XLNet tokenizer to convert the tokens to their index numbers in the XLNet vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
# Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
seq_mask = [float(i>0) for i in seq]
prediction_inputs = torch.tensor(input_ids)
prediction_masks = torch.tensor(attention_masks)
prediction_labels = torch.tensor(labels)
batch_size = 32
prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)
# Prediction on test set
# Put model in evaluation mode
# Tracking variables
predictions , true_labels = [], []
# Predict
for batch in prediction_dataloader:
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Telling the model not to compute or store gradients, saving memory and speeding up prediction
with torch.no_grad():
# Forward pass, calculate logit predictions
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
logits = outputs[0]
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
# Store predictions and true labels
from sklearn.metrics import matthews_corrcoef
matthews_set = []
for i in range(len(true_labels)):
matthews = matthews_corrcoef(true_labels[i],
np.argmax(predictions[i], axis=1).flatten())
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]
matthews_corrcoef(flat_true_labels, flat_predictions)