pytorch实现bert
As of the time of writing this piece, state-of-the-art results on NLP and NLU tasks are obtained with Transformer models. There is a trend of performance improvement as models become deeper and larger, GPT 3 comes to mind. Training small versions of such models from scratch takes a significant amount of time, even with GPU. This problem can be solved via pre-training when a model is trained on a large text corpus using a high-performance cluster. Later it can be fine-tuned for a specific task in a much shorter amount of time. During fine tuning stage, additional layers can be added to the model for specific tasks, which can be different from those for which the model was initially trained. This technique is related to transfer learning, a concept applied to areas of machine learning beyond NLP (see here and here for a quick intro).
在撰写本文时,已使用Transformer模型获得了有关NLP和NLU任务的最新结果。 随着模型变得越来越深,性能越来越大, GPT 3浮现在脑海。 即使使用GPU,从头开始训练这类模型的小版本也要花费大量时间。 当使用高性能集群在大型文本语料库上训练模型时,可以通过预训练来解决此问题。 之后,可以在更短的时间内针对特定任务对其进行微调。 在微调阶段,可以为特定任务向模型添加其他层,这些层可以与最初训练模型时所用的层不同。 该技术与转移学习有关,转移学习是应用于NLP之外的机器学习领域的概念(快速入门请参见此处和此处 )。
In this post, I would like to share my experience of fine-tuning BERT and RoBERTa, available from the transformers library by Hugging Face, for a document classification task. Both models share a transformer architecture, which consists of at least two distinct blocks — encoder and decoder. Both encoder and decoder consist of multiple layers based around Attention mechanism. Encoder processed the input token sequence into a vector of floating point numbers — a hidden state, which is picked up by the decoder. It is the hidden state that encompasses the information content of the input sequence. This enables to represent an entire sequence of tokens with a single dense vector of float point numbers. Two texts or documents, which have similar meaning are represented by closely aligned vectors. Comparing vectors using a metric of choice, for example, cosine similarity, enables to quantify the similarity of original text pieces.
在本文中,我想分享一下我对BERT和RoBERTa进行微调的经验,这些经验可以从Hugging Face的转换器库中获得,用于文档分类任务。 两种模型共享一种变压器架构 ,该架构至少由两个不同的模块组成-编码器和解码器。 编码器和解码器均由基于Attention机制的多层组成。 编码器将输入令牌序列处理为浮点数的矢量-一种隐藏状态,由解码器获取。 隐藏状态涵盖了输入序列的信息内容。 这使得可以用单个密集的浮点数矢量表示令牌的整个序列。 具有相似含义的两个文本或文档由紧密对齐的向量表示。 使用选择度量(例如, 余弦相似度)比较向量可以量化原始文本片段的相似度。
Bolts and Nuts
螺栓和螺母
While researching this topic I found an article on fine-tuning BERT to classify Fake News Dataset, available on Kaggle. Running the code from the article as is yields F1 score that is considerably lower than the claimed 96.99%. After a weekend of reading and adding a few things I have managed to squeeze out 99.05% F1 score on a test set for RoBERTa model with two additional Linear layers (code).
在研究此主题时,我发现了一篇关于微调BERT以对Fake News数据集进行分类的文章,该文章可在Kaggle上找到。 按原样运行文章中的代码,得出的F1分数大大低于要求的96.99%。 经过一个周末的阅读并添加了一些内容之后,我设法在带有两个附加线性层( 代码 )的RoBERTa模型测试集上挤出了99.05%的F1分数。
Classification Report:
precision recall f1-score support
1 0.9844 0.9968 0.9906 634
0 0.9968 0.9840 0.9904 626
accuracy 0.9905 1260
macro avg 0.9906 0.9904 0.9905 1260
weighted avg 0.9906 0.9905 0.9905 1260
To start, let's have a brief look at the Fake News dataset. It consists of 6299 items with text exceeding 5 words, 3128 fake and 3171 real. The plot below shows the histogram of the text length distribution, cropped at 5000 tokens. Documents with a token count as high as 20000 are present in the dataset.
首先,让我们简要了解一下假新闻数据集。 它由6299个项目组成,文本超过5个字,3128个伪造和3171个真实。 下图显示了以5000个标记裁剪的文本长度分布的直方图。 数据集中存在令牌计数高达20000的文档。
Batch size and sequence length trade-off. Both BERT and RoBERTa are limited to 512 token sequences in their base configuration. GPU memory limitations can further reduce the maximum sequence length. It is possible to trade batch size for sequence length. In “Language Models are Few-Shot Learners” paper authors mention the benefit of higher batch size at later stages of training. As fine-tuning picks up where pre-training ends, higher batch size lead to better results and somewhat reduce over-fitting.
批次大小和序列长度的权衡 。 BERT和RoBERTa的基本配置均限制为512个令牌序列。 GPU内存限制可以进一步减少最大序列长度。 可以用批量大小交换序列长度。 在“语言模型很少学习者”一文中,论文作者提到了在以后的培训阶段增加批量大小的好处。 随着微调在预训练结束的地方进行,批量越大,结果越好,并且在某种程度上减少了过度拟合。
Sequences in a training batch can have different lengths. This requires appending padding tokens to each sequence to get them to a common length. It can be done using a dedicated tokenizer, generously supplied by Hugging Face with corresponding models:
训练批次中的序列可以具有不同的长度。 这需要将填充令牌附加到每个序列,以使它们具有相同的长度。 可以使用专用的令牌生成器来完成此任务,该令牌生成器由Hugging Face提供,并带有相应的模型:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
Torchtext library provides several easy to use Swiss army knife iterators. Among other things, they are capable of grouping sequences of similar lengths into batches and padding them, splitting a dataset into train, validation and test sets, stratified if necessary, shuffling after every epoch.
Torchtext库提供了几个易于使用的瑞士军刀迭代器。 除其他功能外,它们还能够将相似长度的序列分组并填充,将数据集划分为训练,验证和测试集,必要时进行分层,在每个时期之后进行改组。
# Set tokenizer hyperparameters.
MAX_SEQ_LEN = 256
BATCH_SIZE = 16
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
# Define columns to read.
label_field = Field(sequential=False, use_vocab=False, batch_first=True)
text_field = Field(use_vocab=False,
tokenize=tokenizer.encode,
include_lengths=False,
batch_first=True,
fix_length=MAX_SEQ_LEN,
pad_token=PAD_INDEX,
unk_token=UNK_INDEX)
fields = {'titletext' : ('titletext', text_field), 'label' : ('label', label_field)}
# Read preprocessed CSV into TabularDataset and split it into train, test and valid.
train_data, valid_data, test_data = TabularDataset(path=f"{data_path}/prep_news.csv",
format='CSV',
fields=fields,
skip_header=False).split(split_ratio=[0.70, 0.2, 0.1],
stratified=True,
strata_field='label')
# Create train and validation iterators.
train_iter, valid_iter = BucketIterator.splits((train_data, valid_data),
batch_size=BATCH_SIZE,
device=device,
shuffle=True,
sort_key=lambda x: len(x.titletext),
sort=True,
sort_within_batch=False)
# Test iterator, no shuffling or sorting required.
test_iter = Iterator(test_data, batch_size=BATCH_SIZE, device=device, train=False, shuffle=False, sort=False)
Attention mask for batch training. A padded training batch is passed to RoBERTa, which outputs a batch of hidden state vectors, one per training batch sequence. Padding indexes do not represent any useful information. The end of each sequence in a batch is denoted by a special end-of-string token. Indeed, a batch of size 1 does not require any padding at all. The padding indexes, therefore, should be excluded from attention weights calculation. This is achieved with the help of an attention mask tensor:
批量培训注意面罩。 填充的训练批次将传递到RoBERTa,后者会输出一批隐藏状态向量,每个训练批次序列一个。 填充索引不代表任何有用的信息。 批处理中每个序列的结尾都用特殊的字符串结尾标记表示。 实际上,一批大小为1的根本不需要任何填充。 因此,应从注意力权重计算中排除填充指标。 这是通过注意遮罩张量实现的:
for (source, target), _ in train_iter:
mask = (source != PAD_INDEX).type(torch.uint8)
y_pred = model(input_ids=source,
attention_mask=mask)
The mask tensor has values of 0 (False) for padding tokens and 1 (True) for all other tokens. It is applied via an element-wise multiplication to the Key input of an Attention layer, which reduces to 0 contributions from padding tokens. Attention mask should also be applied during validation and testing if it is performed in batches.
掩码张量的填充令牌值为0(False),所有其他令牌的值为1(True)。 它通过逐元素乘法应用于Attention层的Key输入,从而将填充令牌的贡献减少到0。 如果分批执行,则在验证和测试期间也应使用注意面罩。
Pre-fine-tuning and learning rate. During training the output of RoBERTa is a batch of hidden states, which is passed to classifier layers:
预微调和学习率。 在训练中 RoBERTa的输出是一批隐藏状态,这些状态被传递到分类器层:
# Model with classifier layers on top of RoBERTa
class ROBERTAClassifier(torch.nn.Module):
def __init__(self, dropout_rate=0.3):
super(ROBERTAClassifier, self).__init__()
self.roberta = RobertaModel.from_pretrained('roberta-base')
self.d1 = torch.nn.Dropout(dropout_rate)
self.l1 = torch.nn.Linear(768, 64)
self.bn1 = torch.nn.LayerNorm(64)
self.d2 = torch.nn.Dropout(dropout_rate)
self.l2 = torch.nn.Linear(64, 2)
def forward(self, input_ids, attention_mask):
_, x = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
x = self.d1(x)
x = self.l1(x)
x = self.bn1(x)
x = torch.nn.Tanh()(x)
x = self.d2(x)
x = self.l2(x)
return x
When the above model is initialised, RoBERTa is assigned pre-trained parameters. For this reason, fine-tuning should be performed with a small learning rate, of the order of 1e-5. However, the classifier layers are assigned random untrained values of their parameters. For this reason, I ran a few training epochs with frozen RoBERTa parameters and higher learning rate of 1e-4, while adjusting only classifier layer parameters. Next, the whole model is trained with all parameters updated at the same time.
初始化以上模型后,将为RoBERTa分配预训练参数。 因此,应该以较小的学习率(大约1e-5)执行微调。 但是,为分类器层分配了其参数的随机未训练值。 因此,我只调整了分类器层参数,就使用了冻结的RoBERTa参数和较高的学习率1e-4进行了一些训练。 接下来,使用同时更新所有参数来训练整个模型。
In this final step of the training, a linear learning rate scheduler, which updates the optimiser learning rate at each training step, turned out to be quite beneficial. During the first two epochs optimiser is warming up — the learning rate increases to its maximum value of 2e-6, which enables the model to explore local parameter space. In the following epochs, the learning rate is gradually reduced to zero.
在训练的最后一步中,线性学习率调度程序在每个训练步骤中更新了最优学习率,这是非常有益的。 在前两个时期,优化器正在预热-学习速率增加到最大值2e-6 ,这使模型能够探索局部参数空间。 在以下时期,学习率逐渐降低到零。
结果汇总 (Results summary)
Huggingface library provides out-of-the-box sequence classifiers. These models have a name ending with “ForSequenceClassification”, which speaks for itself. It is the same model as above, but with a single Linear layer, preceded by a Dropout. I trained my models with BERT and RoBERTa, as well as two out of the box models, “BertForSequenceClassification” and “RobertaForSequenceClassification”. For all BERT models cased configuration was used. Below table shows a brief summary of results on a test set.
Huggingface库提供了现成的序列分类器。 这些模型的名称以“ ForSequenceClassification”结尾,这足以说明问题。 它与上面的模型相同,但是具有一个线性层,前面是一个Dropout。 我使用BERT和RoBERTa以及两个现成的模型“ BertForSequenceClassification”和“ RobertaForSequenceClassification”来训练我的模型。 对于所有BERT型号,都使用了案例配置。 下表显示了测试集结果的简要摘要。
Training set contained 70% of the data (4410 items), 10% (629 items) in validation set and 20% (1260 items) in test set. It may seem that RoBERTa performed only marginally better. However, “ROBERTAClassifier” was wrong almost 3 times less often, 1% of the test samples, than “BERTClassifier”, which got it wrong almost 3% of the time.
训练集包含70%的数据(4410个项目),验证集中的10%(629个项目)和测试集中的20%(1260个项目)。 看起来RoBERTa的表现仅稍好一点。 但是,“ ROBERTAClassifier”的错误发生率几乎是“ BERTClassifier”的3倍,几乎是错误率的3倍,占测试样本的1%。
In summary, an exceptionally good accuracy for text classification, 99% in this example, can be achieved by fine-tuning the state-of-the-art models. For the latter, a shout-out goes to Huggingface team!
总而言之,通过对最新模型进行微调,可以实现文本分类的极佳准确性(在此示例中为99%)。 对于后者, Huggingface团队大喊大叫 !
进一步改善 (Further improvement)
As is, all models read only first 256 tokens. To paraphrase an old proverb, it may not be very accurate to judge a news piece by the first few hundred tokens, including title. An obvious way to improve the result is to get the model to read some more of the text. One way to overcome the text size limitation is to split a text into chunks of manageable length. Encoding several chunks with RoBERTa will yield an array of hidden states, which together contain more information about the text than just a single first chunk. To combine hidden states into a single vector one can use a range of techniques, such as a simple averaging or an RNN cell. The resulting aggregated vector can be passed to subsequent layers. Such a model can potentially make a more informed decision on whenever the news piece is fake or real.
照原样,所有模型仅读取前256个令牌。 用旧的谚语来解释,用头几百个标记(包括标题)来判断新闻片段可能不是很准确。 改善结果的一个明显方法是使模型读取更多文本。 克服文本大小限制的一种方法是将文本拆分为可管理长度的块。 使用RoBERTa编码几个块将产生一个隐藏状态数组,这些隐藏状态在一起不仅包含单个第一个块,还包含更多有关文本的信息。 要将隐藏状态组合成单个向量,可以使用多种技术,例如简单平均或RNN单元。 所得的聚合向量可以传递到后续层。 这样的模型可能会在新闻是假的还是真实的情况下做出更明智的决策。
翻译自: https://towardsdatascience.com/fine-tuning-bert-and-roberta-for-high-accuracy-text-classification-in-pytorch-c9e63cf64646
pytorch实现bert