BERT是2018年10月由Google AI研究院提出的一种预训练模型。当时它在11中不同的NLP任务中取得了SOTA结果。
我第二次参加的正式比赛是腾讯广告算法大赛,本来这个比赛是一个学习使用bert的很好的机会,但是由于比赛中LSTM展现出了优于transformer的表现,于是最终的模型选择了LSTM,也因为错过了学习bert的机会。前几天为了试一试bert的威力报名了一个NLP的比赛,也提前踩一下坑,为即将开始的腾讯广告算法大赛做一下准备。由于在最近结束的一场比赛中,使用tensorflow有诸多不好的体验,于是这次也从零开始学起了用pytorch,这篇文章则记录从零开始预训练bert踩到的一些坑。
bert的预训练模式一般分为,Masked language model 与 next sentence prediction. 这个比赛中我们主要用到MLM,以下是用MLM在自己的语料上进行预训练的代码(文中的代码参考了这里),
首先读取数据,然后将预料输出为一个txt文件,同时也定义好tokenizer,由于比赛提供的文本为匿名化的文本(纯数字),因此
直接使用任意tokenizer就可以了。(这里选择了 ‘bert-base-chinese’)
import numpy as np
import pandas as pd
from tqdm import tqdm
path_to_file = None
df = pd.read_csv(path_to_file)
with open('text.txt','w') as f:
for setence in df.text.values:
f.write(setence+'\n')
# Train a tokenizer
import tokenizers
from transformers import BertTokenizer, LineByLineTextDataset
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
这里选择的参数时标准的bert所用的参数,用到了基于torch的transformers api,12层、12个注意力头、768维度已经512最大长度的序列。预训练的模式为MLM,直接调用 DataCollatorForLanguageModeling API即可方便得以自己的语料定义生成器。
from transformers import BertConfig, BertForMaskedLM, DataCollatorForLanguageModeling
config = BertConfig(
vocab_size=50000,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
max_position_embeddings=512
)
model = BertForMaskedLM(config).from_pretrained('./')
print('No of parameters: ', model.num_parameters())
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
这里可以设置的参数有,输入端的batch_size、预料文件、tokenizer,训练过程方面则有 训练轮数epochs、batch_size 以及保存频率。经过这些简单的即可成功训练好一个基于MLM的bert模型了(损失loss降到0.5左右就可以了),也可以通过MLM模型所带的接口来做MLM预测,当然我们这里需要的只是bert的权重。
from transformers import Trainer, TrainingArguments
dataset= LineByLineTextDataset(
tokenizer = tokenizer,
file_path = './text.txt',
block_size = 64 # maximum sequence length
)
print('No. of lines: ', len(dataset)) # No of lines in your datset
training_args = TrainingArguments(
output_dir='./',
overwrite_output_dir=True,
num_train_epochs=30,
per_device_train_batch_size=64,
save_steps=10000,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
trainer.save_model('./')
我们处理的模型类别为分类模型(二分类),因此以以下的方式定义模型。这里也定义了[CLS] 与 [SEP]这两个token,CLS用于输出表达整个句子的特征,SEP用于分割句子。在做多分类时,我们可以选择利用CLS的输出向量微调,或者是利用整个序列的向量来做进一步处理。
import torch
import torch.nn as nn
from pytorch_pretrained_bert import BertModel, BertTokenizer
class Model(nn.Module):
def __init__(self, pretrain_model_path='./mybert/', hidden_size=768):
super(Model, self).__init__()
self.pretrain_model_path = pretrain_model_path
self.bert = BertModel.from_pretrained(pretrain_model_path)
for param in self.bert.parameters():
param.requires_grad = True
self.dropout = nn.Dropout(0.1)
self.embed_size = hidden_size
self.cls = nn.Linear(self.embed_size, 2)
def forward(self, ids, attention_mask, labels=None, training=True):
loss_fct = nn.CrossEntropyLoss()
context = ids
types = None
mask = attention_mask
sequence_out, cls_out = self.bert(context, token_type_ids=types, attention_mask=attention_mask, output_all_encoded_layers=False)
cls_out = self.dropout(cls_out)
logits = self.cls(cls_out)
if training:
loss = loss_fct(logits.view(-1, 2), labels.view(-1))
return loss, nn.Softmax(dim=-1)(logits)
else:
return logits
pretrain_model_path = 'bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(pretrain_model_path)
'''
可以自行修改params
'''
CLS_TOKEN = '[CLS]'
SEP_TOKEN = '[SEP]'
seq_length = 64
mybertmodel = Model()
这里处理的问题是判断两个句子的相似程度。与tf.dataset很相似,这里的dataset负责输出模型需要格式的输入数据,可以在这里加上数据的预处理,其功能类似于一个高级的generator。
import torch
MAX_LEN = 64
SEP_TOKEN_ID = 102
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import roc_auc_score
def compute_auc(y_true, y_pred):
try:
return roc_auc_score(y_true, y_pred)
except ValueError:
return np.nan
class QuestDataset(torch.utils.data.Dataset):
def __init__(self, df, train_mode=True, labeled=True):
self.df = df
self.train_mode = train_mode
self.labeled = labeled
self.tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
def __getitem__(self, index):
row = self.df.iloc[index]
token_ids, seg_ids = self.get_token_ids(row)
if self.labeled:
labels = self.get_label(row)
return {'input_ids':token_ids, 'attention_mask':seg_ids, 'labels':labels}
else:
return {'input_ids':token_ids, 'attention_mask':seg_ids}
def __len__(self):
return len(self.df)
def select_tokens(self, tokens, max_num):
if len(tokens) <= max_num:
return tokens
if self.train_mode:
num_remove = len(tokens) - max_num
remove_start = random.randint(0, len(tokens)-num_remove-1)
return tokens[:remove_start] + tokens[remove_start + num_remove:]
else:
return tokens[:max_num//2] + tokens[-(max_num - max_num//2):]
def trim_input(self, title, question, max_sequence_length=MAX_LEN,
t_max_len=30, q_max_len=30):
t = self.tokenizer.tokenize(title)
q = self.tokenizer.tokenize(question)
t_len = len(t)
q_len = len(q)
if t_len+q_len+3 > max_sequence_length:
if t_max_len > t_len:
t_new_len = t_len
q_max_len = q_max_len + (t_max_len - t_len)
else:
t_new_len = t_max_len
if q_max_len > q_len:
q_new_len = q_len
else:
q_new_len = q_max_len
t = t[:t_new_len]
q = q[:q_new_len]
return t, q
def get_token_ids(self, row):
t_tokens, q_tokens = self.trim_input(row.q1, row.q2)
# print(t_tokens)
tokens = ['[CLS]'] + t_tokens + ['[SEP]'] + q_tokens + ['[SEP]']
token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
seg_ids = torch.tensor([1]*len(token_ids)+[0] * (MAX_LEN - len(token_ids)))
if len(token_ids) < MAX_LEN:
token_ids += [0] * (MAX_LEN - len(token_ids))
ids = torch.tensor(token_ids)
return ids, seg_ids
def get_label(self, row):
return torch.tensor(row['label'].astype(int))
def collate_fn(self, batch):
token_ids = torch.stack([x[0] for x in batch])
seg_ids = torch.stack([x[1] for x in batch])
if self.labeled:
labels = torch.stack([x[2] for x in batch])
return token_ids, seg_ids, labels
else:
return token_ids, seg_ids
由于调用trainer api时不方便输出auc (其实是嫌documentation太繁杂,又找不到现成代码),于是参考了网上别人共享的代码,并加以补全。利用了tqdm接口,最终实现了与keras基本相同的训练过程条。注意到模型的输入为ids, attention mask以及标签,由于以前没有使用过bert,所以没有注意到attention mask这一输入的重要性(bert只会对mask到的部分计算损失),调整了很多次才跑通代码。这里的代码参考了这里
from transformers import DistilBertForSequenceClassification, AdamW
import time
from random import random
from datetime import datetime
from torch.utils.data import DataLoader
from tqdm import tqdm, trange
from sklearn.metrics import roc_auc_score
train_dataset = QuestDataset(df_train, train_mode=True, labeled=True)
eval_dataset = QuestDataset(df_val, train_mode=True, labeled=True)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2,)
eval_loader = torch.utils.data.DataLoader(eval_dataset, batch_size=64, shuffle=False, num_workers=2,)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
mybertmodel.to(device)
mybertmodel.train()
optim = AdamW(mybertmodel.parameters(), lr=5e-5)
for epoch in range(4):
with tqdm(iterable=train_loader,bar_format='{desc} {n_fmt:>4s}/{total_fmt:<4s} {percentage:3.0f}%|{bar}| {postfix}',
) as t:
start_time = datetime.now()
loss_list = []
label_list = []
pred_list = []
for count, batch in enumerate(train_loader):
t.set_description_str(f"\33[36m【Epoch {epoch + 1:04d}】")
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
label_list.append(batch['labels'].numpy())
outputs = mybertmodel(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
pred_tmp = outputs[1].cpu().detach().numpy()
pred_tmp = np.exp(pred_tmp)
pred_tmp = pred_tmp[:,1]/np.sum(pred_tmp,axis=-1)
pred_list.append(pred_tmp)
loss.backward()
optim.step()
loss_list.append(loss)
cur_time = datetime.now()
delta_time = cur_time - start_time
t.set_postfix_str(f"train_loss={sum(loss_list) / len(loss_list):.6f}, 执行时长:{delta_time}\33[0m")
t.update()
train_auc = roc_auc_score(np.concatenate(label_list), np.concatenate(pred_list))
t.set_postfix_str(f"train_auc={train_auc:.6f}\33[0m")
with tqdm(iterable=eval_loader, bar_format='{desc} {postfix}',) as t:
label_list = []
pred_list = []
for count, batch in enumerate(eval_loader):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
label_list.append(batch['labels'].numpy())
outputs = mybertmodel(input_ids, attention_mask=attention_mask, training=False)
pred_tmp = outputs.cpu().detach().numpy()
pred_tmp = np.exp(pred_tmp)
pred_tmp = pred_tmp[:,1]/np.sum(pred_tmp,axis=-1)
pred_list.append(pred_tmp)
test_auc = roc_auc_score(np.concatenate(label_list), np.concatenate(pred_list))
t.set_description_str(f"\33[35m【测试集】")
t.set_postfix_str(f"test_auc={test_auc:.6f}\33[0m")
t.update()
以下为输出效果:
本文我们记录了我从零开始学习预训练bert所踩到的一些坑,调用基于torch的api来预训练bert十分便捷,虽然有部分api接口难以配适,但是最终还是通过修改代码取得了预期的效果。
文章没有给出预测输出部分的代码,这一部分留给读者自行完成,按照上面给出的代码应该完全可以自行实现。