本篇文章参考:
Transformers for Sentiment Analysis.ipynb
部分细节可能会略作改动,代码注释尽数基于自己的理解。文章目的仅作个人领悟记录,并不完全是tutorial的翻译,可能并不适用所有初学者,但也可从中互相借鉴吸收参考。
接上篇:torchtext使用-- 单标签多分类任务TREC
这是第六篇,也是入门的最后一篇
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchtext
from torchtext import data
from torchtext import datasets
import spacy
import random
import math
import numpy as np
use_cuda=torch.cuda.is_available()
device=torch.device("cuda" if use_cuda else "cpu")
SEED=1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if use_cuda:
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')
可以查看一下词表
len(tokenizer.vocab)
30522
对于一个句子的分词,只需要调用tokenizer的tokenize方法
token=tokenizer.tokenize("Don't Make sUch a fuss,get StUFf!")
print(token)
['don', "'", 't', 'make', 'such', 'a', 'fuss', ',', 'get', 'stuff', '!']
可以调用内部的convert_tokens_to_ids方法,将token序列化
indexes=tokenizer.convert_tokens_to_ids(token)
print(indexes)
[2123, 1005, 1056, 2191, 2107, 1037, 28554, 1010, 2131, 4933, 999]
BERT在预训练的时候为了能够学到句子级别的特征,将输入改为一个个句对。针对句对,BERT采用了两个特殊的token:[CLS]、[SEP].后者用来表示句子的结束标志,而前者则是针对整个句对的抽象特征,它将作为最高隐层,用来表示句子级别的信息。对BERT以及预训练模型陌生的同学可以参考文章: 最火的几个全网络预训练模型梳理整合(BERT、ALBERT、XLNet详解)
因此,在Field中必须把默认的token改为与BERT一致
cls_token=tokenizer.cls_token
sep_token=tokenizer.sep_token
pad_token=tokenizer.pad_token
unk_token=tokenizer.unk_token
print(cls_token,sep_token,pad_token,unk_token)
[CLS] [SEP] [PAD] [UNK]
也可以直接得到idx
cls_token_id=tokenizer.cls_token_id
sep_token_id=tokenizer.sep_token_id
pad_token_id=tokenizer.pad_token_id
unk_token_id=tokenizer.unk_token_id
print(cls_token_id,sep_token_id, pad_token_id, unk_token_id)
101 102 0 100
由于不同的BERT模型对于输入的最大长度要求可能不同,因此我们可以通过max_model_input_sizes来查看。同时在Field中也要指定最大长度的处理
max_model_input_sizes是一个字典,包含了各种BERT的最大长度限制
type(tokenizer.max_model_input_sizes)
dict
len(tokenizer.max_model_input_sizes)
18
而我们这里使用的是bert-base-uncased
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
print(max_input_length)
512
所以对于我们训练数据的每一个句子,我们要保证喂给BERT前长度已经被cut过(当句子长度大于512)
def tokenize_and_cut(sentence):
tokens=tokenizer.tokenize(sentence)
tokens=tokens[:max_input_length-2] #空出两个位置,让Field加上开始和结束的标记([CLS]、[SEP])
return tokens
总结一下Field应该做出的调整:
1).更改tokenize.这里不再使用spacy,而是调用tokenize_and_cut
2).关闭vocab.由于我们已经有了tokenizer,因此在Field中我们需要指定use_vocab = False。也就是我们不需要Field来给分词之后的输入做序列化。
3).指定preprocessing为序列化.我们都知道preprocessing预处理是先于序列化的。因为关闭了vocab,Field将没办法自动对token进行序列化操作,所以我们指定preprocessing为tokenizer.convert_tokens_to_ids,这样在分词之后可以调用preprocessing代替Field原先的自动序列化。
4).指定特殊的token.这里需要指定四个token,即PAD\UNK\INIT\EOS。前面已经保证空出了两个位置留给INIT\EOS,所以这里可以放心地指定init_token = init_token_idx、eos_token = eos_token_idx,它会自动在sentence首位填充这两个token的idx。(要注意我们这里已经没有序列化的操作,而是preprocessing代替原先的序列化。而添加首位token这一步操作是在preprocessing和原先的序列化之间的。因此在这里,我们添加的token必须是token_idx,否则preprocessing将没办法把首尾token序列化)
5).设置batch_first = True.BERT要求输入的第一维是batch
TEXT=data.Field(batch_first=True,use_vocab=False,tokenize=tokenize_and_cut,
preprocessing=tokenizer.convert_tokens_to_ids,
init_token=cls_token_id,
eos_token=sep_token_id,
pad_token=pad_token_id,
unk_token=unk_token_id)
LABEL=data.LabelField(dtype=torch.float)
train_data, test_data =datasets.IMDB.splits(TEXT,LABEL)
train_data,valid_data=train_data.split(split_ratio=0.7)
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")
Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000
可以看到,经过这样的Field处理的IMDB数据将会被序列化为BERT“认识的”句子
print(vars(train_data.examples[0]))
{'text': [2023, 3185, 2038, 2288, 2000, 2022, 2028, 1997, 1996, 5409, 1045, 2031, 2412, 2464, 2191, 2009, 2000, 4966, 999, 999, 999, 1996, 2466, 2240, 2453, 2031, 13886, 2065, 1996, 2143, 2018, 2062, 4804, 1998, 4898, 2008, 2052, 2031, 3013, 1996, 14652, 1998, 5305, 2135, 5019, 2008, 1045, 3811, 14046, 3008, 2006, 1012, 1012, 1012, 1012, 2021, 1996, 2466, 2240, 2003, 2066, 1037, 6065, 8854, 1012, 2065, 2045, 2001, 2107, 1037, 2518, 2004, 1037, 3298, 27046, 3185, 9338, 1011, 2023, 2028, 2052, 2031, 22057, 2013, 2008, 1012, 2009, 6966, 2033, 1037, 2843, 1997, 1996, 4248, 2666, 3152, 2008, 2020, 2404, 2041, 1999, 1996, 3624, 1005, 1055, 1010, 3532, 5896, 3015, 1998, 7467, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2069, 21082, 3494, 1999, 1996, 2878, 3185, 2001, 1996, 15812, 1998, 13570, 1012, 1996, 2717, 1997, 1996, 2143, 1010, 2071, 2031, 4089, 2042, 2081, 2011, 2690, 2082, 2336, 1012, 1045, 2507, 2023, 2143, 1037, 5790, 1997, 1015, 2004, 2009, 2003, 5621, 9643, 1998, 2187, 2026, 2972, 2155, 2007, 1037, 3168, 1997, 2108, 22673, 1012, 2026, 6040, 1011, 2123, 1005, 1056, 3422, 2009, 999, 999, 999], 'label': 'neg'}
这里只需要再构造LABEL的词表
LABEL.build_vocab(train_data)
BATCH_SIZE = 128
train_iterator, valid_iterator, test_iterator=data.BucketIterator.splits(
(train_data,valid_data,test_data),
batch_size=BATCH_SIZE,
device=device)
这里会下载BERT的预训练模型,大概500MB不到。
from transformers import BertModel
bert=BertModel.from_pretrained('bert-base-uncased')
对于BertModel来说,它的返回由四个部分组成:
1.last_hidden_state :(batch,seq,hidden_size)
整个输入的句子每一个token的隐层输出,也是我们这里将要用到的,可以将它作为embedding的替代
2.pooler_output:(batch,hidden_size)
输入句子第一个token的最高隐层。也就是[CLS]标记提取到的最终的句对级别的抽象信息。对于BERT的预训练来说,这个隐层信息将作为Next Sentence prediction任务的输入。然而我们这里将不会用到它,因为它对于情感分析来说效果不是很好。
This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
3.hidden_states:
一个元组,里面每个元素都是(batch,seq,hidden_size) 大小的FloatTensor,分别代表每一层的隐层和初始embedding的和
4.attentions:
一个元组,里面每个元素都是(batch,num_heads,seq,seq) 大小的FloatTensor,分别表示每一层的自注意力分数。
class BERTGRUSentiment(nn.Module):
def __init__(self,bert:BertModel,
hidden_dim:int,
output_dim:int,
n_layers:int,
bidirectional:bool,
dropout:float):
super(BERTGRUSentiment, self).__init__()
self.bert=bert
embedding_dim=bert.config.to_dict()['hidden_size']
self.rnn=nn.GRU(embedding_dim,hidden_dim,n_layers,
bidirectional=bidirectional,
batch_first=True,
dropout=0 if n_layers<2 else dropout)
self.fc=nn.Linear(hidden_dim*2 if bidirectional else hidden_dim,output_dim)
self.dropout=nn.Dropout(dropout)
def forward(self,text):
with torch.no_grad():
embedding=self.bert(text)[0]
#embeddiing:(batch,seq,embedding_dim)
_,hidden=self.rnn(embedding)
#(bi*num_layers,batch,hidden_size)
hidden=self.dropout(torch.cat((hidden[-1,:,:],hidden[-2,:,:]),dim=1) if self.rnn.bidirectional else hidden[-1,:,:])
return self.fc(hidden)
#(batch,output_dim)
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
model = BERTGRUSentiment(bert,
HIDDEN_DIM,
OUTPUT_DIM,
N_LAYERS,
BIDIRECTIONAL,
DROPOUT)
def freeze_parameters(model:nn.Module,demand:str):
for name, parameters in model.named_parameters():
if name.startswith(demand):
parameters.requires_grad = False
冻结:
freeze_parameters(model,'bert')
来数一下参与训练的参数:
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')
The model has 2,759,169 trainable parameters
可以看到把BERT的参数冻结之后,整体模型参数还是比较少的。实际上要训练的就只是GRU和全连接层
criterion=nn.BCEWithLogitsLoss()
criterion=criterion.to(device)
model=model.to(device)
optimizer=optim.Adam(model.parameters())
def binary_accuracy(preds, y):
"""
Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
"""
#round predictions to the closest integer
rounded_preds = torch.round(torch.sigmoid(preds))
correct = rounded_preds.eq(y).float() #convert into float for division
acc = correct.sum() / len(correct)
return acc
def train(model:nn.Module, iterator:data.BucketIterator,
optimizer:optim.Adam, criterion:nn.BCEWithLogitsLoss):
epoch_loss = 0.
epoch_acc = 0.
model.train()
for batch in iterator:
preds=model(batch.text).squeeze(-1)
loss=criterion(preds,batch.label)
acc=binary_accuracy(preds,batch.label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss+=loss.item()
epoch_acc+=acc.item()
return epoch_loss/len(iterator),epoch_acc/len(iterator)
def evaluate(model: nn.Module, iterator: data.BucketIterator,
criterion: nn.BCEWithLogitsLoss):
epoch_loss = 0.
epoch_acc = 0.
model.eval()
with torch.no_grad():
for batch in iterator:
preds = model(batch.text).squeeze(-1)
loss = criterion(preds, batch.label)
acc = binary_accuracy(preds, batch.label)
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
import time
def epoch_time(start_time, end_time):
elapsed_time = end_time - start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
return elapsed_mins, elapsed_secs
注意,如果你的机子显存不够,那么下面的这段代码基本上是会报错的。这主要是因为前面把model参数搬到了GPU,BERT参数有又多得让人发指,这时候基本上4、5个G的显存已经直接被占用了,再继续训练下去,显存就会支撑不了模型的迭代。
可以选择colab跑跑看,有钱人就忽略吧
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
start_time = time.time()
train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'tut6-model.pt')
print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
下一篇补充篇:torchtext补充—利用torchtext读取自己的数据集