来自论文《https://arxiv.org/pdf/1810.04805.pdf》
BERT说:“我要用 transformer 的 encoders”
Ernie不屑道:“呵呵,你不能像Bi-Lstm一样考虑文章”
BERT自信回答道:“我们会用masks”
解释一下Mask:
语言模型会根据前面单词来预测下一个单词,但是self-attention的注意力只会放在自己身上,那么这样100%预测到自己,毫无意义,所以用Mask,把需要预测的词给挡住。
如下图:
BERT 数据数据表示如图所示:
BERT的论文为我们介绍了几种BERT可以处理的NLP任务:
微调方法并不是使用BERT的唯一方法,就像ELMo一样,你可以使用预选训练好的BERT来创建语境化词嵌入。然后你可以将这些嵌入提供给现有的模型。
哪个向量最适合作为上下文嵌入? 我认为这取决于任务。 本文考察了六种选择(与微调模型相比,得分为96.4):
BERT是Transformer的一部分. Transformer本质是filter size为1的CNN. BERT比较适合几百个单词的情况,文章太长不太行。从github 上下载transformers 实现bert 预训练模型处理和文本分类的案例。
https://github.com/huggingface/transformers
重点关注几个重要的文件
pytorch_model.bin : 预训练的模型
vocab.txt :词典文件
config.json: bert 配置文件,主要bert 的定义的参数
英文预训练模型:
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json
中文预训练模型:
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json
我们首先把相关文件放到 bert_pretrain 目录下
bert_pretrain$ tree -a
├── config.json
├── pytorch_model.bin
└── vocab.txt
from transformers import BertTokenizer, BertForSequenceClassification
bert_path = './bert_pretrain/'
tokenizer = BertTokenizer.from_pretrained( os.path.join(bert_path, 'vocab.txt') )
model = BertForSequenceClassification.from_pretrained(
os.path.join(bert_path, 'pytorch_model.bin'),
config=os.path.join(bert_path, 'config.json'))
可以查看BERT的PyTorch实现 (https://github.com/huggingface/transformers)。
modeling: https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py
BertEmbedding: wordpiece embedding + position embedding + token type embedding
BertSelfAttnetion: query, key, value的变换
BertSelfOutput:
BertIntermediate
BertOutput
BertForSequenceClassification
configuration: https://github.com/huggingface/transformers/blob/master/transformers/configuration_bert.py
tokenization: https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py
DataProcessor: https://github.com/huggingface/transformers/blob/master/transformers/data/processors/glue.py#L194
论文地址:https://arxiv.org/pdf/1907.11692.pdf
论文地址:https://arxiv.org/pdf/1909.11942.pdf
https://arxiv.org/pdf/1910.01108.pdf
Patient Distillation
https://arxiv.org/abs/1908.09355
我们通过斯坦福大学电影评论数据集进行情感分析,验证BERT 在情感分析上效果。
import argparse
import logging
import os
import random
import time
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from transformers.data.processors.glue import glue_convert_examples_to_features as convert_examples_to_features
from transformers.data.processors.utils import DataProcessor, InputExample
logger = logging.getLogger(__name__)
# 初始化参数
parser = argparse.ArgumentParser()
args = parser.parse_args(args=[]) # 在jupyter notebook中,args不为空
args.data_dir = "./data/"
args.model_type = "bert"
args.task_name = "sst-2"
args.output_dir = "./outputs2/"
args.max_seq_length = 128
args.do_train = True
args.do_eval = True
args.warmup_steps = 0
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
args.device = device
args.seed = 1234
args.batch_size = 48
args.n_epochs=3
args.lr = 5e-5
print('args: ', args)
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if torch.cuda.is_available() > 0:
torch.cuda.manual_seed_all(args.seed)
set_seed(args) # Added here for reproductibility
class Sst2Processor(DataProcessor):
"""Processor for the SST-2 data set (GLUE version)."""
def get_example_from_tensor_dict(self, tensor_dict):
"""See base class."""
return InputExample(
tensor_dict["idx"].numpy(),
tensor_dict["sentence"].numpy().decode("utf-8"),
None,
str(tensor_dict["label"].numpy()),
)
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev/test sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = line[0]
label = line[1]
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def load_and_cache_examples(args, processor, tokenizer, set_type):
# Load data features from cache or dataset file
print("Creating features from dataset file at {}".format(args.data_dir))
label_list = processor.get_labels()
if set_type == 'train':
examples = (
processor.get_train_examples(args.data_dir)
)
if set_type == 'dev':
examples = (
processor.get_dev_examples(args.data_dir)
)
if set_type == 'test':
examples = (
processor.get_test_examples(args.data_dir)
)
features = convert_examples_to_features(
examples, # 原始数据
tokenizer, #
label_list=label_list,
max_length=args.max_seq_length, # 设置每个batch 最大句子长度
output_mode='classification', # 设置分类标记
pad_on_left=False, # 右侧进行padding
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=0, # bert 分类设置0
mask_padding_with_zero=True # the attention mask will be filled by ``1``
)
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
return dataset
def epoch_time(start_time, end_time):
elapsed_time = end_time - start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
return elapsed_mins, elapsed_secs
def train(model, data_loader, criterion, optimizer):
epoch_acc = 0.
epoch_loss = 0.
total_batch = 0
model.train()
for batch in data_loader:
#
batch = tuple(t.to(device) for t in batch)
input_ids, attention_mask, token_type_ids, labels = batch
# 预测
outputs = model(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)[0]
# 计算loss和acc
loss = criterion(outputs, labels)
_, y = torch.max(outputs, dim=1)
acc = (y == labels).float().mean()
#
if total_batch % 100==0:
print('Iter_batch[{}/{}]:'.format(total_batch, len(data_loader)),
'Train Loss: ', "%.3f" % loss.item(), 'Train Acc:', "%.3f" % acc.item())
# 计算批次下总的acc和loss
epoch_acc += acc.item() # 当前批次准确率
epoch_loss += loss.item() # 当前批次loss
# 剃度下降
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
#scheduler.step() # Update learning rate schedule
model.zero_grad()
total_batch += 1
# break
return epoch_acc / len(data_loader), epoch_loss / len(data_loader)
def evaluate(model, data_loader, criterion):
epoch_acc = 0.
epoch_loss = 0.
model.eval()
with torch.no_grad():
for batch in data_loader:
#
batch = tuple(t.to(device) for t in batch)
input_ids, attention_mask, token_type_ids, labels = batch
# 预测
outputs = model(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)[0]
# 计算loss和acc
loss = criterion(outputs, labels)
_, y = torch.max(outputs, dim=1)
acc = (y == labels).float().mean()
# 计算批次下总的acc和loss
epoch_acc += acc.item() # 当前批次准确率
epoch_loss += loss.item() # 当前批次loss
return epoch_acc / len(data_loader), epoch_loss / len(data_loader)
def main():
# 1. 定义数据处理器
processor = Sst2Processor()
label_list = processor.get_labels()
num_labels = len(label_list)
print('label_list: ', label_list)
print('num_label: ', num_labels)
print('*' * 60)
# 2. 加载数据集
train_examples = processor.get_train_examples(args.data_dir)
dev_examples = processor.get_dev_examples(args.data_dir) # 每条记录封装InputExample 类实例对象
test_examples = processor.get_test_examples(args.data_dir) # 每条记录封装InputExample 类实例对象
print('训练集记录数:', len(train_examples))
print('验证集数据记录数:', len(dev_examples))
print('测试数据记录数:', len(test_examples))
print('训练数据数据举例:\n', train_examples[0])
print('验证集数据数据举例:\n', dev_examples[0])
print('测试集数据数据举例:\n', test_examples[0])
# 3.加载本地 词表 模型
bert_path = './bert_pretrain/'
tokenizer = BertTokenizer.from_pretrained( os.path.join(bert_path, 'vocab.txt') )
train_dataset = load_and_cache_examples(args, processor, tokenizer, 'train')
dev_dataset = load_and_cache_examples(args, processor, tokenizer, 'dev')
test_dataset = load_and_cache_examples(args, processor, tokenizer, 'test')
train_dataloader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True)
dev_dataloader = DataLoader(dev_dataset, batch_size=args.batch_size, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=args.batch_size, shuffle=False)
# 4. 模型定义
model = BertForSequenceClassification.from_pretrained(
os.path.join(bert_path, 'pytorch_model.bin'),
config=os.path.join(bert_path, 'config.json'))
model.to(device)
N_EPOCHS = args.n_epochs
# 梯度更新算法AdamW
t_total = len(train_dataloader) // N_EPOCHS
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.lr,
correct_bias=False)
# loss_func
criterion = nn.CrossEntropyLoss()
# 5. 模型训练
print('模型训练开始: ')
logger.info("***** Running training *****")
logger.info(" train num examples = %d", len(train_dataloader))
logger.info(" dev num examples = %d", len(dev_dataloader))
logger.info(" test num examples = %d", len(test_dataloader))
logger.info(" Num Epochs = %d", args.n_epochs)
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
start_time = time.time()
train_acc, train_loss = train(model, train_dataloader, criterion, optimizer)
val_acc, val_loss = evaluate(model, dev_dataloader, criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
if val_loss < best_valid_loss:
print('loss increasing->')
best_valid_loss = val_loss
torch.save(model.state_dict(), 'bert-model.pt')
print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.3f}%')
print(f'\t Val. Loss: {val_loss:.3f} | Val. Acc: {val_acc * 100:.3f}%')
# evaluate
model.load_state_dict(torch.load('bert-model.pt'))
test_acc, test_loss = evaluate(model, test_dataloader, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc * 100:.2f}%')
if __name__ == '__main__':
main()
python3 main.py
loss increasing->
Epoch: 01 | Epoch Time: 13m 38s
Train Loss: 0.293 | Train Acc: 88.791%
Val. Loss: 0.307 | Val. Acc: 88.600%
Epoch: 02 | Epoch Time: 13m 40s
Train Loss: 0.160 | Train Acc: 94.905%
Val. Loss: 0.336 | Val. Acc: 88.664%
Epoch: 03 | Epoch Time: 13m 40s
Train Loss: 0.127 | Train Acc: 96.367%
Val. Loss: 0.471 | Val. Acc: 88.935%
Test Loss: 0.289 | Test Acc: 89.57%
我们使用新闻的title 进行文本的分类,属于中文短文本分类业务,这里使用BERT提供的预训练中文模型进行,通过HuggingFace提供transformers 模型实现文本分类,最终的效果非常的不错。
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json
pytorch_model.bin : 预训练的模型
vocab.txt :词典文件
config.json: bert 配置文件,主要bert 的定义的参数
{
"architectures": [
"BertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"num_labels": 10
}
# 词典和BERT 预训练模型目录
bert_path = './bert_pretrain/'
# 加载词典: 默认当前目录下vocab.txt 文件
tokenizer = BertTokenizer.from_pretrained(bert_path)
# 初始化BERT 预训练的模型
model = BertForSequenceClassification.from_pretrained(
os.path.join(bert_path, 'pytorch_model.bin'),# 模型路径
config=os.path.join(bert_path, 'config.json')# BERT 配置文件
)
参考bert 代码中rule_bert.py 实现
class NewsProcessor(DataProcessor):
"""Processor for the News-2 data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev/test sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = line[0]
label = line[1]
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def load_and_cache_examples(args, processor, tokenizer, set_type):
# Load data features from cache or dataset file
print("Creating features from dataset file at {}".format(args.data_dir))
if set_type == 'train':
examples = (
processor.get_train_examples(args.data_dir)
)
if set_type == 'dev':
examples = (
processor.get_dev_examples(args.data_dir)
)
if set_type == 'test':
examples = (
processor.get_test_examples(args.data_dir)
)
features = convert_examples_to_features(
processor,
examples, # 原始数据
tokenizer, #
max_length=args.max_seq_length, # 设置每个batch 最大句子长度
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=0, # bert 分类设置0
mask_padding_with_zero=True # the attention mask will be filled by ``1``
)
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
return dataset
def train(model, data_loader, criterion, optimizer):
epoch_acc = 0.
epoch_loss = 0.
total_batch = 0
model.train()
for batch in data_loader:
#
batch = tuple(t.to(device) for t in batch)
input_ids, attention_mask, token_type_ids, labels = batch
# 预测
outputs = model(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)[0]
# 计算loss和acc
loss = criterion(outputs, labels)
_, y = torch.max(outputs, dim=1)
acc = (y == labels).float().mean()
#
if total_batch % 100 == 0:
print('Iter_batch[{}/{}]:'.format(total_batch, len(data_loader)),
'Train Loss: ', "%.3f" % loss.item(), 'Train Acc:', "%.3f" % acc.item())
# 计算批次下总的acc和loss
epoch_acc += acc.item() # 当前批次准确率
epoch_loss += loss.item() # 当前批次loss
# 剃度下降
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
# scheduler.step() # Update learning rate schedule
model.zero_grad()
total_batch += 1
# break
return epoch_acc / len(data_loader), epoch_loss / len(data_loader)
def evaluate(model, data_loader, criterion):
epoch_acc = 0.
epoch_loss = 0.
model.eval()
with torch.no_grad():
for batch in data_loader:
#
batch = tuple(t.to(device) for t in batch)
input_ids, attention_mask, token_type_ids, labels = batch
# 预测
outputs = model(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)[0]
# 计算loss和acc
loss = criterion(outputs, labels)
_, y = torch.max(outputs, dim=1)
acc = (y == labels).float().mean()
# 计算批次下总的acc和loss
epoch_acc += acc.item() # 当前批次准确率
epoch_loss += loss.item() # 当前批次loss
return epoch_acc / len(data_loader), epoch_loss / len(data_loader)
核心代码定义后,我们就可以定义optimzier ,加载数据进行模型训练了,这里不在详述。
下面我们直接看下训练过程以及最终效果如何。
python3 main.py
这里针对短文本分类,一般在24 个字左右,这里我们迭代3轮,我们发现效果还是非常不错的。
训练数据数据举例:
InputExample(guid='train-0', text_a='中华女子学院:本科层次仅1专业招男生', text_b=None, label='3')
验证集数据数据举例:
InputExample(guid='dev-0', text_a='体验2D巅峰 倚天屠龙记十大创新概览', text_b=None, label='8')
测试集数据数据举例:
InputExample(guid='test-0', text_a='词汇阅读是关键 08年考研暑期英语复习全指南', text_b=None, label='3')
loss increasing->
Epoch: 01 | Epoch Time: 7m 0s
Train Loss: 0.288 | Train Acc: 91.121%
Val. Loss: 0.206 | Val. Acc: 93.097%
loss increasing->
Epoch: 02 | Epoch Time: 7m 4s
Train Loss: 0.153 | Train Acc: 95.030%
Val. Loss: 0.200 | Val. Acc: 93.612%
Epoch: 03 | Epoch Time: 7m 5s
Train Loss: 0.109 | Train Acc: 96.385%
Val. Loss: 0.205 | Val. Acc: 93.661%
Test Loss: 0.201 | Test Acc: 93.72%
with open('data/class.txt') as f:
items = [item.strip() for item in f.readlines()]
def predict(model,text):
tokens_pt2 = tokenizer.encode_plus(text, return_tensors="pt")
print(type(tokens_pt2))
input_ids = tokens_pt2['input_ids']
attention_mask = tokens_pt2['attention_mask']
token_type_ids = tokens_pt2['token_type_ids']
with torch.no_grad():
outputs = model(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)[0]
_,y=torch.max(outputs,dim=1)
return y.item()
text = "安徽2009年成人高考成绩查询系统"
y_pred = predict(model,text)
print('*'*60)
print('在线服务预测:')
print("{}->{}->{}".format( text,y_pred,items[y_pred]))
输出结果
安徽2009年成人高考成绩查询系统->3->教育