谷歌发布bert已经有一段时间了,但是仅在最近一个文本分类任务中实战使用过,顺便记录下使用过程。记录前先对bert的代码做一个简单的解读
bert源码
首先我们从官方bert仓库clone一份源码到本地,看下目录结构:
.
├── CONTRIBUTING.md
├── create_pretraining_data.py # 构建预训练结构数据
├── extract_features.py
├── __init__.py
├── LICENSE
├── modeling.py # 预训练模型结果
├── modeling_test.py
├── multilingual.md
├── optimization.py # 优化器选择, 学习率等参数设置
├── optimization_test.py
├── predicting_movie_reviews_with_bert_on_tf_hub.ipynb
├── README.md
├── requirements.txt
├── run_classifier.py # 自定义微调脚本
├── run_classifier_with_tfhub.py
├── run_pretraining.py # 预训练脚本
├── run_squad.py
├── sample_text.txt
├── tokenization.py # 分词工具
└── tokenization_test.py
跟咱们有关的只有 run_classifier.py
以及 run_pretraining.py
两个脚本
run_pretraining
其中 run_pretraining.py
是用于预训练的脚本, 这个真的老老实实用谷歌已经训练好的模型
吧, 毕竟没有条件支撑自己去重新训练一个模型。找到简体中文模型(chinese_L-12_H-768_A-12),将模型下载解压后目录结构如下:
├── bert_config.json # bert基础参数配置
├── bert_model.ckpt.data-00000-of-00001 # 预训练模型
├── bert_model.ckpt.index
├── bert_model.ckpt.meta
└── vocab.txt # 字符编码
之后的各种NLP任务都可以用这个模型。实际上我用的是哈工大版的中文预训练BERT-wwm模型,由于其预训练阶段采用全词遮罩(Whole Word Masking)技术,据称实际效果要优于谷歌官方发布的中文与训练模型,感兴趣的小伙伴可以点击该链接
具体查看。
run_classifier
微调(Fine-Tuning)阶段是核心部分,关键代码就是如何自定义自己的 Processor
,源码中已经包含了4个NLP任务的 Processor
写法示例,分别为:XnliProcessor
MnliProcessor
MrpcProcessor
ColaProcessor
。每个 Processor
都实现了下面的这些函数,以 MnliProcessor
为例:
get_train_examples
: 获取训练数据函数,需要在对应文件夹下有 "train.tsv" 文件
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
get_dev_examples
: 获取验证数据函数,需要在对应文件夹下有 "dev_matched.tsv" 文件
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
"dev_matched")
get_test_examples
: 获取测试数据函数,需要在对应文件夹下有 "dev_matched.tsv" 文件
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
get_labels
: 分类标签获取
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
_create_examples
: 构建训练样本
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0]))
text_a = tokenization.convert_to_unicode(line[8])
text_b = tokenization.convert_to_unicode(line[9])
if set_type == "test":
label = "contradiction"
else:
label = tokenization.convert_to_unicode(line[-1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
开始
首先将 run_classifier.py
文件备份一份, 然后我们直接在这上面修改, 本次做的文本分类任务是一个多分类任务(文本涉黄涉政检查), 首先重写自己的 Processor
class TextProcessor(DataProcessor):
"""用于文本分类任务的Processor"""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), "dev_matched")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
def get_labels(self):
"""
0: 正常文本
1: 涉黄文本
2: 涉政文本
"""
return ["0", "1", "2"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
# 注意下自己的样本格式, 我是label在第一列, 文本在第二列
text_a = tokenization.convert_to_unicode(line[1])
label = tokenization.convert_to_unicode(line[0])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
然后将刚刚写好的 Processor
注册至 main
函数下的 processors
字典, 其中 text
是自定义的任务名称, 运行脚本时需要用到
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mrpc": MrpcProcessor,
"xnli": XnliProcessor,
"text": TextProcessor, # 文本分类
}
...
另外, 为了后续准备我们需要在 convert_single_example
函数中增加点内容, 将label的id映射写入文件。如果忘记了也没影响,自己补充上这个文件即可
import pickle
...
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
########################## 新增部分 ###################################
output_label2id_file = os.path.join(FLAGS.output_dir, "label2id.pkl")
if not os.path.exists(output_label2id_file):
with open(output_label2id_file, 'wb') as w:
pickle.dump(label_map, w)
########################## 新增部分 ###################################
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
...
至此 run_classifier.py
就改造完了
数据准备
创建一个 data
文件夹
.
├── dev_matched.tsv # 验证集
├── test_matched.tsv # 测试集
└── train.tsv # 训练集
数据结构如下
0 丝袜被撕开,一片雪白的肌肤乍现...
0 保姆接下来快速地在我那根手指上面揉搓,顿时感到疼痛...
1 她再也忍不住,本能占据了她的矜持,伸手探向了唐枫的身...
2 北京天安门...
开始训练
为了方便,将启动任务写到脚本中
export DATA_DIR="data文件夹绝对路径"
export BERT_BASE_DIR="bert训练模型路径"
export OUTPUT_DIR="模型输出路径"
python run_classifier.py \
--task_name=text \ '任务名,上边写道 `processor` 中的key'
--do_train=true \ '进行训练, data下要有对应的 train.tsv'
--do_eval=true \ '进行验证, data下要有对应的 dev_matched.tsv'
--do_predict=true \ '进行测试, data下要有对应的 test_matched.tsv'
--data_dir=${DATA_DIR}/ \
--vocab_file=${BERT_BASE_DIR}/vocab.txt \
--bert_config_file=${BERT_BASE_DIR}/bert_config.json \
--init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
--max_seq_length=128 \ '序列最大长度'
--train_batch_size=16 \ '批大小, 过大可能显存超出会报错, 过小可能拟合不够好'
--learning_rate=2e-5 \ '学习率, 默认'
--num_train_epochs=20 \ '训练轮数'
--output_dir=${OUTPUT_DIR} '输出路径'
好了,启动脚本进行训练吧,会看到日志:
WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x7fd4b2c01488>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': '/home/SanJunipero/rd/dujihan/book_content_review/model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: truegraph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_cluster': None} INFO:tensorflow:_TPUContext: eval_on_tpu TrueWARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:Writing example 0 of 70000
...
...
...
遇到的错误:
- 版本问题,gpu版本尝试使用1.x版本,cpu版本查看函数api模块位置是否需要修改
- 批大小设置问题,第一次设置的
train_batch_size=32
,显存不足导致报错,修改为 16 后问题解决
同时会在输出文件夹中有 events
开头的训练过程参数变化信息, 如果安装了 tensorboard
可以通过浏览器查看模型训练情况, 通过命令行启动 tensorboard
tensorboard --logdir=./输出路径
看到如下信息表示成功
W1031 17:09:47.128739 140197393516288 plugin_event_accumulator.py:294] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
W1031 17:09:47.147893 140197393516288 plugin_event_accumulator.py:302] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
TensorBoard 1.13.1 at http://127.0.0.1:6006 (Press CTRL+C to quit)
然后通过浏览器打开上边的地址就可以看到模型的训练情况
训练完成
训练完成后会在输出目录看到如下文件
.
├── checkpoint
├── eval # 如果训练时 --do_eval 为 true 就会有此目录
│ └── events.out.tfevents.1572485638.localhost
├── eval_results.txt # 模型在验证集上的表现
├── eval.tf_record
├── events.out.tfevents.localhost
├── graph.pbtxt
├── label2id.pkl # label对应的id映射
├── model.ckpt-84000.data-00000-of-00001 # 训练好的模型, 默认保存最近5个
├── model.ckpt-84000.index
├── model.ckpt-84000.meta
├── model.ckpt-85000.data-00000-of-00001
├── model.ckpt-85000.index
├── model.ckpt-85000.meta
├── model.ckpt-86000.data-00000-of-00001
├── model.ckpt-86000.index
├── model.ckpt-86000.meta
├── model.ckpt-87000.data-00000-of-00001
├── model.ckpt-87000.index
├── model.ckpt-87000.meta
├── model.ckpt-87500.data-00000-of-00001
├── model.ckpt-87500.index
├── model.ckpt-87500.meta
├── predict.tf_record
├── test_results.tsv # 模型对测试集预测的结果
└── train.tf_record
打开 eval_results.txt
看下模型的最终效果
1 eval_accuracy = 0.8573
2 eval_loss = 1.4312192
3 global_step = 87500
4 loss = 1.4312192
至此, 模型训练完毕
服务部署
服务部署前我们需要通过一个别人写好的脚本 freeze_graph.py
将我们的模型压缩一下(脚本需与run_classifier.py
放在同级目录下), 完整代码如下, 或者移步至我的github下载:
import json
import os
from enum import Enum
from termcolor import colored
import sys
import modeling
import logging
import pickle
import tensorflow as tf
import argparse
def set_logger(context, verbose=False):
if os.name == 'nt': # for Windows
return NTLogger(context, verbose)
logger = logging.getLogger(context)
logger.setLevel(logging.DEBUG if verbose else logging.INFO)
formatter = logging.Formatter(
'%(levelname)-.1s:' + context + ':[%(filename).3s:%(funcName).3s:%(lineno)3d]:%(message)s', datefmt=
'%m-%d %H:%M:%S')
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG if verbose else logging.INFO)
console_handler.setFormatter(formatter)
logger.handlers = []
logger.addHandler(console_handler)
return logger
class NTLogger:
def __init__(self, context, verbose):
self.context = context
self.verbose = verbose
def info(self, msg, **kwargs):
print('I:%s:%s' % (self.context, msg), flush=True)
def debug(self, msg, **kwargs):
if self.verbose:
print('D:%s:%s' % (self.context, msg), flush=True)
def error(self, msg, **kwargs):
print('E:%s:%s' % (self.context, msg), flush=True)
def warning(self, msg, **kwargs):
print('W:%s:%s' % (self.context, msg), flush=True)
def create_classification_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels):
#import tensorflow as tf
#import modeling
# 通过传入的训练数据,进行representation
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
)
embedding_layer = model.get_sequence_output()
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
if labels is not None:
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
else:
loss, per_example_loss = None, None
return (loss, per_example_loss, logits, probabilities)
def init_predict_var(path):
label2id_file = os.path.join(path, 'label2id.pkl')
if os.path.exists(label2id_file):
with open(label2id_file, 'rb') as rf:
label2id = pickle.load(rf)
id2label = {value: key for key, value in label2id.items()}
num_labels = len(label2id.items())
return num_labels, label2id, id2label
def optimize_class_model(args, logger=None):
if not logger:
logger = set_logger(colored('CLASSIFICATION_MODEL, Lodding...', 'cyan'), args.verbose)
pass
try:
# 如果PB文件已经存在则,返回PB文件的路径,否则将模型转化为PB文件,并且返回存储PB文件的路径
if args.model_pb_dir is None:
tmp_file = args.model_dir
else:
tmp_file = args.model_pb_dir
pb_file = os.path.join(tmp_file, 'classification_model.pb')
if os.path.exists(pb_file):
print('pb_file exits', pb_file)
return pb_file
#增加 从label2id.pkl中读取num_labels, 这样也可以不用指定num_labels参数; 2019/4/17
if not args.num_labels:
num_labels, label2id, id2label = init_predict_var()
else:
num_labels = args.num_labels
graph = tf.Graph()
with graph.as_default():
with tf.Session() as sess:
input_ids = tf.placeholder(tf.int32, (None, args.max_seq_len), 'input_ids')
input_mask = tf.placeholder(tf.int32, (None, args.max_seq_len), 'input_mask')
bert_config = modeling.BertConfig.from_json_file(os.path.join(args.bert_model_dir, 'bert_config.json'))
loss, per_example_loss, logits, probabilities = create_classification_model(bert_config=bert_config, is_training=False,
input_ids=input_ids, input_mask=input_mask, segment_ids=None, labels=None, num_labels=num_labels)
# pred_ids = tf.argmax(probabilities, axis=-1, output_type=tf.int32, name='pred_ids')
# pred_ids = tf.identity(pred_ids, 'pred_ids')
probabilities = tf.identity(probabilities, 'pred_prob')
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
latest_checkpoint = tf.train.latest_checkpoint(args.model_dir)
logger.info('loading... %s ' % latest_checkpoint )
saver.restore(sess,latest_checkpoint )
logger.info('freeze...')
from tensorflow.python.framework import graph_util
tmp_g = graph_util.convert_variables_to_constants(sess, graph.as_graph_def(), ['pred_prob'])
logger.info('predict cut finished !!!')
# 存储二进制模型到文件中
logger.info('write graph to a tmp file: %s' % pb_file)
with tf.gfile.GFile(pb_file, 'wb') as f:
f.write(tmp_g.SerializeToString())
return pb_file
except Exception as e:
logger.error('fail to optimize the graph! %s' % e, exc_info=True)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Trans ckpt file to .pb file')
parser.add_argument('-bert_model_dir', type=str, required=True,
help='chinese google bert model path')
parser.add_argument('-model_dir', type=str, required=True,
help='directory of a pretrained BERT model')
parser.add_argument('-model_pb_dir', type=str, default=None,
help='directory of a pretrained BERT model,default = model_dir')
parser.add_argument('-max_seq_len', type=int, default=128,
help='maximum length of a sequence,default:128')
parser.add_argument('-num_labels', type=int, default=None,
help='length of all labels,default=2')
parser.add_argument('-verbose', action='store_true', default=False,
help='turn on tensorflow logging for debug')
args = parser.parse_args()
optimize_class_model(args, logger=None)
运行命令:
python freeze_graph.py \
-bert_model_dir="bert预训练模型地址" \
-model_dir="模型输出地址(和上边模型训练输出地址一样即可)" \
-max_seq_len=128 \ # 序列长度, 需要与训练时 max_seq_length 参书相同
-num_labels=3 # label数量
运行后会在输出文件夹中多出一个 classification_model.pb
文件, 就是压缩后的模型, 之后就可以开始部署了
服务部署用到别人写好的开源框架BERT-BiLSTM-CRF-NER, 虽然叫NER, 但是可以用来部署分类任务的bert模型哈, 下载安装
pip install bert-base==0.0.7 -i https://pypi.python.org/simple
为了方便我们同样将部署过程写到脚本中
bert-base-serving-start \
-model_dir "训练好的模型路径" \
-bert_model_dir "bert预训练模型路径" \
-model_pb_dir "classification_model.pb文件路径" \
-mode CLASS \ # 模式, 咱们是分类所以用CLASS
-max_seq_len 128 \ # 序列长度与上边保持一致
-port 7006 \ # 端口号, 不要与其他程序冲突
-port_out 7007 # 端口号
具体的安装使用方法及参数含义可以查看BERT-BiLSTM-CRF-NER及bert-as-service
启动服务后我们会看到如下log信息
usage: xxxx/bin/bert-base-serving-start -model_dir xxxx/model -bert_model_dir xxxx/bert_model -model_pb_dir xxxxx/model -mode CLASS -max_seq_len 128 -port 7006 -port_out 7007
ARG VALUE
__________________________________________________
bert_model_dir = xxxx
ckpt_name = bert_model.ckpt
config_name = bert_config.json
cors = *
cpu = False
device_map = []
fp16 = False
gpu_memory_fraction = 0.5
http_max_connect = 10
http_port = None
mask_cls_sep = False
max_batch_size = 1024
max_seq_len = 128
mode = CLASS
model_dir = xxxxx
model_pb_dir = xxxxx
num_worker = 1
pooling_layer = [-2]
pooling_strategy = REDUCE_MEAN
port = 7006
port_out = 7007
prefetch_size = 10
priority_batch_size = 16
tuned_model_dir = None
verbose = False
xla = False
I:VENTILATOR:[__i:__i:104]:lodding classification predict, could take a while...
I:VENTILATOR:[__i:__i:111]:contain 0 labels:dict_values(['0', '1', '2'])
pb_file exits xxxx/model/classification_model.pb
I:VENTILATOR:[__i:__i:114]:optimized graph is stored at: xxxxx/model/classification_model.pb
I:VENTILATOR:[__i:_ru:148]:bind all sockets
I:VENTILATOR:[__i:_ru:153]:open 8 ventilator-worker sockets, ipc://tmp0cZQ9R/socket,ipc://tmp6uxbcD/socket,ipc://tmpu7Xxeo/socket,ipc://tmpsF2Ug9/socket,ipc://tmpMJTkjU/socket,ipc://tmpkvoLlF/socket,ipc://tmpefSdoq/socket,ipc://tmpW60Iqb/socket
I:VENTILATOR:[__i:_ru:157]:start the sink
I:VENTILATOR:[__i:_ge:239]:get devices
I:SINK:[__i:_ru:317]:ready
I:VENTILATOR:[__i:_ge:271]:device map:
worker 0 -> gpu 0
I:WORKER-0:[__i:_ru:497]:use device gpu: 0, load graph from xxxx/model/classification_model.pb
I:WORKER-0:[__i:gen:537]:ready and listening!
bert服务部署完成
使用示例
In [1]: from bert_base.client import BertClient
In [2]: str1="我爱北京天安门"
In [3]: str2 = "哈哈哈哈"
In [4]: with BertClient(show_server_config=False, check_version=False, check_length=False,
...: mode="CLASS", port=7006, port_out=7007) as bc:
...: res = bc.encode([str1, str2])
...:
In [5]: print(res)
[{'pred_label': ['2', '1'], 'score': [0.9999899864196777, 0.9999299049377441]}]
参考链接:
BERT源码注释(run_classifier.py) - 全网最详细
干货 | BERT fine-tune 终极实践教程
NLP之BERT分类模型部署提供服务