BERT-BiLSTM-CRF-NER (github项目地址)
建议:
python == 3.X
tensorflow == 1.13.2
tensorflow-gpu == 1.13.2
其中tensorflow1.14版本会报错,且gpu版本过高会无法调用gpu。
pip安装
pip install bert-base==0.0.9 -i https://pypi.python.org/simple
或者直接从github下载:
git clone https://github.com/macanv/BERT-BiLSTM-CRF-NER
cd BERT-BiLSTM-CRF-NER/
python setup.py install
pip list | grep tensor
BERT-Base, Chinese
其中,bert_config.json
为配置文件;bert_model.ckpt.data-00000-of-00001
为初始化checkpoint文件,在本框架下使用时建议复制修改名称为bert_model.ckpt
;vocab.txt
为词表文件。
下载地址:ChineseNER
主要使用的是data中的三个数据文件。在本框架下使用时建议复制,并修改名称为train.txt
(20864条),dev.txt
(2318条),test.txt
(4636条)。
下载地址:CLUENER2020
本数据是在清华大学开源的文本分类数据集THUCTC基础上,选出部分数据进行细粒度命名实体标注,原数据来源于Sina News RSS。
例子:
{“text”: “浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为,对目前国内商业银行而言,”, “label”: {“name”: {“叶老桂”: [[9, 11]]}, “company”: {“浙商银行”: [[0, 3]]}}}
{“text”: “生生不息CSOL生化狂潮让你填弹狂扫”, “label”: {“game”: {“CSOL”: [[4, 7]]}}}
按照不同标签类别统计,训练集数据分布如下(注:一条数据中出现的所有实体都进行标注,如果一条数据出现两个地址(address)实体,那么统计地址(address)类别数据的时候,算两条数据):
【训练集】标签数据分布如下:
地址(address):2829
书名(book):1131
公司(company):2897
游戏(game):2325
政府(government):1797
电影(movie):1109
姓名(name):3661
组织机构(organization):3075
职位(position):3052
景点(scene):1462
【验证集】标签数据分布如下:
地址(address):364
书名(book):152
公司(company):366
游戏(game):287
政府(government):244
电影(movie):150
姓名(name):451
组织机构(organization):344
职位(position):425
景点(scene):199
数据集中,test数据只有句子,未提供标注结果,所以首先将train数据拆分成train+test(0.85+0.15),dev数据还是作为dev,然后将json数据转换成输入格式的txt数据。
[-data_dir DATA_DIR]
# 训练数据存放路径,训练数据,验证数据和测试数据命名格式为:train.txt, dev.txt,test.txt,请按照这个格式命名文件,否则会报错。
[-bert_config_file BERT_CONFIG_FILE]
# 谷歌BERT模型下面的bert_config.json
[-output_dir OUTPUT_DIR]
# 训练模型输出的文件路径,模型的checkpoint以及一些标签映射表都会存储在这里,这个路径在作为服务的时候,可以指定为-ner_model_dir
[-init_checkpoint INIT_CHECKPOINT]
# 谷歌BERT模型下面的初始化ckpt文件,bert_model.ckpt,请按照这个格式命名文件,否则会报错。
[-vocab_file VOCAB_FILE]
# 谷歌BERT模型下面的vocab.txt
[-max_seq_length MAX_SEQ_LENGTH]
[-do_train DO_TRAIN]
[-do_eval DO_EVAL]
[-do_predict DO_PREDICT]
[-batch_size BATCH_SIZE]
# 默认batch_size为64,不过需要根据当前设备进行调整,batch_size过大可能导致资源不足无法进行
[-learning_rate LEARNING_RATE]
[-num_train_epochs NUM_TRAIN_EPOCHS]
[-dropout_rate DROPOUT_RATE]
[-clip CLIP]
[-warmup_proportion WARMUP_PROPORTION]
[-lstm_size LSTM_SIZE]
[-num_layers NUM_LAYERS]
[-cell CELL]
[-save_checkpoints_steps SAVE_CHECKPOINTS_STEPS]
[-save_summary_steps SAVE_SUMMARY_STEPS]
[-filter_adam_var FILTER_ADAM_VAR]
[-do_lower_case DO_LOWER_CASE]
[-clean CLEAN]
[-device_map DEVICE_MAP]
# 指定使用的GPU,默认使用device0,不过资源被占用时有可能报错,可以换到其他空余的GPU上
[-label_list LABEL_LIST]
# 添加自定义的labels
[-verbose]
[-ner NER]
[-version]
实际命令:
bert-base-ner-train \
-data_dir /home/username/pyprojects/BertNer/ChineseNER/data/ChineseNER/ChineseNER-master/data \
-output_dir /home/username/pyprojects/BertNer/ChineseNER/output/CNER \
-init_checkpoint /home/username/pyprojects/BertNer/ChineseNER/googleBERT/chinese_L-12_H-768_A-12/bert_model.ckpt \
-bert_config_file /home/username/pyprojects/BertNer/ChineseNER/googleBERT/chinese_L-12_H-768_A-12/bert_config.json \
-vocab_file /home/username/pyprojects/BertNer/ChineseNER/googleBERT/chinese_L-12_H-768_A-12/vocab.txt \
-device_map 2 \
-batch_size 32
checkpoint
eval/
eval.tf_record
events.out.tfevents.1603167974.tianyu_248
graph.pbtxt
label2id.pkl
label_list.pkl
label_test.txt
model.ckpt-5000.data-00000-of-00001
model.ckpt-5000.index
model.ckpt-5000.meta
model.ckpt-5500.data-00000-of-00001
model.ckpt-5500.index
model.ckpt-5500.meta
model.ckpt-6000.data-00000-of-00001
model.ckpt-6000.index
model.ckpt-6000.meta
model.ckpt-6500.data-00000-of-00001
model.ckpt-6500.index
model.ckpt-6500.meta
model.ckpt-6520.data-00000-of-00001
model.ckpt-6520.index
model.ckpt-6520.meta
predict_score.txt
predict.tf_record
token_test.txt
train.tf_record
其中的label2id.pkl,label_list.pkl2个文件将在应用模型结果时使用。
.pb文件需要使用相关结果文件进行转换得到。详见获取从output中pb文件
训练参数设置:
ARG VALUE
__________________________________________________
batch_size = 32
bert_config_file = /home/username/pyprojects/BertNer/ChineseNER/googleBERT/chinese_L-12_H-768_A-12/bert_config.json
cell = lstm
clean = True
clip = 0.5
data_dir = /home/username/pyprojects/BertNer/ChineseNER/data/ChineseNER/ChineseNER-master/data
device_map = 2
do_eval = True
do_lower_case = True
do_predict = True
do_train = True
dropout_rate = 0.5
filter_adam_var = False
init_checkpoint = /home/username/pyprojects/BertNer/ChineseNER/googleBERT/chinese_L-12_H-768_A-12/bert_model.ckpt
label_list = None
learning_rate = 1e-05
lstm_size = 128
max_seq_length = 128
ner = ner
num_layers = 1
num_train_epochs = 10
output_dir = /home/username/pyprojects/BertNer/ChineseNER/output/CNER
save_checkpoints_steps = 500
save_summary_steps = 500
verbose = False
vocab_file = /home/username/pyprojects/BertNer/ChineseNER/googleBERT/chinese_L-12_H-768_A-12/vocab.txt
warmup_proportion = 0.1
在自身数据集上的效果:(用时约47分钟)
processed 214541 tokens with 7450 phrases; found: 7744 phrases; correct: 7029.
accuracy: 99.28%; precision: 90.77%; recall: 94.35%; FB1: 92.52
LOC: precision: 93.16%; recall: 94.80%; FB1: 93.98 3525
ORG: precision: 82.44%; recall: 91.27%; FB1: 86.63 2398
PER: precision: 97.09%; recall: 97.14%; FB1: 97.12 1821
注意:
【自定义的label文件需要重新设置,注意 label_list参数】
训练参数设置:
ARG VALUE
__________________________________________________
batch_size = 32
bert_config_file = /home/username/pyprojects/BertNer/NERtests/googleBERT/chinese_L-12_H-768_A-12/bert_config.json
cell = lstm
clean = True
clip = 0.5
data_dir = /home/username/pyprojects/BertNer/NERtests/data/cluener2020
device_map = 2
do_eval = True
do_lower_case = True
do_predict = True
do_train = True
dropout_rate = 0.5
filter_adam_var = False
init_checkpoint = /home/username/pyprojects/BertNer/NERtests/googleBERT/chinese_L-12_H-768_A-12/bert_model.ckpt
label_list = /home/username/pyprojects/BertNer/NERtests/data/cluener2020/labels.txt
learning_rate = 1e-05
lstm_size = 128
max_seq_length = 128
ner = ner
num_layers = 1
num_train_epochs = 10
output_dir = /home/username/pyprojects/BertNer/NERtests/output/CLUENER2020
save_checkpoints_steps = 500
save_summary_steps = 500
verbose = False
vocab_file = /home/username/pyprojects/BertNer/NERtests/googleBERT/chinese_L-12_H-768_A-12/vocab.txt
warmup_proportion = 0.1
在自身数据集上的效果:(用时约23分钟)
processed 60286 tokens with 2698 phrases; found: 4433 phrases; correct: 1630.
accuracy: 91.91%; precision: 36.77%; recall: 60.42%; FB1: 45.72
address: precision: 0.40%; recall: 0.60%; FB1: 0.48 497
book: precision: 37.29%; recall: 72.13%; FB1: 49.16 236
company: precision: 33.64%; recall: 65.58%; FB1: 44.47 657
game: precision: 62.33%; recall: 81.60%; FB1: 70.68 377
government: precision: 54.64%; recall: 70.98%; FB1: 61.75 291
movie: precision: 63.33%; recall: 71.70%; FB1: 67.26 120
name: precision: 47.65%; recall: 83.33%; FB1: 60.63 808
organization: precision: 32.59%; recall: 65.07%; FB1: 43.43 583
position: precision: 39.06%; recall: 68.08%; FB1: 49.64 699
scene: precision: 0.61%; recall: 0.76%; FB1: 0.67 165
由于数据集的标签不同,所以无法交叉验证各模型在对方数据上的效果。
使用output2pb.py文件从输出文件中得到ner_model.pb文件,会在当前文件夹(本例中为/home/username/pyprojects/BertNer/ChineseNER/output/CNER
)下生成predict_optimizer文件夹,ner_model.pb文件在该文件夹路径下。
注意:
这里运行的时候,模型内部调用服务器gpu转换有点儿问题,报错(resource exhausted: OOM),后面有空再找原因。先自己写了一个脚本output2pb.py用cpu转换。
bert-base-serving-start \
-model_dir /home/username/pyprojects/BertNer/ChineseNER/output/CNER \
-bert_model_dir /home/username/pyprojects/BertNer/ChineseNER/googleBERT/chinese_L-12_H-768_A-12 \
-model_pb_dir /home/username/pyprojects/BertNer/ChineseNER/output/CNER\predict_optimizer \
-mode NER
默认的服务端口号为5555和5556,这个部分也可以自定义参数,根据情况调整端口号
import time
from bert_base.client import BertClient
def ner_test():
with BertClient(show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
start_t = time.perf_counter()
str1 = '新华社对外发布了中央对雄安新区的指导意见,洋洋洒洒1.2万多字,17次提到北京,4次提到天津,信息量很大,其实也回答了人们关心的很多问题。'
# rst = bc.encode([list(str1)], is_tokenized=True)
# str1 = list(str1)
rst = bc.encode([str1], is_tokenized=True)
print('rst:', rst)
print(len(rst[0]))
print(time.perf_counter() - start_t)
if __name__ == '__main__':
# class_test()
ner_test()
返回结果:
[['B-ORG' 'I-ORG' 'I-ORG' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
'O' 'B-LOC' 'I-LOC' 'I-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-LOC' 'I-LOC' 'O' 'O'
'O' 'O' 'O' 'B-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O']]
这个结果看起来不是很直观,稍微改了一下输出脚本:
import time
from bert_base.client import BertClient
def ner_test(str1):
with BertClient(show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
rst = bc.encode([str1])
return rst
def token_parser(token_res, user_str):
res_list = []
count_seq = 0
curr_seq = []
ner_count = 0
for ids, item in enumerate(token_res[0]):
if item == 'O':
if count_seq != 0:
result = [curr_label, ''.join(curr_seq)]
res_list.append(result)
# ner_count = ner_count + 1
count_seq = 0
curr_seq = []
elif '-' in item:
curr_label = item.split('-')[1]
curr_seq.append(user_str[ids])
# curr_seq.append(user_str[ids+ner_count])
count_seq += 1
return res_list
def is_chinese(uchar):
if uchar >= '\u4e00' and uchar <= '\u9fa5':
return True
else:
return False
def reserve_chinese(content):
content_str = ''
for i in content:
if is_chinese(i):
content_str += i
return content_str
if __name__ == '__main__':
userstr = '新华社对外发布了中央对雄安新区的指导意见,洋洋洒洒1.2万多字,17次提到北京,4次提到天津,信息量很大,其实也回答了人们关心的很多问题。'
user_strip = reserve_chinese(userstr)
tokenres = ner_test(user_strip)
parseres = token_parser(tokenres, user_strip)
print()
返回结果:
ORG:新华社
LOC:雄安新区 北京 天津
做中文NER的时候应该是去掉了标点和数字,然后再进行NER的。所以按原始str的坐标去取词会有点问题,所以调整了一下。
nvidia-smi
由于一般是使用的公共服务器,所以有些device会有已经在使用的情况,项目中默认的device编号是0,如果不配置,有可能报OOM错误。
所以可以事先查看gpu情况,使用闲置的device去训练。