PaddleHub2.0——使用动态图版预训练模型ERNIE实现快递单信息抽取

#一train

# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import paddlehub as hub
import paddle
from paddlehub.datasets.base_nlp_dataset import SeqLabelingDataset

label_list = ["B-P", "I-P", "B-T", "I-T", "B-A1", "I-A1", "B-A2", "I-A2", "B-A3", "I-A3", "B-A4", "I-A4", "O"]
label_map = {idx: label for idx, label in enumerate(label_list)}

class ExpressNER(SeqLabelingDataset):
    # 数据集存放目录
    base_path = './express_ner'
    # 数据集的标签列表
    label_list = label_list
    label_map = {idx: label for idx, label in enumerate(label_list)}
    # 数据文件使用的分隔符
    split_char = '\002'

    def __init__(self, tokenizer, max_seq_len: int = 128, mode: str = 'train'):
        if mode == 'train':
            data_file = 'train.txt'
        elif mode == 'test':
            data_file = 'test.txt'
        else:
            data_file = 'dev.txt'
        super().__init__(
            base_path=self.base_path,
            tokenizer=tokenizer,
            max_seq_len=max_seq_len,
            mode=mode,
            data_file=data_file,
            label_file=None,
            label_list=self.label_list,
            split_char=self.split_char,
            is_file_with_header=True)


if __name__ == '__main__':
    # 选择所需要的模型
    model = hub.Module(name='ernie_tiny',version='2.0.1', task='token-cls', label_map=label_map)
    tokenizer = model.get_tokenizer()
    # 获取数据集
    train_dataset = ExpressNER(tokenizer=tokenizer, max_seq_len=64, mode='train')
    dev_dataset = ExpressNER(tokenizer=tokenizer, max_seq_len=64, mode='dev')
    test_dataset = ExpressNER(tokenizer=tokenizer, max_seq_len=64, mode='test')
    optimizer = paddle.optimizer.Adam(learning_rate=5e-5, parameters=model.parameters())  # 优化器的选择和参数配置
    trainer = hub.Trainer(model, optimizer, checkpoint_dir='./ckpt', use_gpu=False)  # fine-tune任务的执行者
    trainer.train(train_dataset, epochs=10, batch_size=32, eval_dataset=dev_dataset, save_interval=5)   # 配置训练参数,启动训练,并指定验证集
    result = trainer.evaluate(test_dataset, batch_size=32)  # 在测试集上评估当前训练模型


#二predict

import paddlehub as hub
from paddlehub.datasets.base_nlp_dataset import SeqLabelingDataset

label_list = ["B-P", "I-P", "B-T", "I-T", "B-A1", "I-A1", "B-A2", "I-A2", "B-A3", "I-A3", "B-A4", "I-A4", "O"]

class ExpressNER(SeqLabelingDataset):
    # 数据集存放目录
    base_path = './express_ner'
    # 数据集的标签列表
    label_list = label_list
    label_map = {idx: label for idx, label in enumerate(label_list)}
    # 数据文件使用的分隔符
    split_char = '\002'

    def __init__(self, tokenizer, max_seq_len: int = 128, mode: str = 'train'):
        if mode == 'train':
            data_file = 'train.txt'
        elif mode == 'test':
            data_file = 'test.txt'
        else:
            data_file = 'dev.txt'
        super().__init__(
            base_path=self.base_path,
            tokenizer=tokenizer,
            max_seq_len=max_seq_len,
            mode=mode,
            data_file=data_file,
            label_file=None,
            label_list=self.label_list,
            split_char=self.split_char,
            is_file_with_header=True)

def parse_chunk_labels(text, labels):
    labels = labels[1:len(text) + 1]

    chunk_indices = []
    chunk_types = []
    for idx, label in enumerate(labels):
        if label.startswith('B-'):
            chunk_indices.append(idx)
            chunk_types.append(label[2:])

    ret = ''
    idx = 0
    for left, right in zip(chunk_indices, chunk_indices[1:] + [len(text)]):
        ret += text[left:right] + '/' + chunk_types[idx] + ' '
        idx += 1
    return ret

split_char = '\002'
texts = [
    '广西壮族自治区桂林市雁山区雁山镇西龙村老年活动中心17610348888羊卓卫',
    '廖梓琪18514743222湖北省宜昌市长阳土家族自治县贺家坪镇贺家坪村一组临河1号',
    '15712917351宗忆珍山东省青岛市崂山区秦岭路18号'
]
data = [
    [split_char.join(text)] for text in texts
]

model = hub.Module(
    name='ernie_tiny',
    version='2.0.1',
    task='token-cls',
    load_checkpoint='./ckpt/best_model/model.pdparams',
    label_map=ExpressNER.label_map
)

results = model.predict(data, max_seq_len=64, batch_size=32, use_gpu=False)
for idx, text in enumerate(texts):
    labels = results[idx]
    print(parse_chunk_labels(text, labels))

 

#三流程说明

PaddleHub2.0——使用动态图版预训练模型ERNIE实现快递单信息抽取
本项目将演示如何使用PaddleHub语义预训练模型ERNIE完成从快递单中抽取姓名、电话、省、市、区、详细地址等内容,形成结构化信息。辅助物流行业从业者进行有效信息的提取,从而降低客户填单的成本。

请务必使用GPU环境, 因为下方的代码基于GPU环境.

当前平台正在进行普遍赠送, 只要点击此处表单进行填写, 之后再度运行即可获赠.

一、简介
在2017年之前,工业界和学术界对NLP文本处理依赖于序列模型Recurrent Neural Network (RNN).

近年来随着深度学习的发展,模型参数数量飞速增长,为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集成本过高,非常困难,特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。最近的研究表明,基于大规模未标注语料库的预训练模型(Pretrained Models, PTM) 能够习得通用的语言表示,将预训练模型Fine-tune到下游任务,能够获得出色的表现。另外,预训练模型能够避免从零开始训练模型。

二、准备工作
首先安装和导入必要的python包

In [1]
!pip install paddlehub==2.0.0rc0 -i https://pypi.tuna.tsinghua.edu.cn/simple
In [2]
import paddlehub as hub
import paddle
三、代码步骤
使用PaddleHub Fine-tune API进行Fine-tune可以分为4个步骤。

选择模型
加载自定义数据集
选择优化策略和运行配置
执行fine-tune并评估模型
Step1: 选择模型
NOTE: 在命名实体识别的任务中,因不同的数据集标识实体的标签不同,评测的方式也有所差异。因此,在初始化模型的之前,需要先确定实际标签的形式,下方的label_list则是示例中数据集使用的标签类别。如果用户使用的实体识别的数据集的标签方式与当前示例不同,则需要自行根据数据集确定。

In [3]
label_list = ["B-P", "I-P", "B-T", "I-T", "B-A1", "I-A1", "B-A2", "I-A2", "B-A3", "I-A3", "B-A4", "I-A4", "O"]
label_map = {idx: label for idx, label in enumerate(label_list)}
In [4]
# 选择所需要的模型
model = hub.Module(name='ernie', task='token-cls', label_map=label_map)
 

PaddleHub还提供BERT等模型可供选择, 当前支持文本分类任务的模型对应的加载示例如下:

模型名    PaddleHub Module
ERNIE, Chinese    hub.Module(name='ernie')
ERNIE tiny, Chinese    hub.Module(name='ernie_tiny')
ERNIE 2.0 Base, English    hub.Module(name='ernie_v2_eng_base')
ERNIE 2.0 Large, English    hub.Module(name='ernie_v2_eng_large')
BERT-Base, English Cased    hub.Module(name='bert-base-cased')
BERT-Base, English Uncased    hub.Module(name='bert-base-uncased')
BERT-Large, English Cased    hub.Module(name='bert-large-cased')
BERT-Large, English Uncased    hub.Module(name='bert-large-uncased')
BERT-Base, Multilingual Cased    hub.Module(nane='bert-base-multilingual-cased')
BERT-Base, Multilingual Uncased    hub.Module(nane='bert-base-multilingual-uncased')
BERT-Base, Chinese    hub.Module(name='bert-base-chinese')
BERT-wwm, Chinese    hub.Module(name='chinese-bert-wwm')
BERT-wwm-ext, Chinese    hub.Module(name='chinese-bert-wwm-ext')
RoBERTa-wwm-ext, Chinese    hub.Module(name='roberta-wwm-ext')
RoBERTa-wwm-ext-large, Chinese    hub.Module(name='roberta-wwm-ext-large')
RBT3, Chinese    hub.Module(name='rbt3')
RBTL3, Chinese    hub.Module(name='rbtl3')
ELECTRA-Small, English    hub.Module(name='electra-small')
ELECTRA-Base, English    hub.Module(name='electra-base')
ELECTRA-Large, English    hub.Module(name='electra-large')
ELECTRA-Base, Chinese    hub.Module(name='chinese-electra-base')
ELECTRA-Small, Chinese    hub.Module(name='chinese-electra-small')
通过以上的一行代码,model初始化为一个适用于序列标注任务的模型,为ERNIE Tiny的预训练模型后拼接上一个输出token共享的全连接网络(Full Connected)。


以上图片来自于:https://arxiv.org/pdf/1810.04805.pdf

Step2: 加载自定义数据集
In [5]
# 解压数据集
!tar -zxvf /home/aistudio/data/data16246/express_ner.tar.gz 
express_ner/
express_ner/test.txt
express_ner/dev.txt
express_ner/train.txt
In [6]
# 查看预测的数据
!head -n 3 /home/aistudio/express_ner/test.txt
text_a    label
黑龙江省双鸭山市尖山区八马路与东平行路交叉口北40米韦业涛18600009172    B-A1I-A1I-A1I-A1B-A2I-A2I-A2I-A2B-A3I-A3I-A3B-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4B-PI-PI-PB-TI-TI-TI-TI-TI-TI-TI-TI-TI-TI-T
广西壮族自治区桂林市雁山区雁山镇西龙村老年活动中心17610348888羊卓卫    B-A1I-A1I-A1I-A1I-A1I-A1I-A1B-A2I-A2I-A2B-A3I-A3I-A3B-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4I-A4B-TI-TI-TI-TI-TI-TI-TI-TI-TI-TI-TB-PI-PI-P
加载自定义序列标注数据集,用户仅需要继承SeqLabelingDataset类。下面代码示例展示如何将自定义数据集加载进PaddleHub使用。
具体详情可参考 加载自定义数据集

In [7]
from paddlehub.datasets.base_nlp_dataset import SeqLabelingDataset

class ExpressNER(SeqLabelingDataset):
    # 数据集存放目录
    base_path = '/home/aistudio/express_ner'
    # 数据集的标签列表
    label_list = label_list
    label_map = {idx: label for idx, label in enumerate(label_list)}
    # 数据文件使用的分隔符
    split_char = '\002'
    
    def __init__(self, tokenizer, max_seq_len: int = 128, mode: str = 'train'):
        if mode == 'train':
            data_file = 'train.txt'
        elif mode == 'test':
            data_file = 'test.txt'
        else:
            data_file = 'dev.txt'
        super().__init__(
                    base_path=self.base_path,
                    tokenizer=tokenizer,
                    max_seq_len=max_seq_len,
                    mode=mode,
                    data_file=data_file,
                    label_file=None,
                    label_list=self.label_list,
                    split_char=self.split_char,
                    is_file_with_header=True)
In [8]
tokenizer = model.get_tokenizer()
# 获取数据集
train_dataset = ExpressNER(tokenizer=tokenizer, max_seq_len=64, mode='train')
dev_dataset = ExpressNER(tokenizer=tokenizer, max_seq_len=64, mode='dev')
test_dataset = ExpressNER(tokenizer=tokenizer, max_seq_len=64, mode='test')
[2021-01-18 19:38:06,835] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-1.0/vocab.txt
NOTE: 最大序列长度max_seq_len是可以调整的参数,建议值128,根据任务文本长度不同可以调整该值,但最大不超过512。

Step3: 选择优化策略和运行配置
In [9]
optimizer = paddle.optimizer.Adam(learning_rate=5e-5, parameters=model.parameters())  # 优化器的选择和参数配置
trainer = hub.Trainer(model, optimizer, checkpoint_dir='./ckpt', use_gpu=True)        # fine-tune任务的执行者
[2021-01-18 19:38:07,669] [ WARNING] - PaddleHub model checkpoint not found, start from scratch...
优化策略
Paddle2.0-rc提供了多种优化器选择,如SGD, Adam, Adamax等,详细参见策略。

在本教程中选择了Adam优化器,其的参数用法:

learning_rate: 全局学习率。默认为1e-3;
parameters: 待优化模型参数。
运行配置
Trainer 主要控制Fine-tune任务的训练,是任务的发起者,包含以下可控制的参数:

model: 被优化模型;
optimizer: 优化器选择;
use_gpu: 是否使用gpu训练;
use_vdl: 是否使用vdl可视化训练过程;
checkpoint_dir: 保存模型参数的地址;
compare_metrics: 保存最优模型的衡量指标;
Step4: 执行fine-tune并评估模型
In [10]
trainer.train(train_dataset, epochs=10, batch_size=32, eval_dataset=dev_dataset, save_interval=5)   # 配置训练参数,启动训练,并指定验证集
[2021-01-18 19:38:09,224] [   TRAIN] - Epoch=1/10, Step=10/50 loss=0.9344 f1_score=0.0703 lr=0.000050 step/sec=6.46 | ETA 00:01:17
[2021-01-18 19:38:10,538] [   TRAIN] - Epoch=1/10, Step=20/50 loss=0.3404 f1_score=0.5655 lr=0.000050 step/sec=7.61 | ETA 00:01:11
[2021-01-18 19:38:11,851] [   TRAIN] - Epoch=1/10, Step=30/50 loss=0.1145 f1_score=0.8948 lr=0.000050 step/sec=7.61 | ETA 00:01:09
[2021-01-18 19:38:13,161] [   TRAIN] - Epoch=1/10, Step=40/50 loss=0.0449 f1_score=0.9309 lr=0.000050 step/sec=7.63 | ETA 00:01:08
[2021-01-18 19:38:14,466] [   TRAIN] - Epoch=1/10, Step=50/50 loss=0.0240 f1_score=0.9511 lr=0.000050 step/sec=7.66 | ETA 00:01:07
[2021-01-18 19:38:15,816] [   TRAIN] - Epoch=2/10, Step=10/50 loss=0.0122 f1_score=0.9773 lr=0.000050 step/sec=7.41 | ETA 00:01:07
[2021-01-18 19:38:17,128] [   TRAIN] - Epoch=2/10, Step=20/50 loss=0.0107 f1_score=0.9810 lr=0.000050 step/sec=7.62 | ETA 00:01:07
[2021-01-18 19:38:18,446] [   TRAIN] - Epoch=2/10, Step=30/50 loss=0.0071 f1_score=0.9857 lr=0.000050 step/sec=7.59 | ETA 00:01:07
[2021-01-18 19:38:19,763] [   TRAIN] - Epoch=2/10, Step=40/50 loss=0.0087 f1_score=0.9844 lr=0.000050 step/sec=7.59 | ETA 00:01:07
[2021-01-18 19:38:21,078] [   TRAIN] - Epoch=2/10, Step=50/50 loss=0.0126 f1_score=0.9808 lr=0.000050 step/sec=7.60 | ETA 00:01:07
[2021-01-18 19:38:22,435] [   TRAIN] - Epoch=3/10, Step=10/50 loss=0.0028 f1_score=0.9955 lr=0.000050 step/sec=7.37 | ETA 00:01:07
[2021-01-18 19:38:23,750] [   TRAIN] - Epoch=3/10, Step=20/50 loss=0.0038 f1_score=0.9932 lr=0.000050 step/sec=7.61 | ETA 00:01:06
[2021-01-18 19:38:25,062] [   TRAIN] - Epoch=3/10, Step=30/50 loss=0.0056 f1_score=0.9932 lr=0.000050 step/sec=7.62 | ETA 00:01:06
[2021-01-18 19:38:26,378] [   TRAIN] - Epoch=3/10, Step=40/50 loss=0.0045 f1_score=0.9922 lr=0.000050 step/sec=7.60 | ETA 00:01:06
[2021-01-18 19:38:27,693] [   TRAIN] - Epoch=3/10, Step=50/50 loss=0.0049 f1_score=0.9890 lr=0.000050 step/sec=7.60 | ETA 00:01:06
[2021-01-18 19:38:29,051] [   TRAIN] - Epoch=4/10, Step=10/50 loss=0.0030 f1_score=0.9938 lr=0.000050 step/sec=7.37 | ETA 00:01:06
[2021-01-18 19:38:30,365] [   TRAIN] - Epoch=4/10, Step=20/50 loss=0.0051 f1_score=0.9922 lr=0.000050 step/sec=7.61 | ETA 00:01:06
[2021-01-18 19:38:31,678] [   TRAIN] - Epoch=4/10, Step=30/50 loss=0.0029 f1_score=0.9953 lr=0.000050 step/sec=7.62 | ETA 00:01:06
[2021-01-18 19:38:32,992] [   TRAIN] - Epoch=4/10, Step=40/50 loss=0.0032 f1_score=0.9953 lr=0.000050 step/sec=7.61 | ETA 00:01:06
[2021-01-18 19:38:34,301] [   TRAIN] - Epoch=4/10, Step=50/50 loss=0.0040 f1_score=0.9904 lr=0.000050 step/sec=7.64 | ETA 00:01:06
[2021-01-18 19:38:35,653] [   TRAIN] - Epoch=5/10, Step=10/50 loss=0.0024 f1_score=0.9953 lr=0.000050 step/sec=7.40 | ETA 00:01:06
[2021-01-18 19:38:36,965] [   TRAIN] - Epoch=5/10, Step=20/50 loss=0.0026 f1_score=0.9930 lr=0.000050 step/sec=7.62 | ETA 00:01:06
[2021-01-18 19:38:38,286] [   TRAIN] - Epoch=5/10, Step=30/50 loss=0.0034 f1_score=0.9935 lr=0.000050 step/sec=7.57 | ETA 00:01:06
[2021-01-18 19:38:39,606] [   TRAIN] - Epoch=5/10, Step=40/50 loss=0.0032 f1_score=0.9948 lr=0.000050 step/sec=7.57 | ETA 00:01:06
[2021-01-18 19:38:40,920] [   TRAIN] - Epoch=5/10, Step=50/50 loss=0.0029 f1_score=0.9961 lr=0.000050 step/sec=7.61 | ETA 00:01:06
[2021-01-18 19:38:41,309] [    EVAL] - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - [Evaluation result] avg_f1_score=0.9899
[2021-01-18 19:38:53,283] [    EVAL] - Saving best model to ./ckpt/best_model [best f1_score=0.9899]
[2021-01-18 19:38:53,285] [    INFO] - Saving model checkpoint to ./ckpt/epoch_5
[2021-01-18 19:39:06,531] [   TRAIN] - Epoch=6/10, Step=10/50 loss=0.0039 f1_score=0.9950 lr=0.000050 step/sec=0.39 | ETA 00:01:53
[2021-01-18 19:39:07,844] [   TRAIN] - Epoch=6/10, Step=20/50 loss=0.0013 f1_score=0.9966 lr=0.000050 step/sec=7.62 | ETA 00:01:51
[2021-01-18 19:39:09,162] [   TRAIN] - Epoch=6/10, Step=30/50 loss=0.0017 f1_score=0.9966 lr=0.000050 step/sec=7.59 | ETA 00:01:49
[2021-01-18 19:39:10,479] [   TRAIN] - Epoch=6/10, Step=40/50 loss=0.0014 f1_score=0.9971 lr=0.000050 step/sec=7.59 | ETA 00:01:48
[2021-01-18 19:39:11,786] [   TRAIN] - Epoch=6/10, Step=50/50 loss=0.0007 f1_score=0.9984 lr=0.000050 step/sec=7.65 | ETA 00:01:46
[2021-01-18 19:39:13,148] [   TRAIN] - Epoch=7/10, Step=10/50 loss=0.0008 f1_score=0.9984 lr=0.000050 step/sec=7.34 | ETA 00:01:45
[2021-01-18 19:39:14,457] [   TRAIN] - Epoch=7/10, Step=20/50 loss=0.0019 f1_score=0.9963 lr=0.000050 step/sec=7.64 | ETA 00:01:44
[2021-01-18 19:39:15,768] [   TRAIN] - Epoch=7/10, Step=30/50 loss=0.0026 f1_score=0.9956 lr=0.000050 step/sec=7.62 | ETA 00:01:43
[2021-01-18 19:39:17,079] [   TRAIN] - Epoch=7/10, Step=40/50 loss=0.0023 f1_score=0.9984 lr=0.000050 step/sec=7.63 | ETA 00:01:42
[2021-01-18 19:39:18,406] [   TRAIN] - Epoch=7/10, Step=50/50 loss=0.0012 f1_score=0.9958 lr=0.000050 step/sec=7.54 | ETA 00:01:41
[2021-01-18 19:39:19,767] [   TRAIN] - Epoch=8/10, Step=10/50 loss=0.0012 f1_score=0.9982 lr=0.000050 step/sec=7.35 | ETA 00:01:40
[2021-01-18 19:39:21,091] [   TRAIN] - Epoch=8/10, Step=20/50 loss=0.0011 f1_score=0.9974 lr=0.000050 step/sec=7.55 | ETA 00:01:39
[2021-01-18 19:39:22,406] [   TRAIN] - Epoch=8/10, Step=30/50 loss=0.0020 f1_score=0.9961 lr=0.000050 step/sec=7.60 | ETA 00:01:38
[2021-01-18 19:39:23,718] [   TRAIN] - Epoch=8/10, Step=40/50 loss=0.0020 f1_score=0.9990 lr=0.000050 step/sec=7.62 | ETA 00:01:37
[2021-01-18 19:39:25,029] [   TRAIN] - Epoch=8/10, Step=50/50 loss=0.0010 f1_score=0.9984 lr=0.000050 step/sec=7.63 | ETA 00:01:36
[2021-01-18 19:39:26,386] [   TRAIN] - Epoch=9/10, Step=10/50 loss=0.0017 f1_score=0.9971 lr=0.000050 step/sec=7.37 | ETA 00:01:35
[2021-01-18 19:39:27,708] [   TRAIN] - Epoch=9/10, Step=20/50 loss=0.0006 f1_score=0.9990 lr=0.000050 step/sec=7.56 | ETA 00:01:35
[2021-01-18 19:39:29,035] [   TRAIN] - Epoch=9/10, Step=30/50 loss=0.0017 f1_score=0.9984 lr=0.000050 step/sec=7.53 | ETA 00:01:34
[2021-01-18 19:39:30,356] [   TRAIN] - Epoch=9/10, Step=40/50 loss=0.0004 f1_score=0.9992 lr=0.000050 step/sec=7.58 | ETA 00:01:33
[2021-01-18 19:39:31,673] [   TRAIN] - Epoch=9/10, Step=50/50 loss=0.0008 f1_score=0.9979 lr=0.000050 step/sec=7.59 | ETA 00:01:33
[2021-01-18 19:39:33,028] [   TRAIN] - Epoch=10/10, Step=10/50 loss=0.0004 f1_score=0.9992 lr=0.000050 step/sec=7.38 | ETA 00:01:32
[2021-01-18 19:39:34,343] [   TRAIN] - Epoch=10/10, Step=20/50 loss=0.0005 f1_score=0.9992 lr=0.000050 step/sec=7.60 | ETA 00:01:32
[2021-01-18 19:39:35,659] [   TRAIN] - Epoch=10/10, Step=30/50 loss=0.0005 f1_score=0.9992 lr=0.000050 step/sec=7.60 | ETA 00:01:31
[2021-01-18 19:39:36,971] [   TRAIN] - Epoch=10/10, Step=40/50 loss=0.0010 f1_score=0.9963 lr=0.000050 step/sec=7.62 | ETA 00:01:31
[2021-01-18 19:39:38,289] [   TRAIN] - Epoch=10/10, Step=50/50 loss=0.0026 f1_score=0.9945 lr=0.000050 step/sec=7.59 | ETA 00:01:30
[2021-01-18 19:39:38,675] [    EVAL] - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - [Evaluation result] avg_f1_score=0.9958
[2021-01-18 19:39:51,511] [    EVAL] - Saving best model to ./ckpt/best_model [best f1_score=0.9958]
[2021-01-18 19:39:51,514] [    INFO] - Saving model checkpoint to ./ckpt/epoch_10
trainer.train 主要控制具体的训练过程,包含以下可控制的参数:

train_dataset: 训练时所用的数据集;
epochs: 训练轮数;
batch_size: 训练的批大小,如果使用GPU,请根据实际情况调整batch_size;
num_workers: works的数量,默认为0;
eval_dataset: 验证集;
log_interval: 打印日志的间隔, 单位为执行批训练的次数。
save_interval: 保存模型的间隔频次,单位为执行训练的轮数。
In [11]
result = trainer.evaluate(test_dataset, batch_size=32)    # 在测试集上评估当前训练模型
[2021-01-18 19:40:03,410] [    INFO] - Evaluation on validation dataset: \
[2021-01-18 19:40:03,768] [    EVAL] - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - [Evaluation result] avg_f1_score=0.9833
四、使用模型进行预测
当Finetune完成后,我们加载训练后保存的最佳模型来进行预测。

先添加一个解析实体标签的方法parse_chunk_labels,便于预测结果可视化。

In [12]
def parse_chunk_labels(text, labels):
    labels = labels[1:len(text)+1]
    
    chunk_indices = []
    chunk_types = []
    for idx, label in enumerate(labels):
        if label.startswith('B-'):
            chunk_indices.append(idx)
            chunk_types.append(label[2:])

    ret = ''
    idx = 0
    for left, right in zip(chunk_indices, chunk_indices[1:]+[len(text)]):
        ret += text[left:right] + '/' + chunk_types[idx] + ' '
        idx += 1
    return ret
下面将加载模型,选用测试集上的数据进行预测并打印出预测的结果,完整预测代码如下:

In [14]
split_char = '\002'
texts = [
    '广西壮族自治区桂林市雁山区雁山镇西龙村老年活动中心17610348888羊卓卫',
    '廖梓琪18514743222湖北省宜昌市长阳土家族自治县贺家坪镇贺家坪村一组临河1号',
    '15712917351宗忆珍山东省青岛市崂山区秦岭路18号'
]
data = [
    [split_char.join(text)] for text in texts
]

model = hub.Module(
    name='ernie', 
    task='token-cls', 
    load_checkpoint='./ckpt/best_model/model.pdparams', 
    label_map=ExpressNER.label_map
)

results = model.predict(data, max_seq_len=64, batch_size=32, use_gpu=True)
for idx, text in enumerate(texts):
    labels = results[idx]
    print(parse_chunk_labels(text, labels))
[2021-01-18 19:40:03,834] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-1.0/ernie_v1_chn_base.pdparams
[2021-01-18 19:40:08,584] [    INFO] - Loaded parameters from /home/aistudio/ckpt/best_model/model.pdparams
[2021-01-18 19:40:08,683] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-1.0/vocab.txt
广西壮族自治区/A1 桂林市/A2 雁山区/A3 雁山镇西龙村老年活动中心/A4 17610348888/T 羊卓卫/P 
廖梓琪/P 18514743222/T 湖北省/A1 宜昌市/A2 长阳土家族自治县/A3 贺家坪镇贺家坪村一组临河1号/A4 
15712917351/T 宗忆珍/P 山东省/A1 青岛市/A2 崂山区/A3 秦岭路18号/A4 
五、PaddleHub 2.0 动态图版本全新系列教程
PaddleHub2.0目前已经升级到2.0的动态图版本,新版的教程也同步更新。其他的CV,NLP等任务的教程如下列表:

CV
PaddleHub2.0实现图像分类训练与预测
PaddleHub2.0实现图像风格迁移训练与预测
PaddleHub2.0预训练着色模型finetune教程

NLP

文本分类:

- PaddleHub2.0——使用动态图版预训练模型ERNIE实现文本分类
- PaddleHub2.0——使用动态图版预训练模型ERNIE Tiny实现文本分类
- PaddleHub2.0——使用动态图版预训练模型ERNIE实现文新闻本分类

序列标注:

- PaddleHub2.0——使用动态图版预训练模型ERNIE实现序列标注
- PaddleHub2.0——使用动态图版预训练模型ERNIE实现快递单信息抽取

 

你可能感兴趣的:(paddlepaddle,人工智能,深度学习,paddlepaddle,机器学习)