对联,是汉族传统文化之一,是写在纸、布上或刻在竹子、木头、柱子上的对偶语句。对联对仗工整,平仄协调,是一字一音的汉语独特的艺术形式,是中国传统文化瑰宝。
这里,我们将根据上联,自动写下联。这是一个典型的序列到序列(sequence2sequence, seq2seq)建模的场景,编码器-解码器(Encoder-Decoder)框架是解决seq2seq问题的经典方法,它能够将一个任意长度的源序列转换成另一个任意长度的目标序列:编码阶段将整个源序列编码成一个向量,解码阶段通过最大化预测序列概率,从中解码出整个目标序列。编码和解码的过程通常都使用RNN实现。
这里的Encoder采用LSTM,Decoder采用带有attention机制的LSTM。
我们将以对联的上联作为Encoder的输出,下联作为Decoder的输入,训练模型。
AI Studio平台后续会默认安装PaddleNLP,在此之前可使用如下命令安装。
# !pip install --upgrade paddlenlp>=2.0.0b -i https://pypi.org/simple
!pip install --upgrade paddlenlp
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already up-to-date: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.0.0rc1)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied, skipping upgrade: visualdl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.1.1)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.15.0)
Requirement already satisfied, skipping upgrade: numpy>=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.16.4)
Requirement already satisfied, skipping upgrade: pre-commit in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.21.0)
Requirement already satisfied, skipping upgrade: shellcheck-py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.7.1.1)
Requirement already satisfied, skipping upgrade: flake8>=3.7.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.8.2)
Requirement already satisfied, skipping upgrade: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.14.0)
Requirement already satisfied, skipping upgrade: Pillow>=7.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (7.1.2)
Requirement already satisfied, skipping upgrade: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.8.53)
Requirement already satisfied, skipping upgrade: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (2.22.0)
Requirement already satisfied, skipping upgrade: Flask-Babel>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.0.0)
Requirement already satisfied, skipping upgrade: flask>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.22.1)
Requirement already satisfied, skipping upgrade: identify>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.4.10)
Requirement already satisfied, skipping upgrade: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (5.1.2)
Requirement already satisfied, skipping upgrade: aspy.yaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.0)
Requirement already satisfied, skipping upgrade: nodeenv>=0.11.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.4)
Requirement already satisfied, skipping upgrade: toml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (0.10.0)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (0.23)
Requirement already satisfied, skipping upgrade: virtualenv>=15.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (16.7.9)
Requirement already satisfied, skipping upgrade: cfgv>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (2.0.1)
Requirement already satisfied, skipping upgrade: mccabe<0.7.0,>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (0.6.1)
Requirement already satisfied, skipping upgrade: pycodestyle<2.7.0,>=2.6.0a1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.6.0)
Requirement already satisfied, skipping upgrade: pyflakes<2.3.0,>=2.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.2.0)
Requirement already satisfied, skipping upgrade: pycryptodome>=3.8.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (3.9.9)
Requirement already satisfied, skipping upgrade: future>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (0.18.0)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2.8)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (3.0.4)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (1.25.6)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2019.9.11)
Requirement already satisfied, skipping upgrade: Jinja2>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2.10.1)
Requirement already satisfied, skipping upgrade: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2019.3)
Requirement already satisfied, skipping upgrade: Babel>=2.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2.8.0)
Requirement already satisfied, skipping upgrade: click>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (7.0)
Requirement already satisfied, skipping upgrade: Werkzeug>=0.15 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (0.16.0)
Requirement already satisfied, skipping upgrade: itsdangerous>=0.24 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (1.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied, skipping upgrade: scipy>=0.17.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.3.0)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp) (0.6.0)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Jinja2>=2.5->Flask-Babel>=1.0.0->visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: more-itertools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp) (7.2.0)
import paddlenlp
paddlenlp.__version__
'2.0.0rc1'
import io
import os
from functools import partial
import numpy as np
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddlenlp.data import Vocab, Pad
from paddlenlp.metrics import Perplexity
from paddlenlp.datasets import CoupletDataset
采用开源的对联数据集couplet-clean-dataset,该数据集过滤了
couplet-dataset中的低俗、敏感内容。
这个数据集包含70w多条训练样本,1000条验证样本和1000条测试样本。
下面列出一些训练集中对联样例:
上联:晚风摇树树还挺 下联:晨露润花花更红
上联:愿景天成无墨迹 下联:万方乐奏有于阗
上联:丹枫江冷人初去 下联:绿柳堤新燕复来
上联:闲来野钓人稀处 下联:兴起高歌酒醉中
paddlenlp.datasets
中内置了多个常见数据集,包括这里的对联数据集CoupletDataset
。
paddlenlp.datasets
均继承paddle.io.Dataset
,支持paddle.io.Dataset
的所有功能:
len()
函数返回数据集长度,即样本数量。此外,paddlenlp.datasets
,还支持如下操作:
get_datasets()
函数,传入list或者string,获取相对应的train_dataset、development_dataset、test_dataset等。其中train为训练集,用于模型训练; development为开发集,也称验证集validation_dataset,用于模型参数调优;test为测试集,用于评估算法的性能,但不会根据测试集上的表现再去调整模型或参数。apply()
函数,对数据集进行指定操作。这里的CoupletDataset
数据集继承TranslationDataset
,继承自paddlenlp.datasets
,除以上通用用法外,还有一些个性设计:
CoupletDataset class
中,还定义了transform
函数,用于在每个句子的前后加上起始符
和结束符
,并将原始数据映射成id序列。train_ds, dev_ds, test_ds = CoupletDataset.get_datasets(['train', 'dev', 'test'])
100%|██████████| 21421/21421 [00:00<00:00, 23224.53it/s]
来看看数据集有多大,长什么样:
print (len(train_ds), len(test_ds), len(dev_ds))
for i in range(5):
print (train_ds[i])
print ('\n')
for i in range(5):
print (test_ds[i])
702594 999 1000
([1, 447, 3, 509, 153, 153, 279, 1517, 2], [1, 816, 294, 378, 9, 9, 142, 32, 2])
([1, 594, 185, 10, 71, 18, 158, 912, 2], [1, 14, 105, 107, 835, 20, 268, 3855, 2])
([1, 335, 830, 68, 425, 4, 482, 246, 2], [1, 94, 51, 1115, 23, 141, 761, 17, 2])
([1, 126, 17, 217, 802, 4, 1103, 118, 2], [1, 125, 205, 47, 55, 57, 78, 15, 2])
([1, 1203, 228, 390, 10, 1921, 827, 474, 2], [1, 1699, 89, 426, 317, 314, 43, 374, 2])
([1, 6, 201, 350, 54, 1156, 2], [1, 64, 522, 305, 543, 102, 2])
([1, 168, 1402, 61, 270, 11, 195, 253, 2], [1, 435, 782, 1046, 36, 188, 1016, 56, 2])
([1, 744, 185, 744, 6, 18, 452, 16, 1410, 2], [1, 286, 102, 286, 74, 20, 669, 280, 261, 2])
([1, 2577, 496, 1133, 60, 107, 2], [1, 1533, 318, 625, 1401, 172, 2])
([1, 163, 261, 6, 64, 116, 350, 253, 2], [1, 96, 579, 13, 463, 16, 774, 586, 2])
vocab, _ = CoupletDataset.get_vocab()
trg_idx2word = vocab.idx_to_token
vocab_size = len(vocab)
pad_id = vocab[CoupletDataset.EOS_TOKEN]
bos_id = vocab[CoupletDataset.BOS_TOKEN]
eos_id = vocab[CoupletDataset.EOS_TOKEN]
print (pad_id, bos_id, eos_id)
2 1 2
使用paddle.io.DataLoader
来创建训练和预测时所需要的DataLoader
对象。
paddle.io.DataLoader
返回一个迭代器,该迭代器根据batch_sampler
指定的顺序迭代返回dataset数据。支持单进程或多进程加载数据,快!
接收如下重要参数:
batch_sampler
:批采样器实例,用于在paddle.io.DataLoader
中迭代式获取mini-batch的样本下标数组,数组长度与 batch_size 一致。collate_fn
:指定如何将样本列表组合为mini-batch数据。传给它参数需要是一个callable
对象,需要实现对组建的batch的处理逻辑,并返回每个batch的数据。在这里传入的是prepare_input
函数,对产生的数据进行pad操作,并返回实际长度等。PaddleNLP提供了许多NLP任务中,用于数据处理、组batch数据的相关API。
API | 简介 |
---|---|
paddlenlp.data.Stack |
堆叠N个具有相同shape的输入数据来构建一个batch |
paddlenlp.data.Pad |
将长度不同的多个句子padding到统一长度,取N个输入数据中的最大长度 |
paddlenlp.data.Tuple |
将多个batchify函数包装在一起 |
更多数据处理操作详见: https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/data.md
def create_data_loader(dataset):
data_loader = paddle.io.DataLoader(
dataset,
batch_sampler=None,
batch_size = batch_size,
collate_fn=partial(prepare_input, pad_id=pad_id))
return data_loader
def prepare_input(insts, pad_id):
src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts])
tgt, tgt_length = Pad(pad_val=pad_id, ret_length=True)([inst[1] for inst in insts])
tgt_mask = (tgt[:, :-1] != pad_id).astype(paddle.get_default_dtype())
return src, src_length, tgt[:, :-1], tgt[:, 1:, np.newaxis], tgt_mask
use_gpu = False
device = paddle.set_device("gpu" if use_gpu else "cpu")
batch_size = 128
num_layers = 2
dropout = 0.2
hidden_size =256
max_grad_norm = 5.0
learning_rate = 0.001
max_epoch = 20
model_path = './couplet_models'
log_freq = 200
# Define dataloader
train_loader = create_data_loader(train_ds)
test_loader = create_data_loader(test_ds)
print(len(train_ds), len(train_loader), batch_size)
# 702594 5490 128 共5490个batch
for i in train_loader:
print (len(i))
for ind, each in enumerate(i):
print (ind, each.shape)
break
702594 5490 128
5
0 [128, 18]
1 [128]
2 [128, 17]
3 [128, 17, 1]
4 [128, 17]
下图是带有Attention的Seq2Seq模型结构。下面我们分别定义网络的每个部分,最后构建Seq2Seq主网络。
Encoder部分非常简单,可以直接利用PaddlePaddle2.0提供的RNN系列API的nn.LSTM
。
nn.Embedding
:该接口用于构建 Embedding 的一个可调用对象,根据输入的size (vocab_size, embedding_dim)自动构造一个二维embedding矩阵,用于table-lookup。查表过程如下:nn.LSTM
:提供序列,得到encoder_output
和encoder_state
。https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/layer/rnn/LSTM_cn.html
输出:
outputs (Tensor) - 输出,由前向和后向cell的输出拼接得到。如果time_major为True,则Tensor的形状为[time_steps,batch_size,num_directions * hidden_size],如果time_major为False,则Tensor的形状为[batch_size,time_steps,num_directions * hidden_size],当direction设置为bidirectional时,num_directions等于2,否则等于1。
final_states (tuple) - 最终状态,一个包含h和c的元组。形状为[num_lauers * num_directions, batch_size, hidden_size],当direction设置为bidirectional时,num_directions等于2,否则等于1。
class Seq2SeqEncoder(nn.Layer):
def __init__(self, vocab_size, embed_dim, hidden_size, num_layers):
super(Seq2SeqEncoder, self).__init__()
self.embedder = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_size,
num_layers=num_layers,
dropout=0.2 if num_layers > 1 else 0.)
def forward(self, sequence, sequence_length):
inputs = self.embedder(sequence)
encoder_output, encoder_state = self.lstm(
inputs, sequence_length=sequence_length)
# encoder_output [128, 18, 256] [batch_size,time_steps,hidden_size]
# encoder_state (tuple) - 最终状态,一个包含h和c的元组。 [2, 128, 256] [2, 128, 256] [num_lauers * num_directions, batch_size, hidden_size]
return encoder_output, encoder_state
nn.Linear
线性变换层传入2个参数paddle.matmul
用于计算两个Tensor的乘积,遵循完整的广播规则,关于广播规则,请参考广播 (broadcasting) 。 并且其行为与 numpy.matmul 一致。paddle.unsqueeze
用于向输入Tensor的Shape中一个或多个位置(axis)插入尺寸为1的维度
paddle.add
逐元素相加算子,输入 x 与输入 y 逐元素相加,并将各个位置的输出元素保存到返回结果中。
输入 x 与输入 y 必须可以广播为相同形状。
class AttentionLayer(nn.Layer):
def __init__(self, hidden_size):
super(AttentionLayer, self).__init__()
self.input_proj = nn.Linear(hidden_size, hidden_size)
self.output_proj = nn.Linear(hidden_size + hidden_size, hidden_size)
def forward(self, hidden, encoder_output, encoder_padding_mask):
encoder_output = self.input_proj(encoder_output)
attn_scores = paddle.matmul(
paddle.unsqueeze(hidden, [1]), encoder_output, transpose_y=True)
# print('attention score', attn_scores.shape) #[128, 1, 18]
if encoder_padding_mask is not None:
attn_scores = paddle.add(attn_scores, encoder_padding_mask)
attn_scores = F.softmax(attn_scores)
attn_out = paddle.squeeze(
paddle.matmul(attn_scores, encoder_output), [1])
# print('1 attn_out', attn_out.shape) #[128, 256]
attn_out = paddle.concat([attn_out, hidden], 1)
# print('2 attn_out', attn_out.shape) #[128, 512]
attn_out = self.output_proj(attn_out)
# print('3 attn_out', attn_out.shape) #[128, 256]
return attn_out
由于Decoder部分是带有attention的LSTM,我们不能复用nn.LSTM
,所以需要定义Seq2SeqDecoderCell
nn.LayerList
用于保存子层列表,它包含的子层将被正确地注册和添加。列表中的子层可以像常规python列表一样被索引。这里添加了num_layers=2层lstm。class Seq2SeqDecoderCell(nn.RNNCellBase):
def __init__(self, num_layers, input_size, hidden_size):
super(Seq2SeqDecoderCell, self).__init__()
self.dropout = nn.Dropout(0.2)
self.lstm_cells = nn.LayerList([
nn.LSTMCell(
input_size=input_size + hidden_size if i == 0 else hidden_size,
hidden_size=hidden_size) for i in range(num_layers)
])
self.attention_layer = AttentionLayer(hidden_size)
def forward(self,
step_input,
states,
encoder_output,
encoder_padding_mask=None):
lstm_states, input_feed = states
new_lstm_states = []
step_input = paddle.concat([step_input, input_feed], 1)
for i, lstm_cell in enumerate(self.lstm_cells):
out, new_lstm_state = lstm_cell(step_input, lstm_states[i])
step_input = self.dropout(out)
new_lstm_states.append(new_lstm_state)
out = self.attention_layer(step_input, encoder_output,
encoder_padding_mask)
return out, [new_lstm_states, out]
有了Seq2SeqDecoderCell
,就可以构建Seq2SeqDecoder
了
paddle.nn.RNN
该OP是循环神经网络(RNN)的封装,将输入的Cell封装为一个循环神经网络。它能够重复执行 cell.forward() 直到遍历完input中的所有Tensor。class Seq2SeqDecoder(nn.Layer):
def __init__(self, vocab_size, embed_dim, hidden_size, num_layers):
super(Seq2SeqDecoder, self).__init__()
self.embedder = nn.Embedding(vocab_size, embed_dim)
self.lstm_attention = nn.RNN(
Seq2SeqDecoderCell(num_layers, embed_dim, hidden_size))
self.output_layer = nn.Linear(hidden_size, vocab_size)
def forward(self, trg, decoder_initial_states, encoder_output,
encoder_padding_mask):
inputs = self.embedder(trg)
decoder_output, _ = self.lstm_attention(
inputs,
initial_states=decoder_initial_states,
encoder_output=encoder_output,
encoder_padding_mask=encoder_padding_mask)
predict = self.output_layer(decoder_output)
return predict
Encoder和Decoder定义好之后,网络就可以构建起来了
class Seq2SeqAttnModel(nn.Layer):
def __init__(self, vocab_size, embed_dim, hidden_size, num_layers,
eos_id=1):
super(Seq2SeqAttnModel, self).__init__()
self.hidden_size = hidden_size
self.eos_id = eos_id
self.num_layers = num_layers
self.INF = 1e9
self.encoder = Seq2SeqEncoder(vocab_size, embed_dim, hidden_size,
num_layers)
self.decoder = Seq2SeqDecoder(vocab_size, embed_dim, hidden_size,
num_layers)
def forward(self, src, src_length, trg):
# encoder_output 各时刻的输出h
# encoder_final_state 最后时刻的输出h,和记忆信号c
encoder_output, encoder_final_state = self.encoder(src, src_length)
print('encoder_output shape', encoder_output.shape) # [128, 18, 256] [batch_size,time_steps,hidden_size]
print('encoder_final_states shape', encoder_final_state[0].shape, encoder_final_state[1].shape) #[2, 128, 256] [2, 128, 256] [num_lauers * num_directions, batch_size, hidden_size]
# Transfer shape of encoder_final_states to [num_layers, 2, batch_size, hidden_size]???
encoder_final_states = [
(encoder_final_state[0][i], encoder_final_state[1][i])
for i in range(self.num_layers)
]
print('encoder_final_states shape', encoder_final_states[0][0].shape, encoder_final_states[0][1].shape) #[128, 256] [128, 256]
# Construct decoder initial states: use input_feed and the shape is
# [[h,c] * num_layers, input_feed], consistent with Seq2SeqDecoderCell.states
decoder_initial_states = [
encoder_final_states,
self.decoder.lstm_attention.cell.get_initial_states(
batch_ref=encoder_output, shape=[self.hidden_size])
]
# Build attention mask to avoid paying attention on padddings
src_mask = (src != self.eos_id).astype(paddle.get_default_dtype())
print ('src_mask shape', src_mask.shape) #[128, 18]
print(src_mask[0, :])
encoder_padding_mask = (src_mask - 1.0) * self.INF
print ('encoder_padding_mask', encoder_padding_mask.shape) #[128, 18]
print(encoder_padding_mask[0, :])
encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1])
print('encoder_padding_mask', encoder_padding_mask.shape) #[128, 1, 18]
predict = self.decoder(trg, decoder_initial_states, encoder_output,
encoder_padding_mask)
print('predict', predict.shape) #[128, 17, 7931]
return predict
这里使用的是交叉熵损失函数,我们需要将padding位置的loss置为0,因此需要在损失函数中引入trg_mask
参数,由于PaddlePaddle框架提供的paddle.nn.CrossEntropyLoss
不能接受trg_mask
参数,因此在这里需要重新定义:
class CrossEntropyCriterion(nn.Layer):
def __init__(self):
super(CrossEntropyCriterion, self).__init__()
def forward(self, predict, label, trg_mask):
cost = F.softmax_with_cross_entropy(
logits=predict, label=label, soft_label=False)
cost = paddle.squeeze(cost, axis=[2])
masked_cost = cost * trg_mask
batch_mean_cost = paddle.mean(masked_cost, axis=[0])
seq_cost = paddle.sum(batch_mean_cost)
return seq_cost
使用高层API执行训练,需要调用prepare
和fit
函数。
在prepare
函数中,配置优化器、损失函数,以及评价指标。其中评价指标使用的是PaddleNLP提供的困惑度计算API paddlenlp.metrics.Perplexity
。
如果你安装了VisualDL,可以在fit中添加一个callbacks参数使用VisualDL观测你的训练过程,如下:
model.fit(train_data=train_loader,
epochs=max_epoch,
eval_freq=1,
save_freq=1,
save_dir=model_path,
log_freq=log_freq,
callbacks=[paddle.callbacks.VisualDL('./log')])
在这里,由于对联生成任务没有明确的评价指标,因此,可以在保存的多个模型中,通过人工评判生成结果选择最好的模型。
本项目中,为了便于演示,已经将训练好的模型参数载入模型,并省略了训练过程。读者自己实验的时候,可以尝试自行修改超参数,调用下面被注释掉的fit
函数,重新进行训练。
如果读者想要在更短的时间内得到效果不错的模型,可以使用预训练模型技术,例如《预训练模型ERNIE-GEN自动写诗》项目为大家展示了如何利用预训练的生成模型进行训练。
model = paddle.Model(
Seq2SeqAttnModel(vocab_size, hidden_size, hidden_size,
num_layers, pad_id))
optimizer = paddle.optimizer.Adam(
learning_rate=learning_rate, parameters=model.parameters())
ppl_metric = Perplexity()
model.prepare(optimizer, CrossEntropyCriterion(), ppl_metric)
model.fit(train_data=train_loader,
epochs=max_epoch,
eval_freq=1,
save_freq=1,
save_dir=model_path,
log_freq=log_freq)
predict [128, 17, 7931]
预测网络继承上面的主网络Seq2SeqAttnModel
,定义子类Seq2SeqAttnInferModel
class Seq2SeqAttnInferModel(Seq2SeqAttnModel):
def __init__(self,
vocab_size,
embed_dim,
hidden_size,
num_layers,
bos_id=0,
eos_id=1,
beam_size=4,
max_out_len=256):
self.bos_id = bos_id
self.beam_size = beam_size
self.max_out_len = max_out_len
self.num_layers = num_layers
super(Seq2SeqAttnInferModel, self).__init__(
vocab_size, embed_dim, hidden_size, num_layers, eos_id)
# Dynamic decoder for inference
self.beam_search_decoder = nn.BeamSearchDecoder(
self.decoder.lstm_attention.cell,
start_token=bos_id,
end_token=eos_id,
beam_size=beam_size,
embedding_fn=self.decoder.embedder,
output_fn=self.decoder.output_layer)
def forward(self, src, src_length):
encoder_output, encoder_final_state = self.encoder(src, src_length)
encoder_final_state = [
(encoder_final_state[0][i], encoder_final_state[1][i])
for i in range(self.num_layers)
]
# Initial decoder initial states
decoder_initial_states = [
encoder_final_state,
self.decoder.lstm_attention.cell.get_initial_states(
batch_ref=encoder_output, shape=[self.hidden_size])
]
# Build attention mask to avoid paying attention on paddings
src_mask = (src != self.eos_id).astype(paddle.get_default_dtype())
encoder_padding_mask = (src_mask - 1.0) * self.INF
encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1])
# Tile the batch dimension with beam_size
encoder_output = nn.BeamSearchDecoder.tile_beam_merge_with_batch(
encoder_output, self.beam_size)
encoder_padding_mask = nn.BeamSearchDecoder.tile_beam_merge_with_batch(
encoder_padding_mask, self.beam_size)
# Dynamic decoding with beam search
seq_output, _ = nn.dynamic_decode(
decoder=self.beam_search_decoder,
inits=decoder_initial_states,
max_step_num=self.max_out_len,
encoder_output=encoder_output,
encoder_padding_mask=encoder_padding_mask)
return seq_output
接下来对我们的任务选择beam search解码方式,可以指定beam_size为10。
def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
"""
Post-process the decoded sequence.
"""
eos_pos = len(seq) - 1
for i, idx in enumerate(seq):
if idx == eos_idx:
eos_pos = i
break
seq = [
idx for idx in seq[:eos_pos + 1]
if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
]
return seq
beam_size = 10
# init_from_ckpt = './couplet_models/0' # for test
# infer_output_file = './infer_output.txt'
# test_loader, vocab_size, pad_id, bos_id, eos_id = create_data_loader(test_ds, batch_size)
# vocab, _ = CoupletDataset.get_vocab()
# trg_idx2word = vocab.idx_to_token
model = paddle.Model(
Seq2SeqAttnInferModel(
vocab_size,
hidden_size,
hidden_size,
num_layers,
bos_id=bos_id,
eos_id=eos_id,
beam_size=beam_size,
max_out_len=256))
model.prepare()
在预测之前,我们需要将训练好的模型参数load进预测网络,之后我们就可以根据对联的上联,生成对联的下联啦!
model.load('couplet_models/model_18')
test_ds = CoupletDataset.get_datasets(['test'])
idx = 0
for data in test_loader():
inputs = data[:2]
finished_seq = model.predict_batch(inputs=list(inputs))[0]
finished_seq = finished_seq[:, :, np.newaxis] if len(
finished_seq.shape) == 2 else finished_seq
finished_seq = np.transpose(finished_seq, [0, 2, 1])
for ins in finished_seq:
for beam in ins:
id_list = post_process_seq(beam, bos_id, eos_id)
word_list_l = [trg_idx2word[id] for id in test_ds[idx][0]][1:-1]
word_list_r = [trg_idx2word[id] for id in id_list]
sequence = "上联: "+" ".join(word_list_l)+"\t下联: "+" ".join(word_list_r) + "\n"
print(sequence)
idx += 1
break
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return (isinstance(seq, collections.Sequence) and
现在就加入PaddleNLP的QQ技术交流群,一起交流NLP技术吧!