该项目基于PaddlePaddle框架完成,项目直达:基于Transformer实现英语–>西班牙语的翻译任务
在本项目中,我们将构建一个Sequence-to-sequence的 Transformer 模型,并在英语到西班牙语的机器翻译任务中对其进行训练。
这种架构被称为编码器-解码器(encoder-decoder)架构,如下图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4e27Xy4r-1668437026856)(https://ai-studio-static-online.cdn.bcebos.com/a7cadcd296924fec8f50146faa7d0c1db5c39f0712474be6a3dfe1b44d3270c9)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bAMPtLeB-1668437026858)(https://ai-studio-static-online.cdn.bcebos.com/ca0931928b4a41af97a318f66957767c9ffda9b248634f4182cfda80e6a3923a)]
神经机器翻译(neuralmachine translation)
,用于将两种翻译模型区分开来。我们需要的依赖主要有:
import paddle
import paddlenlp
from paddle.io import Dataset
from paddlenlp.data import Vocab
import numpy as np
import string
import random
import matplotlib.pyplot as plt
from functools import partial
from collections import Counter
使用Anki提供的English-to-Spanish
翻译数据集,下载地址为:English-to-Spanish数据集。该数据集总共包含118964条(English,Spanish)语句对。示例如下:
English | Spanish |
---|---|
Go on home. | Vete a casa. |
I can jump. | Puedo saltar. |
数据集的加载有两种方式:
通过链接下载,该方法可能存在下载速度慢等不便利的情况,不推荐。
我们已经下载好并上传至AiStudio平台,使用方便,强烈推荐。我们的数据集地址:【NLP】English-Spanish
# 下载数据集
# from paddle.utils.download import get_path_from_url
# URL = "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
# get_path_from_url(URL, "./data")
# text_file='data/spa-eng/spa.txt' # 数据集路径
# 使用我们提供的数据集
text_file='data/data173968/spa.txt' # 数据集路径
为目标语言:西班牙语,添加[start] token与 [end] token
with open(text_file) as f:
lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
eng, spa = line.split("\t")
spa = "[start] " + spa + " [end]"
text_pairs.append((eng, spa))
for _ in range(5):
print(random.choice(text_pairs))
('Are you seriously thinking about getting a divorce?', '[start] ¿Estás pensando seriamente en divorciarte? [end]')
('I bought it.', '[start] Lo he comprado. [end]')
('He was my student. Now he teaches my children.', '[start] Era alumno mío, ahora enseña a mis hijos. [end]')
('It seems Tom knows Mary.', '[start] Parece que Tom conoce a Mary. [end]')
('Tom has done a magnificent job.', '[start] Tom ha hecho un excelente trabajo. [end]')
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]
print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")
118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs
train_eng_texts = [pair[0] for pair in train_pairs]
train_spa_texts = [pair[1] for pair in train_pairs]
val_eng_texts = [pair[0] for pair in val_pairs]
val_spa_texts = [pair[1] for pair in val_pairs]
test_eng_texts = [pair[0] for pair in test_pairs]
test_spa_texts = [pair[1] for pair in test_pairs]
def pre_process(datas,save_punctuation=False):
dataset=[]
# 定义标点符号集合
strip_chars = string.punctuation + "¿¡"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")
for i in range(len(datas)):
lowercase=datas[i].lower() # 全部转为小写
out=""
if save_punctuation:
# 在标点符号之前加空格,需注意有特殊情况
for low in lowercase:
if low in strip_chars:
if low=="¿"or low=="¡": # 西班牙语的【反问号、反叹号】...百度一下,你就知道
out+=low+" "
else:
out+=" "+low
else:
out+=low
else:
# 也可以选择删除除所有的标点
for low in lowercase:
if low not in strip_chars:
out+=low
dataset.append(out)
return dataset
train_eng_texts_pre=pre_process(train_eng_texts)
train_spa_texts_pre=pre_process(train_spa_texts)
val_eng_texts_pre=pre_process(val_eng_texts)
val_spa_texts_pre=pre_process(val_spa_texts)
test_eng_texts_pre=pre_process(test_eng_texts)
test_spa_texts_pre=pre_process(test_spa_texts)
print("预处理结果展示:")
print("英语:标准化处理之前:",train_eng_texts[0])
print("英语:标准化处理之后:",train_eng_texts_pre[0])
print("西班牙语:标准化处理之前:",train_spa_texts[0])
print("西班牙语:标准化处理之后:",train_spa_texts_pre[0])
预处理结果展示:
英语:标准化处理之前: Tom wants to stay single.
英语:标准化处理之后: tom wants to stay single
西班牙语:标准化处理之前: [start] Tom quiere seguir soltero. [end]
西班牙语:标准化处理之后: [start] tom quiere seguir soltero [end]
dicta=dict()
for text in train_eng_texts_pre:
lent=len(text.split())
if lent in dicta.keys():
dicta[lent]+=1
else:
dicta[lent]=1
lita=sorted(dicta.items(),key=lambda x:x[0],reverse=True)
x=[l[0] for l in lita]
y=[l[1] for l in lita]
plt.bar(x, y)
plt.xlabel('English sentences length')
plt.ylabel('nums')
plt.title('Information on the length of English sentences')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oWwjfxSr-1668437026859)(main_files/main_20_0.png)]
dicta=dict()
for text in train_spa_texts_pre:
lent=len(text.split())
if lent in dicta.keys():
dicta[lent]+=1
else:
dicta[lent]=1
lita=sorted(dicta.items(),key=lambda x:x[0],reverse=True)
x=[l[0] for l in lita]
y=[l[1] for l in lita]
plt.bar(x, y)
plt.xlabel('Spanish sentences length')
plt.ylabel('nums')
plt.title('Information on the length of Spanish sentences')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JG8W7Dc9-1668437026860)(main_files/main_22_0.png)]
def build_cropus(data):
crpous=[]
for i in range(len(data)):
cr=data[i].strip().lower()
cr=cr.split()
crpous.extend(cr)
return crpous
eng_crpous=build_cropus(train_eng_texts_pre)
spa_crpous=build_cropus(train_spa_texts_pre)
print(eng_crpous[:3])
print(spa_crpous[:3])
['tom', 'wants', 'to']
['[start]', 'tom', 'quiere']
# 根据给定的词汇量,按照词汇出现的频率构造词典,
def build_dict(corpus,vocab_size):
# 首先统计每个不同词的频率(出现的次数),使用一个词典记录
word_freq_dict = dict()
for word in corpus:
if word not in word_freq_dict:
word_freq_dict[word] = 0
word_freq_dict[word] += 1
# 将词典中的词,按照出现次数排序,出现次数越高,排序越靠前
word_freq_dict = sorted(word_freq_dict.items(), key = lambda x:x[1], reverse = True)
# 构造2个不同的词典
# 每个词到id的映射关系:word2id_dict
# 每个id到词的映射关系:id2word_dict
word2id_dict = {'' :0,'' :1}
id2word_dict = {0:'' ,1:'' }
# 按照频率,从高到低,开始遍历每个单词,并为这个单词构造一个独一无二的id
i=2
for word, freq in word_freq_dict:
if i<vocab_size:
# curr_id = len(word2id_dict)
word2id_dict[word] = i
id2word_dict[i] = word
i+=1
else: # 超过指定的词汇量,指向
word2id_dict[word]=1
return word2id_dict, id2word_dict
vocab_size = 15000 #设置词汇量,英语和西班牙语可分开设置;但为减少参数,我们在这里设置一个统一词汇量
eng2id_dict,id2eng_dict=build_dict(eng_crpous,vocab_size)
spa2id_dict,id2spa_dict=build_dict(spa_crpous,vocab_size)
print("我们设置的英语总词汇为:",vocab_size,'\t我们设置的英语总词汇为:',vocab_size)
print("总的英语词汇量为:",len(eng2id_dict),"\t\t我们实际使用的英语词汇量为",len(id2eng_dict))
print("总的西班牙语词汇量为:",len(spa2id_dict),"\t我们实际使用的西班牙语词汇量为",len(id2spa_dict))
我们设置的英语总词汇为: 15000 我们设置的英语总词汇为: 15000
总的英语词汇量为: 12092 我们实际使用的英语词汇量为 12092
总的西班牙语词汇量为: 22445 我们实际使用的西班牙语词汇量为 15000
def build_tensor(data,dicta,maxlen):
tensor=[]
for i in range(len(data)):
subtensor=[]
lista=data[i].split()
for j in range(len(lista)):
index=dicta.get(lista[j])
# 对于训练解和测试集,可能会出现未在词表中的词汇,此时index会返回None
if index==None:
index=1
subtensor.append(index)
if len(subtensor) < maxlen:
subtensor+=[0]*(maxlen-len(subtensor))
else:
subtensor=subtensor[:maxlen]
tensor.append(subtensor)
return np.array(tensor)
sequence_length = 20 # 语句长度我们统一设置为20,可以依据3.5小结的统计信息分开设置
train_eng_tensor=build_tensor(train_eng_texts_pre,eng2id_dict,sequence_length)
val_eng_tensor=build_tensor(val_eng_texts_pre,eng2id_dict,sequence_length)
test_eng_tensor=build_tensor(test_eng_texts_pre,eng2id_dict,sequence_length)
train_spa_tensor=build_tensor(train_spa_texts_pre,spa2id_dict,sequence_length+1)
val_spa_tensor=build_tensor(val_spa_texts_pre,spa2id_dict,sequence_length+1)
test_spa_tensor=build_tensor(test_spa_texts_pre,spa2id_dict,sequence_length+1)
print(val_eng_texts_pre[0])
print(val_eng_tensor[0])
print(val_spa_texts_pre[0])
print(val_spa_tensor[0])
tom is an intelligent person
[ 6 8 67 1244 285 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
[start] tom es una persona inteligente [end]
[ 2 8 12 19 243 676 3 0 0 0 0 0 0 0 0 0 0 0
0 0 0]
class MyDataset(Dataset):
"""
步骤一:继承paddle.io.Dataset类
"""
def __init__(self, eng,spa):
"""
步骤二:实现构造函数,定义数据集大小
"""
super(MyDataset, self).__init__()
self.eng = eng
self.spa=spa
def __getitem__(self, index):
"""
步骤三:实现__getitem__方法,定义指定index时如何获取数据,并返回单条数据(训练数据,对应的标签)
"""
return self.eng[index], self.spa[index]
def __len__(self):
"""
步骤四:实现__len__方法,返回数据集总数目
"""
return self.eng.shape[0]
def prepare_input(inputs,padid=0):
src=np.array([inputsub[0] for inputsub in inputs])
trg=np.array([inputsub[1] for inputsub in inputs])
trg_mask =(trg[:,:-1]!=padid).astype(paddle.get_default_dtype())
return src,trg[:,:-1],trg[:,1:,np.newaxis],trg_mask
# 封装数据集
BATCH_SIZE=64
train_dataset = MyDataset(train_eng_tensor,train_spa_tensor)
train_loader = paddle.io.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True,drop_last=True,collate_fn=partial(prepare_input))
val_dataset=MyDataset(val_eng_tensor,val_spa_tensor)
val_loader=paddle.io.DataLoader(val_dataset,batch_size=BATCH_SIZE,shuffle=True,drop_last=True,collate_fn=partial(prepare_input))
for i,data in enumerate(val_loader):
for d in data:
print(d.shape)
break
[64, 20]
[64, 20]
[64, 20, 1]
[64, 20]
# 为方便调试网络,我们提前定义一些参数
embed_dim=256 # 词嵌入embedding的维度
latent_dim=2048 # feed forward 前馈神经网络的相关参数
num_heads=8 # 多头注意力机制的‘头’数
Encoder部分主要包含了多头注意力机制、层归一化层以及前馈神经网络序列。
paddle.nn.MultiHeadAttention
实现多头注意力机制,需要注意其掩码attn_mask需要的shape是[batch_szie,num_heads,sequence_legth,sequence_legth]class TransformerEncoder(paddle.nn.Layer):
def __init__(self, embed_dim, dense_dim, num_heads):
super(TransformerEncoder, self).__init__()
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads
self.attention = paddle.nn.MultiHeadAttention(num_heads=num_heads, embed_dim=embed_dim, dropout =0.1)
self.dense_proj =paddle.nn.Sequential(
paddle.nn.Linear(embed_dim, dense_dim),
paddle.nn.ReLU(),
paddle.nn.Linear(dense_dim, embed_dim) )
self.layernorm_1 = paddle.nn.LayerNorm(embed_dim)
self.layernorm_2 = paddle.nn.LayerNorm(embed_dim)
self.supports_masking = True
def forward(self, inputs, mask=None):
padding_mask=None
if mask is not None:
padding_mask = paddle.cast(mask[:, np.newaxis, np.newaxis, :], dtype="int32")
#print("inputs.shape",inputs.shape)
attention_output = self.attention(query=inputs, value=inputs, key=inputs, attn_mask=padding_mask)
#print("attention_output.shape",attention_output.shape)
proj_input = self.layernorm_1(inputs + attention_output)
proj_output = self.dense_proj(proj_input)
return self.layernorm_2(proj_input + proj_output)
# pencoder=TransformerEncoder(embed_dim, latent_dim, num_heads)
# print(pencoder)
# inputs=paddle.rand([64,20,256])
# print("inputs.shape:",inputs.shape)
# out=pencoder(inputs)
# print("out.shape:",out.shape)
Transformer模型并不包括任何的循环或卷积网络,所以模型添加了位置编码,为模型提供一些关于单词在句子中相对位置的信息。我们用paddle.nn.Embedding
实现位置编码,其中num_embeddings=sequence_length
。
class PositionalEmbedding(paddle.nn.Layer):
def __init__(self, sequence_length, vocab_size, embed_dim):
super(PositionalEmbedding, self).__init__()
self.token_embeddings = paddle.nn.Embedding(num_embeddings =vocab_size, embedding_dim =embed_dim)
self.position_embeddings = paddle.nn.Embedding(num_embeddings =sequence_length, embedding_dim =embed_dim)
self.sequence_length = sequence_length
self.vocab_size = vocab_size
self.embed_dim = embed_dim
def forward(self, inputs):
length = inputs.shape[-1]
positions = paddle.arange(start=0, end=length, step=1)
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(positions)
return embedded_tokens + embedded_positions
def compute_mask(self, inputs, mask=None):
return paddle.not_equal(inputs, 0)
# ps=PositionalEmbedding(20,15000,256)
# print(ps)
# inputs=paddle.randint(0,15000,[64,20])
# print("inputs.shape:",inputs.shape)
# out=ps(inputs)
# print("out.shape:",out.shape)
编码器含有两个多头注意力组件,一个用于处理西班牙语的输入,另一个用于处理编码器的输出和前一个多头注意力机制的输出。
class TransformerDecoder(paddle.nn.Layer):
def __init__(self, embed_dim, latent_dim, num_heads):
super(TransformerDecoder, self).__init__()
self.embed_dim = embed_dim
self.latent_dim = latent_dim
self.num_heads = num_heads
self.attention_1 = paddle.nn.MultiHeadAttention(num_heads=num_heads, embed_dim=embed_dim)
self.attention_2 = paddle.nn.MultiHeadAttention(num_heads=num_heads, embed_dim=embed_dim)
self.dense_proj = paddle.nn.Sequential(
paddle.nn.Linear(embed_dim, latent_dim),
paddle.nn.ReLU(),
paddle.nn.Linear(latent_dim, embed_dim) )
self.layernorm_1 = paddle.nn.LayerNorm(embed_dim)
self.layernorm_2 = paddle.nn.LayerNorm(embed_dim)
self.layernorm_3 = paddle.nn.LayerNorm(embed_dim)
self.supports_masking = True
def forward(self, inputs, encoder_outputs, mask=None):
causal_mask = self.get_causal_attention_mask(inputs) #[batch_size, equence_length, sequence_length]
padding_mask=None
if mask is not None:
padding_mask = paddle.cast(mask[:, np.newaxis, :], dtype="int32")
padding_mask = paddle.minimum(padding_mask, causal_mask)
# attn_mask: [batch_size, n_head, sequence_length, sequence_length]
attention_output_1 = self.attention_1(query=inputs, value=inputs, key=inputs, attn_mask=causal_mask)
out_1 = self.layernorm_1(inputs + attention_output_1)
attention_output_2 = self.attention_2(
query=out_1,
value=encoder_outputs,
key=encoder_outputs,
attn_mask=padding_mask,
)
out_2 = self.layernorm_2(out_1 + attention_output_2)
proj_output = self.dense_proj(out_2)
return self.layernorm_3(out_2 + proj_output)
def get_causal_attention_mask(self, inputs):
input_shape = inputs.shape
batch_size, sequence_length = input_shape[0], input_shape[1]
i = paddle.arange(sequence_length)[:, np.newaxis]
j = paddle.arange(sequence_length)
mask = paddle.cast(i >= j, dtype="int32") #[sequence_length, sequence_length]
mask = paddle.reshape(mask, (1,1, input_shape[1], input_shape[1])) #[1, equence_length, sequence_length]
mult = paddle.concat(
[paddle.to_tensor(64,dtype='int32'), paddle.to_tensor([1,1, 1], dtype="int32")],
axis=0,) #[batch_size,1,1]
return paddle.tile(mask, mult) #[batch_size, equence_length, sequence_length]
# decoder=TransformerDecoder(embed_dim, latent_dim, num_heads)
# print(decoder)
# inputs=paddle.rand([64,20,256])
# enout=paddle.rand([64,20,256])
# out=decoder(inputs,enout)
# print("out.shape:",out.shape)
class Transformer(paddle.nn.Layer):
def __init__(self, embed_dim, latent_dim, num_heads,sequence_length, vocab_size):
super(Transformer, self).__init__()
self.ps1=PositionalEmbedding(sequence_length, vocab_size, embed_dim)
self.encoder=TransformerEncoder(embed_dim, latent_dim, num_heads)
self.ps2=PositionalEmbedding(sequence_length, vocab_size, embed_dim)
self.decoder=TransformerDecoder(embed_dim, latent_dim, num_heads)
self.drop=paddle.nn.Dropout(p=0.5)
self.lastLinear=paddle.nn.Linear(embed_dim,vocab_size)
self.softmax=paddle.nn.Softmax()
def forward(self,encoder_inputs,decoder_inputs):
# 编码器
encoder_emb=self.ps1(encoder_inputs)
encoder_outputs=self.encoder(encoder_emb)
# 解码器
deocder_emb=self.ps2(decoder_inputs)
decoder_outputs=self.decoder(deocder_emb,encoder_outputs)
# dropout
out=self.drop(decoder_outputs)
#最后输出
out=self.lastLinear(out)
#out=self.softmax(self.lastLinear(out))
return out
trans=Transformer(embed_dim, latent_dim, num_heads,sequence_length, vocab_size)
encoder_inputs=paddle.randint(0,15000,[64,20])
decoder_inputs=paddle.randint(0,15000,[64,20])
out=trans(encoder_inputs,decoder_inputs)
print("out.shape:",out.shape)
paddle.summary(trans,input_size=[(64,20),(64,20)],dtypes='int32')
W1031 20:31:59.264416 14770 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1031 20:31:59.268676 14770 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
out.shape: [64, 20, 15000]
-------------------------------------------------------------------------------------------
Layer (type) Input Shape Output Shape Param #
===========================================================================================
Embedding-1 [[64, 20]] [64, 20, 256] 3,840,000
Embedding-2 [[20]] [20, 256] 5,120
PositionalEmbedding-1 [[64, 20]] [64, 20, 256] 0
Linear-1 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-2 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-3 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-4 [[64, 20, 256]] [64, 20, 256] 65,792
MultiHeadAttention-1 [] [64, 20, 256] 0
LayerNorm-1 [[64, 20, 256]] [64, 20, 256] 512
Linear-5 [[64, 20, 256]] [64, 20, 2048] 526,336
ReLU-1 [[64, 20, 2048]] [64, 20, 2048] 0
Linear-6 [[64, 20, 2048]] [64, 20, 256] 524,544
LayerNorm-2 [[64, 20, 256]] [64, 20, 256] 512
TransformerEncoder-1 [[64, 20, 256]] [64, 20, 256] 0
Embedding-3 [[64, 20]] [64, 20, 256] 3,840,000
Embedding-4 [[20]] [20, 256] 5,120
PositionalEmbedding-2 [[64, 20]] [64, 20, 256] 0
Linear-7 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-8 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-9 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-10 [[64, 20, 256]] [64, 20, 256] 65,792
MultiHeadAttention-2 [] [64, 20, 256] 0
LayerNorm-3 [[64, 20, 256]] [64, 20, 256] 512
Linear-11 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-12 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-13 [[64, 20, 256]] [64, 20, 256] 65,792
Linear-14 [[64, 20, 256]] [64, 20, 256] 65,792
MultiHeadAttention-3 [] [64, 20, 256] 0
LayerNorm-4 [[64, 20, 256]] [64, 20, 256] 512
Linear-15 [[64, 20, 256]] [64, 20, 2048] 526,336
ReLU-2 [[64, 20, 2048]] [64, 20, 2048] 0
Linear-16 [[64, 20, 2048]] [64, 20, 256] 524,544
LayerNorm-5 [[64, 20, 256]] [64, 20, 256] 512
TransformerDecoder-1 [[64, 20, 256], [64, 20, 256]] [64, 20, 256] 0
Dropout-1 [[64, 20, 256]] [64, 20, 256] 0
Linear-17 [[64, 20, 256]] [64, 20, 15000] 3,855,000
===========================================================================================
Total params: 14,439,064
Trainable params: 14,439,064
Non-trainable params: 0
-------------------------------------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 299.06
Params size (MB): 55.08
Estimated Total Size (MB): 354.15
-------------------------------------------------------------------------------------------
{'total_params': 14439064, 'trainable_params': 14439064}
class CrossEntropy(paddle.nn.Layer):
def __init__(self):
super(CrossEntropy,self).__init__()
def forward(self,pre,real,trg_mask):
# 返回的数据类型与pre一致,除了axis维度(未指定则为-1),其他维度也与pre一致
# logits=pre,[batch_size,sequence_len,word_size],猜测会进行argmax操作,[batch_size,sequence_len,1]
# 默认的soft_label为False,lable=real,[bacth_size,sequence_len,1]
cost=paddle.nn.functional.softmax_with_cross_entropy(logits=pre,label=real)
# 删除axis=2 shape上为1的维度
# 返回结果的形状应为 [batch_size,sequence_len]
cost=paddle.squeeze(cost,axis=[2])
# trg_mask 的形状[batch_size,suqence_len]
# * 这个星号应该是对应位置相乘,返回结果的形状 [bathc_szie,sequence_len]
masked_cost=cost*trg_mask
# paddle.mean 对应轴的对应位置求平均, 在这里返回结果为 [sequence_len]
# paddle.sum 求解方法与paddle.mean一致,最终返回的结果应为[1]
return paddle.sum(paddle.mean(masked_cost,axis=[0]))
epochs = 10
trans=Transformer(embed_dim, latent_dim, num_heads,sequence_length, vocab_size)
model=paddle.Model(trans)
model.prepare(optimizer=paddle.optimizer.Adam(learning_rate=0.001,parameters=model.parameters()),
loss=CrossEntropy(),
metrics=paddle.metric.Accuracy())
model.fit(train_data=train_loader,
epochs=epochs,
eval_data= val_loader,
verbose =2,
log_freq =100,
callbacks=[paddle.callbacks.VisualDL('./log')])
10个epoch下验证集的loss与Accuracy曲线图:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7tfHNAEV-1668437026861)(https://ai-studio-static-online.cdn.bcebos.com/2d3e2fb0fb0c46c7999da12ca96af25e55d5ddad045b4f85b3419dad0774af87)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7tYCi8DG-1668437026862)(https://ai-studio-static-online.cdn.bcebos.com/4b9492bedc8b47a58af3474236f2c70d96e6a7e219774e53aeae719759efa559)]
def evalute(eng):
encoder_input=paddle.unsqueeze(eng,axis=0)
decoded_sentence = "[start]"
for i in range(sequence_length):
decoder_input=paddle.to_tensor(build_tensor([decoded_sentence],spa2id_dict,sequence_length))
pre=trans(encoder_input,decoder_input)
sampled_token_index = np.argmax(pre[0, i, :])
sampled_token = id2spa_dict.get(sampled_token_index)
decoded_sentence += " " + sampled_token
if sampled_token == "[end]":
break
return decoded_sentence
def translate():
with open('result.txt','w+') as re:
#for i in tqdm(range(len(test_eng_tensor))):
for i in range(5):
result=evalute(paddle.to_tensor(test_eng_tensor[i]))
re.write(result+'\n')
#print(result)
translate()
with open('result.txt','r') as re:
pre=re.readlines()
for i in range(5):
print('英语: ',test_eng_texts[i])
print('真实的西班牙语:',test_spa_texts_pre[i])
print('预测的西班牙语:',pre[i])
英语: It's been more than a month.
真实的西班牙语: [start] ha pasado más de un mes [end]
预测的西班牙语: [start] ha sido más de un mes [end]
英语: My parents picked me up from school.
真实的西班牙语: [start] mis padres me recogieron de la escuela [end]
预测的西班牙语: [start] mis padres me del colegio [end]
英语: If I don't fail, I will get my driving license before New Year.
真实的西班牙语: [start] si no suspendo conseguiré mi carné de conducir antes de año nuevo [end]
预测的西班牙语: [start] si no mi licencia de conducir hasta mi nuevo año [end]
英语: Tom said he's Canadian.
真实的西班牙语: [start] tom dijo que es canadiense [end]
预测的西班牙语: [start] tom dijo que es canadiense [end]
英语: They had fun with us.
真实的西班牙语: [start] ellas se entretuvieron con nosotras [end]
预测的西班牙语: [start] ellos se con nosotros [end]
been more than a month.
真实的西班牙语: [start] ha pasado más de un mes [end]
预测的西班牙语: [start] ha sido más de un mes [end]
英语: My parents picked me up from school.
真实的西班牙语: [start] mis padres me recogieron de la escuela [end]
预测的西班牙语: [start] mis padres me del colegio [end]
英语: If I don't fail, I will get my driving license before New Year.
真实的西班牙语: [start] si no suspendo conseguiré mi carné de conducir antes de año nuevo [end]
预测的西班牙语: [start] si no mi licencia de conducir hasta mi nuevo año [end]
英语: Tom said he's Canadian.
真实的西班牙语: [start] tom dijo que es canadiense [end]
预测的西班牙语: [start] tom dijo que es canadiense [end]
英语: They had fun with us.
真实的西班牙语: [start] ellas se entretuvieron con nosotras [end]
预测的西班牙语: [start] ellos se con nosotros [end]