机器翻译(MT):将一段文本从一种语言自动翻译为另一种语言,用神经网络解决这个问题通常称为神经机器翻译(NMT)。
主要特征:输出是单词序列而不是单个单词。 输出序列的长度可能与源序列的长度不同。
import os
os.listdir('/home/kesci/input/')
['fraeng6506', 'd2l9528']
import sys
sys.path.append('/home/kesci/input/d2l9528/')
import collections
import d2l
import zipfile
from d2l.data.base import Vocab
import time
import torch
import torch.nn as nn
import torch.nn.functional as F# 包装好的类
from torch.utils import data
from torch import optim
将数据集清洗、转化为神经网络的输入minbatch
with open('/home/kesci/input/fraeng6506/fra.txt', 'r') as f:
raw_text = f.read()
print(raw_text[0:1000])#使用英语法语的tx
print(len(raw_text))
Go. Va ! CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)
Hi. Salut ! CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)
Hi. Salut. CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)
Run! Cours ! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)
Run! Courez ! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)
Who? Qui ? CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #4366796 (gillux)
Wow! Ça alors ! CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #374631 (zmoo)
Fire! Au feu ! CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #4627939 (sacredceltic)
Help! À l'aide ! CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #128430 (sysko)
Jump. Saute. CC-BY 2.0 (France) Attribution: tatoeba.org #631038 (Shishir) & #2416938 (Phoenix)
Stop! Ça suffit ! CC-BY 2.0 (France) Attribution: tato
25666314
def preprocess_raw(text):
text = text.replace('\u202f', ' ').replace('\xa0', ' ')
# \xa0 是不间断空白符
# \u202f 是空白字符
out = ''
for i, char in enumerate(text.lower()):
if char in (',', '!', '.') and i > 0 and text[i-1] != ' ':
out += ' '
out += char
return out
text = preprocess_raw(raw_text)
print(text[0:1000])
go . va ! cc-by 2 .0 (france) attribution: tatoeba .org #2877272 (cm) & #1158250 (wittydev)
hi . salut ! cc-by 2 .0 (france) attribution: tatoeba .org #538123 (cm) & #509819 (aiji)
hi . salut . cc-by 2 .0 (france) attribution: tatoeba .org #538123 (cm) & #4320462 (gillux)
run ! cours ! cc-by 2 .0 (france) attribution: tatoeba .org #906328 (papabear) & #906331 (sacredceltic)
run ! courez ! cc-by 2 .0 (france) attribution: tatoeba .org #906328 (papabear) & #906332 (sacredceltic)
who? qui ? cc-by 2 .0 (france) attribution: tatoeba .org #2083030 (ck) & #4366796 (gillux)
wow ! ça alors ! cc-by 2 .0 (france) attribution: tatoeba .org #52027 (zifre) & #374631 (zmoo)
fire ! au feu ! cc-by 2 .0 (france) attribution: tatoeba .org #1829639 (spamster) & #4627939 (sacredceltic)
help ! à l'aide ! cc-by 2 .0 (france) attribution: tatoeba .org #435084 (lukaszpp) & #128430 (sysko)
jump . saute . cc-by 2 .0 (france) attribution: tatoeba .org #631038 (shishir) & #2416938 (phoenix)
stop ! ça suffit ! cc-b
字符在计算机里是以编码的形式存在,我们通常所用的空格是 \x20 ,是在标准ASCII可见字符 0x20~0x7e 范围内。
而 \xa0 属于 latin1 (ISO/IEC_8859-1)中的扩展字符集字符,代表不间断空白符nbsp(non-breaking space),超出gbk编码范围,是需要去除的特殊字符。再数据预处理的过程中,我们首先需要对数据进行清洗。
字符串—单词组成的列表
num_examples = 50000
source, target = [], []
for i, line in enumerate(text.split('\n')):
if i > num_examples:
break
parts = line.split('\t')
if len(parts) >= 2:
source.append(parts[0].split(' '))
target.append(parts[1].split(' '))
# 将每个单词分开
source[0:3], target[0:3]
# source是英语,target是法语
([['go', '.'], ['hi', '.'], ['hi', '.']],
[['va', '!'], ['salut', '!'], ['salut', '.']])
# 统计句长
d2l.set_figsize()
d2l.plt.hist([[len(l) for l in source], [len(l) for l in target]],label=['source', 'target'])
# 纵坐标是频数
d2l.plt.legend(loc='upper right');
单词组成的列表—单词id组成的列表
def build_vocab(tokens):
tokens = [token for line in tokens for token in line]
# 词汇表,还没有进行counter
# print(tokens)
return d2l.data.base.Vocab(tokens, min_freq=3, use_special_tokens=True)# 自己的一个函数
# tokens:所有单词组成的一个列表,min_fre=3:只有出现频次大于3的才能够出现在词典中
# use_special_tokens:是否使用特殊字符
src_vocab = build_vocab(source)
len(src_vocab)
print(src_vocab['.'])
print(src_vocab.pad)
4
0
def pad(line, max_len, padding_token):
if len(line) > max_len:
return line[:max_len]
return line + [padding_token] * (max_len - len(line))
print(source[0])
pad(src_vocab[source[0]], 10, src_vocab.pad)# pad填充,列表不够长的时候直接截取,长的符号直接截断
#src_vocab[source[0]]将数据char -> id
['go', '.']
[38, 4, 0, 0, 0, 0, 0, 0, 0, 0]
def build_array(lines, vocab, max_len, is_source):
lines = [vocab[line] for line in lines]# 将每一句话转化为数组
if not is_source:
#判断这个句子是英语还是法语,如果是英语也就是target需要进行包装
lines = [[vocab.bos] + line + [vocab.eos] for line in lines]
array = torch.tensor([pad(line, max_len, vocab.pad) for line in lines])
# 将每一句话扩充
valid_len = (array != vocab.pad).sum(1) #第一个维度,有效长度,原来的长度
# 统计有意义的个数,每行,也就是每个样本有意义的个数
# print()
return array, valid_len
# 好处在于后面计算损失的时候直接拿有效长度进行误差分析就好了
def load_data_nmt(batch_size, max_len): # This function is saved in d2l.
# 数据生成器
src_vocab, tgt_vocab = build_vocab(source), build_vocab(target)# 对数据和目标构建字典
src_array, src_valid_len = build_array(source, src_vocab, max_len, True)#得到向量和每个样本有多少有用的字
tgt_array, tgt_valid_len = build_array(target, tgt_vocab, max_len, False)# 得到目标的向量和每个样本有多少有用的
train_data = data.TensorDataset(src_array, src_valid_len, tgt_array, tgt_valid_len)
# 判断这4个东西是不是一一对应的,第0维度是不是相同的,也就是有多少句子是不是对应的
# 新版的tensorDataset是可变参数
train_iter = data.DataLoader(train_data, batch_size, shuffle=True)
return src_vocab, tgt_vocab, train_iter
src_vocab, tgt_vocab, train_iter = load_data_nmt(batch_size=2, max_len=8)
for X, X_valid_len, Y, Y_valid_len, in train_iter:
print('X =', X.type(torch.int32), '\nValid lengths for X =', X_valid_len,
'\nY =', Y.type(torch.int32), '\nValid lengths for Y =', Y_valid_len)
break
# 每次生成一组
X = tensor([[ 62, 57, 100, 10, 111, 4, 0, 0],
[ 31, 83, 7, 382, 4, 0, 0, 0]], dtype=torch.int32)
Valid lengths for X = tensor([6, 5])
Y = tensor([[ 1, 1586, 139, 82, 155, 13, 2, 0],
[ 1, 50, 1119, 12, 322, 4, 2, 0]], dtype=torch.int32)
Valid lengths for Y = tensor([7, 7])
encoder:输入到隐藏状态
decoder:隐藏状态到输出
encoder部分利用RNN压缩表示的性质。首先将源语言句子的每个词表示成一个向量,这个向量的维度与词汇表大小V相同,并且向量只有一个维度有值1,其余全都是0,1的位置就对应该词在词汇表的位置。这样的向量通常被称为one-hot向量,或者说1-of-K coding,它与词典的词一一对应,可以唯一的表示一个词,但是这样的向量实用,因为:
向量的维度往往很大,容易造成维度灾难,
无法刻画词与词之间的关系(例如语义相似性,也就是无法很好的表达语义)
所以接下来要做的就是将每个词映射到一个低纬度的语义空间,每个词将由一个固定维度的稠密向量表示(也成分布表示Distributed Representation),也就是词向量。词向量的维度K,通常取100到500之间。词向量在整个翻译系统的训练过程中也会逐步更新,会变得更“meaningful”。
我们知道词向量在语言学上有一定含义,这个含义无法直观的解释,但是向量之间的距离在一定程度上可以衡量词的相似性。模型的向量维度至少都在100维以上,无法直接画在纸上或者显示在屏幕上,可以使用PCA,T-SNE等方法映射到低维空间,压缩的向量确实可以保存源语言的句子的语义信息,因为语义越相近,句子在空间的距离就越近,这是传统的词袋模型(bag-of-words)所挖掘不出的信息。例如调换主语和宾语被认为语义不相近的句子。
decoder部分同样适用RNN实现,
encoder-decoder模型虽然非常经典,但是局限性也非常大。最大的局限性就在于编码和解码之间的唯一联系就是一个固定长度的语义向量C。也就是说,编码器要将整个序列的信息压缩进一个固定长度的向量中去。但是这样做有两个弊端,一是语义向量无法完全表示整个序列的信息,还有就是先输入的内容携带的信息会被后输入的信息稀释掉,或者说,被覆盖了。输入序列越长,这个现象就越严重。这就使得在解码的时候一开始就没有获得输入序列足够的信息,
那么解码的准确度自然也就要打个折扣了
class Encoder(nn.Module):
def __init__(self, **kwargs):
super(Encoder, self).__init__(**kwargs)
def forward(self, X, *args):
raise NotImplementedError
class Decoder(nn.Module):
def __init__(self, **kwargs):
super(Decoder, self).__init__(**kwargs)
def init_state(self, enc_outputs, *args):
raise NotImplementedError
def forward(self, X, state):
raise NotImplementedError
class EncoderDecoder(nn.Module):
def __init__(self, encoder, decoder, **kwargs):
super(EncoderDecoder, self).__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
def forward(self, enc_X, dec_X, *args):
enc_outputs = self.encoder(enc_X, *args)
dec_state = self.decoder.init_state(enc_outputs, *args)
return self.decoder(dec_X, dec_state)
可以应用在对话系统、生成式任务中。
训练
预测
class Seq2SeqEncoder(d2l.Encoder):
# 加码
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0, **kwargs):
super(Seq2SeqEncoder, self).__init__(**kwargs)
self.num_hiddens=num_hiddens# 隐藏层
self.num_layers=num_layers# 层数
self.embedding = nn.Embedding(vocab_size, embed_size)#10*.8
# 一个保存了固定字典和大小的简单查找表。这个模块常用来保存词嵌入和用下标检索它们。模块的输入是一个下标的列表,输出是对应的词嵌入。
# num_embeddings (int) - 嵌入字典的大小
#embedding_dim (int) - 每个嵌入向量的大小
self.rnn = nn.LSTM(embed_size,num_hiddens, num_layers, dropout=dropout)# 深度lstm
# 使用lstm进行处
def begin_state(self, batch_size, device):
# 中间隐藏层变量,层数*批数*隐藏层
#2*8*16
return [torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens), device=device),
torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens), device=device)]
def forward(self, X, *args):
#4*7*8
X = self.embedding(X) # X shape: (batch_size, seq_len, embed_size)
# 考虑到输入是时序输入,所以句子的维度需要调换一下: 7*4
# print(X.shape)
X = X.transpose(0, 1) # RNN needs first axes to be time
# state = self.begin_state(X.shape[1], device=X.device)
out, state = self.rnn(X)# 因为有两层
# The shape of out is (seq_len, batch_size, num_hiddens).7*4*16
# state contains the hidden state and the memory cell隐藏层和记忆细胞
# of the last time step, the shape is (num_layers, batch_size, num_hiddens)
return out, state
encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8,num_hiddens=16, num_layers=2)
# 10个字典,嵌入向量8个,2层,隐藏层参数16
X = torch.zeros((4, 7),dtype=torch.long)
output, state = encoder(X)
output.shape, len(state), state[0].shape, state[1].shape
(torch.Size([7, 4, 16]), 2, torch.Size([2, 4, 16]), torch.Size([2, 4, 16]))
class Seq2SeqDecoder(d2l.Decoder):
#解码
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0, **kwargs):
super(Seq2SeqDecoder, self).__init__(**kwargs)
self.embedding = nn.Embedding(vocab_size, embed_size)# 字典大小,这个向量有所少维度
# 一个保存了固定字典和大小的简单查找表。这个模块常用来保存词嵌入和用下标检索它们。模块的输入是一个下标的列表,输出是对应的词嵌入。
# num_embeddings (int) - 嵌入字典的大小
#embedding_dim (int) - 每个嵌入向量的大小
self.rnn = nn.LSTM(embed_size,num_hiddens, num_layers, dropout=dropout)
self.dense = nn.Linear(num_hiddens,vocab_size)# 所以输出10个
#每个循环神经单元都需要输出,全连接层
def init_state(self, enc_outputs, *args):
return enc_outputs[1]
def forward(self, X, state):
X = self.embedding(X).transpose(0, 1)# 将刚才颠倒的维度颠倒回来
# print(X.shape)
out, state = self.rnn(X, state)
# Make the batch to be the first dimension to simplify loss computation.
out = self.dense(out).transpose(0, 1)
return out, state
# 4*7*10 单词表大小,怎么理解呢?一维度和二维度是4*7,最后一维度代表这个词语的输出有10个结果需要筛选
decoder = Seq2SeqDecoder(vocab_size=10, embed_size=8,num_hiddens=16, num_layers=2)
state = decoder.init_state(encoder(X))# 这里能直接把隐藏层参数拿过来用
out, state = decoder(X, state)
out.shape, len(state), state[0].shape, state[1].shape
(torch.Size([4, 7, 10]), 2, torch.Size([2, 4, 16]), torch.Size([2, 4, 16]))
# padding部分损失无效
def SequenceMask(X, X_len,value=0):
'''
x:输出变量
x:len 输出变量的有效数字,通过target进行
'''
maxlen = X.size(1)
# 也需要放进去,不是通过device生成
mask = torch.arange(maxlen)[None, :].to(X_len.device) < X_len[:, None]
# print(mask)
X[~mask]=value
return X
X = torch.tensor([[1,2,3], [4,5,6]])
print(torch.tensor([1,2])[:, None] )
print(torch.arange(3)[None, :].to(torch.tensor([1,2]).device))
SequenceMask(X,torch.tensor([1,2]))
# 去掉填充词
tensor([[1],
[2]])
tensor([[0, 1, 2]])
tensor([[1, 0, 0],
[4, 5, 0]])
X = torch.ones((2,3, 4))
SequenceMask(X, torch.tensor([1,2]),value=-1)
tensor([[[ 1., 1., 1., 1.],
[-1., -1., -1., -1.],
[-1., -1., -1., -1.]],
[[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[-1., -1., -1., -1.]]])
class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
# pred shape: (batch_size, seq_len, vocab_size)decoder训练得分
# label shape: (batch_size, seq_len)
# valid_length shape: (batch_size, )
def forward(self, pred, label, valid_length):
# the sample weights shape should be (batch_size, seq_len)
weights = torch.ones_like(label)# weight全为1
weights = SequenceMask(weights, valid_length).float()# 把非有效位置变成0
self.reduction='none'
output=super(MaskedSoftmaxCELoss, self).forward(pred.transpose(1,2), label)# 轴变换
return (output*weights).mean(dim=1)
loss = MaskedSoftmaxCELoss()
loss(torch.ones((3, 4, 10)), torch.ones((3,4),dtype=torch.long), torch.tensor([4,3,0]))
tensor([2.3026, 1.7269, 0.0000])
def train_ch7(model, data_iter, lr, num_epochs, device): # Saved in d2l
# device 放入model,放入输入
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
loss = MaskedSoftmaxCELoss()
tic = time.time()
for epoch in range(1, num_epochs+1):
l_sum, num_tokens_sum = 0.0, 0.0
for batch in data_iter:
optimizer.zero_grad()
X, X_vlen, Y, Y_vlen = [x.to(device) for x in batch]
# y_input = bos word eos
# y_label = word eos
# y_vlen = word
Y_input, Y_label, Y_vlen = Y[:,:-1], Y[:,1:], Y_vlen-1
Y_hat, _ = model(X, Y_input, X_vlen, Y_vlen)
l = loss(Y_hat, Y_label, Y_vlen).sum()
l.backward()
with torch.no_grad():
d2l.grad_clipping_nn(model, 5, device)
#随机裁剪梯度
num_tokens = Y_vlen.sum().item()
optimizer.step()
l_sum += l.sum().item()
num_tokens_sum += num_tokens
if epoch % 50 == 0:
print("epoch {0:4d},loss {1:.3f}, time {2:.1f} sec".format(
epoch, (l_sum/num_tokens_sum), time.time()-tic))
tic = time.time()
embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.0
#没有随机丢单元
batch_size, num_examples, max_len = 64, 1e3, 10
lr, num_epochs, ctx = 0.005, 300, d2l.try_gpu()
src_vocab, tgt_vocab, train_iter = d2l.load_data_nmt(
batch_size, max_len,num_examples)
# 导入数据
encoder = Seq2SeqEncoder(
len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqDecoder(
len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
#初始化编码器和译码器
model = d2l.EncoderDecoder(encoder, decoder)
train_ch7(model, train_iter, lr, num_epochs, ctx)
epoch 50,loss 0.093, time 27.4 sec
epoch 100,loss 0.042, time 27.7 sec
epoch 150,loss 0.030, time 27.6 sec
epoch 200,loss 0.026, time 27.5 sec
epoch 250,loss 0.025, time 27.6 sec
epoch 300,loss 0.024, time 27.5 sec
def translate_ch7(model, src_sentence, src_vocab, tgt_vocab, max_len, device):
src_tokens = src_vocab[src_sentence.lower().split(' ')]# id列表
src_len = len(src_tokens)
if src_len < max_len:
src_tokens += [src_vocab.pad] * (max_len - src_len)
enc_X = torch.tensor(src_tokens, device=device)
enc_valid_length = torch.tensor([src_len], device=device)
#enc_X已经获得了source
# 而enc_valid_length显示的则是有用的数值
# use expand_dim to add the batch_size dimension.
enc_outputs = model.encoder(enc_X.unsqueeze(dim=0), enc_valid_length)# 增加到batchsize
dec_state = model.decoder.init_state(enc_outputs, enc_valid_length)
dec_X = torch.tensor([tgt_vocab.bos], device=device).unsqueeze(dim=0)
predict_tokens = []
for _ in range(max_len):
Y, dec_state = model.decoder(dec_X, dec_state)
# The token with highest score is used as the next time step input.
dec_X = Y.argmax(dim=2)
py = dec_X.squeeze(dim=0).int().item()
# print(py)
if py == tgt_vocab.eos:
break
predict_tokens.append(py)
return ' '.join(tgt_vocab.to_tokens(predict_tokens))
for sentence in ['Go .', 'Wow !', "I'm OK .", 'I won !']:
print(sentence + ' => ' + translate_ch7(
model, sentence, src_vocab, tgt_vocab, max_len, ctx))
Go . => va !
Wow ! => !
I'm OK . => ça va .
I won ! => j'ai gagné !
简单greedy search:只考虑局部最优解
维特比算法:选择整体分数最高的句子(搜索空间太大)
集束搜索:
数据预处理
Seq2Seq模型的构建
损失函数
测试