Seq2Seq模型用来处理nlp中序列到序列的问题,是一种常见的Encoder-Decoder模型架构,基于RNN同时解决了RNN的一些弊端(输入和输入必须是等长的)。Seq2Seq的模型架构可以参考Seq2Seq详解,也可以读论文原文sequence to sequence learning with neural networks.本文主要介绍如何用Pytorch实现Seq2Seq模型。
本文使用的数据集极为简易,因为只是想要动手实践一下Seq2Seq模型进而更好的理解nlp中模型的搭建和训练。
首先构建字典
建立一个字母表(其实是一个字典,格式为序号:字母,一遍之后用序号检索字母)
char_list = [c for c in 'SEPabcdefghijklmnopqrstuvwxyz']
char_dic = {n:i for i,n in enumerate(char_list)}
手动创建数据集
seq_data = [['man', 'women'], ['black', 'white'], ['king', 'queen'], ['girl', 'boy'], ['up', 'down'], ['high', 'low']]
数据集只有6对单词,如果有合适的数据集模型的训练效果会好一点。
word embedding
本文采用的编码方式是one-hot编码。将数据集中单词组的第一个单词作为encoder的input输入,将第二个单词作为decoder的output输入,也将第二个单词作为计算loss的target.
需要注意的是,拿input举例,数据集中的每个input向量最终需要整合在一个大的向量中,因此就需要保证每一个时间步输入的单词向量维度是相同的。output和target亦是如此。但是数据集中每个单词向量的长度不可能都相同,所以,需要设置一个单词的最大长度seq_len,每一个单词都用大写P补充为这个长度。
def make_batch(seq_data):
batch_size = len(seq_data)
input_batch,output_batch,target_batch = [],[],[]
for seq in seq_data:
for i in range(2):
seq[i] += 'P' * (seq_len - len(seq[i]))
input = [char_dic[n] for n in seq[0]]
output = [char_dic[n] for n in ('S' + seq[1])]
target = [char_dic[n] for n in (seq[1] + 'E')]
input_batch.append(np.eye(n_class)[input])
output_batch.append(np.eye(n_class)[output])
target_batch.append(target)
return Variable(torch.Tensor(input_batch)),Variable(torch.Tensor(output_batch)),Variable(torch.LongTensor(target_batch))
生成的三个向量形状为:(训练集的长度,单词的最大长度,单词表的长度),定义为(batchsize,seq_len,n_classes)
Seq2Seq模型中有一个encoder和一个decoder,encoder负责将输入的所有时间步的input转换成一个向量C,C代表语义向量,里边包含了所有输入单词的信息。decoder负责将encoder生成的C解码为输入向量output.
encoder的输入为输入向量input和预先生成好的全1向量hidden;
decoder的输入向量为encoder生成的语义向量C和encoder中输入向量对应的输出向量。
在将input输入到encoder时,需要将向量的第一维度和第二维度进行转换。因为RNN的输入维度要求为(seq_len,batchsize,n_classes),而我们之前生成的向量维度是(batchsize,seq_len,n_classes),所以需要转换一下维度。RNN输入输出的维度可以参考这篇文章
class Seq2Seq(nn.Module):
def __init__(self):
super(Seq2Seq,self).__init__()
self.encoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.decoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.fc = nn.Linear(n_hidden,n_class)
def forward(self,enc_input,enc_hidden,dec_input):
enc_input = enc_input.transpose(0,1) #需要将向量的第一第二维度进行转换
dec_input = dec_input.transpose(0,1)
_,h_states = self.encoder(enc_input,enc_hidden)
outputs,_ = self.decoder(dec_input,h_states)
outputs = self.fc(outputs)
return outputs
模型训练之前定义一下loss function和optimizer,learning rate设为0.001.还有一点需要注意的是,模型训练前要预先生成一个hidden,放入encoder中的RNN,hidden的维度为(1,batchsize,n_hidden).
model = Seq2Seq()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
for epoch in range(5001):
hidden = Variable(torch.zeros(1,batch_size,n_hidden))
optimizer.zero_grad()
outputs = model(input_batch,hidden,output_batch)
outputs = outputs.transpose(0,1)
loss = 0
for i in range(batch_size):
loss += criterion(outputs[i],target_batch[i])
if (epoch % 500) == 0:
print('epoch:{},loss:{}'.format(epoch,loss))
loss.backward()
optimizer.step()
模型训练好以后,就可以进行检验.同理还是需要将输入转换为向量,输出生成的output也需要转换为字符形式。
def translated(word):
input_batch,output_batch,_ = make_batch([[word,'P'*len(word)]])
hidden = Variable(torch.zeros(1,1,n_hidden))
outputs = model(input_batch,hidden,output_batch)
predict = outputs.data.max(2,keepdim=True)[1]
decode = [char_list[i] for i in predict]
end = decode.index('P')
translated = ''.join(decode[:end])
print(translated)
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
dtype = torch.FloatTensor
char_list = [c for c in 'SEPabcdefghijklmnopqrstuvwxyz']
char_dic = {n:i for i,n in enumerate(char_list)}
seq_data = [['man', 'women'], ['black', 'white'], ['king', 'queen'], ['girl', 'boy'], ['up', 'down'], ['high', 'low']]
seq_len = 8
n_hidden = 128
n_class = len(char_list)
batch_size = len(seq_data)
def make_batch(seq_data):
batch_size = len(seq_data)
input_batch,output_batch,target_batch = [],[],[]
for seq in seq_data:
for i in range(2):
seq[i] += 'P' * (seq_len - len(seq[i]))
input = [char_dic[n] for n in seq[0]]
output = [char_dic[n] for n in ('S' + seq[1])]
target = [char_dic[n] for n in (seq[1] + 'E')]
input_batch.append(np.eye(n_class)[input])
output_batch.append(np.eye(n_class)[output])
target_batch.append(target)
return Variable(torch.Tensor(input_batch)),Variable(torch.Tensor(output_batch)),Variable(torch.LongTensor(target_batch))
input_batch,output_batch,target_batch = make_batch(seq_data)
class Seq2Seq(nn.Module):
def __init__(self):
super(Seq2Seq,self).__init__()
self.encoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.decoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.fc = nn.Linear(n_hidden,n_class)
def forward(self,enc_input,enc_hidden,dec_input):
enc_input = enc_input.transpose(0,1)
dec_input = dec_input.transpose(0,1)
_,h_states = self.encoder(enc_input,enc_hidden)
outputs,_ = self.decoder(dec_input,h_states)
outputs = self.fc(outputs)
return outputs
model = Seq2Seq()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
for epoch in range(5001):
hidden = Variable(torch.zeros(1,batch_size,n_hidden))
optimizer.zero_grad()
outputs = model(input_batch,hidden,output_batch)
outputs = outputs.transpose(0,1)
loss = 0
for i in range(batch_size):
loss += criterion(outputs[i],target_batch[i])
if (epoch % 500) == 0:
print('epoch:{},loss:{}'.format(epoch,loss))
loss.backward()
optimizer.step()
def translated(word):
input_batch,output_batch,_ = make_batch([[word,'P'*len(word)]])
hidden = Variable(torch.zeros(1,1,n_hidden))
outputs = model(input_batch,hidden,output_batch)
predict = outputs.data.max(2,keepdim=True)[1]
decode = [char_list[i] for i in predict]
end = decode.index('P')
translated = ''.join(decode[:end])
print(translated)