【译】Keras中实现带有attention机制的encoder-decoder rnn 结构

最近正好要实现带有attention机制的encoder-decoder结构,发现Jason大神https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/有一篇博客在讲这个,所以翻译一下。

目录

带有注意力机制的编码器-解码器(Encoder-Decoder with Attention)

一个测试注意力机制的问题(Test Problem for Attention)

没有注意力机制的编码-解码(Encoder-Decoder Without Attention)

自定义Keras中的Attention层(Custom Keras Attention Layer)

带有注意力机制的编码器-解码器(Encoder-Decoder With Attention)

模型比较(Comparison of Models)


事实证明,递归神经网络的编码器-解码器体系结构在自然语言处理领域的一系列序列到序列预测问题上非常强大。
注意是一种机制,它解决了编码器-解码器体系结构在长序列上的局限性,并且通常加快了学习速度,提高了序列到序列预测问题模型的技能。

在本教程中,您将了解如何在Keras中实现一个带有注意力机制的编码器-解码器卷积神经网络。

完成本教程后,您将知道:

  • 如何设计一个小而可配置的问题来评估有无注意机制的编码器-解码器循环神经网络。
  • 针对序列预测问题,如何设计和评价一个有无注意机制的编码器-解码器网络。
  • 如何在有无注意机制的情况下,鲁棒比较编码器-解码器网络的性能。

带有注意力机制的编码器-解码器(Encoder-Decoder with Attention)

循环神经网络中的编码器-解码器模型是一种序列到序列预测问题的体系结构。
它由两个子模型组成,顾名思义:

  • 编码器:编码器负责单步执行输入时间步骤,并将整个序列编码为一个固定长度的向量,称为上下文向量(context vector)。
  • 解码器:解码器负责逐步执行输出时间步骤,并从上下文向量读取信息。

该体系结构的一个问题是,长输入或输出序列的性能较差。原因被认为是编码器使用了固定尺寸的上下文向量进行内部表示。
注意力机制是对解决这个限制的体系结构的扩展。它首先提供从编码器到解码器的更丰富的上下文,以及一种学习机制,在该机制中,解码器可以在预测输出序列中的每个时间步的输出时,在更丰富的编码中学习应注意的位置。
有关编码器-解码器体系结构的更多注意事项,请参阅以下文章:

  • Attention in Long Short-Term Memory Recurrent Neural Networks
  • How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

一个测试注意力机制的问题(Test Problem for Attention)

在我们开发带有注意力机制的模型之前,我们首先定义一个人为的可伸缩的测试问题,我们可以用它来确定注意力机制是否带来了任何好处。
在这个问题中,我们将生成随机整数序列作为输入并匹配输出序列,输出序列由输入序列中整数的子集组成。
例如,输入序列可以是[1,6,2,7,3],预期的输出序列可以是序列[1,6]中的前两个随机整数。
我们将定义问题,使输入和输出序列的长度相同,并根据需要用“0”值填充输出序列。
首先,我们需要一个函数来生成随机整数序列。我们将使用python的randint()函数生成介于0和最大值之间的随机整数,并将此范围用作问题的基数(例如,特征数)。

下面的generate_sequence()函数将生成一个具有固定长度和指定基数的随机整数序列。

from random import randint

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)

运行此示例将生成一个具有5个时间步的序列,其中序列中的每个值都是0到49之间的随机整数。

[43, 3, 28, 34, 33]

接下来,我们需要一个函数将离散整数值独热编码为二进制向量。
如果使用的基数为50,则每个整数将由0和指定整数值索引处的1表示的50-元素的向量。
下面的one_hot_encode()函数将对给定的整数序列进行一次独热编码。

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

我们还需要能够解码独热编码后的序列。这将需要把从模型的预测或独热编码的预期序列转换回我们可以读取和计算的整数序列。

下面的one_hot_decode()函数将把一个独热编码序列解码回一个整数序列。

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

我们可以在下面的示例中测试这些操作。

from random import randint
from numpy import array
from numpy import argmax

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]  
# argmax()为返回沿轴axis最大值的索引,也就是独热编码后1所在的索引,即原整数数值

# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)
# one hot encode
encoded = one_hot_encode(sequence, 50)
print(encoded)
# decode
decoded = one_hot_decode(encoded)
print(decoded)

运行这个示例首先打印一个随机生成的序列,然后打印一个独热编码的版本,最后再次打印解码的序列。

[3, 18, 32, 11, 36]
[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[3, 18, 32, 11, 36]

最后,我们需要一个可以创建输入和输出序列对的函数来训练和评估模型。
下面名为get_pair()的函数将返回一个给定了输入长度、输出长度和基数的,输入-输出序列对。输入序列和输出序列的长度相同,即等于输入序列的长度,但输出序列将表示为输入序列的前N个字符,并用零值填充到所需的长度。
然后对整数序列进行独热编码,然后将其reshape为循环神经网络所需的三维格式,其尺寸为:样本、时间步和特征。在这种情况下,样本总是1,因为我们只生成一个输入-输出对,时间步数是输入序列长度,特征是每个时间步数的基数。

# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
	# generate random sequence
	sequence_in = generate_sequence(n_in, n_unique)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, n_unique)
	y = one_hot_encode(sequence_out, n_unique)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

我们可以把这些放在一起并演示数据准备部分的代码。

from random import randint
from numpy import array
from numpy import argmax

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
	# generate random sequence
	sequence_in = generate_sequence(n_in, n_unique)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, n_unique)
	y = one_hot_encode(sequence_out, n_unique)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# generate random sequence
X, y = get_pair(5, 2, 50)
print(X.shape, y.shape)
print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0])))

运行该示例将生成一个输入输出对,并打印两个数组的形状。
然后以解码的形式打印生成的对,我们可以看到序列的前两个整数在输出序列中被复制,后面是零值的填充。

(1, 5, 50) (1, 5, 50)
X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0]

没有注意力机制的编码-解码(Encoder-Decoder Without Attention)

在这一部分中,我们将开发一个解决上述问题的编码器-解码器模型作为评估的基准。
首先,我们将重复问题的定义:5个时间步长的输入和输出序列,输出序列中的值为输入序列的前2个元素,以50为基数(特征的个数)。

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2

我们可以通过从编码器lstm模型中获取输出编码向量,并按照输出序列中的时间步数重复输入编码向量n次,然后使用解码器预测输出序列,从而在keras中开发一个简单的编码器-解码器模型。
有关如何在Keras中定义编码器-解码器体系结构的详细信息,请参阅以下文章:

  • Encoder-Decoder Long Short-Term Memory Networks

我们将使用相同数量的单元(units,代表循环神经网络中隐藏层的单元个数,也代表了循环神经网络的输出维度)配置编码器和解码器,在本例中为150。我们将利用Adam有效实现梯度下降,并优化分类交叉熵损失函数,前提是该问题在技术上是一个多分类问题。
模型的配置是经过一点尝试和错误后发现的,决不是最优化的。
下面列出了keras中编码器-解码器体系结构的代码。

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

我们将在5000个整数序列的随机输入输出对上训练模型。

# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)

经过训练后,我们将在100个新的随机生成的整数序列上评估模型,并且仅在整个输出序列与预期值匹配时标记预测正确。

# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

最后,我们将打印10个期望输出序列和模型预测序列的例子。
将所有这些放在一起,下面列出了完整的示例。

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

在CPU上运行这个例子不会花费很长时间,可能几分钟,不需要GPU。
根据实验该模型的准确度略低于20%。考虑到神经网络的随机性,您的结果会有所不同;可以运行示例几次并取平均值。

Accuracy: 19.00%

我们可以从样本输出中看到,模型在输出序列中得到的第一个数字对于大多数或所有情况都是正确的,并且只与第二个数字发生冲突。所有零填充值都是正确预测的。

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0]
Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0]
Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0]
Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0]
Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0]
Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0]

自定义Keras中的Attention层(Custom Keras Attention Layer)

现在我们需要将注意力机制引入编码器-解码器模型。
在Keras发布正式的注意力机制之前(查Keras的官网发现了一段模糊的话,没看到专门的介绍attention的文档,如果已经有官方实现了,欢迎留言告诉我),我们可以开发自己的实施方案,也可以使用现有的第三方实施方案。


为了加快速度,让我们使用现有的第三方实现。
Zafarali Ahmed ,一位在数据日志网站上的实习生,在2017年发表的题为“How to Visualize Your Recurrent Neural Network with Attention in Keras”的文章和名为“Keras attention”的Github项目中,为Keras开发了一个定制层,为attention提供支持。
自定义关注层称为AttentionDecoder,可在github项目中的custom_recurrents.py文件中找到。我们可以在项目的GNU affero通用公共许可v3.0许可证下重用此代码。
为了完整起见,下面列出了自定义层的副本。复制并粘贴到当前工作目录中名为“attention_decoder.py”的新的单独文件中。

import tensorflow as tf
from keras import backend as K
from keras import regularizers, constraints, initializers, activations
from keras.layers.recurrent import Recurrent, _time_distributed_dense
from keras.engine import InputSpec

tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d)

class AttentionDecoder(Recurrent):

    def __init__(self, units, output_dim,
                 activation='tanh',
                 return_probabilities=False,
                 name='AttentionDecoder',
                 kernel_initializer='glorot_uniform',
                 recurrent_initializer='orthogonal',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 **kwargs):
        """
        Implements an AttentionDecoder that takes in a sequence encoded by an
        encoder and outputs the decoded states
        :param units: dimension of the hidden state and the attention matrices
        :param output_dim: the number of labels in the output space

        references:
            Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio.
            "Neural machine translation by jointly learning to align and translate."
            arXiv preprint arXiv:1409.0473 (2014).
        """
        self.units = units
        self.output_dim = output_dim
        self.return_probabilities = return_probabilities
        self.activation = activations.get(activation)
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.recurrent_initializer = initializers.get(recurrent_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.recurrent_regularizer = regularizers.get(kernel_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)

        self.kernel_constraint = constraints.get(kernel_constraint)
        self.recurrent_constraint = constraints.get(kernel_constraint)
        self.bias_constraint = constraints.get(bias_constraint)

        super(AttentionDecoder, self).__init__(**kwargs)
        self.name = name
        self.return_sequences = True  # must return sequences

    def build(self, input_shape):
        """
          See Appendix 2 of Bahdanau 2014, arXiv:1409.0473
          for model details that correspond to the matrices here.
        """

        self.batch_size, self.timesteps, self.input_dim = input_shape

        if self.stateful:
            super(AttentionDecoder, self).reset_states()

        self.states = [None, None]  # y, s

        """
            Matrices for creating the context vector
        """

        self.V_a = self.add_weight(shape=(self.units,),
                                   name='V_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.W_a = self.add_weight(shape=(self.units, self.units),
                                   name='W_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_a = self.add_weight(shape=(self.input_dim, self.units),
                                   name='U_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_a = self.add_weight(shape=(self.units,),
                                   name='b_a',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the r (reset) gate
        """
        self.C_r = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_r = self.add_weight(shape=(self.units, self.units),
                                   name='U_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_r = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_r = self.add_weight(shape=(self.units, ),
                                   name='b_r',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        """
            Matrices for the z (update) gate
        """
        self.C_z = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_z = self.add_weight(shape=(self.units, self.units),
                                   name='U_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_z = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_z = self.add_weight(shape=(self.units, ),
                                   name='b_z',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the proposal
        """
        self.C_p = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_p = self.add_weight(shape=(self.units, self.units),
                                   name='U_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_p = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_p = self.add_weight(shape=(self.units, ),
                                   name='b_p',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for making the final prediction vector
        """
        self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),
                                   name='C_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_o = self.add_weight(shape=(self.units, self.output_dim),
                                   name='U_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
                                   name='W_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_o = self.add_weight(shape=(self.output_dim, ),
                                   name='b_o',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        # For creating the initial state:
        self.W_s = self.add_weight(shape=(self.input_dim, self.units),
                                   name='W_s',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)

        self.input_spec = [
            InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))]
        self.built = True

    def call(self, x):
        # store the whole sequence so we can "attend" to it at each timestep
        self.x_seq = x

        # apply the a dense layer over the time dimension of the sequence
        # do it here because it doesn't depend on any previous steps
        # thefore we can save computation time:
        self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
                                             input_dim=self.input_dim,
                                             timesteps=self.timesteps,
                                             output_dim=self.units)

        return super(AttentionDecoder, self).call(x)

    def get_initial_state(self, inputs):
        # apply the matrix on the first time step to get the initial s0.
        s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))

        # from keras.layers.recurrent to initialize a vector of (batchsize,
        # output_dim)
        y0 = K.zeros_like(inputs)  # (samples, timesteps, input_dims)
        y0 = K.sum(y0, axis=(1, 2))  # (samples, )
        y0 = K.expand_dims(y0)  # (samples, 1)
        y0 = K.tile(y0, [1, self.output_dim])

        return [y0, s0]

    def step(self, x, states):

        ytm, stm = states

        # repeat the hidden state to the length of the sequence
        _stm = K.repeat(stm, self.timesteps)

        # now multiplty the weight matrix with the repeated hidden state
        _Wxstm = K.dot(_stm, self.W_a)

        # calculate the attention probabilities
        # this relates how much other timesteps contributed to this one.
        et = K.dot(activations.tanh(_Wxstm + self._uxpb),
                   K.expand_dims(self.V_a))
        at = K.exp(et)
        at_sum = K.sum(at, axis=1)
        at_sum_repeated = K.repeat(at_sum, self.timesteps)
        at /= at_sum_repeated  # vector of size (batchsize, timesteps, 1)

        # calculate the context vector
        context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)
        # ~~~> calculate new hidden state
        # first calculate the "r" gate:

        rt = activations.sigmoid(
            K.dot(ytm, self.W_r)
            + K.dot(stm, self.U_r)
            + K.dot(context, self.C_r)
            + self.b_r)

        # now calculate the "z" gate
        zt = activations.sigmoid(
            K.dot(ytm, self.W_z)
            + K.dot(stm, self.U_z)
            + K.dot(context, self.C_z)
            + self.b_z)

        # calculate the proposal hidden state:
        s_tp = activations.tanh(
            K.dot(ytm, self.W_p)
            + K.dot((rt * stm), self.U_p)
            + K.dot(context, self.C_p)
            + self.b_p)

        # new hidden state:
        st = (1-zt)*stm + zt * s_tp

        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(stm, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)

        if self.return_probabilities:
            return at, [yt, st]
        else:
            return yt, [yt, st]

    def compute_output_shape(self, input_shape):
        """
            For Keras internal compatability checking
        """
        if self.return_probabilities:
            return (None, self.timesteps, self.timesteps)
        else:
            return (None, self.timesteps, self.output_dim)

    def get_config(self):
        """
            For rebuilding models on load time.
        """
        config = {
            'output_dim': self.output_dim,
            'units': self.units,
            'return_probabilities': self.return_probabilities
        }
        base_config = super(AttentionDecoder, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

我们可以在项目中使用这个自定义层,通过用如下的方式导入它:

from attention_decoder import AttentionDecoder

该层实现了Bahdanau等人在他们的“Neural Machine Translation by Jointly Learning to Align and Translate.”论文中所描述的注意力机制。
这段代码在文章中解释得很好,并且与LSTM和注意力方程式都有关联。
这种实现的一个限制是,它必须输出与输入序列长度相同的序列,而这是编码器-解码器体系结构设计要克服的特定限制。(确实是这样,如果输入和输出序列长度不一样,可以采用补0措施,将短的序列补为和长的一样的;或者在得到这层输出之后,再通过一层GRU改变序列长度)
重要的是,该层不但避免了第二个LSTM层作为解码器的代码重复,还在这个编码器-解码器模型中通过不加注意机制的Dense层得到SoftMax后的输出。这大大简化了模型的代码。
需要注意的是,自定义层是建立在Keras中的循环层(继承了Recurrent)之上的,在编写时,它被标记为遗留代码,并且可能会在某个时刻从项目中删除。


带有注意力机制的编码器-解码器(Encoder-Decoder With Attention)

既然我们已经有了可以使用的注意力实现,那么我们就可以为我们设计的序列预测问题开发一个有注意力的编码器-解码器模型。
下面定义了具有注意层的模型。我们可以看到,该层能处理编码器-解码器模型本身的一些机制,使得定义模型更简单。

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

就是这样。其余的例子是相同的。
下面列出了完整的示例。

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from attention_decoder import AttentionDecoder

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

运行这个示例可以在100个随机生成的输入输出对进行训练来评估模型的性能。在相同的资源和相同的训练量下,具有注意力的模型表现得更好。
考虑到神经网络的随机性,你的结果可能会有所不同。尝试运行该示例几次。

Accuracy: 95.00%

抽查一些样本输出和预测序列,我们可以看到即使在前两个元素中有零值的情况下,都很少有错误。

Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0]
Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0]
Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0]
Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0]
Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0]
Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0]
Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0]
Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0]
Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0]
Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0]

模型比较(Comparison of Models)

尽管我们在有注意力机制的情况下从模型中得到了更好的结果,但是那只是从模型的一次运行中得到的结果。
在这种情况下,我们通过多次重复对每个模型的评估,并报告这些运行的平均性能,来寻求更可靠的模型评估结果。有关评估神经网络模型的这种稳健方法的更多信息,请参阅以下文章:

  • How to Evaluate the Skill of Deep Learning Models

我们可以定义一个函数来创建每种类型的模型,如下所示。

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
	model.add(RepeatVector(n_timesteps_in))
	model.add(LSTM(150, return_sequences=True))
	model.add(TimeDistributed(Dense(n_features, activation='softmax')))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
	return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
	model.add(AttentionDecoder(150, n_features))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
	return model

然后我们可以定义一个函数来拟合和评估模型的精度,并返回精度分数。

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
	# train LSTM
	for epoch in range(5000):
		# generate new random sequence
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		# fit model for one epoch on this sequence
		model.fit(X, y, epochs=1, verbose=0)
	# evaluate LSTM
	total, correct = 100, 0
	for _ in range(total):
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		yhat = model.predict(X, verbose=0)
		if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
			correct += 1
	return float(correct)/float(total)*100.0

综合起来,我们可以多次重复创建、训练和评估每种模型的过程,并报告多次评估的平均准确性。为了减少运行时间,我们将重复评估每个模型10次,如果您有资源,您可以将其增加到30或100次。
下面列出了完整的示例。

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector
from attention_decoder import AttentionDecoder

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
	model.add(RepeatVector(n_timesteps_in))
	model.add(LSTM(150, return_sequences=True))
	model.add(TimeDistributed(Dense(n_features, activation='softmax')))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
	return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
	model.add(AttentionDecoder(150, n_features))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
	return model

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
	# train LSTM
	for epoch in range(5000):
		# generate new random sequence
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		# fit model for one epoch on this sequence
		model.fit(X, y, epochs=1, verbose=0)
	# evaluate LSTM
	total, correct = 100, 0
	for _ in range(total):
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		yhat = model.predict(X, verbose=0)
		if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
			correct += 1
	return float(correct)/float(total)*100.0

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
n_repeats = 10
# evaluate encoder-decoder model
print('Encoder-Decoder Model')
results = list()
for _ in range(n_repeats):
	model = baseline_model(n_timesteps_in, n_features)
	accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
	results.append(accuracy)
	print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))
# evaluate encoder-decoder with attention model
print('Encoder-Decoder With Attention Model')
results = list()
for _ in range(n_repeats):
	model = attention_model(n_timesteps_in, n_features)
	accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
	results.append(accuracy)
	print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

运行这个示例可以打印每个模型重复的精度,从而让您了解运行的进度。

Encoder-Decoder Model
20.0
23.0
23.0
18.0
28.000000000000004
28.999999999999996
23.0
26.0
21.0
20.0
Mean Accuracy: 23.10%

Encoder-Decoder With Attention Model
98.0
91.0
94.0
93.0
96.0
99.0
97.0
94.0
99.0
96.0
Mean Accuracy: 95.70%

我们可以看到,即使运行10次,注意力模型仍然比没有注意力的编码器-解码器模型表现出更好的性能。
对这种评估的一个很好的扩展是捕获每个模型的每个epoch的模型损失,取平均值,并比较有或无注意力机制的模型中损失随时间的变化情况。我希望这个跟踪会显示出注意力比非注意力模型更快地获得更好的技能,进一步突出了该方法的好处。

 

你可能感兴趣的:(深度学习)