循环神经网络(RNN)是一种重要的神经网络模型,尤其适用于序列化标注问题。初学循环神经网络的过程中,经常迷惑于各种似曾相识的原理图,纠结于不同的Cell是什么原理,它们是怎么组合起来的,输入数据究竟长啥样,它们是怎么被单个Cell处理的,又是怎样在Cell间流转的,代码层是怎么实现的,复杂程度咋样?本文将试图从多个角度,提纲挈领的对诸多问题进行由浅入深的探讨。本文将神经网络的组件划分为三个维度(RNN Cell类型、编码器|解码器结构、Seq2Seq模型),并从这三个维度分别讲解各个组件,以及它们的组合使用方式。本文借鉴了斯坦福CS224D课程的部分内容,每个环节都会伴随着原理的讲解,并列兄弟结构的差别和演进,并从代码的角度进行展示,希望能给初学者一个有点有面的认识。
下图从三个维度总结了神经网络的组件:
单个RNN cell:
RNN cell是循环神经网络最基本的单元,代表了一个基本的神经元。该Cell的输入除了常规的X(t),还多出了一个代表上一步记忆的H(t-1),这里可以称之为记忆,也可以称之为上一步的HiddenState。
如下图:
对于最基本的RNN Cell(对应于rnn_cell_impl.py中的BasicRNNCell),H(t-1)和Y(t-1)是一样的,没错,是一样的,看下面源码:
class BasicRNNCell(RNNCell):
def __init__(self, num_units, activation=None, reuse=None):
super(BasicRNNCell, self).__init__(_reuse=reuse)
self._num_units = num_units
self._activation = activation or math_ops.tanh
self._linear = None
@property
def state_size(self):
return self._num_units
@property
def output_size(self):
return self._num_units
def call(self, inputs, state):
"""Most basic RNN: output = new_state = act(W * input + U * state + B)."""
if self._linear is None:
self._linear = _Linear([inputs, state], self._num_units, True)
output = self._activation(self._linear([inputs, state]))
return output, output
call函数返回的两个值分别代表一个Cell的output和hidden_state,可以看出返回的output和hidden_state值是一样的,所以output_size=hidden_state_size,都等于num_units。同时,通过上面的代码我们还可以看到,BasicRNNCell构造函数只有一个必不可少的参数:num_units。该参数代表的是BasicRNNCell里面的全连接网络的输出维度。输入维度是任意的,取决于用户调用call函数时输入的input的size,所以构造函数只需要输出维度一个必选参数。
接下来我们通过一个简单的例子说明如何通过X(t)和H(t-1)计算output Y(t),同时也就是hidden_state。假设我们要做一个语言翻译模型,将中文翻译成英语。我们的训练数据有100万条,每一条都是一句中文,和对应的英文翻译。我们指定batch_size=100,也就是说每次处理100条训练数据。我们首先对每句话进行切词,并转换成WordEmbedding表示。这里为了画图方便,假设我们的WordEmbedding维度是4,也就是每个单词用一个长度为4的向量表示(实际上WordEmbedding一般是256)。
这是一个极简的BasicRNNCell的计算过程,深刻的理解它对理解整个RNN网络至关重要。这里强调几点,首先这是一个全连接网络,X(t)由矩阵W映射到x_output,H(t-1)由矩阵U映射到state_output。第二,x_output和state_output维度是一样的,这样才能进行后续的累加操作。第三,x_output和state_output累加并加入Bias后,一份作为Y(t),一份作为H(t),所以对于BasicRNNCell,Y(t)=H(t)。
所以训练这个网络,就是训练W和U矩阵,Bias参数,以及初始化的H(0)矩阵的过程。有了这些概念,我们就可以理解下面这个按照时间展开的RNN网络图了:
上面讲的BasicRNNCell是最基本的RNN Cell样式,在此基础上发展出了很多更复杂的RNN Cell,比如BasicLSTMCell/LSTMCell/GRUCell/MultiRNNCell等。为什么要去改进BasicRNNCell呢?这是因为虽然RNN Cell在长距离梯度更新的时候容易出现梯度消失和(或)梯度爆炸的问题,具体证明方法超出了本文的讨论范围,请参照附录里斯坦福大学的教程,里面给出了详细复杂的证明方法。而GRU和LSTM Cell的出现解决了这个问题。
GRUCell公式和原理图如下:
GRUCell最显著的特征是添加了两个Gate,ResetGate和UpdateGate。之前BasicRNNCell是将X(t)和H(t-1)分别经过W,U矩阵映射之后无脑累加起来,而GRU会通过两个门函数进行取舍。ResetGate用于决定从之前的记忆H(t-1)中获取多少来和当前的输入进行合并计算(虚线框的部分),这时会计算出新的hidden_state,也就是新的记忆,而UpdateGate用于决定新的记忆和旧的记忆如何按比例被传递到下一步。所以这里比较tricky的是UpdateGate不只是直接作用在虚线框计算出的新记忆上,而是同时作用在新记忆和旧记忆H(t-1)上,它决定从新记忆中取多少,从旧记忆中取多少,累加后作为传递到下一步的H(t)。可以从这个角度理解,如果模型认为当前时刻t的这个输入对最终结果意义很大,那么新记忆会以很大的权重被累加进H(t),反之,会被以很小的权重被累加进H(t),旧记忆同理。
同BasicRNNCell一样,GRU每一个时刻t输出的output和hidden_state是一样的,都是H(t),所以他的state_size和output_size是一样的,看代码:
class GRUCell(RNNCell):
def __init__(self,
num_units,
activation=None,
reuse=None,
kernel_initializer=None,
bias_initializer=None):
super(GRUCell, self).__init__(_reuse=reuse)
self._num_units = num_units
self._activation = activation or math_ops.tanh
self._kernel_initializer = kernel_initializer
self._bias_initializer = bias_initializer
self._gate_linear = None
self._candidate_linear = None
@property
def state_size(self):
return self._num_units
@property
def output_size(self):
return self._num_units
def call(self, inputs, state):
"""Gated recurrent unit (GRU) with nunits cells."""
if self._gate_linear is None:
bias_ones = self._bias_initializer
if self._bias_initializer is None:
bias_ones = init_ops.constant_initializer(1.0, dtype=inputs.dtype)
with vs.variable_scope("gates"): # Reset gate and update gate.
self._gate_linear = _Linear(
[inputs, state],
2 * self._num_units,
True,
bias_initializer=bias_ones,
kernel_initializer=self._kernel_initializer)
value = math_ops.sigmoid(self._gate_linear([inputs, state]))
r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)
r_state = r * state
if self._candidate_linear is None:
with vs.variable_scope("candidate"):
self._candidate_linear = _Linear(
[inputs, r_state],
self._num_units,
True,
bias_initializer=self._bias_initializer,
kernel_initializer=self._kernel_initializer)
c = self._activation(self._candidate_linear([inputs, r_state]))
new_h = u * state + (1 - u) * c
return new_h, new_h
理解了GRUCell后,再简单介绍一下LSTMCell。
LSTMCell相比于GRUCell更复杂一些,加入了更多的门。它包括了三个Gate,分别是InputGate,ForgetGate和OutputGate。理解上和GRUCell类似,只不过控制更为精细,当然,控制更精细并不意味着结果一定会更好,实际使用的时候可以分别测试一下比较效果择优使用即可。LSTMCell公式和原理图如下:
LSTMCell源码如下:
class LSTMCell(RNNCell):
def __init__(self, num_units,
use_peepholes=False, cell_clip=None,
initializer=None, num_proj=None, proj_clip=None,
num_unit_shards=None, num_proj_shards=None,
forget_bias=1.0, state_is_tuple=True,
activation=None, reuse=None):
super(LSTMCell, self).__init__(_reuse=reuse)
if not state_is_tuple:
logging.warn("%s: Using a concatenated state is slower and will soon be "
"deprecated. Use state_is_tuple=True.", self)
if num_unit_shards is not None or num_proj_shards is not None:
logging.warn(
"%s: The num_unit_shards and proj_unit_shards parameters are "
"deprecated and will be removed in Jan 2017. "
"Use a variable scope with a partitioner instead.", self)
self._num_units = num_units
self._use_peepholes = use_peepholes
self._cell_clip = cell_clip
self._initializer = initializer
self._num_proj = num_proj
self._proj_clip = proj_clip
self._num_unit_shards = num_unit_shards
self._num_proj_shards = num_proj_shards
self._forget_bias = forget_bias
self._state_is_tuple = state_is_tuple
self._activation = activation or math_ops.tanh
if num_proj:
self._state_size = (
LSTMStateTuple(num_units, num_proj)
if state_is_tuple else num_units + num_proj)
self._output_size = num_proj
else:
self._state_size = (
LSTMStateTuple(num_units, num_units)
if state_is_tuple else 2 * num_units)
self._output_size = num_units
self._linear1 = None
self._linear2 = None
if self._use_peepholes:
self._w_f_diag = None
self._w_i_diag = None
self._w_o_diag = None
@property
def state_size(self):
return self._state_size
@property
def output_size(self):
return self._output_size
def call(self, inputs, state):
num_proj = self._num_units if self._num_proj is None else self._num_proj
sigmoid = math_ops.sigmoid
if self._state_is_tuple:
(c_prev, m_prev) = state
else:
c_prev = array_ops.slice(state, [0, 0], [-1, self._num_units])
m_prev = array_ops.slice(state, [0, self._num_units], [-1, num_proj])
dtype = inputs.dtype
input_size = inputs.get_shape().with_rank(2)[1]
if input_size.value is None:
raise ValueError("Could not infer input size from inputs.get_shape()[-1]")
if self._linear1 is None:
scope = vs.get_variable_scope()
with vs.variable_scope(
scope, initializer=self._initializer) as unit_scope:
if self._num_unit_shards is not None:
unit_scope.set_partitioner(
partitioned_variables.fixed_size_partitioner(
self._num_unit_shards))
self._linear1 = _Linear([inputs, m_prev], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
lstm_matrix = self._linear1([inputs, m_prev])
i, j, f, o = array_ops.split(
value=lstm_matrix, num_or_size_splits=4, axis=1)
# Diagonal connections
if self._use_peepholes and not self._w_f_diag:
scope = vs.get_variable_scope()
with vs.variable_scope(
scope, initializer=self._initializer) as unit_scope:
with vs.variable_scope(unit_scope):
self._w_f_diag = vs.get_variable(
"w_f_diag", shape=[self._num_units], dtype=dtype)
self._w_i_diag = vs.get_variable(
"w_i_diag", shape=[self._num_units], dtype=dtype)
self._w_o_diag = vs.get_variable(
"w_o_diag", shape=[self._num_units], dtype=dtype)
if self._use_peepholes:
c = (sigmoid(f + self._forget_bias + self._w_f_diag * c_prev) * c_prev +
sigmoid(i + self._w_i_diag * c_prev) * self._activation(j))
else:
c = (sigmoid(f + self._forget_bias) * c_prev + sigmoid(i) *
self._activation(j))
if self._cell_clip is not None:
# pylint: disable=invalid-unary-operand-type
c = clip_ops.clip_by_value(c, -self._cell_clip, self._cell_clip)
# pylint: enable=invalid-unary-operand-type
if self._use_peepholes:
m = sigmoid(o + self._w_o_diag * c) * self._activation(c)
else:
m = sigmoid(o) * self._activation(c)
if self._num_proj is not None:
if self._linear2 is None:
scope = vs.get_variable_scope()
with vs.variable_scope(scope, initializer=self._initializer):
with vs.variable_scope("projection") as proj_scope:
if self._num_proj_shards is not None:
proj_scope.set_partitioner(
partitioned_variables.fixed_size_partitioner(
self._num_proj_shards))
self._linear2 = _Linear(m, self._num_proj, False)
m = self._linear2(m)
if self._proj_clip is not None:
# pylint: disable=invalid-unary-operand-type
m = clip_ops.clip_by_value(m, -self._proj_clip, self._proj_clip)
# pylint: enable=invalid-unary-operand-type
new_state = (LSTMStateTuple(c, m) if self._state_is_tuple else
array_ops.concat([c, m], 1))
return m, new_state
LSTMCell输出了两个值,m代表hidden_state,c代表final memory,new_state是简单的把c和m连接在了一起。
编码器和解码器
编码器根据适用场景可以分为适用于文本的rnn_encoder,适用于图片的image_encoder,包含了卷积或池化的conv_encoder/pooling_encoder。我们重点关注适用于文本的rnn_encoder。编码器根据单向还是双向分为UndirectionalRNNEncoder和BidirectionalRNNEncoder,它们只是提供了一种框架逻辑,里面具体使用到的Cell可以是RNNCell中的任意一种。UndirectionalRNNEncoder和BidirectionalRNNEncoder又分别调用了python2.7/site-packages/tensorflow/python/ops/rnn.py中的提供的tf.nn.dynamic_rnn和tf.nn.bidirectional_dynamic_rnn。它们实现了单向和双向RNN的核心逻辑。
class UnidirectionalRNNEncoder(Encoder):
"""
A unidirectional RNN encoder. Stacking should be performed as
part of the cell.
Args:
cell: An instance of tf.contrib.rnn.RNNCell
name: A name for the encoder
"""
def __init__(self, params, mode, name="forward_rnn_encoder"):
super(UnidirectionalRNNEncoder, self).__init__(params, mode, name)
self.params["rnn_cell"] = _toggle_dropout(self.params["rnn_cell"], mode)
@staticmethod
def default_params():
return {
"rnn_cell": _default_rnn_cell_params(),
"init_scale": 0.04,
}
def encode(self, inputs, sequence_length, **kwargs):
scope = tf.get_variable_scope()
scope.set_initializer(tf.random_uniform_initializer(
-self.params["init_scale"],
self.params["init_scale"]))
cell = training_utils.get_rnn_cell(**self.params["rnn_cell"])
outputs, state = tf.nn.dynamic_rnn(
cell=cell,
inputs=inputs,
sequence_length=sequence_length,
dtype=tf.float32,
**kwargs)
return EncoderOutput(
outputs=outputs,
final_state=state,
attention_values=outputs,
attention_values_length=sequence_length)
class BidirectionalRNNEncoder(Encoder):
"""
A bidirectional RNN encoder. Uses the same cell for both the
forward and backward RNN. Stacking should be performed as part of
the cell.
Args:
cell: An instance of tf.contrib.rnn.RNNCell
name: A name for the encoder
"""
def __init__(self, params, mode, name="bidi_rnn_encoder"):
super(BidirectionalRNNEncoder, self).__init__(params, mode, name)
self.params["rnn_cell"] = _toggle_dropout(self.params["rnn_cell"], mode)
@staticmethod
def default_params():
return {
"rnn_cell": _default_rnn_cell_params(),
"init_scale": 0.04,
}
def encode(self, inputs, sequence_length, **kwargs):
scope = tf.get_variable_scope()
scope.set_initializer(tf.random_uniform_initializer(
-self.params["init_scale"],
self.params["init_scale"]))
cell_fw = training_utils.get_rnn_cell(**self.params["rnn_cell"])
cell_bw = training_utils.get_rnn_cell(**self.params["rnn_cell"])
outputs, states = tf.nn.bidirectional_dynamic_rnn(
cell_fw=cell_fw,
cell_bw=cell_bw,
inputs=inputs,
sequence_length=sequence_length,
dtype=tf.float32,
**kwargs)
# Concatenate outputs and states of the forward and backward RNNs
outputs_concat = tf.concat(outputs, 2)
return EncoderOutput(
outputs=outputs_concat,
final_state=states,
attention_values=outputs_concat,
attention_values_length=sequence_length)
解码器分为BasicDecoder、AttentionDecoder和BeamSearchDecoder。 BasicDecoder就是在每步step计算中将inputs和state传入cell进行计算,生成cell_output和cell_state,作为下一个step的输入。同时将cell_output通过一个全连接网络转换成各个输出目标的概率(在compute_output中进行)。这里采用了负采样的方法(self.helper.sample)来选择唯一的正例和少数负例进行计算,可以大大降低计算量。
class BasicDecoder(RNNDecoder):
"""Simple RNN decoder that performed a softmax operations on the cell output.
"""
def __init__(self, params, mode, vocab_size, name="basic_decoder"):
super(BasicDecoder, self).__init__(params, mode, name)
self.vocab_size = vocab_size
def compute_output(self, cell_output):
"""Computes the decoder outputs."""
return tf.contrib.layers.fully_connected(
inputs=cell_output, num_outputs=self.vocab_size, activation_fn=None)
@property
def output_size(self):
return DecoderOutput(
logits=self.vocab_size,
predicted_ids=tf.TensorShape([]),
cell_output=self.cell.output_size)
@property
def output_dtype(self):
return DecoderOutput(
logits=tf.float32, predicted_ids=tf.int32, cell_output=tf.float32)
def initialize(self, name=None):
finished, first_inputs = self.helper.initialize()
return finished, first_inputs, self.initial_state
def step(self, time_, inputs, state, name=None):
cell_output, cell_state = self.cell(inputs, state)
logits = self.compute_output(cell_output)
sample_ids = self.helper.sample(
time=time_, outputs=logits, state=cell_state)
outputs = DecoderOutput(
logits=logits, predicted_ids=sample_ids, cell_output=cell_output)
finished, next_inputs, next_state = self.helper.next_inputs(
time=time_, outputs=outputs, state=cell_state, sample_ids=sample_ids)
return (outputs, next_state, next_inputs, finished)
相比于BasicDecoder,AttentionDecoder添加了Attention机制,BeamSearchDecoder通过每一步选择最好的几种可能的结果(相比于其他两种方法每一步都选择唯一的最大概率结果)并采用了剪枝策略和贪心算法,扩大了候选项的范围,提高了捕获到最优解的可能,这里不再贴出代码。
Seq2Seq模型
Seq2Seq模型建立在编码器和解码器之上,将编码器和解码器组合起来,编码器和解码器又分别可以使用不同种类的RNNCell。
Seq2Seq模型可以通过配置化的方式指定使用什么样的编码器、什么样的解码器、什么样注意力机制等:
// Seq2SeqModel.default_params()
def default_params():
params = ModelBase.default_params()
params.update({
"source.max_seq_len": 50,
"source.reverse": True,
"target.max_seq_len": 50,
"embedding.dim": 100,
"embedding.init_scale": 0.04,
"embedding.share": False,
"inference.beam_search.beam_width": 0,
"inference.beam_search.length_penalty_weight": 0.0,
"inference.beam_search.choose_successors_fn": "choose_top_k",
"optimizer.clip_embed_gradients": 0.1,
"vocab_source": "",
"vocab_target": "",
})
// BasicSeq2Seq.default_params()
def default_params():
params = Seq2SeqModel.default_params().copy()
params.update({
"bridge.class": "seq2seq.models.bridges.InitialStateBridge",
"bridge.params": {},
"encoder.class": "seq2seq.encoders.UnidirectionalRNNEncoder",
"encoder.params": {}, # Arbitrary parameters for the encoder
"decoder.class": "seq2seq.decoders.BasicDecoder",
"decoder.params": {} # Arbitrary parameters for the decoder
})
return params
// AttentionSeq2Seq.default_params()
def default_params():
params = BasicSeq2Seq.default_params().copy()
params.update({
"attention.class": "AttentionLayerBahdanau",
"attention.params": {}, # Arbitrary attention layer parameters
"bridge.class": "seq2seq.models.bridges.ZeroBridge",
"encoder.class": "seq2seq.encoders.BidirectionalRNNEncoder",
"encoder.params": {}, # Arbitrary parameters for the encoder
"decoder.class": "seq2seq.decoders.AttentionDecoder",
"decoder.params": {} # Arbitrary parameters for the decoder
})
return params
自定义encoder、decoder过程
作为前面直接调用现成的seq2seq模型的替代,可以自己构造encoder结构并和decoder随意组合,通过实验验证哪种效果更好。同时如果seq2seq.py里现成的encoder-decoder模型不能满足你的要求,还可以自定义Decoder过程。
总结:通过本文的介绍,希望能让初学者从各种混乱的表述中解脱出来,从更高的层次了解Cell,链式结构,seq2seq模型之间的关系,希望能有些启发。
参考:
https://github.com/tensorflow/nmt
https://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf
https://tensorflowkorea.files.wordpress.com/2017/03/cs224n-2017winter-notes-all.pdf