深度学习之GRU学习笔记

在之前几篇深度学习文章中转载了FC、CNN、RNN、LSTM。其中LSTM的变体GRU只是简单提了一点点。本文重新整理了一些资料详细剖析GRU的单元门控逻辑,并结合论文《Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling》分析LSTM和GRU的异同点。

GRU(Gate Recurrent Unit)

GRU是LSTM的变种。在LSTM的基础上将“遗忘门”、“输入门”、“输出门”变成“更新门”和“重置门”。如下图所示:
深度学习之GRU学习笔记_第1张图片
更新门: z t = σ ( W z ⋅ [ h t − 1 , x t ] ) z_t=\sigma(W_z·[h_{t-1},x_t]) zt=σ(Wz[ht1,xt])重置门: r z = σ ( W r ⋅ [ h t − 1 , x t ] ) r_z=\sigma(W_r·[h_{t-1},x_t]) rz=σ(Wr[ht1,xt])当前记忆内容: h t ′ = t a n h ( W ⋅ [ r t ∘ h t − 1 , x t ] ) h_t'=tanh(W·[r_t\circ h_{t-1},x_t]) ht=tanh(W[rtht1,xt])输出: h t = ( 1 − z t ) ∘ h t ′ + z t ∘ h t − 1 h_t=(1-z_t)\circ h_t'+z_t\circ h_{t-1} ht=(1zt)ht+ztht1其中,上面的 W W W为权重矩阵, W z ⋅ [ h t − 1 , x t ] W_z·[h_{t-1},x_t] Wz[ht1,xt]等价于 [ W z h , W z x ] ⋅ [ h t − 1 x t ] [W_{zh},W_{zx}]· \begin{bmatrix} h_{t-1} \\ x_t \end{bmatrix} [Wzh,Wzx][ht1xt],运算符 ∘ \circ 表示按位相乘。以上公式为GRU单元的门控逻辑。

LSTM和GRU比较

如图所示,为LSTM和GRU简化图。
深度学习之GRU学习笔记_第2张图片
下面是论文中对LSTM和GRU的共同点和不同点的讨论(原文)。本想翻译出来,但是觉得还是贴原文比较好,让大家可以更好的理解作者的观点。后面也有我的一个总结。

The most prominent feature shared between these units is the additive component of their update from t to t + 1, which is lacking in the traditional recurrent unit. The traditional recurrent unit always replaces the activation, or the content of a unit with a new value computed from the current input and the previous hidden state. On the other hand, both LSTM unit and GRU keep the existing content and add the new content on top of it (see Eqs. (4) and (5)).

This additive nature has two advantages. First, it is easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps. Any important feature, decided by either the forget gate of the LSTM unit or the update gate of the GRU, will not be overwritten but be maintained as it is.

Second, and perhaps more importantly, this addition effectively creates shortcut paths that bypass multiple temporal steps. These shortcuts allow the error to be back-propagated easily without too quickly vanishing (if the gating unit is nearly saturated at 1) as a result of passing through multiple, bounded nonlinearities, thus reducing the difficulty due to vanishing gradients [Hochreiter, 1991, Bengio et al., 1994].

These two units however have a number of differences as well. One feature of the LSTM unit that is missing from the GRU is the controlled exposure of the memory content. In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network is controlled by the output gate. On the other hand the GRU exposes its full content without any control.

Another difference is in the location of the input gate, or the corresponding reset gate. The LSTM unit computes the new memory content without any separate control of the amount of information flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate).

From these similarities and differences alone, it is difficult to conclude which types of gating units would perform better in general. Although Bahdanau et al. [2014] reported that these two units per- formed comparably to each other according to their preliminary experiments on machine translation, it is unclear whether this applies as well to tasks other than machine translation. This motivates us to conduct more thorough empirical comparison between the LSTM unit and the GRU in this paper.

总结一下:LSTM和GRU都有保留“记忆特征”的加法运算操作,他们可以保留原来的“记忆特征”并加入新的“记忆特征”。这带来两个优点:1、对于长序列输入,可以很容易的保留存在的特定特征,任何重要的特征都可以被LSTM的遗忘门或GRU的更新门决定将其覆盖还是保留。2、解决了梯度消失的问题。 不同点:1、LSTM中的"记忆内容"是有输出门控制,并且可以被网络中的其他单元可见或可利用。而GRU则不受任何控制的暴露它的所有内容。2、LSTM单元从上一步的输入信息中计算新的“记忆内容”不作任何分离控制,LSTM单元独立的控制被加入“记忆单元”的新的“记忆内容”的数量,而GRU则通过更新门来控制。

最后论文中的结论是在论文列举的数据集中,LSTM和GRU比传统的RNN好,但不能明显对比出LSTM和GRU哪个更好。

最后有关RNN的梯度消失问题,LSTM和GRU是怎么解决的。

具体的原理要追溯到反向传播的梯度计算中,其在计算过程中梯度的值在区间 ( 0 , 1 ) (0,1) (0,1)之间。导致在网络层数的增加或长序列问题上梯度累乘而消失。所以这个解决方案就是:1、使用relu代替sigmoid和tanh作为激活函数。2、LSTM和GRU,其原因就是LSTM和GRU在反向传播时计算梯度,其值要不约等于0,要不约等于1。这样在传播的过程就会要不直接阻断不能向下传播,要不以基本没有衰减的向下传播。原理就和ReLU函数差不多了。

你可能感兴趣的:(人工智能,深度学习,神经网络,GRU)