Training Very Deep Networks论文笔记

Abstract
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.

摘要
理论和实证证据表明,神经网络的深度对其性能至关重要。 然而,随着深度的增加,训练变得更加困难,这对于深度网络的训练来说仍然是一个悬而未决的问题。 在这里,我们介绍一种旨在克服这一点的新架构。 我们称其为高速公路网络,此网络允许信息在高速公路网络的多层中畅通无阻。 这个网络是受到LSTM的启发,并使用自适应门控单元来调节信息流。 即使有数百层,也可以通过简单的梯度下降直接训练高速公路网络。 这使得研究极其深入和高效的架构成为可能。

2 Highway Networks
Notation We use boldface letters for vectors and matrices, and italicized capital letters to denote transformation functions. 0 and 1 denote vectors of zeros and ones respectively, and I denotes an identity matrix. The function σ(x) is defined as σ ( x ) = 1 1 + e − x \sigma \left ( x \right )=\frac{1}{1+e^{-x}} σ(x)=1+ex1; x ϵ R x\epsilon R xϵR. The dot operator (·) is used to denote element-wise multiplication.
A plain feedforward neural network typically consists of L layers where the l t h l^{th} lth layer ( l ϵ { 1 , 2 , . . . , L } l\epsilon \left \{ 1,2,...,L \right \} lϵ{1,2,...,L}) applies a non-linear transformation H (parameterized by W H , l W_{H,l} WH,l) on its input x l x_{l} xl to produce its output y l y_{l} yl. Thus, x 1 x_{1} x1 is the input to the network and y L y_{L} yL is the network’s output. Omitting the layer index and biases for clarity,
y = H ( x , W H ) y=H\left ( x,W_{H} \right ) y=H(x,WH) (1)
H is usually an affine transform followed by a non-linear activation function, but in general it may take other forms, possibly convolutional or recurrent. For a highway network, we additionally define two non-linear transforms T ( x , W T ) T\left ( x,W_{T} \right ) T(x,WT)and C ( x , W C ) C\left ( x,W_{C} \right ) C(x,WC) such that
y = H ( x , W H ) ⋅ T ( x , W T ) + x ⋅ C ( x , W C ) y=H\left ( x,W_{H} \right ) \cdot T\left ( x,W_{T} \right )+x\cdot C\left ( x,W_{C} \right ) y=H(x,WH)T(x,WT)+xC(x,WC) (2)
We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, in this paper we set C = 1 − T, giving
y = H ( x , W H ) ⋅ T ( x , W T ) + x ⋅ ( 1 − T ( x , W T ) ) y=H\left ( x,W_{H} \right ) \cdot T\left ( x,W_{T} \right )+x\cdot (1-T\left ( x,W_{T} \right )) y=H(x,WH)T(x,WT)+x(1T(x,WT)) (3)
The dimensionality of x; y; H ( x , W H ) H\left ( x,W_{H} \right ) H(x,WH) and T ( x , W T ) T\left ( x,W_{T} \right ) T(x,WT) must be the same for Equation 3 to be valid.
Note that this layer transformation is much more flexible than Equation 1. In particular, observe that for particular values of T,
y = { x , i f T ( x , W T ) = 0 H ( x , W H ) , i f T ( x , W T ) = 1 y=\left\{\begin{matrix} x,& ifT\left ( x,W_{T} \right )= 0\\ H\left ( x,W_{H} \right),& if T\left ( x,W_{T} \right )=1 \end{matrix}\right. y={x,H(x,WH),ifT(x,WT)=0ifT(x,WT)=1 (4)

Similarly, for the Jacobian of the layer transform,
d y d x = { I i f T ( x , W T ) = 0 H ′ ( x , W H ) i f T ( x , W T ) = 1 \frac{dy}{dx}=\left\{\begin{matrix} I& if T\left ( x,W_{T} \right )=0\\ H^{'}\left ( x,W_{H} \right ) & if T\left ( x,W_{T} \right )=1 \end{matrix}\right. dxdy={IH(x,WH)ifT(x,WT)=0ifT(x,WT)=1(5)
Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of H and that of a layer which simply passes its inputs through. Just as a plain layer consists of multiple computing units such that the i t h i^{th} ith unit computes y i = H i ( x ) y_{i}=H_{i}(x) yi=Hi(x), a highway network consists of multiple blocks such that the i t h i^{th} ith block computes a block state H i ( x ) H_{i}(x) Hi(x) and transform gate output T i ( x ) T_{i}(x) Ti(x). Finally, it produces the block output y i = H i ( x ) ∗ T i ( x ) + x i ∗ ( 1 − T i ( x ) ) y_{i}=H_{i}(x)\ast T_{i}(x)+x_{i}\ast (1-T_{i}(x)) yi=Hi(x)Ti(x)+xi(1Ti(x)), which is connected to the next layer.2

你可能感兴趣的:(论文翻译以及理解)