We will focus on the last two topics to handle vanishing/exploding gradients problem in this section.
Weight initialization is very important in deep learning. I think one of the reasons that early networks did not work as well is that people did not care about it too much.
Initializing all the weights to 0 is a bad idea since all the neurons learn the same thing. In practice, initialization weight from N (0, 0.01 2 ) or uniform distribution and bias with constant 0 is a popular choice. But this does not work when training very deep network from scratch, which will lead to extremely large or diminishing outputs/gradients. Large weights lead to divergence while small weights do not allow the network to learn.
[Glorot and Bengio. 2010] proposed Xavier initialization to keep the variance of each neuron among layers the same under the assumption that no non-linearity exists between layers. Lots of inputs correspond to smaller weights, and smaller number of inputs correspond to larger weights. But Xavier initialization breaks when using ReLU non-linearity. ReLU basicly kill half the distribution, so the output variance halved. [He et al. 2015] extended the Xavier initialization to the ReLU non-linearity by letting the variance of weights doubled. [Sussillo and Abbott. 2014] keeped constant the norm of the backproppagated errors. [Saxe et al. 2013] showed that orthonormal matrix initialization works better for linear networks than Gaussian noise, it also works for networks with non-linearities. [Krhenbhl et al. 2015] and [Mishkin and Matas. 2015] did not give a formula for initialization, but they proposed data-driven ways for initialization. They iteratively rescaled weights such that the neurons had roughly unit variance.
[Ioffe and Szegedy. 2015] inserted batch normalization layer to make the output neurons have roughly unit Gaussian distributions. Thus, they reduced the strong dependence on initialization. And they also had scale and shift operations to preserve the capacity.
The notes of those papers mentioned above can be found in the following links (in order):
[深度学习论文笔记][Weight Initialization] Understanding the difficulty of training deep feedforward neural
[深度学习论文笔记][Weight Initialization] Delving deep into rectifiers: Surpassing human-level performance
[深度学习论文笔记][Weight Initialization] Random walk initialization for training very deep feedforward netw
[深度学习论文笔记][Weight Initialization] Exact solutions to the nonlinear dynamics of learning in deep lin
[深度学习论文笔记][Weight Initialization] Data-dependent Initializations of Convolutional Neural Networks
[深度学习论文笔记][Weight Initialization] All you need is a good init
[深度学习论文笔记][Weight Initialization] Batch Normalization: Accelerating Deep Network Training by Reducin