1. Background
Three things accelerate the performance of Deep Learning. They are Initialization, Normalization, and Residual Networks. One of the most famous initialization method is Xavier initialization, which is published in the paper.
2. Motivation
The motivation comes from the information flowing. The figure below is a demonstration of output of each layer in a 5-hidden layer Dense net with different initialization.
The key idea of Xavier initialization is to keep the information flow neither too small nor too large.
3. Prerequisites
*If two variables and are independent, the variance of their product is given by.
*If two variables and are independent, the variance of their sum is given by.
4. Derivation
4.1 Assumption
For a dense artificial neural network using symmetric activation function with if we write (post-activation) for the activation vector of layer and (pre-activation) the argument vector of the activation function at layer , we have
Note: For an activation function such as or satisfying near zero we can approximate how variance of the outputs depends on the variance of the weights and the inputs as we move forward and backward through the network.
4.2 Derivation
Consider the hypothesis that we are in a linear regime at the initialization, that the weights are initialized independently and the inputs feature variances are the same. Then we can say that, with the size of layer and the network input, we have
4.2.1 Forward Direction
Assume , then we have
Thus, we have
Remark: We assume the weight in each layer are i.i.d. random variables. Thus, we use for the shared scalar variance of all weights at layer , and similarly for .
4.2.2 Backward Direction
From the back-propagation, we have
Thus
Then for -layer neural network, we have
Moreover,
Thus, we have
4.2.3 Requirement
Which is corresponding to
-
As a compromise between these two constraints, we want
Remark: From the derivation, we realize that Xavier initialization requirement mean 0 and variance mention above. Thus, it does matter the distribution, and in Pytorch we have two kinds of Xavier(norm and uniform distributed).
Reference:
https://blog.csdn.net/freeyy1314/article/details/85029599
https://intoli.com/blog/neural-network-initialization/
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf