Xavier Initialization

1. Background

Three things accelerate the performance of Deep Learning. They are Initialization, Normalization, and Residual Networks. One of the most famous initialization method is Xavier initialization, which is published in the paper.

2. Motivation

The motivation comes from the information flowing. The figure below is a demonstration of output of each layer in a 5-hidden layer Dense net with different initialization.

image

The key idea of Xavier initialization is to keep the information flow neither too small nor too large.

3. Prerequisites

*If two variables and are independent, the variance of their product is given by.

*If two variables and are independent, the variance of their sum is given by.

4. Derivation

4.1 Assumption

For a dense artificial neural network using symmetric activation function with if we write (post-activation) for the activation vector of layer and (pre-activation) the argument vector of the activation function at layer , we have

Note: For an activation function such as or satisfying near zero we can approximate how variance of the outputs depends on the variance of the weights and the inputs as we move forward and backward through the network.

4.2 Derivation

Consider the hypothesis that we are in a linear regime at the initialization, that the weights are initialized independently and the inputs feature variances are the same. Then we can say that, with the size of layer and the network input, we have

4.2.1 Forward Direction

Assume , then we have

Thus, we have

Remark: We assume the weight in each layer are i.i.d. random variables. Thus, we use for the shared scalar variance of all weights at layer , and similarly for .

4.2.2 Backward Direction

From the back-propagation, we have

$\begin{align*} \frac{\partial \text{Cost}}{\partial s^{i}_k} &= \sum_{l=1}^{n_{i+1}}\frac{\partial \text{Cost}}{\partial s^{i+1}_l}\cdot \frac{\partial s^{i+1}_l}{\partial s^i_k} \\ &=\sum_{l=1}^{n_{i+1}}\frac{\partial \text{Cost}}{\partial s^{i+1}_l}\cdot \frac{\partial \big(\sum_{j=1}^{n_i} f(s^i_j)\cdot W^{i+1}_{j,l}+b^i\big)}{\partial s^i_k} \\ &=\sum_{l=1}^{n_{i+1}}\frac{\partial \text{Cost}}{\partial s^{i+1}_l}\cdot f'(s^i_k)\cdot W^{i+1}_{k,l}\\ &=f'(s_k^i)\cdot W^{i+1}_{k,\cdot}\cdot \frac{\partial \text{Cost}}{\partial s^{i+1}} \end{align*}$
Thus
$\begin{align*} \text{Var}\bigg[\frac{\partial \text{Cost}}{\partial s^i}\bigg] &= \sum_{l=1}^{n_{i+1}}\text{Var} \bigg[\frac{\partial \text{Cost}}{\partial s^{i+1}_l}\bigg]\text{Var}\bigg[W^{i+1}_{k,l}\bigg]\\ & = \text{Var} \bigg[\frac{\partial \text{Cost}}{\partial s^{i+1}}\bigg]\cdot n_{i+1}\text{Var}\bigg[W^{i+1}\bigg] \end{align*}$
Then for -layer neural network, we have

Moreover,
$\begin{align*} \frac{\partial \text{Cost}}{\partial W^{i}_{l,k}} &= \frac{\partial \text{Cost}}{\partial s^{i}_{k}}\cdot \frac{\partial s^i_k}{\partial W^i_{l,k}}\\ &=\frac{\partial \text{Cost}}{\partial s^{i}_{k}}\cdot\frac{\partial z^i_lW^i_{l,k}}{\partial W^i_{l,k}}\\ &=z^i_l\cdot \frac{\partial \text{Cost}}{\partial s^{i}_{k}} \end{align*}$

Thus, we have
$\begin{align*} \text{Var}\bigg[\frac{\partial \text{Cost}}{\partial W^{i}}\bigg]&= \text{Var}[z^i] \text{Var}\bigg[ \frac{\partial \text{Cost}}{\partial s^{i}}\bigg]\\ &=\text{Var}[x]\prod_{i'= 0}^{i-1}n_{i'}\text{Var}[W^{i'}] \times \text{Var} \bigg[\frac{\partial \text{Cost}}{\partial s^{d}}\bigg]\prod_{i'=i}^{d-1} n_{i'+1}\text{Var}\bigg[W^{i'}\bigg] \end{align*}$

4.2.3 Requirement

Which is corresponding to

As a compromise between these two constraints, we want

Remark: From the derivation, we realize that Xavier initialization requirement mean 0 and variance mention above. Thus, it does matter the distribution, and in Pytorch we have two kinds of Xavier(norm and uniform distributed).

Reference:

https://blog.csdn.net/freeyy1314/article/details/85029599
https://intoli.com/blog/neural-network-initialization/
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf