If you work through the Caffe MNIST tutorial, you’ll come across this curious line
weight_filler { type: "xavier" }
and the accompanying explanation
For the weight filler, we will use the xavier algorithm that automatically determines the scale of initialization based on the number of input and output neurons.
Unfortunately, as of the time this post was written, Google hasn’t heard much about “the xavier algorithm”. To work out what it is, you need to poke around the Caffe source until you find the right docstring and then read the referenced paper, Xavier Glorot & Yoshua Bengio’s Understanding the difficulty of training deep feedforward neural networks.
In short, it helps signals reach deep into the network.
Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.
To go any further than this, you’re going to need a small amount of statistics - specifically you need to know about random distributions and their variance.
In Caffe, it’s initializing the weights in your network by drawing them from a distribution with zero mean and a specific variance,
where W is the initialization distribution for the neuron in question, and nin is the number of neurons feeding into it. The distribution used is typically Gaussian or uniform.
It’s worth mentioning that Glorot & Bengio’s paper originally recommended using
where nout is the number of neurons the result is fed to. We’ll come to why Caffe’s scheme might be different in a bit.
Suppose we have an input X with n components and a linear neuron with random weights W that spits out a number Y . What’s the variance of Y ? Well, we can write
And from Wikipedia we can work out that WiXi is going to have variance
Now if our inputs and weights both have mean 0 , that simplifies to
Then if we make a further assumption that the Xi and Wi are all independent and identically distributed, we can work out that the variance of Y is
Or in words: the variance of the output is the variance of the input, but scaled by nVar(Wi) . So if we want the variance of the input and output to be the same, that means nVar(Wi) should be 1. Which means the variance of the weights should be
Voila. There’s your Caffe-style Xavier initialization.
Glorot & Bengio’s formula needs a tiny bit more work. If you go through the same steps for the backpropagated signal, you find that you need
to keep the variance of the input gradient & the output gradient the same. These two constraints can only be satisfied simultaneously if nin=nout , so as a compromise, Glorot & Bengio take the average of the two:
I’m not sure why the Caffe authors used the nin -only variant. The two possibilities that come to mind are
It is. But it works. Xavier initialization was one of the big enablers of the move away from per-layer generative pre-training.
The assumption most worth talking about is the “linear neuron” bit. This is justified in Glorot & Bengio’s paper because immediately after initialization, the parts of the traditional nonlinearities - tanh,sigm - that are being explored are the bits close to zero, and where the gradient is close to 1 . For the more recent rectifying nonlinearities, that doesn’t hold, and in a recent paper by He, Rang, Zhen and Sun they build on Glorot & Bengio and suggest using
instead. Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.