This blog is notes extracted from http://saban.wang/2016/03/28/Normalizing-All-Layers%EF%BC%9A-Back-Propagation/
It mainly covers how to normalize layers in back-propagation.
In the last post, we discussed how to make all neurons of a neural network to have normal gaussian distribution. However, as the Conclusion section claimed, we haven’t considered the back-propagation procedure. In fact, when we talk about the gradient vanishing or exploding problem, we usually refer to the gradients flow in the back-propagation procedure. Since this, the correct way seems to be normalizing the backward gradients of neurons, instead of the forward values.
In this post, we will discuss how to normalize all the gradients using a similar philosophy with the last post: for a given gradient dy∼N(0,I), normalizing the layer to make sure that dx is expected to have zero mean and one standard deviation.
Consider the back-propagate fomulation of Convolution and InnerProdcut layer,
One problem that can’t be avoided when calculating the formulations of activations is that we should not only assume the distribution of the gradients, but also the forward input of the activation, because the gradients of activations are usually dependent on the inputs. Here we assume that both the input x and the gradient dy follow the normal gaussian distribution N(0,I) , and they are independent with each other.
The backward gradient of Sigmoid activation is,
The backward formula for Dropout layer is almost the same with the forward one, we should still divide the preserved values by q√ to achieve 1 std for both forward and backward procedure.
In this post, we have discussed the normalization strategy that serves the gradient flow of the backward propagation. The standard deviations of the gradients in the morden CNN are recorded here. However, when we are using the std of backward gradients, the forward value scale would not be controlled well. Inhomogeneous(非齐次) activations, such as sigmoid and tanh, are not suitable for this method because their domain may not cover a sufficient non-linear part of the activation.
So maybe a good choice is to use a separate scaling method for forward and backward propagation? This idea is conflict with the back-propagation algorithm, so we should still carefully examine it through experiment.