Motivation
2015年的论文《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》阐述了BN算法,这个算法目前已经被大量应用,很多论文都会引用这个算法,进行网络训练,可见其强大之处非同一般。
论文作者认为:
网络训练过程中参数不断改变导致后续每一层输入的分布也发生变化,而学习的过程又要使每一层适应输入的分布,因此我们不得不降低学习率、小心地初始化。
论文作者将分布发生变化称之为 internal covariate shift。
大家应知道,一般在训练网络的时会将输入减去均值,还有些人甚至会对输入做白化等操作,目的是为了加快训练。为什么减均值、白化可以加快训练呢,这里做一个简单地说明:
首先,图像数据是高度相关的,假设其分布如下图a所示(简化为2维)。由于初始化的时候,我们的参数一般都是0均值的,因此开始的拟合y=Wx+b,基本过原点附近,如图b红色虚线。因此,网络需要经过多次学习才能逐步达到如紫色实线的拟合,即收敛的比较慢。如果我们对输入数据先作减均值操作,如图c,显然可以加快学习。更进一步的,我们对数据再进行去相关操作,使得数据更加容易区分,这样又会加快训练,如图d。
白化的方式有好几种,常用的有PCA白化:即对数据进行PCA操作之后,在进行方差归一化。这样数据基本满足0均值、单位方差、弱相关性。作者首先考虑,对每一层数据都使用白化操作,但分析认为这是不可取的。因为白化需要计算协方差矩阵、求逆等操作,计算量很大,此外,反向传播时,白化操作不一定可导。于是,作者采用下面的Normalization方法。
在深度学习中,随机梯度下降已经成为主要的训练方法。尽管随机梯度下降法对于训练深度网络简单高效,但需要人为的选择一些参数,比如学习率、参数初始化、权重衰减系数、dropout比例等。这些参数的选择对训练结果至关重要,以至于我们很多时间都浪费在这些的调参上。那么学完这篇文献之后,你可以不需要那么刻意的慢慢调整参数。BN算法的强大之处表现在:
# coding=UTF-8
# neural network structure for this sample:
#
# · · · · · · · · · · (input data, 1-deep) X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @ conv. layer +BN 5x5x1=> 24 stride 1 W1 [ 5, 5, 1, 24] B1 [24]
# Y1 [batch, 28, 28, 6]
# @ @ @ @ @ @ @ @ conv. layer +BN 5x5x6=> 48 stride 2 W2 [ 5, 5, 6, 48] B2 [48]
# Y2 [batch, 14, 14, 12]
# @ @ @ @ @ @ conv. layer +BN 4x4x12=> 64 stride 2 W3 [ 4, 4, 12, 64] B3 [64]
# Y3 [batch, 7, 7, 24]
# => reshaped to YY [batch, 7*7*24]
# \x/x\x\x/ fully connected layer (relu+dropout+BN) W4 [7*7*24, 200] B4 [200]
# · · · · Y4 [batch, 200]
# \x/x\x/ fully connected layer (softmax) W5 [200, 10] B5 [10]
# · · · Y [batch, 10]
import tensorflow as tf
import math
from tensorflow.examples.tutorials.mnist import input_data as mnist_data
tf.set_random_seed(0.0)
# Download images and labels into mnist.test (10K images+labels)
# and mnist.train (60K images+labels)
mnist = mnist_data.read_data_sets("data", one_hot=True, reshape=False, validation_size=0)
# Ylogits - input data that need tobe batch normalised. For convolutional
# layer, it's a 4-D tensor. For fully connected layer, it's a 2-D tensor
# is_test - flag, is_test = False for train
# is_test = True for test
# offset - beta
# gamma(scaling) is not useful for relu
def batchnormForRelu(Ylogits, is_test, Iteration, offset, convolutional=False):
# adding the iteration prevents from averaging across non-existing iterations
exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, Iteration)
bnepsilon = 1e-5
if convolutional:
mean, variance = tf.nn.moments(Ylogits, [0, 1, 2])
else:
mean, variance = tf.nn.moments(Ylogits, [0])
update_moving_averages = exp_moving_avg.apply([mean, variance])
m = tf.cond(is_test, lambda: exp_moving_avg.average(mean), lambda: mean)
v = tf.cond(is_test, lambda: exp_moving_avg.average(variance), lambda: variance)
Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon)
return Ybn, update_moving_averages
def compatible_convolutional_noise_shape(Y):
noiseshape = tf.shape(Y)
noiseshape = noiseshape * tf.constant([1,0,0,1]) + tf.constant([0,1,1,0])
return noiseshape
# input X: 28x28 grayscale images
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
# correct answers will go here
Y_ = tf.placeholder(tf.float32, [None, 10])
# variable learning rate
lr = tf.placeholder(tf.float32)
# test flag for batch norm
tst = tf.placeholder(tf.bool)
Iter = tf.placeholder(tf.int32)
# dropout probability
pkeep = tf.placeholder(tf.float32)
pkeep_conv = tf.placeholder(tf.float32)
# three convolutional layers with their channel counts, and a
# fully connected layer (tha last layer has 10 softmax neurons)
K = 24 # 1st convolutional layer output depth
L = 48 # 2nd convolutional layer output depth
M = 64 # 3rd convolutional layer
N = 200 # 4th fully connected layer
W1 = tf.Variable(tf.truncated_normal([6, 6, 1, K], stddev=0.1)) # 6x6 patch, 1 input channel, K output channels
B1 = tf.Variable(tf.constant(0.1, tf.float32, [K]))
W2 = tf.Variable(tf.truncated_normal([5, 5, K, L], stddev=0.1))
B2 = tf.Variable(tf.constant(0.1, tf.float32, [L]))
W3 = tf.Variable(tf.truncated_normal([4, 4, L, M], stddev=0.1))
B3 = tf.Variable(tf.constant(0.1, tf.float32, [M]))
W4 = tf.Variable(tf.truncated_normal([7 * 7 * M, N], stddev=0.1))
B4 = tf.Variable(tf.constant(0.1, tf.float32, [N]))
W5 = tf.Variable(tf.truncated_normal([N, 10], stddev=0.1))
B5 = tf.Variable(tf.constant(0.1, tf.float32, [10]))
# The model
# batch norm scaling is not useful with relus
# batch norm offsets are used instead of biases
Y1l = tf.nn.conv2d(X, W1, strides=[1, 1, 1, 1], padding='SAME')
Y1bn, update_ema1 = batchnormForRelu(Y1l, tst, Iter, B1, convolutional=True)
Y1r = tf.nn.relu(Y1bn)
Y1 = tf.nn.dropout(Y1r, pkeep_conv, compatible_convolutional_noise_shape(Y1r))
stride = 2 # output is 14x14
Y2l = tf.nn.conv2d(Y1, W2, strides=[1, stride, stride, 1], padding='SAME')
Y2bn, update_ema2 = batchnormForRelu(Y2l, tst, Iter, B2, convolutional=True)
Y2r = tf.nn.relu(Y2bn)
Y2 = tf.nn.dropout(Y2r, pkeep_conv, compatible_convolutional_noise_shape(Y2r))
stride = 2 # output is 7x7
Y3l = tf.nn.conv2d(Y2, W3, strides=[1, stride, stride, 1], padding='SAME')
Y3bn, update_ema3 = batchnormForRelu(Y3l, tst, Iter, B3, convolutional=True)
Y3r = tf.nn.relu(Y3bn)
Y3 = tf.nn.dropout(Y3r, pkeep_conv, compatible_convolutional_noise_shape(Y3r))
# reshape
YY = tf.reshape(Y3, shape=[-1, 7 * 7 * M])
Y4l = tf.matmul(YY, W4)
Y4bn, update_ema4 = batchnormForRelu(Y4l, tst, Iter, B4)
Y4r = tf.nn.relu(Y4bn)
Y4 = tf.nn.dropout(Y4r, pkeep)
Ylogits = tf.matmul(Y4, W5) + B5
Y = tf.nn.softmax(Ylogits)
update_ema = tf.group(update_ema1, update_ema2, update_ema3, update_ema4)
# cross-entropy loss function (= -sum(Y_i * log(Yi)) ), normalised for batches of 100 images
# TensorFlow provides the softmax_cross_entropy_with_logits function to avoid numerical stability
# problems with log(0) which is NaN
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_)
cross_entropy = tf.reduce_mean(cross_entropy)*100
# accuracy of the trained model, between 0 (worst) and 1 (best)
correct_prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# training step, the learning rate is a placeholder
train_step = tf.train.AdamOptimizer(lr).minimize(cross_entropy)
# init
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
def training_step(i, update_test_data, update_train_data):
# training on batches of 100 images with 100 labels
batch_X, batch_Y = mnist.train.next_batch(100)
# learning rate decay
max_learning_rate = 0.02
min_learning_rate = 0.0001
decay_speed = 1600
learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-i/decay_speed)
# compute training values for visualisation
if update_train_data:
a, c = sess.run([accuracy, cross_entropy], \
{X: batch_X, Y_: batch_Y, tst: False, pkeep: 1.0, pkeep_conv: 1.0})
print(str(i) + ": accuracy:" + str(a) + " loss: " + str(c) + " (lr:" + str(learning_rate) + ")")
# compute test values for visualisation
if update_test_data:
a, c = sess.run([accuracy, cross_entropy], \
{X: mnist.test.images, Y_: mnist.test.labels, tst: True, pkeep: 1.0, pkeep_conv: 1.0})
print(str(i) + ": ********* epoch " + str(i*100//mnist.train.images.shape[0]+1) + " ********* test accuracy:" + str(a) + " test loss: " + str(c))
# the backpropagation training step
sess.run(train_step, {X: batch_X, Y_: batch_Y, lr: learning_rate, tst: False, pkeep: 0.75, pkeep_conv: 1.0})
sess.run(update_ema, {X: batch_X, Y_: batch_Y, tst: False, Iter: i, pkeep: 1.0, pkeep_conv: 1.0})
if __name__ == "__main__":
for i in range(0, 1000):
training_step(i, True, True)