BatchNormlization.ipynb
当输入数据是 不相关(uncorrelated)、零均值(zero mean) 以及 单元方差(unit variance) 的时候,我们的机器学习方法往往表现得很好。但是,当我们训练深度神经网络的时候,即便我们预处理数据使得输入数据服从这样的分布,不断的网络层的处理也会使得原始分布发生改变。更严重得使,随着权重得不断更新,每一层得输入特征的分布也会不断地发生漂移。
所以,推荐阅读1中的作者假设,输入特征分布的漂移会使得深度神经网络的训练变得困难,从而提出插入一个 批量归一化 层来处理这个问题。
在训练阶段, 我们用一个小批量的数据来估计 每一个特征维度的均值和方差 ,并用它来处理我们输入的小批量数据,使得它们零均值和去相关化。同时,我们会维护一个训练集上得平均均值和方差,用来在测试集上处理数据。
但是,这样得BN层或许会因为改变的输入特的分布而影响网络的表达能力,即对于某些网络层,非零均值和单元方差的数据分布可能会更好。所以,对于每一个BN层,我们会学习一个 漂移因子(Shift)和尺度变化因子(scale) 来适当的恢复每一个特征维度的分布,使得其不是严格服从我们得标准分布,这样增加网络的丰富性。
cs231n/layers.py
, implement the batch normalization forward pass in the function batchnorm_forward
. Once you have done so, run the following to test your implementation.def batchnorm_forward(x, gamma, beta, bn_param):
"""
Forward pass for batch normalization.
During training the sample mean and (uncorrected) sample variance are
computed from minibatch statistics and used to normalize the incoming data.
During training we also keep an exponentially decaying running mean of the
mean and variance of each feature, and these averages are used to normalize
data at test-time.
At each timestep we update the running averages for mean and variance using
an exponential decay based on the momentum parameter:
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
Note that the batch normalization paper suggests a different test-time
behavior: they compute sample mean and variance for each feature using a
large number of training images rather than using a running average. For
this implementation we have chosen to use running averages instead since
they do not require an additional estimation step; the torch7
implementation of batch normalization also uses running averages.
Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
mode = bn_param['mode']
eps = bn_param.get('eps', 1e-5)
momentum = bn_param.get('momentum', 0.9)
N, D = x.shape
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == 'train':
batch_mean = np.mean(x,axis = 0)
batch_var = np.var(x, axis = 0)
# 存储训练时候的均值和方差
running_mean = momentum * running_mean + (1 - momentum) * batch_mean
running_var = momentum * running_var + (1 - momentum) * batch_var
x_std = (x - batch_mean ) / (np.sqrt(batch_var) + eps)
out = gamma * x_std + beta
cache = [gamma, x_std, beta, 1 / (np.sqrt(batch_var) + eps)]
elif mode == 'test':
#######################################################################
# TODO: Implement the test-time forward pass for batch normalization. #
# Use the running mean and variance to normalize the incoming data, #
# then scale and shift the normalized data using gamma and beta. #
# Store the result in the out variable. #
#######################################################################
x_std = (x - bn_param['running_mean']) / (np.sqrt(bn_param['running_var']) + eps)
out = gamma * x_std + beta
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
# Store the updated running means back into bn_param
bn_param['running_mean'] = running_mean
bn_param['running_var'] = running_var
return out, cache
batchnorm_backward
.Some intermediates may have multiple outgoing branches
; make sure to sum gradients across these branches in the backward pass. batchnorm_backward(dout, cache):
"""
Backward pass for batch normalization.
For this implementation, you should write out a computation graph for
batch normalization on paper and propagate gradients backward through
intermediate nodes.
Inputs:
- dout: Upstream derivatives, of shape (N, D)
- cache: Variable of intermediates from batchnorm_forward.
Returns a tuple of:
- dx: Gradient with respect to inputs x, of shape (N, D)
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
"""
gamma, x_std, beta,x,batch_mean, batch_var, eps = cache
N = x.shape[0]
# out = gamma * x_std + beta
dbeta = np.sum(dout, axis = 0)
dgamma = np.sum(dout * x_std, axis = 0)
dx_std = dout * gamma
# x_std = (x - mean) / 标准差
# 此时注意x有多个输出,包括直接输出,方差输出和均值输出
# 所以计算图中有多条边流向x
a = np.sqrt(batch_var + eps)
# 先计算方差
dvar = np.sum( - 0.5 * (x - batch_mean) * dx_std / a ** 3 , axis = 0)
dmean = np.sum( - dx_std / a, axis=0) + dvar * np.sum(-2 * (x - batch_mean), axis=0) / N
dx = dx_std / a + dmean / N + 2 * dvar * (x - batch_mean) / N
return dx, dgamma, dbeta
def batchnorm_backward_alt(dout, cache):
"""
Alternative backward pass for batch normalization.
For this implementation you should work out the derivatives for the batch
normalizaton backward pass on paper and simplify as much as possible. You
should be able to derive a simple expression for the backward pass.
See the jupyter notebook for more hints.
Note: This implementation should expect to receive the same cache variable
as batchnorm_backward, but might not use all of the values in the cache.
Inputs / outputs: Same as batchnorm_backward
"""
gamma, x_std, beta, x, batch_mean, batch_var, eps = cache
N = x.shape[0]
# 先计算变化因子,好计算一点
dgamma = np.sum(dout * x_std, axis = 0)
dbeta = np.sum(dout, axis = 0)
# 再计算对x的梯度
a = 1 / np.sqrt(batch_var + eps)
dx_hat = dout * gamma
dvar = np.sum(dx_hat * (x - batch_mean) * (-0.5) * (a ** 3), axis = 0)
dmean = np.sum(- dx_hat * a, axis = 0) #+ dvar * (-2 / N) * np.sum(x - batch_mean, axis = 0) #后面这项为0
dx = dx_hat * a + dvar * 2 * (x - batch_mean) / N + dmean / N
return dx, dgamma, dbeta
FullyConnectedNet
in the file cs231n/classifiers/fc_net.py
. Modify your implementation to add batch normalization.an additional helper layer
similar to those in the file cs231n/layer_utils.py
.第一步,先在layer_utils.py
中新定义我们的affine->bn->relu
网络层
def affine_bn_relu_forward(x,w,b,gamma,beta,bn_params):
"""
Convenience layer that perorms an affine WITH BACTHNORMALIZATION transform followed by a ReLU
Inputs:
- x: Input to the affine layer
- w, b: Weights for the affine layer
Returns a tuple of:
- out: Output from the ReLU
- cache: Object to give to the backward pass
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- out: Output from the ReLU
- cache: Object to give to the backward pass
"""
a_fc, fc_cache = affine_forward(x,w,b)
a_bn,bn_cache = batchnorm_forward(a_fc,gamma,beta,bn_params)
out,relu_cache = relu_forward(a_bn)
cache = (fc_cache, bn_cache, relu_cache)
return out, cache
def affine_bn_relu_backward(dout, cache):
"""
Backward pass for the affine-bn-relu convenience layer
"""
fc_cache, bn_cache, relu_cache = cache
da = relu_backward(dout, relu_cache)
da_bn, dgamma,dbeta = batchnorm_backward(da,bn_cache)
dx,dw,db = affine_backward(da_bn,fc_cache)
return dx,dw,db,dgamma,dbeta
fc_net.py
from builtins import range
from builtins import object
import numpy as np
from cs231n.layers import *
from cs231n.layer_utils import *
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network with ReLU nonlinearity and
softmax loss that uses a modular layer design. We assume an input dimension
of D, a hidden dimension of H, and perform classification over C classes.
The architecure should be affine - relu - affine - softmax.
Note that this class does not implement gradient descent; instead, it
will interact with a separate Solver object that is responsible for running
optimization.
The learnable parameters of the model are stored in the dictionary
self.params that maps parameter names to numpy arrays.
"""
def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
weight_scale=1e-3, reg=0.0):
"""
Initialize a new network.
Inputs:
- input_dim: An integer giving the size of the input
- hidden_dim: An integer giving the size of the hidden layer
- num_classes: An integer giving the number of classes to classify
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- reg: Scalar giving L2 regularization strength.
"""
self.params = {}
self.reg = reg
# TODO: Initialize the weights and biases of the two-layer net. Weights #
# should be initialized from a Gaussian centered at 0.0 with #
# standard deviation equal to weight_scale, and biases should be #
# initialized to zero.
self.params["W1"] = np.random.randn(input_dim,hidden_dim) * weight_scale
self.params["b1"] = np.zeros(hidden_dim)
self.params["W2"] = np.random.randn(hidden_dim,num_classes) * weight_scale
self.params["b2"] = np.zeros(num_classes)
def loss(self, X, y=None):
"""
Compute loss and gradient for a minibatch of data.
Inputs:
- X: Array of input data of shape (N, d_1, ..., d_k)
- y: Array of labels, of shape (N,). y[i] gives the label for X[i].
Returns:
If y is None, then run a test-time forward pass of the model and return:
- scores: Array of shape (N, C) giving classification scores, where
scores[i, c] is the classification score for X[i] and class c.
If y is not None, then run a training-time forward and backward pass and
return a tuple of:
- loss: Scalar value giving the loss
- grads: Dictionary with the same keys as self.params, mapping parameter
names to gradients of the loss with respect to those parameters.
"""
############################################################################
# TODO: Implement the forward pass for the two-layer net, computing the#
# class scores for X and storing them in the scores variable. #
############################################################################
H , cache_layer1 = affine_relu_forward(X,self.params["W1"],self.params["b1"])
scores , cache_layer2 = affine_forward(H, self.params["W2"], self.params["b2"])
# If y is None then we are in test mode so just return scores
if y is None:
return scores
loss, grads = 0, {}
############################################################################
# TODO: Implement the backward pass for the two-layer net. Store the loss #
# in the loss variable and gradients in the grads dictionary. Compute data #
# loss using softmax, and make sure that grads[k] holds the gradients for #
# self.params[k]. Don't forget to add L2 regularization! #
# #
# NOTE: To ensure that your implementation matches ours and you pass the #
# automated tests, make sure that your L2 regularization includes a factor #
# of 0.5 to simplify the expression for the gradient. #
############################################################################
loss, dS = softmax_loss(scores,y)
loss += 0.5 * self.reg * np.sum(self.params["W1"] * self.params["W1"])
loss += 0.5 * self.reg * np.sum(self.params["W2"] * self.params["W2"]) # 添加正则项
dH , dW2 , grads["b2"] = affine_backward(dS,cache_layer2)
dx, dW1, grads["b1"] = affine_relu_backward(dH , cache_layer1)
grads["W1"] = dW1 + self.reg * self.params["W1"]
grads["W2"] = dW2 + self.reg * self.params["W2"] # 正则项损失
return loss, grads
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function. This will also implement
dropout and batch/layer normalization as options. For a network with L layers,
the architecture will be
{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
where batch/layer normalization and dropout are optional, and the {...} block is
repeated L - 1 times.
Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
"""
def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
dropout=1, normalization=None, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
"""
Initialize a new FullyConnectedNet.
Inputs:
- hidden_dims: A list of integers giving the size of each hidden layer.
- input_dim: An integer giving the size of the input.
- num_classes: An integer giving the number of classes to classify.
- dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
the network should not use dropout at all.
- normalization: What type of normalization the network should use. Valid values
are "batchnorm", "layernorm", or None for no normalization (the default).
- reg: Scalar giving L2 regularization strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- dtype: A numpy datatype object; all computations will be performed using
this datatype. float32 is faster but less accurate, so you should use
float64 for numeric gradient checking.
- seed: If not None, then pass this random seed to the dropout layers. This
will make the dropout layers deteriminstic so we can gradient check the
model.
"""
self.normalization = normalization
self.use_dropout = dropout != 1
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}
############################################################################
# TODO: Initialize the parameters of the network, storing all values in #
# the self.params dictionary. Store weights and biases for the first layer #
# in W1 and b1; for the second layer use W2 and b2, etc. #
# When using batch normalization, store scale and shift parameters for the #
# first layer in gamma1 and beta1; for the second layer use gamma2 and #
# beta2, etc. Scale parameters should be initialized to ones and shift #
# parameters should be initialized to zeros. #
############################################################################
input_size = input_dim
for i in range(len(hidden_dims)):
output_size = hidden_dims[i]
self.params['W' + str(i+1)] = np.random.randn(input_size,output_size) * weight_scale
self.params['b' + str(i+1)] = np.zeros(output_size)
if self.normalization == 'batchnorm':
self.params['gamma' + str(i+1)] = np.ones(output_size)
self.params['beta' + str(i+1)] = np.zeros(output_size)
input_size = output_size # 下一层的输入
# 输出层,没有BN操作
self.params['W' + str(self.num_layers)] = np.random.randn(input_size,num_classes) * weight_scale
self.params['b' + str(self.num_layers)] = np.zeros(num_classes)
# When using dropout we need to pass a dropout_param dictionary to each
# dropout layer so that the layer knows the dropout probability and the mode
# (train / test). You can pass the same dropout_param to each dropout layer.
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed
# With batch normalization we need to keep track of running means and
# variances, so we need to pass a special bn_param object to each batch
# normalization layer. You should pass self.bn_params[0] to the forward pass
# of the first batch normalization layer, self.bn_params[1] to the forward
# pass of the second batch normalization layer, etc.
self.bn_params = []
if self.normalization=='batchnorm':
self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
if self.normalization=='layernorm':
self.bn_params = [{} for i in range(self.num_layers - 1)]
# Cast all parameters to the correct datatype
for k, v in self.params.items():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.use_dropout:
self.dropout_param['mode'] = mode
if self.normalization=='batchnorm':
for bn_param in self.bn_params:
bn_param['mode'] = mode
############################################################################
# TODO: Implement the forward pass for the fully-connected net, computing #
# the class scores for X and storing them in the scores variable. #
# #
# When using dropout, you'll need to pass self.dropout_param to each #
# dropout forward pass. #
# #
# When using batch normalization, you'll need to pass self.bn_params[0] to #
# the forward pass for the first batch normalization layer, pass #
# self.bn_params[1] to the forward pass for the second batch normalization #
# layer, etc. #
############################################################################
cache = {} # 需要存储反向传播需要的参数
hidden = X
for i in range(self.num_layers - 1):
if self.normalization :
hidden,cache[i+1] = affine_bn_relu_forward(hidden,
self.params['W' + str(i+1)],
self.params['b' + str(i+1)],
self.params['gamma' + str(i+1)],
self.params['beta' + str(i+1)],
self.bn_params[i])
else:
hidden , cache[i+1] = affine_relu_forward(hidden,self.params['W' + str(i+1)],
self.params['b' + str(i+1)])
if self.use_dropout:
pass
# 最后一层不用激活
scores, cache[self.num_layers] = affine_forward(hidden , self.params['W' + str(self.num_layers)],
self.params['b' + str(self.num_layers)])
# If test mode return early
if mode == 'test':
return scores
############################################################################
# TODO: Implement the backward pass for the fully-connected net. Store the #
# loss in the loss variable and gradients in the grads dictionary. Compute #
# data loss using softmax, and make sure that grads[k] holds the gradients #
# for self.params[k]. Don't forget to add L2 regularization! #
loss, grads = 0.0, {}
loss, dS = softmax_loss(scores , y)
# 最后一层没有relu激活
dhidden, grads['W' + str(self.num_layers)], grads['b' + str(self.num_layers)] \
= affine_backward(dS,cache[self.num_layers])
loss += 0.5 * self.reg * np.sum(self.params['W' + str(self.num_layers)] * self.params['W' + str(self.num_layers)])
grads['W' + str(self.num_layers)] += self.reg * self.params['W' + str(self.num_layers)]
for i in range(self.num_layers - 1, 0, -1):
loss += 0.5 * self.reg * np.sum(self.params["W" + str(i)] * self.params["W" + str(i)])
# 倒着求梯度
if self.use_dropout:
pass
if self.normalization == 'batchnorm':
dhidden, dw, db, dgamma, dbeta = affine_bn_relu_backward(dhidden, cache[i])
grads['gamma' + str(i)] = dgamma
grads['beta' + str(i)] = dbeta
else:
dhidden, dw, db = affine_relu_backward(dhidden, cache[i])
grads['W' + str(i)] = dw + self.reg * self.params['W' + str(i)]
grads['b' + str(i)] = db
return loss, grads
我们使用1000个训练样本来训练一个6层的神经网络,并比较有BN层和没有BN层的效果:
np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [100, 100, 100, 100, 100]
num_train = 1000
small_data = {
'X_train': data['X_train'][:num_train],
'y_train': data['y_train'][:num_train],
'X_val': data['X_val'],
'y_val': data['y_val'],
}
epochs = 10
weight_scale = 2e-2
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)
bn_solver = Solver(bn_model, small_data,
num_epochs=epochs, batch_size=50,
update_rule='adam',
optim_config={
'learning_rate': 1e-3,
},
verbose=True,print_every=20)
bn_solver.train()
solver = Solver(model, small_data,
num_epochs=epochs, batch_size=50,
update_rule='adam',
optim_config={
'learning_rate': 1e-3,
},
verbose=True, print_every=20)
solver.train()
def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):
"""utility function for plotting training history"""
plt.title(title)
plt.xlabel(label)
bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]
bl_plot = plot_fn(baseline)
num_bn = len(bn_plots)
for i in range(num_bn):
label='with_norm'
if labels is not None:
label += str(labels[i])
plt.plot(bn_plots[i], bn_marker, label=label)
label='baseline'
if labels is not None:
label += str(labels[0])
plt.plot(bl_plot, bl_marker, label=label)
plt.legend(loc='lower center', ncol=num_bn+1)
plt.subplot(3, 1, 1)
plot_training_history('Training loss','Iteration', solver, [bn_solver], \
lambda x: x.loss_history, bl_marker='o', bn_marker='o')
plt.subplot(3, 1, 2)
plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \
lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')
plt.subplot(3, 1, 3)
plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \
lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')
plt.gcf().set_size_inches(15, 15)
plt.show()
这一节,我们将比较BN层和权重初始化的关系。
np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [50, 50, 50, 50, 50, 50, 50]
num_train = 1000
small_data = {
'X_train': data['X_train'][:num_train],
'y_train': data['y_train'][:num_train],
'X_val': data['X_val'],
'y_val': data['y_val'],
}
bn_solvers_ws = {}
solvers_ws = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):
print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)
bn_solver = Solver(bn_model, small_data,
num_epochs=10, batch_size=50,
update_rule='adam',
optim_config={
'learning_rate': 1e-3,
},
verbose=False, print_every=200)
bn_solver.train()
bn_solvers_ws[weight_scale] = bn_solver
solver = Solver(model, small_data,
num_epochs=10, batch_size=50,
update_rule='adam',
optim_config={
'learning_rate': 1e-3,
},
verbose=False, print_every=200)
solver.train()
solvers_ws[weight_scale] = solver
# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []
for ws in weight_scales:
best_train_accs.append(max(solvers_ws[ws].train_acc_history))
bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))
best_val_accs.append(max(solvers_ws[ws].val_acc_history))
bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))
final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))
bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))
plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')
plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend()
plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend()
plt.gca().set_ylim(1.0, 3.5)
plt.gcf().set_size_inches(15, 15)
plt.show()
Q: Describe the results of this experiment. How does the scale of weight initialization affect models with/without batch normalization differently, and why?
A: 由上图可以得到,BN层使得网络的训练对网络参数初始化变得不那么敏感。不
带BN层,则若权重初始化过小,则参数分布逐渐集中在0附近,导致回传的梯度乘以参数之后变得非常小。若权重初始化过大,则参数分布逐渐两极化,出现饱和现像。
这一节,我们来比较BN操作和我们一个批量大小的关系,从我们初步的认识来看,batchsize越大,则我们估计的均值和方差就越准确。
def run_batchsize_experiments(normalization_mode):
np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [100, 100, 100, 100, 100]
num_train = 1000
small_data = {
'X_train': data['X_train'][:num_train],
'y_train': data['y_train'][:num_train],
'X_val': data['X_val'],
'y_val': data['y_val'],
}
n_epochs=10
weight_scale = 2e-2
batch_sizes = [5,10,50]
lr = 10**(-3.5)
solver_bsize = batch_sizes[0]
print('No normalization: batch size = ',solver_bsize)
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)
solver = Solver(model, small_data,
num_epochs=n_epochs, batch_size=solver_bsize,
update_rule='adam',
optim_config={
'learning_rate': lr,
},
verbose=False)
solver.train()
bn_solvers = []
for i in range(len(batch_sizes)):
b_size=batch_sizes[i]
print('Normalization: batch size = ',b_size)
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)
bn_solver = Solver(bn_model, small_data,
num_epochs=n_epochs, batch_size=b_size,
update_rule='adam',
optim_config={
'learning_rate': lr,
},
verbose=False)
bn_solver.train()
bn_solvers.append(bn_solver)
return bn_solvers, solver, batch_sizes
batch_sizes = [5,10,50]
bn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')
plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.gcf().set_size_inches(15, 10)
plt.show()
由于批量归一化和我们得batchsize有一定关系,所以,当我们得设备性能有限,不能选择合适得batchsize的时候,这种操作就会受到影响。
推荐阅读2中提到了一种Layer Normalization:each feature vector corresponding to a single datapoint is normalized based on the sum of all terms within that feature vector.
Answer:
12类似于LN,3类似于BN
由于我们的LN是对行归一化,而BN是对列进行归一化,所以,如果为了方便,我们完全可以简单修改一下之前已经实现好的BN代码,来完成LN的实现
cs231n/layers.py
, implement the forward pass for layer normalization in the function layernorm_backward
.def layernorm_forward(x, gamma, beta, ln_param):
"""
Forward pass for layer normalization.
During both training and test-time, the incoming data is normalized per data-point,
before being scaled by gamma and beta parameters identical to that of batch normalization.
Note that in contrast to batch normalization, the behavior during train and test-time for
layer normalization are identical, and we do not need to keep track of running averages
of any sort.
Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- ln_param: Dictionary with the following keys:
- eps: Constant for numeric stability
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
out, cache = None, None
eps = ln_param.get('eps', 1e-5)
# HINT: this can be done by slightly modifying your training-time #
# implementation of batch normalization, and inserting a line or two of #
# well-placed code. In particular, can you think of any matrix #
# transformations you could perform, that would enable you to copy over #
# the batch norm code and leave it almost unchanged? #
# 我们只需要对输入取一个转置,则和BN的训练阶段差不多了
x = x.T # (D, N )
_mean = np.mean(x, axis=0) # (N,)
_var = np.var(x, axis=0) # (N, )
x_hat = (x - _mean) / (np.sqrt(_var + eps)) #[D, N]
x_hat = x_hat.T # [N, D]
out = x_hat * gamma + beta
cache = (gamma , x_hat, x, _mean, _var, eps)
return out, cache
cs231n/layers.py
, implement the backward pass for layer normalization in the function layernorm_backward
.def layernorm_backward(dout, cache):
"""
Backward pass for layer normalization.
For this implementation, you can heavily rely on the work you've done already
for batch normalization.
Inputs:
- dout: Upstream derivatives, of shape (N, D)
- cache: Variable of intermediates from layernorm_forward.
Returns a tuple of:
- dx: Gradient with respect to inputs x, of shape (N, D)
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
"""
###########################################################################
# TODO: Implement the backward pass for layer norm. #
# #
# HINT: this can be done by slightly modifying your training-time #
# implementation of batch normalization. The hints to the forward pass #
# still apply! #
###########################################################################
gamma, x_hat, x, _mean, _var ,eps = cache
N = x_hat.shape[1]
dgamma = np.sum(dout * x_hat, axis = 0)
dbeta = np.sum(dout, axis = 0)
dx_hat = (dout * gamma).T
# 把前面的BN代码复制过来
a = np.sqrt(_var + eps)
# 先计算方差
dvar = np.sum(- 0.5 * (x - _mean) * dx_hat / a ** 3, axis=0)
dmean = np.sum(- dx_hat / a, axis=0) + dvar * np.sum(-2 * (x - _mean), axis=0) / N
dx = dx_hat / a + dmean / N + 2 * dvar * (x - _mean) / N
dx = dx.T
return dx, dgamma, dbeta
cs231n/classifiers/fc_net.py
to add layer normalization to the FullyConnectedNet
. When the normalization flag is set to "layernorm"
in the constructor, you should insert a layer normalization layer before each ReLU nonlinearity.fcn_net.py
中对BN进行的定义再对LN定义一遍就行,过程都差不多,这里直接放结果:ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')
plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.gcf().set_size_inches(15, 10)
plt.show()
可以发现,batchsize对LN的影响要比对BN的影响小
Q:
When is layer normalization likely to not work well, and why?
A: 因为LN是对一层神经元上进行归一化,所以,其结果可能会受该层神经元的数目影响,即2的影响要大些。