BatchNormalization就是将每一层数据拉回均值为0,方差为1的正太分布上。同时为了保证其还有学习到的特征,再对数据进行缩放和平移。假设输入为 x x x,大小为(N,D),则算法如下:
在训练时,batchnorm的正向传播如上所示,但在测试时,由于batchsize=1,即m=1,因此测试时使用的均值和方差直接由训练时计算的均值和方差的滑动平均值来代替。下面为代码。
def batchnorm_forward(x, gamma, beta, bn_param):
"""
Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
mode = bn_param['mode']
eps = bn_param.get('eps', 1e-5)
momentum = bn_param.get('momentum', 0.9)
N, D = x.shape
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == 'train':
sample_mean = np.mean(x,axis=0,keepdims=True)
sample_var = np.var(x,axis=0,keepdims=True)
sample_sqrtvar = np.sqrt(sample_var+eps)
x_norm = (x-sample_mean)/sample_sqrtvar
out = x_norm*gamma+beta
cache = (x,x_norm,gamma,beta,eps,sample_mean,sample_var,sample_sqrtvar)
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
elif mode == 'test':
x_norm = (x-running_mean)/np.sqrt(running_var+eps)
out = x_norm*gamma+beta
# 将滑动均值和滑动方差保存或更新
bn_param['running_mean'] = running_mean
bn_param['running_var'] = running_var
return out, cache
在卷积神经网络中,输入X的大小可能为(N,W,H,C),所以求均值和方差的过程就变为
sample_mean = np.mean(x,axis=(0,1,2), keepdims=True)
sample_var = np.var(x,axis=(0,1,2), keepdims=True)
另外,在cs231n中较难的一点是batchnorm的反向传播算法。下面为batchnorm的计算图。我们要计算的是 ∂ L ∂ γ \frac { \partial L } { \partial \gamma } ∂γ∂L ∂ L ∂ β \frac { \partial L} { \partial \beta } ∂β∂L ∂ L ∂ x \frac { \partial L} { \partial x } ∂x∂L。
首先计算比较容易的 ∂ l ∂ γ \frac { \partial l } { \partial \gamma } ∂γ∂l ∂ l ∂ β \frac { \partial l } { \partial \beta } ∂β∂l:
然后根据链式求导计算 ∂ L ∂ x \frac { \partial L} { \partial x } ∂x∂L:
bn层反向传播的代码:
def batchnorm_backward_alt(dout, cache)
dx, dgamma, dbeta = None, None, None
N,D = dout.shape
x,x_norm,gamma,beta,eps,sample_mean,sample_var,sample_sqrtvar = cache
dbeta = np.sum(dout,axis=0)
dgamma = np.sum(dout*x_norm, axis=0)
dx_norm = dout * gamma
dvar = np.sum((dx_norm * (x-sample_mean) * (-0.5) * np.power(sample_var+eps, -1.5)),axis = 0) # (D,)
dmean = np.sum(dx_norm * (-1) * np.power(sample_var + eps, -0.5), axis = 0)
dmean += dvar * np.sum(-2 * (x - sample_mean), axis = 0) / N
dx = dx_norm * np.power(sample_var + eps, -0.5) + dvar*2*(x - sample_mean) / N + dmean / N
return dx, dgamma, dbeta
LayerNormalization与BatchNormalization的主要区别在于求均值时所选的通道不同。假设有输入X的大小为(N,W,H,C)在,则ln与bn的区别如图所示。bn是沿C求均值,求得大小为C,而ln是沿N求均值,求得大小为N。因此相比于bn,ln在batchsize=1时也可以起到作用。
ln和bn的主要区别如下:
def layernorm_forward(x, gamma, beta, ln_param):
"""
Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- ln_param: Dictionary with the following keys:
- eps: Constant for numeric stability
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
out, cache = None, None
eps = ln_param.get('eps', 1e-5)
sample_mean = np.mean(x,axis=1,keepdims=True)
sample_var = np.var(x,axis=1,keepdims=True)
sample_sqrtvar = np.sqrt(sample_var+eps)
x_norm = (x-sample_mean)/sample_sqrtvar
out = x_norm*gamma+beta
cache = (x,x_norm,gamma,beta,eps,sample_mean,sample_var,sample_sqrtvar)
return out, cache
在卷积神经网络中,输入X的大小可能为(N,W,H,C),所以求均值和方差的过程就变为
sample_mean = np.mean(x,axis=(1,2,3), keepdims=True)
sample_var = np.var(x,axis=(1,2,3), keepdims=True)
InstanceNormalization是求H和W的两个维度的均值,强调对图像实例的归一化,在图像风格化中很有作用。其代码如下:
def instance_forward(x, gamma, beta, ln_param):
out, cache = None, None
eps = ln_param.get('eps', 1e-5)
sample_mean = np.mean(x,axis=(1,2), keepdims=True)
sample_var = np.var(x,axis=(1,2), keepdims=True)
sample_sqrtvar = np.sqrt(sample_var+eps)
x_norm = (x-sample_mean)/sample_sqrtvar
out = x_norm*gamma+beta
cache = (x,x_norm,gamma,beta,eps,sample_mean,sample_var,sample_sqrtvar)
return out, cache
前面提到,由于bn在小batchsize上的效果比较差,所以gn就是在通道C方向上分为几个group,在group内进行归一化。代码如下:
def group_forward(x, gamma, beta, ln_param):
out, cache = None, None
eps = ln_param.get('eps', 1e-5)
x = np.reshape(x, (x.shape[0], x.shape[1], x.shape[2],G,x.shape[3]/G))
sample_mean = np.mean(x,axis=(1,2,4), keepdims=True)
sample_var = np.var(x,axis=(1,2,4), keepdims=True)
sample_sqrtvar = np.sqrt(sample_var+eps)
x_norm = (x-sample_mean)/sample_sqrtvar
out = x_norm*gamma+beta
cache = (x,x_norm,gamma,beta,eps,sample_mean,sample_var,sample_sqrtvar)
return out, cache
参考
https://arxiv.org/pdf/1502.03167.pdf
https://blog.csdn.net/qq_25737169/article/details/79048516
https://blog.csdn.net/liuxiao214/article/details/81037416
https://zhuanlan.zhihu.com/p/33173246
https://blog.csdn.net/xiaojiajia007/article/details/54924959(这个是译文,原文找不到)