前言
本文旨在学习和记录,如需转载,请附出处https://www.jianshu.com/p/59aaaaab746a
Normalization
Normalization主要是对网络特征的一种处理方法,期望特征在训练中保持较好的分布。一般都是在激活函数前进行Normalization.对于Normalization,现在主要有以下几种方法:
输入
一、Batch Normalization
有关BN的理论知识可以查看博客Batch Normalization,BN的操作对象是对Batch个特征map按通道进行归一化,均值和方差的shape大小为,然后乘以缩放因子和平移因子。
def batchnorm_forward(x, gamma, beta, bn_param):
"""
Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- bn_param: Dictionary with the following keys:
- mode: 'train' or 'test'; required
- eps: Constant for numeric stability
- momentum: Constant for running mean / variance.
- running_mean: Array of shape (D,) giving running mean of features
- running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
mode = bn_param['mode']
eps = bn_param.get('eps', 1e-5)
momentum = bn_param.get('momentum', 0.9)
N, D = x.shape
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == 'train':
sample_mean = np.mean(x,axis = 0)## 每一列的均值
sample_var = np.var(x,axis = 0)
x_hat = (x- sample_mean)/(np.sqrt(sample_var+eps))
out = gamma*x_hat+beta
cache = (x, sample_mean, sample_var, x_hat, eps,gamma, beta)
running_mean = momentum*running_mean +(1-momentum)*sample_mean
running_var = momentum*running_var + (1-momentum)*sample_var
elif mode == 'test':
out = gamma* (x - running_mean)/(np.sqrt(running_var+eps))+beta
pass
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
# Store the updated running means back into bn_param
bn_param['running_mean'] = running_mean
bn_param['running_var'] = running_var
return out, cache
def batchnorm_backward(dout, cache):
"""
Inputs:
- dout: Upstream derivatives, of shape (N, D)
- cache: Variable of intermediates from batchnorm_forward.
Returns a tuple of:
- dx: Gradient with respect to inputs x, of shape (N, D)
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
"""
dx, dgamma, dbeta = None, None, None
N = dout.shape[0]
x, sample_mean, sample_var, x_hat, eps,gamma, beta = cache
dgamma = np.sum(dout*x_hat,axis = 0)
dbeta = np.sum(dout,axis = 0)
dhat = dout * gamma
dx_1 = dhat/(np.sqrt(sample_var+eps))
dvar = np.sum(dhat*(x-sample_mean),axis=0)*(-0.5)*((sample_var+eps)**(-1.5))
dmean = np.sum(-dhat,axis=0)/(np.sqrt(sample_var+eps))+dvar*np.mean(2*sample_mean-2*x,axis=0)
dx_var = dvar*2.0*(x-sample_mean)/N
dx_mean = dmean*1.0/N
dx = dx_1+dx_var+dx_mean
return dx, dgamma, dbeta
上述代码是针对全连接层的BN,如果需要在卷积网络中使用BN,只需把conv出的特征map进行reshape即可。此外,BN训练中每一都计算了每个Batch的均值和方差,在测试时所用的均值和方差是训练中所有数据的滑动平均。
BN的优点:
- 可以容许较大的学习率;
- 可以采用较差的初始化;
- 正则化
BN的缺点:
计算均值和方差是在Batch上,如果Batchsize太小,计算均值和方差不能代表整个数据分布;如果Batchsize太大,会超过显存容量,训练较慢,更新很慢,一般选择32,64,128等
二、Layer Normalization
LN的操作对象是对N个特征map按N进行归一化,均值和方差的shape大小为。简单的说,就是每个样本求一个均值和方差。所以训练和测试时代码都一样,就不必考虑滑动平均了。
def layernorm_forward(x, gamma, beta, ln_param):
"""
Input:
- x: Data of shape (N, D)
- gamma: Scale parameter of shape (D,)
- beta: Shift paremeter of shape (D,)
- ln_param: Dictionary with the following keys:
- eps: Constant for numeric stability
Returns a tuple of:
- out: of shape (N, D)
- cache: A tuple of values needed in the backward pass
"""
out, cache = None, None
eps = ln_param.get('eps', 1e-5)
x_T = x.T
# print(x_T)
sample_mean = np.mean(x_T,axis = 0)
sample_var = np.var(x_T,axis = 0)
x_norm_T = (x_T - sample_mean)/(np.sqrt(sample_var+eps))
# print(x_norm_T)
x_norm = x_norm_T.T
out = x_norm * gamma +beta
cache = (x, sample_mean, sample_var,x_norm,eps, gamma, beta)
return out, cache
def layernorm_backward(dout, cache):
dx, dgamma, dbeta = None, None, None
x, sample_mean, sample_var,x_norm,eps, gamma, beta = cache
dgamma = np.sum(dout*x_norm, axis = 0)
dbeta = np.sum(dout, axis = 0)
dout = dout.T
N = dout.shape[0]
dhat = dout * gamma[:,np.newaxis]
dx_1 = dhat/(np.sqrt(sample_var+eps))
x = x.T
dvar = np.sum(dhat*(x-sample_mean),axis=0)*(-0.5)*((sample_var+eps)**(-1.5))
dmean = np.sum(-dhat,axis=0)/(np.sqrt(sample_var+eps))+dvar*np.mean(2*sample_mean-2*x,axis=0)
dx_var = dvar*2.0*(x-sample_mean)/N
dx_mean = dmean*1.0/N
dx = dx_1+dx_var+dx_mean
dx = dx.T
return dx, dgamma, dbeta
LN的优点:不需要批训练,在一条数据内部就能进行归一化。可以在Batchsize为1的网络和RNN中。此外,对CNN网络来说,BN比LN适合;对RNN网络来说,LN比BN更适合。
三、Instance Normalization
IN的提出主要是针对风格迁移网络。LN的操作对象是对Batch个特征map按像素进行归一化,均值和方差的shape大小为 。因为在图像的风格迁移中,生成的结果主要依赖于某个图像实例,所以在通道和数目上进行归一化不适合风格迁移,需要保持实例个通道内独立。
四、Group Normalization
GN的提出主要针对BN在小的batchsize下,其估计整体不精确造成的精度下降。
GN将原始的输入按通道划分成几组,然后在各个组内进行归一化。这样计算时不必考虑Batchsize的大小。均值和方差的大小为
def spatial_groupnorm_forward(x, gamma, beta, G, gn_param):
"""
out, cache = None, None
eps = gn_param.get('eps',1e-5)
N,C,H,W = x.shape
x_group = np.reshape(x,(N,G,C//G,H,W))
mean = np.mean(x_group,axis=(2,3,4),keepdims=True)
var = np.var(x_group,axis=(2,3,4),keepdims=True)
x_groupnorm = (x_group-mean)/np.sqrt(var+eps)
x_norm = np.reshape(x_groupnorm,(N,C,H,W))
out = x_norm*gamma+beta
cache = (G,x,x_norm,mean,var,gamma,beta,eps)
return out, cache
def spatial_groupnorm_backward(dout, cache):
dx, dgamma, dbeta = None, None, None
G,x,x_norm,mean,var,gamma,beta,eps = cache
N,C,H,W = dout.shape
dbeta = np.sum(dout,axis=(0,2,3),keepdims=True)
dgamma = np.sum(dout*x_norm,axis=(0,2,3),keepdims=True)
dx_norm = dout*gamma
dx_groupnorm = dx_norm.reshape((N,G,C//G,H,W))
x_group = x.reshape((N,G,C//G,H,W))
dvar = np.sum(dx_groupnorm*-1.0/2*(x_group-mean)*(var+eps)**(-1.5),axis=(2,3,4),keepdims=True)
N_group = C//G*H*W
dmean1 = np.sum(dx_groupnorm*-1.0/np.sqrt(var+eps),axis=(2,3,4),keepdims=True)
dmean2 = dvar*-2.0/N_group*np.sum(x_group-mean,axis=(2,3,4),keepdims=True)
dmean = dmean1+dmean2
dx_group1 = dx_groupnorm*1.0/np.sqrt(var+eps)
dx_group2 = dmean*1.0/N_group
dx_group3 = dvar*2.0/N_group*(x_group-mean)
dx_groups = dx_group1+dx_group2+dx_group3
dx = dx_groups.reshape((N,C,H,W))
return dx, dgamma, dbeta
总结
- BN在batch上进行归一化,保留通道数;LN在通道上进行归一化,保留数目N;IN在图像上进行归一化,保留N和C;GN将通道分组,在通道内进行归一化,保留N和G(通道数);
- BN,GN更适合CNN;LN更适合RNN;IN主要用于风格迁移。
- BN训练和测试代码不一样,测试时需要考虑滑动平均。BN可以设置滑动平均的参数来获取更准确的均值和标准差。
参考
- Batch Normalization:Accelerating Deep Network Training by reducing internal covariate shift
- Layer Normalization
- Instance Normalization: The Missing Ingredient for Fast Stylization
- Group Normalization
- cs231课件