自从看了Transformer以后,里面提到了layer normalization。想到之前的也看过相关的normalization资料,很可惜没有记录下来。于是在这篇文章中记录normalization相关的知识,最常见的normalization是batch normalization,后面出现了layer normalization、instance normalization和group normalization。本篇博客主要参考知乎大神Juliuszh的详解深度学习中的Normalization,BN/LN/WN。
吴恩达大致说过深度学习=80%数据+20%模型。训练得到的模型算法要好,莫过于数据分布能够“独立同分布”independent and identically distributed
,简称为i.i.d.
。独立同分布的数据可以简化常规机器学习模型的训练、提升机器学习模型的预测能力。因此喂入数据之前需要“白化”:
normalization主要解决深度学习里面的Internal Covariate Shift
。深度神经网络涉及到很多层的叠加,而每一层的参数更新会导致上层的输入数据分布发生变化,通过层层叠加,高层的输入分布变化会非常剧烈,这就使得高层需要不断去重新适应底层的参数更新。关于Internal Covariate Shift
的解释如下图所示,来自知乎魏秀参在一个回答中的解释。说到底就是为了解决梯度弥散的问题,提高训练速度。值得注意的是RNN无法使用Batch Normalization,因为它是一个动态的网络结构,同一个batch中训练实例有长有短,导致每一个时间步长必须维持各自的统计量,这使得BN并不能正确的使用。在RNN中,对bn进行改进也非常的困难。
具体的Normalization方法如下:
Internal Covariate Shift
现象是:将每一层的输入作为一个分布看待,由于底层的参数随着训练更新,导致相同的输入分布得到的输出分布改变了。而机器学习中有个很重要的假设:IID独立同分布假设,就是假设训练数据和测试数据是满足相同分布的,这是通过训练数据获得的模型能够在测试集获得好的效果的一个基本保障。那么,细化到神经网络的每一层间,每轮训练时分布都是不一致,那么相对的训练效果就得不到保障,所以称为层间的covariate shift。
normalization的标准公式是 h = f ( g ⋅ x − μ σ + b ) h=f(g\cdot\frac{x-\mu}{\sigma}+b) h=f(g⋅σx−μ+b)
套用标准公式, μ i = 1 M ∑ x i , σ i = 1 M ∑ ( x i − μ i ) 2 + ϵ \mu_i=\frac{1}{M}\sum x_i, \sigma_i=\sqrt{\frac{1}{M}\sum(x_i-\mu_i)^2+\epsilon} μi=M1∑xi,σi=M1∑(xi−μi)2+ϵ其中 M M M是mini-batch的大小。Btatch Normalization适用场景是mini batch比较大,数据分布比较接近。在训练之前需要充分的shuffle。
Batch Normalization跟参数batch size有关,BN有两个前提要求:
对于第一个问题,存在batchsize过小的场景,如对于微小的缺陷检测、高精度的关键点检测或小物体的目标检测等任务,我们一般不太想粗暴降低输入图片的分辨率,这样违背了我们使用高分辨率相机的初衷,也可能导致丢失有用特征。在算力有限的情况下,batchsize就无法设置太大,甚至只能为1或2。小的batchsize会带来很多训练上的问题,其中BN问题就是最突出的。虽然大batchsize训练是一个共识,但是现实中可能无法具有充足的资源。既然batchsize太小的情况下,无法保证当前minibatch收集到的数据和整体数据同分布。那么能否多收集几个batch的数据进行统计呢?以下两篇论文主要介绍:
BatchNormalization总结:
μ i = ∑ x i , σ i = ∑ ( x i − μ i ) 2 + ϵ \mu_i=\sum x_i, \sigma_i=\sqrt{\sum(x_i-\mu_i)^2+\epsilon} μi=∑xi,σi=∑(xi−μi)2+ϵ其中 i i i枚举了该层所有的神经元。layer normalization可以避免BN中受mini batch的影响,适用于小mini batch 场景,动态网络场景和RNN场景。
Instance Normalization适用于生成模型中,比如图片风格迁移。因为图片生成的结果主要依赖于某个图像实例,所以对整个Batch进行Normalization操作并不适合图像风格化的任务,在风格迁移中适用IN不仅可以加速模型收敛,并且可以保持每个图像实例之间的独立性。还适用于图像超分辨率等。计算公式如下:
看前沿的图也比较容易理解,极端情况下可等价于LN和IN。
normalization除了上述方法外,还有Cosine Normalization和Weight Normalization,具体可以参考:https://zhuanlan.zhihu.com/p/33173246的文章。
from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization
## Layer Normalization
```python
def GroupNorm(x, gamma, beta, eps=1e−5):
# x: input features with shape [N,C,H,W]
# gamma, beta: scale and offset, with shape [1,C,1,1]
# G: number of groups for GN
N, C, H, W = x.shape
x = tf.reshape(x, [N, G, C, H, W])
mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
x = (x − mean) / tf.sqrt(var + eps)
x = tf.reshape(x, [N, C, H, W])
return x ∗ gamma + beta
``
def GroupNorm(x, gamma, beta, G, eps=1e−5):
# x: input features with shape [N,C,H,W]
# gamma, beta: scale and offset, with shape [1,C,1,1]
# G: number of groups for GN
N, C, H, W = x.shape
x = tf.reshape(x, [N, G, C // G, H, W])
mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
x = (x − mean) / tf.sqrt(var + eps)
x = tf.reshape(x, [N, C, H, W])
return x ∗ gamma + beta
``
## Instance Normalization
```python
from keras.layers import Layer, InputSpec
from keras import initializers, regularizers, constraints
from keras import backend as K
class InstanceNormalization(Layer):
"""Instance normalization layer.
Normalize the activations of the previous layer at each step,
i.e. applies a transformation that maintains the mean activation
close to 0 and the activation standard deviation close to 1.
# Arguments
axis: Integer, the axis that should be normalized
(typically the features axis).
For instance, after a `Conv2D` layer with
`data_format="channels_first"`,
set `axis=1` in `InstanceNormalization`.
Setting `axis=None` will normalize all values in each
instance of the batch.
Axis 0 is the batch dimension. `axis` cannot be set to 0 to avoid errors.
epsilon: Small float added to variance to avoid dividing by zero.
center: If True, add offset of `beta` to normalized tensor.
If False, `beta` is ignored.
scale: If True, multiply by `gamma`.
If False, `gamma` is not used.
When the next layer is linear (also e.g. `nn.relu`),
this can be disabled since the scaling
will be done by the next layer.
beta_initializer: Initializer for the beta weight.
gamma_initializer: Initializer for the gamma weight.
beta_regularizer: Optional regularizer for the beta weight.
gamma_regularizer: Optional regularizer for the gamma weight.
beta_constraint: Optional constraint for the beta weight.
gamma_constraint: Optional constraint for the gamma weight.
# Input shape
Arbitrary. Use the keyword argument `input_shape`
(tuple of integers, does not include the samples axis)
when using this layer as the first layer in a Sequential model.
# Output shape
Same shape as input.
# References
- [Layer Normalization](https://arxiv.org/abs/1607.06450)
- [Instance Normalization: The Missing Ingredient for Fast Stylization](
https://arxiv.org/abs/1607.08022)
"""
def __init__(self,
axis=None,
epsilon=1e-3,
center=True,
scale=True,
beta_initializer='zeros',
gamma_initializer='ones',
beta_regularizer=None,
gamma_regularizer=None,
beta_constraint=None,
gamma_constraint=None,
**kwargs):
super(InstanceNormalization, self).__init__(**kwargs)
self.supports_masking = True
self.axis = axis
self.epsilon = epsilon
self.center = center
self.scale = scale
self.beta_initializer = initializers.get(beta_initializer)
self.gamma_initializer = initializers.get(gamma_initializer)
self.beta_regularizer = regularizers.get(beta_regularizer)
self.gamma_regularizer = regularizers.get(gamma_regularizer)
self.beta_constraint = constraints.get(beta_constraint)
self.gamma_constraint = constraints.get(gamma_constraint)
def build(self, input_shape):
ndim = len(input_shape)
if self.axis == 0:
raise ValueError('Axis cannot be zero')
if (self.axis is not None) and (ndim == 2):
raise ValueError('Cannot specify axis for rank 1 tensor')
self.input_spec = InputSpec(ndim=ndim)
if self.axis is None:
shape = (1,)
else:
shape = (input_shape[self.axis],)
if self.scale:
self.gamma = self.add_weight(shape=shape,
name='gamma',
initializer=self.gamma_initializer,
regularizer=self.gamma_regularizer,
constraint=self.gamma_constraint)
else:
self.gamma = None
if self.center:
self.beta = self.add_weight(shape=shape,
name='beta',
initializer=self.beta_initializer,
regularizer=self.beta_regularizer,
constraint=self.beta_constraint)
else:
self.beta = None
self.built = True
def call(self, inputs, training=None):
input_shape = K.int_shape(inputs)
reduction_axes = list(range(0, len(input_shape)))
if self.axis is not None:
del reduction_axes[self.axis]
del reduction_axes[0]
mean = K.mean(inputs, reduction_axes, keepdims=True)
stddev = K.std(inputs, reduction_axes, keepdims=True) + self.epsilon
normed = (inputs - mean) / stddev
broadcast_shape = [1] * len(input_shape)
if self.axis is not None:
broadcast_shape[self.axis] = input_shape[self.axis]
if self.scale:
broadcast_gamma = K.reshape(self.gamma, broadcast_shape)
normed = normed * broadcast_gamma
if self.center:
broadcast_beta = K.reshape(self.beta, broadcast_shape)
normed = normed + broadcast_beta
return normed
def get_config(self):
config = {
'axis': self.axis,
'epsilon': self.epsilon,
'center': self.center,
'scale': self.scale,
'beta_initializer': initializers.serialize(self.beta_initializer),
'gamma_initializer': initializers.serialize(self.gamma_initializer),
'beta_regularizer': regularizers.serialize(self.beta_regularizer),
'gamma_regularizer': regularizers.serialize(self.gamma_regularizer),
'beta_constraint': constraints.serialize(self.beta_constraint),
'gamma_constraint': constraints.serialize(self.gamma_constraint)
}
base_config = super(InstanceNormalization, self).get_config()
return dict(list(base_config.items()) + list(config.items()))