深度学习 Normalization

深度学习 Normalization

  • 前言
  • Normalization
    • Why:为什么需要normalization?
    • What:有什么normalization?
      • Batch Normalization
      • Layer Normalization
      • Instance Normalization
      • Group Normalization
      • 其他
    • How:怎么实现normalization
      • Batch Normalization
      • Layer Normalization
      • Group Normalization

前言

自从看了Transformer以后,里面提到了layer normalization。想到之前的也看过相关的normalization资料,很可惜没有记录下来。于是在这篇文章中记录normalization相关的知识,最常见的normalization是batch normalization,后面出现了layer normalization、instance normalization和group normalization。本篇博客主要参考知乎大神Juliuszh的详解深度学习中的Normalization,BN/LN/WN。

深度学习 Normalization_第1张图片

Normalization

吴恩达大致说过深度学习=80%数据+20%模型。训练得到的模型算法要好,莫过于数据分布能够“独立同分布”independent and identically distributed,简称为i.i.d.。独立同分布的数据可以简化常规机器学习模型的训练、提升机器学习模型的预测能力。因此喂入数据之前需要“白化”:

  1. 去除特征的相关性,即独立
  2. 使得特征具有相同的均值和方差

Why:为什么需要normalization?

normalization主要解决深度学习里面的Internal Covariate Shift。深度神经网络涉及到很多层的叠加,而每一层的参数更新会导致上层的输入数据分布发生变化,通过层层叠加,高层的输入分布变化会非常剧烈,这就使得高层需要不断去重新适应底层的参数更新。关于Internal Covariate Shift的解释如下图所示,来自知乎魏秀参在一个回答中的解释。说到底就是为了解决梯度弥散的问题,提高训练速度。值得注意的是RNN无法使用Batch Normalization,因为它是一个动态的网络结构,同一个batch中训练实例有长有短,导致每一个时间步长必须维持各自的统计量,这使得BN并不能正确的使用。在RNN中,对bn进行改进也非常的困难。
深度学习 Normalization_第2张图片
具体的Normalization方法如下:
深度学习 Normalization_第3张图片

Internal Covariate Shift现象是:将每一层的输入作为一个分布看待,由于底层的参数随着训练更新,导致相同的输入分布得到的输出分布改变了。而机器学习中有个很重要的假设:IID独立同分布假设,就是假设训练数据和测试数据是满足相同分布的,这是通过训练数据获得的模型能够在测试集获得好的效果的一个基本保障。那么,细化到神经网络的每一层间,每轮训练时分布都是不一致,那么相对的训练效果就得不到保障,所以称为层间的covariate shift。

What:有什么normalization?

normalization的标准公式是 h = f ( g ⋅ x − μ σ + b ) h=f(g\cdot\frac{x-\mu}{\sigma}+b) h=f(gσxμ+b)

Batch Normalization

深度学习 Normalization_第4张图片
套用标准公式, μ i = 1 M ∑ x i , σ i = 1 M ∑ ( x i − μ i ) 2 + ϵ \mu_i=\frac{1}{M}\sum x_i, \sigma_i=\sqrt{\frac{1}{M}\sum(x_i-\mu_i)^2+\epsilon} μi=M1xi,σi=M1(xiμi)2+ϵ 其中 M M M是mini-batch的大小。Btatch Normalization适用场景是mini batch比较大,数据分布比较接近。在训练之前需要充分的shuffle。

Batch Normalization跟参数batch size有关,BN有两个前提要求:

  • batchsize不能太小,这样会跟我们的训练资源冲突
  • 每一个batch的分布都尽可能与整体数据集的分布相同

对于第一个问题,存在batchsize过小的场景,如对于微小的缺陷检测、高精度的关键点检测或小物体的目标检测等任务,我们一般不太想粗暴降低输入图片的分辨率,这样违背了我们使用高分辨率相机的初衷,也可能导致丢失有用特征。在算力有限的情况下,batchsize就无法设置太大,甚至只能为1或2。小的batchsize会带来很多训练上的问题,其中BN问题就是最突出的。虽然大batchsize训练是一个共识,但是现实中可能无法具有充足的资源。既然batchsize太小的情况下,无法保证当前minibatch收集到的数据和整体数据同分布。那么能否多收集几个batch的数据进行统计呢?以下两篇论文主要介绍:

  • BRN:Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
    核心思想就是:训练过程中,由于batchsize较小,当前minibatch统计到的均值和方差与全部数据有差异,那么就对当前的均值和方差进行修正。修正的方法主要是利用到通过滑动平均收集到的全局均值和标准差。
  • CBN:Cross-Iteration Batch Normalization,具体看论文。

BatchNormalization总结:

  • 优点:
  1. 使训练收敛更快
  2. 降低初始权重的重要性
  3. 鲁棒的超参数
  4. 需要较少数据进行泛化
  • 缺点:
  1. 使用小的batch size不稳定
  2. 增加训练时间
  3. 训练阶段和推理阶段不一样的结果
  4. 对于在线学习不好
  5. 对于循环神经网络不好

Layer Normalization

深度学习 Normalization_第5张图片
μ i = ∑ x i , σ i = ∑ ( x i − μ i ) 2 + ϵ \mu_i=\sum x_i, \sigma_i=\sqrt{\sum(x_i-\mu_i)^2+\epsilon} μi=xi,σi=(xiμi)2+ϵ 其中 i i i枚举了该层所有的神经元。layer normalization可以避免BN中受mini batch的影响,适用于小mini batch 场景,动态网络场景和RNN场景。

Instance Normalization

Instance Normalization适用于生成模型中,比如图片风格迁移。因为图片生成的结果主要依赖于某个图像实例,所以对整个Batch进行Normalization操作并不适合图像风格化的任务,在风格迁移中适用IN不仅可以加速模型收敛,并且可以保持每个图像实例之间的独立性。还适用于图像超分辨率等。计算公式如下:
在这里插入图片描述

Group Normalization

看前沿的图也比较容易理解,极端情况下可等价于LN和IN。

其他

normalization除了上述方法外,还有Cosine Normalization和Weight Normalization,具体可以参考:https://zhuanlan.zhihu.com/p/33173246的文章。

How:怎么实现normalization

Batch Normalization

from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization

Layer Normalization

## Layer Normalization
```python
def GroupNorm(x, gamma, beta, eps=1e5):
    # x: input features with shape [N,C,H,W]
    # gamma, beta: scale and offset, with shape [1,C,1,1]
    # G: number of groups for GN
    N, C, H, W = x.shape
    x = tf.reshape(x, [N, G, C, H, W])
    mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
    x = (x − mean) / tf.sqrt(var + eps)
    x = tf.reshape(x, [N, C, H, W])
    return x ∗ gamma + beta
``

Group Normalization

def GroupNorm(x, gamma, beta, G, eps=1e5):
    # x: input features with shape [N,C,H,W]
    # gamma, beta: scale and offset, with shape [1,C,1,1]
    # G: number of groups for GN
    N, C, H, W = x.shape
    x = tf.reshape(x, [N, G, C // G, H, W])
    mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
    x = (x − mean) / tf.sqrt(var + eps)
    x = tf.reshape(x, [N, C, H, W])
    return x ∗ gamma + beta
``

## Instance Normalization
```python

from keras.layers import Layer, InputSpec
from keras import initializers, regularizers, constraints
from keras import backend as K
 
 
class InstanceNormalization(Layer):
    """Instance normalization layer.
    Normalize the activations of the previous layer at each step,
    i.e. applies a transformation that maintains the mean activation
    close to 0 and the activation standard deviation close to 1.
    # Arguments
        axis: Integer, the axis that should be normalized
            (typically the features axis).
            For instance, after a `Conv2D` layer with
            `data_format="channels_first"`,
            set `axis=1` in `InstanceNormalization`.
            Setting `axis=None` will normalize all values in each
            instance of the batch.
            Axis 0 is the batch dimension. `axis` cannot be set to 0 to avoid errors.
        epsilon: Small float added to variance to avoid dividing by zero.
        center: If True, add offset of `beta` to normalized tensor.
            If False, `beta` is ignored.
        scale: If True, multiply by `gamma`.
            If False, `gamma` is not used.
            When the next layer is linear (also e.g. `nn.relu`),
            this can be disabled since the scaling
            will be done by the next layer.
        beta_initializer: Initializer for the beta weight.
        gamma_initializer: Initializer for the gamma weight.
        beta_regularizer: Optional regularizer for the beta weight.
        gamma_regularizer: Optional regularizer for the gamma weight.
        beta_constraint: Optional constraint for the beta weight.
        gamma_constraint: Optional constraint for the gamma weight.
    # Input shape
        Arbitrary. Use the keyword argument `input_shape`
        (tuple of integers, does not include the samples axis)
        when using this layer as the first layer in a Sequential model.
    # Output shape
        Same shape as input.
    # References
        - [Layer Normalization](https://arxiv.org/abs/1607.06450)
        - [Instance Normalization: The Missing Ingredient for Fast Stylization](
        https://arxiv.org/abs/1607.08022)
    """
    def __init__(self,
                 axis=None,
                 epsilon=1e-3,
                 center=True,
                 scale=True,
                 beta_initializer='zeros',
                 gamma_initializer='ones',
                 beta_regularizer=None,
                 gamma_regularizer=None,
                 beta_constraint=None,
                 gamma_constraint=None,
                 **kwargs):
        super(InstanceNormalization, self).__init__(**kwargs)
        self.supports_masking = True
        self.axis = axis
        self.epsilon = epsilon
        self.center = center
        self.scale = scale
        self.beta_initializer = initializers.get(beta_initializer)
        self.gamma_initializer = initializers.get(gamma_initializer)
        self.beta_regularizer = regularizers.get(beta_regularizer)
        self.gamma_regularizer = regularizers.get(gamma_regularizer)
        self.beta_constraint = constraints.get(beta_constraint)
        self.gamma_constraint = constraints.get(gamma_constraint)
 
    def build(self, input_shape):
        ndim = len(input_shape)
        if self.axis == 0:
            raise ValueError('Axis cannot be zero')
 
        if (self.axis is not None) and (ndim == 2):
            raise ValueError('Cannot specify axis for rank 1 tensor')
 
        self.input_spec = InputSpec(ndim=ndim)
 
        if self.axis is None:
            shape = (1,)
        else:
            shape = (input_shape[self.axis],)
 
        if self.scale:
            self.gamma = self.add_weight(shape=shape,
                                         name='gamma',
                                         initializer=self.gamma_initializer,
                                         regularizer=self.gamma_regularizer,
                                         constraint=self.gamma_constraint)
        else:
            self.gamma = None
        if self.center:
            self.beta = self.add_weight(shape=shape,
                                        name='beta',
                                        initializer=self.beta_initializer,
                                        regularizer=self.beta_regularizer,
                                        constraint=self.beta_constraint)
        else:
            self.beta = None
        self.built = True
 
    def call(self, inputs, training=None):
        input_shape = K.int_shape(inputs)
        reduction_axes = list(range(0, len(input_shape)))
 
        if self.axis is not None:
            del reduction_axes[self.axis]
 
        del reduction_axes[0]
 
        mean = K.mean(inputs, reduction_axes, keepdims=True)
        stddev = K.std(inputs, reduction_axes, keepdims=True) + self.epsilon
        normed = (inputs - mean) / stddev
 
        broadcast_shape = [1] * len(input_shape)
        if self.axis is not None:
            broadcast_shape[self.axis] = input_shape[self.axis]
 
        if self.scale:
            broadcast_gamma = K.reshape(self.gamma, broadcast_shape)
            normed = normed * broadcast_gamma
        if self.center:
            broadcast_beta = K.reshape(self.beta, broadcast_shape)
            normed = normed + broadcast_beta
        return normed
 
    def get_config(self):
        config = {
            'axis': self.axis,
            'epsilon': self.epsilon,
            'center': self.center,
            'scale': self.scale,
            'beta_initializer': initializers.serialize(self.beta_initializer),
            'gamma_initializer': initializers.serialize(self.gamma_initializer),
            'beta_regularizer': regularizers.serialize(self.beta_regularizer),
            'gamma_regularizer': regularizers.serialize(self.gamma_regularizer),
            'beta_constraint': constraints.serialize(self.beta_constraint),
            'gamma_constraint': constraints.serialize(self.gamma_constraint)
        }
        base_config = super(InstanceNormalization, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

你可能感兴趣的:(深度学习,机器学习,人工智能)