[Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

文章目录

    • 主要内容
      • PReLU
      • Kaiming 初始化
        • Forward case
      • Backward case

He K, Zhang X, Ren S, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification[C]. international conference on computer vision, 2015: 1026-1034.

@article{he2015delving,
title={Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
pages={1026–1034},
year={2015}}

本文介绍了一种PReLU的激活函数和Kaiming的参数初始化方法.

主要内容

PReLU

[Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification_第1张图片

f ( y i ) = { y i , y i > 0 , a i y i , y i ≤ 0. f(y_i) = \left \{ \begin{array}{ll} y_i, & y_i >0, \\ a_i y_i, & y_i \le 0. \end{array} \right. f(yi)={yi,aiyi,yi>0,yi0.
其中 a i a_i ai是作为网络的参数进行训练的.
等价于
f ( y i ) = max ⁡ ( 0 , y i ) + a i min ⁡ ( 0 , y i ) . f(y_i)=\max(0, y_i) + a_i \min (0,y_i). f(yi)=max(0,yi)+aimin(0,yi).
特别的, 可以一层的节点都用同一个 a a a.

Kaiming 初始化

Forward case

y l = W l x l + b l , \mathbf{y}_l=W_l\mathbf{x}_l+\mathbf{b}_l, yl=Wlxl+bl,
在卷积层中时, x l \mathbf{x}_l xl k × k × c k\times k \times c k×k×c的展开, 故 x l ∈ R k 2 c \mathrm{x}_l\in \mathbb{R}^{k^2c} xlRk2c, 而 y l ∈ R d \mathbf{y}_l \in \mathbb{R}^{d} ylRd, W l ∈ R d × k 2 c W_l \in \mathbb{R^{d \times k^2c}} WlRd×k2c(每一行都可以视作一个kernel), 并记 n = k 2 c n=k^2c n=k2c.

x l = f ( y l − 1 ) , \mathbf{x}_l=f(\mathbf{y}_{l-1}), xl=f(yl1),

c l = d l − 1 . c_l = d_{l-1}. cl=dl1.
[Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification_第2张图片

假设 w l w_l wl x l x_l xl(注意没粗体, 表示 w l , x l \mathbf{w}_l, \mathbf{x}_l wl,xl中的某个元素)相互独立, 且 w l w_l wl采样自一个均值为0的对称分布之中.


V a r [ y l ] = n l V a r [ w l x l ] = n l V a r [ w l ] E [ x l 2 ] , Var[y_l] = n_l Var [w_lx_l] = n_lVar[w_l]E[x_l^2], Var[yl]=nlVar[wlxl]=nlVar[wl]E[xl2],
除非 E [ x l ] = 0 E[x_l]=0 E[xl]=0, V a r [ y l ] = n l V a r [ w l ] V a r [ x l ] Var[y_l] = n_lVar[w_l]Var[x_l] Var[yl]=nlVar[wl]Var[xl], 但对于ReLu, 或者 PReLU来说这个性质是不成立的.

如果我们令 b l − 1 = 0 b_{l-1}=0 bl1=0, 易证
E [ x l 2 ] = 1 2 V a r [ y l − 1 ] , E[x_l^2] = \frac{1}{2} Var[y_{l-1}], E[xl2]=21Var[yl1],
其中 f f f是ReLU, 若 f f f是PReLU,
E [ x l 2 ] = 1 + a 2 2 V a r [ y l − 1 ] . E[x_l^2] = \frac{1+a^2}{2} Var[y_{l-1}]. E[xl2]=21+a2Var[yl1].
下面用ReLU分析, PReLU是类似的.


V a r [ y l ] = 1 2 n l a r [ w l ] V a r [ y l − 1 ] , Var[y_l] = \frac{1}{2} n_l ar[w_l]Var[y_{l-1}], Var[yl]=21nlar[wl]Var[yl1],
自然我们希望
V a r [ y i ] = V a r [ y j ] ⇒ 1 2 n l V a r [ w l ] = 1 , ∀ l . Var[y_i]=Var[y_j] \Rightarrow \frac{1}{2}n_l Var[w_l]=1, \forall l. Var[yi]=Var[yj]21nlVar[wl]=1,l.

Backward case

Δ x l = W ^ l Δ y l , (13) \tag{13} \Delta \mathbf{x}_l = \hat{W}_l \Delta \mathbf{y}_l, Δxl=W^lΔyl,(13)
Δ x l \Delta \mathbf{x}_l Δxl表示损失函数观念与 x l \mathbf{x}_l xl的导数, 这里的 y l \mathbf{y}_l yl与之前提到的 y l \mathbf{y}_l yl有出入, 这里需要用到卷积的梯度回传, 三言两语讲不清, W ^ l \hat{W}_l W^l W l W_l Wl的一个重排.

因为 x l = f ( y l − 1 ) \mathbf{x}_l=f(\mathbf{y}_{l-1}) xl=f(yl1), 所以
Δ y l = f ′ ( y l ) Δ x l + 1 . \Delta y_l = f'(y_l) \Delta x_{l+1}. Δyl=f(yl)Δxl+1.

假设 f ′ ( y l ) f'(y_l) f(yl) Δ x l + 1 \Delta x_{l+1} Δxl+1相互独立, 所以
E [ Δ y l ] = E [ f ′ ( y l ) ] E [ Δ x l + 1 ] = 0 , E[\Delta y_l]=E[f'(y_l)] E[\Delta x_{l+1}] = 0, E[Δyl]=E[f(yl)]E[Δxl+1]=0,
f f f为ReLU:
E [ ( Δ y l ) 2 ] = V a r [ Δ y l ] = 1 2 V a r [ Δ x l + 1 ] . E[(\Delta y_l)^2] = Var[\Delta y_l] = \frac{1}{2}Var[\Delta x_{l+1}]. E[(Δyl)2]=Var[Δyl]=21Var[Δxl+1].
f f f为PReLU:
E [ ( Δ y l ) 2 ] = V a r [ Δ y l ] = 1 + a 2 2 V a r [ Δ x l + 1 ] . E[(\Delta y_l)^2] = Var[\Delta y_l] = \frac{1+a^2}{2}Var[\Delta x_{l+1}]. E[(Δyl)2]=Var[Δyl]=21+a2Var[Δxl+1].

下面以 f f f为ReLU为例, PReLU类似

V a r [ Δ x l ] = n ^ l V a r [ w l ] V a r [ Δ y l ] = 1 2 n ^ l V a r [ w l ] V a r [ Δ x l + 1 ] , Var[\Delta x_l] = \hat{n}_l Var[w_l] Var[\Delta y_l] = \frac{1}{2} \hat{n}_l Var[w_l] Var[\Delta x_{l+1}], Var[Δxl]=n^lVar[wl]Var[Δyl]=21n^lVar[wl]Var[Δxl+1],
这里 n ^ l = k 2 d \hat{n}_l=k^2d n^l=k2d y l \mathbf{y}_l yl的长度.

和前向的一样, 我们希望 V a r [ Δ x l ] Var[\Delta x_l] Var[Δxl]一样, 需要
1 2 n ^ l V a r [ w l ] = 1 , ∀ l . \frac{1}{2}\hat{n}_l Var[w_l]=1, \forall l. 21n^lVar[wl]=1,l.

是实际中,我们前向后向可以任选一个(因为误差不会累积).

你可能感兴趣的:(neural,networks)