这几天开始在看caffe源码, 看到官方代码中有mnist的examples. 其中使用的网络模型是lenet. 在layer name:conv1中使用xavier对该层layer进行权重初始化, 因此本文探究一下xavier初始化方法的原理作用以及caffe代码实现方法.
为方便公式推导,使用一个简单的三层神经网络;为方便描述,规定以下符号表达式定义:
参数名 | 含义 |
---|---|
N l N^l Nl | 第 l l l层单元数量 |
A l A^l Al | 第 l l l层激活值向量,维度为 [ N l × 1 ] [N^l×1] [Nl×1] |
a j l a^l_j ajl | A l A^l Al中的元素,代表第 l l l层第 j j j个单元的激活值 |
W l W^l Wl | 第 l l l层的权重矩阵,维度为 [ N l , N l + 1 ] [N^l, N^{l+1}] [Nl,Nl+1] |
w j k l w^l_{jk} wjkl | W l W^l Wl中的元素,代表第 l − 1 l-1 l−1层的第 j j j个单元连接到第 l l l层第 k k k个单元的权重 |
B l B^l Bl | 第 l l l层偏置项向量,维度为 [ N l ] [N^l] [Nl] |
b k l b^l_{k} bkl | B l B^l Bl中的元素,代表第 l l l层第 k k k个单元的偏置项 |
Z l Z^l Zl | 第 l l l层激活函数的加权输入向量,即 Z l = W l × A l − 1 + B l Z^l=W^l×A^{l-1}+B^l Zl=Wl×Al−1+Bl |
z j l z^l_j zjl | Z l Z^l Zl中的元素,代表第 l l l层第 j j j个单元的加权输入 |
C C C | 尝试优化的损失函数,本文使用平方损失函数 1 2 ( Y − O ) 2 \frac{1}{2}(Y-O)^2 21(Y−O)2作为损失函数 |
σ σ σ | 激活函数,因此 σ l = σ ( Z l ) σ^l = σ(Z^l) σl=σ(Zl),其中该函数应用到输入向量的每一个元素。 |
X X X | 神经网络输入向量 |
δ l δ^l δl | δ l = ∂ C ∂ Z l δ^l=\frac{\partial C}{\partial Z^l} δl=∂Zl∂C为损失函数对第 L 层加权输入向量的梯度,同样也成为误差方向。 |
δ k l δ^l_k δkl | δ l = ∂ C ∂ z k l δ^l=\frac{\partial C}{\partial z^l_k} δl=∂zkl∂C为损失函数对第 l l l层第 k k k个单元的加权输入的梯度 |
M M M | 网络总层数 |
假设第 l l l层网络的输入为第 l − 1 l-1 l−1层的激活值向量 A l − 1 A^{l-1} Al−1,向量维度为 [ N l − 1 × 1 ] [N^{l-1} × 1] [Nl−1×1]
以第 l = 2 l=2 l=2为例:
输入 | 加权计算 | 激活 | 输出 |
---|---|---|---|
a 1 1 , a 2 1 a^{1}_1,a_2^1 a11,a21 | z 1 2 = w 11 2 ∗ a 1 1 + w 21 2 ∗ a 2 1 + b 1 2 z^2_1 = w^{2}_{11}*a_1^1+w_{21}^2*a_2^1+b^{2}_{1} z12=w112∗a11+w212∗a21+b12 | a 1 2 = σ ( z 1 2 ) a^2_1=σ(z^2_1) a12=σ(z12) | a 1 2 a^2_1 a12 |
a 1 1 , a 2 1 a^{1}_1,a_2^1 a11,a21 | z 2 2 = w 12 2 ∗ a 1 1 + w 22 2 ∗ a 2 1 + b 2 2 z^2_2 = w^{2}_{12}*a_1^1+w_{22}^2*a_2^1+b^{2}_{2} z22=w122∗a11+w222∗a21+b22 | a 2 2 = σ ( z 2 2 ) a^2_2=σ(z^2_2) a22=σ(z22) | a 2 2 a^2_2 a22 |
a 1 1 , a 2 1 a^{1}_1,a_2^1 a11,a21 | z 3 2 = w 13 2 ∗ a 1 1 + w 23 2 ∗ a 2 1 + b 3 2 z^2_3 = w^{2}_{13}*a_1^1+w_{23}^2*a_2^1+b^{2}_{3} z32=w132∗a11+w232∗a21+b32 | a 3 2 = σ ( z 3 2 ) a^2_3=σ(z^2_3) a32=σ(z32) | a 3 2 a^2_3 a32 |
可以得出 l l l层输出为:
A l = ∑ k = 1 N l ∑ j = 1 N l − 1 σ ( w j k l ∗ a j l − 1 + b k l ) (1) A^l =\sum_{k=1}^{N^l}{\sum_{j=1}^{N^{l-1}}σ(w_{jk}^l*a^{l-1}_{j}+b^l_k}) \tag{1} Al=k=1∑Nlj=1∑Nl−1σ(wjkl∗ajl−1+bkl)(1)
考虑计算第 l l l层的第 k k k个神经元的加权输入 z k l z_{k}^{l} zkl的方差:
D ( z k l ) = D ( ∑ j = 1 N l − 1 w j k l a j l − 1 + b k l ) (2) \begin{aligned} D(z_{k}^{l}) &= D(\sum_{j=1}^{N^{l-1}}w_{jk}^{l}a^{l-1}_j+b^l_k)\tag{2} \end{aligned} D(zkl)=D(j=1∑Nl−1wjklajl−1+bkl)(2)
将偏置项初始化为 0 0 0, 则:
D ( z k l ) = D ( ∑ j = 1 N l − 1 w j k l a j l − 1 ) (3) \begin{aligned} D(z_{k}^{l}) &=D(\sum_{j=1}^{N^{l-1}}w_{jk}^{l}a^{l-1}_j)\tag{3} \end{aligned} D(zkl)=D(j=1∑Nl−1wjklajl−1)(3)
由
“独立变量和的方差等于独立变量的方差的和”
得到:
D ( z k l ) = ∑ j = 1 N l − 1 D ( w j k l a j l − 1 ) (4) \begin{aligned} D(z_{k}^{l}) &=\sum_{j=1}^{N^{l-1}}D(w_{jk}^{l}a^{l-1}_j)\tag{4} \end{aligned} D(zkl)=j=1∑Nl−1D(wjklajl−1)(4)
假设权重参数和 l − 1 l-1 l−1层激活值相互独立, 由接下来推导两个独立变量乘积项的方差,推导如下:
由方差定义可得
D ( x w ) = E ( ( x w − E ( x w ) ) 2 ) = E ( x 2 w 2 − 2 x w E ( x w ) + ( E ( x w ) ) 2 ) = E ( x 2 w 2 ) − 2 ( E ( k w ) ) 2 + ( E ( x w ) ) 2 = E ( x 2 w 2 ) − ( E ( x w ) ) 2 (5-1) \begin{aligned} D(xw)&=E((xw - E(xw))^2) \\ &= E(x^2w^2-2xwE(xw)+(E(xw))^2) \\ &=E(x^2w^2)-2(E(kw))^2+(E(xw))^2 \\ &=E(x^2w^2)-(E(xw))^2 \end{aligned}\tag{5-1} D(xw)=E((xw−E(xw))2)=E(x2w2−2xwE(xw)+(E(xw))2)=E(x2w2)−2(E(kw))2+(E(xw))2=E(x2w2)−(E(xw))2(5-1)
其中 D ( ) D() D()代表方差, E ( ) E() E()代表期望值,在这里也可以等价于平均值
假设 x , w x,w x,w相互独立,有
E ( x 2 w 2 ) = E ( x 2 ) E ( w 2 ) (5-2) \begin{aligned} E(x^2w^2) = E(x^2)E(w^2) \end{aligned}\tag{5-2} E(x2w2)=E(x2)E(w2)(5-2)
E ( x w ) = E ( x ) E ( w ) (5-3) \begin{aligned} E(xw) = E(x)E(w) \end{aligned}\tag{5-3} E(xw)=E(x)E(w)(5-3)
结合式 ( 5 − 2 ) ( 5 − 3 ) (5-2)(5-3) (5−2)(5−3)有
D ( x w ) = E ( x 2 ) E ( w 2 ) − ( E ( x w ) ) 2 (5-3) \begin{aligned} D(xw)=E(x^2)E(w^2)-(E(xw))^2 \end{aligned}\tag{5-3} D(xw)=E(x2)E(w2)−(E(xw))2(5-3)
同样的由方差定义得
D ( x ) = E ( ( x − E ( x ) ) 2 ) = E ( x 2 − 2 x E ( x ) + E ( x ) ) 2 = E ( x 2 ) − 2 ( E ( x ) ) 2 + E ( x ) ) 2 = E ( x 2 ) − E ( x ) 2 (5-4) \begin{aligned} D(x)&=E((x-E(x))^2) \\ &=E(x^2-2xE(x)+E(x))^2 \\ &=E(x^2)-2(E(x))^2+E(x))^2 \\ &=E(x^2)-E(x)^2 \end{aligned}\tag{5-4} D(x)=E((x−E(x))2)=E(x2−2xE(x)+E(x))2=E(x2)−2(E(x))2+E(x))2=E(x2)−E(x)2(5-4)
可得
E ( x 2 ) = D ( x ) + E ( x ) 2 (5-5) \begin{aligned} E(x^2)=D(x)+E(x)^2\tag{5-5} \end{aligned} E(x2)=D(x)+E(x)2(5-5)
同样的
E ( w 2 ) = D ( w ) + E ( w ) 2 (5-6) \begin{aligned} E(w^2)=D(w)+E(w)^2\tag{5-6} \end{aligned} E(w2)=D(w)+E(w)2(5-6)
代入式(5-3)得
D ( x w ) = D ( x ) D ( w ) + E ( x ) 2 D ( w ) + D ( x ) E ( w ) 2 + E ( x ) 2 E ( w ) 2 − ( E ( x w ) ) 2 = D ( x ) D ( w ) + E ( x ) 2 D ( w ) + D ( x ) E ( w ) 2 (5-7) \begin{aligned} D(xw)=&D(x)D(w)+E(x)^2D(w)+ \\&D(x)E(w)^2+E(x)^2E(w)^2-(E(xw))^2 \\ =&D(x)D(w)+E(x)^2D(w)+D(x)E(w)^2 \end{aligned}\tag{5-7} D(xw)==D(x)D(w)+E(x)2D(w)+D(x)E(w)2+E(x)2E(w)2−(E(xw))2D(x)D(w)+E(x)2D(w)+D(x)E(w)2(5-7)
推导结束, 代入到式(4),且假设初始化权重和激活值服从均值为 0 0 0的分布,得到:
D ( z k l ) = ∑ j = 1 N l − 1 D ( w j k l ) D ( a j l − 1 ) = ∑ j = 1 N l − 1 D ( w k l ) D ( A l − 1 ) = N l − 1 D ( w k l ) D ( A l − 1 ) (6) \begin{aligned} D(z_{k}^{l}) &=\sum_{j=1}^{N^{l-1}}D(w_{jk}^{l})D(a^{l-1}_j) \\ &=\sum_{j=1}^{N^{l-1}}D(w_{k}^{l})D(A^{l-1})\\ &=N^{l-1}D(w_{k}^{l})D(A^{l-1})\\ \end{aligned}\tag{6} D(zkl)=j=1∑Nl−1D(wjkl)D(ajl−1)=j=1∑Nl−1D(wkl)D(Al−1)=Nl−1D(wkl)D(Al−1)(6)
考虑
N l D ( Z l ) = ∑ k = 1 N l D ( z k l ) = ∑ k = 1 N l N l − 1 D ( w k l ) D ( A l − 1 ) = ∑ k = 1 N l N l − 1 D ( W l ) D ( A l − 1 ) = N l N l − 1 D ( W l ) D ( A l − 1 ) (7) \begin{aligned} N^{l}D(Z^{l})&= \sum_{k=1}^{N^{l}}D(z_{k}^{l})\\ &= \sum_{k=1}^{N^{l}}N^{l-1}D(w_{k}^{l})D(A^{l-1})\\ &= \sum_{k=1}^{N^{l}}N^{l-1}D(W^{l})D(A^{l-1})\\ &=N^lN^{l-1}D(W^{l})D(A^{l-1})\\ \end{aligned}\tag{7} NlD(Zl)=k=1∑NlD(zkl)=k=1∑NlNl−1D(wkl)D(Al−1)=k=1∑NlNl−1D(Wl)D(Al−1)=NlNl−1D(Wl)D(Al−1)(7)
可得 l l l层加权输入的方差 D ( Z l ) D(Z^{l}) D(Zl):
D ( Z l ) = N l − 1 D ( W l ) D ( A l − 1 ) (8) \begin{aligned} D(Z^{l})&=N^{l-1}D(W^{l})D(A^{l-1})\\ \end{aligned}\tag{8} D(Zl)=Nl−1D(Wl)D(Al−1)(8)
关于激活函数 σ σ σ我们做如下假设:
三个关于激活函数的假设,称为 G l o r o t Glorot Glorot激活函数假设。基于这三个假设, 我们可以认为激活函数在初始化时候近似于恒等变换,即第 l l l层加权输入和激活函数输出近似相等 z l = a l z^l = a^l zl=al.
此时,有:
D ( Z l ) = N l − 1 D ( W l ) D ( A l − 1 ) = N l − 1 D ( W l ) D ( Z l − 1 ) = N l − 1 D ( W l ) N l − 2 D ( W l − 1 ) D ( Z l − 2 ) = D ( X ) ∏ i = 1 l − 1 N i D ( W i + 1 ) (9) \begin{aligned} D(Z^{l})&=N^{l-1}D(W^{l})D(A^{l-1}) \\ &=N^{l-1}D(W^{l})D(Z^{l-1}) \\ &=N^{l-1}D(W^{l})N^{l-2}D(W^{l-1})D(Z^{l-2}) \\ &=D(X)\prod_{i=1}^{l-1}N^iD(W^{i+1}) \end{aligned}\tag{9} D(Zl)=Nl−1D(Wl)D(Al−1)=Nl−1D(Wl)D(Zl−1)=Nl−1D(Wl)Nl−2D(Wl−1)D(Zl−2)=D(X)i=1∏l−1NiD(Wi+1)(9)
我们希望每一层的加权输入方差 D ( Z l ) D(Z^{l}) D(Zl)都相等, 只需要确保每一层 N l − 1 D ( W l ) N^{l-1}D(W^{l}) Nl−1D(Wl)等于 1 1 1,那么只需要满足:
D ( W l ) = 1 N l − 1 (10) D(W^{l}) = \frac{1}{N^{l-1}}\tag{10} D(Wl)=Nl−11(10)
这里 N l − 1 N^{l-1} Nl−1代表 l l l层的输入端数量。
模型图:
同样的,以第 1 1 1层为例:
考虑第一层第一个单元的加权输入对于损失函数的梯度:
δ 1 1 = ∂ C ∂ z 1 1 = ∂ C ∂ a 1 1 ∗ ∂ a 1 1 ∂ z 1 1 = ( ∂ C ∂ z 1 2 ∗ ∂ z 1 2 ∂ a 1 1 + ∂ C ∂ z 2 2 ∗ ∂ z 2 2 ∂ a 1 1 + ∂ C ∂ z 3 2 ∗ ∂ z 3 2 ∂ a 1 1 ) ∗ σ ′ ( z 1 1 ) = ( ∂ C ∂ z 1 2 ∗ w 11 2 + ∂ C ∂ z 2 2 ∗ w 12 2 + ∂ C ∂ z 3 2 ∗ w 13 2 ) ∗ σ ′ ( z 1 1 ) = ( δ 1 2 ∗ w 11 2 + δ 2 2 ∗ w 12 2 + δ 3 2 ∗ w 13 2 ) ∗ σ ′ ( z 1 1 ) (11) \begin{aligned} δ^1_1 &=\frac{\partial C}{\partial z^1_1}\\ &=\frac{\partial C}{\partial a^1_1}*\frac{\partial a^1_1}{\partial z^1_1}\\ &=(\frac {\partial C}{\partial z^2_1}*\frac {\partial z^2_1}{\partial a^1_1} + \frac {\partial C}{\partial z^2_2}*\frac {\partial z^2_2}{\partial a^1_1} + \frac {\partial C}{\partial z^2_3}*\frac {\partial z^2_3}{\partial a^1_1})*σ′(z^1_1)\\ &=(\frac {\partial C}{\partial z^2_1}*w^{2}_{11} + \frac {\partial C}{\partial z^2_2}*w^{2}_{12} + \frac {\partial C}{\partial z^2_3}*w^{2}_{13})*σ′(z^1_1)\\ &=(δ^2_1*w^{2}_{11} +δ^2_2*w^{2}_{12} +δ^2_3*w^{2}_{13})*σ′(z^1_1) \end{aligned}\tag{11} δ11=∂z11∂C=∂a11∂C∗∂z11∂a11=(∂z12∂C∗∂a11∂z12+∂z22∂C∗∂a11∂z22+∂z32∂C∗∂a11∂z32)∗σ′(z11)=(∂z12∂C∗w112+∂z22∂C∗w122+∂z32∂C∗w132)∗σ′(z11)=(δ12∗w112+δ22∗w122+δ32∗w132)∗σ′(z11)(11)
一般,第 l l l层第 j j j个单元对于损失函数的梯度可以表示为:
D ( δ j l ) = D ( ( ∑ k = 1 N l + 1 w j k l + 1 δ k l + 1 ) ∗ σ ′ ( z j l ) ) \begin{aligned} D(δ^l_j) &=D((\sum_{k=1}^{N^{l+1}}w^{l+1}_{jk}δ^{l+1}_k)*σ′(z^l_j))\\ \end{aligned} D(δjl)=D((k=1∑Nl+1wjkl+1δkl+1)∗σ′(zjl))
由上一节对激活函数的三个假设可知, 加权输入 z L z^L zL的均值为 0 0 0,且在初始化阶段激活函数的导数 σ ′ ( z j ) σ′(z_j) σ′(zj)近似为 1:
= D ( ∑ k = 1 N l + 1 w j k l + 1 δ k l + 1 ) = ∑ k = 1 N l + 1 D ( w j k l + 1 δ k l + 1 ) = ∑ k = 1 N l + 1 D ( w j k l + 1 ) D ( δ k l + 1 ) = N l + 1 D ( w j l + 1 ) D ( δ l + 1 ) (12) \begin{aligned} &=D(\sum_{k=1}^{N^{l+1}}w^{l+1}_{jk}δ^{l+1}_k)\\ &=\sum_{k=1}^{N^{l+1}}D(w^{l+1}_{jk}δ^{l+1}_k)\\ &=\sum_{k=1}^{N^{l+1}}D(w^{l+1}_{jk})D(δ^{l+1}_k)\\ &=N^{l+1}D(w^{l+1}_{j})D(δ^{l+1})\\ \end{aligned}\tag{12} =D(k=1∑Nl+1wjkl+1δkl+1)=k=1∑Nl+1D(wjkl+1δkl+1)=k=1∑Nl+1D(wjkl+1)D(δkl+1)=Nl+1D(wjl+1)D(δl+1)(12)
考虑
N l D ( δ l ) = ∑ j = 0 N l D ( δ j l ) = ∑ j = 0 N l N l + 1 D ( w j l + 1 ) D ( δ l + 1 ) = N l N l + 1 D ( W l + 1 ) D ( δ l + 1 ) (13) \begin{aligned} N^{l}D(δ^{l})&=\sum^{N^l}_{j=0}D(δ^l_j)\\ &=\sum^{N^l}_{j=0}N^{l+1}D(w^{l+1}_{j})D(δ^{l+1})\\ &=N^lN^{l+1}D(W^{l+1})D(δ^{l+1}) \end{aligned}\tag{13} NlD(δl)=j=0∑NlD(δjl)=j=0∑NlNl+1D(wjl+1)D(δl+1)=NlNl+1D(Wl+1)D(δl+1)(13)
得到第 l l l层加权输入的梯度 D ( δ l ) D(δ^{l}) D(δl)为:
D ( δ l ) = N l + 1 D ( W l + 1 ) D ( δ l + 1 ) = N l + 1 D ( W l + 1 ) N l + 2 D ( W l + 2 ) D ( δ l + 2 ) = D ( δ M ) ∏ i = 1 M − 1 N i D ( W i ) (14) \begin{aligned} D(δ^{l})&=N^{l+1}D(W^{l+1})D(δ^{l+1})\\ &=N^{l+1}D(W^{l+1})N^{l+2}D(W^{l+2})D(δ^{l+2})\\ &=D(δ^{M})\prod^{M-1}_{i=1}N^iD(W^i) \end{aligned}\tag{14} D(δl)=Nl+1D(Wl+1)D(δl+1)=Nl+1D(Wl+1)Nl+2D(Wl+2)D(δl+2)=D(δM)i=1∏M−1NiD(Wi)(14)
其中 δ M δ^{M} δM是最后一层加权输入对于损失函数的梯度。
因此为了确保在反向传播中所有层的加权输入对于损失函数的梯度 D ( δ l ) D(δ^{l}) D(δl)保持不变,需要满足的条件是
D ( W l ) = 1 N l (15) D(W^l) = \frac{1}{N^l}\tag{15} D(Wl)=Nl1(15)
这里 N l N^l Nl代表 l l l层的输出端数量。
通常来说一个网络层的的输入端数量和输出端数量不保持一致, G l o r o t Glorot Glorot和 B e n g i o Bengio Bengio建议使用输入端和输出端数量的均值来作为每一层的权重方差。即就是:
D ( W l ) = 2 N l + N l − 1 (16) D(W^l) = \frac{2}{N^l + N^{l-1}}\tag{16} D(Wl)=Nl+Nl−12(16)
得到的权重应该满足的方差, 可以计算该方差对应的均匀分布:
由均匀分布方差公式, W ∼ U ( a , b ) W \sim U(a, b) W∼U(a,b)对应的方差为 ( b − a ) 2 12 \frac{(b-a)^2}{12} 12(b−a)2,可得方差对应的均匀分布应为:
W ∼ U ( − 6 N l + N l − 1 , 6 N l + N l − 1 ) (17) W \sim U(-\frac{\sqrt{6}}{\sqrt{N^l + N^{l-1}}}, \frac{\sqrt{6}}{\sqrt{N^l + N^{l-1}}})\tag{17} W∼U(−Nl+Nl−16,Nl+Nl−16)(17)
/**
* @brief Fills a Blob with values @f$ x \sim U(-a, +a) @f$ where @f$ a @f$ is
* set inversely proportional to number of incoming nodes, outgoing
* nodes, or their average.
*
* A Filler based on the paper [Bengio and Glorot 2010]: Understanding
* the difficulty of training deep feedforward neuralnetworks.
*
* It fills the incoming matrix by randomly sampling uniform data from [-scale,
* scale] where scale = sqrt(3 / n) where n is the fan_in, fan_out, or their
* average, depending on the variance_norm option. You should make sure the
* input blob has shape (num, a, b, c) where a * b * c = fan_in and num * b * c
* = fan_out. Note that this is currently not the case for inner product layers.
*
* TODO(dox): make notation in above comment consistent with rest & use LaTeX.
*/
template <typename Dtype>
class XavierFiller : public Filler<Dtype> {
public:
explicit XavierFiller(const FillerParameter& param)
: Filler<Dtype>(param) {}
virtual void Fill(Blob<Dtype>* blob) {
/* count参数是权重矩阵的总个数, 权重矩阵维度[输出端个数, 输入端个数] */
CHECK(blob->count());
/* fan_in是输入端个数 */
int fan_in = blob->count() / blob->shape(0);
// Compatibility with ND blobs
int fan_out = blob->num_axes() > 1 ?
blob->count() / blob->shape(1) :
blob->count(); /* 当输入端个数为1时, count()参数即就是输出端个数 */
/** 默认情况下, 只考虑输入端个数;
* 当配置了FillerParameter_VarianceNorm_AVERAGE参数, 考虑输入和输出;
* 当配置了FillerParameter_VarianceNorm_FAN_OUT参数, 只考虑输出;
*/
Dtype n = fan_in; // default to fan_in
if (this->filler_param_.variance_norm() ==
FillerParameter_VarianceNorm_AVERAGE) {
n = (fan_in + fan_out) / Dtype(2);
} else if (this->filler_param_.variance_norm() ==
FillerParameter_VarianceNorm_FAN_OUT) {
n = fan_out;
}
Dtype scale = sqrt(Dtype(3) / n);
/* 生成均匀分布 */
caffe_rng_uniform<Dtype>(blob->count(), -scale, scale,
blob->mutable_cpu_data());
CHECK_EQ(this->filler_param_.sparse(), -1)
<< "Sparsity not supported by this Filler.";
}
};
具体的含义已在代码注释中说明了, 代码路径在/caffe/include/caffe/filler.hpp
链接: Xavier Initialization.