假设n为特征个数,K为目标分类个数,则:
对于一个 n × 1 n\times 1 n×1输入样本向量
x = ( x 1 , ⋯   , x i , ⋯   , x n ) T x=(x_1,\cdots,x_i,\cdots,x_n)^T x=(x1,⋯,xi,⋯,xn)T
一个 K × n K\times n K×n的权重矩阵W(每行对应一个类)
以及一个 K × 1 K\times 1 K×1的偏置向量b
令 z = W x + b z=Wx+b z=Wx+b
则
z j = W j x + b j = ∑ i = 1 n W j , i x i + b j z_j=W_j x+b_j=\displaystyle\sum_{i=1}^{n} W_{j,i}x_i + b_j zj=Wjx+bj=i=1∑nWj,ixi+bj
y y y为监督数据, y ^ \hat{y} y^为预测值, y y y和 y ^ \hat{y} y^均为 K × 1 K\times 1 K×1的向量。 y ^ j \hat{y}_j y^j表示第样本 x x x被分到第 t t t个类的概率,
y ^ j = s o f t m a x ( z ) j = e z j ∑ k = 1 K e z k \hat{y}_j=softmax(z)_j=\frac{e^{z_{j}}}{\sum^{K}_{k=1}e^{z_{k}}} y^j=softmax(z)j=∑k=1Kezkezj
多分类交叉熵损失函数为:
L cross-entropy ( y ^ , y ) = − ∑ y j l o g ( y ^ j ) L_{\text{cross-entropy}}(\hat{y},y)=-\sum y_{j}log(\hat{y}_{j}) Lcross-entropy(y^,y)=−∑yjlog(y^j)
假设只有一个正确分类为 t t t,则 y y y为独热向量, y t = 1 , y j ≠ t = 0 y_t=1,y_{j\neq t}=0 yt=1,yj̸=t=0,则损失函数变为:
L cross-entropy ( y ^ , y ) = − l o g ( y ^ t ) L_{\text{cross-entropy}}(\hat{y},y)=-log(\hat{y}_{t}) Lcross-entropy(y^,y)=−log(y^t)
∂ L ∂ y ^ t = − 1 y ^ t \dfrac{\partial L}{\partial \hat{y}_t}=-\dfrac{1}{\hat{y}_t} ∂y^t∂L=−y^t1
∂ y t ∂ z t = ∂ ( e z t ∑ e z k ) ∂ z t = e z t ∑ e z k − e z t e z t ( ∑ e z k ) 2 = y ^ t ( 1 − y ^ t ) \begin{aligned} \dfrac{\partial y_t}{\partial z_t}&=\dfrac{\partial (\frac{e^{z_{t}}}{\sum e^{z_{k}}})}{\partial z_t} \\ &=\dfrac{e^{z_t}\sum{e^{z_k}}-e^{z_t}e^{z_t}}{(\sum e^{z_k})^2} \\ &=\hat{y}_t(1-\hat{y}_t) \end{aligned} ∂zt∂yt=∂zt∂(∑ezkezt)=(∑ezk)2ezt∑ezk−eztezt=y^t(1−y^t)
当 j ≠ t j\ne t j̸=t时,
∂ y t ∂ z j = ∂ ( e z t ∑ e z k ) ∂ z j = e z t ∂ ( 1 ∑ e z k ) ∂ z j = e z t − e z j ( ∑ e z k ) 2 = − y ^ t y ^ j \begin{aligned} \dfrac{\partial y_t}{\partial z_j}&=\dfrac{\partial (\frac{e^{z_{t}}}{\sum e^{z_{k}}})}{\partial z_j} \\ &=e^{z_t}\dfrac{\partial (\frac{1}{\sum e^{z_{k}}})}{\partial z_j} \\ &=e^{z_t}\dfrac{-e^{z_j}}{(\sum e^{z_k})^2} \\ &=-\hat{y}_t \hat{y}_j \end{aligned} ∂zj∂yt=∂zj∂(∑ezkezt)=ezt∂zj∂(∑ezk1)=ezt(∑ezk)2−ezj=−y^ty^j
则
∂ L ∂ z t = ∂ L ∂ y ^ t ∂ y t ∂ z t = − 1 y ^ t y ^ t ( 1 − y ^ t ) = y ^ t − 1 \begin{aligned} \dfrac{\partial L}{\partial z_t}&=\dfrac{\partial L}{\partial \hat{y}_t}\dfrac{\partial y_t}{\partial z_t} \\ &=-\dfrac{1}{\hat{y}_t}\hat{y}_t(1-\hat{y}_t) \\ &=\hat{y}_t-1 \end{aligned} ∂zt∂L=∂y^t∂L∂zt∂yt=−y^t1y^t(1−y^t)=y^t−1
当 j ≠ t j\ne t j̸=t时,
∂ L ∂ z j = ∂ L ∂ y ^ t ∂ y t ∂ z j = − 1 y ^ t ( − y ^ t y ^ j ) = y ^ j \begin{aligned} \dfrac{\partial L}{\partial z_j}&=\dfrac{\partial L}{\partial \hat{y}_t}\dfrac{\partial y_t}{\partial z_j} \\ &=-\dfrac{1}{\hat{y}_t}(-\hat{y}_t \hat{y}_j) \\ &=\hat{y}_j \end{aligned} ∂zj∂L=∂y^t∂L∂zj∂yt=−y^t1(−y^ty^j)=y^j
对于包含一个隐藏层的3层神经网络,
此时假如输出 y ^ \hat{y} y^=(0.1, 0.2, 0.3),假设输入x对应的正确分类为t=1,则对隐藏层求梯度结果为(0.1,0.2-1, 0.3)=(0.1, -0.8, 0.3)
此时再反向传播,对输入 x x x求梯度,然后利用随机梯度下降更新隐藏层权重即可。
如果我们直接对 W W W求导:
∂ L ∂ W t , i = ∂ L ∂ z t ∂ z t ∂ W t , i = ( y ^ t − 1 ) x i \begin{aligned} \dfrac{\partial L}{\partial W_{t,i}}&=\dfrac{\partial L}{\partial z_t}\dfrac{\partial z_t}{\partial W_t,i} \\ &=(\hat{y}_t-1)x_i \end{aligned} ∂Wt,i∂L=∂zt∂L∂Wt,i∂zt=(y^t−1)xi
若 j ≠ t j\ne t j̸=t时
∂ L ∂ W j , i = ∂ L ∂ z j ∂ z j ∂ W j , i = ( y ^ j ) x i \begin{aligned} \dfrac{\partial L}{\partial W_{j,i}}&=\dfrac{\partial L}{\partial z_j}\dfrac{\partial z_j}{\partial W_j,i} \\ &=(\hat{y}_j)x_i \end{aligned} ∂Wj,i∂L=∂zj∂L∂Wj,i∂zj=(y^j)xi
统一起来有:
∂ L ∂ W j , i = ( y ^ j − 1 { j = t } ) x i \dfrac{\partial L}{\partial W_{j,i}}=(\hat{y}_j-1\{j=t\})x_i ∂Wj,i∂L=(y^j−1{j=t})xi
其中
1 { j = t } = { 1 , j = t 0 , j ≠ t 1\{j=t\}=\begin{cases} 1, &j=t \\ 0,&j\ne t \end{cases} 1{j=t}={1,0,j=tj̸=t