交叉熵损失(Cross Entropy)求导

Cross Entropy是分类问题中常见的一种损失函数,我们在之前的文章提到过二值交叉熵的证明和交叉熵的作用,下面解释一下交叉熵损失的求导。
首先一个模型的最后一层神经元的输出记为 f 0 . . . f i f_{0}...f_{i} f0...fi
输出经过softmax激活之后记为 p 0 . . . p i p_{0}...p_{i} p0...pi,那么:
p i = e f i ∑ k = 0 C − 1 e f k p_{i} = \frac{e^{f_{i}}}{\sum_{k=0}^{C-1} e^{f_{k}}} pi=k=0C1efkefi
类别的实际标签记为 y 0 . . . y i y_{0}...y_{i} y0...yi,那么交叉熵损失L为:
L = − ∑ i = 0 C − 1 y i l o g p i L = -\sum_{i=0}^{C-1} y_{i}log^{p_{i}} L=i=0C1yilogpi
上式中的 l o g log log是一种简写,为了后续的求导方便,一般我们认为 l o g log log的底是 e e e,即 l o g log log l n ln ln
那么 L L L对第 i i i个神经元的输出 f i f_{i} fi求偏导 ∂ L ∂ f i \frac{\partial L}{\partial f_{i}} fiL:
根据复合函数求导原则:
∂ L ∂ f i = ∑ j = 0 C − 1 ∂ L j ∂ p j ∂ p j ∂ f i \frac{\partial L}{\partial f_{i}} = \sum_{j=0}^{C-1} \frac{\partial L_{j}}{\partial p_{j}}\frac{\partial p_{j}}{\partial f_{i}} fiL=j=0C1pjLjfipj
在这里需要说明,在softmax中我们使用了下标 i i i k k k,在交叉熵中使用了下标 i i i,但是这里的两个 i i i并不等价,因为softmax的分母中包含了每个神经元的输出 f f f,也就是激活后所有的 p p p对任意的 f i f_{i} fi求偏导都不为0,同时 L L L中又包含了所有的 p p p,所以为了避免重复我们需要为 p p p引入一个新的下标 j j j j j j 0... C − 1 0...C-1 0...C1这C种情况。
那么依次求导:

∂ L j ∂ p j = ∂ ( − y j l o g p j ) ∂ ( p j ) \frac{\partial L_{j}}{\partial p_{j}}= \frac{\partial (-y_{j}log^{p_{j}})}{\partial (p_{j})} pjLj=(pj)(yjlogpj)

由于默认一般我们认为 l o g log log的底是 e e e,即 l o g log log l n ln ln,所以:

∂ L j ∂ p j = ∂ ( − y j l o g p j ) ∂ ( p j ) = − y j p j \frac{\partial L_{j}}{\partial p_{j}}= \frac{\partial (-y_{j}log^{p_{j}})}{\partial (p_{j})} =-\frac{y_{j}}{p_{j}} pjLj=(pj)(yjlogpj)=pjyj

接着要求 ∂ p j ∂ f i \frac{\partial p_{j}}{\partial f_{i}} fipj的值,在这里可以发现,每一个 p j p_{j} pj中都包含 f i f_{i} fi,所以 ∂ p j ∂ f i \frac{\partial p_{j}}{\partial f_{i}} fipj都不是0,但是 j = i j=i j=i j ≠ i j \neq i j=i的时候, ∂ p j ∂ f i \frac{\partial p_{j}}{\partial f_{i}} fipj结果又不相同,所以这里需要分开讨论:

  • 首先 j = i j=i j=i时:
    ∂ p j ∂ f i = ∂ p i ∂ f i = ∂ e f i ∑ k = 0 C − 1 e f k ∂ f i \frac{\partial p_{j}}{\partial f_{i}} = \frac{\partial p_{i}}{\partial f_{i}} = \frac{\partial \frac{e^{f_{i}}}{\sum_{k=0}^{C-1} e^{f_{k}}}}{\partial f_{i}} fipj=fipi=fik=0C1efkefi
    = ( e f i ) ′ ∑ k = 0 C − 1 e f k − e f i ( ∑ k = 0 C − 1 e f k ) ′ ( ∑ k = 0 C − 1 e f k ) 2 = \frac{ (e^{f_{i}})' \sum_{k=0}^{C-1} e^{f_{k}} - e^{f_{i}}(\sum_{k=0}^{C-1} e^{f_{k}})' }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}} =(k=0C1efk)2(efi)k=0C1efkefi(k=0C1efk)
    = e f i ∑ k = 0 C − 1 e f k − ( e f i ) 2 ( ∑ k = 0 C − 1 e f k ) 2 = e f i ∑ k = 0 C − 1 e f k − ( e f i ∑ k = 0 C − 1 e f k ) 2 = \frac{ e^{f_{i}}\sum_{k=0}^{C-1} e^{f_{k}} - (e^{f_{i}})^2 }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}}= \frac{ e^{f_{i}} }{\sum_{k=0}^{C-1} e^{f_{k}}} - (\frac{ e^{f_{i}} }{\sum_{k=0}^{C-1} e^{f_{k}}})^2 =(k=0C1efk)2efik=0C1efk(efi)2=k=0C1efkefi(k=0C1efkefi)2
    = p i − ( p i ) 2 = p i ( 1 − p i ) = p_{i}-(p{i})^2 = p_{i}(1-p_{i}) =pi(pi)2=pi(1pi)

  • 然后 j ≠ i j\neq i j=i时:
    ∂ p j ∂ f i = ∂ e f j ∑ k = 0 C − 1 e f k ∂ f i \frac{\partial p_{j}}{\partial f_{i}}= \frac{\partial \frac{e^{f_{j}}}{\sum_{k=0}^{C-1} e^{f_{k}}}}{\partial f_{i}} fipj=fik=0C1efkefj
    = ( e f j ) ′ ∑ k = 0 C − 1 e f k − e f j ( ∑ k = 0 C − 1 e f k ) ′ ( ∑ k = 0 C − 1 e f k ) 2 = \frac{ (e^{f_{j}})' \sum_{k=0}^{C-1} e^{f_{k}} - e^{f_{j}}(\sum_{k=0}^{C-1} e^{f_{k}})' }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}} =(k=0C1efk)2(efj)k=0C1efkefj(k=0C1efk)
    = − e f i e f j ( ∑ k = 0 C − 1 e f k ) 2 = − e f i ∑ k = 0 C − 1 e f k e f j ∑ k = 0 C − 1 e f k = \frac{ - e^{f_{i}} e^{f_{j}} }{(\sum_{k=0}^{C-1} e^{f_{k}})^{2}} = - \frac{ e^{f_{i}} }{\sum_{k=0}^{C-1} e^{f_{k}}} \frac{ e^{f_{j}} }{\sum_{k=0}^{C-1} e^{f_{k}}} =(k=0C1efk)2efiefj=k=0C1efkefik=0C1efkefj
    = − p i p j = -p_{i}p_{j} =pipj

对于最后的偏导数,需要把上述两个部分加起来:
∂ L ∂ f i = ∑ j = i C − 1 ∂ L j ∂ p j ∂ p j ∂ f i + ∑ j ≠ i C − 1 ∂ L j ∂ p j ∂ p j ∂ f i \frac{\partial L}{\partial f_{i}} = \sum_{j=i}^{C-1} \frac{\partial L_{j}}{\partial p_{j}}\frac{\partial p_{j}}{\partial f_{i}} + \sum_{j\neq i}^{C-1} \frac{\partial L_{j}}{\partial p_{j}}\frac{\partial p_{j}}{\partial f_{i}} fiL=j=iC1pjLjfipj+j=iC1pjLjfipj
= − y i p i p i ( 1 − p i ) + ∑ j ≠ i C − 1 − p i p j ( − y j p j ) =-\frac{y_{i}}{p_{i}}p_{i}(1-p_{i}) + \sum_{j\neq i}^{C-1}-p_{i}p_{j}(-\frac{y_{j}}{p_{j}}) =piyipi(1pi)+j=iC1pipj(pjyj)
= − y i ( 1 − p i ) + ∑ j ≠ i C − 1 p i y j =-y_{i}(1-p_{i}) + \sum_{j\neq i}^{C-1}p_{i}y_{j} =yi(1pi)+j=iC1piyj
= y i p i − y i + ∑ j ≠ i C − 1 p i y j =y_{i}p_{i}-y_{i} + \sum_{j\neq i}^{C-1}p_{i}y_{j} =yipiyi+j=iC1piyj

在上式中, j ≠ i j\neq i j=i的情况中刚好缺了 j = i j=i j=i,所以可以继续改写为:
= ∑ j = 0 C − 1 p i y j − y i =\sum_{j=0}^{C-1}p_{i}y_{j} - y_{i} =j=0C1piyjyi
= p i ∑ j = 0 C − 1 y j − y i =p_{i}\sum_{j=0}^{C-1}y_{j} - y_{i} =pij=0C1yjyi
∑ j = 0 C − 1 y j = 1 \sum_{j=0}^{C-1}y_{j} = 1 j=0C1yj=1,所以:
= p i ∑ j = 0 C − 1 y j − y i = p i − y i =p_{i}\sum_{j=0}^{C-1}y_{j} - y_{i} = p_{i}-y_{i} =pij=0C1yjyi=piyi

你可能感兴趣的:(机器学习,Deep,Learning)