多标签softmax + cross-entropy交叉熵损失函数详解及反向传播中的梯度求导

摘要

本文求解 softmax + cross-entropy 在反向传播中的梯度.

相关

配套代码, 请参考文章 :

Python和PyTorch对比实现多标签softmax + cross-entropy交叉熵损失及反向传播

有关 softmax 的详细介绍, 请参考 :

softmax函数详解及反向传播中的梯度求导

有关 cross-entropy 的详细介绍, 请参考 :

通过案例详解cross-entropy交叉熵损失函数

系列文章索引 :
https://blog.csdn.net/oBrightLamp/article/details/85067981

正文

在大多数教程中, softmax 和 cross-entropy 总是一起出现, 求梯度的时候也是一起考虑.
softmax 和 cross-entropy 的梯度, 已经在上面的两篇文章中分别给出.

1. 题目

考虑一个输入向量 x, 经 softmax 函数归一化处理后得到向量 s 作为预测的概率分布, 已知向量 y 为真实的概率分布, 由 cross-entropy 函数计算得出误差值 error (标量 e ), 求 e 关于 x 的梯度.
x = ( x 1 , x 2 , x 3 , ⋯   , x k ) s = s o f t m a x ( x ) s i = e x i ∑ t = 1 k e x t e = c r o s s E n t r o p y ( s , y ) = − ∑ i = 1 k y i l o g ( s i ) \quad\\ x = (x_1, x_2, x_3, \cdots, x_k)\\ \quad\\ s = softmax(x)\\ \quad\\ s_{i} = \frac{e^{x_{i}}}{ \sum_{t = 1}^{k}e^{x_{t}}} \\ \quad\\ e = crossEntropy(s, y) = -\sum_{i = 1}^{k}y_{i}log(s_{i})\\ x=(x1,x2,x3,,xk)s=softmax(x)si=t=1kextexie=crossEntropy(s,y)=i=1kyilog(si)

已知 :
∇ e ( s ) = ∂ e ∂ s = ( ∂ e ∂ s 1 , ∂ e ∂ s 2 , ⋯   , ∂ e ∂ s k ) = ( − y 1 s 1 , − y 2 s 2 , ⋯   , − y k s k )    ∇ s ( x ) = ∂ s ∂ x = ( ∂ s 1 / ∂ x 1 ∂ s 1 / ∂ x 2 ⋯ ∂ s 1 / ∂ x k ∂ s 2 / ∂ x 1 ∂ s 2 / ∂ x 2 ⋯ ∂ s 2 / ∂ x k ⋮ ⋮ ⋱ ⋮ ∂ s k / ∂ x 1 ∂ s k / ∂ x 2 ⋯ ∂ s k / ∂ x k ) = ( − s 1 s 1 + s 1 − s 1 s 2 ⋯ − s 1 s k − s 2 s 1 − s 2 s 2 + s 2 ⋯ − s 2 s k ⋮ ⋮ ⋱ ⋮ − s k s 1 − s k s 2 ⋯ − s k s k + s k ) \nabla e_{(s)}=\frac{\partial e}{\partial s} =(\frac{\partial e}{\partial s_{1}},\frac{\partial e}{\partial s_{2}}, \cdots, \frac{\partial e}{\partial s_{k}}) =( -\frac{y_1}{s_1}, -\frac{y_2}{s_2},\cdots,-\frac{y_k}{s_k}) \\ \;\\ % ---------- \nabla s_{(x)}= \frac{\partial s}{\partial x}= \begin{pmatrix} \partial s_{1}/\partial x_{1}&\partial s_{1}/\partial x_{2}& \cdots&\partial s_{1}/\partial x_{k}\\ \partial s_{2}/\partial x_{1}&\partial s_{2}/\partial x_{2}& \cdots&\partial s_{2}/\partial x_{k}\\ \vdots & \vdots & \ddots & \vdots \\ \partial s_{k}/\partial x_{1}&\partial s_{k}/\partial x_{2}& \cdots&\partial s_{k}/\partial x_{k}\\ \end{pmatrix}= \begin{pmatrix} -s_{1}s_{1} + s_{1} & -s_{1}s_{2} & \cdots & -s_{1}s_{k} \\ -s_{2}s_{1} & -s_{2}s_{2} + s_{2} & \cdots & -s_{2}s_{k} \\ \vdots & \vdots & \ddots & \vdots \\ -s_{k}s_{1} & -s_{k}s_{2} & \cdots & -s_{k}s_{k} + s_{k} \end{pmatrix} \\ \quad\\ e(s)=se=(s1e,s2e,,ske)=(s1y1,s2y2,,skyk)s(x)=xs=s1/x1s2/x1sk/x1s1/x2s2/x2sk/x2s1/xks2/xksk/xk=s1s1+s1s2s1sks1s1s2s2s2+s2sks2s1sks2sksksk+sk

2. 求解过程 :

∂ e ∂ x i = ∂ e ∂ s 1 ∂ s 1 ∂ x i + ∂ e ∂ s 2 ∂ s 2 ∂ x i + ∂ e ∂ s 3 ∂ s 3 ∂ x i + ⋯ + ∂ e ∂ s k ∂ s k ∂ x i \frac{\partial e}{\partial x_i} = \frac{\partial e}{\partial s_1}\frac{\partial s_1}{\partial x_i} +\frac{\partial e}{\partial s_2}\frac{\partial s_2}{\partial x_i} +\frac{\partial e}{\partial s_3}\frac{\partial s_3}{\partial x_i} + \cdots +\frac{\partial e}{\partial s_k}\frac{\partial s_k}{\partial x_i}\\ xie=s1exis1+s2exis2+s3exis3++skexisk

展开 ∂ e / ∂ x i \partial e/\partial x_i e/xi 可得 e 关于 X 的梯度向量 :
∇ e ( x ) = ( ∂ e ∂ s 1 , ∂ e ∂ s 2 , ∂ e ∂ s 3 , ⋯   , ∂ e ∂ s k ) ( ∂ s 1 / ∂ x 1 ∂ s 1 / ∂ x 2 ⋯ ∂ s 1 / ∂ x k ∂ s 2 / ∂ x 1 ∂ s 2 / ∂ x 2 ⋯ ∂ s 2 / ∂ x k ⋮ ⋮ ⋱ ⋮ ∂ s k / ∂ x 1 ∂ s k / ∂ x 2 ⋯ ∂ s k / ∂ x k )    ∇ e ( x ) = ∇ e ( s ) ∇ s ( x ) \nabla e_{(x)} = (\frac{\partial e}{\partial s_1},\frac{\partial e}{\partial s_2},\frac{\partial e}{\partial s_3}, \cdots ,\frac{\partial e}{\partial s_k}) \begin{pmatrix} \partial s_{1}/\partial x_{1}&\partial s_{1}/\partial x_{2}& \cdots&\partial s_{1}/\partial x_{k}\\ \partial s_{2}/\partial x_{1}&\partial s_{2}/\partial x_{2}& \cdots&\partial s_{2}/\partial x_{k}\\ \vdots & \vdots & \ddots & \vdots \\ \partial s_{k}/\partial x_{1}&\partial s_{k}/\partial x_{2}& \cdots&\partial s_{k}/\partial x_{k}\\ \end{pmatrix}\\ \;\\ \nabla e_{(x)} =\nabla e_{(s)} \nabla s_{(x)}\\ e(x)=(s1e,s2e,s3e,,ske)s1/x1s2/x1sk/x1s1/x2s2/x2sk/x2s1/xks2/xksk/xke(x)=e(s)s(x)

由于 :
∇ e ( s ) = ( − y 1 s 1 , − y 2 s 2 , ⋯   , − y k s k )    ∇ s ( x ) = ( − s 1 s 1 + s 1 − s 1 s 2 ⋯ − s 1 s k − s 2 s 1 − s 2 s 2 + s 2 ⋯ − s 2 s k ⋮ ⋮ ⋱ ⋮ − s k s 1 − s k s 2 ⋯ − s k s k + s k ) \nabla e_{(s)}=( -\frac{y_1}{s_1}, -\frac{y_2}{s_2},\cdots,-\frac{y_k}{s_k})\\ \;\\ \nabla s_{(x)} =\begin{pmatrix} -s_{1}s_{1} + s_{1} & -s_{1}s_{2} & \cdots & -s_{1}s_{k} \\ -s_{2}s_{1} & -s_{2}s_{2} + s_{2} & \cdots & -s_{2}s_{k} \\ \vdots & \vdots & \ddots & \vdots \\ -s_{k}s_{1} & -s_{k}s_{2} & \cdots & -s_{k}s_{k} + s_{k} \end{pmatrix} e(s)=(s1y1,s2y2,,skyk)s(x)=s1s1+s1s2s1sks1s1s2s2s2+s2sks2s1sks2sksksk+sk

得 :
∇ e ( x ) = ( s 1 ∑ t = 1 k y t − y 1 ,    s 2 ∑ t = 1 k y t − y 2 , ⋯   , s i ∑ t = 1 k y t − y i )    ∂ e ∂ x i = s i ∑ t = 1 k y t − y i \nabla e_{(x)}= (s_1\sum_{t = 1}^{k}y_t- y_1, \;s_2\sum_{t = 1}^{k}y_t- y_2,\cdots,s_i\sum_{t = 1}^{k}y_t- y_i)\\ \;\\ \frac{\partial e}{\partial x_i} =s_i\sum_{t = 1}^{k}y_t- y_i e(x)=(s1t=1kyty1,s2t=1kyty2,,sit=1kytyi)xie=sit=1kytyi

结论:
将 softmax 和 cross-entropy 放在一起使用, 可以大大减少梯度求解的计算量.

你可能感兴趣的:(深度学习基础)