Softmax Cross Entropy 梯度推导

Softmax Cross Entropy 梯度推导

  • Softmax的梯度
  • Softmax交叉熵的梯度
  • 引用

Softmax的梯度

Softmax定义:
p i = e a i ∑ k = 1 N e a k , i ∈ N p_i = \frac{e^{a_i}}{\sum_{k=1}^Ne^{a_k}}, \quad i\in N pi=k=1Neakeai,iN
数值稳定的Softmax:
e a i ∑ k = 1 N e a k = C e a i C ∑ k = 1 N e a k = e a i + log ⁡ C ∑ k = 1 N e a k + log ⁡ C , i ∈ N \frac{e^{a_i}}{\sum_{k=1}^Ne^{a_k}} = \frac{Ce^{a_i}}{C\sum_{k=1}^Ne^{a_k}} = \frac{e^{a_i + \log{C}}}{\sum_{k=1}^Ne^{a_k + \log{C}}}, \quad i \in N k=1Neakeai=Ck=1NeakCeai=k=1Neak+logCeai+logC,iN
其中 log ⁡ C = − max ⁡ ( a ) \log{C} = -\max{(\bm{a})} logC=max(a)

对于向量函数Softmax,第i个输出相对于第j个输入的偏导数可以定义如下:
∂ p i ∂ a j = ∂ e a i ∑ k = 1 N e a k ∂ a j \frac{\partial p_i}{\partial a_j} = \frac{\partial{\frac{e^{a_i}}{\sum_{k=1}^Ne^{a_k}}}}{\partial a_j} ajpi=ajk=1Neakeai
那么根据商的求导法则:
f ( x ) = g ( x ) h ( x ) , f ′ ( x ) = g ′ ( x ) h ( x ) − h ′ ( x ) g ( x ) ( h ( x ) ) 2 f(x) = \frac{g(x)}{h(x)}, \quad f'(x) = \frac{g'(x)h(x) - h'(x)g(x)}{(h(x))^2} f(x)=h(x)g(x),f(x)=(h(x))2g(x)h(x)h(x)g(x)
在这里, g i = e a i , h i = ∑ k = 1 N e a k g_i = e^{a_i},\quad h_i = \sum_{k=1}^Ne^{a_k} gi=eai,hi=k=1Neak
因此:
∂ g i ∂ a j = ∂ e a i ∂ a j = { 0 , i ≠ j e a i , i = j \frac{\partial{g_i}}{\partial{a_j}} =\frac{\partial{e^{a_i}}}{\partial{a_j}}= \begin{cases} 0, i \neq j \\ e^{a_i}, i = j \end{cases} ajgi=ajeai={0,i̸=jeai,i=j
∂ h i ∂ a j = ∂ ∑ k = 1 N e a k ∂ a j = e a j \frac{\partial{h_i}}{\partial{a_j}} = \frac{\partial{\sum_{k=1}^Ne^{a_k}}}{\partial{a_j}} = e^{a_j} ajhi=ajk=1Neak=eaj
所以,当 i ≠ j i \neq j i̸=j
∂ p i ∂ a j = 0 ∑ k = 1 N e a k − e a j e a i ( ∑ k = 1 N e a k ) 2 = − e a j ∑ k = 1 N e a k × e a i ∑ k = 1 N e a k = − p j p i \begin{aligned} \frac{\partial p_i}{\partial a_j} & = \frac{0\sum_{k=1}^Ne^{a_k}-e^{a_j}e^{a_i}}{(\sum_{k=1}^Ne^{a_k})^2} \\ & = -\frac{e^{a_j}}{\sum_{k=1}^Ne^{a_k}}\times\frac{e^{a_i}}{\sum_{k=1}^Ne^{a_k}} \\ & = -p_jp_i \end{aligned} ajpi=(k=1Neak)20k=1Neakeajeai=k=1Neakeaj×k=1Neakeai=pjpi
而当 i = j i = j i=j
∂ p i ∂ a j = e a i ∑ k = 1 N e a k − e a j e a i ( ∑ k = 1 N e a k ) 2 = e a i ∑ k = 1 N e a k × ( ∑ k = 1 N e a k − e a j ) ∑ k = 1 N e a k = p i ( 1 − p j ) \begin{aligned} \frac{\partial p_i}{\partial a_j} & = \frac{e^{a_i}\sum_{k=1}^Ne^{a_k}-e^{a_j}e^{a_i}}{(\sum_{k=1}^Ne^{a_k})^2} \\ & = \frac{e^{a_i}}{\sum_{k=1}^Ne^{a_k}}\times\frac{(\sum_{k=1}^Ne^{a_k} - e^{a_j})}{\sum_{k=1}^Ne^{a_k}} \\ & = p_i(1-p_j) \end{aligned} ajpi=(k=1Neak)2eaik=1Neakeajeai=k=1Neakeai×k=1Neak(k=1Neakeaj)=pi(1pj)
总结一下,就是:
∂ p i ∂ a j = { − p j p i , i ≠ j p i ( 1 − p j ) , i = j \frac{\partial p_i}{\partial a_j} = \begin{cases} -p_jp_i, \quad i \neq j \\ p_i(1-p_j), \quad i = j \end{cases} ajpi={pjpi,i̸=jpi(1pj),i=j

Softmax交叉熵的梯度

交叉熵定义为:
H ( p , q ) = − ∑ x p ( x ) log ⁡ q ( x ) H(p,q) = -\sum_xp(x)\log{q(x)} H(p,q)=xp(x)logq(x)
在这里,可以改写为:
H ( y , p ) = − ∑ i = 1 N y i log ⁡ ( p i ) = − l o g ( p k ) H(y,p) = -\sum_{i=1}^{N}y_i\log{(p_i)} = -log{(p_k)} H(y,p)=i=1Nyilog(pi)=log(pk)
其中 k k k是label;那么交叉熵针对第j个输入的偏导数是:
∂ L ∂ o j = − ∑ i = 1 N y i log ⁡ ( p i ) ∂ o j = − ∑ i = 1 N y i ∂ log ⁡ p i ∂ p i ∂ p i ∂ o j = − ∑ i = 1 N y i 1 p i ∂ p i ∂ o j \begin{aligned} \frac{\partial{L}}{\partial{o_j}} & = -\frac{\sum_{i=1}^Ny_i\log{(p_i)}}{\partial o_j} \\ & = -\sum_{i=1}^Ny_i\frac{\partial \log{p_i}}{\partial p_i}\frac{\partial p_i}{\partial o_j} \\ & = -\sum_{i=1}^Ny_i\frac{1}{p_i}\frac{\partial p_i}{\partial o_j} \end{aligned} ojL=oji=1Nyilog(pi)=i=1Nyipilogpiojpi=i=1Nyipi1ojpi
现在,把之前推导的Softmax的梯度代入 p i p_i pi
∂ L ∂ o j = − ∑ i = 1 N y i 1 p i ∂ p i ∂ o j = − y j 1 p j ( p j ( 1 − p j ) ) − ∑ i ≠ j N y i 1 p i ( − p j p i ) ) = − y j ( 1 − p j ) + ∑ i ≠ j N y i p j = y j p j + ∑ i ≠ j N y i p j − y j = ∑ i = 1 N y i p j − y j = p j − y j ( since ∑ i = 1 N y i = 1 ) \begin{aligned} \frac{\partial{L}}{\partial{o_j}} & = -\sum_{i=1}^Ny_i\frac{1}{p_i}\frac{\partial p_i}{\partial o_j} \\ & = -y_j\frac{1}{p_j}(p_j(1-p_j))-\sum_{i\neq j}^Ny_i\frac{1}{p_i}(-p_jp_i)) \\ & = -y_j(1-p_j) + \sum_{i\neq j}^Ny_ip_j \\ & = y_jp_j + \sum_{i\neq j}^Ny_ip_j - y_j \\ & = \sum_{i=1}^Ny_ip_j-y_j \\ & = p_j - y_j \quad (\text{since} \sum_{i=1}^Ny_i = 1) \end{aligned} ojL=i=1Nyipi1ojpi=yjpj1(pj(1pj))i̸=jNyipi1(pjpi))=yj(1pj)+i̸=jNyipj=yjpj+i̸=jNyipjyj=i=1Nyipjyj=pjyj(sincei=1Nyi=1)
结束。

引用

[1]softmax-crossentropy
[2]the-softmax-function-and-its-derivative

你可能感兴趣的:(深度学习基础)