Softmax函数和交叉熵Cross-entropy以及KL散度求导

参考链接:https://blog.csdn.net/qian99/article/details/78046329

交叉熵cross-entropy

对一个分类神经网络 f f f,输出为 z = f ( x ; θ ) , z = [ z 0 , z 1 , ⋯   , z C − 1 ] z=f(x;\theta),z=[z_{0},z_{1},\cdots,z_{C-1}] z=f(x;θ),z=[z0,z1,,zC1], z z z为logits,其中类别数量为 C C C, y y y x x x的one-hot标签。通过softmax归一化来得到概率:
p i = exp ⁡ z i ∑ j exp ⁡ z j p_{i}=\frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}} pi=jexpzjexpzi
负交叉熵误差为:
L = − ∑ i y i log ⁡ p i \mathcal{L}=-\sum_{i}y_{i}\log{p_{i}} L=iyilogpi
误差对于概率的梯度为:
∂ L ∂ p i = − y i 1 p i \frac{\partial \mathcal{L}}{\partial p_{i}}=-y_{i}\frac{1}{p_{i}} piL=yipi1
紧接着计算 ∂ p i ∂ z k , k = 0 , 1 , . . . , C − 1 \frac{\partial \mathcal{p_{i}}}{\partial z_{k}},k=0,1,...,C-1 zkpi,k=0,1,...,C1:
(1)当 k = i k=i k=i时,
∂ p i ∂ z i = ∂ ( exp ⁡ z i ∑ j exp ⁡ z j ) ∂ z i = exp ⁡ z i ∑ j exp ⁡ z j − ( exp ⁡ z i ) 2 ( ∑ j exp ⁡ z j ) 2 = ( exp ⁡ z i ∑ j exp ⁡ z j ) ( 1 − exp ⁡ z i ∑ j exp ⁡ z j ) = p i ( 1 − p i ) \frac{\partial \mathcal{p_{i}}}{\partial z_{i}}=\frac{\partial ( \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})}{\partial z_{i}}=\frac{\exp{z_{i}}\sum_{j}\exp{z_{j}}-(\exp{z_{i}})^{2}}{(\sum_{j}{\exp{z_{j}}})^{2}} \\ =( \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})(1- \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})=p_{i}(1-p_{i}) zipi=zi(jexpzjexpzi)=(jexpzj)2expzijexpzj(expzi)2=(jexpzjexpzi)(1jexpzjexpzi)=pi(1pi)

(2)当 k ≠ i k\neq i k=i时,
∂ p i ∂ z k = ∂ ( exp ⁡ z i ∑ j exp ⁡ z j ) ∂ z k = − exp ⁡ z i exp ⁡ z k ( ∑ j exp ⁡ z j ) 2 = − p i p k \frac{\partial \mathcal{p_{i}}}{\partial z_{k}}=\frac{\partial ( \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})}{\partial z_{k}}=\frac{-\exp{z_{i}}\exp{z_{k}}}{(\sum_{j}{\exp{z_{j}}})^{2}} =-p_{i}p_{k} zkpi=zk(jexpzjexpzi)=(jexpzj)2expziexpzk=pipk
根据求导的链式法则:
∂ L ∂ z k = ∑ j ( ∂ L ∂ p j ∂ p j ∂ z k ) = ∑ j = / k ( ∂ L ∂ p j ∂ p j ∂ z k ) + ( ∂ L ∂ p k ∂ p k ∂ z k ) = ∑ j = / k ( − y j 1 p j ∗ − p j p k ) + ( − y k 1 p k ∗ p k ( 1 − p k ) ) = ∑ j = / k ( y j p k ) − y k + y k p k = p k ∑ j y j − y k \frac{\partial \mathcal{\mathcal{L}}}{\partial z_{k}}=\sum_{j}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})\\ =\sum_{j=/k}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})+(\frac{\partial \mathcal{L}}{\partial p_{k}}\frac{\partial \mathcal{p_{k}}}{\partial z_{k}})\\ =\sum_{j=/k}(-y_{j}\frac{1}{p_{j}}*-p_{j}p_{k})+(-y_{k}\frac{1}{p_{k}}*p_{k}(1-p_{k}))\\ =\sum_{j=/k}(y_{j}p_{k})-y_{k}+y_{k}p_{k}\\ =p_{k}\sum_{j}y_{j}-y_{k} zkL=j(pjLzkpj)=j=/k(pjLzkpj)+(pkLzkpk)=j=/k(yjpj1pjpk)+(ykpk1pk(1pk))=j=/k(yjpk)yk+ykpk=pkjyjyk
因为 y y y为one-hot编码,所以 ∑ j y j = 1 \sum_{j}y_{j}=1 jyj=1,i.e.,
∂ L ∂ z k = p k − y k \frac{\partial \mathcal{\mathcal{L}}}{\partial z_{k}}=p_{k}-y_{k} zkL=pkyk

相对熵KL散度

预测的概率分布 p p p,真实概率分布为 q q q,KL的散度为:
L = K L ( q ∣ ∣ p ) = ∑ k q c log ⁡ q k p k \mathcal{L}=KL(q||p)=\sum_{k}q_{c}\log{\frac{q_{k}}{p_{k}}} L=KL(qp)=kqclogpkqk
求解对概率 p k p_{k} pk的梯度
∂ L ∂ p k = − q k p k \frac{\partial \mathcal{\mathcal{L}}}{\partial p_{k}}=-\frac{q_{k}}{p_{k}} pkL=pkqk
求解对logits z k z_{k} zk的梯度:
∂ L ∂ z c = ∑ j ( ∂ L ∂ p j ∂ p j ∂ z k ) = ∑ j = / k ( ∂ L ∂ p j ∂ p j ∂ z k ) + ( ∂ L ∂ p k ∂ p k ∂ z k ) = ∑ j = / k ( − q j p j ∗ − p j p k ) + ( − q k p k ∗ p k ( 1 − p k ) ) = ∑ j = / k ( q j p k ) + q k p k − q k = ∑ j q j p k − q k \frac{\partial \mathcal{\mathcal{L}}}{\partial z_{c}}= \sum_{j}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})\\ =\sum_{j=/k}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})+(\frac{\partial \mathcal{L}}{\partial p_{k}}\frac{\partial \mathcal{p_{k}}}{\partial z_{k}})\\ =\sum_{j=/k}(-\frac{q_{j}}{p_{j}}*-p_{j}p_{k})+(-\frac{q_{k}}{p_{k}}*p_{k}(1-p_{k}))\\ =\sum_{j=/k}(q_{j}p_{k})+q_{k}p_{k}-q_{k}\\ =\sum_{j}q_{j}p_{k}-q_{k} zcL=j(pjLzkpj)=j=/k(pjLzkpj)+(pkLzkpk)=j=/k(pjqjpjpk)+(pkqkpk(1pk))=j=/k(qjpk)+qkpkqk=jqjpkqk

你可能感兴趣的:(python机器学习)