在机器学习多分类问题里数据集格式一般如下:
第 i i i个样本特征和标签写作:
x i = ( x 1 i , x 2 i , x 3 i , . . . , x d i ) T ∈ R d y i = ( y 1 i , y 2 i , y 3 i , . . . , y m i ) T ∈ R m x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T \in R^d \\ y^i =(y_1^i,y_2^i,y_3^i,...,y_m^i)^T \in R^m xi=(x1i,x2i,x3i,...,xdi)T∈Rdyi=(y1i,y2i,y3i,...,ymi)T∈Rm
其中 d d d代表输入的特征的维数, m m m代表输出类别的个数并对标签进行one-hot编码,
则完整的数据集可以写作:
X = [ x 1 , x 2 , … , x n ] = [ x 1 1 x 1 2 … x 1 n x 2 1 x 2 2 … x 2 n ⋮ ⋮ ⋱ ⋮ x d 1 x d 2 … x d n ] ∈ R d ∗ n Y = [ y 1 , y 2 , … , y n ] ∈ R m ∗ n X=[ x^1,x^2,\ldots,x^n] \\ = \begin{bmatrix} x^1_1& x^2_1 &\ldots &x^n_1 \\ x^1_2& x^2_2 &\ldots &x^n_2 \\ \vdots& \vdots & \ddots & \vdots\\ x^1_d& x^2_d &\ldots &x^n_d \\ \end{bmatrix} \in R^{d*n} \\ Y=[ y^1,y^2,\ldots,y^n] \in R^{m*n} X=[x1,x2,…,xn]=⎣ ⎡x11x21⋮xd1x12x22⋮xd2……⋱…x1nx2n⋮xdn⎦ ⎤∈Rd∗nY=[y1,y2,…,yn]∈Rm∗n
对于单个样本
o = W x + b = [ w 11 w 12 … w 1 d w 21 w 22 … w 2 d ⋮ ⋮ ⋱ ⋮ w m 1 w m 2 … w m d ] [ x 1 x 2 ⋮ x d ] + [ b 1 b 2 ⋮ b m ] = [ w 11 ∗ x 1 + w 12 ∗ x 2 + … + w 1 d ∗ x d + b 1 w 21 ∗ x 1 + w 22 ∗ x 2 + … + w 2 d ∗ x d + b 2 ⋮ w m 1 ∗ x 1 + w m 2 ∗ x 2 + … + w m d ∗ x d + b m ] o = Wx+b \\ =\begin{bmatrix} w_{11} & w_{12} & \ldots & w_{1d} \\ w_{21} & w_{22} & \ldots & w_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m1} & w_{m2} & \ldots & w_{md} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} \\ = \begin{bmatrix} w_{11}*x_1+ w_{12} *x_2 + \ldots + w_{1d} *x_d+ b_1 \\ w_{21}*x_1 + w_{22} *x_2+ \ldots + w_{2d} *x_d+ b_2 \\ \vdots \\ w_{m1}*x_1 +w_{m2} *x_2 + \ldots + w_{md}*x_d+ b_m \end{bmatrix} o=Wx+b=⎣ ⎡w11w21⋮wm1w12w22⋮wm2……⋱…w1dw2d⋮wmd⎦ ⎤⎣ ⎡x1x2⋮xd⎦ ⎤+⎣ ⎡b1b2⋮bm⎦ ⎤=⎣ ⎡w11∗x1+w12∗x2+…+w1d∗xd+b1w21∗x1+w22∗x2+…+w2d∗xd+b2⋮wm1∗x1+wm2∗x2+…+wmd∗xd+bm⎦ ⎤
使用 s o f t m a x softmax softmax函数实现输出为实现多分类, s o f t m a x softmax softmax函数表达式如下
给定
o = [ o 1 o 2 ⋮ o m ] o=\begin{bmatrix} o_1 \\ o_2 \\ \vdots \\o_m\end{bmatrix} o=⎣ ⎡o1o2⋮om⎦ ⎤
y ^ = [ y 1 ^ y 2 ^ ⋮ y m ^ ] = s o f t m a x ( o ) = [ e o 1 ∑ i = 1 m e o i e o 2 ∑ i = 1 m e o i ⋮ e o m ∑ i = 1 m e o i ] \hat{y}=\begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ \vdots \\ \hat{y_m} \end{bmatrix} = softmax(o)=\begin{bmatrix} \frac{e^{o_1}}{\sum_{i=1}^{m}e^{o_i} }\\ \frac{e^{o_2}}{\sum_{i=1}^{m}e^{o_i}} \\ \vdots \\ \frac{e^{o_m}}{\sum_{i=1}^{m}e^{o_i}} \end{bmatrix} y^=⎣ ⎡y1^y2^⋮ym^⎦ ⎤=softmax(o)=⎣ ⎡∑i=1meoieo1∑i=1meoieo2⋮∑i=1meoieom⎦ ⎤
对于多分类问题,使用最小化负对数似然作为损失函数,其表达式为
L = − log P ( Y ∣ X ) = ∑ i = 1 n − log P ( y i ∣ x i ) = ∑ i = 1 n l ( y i , y i ^ ) L=-\log{P(Y|X)}= \sum_{i=1}^{n}-\log{P(y^i|x^i)}=\sum_{i=1}^{n}l(y_i,\hat{y_i}) L=−logP(Y∣X)=i=1∑n−logP(yi∣xi)=i=1∑nl(yi,yi^)
l ( y , y ^ ) l(y,\hat{y}) l(y,y^)是针对于单个样本而定义的,具体写作
l ( y , y ^ ) = − ∑ j = 1 m y j log y j ^ l(y,\hat{y})=-\sum_{j=1}^{m}y_j\log{\hat{y_j}} l(y,y^)=−j=1∑myjlogyj^
其中 y j y_j yj为样本标签值, y j ^ \hat{y_j} yj^为样本预测值, m m m为one-hot向量长度,代表分类种类数。
求解 w j k w_{jk} wjk和 b j b_j bj的导数需要使用链式求导法则
求导公式如下:
∂ l ∂ w j k = ∂ l ∂ y ^ ∂ y ^ ∂ o j ∂ o j ∂ w j k ∂ l ∂ b j = ∂ l ∂ y ^ ∂ y ^ ∂ o j ∂ o j ∂ b j \frac{\partial l}{\partial w_{jk}} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial w_{jk}} \\ \frac{\partial l}{\partial b_j} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial b_j} ∂wjk∂l=∂y^∂l∂oj∂y^∂wjk∂oj∂bj∂l=∂y^∂l∂oj∂y^∂bj∂oj
因为
l ( y , y ^ ) = − ∑ j = 1 m y j log y j ^ y j ^ = e o j ∑ i = 1 m e o i l(y,\hat{y})=-\sum_{j=1}^{m}y_j\log{\hat{y_j}} \\ \hat{y_j}=\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}} l(y,y^)=−j=1∑myjlogyj^yj^=∑i=1meoieoj
则可得 l l l关于 o j o_j oj的表达式
l ( y , y ^ ) = − ∑ j = 1 m y j log e o j ∑ i = 1 m e o i = ∑ j = 1 m y j log ∑ i = 1 m e o i − ∑ j = 1 m y j log e o j = log ∑ i = 1 m e o i − ∑ j = 1 m y j o j l(y,\hat{y})=-\sum_{j=1}^{m}y_j\log{\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}} } \\ = \sum_{j=1}^{m} y_j\log{ \sum_{i=1}^{m}e^{o_i}}-\sum_{j=1}^{m}y_j\log{e^{o_j}} \\ = \log{ \sum_{i=1}^{m}e^{o_i}}-\sum_{j=1}^{m}y_jo_j l(y,y^)=−j=1∑myjlog∑i=1meoieoj=j=1∑myjlogi=1∑meoi−j=1∑myjlogeoj=logi=1∑meoi−j=1∑myjoj
因此
∂ l ∂ o j = e o j ∑ i = 1 m e o i − y j \frac{\partial{l}}{\partial{o_j}}= \frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j ∂oj∂l=∑i=1meoieoj−yj
o j o_j oj关于 w j k w_{jk} wjk的表达式可写作
o j = w j 1 ∗ x 1 + w j 2 ∗ x 2 + … + w j d ∗ x d + b j o_j=w_{j1}*x_1+ w_{j2} *x_2 + \ldots + w_{jd} *x_d+ b_j oj=wj1∗x1+wj2∗x2+…+wjd∗xd+bj
则
∂ o j ∂ w j k = x k \frac{\partial{o_j}}{\partial{w_{jk}}}=x_k ∂wjk∂oj=xk
o j o_j oj关于 b j b_j bj的表达式为
o j = w j 1 ∗ x 1 + w j 2 ∗ x 2 + … + w j d ∗ x d + b j o_j=w_{j1}*x_1+ w_{j2} *x_2 + \ldots + w_{jd} *x_d+ b_j oj=wj1∗x1+wj2∗x2+…+wjd∗xd+bj
则可得
∂ o j ∂ b j = 1 \frac{\partial o_j}{\partial b_j}=1 ∂bj∂oj=1
∂ l ∂ w j k = ∂ l ∂ y ^ ∂ y ^ ∂ o j ∂ o j ∂ w j k = ( e o j ∑ i = 1 m e o i − y j ) x k \frac{\partial l}{\partial w_{jk}} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial w_{jk}} = (\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j)x_k ∂wjk∂l=∂y^∂l∂oj∂y^∂wjk∂oj=(∑i=1meoieoj−yj)xk
∂ l ∂ b j = ∂ l ∂ y ^ ∂ y ^ ∂ o j ∂ o j ∂ b j = e o j ∑ i = 1 m e o i − y j \frac{\partial l}{\partial b_j} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial b_j} = \frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j ∂bj∂l=∂y^∂l∂oj∂y^∂bj∂oj=∑i=1meoieoj−yj
因为
∂ l ∂ w j k = ( e o j ∑ i = 1 m e o i − y j ) x k \frac{\partial l}{\partial w_{jk}} = (\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j)x_k ∂wjk∂l=(∑i=1meoieoj−yj)xk
则梯度更新表达式为
w j k = w j k − η ∂ l ∂ w i = w j k − η ( e o j ∑ i = 1 m e o i − y j ) x k w_{jk}=w_{jk}-\eta\frac{\partial l}{\partial w_i} \\ = w_{jk}-\eta (\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j)x_k wjk=wjk−η∂wi∂l=wjk−η(∑i=1meoieoj−yj)xk
因为
∂ l ∂ b j = e o j ∑ i = 1 m e o i − y j \frac{\partial l}{\partial b_j} = \frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j ∂bj∂l=∑i=1meoieoj−yj
则梯度更新表达式为
b j = b j − η ∂ l ∂ b j = b j − η e o j ∑ i = 1 m e o i − y j b_j=b_j-\eta\frac{\partial l}{\partial b_j} \\ = b_j-\eta\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j bj=bj−η∂bj∂l=bj−η∑i=1meoieoj−yj
Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J , DiveintoDeepLearning