在机器学习里数据集格式一般如下:
第 i i i个样本特征和标签写作:
x i = ( x 1 i , x 2 i , x 3 i , . . . , x d i ) T ∈ R d y i ∈ R x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T \in R^d \\ y^i \in R xi=(x1i,x2i,x3i,...,xdi)T∈Rdyi∈R
完整的数据集可以写作:
X = [ x 1 , x 2 , … , x n ] = [ x 1 1 x 1 2 … x 1 n x 2 1 x 2 2 … x 2 n ⋮ ⋮ ⋱ ⋮ x d 1 x d 2 … x d n ] ∈ R d ∗ n y = [ y 1 , y 2 , … , y n ] ∈ R n X=[ x^1,x^2,\ldots,x^n] \\ = \begin{bmatrix} x^1_1& x^2_1 &\ldots &x^n_1 \\ x^1_2& x^2_2 &\ldots &x^n_2 \\ \vdots& \vdots & \ddots & \vdots\\ x^1_d& x^2_d &\ldots &x^n_d \\ \end{bmatrix} \in R^{d*n} \\ y=[ y^1,y^2,\ldots,y^n] \in R^n X=[x1,x2,…,xn]=⎣ ⎡x11x21⋮xd1x12x22⋮xd2……⋱…x1nx2n⋮xdn⎦ ⎤∈Rd∗ny=[y1,y2,…,yn]∈Rn
对于单个样本
z = w T x + b = w 1 x 1 + w 2 x 2 + … + w d x d + b z = w^Tx+b \\ = w_1x_1+ w_2x_2+\ldots+w_dx_d+b z=wTx+b=w1x1+w2x2+…+wdxd+b
使用 s i g m o i d sigmoid sigmoid函数实现输出为 0 − 1 0-1 0−1之间,从而实现二分类, s i g m o i d sigmoid sigmoid函数表达式如下
σ ( z ) = 1 1 + e − z = e z 1 + e z \sigma (z) = \frac{1}{1+e^{-z}} = \frac{e^z}{1+e^z} σ(z)=1+e−z1=1+ezez
使用 c r o s s − e n t r o p y cross-entropy cross−entropy 作为损失函数,对于二分类问题,其表达式为
g ( z i ) = − y i log ( σ ( z i ) ) − ( 1 − y i ) log ( 1 − σ ( z i ) ) g(z^i)=-y^i \log{(\sigma(z^i))}-(1-y^i) \log{(1-\sigma(z^i))} g(zi)=−yilog(σ(zi))−(1−yi)log(1−σ(zi))
则损失函数可写作
L = 1 n ∑ i = 1 n ( g ( z i ) ) = 1 n ∑ i = 1 n ( − y i log ( σ ( z i ) ) − ( 1 − y i ) log ( 1 − σ ( z i ) ) ) L=\frac{1}{n} \sum_{i=1}^{n}(g(z^i))=\frac{1}{n} \sum_{i=1}^{n}(-y^i \log{(\sigma(z^i))}-(1-y^i) \log{(1-\sigma(z^i))}) L=n1i=1∑n(g(zi))=n1i=1∑n(−yilog(σ(zi))−(1−yi)log(1−σ(zi)))
求解 w w w和 b b b的导数需要使用链式求导法则
求导公式如下:
∂ L ∂ w i = ∂ L ∂ g ∂ g ∂ σ ∂ σ ∂ z ∂ z ∂ w i ∂ L ∂ b = ∂ L ∂ g ∂ g ∂ σ ∂ σ ∂ z ∂ z ∂ b \frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial g} \frac{\partial g}{\partial \sigma} \frac{\partial \sigma}{\partial z} \frac{\partial z}{\partial w_i} \\ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial g} \frac{\partial g}{\partial \sigma} \frac{\partial \sigma}{\partial z} \frac{\partial z}{\partial b} ∂wi∂L=∂g∂L∂σ∂g∂z∂σ∂wi∂z∂b∂L=∂g∂L∂σ∂g∂z∂σ∂b∂z
L L L关于 g g g的表达式可写作
L = 1 n ∑ i = 1 n ( g ) = g L=\frac{1}{n} \sum_{i=1}^{n}(g)=g L=n1i=1∑n(g)=g
因此
∂ L ∂ g = 1 \frac{\partial L}{\partial g}=1 ∂g∂L=1
g g g关于 σ \sigma σ的表达式可写作
g = − y log ( σ ) − ( 1 − y ) log ( 1 − σ ) g=-y \log{(\sigma)}-(1-y) \log{(1-\sigma)} g=−ylog(σ)−(1−y)log(1−σ)
则可得
∂ g ∂ σ = ∂ ( − y log ( σ ) − ( 1 − y ) log ( 1 − σ ) ) ∂ σ = − y ∂ log ( σ ) ∂ σ − ( 1 − y ) ∂ log ( 1 − σ ) ∂ σ = − y σ + 1 − y 1 − σ \frac{\partial g}{\partial \sigma} = \frac{\partial (-y \log{(\sigma)}-(1-y) \log{(1-\sigma)})}{\partial \sigma} \\ = -y \frac{\partial \log{(\sigma)}}{\partial \sigma} -(1-y) \frac{\partial \log{(1-\sigma)}}{\partial \sigma} \\ =-\frac{y}{\sigma} + \frac {1-y}{1-\sigma} ∂σ∂g=∂σ∂(−ylog(σ)−(1−y)log(1−σ))=−y∂σ∂log(σ)−(1−y)∂σ∂log(1−σ)=−σy+1−σ1−y
σ \sigma σ关于 z z z的表达式可写作
σ ( z ) = 1 1 + e − z = e z 1 + e z \sigma (z) = \frac{1}{1+e^{-z}} = \frac{e^z}{1+e^z} σ(z)=1+e−z1=1+ezez
则
∂ σ ∂ z = ∂ ( 1 1 + e − z ) ∂ z = − 1 ( 1 + e − z ) 2 × e − z × ( − 1 ) = e − z ( 1 + e − z ) 2 = σ ( 1 − σ ) \frac{\partial \sigma}{\partial z}=\frac{\partial (\frac{1}{1+e^{-z}}) }{\partial z} \\ =-\frac{1}{(1+e^{-z})^2}\times e^{-z} \times (-1) \\ =\frac{e^{-z} }{(1+e^{-z})^2}=\sigma(1-\sigma) ∂z∂σ=∂z∂(1+e−z1)=−(1+e−z)21×e−z×(−1)=(1+e−z)2e−z=σ(1−σ)
z z z关于 w w w的表达式为 z = w T x + b z = w^Tx+b z=wTx+b
则可得
∂ z ∂ w i = x i , i = 1 , 2 , … , d \frac{\partial z}{\partial w_i}=x_i,i=1,2,\ldots,d ∂wi∂z=xi,i=1,2,…,d
z z z关于 w w w的表达式为 z = w T x + b z = w^Tx+b z=wTx+b
则可得
∂ z ∂ b = 1 \frac{\partial z}{\partial b}=1 ∂b∂z=1
∂ L ∂ w i = ∂ L ∂ g ∂ g ∂ σ ∂ σ ∂ z ∂ z ∂ w i = 1 × ( − y σ + 1 − y 1 − σ ) × σ ( 1 − σ ) × x i = x i ( − y ( 1 − σ ) + σ ( 1 − y ) ) = x i ( σ − y ) ∂ L ∂ b = ∂ L ∂ g ∂ g ∂ σ ∂ σ ∂ z ∂ z ∂ b = 1 × ( − y σ + 1 − y 1 − σ ) × σ ( 1 − σ ) × 1 = − y ( 1 − σ ) + σ ( 1 − y ) = σ − y \frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial g} \frac{\partial g}{\partial \sigma} \frac{\partial \sigma}{\partial z} \frac{\partial z}{\partial w_i} \\ = 1\times(-\frac{y}{\sigma} + \frac {1-y}{1-\sigma} ) \times \sigma(1-\sigma) \times x_i \\ = x_i(-y(1-\sigma)+\sigma(1-y)) \\ = x_i(\sigma-y) \\ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial g} \frac{\partial g}{\partial \sigma} \frac{\partial \sigma}{\partial z} \frac{\partial z}{\partial b} \\ = 1\times(-\frac{y}{\sigma} + \frac {1-y}{1-\sigma} ) \times \sigma(1-\sigma) \times 1 \\ = -y(1-\sigma)+\sigma(1-y) \\ = \sigma-y \\ ∂wi∂L=∂g∂L∂σ∂g∂z∂σ∂wi∂z=1×(−σy+1−σ1−y)×σ(1−σ)×xi=xi(−y(1−σ)+σ(1−y))=xi(σ−y)∂b∂L=∂g∂L∂σ∂g∂z∂σ∂b∂z=1×(−σy+1−σ1−y)×σ(1−σ)×1=−y(1−σ)+σ(1−y)=σ−y
因为
∂ L ∂ w i = x i ( σ − y ) \frac{\partial L}{\partial w_i} =x_i(\sigma-y) ∂wi∂L=xi(σ−y)
则梯度更新表达式为
w i = w i − η ∂ L ∂ w i = w i − η x i ( σ − y ) w_i=w_i-\eta\frac{\partial L}{\partial w_i} \\ = w_i-\eta x_i(\sigma-y) wi=wi−η∂wi∂L=wi−ηxi(σ−y)
则
[ w 1 w 2 ⋮ w d ] = [ w 1 w 2 ⋮ w d ] − η ( σ − y ) [ x 1 x 2 ⋮ x d ] \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix}=\begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix}-\eta(\sigma-y)\begin{bmatrix} x_1\\ x_2\\ \vdots \\ x_d\\ \end{bmatrix} ⎣ ⎡w1w2⋮wd⎦ ⎤=⎣ ⎡w1w2⋮wd⎦ ⎤−η(σ−y)⎣ ⎡x1x2⋮xd⎦ ⎤
即
w = w − η ( σ − y ) x w=w-\eta(\sigma-y)x w=w−η(σ−y)x
因为
∂ L ∂ b = σ − y \frac{\partial L}{\partial b} =\sigma-y ∂b∂L=σ−y
则梯度更新表达式为
b = b − η ∂ L ∂ b = b − η ( σ − y ) b=b-\eta\frac{\partial L}{\partial b} \\ = b-\eta(\sigma-y) b=b−η∂b∂L=b−η(σ−y)