对损失函数以及参数w的梯度下降公式的推导

根据《统计学习方法》第6章中6.1节介绍,下面对损失函数以及参数 w w w的梯度下降公式的推导:
S i g m o i d Sigmoid Sigmoid函数为:
g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1 给定一个样本 x x x,可以使用一个线性函数对自变量进行线性组合 z = w 0 + w 1 x 1 + w 2 x 2 + ⋯ + w n x n = ∑ i = 0 n w i x i = w T X z=w_0+w_1x_1+w_2x_2+\dots+w_nx_n=\sum_{i=0}^{n}w_ix_i=w^TX z=w0+w1x1+w2x2++wnxn=i=0nwixi=wTX 根据 s i g m o i d sigmoid sigmoid函数,预测函数表达式为:
h w ( x ) = g = w ( T X ) = 1 1 + e − w T X h_w(x)=g=w(^TX)=\frac{1}{1+e^{-w^TX}} hw(x)=g=w(TX)=1+ewTX1
P ( Y = 1 ∣ X ) = h w ( x ) P(Y=1|X)=h_w(x) P(Y=1X)=hw(x)
P ( Y = 0 ∣ X ) = 1 − h w ( x ) P(Y=0|X)=1-h_w(x) P(Y=0X)=1hw(x)
P ( Y ∣ X ) = h w ( x ) y ( 1 − h w ( x ) ) 1 − y P(Y|X)=h_w(x)^y(1-h_w(x))^{1-y} P(YX)=hw(x)y(1hw(x))1y

极大似然函数:
L ( w ) = ∏ i = 1 m h w ( x i ) i y ( 1 − h w ( x i ) ) 1 − y i L(w)=\prod_{i=1}^mh_w(x_i)^y_i(1-h_w(x_i))^{1-y_i} L(w)=i=1mhw(xi)iy(1hw(xi))1yi
l o g L ( w ) = ∑ i = 1 m l o g [ h w ( x i ) y i ( 1 − h w ( x i ) ) 1 − y i ] = ∑ i = 1 m [ y i l o g h w ( x i ) + ( 1 − y i ) l o g ( 1 − h w ( x i ) ) ] logL(w)=\sum_{i=1}^mlog[h_w(x_i)^yi(1-h_w(x_i))^{1-y_i}]= \sum_{i=1}^m[y_ilogh_w(x_i)+(1-y_i)log(1-h_w(x_i))] logL(w)=i=1mlog[hw(xi)yi(1hw(xi))1yi]=i=1m[yiloghw(xi)+(1yi)log(1hw(xi))]
损失函数:
J ( w ) = − 1 m ∑ i = 1 m [ y i ⋅ l o g h w ( x ) + ( 1 − y i ) l o g ( 1 − h w ( x i ) ) ] = − 1 m s u m i = 1 m [ y i ⋅ l n 1 1 + e w x i + ( 1 − y i ) ⋅ l n e − w x i 1 + e − w x i ] = − 1 m s u m i = 1 m [ l n 1 1 + e w x i + y i ⋅ l n 1 e − w x i ] = 1 m ∑ i = 1 m [ − w x i y i + l n ( 1 + e w x i ) ] J(w)=-\frac{1}{m}\sum_{i=1}^m[y_i \cdot logh_w(x)+(1-y_i)log(1-h_w(x_i))] =-\frac{1}{m}sum_{i=1}^m[y_i \cdot ln \frac{1}{1+e^{wx_i}}+(1-y_i) \cdot ln \frac{e^{-wx_i}}{1+e^{-wx_i}}] =-\frac{1}{m}sum_{i=1}^m[ln \frac{1}{1+e^{wx_i}}+y_i \cdot ln \frac{1}{e^{-wx_i}}] =\frac{1}{m}\sum_{i=1}{m}[-wx_iy_i+ln(1+e^{wx_i})] J(w)=m1i=1m[yiloghw(x)+(1yi)log(1hw(xi))]=m1sumi=1m[yiln1+ewxi1+(1yi)ln1+ewxiewxi]=m1sumi=1m[ln1+ewxi1+yilnewxi1]=m1i=1m[wxiyi+ln(1+ewxi)]
梯度下降 w w w参数的梯度为:
∂ J ( w ) ∂ w i = 1 m ∑ i m [ − x i , j y i + x i , j ⋅ e w x i 1 + e w x i ] = 1 m ∑ i m x i , j ( 1 1 + e − w x i − y i ) = 1 m ∑ i m [ h w ( x i ) − y i ] x i , j \frac{\partial J(w)}{\partial w_i}=\frac{1}{m}\sum_i^m[-x_{i,j}y_i+\frac{x_{i,j}\cdot e^{wx_i}}{1+e^{wx_i}}] =\frac{1}{m}\sum_i^mx_{i,j}(\frac{1}{1+e^{-wx_i}}-y_i) =\frac{1}{m}\sum_i^m[h_w(x_i)-y_i]x_{i,j} wiJ(w)=m1im[xi,jyi+1+ewxixi,jewxi]=m1imxi,j(1+ewxi1yi)=m1im[hw(xi)yi]xi,j

所以最后的 w w w参数公式为:
w j + 1 = w j − α ∑ i = 1 m [ h w ( x i ) − y i ] x i , j w_{j+1}=w_j-\alpha\sum_{i=1}^m[h_w(x_i)-y_i]x_{i,j} wj+1=wjαi=1m[hw(xi)yi]xi,j 对于随机梯度下降的 w w w参数公式为:
w j + 1 = w j − α [ h w ( x ) − y ] x j w_{j+1}=w_j-\alpha[h_w(x)-y]x_j wj+1=wjα[hw(x)y]xj

你可能感兴趣的:(理论知识)