Logistic回归和Softmax回归引入$L_2$正则化后的梯度下降过程

Logistic回归和Softmax回归引入L2正则化后的梯度下降过程

  • Logistic回归引入 L 2 L_2 L2正则化后的梯度下降过程
  • Softmax回归引入 L 2 L_2 L2正则化后的梯度下降过程

Logistic回归引入 L 2 L_2 L2正则化后的梯度下降过程

Logistic回归是一种用于分类问题的机器学习算法。在引入 L 2 L_2 L2正则化后,Logistic回归的目标函数为:

J ( w ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h w ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h w ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n w j 2 = − 1 m ∑ i = 1 m [ y ( i ) l o g ( σ ( w T x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − σ ( w T x ( i ) ) ) ) ] + λ 2 m ∑ j = 1 n w j 2 \begin{aligned} J(w) &= -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(h_w(x^{(i)})) + (1-y^{(i)})log(1-h_w(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n w_j^2 \\ &= -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\sigma(w^Tx^{(i)})) + (1-y^{(i)})log(1-\sigma(w^Tx^{(i)})))] + \frac{\lambda}{2m}\sum_{j=1}^n w_j^2 \end{aligned} J(w)=m1i=1m[y(i)log(hw(x(i)))+(1y(i))log(1hw(x(i)))]+2mλj=1nwj2=m1i=1m[y(i)log(σ(wTx(i)))+(1y(i))log(1σ(wTx(i))))]+2mλj=1nwj2
其中, m m m是训练样本的数量, n n n是特征的数量, y ( i ) y^{(i)} y(i)是第 i i i个样本的类别(0或1), x ( i ) x^{(i)} x(i)是第 i i i个样本的特征向量, w w w是模型的参数向量, λ \lambda λ是正则化系数, σ ( z ) \sigma(z) σ(z)是Logistic函数,定义为:
σ ( z ) = 1 1 + e − z \sigma(z)=\frac{1}{1+e^{-z}} σ(z)=1+ez1
使用梯度下降算法来最小化目标损失函数 J ( w ) J(w) J(w),更新规则为:
w j ← w j − α ∂ J ( w ) ∂ w j w_j \leftarrow w_j - \alpha \frac{\partial J(w)}{\partial w_j} wjwjαwjJ(w)
其中, α \alpha α是学习率, ∂ J ( w ) ∂ w j \frac{\partial J(w)}{\partial w_j} wjJ(w)是目标函数 J ( w ) J(w) J(w)对参数 w j w_j wj的偏导数。
对目标函数 J ( w ) J(w) J(w)求偏导数,有:
∂ J ( w ) ∂ w j = − 1 m ∑ i = 1 m ( y ( i ) − σ ( w T x ( i ) ) ) x j ( i ) + λ m w j = − 1 m ∑ i = 1 m x j ( i ) ( y ( i ) − σ ( w T x ( i ) ) ) + λ m w j \begin{aligned} \frac{\partial J(w)}{\partial w_j} &= -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-\sigma(w^Tx^{(i)}))x_j^{(i)} + \frac{\lambda}{m}w_j \\ &= -\frac{1}{m}\sum_{i=1}^{m}x_j^{(i)}(y^{(i)}-\sigma(w^Tx^{(i)})) + \frac{\lambda}{m}w_j \end{aligned} wjJ(w)=m1i=1m(y(i)σ(wTx(i)))xj(i)+mλwj=m1i=1mxj(i)(y(i)σ(wTx(i)))+mλwj
因此,Logistic回归引入 L 2 L_2 L2正则化后的梯度下降更新规则为:
w j ← w j − α ( − 1 m ∑ i = 1 m x j ( i ) ( y ( i ) − σ ( w T x ( i ) ) ) + λ m w j ) w_j \leftarrow w_j - \alpha \left(-\frac{1}{m}\sum_{i=1}^{m}x_j^{(i)}(y^{(i)}-\sigma(w^Tx^{(i)})) + \frac{\lambda}{m}w_j\right) wjwjα(m1i=1mxj(i)(y(i)σ(wTx(i)))+mλwj)
化简得:
w j ← ( 1 − α λ m ) w j + α 1 m ∑ i = 1 m x j ( i ) ( y ( i ) − σ ( w T x ( i ) ) ) w_j \leftarrow (1-\alpha\frac{\lambda}{m})w_j + \alpha\frac{1}{m}\sum_{i=1}^{m}x_j^{(i)}(y^{(i)}-\sigma(w^Tx^{(i)})) wj(1αmλ)wj+αm1i=1mxj(i)(y(i)σ(wTx(i)))

Softmax回归引入 L 2 L_2 L2正则化后的梯度下降过程

Softmax回归是一种用于多分类问题的机器学习算法。在引入 L 2 L_2 L2正则化后,Softmax回归的目标函数为:
J ( W ) = − 1 m ∑ i = 1 m ∑ j = 1 k y j ( i ) l o g ( e w j T x ( i ) ∑ l = 1 k e w l T x ( i ) ) + λ 2 m ∑ j = 1 k ∑ l = 1 n W j l 2 = − 1 m ∑ i = 1 m ∑ j = 1 k y j ( i ) ( w j T x ( i ) − l o g ∑ l = 1 k e w l T x ( i ) ) + λ 2 m ∑ j = 1 k ∑ l = 1 n W j l 2 \begin{aligned} J(W) &= -\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{k}y_{j}^{(i)}log(\frac{e^{w_j^Tx^{(i)}}}{\sum_{l=1}^{k}e^{w_l^Tx^{(i)}}}) + \frac{\lambda}{2m}\sum_{j=1}^{k}\sum_{l=1}^{n}W_{jl}^2 \\ &= -\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{k}y_{j}^{(i)}(w_j^Tx^{(i)} - log\sum_{l=1}^{k}e^{w_l^Tx^{(i)}}) + \frac{\lambda}{2m}\sum_{j=1}^{k}\sum_{l=1}^{n}W_{jl}^2 \end{aligned} J(W)=m1i=1mj=1kyj(i)log(l=1kewlTx(i)ewjTx(i))+2mλj=1kl=1nWjl2=m1i=1mj=1kyj(i)(wjTx(i)logl=1kewlTx(i))+2mλj=1kl=1nWjl2
其中, m m m为训练样本的数量, n n n表示特征的数量, k k k是类别的数量, y j ( i ) y_{j}^{(i)} yj(i)是第 i i i个样本属于第 j j j个类别的概率( y j ( i ) = 1 y_{j}^{(i)} = 1 yj(i)=1表示第 i i i个样本属于第 j j j个类别, y j ( i ) = 0 y_{j}^{(i)} = 0 yj(i)=0表示不属于), x ( i ) x^{(i)} x(i)是第 i i i个样本的特征向量, W W W是模型的参数矩阵, λ \lambda λ是正则化系数。

使用梯度下降算法来最小化目标函数 J ( W ) J(W) J(W),更新规则为:
W j l ← W j l − α ∂ J ( W ) ∂ W j l W_{jl} \leftarrow W_{jl} - \alpha \frac{\partial J(W)}{\partial W_{jl}} WjlWjlαWjlJ(W)
其中, α \alpha α是学习率, ∂ J ( W ) ∂ W j l \frac{\partial J(W)}{\partial W_{jl}} WjlJ(W)是目标函数 J ( W ) J(W) J(W)对参数 W j l W_{jl} Wjl的偏导数。

对目标函数 J ( W ) J(W) J(W)求偏导数,有:
∂ J ( W ) ∂ W j l = − 1 m ∑ i = 1 m ( y j ( i ) − e w j T x ( i ) ∑ l = 1 k e w l T x ( i ) ) x l ( i ) + λ m W j l = − 1 m ∑ i = 1 m x l ( i ) ( y j ( i ) − e w j T x ( i ) ∑ l = 1 k e w l T x ( i ) ) + λ m W j l \begin{aligned} \frac{\partial J(W)}{\partial W_{jl}} &= -\frac{1}{m}\sum_{i=1}^{m}(y_{j}^{(i)} - \frac{e^{w_j^Tx^{(i)}}}{\sum_{l=1}^{k}e^{w_l^Tx^{(i)}}})x_{l}^{(i)} + \frac{\lambda}{m}W_{jl} \\ &= -\frac{1}{m}\sum_{i=1}^{m}x_{l}^{(i)}(y_{j}^{(i)} - \frac{e^{w_j^Tx^{(i)}}}{\sum_{l=1}^{k}e^{w_l^Tx^{(i)}}}) + \frac{\lambda}{m}W_{jl} \end{aligned} WjlJ(W)=m1i=1m(yj(i)l=1kewlTx(i)ewjTx(i))xl(i)+mλWjl=m1i=1mxl(i)(yj(i)l=1kewlTx(i)ewjTx(i))+mλWjl
将上述偏导数带入梯度下降更新规则中,得到:
W j l ← W j l − α ( − 1 m ∑ i = 1 m x l ( i ) ( y j ( i ) − e w j T x ( i ) ∑ l = 1 k e w l T x ( i ) ) + λ m W j l ) W_{jl} \leftarrow W_{jl} - \alpha(-\frac{1}{m}\sum_{i=1}^{m}x_{l}^{(i)}(y_{j}^{(i)} - \frac{e^{w_j^Tx^{(i)}}}{\sum_{l=1}^{k}e^{w_l^Tx^{(i)}}}) + \frac{\lambda}{m}W_{jl}) WjlWjlα(m1i=1mxl(i)(yj(i)l=1kewlTx(i)ewjTx(i))+mλWjl)
化简后得到:
W j l ← ( 1 − α λ m ) W j l − α m ∑ i = 1 m x l ( i ) ( e w j T x ( i ) ∑ l = 1 k e w l T x ( i ) − y j ( i ) ) W_{jl} \leftarrow (1-\alpha\frac{\lambda}{m})W_{jl} - \frac{\alpha}{m}\sum_{i=1}^{m}x_{l}^{(i)}(\frac{e^{w_j^Tx^{(i)}}}{\sum_{l=1}^{k}e^{w_l^Tx^{(i)}}} - y_{j}^{(i)}) Wjl(1αmλ)Wjlmαi=1mxl(i)(l=1kewlTx(i)ewjTx(i)yj(i))

你可能感兴趣的:(回归,机器学习,算法)