梯度下降法推导:线性回归

目录

  • 数据集格式
  • 线性回归表达式
  • 通用范式
  • 权重 w w w的梯度下降
  • 权重 b b b的梯度下降
  • 参考文献

数据集格式

在机器学习里数据集格式一般如下:
i i i个样本特征和标签写作:
x i = ( x 1 i , x 2 i , x 3 i , . . . , x d i ) T ∈ R d y i ∈ R x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T \in R^d \\ y^i \in R xi=(x1i,x2i,x3i,...,xdi)TRdyiR
完整的数据集可以写作:
X = [ x 1 , x 2 , … , x n ] = [ x 1 1 x 1 2 … x 1 n x 2 1 x 2 2 … x 2 n ⋮ ⋮ ⋱ ⋮ x d 1 x d 2 … x d n ] ∈ R d ∗ n y = [ y 1 , y 2 , … , y n ] ∈ R n X=[ x^1,x^2,\ldots,x^n] \\ = \begin{bmatrix} x^1_1& x^2_1 &\ldots &x^n_1 \\ x^1_2& x^2_2 &\ldots &x^n_2 \\ \vdots& \vdots & \ddots & \vdots\\ x^1_d& x^2_d &\ldots &x^n_d \\ \end{bmatrix} \in R^{d*n} \\ y=[ y^1,y^2,\ldots,y^n] \in R^n X=[x1,x2,,xn]= x11x21xd1x12x22xd2x1nx2nxdn Rdny=[y1,y2,,yn]Rn

线性回归表达式

线性回归的权重 w w w,默认写作
w = [ w 1 , w 2 , … , w d ] T ∈ R d w=[w_1,w_2,\ldots,w_d]^T \in R^d w=[w1,w2,,wd]TRd
对于第 i i i个样本,线性回归模型可写作:
f ( x ) = w T x + b = w 1 x 1 + w 2 x 2 + … + w d x d + b f(x) = w^Tx+b \\ = w_1x_1+ w_2x_2+\ldots+w_dx_d+b f(x)=wTx+b=w1x1+w2x2++wdxd+b
对于特征集合 X X X,预测值 Y ^ \hat{Y} Y^ 可以通过矩阵-向量乘法表示为
Y ^ = w T X + b \hat{Y} = w^TX+b Y^=wTX+b
如果以均方误差MSE (Mean Squared Error)作为损失函数的话,则损失函数(loss function or cost function)可以写作
L = 1 n ∑ i = 1 n ( f ( x i ) − y i ) 2 = 1 n ∑ i = 1 n ( w 1 x 1 i + w 2 x 2 i + … + w d x d i + b − y i ) 2 L =\frac{1}{n} \sum_{i=1}^{n}(f(x^i)-y^i)^2 \\ =\frac{1}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)^2 L=n1i=1n(f(xi)yi)2=n1i=1n(w1x1i+w2x2i++wdxdi+byi)2

通用范式

引入梯度符号 ∇ \nabla ,则
∇ L ( w ) = [ ∂ L ∂ w 1 ∂ L ∂ w 2 ⋮ ∂ L ∂ w d ] ∇ L ( b ) = [ ∂ L ∂ b ] = ∂ L ∂ b \nabla L(w)=\begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_d} \end{bmatrix} \\ \nabla L(b) = [\frac{\partial L}{\partial b} ] =\frac{\partial L}{\partial b} L(w)= w1Lw2LwdL L(b)=[bL]=bL
因此梯度下降一般化更新公式可写作
w ← w − η ∇ L ( w ) b ← b − η ∇ L ( b ) = b − η ∂ L ∂ b w \leftarrow w - \eta \nabla L(w) \\ b \leftarrow b - \eta \nabla L(b) =b - \eta \frac{\partial L}{\partial b} wwηL(w)bbηL(b)=bηbL
其中 η \eta η为学习率。

权重 w w w的梯度下降

对于权重 w j w_j wj,求梯度可得
∂ L ∂ w j = 2 n ∑ i = 1 n ( w 1 x 1 i + w 2 x 2 i + … + w d x d i + b − y i ) x j i = 2 n ∑ i = 1 n ( f ( x i ) − y i ) 2 x j i \frac{\partial L}{\partial w_j}=\frac{2}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)x_j^i \\ = \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)^2x_j^i wjL=n2i=1n(w1x1i+w2x2i++wdxdi+byi)xji=n2i=1n(f(xi)yi)2xji
w j w_j wj的梯度下降更新可写作
w j = w j − ∂ L ∂ w j = w j − 2 n ∑ i = 1 n ( f ( x i ) − y i ) x j i w_j=w_j-\frac{\partial L}{\partial w_j} \\ = w_j -\frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_j^i wj=wjwjL=wjn2i=1n(f(xi)yi)xji
同理, w 1 w_1 w1, w 2 w_2 w2,… w d w_d wd分别可写作
w 1 = w 1 − ∂ L ∂ w 1 = w 1 − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 1 i w 2 = w 2 − ∂ L ∂ w 2 = w 2 − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 2 i ⋮ w d = w d − ∂ L ∂ w 2 = w d − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x d i w_1=w_1-\frac{\partial L}{\partial w_1} \\ = w_1 -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_1^i \\ w_2=w_2-\frac{\partial L}{\partial w_2} \\ = w_2 -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_2^i \\ \vdots \\ w_d=w_d-\frac{\partial L}{\partial w_2} \\ = w_d -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_d^i \\ w1=w1w1L=w1ηn2i=1n(f(xi)yi)x1iw2=w2w2L=w2ηn2i=1n(f(xi)yi)x2iwd=wdw2L=wdηn2i=1n(f(xi)yi)xdi
观察上述式子可知, ( f ( x i ) − y i ) (f(x^i)-y^i) (f(xi)yi)为实数,而
w = [ w 1 , w 2 , … , w d ] T ∈ R d x i = ( x 1 i , x 2 i , x 3 i , . . . , x d i ) T w=[w_1,w_2,\ldots,w_d]^T \in R^d \\ x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T w=[w1,w2,,wd]TRdxi=(x1i,x2i,x3i,...,xdi)T
则权重 w w w的梯度下降公式可写作
[ w 1 w 2 ⋮ w d ] = [ w 1 w 2 ⋮ w d ] − η [ 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 1 i 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 2 i ⋮ 2 n ∑ i = 1 n ( f ( x i ) − y i ) x d i ] = [ w 1 w 2 ⋮ w d ] − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) [ x 1 i x 2 i ⋮ x d i ] \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} = \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} - \eta \begin{bmatrix} \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_1^i\\ \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_2^i\\ \vdots\\ \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_d^i\\ \end{bmatrix} \\ = \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} - \eta\frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) \begin{bmatrix} x_1^i\\ x_2^i\\ \vdots\\ x_d^i\\ \end{bmatrix} \\ w1w2wd = w1w2wd η n2i=1n(f(xi)yi)x1in2i=1n(f(xi)yi)x2in2i=1n(f(xi)yi)xdi = w1w2wd ηn2i=1n(f(xi)yi) x1ix2ixdi
因此上述式子可写作
w = w − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x i w=w-\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) x^i w=wηn2i=1n(f(xi)yi)xi
亦即
w = w − η 2 n ∑ i = 1 n ( w T x + b − y i ) x i w=w-\eta \frac{2}{n} \sum_{i=1}^{n}(w^Tx+b-y^i) x^i w=wηn2i=1n(wTx+byi)xi

权重 b b b的梯度下降

同理可得权重 b b b的梯度
∂ L ∂ b = 2 n ∑ i = 1 n ( w 1 x 1 i + w 2 x 2 i + … + w d x d i + b − y i ) = 2 n ∑ i = 1 n ( f ( x i ) − y i ) \frac{\partial L}{\partial b}=\frac{2}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)\\ = \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) bL=n2i=1n(w1x1i+w2x2i++wdxdi+byi)=n2i=1n(f(xi)yi)
则梯度更新公式为
b = b − η ∂ L ∂ b = b − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) b = b -\eta \frac{\partial L}{\partial b} =b - \eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) b=bηbL=bηn2i=1n(f(xi)yi)
亦即:
b = b − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) = b − η ( w T x + b − y i ) b=b - \eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) = b - \eta (w^Tx+b-y^i) b=bηn2i=1n(f(xi)yi)=bη(wTx+byi)

参考文献

Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J , DiveintoDeepLearning

你可能感兴趣的:(深度学习笔记)