在机器学习里数据集格式一般如下:
第 i i i个样本特征和标签写作:
x i = ( x 1 i , x 2 i , x 3 i , . . . , x d i ) T ∈ R d y i ∈ R x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T \in R^d \\ y^i \in R xi=(x1i,x2i,x3i,...,xdi)T∈Rdyi∈R
完整的数据集可以写作:
X = [ x 1 , x 2 , … , x n ] = [ x 1 1 x 1 2 … x 1 n x 2 1 x 2 2 … x 2 n ⋮ ⋮ ⋱ ⋮ x d 1 x d 2 … x d n ] ∈ R d ∗ n y = [ y 1 , y 2 , … , y n ] ∈ R n X=[ x^1,x^2,\ldots,x^n] \\ = \begin{bmatrix} x^1_1& x^2_1 &\ldots &x^n_1 \\ x^1_2& x^2_2 &\ldots &x^n_2 \\ \vdots& \vdots & \ddots & \vdots\\ x^1_d& x^2_d &\ldots &x^n_d \\ \end{bmatrix} \in R^{d*n} \\ y=[ y^1,y^2,\ldots,y^n] \in R^n X=[x1,x2,…,xn]=⎣ ⎡x11x21⋮xd1x12x22⋮xd2……⋱…x1nx2n⋮xdn⎦ ⎤∈Rd∗ny=[y1,y2,…,yn]∈Rn
线性回归的权重 w w w,默认写作
w = [ w 1 , w 2 , … , w d ] T ∈ R d w=[w_1,w_2,\ldots,w_d]^T \in R^d w=[w1,w2,…,wd]T∈Rd
对于第 i i i个样本,线性回归模型可写作:
f ( x ) = w T x + b = w 1 x 1 + w 2 x 2 + … + w d x d + b f(x) = w^Tx+b \\ = w_1x_1+ w_2x_2+\ldots+w_dx_d+b f(x)=wTx+b=w1x1+w2x2+…+wdxd+b
对于特征集合 X X X,预测值 Y ^ \hat{Y} Y^ 可以通过矩阵-向量乘法表示为
Y ^ = w T X + b \hat{Y} = w^TX+b Y^=wTX+b
如果以均方误差MSE (Mean Squared Error)作为损失函数的话,则损失函数(loss function or cost function)可以写作
L = 1 n ∑ i = 1 n ( f ( x i ) − y i ) 2 = 1 n ∑ i = 1 n ( w 1 x 1 i + w 2 x 2 i + … + w d x d i + b − y i ) 2 L =\frac{1}{n} \sum_{i=1}^{n}(f(x^i)-y^i)^2 \\ =\frac{1}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)^2 L=n1i=1∑n(f(xi)−yi)2=n1i=1∑n(w1x1i+w2x2i+…+wdxdi+b−yi)2
引入梯度符号 ∇ \nabla ∇,则
∇ L ( w ) = [ ∂ L ∂ w 1 ∂ L ∂ w 2 ⋮ ∂ L ∂ w d ] ∇ L ( b ) = [ ∂ L ∂ b ] = ∂ L ∂ b \nabla L(w)=\begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_d} \end{bmatrix} \\ \nabla L(b) = [\frac{\partial L}{\partial b} ] =\frac{\partial L}{\partial b} ∇L(w)=⎣ ⎡∂w1∂L∂w2∂L⋮∂wd∂L⎦ ⎤∇L(b)=[∂b∂L]=∂b∂L
因此梯度下降一般化更新公式可写作
w ← w − η ∇ L ( w ) b ← b − η ∇ L ( b ) = b − η ∂ L ∂ b w \leftarrow w - \eta \nabla L(w) \\ b \leftarrow b - \eta \nabla L(b) =b - \eta \frac{\partial L}{\partial b} w←w−η∇L(w)b←b−η∇L(b)=b−η∂b∂L
其中 η \eta η为学习率。
对于权重 w j w_j wj,求梯度可得
∂ L ∂ w j = 2 n ∑ i = 1 n ( w 1 x 1 i + w 2 x 2 i + … + w d x d i + b − y i ) x j i = 2 n ∑ i = 1 n ( f ( x i ) − y i ) 2 x j i \frac{\partial L}{\partial w_j}=\frac{2}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)x_j^i \\ = \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)^2x_j^i ∂wj∂L=n2i=1∑n(w1x1i+w2x2i+…+wdxdi+b−yi)xji=n2i=1∑n(f(xi)−yi)2xji
则 w j w_j wj的梯度下降更新可写作
w j = w j − ∂ L ∂ w j = w j − 2 n ∑ i = 1 n ( f ( x i ) − y i ) x j i w_j=w_j-\frac{\partial L}{\partial w_j} \\ = w_j -\frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_j^i wj=wj−∂wj∂L=wj−n2i=1∑n(f(xi)−yi)xji
同理, w 1 w_1 w1, w 2 w_2 w2,… w d w_d wd分别可写作
w 1 = w 1 − ∂ L ∂ w 1 = w 1 − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 1 i w 2 = w 2 − ∂ L ∂ w 2 = w 2 − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 2 i ⋮ w d = w d − ∂ L ∂ w 2 = w d − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x d i w_1=w_1-\frac{\partial L}{\partial w_1} \\ = w_1 -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_1^i \\ w_2=w_2-\frac{\partial L}{\partial w_2} \\ = w_2 -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_2^i \\ \vdots \\ w_d=w_d-\frac{\partial L}{\partial w_2} \\ = w_d -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_d^i \\ w1=w1−∂w1∂L=w1−ηn2i=1∑n(f(xi)−yi)x1iw2=w2−∂w2∂L=w2−ηn2i=1∑n(f(xi)−yi)x2i⋮wd=wd−∂w2∂L=wd−ηn2i=1∑n(f(xi)−yi)xdi
观察上述式子可知, ( f ( x i ) − y i ) (f(x^i)-y^i) (f(xi)−yi)为实数,而
w = [ w 1 , w 2 , … , w d ] T ∈ R d x i = ( x 1 i , x 2 i , x 3 i , . . . , x d i ) T w=[w_1,w_2,\ldots,w_d]^T \in R^d \\ x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T w=[w1,w2,…,wd]T∈Rdxi=(x1i,x2i,x3i,...,xdi)T
则权重 w w w的梯度下降公式可写作
[ w 1 w 2 ⋮ w d ] = [ w 1 w 2 ⋮ w d ] − η [ 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 1 i 2 n ∑ i = 1 n ( f ( x i ) − y i ) x 2 i ⋮ 2 n ∑ i = 1 n ( f ( x i ) − y i ) x d i ] = [ w 1 w 2 ⋮ w d ] − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) [ x 1 i x 2 i ⋮ x d i ] \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} = \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} - \eta \begin{bmatrix} \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_1^i\\ \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_2^i\\ \vdots\\ \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_d^i\\ \end{bmatrix} \\ = \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} - \eta\frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) \begin{bmatrix} x_1^i\\ x_2^i\\ \vdots\\ x_d^i\\ \end{bmatrix} \\ ⎣ ⎡w1w2⋮wd⎦ ⎤=⎣ ⎡w1w2⋮wd⎦ ⎤−η⎣ ⎡n2∑i=1n(f(xi)−yi)x1in2∑i=1n(f(xi)−yi)x2i⋮n2∑i=1n(f(xi)−yi)xdi⎦ ⎤=⎣ ⎡w1w2⋮wd⎦ ⎤−ηn2i=1∑n(f(xi)−yi)⎣ ⎡x1ix2i⋮xdi⎦ ⎤
因此上述式子可写作
w = w − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) x i w=w-\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) x^i w=w−ηn2i=1∑n(f(xi)−yi)xi
亦即
w = w − η 2 n ∑ i = 1 n ( w T x + b − y i ) x i w=w-\eta \frac{2}{n} \sum_{i=1}^{n}(w^Tx+b-y^i) x^i w=w−ηn2i=1∑n(wTx+b−yi)xi
同理可得权重 b b b的梯度
∂ L ∂ b = 2 n ∑ i = 1 n ( w 1 x 1 i + w 2 x 2 i + … + w d x d i + b − y i ) = 2 n ∑ i = 1 n ( f ( x i ) − y i ) \frac{\partial L}{\partial b}=\frac{2}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)\\ = \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) ∂b∂L=n2i=1∑n(w1x1i+w2x2i+…+wdxdi+b−yi)=n2i=1∑n(f(xi)−yi)
则梯度更新公式为
b = b − η ∂ L ∂ b = b − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) b = b -\eta \frac{\partial L}{\partial b} =b - \eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) b=b−η∂b∂L=b−ηn2i=1∑n(f(xi)−yi)
亦即:
b = b − η 2 n ∑ i = 1 n ( f ( x i ) − y i ) = b − η ( w T x + b − y i ) b=b - \eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) = b - \eta (w^Tx+b-y^i) b=b−ηn2i=1∑n(f(xi)−yi)=b−η(wTx+b−yi)
Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J , DiveintoDeepLearning