【ML】- 001 线性回归01-梯度下降法基础理论

基础理论-数学推导

  • 数据集
    给定数据集 { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯   , ( x ( m ) , y ( m ) ) } \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m)})\} {(x(1),y(1)),(x(2),y(2)),,(x(m),y(m))},其中 x ( i ) = { x 1 ( i ) , x 2 ( i ) , ⋯   , x n ( i ) } x^{(i)} = \{x_1^{(i)}, x_2^{(i)}, \cdots, x_n^{(i)}\} x(i)={x1(i),x2(i),,xn(i)} y ( i ) ∈ R y^{(i)} \in R y(i)R

  • 待拟合的函数(线性函数)

h θ ( x ( i ) ) = θ 0 + θ 1 x 1 ( i ) + θ 2 x 2 ( i ) + ⋯ + θ n x n ( i ) = ∑ j = 0 n θ j x j ( i ) = θ T x ( i ) \begin{aligned} h_{\theta} (x^{(i)}) &= \theta_0 + \theta_1 x_1^{(i)} + \theta_2 x_2^{(i)} + \cdots + \theta_n x_n^{(i)} \\ &= \sum_{j=0}^{n} \theta_j x_j^{(i)} \\ &= \theta^T x^{(i)} \end{aligned} hθ(x(i))=θ0+θ1x1(i)+θ2x2(i)++θnxn(i)=j=0nθjxj(i)=θTx(i)

  • 损失函数(目标函数)

J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 min ⁡ θ J ( θ ) J(\theta) = \frac{1}{2} \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ \min_{\theta} J(\theta) J(θ)=21i=1m(hθ(x(i))y(i))2θminJ(θ)

  • 梯度下降法求解

  • Batch Gradient Descent (m为样本数)
    θ j : = θ j − α ∂ ∂ θ j J ( θ ) = θ j − α ∂ ∂ θ j 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 = θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) = θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2} \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ &= \theta_j - \alpha \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \frac{\partial}{\partial \theta_j} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \\ &= \theta_j - \alpha \sum_{i=1}^{m} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{aligned} θj:=θjαθjJ(θ)=θjαθj21i=1m(hθ(x(i))y(i))2=θjαi=1m(hθ(x(i))y(i))θj(hθ(x(i))y(i))=θjαi=1m(hθ(x(i))y(i))xj(i)

  • Stochastic Gradient Descent (随机取一个样本)
    θ j : = θ j − α ∂ ∂ θ j J ( θ ) = θ j − α ∂ ∂ θ j 1 2 ( h θ ( x ( i ) ) − y ( i ) ) 2 = θ j − α ( h θ ( x ( i ) ) − y ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) = θ j − α ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ &= \theta_j - \alpha \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \frac{\partial}{\partial \theta_j} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \\ &= \theta_j - \alpha \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{aligned} θj:=θjαθjJ(θ)=θjαθj21(hθ(x(i))y(i))2=θjα(hθ(x(i))y(i))θj(hθ(x(i))y(i))=θjα(hθ(x(i))y(i))xj(i)

  • mini-batch Stochastic Gradient Descent (b为batch size)
    θ j : = θ j − α ∂ ∂ θ j J ( θ ) = θ j − α ∂ ∂ θ j 1 2 ∑ i = 1 b ( h θ ( x ( i ) ) − y ( i ) ) 2 = θ j − α ∑ i = 1 b ( h θ ( x ( i ) ) − y ( i ) ) ∂ ∂ θ j ( h θ ( x ( i ) ) − y ( i ) ) = θ j − α ∑ i = 1 b ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \theta_j &:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial \theta_j} \frac{1}{2} \sum_{i=1}^{b} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right)^2 \\ &= \theta_j - \alpha \sum_{i=1}^{b} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \frac{\partial}{\partial \theta_j} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) \\ &= \theta_j - \alpha \sum_{i=1}^{b} \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{aligned} θj:=θjαθjJ(θ)=θjαθj21i=1b(hθ(x(i))y(i))2=θjαi=1b(hθ(x(i))y(i))θj(hθ(x(i))y(i))=θjαi=1b(hθ(x(i))y(i))xj(i)

  • 三种梯度下降算法对比:

  1. BGD:全局最优解,能保证每一次更新权值,都能降低损失函数;样本数量大的时候训练慢。
  2. SGD:训练速度快;局部最优解,不能保证每次迭代都向着整体最优化方向,需要迭代的次数多。
  3. mbSGD:以上二者折衷。

你可能感兴趣的:(machine,learning,机器学习,算法,线性代数)