优化梯度下降算法

文章目录

    • Optimization problem
      • Normalizing inputs
      • vanishing/exploding gradients
      • weight initialize
      • gradient check
        • Numerical approximation
        • grad check
    • Optimize algorithm
      • mini-bach gradient
        • mini-batch size
      • exponential weighted averages
        • Bias correction
        • Momentum
      • RMSprop
      • Adam algorithm
      • Learning rate decay
      • Local optima

Optimization problem

speed up the training of your neural network

Normalizing inputs

  1. subtract mean

μ = 1 m ∑ i = 1 m x ( i ) x : = x − μ \mu =\frac{1}{m}\sum _{i=1}^{m}x^{(i)}\\ x:=x-\mu μ=m1i=1mx(i)x:=xμ

  1. normalize variance

σ 2 = 1 m ∑ i = 1 m ( x ( i ) ) 2 x / = σ \sigma ^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)})^2\\ x/=\sigma σ2=m1i=1m(x(i))2x/=σ

vanishing/exploding gradients

y = w [ l ] w [ l − 1 ] . . . w [ 2 ] w [ 1 ] x w [ l ] > I → ( w [ l ] ) L → ∞ w [ l ] < I → ( w [ l ] ) L → 0 y=w^{[l]}w^{[l-1]}...w^{[2]}w^{[1]}x\\ w^{[l]}>I\rightarrow (w^{[l]})^L\rightarrow\infty \\w^{[l]}y=w[l]w[l1]...w[2]w[1]xw[l]>I(w[l])Lw[l]<I(w[l])L0

weight initialize

v a r ( w ) = 1 n ( l − 1 ) w [ l ] = n p . r a n d o m . r a n d n ( s h a p e ) ∗ n p . s q r t ( 1 n ( l − 1 ) ) var(w)=\frac{1}{n^{(l-1)}}\\ w^{[l]}=np.random.randn(shape)*np.sqrt(\frac{1}{n^{(l-1)}}) var(w)=n(l1)1w[l]=np.random.randn(shape)np.sqrt(n(l1)1)

gradient check

Numerical approximation

f ( θ ) = θ 3 f ′ ( θ ) = f ( θ + ε ) − f ( θ − ε ) 2 ε f(\theta)=\theta^3\\ f'(\theta)=\frac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2\varepsilon} f(θ)=θ3f(θ)=2εf(θ+ε)f(θε)

grad check

d θ a p p r o x [ i ] = J ( θ 1 , . . . θ i + ε . . . ) − J ( θ 1 , . . . θ i − ε . . . ) 2 ε = d θ [ i ] c h e c k : ∥ d θ a p p r o x − d θ ∥ 2 ∥ d θ a p p r o x ∥ 2 + ∥ d θ ∥ 2 < 1 0 − 7 d\theta_{approx}[i]=\frac{J(\theta_1,...\theta_i+\varepsilon...)-J(\theta_1,...\theta_i-\varepsilon...)}{2\varepsilon}=d\theta[i]\\ check:\frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2}<10^{-7} dθapprox[i]=2εJ(θ1,...θi+ε...)J(θ1,...θiε...)=dθ[i]check:dθapprox2+dθ2dθapproxdθ2<107

Optimize algorithm

mini-bach gradient

[ x ( 1 ) . . . x ( m ) ] → [ x { 1 } . . . x { m / u } ] ( a n      e p o c h : F o r w a r d      p r o p      o n      x { t } : z [ l ] = w [ l ] X { t } + b [ l ] A [ l ] = g [ l ] ( z [ l ] ) J { t } = 1 1000 ∑ i = 1 l L ( y ^ ( i ) , y ( i ) ) + λ 2 ∗ s i z e ∑ l ∥ w [ l ] ∥ F 2 B a c k w a r d      p r o p [x^{(1)}...x^{(m)}]\rightarrow [x^{\{1\}}...x^{\{m/u\}}]\\ (an\;\;epoch:Forward\;\;prop\;\;on\;\;x^{\{t\}}:\\ z^{[l]}=w^{[l]}X^{\{t\}}+b^{[l]}\\ A^{[l]}=g^{[l]}(z^{[l]})\\ J^{\{t\}}=\frac{1}{1000}\sum_{i=1}^l\mathcal{L}(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2*size}\sum_l\Vert w^{[l]}\Vert_F^2\\ Backward\;\;prop [x(1)...x(m)][x{1}...x{m/u}](anepoch:Forwardproponx{t}:z[l]=w[l]X{t}+b[l]A[l]=g[l](z[l])J{t}=10001i=1lL(y^(i),y(i))+2sizeλlw[l]F2Backwardprop

mini-batch size

size = m -> Batch gradient descent <- small train set (<2000)

size = 1 -> stochastic gradient descent

typical mini-batch size (62,128,256…)

exponential weighted averages

$$
v_\theta = 0\
\theta_t\rightarrow v_\theta:=\beta v_{\theta-1}+(1-\beta)\theta_\theta\

$$

Bias correction

1 1 − β → v t 1 − β t \frac{1}{1-\beta}\rightarrow\frac{v_t}{1-\beta^t} 1β11βtvt

Momentum

V d w = β V d w + ( 1 − β ) d w V d b = β V d b + ( 1 − β ) d b w : = w − α V d w V_{dw}=\beta V_{dw}+(1-\beta)dw\\ V_{db}=\beta V_{db}+(1-\beta)db\\ w:=w-\alpha V_{dw} Vdw=βVdw+(1β)dwVdb=βVdb+(1β)dbw:=wαVdw

RMSprop

S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 w : = w − α d w S d w + ε S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\ S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ w:=w-\alpha \frac{dw}{\sqrt S_{dw}+\varepsilon}\\ Sdw=β2Sdw+(1β2)dw2Sdb=β2Sdb+(1β2)db2w:=wαS dw+εdw

Adam algorithm

V d w = 0 , S d w = 0 V d w = β 1 V d w + ( 1 − β 1 ) d w V d b = β 1 V d b + ( 1 − β 1 ) d b S d w = β 2 S d w + ( 1 − β 2 ) d w 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 V d w c o r r e c t = v d w 1 − β 1 t S d w c o r r e c t = s d w 1 − β 2 t W : = W − α V d w c o r r e c t S d w c o r r e c t + ε β 1 : 0.9 , β 2 : 0.999 V_{dw}=0,S_{dw}=0\\ V_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw\\V_{db}=\beta_1 V_{db}+(1-\beta_1)db\\ S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2\\S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2\\ V_{dw}^{correct}=\frac{v_{dw}}{1-\beta_1^t}\\S_{dw}^{correct}=\frac{s_{dw}}{1-\beta_2^t}\\ W:=W-\alpha \frac{V_{dw}^{correct}}{\sqrt{S_{dw}^{correct}}+\varepsilon}\\ \beta_1:0.9,\beta_2:0.999 Vdw=0,Sdw=0Vdw=β1Vdw+(1β1)dwVdb=β1Vdb+(1β1)dbSdw=β2Sdw+(1β2)dw2Sdb=β2Sdb+(1β2)db2Vdwcorrect=1β1tvdwSdwcorrect=1β2tsdwW:=WαSdwcorrect +εVdwcorrectβ1:0.9,β2:0.999

Learning rate decay

α = 1 1 + d e c a y R a t e ∗ e p o c h N u m b e r α 0 α = k e p o c h N u m α 0 \alpha=\frac{1}{1+decayRate*epochNumber}\alpha_0\\ \alpha=\frac{k}{\sqrt{epochNum}}\alpha_0 α=1+decayRateepochNumber1α0α=epochNum kα0

Local optima

优化梯度下降算法_第1张图片
优化梯度下降算法_第2张图片

你可能感兴趣的:(算法,机器学习,人工智能,深度学习,神经网络)