deep learning 优化方法(未完成待编辑)

SGD(Stochastic Gradient Descent)

ASGD(Averaged Stochastic Gradient Descent)

CG(Conjungate Gradient)

LBFGS(Limited-memory Broyden-Fletcher-Goldfarb-Shanno)

SGD  随机梯度下降


SGD解决了梯度下降的两个问题: 收敛速度慢和陷入局部最优。修正部分是权值更新的方法有些许不同。

Stochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions


  • Choose an initial vector of parameters w and learning rate \eta.

  • Repeat until an approximate minimum is obtained:

    • \! w := w - \eta \nabla Q_i(w).

    • Randomly shuffle examples in the training set.

    • For \! i=1, 2, ..., n, do:


Let's suppose we want to fit a straight line y = \! w_1 + w_2 x to a training set of two-dimensional points \! (x_1, y_1), \ldots, (x_n, y_n) using least squares. The objective function to be minimized is:

Q(w) = \sum_{i=1}^n Q_i(w) = \sum_{i=1}^n \left(w_1 + w_2 x_i - y_i\right)^2.

The last line in the above pseudocode for this specific problem will become:

\begin{bmatrix} w_1 \\ w_2 \end{bmatrix} :=     \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}     -  \eta  \begin{bmatrix} 2 (w_1 + w_2 x_i - y_i) \\ 2 x_i(w_1 + w_2 x_i - y_i) \end{bmatrix}.







梯度下降 <wbr>VS <wbr>随机梯度下降
deep learning 优化方法(未完成待编辑)_第1张图片

ASGD 平均随机梯度下降


deep learning 优化方法(未完成待编辑)_第2张图片



If we choose the conjugate vectors pk carefully, then we may not need all of them to obtain a good approximation to the solution x. So, we want to regard the conjugate gradient method as an iterative method. This also allows us to approximately solve systems where n is so large that the direct method would take too much time.

We denote the initial guess for x by x0. We can assume without loss of generality that x0 = 0 (otherwise, consider the system Az = b − Ax0 instead). Starting with x0 we search for the solution and in each iteration we need a metric to tell us whether we are closer to the solution x (that is unknown to us). This metric comes from the fact that the solution x is also the unique minimizer of the following quadratic function; so if f(x) becomes smaller in an iteration it means that we are closer to x.

  • f(\mathbf{x}) = \tfrac12 \mathbf{x}^\mathsf{T} \mathbf{A}\mathbf{x} - \mathbf{x}^\mathsf{T} \mathbf{b}, \qquad \mathbf{x}\in\mathbf{R}^n.

This suggests taking the first basis vector p0 to be the negative of the gradient of f at x = x0. The gradient of f equals Ax − b. Starting with a "guessed solution" x0 (we can always guess x0 = 0 if we have no reason to guess for anything else), this means we take p0 = b − Ax0. The other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method.

Let rk be the residual at the kth step:

  • \mathbf{r}_k = \mathbf{b} - \mathbf{Ax}_k.

Note that rk is the negative gradient of f at x = xk, so the gradient descent method would be to move in the direction rk. Here, we insist that the directions pk be conjugate to each other. We also require that the next search direction be built out of the current residue and all previous search directions, which is reasonable enough in practice.

The conjugation constraint is an orthonormal-type constraint and hence the algorithm bears resemblance to Gram-Schmidt orthonormalization.

This gives the following expression:

  • \mathbf{p}_{k} = \mathbf{r}_{k} - \sum_{i < k}\frac{\mathbf{p}_i^\mathsf{T} \mathbf{A} \mathbf{r}_{k}}{\mathbf{p}_i^\mathsf{T}\mathbf{A} \mathbf{p}_i} \mathbf{p}_i

(see the picture at the top of the article for the effect of the conjugacy constraint on convergence). Following this direction, the next optimal location is given by

  • \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k


  • \alpha_{k} = \frac{\mathbf{p}_k^\mathsf{T} \mathbf{b}}{\mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k} = \frac{\mathbf{p}_k^\mathsf{T} (\mathbf{r}_{k-1}+\mathbf{Ax}_{k-1})}{\mathbf{p}_{k}^\mathsf{T} \mathbf{A} \mathbf{p}_{k}} = \frac{\mathbf{p}_{k}^\mathsf{T} \mathbf{r}_{k-1}}{\mathbf{p}_{k}^\mathsf{T} \mathbf{A} \mathbf{p}_{k}},

where the last equality holds because pk and xk-1 are conjugate.



你可能感兴趣的:(deep learning 优化方法(未完成待编辑))