数值最优化:理解L-BFGS

数值最优化:理解L-BFGS

数值最优化是很多机器学习中的核心,一旦你已经选定了模型和数据集,那么就需要通过数值最优化方法去最小化多元函数f(x) 估计出模型的参数:

x∗=argminf(x)


通过求解上面的优化问题,得到的 x∗就是模型最优的参数 。

本文,我重点阐述L-BFGS算法求解无约束优化问题的过程,这也是目前解决机器学习优化问题的常用方法,同时,随机梯度下降也是比较流行的一种优化方法。最后,我也会用到我比较喜欢的AdaDelta.

Note: Throughout the post, I’ll assume you remember multivariable calculus. So if you don’t recall what a gradient or Hessian is, you’ll want to bone up first.

数值最优化:理解L-BFGS_第1张图片

牛顿法

大部分的数值优化步骤都是一个迭代更新算法,Most numerical optimization procedures are iterative algorithms which consider a sequence of ‘guesses’  xn xn which ultimately converge to  x x∗ the true global minimizer of  f f. Suppose, we have an estimate  xn xn and we want our next estimate  xn+1 xn+1 to have the property that  f(xn+1)<f(xn) f(xn+1).

Newton’s method is centered around a quadratic approximation of  f f for points near  xn xn. Assuming that  f f is twice-differentiable, we can use a quadratic approximation of  f f for points ‘near’ a fixed point  x x using a Taylor expansion:

f(x+Δx)f(x)+ΔxTf(x)+12ΔxT(2f(x))Δx f(x+Δx)≈f(x)+ΔxT∇f(x)+12ΔxT(∇2f(x))Δx

where  f(x) ∇f(x) and  2f(x) ∇2f(x) are the gradient and Hessian of  f f at the point  xn xn. This approximation holds in the limit as  ||Δx||0 ||Δx||→0. This is a generalization of the single-dimensional Taylor polynomial expansion you might remember from Calculus.

In order to simplify much of the notation, we’re going to think of our iterative algorithm of producing a sequence of such quadratic approximations  hn hn. Without loss of generality, we can write  xn+1=xn+Δx xn+1=xn+Δx and re-write the above equation,

hn(Δx)=f(xn)+ΔxTgn+12ΔxTHnΔx hn(Δx)=f(xn)+ΔxTgn+12ΔxTHnΔx

where  gn gn and  Hn Hn represent the gradient and Hessian of  f f at  xn xn.

We want to choose  Δx Δx to minimize this local quadratic approximation of  f f at  xn xn. Differentiating with respect to  Δx Δx above yields:

hn(Δx)Δx=gn+HnΔx ∂hn(Δx)∂Δx=gn+HnΔx

Recall that any  Δx Δx which yields  hn(Δx)Δx=0 ∂hn(Δx)∂Δx=0 is a local extrema of  hn() hn(⋅). If we assume that  Hn Hnis [postive definite] (psd) then we know this  Δx Δx is also a global minimum for  hn() hn(⋅). Solving for  Δx Δx:2

Δx=H1ngn Δx=−Hn−1gn

This suggests  H1ngn Hn−1gn as a good direction to move  xn xn towards. In practice, we set  xn+1=xnα(H1ngn) xn+1=xn−α(Hn−1gn) for a value of  α α where  f(xn+1) f(xn+1) is ‘sufficiently’ smaller than  f(xn) f(xn).

迭代算法

The above suggests an iterative algorithm:

NewtonRaphson(f,x0):For n=0,1, (until converged):Compute gn and H1n for xnd=H1ngnα=minα0f(xnαd)xn+1xnαd NewtonRaphson(f,x0):For n=0,1,… (until converged):Compute gn and Hn−1 for xnd=Hn−1gnα=minα≥0f(xn−αd)xn+1←xn−αd

The computation of the  α α step-size can use any number of line search algorithms. The simplest of these is backtracking line search, where you simply try smaller and smaller values of  α α until the function value is ‘small enough’.

In terms of software engineering, we can treat  NewtonRaphson NewtonRaphson as a blackbox for any twice-differentiable function which satisfies the Java interface:

public interface TwiceDifferentiableFunction {
  // compute f(x)

  public double valueAt(double[] x);

  // compute grad f(x)

  public double[] gradientAt(double[] x);

  // compute inverse hessian H^-1

  public double[][] inverseHessian(double[] x);
}

With quite a bit of tedious math, you can prove that for a convex function, the above procedure will converge to a unique global minimizer  x x∗, regardless of the choice of  x0 x0. For non-convex functions that arise in ML (almost all latent variable models or deep nets), the procedure still works but is only guranteed to converge to a local minimum. In practice, for non-convex optimization, users need to pay more attention to initialization and other algorithm details.

Huge Hessians

The central issue with  NewtonRaphson NewtonRaphson is that we need to be able to compute the inverse Hessian matrix.3 Note that for ML applications, the dimensionality of the input to  f f typically corresponds to model parameters. It’s not unusual to have hundreds of millions of parameters or in some vision applications even billions of parameters. For these reasons, computing the hessian or its inverse is often impractical. For many functions, the hessian may not even be analytically computable, let along representable.

Because of these reasons,  NewtonRaphson NewtonRaphson is rarely used in practice to optimize functions corresponding to large problems. Luckily, the above algorithm can still work even if  H1n Hn−1doesn’t correspond to the exact inverse hessian at  xn xn, but is instead a good approximation.

Quasi-Newton

Suppose that instead of requiring  H1n Hn−1 be the inverse hessian at  xn xn, we think of it as an approximation of this information. We can generalize  NewtonRaphson NewtonRaphson to take a  QuasiUpdate QuasiUpdate policy which is responsible for producing a sequence of  H1n Hn−1.

QuasiNewton(f,x0,H10,QuasiUpdate):For n=0,1, (until converged):// Compute search direction and step-size d=H1ngnαminα0f(xnαd)xn+1xnαd// Store the input and gradient deltas gn+1f(xn+1)sn+1xn+1xnyn+1gn+1gn// Update inverse hessian H1n+1QuasiUpdate(H1n,sn+1,yn+1) QuasiNewton(f,x0,H0−1,QuasiUpdate):For n=0,1,… (until converged):// Compute search direction and step-size d=Hn−1gnα←minα≥0f(xn−αd)xn+1←xn−αd// Store the input and gradient deltas gn+1←∇f(xn+1)sn+1←xn+1−xnyn+1←gn+1−gn// Update inverse hessian Hn+1−1←QuasiUpdate(Hn−1,sn+1,yn+1)

We’ve assumed that  QuasiUpdate QuasiUpdate only requires the former inverse hessian estimate as well tas the input and gradient differences ( sn sn and  yn yn respectively). Note that if  QuasiUpdate QuasiUpdate just returns  2f(xn+1) ∇2f(xn+1), we recover exact  NewtonRaphson NewtonRaphson.

In terms of software, we can blackbox optimize an arbitrary differentiable function (with no need to be able to compute a second derivative) using  QuasiNewton QuasiNewton assuming we get a quasi-newton approximation update policy. In Java this might look like this,

public interface DifferentiableFunction {
  // compute f(x)

  public double valueAt(double[] x);

  // compute grad f(x)

  public double[] gradientAt(double[] x);  
}

public interface QuasiNewtonApproximation {
  // update the H^{-1} estimate (using x_{n+1}-x_n and grad_{n+1}-grad_n)

  public void update(double[] deltaX, double[] deltaGrad);

  // H^{-1} (direction) using the current H^{-1} estimate

  public double[] inverseHessianMultiply(double[] direction);
}

Note that the only use we have of the hessian is via it’s product with the gradient direction. This will become useful for the L-BFGS algorithm described below, since we don’t need to represent the Hessian approximation in memory. If you want to see these abstractions in action, here’s a link to a Java 8 and golang implementation I’ve written.

Behave like a Hessian

What form should  QuasiUpdate QuasiUpdate take? Well, if we have  QuasiUpdate QuasiUpdate always return the identity matrix (ignoring its inputs), then this corresponds to simple gradient descent, since the search direction is always  fn ∇fn. While this actually yields a valid procedure which will converge to  x x∗ for convex  f f, intuitively this choice of  QuasiUpdate QuasiUpdate isn’t attempting to capture second-order information about  f f.

Let’s think about our choice of  Hn Hn as an approximation for  f f near  xn xn:

hn(d)=f(xn)+dTgn+12dTHnd hn(d)=f(xn)+dTgn+12dTHnd

Secant Condition

A good property for  hn(d) hn(d) is that its gradient agrees with  f f at  xn xn and  xn1 xn−1. In other words, we’d like to ensure:

hn(xn)hn(xn1)=gn=gn1 ∇hn(xn)=gn∇hn(xn−1)=gn−1

Using both of the equations above:

hn(xn)hn(xn1)=gngn1 ∇hn(xn)−∇hn(xn−1)=gn−gn−1

Using the gradient of  hn+1() hn+1(⋅) and canceling terms we get

Hn(xnxn1)=(gngn1) Hn(xn−xn−1)=(gn−gn−1)

This yields the so-called “secant conditions” which ensures that  Hn+1 Hn+1 behaves like the Hessian at least for the diference  (xnxn1) (xn−xn−1). Assuming  Hn Hn is invertible (which is true if it is psd), then multiplying both sides by  H1n Hn−1 yields

H1nyn=sn Hn−1yn=sn

where  yn+1 yn+1 is the difference in gradients and  sn+1 sn+1 is the difference in inputs.

Symmetric

Recall that the a hessian represents the matrix of 2nd order partial derivatives:  H(i,j)=f/xixj H(i,j)=∂f/∂xi∂xj. The hessian is symmetric since the order of differentiation doesn’t matter.

The BFGS Update

Intuitively, we want  Hn Hn to satisfy the two conditions above:

  • Secant condition holds for  sn sn and  yn yn
  • Hn Hn is symmetric

Given the two conditions above, we’d like to take the most conservative change relative to  Hn1 Hn−1. This is reminiscent of the MIRA update, where we have conditions on any good solution but all other things equal, want the ‘smallest’ change.

minH1s.t. H1H1n12H1yn=snH1 is symmetric  minH−1∥H−1−Hn−1−1∥2s.t. H−1yn=snH−1 is symmetric 

The norm used here  ∥⋅∥ is the weighted frobenius norm.4 The solution to this optimization problem is given by

H1n+1=(IρnynsTn)H1n(IρnsnyTn)+ρnsnsTn Hn+1−1=(I−ρnynsnT)Hn−1(I−ρnsnynT)+ρnsnsnT

where  ρn=(yTnsn)1 ρn=(ynTsn)−1. Proving this is relatively involved and mostly symbol crunching. I don’t know of any intuitive way to derive this unfortunately.

数值最优化:理解L-BFGS_第2张图片

This update is known as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, named after the original authors. Some things worth noting about this update:

  • H1n+1 Hn+1−1 is positive definite (psd) when  H1n Hn−1 is. Assuming our initial guess of  H0 H0 is psd, it follows by induction each inverse Hessian estimate is as well. Since we can choose any  H10 H0−1we want, including the  I I matrix, this is easy to ensure.

  • The above also specifies a recurrence relationship between  H1n+1 Hn+1−1 and  H1n Hn−1. We only need the history of  sn sn and  yn yn to re-construct  H1n Hn−1.

The last point is significant since it will yield a procedural algorithm for computing  H1nd Hn−1d, for a direction  d d, without ever forming the  H1n Hn−1 matrix. Repeatedly applying the recurrence above we have

BFGSMultiply(H10,{sk},{yk},d):rd// Compute right productfor i=n,,1:αiρisTirrrαiyi// Compute centerrH10r// Compute left productfor i=1,,n:βρiyTirrr+(αni+1β)sireturn r BFGSMultiply(H0−1,{sk},{yk},d):r←d// Compute right productfor i=n,…,1:αi←ρisiTrr←r−αiyi// Compute centerr←H0−1r// Compute left productfor i=1,…,n:β←ρiyiTrr←r+(αn−i+1−β)sireturn r

Since the only use for  H1n Hn−1 is via the product  H1ngn Hn−1gn, we only need the above procedure to use the BFGS approximation in  QuasiNewton QuasiNewton.

L-BFGS: BFGS on a memory budget

The BFGS quasi-newton approximation has the benefit of not requiring us to be able to analytically compute the Hessian of a function. However, we still must maintain a history of the  sn sn and  yn yn vectors for each iteration. Since one of the core-concerns of the  NewtonRaphson NewtonRaphson algorithm were the memory requirements associated with maintaining an Hessian, the BFGS Quasi-Newton algorithm doesn’t address that since our memory use can grow without bound.

The L-BFGS algorithm, named for limited BFGS, simply truncates the  BFGSMultiply BFGSMultiplyupdate to use the last  m m input differences and gradient differences. This means, we only need to store  sn,sn1,,snm1 sn,sn−1,…,sn−m−1 and  yn,yn1,,ynm1 yn,yn−1,…,yn−m−1 to compute the update. The center product can still use any symmetric psd matrix  H10 H0−1, which can also depend on any  {sk} {sk} or  {yk} {yk}.

L-BFGS variants

There are lots of variants of L-BFGS which get used in practice. For non-differentiable functions, there is an othant-wise varient which is suitable for training  L1 L1 regularized loss.

One of the main reasons to not use L-BFGS is in very large data-settings where an online approach can converge faster. There are in fact online variants of L-BFGS, but to my knowledge, none have consistently out-performed SGD variants (including AdaGrad or AdaDelta) for sufficiently large data sets.

  1. This assumes there is a unique global minimizer for  f f. In practice, in practice unless  f f is convex, the parameters used are whatever pops out the other side of an iterative algorithm.↩

  2. We know  H1f −H−1∇f is a local extrema since the gradient is zero, since the Hessian has positive curvature, we know it’s in fact a local minima. If  f f is convex, we know the Hessian is always positive definite and we know there is a single unique global minimum. ↩

  3. As we’ll see, we really on require being able to multiply by  H1d H−1d for a direction  d d. ↩

  4. I’ve intentionally left the weighting matrix  W W used to weight the norm since you get the same solution under many choices. In particular for any positive-definite  W W such that  Wsn=yn Wsn=yn, we get the same solution. ↩

你可能感兴趣的:(Machine,Learning)