Why is the subgradient not a descent method?

Question:

I am reading this nice document about the subgradient method, which defines the subgradient method iteration as follows.

x k + 1 = x k − α k g k \begin{align*} x_{k+1}=x_{k}-\alpha_k g^k \end{align*} xk+1=xkαkgk
for a g g g such that
f ( y ) ≥ f ( x ) + g T ( y − x ) \begin{align*} f(y) \geq f(x) + g^T (y-x) \end{align*} f(y)f(x)+gT(yx)
If f f f is differential then g g g is its gradient. This seems to suggest that for any valid g g g, we are ensured to increase (not strictly) the value of f f f. However the same document states the following.

The subgradient method is not a descent method; the function value can (and often does) increase

and proposes to keep track of f f f in each iteration, in order to keep track of the best one.

This seems to contradict the first statement about the choice of g g g, as we are choosing a g g g such that f f f increases, how can the next f f f not be the best so far?


Answer:

I believe your confusion comes from the fact that you inaccurately read the sign of the inequality. If g k g^k gk is a sub-gradient at x k x^k xk, then by taking y = x k + 1 y=x^{k+1} y=xk+1 and x = x k x=x^k x=xk the sub-gradient inequality is:
f ( x k + 1 ) = f ( x k − α g k ) ≥ f ( x k ) + ( g k ) T ( − α g k ) = f ( x k ) − α ∣ ∣ g k ∣ ∣ 2 2 \begin{align*} f(x^{k+1})&=f(x^{k}-\alpha g^k)\\ &\geq f(x^{k})+(g^k)^T(-\alpha g^k)\\ &=f(x^{k})-\alpha ||g^k||_2^2 \end{align*} f(xk+1)=f(xkαgk)f(xk)+(gk)T(αgk)=f(xk)α∣∣gk22

This means that f ( x k + 1 ) f(x^k+1) f(xk+1) can be any value above f ( x k ) − α ∣ ∣ g k ∣ ∣ 2 2 f(x^k)−α||g^k||_2^2 f(xk)α∣∣gk22, and in particular, any value above f ( x k ) f(x^k) f(xk). So the sub-gradient inequality does not ensure that it is a descent method.

In contrast to the sub-gradient method, when f f f is differentiable and ∇ f \nabla f f is Lipschitz continuous, you have the Descent Lemma:
f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + L 2 ∣ ∣ y − x ∣ ∣ 2 2 \begin{align*} f(y) \leq f(x) + \nabla f(x)^T (y-x) + \frac{L}{2} || y-x ||_2^2 \end{align*} f(y)f(x)+f(x)T(yx)+2L∣∣yx22

The Descent Lemma is the property which ensures descent, and it does not necessarily holds when you replace the gradient with a sub-gradient.


参考网址:

  1. Subgradient Methods - MIT
  2. Why is the subgradient not a descent method?
  3. The subgradient method is not a true descent method
  4. Descent lemma of gradient descent algorithm

你可能感兴趣的:(基础数学类,数学类)