梯度下降是优化方法中最基础也是最重要的一类。其思想也很简单:
f ( x ) = f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + ⋯ f(x) = f(x_0) + f'(x_0)(x-x_0) + \cdots f(x)=f(x0)+f′(x0)(x−x0)+⋯
上面是函数f(x)的一阶泰勒展开。如果我们令
x k + 1 = x k − f ′ ( x 0 ) x_{k+1} = x_k - f'(x_0) xk+1=xk−f′(x0)
很明显可以看出 f ( x k + 1 ) < f ( x k ) f(x_{k+1}) < f(x_k) f(xk+1)<f(xk),即下一步迭代的方向会使当前函数值缩小,最后收敛到极小值点。
首先需要注意的是牛顿迭代与牛顿法的区别。
牛顿法是一种二阶的优化方法,而牛顿迭代是一种在实数域和复数域上近似求解方程的方法,主要是通过函数f(x)的泰勒级数的前面几项来寻找方程 f ( x ) = 0 f(x) = 0 f(x)=0的根。
由f(x)的一阶泰勒展开,很容易得到牛顿迭代的迭代公式:
x n + 1 = x n − f ( x n ) f ′ ( x n ) x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} xn+1=xn−f′(xn)f(xn)
用上面的公式进行迭代,即可得到 f ( x ) = 0 f(x) = 0 f(x)=0的根。
相比梯度下降是函数的一阶泰勒展开,牛顿法使用了函数的二阶泰勒展开。
f ( x ) = f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + f ′ ′ ( x 0 ) 2 ! ( x − x 0 ) 2 f(x) = f(x_0) + f'(x_0)(x-x_0) + \frac{f''(x_0)}{2!}(x-x_0)^2 f(x)=f(x0)+f′(x0)(x−x0)+2!f′′(x0)(x−x0)2
如果变量x是一组向量
f ( x ) = f ( x 0 ) + ∇ f ( x 0 ) ( x − x 0 ) + ∇ 2 f ( x 0 ) 2 ! ( x − x 0 ) 2 f(x) = f(x_0) + \nabla f(x_0)(x-x_0) + \frac{\nabla ^2 f(x_0)}{2!}(x-x_0)^2 f(x)=f(x0)+∇f(x0)(x−x0)+2!∇2f(x0)(x−x0)2
我们将 ∇ f ( x 0 ) \nabla f(x_0) ∇f(x0)记为 g g g, ∇ 2 f ( x 0 ) \nabla ^2 f(x_0) ∇2f(x0)记为 H H H,
如果我们要求极值点,对x求导,直接令 f ′ ( x ) = 0 f'(x) = 0 f′(x)=0,有
∇ f ( x 0 ) + ∇ 2 f ( x 0 ) ( x − x 0 ) = 0 \nabla f(x_0) + \nabla^2 f(x_0)(x-x_0) = 0 ∇f(x0)+∇2f(x0)(x−x0)=0
所以最后x的迭代公式为
x = x 0 − ∇ f ( x 0 ) ∇ 2 f ( x 0 ) x = x_0 - \frac{\nabla f(x_0) }{\nabla^2 f(x_0) } x=x0−∇2f(x0)∇f(x0)
或者可以表示为:
x k + 1 = x k − H k − 1 ⋅ g k x_{k+1} = x_k - H ^ {-1}_k \cdot g_k xk+1=xk−Hk−1⋅gk
优点
利用到了二阶导的信息,收敛速度较快
缺点
1.计算二阶导,计算量大。
2.求解的时候很容易产生病态方程。
3.海森矩阵H不一定正定。
为了克服牛顿法的缺点,拟牛顿法的思想就是不使用海森矩阵,而是构造一个近似海森矩阵(或其逆矩阵)的正定对称阵来代替,在“拟牛顿”的条件下优化目标函数。
首先将 f ( x ) f(x) f(x)在 x k + 1 x_{k+1} xk+1处二阶泰勒展开
f ( x ) = f ( x k + 1 ) + ∇ f ( x k + 1 ) ( x − x k + 1 ) + ∇ 2 f ( x k + 1 ) 2 ! ( x − x k + 1 ) 2 f(x) = f(x_{k+1}) + \nabla f(x_{k+1})(x-x_{k+1}) + \frac{\nabla ^2 f(x_{k+1})}{2!}(x-x_{k+1})^2 f(x)=f(xk+1)+∇f(xk+1)(x−xk+1)+2!∇2f(xk+1)(x−xk+1)2
两边求导
∇ f ( x ) = ∇ f ( x k + 1 ) + ∇ 2 f ( x k + 1 ) ( x − x k + 1 ) \nabla f(x) = \nabla f(x_{k+1}) + \nabla^2 f(x_{k+1})(x-x_{k+1}) ∇f(x)=∇f(xk+1)+∇2f(xk+1)(x−xk+1)
令 x = x k x=x_k x=xk
g k = g k + 1 + H k + 1 ( x − x k + 1 ) g_k = g_{k+1} + H_{k+1}(x-x_{k+1}) gk=gk+1+Hk+1(x−xk+1)
再令
s k = x k + 1 − x k , y k = g k + 1 − g k s_k = x_{k+1}-x_k, y_k = g_{k+1} - g_k sk=xk+1−xk,yk=gk+1−gk
有:
y k = H k + 1 ⋅ s k y_k = H_{k+1} \cdot s_k yk=Hk+1⋅sk
或
s k = H k + 1 − 1 ⋅ y k s_k = H^{-1}_{k+1} \cdot y_k sk=Hk+1−1⋅yk
也可以写成
y k = B k + 1 ⋅ s k y_k = B_{k+1} \cdot s_k yk=Bk+1⋅sk
或:
s k = D k + 1 ⋅ y k s_k = D_{k+1} \cdot y_k sk=Dk+1⋅yk
常见的拟牛顿法有DFP,BFGS,LBFGS等,网上资料很多,可以自行查阅。
看看牛顿迭代求方根的一个例子。
def newton_sqrt():
x = 4
num = 1.0
eps = 1e-4
n = 100
for i in range(n):
num = 0.5 * (num + x / num)
pred_x = num * num
if abs(x - pred_x) < eps:
break
print("sqrt num is: ", num)
import numpy as np
def f(x, y):
return (1 - x) ** 2 + 100 * (y - x * x) ** 2
def grad(x, y):
return np.array([2 * x - 2 + 400 * x * (x * x - y),
200 * (y - x * x)])
def H(x, y):
return np.array([[1200 * x * x - 400 * y + 2, -400 * x],
[-400 * x, 200]])
def delta_newton(x, y):
alpha = 1.0
inverse_H = np.linalg.inv(H(x, y))
delta = alpha * np.dot(inverse_H, grad(x, y)) #
return delta
def solution():
n = 256
x = np.array([-0.3, 0.4])
tol = 0.00001
for i in range(100):
delta = delta_newton(x[0], x[1])
if abs(delta[0]) < tol and abs(delta[1]) < tol:
break
x = x - delta
print("i is: ", i, ", x is: ", x)
solution()
输出结果为
i is: 0 , x is: [-0.32131148 0.10278689]
i is: 1 , x is: [ 0.88997209 -0.67515756]
i is: 2 , x is: [0.89034578 0.79271546]
i is: 3 , x is: [0.99999694 0.9879705 ]
i is: 4 , x is: [0.99999784 0.99999567]
上面的例子,是对Rosenbrock函数求最优解。Rosenbrock的形式为
f ( x ) = ( 1 − x ) 2 + 100 ( y − x 2 ) 2 f(x) = (1-x)^2 + 100(y-x^2)^2 f(x)=(1−x)2+100(y−x2)2
从上面的例子不难看出,牛顿法的迭代速度确实很快,5步就已经收敛到全局最优解。