牛顿法和拟牛顿法是求解无约束最优化的常用方法,有收敛速度快的优点. 牛顿法属于迭代算法,每一步需要求解目标函数的海赛矩阵的逆矩阵,计算复杂. 拟牛顿法通过正定矩阵近似海赛矩阵的逆矩阵,简化了这个过程.
对于无约束优化
min x ∈ R n f ( x ) \min_{x\in R^n} f(x) x∈Rnminf(x)
x ∗ x^* x∗是目标的极小值点.
假设 f ( x ) f(x) f(x)有二阶连续偏导数,第k次迭代值为 x ( k ) x^(k) x(k),将 f ( x ) f(x) f(x)在 x ( k ) x^(k) x(k)附近进行二阶展开:
f ( x ) = f ( x ( k ) ) + f ′ ( x ( k ) ) Δ x + 1 2 f ′ ′ ( x ( k ) Δ x 2 = f ( x ( k ) ) + f ′ ( x ( k ) ) ( x − x ( k ) ) + 1 2 f ′ ′ ( x ( k ) ) ( x − x ( k ) ) 2 \begin{aligned} f(x) &= f(x^{(k)}) + f'(x^{(k)})\Delta x + \frac{1}{2} f''(x^{(k)}\Delta x^2\\ & = f(x^{(k)}) + f'(x^{(k)})(x - x^{(k)}) + \frac{1}{2} f''(x^{(k)})(x - x^{(k)})^2 \end{aligned} f(x)=f(x(k))+f′(x(k))Δx+21f′′(x(k)Δx2=f(x(k))+f′(x(k))(x−x(k))+21f′′(x(k))(x−x(k))2
当这里的 x x x是高维时
f ( x ) = f ( x ( k ) ) + g k T ( x − x ( k ) ) + 1 2 ( x − x ( k ) ) T H ( x ( k ) ) ( x − x ( k ) ) f(x) = f(x^{(k)}) +g_k^T(x-x^{(k)}) + \frac{1}{2}(x-x^{(k)})^TH(x^{(k)})(x-x^{(k)}) f(x)=f(x(k))+gkT(x−x(k))+21(x−x(k))TH(x(k))(x−x(k))
其中, g k = g ( x ( k ) ) = ▽ f ( x ( k ) ) g_k = g(x^{(k)}) = \bigtriangledown f(x^{(k)}) gk=g(x(k))=▽f(x(k))是 f ( x ) f(x) f(x)的梯度向量在点 x ( k ) x^{(k)} x(k)的值, H ( x ( k ) ) H(x^{(k)}) H(x(k))是 f ( x ) f(x) f(x)的海赛矩阵在点 x ( k ) x^{(k)} x(k)的值.
海赛矩阵式二阶导数矩阵
H ( x ) = [ ∂ 2 f ∂ x i ∂ x j ] n × n H(x) = \Big[\frac{\partial^2f}{\partial x_i\partial x_j}\Big]_{n\times n} H(x)=[∂xi∂xj∂2f]n×n
f ( x ) f(x) f(x)有极值的必要条件是在极值点处一阶导数为0,即梯度向量为0. 当 H ( x ( k ) ) H(x^{(k)}) H(x(k))为正定矩阵时, f ( x ) f(x) f(x)的极值为极小值.
假设 ▽ f ( x ( k + 1 ) ) = 0 \bigtriangledown f(x^{(k+1)}) = 0 ▽f(x(k+1))=0,我们有
▽ f ( x ) = f ′ ( x ) + f ′ ′ ( x ) ( x − x ( k ) ) \bigtriangledown f(x) = f'(x) + f''(x)(x - x^{(k)}) ▽f(x)=f′(x)+f′′(x)(x−x(k))
(1.1) ▽ f ( x ) = g k + H k ( x − x ( k ) ) \bigtriangledown f(x) = g_k + H_k(x-x^{(k)})\tag{1.1} ▽f(x)=gk+Hk(x−x(k))(1.1)
所以
g k + H k ( x − x ( k ) ) = 0 x ( k + 1 ) = x ( k ) − H k − 1 g k g_k + H_k(x-x^{(k)}) = 0\\ x^{(k+1)} = x^{(k)} - H_k^{-1}g_k\\ gk+Hk(x−x(k))=0x(k+1)=x(k)−Hk−1gk
我们假设
H k p k = − g k H_kp_k = -g_k Hkpk=−gk
则有
x ( k + 1 ) = x ( k ) + p k x^{(k+1)} = x^{(k)} + p_k x(k+1)=x(k)+pk
输入:目标函数 f ( x ) f(x) f(x),梯度 g ( x ) = ▽ f ( x ) g(x) = \bigtriangledown f(x) g(x)=▽f(x),海赛矩阵 H ( x ) H(x) H(x),精度要求 ϵ \epsilon ϵ.
输出: f ( x ) f(x) f(x)的极小值点 x ∗ x^* x∗.
基本思想:
海赛矩阵的逆矩阵计算困难,考虑用一个n阶矩阵 G k = G ( x ( k ) ) G_k = G(x^{(k)}) Gk=G(x(k))来近似替代 H k − 1 H_k^{-1} Hk−1.
在 ( 1.1 ) (1.1) (1.1)中代入 x = x ( k + 1 ) x = x^{(k+1)} x=x(k+1),即有如下:
▽ f ( x ( k + 1 ) ) = g k + H k ( x ( k + 1 ) − x ( k ) ) g k + 1 − g k = H k ( x ( k + 1 ) − x ( k ) ) \bigtriangledown f(x^{(k+1)}) = g_k + H_k(x^{(k+1)}-x^{(k)})\\ g_{k+1} - g_k = H_k(x^{(k+1)} - x^{(k)}) ▽f(x(k+1))=gk+Hk(x(k+1)−x(k))gk+1−gk=Hk(x(k+1)−x(k))
记 y k = g ( k + 1 ) − g k , δ k = x ( x + 1 ) − x ( k ) y_k = g_{(k+1)} - g_k, \delta_k = x^{(x+1)} - x^{(k)} yk=g(k+1)−gk,δk=x(x+1)−x(k),则有
(2.1) y k = H ) k δ k H k − 1 y k = δ k y_k = H)k\delta_k\tag{2.1}\\ H_k^{-1}y_k = \delta_k yk=H)kδkHk−1yk=δk(2.1)
我们称 ( 2.1 ) (2.1) (2.1)为拟牛顿条件.
如果 H k H_k Hk是正定的,那么可以保证牛顿法搜索方向 p k p_k pk是下降方向:
因为搜索方向是 p k = − λ g k p_k = -\lambda g_k pk=−λgk
x = x ( k ) + λ p k = x ( k ) − λ H k ( − 1 ) g k x = x^{(k)} + \lambda p_k = x^{(k)} - \lambda H_k^{(-1)}g_k x=x(k)+λpk=x(k)−λHk(−1)gk
所以 f ( x ) f(x) f(x)在 x ( k ) x^{(k)} x(k)的泰勒展开式可以近似写成:
f ( x ) = f ( x ( k ) ) − λ g k T H k − 1 g k f(x) = f(x^{(k)}) - \lambda g_k^TH_k^{-1}g_k f(x)=f(x(k))−λgkTHk−1gk
由于 H k − 1 H_k^{-1} Hk−1正定,所以有 g k T H k − 1 g k > 0 g_k^TH_k^{-1}g_k>0 gkTHk−1gk>0. 当 λ \lambda λ为一个充分小的正数时,总有 f ( x ) < f ( x ( k ) ) f(x)<f(x^{(k)}) f(x)<f(x(k)),也就是说 p k p_k pk是下降方向.
拟牛顿法将 G k G_k Gk作为 H k − 1 H_k^{-1} Hk−1的近似,要求矩阵 G k G_k Gk满足同样的条件,每次迭代矩阵 G k G_k Gk都是正定的,且 G k G_k Gk要满足拟牛顿条件:
G k 1 y k = δ k G_{k_1}y_k = \delta_k Gk1yk=δk
按照拟牛顿条件选择 G k G_k Gk作为 H k − 1 H_k^{-1} Hk−1的近似或选择 B k B_k Bk作为 H k H_k Hk的近似的算法称为拟牛顿法.
按照拟牛顿条件,每次迭代都可以更新矩阵:
G k + 1 = G k + Δ G k G_{k+1} = G_k +\Delta G_k Gk+1=Gk+ΔGk
有多种具体实现方法
DFP选择 G k + 1 G_{k+1} Gk+1的方法是,假设每一次迭代 G k + 1 G_{k+1} Gk+1都是由 G k G_k Gk加上两个附加项构成,我们假设这两个附加项分别为 Q k 、 P k Q_k、P_k Qk、Pk,则有
G k + 1 = G k + Q k + P k G k + 1 y k = G k y k + Q k y k + P k y k G_{k+1} = G_k + Q_k + P_k\\ G_{k+1}y_k = G_ky_k + Q_ky_k + P_ky_k Gk+1=Gk+Qk+PkGk+1yk=Gkyk+Qkyk+Pkyk
使 G k + 1 G_{k+1} Gk+1满足拟牛顿条件,可以使
P k y k = δ k Q k y k = − G k y k P_ky_k = \delta_k\\ Q_ky_k = - G_ky_k Pkyk=δkQkyk=−Gkyk
取
P k = δ k δ k T δ T y k Q = − G k y k y k T G k y k T G k y k P_k = \frac{\delta_k\delta_k^T}{\delta^Ty_k}\\ Q = - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k} Pk=δTykδkδkTQ=−ykTGkykGkykykTGk
所以可以得到 G k + 1 G_{k+1} Gk+1的迭代公式
(2.2) G k + 1 = G k − G k y k y k T G k y k T G k y k + δ k δ k T δ T y k G_{k+1} = G_k - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k} + \frac{\delta_k\delta_k^T}{\delta^Ty_k}\tag{2.2} Gk+1=Gk−ykTGkykGkykykTGk+δTykδkδkT(2.2)
算法过程:
输入: f ( x ) f(x) f(x),梯度 g ( x ) = ▽ f ( x ) g(x) = \bigtriangledown f(x) g(x)=▽f(x),精度 ϵ \epsilon ϵ.
输出: f ( x ) f(x) f(x)的极小点 x ∗ x^* x∗
选取初始点 x ( 0 ) x^{(0)} x(0),取 G 0 G_0 G0为正定对称矩阵,令 k = 0 k=0 k=0
计算 g k g_k gk,若 ∣ ∣ g k ∣ ∣ < ϵ ||g_k||<\epsilon ∣∣gk∣∣<ϵ,则停止计算,得到近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k)
令 p k = − G k g k p_k = - G_kg_k pk=−Gkgk
一维搜索:
求 λ k \lambda_k λk使得 f ( x ( k ) + λ k p k ) = min λ ≥ 0 f ( x ( k ) + λ p k ) f(x^{(k)} + \lambda_kp_k) = \min_{\lambda\geq 0}f(x^{(k)}+\lambda p_k) f(x(k)+λkpk)=minλ≥0f(x(k)+λpk)
令 x ( k + 1 ) = x ( k ) + λ k p k x^{(k+1)} = x^{(k)} + \lambda_k p_k x(k+1)=x(k)+λkpk
计算 g k + 1 g_{k+1} gk+1,若 ∣ ∣ g k + 1 ∣ ∣ < ϵ ||g_{k+1}||<\epsilon ∣∣gk+1∣∣<ϵ,则停止计算,得近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k),否则按式 ( 2.2 ) (2.2) (2.2)计算 G k + 1 G_{k+1} Gk+1.
k = k + 1 k=k+1 k=k+1
BFGS是最流行得拟牛顿算法.
基本思想:
考虑用 B k B_k Bk逼近海赛矩阵 H H H,则拟牛顿条件为:
B k + 1 δ k = y k B_{k+1}\delta_k = y_k Bk+1δk=yk
令
B k + 1 = B k + P k + Q k B k + 1 δ k = B k δ k + P k δ k + Q k δ k B_{k+1} = B_k + P_k + Q_k\\ B_{k+1}\delta_k = B_k\delta_k + P_k\delta_k + Q_k\delta_k Bk+1=Bk+Pk+QkBk+1δk=Bkδk+Pkδk+Qkδk
使 P k P_k Pk和 Q k Q_k Qk满足:
P k δ k = y k Q k δ k = − B k δ k P_k\delta_k = y_k\\ Q_k\delta_k = -B_k\delta_k Pkδk=ykQkδk=−Bkδk
取
P k = y k y k T y k T δ k Q k = − B k δ k δ k T B k δ k T B k δ k P_k = \frac{y_ky_k^T}{y_k^T\delta_k}\\ Q_k = - \frac{B_k\delta_k\delta_k^TB_k}{\delta_k^TB_k\delta_k} Pk=ykTδkykykTQk=−δkTBkδkBkδkδkTBk
我们得到了 B k + 1 B_{k+1} Bk+1的迭代公式:
(2.3) B k + 1 = B k − B k δ k δ k T B k δ k T B k δ k + y k y k T y k T δ k B_{k+1} = B_k - \frac{B_k\delta_k\delta_k^TB_k}{\delta_k^TB_k\delta_k} + \frac{y_ky_k^T}{y_k^T\delta_k}\tag{2.3} Bk+1=Bk−δkTBkδkBkδkδkTBk+ykTδkykykT(2.3)
可以证明,如果初始矩阵 B 0 B_0 B0是正定的,那么迭代过程中的每个矩阵 B k B_k Bk都是正定的.
算法过程:
输入: f ( x ) f(x) f(x),梯度 g ( x ) = ▽ f ( x ) g(x) = \bigtriangledown f(x) g(x)=▽f(x),精度 ϵ \epsilon ϵ.
输出: f ( x ) f(x) f(x)的极小点 x ∗ x^* x∗
选取初始点 x ( 0 ) x^{(0)} x(0),取 B 0 B_0 B0为正定对称矩阵,令 k = 0 k=0 k=0
计算 g k g_k gk,若 ∣ ∣ g k ∣ ∣ < ϵ ||g_k||<\epsilon ∣∣gk∣∣<ϵ,则停止计算,得到近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k)
令 p k = − B k g k p_k = - B_kg_k pk=−Bkgk
一维搜索:
求 λ k \lambda_k λk使得 f ( x ( k ) + λ k p k ) = min λ ≥ 0 f ( x ( k ) + λ p k ) f(x^{(k)} + \lambda_kp_k) = \min_{\lambda\geq 0}f(x^{(k)}+\lambda p_k) f(x(k)+λkpk)=minλ≥0f(x(k)+λpk)
令 x ( k + 1 ) = x ( k ) + λ k p k x^{(k+1)} = x^{(k)} + \lambda_k p_k x(k+1)=x(k)+λkpk
计算 g k + 1 g_{k+1} gk+1,若 ∣ ∣ g k + 1 ∣ ∣ < ϵ ||g_{k+1}||<\epsilon ∣∣gk+1∣∣<ϵ,则停止计算,得近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k),否则按式 ( 2.3 ) (2.3) (2.3)计算 B k + 1 B_{k+1} Bk+1.
k = k + 1 k=k+1 k=k+1
可以从BFGS算法的 B k B_k Bk矩阵得到关于 G k G_k Gk的迭代公式. 若记 G k = B k − 1 、 G k + 1 = B k + 1 − 1 G_k = B_k^{-1}、G_{k+1} = B_{k+1}^{-1} Gk=Bk−1、Gk+1=Bk+1−1,对 ( 2.3 ) (2.3) (2.3)用两次Sherman-Morrison公式可得
G k + 1 = ( I − δ k y k T δ k T y k ) G k ( I − δ k y k T δ k T y k ) T + δ k y k T δ k T y k G_{k+1} = \Big(I - \frac{\delta_ky_k^T}{\delta^T_ky_k}\Big)G_k\Big(I - \frac{\delta_ky_k^T}{\delta^T_ky_k}\Big)^T + \frac{\delta_ky_k^T}{\delta^T_ky_k} Gk+1=(I−δkTykδkykT)Gk(I−δkTykδkykT)T+δkTykδkykT
李航-统计学习方法