从一个简单的问题引入:
m x i n 1 2 ∥ f ( x ) ∥ 2 2 \underset{x}min\frac 12\parallel f(x )\parallel_2^2 xmin21∥f(x)∥22
其中: x ∈ R n , f x \in R^n, f x∈Rn,f 为任意函数
当 f f f 很简单时,
d f d x ∣ x = 0 \left. \frac{ {\rm d}f}{ {\rm d}x} \right| _{x=0} dxdf∣∣∣∣x=0
得到极值点或者鞍点。
当 f f f 很复杂时, ∂ f ∂ x \dfrac{\partial f}{\partial x} ∂x∂f 很难求,或其极值点或鞍点很难求,这时需要用到迭代的方式求取
∥ f ( x k + Δ x k ) ∥ 2 2 ≈ ∥ f ( x k ) ∥ 2 2 + J ( x ) Δ x + 1 2 Δ x T H Δ x \parallel f(x_{k} + \Delta x_{k}) \parallel_2^2 \approx \parallel f(x_{k}) \parallel_2^2 + J(x)\Delta x + \frac 12 \Delta x^T H \Delta x ∥f(xk+Δxk)∥22≈∥f(xk)∥22+J(x)Δx+21ΔxTHΔx
其中: J , H J, H J,H 分别为雅克比和 海塞矩阵
m Δ x i n ∥ f ( x k ) ∥ 2 2 + J ( x ) Δ x \underset{\Delta x}min\parallel f(x_{k}) \parallel_2^2 + J(x)\Delta x Δxmin∥f(xk)∥22+J(x)Δx
增量方向: Δ x ∗ = − J T ( x ) \Delta x^* = - J^T(x) Δx∗=−JT(x) , 通常还需要计算步长
上述这样方法称为最速下降法( Steep Method)
若保留二阶梯度:
Δ x ∗ = a r g min ∥ f ( x k ) ∥ 2 2 + J ( x ) Δ x + 1 2 Δ x T H Δ x \Delta x^* = arg\min\parallel f(x_{k}) \parallel_2^2 + J(x)\Delta x + \frac 12 \Delta x^T H \Delta x Δx∗=argmin∥f(xk)∥22+J(x)Δx+21ΔxTHΔx
则得到(令上式关于 Δ x \Delta x Δx 的倒数为零): H Δ x = − J T H \Delta x = - J^T HΔx=−JT
保留二阶倒数称为牛顿法
最速下降法和牛顿法都有各自的缺点:
那我们是否可以保留二阶梯度,但简化 Hessian 计算:
关于 Δ x \Delta x Δx 导数为零:
2 J ( x ) T f ( x ) + 2 J ( X ) T J ( x ) Δ x = 0 2J(x)^Tf(x)+2J(X)^TJ(x)\Delta x = 0 2J(x)Tf(x)+2J(X)TJ(x)Δx=0
J ( X ) T J ( x ) Δ x = − J ( X ) T f ( x ) J(X)^TJ(x)\Delta x=-J(X)^Tf(x) J(X)TJ(x)Δx=−J(X)Tf(x)
H Δ = g H\Delta=g HΔ=g
G-N 中用 J J J 的表达式近似了 H H H
G-N 简单实用,但 Δ x x = H − 1 g \Delta x_{x} = H^{-1}g Δxx=H−1g 中无法保证海塞矩阵是可逆的(二次近似不可靠)
Levenberg-Marquadt 方法在一定程度上面改善了 G-N
G-N 属于线搜索方法:即先找到方向,在确定长度
L-M 属于信赖区域方法(Trust Region),认为近似只在区域内是可靠的
ρ = f ( x + Δ x ) − f ( x ) J ( x ) Δ \rho=\frac {f(x+\Delta x) - f(x)}{J(x)\Delta} ρ=J(x)Δf(x+Δx)−f(x)
实际下降/近似下降,近似于1,比较可靠,反之,则不可靠
如果半径过小,则减小近似范围
如果半径过大,则增加近似范围
import numpy as np
import matplotlib.pyplot as plt
# input data, whose shape is (num_data,1)
# data_input=np.array([[0.25, 0.5, 1, 1.5, 2, 3, 4, 6, 8]]).T
# data_output=np.array([[19.21, 18.15, 15.36, 14.10, 12.89, 9.32, 7.45, 5.24, 3.01]]).T
tao = 10 ** -3
threshold_stop = 10 ** -15
threshold_step = 10 ** -15
threshold_residual = 10 ** -15
residual_memory = []
# construct a user function
def my_Func(params, input_data):
a = params[0, 0]
b = params[1, 0]
# c = params[2,0]
# d = params[3,0]
return a * np.exp(b * input_data)
# return a*np.sin(b*input_data[:,0])+c*np.cos(d*input_data[:,1])
# generating the input_data and output_data,whose shape both is (num_data,1)
def generate_data(params, num_data):
x = np.array(np.linspace(0, 10, num_data)).reshape(num_data, 1) # 产生包含噪声的数据
mid, sigma = 0, 5
y = my_Func(params, x) + np.random.normal(mid, sigma, num_data).reshape(num_data, 1)
return x, y
# calculating the derive of pointed parameter,whose shape is (num_data,1)
def cal_deriv(params, input_data, param_index):
params1 = params.copy()
params2 = params.copy()
params1[param_index, 0] += 0.000001
params2[param_index, 0] -= 0.000001
data_est_output1 = my_Func(params1, input_data)
data_est_output2 = my_Func(params2, input_data)
return (data_est_output1 - data_est_output2) / 0.000002
# calculating jacobian matrix,whose shape is (num_data,num_params)
def cal_Jacobian(params, input_data):
num_params = np.shape(params)[0]
num_data = np.shape(input_data)[0]
J = np.zeros((num_data, num_params))
for i in range(0, num_params):
J[:, i] = list(cal_deriv(params, input_data, i))
return J
# calculating residual, whose shape is (num_data,1)
def cal_residual(params, input_data, output_data):
data_est_output = my_Func(params, input_data)
residual = output_data - data_est_output
return residual
'''
#calculating Hessian matrix, whose shape is (num_params,num_params)
def cal_Hessian_LM(Jacobian,u,num_params):
H = Jacobian.T.dot(Jacobian) + u*np.eye(num_params)
return H
#calculating g, whose shape is (num_params,1)
def cal_g(Jacobian,residual):
g = Jacobian.T.dot(residual)
return g
#calculating s,whose shape is (num_params,1)
def cal_step(Hessian_LM,g):
s = Hessian_LM.I.dot(g)
return s
'''
# get the init u, using equation u=tao*max(Aii)
def get_init_u(A, tao):
m = np.shape(A)[0]
Aii = []
for i in range(0, m):
Aii.append(A[i, i])
u = tao * max(Aii)
return u
# LM algorithm
def LM(num_iter, params, input_data, output_data):
num_params = np.shape(params)[0] # the number of params
k = 0 # set the init iter count is 0
# calculating the init residual
residual = cal_residual(params, input_data, output_data)
# calculating the init Jocobian matrix
Jacobian = cal_Jacobian(params, input_data)
A = Jacobian.T.dot(Jacobian) # calculating the init A
g = Jacobian.T.dot(residual) # calculating the init gradient g
stop = (np.linalg.norm(g, ord=np.inf) <= threshold_stop) # set the init stop
u = get_init_u(A, tao) # set the init u
v = 2 # set the init v=2
while ((not stop) and (k < num_iter)):
k += 1
while (1):
Hessian_LM = A + u * np.eye(num_params) # calculating Hessian matrix in LM
step = np.linalg.inv(Hessian_LM).dot(g) # calculating the update step
if (np.linalg.norm(step) <= threshold_step):
stop = True
else:
new_params = params + step # update params using step
new_residual = cal_residual(new_params, input_data, output_data) # get new residual using new params
rou = (np.linalg.norm(residual) ** 2 - np.linalg.norm(new_residual) ** 2) / (step.T.dot(u * step + g))
if rou > 0:
params = new_params
residual = new_residual
residual_memory.append(np.linalg.norm(residual) ** 2)
# print (np.linalg.norm(new_residual)**2)
Jacobian = cal_Jacobian(params, input_data) # recalculating Jacobian matrix with new params
A = Jacobian.T.dot(Jacobian) # recalculating A
g = Jacobian.T.dot(residual) # recalculating gradient g
stop = (np.linalg.norm(g, ord=np.inf) <= threshold_stop) or (
np.linalg.norm(residual) ** 2 <= threshold_residual)
u = u * max(1 / 3, 1 - (2 * rou - 1) ** 3)
v = 2
else:
u = u * v
v = 2 * v
if (rou > 0 or stop):
break;
return params
def main():
# set the true params for generate_data() function
params = np.zeros((2, 1))
params[0, 0] = 10.0
params[1, 0] = 0.8
num_data = 100 # set the data number
data_input, data_output = generate_data(params, num_data) # generate data as requested
# set the init params for LM algorithm
params[0, 0] = 6.0
params[1, 0] = 0.3
# using LM algorithm estimate params
num_iter = 100 # the number of iteration
est_params = LM(num_iter, params, data_input, data_output)
print(est_params)
a_est = est_params[0, 0]
b_est = est_params[1, 0]
# 老子画个图看看状况
plt.scatter(data_input, data_output, color='b')
x = np.arange(0, 100) * 0.1 # 生成0-10的共100个数据,然后设置间距为0.1
plt.plot(x, a_est * np.exp(b_est * x), 'r', lw=1.0)
plt.xlabel("2018.06.13")
plt.savefig("result_LM.png")
plt.show()
plt.plot(residual_memory)
plt.xlabel("2018.06.13")
plt.savefig("error-iter.png")
plt.show()
if __name__ == '__main__':
main()
m Δ x i n 1 2 ∥ f ( x k ) + J ( x ) Δ x ∥ 2 2 + λ 2 ∥ D Δ x ∥ 2 2 \underset{\Delta x}min\frac 12\parallel f(x_{k})+J(x)\Delta x \parallel_2^2 + \frac \lambda2\parallel D\Delta x \parallel_2^2 Δxmin21∥f(xk)+J(x)Δx∥22+2λ∥DΔx∥22
仍参照 G-N 展开,增量方程为:
( H + λ D T D ) Δ x = g (H + \lambda D^TD)\Delta x = g (H+λDTD)Δx=g
在 L-M 方法中, 取 D = I D=I D=I , 则:
( H + λ I ) Δ x = g (H + \lambda I) \Delta x = g (H+λI)Δx=g
我的理解是 L - M 很好的结合了最速下降和高斯牛顿两种方法。如果仅用最速下降方法耗时过久,在遇到高阶问题时,表现差强人意;仅使用高斯牛顿存在海塞矩阵不可逆问题且计算海塞对于计算机来说并不友好,故结合这两种方式从而产生 L -M: ( H + λ I ) Δ x = g (H + \lambda I) \Delta x = g (H+λI)Δx=g,当 H H H 为零时,就是一阶,当 λ \lambda λ 为零的时候,就是高斯牛顿,故 λ \lambda λ 的初值也是非常重要的,见代码:
def get_init_u(A, tao):
m = np.shape(A)[0]
Aii = []
for i in range(0, m):
Aii.append(A[i, i])
u = tao * max(Aii)
return u
对于非凸问题,对初值非常敏感,什么是非凸问题,见图:
这种情况会陷入局部极值,但我们希望的是全局极值(不要嫌我字丑,忍着!)因为在上家单位,我自己做的药代动力学模块,整个模块的卖点在我看来,就是给客户提供一组很优的初值,如果初值比较好,L-M 算法表现非常好,但是如果初值差强人意,其效果还不如单纯性算法(这个算法在我看来是一个非常中庸的算法)
代码参考:https://its304.com/article/wolfcsharp/89674973 ,这个代码我试过,是可以的,使用时有个地方,即更新的时候我求最小,所以梯度更新方向需要变一下
推荐一篇超级好的论文:http://www2.imm.dtu.dk/pubdb/edoc/imm3215.pdf
想看视频的欢迎关注我B站讲解:https://www.bilibili.com/video/BV1134y1k7gv/
注: 我之前给公司培训的时候做过PPT ,但是离职的时候没有拿,此篇博客,大多公式都是高教授 slam 十四讲中的,需要说明一下~