这次博客主要介绍了SLAM中经常用到的Levenberg-Marquardt算法。该算法是一种信赖搜索方法。 信赖域方法与线搜索技术一样,也是优化算法中的一种保证全局收敛的重要技术. 它们的功能都是在优化算法中求出每次迭代的位移, 从而确定新的迭代点.所不同的是: 线搜索技术是先产生位移方向(亦称为搜索方向), 然后确定位移的长度(亦称为搜索步长)。而信赖域技术则是直接确定位移, 产生新的迭代点。
Levenberg-Marquardt算法的主要应用是最小二乘曲线拟合问题:给定一组数量为 n n n的独立变量和因变量的经验数据对 ( x i , y i ) (x_i,y_i) (xi,yi),找到模型曲线 f ( x , β ) f(x,\beta ) f(x,β)的参数 β \beta β 使得偏差 S ( β ) S(\beta ) S(β)的平方和最小化:
β ^ = a r g min β S ( β ) = a r g min β ∑ i = 1 n ∥ y i − f ( x i , β ) ∥ 2 2 (1) \hat{\boldsymbol \beta}=arg\min_{\boldsymbol \beta}S(\boldsymbol \beta)=arg\min{\boldsymbol \beta }\sum_{i=1}^n{\parallel y_{i}-f(x_i,\boldsymbol \beta) \parallel }_2^2 \tag{1} β^=argβminS(β)=argminβi=1∑n∥yi−f(xi,β)∥22(1)
假设: f ( x ) = e a x 2 + b x + c (2) f(x)=e^{ax^2+bx+c} \tag{2} f(x)=eax2+bx+c(2)
则 β \beta β 向量为 β = [ a , b , c ] (3) \boldsymbol \beta = [a, b, c] \tag{3} β=[a,b,c](3)
根据泰勒一阶展开可以得到:
f ( x i , β + δ ) ≈ f ( x i , β ) + J i δ i (4) f(x_i,\boldsymbol \beta+\boldsymbol \delta) \approx f(x_i,\boldsymbol \beta) + J_i\delta_i \tag{4} f(xi,β+δ)≈f(xi,β)+Jiδi(4)
雅克比矩阵 J i J_i Ji为偏导函数:
J i = ψ f ( x i , β ) ψ β (5) J_i=\frac{ \psi f(x_i,\beta) }{\psi \beta} \tag{5} Ji=ψβψf(xi,β)(5)
误差函数 S S S在零梯度处关于误差的平方和最小:
S ( β + δ ) ≈ ∑ i = 1 n ∥ y i − f ( x i , β ) − J δ ∥ 2 2 (6) S(\beta + \delta) \approx \sum_{i=1}^n {\parallel y_{i}-f(x_i,\boldsymbol \beta) - J \delta \parallel }_2^2 \tag{6} S(β+δ)≈i=1∑n∥yi−f(xi,β)−Jδ∥22(6) S ( β + δ ) ≈ ∥ y i − f ( x i , β ) − J δ ∥ 2 2 = [ y i − f ( x i , β ) − J δ ] T [ y i − f ( x i , β ) − J δ ] (7) S(\beta + \delta) \approx {\parallel y_{i}-f(x_i,\boldsymbol \beta) - J \delta \parallel }_2^2 \\ ={[y_{i}-f(x_i,\boldsymbol \beta) - J \delta ]^T[y_{i}-f(x_i,\boldsymbol \beta) - J \delta ]} \tag{7} S(β+δ)≈∥yi−f(xi,β)−Jδ∥22=[yi−f(xi,β)−Jδ]T[yi−f(xi,β)−Jδ](7)
高斯牛顿采用近似二阶泰勒展开只能在展开点附近有较好的近似效果,所以我们很自然地想到应该给 δ \delta δ添加一个信赖区间(Trust Region),不能让它太大而不准确,也不能因为太小而多次迭代耗费时间。
泰勒展开:
f ( β + δ ) ≈ f ( β ) + J ( β ) δ f(\beta + \delta) \approx f(\beta) + J(\beta)\delta f(β+δ)≈f(β)+J(β)δ
将上式变形为如下形式,其中 ρ ∈ ( 0 , 1 ) \rho\in (0,1) ρ∈(0,1),当 ρ ≈ 1 \rho \approx 1 ρ≈1时,表明近似比较好,当 ρ \rho ρ 比较小时,说明需要尽可能缩小 δ \delta δ范围。
因此我们可以计算 ρ \rho ρ:
ρ = f ( β + δ ) − f ( β ) J ( β ) δ \rho = \frac{f(\beta + \delta) - f(\beta)}{J(\beta)\delta} ρ=J(β)δf(β+δ)−f(β)
列文伯格-马夸尔特算法的主要改进就是加入了 λ \lambda λ阻尼因子,使得迭代步长略有不同:
( J T J ) δ = J T [ y − f ( β ) ] (8) (J^TJ)\delta =J^T[y-f(\beta )] \tag{8} (JTJ)δ=JT[y−f(β)](8)
( J T J + λ I ) δ = J T [ y − f ( β ) ] (9) (J^TJ+\lambda I)\delta = J^T[y - f(\beta )] \tag{9} (JTJ+λI)δ=JT[y−f(β)](9)
I I I为单位矩阵, δ \delta δ为关于估计参数 β \beta β的增量。
非负的阻尼因子 λ \lambda λ在每次迭代时都会更新,当 S S S减小比较快时, λ \lambda λ就比较小,这时候比较像高斯牛顿算法,而如果迭代不足以减少残差,则可以增加 λ \lambda λ,从而更接近梯度下降方向。因此,对于λ的大值,该步骤将大致在梯度的方向上进行。 如果计算的步长 δ \delta δ 的长度或者来自最新参数矢量β+δ的平方和的减少量低于预定限制,则迭代停止,并且最后的参数矢量β被认为是所要得到的解。
a.阻尼因子>0,协方差矩阵 J T J + λ I J^TJ+\lambda I JTJ+λI恒大于零。
b.阻尼因子值接近1时,主对角线元素占主体地位(从数学公式中可以看出来),最速下降方向约等于 − 1 λ J T [ y − f ( β ) ] -\frac{1}{\lambda}J^T[y-f(\beta)] −λ1JT[y−f(β)](类似于最速下降)。
c.如果 λ \lambda λ值很小时,为高斯牛顿法的形式(迭代的最后阶段,非常接近最优解,避免了最速下降的震荡)。
信赖域方法的基本思想是:首先给定一个所谓的“信赖域半径”作为位移长度的上界,并以当前迭代点为中心以此“上界”为半径确定一个称之为“信赖域”的闭球区域。然后,通过求解这个区域内的“信赖域子问题”(目标函数的二次近似模型) 的最优点来确定“候选位移”。若候选位移能使目标函数值有充分的下降量, 则接受该候选位移作为新的位移,并保持或扩大信赖域半径, 继续新的迭代。否则, 说明二次模型与目标函数的近似度不够理想,需要缩小信赖域半径,再通过求解新的信赖域内的子问题得到新的候选位移。如此重复下去,直到满足迭代终止条件。
定义 L ( δ ) L(\delta) L(δ)是 f ( β + δ ) f(\beta + \delta) f(β+δ)的近似函数,设 β k \beta _k βk是第k次迭代,记$f_k=f(x_k)
m i n L k ( δ ) = c k T δ + 1 2 δ T B k δ min L_k(\delta)=c_k^T\delta +\frac{1}{2}\delta ^T B_k \delta minLk(δ)=ckTδ+21δTBkδ s . t . ∥ δ ∥ < Δ k s.t. \parallel \delta \parallel<\Delta_k s.t.∥δ∥<Δk
实际下降量:
Δ F k = F k − F ( x k + δ k ) \Delta F_k=F_k-F(x_k+\delta _k) ΔFk=Fk−F(xk+δk)
预测下降量:
Δ L k = L k ( 0 ) − L k ( δ k ) \Delta L_k=L_k(0)-L_k(\delta_k) ΔLk=Lk(0)−Lk(δk)
比值为gain ratio,用来监控步长的质量:
q k = Δ F k Δ L k = f ( β ) − f ( β + δ ) 1 2 β T ( λ β − J T ∗ e r r o r ) q_k=\frac{\Delta F_k}{\Delta L_k}=\frac{f(\beta)-f(\beta +\delta)}{\frac{1}{2} \beta^T(\lambda \beta -J^T *error) } qk=ΔLkΔFk=21βT(λβ−JT∗error)f(β)−f(β+δ)
q = (mse - mse_temp)/(0.5*hlm.t()*(u*hlm-Jf.t()*r));
如果q很大,说明 L ( β ) L(\beta) L(β)非常接近 F ( x k + δ k ) F(x_k+\delta _k) F(xk+δk),我们可以减少惩罚因子 λ \lambda λ,以便于下次迭代此时算法更接近高斯牛顿算法。如果q很小或者是负的,说明是poor approximation,我们需要增大惩罚因子,减少步长,此时算法更接近最速下降法。具体来说,
a.当q大于0时,此次迭代有效:
λ = λ ∗ m a x { 1 3 , 1 − ( 2 q − 1 ) 3 } \lambda = \lambda *max\{\frac{1}{3},1-(2q-1)^3\} λ=λ∗max{31,1−(2q−1)3} v = 2 v=2 v=2
对应的代码为:
if(q_value>0)
{
double s = 1.0/3.0;
v = 2;
mse = mse_temp;
params = params_tmp;
double temp = 1 - pow(2*q_value-1,3);
if(s>temp)
{
u = u * s;
}
else
{
u = u * temp;
}
}
b.当q小于等于0时,此次迭代无效:
λ = λ ∗ v \lambda = \lambda *v λ=λ∗v v = 2 ∗ v v=2*v v=2∗v
对应代码:
else
{
u = u*v;
v = 2*v;
params = params_tmp;
}
//
// Created by wpr on 18-12-17.
//
#include
#include
#include
using namespace std;
using namespace cv;
const double DERIV_STEP = 1e-5;
const int MAX_ITER = 100;
void LM(double(*Func)(const Mat &input, const Mat params), // function pointer
const Mat &inputs, const Mat &outputs, Mat& params);
double Deriv(double(*Func)(const Mat &input, const Mat params), // function pointer
const Mat &input, const Mat params, int n);
// The user defines their function here
double Func(const Mat &input, const Mat params);
int main()
{
// For this demo we're going to try and fit to the function
// F = A*exp(t*B)
// There are 2 parameters: A B
int num_params = 2;
// Generate random data using these parameters
int total_data = 8;
Mat inputs(total_data, 1, CV_64F);
Mat outputs(total_data, 1, CV_64F);
//load observation data
for(int i=0; i < total_data; i++) {
inputs.at<double>(i,0) = i+1; //load year
}
//load America population
outputs.at<double>(0,0)= 8.3;
outputs.at<double>(1,0)= 11.0;
outputs.at<double>(2,0)= 14.7;
outputs.at<double>(3,0)= 19.7;
outputs.at<double>(4,0)= 26.7;
outputs.at<double>(5,0)= 35.2;
outputs.at<double>(6,0)= 44.4;
outputs.at<double>(7,0)= 55.9;
// Guess the parameters, it should be close to the true value, else it can fail for very sensitive functions!
Mat params(num_params, 1, CV_64F);
//init guess
params.at<double>(0,0) = 5;
params.at<double>(1,0) = 0.2;
LM(Func, inputs, outputs, params);
printf("Parameters from LM: %lf %lf\n", params.at<double>(0,0), params.at<double>(1,0));
return 0;
}
double Func(const Mat &input, const Mat params)
{
// Assumes input is a single row matrix
// Assumes params is a column matrix
double A = params.at<double>(0,0);
double B = params.at<double>(1,0);
double x = input.at<double>(0,0);
return A*exp(x*B);
}
//calc the n-th params' partial derivation , the params are our final target
double Deriv(double(*Func)(const Mat &input, const Mat params), const Mat &input, const Mat params, int n)
{
// Assumes input is a single row matrix
// Returns the derivative of the nth parameter
Mat params1 = params.clone();
Mat params2 = params.clone();
// Use central difference to get derivative
params1.at<double>(n,0) -= DERIV_STEP;
params2.at<double>(n,0) += DERIV_STEP;
double p1 = Func(input, params1);
double p2 = Func(input, params2);
double d = (p2 - p1) / (2*DERIV_STEP);
return d;
}
void LM(double(*Func)(const Mat &input, const Mat params),
const Mat &inputs, const Mat &outputs, Mat& params)
{
int m = inputs.rows;
int n = inputs.cols;
int num_params = params.rows;
Mat r(m, 1, CV_64F); // residual matrix
Mat r_tmp(m, 1, CV_64F);
Mat Jf(m, num_params, CV_64F); // Jacobian of Func()
Mat input(1, n, CV_64F); // single row input
Mat params_tmp = params.clone();
double last_mse = 0;
float u = 1, v = 2;
Mat I = Mat::ones(num_params, num_params, CV_64F);//construct identity matrix
int i =0;
for(i=0; i < MAX_ITER; i++)
{
double mse = 0;
double mse_temp = 0;
for(int j=0; j < m; j++)
{
for(int k=0; k < n; k++)
{//copy Independent variable vector, the year
input.at<double>(0,k) = inputs.at<double>(j,k);
}
r.at<double>(j,0) = outputs.at<double>(j,0) - Func(input, params);//diff between previous estimate and observation population
mse += r.at<double>(j,0)*r.at<double>(j,0);
for(int k=0; k < num_params; k++) {
Jf.at<double>(j,k) = Deriv(Func, input, params, k); //construct jacobian matrix
}
}
mse /= m;
params_tmp = params.clone();
Mat hlm = (Jf.t()*Jf + u*I).inv()*Jf.t()*r; //calculate deta
params_tmp += hlm; //update value
for(int j=0; j < m; j++) {
r_tmp.at<double>(j,0) = outputs.at<double>(j,0) - Func(input, params_tmp);//diff between current estimate and observation population
mse_temp += r_tmp.at<double>(j,0)*r_tmp.at<double>(j,0);//diff square sum
}
mse_temp /= m;//diff square sum
Mat q(1,1,CV_64F);
q = (mse - mse_temp)/(0.5*hlm.t()*(u*hlm-Jf.t()*r));
double q_value = q.at<double>(0,0);
if(q_value>0)
{
double s = 1.0/3.0;
v = 2;
mse = mse_temp;
params = params_tmp;
double temp = 1 - pow(2*q_value-1,3);
if(s>temp)
{
u = u * s;
}
else
{
u = u * temp;
}
}
else
{
u = u*v;
v = 2*v;
params = params_tmp;
}
// The difference in mse is very small, so quit
if(fabs(mse - last_mse) < 1e-8) {
break;
}
//printf("%d: mse=%f\n", i, mse);
printf("%d %lf\n", i, mse);
last_mse = mse;
}
}