VIO系列——从滤波和优化的角度研究视觉惯性里程计

VIO系列——从滤波和优化的角度研究视觉惯性里程计

A review of visual inertial odometry from filtering and optimisation perspectives

摘要:

Visual inertial odometry (VIO) is a technique to estimate the change of a mobile platform in position and orientation overtime using the measurements from on-board cameras and IMU sensor. Recently, VIO attracts significant attentions from large number of researchers and is gaining the popularity in various potential applications due to the miniaturisation in size and low cost in price of two sensing modularities. However, it is very challenging in both of technical development and engineering implementation when accuracy, real-time performance, robustness and operation scale are taken into consideration. This survey is to report the state of the art VIO techniques from the perspectives of filtering and optimisation-based approaches, which are two dominated approaches adopted in the research area. To do so, various representations of 3D rigid motion body are illustrated. Then filtering-based approaches are reviewed, and followed by optimisation-based approaches. The links between these two approaches will be clarified via a framework of the Bayesian Maximum A Posterior. Other features, such as observability and self calibration, will be discussed.
视觉惯性里程计 (VIO) 是一种使用车载摄像头和 IMU 传感器的测量值估计移动平台随时间推移的位置和方向变化的技术。最近,VIO引起了大量研究人员的极大关注,并且由于两类传感模块的尺寸小型化和低成本的价格,在各种潜在应用中越来越受欢迎。然而,在技术开发和工程实施中,当考虑到准确性、实时性能、鲁棒性和操作规模时,这是非常具有挑战性的。本文将从研究领域采用的两种主要方法,即滤波和基于优化的方法的角度报告最先进的VIO技术。为此,列举了 3D 刚性运动体的各种表示形式。而后调研基于滤波的方法和基于优化的方法。这两种方法之间的联系将通过贝叶斯最大值A后验框架来阐明。其他特点也将被讨论,例如可观测性和自校准。

  1. Introduction

Localisation and mapping are two fundamental problems in the research area of robot navigation and control. This is evidenced by the fact that simultaneous localisation and mapping (SLAM) techniques and structure from motion (SFM)
techniques [1] have been one of major topics in robotic and computer vision research communities, respectively, for many years. With the increased interest in applying these techniques to small-sized platforms, such as small-sized unmanned aerial vehicles (UAVs) or handhold mobile devices, the research focus of localisation and mapping is shifted towards the use of cameras and inertial measurement unit (IMU) sensor. These sensors are made available nowadays with high accuracy, miniaturised size, and low cost because of the fast-developing manufacturing of chips and micro-electromechanical systems (MEMS) devices. And they are complimentary with one another in a way which would be able to compensate for the errors made by each of them via the redundant information they provided. Furthermore, the evidence from biological studies shows that the navigation of human beings and some of animals is partly depending on various forms of the combination between motion sensing modalities and vision.[2–4]
定位和建图是机器人导航与控制研究领域的两个基本问题。多年来,同时定位与建图(SLAM)技术和运动结构恢复(SFM)技术[1]一直是机器人和计算机视觉研究界的主要课题之一。随着人们越来越关注将这些技术应用于小型平台,例如小型无人机(UAV)或手持移动设备,定位和建图的研究重点转向使用相机和惯性测量单元(IMU)传感器。由于芯片和微机电系统(MEMS)器件的快速发展,这些传感器如今具有高精度、小型化尺寸和低成本。它们彼此互补,能够通过他们提供的冗余信息来弥补它们各自的误差。此外,生物学研究的证据表明,人类和一些动物的导航部分取决于运动传感模式和视觉之间的各种形式的组合。[2–4]

Localisation and mapping problem is a state estimation problem in robotic community. Stochastic estimation algorithms for dynamic systems given noisy measurements, such as extended Kalman filter (EKF) or particle filters(PF), are the main stream tools being used. Some proprioceptive sensors provide the measurements on change in pose overtime and are formulated as a data-driven dynamic model of mobile platforms. The examples include optical encoders in ground mobile robots and IMUs in flying robots. They are well known for the accumulated errors and biased measurements. Some of exteroceptive sensors, such as cameras or laser ranging finders, are able to provide angular or range measurements. Based on triangulation or trilateration methods, they are able to estimate the position or orientation of mobile platforms. The estimation reliability heavily depends on environment conditions and the results are sensitive to the noise of measurements. However, they can be formulated as the measurement models in state estimation problems. Then, EKF or PF is able to propagate the estimated distribution from previous step to current step by fusing the distribution in dynamic model and the distribution in measurement models. In most cases, EKF is used instead of PF due to the large number of states to be estimated. Then, the means and covariances of Gaussian distributions are estimated.
定位和建图问题是机器人社区中的状态估计问题。给定噪声测量的动态系统的随机估计算法,例如扩展卡尔曼滤波器(EKF)或粒子滤波器(PF),是正在使用的主流工具。一些本体感受传感器提供姿态随时间变化的测量,并被表述为移动平台的数据驱动动态模型。这些示例包括地面移动机器人中的光学编码器和飞行机器人中的IMU。它们的累积误差和偏置测量众所周知。一些外部传感器,如相机或激光测距仪,能够提供角度或距离测量。基于三角测量或三边测量方法,他们能够估计移动平台的位置或方向。估计可靠性在很大程度上取决于环境条件,并且结果对测量噪声很敏感。但是,它们可以表述为状态估计问题中的测量模型。然后,EKF 或 PF 能够通过融合动态模型中的分布和测量模型中的分布,将估计分布从上一步传播到当前步骤。在大多数情况下,由于需要估计的状态数量很多,因此使用 EKF 代替 PF。然后,估计高斯分布的均值和协方差。

Motion estimation from vision is an optimisation problem in computer vision community.[5] The change in camera pose is iteratively computed via image alignment.[6,7] Normally, features are extracted from one image and reprojected to another image captured from a different view. Then, the error cost between features and reprojected features is minimised to find the pose change of camera and the structures of environment. The optimisation-based algorithms are mainly gradient based, such as Newton’s method or Gauss–Newton’s method. In most cases, the sparse structure of matrix is exploited to increase the computational efficiency. Only the values of estimated states are provided while the information on estimation distributions is not available.
视觉运动估计是计算机视觉社区中的一个优化问题。[5] 相机姿态的变化是通过图像对齐迭代计算的。[6,7] 通常,特征是从一张图像中提取的,然后重新投影到从不同视角捕获的另一张图像上。然后,最小化特征和重新投影特征之间的误差,以找到相机的姿势变化和环境结构。基于优化的算法主要是基于梯度的,例如牛顿方法或高斯-牛顿方法。在大多数情况下,利用矩阵的稀疏结构来提高计算效率。仅提供估计状态的值,而没有有关估计分布的信息。

The link between filtering and optimisation-based approaches can be built up within the framework of Bayesian inference. The optimisation-based approaches are viewed as a maximum likelihood (ML) formula where the state for which the total probability of measurements is highest is iteratively found. The filtering-based approaches are viewed as a Maximum A Posterior (MAP) formula, where the prior distribution of platform pose is constructed from the measurements of proprioceptive sensors and the likelihood distribution is built up with the measurements of exteroceptive sensors. For non-linear dynamic model and/or non-linear measurement model, iterated EKF is equivalent to the optimisation-based approaches where iterative updates on each single step in EKF are conducted. The optimisation-based approaches could be reformulated as a MAP problem from a ML problem by adding a regularisation term or prior term from the measurement of pro-prioceptive sensors or other sources. These are on-line or ‘causal’algorithms where the current estimation depends on current and previous measurements. The off-line or ‘non-casual’ algorithms are batch processing procedures where the current estimation depends on full data-set or it can be said they depend on not only current and previous measurements but also future measurements. In optimisation-based approaches, the batch processing is melt down to solving a group of linear algebraic equations. In filtering-based approaches, the Kalman smoother is able to find the posterior Gaussian distribution via a forward pass and a backward pass.
滤波和基于优化的方法之间的联系可以在贝叶斯推理的框架内建立起来。基于优化的方法被视为最大似然 (ML) 公式,其中迭代找到测量总概率最高的状态。基于滤波的方法被视为最大后验(MAP)公式,其中平台姿势的先验分布是根据本体感觉传感器的测量值构建的,可能性分布是通过外感受传感器的测量值建立的。对于非线性动态模型和/或非线性测量模型,迭代 EKF 等效于基于优化的方法,其中对 EKF 中的每个步骤进行迭代更新。基于优化的方法可以通过添加正则化项或来自亲预感受传感器或其他来源测量的先前项,从 ML 问题重新表述为 MAP 问题。这些是在线或“因果”算法,其中电流估计取决于当前和以前的测量。离线或“非随意”算法是批处理过程,其中当前的估计取决于完整的数据集,或者可以说它们不仅取决于当前和以前的测量,还取决于未来的测量。在基于优化的方法中,批处理被熔化为求解一组线性代数方程。在基于滤波的方法中,卡尔曼平滑器能够通过正向传递和向后传递找到后高斯分布。

There is a long list of state variables and parameters in a VIO problem, which needs to be estimated by given measurements from IMU and camera. Some of them are not observable or identifiable, leading to an error growing performance. The analysis of observability and identifiability of states and parameters provides a clear understanding of the estimation results, and potentially provides a guide to sufficiently exciting the platform via specialised motion patterns.
VIO 问题中有一长串状态变量和参数,需要通过 IMU 和相机的给定测量值来估计。其中一些是不可观察或识别的,导致误差增长表现。对状态和参数的可观测性和可识别性的分析提供了对估计结果的清晰理解,并可能为通过专门的运动模式充分激发平台提供指导。

The popular SLAM techniques are able to provide the estimated results on pose of platform and structure of environment. A process called loop closure detection is required to bound the accumulated errors caused by estimation algorithms.[8–10] However, the loop closure is not explicitly pursued in VIO techniques. Therefore, the estimated errors in VIO would be accumulated and could not be bounded. Due to no loop closure detection, long time and large-scale localisation become possible within small-sized device. Then, the research question is shifted towards how to exploit the techniques that could slow down the error increasing.
流行的SLAM技术能够提供平台姿态和环境结构的估计结果。需要一个称为闭环检测的过程来绑定估计算法引起的累积误差。[8–10]但是,在 VIO 技术中并未明确追求闭环。因此,VIO中的估计误差将累积并且无法限制。由于没有闭环检测,因此可以在小型设备中进行长时间和大规模定位。然后,研究问题转向如何利用可以减缓误差增加的技术。

This paper will present a survey of recent developments and advances of VIO techniques. It is organised from both filtering and optimisation perspectives, illustrating their fundamental models, algorithms and recent results. The survey will also highlight the links between two approaches and contribute an in-depth view of VIO techniques. The discussion on state observability and parameter identifiability will be provided. Our vision on this research area is summarised. In the following, filtering-based approaches are introduced in Section 2, followed by optimisation-based approaches in Section 3. The links between them are summarised in Section 4. The analysis of observability and identifiability will be addressed in Section 5. The conclusions and future work are briefly provided in Section 6.
本文将介绍VIO技术的最新发展和进展。它从过滤和优化的角度进行组织,说明了它们的基本模型、算法和最近的结果。该调查还将强调两种方法之间的联系,并深入探讨VIO技术。将提供关于状态可观测性和参数可识别性的讨论。总结了我们对这一研究领域的愿景。在下文中,第 2 节介绍了基于过滤的方法,然后在第 3 节中介绍了基于优化的方法。第4节总结了它们之间的联系。可观测性和可识别性的分析将在第5节中讨论。结论和今后的工作在第6节中简要介绍。

  1. Filtering-based approaches

An EKF framework generally consists of a prediction step and an updating step. For a filtering-based VIO approach, inertial sensors are able to provide acceleration and rotational velocity measurements in three axes, which serve as the data-driven dynamic model or prior distribution for a 3D rigid body motion and make the motion prediction in prediction step. Cameras are able to provide the angular and ranging measurements between features and the mobile platform, which serve as the measurement model or likelihood distribution and update the prediction results in updating step.
EKF 框架通常由预测步骤和更新步骤组成。对于基于滤波的VIO方法,惯性传感器能够在三个轴上提供加速度和旋转速度测量,作为数据驱动的动态模型或3D刚体运动的先验分布,并在预测步骤中进行运动预测。相机能够提供要素和移动平台之间的角度和测距测量值,作为测量模型或似然分布,并在更新步骤中更新预测结果。

We assume a mobile platform, which is only equipped with a camera and an IMU, moves in an unknown environment. The spatial relationship between the camera and IMU is fixed and can be expressed as known position and attitude. The aim of an EKF based VIO algorithm is to provide the information of position and orientation of the platform using inertial measurements and visual observations of unknown environment. Next, we will present a full description of the EKF framework based on the work in [11], which includes the state presentation, IMU data-driven dynamics and the visual observations. This will pave a way for further analysis and discussion in the following sections.
我们假设一个仅配备摄像头和IMU的移动平台在未知环境中移动。相机和IMU之间的空间关系是固定的,可以表示为已知的位置和姿态。基于EKF的VIO算法的目的是使用惯性测量和对未知环境的视觉观察来提供平台的位置和方向信息。接下来,我们将基于 [11] 中的工作对 EKF 框架进行全面描述,其中包括状态呈现、IMU 数据驱动的动态和视觉观察。这将为以下各节的进一步分析和讨论铺平道路。

2.1. IMU data driven dynamic model

An IMU state vector of 3D rigid body at any time instant can be defined by a 16 × 1 vector, where IW q̄ is the unit quaternion describing the rotation from world frame {W} to IMU frame {I}, W pI and W v I are the position and velocity with respect to {W}, bg and ba are 3×1 vectors that describe the biases affecting the gyroscope and accelerometer measurements, respectively. The spatial relations between frames are shown in Figure 1.
3D 刚体在任何时间时刻的 IMU 状态向量都可以由 16 × 1 向量定义,其中 IW q̄ 是描述从世界帧 {W} 到 IMU 帧 {I} 的旋转的单位四元数,W pI 和 W v I 是相对于 {W} 的位置和速度,bg 和 ba 是 3×1 个向量,用于描述影响陀螺仪和加速度计测量的偏差, 分别。帧之间的空间关系如图 1 所示。

Assuming that the inertial measurements contain noises with zero-mean Gaussian models, denoted as ng and na , the real angular velocity ω and the real acceleration a are related with gyroscope and accelerometer measurements in the following form:
假设惯性测量包含零均值高斯模型的噪声,表示为ng和na,实际角速度ω和实际加速度a与陀螺仪和加速度计测量相关,形式如下:

The data-driven dynamic model is a combination of 3D rigid body dynamics and the above IMU measurements, which can be represented by the following form:
数据驱动的动态模型是3D刚体动力学和上述IMU测量的组合,可以用以下形式表示:

where CIW q̄ denotes a rotational matrix described by IW q̄ and g as the gravity vector in world frame {W}. ω = ωx ω y ωz
is the angular velocityexpressed in IMU frame {I}, while is the quaternion kinematic matrix 0−ωT with ω× representing the skew-symmetric matrix. The IMU biases are modelled as random walk process, driven by the Gaussian noise, nwg and nwa . Let unit quaternion be q̄ := q0 , q T and its corresponding rotational matrix be Cq̄ . Two orientation representations can be linked via the equation below:
其中 CIW q̄ 表示由 IW q̄ 和 g 描述为世界帧 {W} 中的重力矢量的旋转矩阵。ω = ωx ω y ωz
是用 IMU 框架 {I} 表示的角速度,而四元数运动矩阵 0−ωT 与 ω× 表示斜对称矩阵。IMU偏差被建模为随机游走过程,由高斯噪声,nwg和nwa驱动。设单位四元数为 q̄ := q0 , q T 及其对应的旋转矩阵为 Cq̄ 。两个方向表示可以通过以下等式链接:

Apart from the current IMU state x I mentioned above, a widely used state vector for filtering-based approaches includes T scale λ (slow drift) and current camera a spatial pose WC q̄ T W pTC . Here, we present a system state only containing a camera pose and it can be expressed as a 24-element vector. However, in various methods, the whole system state may include more than one past camera pose and keeps a moving window to limit them during state updating.
除了上面提到的当前IMU状态x之外,用于基于滤波方法的广泛使用的状态向量包括T尺度λ(慢漂移)和当前相机的空间姿态WC q̄ T W pTC。在这里,我们呈现一个仅包含相机姿势的系统状态,它可以表示为 24 个元素向量。但是,在各种方法中,整个系统状态可能包括多个过去的相机姿势,并在状态更新期间保持移动窗口以限制它们。

with other three differential equations,
与其他三个微分方程,

Applying the expectation operator in above equations, we obtain the prediction results using the IMU data-driven dynamic model:
应用上述方程中的期望运算符,我们使用IMU数据驱动的动态模型获得预测结果:

This can be abstracted as a non-linear system function with camera measurement z and a process noise term n p ∼ N (0, Q),
这可以抽象为具有相机测量z和过程噪声项n p ∼ N (0, Q)的非线性系统函数,

2.2. Error state representation and updating

For the position, velocity and bias state variables, the arithmetic difference can be applied (i.e. the error in the estimate x̂ of a quantity x is defined as x̃ = x − x̂), but the error quaternion should be defined under the assumption as local minimal situation.[12] If q̄ˆ is the estimated value of quaternion q̄, then the orientation error is described by the error quaternion δ q̄, which is defined by the relation q̄ = δ q̄ ⊗ q̄ˆ ⇒ δ q̄ = q̄ ⊗ q̄ˆ −1 . In this expression, the symbol ⊗? denotes quaternion multiplication. Intuitively, the quaternion δ q̄ describes a small rotation that causes the true and estimated attitude to coincide. Since attitude corresponds to three Degree of Freedom (DoF), δθ is used to describe the attitude errors, which is a minimal represen- tation. The error quaternion δ q̄ can be written as
对于位置、速度和偏置状态变量,可以应用算术差(即数量 x 的估计 x̂ 中的误差定义为 x̃ = x − x̂),但误差四元数应在假设下定义为局部最小情况。[12] 如果 q̄ˆ 是四元数 q̄ 的估计值,则取向误差由误差四元数 δ q̄ 描述,该误差由关系 q̄ = δ q̄ ⊗ q̄ˆ ⇒ δ q̄ = q̄ ⊗ q̄ˆ −1 定义。在这个表达式中,符号⊗?表示四元数乘法。直观地说,四元数 δ q̄ 描述了一个小的旋转,导致真实和估计的姿态重合。由于姿态对应于三个自由度(DoF),因此δθ用于描述姿态误差,这是一个最小的反驳。q̄ δ误差四元数可以写为

Thus, the error state vector containing 22 elements is
因此,包含 22 个元素的错误状态向量为

The differential equations for the continuous time error state are
连续时间误差状态的微分方程为

with ω̂ = ωm − b̂g and â = am − b̂a . By stacking the differential equations for error state, the linearised continuous time error state equation can be formed,
ω̂ = ωm − b̂g 和 â = am − b̂a .通过将误差状态的微分方程堆叠起来,可以形成线性化的连续时间误差状态方程,

with the noise vector n = naT , nTwa , nTg , nTwg . And the covariance matrix of n depends on the noise characteristics of IMU, Q c = diag σ 2na , σ 2nwa , σ 2ng , σ 2nwg . In practical, the inertial measurements for state propagation are obtained from IMU in discrete form, thus we assume the signals from gyroscopes and accelerometers are sampled with time interval t, and the state estimate is propagated using numerical integration like Runge–Kutta methods. Moreover, the covariance matrix of EKF filter is defined as
噪声向量 n = naT , nTwa , nTg , nTwg .而n的协方差矩阵取决于IMU的噪声特性,Q c = 诊断σ 2na,σ 2nwa,σ 2ng,σ 2nwg。实际上,状态传播的惯性测量值是以离散形式从IMU获得的,因此我们假设来自陀螺仪和加速度计的信号以时间间隔t采样,并且状态估计是使用Runge-Kutta方法等数值积分传播的。此外,EKF滤波器的协方差矩阵定义为

where P I I k|k is the covariance matrix of the IMU state, P C C k|k is the 6N × 6N covariance matrix of the camera pose estimates and P I C k|k is the correlation between errors in IMU state and camera pose estimates. With this notation, the covariance matrix can be propagated by
其中 P I |k 是 IMU 状态的协方差矩阵,P C C k|k 是相机姿态估计值的 6N × 6N 协方差矩阵,P I C k|k 是 IMU 状态误差与相机姿态估计值之间的相关性。使用此表示法,协方差矩阵可以通过以下方式传播:

where the state-transition matrix can be calculated by assuming F c and G c to be constant over the integration time interval between two consecutive steps,
其中,状态转移矩阵可以通过假设F c和G c在两个连续步骤之间的积分时间间隔内恒定来计算,

and the discrete-time covariance matrix Q d can also be derived through numerical integration,
离散时间协方差矩阵Q d也可以通过数值积分推导出,

Thus, the mean and covariance propagation process as of the EKF-based VIO framework is summarized as follows,
因此,基于 EKF 的 VIO 框架的均值和协方差传播过程总结如下:

(a) when IMU data, ωm and am , in a certain sample frequency, are available to the filter, the state vector is propagated using numerical integration on Equation (1).
(b) calculate F d and Q d according to (4) and (5) respectively.
© the propagated state covariance matrix is com-puted from (3).
(a) 当特定采样频率的 IMU 数据 ωm 和 am 可供滤波器使用时,状态矢量使用公式 (1) 上的数值积分传播。
(b) 分别根据(4)和(5)计算F d和Q d。
(c) 传播状态协方差矩阵由 (3) 计算而成。

2.3. Visual measurement model and updating

Due to the biases and noises in IMU data, the prediction results from prediction step become worse and worse with time increasing. The measurements from visual sensors would be able to provide key information to bound the increased errors. To do so in an EKF framework, key information extracted from images should be cast into measurement equations. There are various methods to build a measurement model. For example, the loosely coupled methods where image alignment is used to directly obtain the position and orientation changes fuse two estimated results together via an EKF framework.[13] The so-called tightly coupled methods advocate the use of key information extracted from images. The key information could be the features extracted from images via feature detectors,[14] direct light intensity with depth information (point cloud),[15,16] or semi-direct light intensity with depth information.[17] The key information is modelled as the measurement equation so that an analytic relationship between the key information and the state variables is provided. In general, a non-linear algebraic equation can be viewed as the measurement equation:
由于IMU数据中的偏差和噪声,预测步骤的预测结果随着时间的增加变得越来越差。来自视觉传感器的测量将能够提供关键信息来限制增加的误差。要在EKF框架中做到这一点,从图像中提取的关键信息应转换为测量方程。有多种方法可以构建测量模型。例如,使用图像对齐直接获取位置和方向变化的松散耦合方法通过EKF框架将两个估计结果融合在一起。[13]所谓的紧密耦合方法提倡使用从图像中提取的关键信息。关键信息可能是通过特征检测器从图像中提取的特征,[14]具有深度信息的直接光强度(点云)[15,16]或具有深度信息的半直射光强度。[17] 关键信息被建模为测量方程,以便提供关键信息和状态变量之间的分析关系。通常,非线性代数方程可以视为测量方程:

where nm models the Gaussian noise with zero mean and covariance R in visual measurement. After linearisation, the measurement error is expressed in a linear form:
其中nm在视觉测量中对均值为零和协方差R的高斯噪声进行建模。线性化后,测量误差以线性形式表示:

where H is the Jacobian matrix, and the noise term is Gaussian distributed and uncorrelated to the state error. ẑ and z represent the prediction and real measurement, respectively.
其中 H 是雅可比矩阵,噪声项是高斯分布,与状态误差无关。ẑ 和 z 分别表示预测和实际测量。

For using features as the visual information, the most straightforward way is to exploit the pinhole model to build up the measurement equation in camera frame {C}. A point u = [u v]T in an image is extracted to represent the key information. First, the spatial relationship between a point u in the camera frame and a point W pu in the world frame {W} is established as following equation:
对于使用特征作为视觉信息,最直接的方法是利用针孔模型在相机帧{C}中建立测量方程。提取图像中的点u = [u v]T以表示关键信息。首先,相机帧中的点 u 与世界帧 {W} 中的点 W pu 之间的空间关系建立为以下等式:

Strictly speaking, the visual world {V} is different to the real world {W}, but they are linked by initialising fixed translation W pV and rotation W q̄ V as shown in Figure 1. Here, for simplicity, we only use the world frame {W}.
严格来说,视觉世界 {V} 与现实世界 {W} 不同,但它们通过初始化固定平移 W pV 和旋转 W q̄ V 进行链接,如图 1 所示。在这里,为简单起见,我们只使用世界框架 {W}。

The position in the image frame is linked to the state variables W pC and W q̄ C in the world frame via the camera pinhole model, described in Equation (10). The measurement model thus can be expressed briefly as
图像帧中的位置通过相机针孔模型链接到世界帧中的状态变量 W pC 和 W q̄ C,如公式 (10) 中所述。因此,测量模型可以简单地表示为

And the estimation of this point in the world frame W p̂u can be calculated by iterative minimisation methods beforehand.[18]
而世界坐标系W p̂u中该点的估计可以通过迭代最小化方法预先计算出来。[18]

For other key visual information, such as pixel intensity or gradient, the measurement model should build up the link between a measurement and the state variables, then the non-linear measurement equation is linearised to obtain the Jacobian matrix.
对于其他关键的视觉信息,如像素强度或梯度,测量模型应该在测量和状态变量之间建立联系,然后线性化非线性测量方程以获得雅可比矩阵。

At this point, the updating step is ready for updating the prediction results made in prediction step. The Kalman gain is calculated as
此时,更新步骤已准备好更新在预测步骤中所做的预测结果。卡尔曼增益计算公式为

and final correction x̃ k+1 = K · z̃. After the correction, we can get the updated state estimate x k . Lastly, the error state covariance is updated as
和最终校正 x̃ k+1 = K ·z̃。修正后,我们可以得到更新的状态估计 x k 。最后,误差状态协方差更新为

The full update process is summarised as follows:
完整的更新过程总结如下:

(a) when visual data, normally raw image, in a certain sample frequency are available, some image processing procedures are adopted to extract the key information.
(b) calculate the residual as (7).
© compute the Kalman gain as (8).
(d) update the state by adding the correction.
(e) update the error state covariance as (9).
(a)当具有一定采样频率的视觉数据(通常是原始图像)可用时,采用一些图像处理程序来提取关键信息。
(b) 将残差计算为 (7)。
(c) 计算卡尔曼增益为 (8)。
(d) 通过添加更正来更新状态。
(e) 将误差状态协方差更新为 (9)。

  1. Optimisation-based approaches

The optimisation-based approaches mainly rely on the techniques of image processing for feature extraction and optimisation for image alignment, while inertial measurement is treated as prior, regularisation terms or totally ignored. In most cases, there are two stages in an optimisation-based approach: mapping and tracking.[15] In mapping stage, features, such as corners, edges or other landmarks in 3D space, are extracted from an image via various features detectors. Then a reprojected error is defined between two images for all the features detected. The error is used as a cost function to be optimised in order to find the coordinates of features or landmarks.[7] In tracking stage, the coordinates of features and landmarks in the map are used to define a reprojected error between two images, and an optimisation algorithm is applied again to find the changes in position and orientation of the mobile platform. The idea of separating the estimation problem into two stages is to obtain a fast tracking result while the mapping processing is time consuming.[14,15] Simultaneously optimising a cost defined using reprojected error between two images against coordinates of 3D features and pose changes of the mobile platform is possible [19] while using the concept of keyframes is able to marginalise old states to maintain a bounded optimisation window for real-time operation.
基于优化的方法主要依赖于图像处理技术进行特征提取和优化图像对齐,而惯性测量则被视为先验、正则化项或完全忽略。在大多数情况下,基于优化的方法有两个阶段:映射和跟踪。[15] 在映射阶段,通过各种特征检测器从图像中提取特征,例如 3D 空间中的角落、边缘或其他地标。然后,针对检测到的所有特征在两个图像之间定义重新投影误差。该误差用作要优化的成本函数,以便查找要素或地标的坐标。[7] 在跟踪阶段,地图中特征和地标的坐标用于定义两个图像之间的重新投影误差,并再次应用优化算法来查找移动平台的位置和方向的变化。将估计问题分为两个阶段的想法是在映射处理耗时的情况下获得快速跟踪结果。[14,15] 同时优化使用两个图像之间根据移动平台的3D特征坐标和姿势变化的重新投影误差定义的成本是可能的[19],而使用关键帧的概念能够边缘化旧状态,以保持实时操作的有限优化窗口。

3.1. Feature alignment

Iterative non-linear optimisation is formulated to find the camera pose changes and/or feature coordinates by minimising a reprojection error of observed regions in images. Normally, a map consists of features identified in a number of keyframes in which significant features are found. The map is presented by a series of 3D coordinate vectors of features. When a new image is obtained, a decision should be made about whether or not it is a keyframe. If so, the coordinates of new features found in this new image is computed via an image alignment algorithm and added to the map together with current camera pose. Otherwise, the map keeps unchanged.
迭代非线性优化旨在通过最小化图像中观察到区域的重投影误差来查找相机姿势变化和/或特征坐标。通常,地图由在多个关键帧中标识的要素组成,在这些关键帧中可以找到重要要素。该地图由一系列要素的 3D 坐标矢量表示。获取新图像时,应决定它是否为关键帧。如果是这样,则通过图像对齐算法计算在此新图像中找到的新要素的坐标,并与当前相机姿势一起添加到地图中。否则,地图将保持不变。

As shown in Figure 2, a 3D rigid body transformation T ∈ S E(3) denotes rotation and translation in 3D:
如图 2 所示,3D 刚体变换 T ∈ S E(3) 表示 3D 中的旋转和平移:

The optimisation purpose in image alignment is to find the transformation T in each time step, i.e. T is regarded as the camera pose. A minimal representation for camera pose is better for optimisation purpose. The Lie algebra se(3) corresponding to the tangent space of S E(3) at the identity is used as the minimal representation. The algebra element is called twisted coordinates ξ = [ωT υ T ]T ∈ R6 . The map from Lie algebra se(3) to Lie group S E(3) is the exponential map T (ξ ) = exp(ψ (ξ )) and its inverse map is the logarithm map ψ (ξ ) = log T (ξ ), where ψ (ξ ) is the wedge operator,
图像对齐的优化目的是在每个时间步长中找到变换T,即T被视为相机姿势。相机姿势的最小表示形式更适合优化目的。对应于恒等式 S E(3) 的切空间的李代数 se(3) 用作最小表示。代数元素称为扭曲坐标 ξ = [ωT υ T ]T ∈ R6 。从李代数 se(3) 到李群 S E(3) 的映射是指数映射 T (ξ) = exp(ψ (ξ )),它的逆映射是对数映射 ψ (ξ) = log T (ξ ),其中 ψ (ξ) 是楔形算子,

A 3D point with homogeneous vector C pu in the camera frame maps to the image coordinate u via the pinhole camera projection model:
相机帧中具有均匀矢量 C pu 的 3D 点通过针孔相机投影模型映射到图像坐标 u:

where u 0 , v0 and f u , f v are the principal point and focal length, respectively, representing camera intrinsic parameters which can be calibrated beforehand. Given a depth information du for a point u, the 3D point in the camera frame can be recovered from an image coordinate:
其中U 0、V0和F U、F V分别是主点和焦距,代表可以事先校准的相机内在参数。给定点 u 的深度信息 du,可以从图像坐标中恢复相机帧中的 3D 点:

The mapping from a point W pu in the world frame to a point in the camera frame is C pu = exp(ψ (ξ ))W pu . For monocular camera, the visual measurement model becomes z = [u v]T = u(ξ ,W pu ). The depth information of a pixel is obtained by triangulating two consecutive keyframes using a set of independent filter.[14,20] For stereo cameras, z = [u l vl u r ]T , which is also a function of ξ and W pu . The depth information is obtained via stereo vision techniques. The reprojection error is defined as the difference between a measurement z and its estimate ẑ(ξ̂ ,W p̂u ):
从世界帧中的点 W pu 到相机帧中的点的映射为 C pu = exp(ψ (ξ ))W pu 。对于单目相机,视觉测量模型变为 z = [u v]T = u(ξ ,W pu )。像素的深度信息是通过使用一组独立的滤波器对两个连续的关键帧进行三角测量而获得的。[14,20] 对于立体相机,z = [u l vl you are ]T ,这也是 ξ 和 W pu 的函数。深度信息是通过立体视觉技术获得的。重投影误差定义为测量值 z 与其估计值 ẑ(ξ̂ ,W p̂u ) 之间的差值:

The cost function η ξ̂ ,W p̂u is the sum of all squared errors z̃ with a weighting matrix W :
成本函数 η ξ̂ ,W p̂u 是所有平方误差 z̃ 与加权矩阵 W 的总和:

where j from 1 to m is the index of points within a frame, and i is the number of frames indexing a set with size n. The optimisation problem is defined as
其中 j 从 1 到 m 是帧内点的索引,i 是索引大小为 n 的集合的帧数。优化问题定义为

The method used to solve the above problem is called Bundle Adjustment.[7]
用于解决上述问题的方法称为束调整。[7]

In PTAM,[14] the optimisation process is separated into two parallel threads: mapping and tracking. Given the tracking results ξ̂ , the optimisation problem in the mapping thread is:
在PTAM中,[14]优化过程分为两个并行线程:映射和跟踪。给定跟踪结果 ξ̂ ,映射线程中的优化问题是:

Given the mapping results W p̂u , the optimisation problem in the tracking thread is:
给定映射结果 W p̂u ,跟踪线程中的优化问题是:

In Levenberg–Marquardt (LM) algorithm, the key is to find an increment δ in each iterative step, then update the optimised state variables. The solution to δ is found by:
在Levenberg-Marquardt(LM)算法中,关键是在每个迭代步骤中找到增量δ,然后更新优化的状态变量。δ的解决方案如下:

where H is the Jacobian of z and α is the LM damping parameter.
其中 H 是 z 的雅可比矩阵,α 是 LM 阻尼参数。

In most cases, only the features in sparse keyframes are maintained in the computer in order to limit the optimisation complexity as the same scale as filtering-based approaches.[19]
在大多数情况下,计算机中仅保留稀疏关键帧中的特征,以限制优化复杂性,与基于过滤的方法相同。[19]

3.2. Dense alignment

The reprojection error is defined using the coordinates of features or landmarks within images in the above discussion. However, it requires a feature detection process, which ignores most parts of an image as the features are only extracted sparsely. The feature extraction process is often badly conditioned, noisy and not robust therefore relying on higher level robust estimation techniques. Since all these estimation steps are not on the level of raw image measurements (intensities), they systematically propagate feature extraction errors and accumulate drifts.
重投影误差是使用上述讨论中影像中要素或地标的坐标定义的。但是,它需要一个特征检测过程,该过程忽略图像的大部分部分,因为特征只是稀疏地提取。特征提取过程通常条件恶劣、噪声大且不鲁棒,因此依赖于更高水平的鲁棒估计技术。由于所有这些估计步骤都不在原始图像测量(强度)的水平上,因此它们系统地传播特征提取误差并累积漂移。

Appearance and optical flow-based techniques, on the other hand, are image-based and minimise an error directly based on raw image measurements,[21] i.e. the photometric(brightness or intensity) function is used, and therefore are called direct or dense methods. Dense methods aim at using the whole image for alignments. Non-linear optimisation techniques are used to find the transformation between two scenes. It is increasingly clear that it is possible to get more complete, accurate and robust results using dense methods for both mapping and tracking. The work in [22,23] minimises the photometric error via image alignment for visual odometry using Kinect RGB-D cameras. The work in [24] includes both of photometric error and depth error to minimise.
另一方面,基于外观和光流的技术是基于图像的,并且直接基于原始图像测量来最小化误差,[21]即使用光度(亮度或强度)功能,因此称为直接或密集方法。密集方法旨在使用整个图像进行对齐。非线性优化技术用于查找两个场景之间的转换。越来越清楚的是,使用密集的映射和跟踪方法可以获得更完整、更准确和更可靠的结果。[22,23] 中的工作通过使用 Kinect RGB-D 相机进行视觉测程的图像对齐,最大限度地减少了光度误差。[24]中的工作包括光度误差和深度误差,以尽量减少。

With RGB-D camera available, such as Kinect sensors, the depth information of a pixel makes the alignment of multiple scans possible by minimising distance measures between all of the data in each image, rather than limited number of features or landmarks. Such dense scan is able to reconstruct surface scenes in the environment and track the pose. Given two successive dense depth measurements, surfaces or 3D point clouds of the same static scene observed from different viewpoints, one can find the change in pose of the camera by obtaining the rigid transformation that best maps one point cloud onto the other.
借助可用的 RGB-D 相机(如 Kinect 传感器),像素的深度信息通过最小化每个图像中所有数据之间的距离测量值(而不是限制数量的要素或地标),使多次扫描的对齐成为可能。这种密集扫描能够重建环境中的表面场景并跟踪姿势。给定从不同视点观察的同一静态场景的两个连续密集深度测量、表面或 3D 点云,可以通过获得将一个点云最好地映射到另一个点云的刚性变换来找到相机姿势的变化。

The Iterative Closest Point (ICP) is popular algorithm to match the scans through optimising the rigid transformation.[25] The ICP works in this way: given two corresponding point sets: β = {β1 , · · · , βn } and γ = {γ1 , · · · , γn } and the translation and rotation between them are t and R, respectively, t and R are found by minimising the sum of squared errors:
迭代最近点 (ICP) 是一种流行的算法,通过优化刚性变换来匹配扫描。[25] ICP 以这种方式工作:给定两个相应的点集:β = {β1 , · · · , βn } 和 γ = {γ1 , · · · , γn } 并且它们之间的平移和旋转分别为 t 和 R,t 和 R 是通过最小化平方误差的总和找到的:

If the correct correspondences are known, the rotation and translation can be calculated in a closed form. If correct correspondences are unknown as in most cases, it is generally impossible to find the optimal relative rotation and translation in one step. The iterating to find the alignment [25] is sought via finding the closest points’correspondence. The ICP converges if starting positions are close enough. One work using ICP for visual odometry is reported in [26].
如果已知正确的对应关系,则可以以封闭形式计算旋转和平移。如果在大多数情况下正确对应是未知的,通常不可能一步找到最佳的相对旋转和平移。通过查找最近点的对应关系来寻求查找对齐 [25] 的迭代。如果起始位置足够接近,则 ICP 会收敛。[26]报道了1项使用ICP进行视觉里程计的工作。

The dense alignment can also be done by matching current image against a scene model. KinectFusion in [27] is the one which provides a real-time solution for dense alignment via separated mapping and tracking threads in GPU where a truncated signed distance function is employed for scene model. However, performing ICP on the full point cloud is computationally expensive, and does not provide a real-time solution on a general PC.
也可以通过将当前图像与场景模型进行匹配来完成密集对齐。[27] 中的 KinectFusion 通过 GPU 中的分离映射和跟踪线程为密集对齐提供了实时解决方案,其中场景模型采用截断的有符号距离函数。但是,在整个点云上执行ICP的计算成本很高,并且无法在普通PC上提供实时解决方案。

For a monocular camera, the depth information is not available. An inverse depth map is estimated in [15] by minimising a photometric error via non-convex optimisation algorithm in the mapping thread. In the tracking thread, the image alignment is used. Estimating the depth information in a monocular camera is also conducted via a Bayesian filter followed by an optimisation process to smooth the depth map in [28].
对于单眼相机,深度信息不可用。在 [15] 中,通过在映射线程中通过非凸优化算法最小化光度误差来估计反深度图。在跟踪线程中,使用图像对齐。单目相机中的深度信息估计也通过贝叶斯滤波器进行,然后进行优化过程以平滑[28]中的深度图。

To further simplify the computational complexity and maintain the accuracy of dense methods, semi-dense or semi-direct methods have been proposed recently [29,30] for monocular camera. A semi-dense depth map covering all image regions with non-negligible gradients. The inverse depth map is estimated using the Bayesian filter while the tracking is obtained by directly minimising the dense photometric error. As stated in [30], semi-direct or semi-dense methods use hundreds of small patches to increase the robustness and allow for neglecting the patch normals.
为了进一步简化计算复杂度并保持密集方法的准确性,最近提出了用于单目相机的半密集或半直接方法[29,30]。覆盖所有图像区域的半密集深度图,梯度不可忽略。使用贝叶斯滤波器估计反深度图,同时通过直接最小化密集的光度误差来获得跟踪。如[30]所述,半直接或半密集方法使用数百个小斑块来提高鲁棒性并允许忽略贴片法线。

3.3. Inertial measurement term

The measurement from an IMU sensor is the data source to the data-driven 3D rigid motion dynamics in filtering-based approaches, and is fused with the measurement from cameras via Kalman filter. This is tightly coupled as the cross variances between two parts are taken into consideration.[11] The loosely coupled fusion is to maintain a constant processing time by fusing the already estimated pose from visual sensor with the predicted pose from IMU via an EKF.[31] In optimisation-based approaches, the fusion between two parts can also be made tightly, i.e. no explicit pose estimates from camera is required. The predicted result from IMU driven dynamics in Equation (1) is viewed as a Gaussian distribution. The error between predicted results and true state is cast as the square error weighted by the covariance, then added to the cost function as a regularisation term.[19] In term of the Bayesian inference, the regularisation term is viewed as the prior while the image alignment term as the likelihood. The optimised result is a posteriori distribution, which is the one resulted from the ML estimation result of pure image alignment smoothed by the IMU prior.
IMU传感器的测量是基于滤波的方法中数据驱动的3D刚性运动动力学的数据源,并通过卡尔曼滤波器与相机的测量融合在一起。这是紧密耦合的,因为考虑了两个零件之间的交叉方差。[11] 松散耦合融合是通过 EKF 将视觉传感器的已估计姿势与来自 IMU 的预测姿势融合来保持恒定的处理时间。[31] 在基于优化的方法中,两个部件之间的融合也可以紧密进行,即不需要从相机进行明确的姿态估计。公式(1)中IMU驱动动力学的预测结果被视为高斯分布。预测结果与真实状态之间的误差被转换为由协方差加权的平方误差,然后作为正则化项添加到成本函数中。[19] 就贝叶斯推理而言,正则化项被视为先验项,而图像对齐项被视为可能性。优化结果是后验分布,这是由 IMU 先验平滑的纯图像对齐的 ML 估计结果得出的结果。

  1. Links between filtering and optimisation based approaches

Both of filtering-based and optimisation-based approaches can be formed under the Bayesian inference. When the succession of approximation linearisation is necessary, their link can be made explicitly via the iterated EKF. When the approximation linearisation is just a single step, the smoother-based approaches which include a forward pass and a backward pass are equivalent to the optimisation-based approaches which are solved via the Cholesky decomposition of information matrix of least square problems. To reduce the computational complexity, reducing the state variables to be estimated is implemented by maintaining only keyframes or a sliding window. In particular, the sliding window scheme or moving horizon estimation divides the cost into two parts, and one popular way to implement two parts is optimising the result in one part and marginalising out oldest states in another part with an EKF.
基于过滤和基于优化的方法都可以在贝叶斯推理下形成。当需要连续的近似线性化时,可以通过迭代的EKF显式建立它们的链接。当近似线性化只是一个步骤时,基于平滑的方法(包括正向传递和反向传递)等效于通过最小二乘问题的信息矩阵的Cholesky分解求解的基于优化的方法。为了降低计算复杂度,通过仅维护关键帧或滑动窗口来实现减少要估计的状态变量。特别是,滑动窗口方案或移动地平线估计将成本分为两部分,实现两部分的一种流行方法是优化一部分的结果,并使用 EKF 边缘化另一部分的旧状态。

4.1. Iterated EKF update
The core of filtering based approaches is the Kalman filter and the core of optimisation-based approaches is the Gauss–Newton method. The link between them is the iterated EKF (IEKF).[32] An EKF has two steps: prediction and update. Let the result of prediction step is x̂ k ∼ N (x k , P k ) at current time k. The difference between EKF and IEKF is that there is an iterative loop in the update step of IEKF while a single loop is executed in the update step of EKF. It is the iterative loop of IEKF which is able to drive the error caused by model linearisation as close as possible to the counterpart in optimisation-based approaches.
基于滤波的方法的核心是卡尔曼滤波器,基于优化的方法的核心是高斯-牛顿方法。它们之间的链接是迭代的EKF(IEKF)。[32] EKF有两个步骤:预测和更新。设预测步长的结果是当前时间 k 的 x̂ k ∼ N (x k, P k)。EKF 和 IEKF 的区别在于,在 IEKF 的更新步骤中有一个迭代循环,而在 EKF 的更新步骤中执行单个循环。IEKF的迭代循环能够在基于优化的方法中尽可能接近模型线性化引起的误差。

In the following, we will show the equivalence between the iterative loop in update step of IEKF and the Gauss–Newton method of optimisation approaches from maximising likelihood (ML).
在下文中,我们将展示IEKF更新步骤中的迭代循环与最大化似然(ML)优化方法的高斯-牛顿方法之间的等价性。

At time k, the IEKF has x k , x̂ k|k−1 and z k as the current state, the current state estimate and the measurement, respectively. The measurement model is the same as Equation (6) and x̂ k|k−1 ∼ N (x k , P k|k−1 ).
在时间 k 处,IEKF 分别将 x k、x̂ k|k−1 和 z k 作为当前状态、当前状态估计和测量值。测量模型与公式(6)和x̂ k|k−1 ∼ N (x k, P k|k−1)相同。

Define an error vector as in quadratic cost function with a free variable μ.
定义一个误差向量,如在二次成本函数中,使用自由变量μ。

……

In summary, the above iterative loop of Gauss–Newton method is viewed as the update step of IEKF, i.e.
综上所述,上述高斯-牛顿方法的迭代循环被视为IEKF的更新步骤,即

(a)Initialization: i = 0, μ(0) = x̂ k|k−1 and P k|k−1 = P k|k−1 ;
(b)Loop calculation from Equations (11) to (13); (i+1)
©Final updating: x̂ k|k = μ(i+1) and P k|k = P k|k−1 .
(一)初始化: i = 0, μ(0) = x̂ k|k−1 和 P k|k−1 = P k|k−1 ;
(二)从公式(11)到(13)的循环计算;(I+1)
(三)最终更新:x̂ k|k = μ(i+1) 和 p k|k = P k|k−1 。

When only one iterative loop is executed, the above is the update step of EKF. Their relationship can be viewed clearly in Figure 3.
当只执行一个迭代循环时,以上是 EKF 的更新步骤。它们之间的关系可以在图3中清楚地看到。

4.2. Smoothing-based approaches
The filtering-based approaches are a sequence iterative process where the current estimated results depend on the current measurements and the past estimated results (the past measurements are condensed into the past estimated results), i.e. future measurements do not make any contributions to the current estimated results. If this is not the case, i.e. future measurements do make contributions to the current estimated results, the smoothing-based approaches should be applied. This is more like a batch algorithm where the estimation is made when all the measurements are available.[33]
基于过滤的方法是一个序列迭代过程,其中当前估计结果取决于当前测量值和过去的估计结果(过去的测量值被压缩为过去的估计结果),即未来的测量值对当前估计结果没有任何贡献。如果不是这种情况,即未来的测量确实对当前的估计结果有所贡献,则应应用基于平滑的方法。这更像是一种批处理算法,当所有测量值都可用时进行估计。[33]

The IMU data-driven dynamics (2) and the camera measurement Equation (6) are still adopted. The full probability is
仍然采用IMU数据驱动的动态(2)和相机测量公式(6)。完全概率是

The MAP problem is
MAP问题是

Linearising f (·) and h(·), the above minimisation problem is rewritten as a group of linear equations in the general form
直线化f(·)和h(·),上述最小化问题被改写为一组一般形式的线性方程

This can be solved by Cholesky decomposition of AT A.[33] This is the perspective of optimisation-based approaches to the batch processing.
这可以通过AT A的Cholesky分解来解决[33]这是基于优化的批处理方法的观点。

On the other hand, the batch processing can be interpreted from the perspective of filtering-based approaches, which is based on stochastic treatment.
另一方面,批处理可以从基于随机处理的基于过滤的方法的角度进行解释。

The Kalman smoother experiences forward and backward passes, as shown in Figure 3. The forward pass computes p(x k |z 0:k ) given the data available so far. The backward pass computes p(x k |z 0 , · · · , z n ) given all the data.
卡尔曼平滑器可体验向前和向后传递,如图 3 所示。前向传递根据目前可用的数据计算 p(x k |z 0:k)。向后传递计算给定所有数据的 p(x k |z 0 , · · · , z n)。

Traditionally, the Kalman filter is expressed as
传统上,卡尔曼滤波器表示为

After linearising, it becomes:
线性化后,它变成:

The p(x k |z 0:k ) only takes into account the past information relative to x k . If we incorporate it with the future observations, more refined state estimates can be found. Estimators that take into account both past and future are often called smoother, which can be written as following.
p(x k |z 0:k) 只考虑相对于 x k 的过去信息。如果我们将其与未来的观测相结合,可以找到更精细的状态估计。同时考虑过去和未来的估计器通常称为更平滑,可以写成如下。

As an extension to Kalman smoother, it is possible to use an EM algorithm to learn the parameters of the system. The parameters include the noise covariances Q, R and the F, H.
作为卡尔曼平滑的扩展,可以使用EM算法来学习系统的参数。参数包括噪声协方差 Q、R 和 F、H。

Although the above algorithm goes through two passes, there is no attempt to solve a succession of linear approximation to the non-linear problem. If there is no good linearisation point, the bad result would be expected. Also as the trajectory goes longer and longer, the scale of the linearised equations is unmanaged.
尽管上述算法经历了两次传递,但没有尝试求解非线性问题的一系列线性近似。如果没有好的线性化点,则会出现不好的结果。此外,随着轨迹越来越长,线性方程的尺度是不受管理的。

4.3. Marginalisation to keyframes

If all the state variables and features encountered during the course of operation are maintained, like the above scenario, the computational complexity becomes larger and larger with the increasing of exploring trajectory. However, not all the sensory data are useful. There are some of them which contain key information on tracking and mapping, while others could be redundant or contain trivial information which could be ignored. There are two ways to select useful information from the past historical data. One is a fixed window scheme in which only recent N data measurements are kept while all the data before N measurements are simply marginalised out. Another one is to select N data measurements from all the data-set according to some criteria, which is illustrated in Figure 4. The most popular criteria are the critical information contained in the data so that those keyframes are maintained during the entire processing.[19,34]
如果像上述场景一样,保持运行过程中遇到的所有状态变量和特征,计算复杂度随着探索轨迹的增加而变得越来越大。然而,并非所有的感官数据都是有用的。其中一些包含有关跟踪和映射的关键信息,而另一些可能是多余的或包含可以忽略的琐碎信息。有两种方法可以从过去的历史数据中选择有用的信息。一种是固定窗口方案,其中仅保留最近的N个数据测量值,而N个测量值之前的所有数据都被边缘化。另一种方法是根据某些标准从所有数据集中选择 N 个数据测量值,如图 4 所示。最常用的标准是数据中包含的关键信息,以便在整个处理过程中维护这些关键帧。[19,34]

Marginalising out states is equivalent to applying the Schur compliment to the least squared problem. For example,
边缘化状态等同于将舒尔的赞美应用于最小二乘问题。例如

Changing to
更改为

After this forward substitution step, the smaller system is
在此前向替换步骤之后,较小的系统是

Marginalising out the state x 1 will induce dependencies between other states that are dependent on x 1 like A11 and A12 . And marginalising out the oldest pose from the full solution may cause fill-in in three places [35]:
边缘化状态 x 1 将导致依赖于 x 1 的其他状态之间的依赖关系,例如 A11 和 A12 .从完整解决方案中边缘化最古老的姿势可能会导致三个地方的填充[35]:

• between any landmarks that are visible from that pose;
• between the states of the next oldest pose;
• between the next oldest pose and all landmarks seen by the removed pose.
• 从该姿势可见的任何地标之间;
•在下一个最古老姿势的状态之间;
•在下一个最旧的姿势和移除的姿势看到的所有地标之间。

The matrices in the problem solution would become dense and affect the computational efficiency.
问题解中的矩阵会变得密集并影响计算效率。

4.4. Moving horizon estimation
Moving horizon estimation (MHE) approaches separates the MAP cost into two parts. Only recent N terms from step k to κ = k − N + 1 are optimised each step, while the oldest terms are summarised into an arrival cost, which is approximated by an EKF, as shown in Figure 4.
移动地平线估计 (MHE) 方法将 MAP 成本分为两部分。每一步仅优化从步骤 k 到 κ = k − N + 1 的最近 N 项,而最旧的项汇总为到达成本,该成本由 EKF 近似,如图 4 所示。

Let the time intervals be [0, k − N ] and [κ, k]. The MAP problem is
设时间间隔为 [0, k − N ] 和 [κ, k]。MAP问题是

With the sum of first N terms is an arrival cost, the MHE problem is
前 N 项的总和是到达成本,MHE 问题是

where κ refers to the starting time of MHE window, φκ is the arrival cost at time τ ∈ [κ, k]. The arrival cost can be approximated by an EKF:
其中 κ 是指 MHE 窗口的开始时间,φκ 是时间 τ ∈ [κ, k] 的到达成本。到达费用可以用EKF近似计算:

where the covariance P is propagated by
其中协方差 P 传播者为

  1. State observability and parameter identifiability

The problem which VIO is tackling with is to recover the platform trajectory in the global frame on-line given the measurements from inertial and visual sensors. However, this problem is solved not straightforward in terms of practical implementation. Apart from the complexity of filteringor optimisation-based approaches, some parameters play a crucial role in the success of dynamic state estimation representing the platform trajectory. These parameters include:
VIO正在解决的问题是根据惯性和视觉传感器的测量结果,在线恢复全局框架中的平台轨迹。但是,就实际实施而言,这个问题的解决并不简单。除了基于过滤或优化的方法的复杂性之外,一些参数在表示平台轨迹的动态状态估计的成功中起着至关重要的作用。这些参数包括:

• Camera intrinsic parameters: focal length, principal points and lens distortion.
• IMU parameters: acceleration and gyroscope biases.
• Spatial parameters: the transform between IMU and camera.
• Temporal parameter: the time delay between IMU and camera measurements
• 相机固有参数:焦距、主点和镜头畸变。
• IMU 参数:加速度和陀螺仪偏置。
• 空间参数:IMU 和相机之间的转换。
• 时间参数:IMU 和相机测量之间的时间延迟

They are treated as time-invariant variables except the IMU biases and called parameters of the system. In contrast, the platform trajectory is represented by time-variant variables: states.
它们被视为除IMU偏差之外的时不变变量,并称为系统参数。相比之下,平台轨迹由时变变量表示:状态。

Given only the measurements from IMU and camera, the question whether these states and parameters can be recovered can be answered from the analysis of observability and identifiability. Traditionally, states are dynamic variables and their estimation problem is analysed using observability while parameters are static variables and their calibration is analysed using identifiability. Modelling the parameters with random walk processes is able to analyse them using observability, like the IMU biases.
仅给定来自IMU和相机的测量值,是否可以从可观测性和可识别性分析中回答是否可以恢复这些状态和参数的问题。传统上,状态是动态变量,它们的估计问题使用可观测性进行分析,而参数是静态变量,它们的校准使用可识别性进行分析。使用随机游走过程对参数进行建模能够使用可观测性(如 IMU 偏差)对其进行分析。

5.1. State observability

Observability is a fundamental property which reflects the possibility of estimating states on the basis of input/output data. Let y (t, t0 , x 0 , z(t)) denotes the output trajectory from an initial state x 0 , initial time t0 and measurement reading z(t) for the continuous time system. Two initial states x 10 and 2 x 0 are defined to be indistinguishable if y t, t0 , x 10 , z(t) = y t, t0 , x 20 , z(t) for t0 < t < t N , and all admissible measurement z(t). If states are distinguishable, they can be estimated from the outputs and the known measurements. If states are indistinguishable, they are called unobservable and their corresponding variance will grow without bound and blow up.
可观测性是一个基本属性,它反映了根据输入/输出数据估计状态的可能性。设 y (t, t0, x 0, z(t)) 表示连续时间系统从初始状态 x 0 、初始时间 t0 和测量读数 z(t) 开始的输出轨迹。Two initial states x 10 and 2 x 0 are defined to be indistinguishable if y t, t0 , x 10 , z(t) = y t, t0 , x 20 , z(t) for t0 < t < t N , and all admissible measurement z(t).如果状态是可区分的,则可以根据输出和已知测量值来估计它们。如果状态是不可区分的,它们被称为不可观察的,它们相应的方差将不受限制地增长并爆炸。

Observability of a linear system is a global property that can be determined either from the rank of the observability matrix or from the rank of Gramian matrix. However, observability of a non-linear system is determined locally about a given state.[36] The local observability is stronger than the observability. The local observability distinguishes states from their neighbours. The local weak observability instantaneously distinguishes states from their neighbours without large excursions.[37]
线性系统的可观测性是一个全局属性,可以从可观测性矩阵的秩或格莱米矩阵的秩确定。但是,非线性系统的可观测性是围绕给定状态局部确定的。[36] 局部可观测性强于可观测性。本地可观测性将国家与其邻国区分开来。局部较弱的可观测性立即将国家与其邻国区分开来,而无需进行大型偏移。[37]

The advantage of local weak observability is the availability of Lie derivative algebraic test. For a non-linear system composed by Equations (2) and (6). The Lie derivative of h(·) with respect to f (·) is
局部弱可观测性的优点是李导数代数检验的可用性。对于由方程 (2) 和 (6) 组成的非线性系统。h(·) 关于 f(·) 的李导数是

The Lie derivatives can be defined recursively,
李导数可以递归定义,

The zero-order Lie derivative of any function is the function itself L 0 h(x) = h(x).
任何函数的零阶李导数是函数本身 L 0 h(x) = h(x)。

The observability matrix O is formed by stacking the Lie derivatives of h(x) as its rows.
可观测性矩阵 O 是通过将 h(x) 的李导数堆叠为其行而形成的。

The system is locally weakly observable at x 0 if O has full column rank at x 0 . As the state and measurement equations are infinitely smooth, the observability matrix can have infinite number of rows. If sufficient number of rows which are linearly independent can be found, O is full rank and the system is locally weakly observable.
如果 O 在 x 0 处具有完整的列秩,则系统在 x 0 处局部弱可观测。由于状态和测量方程是无限平滑的,因此可观测性矩阵可以有无限行数。如果可以找到足够数量的线性独立行,则 O 是全秩,系统是局部弱可观测的。

Based on the observability rank condition, there are a number of publications which show that the platform pose in global frame is unobservable and the rotation around gravity vector (yaw) is unobservable. Some other states are observable, which include the platform pose in the initial camera frame, the gravity vector in the initial camera frame and the features in the camera frame. The intuitive interpretation is that the visual camera is a bearing only sensor, and the IMU is only a double integrator of pose, which are not able to provide the pose and yaw information in the global frame.
基于可观测性秩条件,有许多出版物表明平台在全局框架中的姿态是不可观测的,围绕重力矢量(yaw)的旋转是不可观测的。其他一些状态是可观察到的,其中包括初始相机帧中的平台姿势、初始相机帧中的重力矢量以及相机帧中的特征。直观的解释是,视觉相机只是一个方位传感器,而IMU只是姿势的双重积分器,它们无法提供全局帧中的姿态和偏航信息。

The observability analysis allows us to find ways to sufficiently excite the system, such as using three non co-linear features with given world coordinates, or pseudo-measurement equations to make some unobservables observable.[1] Further, it has the potential to provide clues on how to make the unobservables blow-up as slowly as possible.
可观测性分析允许我们找到充分激励系统的方法,例如使用具有给定世界坐标的三个非线性特征,或伪测量方程使一些不可观测值可观测。[1]此外,它有可能提供有关如何使不可观察的东西尽可能缓慢地爆炸的线索。

For filtering-based approaches, the linearised system is employed in EKF. However, the observability of linearised system is different to the corresponding non-linear system. More specifically, the yaw is observable in the linearised systems. This unexpected observable DOF leads to an in-consistent result: the estimated covariance is better than actual measurement result. One direction for improving the consistency is to impose a constraint in the erroneously observable direction due to model linearisation.[38]
对于基于滤波的方法,EKF中采用线性化系统。然而,线性化系统的可观测性与相应的非线性系统不同。更具体地说,偏航是线性化系统中可以观察到的。这种意外的可观察自由度会导致不一致的结果:估计的协方差优于实际测量结果。提高一致性的一个方向是由于模型线性化而在错误可观察的方向上施加约束。[38]

The observability was also analysed using a concept of continuous symmetries, which is able to find the analytical derivation of observable state and identifiable parameters.[39]
还使用连续对称性的概念分析了可观测性,该概念能够找到可观测状态和可识别参数的分析推导。[39]

5.2. Parameter identifiability

In most of visual related localisation techniques, the camera intrinsic parameters are known in advance. They can also be calibrated on-line, such as the methods introduced in [40].
在大多数视觉相关的定位技术中,相机的内在参数是事先知道的。它们也可以在线校准,例如[40]中介绍的方法。

The IMU biases vary with time, and are modelled as time-variant states in most cases. The observability rank condition shows they are observable.
IMU 偏差随时间变化,在大多数情况下被建模为时变状态。可观测性排名条件显示它们是可观测的。

The spatial parameters between IMU and camera are six DOF transformation and must be known precisely to stabilise the estimation results. They are modelled as random walk processes and treated as time-variant states. The analysis of observability rank condition shows that six DOF camera IMU transformation, along with IMU biases, gravity vector and the metric scene structure are all observable. Full observability requires sufficient excitation, i.e. the platform at least undergoes rotation and acceleration along two IMU axes.[41–43]
IMU和相机之间的空间参数是六自由度变换,必须精确地知道以稳定估计结果。它们被建模为随机游走过程,并被视为时变状态。可观测性等级条件分析表明,6自由度相机IMU变换、IMU偏差、重力矢量和度量场景结构均可观测。完全可观测性需要足够的激励,即平台至少沿两个 IMU 轴进行旋转和加速。[41–43]

The temporal parameter between inertial and visual sensor measurements is the time delay. The local identifiability of time delay is established by constructing constraint equations which involve the time delay and other quantities. The rank of the Jacobian matrices of the constraints is checked for local identifiability. The result shows that the time delay is locally identifiable in general trajectories. The critical trajectories that cause loss of identifiability are characterised.[44] The time delay between inertial and visual sensor measurements is also estimated by a variant of ICP algorithm in [45].
惯性和视觉传感器测量之间的时间参数是时间延迟。通过构建涉及时间延迟和其他量的约束方程来建立时间延迟的局部可识别性。检查约束的雅可比矩阵的秩的局部可识别性。结果表明,在一般轨迹中,时间延迟是局部可识别的。描述了导致可识别性丧失的关键轨迹。[44] 惯性和视觉传感器测量之间的时间延迟也是通过 [45] 中 ICP 算法的变体估计的。

  1. Conclusions

This paper has presented an overview of the state-of-the art visual inertial odometry methods. It was presented from two perspectives: filtering-based and optimisation-based approaches. It was also described by handling visual images with two ways: feature based and dense based. The broad knowledge of EKF and image alignment has been unified into the same framework. The links between two approaches have been characterised by iterated EKF, smoother, and marginalisation, and a deep insight into two approaches has been unveiled. The state observability and parameter identifiability are analysed. The paper has also summarised a range of techniques used in this research area, including EKF, MAP, IEKF, BA, ICP, MHE, etc. In the future, exploring the computational complexity of these variants would provide more intuitive guide to the practitioners in this area. More implementation details should be provided as an integrated part of this work.
本文概述了最先进的视觉惯性里程计方法。它从两个角度提出:基于过滤和基于优化的方法。它还通过两种方式处理视觉图像来描述:基于特征和基于密集。EKF和图像对齐的广泛知识已统一到同一框架中。两种方法之间的联系的特点是迭代的EKF,更平滑和边缘化,并且已经揭示了对两种方法的深刻见解。分析了状态可观测性和参数可识别性。该论文还总结了该研究领域使用的一系列技术,包括EKF,MAP,IEKF,BA,ICP,MHE等。未来,探索这些变体的计算复杂性将为该领域的从业者提供更直观的指导。应提供更多的实施细节,作为这项工作的一个组成部分。

你可能感兴趣的:(SLAM,计算机视觉)