【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey



1、 Overview

2、A.Robust Visual SLAM

2.1、Motion Segmentation

2.1.1、Background/Foreground Initialization

2.1.2、Geometric Constraints—与翻转检测对应

2.1.3、Optical Flow

2.1.4、Ego-Motion Constraints

2.1.5、Deep Learning

2. 2、Localization and 3D Reconstruction

2.2.1、 Feature Based

2.2.2、Deep Learning

3、 B.Dynamic Object Segmentation and 3D Tracking

3.1、Dynamic Object Segmentation

3.1.1、Statistical Model Selection—与翻转前景 && 前景tracking对应

3.1.2、Subspace Clustering


3.1.4、Deep Learning

3.2、3D Tracking of Dynamic Objects

3.2.1、 Trajectory Triangulation

3.2.2、Particle Filter

4、Joint Motion Segmentation and Reconstruction

4.1、 Factorization

4.1.1、Multibody Structure from Motion (MBSfM)

4.1.2、 Nonrigid Structure from Motion (NRSfM)—略


Saputra, Muhamad R U , Markham, et al. Visual SLAM and Structure from Motion in Dynamic Environments: A Survey.

发表于2018年,目前找到的引用较高的关于动态场景的处理(slam && 3d reconstruction)

1、 Overview

  1. overall framework

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第1张图片

2. 相关方法评价表

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第2张图片

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第3张图片

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第4张图片

3. note:相关标记-方法要点方法优点方法缺点

2、A.Robust Visual SLAM

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第5张图片

2.1、Motion Segmentation

  • This approach will work well if the static features are in the majority. When the dynamic objects in front of the camera are dominant or the captured scene is occluded by a large moving object, these types of approaches may fail

  • inertial measurement unit (IMU)-- ego-motion.

2.1.1、Background/Foreground Initialization

  • prior knowledge(B/F)

  • real-time(NO)-- moving objects(exhaustively match)

  • F:tracking-by-detection[14,89]

  • B:background subtraction; Initialization: without F

  • temporarily stationary.—retrack, move again

  • degenerate motion

2.1.2、Geometric Constraints—与翻转检测对应

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第6张图片

  1. 4类几何约束:

  • 极线:should lie on the corresponding epipolar line

  • geometric model(F):model-similarity

  • 三角化:noise, back-projected rays from the tracked features do not meet—与立体匹配剔除背景有关

  • 重投影误差:distance(pixel ), appearance differences

2. degenerate motion:moves along the epipolar line---Flow Vector Bound (FVB)

3. there is no additional computational burden in performing the segmentation, and thus real-time implementation is common

4. handle temporary stopping(NO)--only motion

5. approach cannot differentiate between the residual error caused by the moving object or caused by the false correspondence (outliers) since both conditions result in high geometric errors

6. motion in degenerate(NO)

2.1.3、Optical Flow

  • consecutive images

  • motion metric computed from the optical flow

  • The graph-cut algorithm is utilized to segment the moving objects based on the motion metric

  • scene flow (3D version of optical flow--Mahalanobis distance—residual(low—static object)

  • real time

  • brightness constancy assumption, which is sensitive to changes in lighting conditions [62].

  • sensitive to a large pixel movement

  • degenerate motion(NO)

2.1.4、Ego-Motion Constraints

  • assuming that the camera moves according to particular parameterization(planar and circular)

  • temporary stopping(NO)

  • classifying static features can be done by fitting feature points that match with the camera motion constraints

  • real time.

  • degenerate motion(OK)

2.1.5、Deep Learning

  • estimating optical flow

  • scene flow estimation--stereo images

  • geometric features--spatiotemporal features cannot learn

2. 2、Localization and 3D Reconstruction

2.2.1、 Feature Based

  1. only static features resulting from techniques described in Section 3.1 are employed. All dynamic features are regarded as outliers and excluded from the computation

  2. matching:

  • short baselines: optical flow-based techniques

  • long baselines: e.g., SIFT [99], SURF [8], BRIEF [17], BRISK [91], etc.

  • outliers: e.g., RANSAC [37], PROSAC [22], MLESAC [158], etc.

3. 优化求解:

  • midpoint method [9] or least-square-based method-- drifting problem

  • BA:Gauss-Newton method(Gauss-Newton method)


  • local bundle adjustment

  • PTAM: choosing key frames, different threads

  • binary descriptors

  • metric topological mapping such that large-scale mapping can operate in real time

  • ORB-SLAM [113]( parallel computing, ORB features [131], statistical model selection [155], loop closures based on bag-of-words place recognition [26, 41], local bundle adjustment [111], and graph optimization [81])

2.2.2、Deep Learning

  1. End-to-end: pose estimation(OK)

(1)Supervised Learning

  • classification problem over the discretized space of translation and rotation of the camera

  • regression network

  • optical flow-based networks

(2)Unsupervised Learning

  • Instead, the network learns to predict the camera pose by minimizing the photometric error similar

  • End-to-end:3D reconstruction(NO)—depth(OK)

3、 B.Dynamic Object Segmentation and 3D Tracking

  • clusters feature correspondences into different groups based on their motion and tracks their trajectories in 3D

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第7张图片

3.1、Dynamic Object Segmentation

  • multibody motion segmentation [73, 132, 153]or eorumotion segmentation [133]) clusters all feature correspondences into n number of different object motions

  • In order to estimate the motion of the object, the features should be clustered first; on the other hand, the motion models for all moving objects are required to cluster the features. The problem is compounded by the presence of noise, outliers, or missing feature correspondences due to occlusion, motion blur, or losing tracked features

  • degenerate motion

3.1.1、Statistical Model Selection—与翻转前景 && 前景tracking对应

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第8张图片

  1. consecutive images

  2. Motion models can be based on one of the following categories: fundamental matrix (F), affine fundamental matrix (FA), essential matrix (E), homography/projectivity (H), or affinity (A).

  3. Sample-fit(best)-remaining: sample again(repete)

(1) Sample iteration: RANSAC [37] or the Monte-Carlo

(2) Find the Best Model:

  • Akaike’s information criterion:(likelihood && number of parameters)

  • Bayes Information Criterion (BIC)

  • 其他:

Minimum Description Length (MDL)

Geometric Information Criterion (G-AIC, or in some literature called GIC)

Geometrically Robust Information Criterion (GRIC)

4. several perspective image(VS two image sequences)

temporal coherence is enforced by connecting only essential matrices with similar inlier sets

5. degenerate motion(OK)

6. the number of moving objects is automatically captured when the whole data is described by n different motion models

7. noise and outliers are automatically tackled

8. real-time(NO)

9. fitting a motion model from randomly sampled data is computationally expensive

10. Finally, dependent motion remains a challenging problem for statistical model selection since a group of features can be part of two different motion models.

3.1.2、Subspace Clustering

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第9张图片

  • estimating the subspace parameters and clustering the data into different subspaces should be done simultaneously

  • independent rigid body motion lies in a linear subspace-- rank constraint-- each linear subspace can be recovered.

  • but a subspace is fitted instead of a motion model—AIC—merging two subspaces into one group.

  • multiple linear subspace-- Generalized Principal Component Analysis (GPCA)

a linear subspace-- PCA

  • fitting – finding normal(subspace)—segmentation(similarity—normal vectors—spectral clustering)

  • independent, articulated, rigid, nonrigid, degenerate, and nondegenerate motions-- Local Subspace Affinity (LSA)

  • Projection into a lower-dimensional subspace is also carried out before the subspace is estimated

  • cheaper in computation

  • dependent motions

  • cannot run sequentially (except [161, 185]) or in real time since they need the whole sequence to be available before processing (batch mode)

  • Information about the number of motions in the scene or the dimension

  • affine camera model(perspective effect-- a motion might lie in a nonlinear manifold--NO)

  • noise [35, 177], outliers [35, 185], and missing data


  1. there are a set of fundamental matrices {Fi } associated with each moving object such that the following multibody epipolar constraint is satisfied

  2. multibody epipolar constraints


【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第10张图片

  • bilinear problem again by mapping the polynomial equation into a vector containing Mn monomials using the veronese map

(2)If n is known, reordering,be estimated by least squares

(3)Subsequently, the motion segmentation of dynamic features can be done by assigning each feature correspondence with the correct fundamental matrix [167]

3. extended the multibody SfM formulation from two views into three views by introducing the multibody trilinear constraint and multibody trifocal tensor.(区别不大,F变为Tensor)

4. perspective camera model

5. handle degenerate motion(NO)image pairs-- grows exponentially with respect to the number of motions

6. grows exponentially with respect to the number of motions

3.1.4、Deep Learning

  • predefined number of rigid body motions

  • produce dense object masks

3.2、3D Tracking of Dynamic Objects

3.2.1、 Trajectory Triangulation

  • object trajectory is known or satisfies a parametric form(unknown 3D line)-- finding a 3D line that intersects projected rays from t views

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第11张图片

  • handle outliers and missing data(NO)

  • rigid body motion

  • 其他运动轨迹假设:a conic section,curve

  • Prior knowledge about the camera motion is not needed, although some approaches [5, 6] assume that the camera pose is available

3.2.2、Particle Filter

  • The particles are spread along the ray of projection and are constrained by the estimated/predefined ground plane and maximum/minimum allowed depth value.


【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第12张图片

  • Bearing-only-Tracking (BOT) problem(monocular camera)

  • moving object is segmented—单视图(particles are spread uniformly along the ray of projection)

  • the weight of the particle is updated by projecting each particle into the current frame and computing the projection error compared to the actual feature position.

  • Particle filters are probably the only technique for doing 3D reconstruction and tracking of dynamic objects that can work in real time so far

  • object trajectory is not needed

  • nonrigid or articulated reconstruction(NO)

4、Joint Motion Segmentation and Reconstruction

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第13张图片

4.1、 Factorization

  • factorization can do both simultaneously

  • short sequences of static scenes, a measurement matrix, a matrix containing all tracked feature points through all frames, is at most of rank four (or rank three if using the orthographic projection model under Euclidean coordinates)

W:featue M:motion S: shape(3D structure)

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第14张图片

(SVD分解- O(fp2) complexity)

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第15张图片

(contains n motions,Without noise, each Wi,where i = {1, 2, ... , n}, lies in a subspace of at most rank four [25]. Then, as eachWi can be factorized into a motion and shape matrix)

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第16张图片

  • camera motion is not needed

  • orthography or the affine camera mode(perspective effect--NO)

  • real time(NO)

  • prior knowledge, such as the number of moving objects in the scene, rank of the measurement matrix, or the dimension of the object

  • sensitive to noise and outliers

  • missing data

4.1.1、Multibody Structure from Motion (MBSfM)

  • 同上(W分解)

  • cluster the structure by maximizing the sum-of-squares entries of a block diagonal subject to the constraint that each block represents a physical object

  • projective cameras—trickier

【论文阅读】28-Visual SLAM and Structure from Motion in Dynamic Environments: A Survey_第17张图片

4.1.2、 Nonrigid Structure from Motion (NRSfM)—略
