RGB-Depth SLAM Review 2018
Simultaneous Localization and Mapping (SLAM) have made the real time dense reconstruction possible increasing theprospects of navigation, tracking, and augmented reality problems. Some breakthroughs have been achieved in this regard during past few decades and more remarkable works are still going on. This paper presents an overview of SLAM approaches that have been developed till now. Kinect Fusion algorithm, its variants and further developed approaches are discussed in detailed. The algorithms and approaches are compared for their effectiveness in tracking and mapping based on Root Mean Square error over online available datasets.
同时定位与建图(SLAM)使得实时密集重建成为可能,增加了导航、跟踪和增强现实问题的前景。在过去的几十年里,在这方面已经取得了一些突破,更多出色的工作仍在继续。本文综述了近年来发展起来的SLAM方法。详细讨论了Kinect融合算法及其变体和进一步发展的方法,比较了基于均方根误差的在线可用数据集在跟踪和建图的有效性。
As far as an optimal perception of phenomenal consciousness is concerned, theories based on representation of the mind are based on models of the information processing paradigm [1]. These are as much in correspondence to the neurobiological or functional theories, at this point we are confronted with several arguments on the basis of inversion or absent qualia [2]. Such considerations exhibit a preceding pattern based on the assumption of holding complete knowledge of the neural and functional states that are in subservience to the occurrence of the consciousness that is phenomenal. This can still be conceived as the neural states which are also defined as the states with similar casual responsibilities or with similar representational function [3], [4].
就现象意识的最佳感知而言,基于心智表征的理论是基于信息处理范例[1]的模型。这些都与神经生物学或功能理论相一致,在这一点上,我们面临着几个基于反向或缺少特性[2]的争论。这些考虑显示了一个基于对神经和功能状态的完整知识的假设的模式,这些状态服从于现象性意识的出现。这仍然可以被理解为神经状态,它也被定义为具有相似的偶然职责或相似的表征功能[3],[4]的状态。
These occur with no phenomenal content in any way or such states being accompanied by contents that are phenomenal with broad variation from the usual ones. In definition, visual information processing entails the visual cognitive skills that permit us the processing and interpretation of meaning from visualized information that we attain through eye sight. Therefore, visual perception plays are vital role in aspects of cognitive and intelligence skills such as spelling, math and reading ( [5]). On the other hand, visual perceptual deficits can lead to challenges in learning, recognition and remembrance of letters, wording, and confusion of likeness as well as minor variations in addition to differentiating the main ideal from the details of insignificance.
这些都是在没有现象性内容的情况下发生的,或者这种状态伴随着现象性的内容,这些内容与通常的内容有很大的不同。从定义上说,视觉信息处理需要视觉认知技能,这使得我们能够处理和解释通过视觉获得的视觉化信息。因此,视觉感知在认知和智力技能如拼写、数学和阅读([5])方面起着至关重要的作用。另一方面,视觉感知缺陷会导致学习、识别和记忆字母、措辞、相似的混淆等方面的挑战,除了将主要理想与无关紧要的细节区分开外,还会导致细微的变化。
Visual perceptual processing can be sub segmented into the categories that comprise of visual discrimination, figure grounding, closure, memory, sequential memorization, constancy, spatial relations as well as visual motor integration. Note should be taken of perception as active procedures of location and information extraction form the setting while learning entails the procedures of acquisition of information
through experiences of information storage. In which case, thought is the manipulative stance upon information for solving challenges ( [6]). Such that it is eased to extract information (perception) which creates an ease in thought procedures becoming. In overall it is accepted that human vision takes the form of extreme powerful processing of information towards facilitation of the interaction of the world
that surrounds us. However, even in the face of extended and extensive efforts of research encompassing multiple fields of exploration, the fundamentals that underlay as well as operational principles of visual information procedures remain largely unknown.
视觉感知处理可细分为视觉辨别、图形根植、闭合、记忆、顺序记忆、恒常性、空间关系以及视觉运动整合等类别。应注意的是,感知是环境中位置和信息提取的主动过程,而学习是通过信息存储经验获取信息的过程。在这种情况下,思想是解决挑战的信息操纵立场([6])。这样它就能很容易地提取信息(感知),从而在思想过程的变化中变得简单。总的来说,人们普遍认为,人类的视觉是以极其强大的信息处理方式来促进我们周围世界的相互作用。然而,即使面对包括多个探索领域的广泛而广泛的研究努力,视觉信息程序的基本原理和操作原则在很大程度上仍然是未知的。
We are still not able to ascertain the origin and distance along the route from eyes to the sensory input area known as the cortex. It is in this area that the conversion into object meaningful representation is undertaken under conscious manipulation of the brain ( [7]). Nearly half of the human brain in the cerebral cortex region is charged with the processes of visual information although even with extended and extensive research efforts that are encompassed a conundrum still persists. Present theories on visual information processing are held in the consideration of human visual information processing being interplay of the two inversely directed procedural streams.
我们仍然无法确定从眼睛到感觉输入区(即大脑皮层)之间的起点和距离。正是在这个区域,在大脑的有意识控制下([7]),转换成有意义的物体表征。人类大脑皮层近一半的区域负责视觉信息的处理,尽管围绕这一谜题展开了广泛的研究,但这一谜题仍然存在。现有的视觉信息处理理论认为,人的视觉信息处理是两种逆向过程流的相互作用。
Past research has presented a demonstration of distance and physical enviroment being among the aspects that impairs processing of information, although it remains unknown whether such impairment is on all the levels of information processing or in the onset states instead of the later stages. Those faced with the condition of mapping algorithms suffer from deficiencies of attention that are impairment
to the capability of selective procedures of visual information that is incoming. The early levels of information processing are held in the description of being those that entail the detection as well as response of simplified stimuli. An assignment on the assessment of such function is the inspection time that has previously been demonstrated to entail sensitivity to pharmacological agents.
过去的研究已经表明,距离和物理环境是影响信息处理的因素之一,尽管还不清楚这种损害是在信息处理的所有层次上,还是在开始阶段而不是后期。那些面临建图算法条件的人遭受着注意力缺陷的困扰,这损害了视觉信息传入的选择过程的能力。信息处理的早期阶段是那些需要对简化的刺激进行检测和反应的描述。评估这种功能的一个作业是检查时间,之前已经被证明对药理学药物敏感。
This is as well as being the most reliable and validated within the cultural fairness of information processing measures of cognitive ability ( [5]). Past assessment findings have also presented the impact of nicotine on information procedures as being held in the overall regard in the form of a measure of speed within the early levels of information processing. These include the speed of visual encoding that comprises of the ability of making observations or inspections on sensory input on which the discrimination of relative magnitude rests. This is in contrast to assignments such as reaction time which is summarization entails the involvement of increased response oriented measures of complete decision making time that comprise of total information processing.
这也是在认知能力([5])的信息处理测量的文化公平性中最可靠和最有效的。过去的评估结果还表明,尼古丁对信息处理程序的影响在总体上以信息处理早期阶段的速度衡量标准的形式存在。这包括视觉编码的速度,这种速度包括对感官输入进行观察或检查的能力,而相对大小的辨别就建立在这种能力上。这与诸如反应时间这样的作业形成了对比,反应时间是一种总结,需要增加响应导向的完整决策时间的度量,包括整个信息处理。
Although, there is no research of examination of the impacts administration of 3D scene construction in a similar response, there are limited studies based on the examination of the impacts of 3D scene construction in the early stages of information processing with utilization of other assignments ( [6]). With the application of visual tracking assignments, it was ascertained that the speed of detection experienced impairment from 3D scene construction that that these impacts where greater in dual task settings with comparison to single task settings. Such outcomes have been held in the description of being the deleterious impacts of 3D scene construction on the centralized processing capacity and on information processing availability on the capacity of information processing with time.
虽然目前还没有类似响应下的三维场景建设影响管理的研究,但是利用其他作业([6])对信息处理早期阶段的三维场景建设影响进行检查的研究有限。通过视觉跟踪任务的应用,确定了三维场景构建对检测速度的影响,这些影响在双任务设置时比单任务设置时更大。这些结果被认为是三维场景构建对集中处理能力和信息处理可用性随时间的变化对信息处理能力的有害影响。
Further investigations of early information processing are based on the examination of the mismatched negative component of auditory event relation potential as well as reports of reduced dosage of 3D scene construction attenuation of the event relation potential signal. In this case, the mismatched negative component suppression was solid within stimuli deviation as reduced which the indication of relatively reduced blood 3D scene construction concentration is. The detection of minimal deviations for instance that needed in the course of the inspection time assignment more so in case of hampering in which case similar outcomes have been discovered in simplified reaction time assignments with double level of intensified stimuli. These studies produced outcomes of an increase in response time as well as the impairment of stimuli detection which is a suggestion of the influence on sensory perceptual procedures and the measure of attentiveness ( [7]).
对早期信息处理的进一步研究是基于对听觉事件关系电位负分量不匹配的检验,以及事件关系电位信号的三维场景构建衰减剂量减少的报道。在本例中,不匹配的阴性成分抑制在刺激偏差范围内呈实性降低,提示血三维场景构建浓度相对降低。例如,在检查时间分配的过程中需要检测最小偏差,在检查受阻的情况下更是如此,在这种情况下,在双重强化刺激的简化反应时间分配中发现了类似的结果。这些研究的结果是反应时间的增加以及刺激检测的损伤,这暗示了对感觉知觉过程的影响和注意力的测量([7])。
Current discoveries in the arena of visual information processing are based on the reflection of the elementary principles of vision as well as the utilization of visual information based on cognitive attributes. This is based on the notion of such work leading to the verge of development based on the grounds of optimism within the several computational theories of sophistication that incorporate data that
is neurobiological and behavioral. These theories entail the flourishing of the skillful exploitation of the neural-imaging and computation of simulative technologies, these permits answering of questions that are subtle regarding the component subsystems within vision.
当前在视觉信息处理领域的发现是基于对视觉基本原理的反映和基于认知属性的视觉信息的利用。这是基于这类工作的概念导致了发展的边缘基于乐观的基础在几个复杂的计算理论中结合了神经生物学和行为学的数据。这些理论带来了对神经成像和模拟技术计算的熟练开发,这些允许回答关于视觉中的组成子系统的微妙问题。
This algorithm applies information matrices sparsely production by the generation of graphs using observed interdependencies in case the observations are connected and if they contain information about the similar landmark. Graph SLAM allows for the capability of constructing a map from an environment while simultaneously creating associated localization with the map for navigation in unknown settings when external referencing systems such as GPS are absent. This intuitive approach utilizes a graph with nodes in correspondence to the robot poses at varied points within time and whose edges are representative of the constraint in between the poses. The latter is gained from environment observations of from movement actions as performed by the robot. Upon construction of a graph, the map could be computed by searching the nodes spatial configuration that is notably consistent with modeled measurements by the edges.
该算法利用观察到的相互依赖关系生成的图来生成稀疏的信息矩阵,如果这些图中包含关于相似地标的信息,则这些图是相互关联的。Graph SLAM允许从环境中构造地图,同时在没有GPS等外部引用系统的情况下创建与地图相关的定位,用于未知设置的导航。这种直观的方法利用了一个图形,其中的节点对应于机器人在不同时间点的姿态,其边缘代表姿态之间的约束。后者是通过对机器人所执行的运动动作的环境观察而获得的。在构建一个图之后,可以通过搜索节点空间配置来计算该地图,该节点空间配置与通过边缘建模的测量值非常一致。
From the image above, we note that particular nodes within the graph are in correspondence to the pose of the robot. Proximal poses are linked by the edges with model spatial constraints between the robot poses that are derived from measurements among the consecutive poses of model odometry measurements. This is whereas the other edges are representative of the spatial constraints based from several observations of the similar section of the environment. The graph-based SLAM method develops a simple estimation challenge by abstraction of raw sensor readings. These readings as substituted by the graph edges which are viewed as ”virtual measurements”. Increased detail within an edge between the two nodes holds the label of a probability distribution over locations that are relative to the two
poses with conditioning to mutual measurements.
从上面的图像中,我们注意到图中的特定节点与机器人的姿态相对应。近端位姿由机器人位姿之间带有模型空间约束的边连接起来,这些边来自于模型测程测量的连续位姿之间的测量值。而其他边缘则是根据对环境相似部分的几次观察而形成的空间约束的代表。基于图的SLAM方法通过提取原始传感器读数,提出了一个简单的估计挑战。这些读数被图形边缘所代替,这些边缘被视为“虚拟测量”。在两个节点之间的边界内增加的细节保持了位置上的概率分布的标签,这些位置是相对于两个位置的,条件是相互测量。
Visual real-time tracking in regard to established and unknown scenes is critical as well as an incontrovertible aspect in vision-based AR applications. Multiple algorithm contributions over the years. It is at this point that we introduce RGB-D Camera-Based Parallel Tracking and Meshing as an adaptation and updating of the algorithms utilized in estimating the motion of the camera as well as AR in accordance to the availability of the end user in computational abilities in permitting to gain impressive tracking outcomes in limited AR workspaces. The fact is that estimation of camera motion using environment tracking as well as parallel constructing feature based sparse mapping that creates a possibility in part to the generalization of multi-core processors found in desktop and laptop computers. Of recent is has been revealed that increased computation power within a singular standard of a hand-held video camera is connected to a powerful computer using computational power gained from the Graphics Processing Unit (GPU). The possibility to attain a dense representation of a desktop setting as well as increased texturing scenery whereas as undertaking tracking with the use RGB-D Camera-Based Parallel Tracking and Meshing. The online created map density can be increased with the use stereo-dense matching in addition to GPU founded implementations as shown by GPU to be utilized for effective replacement of the global bundle adjustment aspects of SLAM optimized based systems for instance RGBD Camera-Based Parallel Tracking and Meshing as well as inherent parallelization refinement with step founded Monte Carlo simulations therefore freeing tools on the CPU for other assignments.
在基于视觉的增强现实应用中,对已建立的和未知的场景进行实时跟踪是关键的,也是一个不容置疑的方面。多年来的多重算法贡献。在这一点上,我们介绍基于RGB-D相机平行跟踪和啮合的适应和更新算法用于估计摄像机的运动以及AR按照最终用户的可用性计算能力允许在有限的基于“增大化现实”技术获得令人印象深刻的跟踪结果工作区。事实上,使用环境跟踪和基于并行构造特征的稀疏映射来估计摄像机的运动,这在一定程度上创造了在台式机和笔记本电脑中发现的多核处理器泛化的可能性。最近的一项研究表明,在一个单一标准的手持相机内增加的计算能力与一台强大的计算机相连接,该计算机使用从图形处理单元(GPU)获得的计算能力。实现一个密集的桌面设置和增加纹理风景的可能性,而跟踪使用基于RGB-D相机的并行跟踪和网格。创建的在线地图密度可以增加使用stereo-dense匹配除了GPU实现如图所示由GPU用于有效替代的全球大满贯的束调整方面基于优化的系统例如基于摄像头RGBD平行跟踪和啮合以及固有的并行细化步骤建立蒙特卡洛模拟因此释放工具在CPU上的其他作业。
The following algorithm is a novel direct monocular SLAM method that operates with direct image intensities rather than the use of key points for tracking and mapping. Camera tracking utilizes direct image alignment whereas geometry is estimated in the format of semi-dense depth maps gained through filtration of several pixel wise stereo comparisons. Thereafter, a Sim (3) pose-graph of key frames is created to permit for the development of scale-drift correction with large scale maps comprising of loop-closures. It should be noted that LSD-SLAM could be operated in real time on a CPU as well as smartphones. This algorithm comprises of three core components namely tracking, map optimization and depth map estimation. The tracking feature persistently tracks new camera images by estimating the rigid body pose in regard to the present key frame and the uses of the pose in the past frame as a point of initialization. On the other hand, the depth map estimation feature applies tracked frames for either refinement or replacement of the present key frame. Refinement of depth is achieved by filtration over several per-pixel, limited based line stereo comparisons as well as interleaved spatial regularization as the default proposition. Should the camera extend to far, initialization of a new key frame is implemented by projection of points from existent and proximal key frames. Furthermore, upon replacement of a key frame as a reference tracking, the depth map will not be additionally refined but rather integrated into the global mapping with use of the map optimization feature. In this case, for detection of loop closures as well as scale drifting, the same transform to proximal key frames inclusive of the direct predecessor is estimated with use of the scale aware and direct image alignment.
下面的算法是一种新的直接单目SLAM方法,它不使用关键点进行跟踪和建图,而是使用直接的图像强度进行操作。相机跟踪利用直接图像对齐,而几何估计的形式是通过几个像素的立体比较过滤获得的半密度深度地图。在此基础上,建立关键帧的Sim(3)位图,用于开发包含闭环的大比例尺地图的尺度漂移校正。需要注意的是,LSD-SLAM可以在CPU上实时运行,也可以在智能手机上实时运行。该算法由跟踪、地图优化和深度地图估计三个核心部分组成。跟踪功能通过估计当前关键帧的刚体位姿以及使用过去帧中的位姿作为初始化点来持续跟踪新相机图像。另一方面,深度图估计特性应用履带帧来细化或替换当前的关键帧。深度的细化是通过过滤几个逐像素,有限的基于线立体比较,以及交错的空间正则化作为默认命题来实现的。如果相机延伸到远处,则通过从现有的和最近的关键帧投影点来实现新关键帧的初始化。此外,在替换关键帧作为参考跟踪后,深度地图将不再进一步细化,而是使用地图优化功能集成到全局地图中。在这种情况下,为了检测闭环和尺度漂移,使用尺度感知和直接图像对齐来估计包括直接之前在内的近端关键帧的相同变换。
The algorithm holds the capability of computation of camera trajectory in real time with heavily exploitation of the parallel format of the SLAM challenge, separation of time constraints in pose estimation from less pressing issues for instance building of maps and refinement of assignments. In addition, the stereo setting permits for reconstruction of a metric 3D map for particular stereo frame d images,
improvement of mapping procedure accuracy in respect to monocular SLAM and limiting the common bootstrapping challenge. Furthermore, the actual scale of the environment is a critical aspect for robots when in it comes to interaction with the surrounding workspace. In order to permit for robotic mobile navigation and achieve autonomous assignments, it must be understood for its pose (position and orientation) as well as hold an environment (map) representation. In settings where robots do not have a past map and external information availed of the pose, it is necessitated to undertake both assignments simultaneously. The challenge of the robot and constraint the map of the environment in a simultaneous action is known as SLAM. However, in order to tackle the challenge of stereo vision, we introduced the S-PTAM (Stereo Parallel Tracking and Mapping) algorithm as an approach whose intention is to operate real-time of an extended duration of lengthy trajectories to permit for estimation of the pose with accuracy as it is built upon sparse mapped environments with a global coordinate system [8], [9]. By using optimal performance, this algorithm is able to decouple localization and mapping assignments for the SLAM challenge with two independent threads which permits us to take the benefit of multi core processors. In addition to localization as well as mapping modules, the loop closure function is able to recognize locations from historically visited points. These detected loops are then applied for refinement of the map and trajectory estimation to effectively lower the accumulated error of the method. It is on this basis that S-PTAM operates on the visual features from extraction of images availed by the stereo camera.
该算法充分利用了SLAM挑战的并行格式,将姿态估计中的时间约束与地图构建、赋值细化等不太紧迫的问题分离开来,具有实时计算相机轨迹的能力。此外,立体设置允许重建特定立体帧d图像的度量三维地图,提高与单目SLAM相关的建图过程的准确性,并限制了常见的自举挑战。此外,当涉及到与周围工作空间的交互时,环境的实际规模是机器人的一个关键方面。为了允许机器人移动导航和实现自主分配,它必须理解其姿态(位置和方向)以及保持环境(地图)表示。在机器人没有去过的地图和外部信息的情况下,需要同时完成这两项任务。机器人的挑战和环境地图的约束同时进行的动作被称为SLAM。然而,为了解决立体视觉的挑战,我们介绍了S-PTAM(立体并行跟踪和映射)算法的方法,其目的是实时运行较长时间的冗长的轨迹,允许构成与准确性的估计是建立在稀疏的环境与全球坐标系统建图[8],[9]。通过优化性能,该算法能够将SLAM挑战中的定位和映射分配与两个独立线程解耦,从而使我们能够利用多核处理器的优势。除了本地化和映射模块之外,闭环函数还能够从历史访问点识别位置。然后利用这些检测到的回路对地图和轨迹估计进行细化,有效地降低了方法的累积误差。在此基础上,S-PTAM对立体相机提取的图像进行视觉特征操作。
3D scene reconstruction and mapping has been a crucial and important assignment within the arena of moveable robotics since it is critical need for various techniques specifically including path planning, semantic mapping, localization, navigation, and telepresence. Two major approaches towards 3D reconstruction are: offline multi-view stereo (MVS) based reconstruction and live incremental dense scene reconstruction. Many compelling results have been produced since past few years by exploring multi view stereo (MVS) and
format derived on the basis of motion (SfM) techniques. Multi perspective stereo has been used extensively in photogrammetry for dense surface reconstruction ( [10]) while the problem of accurate camera tracking has been cattered by SfM algorithms along with sparse reconstruction from large datasets of unordered images ( [11]). Although some groundbreaking results have been achieved but most of
both SfM and MVS approaches have not been driven by live implementations. Simultaneous Localization and Mapping (SLAM), unlike SfM and SVM, provides live motion tracking and re-structuring while applying input from a single commodity sensor but for a sequential ordered set of images. Various 3D mapping techniques offer different functionalities but all of them work almost on the same pipeline; of spatial aligning consecutive data frames at first, detecting the loop closures, and aligning the complete data sequence in a globally consistent manner. Although the developed systems provided satisfactory accuracy through point clouds and colored cameras but most of them are
computationally exhaustive and inaccurate for dense depth reconstruction especially in dark environments or scenes with sparsely textured features [12]. Based on the sensors used, 3D reconstruction can be achieved via three routes: using Multiview stereo, Laser scanning, or depth cameras. Multiview stereo is the traditional technique of photogrammetry where overlapping multi views of an object are captured for relative camera pose estimation and scene reconstruction is done via selected control points to get 3D coordinates of the objects points through space intersection. Laser scanners work on the principle of time of flight where scene tracking is achieved via transmitted laser pulses that are received back by the scanner with high accuracy.
三维场景重建和建图一直是移动机器人领域的一个关键和重要的任务,因为它是各种技术的关键需求,特别是包括路径规划、语义建图、定位、导航和远程呈现。两种主要的三维重建方法是:基于离线多视点立体视觉(MVS)的重建和实时增量密集场景重建。过去几年来,通过探索基于运动(SfM)技术的多视图立体声(MVS)和格式,已经产生了许多引人注目的结果。摘要多视角立体视觉技术已广泛应用于稠密表面重建([10])的摄影测量中,而SfM算法在对大量无序图像进行稀疏重建([11])的同时,也解决了相机的精确跟踪问题。虽然已经取得了一些突破性的成果,但是大多数SfM和MVS方法还没有被实时实现所驱动。与SfM和SVM不同的是,同时定位与建图(SLAM)提供实时的运动跟踪和重组,同时将单个传感器的输入应用到一系列有序的图像上。各种3D建图技术提供了不同的功能,但它们几乎都在同一管道上工作;首先对连续的数据帧进行空间对齐,检测闭环,并以全局一致的方式对完整的数据序列进行对齐。虽然所开发的系统通过点云和彩色相机提供了令人满意的精度,但大多数系统在计算上是详尽和不准确的稠密深度重建,特别是在黑暗环境或具有稀疏纹理特征的场景[12]。基于所使用的传感器,三维重建可以通过三种途径实现:使用多视点立体视觉、激光扫描或深度相机。多视点立体视觉是传统的摄影测量技术,通过捕获物体的重叠多视点进行相对相机姿态估计,通过选定的控制点进行场景重建,通过空间交点获得物体的三维坐标。激光扫描仪的工作原理是飞行时间,其中场景跟踪是通过发射激光脉冲,由扫描仪接收回来的高精度。
The most recent and popular approaches are of constructing 3D scenes using RGB depth cameras that, working on the principle of time of flight, measure the pixel depth along with color information of the pixels. Some early work on SLAM in 3D reconstruction over past few decades includes a range of approaches and their extensions. 3D reconstruction has been explored extensively with some point cloud models with real-time tracking like MonoSLAM ( [13]) being the first successful effort on real-time tracking and active 3D mapping with only one camera. This had motivated many other works for online, though sparse, but fine and accurate reconstruction with freely moving hand-held cameras based on probabilistic models ( [14]). Some later research focused on performing tracking and mapping in parallel instead of adopting probabilistic models. Parallel Tracking and Mapping system (PTAM) worked on the hierarchy of live tracking via feature optimization over spatially-distributed key frames for n-point pose estimation and expanding the maps obtained via bundle adjustment and global pose optimization ( [15]). Although the mono SLAM approaches set the benchmarks in real-time 3D mapping and developed robust camera tracking systems, but the AR (Augmented Reality), and other live robust mapping and robot navigation applications cannot rely on sparse point clouds generated as a result of these systems. This triggered the work towards generating live dense maps using depth information of the scene via Multiview stereo approaches combined with PTAM for live camera tracking and robust pose estimation ([16]). But the availability of depth camera has made the task further easier and current approaches have set their focus on large scale 3D mapping using depth commodity sensors. Considering the importance of SLAM approaches and their applications in field of robotics, this paper reveals a general understanding of the development of SLAM approaches for dense surface mapping and reconstruction in real-time using depth cameras as commodity sensors. An introduction of Kinect sensor is presented with its unique use in depth mapping and reconstruction for Augmented Reality (AR) applications. The focus is set on KinectFusion algorithms and marks achieved from them or their integration with other tracking and mapping algorithms. [17]
目前比较流行的方法是利用RGB深度相机构建三维场景,利用飞行时间原理测量像素的深度以及像素的颜色信息。过去几十年里,SLAM在3D重建方面的一些早期工作包括一系列方法及其扩展。随着实时跟踪的点云模型的广泛应用,三维重建已经得到了广泛的探索,例如MonoSLAM([13])是第一个成功的实时跟踪和只有一个摄像头的主动三维映射。这激发了许多其他在线工作,虽然稀疏,但精细和准确的重建与自由移动的手持相机基于概率模型([14])。后来的一些研究集中在并行执行跟踪和映射,而不是采用概率模型。并行跟踪与映射系统(PTAM)通过对空间分布的关键帧进行特征优化实现实时跟踪的层次结构,用于n点姿态估计和扩展通过bundle平差和全局姿态优化([15])获得的地图。虽然mono SLAM方法设定了实时三维映射的基准,并开发了健壮的相机跟踪系统,但AR(增强现实)和其他实时健壮映射和机器人导航应用不能依赖于这些系统生成的稀疏点云。这触发了使用多视点立体方法结合PTAM实时摄像机跟踪和稳健姿态估计([16])生成实时密集地图的工作。但深度相机的可用性使这一任务变得更加容易,目前的方法已经将重点放在使用深度商品传感器的大规模3D绘图上。考虑到SLAM方法的重要性及其在机器人领域的应用,本文揭示了将深度相机作为商品传感器用于密集地表测绘和实时重建的SLAM方法的发展概况。介绍了Kinect传感器在深度映射和增强现实(AR)重建中的独特应用。重点是Kinect Fusion算法和标记的实现或与其他跟踪和建图算法的集成。[17]
Depth cameras, with their ability to measure objects depth from camera (based on time-of-flight or active stereo) in addition to RGB measurement, have paved a new wave of techniques in SLAM and Augmented Reality (AR). Incorporation of RGB-D cameras has allowed SLAM to benefit from range sensing along with visual data to handle the issues like data association and loop closures in visual Odometry along with visual SLAM ( [11]). Kinect sensor, among other RBG-D cameras, is the most notable depth device to be used in revolutionary approaches being developed for real-time tracking and surface mapping algorithms.
RGB-D相机使SLAM能够从距离感知和视觉数据中受益,从而处理数据关联和视觉里程测量中的闭环等问题。在其他RGB-D相机中,Kinect传感器是最值得注意的深度设备,它被用于正在开发的用于实时跟踪和表面建图算法的革命性方法中。
Kinect sensor, a low-cost commodity platform mainly to detect human gestures in gaming and other entertainment applications, has shown its potential in simultaneous localization and mapping approaches to an unprecedented level. It applies an internal ASIC to generate 11-bit 640x480 depth map of a pixel at 30 Hz. Although map quality suffers from certain technical challenges (like motion blur at faster speeds), the information available is significant enough to be utilized by real-time 3D reconstruction algorithms. There have also been algorithms available to improve sensor accuracy ( [18], [19]) depending upon sensors use or systems requirements.
Kinect 是一个低成本的商品平台,主要用于在游戏和其他娱乐应用中检测人类的手势。它使用一个内部ASIC来生成一个30赫兹像素的11位640x480深度图。虽然地图质量受到某些技术挑战(如速度更快的运动模糊),但可用的信息足够重要,可以被实时3D重建算法利用。还有一些算法可以根据传感器的使用或系统的要求来提高传感器的精度([18]、[19])。
Developed by [16], KinectFusion algorithm was the first attempt to real-time volumetric reconstruction of a scene in variable lightning conditions ( [16]). Using information gained through Kinect sensor in form of input, while utilizing a coarse-to-fine iterative closest point (ICP) algorithm to simultaneously track camera pose and construct a medium sized 3D model in real-time by tracking a live depth frame
relative to a global finished model. At a given time k, the transformation matrix given below was used to describe the 6 DOF, that mapped the camera coordinate frame to a global frame g, such as shown in 1.
Kinect Fusion算法是由[16]开发的,首次尝试在可快速变化条件下([16])对场景进行实时体积重建。利用Kinect传感器获取的输入信息,同时利用由粗到精的迭代最近点(ICP)算法同时跟踪相机姿态,通过跟踪与全局完成模型相关的活深度帧实时构建中等大小的3D模型。在给定时刻k,利用下面给出的变换矩阵来描述将摄像机坐标系映射到全局坐标系g的6自由度,如1所示。
In equation 1, SE3 := {R, t|R ∈ SO3, t ∈ R3}. This means, any point Pk ∈ R3 in the camera frame is mapped to global coordinate frame via transformation Pg = Tg,kPk. The algorithm was able to do real-time volumetric reconstruction in four steps surface measurement, surface reconstruction update, surface prediction, and sensor pose estimation (Figure 2) explained below:
式(1)中SE3:= {R, t|R∈SO3, t∈R3}。这意味着,通过变换Pg = Tg,kPk,将相机帧内任意点Pk∈R3映射到全局坐标系。该算法能够进行实时的体积重建,包括四个步骤:表面测量、表面重建更新、表面预测和传感器姿态估计(图2)。
Tracking drifting occurs when sensor is faced with large planner scenes which accounts for systems shortcomings, but Kinect fusion provides a powerful basis for large scale volumetric reconstruction and dense modeling with various approaches projected by [16]. The Point Cloud Library developed by Rusu and Cousins [20] implements the Kinect fusion algorithm to develop Kinfu: an open source
implementation hirearchy along with other methods for point clouds manipulation and 3D reconstruction. Another extension of Kinfu is developed recently by Korn and Pauli with an alternative algorithm for ICP for increased voxel grid hence improving scene dynamics scaning [21]. The voxel grid data used by Kinfu is used to create vertex and normal maps that are registered with the maps obtained
from sensor. But in doing so, unusual amount of information is lost. To cater this problem, Korn and Pauli have suggested
direct matching of the maps obtained from sensor with voxel grid model. The ICP algorithm developed by them is also different from the original ICP algorithm adopted for Kinfu as they,ve removed the normal threshold and use the normals computed from the depth maps for pointto-plane error metric instead of using normals from voxel grid that has shown improved robustness in terms of pose estimations with moving objects.
传感器在面对大型规划场景时会出现跟踪漂移,这是造成系统缺陷的原因,但是Kinect fusion为大规模的体积重建和密集建模提供了强大的基础,可以采用[16]投影的各种方法。由Rusu和Cousins[20]开发的点云库实现了Kinect融合算法来开发Kinfu:一种开源的实现层次结构以及其他点云操作和3D重建的方法。Kinfu的另一个扩展是最近由Korn和Pauli开发的,采用了一种可选的ICP算法来增加体素网格,从而改进了场景动态扫描[21]。Kinfu使用的体素网格数据用于创建顶点和法线映射,这些顶点和法线映射由传感器获得的映射进行注册。但是在这样做的过程中,会丢失大量的信息。为了解决这个问题,Korn和Pauli建议将从传感器获得的地图与体素网格模型直接匹配。ICP算法由他们也不同于原始的ICP算法采用Kinfu时,已经删除了正常阈值并使用法线计算pointto-plane深度图的误差度量而不是使用从立体像素网格法线,表明改进的健壮性和移动物体的姿势估计。
As kinetic fusion algorithm provides consistent and accurate volumetric reconstruction of smaller indoor scenes, the problem of dense 3D mapping of large indoor environments is addressed by the RGB-Depth mapping algorithm by [22], a framework that uses RGB depth camera to generate dense 3D models of even darker and featureless planner indoor environments. A joint optimization computed over
object shape as well as appearance matching (RGB features) is computed to develop alighnment between the frames followed bysparse features extraction and matching using RANSAC. Loop colosures are detected via matching data frames compared to a subset of earlier collected frames and finally an improved, globally consistent allignment is completed either via sparse bundle adjustment (SBA) or a more efficient pose grapgh optimization that is TORO in this case. What lies at the core of RGB-D mapping is its novel ICP algorithm [23], RGB-D ICP (Figure 4), that identifies the sparse feature points in each camera frame using the visual information. These identified point features
then help in RANSAC optimization. An RGB-D frame Ps is input to RGB-ICP algorithm along with target frame Pt. For an instants rotation R and translation t, the rigid transform is T§ = Rp + t. RANSAC then finds the best optimized rigid transform T∗ in order to get best alignment as shown in Equation2.
作为动态融合算法提供一致和准确的体积重建较小的室内场景,密集三维建图的问题解决了大型室内环境RGB-Depth映射算法[22],一个框架,它使用RGB深度相机生成致密的3d模型甚至黑暗和毫无特色的规划师的室内环境。通过计算物体形状和外观匹配(RGB特征)的联合优化来建立帧之间的识别,然后使用RANSAC进行稀疏特征提取和匹配。通过与早期收集的数据帧的子集进行匹配来检测循环效果,最后通过稀疏束调整(SBA)或更有效的位姿图优化(TORO)来完成一个改进的全局一致性的关联。RGB-D映射的核心是其新颖的ICP算法[23],RGB-D ICP(图4),该算法利用视觉信息来识别每个相机帧中的稀疏特征点。这些确定的点特征有助于RANSAC优化。一个RGB-D帧Ps和目标帧Pt一起被输入到RGB-ICP算法中。对于一个瞬时旋转R和平移t,刚性变换是t § = Rp + t。