2018_Semantic SLAM Based on Object Detection and Improved Octomap_note

注释

(2022/4/15 上午9:14:24)

“ABSTRACT” (Zhang 等。, 2018, p. 1) (pdf)

提出了什么:

  • “In this paper we propose a Semantic SLAM system which builds the semantic maps with object-level entities, and it is integrated into the RGB-D SLAM framework.”
    • 提出一个语义SLAM系统,构建对象级别实体的语义地图,并集成到RGB-DSLAM框架。

系统详细信息:

  • “The system combines object detection module that is realized by the deep-learning method, and localization module with RGB-D SLAM seamlessly.”

    • 该系统将采用深度学习方法实现的目标检测模块与**RGB-D SLAM定位模块**无缝结合。
  • “The two modules are integrated together to obtain the semantic maps of the environment”

    • 将这两个模块集成在一起,得到环境的语义图
  • “to improve the computational efficiency of the framework, an improved Octomap based on Fast Line Rasterization Algorithm is constructed.”

    • 为了提高框架的计算效率,构造了一种基于快速线光栅化算法的改进Octomap
  • “for the sake of accuracy and robustness of the semantic map, Conditional Random Field (CRF) is employed to do the optimization.”

    • 为了提高语义地图的准确性和鲁棒性,采用条件随机场(CRF:Conditional Random Field)进行优化。

评估与结果:

  • “we evaluate our Semantic SLAM through three different tasks, i.e. Localization, Object Detection and Mapping.”

    • 通过三个不同的任务来评估语义SLAM:定位、目标检测和映射。
    • 与ORB-SLAM2和原始的RGB-D SLAM系统相比,在TUM数据集上的动态环境评估中,作者的系统定位精度和建图速度分别提高了72.9%和91.2%
  • “With the improved Octomap, the proposed Semantic SLAM is 66.5% faster than the original RGB-D SLAM.”

    • 在改进的·下,提出的语义SLAM比原来的·提高了66.5%。

“I. INTRODUCTION”

当前研究现状与局限性

  • “most recent researches in the field of SLAM only focus on the geometric mapping, instead of both geometric and semantic mapping.”

    • 目前在SLAM领域的研究主要集中在几何建图上,而没有同时考虑几何建图和语义建图。
  • “maps built by SLAM only can tell us where obstacles are and cannot supply semantic meaning.”

    • SLAM构建的地图只能告诉我们障碍物在哪里,不能提供语义。
  • “In such conditions, it is difficult to let the robot do some high level tasks”

    • 在这种情况下,很难让机器人完成一些高水平的任务
  • “the map created by traditional SLAM can only be useful in simple missions, such as navigation and path planning. Obviously, it cannot meet our expected intelligent demands.”

    • 传统的SLAM创建的地图只能在简单的任务中使用,如导航和路径规划。显然,它不能满足我们预期的智能需求。

作者提出的系统

  • “present aRGB-Dsemantic SLAM framework, which not only construct the semantic maps based on the geometric SLAM, but also improve the localization accuracy according to the semantic maps.”

    • 提出的RGB-D语义SLAM框架,该框架在几何SLAM的基础上构造语义映射,并根据语义地图提高定位精度。
  • “one module is the RGB-D SLAM based on sparse feature, which provides information about the locations of objects and also builds the 3D map.”

    • 其中一个模块是基于稀疏特征的RGB-D SLAM,它提供了物体的位置信息,并建立了三维地图。
  • “The other one is object detection realized by deep learning method.”

    • 另一个是用深度学习方法实现的目标检测模块。
  • “According to results of these two modules, we design the integrated RGB-Dsemantic framework, which provides the semantic map, and improves the localization accuracy.”

    • 根据这两个模块的结果,我们设计了一个集成的RGB-D 语义框架,提供了语义图,提高了定位精度。
  • “our system creates a point clouds map of an environment with semantic meanings, which contains separate object models with semantic and geometric information.”

    • 该系统创建了一个具有语义的环境点云地图,其中包含了具有语义和几何信息的独立对象模型。
  • “Our map not only maintains 3D point clouds by projecting semantic messages to 3D models, but also separates object entities independently.”

    • 我们的地图不仅通过将语义消息投射到三维模型来维护三维点云,而且独立地分离对象实体。
  • “it can provide more advanced understanding of environment”

    • 可以提供对环境更高级的理解
  • “The proposed system can detect and classify 80-200 object classes using deep-learning based detection algorithm, while the existing semantic mapping systems [42]-[43] can only detect less than 20 classes.”

    • 利用基于深度学习的检测算法,该系统可以检测和分类80-200个对象类,而现有的语义映射系统[42]-[43]只能检测不到20个类。
  • “when our system builds maps, it can create 3D object models without requiring a-priori known 3D models.”

    • 当我们的系统建立地图时,它可以创建三维对象模型,而不需要先验已知的三维模型。
  • “Because 3D object entities of a semantic class, such as cup, have many kinds of shapes, it can limit the environment understanding as the robot needs to know the 3D object model of an object before identification.”

    • **局限:**由于一个语义类的三维物体实体,如cup,具有多种形状,机器人在识别前需要知道物体的三维模型,从而限制了对环境的理解。
  • “our system mainly focuses on object-level entities, however, some other semantic segmentation methods, such as [1][2], focus only on pixel-level entities. Maps generated by such methods are less usable, because in this condition objects are modeled offline and maintained all the time.”

    • 我们的系统主要关注对象级实体(改进:速度快些),而其他一些语义分割方法,如[1][2]只关注像素级实体。这种方法生成的映射的可用性较低(注:速度太慢),因为在这种情况下,对象是离线建模的,并且一直在维护。

“II. Related Work”

  • “Nowadays, SLAM has reached a level of maturity where maps can be built nearly in real time.”

    • 如今,SLAM已经达到了可以几乎实时构建地图的成熟程度。

    • 语义可以使机器人更加智能,完成更加高层次的任务。

A. SLAM(综述部分)

  • gmapping[3] is a typical SLAM algorithm based on Rao-Blackwellized Particle Filters.”
    • 是一种典型的基于Rao-Blackwellized粒子滤波的SLAM算法。
  • Google’s Cartographer[4] is the newest SLAM algorithm based on Lidar input, which provides a good loop closure detection.”
    • 谷歌的Cartographer[4]是最新的基于激光雷达输入的SLAM算法,提供了很好的环路闭合检测。

“FEATURE-BASED METHODS”

基于特征的方法:传统方法,从图像中提取稀疏点,匹配相邻帧且恢复相机位姿。

  • “KinectFusion[5] uses RGBD camera to generate dense point clouds, recovers camera poses and scene structure with ICP algorithm, and accelerates tracking by CUDA.”

    • KinectFusion[5]利用RGBD摄像机生成密集点云,用ICP算法恢复摄像机姿态和场景结构,用CUDA(高性能计算上,越来越多的被使用)加速跟踪。
  • RGBD_SLAM[6] can also generate dense point clouds, it tracks ORB features and optimizes camera poses by G2O algorithm. However, the algorithm in RGBD_SLAM uses every frames to optimize camera poses instead of KeyFrames, and therefore it is computationally inefficient.”

    • RGBD_SLAM[6]还可以生成密集点云,跟踪ORB特征**(注:在这里就有人使用ORB 特征了)**,并利用G2O算法优化摄像机姿态。然而,RGBD_SLAM中的算法利用每帧来优化相机姿态,而不是关键帧,因此计算效率较低
  • PTAM[7] utilizes KeyFrames to optimize camera poses, it therefore works fast and stable, but it lacks loop closure detection, relocalization and auto initialization, and it can only generate sparse point clouds.”

    • PTAM[7]利用关键帧对摄像机姿态进行优化,速度快、稳定性好,但缺乏闭环检测、重定位和自动初始化,只能生成稀疏点云。(注:人家只是一个里程计,要那么多要求干什么?这可是划时代的进步)
  • “ORB-SLAM[8], which is the state of the art in this field, not only supports RGBD camera, stereo camera and mono camera, but also contains loop-closing, relocalization, and auto initialization. It can work well both in small and large scale environments.”

    • ORB-SLAM[8]是该领域的最新研究成果,它不仅支持RGBD摄像机、立体摄像机和单目摄像机,而且还包含回环、重定位和自动初始化功能。它可以很好地工作在小型和大型环境中。(注:特征点法的巅峰之作)

“DIRECT METHODS”

相机位姿直接从图像的强度(灰度不变假设)估计得到

  • DSO_SLAM[9] uses Direct Methods to estimate poses and maintains 5 to 7 keyframes through sliding windows, but it lacks loop closure, which leads to more errors over time.”

    • DSO_SLAM[9]采用直接方法估计姿态,并通过滑动窗口保持5到7个关键帧,但缺乏回环,随着时间的推移会导致更多的错误。
  • LSD_SLAM[10] can generate semi-dense depth image, and it is used to match next frame in order to estimate camera poses. However, it is sensitive to light change.”

    • LSD_SLAM[10]可以生成半稠密的深度图像,并用于匹配下一帧以估计摄像机姿态。但是,它对光的变化很敏感。
  • SVO_SLAM[11] belongs to half Direct Methods, because only sparse model-based image alignment uses Direct Methods, while pose estimation and bundle adjustment depend on features matching.”

    • SVO_SLAM[11]属于半直接方法,因为只有基于稀疏模型的图像对准使用直接方法,而姿态估计和束调整依赖于特征匹配。

“B. Object Detection and Semantic Segmentation”

传统的语义分割

  • “Some of the methods can only detect few objects through features stored in databases with traditional computer vision algorithms [34].”
    • 传统的计算机视觉算法只能通过存储在数据库中的特征来检测少数目标[34]。(注:在深度学习之前就已经可以进行语义分割了)

基于深度学习的语义分割

  • “However, some researches based on deep learning can detect many objects, even the objects belonging to the same class but having different shapes.”

    • 而一些基于深度学习的研究可以检测出很多物体,甚至是属于同一类但形状不同的物体。(注:深度学习出现后简单粗暴,准确率还高,就是速度不怎么样,特别是对实时性要求很高SLAM领域)
  • “we are interested in the task of object detection”

    • 我们对目标检测的任务感兴趣**(注:主要是为了区分动态物体)**
  • “This network extracts features through AlexNext[38] and realizes classification by SVM[36], but it takes several seconds to process one image.”

    • 该网络通过AlexNext[38]提取特征,通过SVM[36]实现分类,但处理一幅图像需要几秒钟的时间。注:速度慢吧
  • “In order to improve R-CNN, Fast R-CNN[14] maps feature map to feature vector, and it is used as an input to fully connected layer by ROI-pooling, and replaces SVM with softmax.”

    • 为了改进R-CNN,Fast R-CNN[14]将特征映射到特征向量,通过ROI池化将其作为全连通层的输入,并用Softmax代替SVM。
  • 贴一下链接,后边在仔细看看:

    (ROI Pooling:ROI Pooling(感兴趣区域池化) - 刘下的文章 - 知乎 https://zhuanlan.zhihu.com/p/65423423
    SVM:【机器学习】支持向量机 SVM(非常详细) - 阿泽的文章 - 知乎 https://zhuanlan.zhihu.com/p/77750026
    AlexNet:深度学习卷积神经网络-AlexNet - Adia的文章 - 知乎 https://zhuanlan.zhihu.com/p/42914388
    softmax: https://blog.csdn.net/bitcarmanlee/article/details/82320853

  • “It is therefore faster than R-CNN, but it is still too slow for a real-time requirement in SLAM. Faster R-CNN[35] utilizes the Region Proposal Network (RPN) to generate object proposals and adds anchor and shared features to promote the speed of detection, its speed can reach up to 5fps.”

    • 比R-CNN更快,但对于SLAM中的实时要求来说,它仍然太慢了。Faster R-CNN[35]利用区域建议网络(RPN)生成目标建议,并增加锚点和共享特征来提高检测速度,其速度可达5fps。(注:还是太慢了)
  • “Yolo” 快速物体检测算法,用S X S 个网格替换物体提议,并对这些网格的分类实现最终检测。注:使用YOLO来检测动态物体,后边把动态物体上的特征点直接去除

“C. Semantic SLAM”

在前边就重复了好多遍的语义SLAM

  • “Semantic SLAM is used to calculate the motion and position, and object detection and semantic segmentation are utilized to generate semantic map. Semantic SLAM can be categorized into two types based on the object detection methods.”

    • 利用语义SLAM计算运动和位置,利用目标检测和语义分割生成语义地图。根据对象检测方法的不同,语义SLAM可以分为两类。
  • “The first type uses traditional methods to detect object. Real-time Monocular Object SLAM[17] is the most common one, which employs Bags of Binary Words and a database with 500 3D object models to provide a real-time detection. But it limits a lot because 3D object entities of a semantic class like cup having many different kinds of shapes.”

    • 第一类采用传统方法检测目标。实时单目对象SLAM[17]是最常见的一种,它使用二进制词袋和一个拥有500个3D对象模型的数据库来提供实时检测。但是它限制了很多,因为像cup这样的语义类的3D对象实体有很多不同的形状。
  • “[18] generates object proposals through multi-view images, then extracts dense SIFT descriptors from these proposals and predicts their classes. [19] employs DPM[12], in which Hog feature is used to describe the object.”

    • [18]通过多视图图像生成对象建议,然后从这些建议中提取密集的SIFT描述符并预测它们的类别。[19]使用DPM[12],其中Hog特征用于描述对象。注:这些特征精度很高,但是太慢
  • “The other kind of SLAM is using deep-learning methods to do the object recognition, such as method proposed in [20], however, the semantic information is built based on pixels instead of object entities.”

    • 另一种是采用深度学习的方法进行目标识别,如文献[20]提出的方法,但是语义信息是基于像素而不是基于目标实体来建立的。
  • “In fact, this approach is too complex and not practical due to two reasons: (1) robot wants to understand the major semantic meaning of the environment in mission execution, which means it does not care about every pixel’s semantic information, (2) computational speed is not sufficient to perform pixel level semantic classification in robot SLAM system.”

    • 由于以下两个原因,这种方法过于复杂和不实用:(1)机器人在执行任务时希望理解环境的主要语义,这意味着它不关心每个像素的语义信息;(2)在机器人SLAM系统中,计算速度不足以进行像素级语义分类。注:鱼与熊掌我都想要

“D. Mapping”

  • “Several approaches have been proposed to build 3D environments, and 3D maps can be represented with point clouds, elevation maps[21], multi-level surface maps[22] and so on.”
    • 建立三维环境的方法很多,3D地图可以用点云、高程图[21]、多层曲面图[22]等表示。注:不同地图侧重领域不同,建图速度也相差很大

对不同地图的批判

  • “Point clouds store large number of points and consume a lot of memory.”

    • 点云存储大量的点,消耗大量的内存。
  • “it cannot easily differentiate between cluttered and free spaces.”

    • 它不能轻易区分杂乱和自由的空间。注:地图混叠吗?
  • “Elevation maps and multi-level surface maps cannot represent unmapped areas, although they are efficient. More importantly, these methods can not represent arbitrary 3D environments.”

    • 高程图和多级表面图不能表示未映射的区域,尽管它们是高效的。更重要的是,这些方法不能表示任意的三维环境。

吹爆Octmap

  • “Octomap[23] is adopted which is used widely in the field of mapping. OctoMap has advantages of taking measurement uncertainty into account, being space efficient and implicitly representing free and occupied space.”

    • 采用了在绘图领域中应用广泛的Octomap[23], OctoMap具有考虑测量不确定度、空间效率高、隐式表示自由空间和占用空间等优点。
  • “it still takes too much time to build the maps.”

    • 它仍然需要太多的时间来建立地图。注:建图时间还是太多了,别急,后边作者提出了个加速算法,我推导过

III. System Overview

A. SLAM分析

开始过渡了,当前SLAM在动态环境方面的不足

  • “the geometric aspect of the SLAM problem is well understood, and has reached a level of maturity where city level maps can be built precisely and even in real time.”

    • SLAM问题的几何方面已经得到了很好的理解,并且已经达到了一个成熟的水平,可以精确甚至实时地构建城市级地图。
  • “But they can only work well in static environments or the one with small dynamic objects.”

    • 如今的SLAM只能在静态环境或具有小型动态对象的环境中工作。
  • “In the scene with small dynamic objects, as only few feature points are situated at dynamic objects, the SLAM can therefore still work well.”

    • 在具有小动态目标的场景中,由于只有少量的特征点位于动态目标上,所以SLAM仍然可以很好地工作。
  • “Feature-Based SLAM is easy to be effected by large moving objects.”

    • 基于特征的SLAM容易受到大型运动目标的影响。
  • “most Feature-Based SLAM systems are built based on a strong assumption that the number of features on moving objects is much smaller than those on static objects.”

    • 大多数基于特征的SLAM系统都建立在一个强假设基础上,即运动物体上的特征数量比静态物体上的特征数量少得多

建立在动态物体的特征远少于静态物体的上的特征呗

问题:动态物体与静态物体的区分标准是什么?很经典的一幕就是巨大而移动缓慢的动态物体经常被识别为静态物体,但实际上人家不是,所以作者后边对ORB-SLAM2作了改进:不用参考关键帧了,直接使用当前关键帧(牺牲建图精度,提高动态物体的识别准确度) 终究是要取舍的

  • “Octomap is based on OcTree structure which is good for searching and building, while point clouds only store each points without any structures. Octomap can carry not only the RGB and position information but also the semantic messages.”

    • Octomap是基于八叉树结构,Octmap不仅可以携带RGB和位置信息,还可以携带语义信息。
  • “the point clouds only store the original messages from RGB-D camera.”

    • 点云只存储来自RGB-D摄像机的原始信息。点云携带的信息少了

B. ORB-SLAM 分析

欢迎来到ORB-SLAM2

  • “Tracking thread is in charge of localizing the camera with every frame in real time and deciding when to insert a keyframe. In tracking thread, it performs an initial feature matching with the previous frame and optimizes the pose by Bundle Adjustment (BA) algorithm. If tracking is lost, it performs a global re-localization with Bag of Word, then searches map points by re-projection and optimizes the pose with local map points. Finally, the tracking thread can decide if a new keyframe can be generated.”

    • 跟踪线程负责实时定位相机的每一帧,并决定何时插入关键帧。在跟踪线程时,对前一帧进行初始特征匹配,并通过束调整(BA)算法对姿态进行优化。如果跟踪丢失,则利用Word包进行全局重新定位,然后通过重新投影搜索地图点,并利用局部地图点优化姿态。最后,跟踪线程可以决定是否可以生成新的关键帧。
  • “the Local Mapping thread will triangulate new map points through its relative keyframes. Then, it optimizes the pose of relative keyframes and map points with BA. Finally, redundant keyframes and low quality map points are removed.”

    • 局部建图线程将通过其相对关键帧对新的地图点进行三角化。然后,利用BA优化相关关键帧和地图点的姿态。最后去除冗余关键帧和低质量地图点。
  • “The Loop Closing thread responses to loop closure with every keyframe. If a loop is detected, the similarity transformation is computed which represents the drift accumulated in the loop.”

    • 回环检测线程响应每个关键帧的回环。如果检测到回环,则计算表示回环中积累的漂移的相似变换。然后对齐两个回环边的重复点融合。最后在相似度约束下进行位姿图优化,以实现全局一致性。
  • “Although ORB-SLAM2 is a very practical algorithm, it still faces some questions, such as how to work well in dynamic environments, how to supply semantic information and maps and so on.”

    • 虽然ORB-SLAM2是一个非常实用的算法,但它仍然面临着一些问题,如如何在动态环境中工作,如何提供语义信息和地图等。注:对ORB-SLAM2说了那么多就是为了这一句,批判它,批判之后要解决问题啊,引出下边

“C. Overview of Semantic SLAM System”

注:本论文的主角登场

  • “In the proposed Semantic SLAM system, ORB-SLAM2 is in charge of camera localization and mapping with every RGB-D frames.”

    • 在所提出的语义SLAM系统中,ORB-SLAM2负责摄像机的定位和每个RGB-D帧的建图。
  • “Tracking thread is responsible for tracking by keyframes instead of reference frames, in order to decrease the effect of moving objects.”

    • 跟踪线程负责通过关键帧而不是参考帧进行跟踪,以减少运动目标的影响。 注:这就前边说的牺牲定位精度(稍微牺牲了一丁点,结果部分来看基本没啥牺牲,然后显著降低了运动目标的影响,很值)
  • “Local Mapping thread adds a few keyframes to create semantic messages, because semantic messages extraction cannot fulfill the requirement of real-time performance.”

    • 由于语义消息提取不能满足实时性的要求,局部建图线程增加了一些关键帧来创建语义消息注:对采用语义SLAM后出现实时性下降问题的解决方案
  • “After getting keyframes from ORB-SLAM2, YOLO[15] is used to detect objects in each keyframe to get semantic message. In our implementation, we use the tiny-weight version to detect objects, because this version is trained on MS-COCO Dataset, which contains 80 different kinds of objects.”

    • 从ORB-SLAM2中获取关键帧后,利用YOLO[15]对每个关键帧中的对象进行检测,得到语义消息。在我们的实现中,我们使用微小权重版本来检测对象,因为这个版本是在包含80种不同类型对象的MS-COCO数据集上训练的(类型多嘛)。
  • “object regularization based on CRF is used to correct the probabilities of each object computed by YOLO.”

    • 采用基于CRF的对象正则化方法,对YOLO算法计算出的每个对象的概率进行修正。
    • CRF(固定速率系数)
  • “constraints between objects are computed according to the statistics of MS-COCO Dataset, and it is then used to optimize the object probabilities computed by YOLO detection.”

    • 根据MS-COCO数据集的统计量计算对象间的约束条件,并利用该约束条件对YOLO检测计算的对象概率进行优化。
  • “When accurate labels of each object are captured, filter process is used to provide more stable features and remove the unstable features which are always locate on the moving objects. At the same time, the temporary objects are created, which contain point clouds produced by projection.”

    • 当捕捉到每个目标的准确标记时,通过滤波处理提供更稳定的特征,去除运动目标上的不稳定特征。同时创建临时对象,其中包含投影产生的点云。
  • “we use data association module to decide either to create a new object or associate it with existing object in the map according to the matching score.”

    • 利用数据关联模块,根据匹配结果决定创建新的对象或将其与地图中已有的对象进行关联
  • “in order to find correspondence between existed objects and temporary objects, we first build relationship between keyframes and objects.”

    • 为了找到已有对象与临时对象之间的对应关系,我们首先建立关键帧与对象之间的关系。
  • “Kd-Tree structure is used to accelerate the computation of matching score.”

    • 采用Kd-Tree结构加速匹配分数的计算。
  • “When the existing objects can be combined with the temporary objects, the former can be updated with the new detection by a recursive Bayesian process.”

    • 当现有对象与临时对象可以结合时,可以通过递归贝叶斯过程用新的检测更新前者。注:对象更新方法
  • “Map Generation uses point clouds stored in objects to generate map based on Octomap, which is accelerated by multi-threads realization and Fast Line Rasterization algorithm.”

    • 地图生成是利用存储在对象中的点云来生成基于Octomap的地图,通过多线程实现和快速线程光栅化算法来加速地图生成注:地图加速方法,现在多线程都是常规操作了,
  • “In order to integrate the concept of semantic into the framework of ORB_SLAM2, we construct relationship between keyframes and objects by referring to the implementation method between keyframes and map points, which has existed in ORB-SLAM2.”

    • 为了将语义的概念融入到ORB-SLAM2框架中,借鉴了ORB-SLAM2中已有的关键帧与地图点之间的实现方法,构建了关键帧与对象之间的关系。
  • “In ORB_SLAM2, each keyframe stores map points that it has observed in the frame image, at the same time, each map point records the keyframes which have observed the map point sequentially.”

    • 在ORB_SLAM2中,每个关键帧存储其在帧图像中观察到的地图点**(每个关键帧存储它自己观测到的地图点),同时每个地图点记录观察到该地图点的关键帧(你记得我,我也记得你)**。
  • “we can build relationship between keyframes and perform some optimization, such as analyzing whether a keyframe is redundant or deciding whether a map point has high quality.”

    • 我们可以建立关键帧之间的关系,并进行一些优化,如分析一个关键帧是否冗余或判断一个地图点是否具有高质量。这里没有说明怎样判断关键帧是否冗余,怎样判断地图点是否质量高,像ORB-SLAM2那样判断吗?
  • “we build the relationship between keyframe and each object as followings.”

    • 建立关键帧和每个对象之间的关系,如下所示。

“In our realization, each object ܱ O i O_i Oi contains :

  1. Word coordinates of each point cloud that are located on the object.
  2. A fixed number of class labels and the corresponding confidence score which is calculated through a recursive Bayesian update.
  3. Keyframes which can observe this object.
  4. Kd-tree structure generated through the object’s point clouds, which is used for fast search.
  5. The class label which this object belongs to.
  6. The number of observations.”

在我们的实现中,每个对象 O i O_i Oi包含:

  1. 位于该对象上的每个点云的字坐标;
  2. 固定数量的类标签和相应的置信度得分,通过递归贝叶斯更新计算;
  3. 可以观察该对象的关键帧;
  4. 通过该对象的点云生成Kd-tree结构,用于快速搜索;
  5. 该对象所属的类标签;
  6. 观察次数。
  • “Each keyframe K i K_i Ki should store :

    1. The corresponding RGB image which is used to detect objects.
    2. The corresponding depth image which is used to generate point clouds.
    3. Objects that have been observed in this keyframe.”
  • 每个关键帧应该存储:

    1. 对应的RGB图像,用于检测目标;
    2. 对应的深度图像,用于生成点云;
    3. 在该关键帧中观察到的目标。
  • “we create an object database, in which all the detected objects are stored.

    • 自己创建一个对象数据库,里边存储了所有检测到的对象。

“Semantic Mapping”

基于ORB-SLAM2的语义SLAM

“A. Improved SLAM” A.改进SLAM

  • “In ORB_SLAM2, Tracking thread localizes the camera with every frame through four steps. First, ORB features are extracted from RGB images. Second, ORB features are used to perform feature matching with the reference frame, preliminarily calculate the camera pose and return the number of matched map points. Third, the camera pose is optimized again with the matched locale map points which are searched through the relative keyframes. Finally, tracking thread decides whether a new keyframe is inserted based on some principles.”

    • 在ORB_SLAM2中,跟踪线程通过四个步骤对每个帧的相机进行定位。
    • 首先,从RGB图像中提取ORB特征
      其次,利用ORB特征与参考帧进行特征匹配,初步计算摄像机位姿并返回匹配的地图点数。
      再次,通过关键帧搜索匹配的场景地图点,对相机姿态进行优化
      最后,跟踪线程根据一些原则判断是否插入新的关键帧
  • “tracking thread module is modified in the following three ways.”

    • 跟踪线程模块通过以下三种方式进行修改

改用关键帧的论述

  • “In order to reduce the effect of dynamic objects, the second step in tracking thread is changed to track by keyframes instead of reference frames.”

    • 为了减少动态对象对线程跟踪的影响,将跟踪线程的第二步由参考帧改为关键帧跟踪

    注:将参考帧改为直接用关键帧,略微降低定位建图精度,提高对动态物体的适应性

  • “If SLAM tracks by reference frame, the camera pose calculation can easily be effected by large moving objects.”

    • 给出了原因:降低大型运动物体的影响
  • “This is because, when a large moving object passes by, original SLAM will track features on the moving object, which affects the tracking accuracy.”

    • 这是因为当一个大的运动物体经过时,原有的SLAM会对运动物体上的特征进行跟踪(注:误将大型运动物体认为是静态的环境物体),从而影响动态环境中的跟踪精度。
  • “if SLAM tracks feature by keyframes, it can still calculate the correct camera pose before the new keyframe insertion.”

    • 如果SLAM通过关键帧跟踪特征,它仍然可以在新的关键帧插入之前计算正确的摄像机姿态。
  • “The essential reason is that old keyframe doesn’t contain features of the moving object.”

    • 其根本原因是**旧的关键帧(注:相邻的前几帧,若有,地图更鲁棒,但是会包含前几帧的信息)**不包含运动对象的特征。
  • “choose the Levenberg-Marquardt method, from G2O[50] which contains several optimization algorithm to optimize the pose of the current frame.”

    • G2O[50]中选择Levenberg-Marquardt方法,其中包含多个优化算法来优化当前帧的姿态。
  • “This method needs a good initial estimated value for optimization. Therefore we use the constant velocity motion model to predict the position of the current frames as a G2O initial value before optimization.”

    • 该方法需要一个良好的初始估计值进行优化。因此,我们采用等速运动模型预测当前帧的位置,作为优化前的G2O初值。
  • “The third step in tracking process is modified to compare the number of matched inliers with the result of the second step to judge whether the tracked current frame is lost.”

    • 对跟踪过程中的第三步进行了改进,将匹配的内点数与第二步的结果进行比较,判断被跟踪的当前帧是否丢失。 注:这样好像更容易丢啊,人家ORB-SLAM2用的参考帧和前边的信息有关联,更加鲁棒,这直接使用关键帧,妥妥的马尔科夫链啊
  • “In ORB-SLAM2, matched inliers are compared with a constant value in the third step. It is easy to lose tracking when the camera moves fast, because the ORB feature points of current frame may only match with the map points observed by the last frame.”

    • 在ORB-SLAM2中,第三步将匹配的inlier与恒定值进行比较。当摄像机快速移动时,由于当前帧的ORB特征点可能仅与上一帧观测到的地图点相匹配,容易丢失跟踪(注:对快速运动场景不适应了又)。
    • 注:还有一种解释:ORB-SLAM2关键帧上的特征点更少一些,但是人家用的特征点都是参考帧,与前几帧的信息相关,有约束不好匹配。语义SLAM的关键帧特征点多一些,但是这个语义SLAM的匹配精度没有ORB-SLAM2高
  • “In this case, the number of the observed map points will be less than the constant value. Therefore, the third step should compare the number of matched inliers with the number of matched inliers computed by the last frame.”

    • 在这种情况下,观察到的地图点数将小于常数值。因此,第三步应该将匹配Inlier的数量与上一帧计算的匹配Inlier的数量进行比较。
  • “The second step function computes the number of matched map points between the last keyframe and the current frame, therefore we should compare the number of matched inliers with the result of the second step.”

    • 第二步函数计算上一个关键帧和当前帧之间匹配的映射点的数量,因此我们应该将匹配的inlier的数量与第二步的结果进行比较。
  • “Tracking thread is changed to create more keyframes.”*

    • 跟踪线程被改进以创建更多的关键帧。ORB-SLAM2中,**当相机移动时观测到的特征越来越少,跟踪容易丢失。**因此,需要更多的关键帧来创建更多的地图点。

“B. Object Detection”

  • “In semantic SLAM, individual objects are important entities which not only can supply semantic information for the map, but also can enhance the localization accuracy”

    • 语义信息对SLAM的帮助:在语义SLAM中,单个目标是重要的实体,它们不仅可以为地图提供语义信息,而且可以提高定位精度
  • “In deeplearning area, DeepLab[39] and FCN[40] can provide pixel level semantic segmentation, however, RCNN, Fast-RCNN and Faster-RCNN can supply the object level’s bounding box detection. Furthermore, they may generate too many object proposals which cause detection of same region multiple times.”

    • 在deeplearning区域,DeepLab[39]和FCN[40]可以提供像素级的语义分割,而RCNN、Fast-RCNN和Faster-RCNN可以提供对象级的边界盒检测。此外,它们可能会产生太多的目标建议,导致同一区域多次检测。
  • “Thus, many of them are too slow, and they cannot satisfy the real-time requirement when they are integrated with SLAM.”

    • 因此,在与SLAM集成时,很多算法速度太慢,不能满足实时性的要求。
  • “YOLO can process 45 images per second, therefore we choose YOLO as the object detection method to generate a number of object proposals in the form of bounding boxes for every keyframe in our proposed semantic SLAM system.”

    • YOLO每秒可以处理45幅图像,因此我们选择YOLO作为对象检测方法,在我们提出的语义SLAM系统中,以包围盒的形式为每个关键帧生成大量的对象建议。(速度还可以)
  • “We use the network trained on the COCO dataset instead of PASCAL VOC, because COCO dataset contains 80 types of objects, while PASCAL VOC dataset only has 20 kinds of objects.”

    • 由于COCO数据集包含80种对象,而PASCAL VOC数据集只有20种对象,所以我们使用了在COCO数据集上训练的网络来代替PASCAL VOC。
  • “After some experiments, we find that normal YOLO still takes about 0.04s per images which cannot fulfill realtime requirement. However, Tiny YOLO weights can fulfill all these requirements.”

    • 经过一些实验,我们发现正常的YOLO仍然需要0.04s左右的每幅图像,不能满足实时性的要求。然而,微小的YOLO权重可以满足所有这些要求。
  • “we use Tiny YOLO weights instead of normal YOLO weights. In the implementation, YOLO detects the new keyframe in Local Mapping module after the new keyframe is added and wrong map points are culled sequentially.”

    • 我们使用微小的YOLO权重而不是正常的YOLO权重。在实现中,YOLO在添加新的关键帧后,在本地映射模块中检测新的关键帧,并依次剔除错误的映射点。

“C. Object Regularization” C.对象正则化

  • “Although we can get the semantic label for each object through YOLO algorithm, semantic context among objects has not been explicitly incorporated into the object categorization models.”

    • 虽然我们可以通过YOLO算法得到每个对象的语义标签,但对象之间的语义上下文并没有显式地纳入到对象分类模型中。
  • “Some researchers have found that context information is good for semantic segmentation, however, object context requires access to the referential meaning of the object [25].”

    • 一些研究者发现语境信息有利于语义切分,然而,对象语境需要获取对象的参考意义[25]。
  • “In other words, when performing the task of object categorization, objects’ category label must be assigned with respect to other objects in the scene, assuming there is more than one object present.”

    • 换句话说,在执行对象分类任务时,假设场景中存在不止一个对象,就必须为场景中的其他对象分配对象的类别标签。
  • “In the task of image segmentation, context information is used to optimize the final result with CRF (Conditional Random Filed) algorithm.”

    • 在图像分割中,利用上下文信息,采用条件随机域算法对最终分割结果进行优化。
  • “The main problem that CRF can solve effectively is how to model the class scores calculated by some classifiers and local information of images simultaneously.”

    • ​ CRF能够有效解决的主要问题是如何将分类器计算的分类分数和图像的局部信息同时建模。
  • “Then this problem can be treated as a problem of maximizing a posteriori.”

    • 那么这个问题可以看作是一个最大后验概率的问题。
  • “We can define unary potentials to model the probabilities that each pixel or patches belongs to each category, and pairwise potentials to model relation between two pixels or patches.”

    • 我们可以定义一元势函数来模拟每个像素或块属于每个类别的概率,定义点对势函数来模拟两个像素或块之间的关系。
      语义分割之DeepLabv1 - SpartacusIn21的文章 - 知乎 https://zhuanlan.zhihu.com/p/53421692
  • “The frequently used CRF models contain unary potentials, pairwise potentials and some weighted parameters. The pairwise potentials are modeled in 4 or 8 neighbors, like [44]-[46-51], therefore this structure is limited in modeling long distances on the image.”

    • 常用的CRF模型包括一元势函数、点对势函数和一些加权参数。点对势函数是在4或8个邻居中建模的,因此这种结构在图像上的长距离建模中是有限的。
  • “Toyoda et al. [32] proposed a fully connected CRF to integrate local information and global information jointly which sets up pairwise potentials on all pairs of pixels or patches in semantic image labeling task. By modeling long range interactions, dense CRF provides a more detailed labeling compared to its sparse version. The dense CRF with Gaussian kernel potentials has emerged as a popular framework for semantic image segmentation tasks”

    • 丰田章男等人。[32]提出了一种将局部信息和全局信息联合集成的全连通CRF算法,在语义图像标注任务中,在所有对像素或片上建立成点对势函数。通过对远程交互进行建模,稠密CRF提供了比稀疏版本更详细的标记。具有高斯核结合势函数的稠密CRF已经成为一种流行的语义图像分割框架
  • “we construct a probabilistic object-based dense CRF. Compared with CRF based on pixel level, our proposed model can reduce the computational complexity significantly.”

    • 我们构造了一个基于对象对象的稠密CRF。与基于像素级的CRF相比,该模型可以显著降低计算复杂度。
  • 构建了一个基于对象概率的稠密CRF

  • 给出基于对象模型的Gibbs energy function:

  • P ( x ) = 1 Z exp ⁡ ( − E ( x ) ) P(x)=\frac{1}{Z} \exp (-E(x)) P(x)=Z1exp(E(x))

  • E ( x ) = ∑ i ψ u ( x i ) + ∑ i < j ψ p ( x i , x j ) E(x)=\sum_{i} \psi_{u}\left(x_{i}\right)+\sum_{iE(x)=iψu(xi)+i<jψp(xi,xj)

    • x x x: 分配给对象特征的标签
    • i , j i, j i,j的范围从 1 − k ( k 是 地 图 中 对 象 的 数 量 ) 1-k(k是地图中对象的数量) 1k(k)
    • Z Z Z是归一化因子
    • 在基于对象的CRF模型中,目标是最小化 E ( x ) E(x) E(x)来获得最终的分配标签
  • 一元势函数 Ψ u \Psi_{u} Ψu建模CRF中每个vertex

  • 点对势函数 Ψ p \Psi_{p} Ψp 建模CRFvertices之间的联系

    • (G2O?)
  • 一元势函数和点对势函数的一般表达形式:

  • ψ u ( x i ) = − log ⁡ P ( x i ) \psi_{u}\left(x_{i}\right)=-\log P\left(x_{i}\right) ψu(xi)=logP(xi)

  • ψ u ( x i , x j ) = μ ( x i , x j ) ∑ m = 1 K ω m exp ⁡ ( − a m ( f i , j ) 2 ) \psi_{u}\left(x_{i}, x_{j}\right)=\mu\left(x_{i}, x_{j}\right) \sum_{m=1}^{K} \omega_{m} \exp \left(-a_{m}\left(f_{i, j}\right)^{2}\right) ψu(xi,xj)=μ(xi,xj)m=1Kωmexp(am(fi,j)2)

    • P ( x i ) P(x_{i}) P(xi): YOLO算法检测到的第 i i^{} i对象的标签概率分布

    • ω \omega ω: 线性组合的权重

    • μ \mu μ: 标签兼容函数, 描述两个不同类同时出现在相邻位置的可能性

    • Potts 模型是最简单的标签兼容函数,并在本系统中使用:

    • μ ( x i , x j ) = { 0 ,  if  x i = x j 1 ,  otherwise  } \mu\left(x_{i}, x_{j}\right)=\left\{\begin{array}{cc} 0, & \text { if } x_{i}=x_{j} \\ 1, & \text { otherwise } \end{array}\right\} μ(xi,xj)={0,1, if xi=xj otherwise }

    • f i , j f_{i,j} fi,j: 第 i i i 个对象和第 j j j 个对象的约束

    • f i , j f_{i, j} fi,j​ 计算如下:

    • f i , j = 1 p i , j f i , j = 1 p i , j f_{i, j}=\frac{1}{p_{i, j}}f_{i, j}=\frac{1}{p_{i, j}} fi,j=pi,j1fi,j=pi,j1

      • p i , j p_{i,j} pi,j : 第 i i i 个对象和第 j j j 个对象出现在同一位置的概率,通过COCO数据集进行统计获得
  • 每个图像都有不同种类的对象在同时显示,因此,根据这些图像计算标签共视关系计数矩阵

  • 矩阵中的元素 i , j i, j i,j 表示标签为 i i i 的对象在标签为 j j j 的对象的训练图像中图像的次数

  • 对角线元素表示对象在训练集中的频率,标签共视关系计数矩阵的部分,及混淆矩阵如图所示:

    • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gfEIqQmS-1650186682956)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_math.assets/image-20220414144210932.png)]
  • p i , j p_{i,j} pi,j​ 的计算:

  • p i , j = n i , j n j p_{i, j}=\frac{n_{i, j}}{n_{j}} pi,j=njni,j

    • n j n_j nj : 表示对象$j $ 的频率
    • n i , j n_{i,j} ni,j : 对象 i i i 和对象 j j j 同时出现的频率
  • “After we get the unary and pairwise potentials, we can use mean fields method to optimize the CRF model. Based on this method, we not only utilize the YOLO object detection, but also integrate the object context information to refine the final object confidence score”

    • 在得到一元势和两两势后,我们可以用平均场法对CRF模型进行优化。在此基础上,我们不仅利用了YOLO目标检测方法,还结合了目标的上下文信息对最终的目标置信度得分进行了精化

“D. Temporary Objects Generation”D.临时对象生成

  • “After getting the accurate semantic label, we can filter features and map points according to the semantic label type, thus the effects of dynamic objects can be reduced to the maximum extent.”
    • 在得到准确的语义标签后,根据语义标签类型滤除特征和地图点,从而最大限度地减少动态对象的影响。

“feature filter” 特征滤波器:

首先,将对象分成静态与动态两种类型。然后从上面的算法计算标签和边界框,我们排除了属于动态对象的ORB特征、地图点和DBoW特征,保留了静态对象上的特征

  • “An original image is shown in Fig.5(a); Fig.5(b) shows ORB features extracted from the original image; Fig.5© shows the semantic messages extracted from the original image; Fig.5(d) shows the result of the features filter.”

  • 原始图像如图5(a)所示;图5(b)示出了从原始图像中提取的球体特征;图5©示出了从原始图像提取的语义消息;图5(d)示出了特征滤波器的结果。
    利用语义信息去除动态位于动态兑现上的点,但是从图5(d)中左下角人的裤子上的特征点还是没能去除

  • “After the feature filter process, we generate some temporary objects which contain object size, object type, object confidence scores, and the corresponding point clouds.”

    • 在特征过滤过程之后,我们生成一些临时对象,这些对象包含对象大小、对象类型、对象置信度分数以及相应的点云。
  • “point clouds generated by the RGB-D camera contain some noises.”

    • 然而 由RGB-D摄像机生成的点云包含一些噪声。

    如何去除噪声

  • “In order to remove these noises, we apply statistical calculation to point clouds. If the points deviate from the average, they may be noises, and can therefore be removed.”

    • 为了去除这些噪声,我们对点云应用统计计算。如果这些点偏离平均值,它们可能是噪声,因此可以消除。
  • “in order to save memory, point clouds are down-sampled with 5 mm resolution. When getting the robust temporary objects and point clouds, we use data association to decide whether those temporary objects are new objects or already exist in the map.”

    • 为了节省内存,点云以5毫米分辨率下采样。在获取鲁棒的临时对象和点云时,利用数据关联来判断这些临时对象是新的还是已经存在于地图中。

“E. Data Association” E.数据关联

  • “Data association is very important for robust SLAM. In our semantic SLAM system, data association is used to judge the detected objects. In the proposed method, there are two steps for data association.”
    • 数据关联对于健壮的SLAM至关重要。在我们的语义SLAM系统中,利用数据关联来判断检测到的对象。在所提出的方法中,数据关联分为两个步骤。

数据关联

First

  • “First, we need to find the candidate objects for each temporary objects.”

    • 首先,我们需要为每个临时对象找到候选对象。
  • “Through the relationship between keyframes and map points, we can easily find keyframes which are relative to the current keyframe. These keyframes not only are close to the current keyframe, but also are more likely to contain same objects because they have enough shared map points.”

    • 通过关键帧与映射点之间的关系,可以方便地找到与当前关键帧相对应的关键帧。这些关键帧不仅与当前关键帧接近,而且更有可能包含相同的对象,因为它们有足够的共享地图点
  • “With the relationship between keyframes and objects, the objects seen by these relative keyframes are considered as the candidate objects for every temporary object.”

    • 利用关键帧与对象之间的关系,将这些关键帧所看到的对象作为每个临时对象的候选对象
  • “When KeyFrame4 is inserted, the system detects object3, object4 and object5, which are treated as three temporary objects.”

    • 当插入KeyFrame4时,系统检测到object3、object4和object5,(关键帧4能观测到的对象)它们被视为三个临时对象
      每个关键帧对应的临时对象:
      将当前关键帧能够看到的对象作为临时对象
  • “In order to find the candidate objects for such three temporary objects, we take the following two steps.”

    • 为了找到这三个临时对象的候选对象,我们采取以下两个步骤。
  • “First, we search the relative keyframes for KeyFrame4.”

    • 首先,我们搜索相关的关键帧来查找KeyFrame4。关键帧2和关键帧3都有共同的地图点 ,一个为2个另一个为3个。论文中却只说了关键帧3和关键帧4是相对关键帧 相对关键帧的选取规则不明
      问题:相对关键帧的选取规则好像有点问题,看前面的话,好像是具有共视关系数目最多的作为相对关键帧,但是,根据图7,对象2明显不能被关键帧3和4观测到。若相对关键帧的选取只需要有共视关系的就是的化,那关键帧2也应该是关键帧
  • 其次,推断对象2、对象3和对象4可以被关键帧3和关键帧4观测**(问题:从图中看出对象2不能被关键帧3和4观测)** 要说具有相同地图点的关键帧为相对关键帧(关键帧2、关键帧3、关键帧4为相对关键帧,而不是只有3和4),然后相对关键帧所观测到的对象(2、3、4)都被推断成能被 相应关键帧组 观测到

此处有疑问,感觉作者写的有点问题哈

  • “therefore Object2, Object3 and Object4 are regarded as candidate objects for temporary objects observed by KeyFrame4.”

    • 因此,Object2、Object3和Object4被视为由KeyFrame4观察的临时对象的候选对象。

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6oR7NIT9-1650186682956)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220417161153514.png)]

  • “With the relationship between keyframes and objects, we can avoid some undesirable situation caused by moving objects.”

    • 利用关键帧与对象之间的关系,可以避免由于对象的移动而引起的一些不良情况。
  • “In the third keyframe, it generates a temporary object whose label is TV monitor.”

    • 在第三个关键帧中,它生成一个标签为电视监视器的临时对象。
    • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1UxhYgTY-1650186682957)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220417161345794.png)]
  • “According to the relationship between keyframes and objects, we can easily find that the TV monitor on the first keyframe is a candidate object for the temporary object observed in the third keyframe.”

    • 根据关键帧与对象的关系,我们可以很容易地发现第一个关键帧上的电视监视器是第三个关键帧中观察到的临时对象的候选对象。
  • “The first keyframe and the third keyframe have a lot of same map points, which reveals that the first keyframe is one of the relative keyframes for the third keyframe.”

    • 第一关键帧和第三关键帧具有许多相同的映射点,这表明第一关键帧是第三关键帧的相对关键帧之一。
  • “If we use the object seen by the last keyframe (it is the second keyframe in this condition) as candidate objects, the temporary object (TV monitor in the third keyframe) will be regarded as a new object by data association because the last keyframe does not contain the white TV monitor.”

    • 如果我们使用上一个关键帧看到的对象(在这种情况下是第二个关键帧)作为候选对象,由于上一个关键帧不包含白色电视监视器,所以临时对象(第三个关键帧中的电视监视器)将通过数据关联被视为新对象。注:插入新关键帧的规则

“Second”

  • “among the candidate objects, we need to select which one is most similar to the temporary object.”

    • 多个候选对象选最相似的那一个。
  • “we perform a nearest neighbor search between 3D points in candidate and temporary objects, and calculate the Euclidean Distance between the matched point pairs.”

    • 我们在候选对象和临时对象中的三维点之间进行最近邻搜索,并计算匹配点对之间的欧几里得距离。
    • 注:选取规则
  • “k-d tree is used to accelerate the matching process.”

    • 采用k-d树加速匹配过程。
  • “According to the matched point pairs, scores between candidate and temporary objects can be calculated.”

    • 根据匹配的点对,可以计算候选对象和临时对象之间的分数。

    • 候选对象和临时对象间的打分公式:

    • S = M N \mathbf{S} = \frac{M}{N} S=NM

      • M : distance ≤ 2 c m M:\text{distance} \leq 2cm Mdistance2cm的匹配点数
      • N N N:来自临时对象的点数
  • “A candidate object with the highest score which is also higher than the threshold, is selected as the associated object.”

    • 选择具有最高分且也高于阈值的候选对象作为关联对象。
  • “If all the objects do not fulfill real-time requirements, the temporary object is considered as a new object which can be inserted into the SLAM system.”

    • 如果所有的对象都不能满足实时性要求,则将临时对象作为新的对象插入到SLAM系统中。
      理解:一个对象没有临时对象和候选对象的化,就将其作为新对象插入SLAM系统中注:新关键帧插入规则

“F. Object Model Update”F.对象模型更新

  • “When we find the correspondence between candidate and temporary objects, the point clouds and confidence scores associated with them should be fused together.”

    • 当我们发现候选对象和临时对象之间的对应关系时,将与它们相关的点云和置信度分数融合在一起。
  • YOLO中的对象检测中,输出RGB图像到RCNN框架,给定第 k k k张图像的数据 I k I_k IkYOLO的输出可以用一种简化的方式解释为类标签上对象的独立概率分布

  • 例如 P ( O u ) = l i ∣ I k P(O_{u}) = l_i|I_k P(Ou)=liIk u u u表示检测到的对象, l i l_i li表示第 i i i 个类标签

  • 这使得我们能够通过递归贝叶斯更新方式更新可见集 V k ∈ M V_k \in M VkM​ 中的所有对象和相应的概率分布:

  • P ( l i ∣ I 1 , … , k ) = 1 Z P ( l i ∣ I 1 , … , k − 1 ) P ( O u = l i ∣ I k ) P\left(l_{i} \mid I_{1, \ldots, k}\right)=\frac{1}{Z} P\left(l_{i} \mid I_{1, \ldots, k-1}\right) P\left(O_{u}=l_{i} \mid I_{k}\right) P(liI1,,k)=Z1P(liI1,,k1)P(Ou=liIk)

  • 方程 ( 9 ) (9) (9) 应用于每个对象的所有标签概率,且最后用常数 Z Z Z进行归一化得到适当的分布

G. Map Generation

又比较了一番点云地图和Octomap

  • “In our system, 3D point clouds are stored in every keyframe, and the segmented 3D point clouds are also stored in the corresponding object.”

    • 在我们的系统中,三维点云存储在每个关键帧中,分割后的三维点云也存储在相应的对象中。
  • “the map based on point clouds can be generated by projecting the stored 3D points according to the associated poses.”

    • 基于点云的地图可以通过将存储的三维点根据相关联的姿态进行投影来生成。
  • “However, the map based on point clouds is useless for advanced mission such as path planning or grasp point selection,”

    • 但基于点云的地图对于路径规划或抓取点选择等高级任务毫无用处,注:前边说了好多次了
  • “because point clouds do not use any structures to store each points, which is bad for searching, and each point has no volume information, which makes collision detection and 2D maps generation easily fail.”

    • 由于点云不使用任何结构来存储每个点,不利于搜索,并且每个点没有体信息,容易导致碰撞检测和二维地图生成失败。
      没有体信息,容易导致碰撞检测和二位图像生成失败,注:我的理解是不能确定物体的边界,生成的地图混叠
  • “point clouds cannot distinguish the unknown area, the empty area, and cannot eliminate noise.”

    • 点云不能区分未知区域、空区,不能消除噪声。
  • “Octomap” Octmap 使用Octree来存储点云,当插入一个新点后,它能够区分未知和空区域,还能降低噪点。

  • Octomap 是一个基于八叉树的概率3D建图框架,实现中,八叉树的根节点代表整个空间,八个子节点代表八个小空间。八叉树的叶节点代表空间中最小的分辨率体素

  • Octomap在实现过程中,移动物体和距离测量中的误差会导致地图中出现大量噪声。Octomap使用概率模型来解决这个问题,每个叶节点都存储其被占用或空闲的概率。当插入一个新的3D点时,其对应的叶节点以以下方式更新其概率:

  • L ( n ∣ z 1 : t ) = L ( n ∣ z 1 : t − 1 ) + L ( n ∣ z t ) L\left(n \mid z_{1: t}\right)=L\left(n \mid z_{1: t-1}\right)+L\left(n \mid z_{t}\right) L(nz1:t)=L(nz1:t1)+L(nzt)

  • L ( n ) = log ⁡ [ P ( n ) 1 − P ( n ) ] L(n)=\log \left[\frac{P(n)}{1-P(n)}\right] L(n)=log[1P(n)P(n)]

    • n n n表示叶子节点 n n n
    • Z t Z_t Zt 表示测量值 Z t Z_t Zt
    • P ( n ∣ Z t ) P(n|Z_t) P(nZt) 表示体素 n n n 在给定测量值 Z t Z_t Zt 下被占用的概率 注:前文中没有出现这个公式,是不是写错了? L ( n ∣ z t ) L(n|z_t) L(nzt) ?

这里有个疑问,P(n|Z_t) 前文中并没有出现该公式,写错了吗? L ( n ∣ z t ) ? L(n|z_t)? L(nzt)

  • Octomap可以生成基于体素或像素的3D或2D地图Octomap中的每个体术不仅可以存储其被占用或空闲的概率,还可以存储固定数量的类标签和置信度。

  • 最初的Octomap通过两步来生成地图
    第一步:
    计算空体素从相机到被占体素的位置。问题: 计算空体素太费时
    作者对此过程进行了优化

  • 最初的Octomap通过两步来生成地图
    第一步
    计算空体素从相机到被占体素的位置。问题: 计算空体素太费时
    作者对此过程进行了优化: 考虑二维平面的例子 注:加速算法

  • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C4Bac57W-1650186682957)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_math.assets/image-20220414211058962.png)]

  • 在图10中,有一个 ( X s , Y s ) (X_s,Y_s) (Xs,Ys) 指向 ( X e , Y e ) (X_e, Y_e) (Xe,Ye)​的直线,斜率:

  • K = Y e − Y s X e − X s K=\frac{Y_{e}-Y_{s}}{X_{e}-X_{s}} K=XeXsYeYs

  • 网格 ( X , Y ) (X,Y) (X,Y)​被定义为: ( X , Y ) (X,Y) (X,Y)网格数坐标

  • X = X S V , Y = Y S V X=\frac{X_{S}}{V}, Y=\frac{Y_{S}}{V} X=VXS,Y=VYS

    • V V V 是网格的尺寸
  • 网格中的偏移量被定义为:偏离格线距离

  • X 0 = X S % V , Y 0 = Y S % V X_{0}=X_{S} \% V, Y_{0}=Y_{S} \% V X0=XS%V,Y0=YS%V

  • 最后可以通过下式计算出 D D D: 通过斜率计算得出

  • D = K × ( V − X 0 ) + Y 0 D=K \times\left(V-X_{0}\right)+Y_{0} D=K×(VX0)+Y0

  • 如果 D < V D < V D<V, 意味着 n < N n < N n<N. 因此下一个空体素是 ( X + 1 , Y ) (X+1, Y) (X+1,Y) 注:已验证过

  • 此外还有三种情况,如图11:[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bjSUg8Jq-1650186682958)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_math.assets/image-20220414215123548.png)]

    • ( a ) (a) (a):可以交换起始点和结束点同上边的计算得出空体素的位置

    • ( b ) (b) (b): 可以通过图12的方式将 L 1 L1 L1转换到 L 2 L2 L2来计算空体素的位置 L L L对称 线 L L L通过式16确定(中点)

      • Y = Y s + Y e 2 Y=\frac{Y_{s}+Y_{e}}{2} Y=2Ys+Ye

      • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N233XuPP-1650186682958)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_math.assets/image-20220414215925713.png)]

    • ( c ) (c) (c): 交换 ( b ) (b) (b)中两点可得

  • 第二步

    将所有新体素插入到一个Octomap中,Octomap通过Octomap结构为每个新体素定位到相应的叶子节点

  • “probability of corresponding leaf node is updated with the new voxel.”

    • 用新的体素更新对应叶节点的概率。
  • “However, both of these two steps use a single thread to process. Although the first step can use OpenMP to create multi threads to compute the empty voxels, we find that it runs slower than the one with a single thread because too many threads are created. In order to accelerate mapping with multi-threads, three strategies are proposed.”

    • 但是,这两个步骤都使用单个线程进行处理。虽然第一步可以使用**OpenMP(注:我没用过这个库)创建多个线程来计算空体素,但我们发现它比单线程运行得慢,因为创建的线程太多了。(注:线程数是否超过核心数?若线程数超过核心数,系统会自动调度,速度不会慢的,只有当线程数等与核心数,根据木桶原理,速度取决于最慢的那个线程,但这样还是应该快于单线程啊????疑惑?? 作者也没给出原因)**为了加速多线程映射,提出了三种策略。
  • “First, we use a thread Pool to take the place of OpenMP. The second step is modified to use eight threads to insert voxels into an Octomap. And the third is that the whole architecture can be accelerated by using a producer-consumer model.”

    • 首先,我们使用线程池来代替OpenMP。第二步被修改为使用八个线程将体素插入到一个八分图中。第三,整个体系结构可以通过使用生产者-消费者模型来加速。注:TBB挺不错

“V. Evaluation” V.评价

  • “we give detailed evaluation of the proposed semantic SLAM system in three different ways. First, we use TUM dataset to evaluate the accuracy of tracking. Second, PASCAL dataset is used to evaluate the accuracy of detection. Third, we compare the efficiency between improved and original Octomap.”
    • 从三个不同的角度对所提出的语义SLAM系统进行了详细的评价。
    • 首先,我们使用TUM数据集来评估跟踪的准确性。
    • 其次,利用PASCAL数据集对检测的准确性进行评估。
    • 最后,比较了改进后的Octomap与原Octomap的效率。

“Tracking”跟踪

写了一堆关于这个数据集的东西

  • “TUM dataset is an excellent dataset to evaluate the accuracy of camera localization as it provides accurate ground truth for the sequences.” TUM数据集为摄像机定位提供了准确的地面真值,是评价摄像机定位精度的一种很好的数据集。

  • “It contains seven kinds of sequences recorded by a RGB-D camera at 30fps and a resolution of 640 x 480.” 它由RGB-D摄像机以30fps、640×480的分辨率记录了7种序列。

  • “We only use Handheld SLAM sequence, Robot SLAM sequence, Structure vs. Texture sequence and Dynamic Objects sequence among the seven sequences because these four sequences represent most of the scenes of every day.” 在这七个序列中,我们只使用手持SLAM序列、机器人SLAM序列、结构纹理序列和动态对象序列,因为这四个序列代表了每天的大部分场景。

  • “these sequences contain different kinds of objects, which can ensure more semantic information than other kinds of sequences.” 这些序列包含不同种类的对象,可以保证比其他种类的序列更多的语义信息。

  • “Handheld SLAM sequence is recorded by hands, and therefore it has complex and unstable trajectories, while Robot SLAM sequence is recorded by real robots, hence it has stable and simple trajectories.” 手持SLAM序列由手记录,轨迹复杂且不稳定,而机器人SLAM序列由真实机器人记录,轨迹稳定且简单。

  • “But all of them are experimented in the scene without dynamic objects.” 但它们都是在没有动态物体的场景中进行实验的。

  • “Furthermore, Dynamic Objects sequence is recorded by hands, however when the recordings have some dynamic objects, the trajectory is found to be unstable in such scenes.” 此外,动态目标序列是由人工记录的,但当记录中有一些动态目标时,在这种场景中会发现轨迹不稳定。

  • “For comparison, we use different RGB-D SLAM in the benchmark.” 为了进行比较,我们在基准测试中使用了不同的RGB-D SLAM。

  • “Each sequence is processed 5 times, and we use RMSE to judge its localization Accuracy” 对每个序列进行5次处理,利用RMSE判断其定位精度

  • 使用均方根误差 R M S E RMSE RMSE​ :

  • R M S E = ∑ i = 1 n ( X o b s , i − X model,  i ) 2 n R M S E=\sqrt{\frac{\sum_{i=1}^{n}\left(X_{o b s, i}-X_{\text {model, } i}\right)^{2}}{n}} RMSE=ni=1n(Xobs,iXmodel, i)2

    • n n n 表示观测次数
    • i i i 表示第 i i i 次观测
    • X o b s , i X_{obs,i} Xobs,i 是第 i i i 次观测的地面真值
    • X m o d e l , i X_{model,i} Xmodel,i 是第 i i i 次观测值的计算结果

“B. Localization Accuracy with Improved SLAM”B.改进的SLAM定位精度

  • “First, we verify the localization accuracy with Improved SLAM.” 首先,我们用改进的SLAM来验证定位精度。

  • “For comparison we have executed ORB-SLAM2 with a RGB-D camera in the benchmark.” 为了进行比较,我们在基准测试中使用RGB-D相机执行了ORB-SLAM2。

  • “Table I shows the median RMSE error of the benchmark.”表一显示了基准的中值RMSE误差。

  • “It can be seen that the RMSE error of Improved SLAM is little worse than the error of ORB-SLAM2 in static environment, which means the accuracy of our SLAM is close to ORB-SLAM2.” 可以看出,在静态环境下,改进后的SLAM的RMSE误差比ORB-SLAM2的误差大一点,表明改进后的SLAM的精度接近ORB-SLAM2。注:相差不大

  • “The reason of worse RMSE error is that when tracking reference frames, the number of common features that the last reference frame and the current keyframe have is larger than that the last keyframe and the current keyframe have.” **注分析原因:**RMSE误差较大的原因是在跟踪参考帧时,最后一个参考帧和当前关键帧所具有的共同特征数大于最后一个关键帧和当前关键帧所具有的共同特征数。

  • “Because of the number of common features, SLAM tracking by reference frames can provide more accurate localization in the static environment.” 在静态环境下,基于参考帧的SLAM跟踪可以提供更精确的定位,因为它具有大量的公共特征。
    是ORB-SLAM2使用了参考关键帧, 能提供更精确的信息。

  • “However, in the scene with dynamic objects, our SLAM is better than ORB-SLAM2. This is because our SLAM tracks by keyframes instead of reference frames, when a big moving object passes through the camera, the trajectory is easy to follow the moving object if SLAM tracks by reference frames.”但是,在有动态物体的场景中,我们的SLAM比ORB-SLAM2要好。这是因为我们的SLAM是通过关键帧而不是参考帧跟踪的,当一个大的运动物体通过摄像机时,如果SLAM是通过参考帧跟踪的,那么轨迹就容易跟随运动物体。

  • “Since a big moving object contains a lot of features, and SLAM are based on the assumption that the number of features in moving objects is much smaller than the number of features in static objects, so in this case, original ORB-SLAM2 considers the big moving object to be static.” 由于一个大的运动物体包含大量的特征,而SLAM是基于运动物体中的特征数目远小于静态物体中的特征数目的假设,因此在这种情况下,原始的ORB-SLAM2将大的运动物体视为静态的。
    原因是较大物体的上的参考关键帧数目会大于静态环境,使得ORB-SLAM2误认为较大移动物体是静态物体,其实它不是
    注:只使用关键帧和使用参考关键帧的不同方法有点鱼与熊掌的感觉,目前的理解是 针对不同环境的需求使用不同的方案

  • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pWPtmZNJ-1650186682959)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220417164632108.png)]

“C. Localization Accuracy with Semantics” C.语义定位精度

  • “we verify the improvements of our semantic SLAM. For comparison we have also executed RGB-D SLAM[6] in the benchmark.”我们验证了语义SLAM的改进。为了进行比较,我们还在基准测试中执行了RGB-D SLAM[6]。
    评估语义SLAM ,用RGB-D SLAM作对照组

  • “Fig.13 shows the comparisons of trajectories calculated by semantic SLAM, with the trajectories calculated by ORB-SLAM2 and RGBD-SLAM. As we observe, our trajectories are similar to the trajectories calculated by ORB-SLAM2 in the scenes without dynamic objects.”图13示出了由语义SLAM计算的轨迹与由ORB-SLAM2和rgbd-slam计算的轨迹的比较。正如我们所观察到的,我们的轨迹与ORB-SLAM2在没有动态物体的场景中计算的轨迹相似。

  • 基线的中值均方根误差:
    静态环境中,我们的语义SLAM的RMSE比ORB-SLAM差一点,但是很接近了
    引起RMSE比ORB-SLAM差一点的原因是:局部建图线程需要更多时间来检测对象,需要更多时间来处理关键帧,因此一些由跟踪线程生成的关键帧可能会被丢弃。(为什么?有点不理解,他们是独立的线程啊?跟线程有什么关系?应该是没有使用参考帧的缘故吧

动态环境中,语义SLAM比ORB-SLAM2好很多
主要原因:
一、语义SLAM保留了静态对象上的特征,为每个关键帧剔除了可能属于动态物体的动态特征。
二、由于每一帧都是按关键帧追踪,可以提供更稳定的特征,因此在有动态物体的场景中,定位可以更准确(第二个原因好像有点问题

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-53pMEoBl-1650186682959)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415084730463.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2LltWagF-1650186682959)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415084751642.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GoC4WVC2-1650186682960)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415084812189.png)]

D.Detection

  • 为展示我们面向对象的语义建图系统的能力,使用室内RGB-D系列来生成每个环境的全局地图,然后比较每个类别的地图中记录的对象数量。结果如图14, 能够识别场景中的大多数对象,但还是有些对象会缺失,两台显示器,只检测到了一台
    原因:它们靠的太近了。在一些场景中,它会将一台显示器和一个键盘识别为一台笔记本电脑,因为它们在场景中靠得很近。

  • 还有一个问题是:YOLO是以边界框的形式给出结果,所以每个对象都会包含背景的一部分

  • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pEAvC5o9-1650186682960)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415085755988-16501860768701.png)]

E. Mapping

  • 将检测到的对象转化为Octomap的结果如图15所示

  • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-T1JKMcIz-1650186682961)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415090530233.png)]

  • 在语义SLAM中结合RGB和深度图像来生成点云,并获得每个点云的位置,因此点云及其位置可用于通过Octomap创建地图

  • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0uzZfYNG-1650186682961)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415091219506.png)]

  • 比较改进后的Octomap和原始的Octomap 选用TUM数据集构建多线程地图,结果如图16
    计算空体素的时间如图17所示

  • [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fcpE4MHQ-1650186682961)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415091338324-16501862346123.png)]

你可能感兴趣的:(SLAM,机器学习,深度学习,其他)