(2022/4/15 上午9:14:24)
RGB-D
SLAM框架。“The system combines object detection module that is realized by the deep-learning method, and localization module with RGB-D SLAM seamlessly.”
RGB-D SLAM
定位模块**无缝结合。“The two modules are integrated together to obtain the semantic maps of the environment”
“to improve the computational efficiency of the framework, an improved Octomap based on Fast Line Rasterization Algorithm is constructed.”
Octomap
。“for the sake of accuracy and robustness of the semantic map, Conditional Random Field (CRF) is employed to do the optimization.”
CRF:Conditional Random Field
)进行优化。“we evaluate our Semantic SLAM through three different tasks, i.e. Localization, Object Detection and Mapping.”
“With the improved Octomap, the proposed Semantic SLAM is 66.5% faster than the original RGB-D SLAM.”
“most recent researches in the field of SLAM only focus on the geometric mapping, instead of both geometric and semantic mapping.”
“maps built by SLAM only can tell us where obstacles are and cannot supply semantic meaning.”
“In such conditions, it is difficult to let the robot do some high level tasks”
“the map created by traditional SLAM can only be useful in simple missions, such as navigation and path planning. Obviously, it cannot meet our expected intelligent demands.”
“present aRGB-D
semantic SLAM framework, which not only construct the semantic maps based on the geometric SLAM, but also improve the localization accuracy according to the semantic maps.”
RGB-D语义SLAM
框架,该框架在几何SLAM
的基础上构造语义映射,并根据语义地图提高定位精度。“one module is the RGB-D SLAM based on sparse feature, which provides information about the locations of objects and also builds the 3D map.”
RGB-D SLAM
,它提供了物体的位置信息,并建立了三维地图。“The other one is object detection realized by deep learning method.”
“According to results of these two modules, we design the integrated RGB-D
semantic framework, which provides the semantic map, and improves the localization accuracy.”
RGB-D 语义框架
,提供了语义图,提高了定位精度。“our system creates a point clouds map of an environment with semantic meanings, which contains separate object models with semantic and geometric information.”
“Our map not only maintains 3D point clouds by projecting semantic messages to 3D models, but also separates object entities independently.”
“it can provide more advanced understanding of environment”
“The proposed system can detect and classify 80-200 object classes using deep-learning based detection algorithm, while the existing semantic mapping systems [42]-[43] can only detect less than 20 classes.”
“when our system builds maps, it can create 3D object models without requiring a-priori known 3D models.”
“Because 3D object entities of a semantic class, such as cup, have many kinds of shapes, it can limit the environment understanding as the robot needs to know the 3D object model of an object before identification.”
“our system mainly focuses on object-level entities, however, some other semantic segmentation methods, such as [1][2], focus only on pixel-level entities. Maps generated by such methods are less usable, because in this condition objects are modeled offline and maintained all the time.”
“Nowadays, SLAM has reached a level of maturity where maps can be built nearly in real time.”
如今,SLAM已经达到了可以几乎实时构建地图的成熟程度。
语义可以使机器人更加智能,完成更加高层次的任务。
基于特征的方法:传统方法,从图像中提取稀疏点,匹配相邻帧且恢复相机位姿。
“KinectFusion[5] uses RGBD camera to generate dense point clouds, recovers camera poses and scene structure with ICP algorithm, and accelerates tracking by CUDA.”
ICP
算法恢复摄像机姿态和场景结构,用CUDA
(高性能计算上,越来越多的被使用)加速跟踪。“RGBD_SLAM[6] can also generate dense point clouds, it tracks ORB features and optimizes camera poses by G2O algorithm. However, the algorithm in RGBD_SLAM uses every frames to optimize camera poses instead of KeyFrames, and therefore it is computationally inefficient.”
ORB
特征**(注:在这里就有人使用ORB 特征了)**,并利用G2O算法优化摄像机姿态。然而,RGBD_SLAM中的算法利用每帧来优化相机姿态,而不是关键帧,因此计算效率较低。“PTAM[7] utilizes KeyFrames to optimize camera poses, it therefore works fast and stable, but it lacks loop closure detection, relocalization and auto initialization, and it can only generate sparse point clouds.”
“ORB-SLAM[8], which is the state of the art in this field, not only supports RGBD camera, stereo camera and mono camera, but also contains loop-closing, relocalization, and auto initialization. It can work well both in small and large scale environments.”
相机位姿直接从图像的强度(灰度不变假设)估计得到
“DSO_SLAM[9] uses Direct Methods to estimate poses and maintains 5 to 7 keyframes through sliding windows, but it lacks loop closure, which leads to more errors over time.”
“LSD_SLAM[10] can generate semi-dense depth image, and it is used to match next frame in order to estimate camera poses. However, it is sensitive to light change.”
“SVO_SLAM[11] belongs to half Direct Methods, because only sparse model-based image alignment uses Direct Methods, while pose estimation and bundle adjustment depend on features matching.”
“However, some researches based on deep learning can detect many objects, even the objects belonging to the same class but having different shapes.”
“we are interested in the task of object detection”
“This network extracts features through AlexNext[38] and realizes classification by SVM[36], but it takes several seconds to process one image.”
“In order to improve R-CNN, Fast R-CNN[14] maps feature map to feature vector, and it is used as an input to fully connected layer by ROI-pooling, and replaces SVM with softmax.”
贴一下链接,后边在仔细看看:
(ROI Pooling:ROI Pooling(感兴趣区域池化) - 刘下的文章 - 知乎 https://zhuanlan.zhihu.com/p/65423423
SVM:【机器学习】支持向量机 SVM(非常详细) - 阿泽的文章 - 知乎 https://zhuanlan.zhihu.com/p/77750026
AlexNet:深度学习卷积神经网络-AlexNet - Adia的文章 - 知乎 https://zhuanlan.zhihu.com/p/42914388
softmax: https://blog.csdn.net/bitcarmanlee/article/details/82320853
“It is therefore faster than R-CNN, but it is still too slow for a real-time requirement in SLAM. Faster R-CNN[35] utilizes the Region Proposal Network (RPN) to generate object proposals and adds anchor and shared features to promote the speed of detection, its speed can reach up to 5fps.”
“Yolo” 快速物体检测算法,用S X S 个网格替换物体提议,并对这些网格的分类实现最终检测。注:使用YOLO来检测动态物体,后边把动态物体上的特征点直接去除
“Semantic SLAM is used to calculate the motion and position, and object detection and semantic segmentation are utilized to generate semantic map. Semantic SLAM can be categorized into two types based on the object detection methods.”
“The first type uses traditional methods to detect object. Real-time Monocular Object SLAM[17] is the most common one, which employs Bags of Binary Words and a database with 500 3D object models to provide a real-time detection. But it limits a lot because 3D object entities of a semantic class like cup having many different kinds of shapes.”
“[18] generates object proposals through multi-view images, then extracts dense SIFT descriptors from these proposals and predicts their classes. [19] employs DPM[12], in which Hog feature is used to describe the object.”
“The other kind of SLAM is using deep-learning methods to do the object recognition, such as method proposed in [20], however, the semantic information is built based on pixels instead of object entities.”
“In fact, this approach is too complex and not practical due to two reasons: (1) robot wants to understand the major semantic meaning of the environment in mission execution, which means it does not care about every pixel’s semantic information, (2) computational speed is not sufficient to perform pixel level semantic classification in robot SLAM system.”
“Point clouds store large number of points and consume a lot of memory.”
“it cannot easily differentiate between cluttered and free spaces.”
“Elevation maps and multi-level surface maps cannot represent unmapped areas, although they are efficient. More importantly, these methods can not represent arbitrary 3D environments.”
“Octomap[23] is adopted which is used widely in the field of mapping. OctoMap has advantages of taking measurement uncertainty into account, being space efficient and implicitly representing free and occupied space.”
“it still takes too much time to build the maps.”
开始过渡了,当前SLAM在动态环境方面的不足
“the geometric aspect of the SLAM problem is well understood, and has reached a level of maturity where city level maps can be built precisely and even in real time.”
“But they can only work well in static environments or the one with small dynamic objects.”
“In the scene with small dynamic objects, as only few feature points are situated at dynamic objects, the SLAM can therefore still work well.”
“Feature-Based SLAM is easy to be effected by large moving objects.”
“most Feature-Based SLAM systems are built based on a strong assumption that the number of features on moving objects is much smaller than those on static objects.”
建立在动态物体的特征远少于静态物体的上的特征呗
问题:动态物体与静态物体的区分标准是什么?很经典的一幕就是巨大而移动缓慢的动态物体经常被识别为静态物体,但实际上人家不是,所以作者后边对ORB-SLAM2作了改进:不用参考关键帧了,直接使用当前关键帧(牺牲建图精度,提高动态物体的识别准确度) 终究是要取舍的
“Octomap is based on OcTree structure which is good for searching and building, while point clouds only store each points without any structures. Octomap can carry not only the RGB and position information but also the semantic messages.”
“the point clouds only store the original messages from RGB-D camera.”
欢迎来到ORB-SLAM2
“Tracking thread is in charge of localizing the camera with every frame in real time and deciding when to insert a keyframe. In tracking thread, it performs an initial feature matching with the previous frame and optimizes the pose by Bundle Adjustment (BA) algorithm. If tracking is lost, it performs a global re-localization with Bag of Word, then searches map points by re-projection and optimizes the pose with local map points. Finally, the tracking thread can decide if a new keyframe can be generated.”
“the Local Mapping thread will triangulate new map points through its relative keyframes. Then, it optimizes the pose of relative keyframes and map points with BA. Finally, redundant keyframes and low quality map points are removed.”
“The Loop Closing thread responses to loop closure with every keyframe. If a loop is detected, the similarity transformation is computed which represents the drift accumulated in the loop.”
“Although ORB-SLAM2 is a very practical algorithm, it still faces some questions, such as how to work well in dynamic environments, how to supply semantic information and maps and so on.”
注:本论文的主角登场
“In the proposed Semantic SLAM system, ORB-SLAM2 is in charge of camera localization and mapping with every RGB-D frames.”
“Tracking thread is responsible for tracking by keyframes instead of reference frames, in order to decrease the effect of moving objects.”
“Local Mapping thread adds a few keyframes to create semantic messages, because semantic messages extraction cannot fulfill the requirement of real-time performance.”
“After getting keyframes from ORB-SLAM2, YOLO[15] is used to detect objects in each keyframe to get semantic message. In our implementation, we use the tiny-weight version to detect objects, because this version is trained on MS-COCO Dataset, which contains 80 different kinds of objects.”
“object regularization based on CRF is used to correct the probabilities of each object computed by YOLO.”
“constraints between objects are computed according to the statistics of MS-COCO Dataset, and it is then used to optimize the object probabilities computed by YOLO detection.”
“When accurate labels of each object are captured, filter process is used to provide more stable features and remove the unstable features which are always locate on the moving objects. At the same time, the temporary objects are created, which contain point clouds produced by projection.”
“we use data association module to decide either to create a new object or associate it with existing object in the map according to the matching score.”
“in order to find correspondence between existed objects and temporary objects, we first build relationship between keyframes and objects.”
“Kd-Tree structure is used to accelerate the computation of matching score.”
“When the existing objects can be combined with the temporary objects, the former can be updated with the new detection by a recursive Bayesian process.”
“Map Generation uses point clouds stored in objects to generate map based on Octomap, which is accelerated by multi-threads realization and Fast Line Rasterization algorithm.”
“In order to integrate the concept of semantic into the framework of ORB_SLAM2, we construct relationship between keyframes and objects by referring to the implementation method between keyframes and map points, which has existed in ORB-SLAM2.”
“In ORB_SLAM2, each keyframe stores map points that it has observed in the frame image, at the same time, each map point records the keyframes which have observed the map point sequentially.”
“we can build relationship between keyframes and perform some optimization, such as analyzing whether a keyframe is redundant or deciding whether a map point has high quality.”
“we build the relationship between keyframe and each object as followings.”
“In our realization, each object ܱ O i O_i Oi contains :
- Word coordinates of each point cloud that are located on the object.
- A fixed number of class labels and the corresponding confidence score which is calculated through a recursive Bayesian update.
- Keyframes which can observe this object.
- Kd-tree structure generated through the object’s point clouds, which is used for fast search.
- The class label which this object belongs to.
- The number of observations.”
在我们的实现中,每个对象 O i O_i Oi包含:
- 位于该对象上的每个点云的字坐标;
- 固定数量的类标签和相应的置信度得分,通过递归贝叶斯更新计算;
- 可以观察该对象的关键帧;
- 通过该对象的点云生成Kd-tree结构,用于快速搜索;
- 该对象所属的类标签;
- 观察次数。
“Each keyframe K i K_i Ki should store :
每个关键帧应该存储:
“we create an object database, in which all the detected objects are stored.
基于ORB-SLAM2的语义SLAM
“In ORB_SLAM2, Tracking thread localizes the camera with every frame through four steps. First, ORB features are extracted from RGB images. Second, ORB features are used to perform feature matching with the reference frame, preliminarily calculate the camera pose and return the number of matched map points. Third, the camera pose is optimized again with the matched locale map points which are searched through the relative keyframes. Finally, tracking thread decides whether a new keyframe is inserted based on some principles.”
“tracking thread module is modified in the following three ways.”
“In order to reduce the effect of dynamic objects, the second step in tracking thread is changed to track by keyframes instead of reference frames.”
注:将参考帧改为直接用关键帧,略微降低定位建图精度,提高对动态物体的适应性
“If SLAM tracks by reference frame, the camera pose calculation can easily be effected by large moving objects.”
“This is because, when a large moving object passes by, original SLAM will track features on the moving object, which affects the tracking accuracy.”
“if SLAM tracks feature by keyframes, it can still calculate the correct camera pose before the new keyframe insertion.”
“The essential reason is that old keyframe doesn’t contain features of the moving object.”
“choose the Levenberg-Marquardt method, from G2O[50] which contains several optimization algorithm to optimize the pose of the current frame.”
“This method needs a good initial estimated value for optimization. Therefore we use the constant velocity motion model to predict the position of the current frames as a G2O initial value before optimization.”
“The third step in tracking process is modified to compare the number of matched inliers with the result of the second step to judge whether the tracked current frame is lost.”
“In ORB-SLAM2, matched inliers are compared with a constant value in the third step. It is easy to lose tracking when the camera moves fast, because the ORB feature points of current frame may only match with the map points observed by the last frame.”
“In this case, the number of the observed map points will be less than the constant value. Therefore, the third step should compare the number of matched inliers with the number of matched inliers computed by the last frame.”
“The second step function computes the number of matched map points between the last keyframe and the current frame, therefore we should compare the number of matched inliers with the result of the second step.”
“Tracking thread is changed to create more keyframes.”*
“In semantic SLAM, individual objects are important entities which not only can supply semantic information for the map, but also can enhance the localization accuracy”
“In deeplearning area, DeepLab[39] and FCN[40] can provide pixel level semantic segmentation, however, RCNN, Fast-RCNN and Faster-RCNN can supply the object level’s bounding box detection. Furthermore, they may generate too many object proposals which cause detection of same region multiple times.”
“Thus, many of them are too slow, and they cannot satisfy the real-time requirement when they are integrated with SLAM.”
“YOLO can process 45 images per second, therefore we choose YOLO as the object detection method to generate a number of object proposals in the form of bounding boxes for every keyframe in our proposed semantic SLAM system.”
“We use the network trained on the COCO dataset instead of PASCAL VOC, because COCO dataset contains 80 types of objects, while PASCAL VOC dataset only has 20 kinds of objects.”
“After some experiments, we find that normal YOLO still takes about 0.04s per images which cannot fulfill realtime requirement. However, Tiny YOLO weights can fulfill all these requirements.”
“we use Tiny YOLO weights instead of normal YOLO weights. In the implementation, YOLO detects the new keyframe in Local Mapping module after the new keyframe is added and wrong map points are culled sequentially.”
“Although we can get the semantic label for each object through YOLO algorithm, semantic context among objects has not been explicitly incorporated into the object categorization models.”
“Some researchers have found that context information is good for semantic segmentation, however, object context requires access to the referential meaning of the object [25].”
“In other words, when performing the task of object categorization, objects’ category label must be assigned with respect to other objects in the scene, assuming there is more than one object present.”
“In the task of image segmentation, context information is used to optimize the final result with CRF (Conditional Random Filed) algorithm.”
“The main problem that CRF can solve effectively is how to model the class scores calculated by some classifiers and local information of images simultaneously.”
“Then this problem can be treated as a problem of maximizing a posteriori.”
“We can define unary potentials to model the probabilities that each pixel or patches belongs to each category, and pairwise potentials to model relation between two pixels or patches.”
“The frequently used CRF models contain unary potentials, pairwise potentials and some weighted parameters. The pairwise potentials are modeled in 4 or 8 neighbors, like [44]-[46-51], therefore this structure is limited in modeling long distances on the image.”
“Toyoda et al. [32] proposed a fully connected CRF to integrate local information and global information jointly which sets up pairwise potentials on all pairs of pixels or patches in semantic image labeling task. By modeling long range interactions, dense CRF provides a more detailed labeling compared to its sparse version. The dense CRF with Gaussian kernel potentials has emerged as a popular framework for semantic image segmentation tasks”
“we construct a probabilistic object-based dense CRF. Compared with CRF based on pixel level, our proposed model can reduce the computational complexity significantly.”
构建了一个基于对象概率的稠密CRF
。
给出基于对象模型的Gibbs energy function:
P ( x ) = 1 Z exp ( − E ( x ) ) P(x)=\frac{1}{Z} \exp (-E(x)) P(x)=Z1exp(−E(x))
E ( x ) = ∑ i ψ u ( x i ) + ∑ i < j ψ p ( x i , x j ) E(x)=\sum_{i} \psi_{u}\left(x_{i}\right)+\sum_{i
CRF
模型中,目标是最小化 E ( x ) E(x) E(x)来获得最终的分配标签一元势函数 Ψ u \Psi_{u} Ψu建模CRF
中每个vertex
点对势函数 Ψ p \Psi_{p} Ψp 建模CRF
中vertices
之间的联系
G2O?
)一元势函数和点对势函数的一般表达形式:
ψ u ( x i ) = − log P ( x i ) \psi_{u}\left(x_{i}\right)=-\log P\left(x_{i}\right) ψu(xi)=−logP(xi)
ψ u ( x i , x j ) = μ ( x i , x j ) ∑ m = 1 K ω m exp ( − a m ( f i , j ) 2 ) \psi_{u}\left(x_{i}, x_{j}\right)=\mu\left(x_{i}, x_{j}\right) \sum_{m=1}^{K} \omega_{m} \exp \left(-a_{m}\left(f_{i, j}\right)^{2}\right) ψu(xi,xj)=μ(xi,xj)m=1∑Kωmexp(−am(fi,j)2)
P ( x i ) P(x_{i}) P(xi): YOLO
算法检测到的第 i i^{} i对象的标签概率分布
ω \omega ω: 线性组合的权重
μ \mu μ: 标签兼容函数, 描述两个不同类同时出现在相邻位置的可能性
Potts 模型是最简单的标签兼容函数,并在本系统中使用:
μ ( x i , x j ) = { 0 , if x i = x j 1 , otherwise } \mu\left(x_{i}, x_{j}\right)=\left\{\begin{array}{cc} 0, & \text { if } x_{i}=x_{j} \\ 1, & \text { otherwise } \end{array}\right\} μ(xi,xj)={0,1, if xi=xj otherwise }
f i , j f_{i,j} fi,j: 第 i i i 个对象和第 j j j 个对象的约束
f i , j f_{i, j} fi,j 计算如下:
f i , j = 1 p i , j f i , j = 1 p i , j f_{i, j}=\frac{1}{p_{i, j}}f_{i, j}=\frac{1}{p_{i, j}} fi,j=pi,j1fi,j=pi,j1
COCO
数据集进行统计获得每个图像都有不同种类的对象在同时显示,因此,根据这些图像计算标签共视关系计数矩阵
矩阵中的元素 i , j i, j i,j 表示标签为 i i i 的对象在标签为 j j j 的对象的训练图像中图像的次数
对角线元素表示对象在训练集中的频率,标签共视关系计数矩阵的部分,及混淆矩阵如图所示:
p i , j p_{i,j} pi,j 的计算:
p i , j = n i , j n j p_{i, j}=\frac{n_{i, j}}{n_{j}} pi,j=njni,j
“After we get the unary and pairwise potentials, we can use mean fields method to optimize the CRF model. Based on this method, we not only utilize the YOLO object detection, but also integrate the object context information to refine the final object confidence score”
“feature filter” 特征滤波器:
首先,将对象分成静态与动态两种类型。然后从上面的算法计算标签和边界框,我们排除了属于动态对象的ORB特征、地图点和DBoW特征,保留了静态对象上的特征
“An original image is shown in Fig.5(a); Fig.5(b) shows ORB features extracted from the original image; Fig.5© shows the semantic messages extracted from the original image; Fig.5(d) shows the result of the features filter.”
原始图像如图5(a)所示;图5(b)示出了从原始图像中提取的球体特征;图5©示出了从原始图像提取的语义消息;图5(d)示出了特征滤波器的结果。
利用语义信息去除动态位于动态兑现上的点,但是从图5(d)中左下角人的裤子上的特征点还是没能去除
“After the feature filter process, we generate some temporary objects which contain object size, object type, object confidence scores, and the corresponding point clouds.”
“point clouds generated by the RGB-D camera contain some noises.”
如何去除噪声
“In order to remove these noises, we apply statistical calculation to point clouds. If the points deviate from the average, they may be noises, and can therefore be removed.”
“in order to save memory, point clouds are down-sampled with 5 mm resolution. When getting the robust temporary objects and point clouds, we use data association to decide whether those temporary objects are new objects or already exist in the map.”
数据关联
“First, we need to find the candidate objects for each temporary objects.”
“Through the relationship between keyframes and map points, we can easily find keyframes which are relative to the current keyframe. These keyframes not only are close to the current keyframe, but also are more likely to contain same objects because they have enough shared map points.”
“With the relationship between keyframes and objects, the objects seen by these relative keyframes are considered as the candidate objects for every temporary object.”
“When KeyFrame4 is inserted, the system detects object3, object4 and object5, which are treated as three temporary objects.”
“In order to find the candidate objects for such three temporary objects, we take the following two steps.”
“First, we search the relative keyframes for KeyFrame4.”
其次,推断对象2、对象3和对象4可以被关键帧3和关键帧4观测**(问题:从图中看出对象2不能被关键帧3和4观测)** 要说具有相同地图点的关键帧为相对关键帧(关键帧2、关键帧3、关键帧4为相对关键帧,而不是只有3和4),然后相对关键帧所观测到的对象(2、3、4)都被推断成能被 相应关键帧组 观测到
此处有疑问,感觉作者写的有点问题哈
“therefore Object2, Object3 and Object4 are regarded as candidate objects for temporary objects observed by KeyFrame4.”
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6oR7NIT9-1650186682956)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220417161153514.png)]
“With the relationship between keyframes and objects, we can avoid some undesirable situation caused by moving objects.”
“In the third keyframe, it generates a temporary object whose label is TV monitor.”
“According to the relationship between keyframes and objects, we can easily find that the TV monitor on the first keyframe is a candidate object for the temporary object observed in the third keyframe.”
“The first keyframe and the third keyframe have a lot of same map points, which reveals that the first keyframe is one of the relative keyframes for the third keyframe.”
“If we use the object seen by the last keyframe (it is the second keyframe in this condition) as candidate objects, the temporary object (TV monitor in the third keyframe) will be regarded as a new object by data association because the last keyframe does not contain the white TV monitor.”
“among the candidate objects, we need to select which one is most similar to the temporary object.”
“we perform a nearest neighbor search between 3D points in candidate and temporary objects, and calculate the Euclidean Distance between the matched point pairs.”
“k-d tree is used to accelerate the matching process.”
“According to the matched point pairs, scores between candidate and temporary objects can be calculated.”
根据匹配的点对,可以计算候选对象和临时对象之间的分数。
候选对象和临时对象间的打分公式:
S = M N \mathbf{S} = \frac{M}{N} S=NM
“A candidate object with the highest score which is also higher than the threshold, is selected as the associated object.”
“If all the objects do not fulfill real-time requirements, the temporary object is considered as a new object which can be inserted into the SLAM system.”
“When we find the correspondence between candidate and temporary objects, the point clouds and confidence scores associated with them should be fused together.”
YOLO
中的对象检测中,输出RGB
图像到RCNN
框架,给定第 k k k张图像的数据 I k I_k Ik,YOLO
的输出可以用一种简化的方式解释为类标签上对象的独立概率分布
例如 P ( O u ) = l i ∣ I k P(O_{u}) = l_i|I_k P(Ou)=li∣Ik , u u u表示检测到的对象, l i l_i li表示第 i i i 个类标签
这使得我们能够通过递归贝叶斯更新方式更新可见集 V k ∈ M V_k \in M Vk∈M 中的所有对象和相应的概率分布:
P ( l i ∣ I 1 , … , k ) = 1 Z P ( l i ∣ I 1 , … , k − 1 ) P ( O u = l i ∣ I k ) P\left(l_{i} \mid I_{1, \ldots, k}\right)=\frac{1}{Z} P\left(l_{i} \mid I_{1, \ldots, k-1}\right) P\left(O_{u}=l_{i} \mid I_{k}\right) P(li∣I1,…,k)=Z1P(li∣I1,…,k−1)P(Ou=li∣Ik)
方程 ( 9 ) (9) (9) 应用于每个对象的所有标签概率,且最后用常数 Z Z Z进行归一化得到适当的分布
又比较了一番点云地图和Octomap
“In our system, 3D point clouds are stored in every keyframe, and the segmented 3D point clouds are also stored in the corresponding object.”
“the map based on point clouds can be generated by projecting the stored 3D points according to the associated poses.”
“However, the map based on point clouds is useless for advanced mission such as path planning or grasp point selection,”
“because point clouds do not use any structures to store each points, which is bad for searching, and each point has no volume information, which makes collision detection and 2D maps generation easily fail.”
“point clouds cannot distinguish the unknown area, the empty area, and cannot eliminate noise.”
“Octomap” Octmap 使用Octree来存储点云,当插入一个新点后,它能够区分未知和空区域,还能降低噪点。
Octomap 是一个基于八叉树的概率3D建图框架,实现中,八叉树的根节点代表整个空间,八个子节点代表八个小空间。八叉树的叶节点代表空间中最小的分辨率体素
Octomap
在实现过程中,移动物体和距离测量中的误差会导致地图中出现大量噪声。Octomap
使用概率模型来解决这个问题,每个叶节点都存储其被占用或空闲的概率。当插入一个新的3D
点时,其对应的叶节点以以下方式更新其概率:
L ( n ∣ z 1 : t ) = L ( n ∣ z 1 : t − 1 ) + L ( n ∣ z t ) L\left(n \mid z_{1: t}\right)=L\left(n \mid z_{1: t-1}\right)+L\left(n \mid z_{t}\right) L(n∣z1:t)=L(n∣z1:t−1)+L(n∣zt)
L ( n ) = log [ P ( n ) 1 − P ( n ) ] L(n)=\log \left[\frac{P(n)}{1-P(n)}\right] L(n)=log[1−P(n)P(n)]
这里有个疑问,P(n|Z_t) 前文中并没有出现该公式,写错了吗? L ( n ∣ z t ) ? L(n|z_t)? L(n∣zt)?
Octomap可以生成基于体素或像素的3D或2D地图Octomap中的每个体术不仅可以存储其被占用或空闲的概率,还可以存储固定数量的类标签和置信度。
最初的Octomap通过两步来生成地图
第一步:
计算空体素从相机到被占体素的位置。问题: 计算空体素太费时
作者对此过程进行了优化
最初的Octomap
通过两步来生成地图
第一步:
计算空体素从相机到被占体素的位置。问题: 计算空体素太费时
作者对此过程进行了优化: 考虑二维平面的例子 注:加速算法
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C4Bac57W-1650186682957)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_math.assets/image-20220414211058962.png)]
在图10中,有一个 ( X s , Y s ) (X_s,Y_s) (Xs,Ys) 指向 ( X e , Y e ) (X_e, Y_e) (Xe,Ye)的直线,斜率:
K = Y e − Y s X e − X s K=\frac{Y_{e}-Y_{s}}{X_{e}-X_{s}} K=Xe−XsYe−Ys
网格 ( X , Y ) (X,Y) (X,Y)被定义为: ( X , Y ) (X,Y) (X,Y)网格数坐标
X = X S V , Y = Y S V X=\frac{X_{S}}{V}, Y=\frac{Y_{S}}{V} X=VXS,Y=VYS
网格中的偏移量被定义为:偏离格线距离
X 0 = X S % V , Y 0 = Y S % V X_{0}=X_{S} \% V, Y_{0}=Y_{S} \% V X0=XS%V,Y0=YS%V
最后可以通过下式计算出 D D D: 通过斜率计算得出
D = K × ( V − X 0 ) + Y 0 D=K \times\left(V-X_{0}\right)+Y_{0} D=K×(V−X0)+Y0
如果 D < V D < V D<V, 意味着 n < N n < N n<N. 因此下一个空体素是 ( X + 1 , Y ) (X+1, Y) (X+1,Y) 注:已验证过
此外还有三种情况,如图11:[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bjSUg8Jq-1650186682958)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_math.assets/image-20220414215123548.png)]
( a ) (a) (a):可以交换起始点和结束点同上边的计算得出空体素的位置
( b ) (b) (b): 可以通过图12的方式将 L 1 L1 L1转换到 L 2 L2 L2来计算空体素的位置 过 L L L对称 线 L L L通过式16确定(中点)
Y = Y s + Y e 2 Y=\frac{Y_{s}+Y_{e}}{2} Y=2Ys+Ye
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N233XuPP-1650186682958)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_math.assets/image-20220414215925713.png)]
( c ) (c) (c): 交换 ( b ) (b) (b)中两点可得
第二步:
将所有新体素插入到一个Octomap
中,Octomap
通过Octomap
结构为每个新体素定位到相应的叶子节点
“probability of corresponding leaf node is updated with the new voxel.”
“However, both of these two steps use a single thread to process. Although the first step can use OpenMP to create multi threads to compute the empty voxels, we find that it runs slower than the one with a single thread because too many threads are created. In order to accelerate mapping with multi-threads, three strategies are proposed.”
“First, we use a thread Pool to take the place of OpenMP. The second step is modified to use eight threads to insert voxels into an Octomap. And the third is that the whole architecture can be accelerated by using a producer-consumer model.”
写了一堆关于这个数据集的东西
“TUM dataset is an excellent dataset to evaluate the accuracy of camera localization as it provides accurate ground truth for the sequences.” TUM数据集为摄像机定位提供了准确的地面真值,是评价摄像机定位精度的一种很好的数据集。
“It contains seven kinds of sequences recorded by a RGB-D camera at 30fps and a resolution of 640 x 480.” 它由RGB-D摄像机以30fps、640×480的分辨率记录了7种序列。
“We only use Handheld SLAM sequence, Robot SLAM sequence, Structure vs. Texture sequence and Dynamic Objects sequence among the seven sequences because these four sequences represent most of the scenes of every day.” 在这七个序列中,我们只使用手持SLAM序列、机器人SLAM序列、结构纹理序列和动态对象序列,因为这四个序列代表了每天的大部分场景。
“these sequences contain different kinds of objects, which can ensure more semantic information than other kinds of sequences.” 这些序列包含不同种类的对象,可以保证比其他种类的序列更多的语义信息。
“Handheld SLAM sequence is recorded by hands, and therefore it has complex and unstable trajectories, while Robot SLAM sequence is recorded by real robots, hence it has stable and simple trajectories.” 手持SLAM序列由手记录,轨迹复杂且不稳定,而机器人SLAM序列由真实机器人记录,轨迹稳定且简单。
“But all of them are experimented in the scene without dynamic objects.” 但它们都是在没有动态物体的场景中进行实验的。
“Furthermore, Dynamic Objects sequence is recorded by hands, however when the recordings have some dynamic objects, the trajectory is found to be unstable in such scenes.” 此外,动态目标序列是由人工记录的,但当记录中有一些动态目标时,在这种场景中会发现轨迹不稳定。
“For comparison, we use different RGB-D SLAM in the benchmark.” 为了进行比较,我们在基准测试中使用了不同的RGB-D SLAM。
“Each sequence is processed 5 times, and we use RMSE to judge its localization Accuracy” 对每个序列进行5次处理,利用RMSE判断其定位精度
使用均方根误差 R M S E RMSE RMSE :
R M S E = ∑ i = 1 n ( X o b s , i − X model, i ) 2 n R M S E=\sqrt{\frac{\sum_{i=1}^{n}\left(X_{o b s, i}-X_{\text {model, } i}\right)^{2}}{n}} RMSE=n∑i=1n(Xobs,i−Xmodel, i)2
“First, we verify the localization accuracy with Improved SLAM.” 首先,我们用改进的SLAM来验证定位精度。
“For comparison we have executed ORB-SLAM2 with a RGB-D camera in the benchmark.” 为了进行比较,我们在基准测试中使用RGB-D相机执行了ORB-SLAM2。
“Table I shows the median RMSE error of the benchmark.”表一显示了基准的中值RMSE误差。
“It can be seen that the RMSE error of Improved SLAM is little worse than the error of ORB-SLAM2 in static environment, which means the accuracy of our SLAM is close to ORB-SLAM2.” 可以看出,在静态环境下,改进后的SLAM的RMSE误差比ORB-SLAM2的误差大一点,表明改进后的SLAM的精度接近ORB-SLAM2。注:相差不大
“The reason of worse RMSE error is that when tracking reference frames, the number of common features that the last reference frame and the current keyframe have is larger than that the last keyframe and the current keyframe have.” **注分析原因:**RMSE误差较大的原因是在跟踪参考帧时,最后一个参考帧和当前关键帧所具有的共同特征数大于最后一个关键帧和当前关键帧所具有的共同特征数。
“Because of the number of common features, SLAM tracking by reference frames can provide more accurate localization in the static environment.” 在静态环境下,基于参考帧的SLAM跟踪可以提供更精确的定位,因为它具有大量的公共特征。
是ORB-SLAM2使用了参考关键帧, 能提供更精确的信息。
“However, in the scene with dynamic objects, our SLAM is better than ORB-SLAM2. This is because our SLAM tracks by keyframes instead of reference frames, when a big moving object passes through the camera, the trajectory is easy to follow the moving object if SLAM tracks by reference frames.”但是,在有动态物体的场景中,我们的SLAM比ORB-SLAM2要好。这是因为我们的SLAM是通过关键帧而不是参考帧跟踪的,当一个大的运动物体通过摄像机时,如果SLAM是通过参考帧跟踪的,那么轨迹就容易跟随运动物体。
“Since a big moving object contains a lot of features, and SLAM are based on the assumption that the number of features in moving objects is much smaller than the number of features in static objects, so in this case, original ORB-SLAM2 considers the big moving object to be static.” 由于一个大的运动物体包含大量的特征,而SLAM是基于运动物体中的特征数目远小于静态物体中的特征数目的假设,因此在这种情况下,原始的ORB-SLAM2将大的运动物体视为静态的。
原因是较大物体的上的参考关键帧数目会大于静态环境,使得ORB-SLAM2误认为较大移动物体是静态物体,其实它不是
注:只使用关键帧和使用参考关键帧的不同方法有点鱼与熊掌的感觉,目前的理解是 针对不同环境的需求使用不同的方案
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pWPtmZNJ-1650186682959)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220417164632108.png)]
“Fig.13 shows the comparisons of trajectories calculated by semantic SLAM, with the trajectories calculated by ORB-SLAM2 and RGBD-SLAM. As we observe, our trajectories are similar to the trajectories calculated by ORB-SLAM2 in the scenes without dynamic objects.”图13示出了由语义SLAM计算的轨迹与由ORB-SLAM2和rgbd-slam计算的轨迹的比较。正如我们所观察到的,我们的轨迹与ORB-SLAM2在没有动态物体的场景中计算的轨迹相似。
基线的中值均方根误差:
静态环境中,我们的语义SLAM的RMSE比ORB-SLAM差一点,但是很接近了
引起RMSE比ORB-SLAM差一点的原因是:局部建图线程需要更多时间来检测对象,需要更多时间来处理关键帧,因此一些由跟踪线程生成的关键帧可能会被丢弃。(为什么?有点不理解,他们是独立的线程啊?跟线程有什么关系?)应该是没有使用参考帧的缘故吧
动态环境中,语义SLAM比ORB-SLAM2好很多
主要原因:
一、语义SLAM保留了静态对象上的特征,为每个关键帧剔除了可能属于动态物体的动态特征。
二、由于每一帧都是按关键帧追踪,可以提供更稳定的特征,因此在有动态物体的场景中,定位可以更准确(第二个原因好像有点问题)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-53pMEoBl-1650186682959)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415084730463.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2LltWagF-1650186682959)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415084751642.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GoC4WVC2-1650186682960)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415084812189.png)]
为展示我们面向对象的语义建图系统的能力,使用室内RGB-D系列来生成每个环境的全局地图,然后比较每个类别的地图中记录的对象数量。结果如图14, 能够识别场景中的大多数对象,但还是有些对象会缺失,两台显示器,只检测到了一台
原因:它们靠的太近了。在一些场景中,它会将一台显示器和一个键盘识别为一台笔记本电脑,因为它们在场景中靠得很近。
还有一个问题是:YOLO是以边界框的形式给出结果,所以每个对象都会包含背景的一部分
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pEAvC5o9-1650186682960)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415085755988-16501860768701.png)]
将检测到的对象转化为Octomap的结果如图15所示
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-T1JKMcIz-1650186682961)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415090530233.png)]
在语义SLAM中结合RGB和深度图像来生成点云,并获得每个点云的位置,因此点云及其位置可用于通过Octomap创建地图
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0uzZfYNG-1650186682961)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415091219506.png)]
比较改进后的Octomap和原始的Octomap 选用TUM数据集构建多线程地图,结果如图16
计算空体素的时间如图17所示
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fcpE4MHQ-1650186682961)(2018_Semantic%20SLAM%20Based%20on%20Object%20Detection%20and%20Improved%20Octomap_note.assets/image-20220415091338324-16501862346123.png)]