RGB:就是彩色图像。
RGB-D:就是彩色图像外加一个深度,这个深度就是摄像头到那个东西的距离。
单目RGB-D:就是一个摄像头采集RGB-D数据
双目RGB-D:就是两个摄像头一起采集RGB-D数据,这样类似于两个眼睛的效果,可以更加有效地推算出位置。
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes.
在本工作中,我们研究了室内和室外场景下的RGB-D数据的三维目标检测。(这是目标检测任务)
While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans.
虽然以前的方法专注于图像或3D体素,往往模糊了自然3D模式和3D数据的不变性,但我们通过跳出RGB-D扫描直接操作原始点云。(和PointNet类似都是直接作用在点云上)
However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes(region proposal).
然而,该方法的一个关键挑战是如何有效地定位大规模场景点云中的目标(区域方案)。(就是在大规模的场景中定位出这个物体存在很大的问题)
Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization,achieving efficiency as well as high recall for even small objects.
我们的方法不是完全依赖于3D方案,而是利用成熟的2D对象检测器和先进的3D深度学习来进行对象定位,实现了效率和高召回,即使是小的对象。
对recall不了解可以参考:Rrecision和Recall
Benefited from learning directly in raw point clouds,our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points.
受益于直接学习原始点云,即使在强遮挡或非常稀疏的点,我们的方法也能够精确估计三维包围盒。(大约就是抗干扰能力好)
(这里的三维包围盒,其实就类似于我们在2d图形识别当中,我们弄一个小的框来把那个东西框起来)
Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.
在KITTI和SUN RGB-D 3D检测基准上进行评估,我们的方法在具有实时能力的同时显著优于现有技术水平。
这里的总结一下:
第一段(三维数据处理的需求很大,但是做的不多,本文也只是作了其中的一部分)
Recently, great progress has been made on 2D image understanding tasks, such as object detection [10] and instance segmentation [11].
近年来,二维图像理解任务如目标检测[10]和实例分割[11]取得了很大的进展。
However, beyond getting 2D bounding boxes or pixel masks, 3D understanding is eagerly in demand in many applications such as autonomous driving and augmented reality (AR).
然而,除了获得2D边框或像素掩模外,理解3D技术在自动驾驶和增强现实(AR)等许多应用中都非常受欢迎。
With the popularity of 3D sensors deployed on mobile devices and autonomous vehicles, more and more 3D data is captured and processed.
随着3D传感器部署在移动设备和自动驾驶汽车上的普及,越来越多的3D数据被捕获和处理。
In this work, we study one of the most important 3D perception tasks –3D object detection, which classifies the object category and estimates oriented 3D bounding boxes of physical objects from 3D sensor data.
在本研究中,我们研究了最重要的三维感知任务之一——三维物体检测,它从三维传感器数据中分类物体类别并估计物体的三维边界框。
第二段(引出处理点的集合的问题,之后就引出来点云的事情)
While 3D sensor data is often in the form of point clouds,how to represent point cloud and what deep net architectures to use for 3D object detection remains an open problem.
虽然三维传感器数据通常以点云的形式出现,但如何表示点云以及使用什么样的深网架构进行三维目标检测仍然是一个开放性的问题。
Most existing works convert 3D point clouds to images by projection [30, 21] or to volumetric grids by quantization [33, 18, 21] and then apply convolutional networks.
现有的研究大多是通过投影[30,21]或量化[33,18,21]将三维点云转化为图像,然后再应用卷积网络。(这里不提PointNet,可能是想凸显自己的创新点吧,但是后面也提了)
This data representation transformation, however, may obscure natural 3D patterns and invariances of the data.
然而,这种数据表示转换可能会掩盖自然的3D模式和数据的不变性。(老问题了)
Recently, a number of papers have proposed to process point clouds directly without converting them to other formats.
最近,一些论文提出直接处理点云,而不将它们转换为其他格式。
For example, [20, 22] proposed new types of deep net architectures, called PointNets, which have shown superior performance and efficiency in several 3D understanding tasks such as object classification and semantic segmentation.
例如,[20,22]提出了一种名为PointNets的新型深度网络架构,该架构在诸如对象分类和语义分割等几个3D理解任务中表现出了卓越的性能和效率。
第三段(点云只做了分类和语义分割,剩下对象检测还没做,所以我们做一下)
While PointNets are capable of classifying a whole point cloud or predicting a semantic class for each point in a point cloud, it is unclear how this architecture can be used for instance-level 3D object detection.
虽然PointNets能够对整个点云进行分类,或者预测点云中每个点的语义类,但目前还不清楚该架构如何用于实例级3D对象检测。
Towards this goal, we have to address one key challenge: how to efficiently propose possible locations of 3D objects in a 3D space.
为了实现这个目标,我们必须解决一个关键的挑战:如何有效地在3D空间中提出3D对象的可能位置。
Imitating the practice in image detection, it is traightforward to enumerate candidate 3D boxes by sliding windows [7] or by 3D region proposal networks such as [27].
模仿图像检测的实践,通过滑动窗口[7]或3D区域建议网络(如[27])枚举候选3D盒子是很简单的。
However, the computational complexity of 3D search typically grows cubically with respect to resolution and becomes too expensive for large scenes or real-time applications such as autonomous driving.
然而,3D搜索的计算复杂性在分辨率方面通常呈立方体增长,对于大型场景或实时应用(如自动驾驶)来说过于昂贵。
第四段(使用的方法的简单介绍)
Instead, in this work, we reduce the search space following the dimension reduction principle: we take the advantage of mature 2D object detectors (Fig. 1).
相反,在这项工作中,我们遵循降维原则减少搜索空间:我们利用成熟的2D目标检测器(图1)。
First, we extract the 3D bounding frustum of an object by extruding 2D bounding boxes from image detectors.
首先,通过从图像检测器中挤出二维边界框提取物体的三维边界截锥。(我理解这个大约就是在一个二维切片上进行操作)
Then, within the 3D space trimmed by each of the 3D frustums, we consecutively perform 3D object instance segmentation and amodal3D bounding box regression using two variants of PointNet.
然后,在每个三维截锥裁剪的三维空间内,使用PointNet的两个变量分别进行三维对象实例分割和amodal3D包围盒回归。(之后在这个小范围内再进行就会节省一些计算量)
The segmentation network predicts the 3D mask of the object of interest (i.e. instance segmentation);
分割网络预测感兴趣对象的三维掩码(即实例分割);
and the regression network estimates the amodal 3D bounding box (covering the entire object even if only part of it is visible).
回归网络估计模态3D包围盒(覆盖整个物体,即使只有一部分可见)。
第五段(三维数据要求我们有更优秀的搜索方法)
In contrast to previous work that treats RGB-D data as 2D maps for CNNs, our method is more 3D-centric as we lift depth maps to 3D point clouds and process them using 3D tools.
与之前将RGB-D数据作为cnn的2D地图的工作相比,我们的方法更以3D为中心,因为我们将深度地图提升为3D点云,并使用3D工具对它们进行处理。
This 3D-centric view enables new capabilities for exploring 3D data in a more effective manner.
这种以3D为中心的视图支持以更有效的方式探索3D数据的新功能。
First, in our pipeline, a few transformations are applied successively on 3D coordinates, which align point clouds into a sequence of more constrained and canonical frames.
首先,在我们的管道中,在3D坐标上依次应用了一些转换,将点云对齐成一系列更加约束和规范的框架。
These alignments factor out pose variations in data, and thus make 3D geometry pattern more evident, leading to an easier job of 3D learners.
这些对齐因素出姿态变化的数据,从而使三维几何模式更明显,导致3D学习者的工作更容易。
Second, learning in 3D space can better exploits the geometric and topological structure of 3D space.
其次,三维空间学习可以更好地利用三维空间的几何和拓扑结构。
In principle, all objects live in 3D space; therefore, we believe that many geometric structures, such as repetition, planarity, and symmetry, are more naturally parameterized and captured by learners that directly operate in 3D space.
原则上,所有物体都生活在三维空间中;因此,我们认为许多几何结构,如重复、平面性和对称性,更自然地参数化和捕获学习者直接操作在三维空间。(三维空间更加贴近我们的生活,自然而然也将有更多的应用)
The usefulness of this 3D-centric network design philosophy has been supported by much recent experimental evidence.
这种以3d为中心的网络设计理念的有效性已经得到了许多最新实验证据的支持。
第六段(介绍网络的效果好)
Our method achieve leading positions on KITTI 3D object detection [1] and bird’s eye view detection [2] bench marks.
该方法在KITTI三维目标检测[1]和鸟瞰图检测[2]标杆上取得领先地位。
Compared with the previous state of the art [5], our method is 8.04% better on 3D car AP with high efficiency (running at 5 fps).
与之前的技术[5]相比,我们的方法在3D汽车AP上效率提高了8.04%(以5帧每秒的速度运行)。
Our method also fits well to indoor RGB-D data where we have achieved 8.9% and 6.4% better 3D map than [13] and [24] on SUN-RGBD while running one to three orders of magnitude faster.
我们的方法也很适合室内RGB-D数据,我们在SUN-RGBD上获得了比[13]和[24]更好的8.9%和6.4%的3D地图,同时运行速度快1 - 3个数量级。
第七段-作者自己的总结
The key contributions of our work are as follows:
我们工作的主要贡献如下:
• We propose a novel framework for RGB-D data based 3D object detection called Frustum PointNets.
我们提出了一个基于RGB-D数据的3D对象检测的新框架,称为Frustum PointNets。
• We show how we can train 3D object detectors under our framework and achieve state-of-the-art performance on standard 3D object detection benchmarks.
我们展示了如何在我们的框架下训练3D对象检测器,并在标准3D对象检测基准上实现最先进的性能。
• We provide extensive quantitative evaluations to validate our design choices as well as rich qualitative results for understanding the strengths and limitations of our method.
我们提供了广泛的定量评估,以验证我们的设计选择,以及丰富的定性结果,以理解我们的方法的优势和局限性。
1.三维空间的应用十分重要。
2.三维空间中PointNet的效果很好,但是却没有解决三维空间目标分割的问题。
3.所以本文尝试去做这个事情。
4.因为三维空间的计算量比较大,所以我们先借助二维空间缩小范围形成 3D bounding frustum ,之后我们再在缩小范围的空间进行处理。
5.取得了很好的效果
3D Object Detection from RGB-D Data:
Researchers have approached the 3D detection problem by taking various ways to represent RGB-D data.
基于RGB-D数据的三维物体检测:
研究人员通过各种方法来表示RGB-D数据来解决三维物体检测问题。
Front view image based methods:
基于前视图图像的方法
[3, 19, 34] take monocular RGB images and shape priors or occlusion patterns to infer 3D bounding boxes.
[3, 19, 34]采用单眼RGB图像和形状,先验或遮挡模式来推断3D边界框。
[15, 6] represent depth data as 2D maps and apply CNNs to localize objects in 2D image.
[15,6]将深度数据表示为2D地图,利用cnn对2D图像中的对象进行定位。
In comparison we represent depth as a point cloud and use advanced 3D deep networks (PointNets) that can exploit 3D geometry more effectively.
相比之下,我们将深度表示为点云,并使用先进的3D深度网络(PointNets),可以更有效地利用3D几何。
Bird’s eye view based methods: MV3D [5] projects LiDAR point cloud to bird’s eye view and trains a region proposal network (RPN [23]) for 3D bounding box proposal.
基于鸟瞰图的方法:MV3D[5]将LiDAR点云投影到鸟瞰图上,训练区域提议网络(RPN[23])进行3D包围盒提议。(鸟瞰图:虽然看起来立体但是他归根结底还是一个二维的图像,不太好满足我们的要求)
However, the method lags behind in detecting small objects, such as pedestrians and cyclists and cannot easily adapt to scenes with multiple objects in vertical direction.
但是,该方法在检测行人、自行车等小目标时存在一定的滞后,难以适应垂直方向有多个目标的场景。
3D based methods: [31, 28] train 3D object classifiers by SVMs on hand-designed geometry features extracted from point cloud and then localize objects using sliding-window search.
基于三维的方法:[31,28]根据从点云中提取的手工设计的几何特征,利用支持向量机训练三维对象分类器,然后使用滑动窗口搜索对对象进行定位。
[7] extends [31] by replacing SVM with 3D CNN on voxelized 3D grids. [24] designs new geometric features for 3D object detection in a point cloud.
[7]扩展了[31],在体素化的3D网格上用3D CNN代替SVM。[24]为点云中的三维物体检测设计了新的几何特征。
[29, 14] convert a point cloud of the entire scene into a volumetric grid and use 3D volumetric CNN for object proposal and classification.
[29,14]将整个场景的点云转化为体积网格,使用3D体积CNN进行对象提案和分类。
Computation cost for those method is usually quite high due to the expensive cost of 3D convolutions and large 3D search space.
由于三维卷积代价昂贵,三维搜索空间大,这些方法的计算成本通常相当高。
Recently, [13] proposes a 2D-driven 3D object detection method that is similar to ours in spirit.
最近,[13]提出了一种2d驱动的3D目标检测方法,这种方法与我们的精神类似。
However, they use hand-crafted features (based on histogram of point coordinates) with simple fully connected networks to regress 3D box location and pose, which is sub-optimal in both speed and performance.
然而,他们使用手工制作的特征(基于点坐标直方图)和简单的全连接网络回归3D盒的位置和姿态,这在速度和性能上都不是最优的。
In contrast, we propose a more flexible and effective solution with deep 3D feature learning (PointNets).
相比之下,我们提出了一个更灵活和有效的解决方案,深度三维特征学习(PointNets)。
Deep Learning on Point Clouds :
Most existing works convert point clouds to images or volumetric forms before feature learning.
大多数现有的作品在特征学习之前将点云转换为图像或体积形式。
[33, 18, 21] voxelize point clouds into volumetric grids and generalize image CNNs to 3D CNNs.
[33, 18, 21]将点云体素化为体积网格,将图像cnn泛化为3D cnn。(也就是转化为三维的图像)
[16, 25, 32, 7] design more efficient 3D CNN or neural network architectures that exploit sparsity in point cloud.
[16, 25, 32,7]设计更高效的3D CNN或利用点云稀疏性的神经网络架构。
However, these CNN based methods still require quantitization of point clouds with certain voxel resolution.
然而,这些基于CNN的方法仍然需要对具有一定体素分辨率的点云进行量化。
Recently, a few works [20, 22] propose a novel type of network architectures (PointNets) that directly consumes raw point clouds without converting them to other formats.
最近,一些工作[20,22]提出了一种新型的网络架构(PointNets),直接使用原始点云,而不将它们转换为其他格式。
While PointNets have been applied to single object classification and semantic segmentation, our work explores how to extend the architecture for the purpose of 3D object detection.
虽然PointNets已经应用于单一对象分类和语义分割,我们的工作探索如何扩展架构的目的,3D对象检测。
1.之前的研究者尝试使用各种办法将三维空间的内容转化为二维空间,但是都会造成一定程度的信息丢失。所以本文提出了新的方法。
2.PointNet直接使用点云,避免了使用各种转化导致信息丢失
Given RGB-D data as input, our goal is to classify and localize objects in 3D space.
以RGB-D数据作为输入,我们的目标是对3D空间中的对象进行分类和定位。
The depth data, obtained from LiDAR or indoor depth sensors, is represented as a point cloud in RGB camera coordinates. The projection matrix is also known so that we can get a 3D frustum from a 2D image region.
深度数据,从激光雷达或室内深度传感器,表示为点云在RGB相机坐标。投影矩阵也是已知的,因此我们可以从2D图像区域得到3D截锥体。
Each object is represented by a class (one among k predefined classes) and an amodal 3D bounding box.
每个对象由一个类(k个预定义类中的一个)和一个模态3D边界框表示。
The amodal box bounds the complete object even if part of the object is occluded or truncated.
模态框限定完整的对象,即使对象的一部分被遮挡或截断。
The 3D box is parameterized by its size h, w, l, center cx, cy, cz, and orientation θ, φ, ψ relative to a predefined canonical pose for each category.
三维盒的参数化是通过其大小h, w, l,中心cx, cy, cz和方向θ, φ, ψ相对于每个类别的预定义标准姿态。
In our implementation, we only consider the heading angle θ around the up-axis for orientation
在我们的实现中,我们只考虑围绕上轴的航向角θ
1.说明了什么是RGB-d
2.说明了什么是包围盒.
As shown in Fig. 2, our system for 3D object detection consists of three modules: frustum proposal, 3D instance segmentation, and 3D amodal bounding box estimation.
如图2所示,我们的三维目标检测系统由三个模块组成:截锥体提议、三维实例分割和三维模态边界盒估计。
We will introduce each module in the following subsections.
我们将在下面的小节中介绍每个模块。
We will focus on the pipeline and functionality of each module, and refer readers to supplementary for specific architectures of the deep networks involved.
我们将重点介绍每个模块的管道和功能,并参考读者对所涉及的深层网络的具体架构的补充。
大约就说了一个事情这个网络分成三个部分;截锥体提议、三维实例分割和三维模态边界盒估计.
The resolution of data produced by most 3D sensors, especially real-time depth sensors, is still lower than RGB images from commodity cameras.
大多数3D传感器,特别是实时深度传感器产生的数据的分辨率仍然低于来自商品相机的RGB图像。
Therefore, we leverage mature 2D object detector to propose 2D object regions in RGB images as well as to classify objects.
因此,我们利用成熟的2D目标检测器提出RGB图像中的2D目标区域,并对目标进行分类。
(这里大约是为了给后面使用)
With a known camera projection matrix, a 2D bounding box can be lifted to a frustum (with near and far planes specified by depth sensor range) that defines a 3D search space for the object.
使用已知的摄像机投影矩阵,可以将2D边界框提升到定义对象3D搜索空间的截锥体(由深度传感器范围指定远近平面)。
We then collect all points within the frustum to form a frustum point cloud.
然后我们收集截锥体内的所有点,形成一个截锥体点云。
As shown in Fig 4 (a), frustums may orient towards many different directions, which result in large variation in the placement of point clouds.
如图4 (a)所示,截锥体可能朝向多个不同的方向,导致点云的位置变化很大。
We therefore normalize the frustums by rotating them toward a center view such that the center axis of the frustum is orthogonal to the image plane.
因此,我们通过将截锥体向中心视图旋转,使截锥体的中心轴与成像平面正交,从而使其标准化。
This normalization helps improve the rotation-invariance of the algorithm.
这种归一化有助于提高算法的旋转不变性。
We call this entire procedure for extracting frustum point clouds from RGB-D data frustum proposal generation.
我们将这整个过程称为从RGB-D数据截锥体提案生成中提取截锥体点云的过程。
While our 3D detection framework is agnostic to the exact method for 2D region proposal, we adopt a FPN [17] based model.
虽然我们的三维检测框架是不确定的准确方法的2D区域提议,我们采用FPN[17]提出的基础模型。
We pre-train the model weights on ImageNet classification and COCO object detection datasets and further fine-tune it on a KITTI 2D object detection dataset to classify and predict amodal 2D boxes.
我们在ImageNet分类和COCO目标检测数据集上预先训练模型的权重,并在KITTI 2D目标检测数据集上进一步微调,以分类和预测模态的2D盒子。(先用比较通用的数据集来进行预训练,再用和领域比较相关的数据集进行训练。)
More details of the 2D detector training are provided in the supplementary.
更多的2D探测器训练细节在补充部分(supplementary.)提供。
Given a 2D image region (and its corresponding 3D frustum), several methods might be used to obtain 3D location of the object: One straightforward solution is to directly regress 3D object locations (e.g., by 3D bounding box) from a depth map using 2D CNNs.
给定一个2D图像区域(及其相应的3D截锥),可以使用几种方法来获取物体的3D位置:一个简单的解决方案是使用2D cnn从深度地图直接回归3D物体位置(例如,通过3D边界框)
However, this problem is not easy as occluding objects and background clutter is common in natural scenes (as in Fig. 3), which may severely distract the 3D localization task.
然而,这一问题并不容易解决,因为在自然场景中普遍存在遮挡目标和背景杂波(如图3所示),这可能会严重分散三维定位任务。
Because objects are naturally separated in physical space, segmentation in 3D point cloud is much more natural and easier than that in images where pixels from distant objects can be near-by to each other.
因为物体在物理空间上是自然分离的,所以在3D点云中分割要比在图像中分割更自然、更容易,在图像中,来自遥远物体的像素可以彼此相邻。
Having observed this fact, we propose to segment instances in 3D point cloud instead of in 2D image or depth map.
考虑到这一事实,我们建议在3D点云中分割实例,而不是在2D图像或深度映射中。
Similar to Mask-RCNN [11], which achieves instance segmentation by binary classification of pixels in image regions, we realize 3D instance segmentation using a PointNet-based network on point clouds in frustums.
与Mask-RCNN[11]通过对图像区域的像素进行二值分类来实现实例分割类似,我们利用基于点网的网络对截锥体中的点云进行三维实例分割。
Based on 3D instance segmentation, we are able to achieve residual based 3D localization.
在三维实例分割的基础上,实现了基于残差的三维定位。
That is, rather than regressing the absolute 3D location of the object whose off-set from the sensor may vary in large ranges (e.g. from 5m to beyond 50m in KITTI data),
也就是说,不是回归物体的绝对三维位置,其与传感器的偏差可能在很大范围内变化(例如,KITTI数据中从5米到超过50米)
we predict the 3D bounding box center in a local coordinate system – 3D mask coordinates as shown in Fig. 4 ©.
我们在局部坐标系-三维掩模坐标系中预测三维包围盒中心,如图4 ©所示。
3D Instance Segmentation PointNet.
The network takes a point cloud in frustum and predicts a probability score for each point that indicates how likely the point belongs to the object of interest.
该网络在截锥体中取一个点云,并预测每个点的概率分数,该分数表明该点属于感兴趣的对象的可能性有多大。
Note that each frustum contains exactly one object of interest.
注意,每个截锥只包含一个感兴趣的对象。
Here those “other” points could be points of non-relevant areas (such as ground, vegetation) or other instances that occlude or are behind the object of interest.
在这里,这些“其他”点可以是不相关区域的点(如地面、植被)或其他遮挡或隐藏在感兴趣对象后面的实例。
Similar to the case in 2D instance segmentation, depending on the position of the frustum, object points in one frustum may become cluttered or occlude points in another.
与2D实例分割的情况类似,根据截锥体的位置,一个截锥体中的物体点可能变得杂乱或遮挡另一个截锥体中的点。
Therefore, our segmentation PointNet is learning the occlusion and clutter patterns as well as recognizing the geometry for the object of a certain category.
因此,我们的分割PointNet是在学习遮挡和杂波模式的同时,识别特定类别的对象的几何形状。
In a multi-class detection case, we also leverage the semantics from a 2D detector for better instance segmentation.
在一个多类检测案例中,我们还利用来自2D检测器的语义来更好地进行实例分割。
For example, if we know the object of interest is a pedestrian, then the segmentation network can use this prior to find geometries that look like a person.
例如,如果我们知道感兴趣的对象是一个行人,那么分割网络就可以利用这一点事先找到看起来像人的几何图形。
Specifi-cally, in our architecture we encode the semantic category as a one-hot class vector (k dimensional for the pre-defined k categories) and concatenate the one-hot vector to the intermediate point cloud features. More details of the specific architectures are described in the supplementary.
具体来说,在我们的体系结构中,我们将语义类别编码为一个独热向量(预定义的k个类别的k维),并将这个热向量连接到中间点云特征。补充部分描述了具体架构的更多细节。
After 3D instance segmentation, points that are classified as the object of interest are extracted (“masking” in Fig. 2).
三维实例分割后,提取出分类为感兴趣对象的点(图2中的“masking”)。
Having obtained these segmented object points, we further normalize its coordinates to boost the translational invariance of the algorithm, following the same rationale as in the frustum proposal step.
在获得这些分割的目标点之后,我们进一步规范化其坐标,以提高算法的平移不变性,遵循与截锥体建议步骤相同的原理。
In our implementation, we transform the point cloud into a local coordinate by subtracting XYZ values by its centroid.
在我们的实现中,我们通过用点云的质心减去XYZ值来将点云转换为局部坐标。
This is illustrated in Fig. 4 ©. Note that we intentionally do not scale the point cloud, because the bounding sphere size of a partial point cloud can be greatly affected by viewpoints and the real size of the point cloud helps the box size estimation.
如图4 ©所示。请注意,我们有意不缩放点云,因为部分点云的边界球大小可以很大程度上受视点的影响,点云的实际大小有助于盒的大小估计。
In our experiments, we find that coordinate transformations such as the one above and the previous frustum rotation are critical for 3D detection result as shown in Tab. 8.
在我们的实验中,我们发现如上图所示的坐标变换和之前的截锥旋转对三维检测结果是至关重要的,如表8所示。
Given the segmented object points (in 3D mask coordinate), this module estimates the object’s amodal oriented 3D bounding box by using a box regression PointNet together with a preprocessing transformer network.
给定被分割的对象点(以3D掩模坐标表示),该模块使用盒回归PointNet和预处理transformer网络估计对象的模态导向3D包围盒。
Learning-based 3D Alignment by T-Net :基于T-Net学习的三维对齐
Even though we have aligned segmented object points according to their centroid position, we find that the origin of the mask coordinate frame (Fig. 4 ©) may still be quite far from the amodal box center.
即使我们将分割的物体点按质心位置对齐,我们发现掩模坐标系的原点(图4 ©)仍然可能离模态盒中心很远。
We therefore propose to use a light-weight regression PointNet (T-Net) to estimate the true center of the complete object and then transform the coordinate such thatthe predicted center becomes the origin (Fig. 4 (d)).
因此,我们建议使用一个轻量级回归点网(T-Net)来估计完整对象的真实中心,然后转换坐标,使预测中心成为原点(图4 (d))。
(T-Net:和PointNet中的T-Net类似,只不过这里学习的不是旋转矩阵,而是一个物体中心点的残差,并且这里是有监督的。也就是说这里其实是想办法修正整个锥的中心)
Amodal 3D Box Estimation PointNet 模态3D盒估计点网
The box estimation network predicts amodal bounding boxes (for entire object even if part of it is unseen) for objects given an object point cloud in 3D object coordinate (Fig. 4 (d)).
盒子估计网络预测模态的边界盒(对于整个物体,即使它的一部分是看不见的)在三维物体坐标中给定一个物体点云(图4 (d))。
The network architecture is similar to that for object classification [20, 22], however the output is no longer object class scores but parameters for a 3D bounding box.
网络结构与对象分类类似[20,22],但输出的不再是对象类分数,而是一个三维边界框的参数。
As stated in Sec. 3, we parameterize a 3D bounding box by its center (cx, cy, cz), size (h, w, l) and heading angle θ (along up-axis).
如第3节所述,我们通过其中心(cx, cy, cz),大小(h, w, l)和航向角度θ(沿上轴)参数化三维边界框。
We take a “residual” approach for box center estimation. The center residual predicted by the box estimation network is combined with the previous center residual from the T-Net and the masked points’ centroid to recover an absolute center (Eq. 1).
我们采用“残差”方法进行盒中心估计。箱形估计网络预测的中心残差与T-Net之前的中心残差和掩蔽点的质心相结合,恢复绝对中心(Eq. 1)。
For box size and heading angle, we follow previous works [23, 19] and use a hybrid of classification and regression formulations.
对于盒子大小和朝向角度,我们遵循以前的工作[23,19],并使用分类和回归公式的混合。
Specifically we pre-define NS size templates and NH equally split angle bins.
具体来说,我们预先定义了nssize模板和NH等分角箱。
Our model will both classify size/heading (NS scores for size, NH scores for heading) to those pre-defined categories as well as predict residual numbers for each category (3×NS residual dimensions for height, width, length, NH residual angles for heading). In the end the net outputs 3 + 4 × NS + 2 × NH numbers in total.
We simultaneously optimize the three nets involved (3D instance segmentation PointNet, T-Net and amodal box estimation PointNet) with multi-task losses (as in Eq. 2).
我们同时优化了涉及到的三个网(三维实例分割PointNet, T-Net和模态盒估计PointNet),并有多任务损失(如Eq. 2所示)。
(也就是说,我们这里虽然是一个流程走下来的,但是中间有很多步骤的结果是有标签的,所以这个和我们传统的监督学习不一样了)
这个位置只能直接看原文了,大约就是作者自己定义了一个损失函数
Corner Loss for Joint Optimization of Box Parameters(就是这里的几个东西要同时达到最佳,所以不能使用简单的损失函数)
While our 3D bounding box parameterization is compact and complete, learning is not optimized for final 3D box accuracy – center, size and heading have separate loss terms.
虽然我们的三维包围盒参数是紧凑和完整的,学习并没有优化最终的三维盒精度-中心,大小和标题有单独的损失条款。
Imagine cases where center and size are accurately predicted but heading angle is off – the 3D IoU with ground truth box will then be dominated by the angle error. Ideally all three terms (center,size,heading) should be jointly optimized for best 3D box estimation (under IoU metric).
想象一下,中心和大小是准确预测的情况下,但航向角度是偏离的- 3D IoU与地面真实盒,然后将被角度误差主导。理想情况下,所有三个术语(中心,大小,标题)应该联合优化最佳3D盒估计(根据IoU度量)。
To resolve this problem we propose a novel regularization loss, the corner loss:
为了解决这一问题,我们提出了一种新的正则化损失,即角点损失:
第三段( corner loss很好用)
In essence, the corner loss is the sum of the distances between the eight corners of a predicted box and a ground truth box.
从本质上讲,角损失是预测盒的八个角与地面真实盒之间的距离之和。
Since corner positions are jointly determined by center, size and heading, the corner loss is able to regularize the multi-task training for those parameters.
由于角点位置是由中心、大小和方向共同确定的,因此角点损失可以使多任务训练规范化。