前言:
搜了一些 PVRCNN 相关的博客,对于网络结构的有些地方还是不是很理解,就仔细阅读了以下原文,发现原论文讲的很透彻,一下子就理解了,因此对原论文做了翻译,也算是加强自己的记忆。
论文地址:https://arxiv.org/pdf/1912.13192.pdf
We present a novel and high-performance 3D object detection framework, named PointVoxel-RCNN (PV-RCNN), for accurate 3D object detection from point clouds. Our proposed method deeply integrates both 3D voxel Convolutional Neural Network (CNN) and PointNet-based set abstraction to learn more discriminative point cloud features. It takes advantages of efficient learning and high-quality proposals of the 3D voxel CNN and the flexible receptive fields of the PointNet-based networks. Specifically, the proposed framework summarizes the 3D scene with a 3D voxel CNN into a small set of keypoints via a novel voxel set abstraction module to save follow-up computations and also to encode representative scene features. Given the highquality 3D proposals generated by the voxel CNN, the RoI-grid pooling is proposed to abstract proposal-specific features from the keypoints to the RoI-grid points via keypoint set abstraction with multiple receptive fields. Compared with conventional pooling operations, the RoI-grid feature points encode much richer context information for accurately estimating object confidences and locations. Extensive experiments on both the KITTI dataset and the Waymo Open dataset show that our proposed PV-RCNN surpasses state-of-the-art 3D detection methods with remarkable margins by using only point clouds.
Code is available at https://github.com/open-mmlab/OpenPCDet
我们提出了一种新的高性能三维目标检测框架,称之为: PointVoxel-RCNN(PV-RCNN),用于从点云中检测3D对象。我们将 3d voxel CNN 和基于点的特征集合进行了深度集成,从而得到更具辨别力的点云特征。它采用了 3D voxel CNN的高效学习和高质量 proposals 的优势,以及 PointNet 的网络的灵活感受野。
具体来说,该框架通过一个新的 voxel SA model 将三维场景提取成一系列的关键点组,以减少后续计算,并对具有代表性的场景特征进行编码。首先通过 voxel CNN 生成的高质量3D proposals,再通过多尺度的感受野对关键点提取特征,将 proposals 的特征从关键点抽象到 RoI-grid 点,从而提出了 RoI-grid pooling。
与传统的 pooling 操作相比,RoI-grid 特征点编码了更丰富的上下文信息,用于更准确的估计目标的可信度和位置。
在 KITTI 数据集和 Waymo Open 数据集上进行的大量实验表明,我们提出的PV-RCNN通过仅使用点云,超越了最先进的3D检测方法,具有显著的优势。
code 地址:https://github.com/open-mmlab/OpenPCDet
3D object detection has been receiving increasing attention from both industry and academia thanks to its wide applications in various fields such as autonomous driving and robotics. LiDAR sensors are widely adopted in autonomous driving vehicles and robots for capturing 3D scene information as sparse and irregular point clouds, which provide vital cues for 3D scene perception and understanding. In this paper, we propose to achieve high performance 3D object detection by designing novel point-voxel integrated networks to learn better 3D features from irregular point clouds.
Most existing 3D detection methods could be classified into two categories in terms of point cloud representations, i.e., the grid based methods and the point-based methods. The grid-based methods generally transform the irregular point clouds to regular representations such as 3D voxels [27, 41, 34, 2, 26] or 2D bird-view maps[1, 11, 36, 17, 35, 12, 16], which could be efficiently processed by 3D or 2D Convolutional Neural Networks (CNN) to learn point features for 3D detection. Powered by the pioneer work, PointNet and its variants [23, 24], the pointbased methods [22, 25, 32, 37] directly extract discriminative features from raw point clouds for 3D detection. Generally, the grid-based methods are more computationally efficient but the inevitable information loss degrades the finegrained localization accuracy, while the point-based methods have higher computation cost but could easily achieve larger receptive field by the point set abstraction [24]. However, we show that a unified framework could integrate the best of the two types of methods, and surpass the prior stateof-the-art 3D detection methods with remarkable margins.
We propose a novel 3D object detection framework, PVRCNN (Illustrated in Fig. 1), which boosts the 3D detection performance by incorporating the advantages from both the Point-based and Voxel-based feature learning methods. The principle of PV-RCNN lies in the fact that the voxel-based operation efficiently encodes multi-scale feature representations and can generate high-quality 3D proposals, while the PointNet-based set abstraction operation preserves accurate location information with flexible receptive fields. We argue that the integration of these two types of feature learning frameworks can help learn more discriminative features for accurate fine-grained box refinement.
The main challenge would be how to effectively combine the two types of feature learning schemes, specifically the 3D voxel CNN with sparse convolutions [6, 5] and the PointNet-based set abstraction [24], into a unified framework. An intuitive solution would be uniformly sampling several grid points within each 3D proposal, and adopt the set abstraction to aggregate 3D voxel-wise features surrounding these grid points for proposal refinement. However, this strategy is highly memory-intensive since both the number of voxels and the number of grid points could be quite large to achieve satisfactory performance.
Therefore, to better integrate these two types of point cloud feature learning networks, we propose a two-step strategy with the first voxel-to-keypoint scene encoding step and the second keypoint-to-grid RoI feature abstraction step. Specifically, a voxel CNN with 3D sparse convolution is adopted for voxel-wise feature learning and accurate proposal generation. To mitigate the above mentioned issue of requiring too many voxels for encoding the whole scene, a small set of keypoints are selected by the furtherest point sampling (FPS) to summarize the overall 3D information from the voxel-wise features. The features of each keypoint is aggregated by grouping the neighboring voxel-wise features via PointNet-based set abstraction for summarizing multi-scale point cloud information. In this way, the overall scene can be effectively and efficiently encoded by a small number of keypoints with associated multi-scale features.
For the second keypoint-to-grid RoI feature abstraction step, given each box proposal with its grid point locations, a RoI-grid pooling module is proposed, where a keypoint set abstraction layer with multiple radii is adopted for each grid point to aggregate the features from the keypoints with multi-scale context. All grid points’ aggregated features can then be jointly used for the succeeding proposal refinement. Our proposed PV-RCNN effectively takes advantages of both point-based and voxel-based networks to encode discriminative features at each box proposal for accurate confidence prediction and fine-grained box refinement.
Our contributions can be summarized into four-fold. (1) We propose PV-RCNN framework which effectively takes advantages of both the voxel-based and point-based methods for 3D point-cloud feature learning, leading to improved performance of 3D object detection with manageable memory consumption. (2) We propose the voxelto-keypoint scene encoding scheme, which encodes multiscale voxel features of the whole scene to a small set of keypoints by the voxel set abstraction layer. These keypoint features not only preserve accurate location but also encode rich scene context, which boosts the 3D detection performance significantly. (3) We propose a multi-scale RoI feature abstraction layer for grid points in each proposal, which aggregates richer context information from the scene with multiple receptive fields for accurate box refinement and confidence prediction. (4) Our proposed method PV-RCNN outperforms all previous methods with remarkable margins and ranks 1st on the highly competitive KITTI 3D detection benchmark [10], ans also surpasses previous methods on the large-scale Waymo Open dataset with a large margin.
由于三维目标检测在自动驾驶和机器人等领域的广泛应用,它受到了工业界和学术界越来越多的关注。激光雷达传感器广泛应用于自动驾驶车辆和机器人中,以稀疏不规则的点云形式捕捉三维场景信息,为三维场景感知和理解提供重要线索。在本文中,我们提出通过设计新的 point-voxel 集成网络来实现高性能的三维目标检测,从而从不规则点云中学习更好的三维特征。
现有的大多数三维检测方法根据点云表示可分为两类,即基于网格的方法和基于点的方法。基于网格的方法通常将不规则点云转换为规则表示,例如3D voxel 或2D鸟瞰图,3D或2D卷积神经网络(CNN)可以有效地处理这些点云,以学习用于3D检测的点特征。在 PointNet 及其变体(PointNet++) 的推动下,基于点的方法可直接从原始点云中提取鉴别特征,用于3D检测。一般来说,基于网格的方法计算效率更高,但不可避免的信息丢失会降低细粒度定位精度,而基于点的方法计算成本更高,但通过point set abstraction 很容易获得更大的感受野。我们提出了一个统一框架,该框架可以整合这两种方法,并以显著的优势超越现有的最先进的3D检测方法。
我们提出了一种新的三维目标检测框架PVRCNN(如图1所示),它结合了基于点和基于体素的特征学习方法的优点,提高了三维检测性能。PV-RCNN的原理在于,基于体素的操作可以有效地对多尺度特征表示进行编码,并可以生成高质量的3D proposals,而基于PointNet的 SA 操作可以通过灵活的感受野保留准确的位置信息。我们认为,这两种类型的特征学习框架的集成可以帮助学习更具辨别力的特征,从而实现精确的细粒度 box refinement。
主要的挑战是如何将两种类型的特征学习方案有效地结合到一个统一的框架中,特别是3D voxel CNN和稀疏卷积 以及基于 PointNet 的SA操作。直观的解决方案是在每个3D proposal 中均匀地采样几个网格点,并采用 SA 来聚合围绕这些网格点的3D体素特征,以便 proposal refinement。然而,这种策略是高度内存密集型的,因为体素的数量和网格点的数量都可能非常大,才能实现令人满意的性能。
因此,为了更好地集成这两种类型的点云特征学习网络,我们提出了一种两步策略,第一步是体素到关键点场景编码,第二步是关键点到网格RoI特征提取。具体而言,采用具有3D稀疏卷积的 voxel CNN进行体素特征学习和生成精确 proposal 。为了缓解上述需要太多体素来编码整个场景的问题,通过最远点采样(FPS)选择一小组关键点,以总结体素特征的整体3D信息。通过PointNet-based 的集合抽象对相邻的体素特征进行分组,以汇总多尺度点云信息,从而聚合每个关键点的特征。这种方式是通过少量具有相关多尺度特征的关键点对整个场景进行有效且高效的编码。
对于第二个关键点到网格RoI特征提取步骤,考虑到每个 box proposal 及其网格点位置,提出了一个RoI-grid Pooling 模块,在该模块的keypoint SA 层中每个网格点采用多个半径,从关键点多尺度的上下文信息中,来聚合网格点的特征。然后,可以将所有网格点的聚合特征拼接用于后续proposal refinement。我们提出的PV-RCNN有效地利用了基于点和基于体素的网络的优点,在每个 box proposal 中对易辨别的特征进行编码,以实现准确的置信度预测和细粒度(fine-grained) box refinement。
我们的贡献可以概括为四个方面。
(1) 我们提出了PV-RCNN框架,该框架有效地利用了基于体素和基于点的三维点云特征学习方法,在可管理的内存消耗下提高了三维目标检测的性能。
(2) 我们提出了体素到关键点场景编码方案,该方案通过voxel SA层将整个场景的多尺度体素特征编码为一个小的关键点集。这些关键点特征不仅保留了精确的位置,还包含丰富的场景信息,显著提高了3D检测性能。
(3) 我们为每个proposal 中的网格点提出了一个多尺度RoI特征抽象层,该层通过多个感受野聚合场景中丰富的上下文信息,以实现精确的框细化和置信度预测。
(4) 我们提出的方法PV-RCNN比以前所有的方法都有显著的优势,在竞争激烈的KITTI 3D检测基准中排名第一,对比在Waymo数据集上的方法也有很大的优势。
3D Object Detection with Grid-based Methods. To tackle the irregular data format of point clouds, most existing works project the point clouds to regular grids to be processed by 2D or 3D CNN. The pioneer work MV3D [1] projects the point clouds to 2D bird view grids and places lots of predefined 3D anchors for generating 3D bounding boxes, and the following works [11, 17, 16] develop better strategies for multi-sensor fusion while [36, 35, 12] propose more efficient frameworks with bird view representation. Some other works [27, 41] divide the point clouds into 3D voxels to be processed by 3D CNN, and 3D sparse convolution [5] is introduced [34] for efficient 3D voxel processing. [30, 42] utilizes multiple detection heads while [26] explores the object part locations for improving the performance. These grid-based methods are generally efficient for accurate 3D proposal generation but the receptive fields are constraint by the kernel size of 2D/3D convolutions.
3D Object Detection with Point-based Methods. FPointNet [22] first proposes to apply PointNet [23, 24] for 3D detection from the cropped point clouds based on the 2D image bounding boxes. PointRCNN [25] generates 3D proposals directly from the whole point clouds instead of 2D images for 3D detection with point clouds only, and the following work STD [37] proposes the sparse to dense strategy for better proposal refinement. [21] proposes the hough voting strategy for better object feature grouping. These pointbased methods are mostly based on the PointNet series, especially the set abstraction operation [24], which enables flexible receptive fields for point cloud feature learning.
Representation Learning on Point Clouds. Recently representation learning on point clouds has drawn lots of attention for improving the performance of point cloud classification and segmentation. In terms of 3D detection, previous methods generally project the point clouds to regular bird view grids [1, 36] or 3D voxels [41, 2] for processing point clouds with 2D/3D CNN. 3D sparse convolution [6, 5] are adopted in [34, 26] to effectively learn sparse voxel-wise features from the point clouds. Qi et al. [23, 24] proposes the PointNet to directly learn point-wise features from the raw point clouds, where set abstraction operation enables flexible receptive fields by setting different search radii. [19] combines both voxelbased CNN and point-based SharedMLP for efficient point cloud feature learning. In comparison, our proposed PVRCNN takes advantages from both the voxel-based feature learning (i.e., 3D sparse convolution) and PointNet-based feature learning (i.e., set abstraction operation) to enable both high-quality 3D proposal generation and flexible receptive fields for improving the 3D detection performance.
基于网格(体素)的三维目标检测方法。为了解决点云的不规则数据格式问题,大多数现有工作将点云投影到规则网格上,由 2D 或 3D CNN进行处理。先锋作品 MV3D 将点云投影到二维鸟瞰栅格,并放置大量预定义的 3D anchors,用于生成 3D bounding box,以下作品[11、17、16]开发了更好的多传感器融合策略,同时[36、35、12]提出了更有效的鸟瞰图框架。其他一些工作[27,41]将点云划分为3D体素,由3D CNN进行处理,并引入3D稀疏卷积[5],以实现高效的3D体素处理[34]。[30,42]使用多个探测头,而[26]研究目标局部位置以提高性能。这些基于网格的方法通常能有效地生成精确的3D提案,但感受野受到2D/3D卷积核大小的限制。
基于点的三维目标检测方法。FPointNet 首先提出应用PointNet 从基于2D图像边界框的裁剪点云进行3D检测。PointRCNN 直接从整个点云生成3D proposal,STD 提出了从稀疏到密集的策略,以更好地细化 proposal。[21]提出了hough投票策略,以实现更好的对象特征分组。这些基于点的方法大多基于 PointNet 系列,尤其是 set abstraction operation,它为点云特征学习提供了灵活的感受野。
在点云上进行表征学习。近年来,为了提高点云分类和分割的性能,点云表征学习受到了广泛关注。在3D检测方面,以前的方法通常将点云投影到常规鸟瞰栅格或3D体素,以便使用2D/3D CNN处理点云。[34,26]中采用了3D稀疏卷积[6,5],以有效地从点云中学习稀疏体素特征。Qi等人[23,24]提出 PointNet 可以直接从原始点云中学习点特征,其中SA操作通过设置不同的搜索半径来实现灵活的感受野。[19] 结合基于体素的CNN和基于点的SharedMLP,实现高效的点云特征学习。相比之下,我们提出的PVRCNN利用了基于体素的特征学习(即3D稀疏卷积)和基于PointNet 的特征学习(即集合抽象操作)的优点,实现了高质量的3D proposal 生成和灵活的感受野,从而提高了3D检测性能。
In this paper, we propose the PointVoxel-RCNN (PVRCNN), which is a two-stage 3D detection framework aiming at more accurate 3D object detection from point clouds. State-of-the-art 3D detection approaches are based on either 3D voxel CNN with sparse convolution or PointNet-based networks as the backbone. Generally, the 3D voxel CNNs with sparse convolution are more efficient and are able to generate high-quality 3D object proposals, while the PointNet-based methods can capture more accurate contextual information with flexible receptive fields.
Our PV-RCNN deeply integrates the advantages of two types of networks. As illustrated in Fig. 2, the PV-RCNN consists of a 3D voxel CNN with sparse convolution as the backbone for efficient feature encoding and proposal generation. Given each 3D object proposal, to effectively pool its corresponding features from the scene, we propose two novel operations: the voxel-to-keypoint scene encoding, which summarizes all the voxels of the overall scene feature volumes into a small number of feature keypoints, and the point-to-grid RoI feature abstraction, which effectively aggregates the scene keypoint features to RoI grids for proposal confidence prediction and location refinement.
在本文中,我们提出了 PVRCNN,这是一个两阶段的3D检测框架,旨在从点云中更精确地检测3D对象。最先进的3D检测方法基于稀疏卷积的 3D voxel CNN 或基于点网的网络作为 backbone。通常,具有稀疏卷积的 3D voxel CNN 效率更高,能够生成高质量的 3D object proposals,而基于点网的方法可以通过灵活的感受野捕获更准确的上下文信息。
我们的PV-RCNN深度集成了两种网络的优势。如图2所示,PV-RCNN由3D voxel CNN组成,稀疏卷积作为有效特征编码和生成 proposal 的backbone。考虑到每个3D object proposal,为了有效地汇集场景中相应的特征,我们提出了两种新的操作:voxel-to-keypoint (体素到关键点)场景编码(将整个场景的所有体素特征汇总为少量关键点特征)和 point-to-grid (点到网格) RoI 特征提取,它有效地将场景关键点特征聚合到 RoI网格,用于提供可信度预测和位置优化。
Voxel CNN with 3D sparse convolution is a popular choice by state-of-the-art 3D detectors for efficiently converting the point clouds into sparse 3D feature volumes. Because of its high efficiency and accuracy, we adopt it as the backbone of our framework for feature encoding and 3D proposal generation.
3D voxel CNN. The input points P are first divided into small voxels with spatial resolution of L × W × H, where the features of the non-empty voxels are directly calculated as the mean of point-wise features of all inside points. The commonly used features are the 3D coordinates and reflectance intensities. The network utilizes a series of 3 × 3 × 3 3D sparse convolution to gradually convert the point clouds into feature volumes with 1×; 2×, 4×, 8× downsampled sizes. Such sparse feature volumes at each level could be viewed as a set of voxel-wise feature vectors.
3D proposal generation. By converting the encoded 8×downsampled 3D feature volumes into 2D bird-view feature maps, high-quality 3D proposals are generated following the anchor-based approaches. Specifically, we stack the 3D feature volume along the Z axis to obtain the L/8× W/8 bird-view feature maps. Each class has 2 × L/8 × W/8 3D anchor boxes which adopt the average 3D object sizes of this class, and two anchors of 0°,90° orientations are evaluated for each pixel of the bird-view feature maps. As shown in Table 4, the adopted 3D voxel CNN backbone with anchor-based scheme achieves higher recall performance than the PointNet-based approaches.
具有3D稀疏卷积的体素CNN是最先进的3D探测器的常用选择,用于有效地将点云转换为稀疏3D特征体。由于其高效性和准确性,我们采用它作为特征编码和生成 3D proposal 的backbone。
3D voxel CNN。首先将输入点P划分为空间分辨率为L×W×H的小体素,其中非空体素的特征直接计算为所有内部点的逐点特征的平均值。常用的特征是3D坐标和反射强度。该网络利用一系列3×3×3的3D稀疏卷积将点云逐渐转换为1×,2×,4×,8×下采样尺寸的特征体。每个层次上的稀疏特征体可以看作是一组体素特征向量。
3D proposal 生成。通过将编码的8倍下采样三维特征体转换为二维鸟瞰特征图,按照基于anchor的方法生成高质量的3D proposals。具体来说,我们沿着Z轴堆叠3D特征体,以获得L/8×W/8鸟瞰特征图。每个类别有2×L/8×W/8个 3D anchor框,采用该类别的平均3D对象大小,并为鸟瞰特征地图的每个像素评估两个0°、90°方向的anchor框。如表4所示,采用基于 anchor 的 3D voxel CNN backbone 比基于点的方法实现了更高的召回性能。
Discussions. State-of-the-art detectors mostly adopt two-stage frameworks. They require pooling RoI specific features from the resulting 3D feature volumes or 2D maps for further proposal refinement. However, these 3D feature volumes from the 3D voxel CNN have major limitations in the following aspects. (i) These feature volumes are generally of low spatial resolution as they are downsampled by up to 8 times, which hinders accurate localization of objects in the input scene. (ii) Even if one can upsample to obtain feature volumes/maps of larger spatial sizes, they are generally still quite sparse. The commonly used trilinear or bilinearinterpolationinthe RoIPooling/RoIAlignoperationscan only extract features from very small neighborhoods (i.e., 4 and 8 nearest neighbors for bilinear and trilinear interpolation respectively). The conventional pooling approaches would therefore obtain features with mostly zeros and waste much computation and memory for stage-2 refinement.
On the other hand, the set abstraction operation proposed in the variants of PointNet has shown the strong capability of encoding feature points from a neighborhood of an arbitrary size. We therefore propose to integrate a 3D voxel CNN with a series of set abstraction operations for conducting accurate and robust stage-2 proposal refinement.
A naive solution of using the set abstraction operation for pooling the scene feature voxels would be directly aggregating the multi-scale feature volume in a scene to the RoI grids. However, this intuitive strategy simply occupies much memory and is inefficient to be used in practice. For instance, a common scene from the KITTI dataset might result in 18000 voxels in the 4× downsampled feature volumes. If one uses 100 box proposal for each scene and each box proposal has 3 × 3 × 3 grids. The 2700 × 18000 pairwise distances and feature aggregations cannot be efficiently computed, even after distance thresholding.
To tackle this issue, we propose a two-step approach to first encode voxels at different neural layers of the entire scene into a small number of keypoints and then aggregate keypoint features to RoI grids for box proposal refinement.
最先进的探测器大多采用 two-stage 框架。它们需要从生成的3D特征卷或2D地图中汇集特定于RoI的特征,以进一步完善 proposal。然而,来自3D voxel CNN的这些3D特征体在以下方面存在主要限制。(i) 这些特征体通常具有较低的空间分辨率,因为它们的下采样次数达到了8次,这阻碍了输入场景中目标的准确定位。(ii)即使可以上采样以获得更大空间尺寸的特征体/图,它们通常仍然非常稀疏。常用的三线性或双线性插值 RoIPooling/ROIAlling Operations 只能从非常小的邻域(双线性为4个和三线性插值为8个最近邻域)中提取特征。因此,传统的池化方法将获得大部分为零的特征,并浪费大量计算和内存进行第二阶段优化。
另一方面,PointNet++中提出的 SA 操作已经展示了从任意大小的邻域中编码特征点的强大能力。因此,我们建议将 3D voxel CNN与一系列SA操作相结合,从而进行精确且鲁棒的第二阶段 proposal 优化。
使用SA操作汇集场景体素特征的一个简单解决方案是直接将场景中的多尺度特征体聚合到RoI网格中。然而,这种直观的策略只会占用大量内存,在实践中使用效率低下。例如,在KITTI数据集中,进行4倍下采样时会产生 18000 个体素特征。如果每个场景使用100个box proposal ,每个box proposal 有 3×3×3 的网格。即使通过距离阈值过滤之后,也很难有效地计算 2700×18000 个成对的距离特征。
为了解决这个问题,我们提出了一种两步走方法,首先将整个场景中不同神经层的体素编码转换为少量关键点,然后将关键点特征聚合到RoI网格中,以优化box proposal。
Our proposed framework first aggregates the voxels at the multiple neural layers representing the entire scene into a small number of keypoints, which serve as a bridge between the 3D voxel CNN feature encoder and the proposal refinement network.
Keypoints Sampling. Specifically, we adopt the Furthest Point Sampling (FPS) algorithm to sample a small number of n keypoints K = {p1,…,pn} from the point clouds P, where n = 2048 for the KITTI dataset and n = 4096 for the Waymo dataset. Such a strategy encourages that the keypoints are uniformly distributed around non-empty voxels and can be representative to the overall scene.
Voxel Set Abstraction Module. We propose the Voxel Set Abstraction (VSA) module to encode the multi-scale semantic features from the 3D CNN feature volumes to the keypoints. The set abstraction operation proposed by [24] is adopted for the aggregation of voxel-wise feature volumes. The surrounding points of keypoints are now regular voxels with multi-scale semantic features encoded by the 3D voxel CNN from the multiple levels, instead of the neighboring raw points with features learned from PointNet.
我们提出的框架首先将多层神经网络的 voxel 聚合为少量的关键点,作为 3D voxel CNN特征编码器和优化 proposal 网络之间的桥梁。
关键点取样, 具体来说,我们采用最远点采样(FPS)算法从点云集合P中采样少量的n个关键点K={p1,…,pn},其中KITTI数据集的 n=2048,Waymo数据集的 n=4096。这种方式使得关键点均匀分布在非空体素周围,并且可以代表整个场景。
体素集抽象模块, 我们提出了体素集合抽象(VSA)模块对 3D CNN特征体到关键点的多尺度语义特征进行编码。在论文[24]中 作者提出了 SA 操作来提取体素特征。和 PointNet 关键点是从原始点的邻域中学习到的特征 不同,VSA的关键点特征是经过多层编码的 3D voxel CNN,由多尺度语义体素特征中学习到的特征。
Specifically, denote F (lk) = {f1 (lk), …, fn (lk)} as the set of voxel-wise feature vectors in the k-th level of 3D voxel CNN, V (lk) = {V1 (lk), …, Vn(lk)} as their 3D coordinates calculated by the voxel indices and actual voxel sizes of the k-th level, where Nk is the number of non-empty voxels in the k-th level. For each keypoint pi, we first identify its neighboring non-empty voxels at the k-th level within a radius rk to retrieve the set of voxel-wise feature vectors as Si,where we concatenate the local relative coordinates vj - pi to indicate the relative location of semantic voxel feature fj. The voxel-wise features within the neighboring voxel set Si of pi are then transformed by a PointNet-block to generate the feature for the key point pi as fi, where M() denotes randomly sampling at most Tk voxels from the neighboring set Si for saving computations, G() denotes a multi-layer perceptron network to encode the voxel-wise features and relative locations. Although the number of neighboring voxels varies across different keypoints, the along-channel max-pooling operation max() maps the diverse number of neighboring voxel feature vectors to a feature vector fi for the key point pi. Generally, we also set multiple radii rk at the k-th level to aggregate local voxel-wise features with different receptive fields for capturing richer multi-scale contextual information.
The above voxel set abstraction is performed at different levels of the 3D voxel CNN, and the aggregated features from different levels can be concatenated to generate the multi-scale semantic feature for the key point pi,fi,where the generated feature fi incorporates both the 3D voxel CNN-based feature learning from voxel-wise feature fj and the PointNet-based feature learning from voxel set abstraction as Eq. (2). Besides, the 3D coordinate of pi also preserves accurate location information.
具体地说,表示 3D voxel CNN中第k层的体素特征向量集,表示它们的三维坐标,其由第k层的体素索引和实际体素大小计算,其中 Nk 是第 k 层中非空体素的数量。对于每个关键点pi,我们首先在半径 rk 内识别其相邻的非空体素,以检索体素特征向量集 Si,
其中我们 concate 局部相对坐标 vj-pi 以表示语义体素特征 fj 的相对位置。然后,关键点 pi 的相邻体素特征集 Si 通过 PointNet 进行变换,以生成关键点pi的特征,记作
其中M(·)表示从相邻集Si 中随机采样最多 Tk 个体素 以节省计算,G(·)表示多层感知机网络,用于对体素特征和相对位置进行编码。尽管相邻体素的数量在不同的关键点上有所不同,但沿通道最大池操作max(·)将不同数量的相邻体素特征向量映射到关键点 pi 的特征向量 fi。通常,我们还在第 k 层设置多个半径 rk,以聚集具有不同感受野的局部体素特征,从而捕获更丰富的多尺度上下文信息。
上述 VSA 在 3D voxel CNN 的不同层级执行,来自不同层的聚合特征可以拼接起来,为关键点pi 的特征 fi 生成多尺度语义特征,
其中生成的特征 fi 结合了基于 3D voxel CNN 的特征学习(来自体素特征fj)和基于 PointNet 的特征学习(来自VSA ,如式2 所示),此外,pi 的三维坐标还保留了精确的位置信息。
Extended VSA Module. We extend the VSA module by further enriching the keypoint features from the raw point clouds P and the 8× downsampled 2D bird-view feature maps (as described in Sec. 3.1), where the raw point clouds partially make up the quantization loss of the initial pointcloud voxelization while the 2D bird-view maps have larger receptive fields along the Z axis. The raw point-cloud feature fi(raw) is also aggregated as in Eq. (2). For the bird view feature maps, we project the keypoint pi to the 2D bird-view coordinate system, and utilize bilinear interpolation to obtain the features fi(bev) from the bird-view feature maps. Hence, the keypoint feature for pi is further enriched by concatenating all its associated features fi, which have the strong capability of preserving 3D structural information of the entire scene and can also boost the final detection performance by large margins.
VSA模块的扩展。我们通过进一步丰富原始点云 P 和 8 倍下采样2D鸟瞰特征图(如第3.1节所述)中关键点的特征来扩展VSA模块,其中,原始点云部分弥补了初始点云体素化的量化损失,而2D鸟瞰图沿Z轴具有更大的感受野。原始点云特征fi(raw) 也如式(2)所示进行聚合。对于鸟瞰特征图,我们将关键点 pi 投影到二维鸟瞰坐标系,并利用双线性插值从鸟瞰特征图中获得特征fi(bev)。 因此,pi的关键点特征通过连接其所有相关特征 fi§ 而进一步丰富,
这些特征具有保存整个场景的3D结构信息的强大能力,并且还可以大幅提高最终检测性能。
Predicted Keypoint Weighting. After the overall scene is encoded by a small number of keypoints, they would be further utilized by the succeeding stage for conducting proposal refinement. The keypoints are chosen by the Further Point Sampling strategy and some of them might only represent the background regions. Intuitively, keypoints belonging to the foreground objects should contribute more to the accurate refinement of the proposals, while the ones from the background regions should contribute less.
Hence, we propose a Predicted Keypoint Weighting (PKW) module (see Fig. 3) to re-weight the keypoint features with extra supervisions from point-cloud segmentation. The segmentation labels can be directly generated by the 3D detection box annotations, i.e. by checking whether each key point is inside or outside of a ground-truth 3D box since the 3D objects in autonomous driving scenes are naturally separated in 3D space. The predicted feature weighting for each keypoint’s feature fi( p)~ can be formulated as fi( p)~, where A(·) is a three-layer MLP network with a sigmoid function to predict foreground confidence between. The PKW module is trained by focal loss [18] with default hyper-parameters for handling the unbalanced number of foreground/background points in the training set.
预测关键点权重。在通过少量关键点对整个场景进行编码后,后续阶段将进一步利用这些关键点进行 proposal 优化。关键点由 FPS 算法选择,其中一些点可能只代表背景区域。直观地说,属于前景对象的关键点应该对 proposal 的精确细化贡献更大,而来自背景区域的关键点应该贡献更少。
因此,我们提出了一个预测关键点加权(PKW)模块(见图3),通过点云分割的额外监督来对关键点特征重新加权。由于自动驾驶场景中的3D对象在3D空间中很容易划分,因此可以通过3D标注的检测框直接生成分割标签,即通过检查每个关键点是位于真实3D框的内部还是外部。每个关键点增加预测权重后的特征可以表示为
其中A(·)是一个三层MLP网络,带有一个sigmoid函数,用于预测关键点的前景置信度,值在[0,1]之间。PKW模块采用 focal loss 进行训练,默认超参数用于处理训练集中数量不平衡的 前景/背景 点。
In the previous step, the whole scene is summarized into a small number of keypoints with multi-scale semantic features. Given each 3D proposal (RoI) generated by the 3D voxel CNN, the features of each RoI need to be aggregated from the keypoint features F = {f1( p), … , fn§} for accurate and robust proposal refinement. We propose the keypoint-to-grid RoI feature abstraction based on the set abstraction operation for multi-scale RoI feature encoding.
在之前的步骤中,整个场景被描述为具有多尺度语义特征的少量关键点。对于 3D voxel CNN 生成的 RoI(3D proposal),每个 RoI 的特征需要从关键点的特征中进行聚合,以便准确而稳定的优化 proposal。针对多尺度 RoI 特征编码,提出了基于 SA 操作的 keypoint-to-grid 方法对 RoI 进行特征提取。
RoI-grid Pooling via Set Abstraction. Given each 3D RoI, as shown in Fig. 4, we propose the RoI-grid pooling module to aggregate the keypoint features to the RoI-grid points with multiple receptive fields. We uniformly sample 6 × 6 × 6 grid points within each 3D proposal, which are denoted as G = {g1, … , g216}. The set abstraction operation is adopted to aggregate the features of grid points from the keypoint features. Specifically, we firstly identify the neighboring keypoints of grid point gi within a radius r as ψ, where pj − gi is appended to indicate the local relative location of features fj( p) from keypoint pj. Then a PointNet-block is adopted to aggregate the neighboring keypoint feature set ψ to generate the feature for grid point gi as fi(g), where M(·) and G(·) are defined as the same in Eq. (2). We set multiple radii r and aggregate keypoint features with different receptive fields, which are concatenated together for capturing richer multi-scale contextual information.
After obtaining each grid aggregated features from its surrounding keypoints, all RoI-grid features of the same RoI can be vectorized and transformed by a two-layer MLP with 256 feature dimensions to represent the overall proposal.
Compared with the point cloud 3D RoI pooling operations in previous works [25, 37, 26], our proposed RoI-grid pooling operation targeting the keypoints is able to capture much richer contextual information with flexible receptive fields, where the receptive fields are even beyond the RoI boundaries for capturing the surrounding keypoint features outside the 3D RoI, while the previous state-of-theart methods either simply average all point-wise features within the proposal as the RoI feature [25], or pool many uninformative zeros as the RoI features [26, 37].
通过SA实现 RoI-grid 池化。对于每个3D RoI,如图4所示,我们提出了RoI-grid 池化模块,将关键点特征聚合到具有多个感受野的 RoI 网格点上。我们在每个3D方案中统一采样 6×6×6 的网格点,表示为G={g1,…,g216}。采用SA(集合抽象)操作将关键点特征聚合成网格点的特征。具体来说,我们首先将网格点 gi 在半径 r 内的相邻关键点识别为ψ,
其中pj−gi 是指相对于关键点 pj 的局部相对位置。然后采用 PointNet 模块聚合相邻关键点的特征集ψ,生成网格点 gi 的特征 fi(g),
其中M(·)和G(·)和等式(2)中的定义相同。我们设置多个半径r,并聚合具有不同感受野的关键点特征,将这些特征拼接在一起,以获取更丰富的多尺度上下文信息。
在从周围的关键点获取每个网格聚合特征后,同一RoI的所有RoI网格特征都可以通过一个具有256个特征维度的两层MLP进行矢量化和转换,以代表整个 proposal。(每一个roi proposal都可以被 256个 特征表示。)
与以前工作中的点云3D RoI池化操作相比,我们提出的针对关键点的 RoI-grid 池化操作能够捕获更丰富的上下文信息,具有灵活的感受野,此外,在捕捉3D RoI 周围关键点特征时,感受野甚至超出了RoI的边界。而之前的最先进的方法要么简单地将 proposal 中的所有点特征平均为RoI特征,要么将许多不具信息性的零集合为RoI特征。
3D Proposal Refinement and Confidence Prediction.Given the RoI feature of each box proposal, the proposal refinement network learns to predict the size and location (i.e., center, size and orientation) residuals relative to the input 3D proposal. The refinement network adopts a 2-layer MLP and has two branches for confidence prediction and box refinement respectively.
For the confidence prediction branch, we follow to adopt the 3D Intersection-over-Union (IoU) between the 3D RoIs and their corresponding ground-truth boxes as the training targets. For the k-th 3D RoI, its confidence training target yk is normalized to be between [0, 1] as yk, where IoUk is the IoU of the k-th RoI w.r.t. its ground-truth box. Our confidence branch is then trained to minimize the cross-entropy loss on predicting the confidence targets, L, where yk is the predicted score by the network. Our experiments in Table 9 show that this quality-aware confidence prediction strategy achieves better performance than the traditional classification targets.
The box regression targets of the box refinement branch are encoded by the traditional residual-based method as in and are optimized by smooth-L1 loss function.
3D proposal 优化和置信度预测。proposal 优化网络利用每个 box proposal 的 RoI 特征学习并预测相对于输入3D 框的大小和位置(即中心点、大小和方向)。优化网络采用两层MLP,并有两个分支分别用于 置信度预测 和 box 回归。
对于置信度预测分支,我们采用 3D ROI 与其对应的 ground-truth 之间的 3D IoU 作为训练目标。对于第k个 3D RoI,其置信度训练目标 yk 通常被归一化为[0,1]之间,
其中 IoU k 是第k个 RoI 与真实框之间的IoU。(个人认为这里的公式是对 iou 小于0.5的ROI 减小一下置信度target,对 iou 大于0.5的 ROI 提升一下置信度)
然后用交叉熵损失(使得预测置信目标L最小化)训练置信度分支:
其中 yk~ 是网络的预测值。我们在表9中的实验表明,这种基于 iou 的置信度target设置策略比传统的分类target (0或1) 具有更好的性能。
box 优化分支的 box 回归 target 使用传统的差值方法(利用anchor)进行编码,并使用smooth-L1损失函数进行优化。
The proposed PV-RCNN framework is trained end-toend with the region proposal loss Lrpn, keypoint segmentation loss Lseg and the proposal refinement loss Lrcnn. (1) We adopt the same region proposal loss Lrpn with [34] as Lrpn, where the anchor classification loss Lcls is calculated with focal loss with default hyper-parameters and smoothL1 loss is utilized for anchor box regression with the predicted residual ∆ra~ and the regression target ∆ra. (2) The keypoint segmentation loss Lseg is also calculated by the focal loss as mentioned in Sec. 3.2. (3) The proposal refinement loss Lrcnn includes the IoU-guided confidence prediction loss Liou and the box refinement loss as Lrcnn, where ∆rp~ is the predicted box residual and ∆rp is the proposal regression target which are encoded same with ∆ra.
The overall training loss are then the sum of these three losses with equal loss weights. Further training loss details are provided in the supplementary file.
提出的 PV-RCNN 框架通过 区域proposal损失Lrpn、关键点分割损失Lseg 和 建议细化损失Lrcnn进行端到端训练。
(1) 我们采用与[34]相同的区域 proposal 损失Lrpn,
其中 anchor 分类损失Lcls使用 focal loss 计算,smooth-L1 loss 用于anchor的 box 回归,∆ra~表示预测的差值,∆ra 表示回归目标target的差值。
(2) 关键点分割损失Lseg也通过第3.2节中提到的 focal loss进行计算。
(3) proposal refinement 损失 Lrcnn 包括: IoU引导的置信度预测损失 Liou 和 box refinement 损失,
其中∆rp~是预测的box差值,∆rp是proposal 回归目标的差值, 其编码方式与 ∆ra 一致。
总的训练损失是这三个损失的总和,损失权重相等。
In this section, we introduce the implementation details of our PV-RCNN framework (Sec. 4.1) and compare with previous state-of-the-art methods on both the highly competitive KITTI dataset (Sec. 4.2) and the newly introduced large-scale Waymo Open Dataset (Sec. 4.3). In Sec. 4.4, we conduct extensive ablation studies to investigate each component of PV-RCNN to validate our design.
在本节中,我们将介绍我们的PV-RCNN框架(第4.1节)的实现细节,并与 KITTI数据集(第4.2节)和 大规模Waymo开放数据集(第4.3节)上的最先进方法进行比较。在 4.4节,我们进行了广泛的消融实验,以研究PV-RCNN的每个组件,验证我们的设计。
Datasets. KITTI Dataset is one of the most popular dataset of 3D detection for autonomous driving. There are 7481 training samples and 7518 test samples, where the training samples are generally divided into the train split (3712 samples) and the val split (3769 samples). We compare PV-RCNN with state-of-the-art methods on both the val split and the test split on the online learderboard.
Waymo Open Dataset is a recently released and currently the largest dataset of 3D detection for autonomous driving. There are totally 798 training sequences with around 158361 LiDAR samples, and 202 validation sequences with 40077 LiDAR samples. It annotated the objects in the full 360° field instead of 90° in KITTI dataset. We evaluate our model on this large-scale dataset to further validate the effectiveness of our proposed method.
Network Architecture. As shown in Fig. 2, the 3D voxel CNN has four levels with feature dimensions 16, 32, 64;, 64, respectively. Their two neighboring radii rk of each level in the VSA module are set as (0.4m, 0.8m), (0.8m, 1.2m), (1.2m, 2.4m), (2.4m, 4.8m), and the neighborhood radii of set abstraction for raw points are (0.4m, 0.8m). For the proposed RoI-grid pooling operation, we uniformly sample 6 × 6 × 6 grid points in each 3D proposal and the two neighboring radii r~ of each grid point are (0.8m, 1.6m).
For the KITTI dataset, the detection range is within [0, 70.4]m for the X axis, [−40, 40]m for the Y axis and [−3, 1]m for the Z axis, which is voxelized with the voxel size (0.05m, 0.05m, 0.1m) in each axis. For the Waymo Open dataset, the detection range is [−75.2, 75.2]m for the X and Y axes and [−2, 4]m for the Z axis, and we set the voxel size to (0.1m, 0.1m, 0.15m).
Training and Inference Details. Our PV-RCNN framework is trained from scratch in an end-to-end manner with the ADAM optimizer. For the KITTI dataset, we train the entire network with the batch size 24, learning rate 0.01 for 80 epochs on 8 GTX 1080 Ti GPUs, which takes around 5 hours. For the Waymo Open Dataset, we train the entire network with batch size 64, learning rate 0.01 for 30 epochs on 32 GTX 1080 Ti GPUs. The cosine annealing learning rate strategy is adopted for the learning rate decay. For the proposal refinement stage, we randomly sample 128 proposals with 1:1 ratio for positive and negative proposals, where a proposal is considered as a positive proposal for box refinement branch if it has at least 0.55 3D IoU with the ground-truth boxes, otherwise it is treated as a negative proposal.
During training, we utilize the widely adopted data augmentation strategy of 3D object detection, including random flipping along the X axis, global scaling with a random scaling factor sampled from [0.95, 1.05], global rotation around the Z axis with a random angle sampled from [− π/4, π/4]. We also conduct the ground-truth sampling augmentation [34] to randomly “paste” some new ground-truth objects from other scenes to the current training scenes, for simulating objects in various environments.
For inference, we keep the top-100 proposals generated from the 3D voxel CNN with a 3D IoU threshold of 0.7 for non-maximum-suppression (NMS). These proposals are further refined in the proposal refinement stage with aggregated keypoint features. We finally use an NMS threshold of 0:01 to remove the redundant boxes.
数据集。KITTI数据集是最流行的自动驾驶3D检测数据集之一。共有7481个训练样本和7518个测试样本,其中训练样本通常分为训练序列(3712个样本)和val序列(3769个样本)。我们将PV-RCNN与排行版上 val 和 test 部分 的最新方法进行了比较。
Waymo Open Dataset是最近发布的,也是目前最大的自动驾驶3D检测数据集。共有798个训练序列,约158361个激光雷达样本,202个验证序列,40077个激光雷达样本。它在完整的360° 范围中注释对象,而不是像 KITTI 数据集中注释90°范围的目标。我们在这个大规模数据集上评估了我们的模型,以进一步验证我们提出的方法的有效性。
网络架构。如图2所示,3D voxel CNN有四个级别,特征尺寸分别为16、32、64、64。VSA模块中每层的两个相邻半径rk设置为 (0.4m,0.8m), (0.8m,1.2m), (1.2m,2.4m), (2.4m,4.8m),原始点 SA 的邻域半径为(0.4m,0.8m)。对于 proposal 的RoI网格池化操作,我们在每个3D proposal 中统一采用 6×6×6 个网格点,每个网格点的两个相邻半径r~为(0.8m,1.6m)。
对于KITTI数据集,X轴的检测范围在[0, 70.4]m 范围内, Y轴为[-40, 40]m,以及Z轴为[-3, 1]m,每个轴上的体素大小(0.05m,0.05m,0.1m)对其进行体素化。对于Waymo开放数据集,检测范围 X轴和Y轴为 [-75.2, 75.2]m,以及Z轴为 [-2, 4]m,我们将体素大小设置为(0.1m,0.1m,0.15m)。
训练和推理细节。我们的PV-RCNN框架是通过ADAM优化器以端到端的方式从头开始训练的。对于KITTI数据集,我们在8个GTX 1080 Ti GPU上训练整个网络,batch size 为24,学习率为0.01,持续80个epochs,这大约需要5小时。对于Waymo开放数据集,我们在32个GTX 1080 Ti GPU上对整个网络进行了 batch size 为64、学习率为0.01的30个epochs的训练。学习率衰减采用余弦退火学习率策略。对于 proposal 细化阶段,我们以1:1的比例随机抽样128份 proposal,其中,如果 proposal 与 gt_boxes 的 3d iou 大于 0.55,则视为正样本,否则将被视为负样本。
在训练过程中,我们采用了 3D 目标检测数据增强策略,包括沿 X 轴随机翻转、在[0.95,1.05]之间采用随机比例因子的全局缩放、在[− π/4, π/4]之间进行 Z 轴的随机旋转角度。 我们还进行了 gt_box 采样增强,将一些新的 gt_box 对象从其他场景随机“粘贴”到当前训练场景中,以模拟各种环境中的对象。
在推理阶段,我们保留了3D voxel CNN生成的前100个 proposal(NMS的3D IoU阈值为0.7)。这些 proposal 在 proposal refinement 阶段聚合 关键点特征 进一步细化。我们最终使用0.01的 NMS 阈值来移除冗余框。
To evaluate the proposed model’s performance on the KITTI val split, we train our model on the train set and report the results on the val set. To conduct evaluation on the test set with the KITTI official test server, the model is trained with 80% of all available train+val data and the remaining 20% data is used for validation.
Evaluation Metric. All results are evaluated by the mean average precision with a rotated IoU threshold 0.7 for cars and 0.5 for cyclists. The mean average precisions on the test set are calculated with 40 recall positions on the official KITTI test server [10]. The results on the val set in Table 2 are calculated with 11 recall positions to compare with the results by the previous works.
Comparison with state-of-the-art methods. Table 1 shows the performance of PV-RCNN on the KITTI test set from the official online leaderboard as of Nov. 15th, 2019. For the most important 3D object detection benchmark of the car class, our method outperforms previous state-of-theart methods with remarkable margins, i.e. increasing the mAP by 1.58%, 1.72%, 1.73% on easy, moderate and hard difficulty levels, respectively. For the bird-view detection of the car class, our method also achieves new state-of-theart performance on the easy and moderate difficulty levels while dropping slightly on the hard difficulty level. For 3D detection and bird-view detection of cyclist, our methodsout performs previous LiDAR-only methods with large margins on the moderate and hard difficulty levels while achieving comparable performance on the easy difficulty level. Note that we train a single model for both the car and cyclist detection instead of separate models for each class as previous methods [34, 12, 25, 37] do.
As of Nov. 15th, 2019, our method currently ranks 1st on the car 3D detection leaderboard among all methods including both the RGB+LiDAR methods and LiDAR-only methods, and ranks 1st on the cyclist 3D detection leaderboard among all published LiDAR only methods. The significant improvements manifest the effectiveness of the PV-RCNN.
We also report the performance of the most important car class on the KITTI val split with mAP from R11. Similarly, as shown in Table 2, our method outperforms previous stateof-the-art methods with large margins. The performance with R40 are also provided in Table 3 for reference.
为了评估所提出模型在KITTI val 集上的性能,我们在训练集上训练我们的模型,并在val集上报告结果。并且使用KITTI官方测试平台对测试集进行评估,使用了 train+val 中 80% 的数据对模型进行训练,剩余的20%数据用于验证。
评估指标。所有结果均通过 map 进行评估,汽车的IoU阈值为0.7,自行车为0.5。测试集的 map 是通过官方KITTI测试平台上的 40个召回位置计算得出的。表2中的 val集 的结果通过 11个召回位置进行计算并与之前工作的结果进行比较。
与最先进的方法进行比较。表1显示了截至2019年11月15日,PV-RCNN在官方在线排行榜KITTI测试集上的性能。对于汽车类最重要的3D目标检测基准,我们的方法优于之前的最先进的方法,具有显著的优势,即在简单、中等和硬难度水平上,map 分别增加1.58%、1.72%、1.73%。对于汽车类的鸟瞰检测,我们的方法在简单和中等难度水平上也取得了新的技术水平,而在硬难度水平上则略有下降。对于自行车的3D检测和鸟瞰检测,我们的方法在中等难度和高难度水平上优于之前仅使用激光雷达的方法,同时在容易难度水平上实现了相当的性能。请注意,我们为汽车和自行车检测训练了一个单一模型,而不是像以前的方法[34,12,25,37]那样为每个类别训练单独的模型。
截至2019年11月15日,我们的方法目前在所有方法中(包括RGB+激光雷达方法和仅使用激光雷达的方法)在汽车3D检测排行榜上排名第。在所有已发布的仅使用激光雷达的方法中,我们的方法在自行车3D检测排行榜上排名第一。这些显著的改进表明了PV-RCNN的有效性。
表2中 我们给出了KITTI val split上R11的map。类似地,表3还提供了R40的性能,以供参考。我们的方法比以前最先进的方法有很大的优势。
To further validate the effectiveness of our proposed PVRCNN, we evaluate the performance of PV-RCNN on the newly released large-scale Waymo Open Dataset.
Evaluation Metric. We adopt the official released evaluation tools for evaluating our method, where the mean average precision (mAP) and the mean average precision weighted by heading (mAPH) are used for evaluation. The rotated IoU threshold is set as 0.7 for vehicle detection and 0.5 for pedestrian / cyclist. The test data are split in two ways. The first way is based on objects’ different distances to the sensor: 0 − 30m, 30 − 50m and > 50m. The second way is to split the data into two difficulty levels, where the LEVEL 1 denotes the ground-truth objects with at least 5 inside points while the LEVEL 2 denotes the ground-truth objects with at least 1 inside points or the ground-truth objects manually marked as LEVEL 2.
Comparison with state-of-the-art methods. Table 5 shows that our method outperforms previous state-of-theart [40] significantly with a 7.37% mAP gain for the 3D object detection and a 2.56% mAP gain for the bird-view object detection. The results show that our method achieves remarkably better mAP on all distance ranges of interest, where the maximum gain is 9:19% for the 3D detection in the range of 30 − 50m, which validates that our proposed multi-level point-voxel integration strategy is able to effectively capture more accurate contextual information for improving the 3D detection performance. As shown in Table 5, our method also achieves superior performance in terms of mAPH, which demonstrates that our model predicted accurate heading direction for the vehicles. The results on the LEVEL 2 difficult level are also reported in Table 5 for reference, and we could see that our method performs well even for the objects with fewer than 5 inside points. The experimental results on the large-scale Waymo Open dataset further validate the generalization ability of our proposed framework on various datasets.
Better performance for multi-class detection with more proposals. To evaluate the performance of our method for multi-class detection, we further conduct experiments on the latest Waymo Open Dataset (version 1.2 released in March 2020). Here the number of proposals is increased from 100 to 500 since we only train a single model for detecting all three categories (e.g., vehicle, pedestrian and cyclist). As shown in Table 6, our method significantly surpasses previous methods on all difficulty levels of these three categories. We hope it could set up a strong baseline on the Waymo Open Dataset for future works.
为了进一步验证我们提出的PVRCNN的有效性,我们在新发布的大规模Waymo开放数据集上评估了PV-RCNN的性能。
评估指标。我们采用官方发布的评估工具对我们的方法进行评估,其中使用 mAP 和 加权map(mAPH)进行评估。对于车辆检测,旋转IoU阈值设置为0.7,对于行人/骑车人,旋转IoU阈值设置为0.5。测试数据分为两种方式,第一种方法基于对象到传感器的不同距离:0− 30米,30米−50米 和 >50米。第二种方法是将数据分为两个难度级别,其中级别1表示 gt 内至少包含有5个点,级别2表示 gt 内至少有1个点 或 手动标记为级别2的 gt 对象。
与最先进的方法进行比较。表5显示,我们的方法明显优于之前的方法[40],3D对象检测的map增益为7.37%,鸟瞰对象检测的 map 增益为2.56%。结果表明,我们的方法在所有的距离范围内都取得了显著的效果,在30%的范围内,3D检测达到了最大增益为9.19%, 这验证了我们提出的多层次 point-voxel 集成策略能够有效地捕获更准确的上下文信息,从而提高3D检测性能。如表5所示,我们的方法在mAPH方面也取得了优异的性能,这表明我们的模型预测了车辆的准确航向。表5中还报告了第2级难度的结果,以供参考。我们可以看到,即使对于内部点少于5个的对象,我们的方法也表现良好。在大规模Waymo开放数据集上的实验结果进一步验证了我们提出的框架在各种数据集上的泛化能力。
多类检测性能更好,proposal 更多。为了评估我们的多类检测方法的性能,我们进一步在最新的Waymo开放数据集(2020年3月发布的1.2版)上进行了实验。这里的 proposals 数量从100个增加到500个,因为我们只训练一个模型来检测所有三个类别(例如,车辆、行人和自行车)。如表6所示,我们的方法在这三个类别的所有难度水平上都显著优于以前的方法。我们希望它能为未来的工作在Waymo开放数据集上建立一个强大的基线。
In this section, we conduct extensive ablation experiments to analyze individual components of our proposed method. All models are trained on the train split and evaluated on the val split for the car class of KITTI dataset [4].
Effects of voxel-to-keypoint scene encoding. We validate the effectiveness of voxel-to-keypoint scene encoding strategy by comparing with the native solution that directly aggregating multi-scale feature volumes of encoder to the RoI-grid points as mentioned in Sec. 3.1. As shown in the 2nd and 3rd rows of Table 7, the voxel-to-keypoint scene encoding strategy contributes significantly to the performance in all three difficulty levels. This benefits from that the keypoints enlarge the receptive fields by bridging the 3D voxel CNN and RoI-grid points, and the segmentation supervision of keypoints also enables a better multi-scale feature learning from the 3D voxel CNN. Besides, a small set of keypoints as the intermediate feature representation also decreases the GPU memory usage when compared with the directly pooling strategy.
Effects of different features for VSA module. In Table 8, we investigate the importance of each feature component of keypoints in Eq. (3) and Eq. (4). The 1st row shows that the performance drops a lot if we only aggregate features from fi (raw), since the shallow semantic information is not enough for the proposal refinement. The high level semantic information from fi (pv3), fi (pv4) and fi (bev) improves the performance significantly as shown in 2nd to 5th rows. As shown in last four rows, the additions of relative shallow semantic features fi (pv1), fi (pv2), fi (raw) further improves the performance slightly and the best performance is achieved with all the feature components as the keypoint features.
Effects of PKW module. We propose the predicted keypoint weighting (PKW) module in Sec. 3.2 to re-weight the point-wise features of keypoint with extra keypoint segmentation supervision. Table 9 (1st and 4th rows) shows that removing the PKW module drops performance a lot, which demonstrates that the PKW module enables better multi-scale feature aggregation by focusing more on the foreground keypoints, since they are more important for the succeeding proposal refinement network.
Effects of RoI-grid pooling module. We investigate the effects of RoI-grid pooling module by replacing it with the RoI-aware pooling [26] and keeping the other modules consistent. Table 9 shows that the performance drops significantly when replacing RoI-grid pooling module, which validates that our proposed set abstraction based RoI-grid pooling could learn much richer contextual information, and the pooled features also encode more discriminative RoI features by pooling more effective features with large search radii for each grid point. 1st and 2nd rows of Table 7 also shows that comparing with the 3D voxel RPN, the performance increases a lot after the proposal is refined by the features aggregated from the RoI-grid pooling module.
在本节中,我们将进行广泛的烧蚀实验,以分析我们提出的方法的各个组成部分。所有模型都在KITTI数据集[4]的 car 类别的 train部分上进行训练,并在 val部分上进行评估。
voxel-to-keypoint 场景编码的影响。通过与直接将编码器的多尺度特征体积聚合到RoI网格点上的方案比较,我们验证了 voxel-to-point 编码策略的有效性。 如表7的第2行和第3行所示,voxel-to-keypoint 场景编码策略对所有三个难度级别的性能都有显著影响。这得益于关键点扩大了链接 3D voxel CNN 和 RoI-grid point 的感受野,并且关键点的分割监督部分也可以从3D voxel CNN中学习更好的多尺度特征。此外,与直接池化策略相比,作为表示中间特征的关键点也减少了GPU内存使用。
VSA模块不同功能的影响。在表8中,我们研究了等式(3)和等式(4)中关键点的每个特征分量的重要性。第一行显示,如果我们只聚合来自fi(raw)的特征,性能会下降很多,因为浅层语义信息不足以用于 proposal 细化。来自fi(pv3)、fi(pv4)和fi(bev)的高级语义信息显著提高了性能,如第2到第5行所示。最后四行展示了,相对浅层语义特征fi(pv1)、fi(pv2)、fi(raw)的添加进一步略微提高了性能,并且在所有特征组件作为关键点特征的情况下实现了最佳性能。
PKW模块的作用。我们在Sec3.2 中提出了预测关键点加权(PKW)模块。通过额外的关键点分割监督,重新加权关键点的逐点特征。表9(第1行和第4行)显示,删除PKW模块会大大降低性能,这表明PKW模块通过更多地关注前景关键点,实现更好的多尺度特征聚合,因为它们对后续 proposal refinement 网络是很重要的。
RoI网格池化模块的效果。我们研究了RoI-grid Pooling 模块的效果,将其替换为RoI-aware Pooling,并保持其他模块的一致性。表9显示,当替换掉 RoI-grid Pooling 模块时,性能显著下降,这验证了我们提出的基于 SA 的 RoI-grid Pooling 可以学习更丰富的上下文信息,每一个 grid point 通过一个大半径的范围去池化更多的有效特征而得到池化特征,从而可以对更具有辨别力的 RoI 特征进行编码。表7的第一行和第二行还显示,与3D voxel RPN相比,在通过 RoI-grid Pooling 模块聚合的特征细化 proposal 后,性能提高了很多。
We have presented the PV-RCNN framework, a novel method for accurate 3D object detection from point clouds. Our method integrates both the multi-scale 3D voxel CNN features and the PointNet-based features to a small set of keypoints by the new proposed voxel set abstraction layer, and the learned discriminative features of keypoints are then aggregated to the RoI-grid points with multiple receptive fields to capture much richer context information for the fine-grained proposal refinement. Experimental results on the KITTI dataset and the Waymo Open dataset demonstrate that our proposed voxel-to-keypoint scene encoding and keypoint-to-grid RoI feature abstraction strategy significantly improve the 3D object detection performance compared with previous state-of-the-art methods.
我们提出了PV-RCNN框架,这是一种从点云中精确检测三维目标的新方法。我们的方法通过新提出的voxel set abstraction 层,将多尺度3D voxel CNN特征和 基于 PointNet 的特征集成到少量的关键点集合上,然后将学习到的关键点特征通过多个感受野聚合到 RoI网格点上,以获取更丰富的上下文信息,用于细粒度的 proposal refinement。在KITTI数据集和Waymo Open数据集上的实验结果表明,我们提出的 voxel-to-keypoint 场景编码和 keypoint-to-grid RoI 特征提取策略与以前最先进的方法相比,显著提高了三维目标检测性能。