论文:RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
代码:code
前言:最近很犹豫要不要继续翻译点云方面的论文,直到看到这篇论文的作者是自己学校以前的学长,他的科研经历挺振奋人心的。突然觉得应该好好读一下他的文章。
我们研究了一个高效的大规模3D点云的语义分割问题。通过代价很高的下采样策略或高运算量的预处理或后处理步骤,大多数已有的方法只能在小规模点云上训练或运算。在这篇文章中,我们引入了RandLA-Net,一个高效且轻量级的神经架构来直接在大规模点云上推断语义信息。我们的方法的关键是用随机采样的点替代复杂的采样方案。尽管计算量和存储空间上高效,随机采样随机的丢掉一些关键点。为了克服它,我们引入了一个新型的局部特征聚合模块来逐步的增加3D点的感受野,因此高效的保留了一些几何信息。扩展实验展示了我们的RandLA-Net可以一次处理100万个点,比现有的方法快200个×。此外,我们的RandLA-Net在两个大规模的数据集Semantic3D和SemanticKITTI上都超过了当前的语义分割方法。
We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/postprocessing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200× faster than existing approaches. Moreover, our RandLA-Net clearly surpasses state-of-the-art approaches for semantic segmentation on two large-scale benchmarks Semantic3D and SemanticKITTI.
大规模的3D点云的高效语义分割是一个实时智能系统上基础且重要的能力,比如自动驾驶和增强现实。一个重要的挑战是通过深度传感器获取的原始点云通常是不规则采样的,非结构的且无序的。尽管深度卷积网络展示了在结构化的2D视觉任务上的优秀的性能,它们却不能直接用在非结构化的数据上。
Efficient semantic segmentation of large-scale 3D point clouds is a fundamental and essential capability for realtime intelligent systems, such as autonomous driving and augmented reality. A key challenge is that the raw point clouds acquired by depth sensors are typically irregularly sampled, unstructured and unordered. Although deep convolutional networks show excellent performance in structured 2D computer vision tasks, they cannot be directly applied to this type of unstructured data.
近年来,里程碑式的工作PointNet掀起了一个直接处理3D点云的方法。它用共享的多层感知机学习到每个点的特征。这是在计算上很高效的,但是它却不能捕捉到每个点更宽的内容信息。为了学习更丰富的局部结构,许多专用的神经模块随后被迅速引入。这些模块通常可以被分为:1)相邻的特征池化2)基于图的信息传递 3)基于核的卷积 4)基于注意力机制的聚合。尽管这些方法在目标识别或语义分割任务上取得了令人印象深刻的效果,它们大多数还是限制在了小的3D点云上(4k points or 1×1 meter blocks),并且也不能不经过预处理的步骤比如块的分割,直接扩展到大的点云上。这样做的局限性有三点:第一,最常用的点的下采样方法要么计算量大要么占内存,比如一个最常用的快速点云下采样的方法,处理100万个点花200秒只能处理到10%。第二,已有的局部特征学习器通常依赖与计算量大的核化或图的构造,因此不能适用于大量的点云。第三,对于大规模点云来说,通常包括几百个物体,已有的局部特征学习器由于感受野受限,要么不能处理复杂的结构,要么不高效。
Recently, the pioneering work PointNet has emerged as a promising approach for directly processing 3D point clouds. It learns per-point features using shared multilayer perceptrons (MLPs). This is computationally efficient but fails to capture wider context information for each point. To learn richer local structures, many dedicated neural modules have been subsequently and rapidly introduced. These modules can be generally categorized as: 1) neighbouring feature pooling, 2) graph message passing , 3) kernel-based convolution, and 4) attentionbased aggregation. Although these approaches achieve impressive results for object recognition and semantic segmentation, almost all of them are limited to extremely small 3D point clouds (e.g., 4k points or 1×1 meter blocks) and cannot be directly extended to larger point clouds (e.g., millions of points and up to 200×200 meters) without preprocessing steps such as block partition. The reasons for this limitation are three-fold. 1) The commonly used point-sampling methods of these networks are either computationally expensive or memory inefficient. For example, the widely employed farthest-point sampling
takes over 200 seconds to sample 10% of 1 million points. 2) Most existing local feature learners usually rely on computationally expensive kernelisation or graph construction,thereby being unable to process massive number of points.3) For a large-scale point cloud, which usually consists of hundreds of objects, the existing local feature learners are either incapable of capturing complex structures, or do so inefficiently, due to their limited size of receptive fields.
目前的大量工作都开始着手解决直接在大规模点云上的处理的任务。SPG在应用神经网网络来学习每个点的语义信息之前将点云作为超图进行预处理。 FCPN和PCT结合了voxel化和点层的网络来处理大量的点云。尽管它们都取得了不错的分割精度,预处理和voxel化的操作在实际应用中计算量都太大了。
A handful of recent works have started to tackle the task of directly processing large-scale point clouds. SPG preprocesses the large point clouds as super graphs before applying neural networks to learn per super-point semantics. Both FCPN and PCT combine voxelization and point-level networks to process massive point clouds. Although they achieve decent segmentation accuracy, the preprocessing and voxelization steps are too computationally heavy to be deployed in real-time applications.
在这篇论文中,我们目的是设计一个计算量和存储量都很高效的架构,它可以一次性的,不用voxel化或模块划分或图构造的预处理方式直接处理大规模的3D点云。然而,这个任务是极具挑战性的,因为它要求:1.一个存储而计算都很高效的采样方式来逐步下采样大规模的点云来适应现在GPU的一些局限。2. 一个高效的局部特征聚合器来逐步增加感受野来保护复杂的几何结构。最后,我们首先系统证实了随机采样是深度神经网络有效处理大规模点云的关键促成因素。但是,随机采样会丢掉一些关键信息,特别是对稀疏点物体而言。为了对抗随机抽样的潜在有害影响,我们提出了一种新型的高效的局部特征聚合模块在逐步更小的点集上来捕获到复杂的局部结构。
In this paper, we aim to design a memory and computationally efficient neural architecture, which is able to directly process large-scale 3D point clouds in a single pass, without requiring any pre/post-processing steps such as voxelization, block partitioning or graph construction. However, this task is extremely challenging as it requires: 1) a memory and computationally efficient sampling approach to progressively downsample large-scale point clouds to fit in the limits of current GPUs, and 2) an effective local feature learner to progressively increase the receptive field size to preserve complex geometric structures.To this end, we first systematically demonstrate that random sampling is a key enabler for deep neural networks toefficiently process large-scale point clouds. However, random sampling can discard key information, especially for objects with sparse points. To counter the potentially detrimental impact of random sampling, we propose a new and efficient local feature aggregation module to capture complex local structures over progressively smaller point-sets.
在已有的下采样方法中,最远点采样和反密度采样是小规模点云最常用的采样方法。点采样是这些网络中的一个基础的步骤,我们在3.2节调研了不同方法的相对优势,我们发现最常用的采样方法限制了大点云的缩放,并且也是实时处理的一个重要瓶颈。然而,我们认为随机抽样是迄今为止最适合大规模点云处理的组件,因为它是快速和并且缩放很高效。随机采样也并不是没有成本的,因为突出的点的特征有可能会被偶然的丢失,它不能不造成性能损失的直接用于现有的网络中。为了克服这个问题,我们在3.3节设计了一个新的局部特征聚合的模块,它可以通过在每层逐步增加感受野的方式高效的学习复杂的局部结构。特别的,对每个3D点而言,我们首先引入了一个局部空间编码单元(LocSE) 来保护局部几何特征,其次,我们利用注意力池化来自动的保护有用的特征。第三,我们堆叠了多个LocSE单元和注意力池化的模块作为扩张的残余块,大大增加了每个点的有效接受野。注意到所有的这些组成都是用共享MLP实现的,这也是存储计算都很高效的。
Amongst existing sampling methods, farthest point sampling and inverse density sampling are the most frequently used for small-scale point clouds. As point sampling is such a fundamental step within these networks, we investigate the relative merits of different approaches in Section 3.2, where we see that the commonly used sampling methods limit scaling towards large point clouds, and act as a significant bottleneck to real-time processing. However, we identify random sampling as by far the most suitable component for large-scale point cloud processing as it is fast and scales efficiently. Random sampling is not without cost, because prominent point features may be dropped by chance and it cannot be used directly in existing networks without incurring a performance penalty. To overcome this issue, we design a new local feature aggregation module in Section 3.3, which is capable of effectively learning complex local structures by progressively increasing the receptive field size in each neural layer. In particular, for each 3D point, we firstly introduce a local spatial encoding (LocSE) unit to explicitly preserve local geometric structures. Secondly, we leverage attentive pooling to automatically keep the useful local features. Thirdly, we stack multiple LocSE units and attentive poolings as a dilated residual block, greatly increasing the effective receptive field for each point. Note that all these neural components are implemented as shared MLPs, and are therefore remarkably memory and computational efficient.
总之,我们建立了一个简单随机采样的准则和一个局部特征局部的方法,我们这个高效的方法叫做 RandLA-Net,不仅仅是在大规模的点云上快了不止200倍,并且也在两个基准数据集上Semantic3D 和 SemanticKITTI 超越了最新的语义分割的方法。图一展示了我们的方法:
Overall, being built on the principles of simple random sampling and an effective local feature aggregator, our efficient neural architecture, named RandLA-Net, not only is up to 200× faster than existing approaches on large-scale point clouds, but also surpasses the state-of-the-art semantic segmentation methods on both Semantic3D [17] and SemanticKITTI [3] benchmarks. Figure 1 shows qualitative results of our approach. Our key contributions are:
我们的贡献包括:
我们分析比较了已有的采样方法,确定随机抽样是最适合在大规模点云上有效学习的组件。
我们提出了一个高效的局部特征聚合模块通过增加每个点的感受野的方式来保护复杂的局部结构
我们证明了在基线上显著的内存和计算收益,并在多个大规模基准上超过了最先进的语义分割方法。
• We analyse and compare existing sampling approaches, identifying random sampling as the most suitable component for efficient learning on large-scale point clouds.
• We propose an effective local feature aggregation module to preserve complex local structures by progressively increasing the receptive field for each point.
• We demonstrate significant memory and computational gains over baselines, and surpass the state-of-the-art semantic segmentation methods on multiple large-scale benchmarks.
为了提取3D点的特征,传统的方法经常依赖与手工提取的特征。最近基于学习的方法主要包括:基于投影法,voxel化的方法和基于点方案,这个将详述。
To extract features from 3D point clouds, traditional approaches usually rely on hand-crafted features. Recent learning based approaches mainly include projection-based, voxel-based and point-based schemes which are outlined here.
基于投影法和voxel化方法
为了更好的利用2DCNN,一些工作将3D点投影到2D图像上以此来解决目标识别的问题。然而,几何的细节会在投影中损失。替代的方案是,点云可以被 voxel化成一个三维的网格,因此就可以应用强大的3DCNN了。尽管它们在目标识别或语义分割领域取得了不错的成就,它们一个最大的限制就是计算量很大,特别是对于大规模点云而言。
(1) Projection and Voxel Based Networks. To leverage the success of 2D CNNs, many works
project/flatten 3D point clouds onto 2D images to address the task of object detection. However, geometric details may be lost during the projection. Alternatively, point clouds can be voxelized into 3D grids and then powerful 3D CNNs are applied in. Although they achieve leading results on semantic segmentation and object detection, their primary limitation is the heavy computation cost, especially when processing large-scale point clouds
基于点的网络
受到 PointNet/PointNet++的启发,一些最近的工作引入了一些复杂的模块来学习每个点的局部特征。这些模块可以被简单的分类为:1邻域特征池化 2图信息传递 3核卷积 4基于注意力机制。尽管这些工作在小点云上展现了强大的性能,大多数因为计算量和存储量的受限不能直接扩展到大规模的数据集场景。比起它们,我们的方法有三个突出之处:1它依赖的是网络的随机采样,因此只需要少得多的内存和计算量 2,提出的局部特征聚合模块可以通过不断的增加感受野来考虑到局部的空间关系和点的特征,因此在复杂的局部图案上更高效鲁棒 3整个网络只用共享的MLP组成不依赖图卷积或核化的方式,因此对于大规模的点云非常有效。
(2) Point Based Networks. Inspired by PointNet/PointNet++, many recent works introduced sophisticated neural modules to learn per-point local features. These modules can be generally classified as 1) neighbouring feature pooling , 2) graph message passing, 3) kernel-based convolution , and 4) attention-based aggregation. Although these networks have shown promising results on small point clouds, most of them cannot directly scale up to large scenarios due to their high computational and memory costs. Compared with them, our proposed RandLA-Net is distinguished in three ways: 1) it only relies on random sampling within the network, thereby requiring much less memory and computation; 2) the proposed local feature aggregator can obtain successively larger receptive fields by explicitly considering the local spatial relationship and point features, thus being more effective and robust for learning complex local patterns; 3) the entire network only consists of shared MLPs without relying on any expensive operations such as graph
construction and kernelisation, therefore being superbly efficient for large-scale point clouds.
对于大规模点云的学习
SPG预处理了大的点云网络作为超点图来学习每个点的超点语义。最近的FCPN和PCT 都依赖与基于voxel或基于点的网络来处理大量的点云。然而,图分割和voxel化的方法计算量都很大。相反, RandLA-Net的方法是一个不需要额外预处理或后处理的一种端到端的可学习的方法。
(3) Learning for Large-scale Point Clouds. SPG preprocesses the large point clouds as superpoint graphs to learn per super-point semantics. The recent FCPN and PCT apply both voxel-based and point-based networks to process the massive point clouds. However, both the graph partitioning and voxelisation are computationally expensive. In constrast, our RandLA-Net is end-to-end trainable without requiring additional pre/post-processing steps.
总述
正如图2所示那样,给定有一个跨越数百米的数百万个点的大规模点云,来用深度神经网络的方法处理它,这就不可避免的要求这些点要逐渐且高效的,却不损失信息的在每层神经元处下采样。RandLA-Net中,我们提出了一种用简单且快速的方法来随机采样来减少点的密度,同时应用一个精心设计的本地特性聚合器来保留突出的特性。这使得整个网络能够在有效效率和有效性之间实现良好的权衡。
As illustrated in Figure 2, given a large-scale point cloud with millions of points spanning up to hundreds of meters, to process it with a deep neural network inevitably requires those points to be progressively and efficiently downsampled in each neural layer, without losing the useful point features. In our RandLA-Net, we propose to use the simple and fast approach of random sampling to greatly decrease point density, whilst applying a carefully designed local feature aggregator to retain prominent features. This allows the entire network to achieve an excellent trade-off between efficiency and effectiveness.
寻求高效的采样
已有的点采样的方法可以被粗略的分类到启发式方法和基于学习的方法。然而,,现在没有一个适合大规模点云的标准的采样方法。因此,我们分析并比较了它们的相对优势和复杂度。
Existing point sampling approaches can be roughly classified into heuristic and learning based approaches. However, there is still no standard sampling strategy that is suitable for large-scale point clouds.Therefore, we analyse and compare their relative merits and complexity as follows.
基于启发式的采样
最远点采样:为了从大规模的N个点的点云中采样K个点,FPS返回了一个度量空间 {p1 · · · pk · · · pK},这样每个pk就是前k-1个点的最远点。FPS被广泛用于小点集的语义分割。虽然它能很好地覆盖整个点集,但其计算复杂度为O(n2)。对于一个大规模的点云(N∼106次方),FPS在单个GPU上处理需要多达200秒。这说明FPS不适用于大尺度的点云。
Farthest Point Sampling (FPS): In order to sample K points from a large-scale point cloud P with N points, FPS returns a reordering of the metric space {p1 · · · pk · · · pK}, such that each pk is the farthest point from the first k 1 points. FPS is widely used in for semantic segmentation of small point sets. Although it has a good coverage of the entire point set, its computational complexity is O(N2). For a large scale point cloud (N ~ 106次方), FPS takes up to 200 seconds to process on a single GPU. This shows that FPS is not suitable for large-scale point clouds.
反密度重要性抽样
为了在N个点中采样K个点,IDIS根据每个点的密度重新排序,然后选择前K个点。其计算复杂度约为O(N)。根据经验,处理106次方个点需要10秒。与FPS相比,IDIS的效率更高,但对异常值也更敏感。然而,它在实时系统中使用仍然太慢。
Inverse Density Importance Sampling (IDIS): To sample K points from N points, IDIS reorders all N points according to the density of each point, after which the top K points are selected. Its computational complexity is approximately O(N). Empirically, it takes 10 seconds to process 106 points. Compared with FPS, IDIS is more efficient, but also more sensitive to outliers. However, it is still too slow for use in a real-time system.
随机采样
随机抽样从原始的N个点中统一选择K个点。它的计算复杂度是O(1),这与输入点的总数无关,即它是恒定时间的,因此具有可伸缩的。与FPS和IDIS相比,无论输入点云的规模如何,随机采样的计算效率都最高。处理106次方个点只需要0.004秒。
Random Sampling (RS): Random sampling uniformly selects K points from the original N points. Its computational complexity is O(1), which is agnostic to the total number of input points, i.e., it is constant-time and hence inherently scalable. Compared with FPS and IDIS, random sampling has the highest computational efficiency, regardless of the scale of input point clouds. It only takes 0.004s to process 106 points.
基于学习的采样
基于生成器的采样(GS):GS学习生成一个小的点集来近似地表示原始的大的点集。然而,FPS通常用于在推理阶段将生成的子集与原始子集进行匹配,这样会产生额外的计算。在我们的实验中,采样106次方个点的10%需要1200秒。
Generator-based Sampling (GS): GS learns to generate a small set of points to approximately represent the original large point set. However, FPS is usually used in order to match the generated subset with the original set at inference stage, incurring additional computation. In our experiments, it takes up to 1200 seconds to sample 10% of 106 points.
基于连续松弛的采样(CRS):CRS方法使用重新参数化技巧将采样操作放松到一个连续域进行端到端训练。特别地,每个采样点都是基于点云的加权和学习的。当同时采样所有新点时,它会产生一个大的权重矩阵,导致一个负担不起的内存成本。例如,估计采样106次方个点中的10%需要超过100GB的内存占用。
Continuous Relaxation based Sampling (CRS): CRS approaches use the reparameterization trick to relax the sampling operation to a continuous domain for end-to-end training. In particular, each sampled point is learnt based on a weighted sum over the full point clouds. It results in a large weight matrix when sampling all the new points simultaneously with a one-pass matrix multiplication, leading to an unaffordable memory cost. For example, it is estimated to take more than a 300 GB memory footprint to sample 10% of 106 points.
基于策略梯度的抽样(PGS):PGS将采样操作定义为一个马尔可夫决策过程。它依次学习一个概率分布来采样这些点。但当点云较大时,由于勘探空间极大,学习概率的方差较大。例如,对106次方个点中的10%进行采样,勘探空间为C105 106 ,不太可能学习到有效的采样策略。经验表明,如果将PGS用于大点云,网络很难收敛。
Policy Gradient based Sampling (PGS): PGS formulates the sampling operation as a Markov decision process. It sequentially learns a probability distribution to sample the points. However, the learnt probability has high variance due to the extremely large exploration space when the point cloud is large. For example, to sample 10% of 106 points, the exploration space is C 105 106 and it is unlikely to learn an effective sampling policy. We empirically find that the network is difficult to converge if PGS is used for large point clouds.
总之,FPS, IDIS 和 GS对于大规模点云应用计算量都太大了。 CRS方法内存占用过多,PGS很难学习。相反的是,随机采样有以下优点:1.计算很高效,和输入的点数无关 2它不需要额外的计算内存 。因此,我们可以有把握地得出结论,与所有现有的替代方案相比,随机抽样是迄今为止处理大规模点云的最合适的方法。然而,随机抽样可能会导致许多有用的点特征被丢弃。为了克服这个问题,我们提出了一个功能强大的局部特性聚合模块,如下一节所述。
Overall, FPS, IDIS and GS are too computationally expensive to be applied for large-scale point clouds. CRS approaches have an excessive memory footprint and PGS is hard to learn. By contrast, random sampling has the following two advantages: 1) it is remarkably computational efficient as it is agnostic to the total number of input points, 2) it does not require extra memory for computation. Therefore, we safely conclude that random sampling is by far the most suitable approach to process large-scale point clouds compared with all existing alternatives. However, random sampling may result in many useful point features being dropped. To overcome it, we propose a powerful local feature aggregation module as presented in the next section.
局部特征聚合
正如图3所示,我们的局部特征聚合模块并行的应用到每个3D点,它包括了三个神经元单元。1 局部空间编码LocSE,2注意力池化 3扩张残差块
As shown in Figure 3, our local feature aggregation module is applied to each 3D point in parallel and it consists of three neural units: 1) local spatial encoding (LocSE), 2) attentive pooling, and 3) dilated residual block.
局部空间编码
给一个有着每个点特征的点云P,这个局部空间编码单元显示的嵌入了点每个邻域的x-y-z的坐标,这样的对应点特征总是注意到了它的相对空间位置。这使得LocSE单元显示的注意了局部的几何图案,因此最终使网络学习到复杂的局部结构。特别是,这个单元包括以下步骤:
(1) Local Spatial Encoding
Given a point cloud P together with per-point features (e.g., raw RGB, or intermediate learnt features), this local spatial encoding unit explicitly embeds the x-y-z coordinates of all neighbouring points, such that the corresponding point features are always aware of their relative spatial locations. This allows the LocSE unit to explicitly observe the local geometric patterns, thus eventually benefiting the entire network to effectively learn complex local structures. In particular, this unit includes the following steps:
找邻域:对第i个点,它的邻点被最近邻KNN算法收集。KNN是基于点的欧几里得空间的。
相对位置编码:对每个Pi最近的K个点,我们按如下来编码它的相对位置:
Finding Neighbouring Points. For the ith point, its neighbouring points are firstly gathered by the simple K nearest neighbours (KNN) algorithm for efficiency. The KNN is based on the point-wise Euclidean distances.
Relative Point Position Encoding. For each of the nearest K points {p1i· · · pki· · · pKi } of the center point pi , we explicitly encode the relative point position as follows:
pi和pik是三维坐标点,⊕ 是cat操作, || · || 计算邻点对中心点的欧几里得距离。rki似乎编码了不需要的点位置。有趣的是,这往往有助于网络学习局部特征,并在实践中获得良好的性能。
where pi and pki are the x-y-z positions of points, ⊕ is the concatenation operation, and || · || calculates the Euclidean distance between the neighbouring and center points. It seems that rki
is encoded from redundant point positions. Interestingly, this tends to aid the network to learn local features and obtains good performance in practice.
点的特征增强
对每个邻点,编码的相对点位置连接了对应的点的特征,得到一个增强特征
Point Feature Augmentation. For each neighbouring point pki, the encoded relative point positions rki are concatenated with its corresponding point features fik, obtaining an augmented feature vector ˆfik.
最终, LocSE单元的输出是一个新的邻域集合ˆFi = {ˆfi1 · · ·ˆfik · · ·ˆfiK},显示的编码了中心点局部几何结构。我们注意到最近的工作也用电的位置来提升语义分割。然而,位置被用于学习点的分数,而我们的LocSE 显示的编码了相对位置来增强邻点特征。
Eventually, the output of the LocSE unit is a new set of neighbouring features ˆFi = {ˆfi1 · · ·ˆfik · · ·ˆfiK}, which explicitly encodes the local geometric structures for the center point pi. We notice that the recent work [36] also uses point positions to improve semantic segmentation. However, the positions are used to learn point scores in [36], while our LocSE explicitly encodes the relative positions to augment the neighbouring point features.
(2)注意力池化
这个神经元单元是用来聚合邻点的特征的。已有的一些方法通常用最大池化或平均池化来硬聚合邻点的信息,导致大多数的信息被丢失。相反,我们用注意力机制来自适应的学习重要的邻域特征。特别的是,受到一篇论文的启发,我们的注意力池化包括以下步骤:
(2) Attentive Pooling
This neural unit is used to aggregate the set of neighbouring point features ˆFi. Existing works typically use max/mean pooling to hard integrate the neighbouring features, resulting in the majority of the information being lost. By contrast, we turn to the powerful attention mechanism to automatically learn important local features. In particular, inspired by [65], our attentive pooling unit consists of the following steps.
计算注意力分数。
给一组局部特征集合,我们设计一个函数来为每个特征学习独特的注意力分数。这个函数g基本上包括共享MLP和softmax。它的公式如下,这里的w是共享MLP学习的权重
Computing Attention Scores. Given the set of local features ˆFi = {ˆfi1· · ·ˆfik· · ·ˆfiK}, we design a shared function g() to learn a unique attention score for each feature. Basically, the function g() consists of a shared MLP followed by softmax. It is formally defined as follows:
where W is the learnable weights of a shared MLP.
权重集加。学习到的注意力分数可以被认为是一个软的mask来自动挑选重要的特征。在公式上,这些特征是这样被加起来的:
Weighted Summation. The learnt attention scores can be regarded as a soft mask which automatically selects the important features. Formally, these features are weighted
summed as follows:
为了求和,输入的点云P每个点看做pi,我们的 LocSE和Attentive Pooling units 学习来聚合k个最靠近的邻点的几何图案和特征,最终生成一个含有丰富信息的向量特征fi.
To summarize, given the input point cloud P , for the i th point pi , our LocSE and Attentive Pooling units learn to aggregate the geometric patterns and features of its K nearest points, and finally generate an informative feature vector ˜fi.
扩展剩余块
因为大的点云会逐步的进行降采样,因此需要显著的增加每个点的感受野,这样每个点,即使它被丢掉了,它的几何细节才会大概率被保留下来。如图3所示,受到ResNet和有效扩张网络的启发,我们通过跳跃连接和dilated residual block.堆叠了我们多个 LocSE 和Attentive Pooling units
Dilated Residual Block
Since the large point clouds are going to be substantially downsampled, it is desirable to significantly increase the receptive field for each point, such that the geometric details of input point clouds are more likely to be reserved, even if some points are dropped. As shown in Figure 3, inspired by the successful ResNet [19] and the effective dilated networks, we stack multiple LocSE and Attentive Pooling units with a skip connection as a dilated residual block.
为了更进一步论证我们呢的 dilated residual block的能力,图4展示了在进行了第一次的 LocSE/Attentive Pooling操作之后,红色3D点的观察到的K个邻点,并且能收到来自K的平方的邻点的信息,它第二次之后就跳两层,这是一种通过特征传播来扩张感受域和扩展有效邻域的简单方法。理论上,堆叠越多的单元,模块的能力就越强一维它的球会变得越来越大。然而,越多的单元会不可避免的牺牲计算效率。除此之外,整个网络很有可能会过拟合。在我们的RandLA-Net,我们简单搭了两层LocSE Atte和ntive Pooling 作为标准的残差模块,使得满足一个计算效率和效益的平衡。
To further illustrate the capability of our dilated residual block, Figure 4 shows that the red 3D point observes K neighbouring points after the first LocSE/Attentive Pooling operation, and then is able to receive information from up to K2 neighbouring points i.e. its two-hop neighbourhood after the second. This is a cheap way of dilating the receptive field and expanding the effective neighbourhood through feature propagation. Theoretically, the more units we stack, the more powerful this block as its sphere of reach becomes greater and greater. However, more units would inevitably sacrifice the overall computation efficiency. In addition, the entire network is likely to be over-fitted. In our RandLA-Net, we simply stack two sets of LocSE and Attentive Pooling as the standard residual block, achieving a satisfactory balance between efficiency and effectiveness.
总之,我们的局部特征聚合模块被设计成通过显示的考虑邻点几何信息并且逐步增大感受野来高效的保护复杂的局部结构。不仅如此,这个模块包括了只有前馈MLP组成,十分高效。
Overall, our local feature aggregation module is designed to effectively preserve complex local structures via explicitly considering neighbouring geometries and significantly increasing receptive fields. Moreover, this module only consists of feed-forward MLPs, thus being computationally efficient.
我们通过叠加多个局部特征聚合模块和随机采样层来实现RandLA-Net。详细的体系结构在附录中介绍。我们使用带有默认参数的Adam优化器。初始学习率设为0.01,每一个历元后下降5%。最近的点数K设为16。为了并行地训练我们的RandLA-Net,我们从每个点云中采样一个固定数量的点(∼105)作为输入。在测试过程中,整个原始点云被输入我们的网络,以推断每个点的语义,而不进行预或后处理,如几何或块分区。所有实验都在NVIDIA RTX2080TiGPU上进行
We implement RandLA-Net by stacking multiple local feature aggregation modules and random sampling layers. The detailed architecture is presented in the Appendix. We use the Adam optimizer with default parameters. The initial learning rate is set as 0.01 and decreases by 5% after each epoch. The number of nearest points K is set as 16. To train our RandLA-Net in parallel, we sample a fixed number of points (∼ 105) from each point cloud as the input. During testing, the whole raw point cloud is fed into our network to infer per-point semantics without pre/post-processing such as geometrical or block partition. All experiments are conducted on an NVIDIA RTX2080Ti GPU.
在本节中,我们将实证评估现有的抽样方法的效率,包括FPS、IDIS、RS、GS、CRS和PGS,这些方法已经在第3.2节中进行了讨论。特别是,我们进行了以下4组实验。
In this section, we empirically evaluate the efficiency of existing sampling approaches including FPS, IDIS, RS, GS, CRS, and PGS, which have been discussed in Section 3.2. In particular, we conduct the following 4 groups of experiments.
第一组
给定一个小规模的点云(∼103点),我们使用每种采样方法来逐步降采样。具体来说,对点云进行降采样5次,在单个GPU上,每步只保留25%的点,即4倍的抽取比。这意味着最后只剩下∼(1/4)5次方×103点。这个降采样策略模拟了PointNet++中使用的过程。对于每种抽样方法,我们总结其时间和内存消耗以进行比较。
组234
组2月4日。总点数向大规模增加,即分别在104、105和106点左右。我们使用与第1组相同的五个抽样步骤。
Group 1. Given a small-scale point cloud (∼ 103 points),we use each sampling approach to progressively downsample it. Specifically, the point cloud is downsampled by five steps with only 25% points being retained in each step on a single GPU i.e. a four-fold decimation ratio. This means that there are only ∼ (1/4)5 × 103 points left in the end. This downsampling strategy emulates the procedure used in PointNet++ [44]. For each sampling approach, we sum up its time and memory consumption for comparison.
Group 2/3/4. The total number of points are increased towards large-scale, i.e., around 104, 105
and 106 points respectively. We use the same five sampling steps as in Group 1.
分析
图5比较了每种采样方法处理不同尺度的点云的总时间和内存消耗。可以看出:1)对于小尺度点云(∼103),所有的采样方法都往往有相似的时间和内存消耗,不太可能造成沉重或有限的计算负担。2)对于大规模的点云(∼106),FPS/IDIS/GS/CRS/PGS要么是非常耗时的,要么是内存昂贵的。相比之下,随机抽样具有较好的时间和记忆能力。这一结果清楚地表明,大多数现有的网络只能在点云的小块上进行优化,这主要是因为它们依赖于昂贵的采样方法。基于此,我们在RandLA-Net中使用了有效的随机抽样策略。
Analysis. Figure 5 compares the total time and memory consumption of each sampling approach to process different scales of point clouds. It can be seen that: 1) For small-scale point clouds (∼ 103 ), all sampling approaches tend to have similar time and memory consumption, and are unlikely to incur a heavy or limiting computation burden. 2) For largescale point clouds (∼ 106), FPS/IDIS/GS/CRS/PGS are either extremely time-consuming or memory-costly. By contrast, random sampling has superior time and memory efficiency overall. This result clearly demonstrates that most existing networks are only able to be optimized on small blocks of point clouds primarily because they rely on the expensive sampling approaches. Motivated by this, we use the efficient random sampling strategy in our RandLA-Net.
在本节中,我们系统地评估了我们的RandLA-Net在现实世界的大规模语义分割点云上的整体信息效率。特别是,我们在SemanticKITTI数据集上评估了RandLA-Net,获得了我们的网络在 Sequence 08上的总时间消耗,该网络总共有4071次点云扫描。我们还评估了最近的代表性工作在同一数据集上的时间消耗。为了进行公平的比较,我们将每次扫描中相同数量的点(即81920个)输入每个神经网络。
In this section, we systematically evaluate the overall efficiency of our RandLA-Net on real-world large-scale point clouds for semantic segmentation. Particularly, we evaluate RandLA-Net on the SemanticKITTI dataset, obtaining the total time consumption of our network on Sequence 08 which has 4071 scans of point clouds in total. We also evaluate the time consumption of recent representative works on the same dataset. For a fair comparison, we feed the same number of points (i.e., 81920) from each scan into each neural network.
除此之外,我们还评估了RandLA-Net的内存消耗和基准线。特别是,我们不仅报告每个网络的参数总数,而且还评估每个网络可以接受的最大点数作为一次输入来推断每个点语义信息。需要注意的是,所有实验都是在同一台机器上进行的,使用[email protected]和NVIDIA RTX2080TiGPU。
In addition, we also evaluate the memory consumption of RandLA-Net and the baselines. In particular, we not only report the total number of parameters of each network, but also measure the maximum number of 3D points each network can take as input in a single pass to infer per-point semantics. Note that, all experiments are conducted on the same machine with an AMD 3700X @3.6GHz CPU and an NVIDIA RTX2080Ti GPU.
分析
分析。表1定量地显示了不同方法的总时间和内存消耗。可以看出,1)SPG的网络参数数最少,但由于昂贵的几何划分和超图构造步骤,处理点云的时间最长;2)PointNet++和PointCNN的计算成本也很高,主要原因是由于FPS采样操作;3)PointNet 和KPConv由于内存效率低下,无法一次获取极大规模的点云(例如106点)。4)由于简单的随机抽样和高效的基于MLP的局部特征聚合,我们的RandLA-Net花费最短的时间(平均185秒4071帧→大约22FPS)来推断每个大规模点云的语义标签(高达106个点)。
Analysis. Table 1 quantitatively shows the total time and memory consumption of different approaches. It can be seen that, 1) SPG has the lowest number of network parameters, but takes the longest time to process the point clouds due to the expensive geometrical partitioning and super-graph construction steps; 2) both PointNet++and PointCNN are also computationally expensive mainly because of the FPS sampling operation; 3) PointNet and KPConv are unable to take extremely large scale point clouds (e.g. 106 points) in a single pass due to their memory inefficient operations. 4) Thanks to the simple random sampling together with the efficient MLP-based local feature aggregator, our RandLA-Net takes the shortest time (185 seconds averaged by 4071 frames → roughly 22 FPS) to infer the semantic labels for each large-scale point cloud (up to 106 points).
在本节中,我们评估了我们的RandLA-Net在三个大型公共数据集上的语义分割:the outdoor Semantic3D和 SemanticKITTI ,以及indoor S3DIS 。
In this section, we evaluate the semantic segmentation of our RandLA-Net on three large-scale public datasets: the outdoor Semantic3D and SemanticKITTI , and the indoor S3DIS .
Semantic3D
Semantic3D由15个用于训练的点云和15个用于在线测试的点云组成。每个点云最多有108个点,在现实世界的3D空间中覆盖多达160×240×30米。原始的点属于8个类,包含三维坐标、RGB信息和强度。我们只使用三维坐标和颜色信息来训练和测试我们的RandLANet。使用所有类的平均交叉点过联合(mIoU)和总体精度(OA)作为标准指标。为了进行公平的比较,我们只包括了最近发表的强基线=和目前最先进的KPConv方法的结果。
(1) Evaluation on Semantic3D. The Semantic3D dataset consists of 15 point clouds for training and 15 for online testing. Each point cloud has up to 108 points, covering up to 160×240×30 meters in real-world 3D space. The raw 3D points belong to 8 classes and contain 3D coordinates, RGB information, and intensity. We only use the 3D coordinates and color information to train and test our RandLANet. Mean Intersection-over-Union (mIoU) and Overall Accuracy (OA) of all classes are used as the standard metrics. For fair comparison, we only include the results of recently published strong baselines and the current state-of-the-art approach KPConv。
表2显示了不同方法的定量结果。RandLA-Net在mIoU和OA方面明显优于所有现有的方法。值得注意的是,RandLANet 在8个中的6个类也取得了优异的表现,除了两个类,低植被和扫描艺术
Table 2 presents the quantitative results of different approaches. RandLA-Net clearly outperforms all existing methods in terms of both mIoU and OA. Notably, RandLANet also achieves superior performance on six of the eight classes, except low vegetation and scanning art.
SemanticKITTI
SemanticKITTI由43552个密集注释的激光雷达扫描图组成,属于21个序列。每次扫描都是一个大规模的点云,有105点,在3D空间中跨越160×160×20米。一般序列00∼07和09∼10(19130次扫描)用于训练,序列08(4071次扫描)用于验证,序列11∼21(20351次扫描)用于在线测试。原始的3D点只有三维坐标,没有颜色信息。超过19个类别的mIoU得分被用作标准指标。
(2) Evaluation on SemanticKITTI. SemanticKITTI consists of 43552 densely annotated LIDAR scans belonging to 21 sequences. Each scan is a large-scale point cloud with 105 points and spanning up to 160×160×20 meters in 3D space. Officially, the sequences 00∼07 and 09∼10 (19130 scans) are used for training, the sequence 08 (4071 scans) for validation, and the sequences 11∼21 (20351 scans) for online testing. The raw 3D points only have 3D coordinates without color information. The mIoU score over 19 categories is used as the standard metric.
表3显示了我们的RandLANet与最近的两类方法的定量比较,即1)基于点的方法和2)基于投影的方法,图6显示了RandLA-Net对验证分割的一些定性结果。可以看出,我们的RandLA-Net大大超过了所有基于点的方法。我们也优于所有基于投影的方法,但并不显著,主要是因为RangeNet++在交通标志等小对象类别上取得了更好的结果。然而,我们的RandLA-Net的网络比RangeNet++少了40个×工作参数,而计算效率更高,因为它不需要昂贵的前/后投影步骤。
Table 3 shows a quantitative comparison of our RandLANet with two families of recent approaches, i.e. 1) point based methods and 2) projection based approaches, and Figure 6 shows some qualitative results of RandLA-Net on the validation split. It can be seen that our RandLA-Net surpasses all point based approaches by a large margin. We also outperform all projection based methods, but not significantly, primarily because RangeNet++ achieves much better results on the small object category such as traffic-sign. However, our RandLA-Net has 40× fewer net work parameters than RangeNet++ and is more computationally efficient as it does not require the costly steps of pre/post projection.
S3DIS
S3DIS数据集由属于6个大区域的271个房间组成。每个点云都是一个中型的单室(∼20×15×5米),带有密集的3D点。为了评估我们的RandLA-Net的语义分割,我们在实验中使用了标准的6倍交叉验证。比较了总共13个类别的平均IoU(mIoU)、平均类别准确度(mAcc)和总体准确度(OA)。
Evaluation on S3DIS. The S3DIS dataset [2] consists of 271 rooms belonging to 6 large areas. Each point cloud is a medium-sized single room ( 20×15×5 meters) with dense 3D points. To evaluate the semantic segmentation of our RandLA-Net, we use the standard 6-fold crossvalidation in our experiments. The mean IoU (mIoU), mean class Accuracy (mAcc) and Overall Accuracy (OA) of the total 13 classes are compared.
如表4所示,我们的RandLA-Net取得了比最先进的方法相当或更好的性能。请注意,大多数这些基线倾向于使用复杂但昂贵的操作或采样来优化小块(例如,1×1米)的点云上的网络,而相对较小的房间有利于被划分成小块。相比之下,RandLA-Net将整个房间作为输入,并能够在一次传递中有效地推断出每一个点的语义。
As shown in Table 4, our RandLA-Net achieves on-par or better performance than state-of-the-art methods. Note that, most of these baselines tend to use sophisticated but expensive operations or samplings to optimize the networks on small blocks (e.g., 1×1 meter) of point clouds, and the relatively small rooms act in their favours to be divided into tiny blocks. By contrast, RandLA-Net takes the entire rooms as input and is able to efficiently infer per-point semantics in a single pass.
由于在第4.1节中已经充分研究了随机抽样的影响,因此我们对我们的局部特征聚合模块进行了以下消融研究。所有消融的网络都在序列00∼07和09∼10上进行训练,并在语义KITTI数据集的序列08上进行测试。
(1)删除本地空间编码(LocSE)。该单元使每个三维点能够显式地观察其局部几何图形。在删除locSE后,我们直接将局部点特征输入后续的注意池。
(2∼4)用最大/平均/和池替换注意池。注意池化单元学会了自动组合所有的局部点特征。相比之下,广泛使用的最大/平均/和池倾向于硬选择或组合特征,因此它们的性能可能是次优的。
(5)简化了扩张的残余块。扩张的残余块叠加了多个LocSE单元和注意池,大大扩张了每个3D点的感受野。通过简化这个块,我们每层只使用一个LocSE单元和注意池,即我们不像我们原来的RandLA-Net那样链接多个块。
Since the impact of random sampling is fully studied in Section 4.1, we conduct the following ablation studies for our local feature aggregation module. All ablated networks are trained on sequences 00∼07 and 09∼10, and tested on the sequence 08 of SemanticKITTI dataset .
(1) Removing local spatial encoding (LocSE). This unit enables each 3D point to explicitly observe its local geometry. After removing locSE, we directly feed the local point features into the subsequent attentive pooling.
(2∼4) Replacing attentive pooling by max/mean/sum pooling. The attentive pooling unit learns to automatically combine all local point features. By comparison, the widely used max/mean/sum poolings tend to hard select or combine features, therefore their performance may be suboptimal.
(5) Simplifying the dilated residual block. The dilated residual block stacks multiple LocSE units and attentive poolings, substantially dilating the receptive field for each 3D point. By simplifying this block, we use only one LocSE unit and attentive pooling per layer, i.e. we do not chain multiple blocks as in our original RandLA-Net.
表5比较了所有消融网络的mIoU得分。由此可见,我们可以看到:
1)最大的影响是由于删除了链接的空间嵌入块和注意池化块。图4突出显示了这一点,它显示了如何使用两个链接块允许信息从更广泛的邻近区域传播,即大约是k2点,而不是仅仅是k。这在随机抽样中尤为关键,它不能保证保留一组特定的点。
2)去除局部空间编码单元对性能的影响次大,表明该模块对于有效学习局部和相对几何上下文是必要的。
3)删除注意力模块会由于不能有效地保留有用的功能而降低性能。
从这个消融研究中,我们可以看到所提出的神经单元是如何相互补充的,以达到我们最先进的性能。
Table 5 compares the mIoU scores of all ablated networks. From this, we can see that:
- The greatest impact is caused by the removal of the chained spatial embedding and attentive pooling blocks. This is highlighted in Figure 4, which shows how using two chained blocks allows information to be propagated from a wider neighbourhood, i.e. approximately K2 points as opposed to just K. This is especially critical with random sampling, which is not guaranteed to preserve a particular set of points.
- The removal of the local spatial encoding unit shows the next greatest impact on performance, demonstrating that this module is necessary to effectively learn local and relative geometry context.
- Removing the attention module diminishes performance by not being able to effectively retain useful features.
From this ablation study, we can see how the proposed neural units complement each other to attain our state-of-the-art performance.
在本文中,我们证明了可以使用轻量级网络架构有效且有收益的地分割大规模点云。与目前大多数依赖于昂贵的采样策略的方法相比,我们在框架中使用随机抽样,以显著减少内存占用和计算成本。引入了局部特征聚合模块,有效地保持有用的特征。在多个基准上的大量实验证明了我们的方法的高效率和最先进的性能。将我们的端到端3D实例分割的框架扩展到大规模点云上的将是很有趣,为最新工作描绘了蓝图也帮助了实时动态点云处理,。
In this paper, we demonstrated that it is possible to efficiently and effectively segment large-scale point clouds by using a lightweight network architecture. In contrast to most current approaches, that rely on expensive sampling strategies, we instead use random sampling in our framework to significantly reduce the memory footprint and computational cost. A local feature aggregation module is also introduced to effectively preserve useful features from a wide neighbourhood. Extensive experiments on multiple benchmarks demonstrate the high efficiency and the state of-the-art performance of our approach. It would be interesting to extend our framework for the end-to-end 3D instance segmentation on large-scale point clouds by drawing on the recent work and also for the real-time dynamic point cloud processing [35].