解决什么问题
自监督点云预训练,与PointContrast相比,输入更简单了:single-view depth maps
用了什么方法
instance discrimination with a momentum encoder
Point input:PointNet++ [64]
Voxel input:sparse convolution U-Net model [17]
额外增加2种数据增强:random cuboid and random drop patches
联合架构:
效果如何
9个benchmark,有效地小样本学习器
We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results. Most notably, we set a new state-ofthe-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5% mAP). Our pretrained models are label efficient and improve performance for classes with few examples.
Recent work shows that self-supervised learning is useful to pretrain models in 3D but requires multi-view data and point correspondences.
We present a simple self-supervised pretraining method that can work with single-view depth scans acquired by varied sensors, without 3D registration and point correspondences. We pretrain standard point cloud and voxel based model architectures, and show that joint pretraining further improves performance.
We pretrain standard point cloud and voxel based model architectures, and show that joint pretraining further improves performance. We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results. Most notably, we set a new state-of-the-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5% mAP).
Our pretrained models are label efficient and improve performance for classes with few examples.
Recent work [105] applies self-supervised pretraining to 3D models but uses multi-view depth scans with point correspondences. Since 3D sensors only acquire single-view depth scans, multi-view depth scans and point correspondences are typically obtained via 3D reconstruction. Unfortunately, even with good sensors, 3D reconstruction can fail easily for a variety of reasons such as non-static environments, fast camera motion or odometry drift [16].
Since different 3D applications require different 3D scene representations such as voxels for segmentation [17], point clouds for detection [64], we use our method for both voxels and point clouds. We jointly learn features by considering voxels and point clouds of the same 3D scene as data augmentations that are processed with their associated networks [93]
Our method extends the work of Wu et al. [103] to multiple 3D input formats following Tian et al. [93] using a momentum encoder [36] instead of a memory bank.
In this work, we propose to jointly pretrain two architectures for points and voxels, that are PointNet++ [68] for points and Sparse Convolution based U-Net [17] for voxels.
We use these datasets and evaluate the performance of our methods on the indoor detection [12, 22, 64, 65, 114], scene segmentation [17, 68, 92, 101, 108], and outdoor detection tasks [15, 48, 76–78, 109, 111].
Our method, illustrated in Fig 2, is based on the instance discrimination framework from Wu et al. [103] with a momentum encoder [36].
用了Momentum encoder:This allows us to use a large number K of negative samples without increasing the training batch size
预训练数据集:We use single-view depth map videos from the popular ScanNet [18] dataset and term it as ScanNet-vid.
下游任务:
【论文翻译】【ICCV_2021_精】DepthContrast:Self-Supervised Pretraining of 3D Features on any Point-Cloud
预训练的关键点:数据集(ScanNet,Redwood),模型( PointNet++ [64] ,sparse convolution U-Net model [17]
),损失函数(Instance Discrimination)