HyperSphereSurfaceRegression 算法笔记

360 Surface Regression with a Hyper-Sphere Loss

归纳表格总结

CNN architecture
Architecture A fully convolutional (FCN) encoder-decoder network with skip-connections, based on UNet combined with a VGG16 encoder.
Encoder The same as conv1-conv5 in VGG16
Decoder Composed of symmetrical blocks of convolutions and bi-linear up-sampling layers
Skip-connections Concatenate symmetrical high-resolution features from the encoder
Activation Function ReLU
Normalization Batch normalization after each convolutional layer
Training Details
Framework pyTorch
GPU NVIDIA TITAN X
CUDA v9.0
CuDNN v7.1.3
Random Seed Batch normalization after each convolutional layer
Pre-trained weights ImageNet & Xavier
Optimizer ADAM(default parameters )
Learning Rate 0.0002
Epochs 50
Image Size 512 × 256
Loss Weighting Factor α 0.025

3. Dataset Creation

  • The data-driven nature of deep CNN architectures is partially addressed with datasets such as [48] and [45], for learning depth or surface normals given scenes captured by the pinhole camera projection model. However, it is difficult to obtain similar datasets of spherical images.
  • We overcome this limitation by following the steps of [64], and create a mixed dataset of spherical images of indoors scenes. Similarly, we used a path-tracing renderer 4 and Blender 5 to render existing 3D datasets and annotate our rendered images with their corresponding ground truth surface normal maps that are produced as a result of the rendering process.
  • Specifically, we utilized the same 3D datasets, namely Matterport3D [6], Stanford2D3D [2, 1] and SunCG [51] to generate a dataset composed of a mixture of computer generated (CG) and realistic scenes of indoors spaces. The dataset consists of 24933 unique viewpoints, from which we split 7868 scenes for training, 1098 for validation and 2176 for benchmarking our trained models. We consider the remaining ones as invalid due to inaccuracies during rendering. We provide the dataset publicly to enable further research in 360o visual perception. We showcase a sample of our dataset in Fig. 2.
  • 深度CNN架构的数据驱动性质部分由数据集(例如[48]和[45])解决,用于在给定由针孔相机投影模型捕获的场景的情况下学习深度或表面法线。但是,很难获得类似的球形图像数据集。
  • 我们通过遵循[64]的步骤克服了这一局限性,并创建了室内场景的球形图像的混合数据集。同样,我们使用路径跟踪渲染器4和Blender 5渲染现有3D数据集,并使用渲染过程产生的相应地面真实表面法线贴图对我们的渲染图像进行注释。
  • 具体来说,我们利用相同的3D数据集,即Matterport3D [6],Stanford2D3D [2,1]和SunCG [51]来生成由计算机生成的(CG)和室内空间的真实场景混合而成的数据集。数据集包含24933个唯一视点,从中我们分割了7868个场景进行训练,1098个场景进行验证以及2176个场景对我们训练后的模型进行基准测试。由于渲染过程中的错误,我们认为其余的无效。我们公开提供数据集,以实现对360o视觉感知的进一步研究。我们在图2中展示了我们的数据集样本。
    HyperSphereSurfaceRegression 算法笔记_第1张图片

4. 360 Surface Normals Estimation

  • Following most background work, we treat training a fully convolutional neural network (FCN) to learn surface normal from a single spherical image as a regression task. In most learning-based normal regression problems the approach is to minimize either the L2 norm [32, 40, 5, 12] of the difference of the predicted normal map and the ground truth, or their normalized per-pixel dot-product [13, 61] that implies their angular differences.
  • Quaternions can represent arbitrary rotations and surface orientation in a very simple and compact form. To train our network, we consider normal vectors as pure quaternions and try to minimize their difference in terms of rotation, showing to further boost the performance of our model (Table 1).
  • We first formulate our novel quaternion loss function, followed by the description of the neural network architecture used for our experiments.
  • 在进行了大多数背景工作之后,我们将训练一个全卷积神经网络(FCN)以从单个球面图像中学习表面法线作为回归任务。在大多数基于学习的正态回归问题中,方法是最小化预测法线图和地面真相之差的L2范数[32、40、5、12]或它们的归一化每像素点积[13](61)。
  • 四元数可以非常简单和紧凑的形式表示任意旋转和表面方向。 为了训练我们的网络,我们将法线向量视为纯四元数,并尝试最小化它们在旋转方面的差异,这表明可以进一步提高模型的性能(表1)。
  • 我们首先公式化新的四元数损失函数,然后描述用于实验的神经网络架构。

4.4. CNN architecture

  • Adopting the work of [61, 5], we utilize a fully convolutional (FCN) [34] encoder-decoder network with skip-connections that regresses towards the ground truth surface normals. The network architecture is based on UNet [43] combined with a VGG16 [49] encoder. Despite, training other models used in the literature, their performance was inferior to the selected architecture.
  • Typically a UNet architecture consists of an encoder that captures the input image’s context, and a symmetrical decoder that enables precise localization. In our implementation, the front-end encoder remains the same as conv1-conv5 in VGG16, and the decoder is composed of symmetrical blocks of convolutions and bi-linear up-sampling layers. In order to localize the decoder’s upsampled features, we concatenate them with their symmetrical high-resolution features from the encoder via skip-connections. This technique is shown to prevent gradient degradation [20], and proved to be an important element in the network’s design. Our model outputs high resolution results and keeps fine object details that might otherwise disappear between pooling and up-sampling layers.
  • Further, we use ReLU [36] as the activation function and batch normalization [23] after each convolutional layer. Finally, the output of the network is fed to a convolution with a 3×3 kernel size to produce the final 3-channel prediction, which we explicitly normalize along the channel dimension.
  • 通过[61,5]的工作,我们利用了具有卷积连接的全卷积(FCN)[34]编码器-解码器网络,该网络向地面真实表面法线回归。该网络架构基于结合了VGG16 [49]编码器的UNet [43]。尽管对文献中使用的其他模型进行了训练,但它们的性能不如所选的体系结构。
  • 通常,UNet体系结构由捕获输入图像上下文的编码器和使精确定位成为可能的对称解码器组成。在我们的实现中,前端编码器与VGG16中的conv1-conv5相同,并且解码器由卷积的对称块和双线性上采样层组成。为了定位解码器的上采样特征,我们通过跳过连接将它们与来自编码器的对称高分辨率特征串联起来。该技术可防止梯度下降[20],并已证明是网络设计中的重要元素。我们的模型输出高分辨率结果,并保留精细的对象细节,否则这些细节可能在合并和上采样层之间消失。
  • 此外,我们在每个卷积层之后使用ReLU [36]作为激活函数并进行批归一化[23]。最后,将网络的输出馈送给内核大小为3×3的卷积,以生成最终的3通道预测,我们将沿着通道维度对其进行显式归一化。

5. Experimental Results

  • This section provides an experimental evaluation of our method. To assess the efficiency of our quaternion loss function, we first train our model using the L2 norm of the difference of the predicted and the ground truth surface normal, and additionally, with their normalized per-pixel dot product, i.e. their cosine similarity. We then compare their performance on our dataset’s test split.
  • We then evaluate its performance compared to other methods applied on cubemap projections of our dataset as well as the original equirectangular images.
  • Additionally, we show the efficacy of our model’s generalization ability, by applying it on a subset of the Sun360 dataset containing unseen indoors scenes. Our trained model produces very promising qualitative results, even on in-the-wild data coming from considerably different distributions from our dataset’s train-split. To further evaluate its effectiveness, we experiment with an image relighting application [41]. We compare relit images using our model’s predictions to relight them, and present qualitative results on samples of our dataset and a subset of Sun360.
  • 本节提供了对我们方法的实验评估。为了评估四元数损失函数的效率,我们首先使用L2范数来训练我们的模型,该范数使用预测的和地面真实表面法线之差的L2范数,以及它们的归一化每像素点积(即它们的余弦相似度)来训练。然后,我们在数据集的测试拆分中比较它们的效果。
  • 然后,与应用于数据集的立方图投影以及原始等矩形图像的其他方法相比,我们评估了它的性能。
  • 此外,通过将其应用于包含看不见的室内场景的Sun360数据集的子集,我们可以证明模型的泛化能力的有效性。即使在野生数据来自数据集的火车分割数据的分布差异很大的情况下,我们训练有素的模型也能产生非常有希望的定性结果。为了进一步评估其有效性,我们尝试了图像重新照明应用程序[41]。我们使用模型的预测来比较补光图像以对其进行补光,并在数据集样本和Sun360的子集上给出定性结果。

5.1. Training Details

All of our networks were implemented and trained using pyTorch [39] framework. Experiments were performed on a PC equipped with an NVIDIA TITAN X GPU, CUDA [37] v9.0, and CuDNN [9] v7.1.3. We used a random seed of 1337 for all of our experiments, for achieving similar training sessions and reproducibility. We initialize our network’s encoder parameters with weights pre-trained on ImageNet [11], and the remaining convolution layers with Xavier weight initialization [18]. We use ADAM [25] as the optimizer with its default parameters [β1, β2, ] = [0.9, 0.999, 10−8] and a learning rate of 0.0002, and we train all of our models for 50 epochs. We feed every network with equirectangular images at a 512 × 256 resolution, with the models’ predictions being of equal size. Finally, we use a loss weighting factor α = 0.025 between the prediction and the smoothness term.

你可能感兴趣的:(平面分割,全景重建,深度学习,pytorch)