A Two-Streamed Network for Estimating Fine-Scaled Depth Maps from Single RGB Images(本文不附图,请结合原文一起看)

写在前面的几句话,本人比较菜,而这篇论文着实非常的难!我只是倾力想复制出来,顺便祝愿自己北大核心能过

引言Introduction

Estimating depth from a single RGB image is an illposed and inherently ambiguous problem. State-of-the-art
deep learning methods can now estimate accurate 2D depth
maps, but when the maps are projected into 3D, they lack local detail and are often highly distorted. We propose a fastto-train two-streamed CNN that predicts depth and depth
gradients, which are then fused together into an accurate
and detailed depth map. We also define a novel set loss over
multiple images; by regularizing the estimation between a
common set of images, the network is less prone to overfitting and achieves better accuracy than competing methods. Experiments on the NYU Depth v2 dataset shows that
our depth predictions are competitive with state-of-the-art
and lead to faithful 3D projections.
从单个RGB图像估计深度是一个固有的老大难问题。 最先进的深度学习方法现在可以估计准确的2D深度图,但是对于3D深度图,它们缺乏局部细节并且经常高度扭曲。 我们提出了一种预测深度和深度梯度的快速训练双流CNN,然后融合在一起成为准确的和详细的深度图。 我们还定义了一个新的集合损失多个图像; 通过规范一系列常见的图片集,网络不易过度拟合,并且比同类方法获得更好的准确性。 在纽约大学深度v2数据集上的实验表明我们的深度预测与最先进的技术相比具有竞争力并忠实于原有的3D投影。
引言的第一段介绍了以下目前的难度,室内的物品细节多,语义信息复杂巴拉巴拉
The use of convolutional neural networks (CNNs) has greatly improved the accuracy of depth estimation techniques . Rather than coarsely approximating the depth of large structures such as walls and ceilings, state-of-the-art networks benefit from using pre-trained CNNs and can capture fine-scaled items such as furniture and home accessories. The pinnacle of success for depth estimation is the ability to generate realistic and accurate 3D scene reconstructions from the estimated depths. Faithful reconstructions should be rich with local structure; detailing becomes especially important in
applications derived from the reconstructions such as object recognition and depth-aware image re-rendering and or editing. Despite the impressive evaluation scores of recent works however, the estimated depth maps still suffer from artifacts at finer scales and have unsatisfactory alignments between surfaces. These distortions are especially prominent when projected into 3D (see Figure 1).Other CNN-based end-to-end applications such as semantic segmentation and normal estimation face similar challenges of preserving local details. The repeated convolution and pooling operations are critical for capturing the entire image extent, but simultaneously shrink resolution and degrade the detailing. While up-convolution and feature map concatenation strategies have been proposed to improve resolution, output map boundaries often still fail to align with image boundaries.As such, optimization measures like bilateral filtering orCRFs [4] yield further improvements.It is with the goal of preserving detailing that we motivate our work on depth estimation. We want to benefit from the accuracy of CNNs, but avoid the degradation of resolution and detailing. First, we ensure network accuracy and generalization capability by introducing a novel set image loss. This loss is defined jointly over multiple images,where each image is a transformed version of an original image via standard data augmentation techniques. The set loss considers not only the accuracy of each transformed image’s output depth, but also has a regularization term to minimize prediction differences within the set. Adding this regularizer greatly improves the depth accuracy and reduces RMS error by approximately 5%. As similar data augmentation approaches are also used in other end-to-end frameworks, e.g. for semantic segmentation and normal estimation, we believe that the benefits of the set loss will carry over to these applications as well.
卷积神经网络(CNN)的使用极​​大地提高了深度估计技术的准确性。而不是粗略地去估计诸如墙壁或者是天花板等大型结构的深度,受益于使用预先培训的CNN,神经网络并可以捕捉精细的物品,如家具和家居饰品。深度估计精度的顶峰是可以从估计的深度生成逼真且准确的3D场景重建。忠实的重建丰富细节的空间结构;对于重建场景的应用,细节的描绘变得十分重要,例如对象识别和深度感知图像重新渲染和编辑。尽管最近的一些项目得到了令人印象深刻的成果,但预测的深度图仍然受到较细尺度的影响,并且对准表面的结果不令人满意。这些失真在投影到3D时尤其突出(参见图1)。其他基于CNN的端到端应用(如语义分割和常规估计)面临着保留原有细节的类似挑战。重复的卷积和合并操作对于捕获整个图像范围至关重要,但同时会降低分辨率并降低细节。上采样卷积和特征映射级联策略为了提高分辨率,输出图边界通常仍然无法与图像边界对齐。因此,双边滤波或CRF [4]等优化措施可以进一步改进。我们的目标是保留详细信息,从而激发我们的深度估计工作。我们希望从CNN中受益,同时要避免降低分辨率和细节。首先,我们通过引入新的设置图像损失来确保网络准确性和泛化能力。该损失在多个图像上共同定义,其中每个图像是通过标准数据增强技术的原始图像的变换版本。设定损失不仅考虑每个变换图像的输出深度的准确度,而且还具有正则化项以最小化集合内的预测差异。添加此正则化器可大大提高深度精度,并将RMS误差降低约5%。由于类似的数据增强方法也用于其他端到端框架,例如,对于语义分割和正常估计,我们认为设定损失的好处也会延续到这些应用。例如对象识别和深度感知图像重新渲染和/或编辑。尽管最近的作品得到了令人印象深刻的评价分数,但估计的深度图仍然受到较细尺度的伪影的影响,并且表面之间的对准不令人满意。这些失真在投影到3D时尤其突出(参见图1)。其他基于CNN的端到端应用(如语义分割和常规估计)面临着保留本地细节的类似挑战。重复的卷积和合并操作对于捕获整个图像范围至关重要,但同时会降低分辨率并降低细节。如,对于语义分割和正常估计,我们认为设定损失的好处也会延续到这些应用。
We capture scene detailing by considering information contained in depth gradients. We postulate that local structure can be better encoded with first-order derivative terms than the absolute depth values. Perceptually, it is sharpedges and corners which define an object and make it recognizable, rather than (correct) depth values (compare in Figure 4). As such, we think it’s better to represent a scene with both depth and depth gradients, and propose a fast-to-train two-streamed CNN to regress the depth and depth gradients (see Figure 2). In addition, we propose two possibilities for fusing the depth and depth gradients, one via a CNN, to allow for end-to-end training, and one via direct optimization. We summarize our contributions as follows:
我们通过考虑深度梯度中包含的信息来捕捉场景细节。 我们假设局部结构可以用一阶导数项比绝对深度值更好地进行编程。 在感知上,它利用锐化和边角去定义对象并识别,而不是(正确的)深度值(在图4中比较)。 因此,我们认为最好用深度和深度梯度来表示场景,并提出快速训练的双流CNN来回归深度和深度梯度(见图2)。 此外,我们提出了融合深度和深度梯度的两种可能性,一种是通过CNN,一种是允许端到端训练,另一种是通过直接优化。 我们总结了我们的贡献如下:

  1. A novel set image loss with a regularizer that minimizes differences in estimated depth of related images; this loss makes better use of augmented data and promotes stronger network generalization, resulting in higher estimation accuracy.
  2. A joint representation of the 2.5D scene with depth and depth gradients; this representation captures local structures and fine detailing and is learned with a twostreamed network.
  3. Two methods for fusing depth and depth gradients into a final depth output, one via CNNs for end-to-end training and one via direct optimization; both methods yield depth maps which, when projected into 3D, have less distortion and are richer with structure and object detailing than competing state-of-the-art

1.具有正则化器的创新型的图像损失,其最小化相关图像的估计深度的差异; 这种损失可以更好地利用增强数据并促进更强大的网络泛化,从而提高估计精度。
2.具有深度和深度梯度的2.5D场景的联合表示; 这种表示捕获了本地结构和精细细节,并通过两个网络学习。
3.两种将深度和深度梯度融合到最终深度输出中的方法,一种是通过CNN进行端到端训练,另一种是通过直接优化; 这两种方法都可以产生深度图,当投影到3D中时,它具有较少的失真,并且比同类的最新技术具有更加丰富的结构和目标对象细节
Representing the scene with with both depth and depth gradients is redundant, as one can be derived from the other. We show, however, that this redundancy offers explicit consideration for local detailing that is otherwise lost in the standard Euclidean loss on depth alone and/or with a simple consistency constraint in the loss. Our final depth output is accurate and clean with local detailing, with fewer artifacts than competing methods when projected into 3D.
用深度和深度梯度表示场景是多余的,因为可以从另一个得到。 然而,我们表明,这种冗余提供了对局部细节的明确考虑,否则在单独的深度标准欧几里德损失和/或损失中的简单一致性约束中会丢失。 我们的具有局部细节最终深度输出是精确和干净的,投影到3D时深度的末枝细节更少。

前面的省略,直接看网络结构

Our network architecture, shown in Figure 2, follows a two-stream model; one stream regresses depth and the other depth gradients, both from an RGB input image. The two streams follow the same format: an image parsing block, flowed by a feature fusion block and finally a refinement
block. The image parsing block consists of the convolutional layers of VGG-16 (up to pool5) and two fully connected layers. The output from the second fully connected layer is then reshaped into a 55×75×D feature map to be passed onto the feature fusion block, where D = 1 for the
depth stream and D = 2 for the gradient stream. In place of VGG-16, other pre-trained networks can be used as well for the image parsing block, e.g. VGG-19 or ResNet.The feature fusion block consists of one 9×9 convolution and pooling, followed by eight successive 5×5 convolutions
without pooling. It takes as input a down-sampled RGB image and then fuses together features from the VGG convolutional layers and the image parsing block output. Specifically, the features maps from VGG pool3 and pool4 are fused at the input to the second and fourth convolutional
layers respectively, while the output of the image parsing block is fused at the input to the sixth convolutional layer,all with skip layer connections. The skip connections for the VGG features have a 5×5 convolution and a 2x or 4x up-sampling to match the working 55×75 feature map size;
the skip connection from the image parsing block is a simple concatenation. As noted by other image-to-image mapping works [12, 21, 22], the skip connections provide a convenient way to share hierarchical information and we find that this also results in much faster network training convergence. The output of the feature fusion block is a coarse 55×75×D depth or depth gradient map.The refinement block, similar to the feature fusion block, consists of one 9×9 convolution and pooling and five 5×5 convolutions without pooling. It takes as input a downsampled RGB image and then fuses together a bilinearly up-sampled output of the feature fusion block via a skip connection (concatenation) to the third convolutional layer. The working map size in this block is 111×150, with the output being depth or gradient maps at this higher resolution. The depth and gradient fusion block brings together the depth and depth gradient estimates from the two separate streams into a single coherent depth estimate. We propose two possibilities, one with convolutional processing in an end-to-end network, and one via a numerical optimization. The two methods are explained in detail in Section 3.3. We refer the reader to the supplementary material for specifics on the layers, filter sizes, and learning rates.
我们的网络架构如图2所示,遵循双流模型;一条流从RGB输入图像中回归深度和其他深度梯度。这两个流遵循相同的格式:图像解析块,由特征融合块流动,最后是细化块。图像解析块由VGG-16(直到pool5)的卷积层和两个完全连接的层组成。然后将来自第二完全连接层的输出重新整形为55×75×D特征图以传递到特征融合块,其中深度流的D = 1,梯度流的D = 2。代替VGG-16,其他预训练的网络也可以用于图像解析块,例如, VGG-19或ResNet。特征融合块由一个9×9卷积组成,接着是八个连续的5×5卷积而没有汇集。它将下采样的RGB图像作为输入,然后将来自VGG卷积层和图像解析块输出的特征融合在一起。具体地,来自VGG pool3和pool4的特征映射分别在第二和第四卷积层的输入处融合,而图像解析块的输出在第六卷积层的输入处融合,所有都具有残差结构。 VGG功能的残差连接具有5×5卷积和2x或4x上采样,以匹配工作55×75特征映射大小;来自图像解析块的残差连接是简单的连接。正如其他图像到图像映射工作[12,21,22]所指出的,残差连接提供了共享分层信息的便捷方式,并且我们发现这也导致更快的网络训练收敛。特征融合块的输出是粗糙的55×75×D深度或深度梯度图。细化块类似于特征融合块,由一个9×9卷积和汇集以及五个5×5卷积组成,没有汇集,它将下采样的RGB图像作为输入,然后通过残差连接(串联)将特征融合块的双线性上采样输出与第三卷积层融合在一起。此块中的工作图大小为111×150,输出为此分辨率较高的深度或梯度图。深度和梯度融合块将来自两个单独流的深度和深度梯度估计汇集到单个相干深度估计中。我们提出了两种可能性,一种是在端到端网络中进行卷积处理,另一种是通过数值优化。第3.3节详细介绍了这两种方法。我们建议读者阅读补充材料,了解有关图层,滤镜尺寸和学习速率的详细信息。

本人水平十分有限,只能十分勉强的去理解网络结构。
输入是一张RGB的三通道图片,分成两部分:
第一部分有关深度,第二部分有关梯度的深度
首先看第一部分深度,也是分成了三个部分:

  1. VGG16-up to 5pools:上来就是五个卷积,每个卷积加一个池化,跟vgg一模一样,但是它引入了残差结构(注意看整个网络结构图)。然后跟VGG-16一样,最后经过两个全连接层(我认为这跟原版的vgg-16根本没有任何区别,不知道作者把FC层拉出来单独讲是个啥意思)。reshape一下就成为了55X75X1,由于也不知道输入是多少,真正凑出来这个55X75恐怕很困难。开始进入第二个特征融合层。
  2. 先不看这个残差结构具体是什么意思(这里我仍然很迷惑),在特征融合层,RGB图像经过9X9的卷积和连续5个5X5的卷积组成,这个结果与第一步中的结果一起融合。最终得到一个粗糙的输出,这个输出进入第三步细化。
  3. 细化块的结构与特征融合层相似,RGB图像经过9X9的卷积和连续5个5X5的卷积(这个存疑,这个过程到底是针对谁真的不知道),同时第二步的结果双线性插值成为111×150,两者融合成为深度神经网络的输出。

然后梯度一个样子,之不多刚开始的reshape变成了双通道。
我把之后的部分整体解释清楚了再来看前面的部分。

损失函数

这里也十分难懂
第一个式子是讲像素点之间的差异。
第二个就是讲,一开始输入的图像会进行一些图像增强去扩充数据集,所以A SET OF的图像都需要有一个损失函数(我强烈怀疑我编码这里会遇到极大的困难)
第三个公式是,对于单个图片,我们都拿去跟原图进行一个像素值损失的计算
第四个公式,是本篇文章较为难懂的地方,约束的是训练集中的两张整图,认为一张图片会在数据增益的时候进行色彩,对比度的变换,那这些变化之后的理论上跟之前的输出有着对应的关系。这个约束项就是将变化后的输出经过反变化成之前的进行约束。这个我编码的确不太行
接下来就是深度的损失估计,比较好懂就不多说了
第六个式子又开始看不懂了: 本文的体系结构将RGB图像作为输入并融合在一起通过条约结构作为第三卷积的输入的深度估计和梯度估计层。 我们使用以下组合损失Lcomb来保持深度精度和梯度一致性。其实仔细一看,就是整个融合网络的损失函数应该…,把前面的几个损失函数都加起来了。而且就是一个完整的像素的损失函数。
第七个式子我认为可以看作整体融合网络的深度的损失函数。
我个人是很孩怕这种融合网络的训练的,我的水平写不出来,作者在下面写了一个训练策略,我们再来看一看。

训练策略

这里我就不多引用作者的话了,整体两个流刚开始是分开训练的。而且它每次训练,扔进去的一批图像,是基于同一个原图弄出来的( a set of)
然后,先训练前面两个模块,然后冻结,训练最后的细化模块,细化模块使用公式2,1,5去训练。
最后调参。
难度真的是非常大啊!!!

你可能感兴趣的:(随笔流水账)