We present an end-to-end deep learning architecture for depth map inference from multi-view images.
In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature.
3D cost volume
? upon thereference camera frustum
? via the differentiable homography warping
?Our framework接受N个图像输入,使用方差度量,将多个特征映射为一个cost feature?
With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime.
远超 之前的state-of-the-arts,速度也更快。泛化性强
Multi-view stereo (MVS) estimates the dense representation from overlapping images, which is a core problem of computer vision extensively studied for decades.
局限性:low-textured, specular and reflective regions of the scene
While these methods have shown great results under ideal Lambertian scenarios, they suffer from some common limitations. For example, low-textured, specular and reflective regions of the scene make dense matching intractable and thus lead to incomplete reconstructions.
It is reported in recent MVS benchmarks [1,18] that, although current state-of-the-art algorithms [7,36,8,32] perform very well on the accuracy, the reconstruction completeness still has large room for improvement.
Conceptually, the learning-based method can introduce global semantic information such as specular and reflective priors for more robust matching.
In fact, the stereo matching task is perfectly suitable for applying CNN-based methods, as image pairs are rectified in advance and thus the problem becomes the horizontal pixel-wise disparity estimation without bothering with camera parameters.
局限:arbitrary camera geometries
从two-view stereo 不好转 multi-view stereo,原因:arbitrary camera geometries
一些相关工作:: SurfaceNet [14] , Learned Stereo Machine (LSM)[15]
局限:volumetric representation of regular grids
However, both the two methods exploit the volumetric representation of regular grids. As restricted by the huge memory consumption of 3D volumes, their networks can hardly be scaled up:
we propose an end-to-end deep learning architecture for depth map inference, which computes one depth map at each time, rather than the
whole 3D scene at once.
输入:一个reference image和几个 source images ,输出:the depth map for the reference image。
MVSNet, takes one reference image and several source images as input, and infers the depth map for the reference image.
differentiable homography warping operation ?
The key insight here is the differentiable homography warping operation ?
, which implicitly encodes camera geometries in the network to build the 3D cost volumes from 2D image features and enables the end-to-end training.
,我们提出了一种基于方差的度量,该度量将多个特征映射为one cost feature in the volume。然后,该cost volume进行多尺度3D卷积,并回归初始深度图。最后,利用参考图像对深度图进行细化,以提高边界区域的精度。
To adapt arbitrary number of source images in the input, we propose a variance-based metric that maps multiple features into one cost feature in the volume. This cost volume then undergoes multi-scale 3D convolutions and regress an initial depth map. Finally, the depth map is refined with the reference image to improve the accuracy of boundary areas.
There are two major differences between our method and previous learned approaches [15,14].
, for the purpose of depth map inference, our 3D cost volume is built upon the camera frustum instead of the regular Euclidean space.
, our method decouples the MVS reconstruction to smaller problems of per-view depth map estimation, which makes large-scale reconstruction possible.
According to output representations, MVS methods can be categorized into 1)
direct point cloud reconstructions [22,7],2)
volumetric reconstructions [20,33,14,15] and 3)
depth map reconstructions [35,3,8,32,38].