Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference
CVPR 2019
之前的缺陷:scalability, hard for high-resolution scenes
contribution: scalable MVS framework 内存消耗减少,也可以应用大场景
instead of regularizing the entire 3D cost volume in one go, R-MVSNet sequentially regularizes the 2D cost maps along the depth direction via the gated recurrent unit(GRU)
key insight: regularize the cost volume in a sequential manner using GRU → 3D CNNs
O(N^3) → O(N^2)
主要是深度范围可以无穷假设 unlimited depth-wise resolution
之前的最大缺陷是当分辨率上去,由于volume太大,几乎无法使用
最开始在顺序语音和文本任务中被提出,现在也用在视频序列分析
R-MVSNet: gather spatial and temporal context information in the depth direction(空间和时间上下文信息)
Sequential Processing
Convolution GRU
cost volume C could be viewed as cost map concatenated in the depth direction
apply convolutional variant of GRU to aggregate such temporal context information
这种策略的另一个优点在于,连续的物体往往在一定的深度范围之内有着互相影响,而在代价体深度平面相隔很远的两个位置的相关性可能并不大,通过GRU的方式避免了错误信息和不相关信息的无限传递。
核心思路还是RNN的那套:update gate,reset gate
C r ( t ) = ( 1 − u p d a t e _ g a t e ) ∗ C r ( t − 1 ) + u p d a t e _ g a t e ∗ C u ( t ) C_r(t) = (1-update\_gate) * C_r(t-1) + update\_gate * C_u(t) Cr(t)=(1−update_gate)∗Cr(t−1)+update_gate∗Cu(t)
现在的cost volume综合考虑之前的cost volume和当前状态
当前状态通过标准的神经网络构建(这里的公式都有所简化)
C u ( t ) = σ ( W ∗ [ r e s e t _ g a t e × C r ( t − 1 ) ] + b ) C_u(t) = \sigma(W * [reset\_gate \times C_r(t-1)] + b) Cu(t)=σ(W∗[reset_gate×Cr(t−1)]+b)
Stacked GRU
将多层GRU组合(详细的网络结构说明暂略)
apply the inverse depth to sample the depth values in order to efficiently handle reconstructions
regression task → multi-class classification problem
用概率体和真值的cross entropy作为loss
inver depth
): 将时间分辨率调整为空间分辨率(这里不是很懂什么意思)depth map will be retrieved from the regularized cost map
winner-take-all cannot produce sub-pixel accurancy
refine the depth map in a small depth range by enforcing the multi-view photo-consistency
interatively minimize the total image reproduction error between the reference image and all source images
这一步是为了让深度图达到sub-pixel准确度,原本已经很不错了,类似于二次插值(具体数学表示没太看懂)
training
to prevent depth maps from being biased on the GRU regularization order, each training sample is passed to the network with forward and backward GRU regularization
Tanks and Temples Benchmark
highly dependent on the point cloud density
R-MVSNet的深度图进行了1/4下采样,所以点云密度不太够
ETH3D Benchmark
提供depth map真值
width-range
memory requirement is independent to the depth sample number
内存消耗是NVSNet的一半,内存利用率也更高
high-resolution
R-MVSNet can sample denser in depth direction
2D CNN: learned 2D features
WTA: Winner-take-all
ZCNN: classical plane sweeping
充分的消融实验