We present a novel framework to regularize Neural Radiance Field (NeRF) in a few-shot setting with a geometry-aware consistency regularization. The proposed approach leverages a rendered depth map at unobserved viewpoint to warp sparse input images to the unobserved viewpoint and impose them as pseudo ground truths to facilitate learning of NeRF. By encouraging such geometry-aware consistency at a feature-level instead of using pixel-level reconstruction loss, we regularize the NeRF at semantic and structural levels while allowing for modeling viewdependent radiance to account for color variations across viewpoints. We also propose an effective method to filter out erroneous warped solutions, along with training strategies to stabilize training during optimization. We show that our model achieves competitive results compared to state-of-the-art few-shot NeRF models.
我们提出了一种新的框架,用几何感知的一致性正则化来正则化少数镜头环境中的神经辐射场(NeRF) [R1]。该方法利用未观察视角处的渲染深度图,将稀疏输入图像扭曲到未观察视角,并将其作为伪ground truth,以便于神经网络的学习。通过在特征级别实现这种几何感知的一致性,而不是使用像素级别的重建损失,我们在语义和结构级别正则化NeRF,同时允许建模视角相关的辐射来解释视点之间的颜色变化。我们还提出了一种有效的方法来过滤错误的翘曲解,以及在优化过程中稳定训练的训练策略。我们表明,与最先进的few-shot NeRF模型相比,我们的模型取得了具有竞争力的结果。
Given an image I i I_i Ii and estimated depth map D j D_j Dj of j j j-th unobserved viewpoint, we warp the image I i I_i Ii to that novel viewpoint as I i → j I_{i→j} Ii→j by establishing geometric correspondence between two viewpoints. Using the warped image as a pseudo ground truth, we cause rendered image of unseen viewpoint, I j I_j Ij , to be consistent in structure with warped image, with occlusions taken into consideration.
给定第 j j j个未观察视点的图像 I i I_i Ii和估计深度图 D j D_j Dj,通过在两个视点之间建立几何对应,我们将图像 I i I_i Ii扭曲到该新视点作为 I i → j I_{i→j} Ii→j。利用扭曲图像作为伪ground truth,使不可见视点的渲染图像 I j I_j Ij在结构上与扭曲图像一致,并考虑遮挡。
However, despite its impressive performance, NeRF requires a large number of densely, well distributed calibrated images for optimization, which limits its applicability. When limited to sparse observations, NeRF easily overfits to the input view images and is unable to reconstruct correct geometry.
尽管NeRF的性能令人印象深刻,但它需要大量密集、分布良好的校准图像进行优化,这限制了它的适用性。当仅限于稀疏观测时,NeRF很容易与输入的视图图像过度匹配,并且无法重建正确的几何。
The task that directly addresses this problem, also called a few-shot NeRF, aims to optimize highfidelity neural radiance field in such sparse scenarios, countering the underconstrained nature of said problem by introducing additional priors. Specifically, previous works attempted to solve this by utilizing a semantic feature [R2], entropy minimization [R3], SfM depth priors [R4] or normalizing flow [R5], but their necessity for handcrafted methods or inability to extract local and fine structures limited their performance.
直接解决这一问题的任务,也被称为few-shot NeRF,旨在优化这种稀疏场景下的高保真神经辐射场,通过引入额外的先验来抵消所述问题的不足性质。具体地说,以前的工作试图通过利用语义特征[R2]、最小熵、SfM深度先验或归一化流 [R5]来解决这一问题,但它们需要手工方法或无法提取局部和精细结构,限制了它们的性能。
To alleviate these issues, we propose a novel regularization technique that enforces a geometric consistency across different views with a depth-guided warping and a geometry-aware consistency modeling. Based on these, we propose a novel framework, called Neural Radiance Fields with Geometric Consistency (GeCoNeRF), for training neural radiance fields in a few-shot setting.
为了缓解这些问题,我们提出了一种新的正则化技术,该技术通过深度引导扭曲和几何感知一致性建模来增强不同视图之间的几何一致性。在此基础上,我们提出了一种新的框架,称为具有几何一致性的神经辐射场(GeCoNeRF),用于在few-shot环境中训练神经辐射场。
NeRF inherently renders not only color image but depth image as well. Combined with known viewpoint difference, the rendered depths can be used to define a geometric correspondence relationship between two arbitrary views.
NeRF不仅固有地渲染彩色图像,而且还渲染深度图像。结合已知的视点差异,渲染的深度可用于定义两个任意视图之间的几何对应关系。
Specifically, we consider a depth image rendered by the NeRF model, D j D_j Dj at unseen viewpointj. By formulating a warping function ψ ( I i ; D j , R i → j ) ψ(I_i; D_j , R_{i→j}) ψ(Ii;Dj,Ri→j) that warps an image I i I_i Ii according to the depth D j D_j Dj and viewpoint difference R i → j R_{i→j} Ri→j, we can encourage a consistency between warped image I i → j = ψ ( I i ; D j , R i → j ) I_{i→j} = ψ(I_i; D_j , R_{i→j}) Ii→j=ψ(Ii;Dj,Ri→j) and rendered image I j I_j Ij at j j j-th unseen viewpoint, which in turn improves the few-shot novel view synthesis performance.
具体地说,我们考虑由NeRF模型在不可见视点 j j j处呈现的深度图像 D j D_j Dj。通过根据深度 D j D_j Dj和视点差 R i → j R_{i→j} Ri→j对图像 I i I_i Ii进行扭曲的扭曲函数 ψ ( I i ; D j , R i → j ) ψ(I_i; D_j , R_{i→j}) ψ(Ii;Dj,Ri→j),我们可以鼓励在第 j j j个不可见视点处的扭曲图像 I i → j = ψ ( I i ; D j , R i → j ) I_{i→j} = ψ(I_i; D_j , R_{i→j}) Ii→j=ψ(Ii;Dj,Ri→j)和渲染图像 I j I_j Ij之间的一致性,这进而提高了few-shot新视图合成性能。
GeCoNeRF regularizes the networks with consistency modeling. Consistency loss function L c o n s M L^M_{cons} LconsM is applied between unobserved viewpoint image and warped observed viewpoint image, while disparity regularization loss L r e g L_{reg} Lreg regularizes depth at seen viewpoints.
GeCoNeRF使用一致性建模对网络进行正则化。在未观测视点图像和扭曲观测视点图像之间应用一致性损失函数 L c o n s M L^M_{cons} LconsM,而视差正则化损失 L r e g L_{reg} Lreg用于正则化视点深度。
To render an image at novel viewpoints, we first sample a random camera viewpoint, from which corresponding ray vectors are generated in a patch-wise manner. As NeRF outputs density and color values of sampled points along the novel rays, we use recovered density values to render a consistent depth map. Following [R1], we formulate per-ray depth values as weighted composition of distances traveled from origin. Since ray r p r_p rp corresponding to pixel p p p is parameterized as r p ( t ) = o + t d p r_p(t) = o + td_p rp(t)=o+tdp, the depth rendering is defined similarly to the color rendering,
为了在新的视点上渲染图像,我们首先对随机的相机视点进行采样,从该视点以patch的方式生成相应的光线矢量。当NeRF输出沿新光线的采样点的密度和颜色值时,我们使用恢复的密度值来呈现一致的深度图。我们将每射线深度值表示为从原点出发的距离的加权合成。由于对应于像素 p p p的光线 r p r_p rp被参数化为 r p ( t ) = o + t d p r_p(t) = o + td_p rp(t)=o+tdp,所以深度渲染的定义类似于颜色渲染,
where D ( r p ) D(r_p) D(rp) is a predicted depth along the ray r p r_p rp. As described in Figure 1, we use the rendered depth map D j D_j Dj to warp input ground truth image I i I_i Ii to j j j-th unseen viewpoint and acquire a warped image I i → j I_{i→j} Ii→j , which is defined as a process such that I i → j = ψ ( I i ; D j , R i → j ) I_{i→j} = ψ(I_i; D_j , R_{i→j}) Ii→j=ψ(Ii;Dj,Ri→j). More specifically, pixel location p j p_j pj in target unseen viewpoint image is transformed to p j → i p_{j→i} pj→i at source viewpoint image by viewpoint difference R j → i R_{j→i} Rj→i and camera intrinsic parameter K K K
其中 D ( r p ) D(r_p) D(rp)是沿射线 r p r_p rp的预测深度。如图1中所描述的,我们使用渲染的深度图 D j D_j Dj将输入的ground truth图像 I i I_i Ii到第 j j j个不可见视点扭曲,并获得扭曲图像 I i → j I_{i→j} Ii→j,其被定义为 I i → j = ψ ( I i ; D j , R i → j ) I_{i→j} = ψ(I_i; D_j , R_{i→j}) Ii→j=ψ(Ii;Dj,Ri→j)的过程。更具体地说,目标不可见视点图像中的像素位置 p j p_j pj通过视点差 R j → i R_{j→i} Rj→i和相机内参数 K K K被变换为源视点图像处的 p j → i p_{j→i} pj→i,
where ∼ ∼ ∼ indicates approximate equality and the projected coordinate p j → i p_{j→i} pj→i is a continuous value. With a differentiable sampler, we extract color values of p j → i p_{j→i} pj→i on I i I_i Ii. More formally, the transforming components process can be written as follows
其中 ∼ ∼ ∼表示近似相等,并且投影坐标 p j → i p_{j→i} pj→i是连续值。用可微采样器提取图像 I i I_i Ii像素 p j → i p_{j→i} pj→i上的颜色值。更正式的转换组件流程如下
where s a m p l e r ( ⋅ ) sampler(·) sampler(⋅) is a bilinear sampling operator.
其中 s a m p l e r ( ⋅ ) sampler(·) sampler(⋅)是一种双线性采样算子。
Acceleration. Rendering full image with NeRF voluemtric rendering is computationally heavy and extremely timetaking, requiring tens of seconds for a single iteration. To overcome the computational bottleneck of full image rendering and warping, rays are sampled on a strided grid to make the patch with stride s, which we have set as 2. After the rays undergo volumetric rendering, we upsample the low-resolution depth map back to original resolution with bilinear interpolation. This full-resolution depth map is used for the inverse warping. This way, detailed warped patches of full-resolution can be generated with only a fraction of computational cost that would be required when rendering the original sized ray batch.
加速。使用Nerf卷积渲染完整图像的计算量很大,而且非常耗时,单次迭代需要数十秒。为了克服全图像渲染和变形的计算瓶颈,在步长网格上对光线进行采样,生成步长为s的patch,我们将步长设置为2。在对光线进行体渲染后,使用双线性插值法将低分辨率深度图向上采样到原始分辨率。此全分辨率深度贴图用于反向变形。这样,渲染原始大小的光线batch时,可以生成全分辨率的详细变形patch,而所需的计算成本只有一小部分。
Given the rendered patch I j I_j Ij at j j j-th viewpoint and the warped patch I i → j I_{i→j} Ii→j with depth D j D_j Dj and viewpoint difference R i → j R_{i→j} Ri→j , we define the consistency between the two to encourage additional regularization for globally consistent rendering.
给定第 j j j视点处的绘制patch I j I_j Ij和具有深度 D j D_j Dj和视点差 R i → j R_{i→j} Ri→j的扭曲patch I i → j I_{i→j} Ii→j,我们定义两者之间的一致性,以鼓励全局一致的渲染的附加正则化。
One viable option is to naïvely apply the pixel-wise image reconstruction loss L p i x L_{pix} Lpix,
一种可行的选择是简单地应用像素方式的图像重建损失 L p i x L_{pix} Lpix
However, we observe that this simple strategy is prone to cause failures in reflectant non-Lambertian surfaces where appearance changes greatly regarding viewpoints. In addition, geometry-related problems, such as self-occlusion and artifacts, prohibits naïve usage of pixel-wise image reconstruction loss for regularization in unseen viewpoints.
然而,这种简单的策略很容易导致反射式非朗伯表面的失败,因为在这些表面上,视点的外观变化很大。此外,与几何相关的问题,如自遮挡和伪影,也说明不应当简单使用像素级图像重建损失来进行不可见视点的正则化。
To overcome these issues, we propose masked feature-level regularization loss that encourages structural consistency while ignoring view-dependent radiance effects.
为了克服这些问题,我们提出了掩蔽的特征级别的正则化损失,该损失鼓励结构一致性,同时忽略视相关的辐射效应。
Given an image I I I as an input, we use a convolutional network to extract multi-level feature maps such that f φ , l ( I ) ∈ R H l × W l × C l f_{φ,l}(I) ∈ R^{H_l ×W_l ×C_l} fφ,l(I)∈RHl×Wl×Cl , with channel depth C l C_l Cl for l l l-th layer. To measure feature-level consistency between warped image I i → j I_{i→j} Ii→j and rendered image I j I_j Ij , we extract their features maps from L L L layers and compute difference within each feature map pairs that are extracted from the same layer.
在给定图像 I I I作为输入的情况下,我们使用卷积网络来提取多层特征映射,使得 f φ , l ( I ) ∈ R H l × W l × C l f_{φ,l}(I) ∈ R^{H_l ×W_l ×C_l} fφ,l(I)∈RHl×Wl×Cl,其中以 l l l层为通道深度 C l C_l Cl。为了衡量扭曲图像 I i → j I_{i→j} Ii→j和渲染图像 I j I_j Ij之间的特征级别一致性,我们从层中提取它们的特征图,并计算从同一层提取的每个特征图对之间的差异。
In accordance with the idea of using the warped image I i → j I_{i→j} Ii→j as pseudo ground truths, we allow a gradient backpropagation to pass only through the rendered image and block it for the warped image. By applying the consistency loss at multiple levels of feature maps, we cause I j I_j Ij to model after I i → j I_{i→j} Ii→j both on semantic and structural level.
根据将扭曲图像 I i → j I_{i→j} Ii→j用作伪ground truths的想法,我们仅允许梯度反向传播通过渲染图像,并阻止扭曲图像的梯度反向传播。通过在多个层次的特征映射上应用一致性损失,使得 I j I_j Ij在语义和结构两个层面上都模仿 I i → j I_{i→j} Ii→j。
For this loss function L c o n s L_{cons} Lcons, we find l l l-1 distance function most suited for our task and utilize it to measure consistency across feature difference maps. Empirically, we have discovered that VGG-19 network yields best performance in modeling consistencies, likely due to the absence of normalization layers that scale down absolute values of feature differences. Therefore, we employ VGG19 network as our feature extractor network f φ f_φ fφ throughout all of our models.
使用VGG19网络作为特征提取网络,计算特征图间的 l l l-1距离。
It should be noted that our loss function differs from that of DietNeRF in that while DietNeRF’s consistency loss is limited to regularizing the radiance field in a globally semantic level, our loss combined with warping module is also able to give the network highly rich information on a local, structural level as well. In other words, contrary to DietNeRF giving only high-level feature consistency, our method of using multiple levels of convolutional network for feature difference calculation can be interpreted as enforcing a mixture of all levels, from high-level semantic consistency to low-level structural consistency.
需要注意的是,我们的损失函数与DietNeRF [R2]的不同之处在于,DietNeRF的一致性损失仅限于在全局语义水平上规则化辐射场,而我们的损失与Warping模块相结合还能够在局部、结构水平上为网络提供高度丰富的信息。换言之,与DietNeRF只给出高层特征一致性相反,我们使用多层卷积网络进行特征差异计算的方法可以看作从高层语义一致性到低层结构一致性的所有级别的混合。
In order to prevent imperfect and distorted warpings caused by erroneous geometry from influencing the model, which degrade overall reconstruction quality, we construct consistency mask M l M_l Ml to let NeRF ignore regions with geometric inconsistencies.
为了避免几何错误引起的扭曲和变形影响模型的整体重建质量,我们构造了一致性掩码 M l M_l Ml,使神经网络忽略几何不一致区域。
Instead of applying mask to the images before inputting them into feature extractor network, we apply resized masks M l M_l Ml directly to the feature maps, after using nearest-neighbor down-sampling to make them match the dimensions of l l l-th layer outputs.
我们不是在将图像输入到特征提取网络之前对它们施加掩码,而是在使用最近邻下采样使它们与 l l l层输出的维度匹配之后,直接对特征图应用调整大小的掩码 M l M_l Ml。
We generate M M M by measuring consistency between rendered depth values from target viewpoint and source viewpoint,
我们通过测量来自目标视点和源视图的渲染深度值之间的一致性来生成 M M M
where [ ⋅ ] [·] [⋅] is Iverson bracket, and p j → i p_{j→i} pj→i refers to the corresponding pixel in source viewpoint i i i for reprojected target pixel p j p_j pj of j j j-th viewpoint. Here we measure euclidean distance between depth points rendered from target and source viewpoints as a criterion for a threshold masking.
其中 [ ⋅ ] [·] [⋅]是艾弗森括号, p j → i p_{j→i} pj→i是指第 j j j视点的重新投影的目标像素 p j p_j pj在源视点 i i i中的对应像素。在这里,我们测量从目标和源视点渲染的深度点之间的欧几里得距离作为阈值掩码的标准。
Mask generation by comparing geometry between novel view j j j and source view i i i, with I i → j I_{i→j} Ii→j being warped patch generated for view j j j. For (a) and (b), warping does not occur correctly due to artifacts and self-occlusion, respectively. Such pixels are masked out by M l M_l Ml, allowing only (c), with accurate warping, as training signal for rendered image I j I_j Ij
(通过比较新视图 j j j和源视图 i i i之间的几何来生成掩码,其中 I i → j I_{i→j} Ii→j是为视图 j j j生成的扭曲面片。对于(a)和(b),分别由于伪影和自遮挡而不能正确地变形。这样的像素被 M l M_l Ml掩蔽,仅允许©作为渲染图像 I j I_j Ij的训练信号,具有精确的扭曲。)
If distance between two points are greater than given threshold value τ τ τ , we determine two rays as rendering depths of separate surfaces and mask out the corresponding pixel in viewpoint I j I_j Ij . The process takes place over every pixel in viewpoint I j I_j Ij to generate a mask M M M the same size as rendered pixels. Through this technique, we filter out problematic solutions at feature level and regularize NeRF with only high-confidence image features.
如果两点之间的距离大于给定的阈值 τ τ τ,则将两条光线确定为单独曲面的渲染深度,并遮蔽视点 I j I_j Ij中对应的像素。该过程在视点 I j I_j Ij中的每个像素上进行,以生成与渲染像素相同大小的遮罩 M M M。通过该技术,我们在特征级过滤掉有问题的解,并仅用高置信度的图像特征来正则化NeRF。
Based on this, the consistency loss L c o n s L_{cons} Lcons is extended as,
一致性损失 L c o n s L_{cons} Lcons被拓展为
m l m_l ml is the sum of non-zero values.
m l m_l ml 是非零值之和。
Since our method is dependent upon the quality of depth rendered by NeRF, we directly impose additional regularization on rendered depth to facilitate optimization. We further encourage local depth smoothness on rendered scenes by imposing l l l-1 penalty on disparity gradient within randomly sampled patches of input views. In addition, inspired by [R7], we take into account the fact that depth discontinuities in depth maps are likely to be aligned to gradients of its color image, and introduce an edge-aware term with image gradients ∂ I ∂I ∂I to weight the disparity values. Specifically, following [R7], we regularize for edge-aware depth smoothness,
由于我们的方法依赖于NeRF绘制的深度的质量,所以我们直接对渲染的深度施加额外的正则化以便于优化。我们还通过对随机采样的输入视图块内的视差梯度施加 l l l-1惩罚来进一步鼓励渲染场景的局部深度平滑。此外,受[R7]的启发,我们考虑到深度图中的深度不连续性可能与其颜色图像的梯度对齐的事实,引入了边缘感知项图像梯度 ∂ I ∂I ∂I来加权视差值。具体地说,遵循[R7],我们对边缘感知深度光滑度进行正则化,
where D i ∗ = D i / D i ‾ D^∗_i = D_i/\overline{D_i} Di∗=Di/Di is the mean-normalized inverse depth from [R7] to discourage shrinking of the estimated depth.
其中 D i ∗ = D i / D i ‾ D^∗_i = D_i/\overline{D_i} Di∗=Di/Di是平均归一化逆深度[R7],以阻止估计深度缩小。
We optimize our model with a combined final loss of original NeRF’s pixel-wise reconstruction loss L o b s L_{obs} Lobs and two types of regularization loss, L c o n s M L^M_{cons} LconsM for unobserved view consistency modeling and L r e g L_{reg} Lreg for disparity regularization.
我们综合考虑了原始Nerf像素级重建损失 L o b s L_{obs} Lobs和两种正则化损失,未观测视点一致性建模 L c o n s M L^M_{cons} LconsM和视差正则化 L r e g L_{reg} Lreg。
Difficulty of accurate warping increases the further target view is from the source view, which means that sampling far camera poses straight from the beginning of training may have negative effects on our model. Therefore, we first generate camera poses near source views, then progressively further as training proceeds. We sample noise value uniformly between an interval of [ − β , + β ] [-β, +β] [−β,+β] and add it to the original Euler rotation angles of input view poses, with parameter β β β growing linearly from 3 to 9 degrees throughout the course of optimization. This design choice can be intuitively understood as stabilizing locations near observed viewpoints at start and propagating this regularization to further locations, where warping becomes progressingly more difficult.
精确翘曲的难度增加了,进一步的目标视角来自源视点,这意味着从训练开始就直接采样远摄像机姿势可能会对我们的模型产生负面影响。因此,我们首先在源视图附近生成相机姿势,然后随着训练的进行逐渐深入。我们在 [ − β , + β ] [-β, +β] [−β,+β]的区间内均匀采样噪声值,并将其添加到输入视点姿势的原始欧拉旋转角中,在整个优化过程中,参数 β β β从3度线性增长到9度。这种设计选择可以直观地理解为在开始时稳定观察视点附近的位置,并将这种正则化传播到更远的位置,在那里翘曲逐渐变得更加困难。
We find that most of the artifacts occurring are highfrequency occlusions that fill the space between scene and camera. This behaviour can be effectively suppressed by constraining the order of fourier positional encoding to low dimensions. Due to this reason, we adopt coarse-to-fine frequency annealing strategy previously used by [R8] to regularize our optimization. This strategy forces our network to primarily optimize from coarse, low-frequency details where self-occlusions and fine features are minimized, easing the difficulty of warping process in the beginning stages of training. Following [R8], the annealing equation is α ( t ) = m t / K α(t) = mt/K α(t)=mt/K, with m m m as the number of encoding frequencies, t t t as iteration step, and we set hyper-parameter K K K as 15 k k k.
我们发现,大多数伪影都是高频遮挡,填补了场景和摄像机之间的空间。通过将傅里叶位置编码的顺序限制在低维,可以有效地抑制这种行为。由于这个原因,我们采用了[R8]以前使用的从粗到精的频率退火策略来规则化我们的优化。这一策略迫使我们的网络主要从粗略的、低频的细节进行优化,其中自遮挡和精细特征被最小化,从而减轻了训练开始阶段的翘曲过程的难度。追随[R8],,退火方程为 α ( t ) = m t / K α(t) = mt/K α(t)=mt/K,m为编码频率数,t为迭代步长,超参数K设为15k。
实验部分基于两个数据集,在三个视角的输入下,重建场景。与对比方法相比,本文的方法可以学到更精细的细节,并且重建的表面更平滑,背景中的伪影也更少。
消融实验中可以看出,特征一致性损失可以显著提升baseline [R6]的性能。
再考虑一致性掩码时,对外观和几何都有改善。
对深度的平滑约束也有利于场景重建。
而模型的两个渐进式训练策略,也有利于提升模型的稳定性和表现。
另外,如果不使用本文中依据深度和相机参数对任意未知视角进行图像变形的方式生成伪标签,而是只在已知的视角间变形,仍会因为视角间较大的差异造成结果的显著下降。
Paper: https://arxiv.org/abs/2301.10941
Project Page: https://ku-cvlab.github.io/GeCoNeRF/
Code: https://github.com/KU-CVLAB/GeCoNeRF (Not released)
Related works
[R1] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[R2] Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis
[R3] InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering
[R4] Depth-supervised NeRF: Fewer Views and Faster Training for Free
[R5] RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs
[R6] Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
[R7] Unsupervised Monocular Depth Estimation with Left-Right Consistency
[R8] Nerfies: Deformable Neural Radiance Fields