Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids that take on part of the learning task and allow for smaller, more efficient neural networks. Unfortunately, these feature grids usually come at the cost of significantly increased memory consumption compared to stand-alone neural network models. We present a dictionary method for compressing such feature grids, reducing their memory consumption by up to 100x and permitting a multiresolution representation which can be useful for out-of-core streaming. We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available and with dynamic topology and structure.
标量场和矢量场的神经近似,如带符号的距离函数和辐射场,已经成为准确的、高质量的表示。最先进的结果是通过从可训练的特征网格中查找神经近似来获得的,该网格承担了部分学习任务,并允许更小、更有效的神经网络。不幸的是,与独立的神经网络模型相比,这些特征网格通常以显著增加的内存消耗为代价。我们提出了一种压缩这种特征网格的字典方法,将其内存消耗减少了100倍,并允许使用多分辨率表示,这对外核流媒体是有用的。我们将字典优化表述为一个矢量量化的自动解码器问题,这让我们能够在一个没有直接监督的空间中学习端到端的离散神经表征,并且具有动态拓扑和结构。
(Top-left shows a baseline neural radiance field whose uncompressed feature grid weighs 15 207 kB. Our method, shown bottom right, compresses this by a factor of 60x, with minimal visual impact (PSNR shown relative to training images). In a streaming setting, a coarse LOD can be displayed after receiving only the first 10 kB of data. All sizes are without any additional entropy encoding of the bit-stream.)
Feature grid methods are a special class of neural fields which have enabled state-of-the-art signal reconstruction quality whilst being able to render and train at interactive rates.
基于特征网格的方法是一类特殊的神经场,它实现了最先进的信号重建质量,同时能够以交互速率进行渲染和训练。
Since (, interp(, )) ≈ () is a non-linear function, this approach has the potential to reconstruct signals with frequencies above the usual Nyquist limit. Thus coarser grids can be used, motivating their use in signal compression.
特征网格方法是一个非线性函数,这种方法有可能重建频率高于通常奈奎斯特极限1的信号。因此,可以使用更粗的栅格,从而促进它们在信号压缩中的使用。
Nyquist Rate是信息论里面的一个概念,如果对一个连续信号进行采样,然后想要用采样之后的信号来恢复出原有信号的完整信息,那么采样率必须大于Nyquist Rate,而这个Rate是此连续信号中最高频分量频率的两倍2。
The feature grid can be represented as a matrix ∈ R× where is the number of grid points, and is the dimension of the feature vector at each grid point. Since × may be quite large compared to the size of the MLP, the feature vectors are by far the most memory hungry component.
但是grid与MLP的大小相比可能相当大,特征向量是到目前为止最耗费内存的分量。
These methods require high-resolution feature grids to achieve good quality. This makes them less practical for graphics systems which must operate within tight memory, storage, and bandwidth budgets.
这些方法需要高分辨率的特征栅格才能获得良好的质量。这使得它们不太适用于必须在紧张的内存、存储和带宽预算内运行的图形系统。
Beyond compactness, it is also desirable for a shape representation to dynamically adapt to the spatially varying complexity of the data, the available bandwidth, and desired level of detail.
除了紧凑度之外,形状表示还需要动态地适应数据的空间变化的复杂性、可用带宽和所需的细节级别。
We propose the vector-quantized auto-decoder method which uses the auto-decoder framework with an extra focus on learning compressed representations. The key idea is to replace bulky featurevectors with indices into a learned codebook (In prior work such as NGLOD [R2], the feature vectors consumed 512 bits each; the codebook indices that replace them in this work may be as small as 4 bits.). These indices, the codebook, and a decoder MLP network are all trained jointly.
(a) shows the baseline uncompressed version of our data structure, in which we store the bulky feature vectors at every grid vertex, of which there may be millions. In (b), we store a compact -bit code per vertex, which indexes into a small codebook of feature vectors. This reduces the total storage size, and this representation is directly used at inference time. This indexing operation is not differentiable; at training time (c), we replace the indices with vectors of width 2 , to which softmax is applied before multiplying with the entire codebook. This ‘soft-indexing’ operation is differentiable, and can be converted back to ‘hard’ indices used in (b) through an argmax operation.
Baseline: NGLOD [R2] & DeepSDF [R3]
In order to effectively apply discrete signal compression to featuregrids, we leverage the auto-decoder [R3] framework where only the decoder −1 is explicitly constructed.
利用DeepSDF [R3]方法的框架,只显式构造解码器−1。可以通过计算解码变换系数vx后的误差来优化自动解码器。
A strength of the auto-decoder is that it can reconstruct transform coefficients with respect to supervision in a domain different from the signal we wish to reconstruct. We define a differentiable forward map as an operator which lifts a signal onto another domain. For radiance field reconstruction, the signal of interest is volumetric density and plenoptic color, while the supervision is over 2D images. In this case, represents a differentiable renderer.
通过F这个操作,可以将信号转换到任意域。对于辐射场重建,感兴趣的信号 是体密度和光学颜色,监督是在2D图像上。在这种情况下,表示可微分的渲染器。
The feature grid is a matrix ∈ R× where is the size of the grid and is the feature vector dimension. Local embeddings are queried from the feature grid with interpolation at a coordinate and fed to a MLP to reconstruct continuous signals. The feature grid is learned by optimizing equation (6), where interp represents trilinear interpolation of the 8 feature grid points surrounding . The forward map is applied to the output of the MLP ; in our experiments, it is a differentiable renderer [R1] and are the training image pixels.
The feature grid can be treated as a block-based decomposition of the signal where each row vector (block) of size controls the local spatial region.
特征网格可以被视为信号的基于块的分解,其中大小为的每个行向量(块)控制局部空间区域。
Hence, we consider block-based inverse transforms −1 with block coefficients . Since we want to learn the compressed features = −1(), we substitute
Considering the ((, , interp(, ))) as a map which lifts the discrete signal to a continuous signal where the supervision (and other operations) are applied, we can see that this is equivalent to a block-based compressed auto-decoder.
((, , interp(, )))这部分将离散信号Z转变为连续的信号,可以看作是自动解码器。
This allows us to work only with the discrete signal to design a compressive inverse transform−1 for the feature-grid , in our case the vector-quantized inverse transform to directly learn compressed representations.
所以可以对离散信号Z进行矢量量化,学习压缩后的表示,再通过压缩逆变换,重建场景。
We define our compressed representation as an integer vector ∈ Z with the range [0, 2 − 1]. This is used as an index into a codebook matrix ∈ R^2 × ^ where is the number of grid points, is the feature vector dimension, and is the bitwidth. Concretely, we define our decoder function−1() = [] where [·] is the indexing operation.
将压缩表示定义为整数向量。它被用作码本矩阵D的索引。网格点的个数是m,k是特征向量的维度,b是位宽。具体来说,我们定义了我们的decoder函数−1() = [],其中[·]是索引操作。
Solving this optimization problem is difficult because indexing is a non-differentiable operation with respect to the integer index .
解决这个优化问题是困难的,因为索引是关于整数索引的不可微操作。(对应流程图中的(b))
As a solution, in training we propose to represent the integer index with a softened matrix ∈ R×2b from which the index vector = arg max [] can be obtained from a row-wise argmax. We can then replace our index lookup with a simple matrix product and obtain the following optimization problem, where the softmax function is applied row-wise on the matrix . This optimization problem is now differentiable.
用softened矩阵C来表示整数索引。其中的索引向量可以从逐行的argmax中获得。然后用一个简单的矩阵乘积替换原本的索引查找。对应的优化问题如下,其中Softmax函数按行应用于矩阵C。这个优化问题现在是可微的。(对应流程图中的(c))
In practice, we adopt a straight-through estimator approach to make the loss be aware of the hard indexing during training. That is, we use Equation 8 in the forward pass and Equation 9 in the backward pass.
在实践中,我们采用straight-through estimator的方法,使损失在训练过程中意识到硬索引。也就是说,我们在前向传递中使用公式8,在向后传中使用公式9。
At storage and inference, we discard the softened matrix and only store the integer vector . Even without entropy coding, this gives us a compression ratio of 16/( + 2) which can be orders of magnitude when is small and is large. We generally observe to be in the order of millions, and evaluate ∈ {4, 6} for our experiments. In contrast to using a hash function [R6] for indexing, we need to store -bit integers in the feature grid but we are able to use a much smaller codebook (table) due to the learned adaptivity of the indices.
存储V,不存储C。压缩比:16/( + 2)。相比hash方法,需要的codebook/table小很多。
Rather than a single resolution feature-grid, we arrange in a multi-resolution sparse octree as in NGLOD [R2], to facilitate streaming level of detail. Thus, for a given coordinate, multiple feature vectors are obtained - one from each tree level - which can then be summed (i.e. in a Laplacian pyramid fashion) or concatenated before being passed to the MLP. We train a separate codebook for each level of the tree. Similarly to NGLOD [R2], we also train multiple levels of details jointly.
将V分配在多分辨率稀疏八叉树中,在插值时,不同level的特征相加,再送入MLP。对于不同level的树,分别对应了一个codebook。
数据集:RTMV dataset
未压缩的情况下,质量最高,但是对应的存储比MLP的方法大很多。
对baseline进行压缩,三种压缩方式分别是低秩近似,后处理的k-means矢量量化和本文的学习式矢量量化。矢量量化的方式相比低秩近似,在存储上,可以达到显著的压缩比例。
而从两种矢量量化的对比来看,后处理的方式会导致明显的变色,psnr下降很多。
这个方法还可以用于其他形式的压缩,例如TSDF的压缩。虽然与压缩前相比引入了一些伪影,但是显著降低了存储。
用hash进行压缩是一种更随机的方式,这种情况下,虽然不需要存储索引V,但是需要存储的表更大。
相似的压缩比例下,本文的方式重建的结果噪声更少。
Mip-NeRF [R4] 是通过不同的cone宽度来调节的,但是它的比特率是恒定的。而VQ-AD中,可以同时实现压缩和调节,更适合渐进式数据流。
VQ-AD可以实现数量级更小的比特率,而不会像后处理方法(例如kmVQ)那样显著地牺牲质量。该图说明,VQ-AD的表示具有可变比特率,并编码了多个不同的分辨率,这些分辨率可以在不同的细节级别上渐进地流传输。但此方法的内存开销无法评估更高的比特率。
Paper: https://arxiv.org/abs/2206.07707
Project Page: https://nv-tlabs.github.io/vqad/
Code / NVIDIA Kaolin Wisp: https://github.com/NVIDIAGameWorks/kaolin-wisp
(A PyTorch library powered by NVIDIA Kaolin Core to work with neural fields (including NeRFs, NGLOD, instant-ngp and VQAD)
Related works
[R1] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[R2] Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes
[R3] DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation
[R4] Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
[R5] Plenoxels: Radiance Fields without Neural Networks
[R6] Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
[R7] Compressing Volumetric Radiance Fields to 1 MB
Nyquist Limit ↩︎
数码相机内的图像处理-图像采样与金字塔 ↩︎