[PDF] [GitHub]
Figure 1: Examples of the switch operation, which switches the global representations of two images from four datasets: (a) CIFAR-10, (b) ImageNet, (c) LSUN Bedroom and (d) CelebA-HQ.
目录
Decoupling Global and Local Representations via Invertible Generative Flows
Abstract
1. Introduction
1.1 Variational Auto-Encoders (VAES)
1.2 Generative Flows
1.3 Problems of VAEs and Generatiave Flows
2. Generative Model for Decoupled Representation Learning
2.1 Generative Model Architecture
2.2 Compression Encoder
2.3 Invertible Decoder Based on Generative Flow
2.4 Taking the Two Porblems in VAEs and Generative Flows
Appendix
A Preliminary Introduction of Glow
B Implementation Details
B.1 Comperssion Encoder
B.2 Scale Term in Affine Coupling Layers
B.3 Prior Distribution in VAEs
In this work, we propose a new generative model that is capable of automatically decoupling global and local representations of images in an entirely unsupervised setting, by embedding a generative flow in the VAE framework to model the decoder.
Specifically, the proposed model utilizes the variational auto-encoding framework to learn a (low-dimensional) vector of latent variables to capture the global information of an image, which is fed as a conditional input to a flow-based invertible decoder with architecture borrowed from style transfer literature.
Experimental results on standard image benchmarks demonstrate the effectiveness of our model in terms of density estimation, image generation and unsupervised representation learning.
Importantly, this work demonstrates that with only architectural inductive biases, a generative model with a likelihood-based objective is capable of learning decoupled representations, requiring no explicit supervision.
本文提出了一种新的生成模型,通过在 VAE 框架中嵌入生成流来建模解码器,该模型能够在完全无监督的情况下自动解耦图像的全局和局部表示。
具体来说,该模型利用变分自编码框架来学习潜在变量的 (低维) 向量来获取图像的整体信息,并将其作为条件输入输入到一个基于流的可逆解码器中,该解码器的结构借鉴自风格转移文献。
在标准图像基准上的实验结果证明了该模型在密度估计、图像生成和无监督表示学习方面的有效性。
重要的是,这项工作证明了,一个基于似然目标的生成模型,在只有架构的归纳偏差时,能够学习解耦的表示,且不需要明确的监督。
Unsupervised learning of probabilistic models and meaningful representation learning are two central yet challenging problems in machine learning. Formally, let be the random variables of the observed data, e.g., is an image. One goal of generative models is to learn the parameter such that the model distribution can best approximate the true distribution . Throughout the paper, uppercase letters represent random variables and lowercase letters their realizations.
问题背景介绍:
概率模型的无监督学习和有意义表示学习是机器学习中的两个核心问题。形式上,设 为观测数据的随机变量,例如 是一幅图像。生成模型的目标之一是了解参数 ,使模型分布 能最接近真实分布 。在整篇论文中,大写字母代表随机变量,小写字母代表随机变量的实现。
Unsupervised (disentangled) representation learning, besides data distribution estimation and data generation, is also a principal component in generative models. The goal is to identify and disentangle the underlying causal factors, to tease apart the underlying dependencies of the data, so that it becomes easier to understand, to classify, or to perform other tasks (Bengio et al., 2013). Unsupervised representation learning has spawned significant interests and a number of techniques (Chen et al., 2017a; Devlin et al., 2019; Hjelm et al., 2019) has emerged over the years to address this challenge. Among these generative models, VAE (Kingma & Welling, 2014; Rezende et al., 2014) and Generative (Normalizing) Flows (Dinh et al., 2014) have stood out for their simplicity and effectiveness.
介绍了本文涉及的概念:无监督,解耦,表示学习,VAE,生成流,等
无监督 (解耦) 表示学习除了数据分布估计和数据生成外,也是生成模型的主要组成部分。目标是识别和解耦潜在的因果因素,梳理并分解数据的潜在依赖性,以便更容易理解、分类或执行其他任务。无监督表示学习已经产生了大量的兴趣和技术,多年来一直在应对这一挑战。在这些生成模型中,VAE 和生成 (标准化) 流因其简单性和有效性而脱颖而出。
VAE, as a member of latent variable models (LVMs), gains popularity for its capability of automatically learning meaningful (low-dimensional) representations from raw data. In the framework of VAEs, a set of latent variables are introduced, and the model distribution is defined as the marginal of the joint distribution between and :
where the joint distribution is factorized as the product of a prior over the latent , and the “generative” distribution . is the base measure on the latent space .
VAE 作为潜在变量模型 (LVMs) 的一个成员,由于能够从原始数据中自动学习有意义 (低维) 表示而受到欢迎。在 VAE 框架下,引入一组潜在变量 ,将模型分布 定义为 与 联合分布的边际为公式 (1)。其中,联合分布 被分解为对潜在分布 的先验 和 “生成” 分布 的乘积。 是潜在空间 的基本度量。
In general, this marginal likelihood is intractable to compute or differentiate directly, and Variational Inference (Wainwright et al., 2008) provides a solution to optimize the evidence lower bound (ELBO), an alternative objective by introducing a parametric inference model :
(2)
where ELBO could be seen as an autoencoding loss with being the encoder and being the decoder, with the first term in the RHS in (2) as the reconstruction error.
通常,这种边际似然值很难直接计算或区分,而变分推理提供了一种优化证据下界 (ELBO) 的解决方案,通过引入参数推理模型 来实现另一种目标,即公式 (2). 其中 ELBO 可以看作是一种自编码损失, 为编码器, 为解码器,其中 (2) 中 RHS 的第一项为重构误差。
Put simply, generative flows (a.k.a., normalizing flows) work by transforming a simple distribution, (e.g. a simple Gaussian) into a complex one (e.g. the complex distribution of data ) through a chain of invertible transformations.
简单地说,生成流 (也就是标准化流) 通过一系列可逆变换将一个简单分布 (例如一个简单的高斯分布) 转换为一个复杂分布 (例如数据 的复杂分布)。
Formally, a generative flow defines a bijection function (with ), where is a set of latent variables with simple prior distribution . It provides us with an invertible transformation between and , whereby the generative process over is defined straightforwardly:
. (3)
形式上,生成流定义了一个双射函数 ,其中 的逆运算 , 是一组具有简单先验分布 的潜在变量。它为我们提供了 和 之间的可逆转换,由此直接定义了 的生成过程,即公式 (3) 。
An important insight behind generative flows is that given this bijection function, the change of the variable formula defines the model distribution on by:
where is the Jacobian of at . A stacked sequence of such invertible transformations is called a generative (normalizing) flow (Rezende & Mohamed, 2015):
, (5)
where is a flow of transformations (omitting for brevity).
生成流背后一个重要的见解是,鉴于这种双射函数,变量的变化公式定义了模型分布 ,公式 (4)。其中, 是 在 的雅可比矩阵。这种可逆转换的堆叠顺序称为生成 (标准化) 流,即公式 (5)。其中 是 个变换的流 (为简洁起见省略 )。
Despite their impressive successes, VAEs and generative flows still suffer their own problems.
Posterior Collapse in VAEs
As discussed in Bowman et al. (2015), without further assumptions, the ELBO objective in (2) may not guide the model towards the intended role for the latent variables , or even learn uninformative with the observation that the KL term vanishes to zero. The essential reason of this posterior collapse problem is that, under absolutely unsupervised setting, the marginal likelihood-based objective incorporates no (direct) supervision on the latent space to characterize the latent variable with preferred properties w.r.t. representation learning.
VAEs 中的后验坍塌 【更多细节请参考博客:VAE 中后验坍塌问题】
正如 Bowman et al. 讨论过的,在没有进一步的假设,ELBO 目标 (2) 可能不会向预期的作用指导模型中潜在变量 ,甚至学习到不提供信息的 ,因为观察到 KL 中 消失为零。这个后验坍塌问题的本质原因是,在绝对无监督设置下,基于边际似然的目标对潜在空间没有 (直接) 监督,以表征具有表征学习偏好性质的潜在变量 。
Local Dependency in Generative Flows
Generative flows suffer from the limitation of expressiveness and local dependency. Most generative flows tend to capture the dependency among features only locally, and are incapable of realistic synthesis of large images compared to GANs (Goodfellow et al., 2014). Unlike latent variable models, e.g. VAEs, which represent the high-dimensional data as coordinates in a latent low-dimensional space, the long-term dependencies that usually describe the global features of the data can only be propagated through a composition of transformations. Previous studies attempted to enlarge the receptive field by using a special design of parameterization like masked convolutions (Ma et al., 2019a) or attention mechanism (Ho et al., 2019).
生成流受到表达限制和局部依赖的问题。大多数生成流倾向于仅局部捕获特征之间的依赖关系,与 GAN 相比,无法真实地合成大图像 (Goodfellow et al., 2014)。不像潜在变量模型,例如 VAEs,能够将高维数据表示为潜在低维空间中的坐标,通常描述数据全局特征的长期依赖只能通过转换组合传播。之前的研究试图通过使用一种特殊的参数化设计,如 Mask 卷积 (Ma et al., 2019a) 或注意力机制 (Ho et al., 2019) 来扩大接受域。
Ma et al., 2019a : Macow: Masked convolutional generative flow. NeurIPS 2019.
Ho et al., 2019 : Flow++: Improving flowbased generative models with variational dequantization and architecture design. ICML 2019.
In this paper, we propose a simple and effective generative model to simultaneously tackle the aforementioned challenges of VAEs and generative flows by leveraging their properties to complement each other.
By embedding a generative flow in the VAE framework to model the decoder, the proposed model is able to learn decoupled representations which capture global and local information of images respectively in an entirely unsupervised manner.
The key insight is to utilize the inductive biases from the model architecture design — leveraging the VAE framework equipped with a compression encoder to extract the global information in a low-dimensional representation, and a flow-based decoder which favors local dependencies to store the residual information into a local high-dimensional representation (§2).
Experimentally, on four benchmark datasets for images, we demonstrate the effectiveness of our model on two aspects:
(i) density estimation and image generation, by consistently achieving significant improvements over Glow (Kingma & Dhariwal, 2018),
(ii) decoupled representation learning, by performing classification on learned representations the switch operation (see examples in Figure 1).
Perhaps most strikingly, we demonstrate the feasibility of decoupled representation learning via the plain likelihood-based generation, using only architectural inductive biases (§3).
本文提出了一个简单而有效的生成模型,通过利用 VAE 和生成流的属性互补,同时解决上述 VAE 和生成流的问题。
通过在 VAE 框架中嵌入生成流程对解码器进行建模,该模型能够以完全无监督的方式学习,分别捕获图像全局和局部信息的解耦表示。
关键的见解是利用模型架构设计的归纳偏差,利用配备压缩编码器的 VAE 框架,以低维表示提取全局信息,以及基于流的解码器,该解码器有利于局部依赖,将残留信息存储到局部高维表示中。
通过实验,在四个基准图像数据集上,本文从两个方面证明了模型的有效性:
(i) 密度估计和图像生成,通过不断取得 Glow 的显著改进;
(ii) 解耦表示学习, (见图 1 中的示例)。
也许最引人注目的是,本文仅使用架构归纳偏差,通过简单似然生成,证明了解耦表示学习的可行性。
We first illustrate the high-level insights of the architecture design of our generative model (shown in Figure 2) before detailing each component in the following sections.
In the training process of our generative model, we minimize the negative ELBO in VAE:
(5)
where is the negative ELBO of RHS in (2):
(2).
Specifically, we first feed the input image into the encoder to compute the latent variable . The encoder is designed to be a compression network, which compresses the high-dimensional image into a low-dimensional vector (§2.2). Through this compression process, the local information of an image is enforced to be discarded, yielding representation that captures the global information.
Then we feed as a conditional input to a flow-based decoder, which transforms into the representation υ with the same dimension (§2.3). Since the decoder is invertible, with and , we can exactly reconstruct the original image . It indicates that and maintain all the information of , and the reconstruction process can be regarded as an additional operation — adding and to recover . In this way, we expect that the local information discarded in the compression process will be restored in .
在本文的生成模型的训练过程中,最小化了 VAE 的 负 ELBO,即公式 (5)。其中,是 (2) 式中 RHS 的负 ELBO ((2) 式中的右边加了加了负号)。
具体地,首先输入图像 到编码器 计算潜变量 。编码器被设计成一个压缩网络,它将高维图像压缩为低维向量。通过这个压缩过程,图像 的局部信息被强制丢弃,生成捕获全局信息的表示 。
然后,将 作为一个条件输入输入到一个基于流的解码器,它将 转换成具有相同维度的表示法 。因为解码器是可逆的, 和 ,可以完全重构原始图像 。它表明, 和 保持所有的信息 () 和重建过程可以被视为一个额外的操作 -- 将 和 结合,恢复 。通过这种方式,,压缩过程中丢弃的局部信息将在 中得到恢复。
In the generative process, we combine the sampling procedures of VAEs and generative flows:
we first sample a value of from the prior distribution , and a value of from ;
second, we input and into the invertible function modeled by the generative flow decoder to generate an image .
在生成过程中,结合了 VAEs 和 生成流的采样过程:
我们首先从先验分布 中采样 z 的值,从 中采样υ的值;
其次,我们将 and 输入生成流解码器所建模的可逆函数 中,生成图像 。
Following previous work, the variational posterior distribution , a.k.a encoder, models the latent variable as a diagonal Gaussian with learned mean and variance:
(6)
where and are neural networks. In the context of 2D images where is a tensor of shape with spatial dimensions and channel dimension , the compression encoder maps each image to a -dimensional vector. is the dimension of the latent space.
根据之前的工作,变分后验分布,即编码器,将潜在变量建模为具有学习均值和方差的对角高斯分布,即公式 (6),其中 and 是神经网络。在二维图像中, 是一个具有空间维数 和通道维数 的形状张量 ,压缩编码器将每个图像 映射到 维向量。 是潜在空间的维数。
In this work, the motivation of the encoder is to compress the high-dimensional data to low-dimensional latent variable , i.e. , to enforce the latent representation to capture the global features of .
Furthermore, unlike previous studies on VAE based generative models for natural images (Kingma et al., 2016; Chen et al., 2017a; Ma et al., 2019b) that represented latent codes as low-resolution feature maps, we represent as an unstructured 1-dimensional vector to erase the local spatial dependencies.
Concretely, we implement the encoder with a similar architecture in ResNet (He et al., 2016). The spatial downsampling is implemented by a 2-strided ResNet block with 3×3 filters. On top of these ResNet blocks, there is one more fully-connected layer with number of output units equal to to generate and log (details in Appendix B).
本文编码器的动机是对高维数据 进行压缩隐藏变量,即 ,强制隐含表示 捕获 的全局特征 。
此外,与先前研究的基于 VAE 生成模型将潜伏代码 表示为低分辨率特征映射,将 表示为非结构化的一维向量,以消除局部空间依赖性。
具体来说,在 ResNet 中使用类似的架构实现编码器。空间下采样是由带有 3x3 个滤波器的 stride=2;ResNet block 实现的。在这些 ResNet block 的顶部,还有一个完全连接的层,输出单元的数量等于 和 (详见附录 B)。
Zero initialization
Following Ma et al. (2019c), we initialize the weights of the last fully-connected layer that generates the and values with zeros. This ensures that the posterior distribution is initialized as a simple normal distribution, which has been demonstrated helpful for training very deep neural networks more stably in the framework of VAEs.
零初始化
按照 Ma et al. 的方法,本文初始化最后一个完全连接层的权值,生成的 和 值为零。这保证了后验分布初始化为一个简单的正态分布,这已被证明有助于在 VAEs 框架中更稳定地训练非常深的神经网络。
The flow-based decoder defines a (conditionally) invertible function , where follows a standard normal distribution . Conditioned on the latent variable output from the encoder, we can reconstruct with the inverse function . The flow-based decoder adopts the main backbone architecture of Glow (Kingma & Dhariwal, 2018), where each step of flow consists of the same three types of elementary flows — actnorm, invertible 1 × 1 convolution and coupling (details in Appendix A).
基于流的解码器定义了一个 (有条件的) 可逆函数 ,其中 遵循标准正态分布 。以编码器输出的潜在变量 为条件,可以用逆函数 重构 。基于流的解码器采用 Glow 的主骨架结构,其中流的每一步都由相同的三种基本流组成—— actnorm、可逆的 1 × 1 卷积和耦合 (详见附录A)。
Conditional Inputs in Affine Coupling Layers
To incorporate as a conditional input to the decoder, we modify the neural networks for the scale and bias terms:
where s() and b() take both and as input. Specifically, each coupling layer includes three convolution layers where the first and last convolutions are 3 × 3, while the center convolution is 1 × 1. ELU (Clevert et al., 2015) is used as the activation function throughout the flow architecture:
(8)
where FC() refers to a linear full-connected layer and ⊕ is addition operation per channel between a 2D image and a 1D vector.
仿射耦合层中的条件输入:
为了将 作为条件输入合并到解码器中,修改了神经网络的规模和偏差项,即公式 (7),其中 s() 和 b() 同时将 和 作为输入。具体来说,每个耦合层包括三个卷积层,其中第一次卷积和最后一次卷积为 3 × 3,中心卷积为 1 × 1。ELU 作为整个流程结构的激活函数,即公式 (8) 所示,其中 FC() 为线性全连接层,⊕ 为二维图像与一维向量之间每个通道的加法运算。
Importantly, is fed as conditional input to every coupling layers, unlike previous work (Agrawal & Dukkipati, 2016; Morrow & Chiu, 2019) where is only used to learn the mean and variance of the underlying Gaussian of υ. This design is inspired by the generator in Style-GAN (Karras et al., 2019), where the style-vector is added to each block of the generator. We conduct experiments to show the importance of this architectural design (see §3.1).
重要的是, 作为条件输入输入到每个耦合层,不像以前的工作,其中 仅用于学习基本高斯函数 的均值和方差。这个设计的灵感来自于 Style-GAN 中的生成器,其中将风格向量添加到生成器的每个块中。本文也进行了实验来说明这个结构设计的重要性 (见§3.1)。
Refined Architecture of Glow
In this work, we refine the organization of these three elementary flows in one step (see Figure 3a) to reduce the total number of invertible 1 × 1 convolution flows. The reason is that the cost and the numerical stability of computing or differentiating the determinant of the weight matrix becomes the practical bottleneck when the channel dimension c is considerably large for high-resolution images. To reduce the number of invertible 1 × 1 convolution flows while maintaining the permutation effect along the channel dimension, we use four split patterns for the split() function in (7) (see Figure 3b). The splits perform on the channel dimension with continuous and alternate patterns, respectively. For each pattern of the split, we alternate xa and xb. Coupling layers with different split types alternate in one step of our flow, as illustrated in Figure 3a. We further replace the original multi-scale architecture with the fine-grained multi-scale architecture (Figure 3c) proposed in Ma et al. (2019a), with the same value of M = 4. Experimental improvements over Glow demonstrate the effectiveness of our refined architecture (§3.1).
本文在一个步骤中细化这三个基本流的组织 (见图3a),以减少可逆卷积流的总数。其原因是在通道维数 相当大的高分辨率图像中,计算或微分权矩阵行列式的代价和数值稳定性成为实际的瓶颈。为了减少可反转的 1 × 1 卷积流的数量,同时保持沿通道维数的排列效果,对 (7) 中的 split() 函数使用了 4 个 split 模式 (见图 3b)。这些分裂分别以连续和交替的模式在通道维度上执行。对于每个分裂的模式,交替 和 。在流程的一个步骤中,具有不同分割类型的耦合层交替出现,如图 3a 所示。进一步用 Ma et al. (2019a) 提出的细粒度多尺度体系结构 (图3c) 替代原来的多尺度体系结构,M = 4 的值相同。在 Glow 上的实验改进证明了本文优化架构的有效性 (3.1)。
Resolving Local Dependency in Generative Flows with Global Information from .
As discussed in §1.3, the flow-based decoder suffers the limitation of expressiveness and local dependency. In the VAE framework of our model, the latent codes provides the decoder with the imperative global information, which is essential to resolve the limitation of expressiveness due to local dependency. On the other hand, the flow-based decoder favors to store local dependencies, encouraging the encoder to extract global information that is complementary to it.
使用来自 的全局信息解决生成流中的局部依赖:
如 §1.3 所述,基于流的解码器受到表达性和局部依赖的限制。在本模型的 VAE 框架中,潜码 为译码器提供了必需的全局信息,这对于解决由于局部依赖而造成的表达能力的限制至关重要。另一方面,基于流的解码器倾向于存储局部依赖,鼓励编码器提取与它互补的全局信息。
Resolving Posterior Collapse in VAEs with Flow-based Decoders.
As discussed in previous work (Chen et al., 2017a), one possible reason for posterior collapse in VAEs is that the decoder model is sufficiently expressive such that it completely ignores latent variables , and a solution to posterior collapse is to limit the capacity of the decoder. This suggests generative flow an ideal model for the decoder since they are insufficiently powerful to trigger the posterior collapse problem.
基于流的译码器解决 VAEs 的后验塌陷:
如前所述 (Chen et al., 2017a), VAEs 中后验崩溃的一个可能原因是解码器模型充分表达,完全忽略了潜在变量 ,解决后验崩溃的方法是限制解码器的容量。这表明生成流是解码器的理想模型,因为它们不够强大,无法触发后验崩溃问题。
Architectural Inductive Biases for Decoupled Representation Learning.
From the high-level view of our model, we utilize these complementary properties of the architectures of the encoder and decoder as inductive bias to attempt to decouple the global and local information of an image by storing them in separate representations. The (indirect) supervision of learning global latent representation z comes from two sources of architectural inductive bias.
First, the compression architecture, which takes a high-dimensional image as input and outputs a low-dimensional vector, encourages the encoder to discard local dependencies of the image.
Second, the preference of the flow-based decoder for capturing local dependencies reinforces global information modeling of the encoder, since the all the information of the input image needs to be preserved by and .
解耦表示学习的架构归纳偏差:
从本文模型的高层角度来看,利用编码器和解码器架构的这些互补特性作为归纳偏差,试图通过将图像的全局和局部信息存储在单独的表示中来解耦。学习全局潜在表征 的 (间接) 监督来自两个来源的结构归纳偏差。
首先,压缩结构以高维图像作为输入,输出低维向量,鼓励编码器丢弃图像的局部依赖关系。
其次,基于流的解码器对捕获局部依赖的偏好加强了编码器的全局信息建模,因为输入图像 的所有信息需要由 和 保存。
附录都是基础内容,直接上截图了(太懒了)。