Generative Adversarial Nets翻译[上]

Generative Adversarial Nets翻译下

code

Generative Adversarial Nets

生成性对抗网络

论文：https://arxiv.org/pdf/1406.2661.pdf

Abstract

摘要

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 12 everywhere. In the case where G and D are deﬁned by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

我们提出了一个通过对抗过程估计生成模型的新框架，其中我们同时训练两个模型：捕获数据分布的生成模型G和估计样本来自训练数据的概率的判别模型D.比G.G的训练程序是最大化D犯错误的概率。该框架对应于minimax双人游戏。在任意函数G和D的空间中，存在唯一的解，其中G恢复训练数据分布，D到处等于12。在G和D由多层感知器定义的情况下，整个系统可以用反向传播进行训练。在训练或生成样本期间不需要任何马尔可夫链或展开的近似推断网络。实验通过对生成的样品进行定性和定量评估来证明该框架的潜力。

1 Introduction

1简介

The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artiﬁcial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [19, 9, 10] which have a particularly well-behaved gradient. Deep generative models have had less of an impact, due to the difﬁculty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difﬁculty of leveraging the beneﬁts of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difﬁculties. 1

深度学习的前景是发现丰富的层次模型[2]，它代表人工智能应用中遇到的各种数据的概率分布，如自然图像，包含语音的音频波形和自然语言语料库中的符号。到目前为止，深度学习中最引人注目的成功涉及辨别模型，通常是那些将高维度，丰富的感官输入映射到类标签的模型[14,22]。这些惊人的成功主要基于反向传播和丢失算法，使用分段线性单元[19,9,10]，这些单元具有特别良好的梯度。深度生成模型的影响较小，因为近似于最大似然估计和相关策略中出现的许多难以处理的概率计算，以及由于在生成环境中利用分段线性单元的好处而难以实现。我们提出了一种新的生成模型估计程序，可以回避这些困难。 1

In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.

在提出的对抗性网络框架中，生成模型与对手进行对比：一种判别模型，用于学习确定样本是来自模型分布还是数据分布。生成模型可以被认为类似于造假者团队，试图生产虚假货币并在没有检测的情况下使用它，而歧视模型类似于警察，试图检测伪造货币。在这个游戏中的竞争促使两个团队改进他们的方法，直到假冒品与真正的文章不可分割。

∗Jean Pouget-Abadie is visiting Universit´e de Montr´eal from Ecole Polytechnique.

Jean Pouget-Abadie将从Ecole Polytechnique访问蒙特利尔大学。

†Sherjil Ozair is visiting Universit´e de Montr´eal from Indian Institute of Technology Delhi ‡Yoshua Bengio is a CIFAR Senior Fellow.

†Sherjil Ozair正在访问德里印度理工学院的蒙特利尔大学‡Yoshua Bengio是CIFAR的高级研究员。

1All code and hyperparameters available at http://www.github.com/goodfeli/adversarial

1所有代码和超参数可从http://www.github.com/goodfeli/adversarial获得

This framework can yield speciﬁc training algorithms for many kinds of model and optimization algorithm. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the discriminative model is also a multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train both models using only the highly successful backpropagation and dropout algorithms [17] and sample from the generative model using only forward propagation. No approximate inference or Markov chains are necessary.

该框架可以为多种模型和优化算法提供特定的训练算法。在本文中，我们探讨了生成模型通过多层感知器传递随机噪声来生成样本的特殊情况，并且判别模型也是多层感知器。我们将这种特殊情况称为对抗性网络。在这种情况下，我们可以仅使用非常成功的反向传播和丢失算法[17]来训练两个模型，并且仅使用前向传播来生成来自生成模型的样本。不需要近似推理或马尔可夫链。

2 Related work

2相关工作

An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann machines (DBMs) [26] and their numerous variants. The interactions within such models are represented as the product of unnormalized potential functions, normalized by a global summation/integration over all states of the random variables. This quantity (the partition function) and its gradient are intractable for all but the most trivial instances, although they can be estimated by Markov chain Monte Carlo (MCMC) methods. Mixing poses a signiﬁcant problem for learning algorithms that rely on MCMC [3, 5].

具有潜在变量的定向图形模型的替代方案是具有潜在变量的无向图形模型，例如受限制的玻尔兹曼机器（RBM）[27,16]，深玻尔兹曼机器（DBM）[26]及其众多变体。这些模型中的相互作用表示为非标准化势函数的乘积，通过随机变量的所有状态的全局求和/积分来归一化。尽管可以通过马尔可夫链蒙特卡罗（MCMC）方法估计，但是这个数量（分区函数）及其梯度对于除了最平凡的实例之外的所有实例都是难以处理的。混合对于依赖MCMC的学习算法提出了一个重要问题[3,5]。

Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and several directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur the computational difﬁculties associated with both undirected and directed models.

深度置信网络（DBN）[16]是包含单个无向层和多个有向层的混合模型。虽然存在快速近似分层训练标准，但DBN会引起与无向和定向模型相关的计算困难。

Alternative criteria that do not approximate or bound the log-likelihood have also been proposed, such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require the learned probability density to be analytically speciﬁed up to a normalization constant. Note that in many interesting generative models with several layers of latent variables (such as DBNs and DBMs), it is not even possible to derive a tractable unnormalized probability density. Some models such as denoising auto-encoders [30] and contractive autoencoders have learning rules very similar to score matching applied to RBMs. In NCE, as in this work, a discriminative training criterion is employed to ﬁt a generative model. However, rather than ﬁtting a separate discriminative model, the generative model itself is used to discriminate generated data from samples a ﬁxed noise distribution. Because NCE uses a ﬁxed noise distribution, learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.

还提出了不接近或约束对数似然的替代标准，例如得分匹配[18]和噪声对比估计（NCE）[13]。这两者都要求将学习的概率密度分析地指定为归一化常数。请注意，在许多具有多层潜在变量（例如DBN和DBM）的有趣生成模型中，甚至不可能得出易处理的非标准化概率密度。一些模型，如去噪自动编码器[30]和压缩自动编码器，其学习规则与应用于RBM的分数匹配非常相似。在NCE中，与本研究一样，采用判别训练标准来建立生成模型。然而，生成模型本身不是用于单独的判别模型，而是用于将生成的数据与固定的噪声分布的样本区分开来。因为NCE使用固定的噪声分布，所以在模型已经在一小部分观察变量上学习了大致正确的分布之后，学习速度显着减慢。

Finally, some techniques do not involve deﬁning a probability distribution explicitly, but rather train a generative machine to draw samples from the desired distribution. This approach has the advantage that such machines can be designed to be trained by back-propagation. Prominent recent work in this area includes the generative stochastic network (GSN) framework [5], which extends generalized denoising auto-encoders [4]: both can be seen as deﬁning a parameterized Markov chain, i.e., one learns the parameters of a machine that performs one step of a generative Markov chain. Compared to GSNs, the adversarial nets framework does not require a Markov chain for sampling. Because adversarial nets do not require feedback loops during generation, they are better able to leverage piecewise linear units [19, 9, 10], which improve the performance of backpropagation but have problems with unbounded activation when used ina feedback loop. More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variational Bayes [20] and stochastic backpropagation [24].

最后，一些技术不涉及明确地定义概率分布，而是训练生成机器从所需分布中抽取样本。这种方法的优点是可以将这种机器设计成通过反向传播进行训练。最近在该领域的突出工作包括生成随机网络（GSN）框架[5]，它扩展了广义去噪自动编码器[4]：两者都可以看作是定义参数化马尔可夫链，即学习机器的参数执行生成马尔可夫链的一步。与GSN相比，对抗性网络框架不需要马尔可夫链进行采样。因为对抗网在生成期间不需要反馈回路，所以它们能够更好地利用分段线性单元[19,9,10]，这提高了反向传播的性能，但是当在反馈回路中使用时存在无界激活的问题。最近通过反向传播训练生成机器的例子包括最近关于变分贝叶斯[20]和随机反向传播[24]的自动编码工作。

3 Adversarial nets

3个对抗网

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator’s distribution

image

over data x, we deﬁne a prior on input noise variables

image

, then represent a mapping to data space as

image

, where G is a differentiable function represented by a multilayer perceptron with parameters

image

. We also deﬁne a second multilayer perceptron

image

that outputs a single scalar.

image

represents the probability that x came from the data rather than

image

. We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize

image

: In other words, D and G play the following two-player minimax game with value function

image

当模型是多层感知器时，对抗建模框架最直接应用。为了在数据x上学习生成器的分布

image

，我们在输入噪声变量

image

上定义先验，然后表示到数据空间的映射为

image

，其中G是由具有参数

image

的多层感知器表示的可微函数。我们还定义了输出单个标量的第二个多层感知器

image

。

image

表示x来自数据而不是

image

的概率。我们训练D以最大化为G训练样本和样本分配正确标签的概率。我们同时训练G以最小化

image

：换句话说，D和G使用值函数

image

进行以下双人迷你极限游戏：

image

In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on ﬁnite datasets would result in overﬁtting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1.

在下一节中，我们提出了对抗网络的理论分析，基本上表明训练标准允许人们恢复数据生成分布，因为G和D被赋予足够的容量，即在非参数限制中。有关该方法的不太正式，更具教学意义的解释，请参见图1。在实践中，我们必须使用迭代的数值方法来实现游戏。在训练的内循环中优化D到完成在计算上是禁止的，并且在有限数据集上将导致过度拟合。相反，我们在优化D的k步和优化G的一步之间交替。这导致D保持在其最佳解附近，只要G变化足够慢。这种策略类似于SML / PCD [31,29]训练将马尔可夫链中的样本从一个学习步骤维持到下一个学习步骤的方式，以避免在马尔可夫链中作为学习内循环的一部分进行训练。该过程在算法1中正式呈现。

In practice, equation 1 may not provide sufﬁcient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high conﬁdence because they are clearly different from the training data. In this case,

image

saturates. Rather than training G to minimize

image

we can train G to maximize

image

. This objective function results in the same ﬁxed point of the dynamics of G and D but provides much stronger gradients early in learning.

在实践中，等式1可能无法为G学习提供足够的梯度。在学习初期，当G很差时，D可以拒绝具有高置信度的样本，因为它们与训练数据明显不同。在这种情况下，

image

饱和。我们可以训练G来最大化

image

，而不是训练G来最小化

image

。该目标函数导致G和D动力学的相同固定点，但在学习早期提供更强的梯度。

image

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line)

image

from those of the generative distribution

image

(G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping

image

imposes the non-uniform distribution

image

on transformed samples. G contracts in regions of high density and expands in regions of low density of

image

. (a) Consider an adversarial pair near convergence:

image

is similar to pdata and D is a partially accurate classiﬁer. (b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to

image

pdata

image

. (c) After an update to G, gradient of D has guided

image

to ﬂow to regions that are more likely pdata

image

to be classiﬁed as data. (d) After several steps of training, if G and D have enough capacity, they will reach a point at which both cannot improve because

image

data. The discriminator is unable to differentiate between the two distributions, i.e.

image

图1：生成对抗网络通过同时更新判别分布（D，蓝色，虚线）进行训练，以便区分数据生成分布（黑色，虚线）

image

中的样本与生成分布

image

（G）的样本（绿色，实线）。下面的水平线是从中采样z的域，在这种情况下是均匀的。上面的水平线是x域的一部分。向上箭头显示映射

image

如何在转换样本上强加非均匀分布

image

。G在高密度区域收缩，在

image

的低密度区域扩展。（a）考虑收敛附近的对抗对：

image

类似于pdata，D是部分准确的分类。（b）在算法D的内循环中，训练D以区分样本和数据，收敛到

image

pdata

image

。（c）在更新G之后，D的梯度引导

image

流向更有可能将pdata

image

分类为数据的区域。（d）经过几个步骤的训练后，如果G和D有足够的容量，它们将达到两个都无法改善的点，因为

image

数据。鉴别器无法区分两种分布，即

image

。

4 Theoretical Results

4理论结果

The generator G implicitly deﬁnes a probability distribution

image

as the distribution of the samples

image

obtained when

image

. Therefore, we would like Algorithm 1 to converge to a good estimator of pdata, if given enough capacity and training time. The results of this section are done in a nonparametric setting, e.g. we represent a model with inﬁnite capacity by studying convergence in the space of probability density functions.

生成器G隐含地将概率分布

image

定义为

image

时获得的样本

image

的分布。因此，如果给定足够的容量和训练时间，我们希望算法1收敛到pdata的良好估计。此部分的结果是在非参数设置中完成的，例如我们通过研究概率密度函数空间中的收敛来表示具有有限容量的模型。

We will show in section 4.1 that this minimax game has a global optimum for

image

data. We will then show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

我们将在4.1节中展示这个minimax游戏具有

image

数据的全局最优。然后我们将在4.2节中展示算法1优化等式1，从而获得所需的结果。

image

4.1 Global Optimality of

image

data

4.1

image

数据的全局最优性

We ﬁrst consider the optimal discriminator D for any given generator G.

我们首先考虑任何给定发电机G的最佳鉴别器D.

Proposition 1. For G ﬁxed, the optimal discriminator D is

命题1.对于G fi xed，最优鉴别器D是

image

Proof. The training criterion for the discriminator D, given any generator G, is to maximize the quantity

image

证明。给定任何发电机G，鉴别器D的训练标准是使

image

的数量最大化

image

For any

image

, the function

image

achieves its maximum in

image

. The discriminator does not need to be deﬁned outside of

image

data

image

, a concluding the proof.

对于任何

image

，

image

函数在

image

的

image

中达到最大值。鉴别器不需要在

image

数据

image

之外定义，结论证明。

image

Note that the training objective for D can be interpreted as maximizing the log-likelihood for estimating the conditional probability

image

, where Y indicates whether x comes from pdata (with

image

) or from

image

(with

image

). The minimax game in Eq. 1 can now be reformulated as:

请注意，D的训练目标可以解释为最大化用于估计条件概率

image

的对数似然，其中Y指示x是来自pdata（使用

image

）还是来自

image

（使用

image

）。方程式中的极小极大游戏1现在可以重新表述为：

image

Theorem 1. The global minimum of the virtual training criterion

image

is achieved if and only if

image

data. At that point,

image

achieves the value

image

定理1.当且仅当

image

数据时，实现虚拟训练标准

image

的全局最小值。此时，

image

达到

image

的值。

Proof. For

image

data,

image

, (consider Eq. 2). Hence, by inspecting Eq. 4 at

image

, we ﬁnd

image

. To see that this is the best possible value of

image

, reached only for

image

data, observe that

证明。对于

image

数据，

image

，（考虑方程2）。因此，通过检查Eq。 4在

image

，我们找到了

image

。要看到这是

image

的最佳可能值，仅针对

image

数据，请注意

image

and that by subtracting this expression from

image

, we obtain:

通过从

image

中减去这个表达式，我们得到：

image

where KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen– Shannon divergence between the model’s distribution and the data generating process:

其中KL是Kullback-Leibler分歧。我们在前面的表达式中认识到模型分布和数据生成过程之间的Jensen-Shannon差异：

image

Since the Jensen–Shannon divergence between two distributions is always non-negative and zero only when they are equal, we have shown that

image

is the global minimum of

image

and that the only solution is

image

data, i.e., the generative model perfectly replicating the data generating process.

由于两个分布之间的Jensen-Shannon分歧总是非负的，只有当它们相等时才为零，我们已经证明

image

是

image

的全局最小值，唯一的解决方案是

image

数据，即生成模型完全复制了数据生成过程。

image

文章引用于 http://tongtianta.site/paper/14506
编辑 Lornatang
校准 Lornatang

Generative Adversarial Nets翻译[上]

你可能感兴趣的:(Generative Adversarial Nets翻译[上])