Abstract
We propose a new framework for estimating generative models via an adversarial process,
我们提出一种新的框架去测量生成模型通过对抗过程
in which we simultaneouly train two models:a generative model G that captures the data distribution,
通过一个过程我们同时训练两个模型:一个生成模型G去捕获数据分布
and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
估计样本来自训练数据而不是来自生成模型G的判别模型。
The training procedure for G is to maximize the probability of D making a mistake.
G的训练过程是去最大化D模型犯错的可能性。
This framework corresponds to a minimax two-player game.
这个框架对应着一个极大极小的两人游戏。
In the space of arbitrary functions G and D,a unique solution exists,with G recovering the training data distribution and D equal to 1/2 everywhere.
在空间中的随意函数G和D,一个独立的解法存在,G恢复训练数据分布并且D的分布处处等于1/2。
In the case where G and D are defined by multilayer perceptrons,the entire system can be trained with backpropagation.
在G和D定义被多层感知机定义的情况下,全部的系统可以通过反向传播来训练。
There is no need for any Markov chains or unrolled approximate inference net-works during either traing or generation of samples.
在G和D
Experiments demonstrate the potential of the framework through qualitative and quantitative evalution of the generated samples.
实验通过定性和定量预测生成的例子展示了框架的潜力。
1.Introduction
The promise of deep learning is to discover rich,hierarchical models that represent probability distributions over the kinds of data encountered in artificial intelligence applications,such as natural images,audio waveforms containing speech,and symbols in natural language corpora.
深度学习的前提是去发现复杂的具有层次性的模型,这种模型能够代表人工智能应用中遇到的概率分布,例如自然图像,演讲中的声波,以及自然语料中的代表内容。
So far,the most striking successes in deep learning have involved discriminative models,usually those that map a high-dimensional,rich sensory input to a class label.
到目前为止,深度学习中最引人注目的成功包括判别模型,经常是哪些将高维度,丰富的感官输入到类别标记中的模型。
These striking successes have primarily been based on the backpropagation and dropout algorithms,using piecewise linear units which have a particularly well-behaved gradient.
这些显著的成功主要基于反向传播和dropout算法,使用有非常好行为梯度的分段线性单元。
Deep generative models have had less of an impact,due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies,
深度生成模型由于在最大似然估计和相关策略中出现了很多难以解决的概率估计的困难,有非常小的影响。
maximum likelihood estimation:最大似然估计
and due to difficulty of leveraging the benefits of piecewise linear units in the generative context.
以及由于在生成文本的语境中使用分段线性函数的困难。
We propose a new generative model estimation procedure that sidesteps these difficulties.
我们提出一个新的生成模型估计流程去分布处理这些困难。
In the proposed adversarial nets framework,the generative model is pitted against an adversary:
在之前提出的生成对抗网络框架之中,生成模型与对手进行比较:
a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution.
一个判别模型去学习确定一个样本是从模型分布得出的还是从数据分布得出的。
The generative model can be thought of as analogous to a team of counterfeiters,trying to product fake currency and use it without detection.
一个生成模型可以视为等同于一个造假货的团队,努力去造假币并且不加检测的使用它。
while the discriminative model is analogous to the police,trying to detect the counterfeit currency.
判别模型类比于警察,努力去检测出假币的流通。
Competition in this game drives both teams to impove their methods until the counterfeits are indistiguishable from the genuine articles.
这个游戏中的竞争驱使两个队伍去提升他们的策略直到假货不能从真货中鉴别出来。
This framework can yield specific training algorithms for many kinds of model and optimization algorithm.
这个框架可以为许多种类的模型和优化算法提供特殊的训练算法。
In this article,we explore the special case when the generative model generates samples by passing random noise through a multiplayer perceptron,and the discriminative model is also a multilayer perceptron.
在这个文章中,我们探索特殊的例子当生成模型通过使用一个多层感知机绕过噪音产生例子,以及判别模型同样是一个多层感知机。
We refer to this special case as adversarial nets.
我们把这个特殊的例子称为对抗网络。
In this case,we can train both models using only the highly successful backpropagation and dropout algorithms and sample from the generative model using only forward propagation.
在这个例子中,我们训练两个模型使用仅仅最成功的反向传播和dropout算法以及从生成模型中使用仅仅前向传播算法的例子。
No approximate inference or Markov chains are necessary.
近似推理或者马尔科夫链是不必要的。
2.Related work
An alternative to directed graphical models with latent variables are undirected graphical models with latent variables,such as restricted Boltzmann machines(RBMs),deep Boltzmann manchines(DBMs) and their numerous variants.
一个有隐变量的有向图模型的替代是有隐变量的无向图模型,例如有限制的Boltzmann机器,深度Boltzmann机器(DBMs)和他们的无数变种。
The interations within such models are represented as the product of unnormalized potential fuctions,normalized by a global summation/intergration over all states of the random variables.
这些模型的干扰被代表为非标准化势函数的乘积,由一个不同状态随机变量的全局总结/集中来进行正则化。
This quantity(the partition function) and its gradient are intractable for all but the most trivial instances,although they can be estimated by Markov chain Monte Carlo(MCMC) methods.
这个质量(分割函数)和它的梯度对于所有但是最重要的小的例子是很难对付的,尽管他们可以通过蒙特卡罗马尔科夫链被预测。
Mixing poses a significant probelm for learning algorithms that rely on MCMC.
混合提出了一个重要的问题去学习依赖于MCMC的学习算法。
Deep belief networks (DBNs) are hybrid models containing a single undirected layer and several directed layers.
深度信念网络是一个混合模型包含了一个单独的无向层和几个有向层。
While a fast approximate layer-wise training criterion exists,DBNs incur the computational difficulties associated with both undirected and directed model.
尽管一个快速逐层训练法则存在,DBNs还是引起与无向模型和有向模型计算的困难。
Alternative criteria that do not approximate or bound the log-likelihood have also been proposed,such as score matching and noise-contrastive estimation(NCE).
另外一个没有接近或者到达log可能性的准则也被提出来了,例如得分匹配和噪音对比估计。
Both of these require the learned probability density to be analytically specified up to a normalization constant.
这些内容需要已学习的概率密度特别地分析归一化的变量。
Note that in many interesting generative models with several layers of latent variables(such as DBNs and DBMs),
注意到在许多有趣的生成模型有几层的潜变量(例如DBNs和DBMs模型)
it is not even possible to derive a tractable unnormalized probability density.
这甚至不可能导出易处理的非标准化概率密度。
Some models such as denoising auto-encoders and contractive autoencoders have learning rules very similar employed to fit a generative model.
一些模型例如去噪自动编码和压缩自动编码学习规则非常相似地去应用于拟合一个生成模型。
In NCE,as in this work,a discriminative training criterion is employed to fit a genenrative model.
在NCE,像在这个工作之中一样,一个判别训练准则被应用于拟合一个生成模型。
However,rather than fitting a separate discriminative model,the generative model itself is used to discriminate generated data from samples a fixed noise distribution.
但是,并不是适应一个独立判别模型,生成模型用来从一个混合噪音分布的样本之中辨别出生成数据。
Because NCE uses a fixed noise distribution,learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.
因为NCE使用一个混合的噪音分布,在模型在一个观察变量的小数据集上学习到近似正确的概率分布的时候,学习会明显的下降。
Finally,some techniques do not involve defining a probability distribution explicitly,but rather train a generative machine to draw samples from the desired distribution.
最终,一些技巧不涉及定义明确地概率分布,但是训练生成器从所需分布中抽取的样本。
This approach has the advantage that such machines can be designed to be trained by back-propagation.
这些方法有优势以至于这种机器可以被用于反向传播的训练。
Prominent recent work in this area includes the generative stochastic network(GSN) framework,which extends generalized denoising auto-encoders:
在这个领域最近重要的工作包括生成随机网络框架,扩展了广义去噪自动编码器。
both can be seen as defining a parameterized Markov chain,i.e,one learns the parameters of a machine that performs one step of a generative Markov chain.
这些可以通过定义参数的马尔科夫链看出,即,一个学习了参数化的机器执行生成马尔科夫链的一步。
Compared to GSNs,the adversarial nets framework does not require a Markov chain for sampling.
与随即网络框架对比,对抗网络框架不需要马尔可夫链去采样。
Because adversarial nets do not require feedback loops during generation,they are better able to leverage piecewise linear units,which improve the performance of backpropagation but have problems with unbounded activation when used in a feedback loop.
因为对抗网络不需要在生成时的反馈循环,他们更能够利用分段线性单元,去提升反向传播的表现,但是在使用一个反馈循环的时候,使用无边界的激活函数会出现问题。
More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variational Bayes and stochastic backpropagation.
更多最近通过反向传播训练生成机器的例子包括最近在自动编码变量贝叶斯和随机反向传播的工作。
3.Adversarial nets
对抗网络
The adversarial modeling framework is most straightforward to apply when
the models are both multilayer perceptrons.
当模型都是多层感知机的时候对抗模型框架经常被最直接使用。
To learn the generator’s distribution p g p_{g} pg over data x x x,
为了在数据集x上学习生成分布 p g p_{g} pg
we define a prior on input noise variables p z ( z ) p_{z}(z) pz(z),then represent a mapping to data space as G ( z ; θ g ) G(z;\theta{g}) G(z;θg),
我们定义了一个输入噪音 p z ( z ) p_{z}(z) pz(z)的先验映射 p z ( z ) p_{z}(z) pz(z),接着定义了一个到数据空间的映射 G ( z ; θ g ) G(z;\theta{g}) G(z;θg)。
where G G G is a differentiable function represented by a multilayer perceptron with parameters θ g \theta{g} θg.
G是一个使用参数 θ g \theta{g} θg的一个多层感知机表示的一个微分函数。
We also define a second multilayer perceptron D ( x ; θ d ) D(x;\theta_{d}) D(x;θd) that outputs a single scalar.
我们也定义了一个第二个多层感知机$D(x;\theta_{d})输出一个单个的向量。
D(x) represents the probability that x came from the data rather than p g p_{g} pg.
D(x)代表可能性x来自于数据而不是来自于 p g p_{g} pg。
We train D to maximize the probability of assigning the correct label to both training examples and samples from G.
我们训练D去最大化分配G训练示例和样本正确的概率。
We simultaneously train G to minimize l o g ( 1 − D ( G ( z ) ) ) log(1-D(G(z))) log(1−D(G(z))).
我们同时训练G去最小化 l o g ( 1 − D ( G ( z ) ) ) log(1-D(G(z))) log(1−D(G(z)))。
In other words,D and G play the following two-player minimax game with value function V ( G , D ) V(G,D) V(G,D):
换句话说,D和G玩下面的V(G,D)极小化极大的二人博弈问题
min G max D V ( G , D ) = E x ∼ p d a t a ( x ) [ l o g D ( x ) ] + E z ∼ p z ( z ) [ l o g ( 1 − D ( G ( z ) ) ) ] \min \limits_{G} \max \limits_{D} V(G,D) = E_{x\sim p_{data}(x)}[log D(x)] + E_{z\sim p_{z}(z)}[log(1-D(G(z)))] GminDmaxV(G,D)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]
In the next section,we present a theoretical analysis of adversarial nets,
在下一个篇章之中,我们提出了一个对抗网络的理论分析,
essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity,
重要地显示了当G和D被给予足够的容量时,训练准则允许一个人去恢复数据的生成分布,
i.e.,in the non-parametric limit.
即在非参数的极限中。
See Figure 1 for a less formal,more pedagogical explanation of the approach.
看表1去发现一个更不正式,更具教育性解释的方法。
In practice,we must implement the game using an iterative,numerical approach.
事实上,我们必须运行游戏使用一个反复的,有数值的方法。
This results in D being maintained near its optimal solution,so long as G changes slowly enough.
这导致只要G变化得足够慢,D将在它的最优化方法周围保持。
This strategy is analogous to the way that SML/PCD training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning.
这个策略类似于SML/PCD从一个马尔科夫链之中训练保持样本,从一个学习梯度到另外一个以避免在马尔科夫链中烧坏作为内部循环的一部分。
The procedure is formally presented in Algorithm 1.
这个过程在算法1之中被正式地展示出。
In practice,equation 1 may not provide sufficient gradient for G to learn well.
实际上,等式1不能够提供足够的梯度让G来学习得好。
Early in learning,when G is poor,D can reject samples with high confidence because they are clearly different from the training data.
在早期的学习之中,当G非常少,D可以以高置信度映射样例因为他们与训练数据非常地不同。
In this case, l o g ( 1 − D ( G ( z ) ) ) log(1-D(G(z))) log(1−D(G(z))) saturates.
在这种情况下, l o g ( 1 − D ( G ( z ) ) ) log(1-D(G(z))) log(1−D(G(z)))饱和。
Rather than training G to minimize l o g ( 1 − D ( G ( z ) ) ) log(1-D(G(z))) log(1−D(G(z))),we can train G to maximize l o g D ( G ( z ) ) logD(G(z)) logD(G(z)).
我们可以训练G去最大化 l o g D ( G ( z ) ) logD(G(z)) logD(G(z)),而不是训练G去最小化 l o g ( 1 − D ( G ( z ) ) ) log(1-D(G(z))) log(1−D(G(z)))。
This objective function results in the same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
这个目标函数导致相同的固定动态目标G和D,但是在早期提供了更强壮的梯度。
Figure 1:Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D,blue,dashed line)so that it discriminates between samples from the data generating distribution(black,dotted line) p x p_{x} px from those of the generative distribution p g ( G ) p_{g}(G) pg(G)(green,solid line).
图1:生成对抗网络被训练同时更新辨别分布(D,蓝色,虚线)以至于它可以从生成分布 p g ( G ) p_{g}(G) pg(G)(绿色,实线)中辨别出数据生成分布(黑色,点线)。
The lower horizontal line is the domain from which z is sampled,in this case uniformly.
下一点的水平线是z被采样的区域,在这个情况下是统一的。
The horizontal line above is part of the domain of x.
水平线之上是x的部分区域。
The upward arrows show how the mapping x = G ( z ) x=G(z) x=G(z) imposes the non-uniform distribution p g p_{g} pg on transformed samples.
上面的线显示如何映射 x = G ( z ) x=G(z) x=G(z)将非统一分布 p g p_{g} pg施加到变换样本之中。
G contracts in regions of high density and expands in regions of low density of p g p_{g} pg.
G在高密度区域中收缩以及在低密度区域 p g p_{g} pg之中扩张。
(a)Consider an adversarial pair near convergence: p g p_{g} pg is similar to p d a t a p_{data} pdata and D is a partially accurate classifier.
(a)考虑一个聚集周边的对抗样本: p g p_{g} pg类似于 p d a t a p_{data} pdata,D是一个特别准确的分类器。
(b)In the inner loop of the algorithm D is trained to discriminate samples from data,converging to D ∗ ( x ) = p d a t a ( x ) p d a t a ( x ) + p g ( ∞ ) D^{*}(x) = {\frac{p_{data}(x)}{p_{data}(x)+p_g(\infty)}} D∗(x)=pdata(x)+pg(∞)pdata(x)
在算法D的内循环之中,D被训练去从数据中辨别样本,收缩到 D ∗ ( x ) = p d a t a ( x ) p d a t a ( x ) + p g ( ∞ ) D^{*}(x) = {\frac{p_{data}(x)}{p_{data}(x)+p_g(\infty)}} D∗(x)=pdata(x)+pg(∞)pdata(x)
©After an update to G,gradient of D has guided G(z) to flow to regions that are more likely to be classified as data.
在G更新后,D的梯度指引G(z)去流向更有可能被分类成为数据的梯度。
(d)After several steps of training,if G and D have enough capacity,they will reach a point at which both cannot improve because p g = p d a t a p_{g} = p_{data} pg=pdata.
在几步之后的训练,如果G和D有足够的容量,他们将会达到一个两个数值都不能优化的点因为 p g = p d a t a p_{g} = p_{data} pg=pdata
The discriminator is unable to differentiate between the two distributions,i.e. D ( x ) = 1 2 D(x) = \frac{1}{2} D(x)=21
辨别器不能够区分两个分布,即 D ( x ) = 1 2 D(x) = \frac{1}{2} D(x)=21。
4 Theoretical Results
The generator G implicitly defines a probability distribution p g p_{g} pg as the distribution of the samples G(z) obtained when z p z z~p_{z} z pz.
生成器G不明显地定义一个可能性分布 p g p_{g} pg,当 z p z z~p{z} z pz时,样例G(z)的分布获取到。
Therefore,we would like Algorithm 1 to converge to a good estimator of p d a t a p_{data} pdata,if given enough capacity and training time.
因此,如果被给到足够的容量和训练时间,我们想要算法1去聚拢到一个好的 p d a t a p_{data} pdata估计值。
The results of this section are done in a non-parametric setting,e.g,we represent a model with infinite capacity by studying convergence in the space of probability density function.
这个篇章的结果被使用一个非参数的设置进行操作,即,我们使用一个无线容量代表一个模型,通过在空间可能性密度函数之中学习聚拢。
We will show in section 4.1 that this minimax game has a global optimum for p g = p d a t a p_{g} = p_{data} pg=pdata.
我们将在篇章4.1之中展示这个极小极大游戏在 p g = p d a t a p_{g} = p_{data} pg=pdata的情况下有全局最佳的效果。
We will then show in section 4.2 that Algorithm1 optimizes Eq1,thus obtaining the desired result.
我们接着在篇章4.2中展示算法1优化Eq1,因此获得想要的结果。
Algorithm 1
Minibatch stochastic gradient descent training of generative adversarial nets.
小样本随机梯度下降训练生成对抗网络。
The number of steps to apply to the discriminator,k,is a hyperparameter.
应用于分辨器的步数数量k是一个超参数。
We used k=1,the least expensive option,in our experiments.
我们使用k=1,在我们实验中最便宜的选项。
for number of training iterations do
训练迭代的次数
for k steps do
训练的步骤k
Sample minibatch of m noise samples {z^(1),...z^(m)} from noise prior p_{g}(z).
从噪音先驱p_{g}(z)中采样小批次的m个噪音样本{z^(1),...z^(m)}
Sample minibatch of m examples {x^(1)...x^(m)} from data generating distribution p_{data}(x).
从数据生成分布p_{data}(x)中采样m个小批次样本{x^(1)...x^(m)}
Update the discriminator by ascending its stochastic gradient:
通过提升它的随机梯度更新辨别器
▽ θ d 1 m ∑ i = 1 m [ l o g D ( x ( i ) ) + l o g ( 1 − D ( G ( z i ) ) ] \bigtriangledown_{\theta_{d}}\frac{1}{m}\sum_{i=1}^m[logD(x^(i))+log(1-D(G(z^{i}))] ▽θdm1∑i=1m[logD(x(i))+log(1−D(G(zi))]
end for
Sample minibatch of m noise samples {z^(1),...z^(m)} from noise prior p_{g}(z)
Update the generator by descending its stochastic gradient:
从噪音前置p_{g}(z)中采样出小批次的m样本{z^(1),...z^(m)}
通过降低它的随机梯度更新生成器
end for
The gradient-based updates can use any standard gradient-based learning rule.We used momentum in our experiments.
基于梯度的更新可以使用任何基于梯度标准的学习准则。我们在我们的实验中使用动量。
4.1 Global Optimality of p_{g} = p_{data}
关于 p g = p d a t a p_{g}=p_{data} pg=pdata的全局优化
We first consider the optimal discriminator D for any given generator G.
我们首先对于给定分辨器G的最优化分辨器D。
Proposition 1.
For G fixed,the optimal discriminator D is
对于固定的G,最优化的辨别器D是
D G ∗ ( x ) = p d a t a ( x ) p d a t a ( x ) + p g ( x ) D_{G}^{*}(x) = \frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)} DG∗(x)=pdata(x)+pg(x)pdata(x) (2)
Proof.
The training criterion for the discriminator D,given any generator G,is to maximize the quantity V(G,D).
对于辨别器D的训练准则,给出任何生成器G,是去最大化质量V(G,D)
V ( G , D ) = ∫ x p d a t a ( x ) l o g ( D ( x ) ) d x + ∫ z p z ( z ) l o g ( 1 − D ( g ( z ) ) ) d z = ∫ x p d a t a ( x ) l o g ( D ( x ) ) + p g ( x ) l o g ( 1 − D ( x ) ) d x ( 3 ) V(G,D) = \int_{x}p_{data}(x)log(D(x))dx + \int_{z}p_{z}(z)log(1-D(g(z)))dz = \int_{x}p_{data}(x)log(D(x))+p_{g}(x)log(1-D(x))dx (3) V(G,D)=∫xpdata(x)log(D(x))dx+∫zpz(z)log(1−D(g(z)))dz=∫xpdata(x)log(D(x))+pg(x)log(1−D(x))dx(3)
(就是前面的z更换了一下相应的符号)
For any ( a , b ) ∈ R 2 \ 0 , 0 (a,b) \in R^{2} \backslash {0,0} (a,b)∈R2\0,0,the function y − > a l o g ( y ) + b l o g ( 1 − y ) y->alog(y)+blog(1-y) y−>alog(y)+blog(1−y) achieves its maximum in [0,1] at a a + b \frac{a}{a+b} a+ba
对于任何 ( a , b ) ∈ R 2 \ 0 , ) (a,b) \in R^{2} \backslash {0,)} (a,b)∈R2\0,),公式 y − > a l o g ( y ) + b l o g ( 1 − y ) y->alog(y)+blog(1-y) y−>alog(y)+blog(1−y)在[0,1]之间的 a a + b \frac{a}{a+b} a+ba达到它的最大值。
The discriminator does not need to be defined outside of S u p p ( p d a t a ) ∪ S u p p ( p g ) Supp(p_{data})\cup Supp(p_{g}) Supp(pdata)∪Supp(pg),concluding the proof.
鉴别器不需要定义在 S u p p ( p d a t a ) ∪ S u p p ( p g ) Supp(p_{data})\cup Supp(p_{g}) Supp(pdata)∪Supp(pg)之外,证明结束。
Note that the training objective for D can be interpreted as maximizing the log-likelihood for estimating the conditional probability P ( Y = y ∣ x ) P(Y=y|x) P(Y=y∣x),where Y indicates whether x comes from p d a t a p_{data} pdata(with y=1) or from p g p_{g} pg(with y=0).
注意到D的训练目标可以被解释为预测条件概率 P ( Y = y ∣ x ) 的 可 能 性 , Y 暗 示 是 否 x 来 自 P(Y=y|x)的可能性,Y暗示是否x来自 P(Y=y∣x)的可能性,Y暗示是否x来自p_{data} ( 在 y = 1 的 情 况 下 ) 或 者 来 自 (在y=1的情况下)或者来自 (在y=1的情况下)或者来自p_{g}$(在y=0的情况下)。
The minimax game in Eq.1 can now be reformulated as:
Eq.1的最小最大游戏现在可以被重新解释为:
max D V ( G , D ) = E x ∼ p d a t a [ l o g D G ∗ ( x ) ] + E z ∼ p z [ l o g ( 1 − D G ∗ ( G ( z ) ) ) ] = E x ∼ p d a t a [ l o g D G ∗ ( x ) ] + E x ∼ p g [ l o g ( 1 − D G ∗ ( x ) ) ) ] = E x ∼ p d a t a [ l o g p d a t a ( x ) p d a t a ( x ) + p g ( x ) ] + E x ∼ p g [ l o g p g ( x ) p d a t a ( x ) + p g ( x ) ] ( 4 ) \max \limits_{D}V(G,D) = E_{x \sim p_{data}}[log D_{G}^{*}(x)]+E_{z \sim p_{z}}[log(1-D_{G}^{*}(G(z)))] = E_{x \sim p_{data}}[log D_{G}^{*}(x)]+E_{x \sim p_{g}}[log(1-D_{G}^{*}(x)))] =E_{x \sim p_{data}}[log \frac {p_{data}(x)}{p_{data}(x)+p_{g}(x)}]+E_{x \sim p_{g}}[log \frac{p_{g}(x)}{p_{data}(x)+p_{g}(x)}](4) DmaxV(G,D)=Ex∼pdata[logDG∗(x)]+Ez∼pz[log(1−DG∗(G(z)))]=Ex∼pdata[logDG∗(x)]+Ex∼pg[log(1−DG∗(x)))]=Ex∼pdata[logpdata(x)+pg(x)pdata(x)]+Ex∼pg[logpdata(x)+pg(x)pg(x)](4)
Theorem 1.(结论1)
The global minimum of the virtual training criterion C(G) is achieved if and only if p g = p d a t a p_{g}=p_{data} pg=pdata.
当且仅当 p g = p d a t a p_{g}=p_{data} pg=pdata的时候,全局最小虚拟训练变量准则C(G)可以获得。
At that point ,C(G) achieves the value -log4.
在那个点上,C(G)获得了-log4的值。
Proof.For p g = p d a t a p_{g}=p_{data} pg=pdata, D G ∗ = 1 2 D_{G}^{*}=\frac{1}{2} DG∗=21(consider Eq.2).Hence,by inspecting Eq.4 at D G ∗ ( x ) = 1 2 D_{G}^{*}(x) = \frac{1}{2} DG∗(x)=21,we find C(G) = l o g 1 2 + l o g 1 2 = − l o g 4 log \frac{1}{2} + log \frac{1}{2} = -log4 log21+log21=−log4. (4)
证据。对于 p g = p d a t a p_{g}=p_{data} pg=pdata, D G ∗ = 1 2 D_{G}^{*}=\frac{1}{2} DG∗=21(考虑等式2)。因此,通过观察 D G ∗ ( x ) = 1 2 D_{G}^{*}(x) = \frac{1}{2} DG∗(x)=21的等式4,我们发现 C ( G ) = l o g 1 2 + l o g 1 2 = − l o g 4 C(G) = log \frac{1}{2} + log \frac{1}{2} = -log4 C(G)=log21+log21=−log4
To see that this is the best possible value of C(G),reached only for p g = p d a t a p_{g} = p_{data} pg=pdata,observe that
E x ∼ p d a t a [ − l o g 2 ] + E x ∼ p g [ − l o g 2 ] = − l o g 4 E_{x \sim p_{data}}[-log2]+E_{x \sim p_{g}}[-log2] = -log4 Ex∼pdata[−log2]+Ex∼pg[−log2]=−log4
为了看到这是C(G)的最可能的值,只在 p g = p d a t a p_{g}=p_{data} pg=pdata的时候达到,观察到 E x ∼ p d a t a [ − l o g 2 ] + E x ∼ p g [ − l o g 2 ] = − l o g 4 E_{x \sim p_{data}}[-log2]+E_{x \sim p_{g}}[-log2] = -log4 Ex∼pdata[−log2]+Ex∼pg[−log2]=−log4。
and that by subtracting this expression from C ( G ) = V ( D G ∗ , G ) C(G) = V(D_{G}^{*},G) C(G)=V(DG∗,G),we obtain:
通过从 C ( G ) = V ( D G ∗ , G ) C(G) = V(D_{G}^{*},G) C(G)=V(DG∗,G),我们观察到
C ( G ) = − l o g ( 4 ) + K L ( p d a t a ∣ ∣ p d a t a + p g 2 ) + K L ( P g ∣ ∣ p d a t a + p g 2 ) ( 5 ) C(G) = -log(4)+KL(p_{data}||\frac{p_{data}+p_{g}}{2})+KL(P_{g}||{\frac{p_{data}+p_{g}}{2}})(5) C(G)=−log(4)+KL(pdata∣∣2pdata+pg)+KL(Pg∣∣2pdata+pg)(5)
where KL is the Kullback-Leibler divergence.
KL是Kullback-Leibler分歧。
We recognize in the previous expression the Jensen-Shannon divergence between the model’s distribution and the data generating preocess.
我们观察到在之前的表达式中,詹森香农差异在模型分布和数据分布之间。
C ( G ) = − l o g ( 4 ) + 2 ∗ J S D ( p d a t a ∣ ∣ p g ) ( 6 ) C(G) = -log(4)+2*JSD(p_{data}||p_{g})(6) C(G)=−log(4)+2∗JSD(pdata∣∣pg)(6)
Since the Jensen-Shannon divergence between the two distributions is always non-negative and zero only when they are equal,
既然在两个分布之间的詹森香农差异都是非负和零只有当他们相等的时候,
we have shown that C ∗ = − l o g ( 4 ) C^{*} = -log(4) C∗=−log(4) is the global minimum of C(G) and that the only solution is p g = p d a t a p_{g} = p_{data} pg=pdata,i.e,the generative model perfectly repllicating the data generating process.
我们发现 C ∗ = − l o g ( 4 ) C^{*} = -log(4) C∗=−log(4)是全局的最小值,唯一的解法是 p g = p d a t a p_{g} = p_{data} pg=pdata,即,生成模型很好地复制了数据生成的过程。
4.2 Convergence of Algorithm 1
算法1的融合
Proposition 2.
提议2
If G and D have enough capacity,and at each step of Algorithm 1,the discriminator is allowed to reach its optimum given G,and p g p_{g} pg is updated so as to improve the criterion
如果G和D有足够的容量,并且在每一个算法1的步骤中,如果给定G,辨别器允许达到它的最佳,以及 p g p_{g} pg被更新去提升准则。
E x ∼ p d a t a [ l o g D G ∗ ( x ) ] + E x ∼ p g [ l o g ( 1 − D G ∗ ( x ) ) ) ] E_{x \sim p_{data}}[logD_{G}^{*}(x)]+E_{x \sim p_{g}}[log(1-D_{G}^{*}(x)))] Ex∼pdata[logDG∗(x)]+Ex∼pg[log(1−DG∗(x)))]
then p g p_{g} pg converges to p d a t a p_{data} pdata.
接着 p g p_{g} pg收敛到 p d a t a p_{data} pdata。
Proof.Consider V ( G , D ) = U ( p g , D ) V(G,D) = U(p_{g},D) V(G,D)=U(pg,D) as a function of p g p_{g} pg as done in the above criterion.
证据。考虑之前的准则 V ( G , D ) = U ( p g , D ) V(G,D) = U(p_{g},D) V(G,D)=U(pg,D)作为 p g p_{g} pg函数在上述的准则中。
Note that U ( p g , D ) U(p_{g},D) U(pg,D) is convex in p g p_{g} pg.
注意到 U ( p g , D ) U(p_{g},D) U(pg,D)在 p g p_{g} pg里面凸起。
The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained.
凸函数上确界的子函数包括函数在极大值处的导数。
(a supremum of convex functions):凸函数的上确界。
The subderivatives of a supremum of convex functions:凸函数上确界的子函数
In other words,if f ( x ) = s u p α ϵ A f α ( x ) f(x) = sup_{\alpha \epsilon A} f_{\alpha}(x) f(x)=supαϵAfα(x) and f α ( x ) f_{\alpha}(x) fα(x) is convex in x for every α \alpha α,then ∂ f β ( x ) ϵ ∂ f \partial f_{\beta}(x) \epsilon \partial f ∂fβ(x)ϵ∂f if β = a r g s u p α ϵ A f α ( x ) \beta = arg sup_{\alpha \epsilon A}f_{\alpha}(x) β=argsupαϵAfα(x).
换句话说,如果 f ( x ) = s u p α ϵ A f α ( x ) f(x) = sup_{\alpha \epsilon A} f_{\alpha}(x) f(x)=supαϵAfα(x)并且 f α ( x ) f_{\alpha}(x) fα(x)对于每个 α \alpha α中的x都是凸的,则如果 β = a r g s u p α ϵ A f α ( x ) \beta = arg sup_{\alpha \epsilon A}f_{\alpha}(x) β=argsupαϵAfα(x),那么 ∂ f β ( x ) ϵ ∂ f \partial f_{\beta}(x) \epsilon \partial f ∂fβ(x)ϵ∂f.
This is equivalent to computing a gradient descent update for p g p_{g} pg at the optimal D given the corresponding G.
这相等于给定相应的G,在最优化的D处计算梯度下降更新 p g p_{g} pg。
s u p D U ( p g , D ) sup_{D}U(p_{g},D) supDU(pg,D) is convex in p g p_{g} pg with a unique global optima as proven in Thm 1,therefore with sufficiently small updates of p g p_{g} pg, p g p_{g} pg converges to p x p_{x} px,concluding the proof.
s u p D U ( p g , D ) sup_{D}U(p_{g},D) supDU(pg,D)在 p g p_{g} pg处凸并且使用一个独特的全局优化函数,显示的内容如Thm1所示,因此有足够小的 p g p_{g} pg的更新,使得 p g p_{g} pg向 p x p_{x} px的位置收敛,从而证明相应的结论。
In practice,adversarial nets represent a limited family of p g p_{g} pg distributions vis the function G ( z ; θ g ) G(z;\theta_{g}) G(z;θg),and we optimize θ g \theta_{g} θg rather that p g p_{g} pg itself.
事实上,对抗网路代表了一种通过函数 G ( z ; θ g ) G(z;\theta_{g}) G(z;θg)的限制性的 p g p_{g} pg分布,并且我们优化 θ g \theta_{g} θg而不是 p g p_{g} pg它本身。
Using a multilayer perceptron to define G introduces multiple critical points in parameter space.
使用一个多重感知机在参数空间介绍的多重关键点中去定义G。
However,the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.
但是,多重感知机在实践中很好地表现暗示尽管缺乏理论的支持,他们是一个非常可行的模型。
5 Experiments
We trained adversarial nets an a range of datasets including MNIST[23],the Toronto Face Database(TFD)[28],and CIFAR-10[21].
我们在一个范围的数据集上训练对抗网络,包括MNIST,Toronto Face数据base以及CIFAR-10。
The generator nets used a mixture of rectifier activation[19,9] and sigmoid activations,while the discriminator net used maxout[10] activations.
生成器网络使用一个混合的纠正器激活函数,和sigmoid激活函数,然而辨别器网络使用maxout激活函数。
While our theoretical framework permits the use of dropout and other noise at intermediate layers of the generator,we used noise as the input to only the bottommost layer of the generator work.
尽管我们的理论框架允许在生成器的中间层使用dropout和其他的噪音,我们仅仅在最底层网络的生成器工作中使用噪音作为输入。
We estimate probability of the test set data under p g p_{g} pg by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution.
通过适应一个由G生成样例的Gaussian Parzen窗口以及在这种分布之下报告log的可能性,我们估计在 p g p_{g} pg概率之下的测试集数据。
This procedure was introduced in Breuleux et al.[8] and used for various generative models for which the exact likelihood is not tractable.
这个过程被Breuleux介绍并且用于不同的生成模型之中,这些生成模型真正的概率难以处理。
Results are reported in Table 1.This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge.
结果被展示在表1中。这种测量概率的方法有较大的偏差并且在高维度的空间中表现得并不好,但是这是我们所熟知的最好可获得的方法。
Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models.
生成模型的进步在于可以采样但是不能直接测量可能性,激励着对于如何测量这种模型的进一步的研究。
Table 1:Parzen 基于窗口的log概率测量。在MNIST上展示的数据是测试集的平均log可能性,以及例子计算的平均的标准差。在TFD上,我们计算多折数据集的标准差,同时对于每一折的验证集使用一个不同的 σ \sigma σ。在TFD上, σ \sigma σ在每一折上被选择以及平均的log可能性在每一折上被计算。对于MNIST我们比较其他模型真实版本(而不是二进制版本)的数据集。
In Figures 2 and Figures 3 we show samples drawn from the generator net after training.在表2和表3我们展示在训练之后从生成器之中抽取的样本。
While we make no claim that these samples are better than samples generated by existing methods,we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework.
尽管我们不确定这些样本比已有方法生成的样本要好,我们相信这些样本至少与目前已有的最好方法有一定的竞争力,并且突出了对抗框架的潜能。
Figure 2:Visualization of samples from the model.
图片2:模型的可视化样本。
Rightmost column shows the nearest training examples of the neighboring sample,in order to demonstrate that the model has not memorized the training set.
最右边的列展示了相邻样本的最近的训练样例,为了阐明模型没有记住训练集。
Samples are fair random draws,not cherry-picked.
样例是相当随机选取的,并不是精挑细选的。
(cherry-picked:精挑细选的)
Unlike most other visualizations of deep generative models,these images show actual samples from the model distributions,not conditional means given samples of hidden units.
不像大多数其他深度生成模型的可视化过程,这些图片展示了模型分布的真实样例,而不是给出隐藏单元样例的条件分布。
Moreover,these samples are uncorrelated because the sampling process does not depend on Markov chain mixing.
此外,这样样本是无关的因为采样过程不依赖马尔可夫链的混合。
a)MNIST b)TFD c)CIFAR-10(fully connected model) d)CIFAR-10(convolutional discriminator and deconvolutional generator)
a)MNIST b)TFD) c)CIFAR-10(全连接模型) d)CIFAR-10(卷积辨别器和非卷积生成器)
Digits obtained by linearly interpolating between coordinates in z space of the full model.
数值获取通过在全模型的z坐标空间中线性插值。
Table 2:Challenges in generative modeling:a summary of the difficulties encountered by different approaches to deep generative modeling for each of the major operations involving a model.
生成模型的挑战:对于深度生成模型中每一个模型主要的操作中使用不同的方法遇到不同的困难的总结。
6 Advantages and disadvantages
This new framework comes with advantagesand disadvantages relative to previous modeling frameworks.
这个新的框架遇到与之前模型框架相关的优点和缺点。
???
The disadvantages are primarily that there is no explicit representation of p g ( x ) p_{g}(x) pg(x),and that D must be synchronized well with G during training
这个主要的缺点在于没有关于 p g ( x ) p_{g}(x) pg(x)明确地表示,并且在训练的过程中D必须很好地与G保持一致。
(in particular,G must not be trained too much without updating D,in order to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x to have enough diversity to model p d a t a p_{data} pdata)
特别地,G必须不能被在D不更新的情况下训练很多次,为了避免"the Helvetica 音调“G塌陷了很多z的值与x有模型 p d a t a p_{data} pdata的足够的多样性对应的值相同。
much as the negative chains of a Boltzmann machine must be kept up to date between learning steps.
与负链的玻尔兹曼机器必须在学习步骤之间保持到
The advantages are that Markov chains are never needed,only backprop is used to obtain gradients,no inference is needed during learning,and a wide variety of functions can be incorporated into the model.
好处是马尔可夫链从不需要,只有反向传播链在训练的时候被用于获得梯度,在训练的时候没有任何其他的参考,并且一大堆函数可以被包含进模型之中。
Table2 summarizes the comparison of generative adversarial nets with other generative modeling approaches.
表2总结了生成对抗网络与其他生成模型方法的比较。
The aforementioned advantages are primarily computational.
上述提到的好处主要是计算的优点。
Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples,but only with gradients flowing through the discriminator.
对抗模型可能也能从没有直接更新数据例子的生成对抗网络中获得一些统计优势,但是只是通过辨别器获得的梯度流。
This means that components of the input are not copied directly into generator’s parameters.
这意味着输入的组成不能直接从生成器的参数中拷贝过来。
Another advantage of adversarial network is that they can represent very sharp,even degenerate distributions,while methods based on markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between models.
生成对抗网络的另外一个优点在于他们可以表达非常尖锐退化的分布,然而基于马尔科夫链的方法需要分布在某种程度上模糊为了马尔可夫链在模型之间进行混合。
7 Conclusion and future work
This framework admits many straightforward extensions:
这项框架承认了许多直接的扩展:
1.A conditional generative model p ( x ∣ c ) p(x|c) p(x∣c) can be obtained by adding c as input to both G and D.
通过对同样的G和D增加c作为输入,可以获得一个条件生成模型 p ( x ∣ c ) p(x|c) p(x∣c)
2.Learned approximate inference can be performed by training an auxiliary network to predict z given x.
学习合适的参照可以通过给定x去预测z的训练辅助网络来实现。
This is similar to the inference net may be trained by the wake-sleep algorithm [15] but with the advantage that the inference net may be trained for a fixed generator net after the generator net has finished training.
这类似于参照网络可以通过wake-sleep算法获得,但是使用网络的好处在于可以在生成网络结束训练之后参照网络可以被训练成一个固定的生成器。
3.One can approximately model all conditionals p ( x s ∣ x s ‾ ) p(x_{s}|x_{\overline{s}}) p(xs∣xs) where S is a subset of the indices of x by training a family of conditional models that share parameters.
一个人可以几乎模型所有的条件 p ( x s ∣ x s ‾ p(x_{s}|x_{\overline{s}} p(xs∣xs,这里的S是一个通过训练一类条件概率模型获得的共享参数x的子集。
4.Semi-supervised learning:features from the discriminator or inference net could improve performance of classifiers when limited labeled data is available.
半监督学习:当有限的标签数据可以获得的时候,从辨别器或者参考网络中获得的特征可以提升分类器的表现。
5.Efficiency improvements:training could be accelerated greatly by devising better methods for coordinating G and D or determining better distributions to sample z from during training.
有效的提升:可以通过在训练的过程中设计协调G和D的更好的方法或者确定样本z更好地分布有效地提升训练的效果。
This paper has demonstrated the viability of the adversarial modeling framework,suggesting that these research directions could prove useful.
这个论文展示了对抗模型框架的可用性,暗示这些研究的方向可以被证明有用。