Are GANs Created Equal? A Large-Scale Study翻译[上]

Are GANs Created Equal? A Large-Scale Study翻译下

Are GANs Created Equal? A Large-Scale Study

GAN创建平等吗？大规模研究

论文：http://arxiv.org/pdf/1711.10337v3.pdf

Abstract

摘要

Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted largescale empirical study on state-of-the art models and evaluation measures. We ﬁnd that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not ﬁnd evidence that any of the tested algorithms consistently outperforms the original one.

生成对抗网络（GAN）是生成模型的强大子类。尽管研究活动非常丰富，导致了许多有趣的GAN算法，但仍然很难评估哪种算法的性能优于其他算法。我们对最先进的模型和评估措施进行了中立的，多方面的大规模实证研究。我们发现大多数模型可以通过足够的超参数优化和随机重启来达到相似的分数。这表明，与基本的算法变化相比，更高的计算预算和调整可以带来改进。为了克服当前度量的某些限制，我们还提出了几个可以计算精度和召回率的数据集。我们的实验结果表明，未来的GAN研究应基于更系统和客观的评估程序。最后，我们没有发现任何经过测试的算法始终优于原始算法的证据。

1. Introduction

1.简介

Generative adversarial networks (GAN) are a powerful subclass of generative models and were successfully applied to image generation and editing, semi-supervised learning, and domain adaptation [20, 24]. In the GAN framework the model learns a deterministic transformation G of a simple distribution

image

, with the goal of matching the data distribution

image

. This learning problem may be viewed as a two-player game between the generator, which learns how to generate samples which resemble real data, and a discriminator, which learns how to discriminate between real and fake data. Both players aim to minimize their own cost and the solution to the game is the Nash equilibrium where neither player can improve their cost unilaterally [8]. Various ﬂavors of GANs have been recently proposed, both purely unsupervised [8, 1, 9, 4] as well as conditional [18, 19]. While these models achieve compelling results in speciﬁc domains, there is still no clear consensus on which GAN algorithm(s) perform objectively better than others.

生成对抗网络（GAN）是生成模型的强大子类，并成功应用于图像生成和编辑，半监督学习和领域适应[20,24]。在GAN框架中，模型学习简单分布

image

的确定性变换G，目标是匹配数据分布

image

。这种学习问题可以被视为生成器之间的双人游戏，该生成器学习如何生成类似于真实数据的样本，以及鉴别器，其学习如何区分真实和伪造数据。两位球员都希望最大限度地降低自己的成本，而比赛的解决方案就是纳什均衡，在这种情况下，任何球员都不能单方面提高成本[8]。最近提出了各种各样的GAN，两者都是纯无监督的[8,1,9,4]以及有条件的[18,19]。虽然这些模型在特定领域取得了令人瞩目的成果，但仍然没有明确的共识来确定哪些GAN算法在客观上比其他算法更好。

⋆Indicates equal authorship. Correspondence to Mario Lucic ([email protected]) and Karol Kurach ([email protected]).

⋆表明同等作者。与Mario Lucic（[email protected]）和Karol Kurach（[email protected]）的通讯。

1 This is partially due to the lack of robust and consistent metric, as well as limited comparisons which put all algorithms on equal footage, including the computational budget to search over all hyperparameters. Why is it important? Firstly, to help the practitioner choose a better algorithm from a very large set. Secondly, to make progress towards better algorithms and their understanding, it is useful to clearly assess which modiﬁcations are critical, and which ones are only good on paper, but do not make a signiﬁcant difference in practice.

1这部分是由于缺乏稳健和一致的度量标准，以及有限的比较，使所有算法都在相同的镜头上，包括搜索所有超参数的计算预算。它为什么如此重要？首先，帮助从业者从一个非常大的集合中选择一个更好的算法。其次，为了在更好的算法及其理解方面取得进展，有必要清楚地评估哪些修改是关键的，哪些修改只在纸上有用，但在实践中没有显着差异。

The main issue with evaluation stems from the fact that one cannot explicitly compute the probability

image

. As a result, classic measures, such as log-likelihood on the test set, cannot be evaluated. Consequently, many researchers focused on qualitative comparison, such as comparing the visual quality of samples. Unfortunately, such approaches are subjective and possibly misleading [7].

评估的主要问题源于一个人无法明确计算

image

的概率。因此，无法评估经典度量，例如测试集上的对数似然。因此，许多研究人员专注于定性比较，例如比较样本的视觉质量。不幸的是，这种方法是主观的，可能会产生误导[7]。

As a remedy, two evaluation metrics were proposed to quantitatively assess the performance of GANs. Both assume access to a pre-trained classiﬁer. Inception Score (IS)

作为补救措施，提出了两个评估指标来定量评估GAN的绩效。两者都假设访问预先训练的分类器。初始分数（IS）

[21] is based on the fact that a good model should generate samples for which, when evaluated by the classiﬁer, the class distribution has low entropy. At the same time, it should produce diverse samples covering all classes. In contrast, Fr´echet Inception Distance is computed by considering the difference in embedding of true and fake data [10]. Assuming that the coding layer follows a multivariate Gaussian distribution, the distance between the distributions is reduced to the Fr´echet distance between the corresponding Gaussians.

[21]基于一个好的模型应该生成样本的事实，当由分类器评估时，类分布具有低熵。同时，它应该生成涵盖所有类别的各种样本。相比之下，Fr'echet初始距离是通过考虑真假数据嵌入的差异来计算的[10]。假设编码层遵循多元高斯分布，则分布之间的距离减小到相应高斯之间的Fr'echet距离。

Our main contributions:

我们的主要贡献：

1. We provide a fair and comprehensive comparison of the state-of-the-art GANs, and empirically demonstrate that nearly all of them can reach similar values of FID, given a high enough computational budget.

1.我们提供了对最先进的GAN的公平和全面的比较，并且凭经验证明，在给定足够高的计算预算的情况下，几乎所有的GAN都可以达到类似的FID值。

2. We provide strong empirical evidence1 that to compare GANs it is necessary to report a summary of distribution of results, rather than the best result achieved, due to the randomness of the optimization process and model instability.

2.我们提供了强有力的经验证据1，为了比较GAN，由于优化过程的随机性和模型不稳定性，有必要报告结果分布的摘要，而不是最佳结果。

1As a note on the scale of the setup, the computational budget to reproduce those experiments is approximately 6.85 GPU years (NVIDIA P100).

1作为设置规模的注释，重现这些实验的计算预算约为6.85 GPU年（NVIDIA P100）。

3. We assess the robustness of FID to mode dropping, use of a different encoding network, and provide estimates of the best FID achievable on classic data sets.

3.我们评估FID对模式丢弃的稳健性，使用不同的编码网络，并提供对经典数据集可实现的最佳FID的估计。

4. We introduce a series of tasks of increasing difﬁculty for which undisputed measures, such as precision and recall, can be approximately computed.

4.我们引入了一系列增加困难的任务，可以近似计算无可争议的衡量标准，如精确度和召回率。

5. We open-sourced our experimental setup and model implementations at goo.gl/G8kf5J.

5.我们在goo.gl/G8kf5J开源了我们的实验设置和模型实现。

2. Background and Related Work

2.背景和相关工作

There are several ongoing challenges in the study of GANs, including their convergence properties [2, 17], and optimization stability [21, 1]. Arguably, the most critical challenge is their quantitative evaluation.

在GAN研究中存在一些持续的挑战，包括它们的收敛性[2,17]和优化稳定性[21,1]。可以说，最关键的挑战是他们的定量评估。

The classic approach towards evaluating generative models is based on model likelihood which is often intractable. While the log-likelihood can be approximated for distributions on low-dimensional vectors, in the context of complex high-dimensional data the task becomes extremely challenging. Wu et al. [23] suggest an annealed importance sampling algorithm to estimate the hold-out log-likelihood. The key drawback of the proposed approach is the assumption of the Gaussian observation model which carries over all issues of kernel density estimation in high-dimensional spaces. Theis et al. [22] provide an analysis of common failure modes and demonstrate that it is possible to achieve high likelihood, but low visual quality, and vice-versa. Furthermore, they argue against using Parzen window density estimates as the likelihood estimate is often incorrect. In addition, ranking models based on these estimates is discouraged [3]. For a discussion on other drawbacks of likelihoodbased training and evaluation consult Husz´ar [11].

评估生成模型的经典方法是基于模型可能性，这通常是难以处理的。虽然对数似然可以近似于低维矢量的分布，但在复杂的高维数据的背景下，任务变得极具挑战性。吴等人。 [23]建议退火重要性采样算法来估计保持对数似然。所提出的方法的主要缺点是假设高斯观测模型，该模型在高维空间中承载核密度估计的所有问题。Theis等人。 [22]提供了对常见故障模式的分析，并证明可能实现高可能性，但视觉质量低，反之亦然。此外，他们反对使用Parzen窗口密度估计，因为可能性估计通常是不正确的。此外，不鼓励基于这些估计的排名模型[3]。关于基于可能性的培训和评估的其他缺点的讨论，请咨询Husz'ar [11]。

Inception Score (IS). Proposed by [21], IS offers a way to quantitatively evaluate the quality of generated samples. The score was motivated by the following considerations: (i) The conditional label distribution of samples containing meaningful objects should have low entropy, and (ii) The variability of the samples should be high, or equivalently, the marginal

image

should have high entropy. Finally, these desiderata are combined into one score,

初始分数（IS）。 [21]提出，IS提供了一种定量评估生成样本质量的方法。得分的动机来自以下考虑：（i）含有有意义物体的样品的条件标签分布应具有低熵，以及（ii）样品的可变性应高，或等效地，边际

image

应具有高熵。最后，这些需求被合并为一个分数，

image

The classiﬁer is Inception Net trained on Image Net which is publicly available. The authors found that this score is well-correlated with scores from human annotators [21]. Drawbacks include insensitivity to the prior distribution over labels and not being a proper distance.

分类器是在Image Net上训练的Inception Net，它是公开的。作者发现这个分数与人类注释者的分数密切相关[21]。缺点包括对标签上的先前分布不敏感而不是适当的距离。

Fr´echet Inception Distance (FID). Proposed by [10], FID provides an alternative approach. To quantify the quality of generated samples, they are ﬁrst embedded into a feature space given by (a speciﬁc layer) of Inception Net. Then, viewing the embedding layer as a continuous multivariate Gaussian, the mean and covariance is estimated for both the generated data and the real data. The Fr´echet distance between these two Gaussians is then used to quantify the quality of the samples, i.e.

Fr'echet初始距离（FID）。 [10]提出，FID提供了另一种方法。为了量化生成的样本的质量，它们首先嵌入由Inception Net（特定层）给出的特征空间中。然后，将嵌入层视为连续的多元高斯，对生成的数据和实际数据估计均值和协方差。然后使用这两个高斯之间的Fr'echet距离来量化样本的质量，即

image

A signiﬁcant drawback of both measures is the inability to detect overﬁtting. A “memory GAN” which stores all training samples would score perfectly.

image

两种措施的一个显着缺点是无法检测过度配置。存储所有训练样本的“记忆GAN”将得分很好。

A very recent study comparing several GANs using IS has been presented by Fedus et al. [6]. The authors focus on IS and consider a smaller subset of GANs. In contrast, our focus is on providing a fair assessment of the current state-of-the-art GANs using FID, as well as precision and recall, and also verifying the robustness of these models in a large-scale empirical evaluation.

Fedus等人最近提出了一项使用IS比较几种GAN的研究。 [6]。作者关注IS并考虑较小的GAN子集。相比之下，我们的重点是使用FID对当前最先进的GAN进行公平评估，以及精确和召回，并在大规模实证评估中验证这些模型的稳健性。

3. Flavors of Generative Adversarial Networks

3.生成对抗网络的风味

In this work we focus on unconditional generative adversarial networks. In this setting, only unlabeled data is available for learning. The optimization problems arising from existing approaches differ by (i) the constraint on the discriminators output and corresponding loss, and the presence and application of gradient norm penalty.

在这项工作中，我们专注于无条件的生成对抗网络。在此设置中，只有未标记的数据可供学习。由现有方法引起的优化问题的不同之处在于：（i）对鉴别器输出的约束和相应的损失，以及梯度范数惩罚的存在和应用。

In the original GAN formulation [8] two loss functions were proposed. In the minimax GAN the discriminator outputs a probability and the loss function is the negative loglikelihood of a binary classiﬁcation task (MM GAN in Table 1). Here the generator learns to generate samples that have a low probability of being fake. To improve the gradient signal, the authors also propose the non-saturating loss (NS GAN in Table 1), where the generator instead aims to maximize the probability of generated samples being real. In Wasserstein GAN [1] the discriminator is allowed to output a real number and the objective function is equivalent to the MM GAN loss without the sigmoid (WGAN in Table 1). The authors prove that, under an optimal (Lipschitz smooth) discriminator, minimizing the value function with respect to the generator minimizes the Wasserstein distance between model and data distributions. Weights of the discriminator are clipped to a small absolute value to enforce smoothness. To improve on the stability of the training, Gulrajani et al. [9] instead add a soft constraint on the norm of the gradient which encourages the discriminator to be 1-Lipschitz. The gradient norm is evaluated on points obtained by linear interpolation between data points and generated samples where the optimal discriminator should have unit gradient norm [9].

在最初的GAN公式[8]中，提出了两种损失函数。在minimax GAN中，鉴别器输出概率，而损失函数是二元分类任务的负对数似然（表1中的MM GAN）。在这里，生成器学习生成具有低假性概率的样本。为了改善梯度信号，作者还提出了非饱和损耗（表1中的NS GAN），其中发生器的目的是最大化生成样本的真实概率。在Wasserstein GAN [1]中，允许鉴别器输出实数，目标函数等效于没有S形的MM GAN损失（表1中的WGAN）。作者证明，在最优（Lipschitz平滑）鉴别器下，最小化相对于发生器的值函数最小化模型和数据分布之间的Wasserstein距离。将鉴别器的权重限制为小的绝对值以强制平滑。为了提高训练的稳定性，Gulrajani等人。 [9]而是在梯度的范数上添加一个软约束，这会促使鉴别器为1-Lipschitz。通过数据点和生成样本之间的线性插值获得的点评估梯度范数，其中最优鉴别器应具有单位梯度范数[9]。

Table 1: Generator and discriminator loss functions. The main difference whether the discriminator outputs a probability (MM GAN, NS GAN, DRAGAN) or its output is unbounded (WGAN, WGAN GP, LS GAN, BEGAN), whether the gradient penalty is present (WGAN GP, DRAGAN) and where is it evaluated. We chose those models based on their popularity.

表1：发生器和鉴别器丢失功能。鉴别器输出概率（MM GAN，NS GAN，DRAGAN）或其输出的主要区别是无界（WGAN，WGAN GP，LS GAN，BEGAN），是否存在梯度惩罚（WGAN GP，DRAGAN）以及其中它评估。我们根据它们的受欢迎程度选择了这些模

image

Gradient norm penalty can also be added to both MM GAN and NS GAN and evaluated around the data manifold (DRAGAN [14] in Table 1 based on NS GAN). This encourages the discriminator to be piecewise linear around the data manifold. Note that the gradient norm can also be evaluated between fake and real points, similarly to WGAN GP, and added to either MM GAN or NS GAN [6].

梯度范数罚分也可以添加到MM GAN和NS GAN，并围绕数据流形进行评估（表1中基于NS GAN的DRAGAN [14]）。这促使鉴别器在数据流形周围呈分段线性。请注意，渐变范数也可以在假点和真实点之间进行评估，类似于WGAN GP，并添加到MM GAN或NS GAN [6]。

Mao et al. [16] propose a least-squares loss for the discriminator and show that minimizing the corresponding objective (LS GAN in Table 1) implicitly minimizes the Pearson

image

divergence. The idea is to provide smooth loss which saturates slower than the sigmoid cross-entropy loss of the original MM GAN.

毛等人。 [16]提出鉴别器的最小二乘损失，并表明最小化相应的目标（表1中的LS GAN）隐含地使Pearson

image

偏差最小化。这个想法是提供平滑的损失，其比原始MM GAN的S形交叉熵损失更慢。

Finally, Berthelot et al. [4] propose to use an autoencoder as a discriminator and optimize a lower bound of the Wasserstein distance between auto-encoder loss distributions on real and fake data. They introduce an additional hyperparameter γ to control the equilibrium between the generator and discriminator.

最后，Berthelot等人。 [4]建议使用自动编码器作为鉴别器，并优化真实和伪数据上自动编码器损耗分布之间的Wasserstein距离的下限。他们引入了额外的超参数γ来控制发生器和鉴别器之间的平衡。

4. Challenges of a Fair Comparison

4.公平比较的挑战

There are several interesting dimensions to this problem, and there is no single right way to compare these models (i.e. the loss function used in each GAN). Unfortunately, due to the combinatorial explosion in the number of choices and their ordering, not all relevant options can be explored. While there is no deﬁnite answer on how to best compare two models, in this work we have made several pragmatic choices which were motivated by two practical concerns: providing a neutral and fair comparison, and a hard limit on the computational budget.

这个问题有几个有趣的方面，没有一种正确的方法来比较这些模型（即每个GAN中使用的损失函数）。不幸的是，由于选择数量和排序的组合爆炸，并非所有相关选项都可以探索。虽然对如何最好地比较两个模型没有明确的答案，但在这项工作中，我们做出了几个实用的选择，这些选择受到两个实际问题的推动：提供中立和公平的比较，以及对计算预算的硬性限制。

Which metric to use? Comparing models implies access to some metric. As discussed in Section 2, classic measures, such as model likelihood cannot be applied. We will argue for and study two sets of evaluation metrics in Section 5: FID, which can be computed on all data sets, and precision, recall, and

image

, which we can compute for the proposed tasks.

使用哪个指标？比较模型意味着访问某些指标。如第2节所述，不能应用经典度量，例如模型可能性。我们将在第5节中讨论并研究两组评估指标：可以在所有数据集上计算的FID，以及我们可以为建议任务计算的精度，召回和

image

。

How to compare models? Even when the metric is ﬁxed, a given algorithm can achieve very different scores, when varying the architecture, hyperparameters, random initialization (i.e. random seed for initial network weights), or the data set. Sensible targets include best score across all dimensions (e.g. to claim the best performance on a ﬁxed data set), average or median score (rewarding models which are good in expectation), or even the worst score (rewarding models with worst-case robustness). These choices can even be combined — for example, one might train the model multiple times using the best hyperparameters, and average the score over random initializations).

如何比较模型？即使当度量被固定时，当改变体系结构，超参数，随机初始化（即初始网络权重的随机种子）或数据集时，给定算法可以实现非常不同的分数。明智的目标包括所有维度的最佳分数（例如，在固定数据集上获得最佳表现），平均或中位数分数（奖励期望良好的模型），甚至最差分数（具有最差情况稳健性的奖励模型）。甚至可以组合这些选择 - 例如，可以使用最佳超参数对模型进行多次训练，并对随机初始化的分数进行平均）。

For each of these dimensions, we took several pragmatic choices to reduce the number of possible conﬁgurations, while still exploring the most relevant options.

对于每个维度，我们采取了几种实用的选择来减少可能的配置数量，同时仍然在探索最相关的选项。

1. Architecture: We use the same architecture for all models. The architecture is rich enough to achieve good performance.

1.架构：我们对所有模型使用相同的架构。该架构足够丰富，可以实现良好的性能。

2. Hyperparameters: For both training hyperparameters (e.g. the learning rate), as well as model speciﬁc ones (e.g. gradient penalty multiplier), there are two valid approaches: (i) perform the hyperparameter optimization for each data set, or (ii) perform the hyperparame

2.超参数：对于训练超参数（例如学习速率）以及模型特定的参数（例如梯度罚分乘数），有两种有效的方法：（i）对每个数据集执行超参数优化，或（ii）执行超级参数

ter optimization on one data set and infer a good range of hyperparameters to use on other data sets. We explore both avenues in Section 6.

在一个数据集上进行优化，并推断出在其他数据集上使用的一系列超参数。我们将探讨第6节中的两种途径。

3. Random seed: Even with everything else being ﬁxed, varying the random seed may have a non-trivial inﬂuence on the results. We study this particular effect and report the corresponding conﬁdence intervals.

3.随机种子：即使其他一切都被固定，随机种子的变化也可能对结果产生不平凡的影响。我们研究这种特殊效应并报告相应的信心间隔。

4. Data set: We chose four popular data sets from GAN literature and report results separately for each data set.

4.数据集：我们从GAN文献中选择了四个流行数据集，并分别为每个数据集报告结果。

5. Computational budget: Depending on the budget to optimize the parameters, different algorithms can achieve the best results. We explore how the results vary depending on the budget.

5.计算预算：根据预算来优化参数，不同的算法可以获得最佳结果。我们将根据预算探索结果的变化情况。

In practice, one can either use hyperparameter values suggested by respective authors, or try to optimize them. Figure 5 and in particular Figure 15 show that optimization is necessary. Hence, we optimize the hyperparameters for each model and data set by performing a random search.2 We concur that the models with fewer hyperparameters have an advantage over models with many hyperparameters, but consider this fair as it reﬂects the experience of practitioners searching for good hyperparameters for their setting.

在实践中，可以使用各自作者建议的超参数值，或者尝试优化它们。图5，特别是图15显示了优化是必要的。因此，我们通过执行随机搜索来优化每个模型和数据集的超参数。我们同意，具有较少超参数的模型优于具有许多超参数的模型，但考虑到这一点，因为它反映了从业者寻求良好的经验超参数为他们的设置。

文章引用于 http://tongtianta.site/paper/3092
编辑 Lornatang
校准 Lornatang

Are GANs Created Equal? A Large-Scale Study翻译[上]

你可能感兴趣的:(Are GANs Created Equal? A Large-Scale Study翻译[上])