论文翻译——无监督DCGAN做表征学习

UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS

DCGAN下的非监督表征学习

Abstract

摘要

In Recent year, supervised learning with convolution networks(CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has recieved less attention. In this work we hope to help beidge between the success of CNNs for supervised learning and unsupervised learning. We intoduce a class os CNNs called deep convolutional generative adversarial networks(DCGAN), that have centain architectural constrains, and demonstrate that they are a strong candidate for unsupervised learning. Trainning on various datasets, we show convincing evidence that our deep convolutional adversarial pairs learns a hierarchy of representions from objects parts to senses in both generator and discriminator. Addtionally, we use the learned feature for novel tasks —- demonstrate their appicability as general image representions.

在这些年里,机器学习中已经广泛应用CNN做监督学习。相形之下,CNN的非监督学习却并没有得到多少关注。在这份工作中,我们希望能建立起CNN中成功的监督学习和非监督学习之间的桥梁。我们介绍了一种名为DCGAN(深度卷积对抗式生成网络)的结构,为它确定了一些必须的架构约束,并揭示了它们是非监督学习下颇具竞争力的候选方案。通过在不同训练集上训练,我们可以相信,不论是判别器还是生成器,不论是单个对象还是图像全局场景,DCGAN都能学习到一系列特征。除此之外,我们使用这些学到的特征完成了一些新奇的应用——揭示了它们在普遍意义上能做出图像表征。

1. Introduction

1. 介绍

Learning reusable feature represention from large unlable datasets has been an area of active research. In the context of computer vision, In the context of conputer vision, one can leverage the practically unlimited amount of unlabeled images and videos to learn good intermidiate representions, which can then be used on a varity of supervied taska such as image classification. We propose that one way to build good image represention is by training Generative Adversarial Networks(GAN), and later reused parts of generator and discriminator networks as feature extrators for supervised tasks. GANs provide an attrative alternative to maximum likehoood techniques. One can addtional argue that their learning process and the lack of a heuristic cost function (such as pixel-wise independent mean-square error) are attrative to represention learning. GANs have been known to be unstable to train, often resulting in generator that produce nonsensical output. There has been very limit published research in trying to understand and visualize what GANs learn, and the intermidiate represention of GANs.

从大规模无标记数据集中学习到可重复使用的特征,这已经是一个活跃的研究领域了。在计算机视觉环境下,如果能从大批量无标记图像和视频中学习到良好的中间特征,就可以将它用于诸如图像分类这样的监督学习任务。我们提出,要建立图像的良好特征,训练GAN是一种方法,之后,我们会把判别器和生成器都作为可再用的特征提取器,用到监督学习的任务中。GAN实际上为最大似然估计的相关技术提供了一种颇具吸引力的替代方案。更进一步,GAN的学习过程,和对启发式损失函数要求不高,这两点对表征学习具有吸引力。GAN出了名的训练不稳定,这经常导致生成器会产出很多荒谬的结果。对于GAN中到底学习到了什么样的中间表征,在当前出版的研究里也并不多见。

In this paper, we make the following contributions:

在这篇文章里,我们做出如下贡献:

  • We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)
  • 我们提出并评价了这样一件事情:什么样的GAN体系拓扑能让训练更加稳定。 我们将这一类架构称作是DCGAN

  • We use the trained discriminators for image classification tasks, showing competitive performance with other unsupervised algorithms.

  • 我们使用图像分类任务上训练出来的判别器和其他的非监督算法做了比较。

  • We visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects.

  • 我们对GAN学习到的特征做出了可视化,并经验性的证明了特殊的特征表征了特殊的对象。

  • We show that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.

  • 针对生成器,我们提出了一个很有趣的算法向量,这个向量能很简单的在语义层面上操作生成样例的质量。

2 相关工作

2.1 REPRESENTATION LEARNING FROM UNLABELED DATA

2.1 无监督表征学习

Unsupervised representation learning is a fairly well studied problem in general computer vision research, as well as in the context of images. A classic approach to unsupervised representation learning is to do clustering on the data (for example using K-means), and leverage the clusters for improved classification scores. In the context of images, one can do hierarchical clustering of image patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that encode an image into a compact code, and decode the code to reconstruct the image as accurately as possible. These methods have also been shown to learn good feature representations from image pixels. Deep belief networks (Lee et al., 2009) have also been shown to work well in learning hierarchical representations.

公平的说,非监督表征学习在机器视觉中的研究做的已经很好了,例如对图像上下文的表征。一个经典的非监督表征学习手段是做出数据聚类(例如使用k-means),之后利用聚类结果来改善分类结果。在图像这一类场景下,可以对图像进行批处理,利用多个图像的聚类来学习到更有效的图像表征,另一个很流行的理论是训练自编码器(卷积式的自编码器),主流存在两种方式:一种是分离编码中向量的意义和位置,另一种是分析编码的梯度结构,这两种方式都能将图像作成紧编码,并且尽可能的通过解码器还原图像。这些方法已经被证明能很好的通过图像像素来学习表征。深度置信网络同样也能学习到表征的连续表达。

2.2 GENERATING NATURAL IMAGES

2.2 生成自然图像

Generative image models are well studied and fall into two categories: parametric and nonparametric.

图像的生成模型已经充分研究过,并划分为两个领域:参数化领域和非参数化领域

The non-parametric models often do matching from a database of existing images, often matching patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution (Freeman et al., 2002) and in-painting (Hays & Efros, 2007).

非参数领域通常是在图像数据库下做匹配,经常对成批的图像做匹配,它在纹理合成,超分辨率重建和in-paiting中用的较多。

Parametric models for generating images has been explored extensively (for example on MNIST digits or for texture synthesis (Portilla & Simoncelli, 2000)). However, generating natural images of the real world have had not much success until recently. A variational sampling approach to generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer from being blurry. Another approach generates images using an iterative forward diffusion process (Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects looking wobbly because of noise introduced in chaining multiple models. A recurrent network approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have also recently had some success with generating natural images. However, they have not leveraged the generators for supervised tasks.

参数模型已经被广泛研究过(例如,MNIST中手写数字的纹理合成)。尽管生成真实世界的自然图像这一点在当前并没有很大成功。尽管其中的一些变种已经取得了一定成功,但是这种采样通常十分模糊。其他的,页游诸如面向扩散过程的生成方法。GAN在图像生成方面,具有不可思议的抗噪特性。这种方法中,一种添加拉普拉斯金字塔的方法展示出了较高质量的图像,但是由于在链式乘法模型中引入了噪声,导致生成的对象看上去是摇摆的。循环网络和反卷积网络似乎在自然图像的生成上取得了一定成功,但它们并没有应用到监督任务上。

2.3 VISUALIZING THE INTERNALS OF CNNS

2.3 CNNs内部的可视化

One constant criticism of using neural networks has been that they are black-box methods, with little understanding of what the networks do in the form of a simple human-consumable algorithm. In the context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and filtering the maximal activations, one can find the approximate purpose of each convolution filter in the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that activates certain subsets of filters (Mordvintsev et al.).

神经网络最受人诟病的地方就是它的黑箱子属性,即使只是用它来模仿很简单的人类行为也一样,我们也对“网络内部干了什么?”这个问题理解不透。在CNNs这个领域内,Zeiler et. al.证明了使用反卷积,过滤最大激活,可以逼近网络中每一个卷积滤波器的结果。相似的,对输入图像使用梯度下降,我们可以看到滤波器子集上所激活的理想图像。

3 APPROACH AND MODEL ARCHITECTURE

3 方法和模型架构

Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to iteratively upscale low resolution generated images which can be modeled more reliably. We also encountered difficulties attempting to scale GANs using CNN architectures commonly used in the supervised literature. However, after extensive model exploration we identified a family of architectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models.

在过去,使用CNN构造GAN的做法并不是很成功。这驱使LAPGAN的作者开始尝试一种能更加稳定建模,并能在迭代方法下提高分辨率层次生成图像的替代方案。对于在监督学习文献里常见的使用CNN来尺度化GAN的文章,我们在重现的时候也遇到过很多苦难。不论如何,在扩大了对模型的研究之后,我们定义了一些列网络架构,这些架构能在一系列数据集上稳定训练,也能获得尺度更优也更深的生成模型。

Core to our approach is adopting and modifying three recently demonstrated changes to CNN architectures.

我们研究的核心,是对现行CNN架构三个改进的学习和纠正。

The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn its own spatial downsampling. We use this approach in our generator, allowing it to learn its own spatial upsampling, and discriminator.

第一是,所有的卷积网络,凡是自卷积开始,使用了确定性pooling函数的,都能学习到自己空间上的降采样。我们在生成器中使用了这种方法,允许生成器学习到它自己的空间降采样,也包括判别器。

Second is the trend towards eliminating fully connected layers on top of convolutional features. The strongest example of this is global average pooling which has been utilized in state of the art image classification models (Mordvintsev et al.). We found global average pooling increased model stability but hurt convergence speed. A middle ground of directly connecting the highest convolutional features to the input and output respectively of the generator and discriminator worked well. The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example model architecture.

第二是,消除卷积特征顶部的全连接层已经成为一种趋势。最有力的证据是在顶级表现的分类模型中使用了全局平均pooling.我们发现,全局平均pooling增强了模型稳定性,但减缓了收敛速度。一个折中的方法,是把最高层的卷积特征和输入连接起来,生成器和判别其各自做自己的输出,实测效果不错。GAN的第一层,输入了一个均匀分布的噪声Z,仅仅只是一个矩阵乘法,可以视作一个全连接层,但是结果被转化为一个4维张量并被用作卷积栈的开始。对于判别器而言,最后一层被平展开,然后输入了一个sigmoid函数,图1中对这种模型架构做了一个可视化。

Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models. This proved critical to get deep generators to begin learning, preventing the generator from collapsing all samples to a single point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers however, resulted in sample oscillation and model instability. This was avoided by not applying batchnorm to the generator output layer and the discriminator input layer.

第三个是BN,这种方法将每一层的输入都正则化为期望0方差1。它改进了训练问题,也缓解了深层网络中的梯度溢出问题。但实际上,这种方法在深层的生成器中被证明是不适用的,它会导致生成器反复震荡生成单点数据,这在GANs中往往是失败的。对于直接将BN使用在所有层上的方法,同样会引起震荡并导致模型不稳定,所以,不要再生成器的输入层上使用BN,也不要在判别器的输出层上使用BN。

The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output layer which uses the Tanh function. We observed that using a bounded activation allowed the model to learn more quickly to saturate and cover the color space of the training distribution. Within the discriminator we found the leaky rectified activation (Maas et al., 2013) (Xu et al., 2015) to work well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which used the maxout activation (Goodfellow et al., 2013).

在生成器中,使用Relu来做激活函数,并在输出层上使用Tanh。我们发现,使用有界的函数更有助于模型迅速在训练分布中覆盖颜色空间。在判别器中,尤其是对高分辨率模型的时候,我们发现Weak Relu更好用。 这是与原始GAN论文的对比,在那篇论文中,它使用Maxout来激活。

Architecture guidelines for stable Deep Convolutional GANs
稳定DCGAN的架构指导:
* Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
* 判别器中,使用带步长的卷基层来替换所有pooling层,生成器中使用小步长卷积来代替pooling层。
* Use batchnorm in both the generator and the discriminator.
* 在生成器和判别器中使用BN。
* Remove fully connected hidden layers for deeper architectures.
* 去除深度架构中的全连接隐藏层。
* Use ReLU activation in generator for all layers except for the output, which uses Tanh.
* 生成器中,除去最后一层使用Tanh之外,每一层都使用ReLU来激活。
* Use LeakyReLU activation in the discriminator for all layers.
* 判别器中,每一层都使用LeakReLU来激活。

4 DETAILS OF ADVERSARIAL TRAINING

4 对抗式训练的细节

We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015),Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets are given below.

我们在三个训练集上训练DCGAN,LSUN,Imagenet-1K和一个新的脸谱数据集。这三大数据集上的使用细则会在接下来给出。

No pre-processing was applied to training images besides scaling to the range of the tanh activation function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models. While previous GAN work has used momentum to accelerate training, we used the Adam optimizer (Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001, to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term β1 at the suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training

所有的图像,都缩放到Tanh激活函数的定义域[-1,1]之内,除此之外没有做任何预处理。所有的模型都使用小批量SGD,一批是128张图。所有的权重都使用正态分布初始化,期望为1,方差为0.02。在LeakReLU中,负向权重全部设置为0.1。有鉴于之前的GAN中使用了动量,我们使用Adam来优化超参数。学习率中发现0.001太大了,使用0.0002。此外,我们发现动量参数 β1=0.9 β 1 = 0.9 的时候,训练波动大也不稳定,所以设置为0.5使训练稳定。
论文翻译——无监督DCGAN做表征学习_第1张图片

Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribution Z is projected to a small spatial extent convolutional representation with many feature maps.A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no fully connected or pooling layers are used.

图1:用于LSUN场景的DCGAN的生成器。一个100维,服从均匀分布的噪声Z在一个小范围卷积表征空间中投影为许多特征映射,接着连续4个小步长卷积,将这个噪声作成一个64*64的像素图像,毫无疑问,没有全连接层或者pooling层。

4.1 LSUN

As visual quality of samples from generative image models has improved, concerns of over-fitting and memorization of training samples have risen. To demonstrate how our model scales with more data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing a little over 3 million training examples. Recent analysis has shown that there is a direct link between how fast models learn and their generalization performance (Hardt et al., 2015). We show samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality samples via simply overfitting/memorizing training examples. No data augmentation was applied to the images.

生成模型中采样的视觉质量改善了对过拟合的关注和对训练样本的记忆。为了证明我们的模型在不同分辨率下都能良好的度量数据,我们在LSUN卧室数据集上训练了超过3000000张图。当前的分析证明,生成表现和模型的学习速度之间有显著关系。我们在图2中展示了一个epoch的样例,就像在线学习那样,在收敛之后又继续添加了样例,以此证明,我们的模型并不是因为过拟合或者记住了训练样本才表现好的。图像中的数据量并没有任何增加。
论文翻译——无监督DCGAN做表征学习_第2张图片

Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model could learn to memorize training examples, but this is experimentally unlikely as we train with a small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating memorization with SGD and a small learning rate.

图2:这是一次训练后生成的卧室。理论上而言,模型是可以记住训练样本的,但是在小学习率和小批量SGD下这并不是一个经验事实,实际上我们意识到SGD和小学习率下并没有先验的经验证据。

论文翻译——无监督DCGAN做表征学习_第3张图片

Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual under-fitting via repeated noise textures across multiple samples such as the base boards of some of the beds.

图3:5个epoch训练后生成的卧室。有证据表明多个样本上的噪声导致欠拟合。

4.1.1 DEDUPLICATION

To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU autoencoder on 32×32 32 × 32 downsampled center-crops of training examples. The resulting code layer activations are then binarized via thresholding the ReLU activation which has been shown to be an effective information preserving technique (Srivastava et al., 2014) and provides a convenient form of semantic-hashing, allowing for linear time de-duplication . Visual inspection of hash collisions showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.

为了降低生成物和所记忆的输入样本的相似程度,我们执行了一个简单的图像二分过程。我们在训练样本中 32×32 32 × 32 降采样中心切片(就是在训练样本正中间切了32*32出来)上施加了一个3072-128-3072的去噪 dropout+ReLU的 自编码器。编码结果层使用ReLU激活阈值进行二值化(这已经被证明是一种有效的信息保存手段),它提供了一个语义hash的简单形式,允许在线性时间内进行二分。这个hash编码可视化结果的错误率不超过1%。此外,此技术的召回率也很好,在复制品的选择和删除上逼近275000。

4.2 FACES

4.2 脸

We scraped images containing human faces from random web image queries of peoples names. The people names were acquired from dbpedia, with a criterion that they were born in the modern era. This dataset has 3M images from 10K people. We run an OpenCV face detector on these images, keeping the detections that are sufficiently high resolution, which gives us approximately 350,000 face boxes. We use these face boxes for training. No data augmentation was applied to the images.

我们从网上的图像里面按名字挖了人脸,人名字是从dbpedia中查询出来的,这些人的共同特点是都出生在现代。这个数据集包括1万人,3百万图片,我们在这些图像上用OpenCV跑了一个人脸detection,保证detection具有足够的分辨率,这给了我们35万张脸,我们拿这些脸来训练。图像没有做数据扩张。

4.3 IMAGENET-1K

We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We train on 32 × 32 min-resized center crops. No data augmentation was applied to the images.

imagenet-1k作为非监督学习下自然图像的来源。我们训练了图像正中间 32×32 32 × 32 的截图。无数据扩张。

5 EMPIRICAL VALIDATION OF DCGANS CAPABILITIES

5 DCGAN能力的经验确认

5.1 CLASSIFYING CIFAR-10 USING GANS AS A FEATURE EXTRACTOR

5.1 CIFAR-10 上使用GAN作为特征提取器进行分类

One common technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models fitted on top of these features.

用于非监督学习算法评估的一般技术是,将这些算法作为特征提取器应用到一些有监督的数据集上,计算线性模型在这些特征上的拟合表现

On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm. When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy. An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates & Ng, 2011). To evaluate the quality of the representations learned by DCGANs for supervised tasks, we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a 4 × 4 spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them. This achieves 82.8% accuracy, out performing all K-means based approaches. Notably, the discriminator has many less feature maps (512 in the highest layer) compared to K-means based techniques, but does result in a larger total feature vector size due to the many layers of 4 × 4 spatial locations. The performance of DCGANs is still less than that of Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs in an unsupervised fashion to differentiate between specifically chosen, aggressively augmented, exemplar samples from the source dataset. Further improvements could be made by finetuning the discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the learned features.

在CIFAR-10这个数据集上,单层特征提取pipline,使用k-means作为特征学习算法的时候,具备了极佳表现。在大规模特征映射下,这项技术达到了80.6%。基础算法的一个非监督多层扩展达到了82.0%.为了评价DCGAN在有监督任务下的表现,我们在Image-1K上训练,之后使用判别器所有层上的卷积特征,使用maxpooling将每一层的表征表示为一个 4×4 4 × 4 的方块,这些特征平展后串联,用于表示28672空间向量,然后使用一个正则 L2SVM L 2 − S V M 来做分类器,这样做达到了82.8%,比所有的k-means都强。注意到,判别器拥有许多小的特征映射,但是最后产生了很大尺寸的总特征,这是因为每一层的特征加在一起后很多。DCGAN比起经典的CNN模型还是有差距,如果将经典CNN的判别器fineturing过来,预计会表现得更好。此外,由于我们的DCGAN从来没有在CIFAR-10上训练过,所以这个结果也表现出算法自身的强大鲁棒性

Table 1: CIFAR-10 classification results using our pre-trained model. Our DCGAN is not pretrained on CIFAR-10, but on Imagenet-1k, and the features are used to classify CIFAR-10 images.
表1:CIFAR-10使用我们预训练模型的结果,我们的DCGAN只在Imagenet-1k上训练过,没有在CIFAR-10上训练过,特征提取器是直接搬过来的。

Model Accuracy Accuracy (400 per class) max # of features units
1 Layer K-means 80.6% 63.7% (±0.7%) 4800
3 Layer K-means Learned RF 82.0% 70.7% (±0.7%) 3200
View Invariant K-means 81.9% 72.6% (±0.7%) 6400
Exemplar CNN 84.3% 77.4% (±0.2%) 1024
DCGAN (ours) + L2-SVM 82.8% 73.8% (±0.4%) 512

5.2 CLASSIFYING SVHN DIGITS USING GANS AS A FEATURE EXTRACTOR

On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of 10,000 examples from the non-extra set and use it for all hyperparameter and model selection. 1000 uniformly class distributed training examples are randomly selected and used to train a regularized linear L2-SVM classifier on top of the same feature extraction pipeline used for CIFAR-10. This achieves state of the art (for classification using 1000 labels) at 22.48% test error, improving upon another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015). Additionally, we validate that the CNN architecture used in DCGAN is not the key contributing factor of the model’s performance by training a purely supervised CNN with the same architecture on the same data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio, 2012). It achieves a signficantly higher 28.87% validation error.

跟CIFAR-10一样处理模型之后,这模型在StreetView这个数据集上有22.48%的错误。CNN的架构在DCGAN里面并不是模型表现的关键因素:在使用同样的架构训练一个监督的CNN,反而变差了。

6 INVESTIGATING AND VISUALIZING THE INTERNALS OF THE NETWORKS

We investigate the trained generators and discriminators in a variety of ways. We do not do any kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric

我们用了很多方式来评价训练处的判别器和生成器。这其中并不包括愚蠢的最近邻搜索和对数似然估计度量。

Table 2: SVHN classification with 1000 labels

Model error rate
KNN 77.93%
TSVM 66.55%
M1+KNN 65.63%
M1+TSVM 54.33%
M1+M2 36.02%
SWWAE without dropout 27.83%
SWWAE with dropout 23.56%
DCGAN (ours) + L2-SVM 22.48%
Supervised CNN with the same architecture 28.87% (validation)

6.1 WALKING IN THE LATENT SPACE

6.1 潜在空间行走,以下视作概率空间行走

The first experiment we did was to understand the landscape of the latent space. Walking on the manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations. The results are shown in Fig.4.

我们做的第一个实验就是针对概率空间的全貌描述 。流形上的行走既可以告诉我们一些关于记忆的信号,也可以告诉我们一些空间层次的问题。如果在概率空间内的行走导致了图像语义的改变,我们可以推断模型的学习到了切题而有趣的表征,结果如图4所示。
论文翻译——无监督DCGAN做表征学习_第4张图片
论文翻译——无监督DCGAN做表征学习_第5张图片
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In the 6th row, you see a room without a window slowly transforming into a room with a giant window. In the 10th row, you see what appears to be a TV slowly being transformed into a window.

图4:Z流形上随机插值9点的图像。

6.2 VISUALIZING THE DISCRIMINATOR FEATURES

6.2 判别器特征可视化

Previous work has demonstrated that supervised training of CNNs on large image datasets results in very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting.

前人的工作已经说明了监督学习下的CNN可以在大规模数据集上学习到很有用的特征。此外,学习classification的监督CNN也可以学习detection。我们揭示了大图像集上无监督DCGAN同样能学习到一系列有趣的特征。

Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows. For comparison, in the same figure, we give a baseline for randomly initialized features that are not activated on anything that is semantically relevant or interesting.

通过反馈,在图5中我们展示了判别器在卧室中典型部分上激活之后学习到的特征,例如床或者窗户。相比之下,在同样一张图上,我们给出了无激活的随机初始化特征,它更加语义化,也更加有趣。
论文翻译——无监督DCGAN做表征学习_第6张图片
Figure 5: On the right, guided backpropagation visualizations of maximal axis-aligned responses for the first 6 learned convolutional features from the last convolution layer in the discriminator. Notice a significant minority of features respond to beds - the central object in the LSUN bedrooms dataset. On the left is a random filter baseline. Comparing to the previous responses there is little to no discrimination and random structure.

图5:右边是前6个卷积特征子的最大轴对称响应,左边是随机滤波器基准。

6.3 MANIPULATING THE GENERATOR REPRESENTATION

6.3 操作生成器表征

6.3.1 FORGETTING TO DRAW CERTAIN OBJECTS

6.3.1 特定对象的遗忘

In addition to the representations learnt by a discriminator, there is the question of what representations the generator learns. The quality of samples suggest that the generator learns specific object representations for major scene components such as beds, windows, lamps, doors, and miscellaneous furniture. In order to explore the form that these representations take, we conducted an experiment to attempt to remove windows from the generator completely.

除了判别器的表征学习之外,生成器的表征学习也是一个问题。样本质量上可以看到,生成器学习到了场景中的主要构建,如床,窗户,灯,门和一些家具。为了探索表征的形式,我们构造一个实验从生成器中完全去除掉“窗户”这一对象。

On 150 samples, 52 window bounding boxes were drawn manually. On the second highest convolution layer features, logistic regression was fit to predict whether a feature activation was on a window (or not), by using the criterion that activations inside the drawn bounding boxes are positives and random samples from the same images are negatives. Using this simple model, all feature maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then, random new samples were generated with and without the feature map removal.

在150个样例上,手动选择了52个窗户。在自高往低第二个卷积层上,使用logistics来判断特征响应是不是一个窗户,画了窗户的响应是正的,否则是负的。在这个简单的模型下,在所有空间位置上,所有大于0的特征映射全部dropout。随后,分别使用dropout后和dropout之前的特征来生成随机的新样例。

The generated images with and without the window dropout are shown in Fig.6, and interestingly, the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.
图6使两种方式下的生成结果,有趣的是,网络忘记了画卧室里的窗户,它使用其他物体来替换掉了。
论文翻译——无监督DCGAN做表征学习_第7张图片
Figure 6: Top row: un-modified samples from model. Bottom row: the same samples generated with dropping out ”window” filters. Some windows are removed, others are transformed into objects with similar visual appearance such as doors and mirrors. Although visual quality decreased, overall scene composition stayed similar, suggesting the generator has done a good job disentangling scene representation from object representation. Extended experiments could be done to remove other objects from the image and modify the objects the generator draws.
图6:上面一排是没有dropout窗户的,下面是dropout之后的。

6.3.2 VECTOR ARITHMETIC ON FACE SAMPLES

6.3.2 脸抽样的向量算法

In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated that simple arithmetic operations revealed rich linear structure in representation space. One canonical example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors of sets of exemplar samples for visual concepts. Experiments working on only single samples per concept were unstable, but averaging the Z vector for three examplars showed consistent and stable generations that semantically obeyed the arithmetic. In addition to the object manipulation shown in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8). These demonstrations suggest interesting applications can be developed using Z representations learned by our models. It has been previously demonstrated that conditional generative models can learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al., 2014). This is to our knowledge the first demonstration of this occurring in purely unsupervised models. Further exploring and developing the above mentioned vector arithmetic could dramatically reduce the amount of data needed for conditional generative modeling of complex image distributions.

word2vector算法揭示了简单的算法操作也可以在特征空间中揭示丰富的线性结构。一个权威的例子是 vector(“国王”) - vector(“男人”) + vector(“女人”) 所得的结果最近邻居是vector(“皇后”),我们的z流形上也呈现相似的结构。我们展示了Z向量集合上相似的例子。单例上的实验结果并不稳定,但是三样例的结果明显具备算术特性。如图7所示,我们揭示了脸的姿态在Z空间上是线性的。 前人的工作已经揭示出,条件生成模型可以学习到诸如伸缩,旋转和平移这样的操作。我们第一次揭示出无监督得到的模型也能具备这样的特性,在复杂图像分布下的生成模型中,这个基于向量的算术特性可以显著地做出降维。
论文翻译——无监督DCGAN做表征学习_第8张图片
论文翻译——无监督DCGAN做表征学习_第9张图片
论文翻译——无监督DCGAN做表征学习_第10张图片
Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples are averaged. Arithmetic was then performed on the mean vectors creating a new vector Y . The center sample on the right hand side is produce by feeding Y as input to the generator. To demonstrate the interpolation capabilities of the generator, uniform noise sampled with scale +-0.25 was added to Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples) results in noisy overlap due to misalignment.
图7:生成向量的算术特性在人脸上的表现。
论文翻译——无监督DCGAN做表征学习_第11张图片
Figure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking right. By adding interpolations along this axis to random samples we were able to reliably transform their pose.
图8 算术向量一个从左到右的连续变换性质。

7 CONCLUSION AND FUTURE WORK

We propose a more stable set of architectures for training generative adversarial networks and we give evidence that adversarial networks learn good representations of images for supervised learning and generative modeling. There are still some forms of model instability remaining - we noticed as models are trained longer they sometimes collapse a subset of filters to a single oscillating mode.Further work is needed to tackle this from of instability. We think that extending this framework to other domains such as video (for frame prediction) and audio (pre-trained features for speech synthesis) should be very interesting. Further investigations into the properties of the learnt latent space would be interesting as well.

有价值的是:随着训练时间变长,会有一部分特征子坍塌为一个单震荡的模式。

你可能感兴趣的:(深度学习)