



In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning


We propose that one way to build
good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow
et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors
for supervised tasks
这里作者说通过训练对抗生成神经网络构建了一种更好的图像表示的方法(one way to build good image representations by training GAN),然后把GAN中训练好的判别器和生成器复用到监督任务中作为特征提取器(reuse parts of generator and discriminator network as feature extractors)也就是把GAN中训练好的判别器后续用于图像分类任务可能会取得更好的分类效果,原因可能是因为GAN训练中生成器产生了大量的类似负样本和噪声信息(这里早期生成器产生的图像可以认为是噪声,后期的生成可以理解为负样本),这样对比训练判别器来说生成器的作用可以理解为数据扩充,这是如果再单纯把判别器当做分类器分类的效果要优于没有数据扩充训练得到的分类器(We use the trained discriminators for image classification tasks, showing competitive performance with other unsupervised algorithms.)

3. Approach and model architecture

Core to our approach is adopting and modifying three recently demonstrated changes to CNN architectures.(主要有3点改进)
第一点改进 :The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn its own spatial downsampling. We use this approach in our generator, allowing it to learn its own spatial upsampling, and discriminator.(使用卷积和反卷积替代池化层)
第二点改进 :Second is the trend towards eliminating fully connected layers on top of convolutional features.
The strongest example of this is global average pooling which has been utilized in state of the
art image classification models (Mordvintsev et al.). We found global average pooling increased
model stability but hurt convergence speed
. A middle ground of directly connecting the highest
convolutional features to the input and output respectively of the generator and discriminator worked
well. The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called
fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional
tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer
is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example
model architecture.
第三点改进 :Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the
input to each unit to have zero mean and unit variance. This helps deal with training problems that
arise due to poor initialization and helps gradient flow in deeper models
. This proved critical to get
deep generators to begin learning, preventing the generator from collapsing all samples to a single
point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers
however, resulted in sample oscillation and model instability. This was avoided by not applying
batchnorm to the generator output layer and the discriminator input layer


4.Details of adversarial training

4.1 All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128
此处关于SGD batchsize设定问题,建议最小不要小于32,而且最好不要考虑通过减小batchsize同步减小lr的方式来加速训练。原因是GAN中的判别器训练评估方式是分类问题的评估方式,就是对一张图片打分然后按照是否高于一个阈值来判断类别,也就是说acc的值变化的最小范围就是1/batchsize. 下面以mmist训练为例,我把batchsize设定为4时候的输出如下
可以看到batchsize=4去情况下, acc的值之后0, 25%。 50%和100%。这样随机性过大,不能引导模型权重按照合理的方向优化。
4.2 优化器
All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models. While previous GAN work has used momentum to accelerate training, we used the Adam optimizer (Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001, to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term β1 at the suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training
4.3 关于GAN中最核心的公式

def train_D(): #训练判别器
    for i in range(d_num):
    	freeze G #此时训练要固定住生成器,生成器权重不更新
    	train D # 此时训练判别器的目标是最大化判别的正确率,对应的让式子logD(x) + log(1 - D(G(Z))的值达到最大
def train_G(): #训练生成器
	for j in range(g_num):
		freeze D #此时训练要固定之前训练好的判别器
		train G # 训练G 使式子logD(G(z))的值达到最小,也就是D(G(z))的值越接近1,也就是G生成的数据尽量多的欺骗了D
for z in range(total): # 循环运行train_D 和 train_G
	if z % 2 == 0:

5. 各种评估和实验

One common technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models fitted on top of these features.
