更快更稳定的GAN训练高保真的小样本图像生成(ICLR 2021)
paper with code with supplement metrics:Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis | OpenReview
official code:https://github.com/odegeasslbc/FastGAN-pytorch
The video : https://www.youtube.com/watch?v=v8IRFcWGcWc
The main contributions of the paper is:
1、a skip-layer excitation in the generator(SLE):
We design the Skip-Layer channel-wise Excitation (SLE) module, which leverages low-scale activations to revise the channel responses on high-scale feature-maps. SLE allows a more robust gradient flow throughout the model weights for faster training. It also leads to an automated learning of a style/content disentanglement like StyleGAN2.
2、paired with autoencoding self-supervised learning in the discriminator:
We propose a self-supervised discriminator D trained as a feature-encoder with an extra decoder. We force D to learn a more descriptive feature-map covering more regions from an input image, thus yielding more comprehensive signals to train G. We test multiple self-supervision strategies for D, among which we show that auto-encoding works the best.
Quoting the one-line summary "converge on single gpu with few hours' training, on 1024 resolution sub-hundred images".
1)、 原来的残差连接需要在不同的卷积激活层中使用 element-wise addition(逐元素相加),那么这个需要不同卷积激活层的维度相同才可以。与之前不同的是,这里作者使用的是 channel-wise multiplications(逐通道相乘):消除了繁重的卷积计算(因为激活层具有的空间维度为1)
这样的话,SLE使网络既继承了ResBlock的 shortcut gradient flow的优势,与此同时还不需要这么大的计算负载。
这里 x 和 y 分别是SLE的输入和输出的特征图;
F 是对 的函数,
如上图所示, 和
对比Skip-Layer Excitation(SLE)模块和ResBlock,有几个区别:
Yellow boxes represent feature-maps (we show the spatial size and omit the channel number), blue box and blue arrows represent the same up-sampling structure, red box contains the SLE module as illustrated on the top.
它的实现思路很简单:将D视作一个encoder,然后用一个小型decoder辅助训练。encoder从real image里提取特征,然后这些特征被输入到decoder里面要求根据特征重建图片。这样就可以迫使D学习到特征的准确性(全局的特征和局部的特征)。鉴别器D只在特征图 f1 16*16和特征图 f2 8*8上进行了Decoder,其中Decoder是由4个卷积层构成的从而重建为128*128的图像,这么做的目的是为了减少计算量。这里之所以叫self-supervisored learning 是因为auto-encoding这种方式是实现self-supervisored learning常用的方式,并且有利于提升模型的鲁棒性和生成能力。
Decoder的损失函数:只用了重建损失做训练。(公式里,G(f)是decoder从D提取的特征图再重建的图片,T(x)是对real sample做处理,使其可以计算损失),这里G和T的操作不仅仅局限于crop,更多的操作有待进行探索从而提升性能。
损失函数中计算重建损失在16*16的 f1 中使用的八分之一的图像求损失(为了减少计算量),在8*8的 f2 中不使用裁剪图像求损失;
- We randomly crop f1 with1/8 of its height and width, then crop the real image on the same portion to get
- We resize the real image to get
. The decoders produce
from the cropped f1, and
from f2.
- Finally, D and the decoders are trained together to minimize the loss in eq. 2, by matching
Blue box and arrows represent the same residual down-sampling structure, green boxes mean the same decoder structure
G和D使用损失是: hinge version of the adversarial loss(GAN Hinge Loss):GAN Hinge Loss Explained | Papers With Code
鉴别器的损失分为两部分组成:soft hinge loss(Rather hinge loss) prevent overfitting and mode collapse) + 重建损失
1)soft hinge loss:(real + fake sample均进行训练)
2)重建损失:(只在real sample上训练,不包含fake sample)
重建损失这里的损失函数使用的是perceptual similarity loss,而不是我当初想的“重建损失”(L1 loss)
In sum, we employ the hinge version of the adversarial loss to iteratively train our D and G. We find the different GAN losses make little performance difference, while hinge loss computes the fastest:
Testing mode collapse with back-tracking: From a well trained GAN, one can take a real image and invert it back to a vector in the latent space of G, thus editing the image’s content by altering the back-tracked vector. Despite the various back-tracking methods , a well generalized G is arguably as important for the good inversions. To this end, we show that our model, although trained on limited image samples, still gets a desirable performance on real image back-tracking.
In Table 5, we split the images from each dataset with a training/testing ratio of 9:1, and train G on the training set. We compute a reconstruction error between all the images from the testing set and their inversions from G, after the same update of 1000 iterations on the latent vectors (to prevent the vectors from being far off the normal distribution). The baseline model’s performance is getting worse with more training iterations, which reflects mode-collapse on G. In contrast, our model gives better reconstructions with consistent performance over more training iterations. Fig. 6 presents the back-tracked examples (left-most and right-most samples in the middle panel) given the real images.
Figure 5: Qualitative comparison between our model and StyleGAN2 on 1024*1024 resolution datasets. The left-most panel shows the training images, and the right two panels show the un-curated samples from StyleGAN2 and our model. Both models are trained from scratch for 10 hours with a batch-size of 8. The samples are generated from the checkpoint with the lowest FID.
【少样本图像生成】Towards Faster And Stabilized GAN training for high-fidelity few-shot image synthesis_芋圆526的博客-CSDN博客_少样本图像生成