Generative models:
explicit models: Likelihood-based models ( autoregressive and flows/VAE)
implicit models: sample z → sample x, learning the deep neural network without explicit density estimation
G captures the data distribution, D estimates the divergence between p d a t a p_{data} pdata and p G p_G pG.
m i n G m a x D V ( G , D ) V ( G , D ) = E x ∼ p d a t a [ log D ( x ) ] + E z ∼ p z [ log ( 1 − D ( G ( z ) ) ) ] min_Gmax_D V(G,D) \\ V(G,D) = \mathbb{E}_{x\sim p_{data}}[\log D(x)] + \mathbb{E}_{z\sim p_{z}}[\log (1-D(G(z)))] minGmaxDV(G,D)V(G,D)=Ex∼pdata[logD(x)]+Ez∼pz[log(1−D(G(z)))]
# last layer of D is nn.Sigmoid()
# Discriminator
f_loss = criterion(netD(fake_img.detach()), f_l)
r_loss = criterion(netD(real_img.detach()), r_l)
D_loss = (f_loss+r_loss)/2
# Generator
G_loss = criterion(netD(fake_img), r_l) # 注意用的是real的label
I S ( x ) = exp ( H ( y ) − H ( y ∣ x ) ) IS(x) = \exp(H(y)-H(y|x)) IS(x)=exp(H(y)−H(y∣x))
D ∗ = a r g m a x D V ( G , D ) D ∗ ( x ) = p d a t a ( x ) p d a t a ( x ) + p G ( x ) m a x V ( G , D ) = V ( G , D ∗ ) = − 2 log 2 + 2 J S D ( p d a t a ∣ ∣ p G ) G ∗ = a r g m i n G m a x D V ( G , D ) = a r g m i n G D i v ( p G , p d a t a ) D^*=argmax_DV(G,D) \\ D^*(x) = \frac{p_{data}(x)}{p_{data}(x)+p_{G}(x)}\\ max V(G,D) = V(G,D^*)=-2\log2+2JSD(p_{data}||p_G) \\ G^*=argmin_Gmax_D V(G,D) = argmin_G Div(p_G,p_{data}) D∗=argmaxDV(G,D)D∗(x)=pdata(x)+pG(x)pdata(x)maxV(G,D)=V(G,D∗)=−2log2+2JSD(pdata∣∣pG)G∗=argminGmaxDV(G,D)=argminGDiv(pG,pdata)
Discriminator Saturation: G产生的图片被D highly confident认为是fake,因此G无法更新,因为梯度为0。
Adam: lr = 2e-4, beta1=0.5, batch size=128
提出用Earth Mover Distance衡量分布间的距离,希望用通过优化其对偶问题找到W,提出了用lipschitzness限制D然后进行优化。其优化方案是进行weight clipping,强制截断。尽管clipping不是一个好的方案,但是证明了这种对W的近似方法解决了JSD在训练的instability problem,让训练更加robust,减少mode collapse。
New divergence measure for optimizing the generator (Earth Mover Distance)
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ W(\mathbb{P}_{…
m i n G m a x D ∈ D E x ∼ P d a t a [ D ( x ) ] + E x ∼ P G [ D ( x ) ] min_G max_{D\in\mathscr{D}}\mathbb{E}_{x\sim~P_{data}}[D(x)] +\mathbb{E}_{x\sim~P_{G}}[D(x)] minGmaxD∈DEx∼ Pdata[D(x)]+Ex∼ PG[D(x)]
Addresses instabilities with JSD version (sigmoid cross entropy)
Robust to architectural choices
Progress on mode collapse and stability of derivative wrt input
Introduces the idea of using lipschitzness to stabilize GAN training
WGAN与original GAN第一种形式相比,只改了四点:
# 损失函数变化
D_loss = -torch.mean(netD(real_imgs.detach()))+torch.mean(netD(fake_imgs.detach()))
## weight clipping
for p in netD.parameters():
# 限制大小
p.data.clamp_(-c, c)
G_loss = -torch.mean(netD(fake_imgs.detach()))
提出gradient penalty正则化项保证lipschitzness,训练更加robust,称为之后各种GAN的基本模型
m i n G m a x D ∈ D E x ∼ P d a t a [ D ( x ) ] + E x ∼ P G [ D ( x ) ] + λ E x ^ ∼ P x ^ [ ( ∇ x ^ ∣ ∣ D ( x ^ ) ∣ ∣ 2 − 1 ) 2 ] min_G max_{D\in\mathscr{D}}\mathbb{E}_{x\sim~P_{data}}[D(x)] +\mathbb{E}_{x\sim~P_{G}}[D(x)] + \lambda \mathbb{E}_{\hat{x}\sim~P_{\hat{x}}}[(\nabla_{\hat{x}}||D(\hat{x})||_2-1)^2] minGmaxD∈DEx∼ Pdata[D(x)]+Ex∼ PG[D(x)]+λEx^∼ Px^[(∇x^∣∣D(x^)∣∣2−1)2]
# calculate Gradient Penalty
def compute_GP(discriminator, real_imgs, fake_imgs):
epsilon = torch.Tensor(real_imgs.size(0),1,1,1).uniform_()
x_hat = (epsilon * real_imgs + (1-epsilon) * fake_imgs).requires_grad_(True)
outputs = discriminator(x_hat)
gradients = autograd.grad(
outputs = outputs,
inputs = x_hat,
gradients = gradiants.view(real_imgs.size(0), -1)
gp = torch.mean((gradients.norm(2, dim=1)-1)**2)
return gp
gradient_penalty = compute_GP(netD, img.data, fake_img.data)
D_loss = -torch.mean(netD(img.detach()) + torch.mean(netD(fake_img.detach())) + \
lambda_ * gradient_penalty # 损失函数变化
通过spectral norm谱范数约束discriminator每一层网络的权重矩阵W,以保证lipschitzness,增强了discriminator训练的稳定性。
keep gradient norm smaller than 1 everywhere.
spectral norm = largest singular value of W.
Because of heavy calculation of singular value of W, using power iteration method to estimate σ ( W ) \sigma(W) σ(W).
: Spectral normalization stabilizes the training of discriminators (critics) in Generative Adversarial Networks (GANs) by rescaling the weight tensor with spectral norm σ of the weight matrix calculated using power iteration method.
lots of applications:
learning to translate an image from a source domain X to a target domain Y in the absence of paired examples.
introduce cycle consistency.
L(G,F,D_X,D_Y)=L_{GAN}(G,D_Y,X,Y)+L_{GAN}(G,D_X,Y,X)+\lambda L_{cyc}(G,F) \
L_{GAN}(G,D_Y,X,Y) = \mathbb E_{y\sim p_{data}(y)}[\log D_Y(y)] + \mathbb E_{x\sim p_{data}(x)}[\log (1-D_Y(G(x))]\
L_{cyc}(G,F) = \mathbb E_{x\sim p_{data}(x)}[||F(G(x))-x||1] + \mathbb E{y\sim p_{data}(y)}[||G(F(y))-y||_1]
learn disentangled representations in an unsupervised manner.
mutual information I ( c ; G ( z , c ) ) I(c;G(z,c)) I(c;G(z,c)) should be high→maximize lower bound of I = L I ( G , Q ) L_I(G,Q) LI(G,Q)
m i n G , Q m a x D V I n f o G A N ( D , G , Q ) = V ( D , G ) − λ L I ( G , Q ) L I ( G , Q ) = E x ∼ G ( z , c ) [ E c ′ ∼ P ( c ∣ x ) [ log Q ( c ′ ∣ x ) ] ] + H ( c ) ≤ I ( c ; G ( z , c ) ) min_{G,Q}max_DV_{InfoGAN}(D,G,Q)=V(D,G)-\lambda L_I(G,Q) \\ L_I(G,Q)=\mathbb{E}_{x\sim G(z,c)}[\mathbb{E}_{c'\sim P(c|x)}[\log Q(c'|x)]]+H(c) \le I(c;G(z,c)) minG,QmaxDVInfoGAN(D,G,Q)=V(D,G)−λLI(G,Q)LI(G,Q)=Ex∼G(z,c)[Ec′∼P(c∣x)[logQ(c′∣x)]]+H(c)≤I(c;G(z,c))
m i n G , E m a x D V ( D , E , G ) min_{G,E}max_DV(D,E,G) minG,EmaxDV(D,E,G)