InfoGAN:Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Paper:https://arxiv.org/abs/1606.03657
Code:https://github.com/openai/InfoGAN
Tips:Nips2016的一篇paper。
(阅读笔记)

1.Main idea

  • 最大化互信息作为目标函数。maximizes the mutual information between a small subset of the latent variables and the observation.
  • 分辨出了各个数据集的某种风格。(即书写的方向)disentangles writing styles from digit shapes on the MNIST dataset…
  • 甚至发现了人脸发型等等视觉概念。It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset.

2.Intro

  • 无监督学习很重要,也很有效,甚至一些监督学习下游任务(downstream tasks)有时也需要无监督先找到很好的潜在表示。
  • 分离表示(disentangled representation)能够更好的学到各个变量与特征之间的关系。
  • 在观察与噪声之间最大互信息。

3.Details

  • GAN的生成器输入只是随机噪声 z z z,并没有给予特定的约束,或许这个 z z z在某个高维对输出有很大作用。所以对将输入 z z z划分为了两个部分:噪声 z z z;潜在编码 c c c对应结构性特征数据,所以生成器有两个参数即 G ( z , c ) G(z,c) G(z,c)。一般原始GAN我们都是直接使 P G ( x ∣ c ) = P G ( x ) P_G(x|c)=P_G(x) PG(xc)=PG(x),忽略了 c c c的作用,直接将结果条件概率省去。
  • 互信息 I ( c ; G ( z , c ) ) I(c;G(z,c)) I(c;G(z,c))应该很高;如下式所示,其中 H ( ⋅ ) H(\cdot) H()表示信息熵:
    I ( X ; Y ) = H ( X ) − H ( X ∣ Y ) = H ( Y ) − H ( Y ∣ X ) = ∑ Y ∑ X p ( x , y ) log ⁡ p ( x , y ) p ( x ) p ( y ) w h e r e : H ( X ) = − ∑ X P ( x ) log ⁡ P ( x ) = − ∫ X p ( x ) log ⁡ p ( x ) d x = − E x ∼ X [ log ⁡ p ( x ) ] (1) \begin{aligned} I(X;Y)&=H(X)-H(X|Y)\\ &=H(Y)-H(Y|X) \\ &=\sum_Y \sum_X p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \\ & where:H(X)=- \sum_{X} P(x)\log P(x)=- \int_{X}p(x)\log p(x)\mathrm{d}x=-\Bbb E_{x \sim X}\left[\log p(x)\right]\\ \tag{1} \end{aligned} I(X;Y)=H(X)H(XY)=H(Y)H(YX)=YXp(x,y)logp(x)p(y)p(x,y)where:H(X)=XP(x)logP(x)=Xp(x)logp(x)dx=ExX[logp(x)](1)
    很明显地,从互信息可以看出当 p ( x ) p(x) p(x) q ( x ) q(x) q(x)独立的时候, I ( X ; Y ) I(X;Y) I(X;Y)为0,表示无关;若 I ( X ; Y ) I(X;Y) I(X;Y)很高,则可以表示两组关系很大。从生成器中得到了任意 x x x,目标让生成器中的后验 P G ( c ∣ x ) P_G(c|x) PG(cx),即潜在编码 c c c具有更小的信息熵;这样才能描述从 x x x c c c的回溯过程仍然没有丢失潜在编码的信息:
    min ⁡ G max ⁡ D E x ∼ P d a t a [ log ⁡ D ( x ) ] + E z ∼ n o i s e [ log ⁡ ( 1 − D ( G ( z ) ) ) ] − λ I ( c ; G ( z , c ) ) (2) \begin{aligned} \min_G \max_D \mathbb{E}_{x \sim P_{\mathrm{data}}}[\log D(x)] +\mathbb{E}_{z \sim \mathrm{noise}}[\log (1-D(G(z)))] - \lambda I(c;G(z,c)) \\ \tag{2} \end{aligned} GminDmaxExPdata[logD(x)]+Eznoise[log(1D(G(z)))]λI(c;G(z,c))(2)
    上式子中前部分就是GAN的目标函数;后面可以理解是 λ \lambda λ参数影响下的惩罚,最大化生成器输出与潜在编码的关系。
  • 后验 P G ( c ∣ x ) P_G(c|x) PG(cx),很难获得,构造了一个辅助分布 Q ( c ∣ x ) Q(c|x) Q(cx)去近似,生成器 G ( z , c ) G(z,c) G(z,c)得到的是 x x x,所以找下边界:
    I ( c ; G ( z , c ) ) = H ( c ) − H ( c ∣ G ( z , c ) ) = H ( c ) + E x ∼ G ( z , c ) [ E c ′ ∼ P ( c ∣ x ) [ log ⁡ P ( c ′ ∣ x ) ] ] = H ( c ) + E x ∼ G ( z , c ) [ ∫ c ′ ∼ P ( c ∣ x ) p ( c ′ ∣ x ) log ⁡ p ( c ′ ∣ x ) d c ′ ] = H ( c ) + E x ∼ G ( z , c ) [ ∫ c ′ ∼ P ( c ∣ x ) p ( c ′ ∣ x ) log ⁡ p ( c ′ , x ) p ( x ) d c ′ ] = H ( c ) + E x ∼ G ( z , c ) [ ∫ x ∼ X p ( ⋅ ∣ x ) log ⁡ p ( ⋅ ∣ x ) q ( ⋅ ∣ x ) d x + ∫ c ′ ∼ Q ( c ∣ x ) q ( c ′ ∣ x ) log ⁡ q ( c ′ ∣ x ) d c ′ ] Let  Q → P = H ( c ) + E x ∼ G ( z , c ) [ D K L ( P ( ⋅ ∣ x ) ∥ Q ( ⋅ ∣ x ) ) + E c ′ ∼ P ( c ∣ x ) [ log ⁡ Q ( c ′ ∣ x ) ] ] ≥ H ( c ) + E x ∼ G ( z , c ) [ E c ′ ∼ P ( c ∣ x ) [ log ⁡ Q ( c ′ ∣ x ) ] ] Only if  Q = P (3) \begin{aligned} I(c;G(z,c)) &= H(c)-H(c \vert G(z,c))\\ &=H(c) + \Bbb E_{x\sim G(z,c)}\left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log P(c^{'} \vert x)\right]\right] \\ &=H(c) + \Bbb E_{x\sim G(z,c)}\left[ \int_{c' \sim P(c|x)} p(c' \vert x) \log p(c' \vert x)\mathrm{d}c' \right] \\ &=H(c) +\Bbb E_{x\sim G(z,c)}\left[ \int_{c' \sim P(c|x)} p(c'|x)\log \frac{p(c',x)}{p(x)}\mathrm{d}c' \right]\\ &=H(c) +\Bbb E_{x\sim G(z,c)} \left[ \int_{x \sim X} p(\cdot|x)\log \frac{p(\cdot|x)}{q(\cdot|x)}\mathrm{d}x + \int_{c' \sim Q(c|x)} q(c'|x)\log q(c'|x) \mathrm{d}c' \right] & \text{Let $Q \rightarrow P$} \\ &=H(c) +\Bbb E_{x\sim G(z,c)} \left[ D_{KL}(P(\cdot \vert x) \Vert Q(\cdot \vert x))+ \Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c^{'} \vert x)\right] \right] \\ &\ge H(c) + \Bbb E_{x\sim G(z,c)} \left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c^{'} \vert x)\right] \right] & \text{Only if $Q=P$} \tag{3} \end{aligned} I(c;G(z,c))=H(c)H(cG(z,c))=H(c)+ExG(z,c)[EcP(cx)[logP(cx)]]=H(c)+ExG(z,c)[cP(cx)p(cx)logp(cx)dc]=H(c)+ExG(z,c)[cP(cx)p(cx)logp(x)p(c,x)dc]=H(c)+ExG(z,c)[xXp(x)logq(x)p(x)dx+cQ(cx)q(cx)logq(cx)dc]=H(c)+ExG(z,c)[DKL(P(x)Q(x))+EcP(cx)[logQ(cx)]]H(c)+ExG(z,c)[EcP(cx)[logQ(cx)]]Let QPOnly if Q=P(3)
    上述式子即是构造出一个分布 Q Q Q,用 K L \mathbf{KL} KL散度去衡量两分布差距,并最小化;那么 Q Q Q就接近 P P P
  • 文中 L e m m a 5.1 \mathbf{Lemma5.1} Lemma5.1:随机变量关于某一函数的期望有一转换方式,即后验并不会影响到先验以及总体的期望,原文中的证明有些问题,重新证明后如下。
    • 首先总体期望公式,条件概率的期望与其本身的期望是一样的(小部分的期望和总体期望一样):
      E [ E [ X ∣ Y ] ] = E [ ∑ x x ⋅ P ( X = x ∣ Y ) ] = ∑ y [ ∑ x x ⋅ P ( X = x ∣ Y = y ) ] P ( Y = y ) = ∑ x x ∑ y P ( X = x ∣ Y = y ) ⋅ P ( Y = y ) = ∑ x x ∑ y P ( X = x   and   Y = y ) = ∑ x x ⋅ P ( X = x ) = E [ X ] (4) \begin{aligned} E[E[X|Y]] &= E \left[ \sum_{x} x \cdot P(X = x | Y) \right] \\ &= \sum_y \left[ \sum_{x} x \cdot P(X = x | Y = y) \right] P(Y = y)\\ &= \sum_x x \sum_y P(X = x | Y = y) \cdot P(Y = y) \\ &= \sum_x x \sum_y P(X = x \, \text{and} \, Y = y) \\ &= \sum_x x \cdot P(X = x)\\ &= E[X] \tag{4} \end{aligned} E[E[XY]]=E[xxP(X=xY)]=y[xxP(X=xY=y)]P(Y=y)=xxyP(X=xY=y)P(Y=y)=xxyP(X=xandY=y)=xxP(X=x)=E[X](4)
      或者:
      E Y [ E X [ X ∣ Y ] ] = ∫ Y E X [ X ∣ Y ] f ( y ) d y = ∫ Y [ ∫ X x f ( x , y ) f ( y ) d x ] f ( y ) d y = ∫ Y ∫ X x f ( x , y ) f ( y ) d x f ( y ) d y f ( y )  is constant w.r.t. Y = ∫ Y ∫ X x f ( x , y ) d x d y = ∫ X x ∫ Y f ( x , y ) d y d x X  is constant w.r.t. Y = ∫ X x [ ∫ Y f ( x , y ) d y ] d x = ∫ X x f ( x ) d x = E X [ X ] (5) \begin{aligned} E_Y[E_X[X|Y]] &= \int_Y E_X\left[X|Y \right]f(y)\mathrm{d}y \\ &= \int_Y \left[\int_X x \frac{f(x,y)}{f(y)}\mathrm{d}x \right] f(y)\mathrm{d}y\\ &= \int_Y \int_X x \frac{f(x,y)}{f(y)}\mathrm{d}x f(y)\mathrm{d}y & \text{$f(y)$ is constant w.r.t. Y}\\ &= \int_Y \int_X x f(x,y)\mathrm{d}x\mathrm{d}y & \\ &= \int_X x \int_Y f(x,y)\mathrm{d}y\mathrm{d}x & \text{$X$ is constant w.r.t. Y}\\ &= \int_X x \left[ \int_Y f(x,y)\mathrm{d}y \right] \mathrm{d}x\\ &= \int_X xf(x)\mathrm{d}x & \\ &= E_X[X] \tag{5} \end{aligned} EY[EX[XY]]=YEX[XY]f(y)dy=Y[Xxf(y)f(x,y)dx]f(y)dy=YXxf(y)f(x,y)dxf(y)dy=YXxf(x,y)dxdy=XxYf(x,y)dydx=Xx[Yf(x,y)dy]dx=Xxf(x)dx=EX[X]f(y) is constant w.r.t. YX is constant w.r.t. Y(5)
    • L e m m a 5.1 \mathbf{Lemma5.1} Lemma5.1即是总体期望公式的应用:
      E x ∼ X , y ∼ Y ∣ x [ f ( x , y ) ] = ∫ x P ( x ) ∫ y P ( y ∣ x ) f ( x , y ) d y d x = ∫ x ∫ y P ( x , y ) f ( x , y ) d y d x = ∫ y P ( y ) ∫ x P ( x ∣ y ) f ( x , y ) d x d y = ∫ y P ( y ) [ ∫ x ′ P ( x ′ ∣ y ) f ( x ′ , y ) d x ′ ] d y rename  x  to  x ′ = ∫ x P ( x ) [ ∫ y P ( y ∣ x ) [ ∫ x ′ P ( x ′ ∣ y ) f ( x ′ , y ) d x ′ ] d y ] d x = E x ∼ X , y ∼ Y ∣ x , x ′ ∼ X ∣ y [ f ( x ′ , y ) ] (6) \begin{aligned} \mathbb{E}_{{x \sim X},y \sim Y|x}[f(x,y)]&=\int_x P(x) \int_y P(y|x)f(x,y)\mathrm{d}y\mathrm{d}x \\ &=\int_x \int_y P(x,y)f(x,y)\mathrm{d}y\mathrm{d}x \\ &=\int_y P(y) \int_x P(x|y)f(x,y)\mathrm{d}x\mathrm{d}y \\ &=\int_y P(y)\left[ \int_{x'} P(x'|y)f(x',y)\mathrm{d}x' \right] \mathrm{d}y & \text{rename $x$ to $x'$} \\ &=\int_x P(x) \left[ \int_y P(y|x)\left[ \int_{x'} P(x'|y)f(x',y)\mathrm{d}x' \right] \mathrm{d}y \right] \mathrm{d}x\\ &=\mathbb{E}_{{x \sim X},y \sim Y|x, x' \sim X|y}[f(x',y)] \\ \tag{6} \end{aligned} ExX,yYx[f(x,y)]=xP(x)yP(yx)f(x,y)dydx=xyP(x,y)f(x,y)dydx=yP(y)xP(xy)f(x,y)dxdy=yP(y)[xP(xy)f(x,y)dx]dy=xP(x)[yP(yx)[xP(xy)f(x,y)dx]dy]dx=ExX,yYx,xXy[f(x,y)]rename x to x(6)
      于是互信息 I I I有:
      I ( c ; G ( z , c ) ) ≥ H ( c ) + E x ∼ G ( z , c ) [ E c ′ ∼ P ( c ∣ x ) [ log ⁡ Q ( c ′ ∣ x ) ] ] = H ( c ) + E c ∼ P ( c ) , x ∼ P G ( x ∣ c ) [ E c ′ ∼ P ( c ∣ x ) [ log ⁡ Q ( c ′ ∣ x ) ] ] = H ( c ) + E c ∼ P ( c ) , x ∼ G ( z , c ) [ log ⁡ Q ( c ∣ x ) ] = L I ( G , Q ) (7) \begin{aligned} I(c;G(z,c)) &\ge H(c) + \Bbb E_{x\sim G(z,c)} \left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c' \vert x)\right] \right] \\ &=H(c) + \Bbb E_{c \sim P(c),x\sim P_G(x|c)} \left[\Bbb E_{c^{'}\sim P(c \vert x)}\left[\log Q(c' \vert x)\right] \right] \\ &=H(c) +\Bbb E_{c \sim P(c),x\sim G(z,c)} \left[\log Q(c \vert x)\right]\\ &=L_I(G,Q) \\ \tag{7} \end{aligned} I(c;G(z,c))H(c)+ExG(z,c)[EcP(cx)[logQ(cx)]]=H(c)+EcP(c),xPG(xc)[EcP(cx)[logQ(cx)]]=H(c)+EcP(c),xG(z,c)[logQ(cx)]=LI(G,Q)(7)
  • infoGAN最后的目标函数就应该优化 G G G D D D以及 Q Q Q三者:
    min ⁡ G , Q max ⁡ D E x ∼ P d a t a [ log ⁡ D ( x ) ] + E z ∼ n o i s e [ log ⁡ ( 1 − D ( G ( z ) ) ) ] − λ L I ( G , Q ) (8) \begin{aligned} \min_{G,Q} \max_D \mathbb{E}_{x \sim P_{\mathrm{data}}}[\log D(x)] +\mathbb{E}_{z \sim \mathrm{noise}}[\log (1-D(G(z)))] - \lambda L_I(G,Q)\\ \tag{8} \end{aligned} G,QminDmaxExPdata[logD(x)]+Eznoise[log(1D(G(z)))]λLI(G,Q)(8)
    在训练实现过程中, Q ( c ∣ x ) Q(c|x) Q(cx)就直接是一个神经网络,即是生成器得到的 x x x即要欺骗判别器 D D D,又要经过神经网络 Q Q Q后与当时送入生成器的参数 c c c互信息最大。

你可能感兴趣的:(Methodology)