[论文笔记] SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS

简介:对 normalization层 进行改进,提出spectral normalization(SN-GAN),以提高Discriminator的训练稳定度;
优点
1、Lipschitz常数是唯一需要进行调节的超参;
2、实现简单,额外的计算成本很低;


一、背景

原始(2014年)GAN公式,
E x ~ q d a t a [ log ⁡ D ( x ) ] + E x ′ ~ p G [ log ⁡ ( 1 − D ( x ′ ) ) ] E_{x~ q_{data}}[\log{D(x)}]+E_{x^{'}~ p_{G}}[\log{(1-D(x^{'})})] Exqdata[logD(x)]+ExpG[log(1D(x))]

一个样本x输入,它可能来自于真实分布,也可能来自于生成器的输出分布。该样本对损失函数的贡献为
q d a t a ( x ) log ⁡ D ( x ) + p G ( x ) log ⁡ ( 1 − D ( x ) ) q_{data}(x)\log{D(x)}+p_{G}(x)\log{(1-D(x)}) qdata(x)logD(x)+pG(x)log(1D(x))

当生成器固定,最优的鉴别器求解如下:
q d a t a ( x ) D ( x ) − p G ( x ) 1 − D ( x ) = 0 \frac{q_{data}(x)}{D(x)}-\frac{p_{G}(x)} {1-D(x)}=0 D(x)qdata(x)1D(x)pG(x)=0

因此(文中直接给了下式,没给推导),
D ( x ) ∗ = q d a t a ( x ) q d a t a ( x ) + p G ( x ) D(x)^*=\frac{q_{data}(x)}{q_{data}(x)+p_{G}(x)} D(x)=qdata(x)+pG(x)qdata(x)

又有原始GAN的鉴别器最后一层为sigmoid层,则
D ( x ) ∗ = q d a t a ( x ) q d a t a ( x ) + p G ( x ) = s i g m o i d ( f ∗ ( x ) ) D(x)^*=\frac{q_{data}(x)}{q_{data}(x)+p_{G}(x)}=sigmoid(f^*(x)) D(x)=qdata(x)+pG(x)qdata(x)=sigmoid(f(x))

其中, f ∗ ( x ) = log ⁡ q d a t a ( x ) − log ⁡ p G ( x ) f^*(x)=\log{q_{data}(x)}-\log{p_G(x)} f(x)=logqdata(x)logpG(x),它的导数为
∇ x f ∗ ( x ) = 1 q d a t a ( x ) ∇ x q d a t a ( x ) − 1 p G ( x ) ∇ x p G ( x ) \nabla_xf^*(x)=\frac{1}{q_{data}(x)}\nabla_xq_{data}(x)-\frac{1}{p_G(x)}\nabla_xp_G(x) xf(x)=qdata(x)1xqdata(x)pG(x)1xpG(x)

然而这个导数项无界甚至不可计算,实际中需要对此加入正则限制。
对此已经有一系列成功的研究(如齐国君的LSGAN,WGAN、WGAN-GP等),它们通过引入正则项对鉴别器的Lipschitz常数进行限制,即
arg ⁡ m a x ∣ ∣ f ∣ ∣ L i p ≤ K V ( G , D ) {\arg{max}}_{||f||_{Lip}\leq{K}}V(G,D) argmaxfLipKV(G,D)

其中, ∣ ∣ f ∣ ∣ L i p ||f||_{Lip} fLip为Lipschitz正则,表示存在常数M,对任意 x 、 x ′ x、x^{'} xx
∣ ∣ f ( x ) − f ( x ′ ) ∣ ∣ 2 ∣ ∣ x − x ′ ∣ ∣ 2 ≤ M \frac{||f(x)-f(x^{'})||_2}{||x-x^{'}||_2}\leq M xx2f(x)f(x)2M

二、spectral norm

要限制鉴别器函数 f f f满足 ∣ ∣ f ∣ ∣ L i p ≤ 1 ||f||_{Lip}\leq{1} fLip1,根据性质
∣ ∣ g 1 ∘ g 2 ∣ ∣ L i p ≤ ∣ ∣ g 1 ∣ ∣ L i p ⋅ ∣ ∣ g 2 ∣ ∣ L i p ||g_1 \circ g_2||_{Lip}\leq ||g_1 ||_{Lip} \cdot||g_2||_{Lip} g1g2Lipg1Lipg2Lip

只需使得每一层网络层均满足Lipschitz常数不大于1即可。对于一个线性层 g ( h ) = W h g(h)=Wh g(h)=Wh

∣ ∣ g ∣ ∣ L i p = σ ( W ) ||g||_{Lip}=\sigma(W) gLip=σ(W)其中 σ ( W ) \sigma(W) σ(W)为矩阵 W W W的二阶范数(也被称为谱范数)spectral norm 所要做的即引入如下归一化:
W S N : = W σ ( W ) W_{SN}:=\frac{W}{\sigma(W)} WSN:=σ(W)W

则有 W S N = 1 W_{SN}=1 WSN=1,满足了1-Lipschitz条件。

三、spectral norm 计算方法

关键点在于如何高效计算 σ ( W ) {\sigma(W)} σ(W),其值为 W W W的最大奇异值(也为 W T W W^TW WTW的最大特征值的开平方)。直接计算,计算成本会很大,更合适的方法是采用 power iteration 的方法来估计 σ ( W ) {\sigma(W)} σ(W)
[论文笔记] SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS_第1张图片

三、spectral norm 与其他正则技巧的比较

1、weight normalization、WGAN:WN面临一个矛盾:WN对网络权重有很强的限制,优化会迫使权重的秩为1;而为了更好的训练GAN,我们需要更大的norm学习更多的特征。
2、Orthonormal regularization:通过迫使奇异值为1,破坏了谱信息;
3、WGAN-GP:非常依赖生成网络输出分布的支撑集,使得这种正则效果不稳定;此外,WGAN-GP计算量较大;

四、实验结果

[论文笔记] SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS_第2张图片


[论文笔记] SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS_第3张图片


[论文笔记] SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS_第4张图片


[论文笔记] SPECTRAL NORMALIZATION FOR GENERATIVE ADVERSARIAL NETWORKS_第5张图片

五、pytorch 源码

https://github.com/hellopipu/pytorch-spectral-normalization-gan

torch.mv(a,b) 用于矩阵和向量相乘,其中b必须是一维向量;
torch.t() 矩阵转置
getattr(x, 'y') 相当于调用 x.y ;
setattr(x,'y',v) 相当于调用 x.y=v;
register_parameter(name,parameter) 向模块中添加参数,该参数能通过name索引到

import torch
from torch.optim.optimizer import Optimizer
from torch.autograd import Variable
import torch.nn.functional as F
from torch import nn
from torch import Tensor
from torch.nn import Parameter

def l2normalize(v, eps=1e-12):
    return v / (v.norm() + eps)


class SpectralNorm(nn.Module):
    def __init__(self, module, name='weight', power_iterations=1):
        super(SpectralNorm, self).__init__()
        self.module = module
        self.name = name
        self.power_iterations = power_iterations
        if not self._made_params():
            self._make_params()
        self._update_u_v()

    def _update_u_v(self):
        u = getattr(self.module, self.name + "_u")
        v = getattr(self.module, self.name + "_v")
        w = getattr(self.module, self.name + "_bar")

        height = w.data.shape[0]
        for _ in range(self.power_iterations):
            # print(w.view(height,-1).data.shape)
            v.data = l2normalize(torch.mv(torch.t(w.view(height,-1).data), u.data))
            u.data = l2normalize(torch.mv(w.view(height,-1).data, v.data))

        # sigma = torch.dot(u.data, torch.mv(w.view(height,-1).data, v.data))
        sigma = u.dot(w.view(height, -1).mv(v))  #sigma = (u^T) W v
        setattr(self.module, self.name, w / sigma.expand_as(w))  #update W to W_SN

    def _made_params(self):
        try:
            u = getattr(self.module, self.name + "_u")
            v = getattr(self.module, self.name + "_v")
            w = getattr(self.module, self.name + "_bar")
            return True
        except AttributeError:
            return False


    def _make_params(self):
        w = getattr(self.module, self.name)  #conv.weight  ( input_channel , output_channel , kernel_w , kernel_h )

        height = w.data.shape[0]  # height = input_channel
        width = w.view(height, -1).data.shape[1] # width =  output_channel x kernel_w x kernel_h

        u = Parameter(w.data.new(height).normal_(0, 1), requires_grad=False) #initiate left singular vector
                                                                     # shape: (input_channel)
        v = Parameter(w.data.new(width).normal_(0, 1), requires_grad=False)  #initiate right singular vector
                                                                    # shape: (output_channel x kernel_w x kernel_h)
        # print(u.shape)
        u.data = l2normalize(u.data)
        v.data = l2normalize(v.data)
        w_bar = Parameter(w.data)

        del self.module._parameters[self.name]  # will add after update
        # add parameter to module
        self.module.register_parameter(self.name + "_u", u)
        self.module.register_parameter(self.name + "_v", v)
        self.module.register_parameter(self.name + "_bar", w_bar)


    def forward(self, *args):
        self._update_u_v()  #更新完W后,再调用原模块的forward()
        return self.module.forward(*args)

if __name__ == '__main__':
    conv2 = SpectralNorm(nn.Conv2d(64, 64, 4, stride=2, padding=(1, 1)))

注意:使用SN之后不要再加BN等其他归一化层,因为 Batch norm 的“除方差”和“乘以缩放因子”这两个操作很明显会破坏判别器的 Lipschitz 连续性
以resblock为例,引入SN,代码如下:

class Block(nn.Module):

    def __init__(self, in_ch, out_ch, h_ch=None, ksize=3, pad=1,
                 activation=F.relu, downsample=False):
        super(Block, self).__init__()

        self.activation = activation
        self.downsample = downsample

        self.learnable_sc = (in_ch != out_ch) or downsample
        if h_ch is None:
            h_ch = in_ch
        else:
            h_ch = out_ch

        self.c1 = utils.spectral_norm(nn.Conv2d(in_ch, h_ch, ksize, 1, pad))
        self.c2 = utils.spectral_norm(nn.Conv2d(h_ch, out_ch, ksize, 1, pad))
        if self.learnable_sc:
            self.c_sc = utils.spectral_norm(nn.Conv2d(in_ch, out_ch, 1, 1, 0))

        self._initialize()

    def _initialize(self):
        init.xavier_uniform_(self.c1.weight.data, math.sqrt(2))
        init.xavier_uniform_(self.c2.weight.data, math.sqrt(2))
        if self.learnable_sc:
            init.xavier_uniform_(self.c_sc.weight.data)

    def forward(self, x):
        return self.shortcut(x) + self.residual(x)

    def shortcut(self, x):
        if self.learnable_sc:
            x = self.c_sc(x)
        if self.downsample:
            return F.avg_pool2d(x, 2)
        return x

    def residual(self, x):
        h = self.c1(self.activation(x))
        h = self.c2(self.activation(h))
        if self.downsample:
            h = F.avg_pool2d(h, 2)
        return h

参考资料
https://zhuanlan.zhihu.com/p/55393813
https://www.zhihu.com/search?type=content&q=wgan
https://www.cnblogs.com/pinard/p/6251584.html
http://kaizhao.net/blog/posts/spectral-norm/

你可能感兴趣的:(论文笔记,GAN)