提示:本人学识有限,如果错误,请及时指出,会及时纠正问题,谢谢~!
本文先介绍Pix2Pix ,主要开始想介绍一下CGAN基础的架构,而姿态生成现在主要基于CGAN的思想而构建的模型,现在Transfromer也在尝试解决姿态生成的问题;也为后续介绍的姿态生成算法打一下基础;同时第一篇姿态生成的博客主要是拿pix2pix举例科普一下CGAN算法的思想;最后也会介绍一下对姿态生成的一些看法;
GAN可以对样本进行无监督学习,然后生成全新的样本;但是问题是:虽然能生成新的样本,但是无法确切的控制新样本的类型。 例如GAN虽然可以生成数字,但是生成的数字是随机的,因为GAN根据输入的随机噪声生成的图片,没有办法控制模型生成的具体数字;如果希望控制生成的结果,例如给生成器输入2,那个GAN只能输出2的图像;CGAN就是可以解决这个问题;
GAN的输入和输出:
cGAN的输入和输出:
cGAN的基础架构:
cGAN的基础架构图,x是真实图片,z是随机噪音是条件
原始GAN的优化目标:
min G max D V ( D , G ) = ∑ x ≈ p d a t a [ log D ( x ) ] + ∑ z ≈ p z [ l o g ( 1 − D ( G ( z ) ) ) ] \min_{G} \max_{D} V(D,G)=\sum_{x\approx p_{data}}[\log D(x)] + \sum_{z \approx p_z}[log(1-D(G(z)))] GminDmaxV(D,G)=x≈pdata∑[logD(x)]+z≈pz∑[log(1−D(G(z)))]
cGAN优化目标只需要做简单的修改,向优化目标中加入条件 y:
min G max D V ( D , G ) = ∑ x ≈ p d a t a [ l o g D ( x ∣ y ) ] + ∑ z ≈ p z ( z ) [ l o g ( 1 − D ( G ( z ∣ y ) ) ) ] \min_{G} \max_{D} V(D,G) = \sum_{x \approx p_{data}}[logD(x|y)] + \sum_{z\approx p_{z}(z)}[log(1-D(G(z|y)))] GminDmaxV(D,G)=x≈pdata∑[logD(x∣y)]+z≈pz(z)∑[log(1−D(G(z∣y)))]
CGAN结论:
输入和输出
训练步骤
训练有条件的GAN预测航空照片地图。鉴别器D学习区分真实和合成的图片,生成器G学习如何愚弄鉴别器,让判别器区分不出来。与GAN不同,cGAN的判别器和生成器都在观察输入图像。
生成器架构
(2016年前)GAN基础结构的基本是上图左边,就是Encoder–Decoder的神经网络;这样结构需要所有图像信息通过所有的网络,可能会造成信息的丢失;对于许多图像翻译问题,输入和输出图像之间共享了大量低级别的信息,因此希望将这些信息直接从encode的网络传递到对应的decoder网络, 例如在图像着色的情况下,输入和输出共享突出边的位置(如图右边);其实就是把encoder的feature map与对应的decoder的feature map 直接合并;例如encoder的feature map大小为32X32X128, decoder的原始的feature map同为相应大小;采用上图右边的结果,decoder的feature map的大小为 32X32X256;
判别器结构(PatchGAN)
(2016年前)GAN的判别器对整张图片输出一个是否真实的概率;Pix2Pix模型提出一种PatchGAN的概念,PatchGAN对图片中每一个NxN的小块(Patch)计算概率,然后再将这些概率求平均值作为整体的输出;这样做可以加快计算速度以及加快收敛,同时使用于任何分辨率的图像;
作者测试了从1×1“PixelGAN”到完整的256×256“ImageGAN”的实验;作者发现70X70的patch可以达到256X256的效果;
最终的loss
GAN loss
L2 loss 比 L1 loss 模糊的原因
实际上L2 loss会产生更平滑的图像。因为最小化L2 loss相当于最大化一个高斯分布的log likelihood。也就是说,我们假定我们的数据是符合了一个高斯分布的。但是实际中,数据可能是多峰的。比如说对于猫而言,有很多种不同种类长得不一样的猫,每个种类的猫可能都会带来一个对应的峰。当你使用L2 loss时,就会对这很多种猫用一个单峰的高斯去拟合,这就相当于对不同长相的猫做了一个平均,这就会使得产生的图像比较平滑。但L1 loss就会减轻这样的影响。在图像超分中,早期大家一般都用l2 loss,因为这能直接优化PSNR。但大家发现这样恢复的图片细节上太平滑了,很多就改用l1 loss了。
评价标准
生成器
class UnetGenerator(nn.Module):
"""Create a Unet-based generator"""
def __init__(self, input_nc, output_nc, num_downs, ngf=64, norm_layer=nn.BatchNorm2d, use_dropout=False):
"""Construct a Unet generator
Parameters:
input_nc (int) -- the number of channels in input images
output_nc (int) -- the number of channels in output images
num_downs (int) -- the number of downsamplings in UNet. For example, # if |num_downs| == 7,
image of size 128x128 will become of size 1x1 # at the bottleneck
ngf (int) -- the number of filters in the last conv layer
norm_layer -- normalization layer
We construct the U-Net from the innermost layer to the outermost layer.
It is a recursive process.
"""
super(UnetGenerator, self).__init__()
# construct unet structure
unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=None, norm_layer=norm_layer, innermost=True) # add the innermost layer
for i in range(num_downs - 5): # add intermediate layers with ngf * 8 filters
unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=unet_block, norm_layer=norm_layer, use_dropout=use_dropout)
# gradually reduce the number of filters from ngf * 8 to ngf
unet_block = UnetSkipConnectionBlock(ngf * 4, ngf * 8, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
unet_block = UnetSkipConnectionBlock(ngf * 2, ngf * 4, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
unet_block = UnetSkipConnectionBlock(ngf, ngf * 2, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
self.model = UnetSkipConnectionBlock(output_nc, ngf, input_nc=input_nc, submodule=unet_block, outermost=True, norm_layer=norm_layer) # add the outermost layer
def forward(self, input):
"""Standard forward"""
return self.model(input)
class UnetSkipConnectionBlock(nn.Module):
"""Defines the Unet submodule with skip connection.
X -------------------identity----------------------
|-- downsampling -- |submodule| -- upsampling --|
"""
def __init__(self, outer_nc, inner_nc, input_nc=None,
submodule=None, outermost=False, innermost=False, norm_layer=nn.BatchNorm2d, use_dropout=False):
"""Construct a Unet submodule with skip connections.
Parameters:
outer_nc (int) -- the number of filters in the outer conv layer
inner_nc (int) -- the number of filters in the inner conv layer
input_nc (int) -- the number of channels in input images/features
submodule (UnetSkipConnectionBlock) -- previously defined submodules
outermost (bool) -- if this module is the outermost module
innermost (bool) -- if this module is the innermost module
norm_layer -- normalization layer
use_dropout (bool) -- if use dropout layers.
"""
super(UnetSkipConnectionBlock, self).__init__()
self.outermost = outermost
if type(norm_layer) == functools.partial:
use_bias = norm_layer.func == nn.InstanceNorm2d
else:
use_bias = norm_layer == nn.InstanceNorm2d
if input_nc is None:
input_nc = outer_nc
downconv = nn.Conv2d(input_nc, inner_nc, kernel_size=4,
stride=2, padding=1, bias=use_bias)
downrelu = nn.LeakyReLU(0.2, True)
downnorm = norm_layer(inner_nc)
uprelu = nn.ReLU(True)
upnorm = norm_layer(outer_nc)
if outermost:
upconv = nn.ConvTranspose2d(inner_nc * 2, outer_nc,
kernel_size=4, stride=2,
padding=1)
down = [downconv]
up = [uprelu, upconv, nn.Tanh()]
model = down + [submodule] + up
elif innermost:
upconv = nn.ConvTranspose2d(inner_nc, outer_nc,
kernel_size=4, stride=2,
padding=1, bias=use_bias)
down = [downrelu, downconv]
up = [uprelu, upconv, upnorm]
model = down + up
else:
upconv = nn.ConvTranspose2d(inner_nc * 2, outer_nc,
kernel_size=4, stride=2,
padding=1, bias=use_bias)
down = [downrelu, downconv, downnorm]
up = [uprelu, upconv, upnorm]
if use_dropout:
model = down + [submodule] + up + [nn.Dropout(0.5)]
else:
model = down + [submodule] + up
self.model = nn.Sequential(*model)
def forward(self, x):
if self.outermost:
return self.model(x)
else: # add skip connections
return torch.cat([x, self.model(x)], 1)
Dropout与多样性
原始cGAN的输入是条件x和噪声z两种信息,这里的生成器只使用了条件信息,因此不能生成多样性的结果。因此pix2pix在训练和测试时都使用了dropout,这样可以生成多样性的结果。
判别器
class NLayerDiscriminator(nn.Module):
"""Defines a PatchGAN discriminator"""
def __init__(self, input_nc, ndf=64, n_layers=3, norm_layer=nn.BatchNorm2d):
"""Construct a PatchGAN discriminator
Parameters:
input_nc (int) -- the number of channels in input images
ndf (int) -- the number of filters in the last conv layer
n_layers (int) -- the number of conv layers in the discriminator
norm_layer -- normalization layer
"""
super(NLayerDiscriminator, self).__init__()
if type(norm_layer) == functools.partial: # no need to use bias as BatchNorm2d has affine parameters
use_bias = norm_layer.func == nn.InstanceNorm2d
else:
use_bias = norm_layer == nn.InstanceNorm2d
kw = 4
padw = 1
#输入图像大小为256X256X3;input_nc=3;ndf=64; kenel_size=4, padding=1 ,feature Map=128X128(originalSize_h+padding*2-kernelSize_h)/stride +1)
sequence = [nn.Conv2d(input_nc, ndf, kernel_size=kw, stride=2, padding=padw), nn.LeakyReLU(0.2, True)]
nf_mult = 1
nf_mult_prev = 1
for n in range(1, n_layers): # gradually increase the number of filters
nf_mult_prev = nf_mult
nf_mult = min(2 ** n, 8)
sequence += [
feature Map=64X64->32X32
nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=2, padding=padw, bias=use_bias),
norm_layer(ndf * nf_mult),
nn.LeakyReLU(0.2, True)
]
nf_mult_prev = nf_mult
nf_mult = min(2 ** n_layers, 8)
sequence += [
featuremap:32X32,
nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=1, padding=padw, bias=use_bias),
norm_layer(ndf * nf_mult),
nn.LeakyReLU(0.2, True)
]
sequence += [nn.Conv2d(ndf * nf_mult, 1, kernel_size=kw, stride=1, padding=padw)] # output 1 channel prediction map
self.model = nn.Sequential(*sequence)
def forward(self, input):
"""Standard forward."""
return self.model(input)
Feature Map 计算公式:
output_h =(originalSize_h+padding*2-kernelSize_h)/stride +1
感受野(Receptive Field)的定义是卷积神经网络每一层输出的特征图(feature map)上的像素点在输入图片上映射的区域大小。再通俗点的解释是,特征图上的一个点对应输入图上的区域;
计算感受域大小
第n层感受野大小=上一层感受野大小+(第n层卷积核大小-1)乘以本层以前所有stride的乘积
PatchGAN感受野计算:
层数 | 感受野大小 | feature map | channel |
---|---|---|---|
初始 | 1 | 256X256 | 3 |
第一层 | 1 +(4 - 1) *1 = 4 | (256+2-4)/2-1=128 | 64 |
第二层 | 4 + (4 - 1) *2 *1=10 | (128+2-4)/2-1=64 | 128 |
第三层 | 10 + (4 - 1) *2 *2 *1=22 | (64+2-4)/2-1=32 | 256 |
第四层 | 22 + (4 - 1) *2 *2 *2 *1=46 | (32+2-4)/1-1=31 | 512 |
第五层 | 46 + (4 - 1) *1 *2 *2 *2 *1=70 | (31+2-4)/1-1=30 | 1 |
最终生成的feature map大小为30X30,通道为1;每一个feature map上面的一个点对于原始图的70个点;同时输入倒GAN Loss中,最终对30X30个点求平均值,为最终的GAN Loss;
Loss
def backward_D(self):
"""Calculate GAN loss for the discriminator"""
# Fake; stop backprop to the generator by detaching fake_B
fake_AB = torch.cat((self.real_A, self.fake_B), 1) # we use conditional GANs; we need to feed both input and output to the discriminator
pred_fake = self.netD(fake_AB.detach())
self.loss_D_fake = self.criterionGAN(pred_fake, False)
# Real
real_AB = torch.cat((self.real_A, self.real_B), 1)
pred_real = self.netD(real_AB)
self.loss_D_real = self.criterionGAN(pred_real, True)
# combine loss and calculate gradients
self.loss_D = (self.loss_D_fake + self.loss_D_real) * 0.5
self.loss_D.backward()
def backward_G(self):
"""Calculate GAN and L1 loss for the generator"""
# First, G(A) should fake the discriminator
fake_AB = torch.cat((self.real_A, self.fake_B), 1)
pred_fake = self.netD(fake_AB)
self.loss_G_GAN = self.criterionGAN(pred_fake, True)
# Second, G(A) = B
self.loss_G_L1 = self.criterionL1(self.fake_B, self.real_B) * self.opt.lambda_L1
# combine loss and calculate gradients
self.loss_G = self.loss_G_GAN + self.loss_G_L1
self.loss_G.backward()
GAN loss 类
class GANLoss(nn.Module):
"""Define different GAN objectives.
The GANLoss class abstracts away the need to create the target label tensor
that has the same size as the input.
"""
def __init__(self, gan_mode, target_real_label=1.0, target_fake_label=0.0):
""" Initialize the GANLoss class.
Parameters:
gan_mode (str) - - the type of GAN objective. It currently supports vanilla, lsgan, and wgangp.
target_real_label (bool) - - label for a real image
target_fake_label (bool) - - label of a fake image
Note: Do not use sigmoid as the last layer of Discriminator.
LSGAN needs no sigmoid. vanilla GANs will handle it with BCEWithLogitsLoss.
"""
super(GANLoss, self).__init__()
self.register_buffer('real_label', torch.tensor(target_real_label))
self.register_buffer('fake_label', torch.tensor(target_fake_label))
self.gan_mode = gan_mode
if gan_mode == 'lsgan':
self.loss = nn.MSELoss()
elif gan_mode == 'vanilla':
self.loss = nn.BCEWithLogitsLoss()
elif gan_mode in ['wgangp']:
self.loss = None
else:
raise NotImplementedError('gan mode %s not implemented' % gan_mode)
def get_target_tensor(self, prediction, target_is_real):
"""Create label tensors with the same size as the input.
Parameters:
prediction (tensor) - - tpyically the prediction from a discriminator
target_is_real (bool) - - if the ground truth label is for real images or fake images
Returns:
A label tensor filled with ground truth label, and with the size of the input
"""
if target_is_real:
target_tensor = self.real_label
else:
target_tensor = self.fake_label
return target_tensor.expand_as(prediction)
def __call__(self, prediction, target_is_real):
"""Calculate loss given Discriminator's output and grount truth labels.
Parameters:
prediction (tensor) - - tpyically the prediction output from a discriminator
target_is_real (bool) - - if the ground truth label is for real images or fake images
Returns:
the calculated loss.
"""
if self.gan_mode in ['lsgan', 'vanilla']:
target_tensor = self.get_target_tensor(prediction, target_is_real)
loss = self.loss(prediction, target_tensor)
elif self.gan_mode == 'wgangp':
if target_is_real:
loss = -prediction.mean()
else:
loss = prediction.mean()
return loss
不同于直接判断图片是否是真实的,PatchGAN会分别判断N x N个patch是否为真,然后求平均值输出。L1损失可以使模型学到低频的特征,PatchGAN的结构可以使模型学到高频的特征(因为它关注的是局部的信息)。而且,当N比原图的尺寸小得多时依然有效。
这种方式类似马尔科夫随机场,因为超出一个patch半径的像素是无关的,而这和纹理(texture)、风格(style)的特点比较符合。因此PatchGAN可以理解为一种texture/style loss。
以MNIST为例,生成器G和判别器D的输入输出是:
CGAN在mnist数据集上进行实验’对于生成器:使用数字的类别y作为标签,并进行了one-hot编码,噪声z来自均均匀分布;噪声z映射到200维的隐层,类别标签映射到1000维的隐层,然后进行拼接作为下一层的输入,激活函数使用ReLU;最后一层使用Sigmoid函数,生成的样本为784维(使用的mnist长宽为28x28=784)。得到的实验结果如下:
姿态生成分为头部姿态和肢体姿态;姿态生成方案可分为一阶段和二阶段;
一阶段生成
二阶段生成
姿态生成(2D)简单可以分为四个阶段
现在头部姿态任务基本属于第二阶段,而肢体姿态属于第一阶段;现在很多的论文发布了有关三,四和五阶段的效果; 坏消息是这些算法暂时基本没法商用;好消息就是现在越来越多的研究人员在研究姿态生成这个课题,后续姿态任务的发展一定越来越好。