一张图片A作为content,一张图片B作为style,输出具有B style的A content。
有向量 ( a 1 , a 2 , a 3 ) (a_1,a_2,a_3) (a1,a2,a3),其gram metrix是:
[ ( a 1 , a 1 ) ( a 1 , a 2 ) ( a 1 , a 3 ) ( a 2 , a 1 ) ( a 2 , a 2 ) ( a 2 , a 3 ) ( a 3 , a 1 ) ( a 3 , a 2 ) ( a 3 , a 3 ) ] \left[ \begin{matrix} (a_1,a_1) & (a_1,a_2) & (a_1,a_3) \\ (a_2,a_1) & (a_2,a_2) & (a_2,a_3) \\ (a_3,a_1) & (a_3,a_2) & (a_3,a_3) \end{matrix} \right] ⎣⎡(a1,a1)(a2,a1)(a3,a1)(a1,a2)(a2,a2)(a3,a2)(a1,a3)(a2,a3)(a3,a3)⎦⎤
()表示两向量内积。
显然风格迁移有许多方法,此文章介绍 a neural algorithm of artistic style 方法
把图片A、B都统一用VGG-19网络进行特征提取,对提取后的特征向量做变换。这个文章并不是提出了一个网络来生成图片,而是用另外的方法。文章提到两种loss,content loss表示两图片content(通俗的说)距离,style loss表示两图片style距离。把原content图片的数据作为训练参数,通过不断减少style图片和content图片的style loss来一步一步使B图片具有A的style。
content loss: J c o n t e n t ( v 1 , v 2 ) = 1 2 ∣ ∣ v 1 − v 2 ∣ ∣ J_{content}(v_1,v_2)=\frac 1 2||v_1-v_2|| Jcontent(v1,v2)=21∣∣v1−v2∣∣
content loss直接就取了两个特征向量的L2范数。我这个公式看上去简洁一些,但是原文的更精确一些。我习惯用J表示loss。
style loss: J s t y l e ( p i c 1 , p i c 2 ) = 1 4 N 2 M 2 Σ ( G p i c 1 , G p i c 2 ) J_{style}(pic_1,pic_2)=\frac 1 {4N^2M^2}\Sigma(G_{pic_1},G_{pic_2}) Jstyle(pic1,pic2)=4N2M21Σ(Gpic1,Gpic2)G表示gram矩阵。
文章用vgg抽取出来的特征向量的gram metrix作为这张图片的style.。style loss仍然是用两数之间的L2 loss表示其距离。公式表示的很不规范,但是解释了直观的意思。首先用两图片抽取的向量分别求出gram矩阵,作为两图片的风格,然后计算两矩阵的L2 loss作为style的距离。
由于是2016年的老论文,实现版本数不胜数。我只是贴一下自己的实现,其他更加成熟和工业化的实现请参考其他博客。
from __future__ import division
from torchvision import models
from torchvision import transforms
from PIL import Image
import torch
import torchvision
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
device=torch.device('cuda' if torch.cuda.is_available() else'cpu')
#max_size specify max size of input picture, if not fit, make transformation
#transformation: do transformation provided by user
#shape: specify exact shape we want it to be resized
def LoadImage(img_path, transformation=None, max_size=None, shape=None):
img=Image.open(img_path)
if max_size:
scale=max_size/max(img.size)
size=np.array(img.size)*scale
img=img.resize(size.astype(int),Image.ANTIALIAS)
if shape:
img=img.resize(shape,Image.LANCZOS)
if transformation:
img=transformation(img).unsqueeze(0)
return img.to(device)
def ShowImg(tensor,title=None):
img=tensor.cpu().clone()
img=img.squeeze(0)
PILloader = transforms.ToPILImage()
img=PILloader(img)
img.save('result.jpg', quality=95)
plt.imshow(img)
if title is not None:
plt.title(title)
plt.pause(4)
# vgg is used to extract feature vectors from picture
class VGGNet(nn.Module):
def __init__(self):
super(VGGNet,self).__init__()
self.select=['0','5','10','19','28']
self.vgg=models.vgg19(pretrained=True).features
def forward(self,x):
features=[]
for name,layer in self.vgg._modules.items():
x=layer(x)
if name in self.select:
features.append(x)
return features
if __name__ == '__main__':
# means and stds come from ImageNet, define a series of transformation
transformation=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485,0.456,0.406],
std=[0.229,0.224,0.225])
])
content=LoadImage("./data-source/style/content.jpeg", transformation, max_size=400)
style=LoadImage("./data-source/style/style.jpg", transformation, shape=[content.size(3), content.size(2)])
#ShowImg(style[0])
# ShowImg(content[0])
# vgg=models.vgg19(pretrained=True)
# vgg.features
#.eval() means to be 1.with fixed parameters
vgg=VGGNet().to(device).eval()
#initialize result picture with content picture and set it to changable
res_tbd=content.clone().requires_grad_(True)
optimizer=torch.optim.Adam([res_tbd], lr=0.03, betas=[0.5, 0.999])
NUM_STEP=50
content_features=vgg(content)
style_features=vgg(style)
for step in range(NUM_STEP):
target_features=vgg(res_tbd)
content_loss=style_loss=0
#for every vector in each layer, we calculate the style loss and content loss
for f1,f2,f3 in zip(target_features,content_features,style_features):
content_loss+=torch.mean((f1-f2)**2)
_,c,h,w=f1.size()
f1 = f1.view(c, h * w)
f3 = f3.view(c, h * w)
g1 = torch.mm(f1, f1.t())
g3 = torch.mm(f3, f3.t())
style_loss+=torch.mean((g1-g3)**2)/(c*h*w)
#100 is a hyper parameter given by experience
loss=content_loss+style_loss*100
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step%10==0:
print("Step{:.4f}, content loss: {:.4f}, sytle loss: {:.4f}".format(step,content_loss,style_loss))
#after training, we de-normalize the picture making it beautiful for human eyes.
denorm=transforms.Normalize(mean=[-2.12,-2.04,-1.80],std=[4.37,4.46,4.44])
img=res_tbd.clone().squeeze()
img=denorm(img).clamp_(0,1)
#show the result image.
ShowImg(img)
我们看loss到底是怎样被平衡的。我们是从content初始化result picture的,所以一开始content loss会很小。result picture既要有content的内容,也要有style picture 的风格,用C表示content图片,用R表示result图片,S表示style图片,也就是 J c o n t e n t ( C , R ) J_{content}(C,R) Jcontent(C,R)不可以太小,否则result图片没有指定内容; J s t y l e ( S , R ) J_{style}(S,R) Jstyle(S,R)也不能太小,否则没有style的风格。而他们两个loss需要综合考虑,所以需要一个系数平衡loss的重要性,以帮助backprop向我们想要的方向演进(在代码里就是那个100)。
很慢。我的cpu不高级,也没用显卡。