TensorFlow实战：Neural Style

Neural Style是一个非常有意思的深度学习应用：输入一张代表内容的图片和一张代表风格的图片，深度学习网络会输出一张融合了这个风格和内容的新作品。

TensorFlow是Google开源的最流行的深度学习框架。作者anishathalye使用TensorFlow实现了Neural Style，并将其开源放在了GitHub上。本文对他的代码进行深入剖析。代码请点这里。

Pretrained VGG-19 Model

VGG在2014年的 ILSVRC localization and classification 两个问题上分别取得了第一名和第二名。VGG-19是其中的一个模型，官网上提供了预先训练好的系数，经常被业界用来做原始图片的特征变换。

VGG-19是一个非常深的神经网络，总共有19层，基本结构如下：

前几层为卷积和maxpool的交替，每个卷积包含多个卷积层，最后面再紧跟三个全连接层。具体而言，第一个卷积包含2个卷积层，第二个卷积包含2个卷积层，第三个卷积包含4个卷基层，第四个卷积包含4个卷积层，第五个卷积包含4个卷基层，所以一共有16个卷积层，加上3个全连接层，一共19层，因此称为VGG-19模型。VGG-19的神经网络结构如下表所示：

Neural Style只依赖于VGG-19的卷积层，需要使用神经网络层列举如下：

VGG19_LAYERS = (
    'conv1_1', 'relu1_1', 'conv1_2', 'relu1_2', 'pool1',

    'conv2_1', 'relu2_1', 'conv2_2', 'relu2_2', 'pool2',

    'conv3_1', 'relu3_1', 'conv3_2', 'relu3_2', 'conv3_3',
    'relu3_3', 'conv3_4', 'relu3_4', 'pool3',

    'conv4_1', 'relu4_1', 'conv4_2', 'relu4_2', 'conv4_3',
    'relu4_3', 'conv4_4', 'relu4_4', 'pool4',

    'conv5_1', 'relu5_1', 'conv5_2', 'relu5_2', 'conv5_3',
    'relu5_3', 'conv5_4', 'relu5_4'
)

我们可以从MatCovNet下载页获取VGG-19模型预先训练好的模型系数文件。该文件为Matlab格式，我们可以使用Python的scipy.io进行数据读取。

该数据包含很多信息，我们需要的信息是每层神经网络的kernels和bias。kernels的获取方式是data['layers'][0][第i层][0][0][0][0][0]，形状为[width, height, in_channels, out_channels]，bias的获取方式是data['layers'][0][第i层][0][0][0][0][0]，形状为[1,out_channels]。对于VGG-19的卷积，全部采用了3X3的filters，所以width为3，height为3。注意，这里面的层数i，指的是最细粒度的层数，包括conv、relu、pool、fc各种操作。因此，i=0为卷积核，i=1为relu，i=2为卷积核，i=3为relu，i=4为pool，i=5为卷积核，……，i=37为全连接层，以此类推。VGG-19的pooling采用了长宽为2X2的max-pooling，Neural Style将它替换为了average-pooling，因为作者发现这样的效果会稍微好一些。

VGG-19需要对输入图片进行一步预处理，把每个像素点的取值减去训练集算出来的RGB均值。VGG-19的RGB均值可以通过np.mean(data['normalization'][0][0][0], axis=(0, 1)获得，其取值为[ 123.68 116.779 103.939]。

综上所述，我们可以使用下面的代码vgg.py读取VGG-19神经网络，用于构造Neural Style模型。

import tensorflow as tf
import numpy as np
import scipy.io

def load_net(data_path):
    data = scipy.io.loadmat(data_path)
    mean = data['normalization'][0][0][0]
    mean_pixel = np.mean(mean, axis=(0,1))
    weights = data['layers'][0]
    return weights, mean_pixel

def net_preloaded(weights, input_image, pooling):
    net = {}
    current = input_image
    for i, name in enumerate(VGG19_LAYERS):
        kind = name[:4]
        if kind == 'conv':
            kernels, bias = weights[i][0][0][0][0]
            # matconvnet: weights are [width, height, in_channels, out_channels]
            # tensorflow: weights are [height, width, in_channels, out_channels]
            kernels = np.transpose(kernels, (1, 0, 2, 3))
            bias = bias.reshape(-1)
            current = _conv_layer(current, kernels, bias)
        elif kind == 'relu':
            current = tf.nn.relu(current)
        elif kind == 'pool':
            current = _pool_layer(current, pooling)
        net[name] = current
    return net
    
def _conv_layer(input, weights, bias):
    conv = tf.nn.conv2d(input, tf.constant(weights), strides=(1, 1, 1, 1),
            padding='SAME')
    return tf.nn.bias_add(conv, bias)

def _pool_layer(input, pooling):
    if pooling == 'avg':
        return tf.nn.avg_pool(input, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1),
            padding='SAME')
    else:
        return tf.nn.max_pool(input, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1),
            padding='SAME')

def preprocess(image, mean_pixel):
    return image - mean_pixel

def unprocess(image, mean_pixel):
    return image + mean_pixel

Neural Style

Neural Style的核心思想如下图所示：

Part 1: Content Reconstruction

基本思路如下：将content图片p和一张随机生成的图片x，都经过VGG-19的卷积网络进行特征变换，获取某些层级输出的特征变换结果，要求二者的差异最小。二者在l层的损失函数定义如下：

其中F_{ij}^l为随机图片的第i个卷积核filter在位置j的取值，P_{ij}^l为content图片的第i个卷积核filter在位置j的取值。

计算content图片的feature map逻辑实现如下：

# 参数说明
# network为VGG-19文件的路径
# content为内容图片转化得到的数组
# pooling为池化方式

CONTENT_LAYERS = ('relu4_2', 'relu5_2')  # paper原文只使用了relu4_2
content_features = {}
shape = (1,) + content.shape  # input shape: [batch, height, width, channels], only one image, so batch=1.

# 获取VGG-19的训练系数，和RGB均值
vgg_weights, vgg_mean_pixel = vgg.load_net(network)


# 计算Content图片的feature map
g = tf.Graph()
with g.as_default(), g.device('/cpu:0'), tf.Session() as sess:
    # 构造Computation Graph，feed为image，输出的net包含了VGG-19每个层级的输出结果
    image = tf.placeholder('float', shape=shape)
    net = vgg.net_preloaded(vgg_weights, image, pooling)
    # 将content进行预处理
    content_pre = np.array([vgg.preprocess(content, vgg_mean_pixel)])
    # 将预处理后的content_pre feed给Computation Graph，得到计算结果
    for layer in CONTENT_LAYERS:
        content_features[layer] = net[layer].eval(feed_dict={image: content_pre})

计算随机图片的feature map，并计算content loss的逻辑实现如下：

# 参数说明
# image为随机生成的图片
# pooling为池化方式
# content_weight_blend为两个content重构层的占比，默认为1，只使用更精细的重构层relu4_2；更抽象的重构层relu5_2占比为1-content_weight_blend.
# content_weight为内容损失的系数

with tf.Graph().as_default():
    net = vgg.net_preloaded(vgg_weights, image, pooling)
    content_layers_weights = {}
    content_layers_weights['relu4_2'] = content_weight_blend
    content_layers_weights['relu5_2'] = 1.0 - content_weight_blend
    
    content_loss = 0
    content_losses = []
    for content_layer in CONTENT_LAYERS:
        content_losses.append(content_layers_weights[content_layer] * content_weight * (2 * tf.nn.l2_loss(net[content_layer] - content_features[content_layer]) / content_features[content_layer].size))
        content_loss += reduce(tf.add, content_losses)

Part 2: Style Reconstruction

从数学上定义什么是风格，是Neural Style比较有意思的地方。每个卷积核filter可以看做是图形的一种特征抽取。风格在这篇paper中被简化为任意两种特征的相关性。相关性的描述使用余弦相似性，而余弦相似性又正比于两种特征的点积。于是风格的数学定义被表示为神经网络层里filter i和filter j的点积，用G_{ij}^l表示。

与Content Reconstruction中的损失定义相似，我们把style图片和随机生成的噪点图片经过相同的VGG-19卷积网络进行特征变换，选出指定层级的filters。对每个层级，计算两张图片特征变换后$G_{ij}^l$的差异。

各个层级的加权和就是最后的style loss：

计算style图片的feature map逻辑实现如下：

# 参数说明
# styles为风格图片集，可以为多张图片
# style_blend_weights为风格图片集之间的权重
# style_layers_weights为不同神经网络层的权重

STYLE_LAYERS = ('relu1_1', 'relu2_1', 'relu3_1', 'relu4_1', 'relu5_1')
style_shapes = [(1,) + style.shape for style in styles]
style_features = [{} for _ in styles]


# 计算style图片的feature map
for i in range(len(styles)):
    g = tf.Graph()
    with g.as_default(), g.device('/cpu:0'), tf.Session() as sess:
        image = tf.placeholder('float', shape=style_shapes[i])
        net = vgg.net_preloaded(vgg_weights, image, pooling)
        style_pre = np.array([vgg.preprocess(styles[i], vgg_mean_pixel)])
        for layer in STYLE_LAYERS:
            features = net[layer].eval(feed_dict={image: style_pre})
            features = np.reshape(features, (-1, features.shape[3]))  # features.shape[3] is the number of filters
            gram = np.matmul(features.T, features) / features.size
            style_features[i][layer] = gram

计算随机图片的feature map，并计算style loss的逻辑实现如下：

# style loss
style_loss = 0
for i in range(len(styles)):
    style_losses = []
    for style_layer in STYLE_LAYERS:
        layer = net[style_layer]
        _, height, width, number = map(lambda i: i.value, layer.get_shape())
        size = height * width * number
        feats = tf.reshape(layer, (-1, number))
        gram = tf.matmul(tf.transpose(feats), feats) / size
        style_gram = style_features[i][style_layer]
        style_losses.append(style_layers_weights[style_layer] * 2 * tf.nn.l2_loss(gram - style_gram) / style_gram.size)
        style_loss += style_weight * style_blend_weights[i] * reduce(tf.add, style_losses)
        
# tv_loss 
# 注：The total variation (TV) loss encourages spatial smoothness in the generated image. It was not used by Gatys et al in their CVPR paper but it can sometimes improve the results; for more details and explanation see Mahendran and Vedaldi "Understanding Deep Image Representations by Inverting Them" CVPR 2015.
tv_loss = ... 

loss = content_loss + style_loss + tv_loss
train_step = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(loss)

将上述代码有序组合在一起后，可以得到Neural Style TensorFlow代码的第二个关键文件stylize.py。

参考资料

VGG-19主页
MatConvNet
Neural Style Paper
TensorFlow Neural Style Github开源项目
Neural Style中文解读
VGG中文解读