VGG-Very Deep Convolutional Networks for Large-Scale Visual Recognition

摘要

VGG 网络在ILSVRC2014挑战赛上取得了定位第一,分类第二的成绩,作者来自牛津大学的视觉几何组( Visual Geometry Group,估计是VGG命名的来源)。其主要贡献在于主要探讨了深度对于网络的重要性,利用小的尺寸核代替大的卷积核,然后把网络做深;分别建立了16层,19层的深度网络(即VGG16,VGG19)。目前在分类,检测,关键点定位中得到了非常广泛的应用,目标检测算法如YOLO,SSD,S3FD;人脸关键点定位算法如DAN 等都采用VGG16作为特征提取网络。

网络性能

VGG-Very Deep Convolutional Networks for Large-Scale Visual Recognition_第1张图片
image.png
VGG-Very Deep Convolutional Networks for Large-Scale Visual Recognition_第2张图片
image.png

网络由来

  • 感受野
    感受野(receptive field)指的是,在卷积神经网络CNN中,决定某一层输出结果中一个元素所对应的输入层的区域大小。比如,一个7x7的图像卷积层,该层输出的特征图的每一个元素对应该层输入的7x7区域,这个区域即为感受野。另外,感受野是相互累计的,即卷积神经网络中每一层的感受野都是相对于第一层输入而言的,因此计算中需注意:
      (1)第一层卷积层的输出特征图像素的感受野的大小等于滤波器的大小
      (2)深层卷积层的感受野大小和它之前所有层的滤波器大小和步长有关系。
    详细的感受野计算参见文章末尾的参考链接。


    VGG-Very Deep Convolutional Networks for Large-Scale Visual Recognition_第3张图片
    感受野
  • 小尺寸vs 大尺寸卷积核与感受野
    AlexNet最开始的7x7的卷积核的感受野是:7x7。而通过上文的感受野计算公式,对于一个卷积层,在步长相同的情况下,2个3x3卷积核的感受野与1个5x5卷积核的感受野一致,3个3x3卷积核的感受野与1个7x7卷积核的感受野一致,而在参数量上,3层3x3卷积的参数量要少于1层7x7卷积(假设输入输出通道数都为C,则参数量为3x(3x3xCxC)=27CxC v.s. 1x(7x7xCxC)=49CxC)。具体地,VGG前3层卷积的感受野分别为:
    第一个卷积核的感受野:3x3
    第二个卷积核的感受野:(3-1)x 1+3=5
    第三个卷积核的感受野:(5-1)x 1+3=7
    可见三个3x3卷积核和一个7x7卷积核的感受野是一样的,但是3*3卷积核可以把网络做的更深。VGGNet不好的一点是它耗费更多计算资源,并且使用了更多的参数,导致更多的内存占用。


    VGG-Very Deep Convolutional Networks for Large-Scale Visual Recognition_第4张图片
    image.png

网络结构

VGG的网络结构非常简单,采用conv2d-relu-BatchNorm作为基础单元,若干(2/3/4)个 这样的基础单元形成一组(vgg_block),每组后连接一个2x2的maxpool进行降采样;若干个组形成不同深度的VGG网络。
为了减少深度网络的参数量,整个网络中一律采用kernel=3x3,stride=1的卷积
在分类问题上,最后一组经过maxpool后经过flatten,dropout,3xfull connect,softmax后输出类别和置信概率。最初的Imagenet预训练模型是在caffe框架下训练得到的,后来tensorflow,mxnet,pytorch等都形成了各自的VGG预训练模型,可以直接将重要的单元结构的预训练数据进行迁移并在此基础上做微调即可用在新的视觉任务中。VGG16和VGG19都包含了5组vgg_block,只是每组vgg_block中的基础单元数不同,VGG16是2+2+3+3+3,VGG19是2+2+4+4+4, 再加上后面的3个full connection layer, 一共是16和19。本文重点介绍VGG16,其详细结构如下:

-------------------------------------------------- 
layer          | kh x kw, out, s | out size 
-------------------------------------------------- 
         input image (224 x 224 x3)
-------------------------------------------------- 
conv1_1        | 3x3, 64, 1      | 224x224x64 
conv1_2        | 3x3, 64, 1      | 224x224x64 
-------------------------------------------------- 
max_pool       | 2x2, 64,2       | 112x112x64
-------------------------------------------------- 
conv2_1        | 3x3, 128, 1     | 112x112x128
conv2_2        | 3x3, 128, 1     | 112x112x128
-------------------------------------------------- 
max_pool       | 2x2, 2          | 56x56x128
-------------------------------------------------- 
conv3_1        | 3x3, 256, 1     | 56x56x256 
conv3_2        | 3x3, 256, 1     | 56x56x256 
conv3_3        | 3x3, 256, 1     | 56x56x256 
-------------------------------------------------- 
max_pool       | 2x2, 256,2      | 28x28x256
-------------------------------------------------- 
conv4_1        | 3x3, 512, 1     | 28x28x512 
conv4_2        | 3x3, 512, 1     | 28x28x512 
conv4_3        | 3x3, 512, 1     | 28x28x512 
-------------------------------------------------- 
max_pool       | 2x2, 512,2      | 14x14x512
-------------------------------------------------- 
conv5_1        | 3x3, 512, 1     | 14x14x512 
conv5_2        | 3x3, 512, 1     | 14x14x512 
conv5_3        | 3x3, 512, 1     | 14x14x512 
-------------------------------------------------- 
max_pool       | 2x2, 512,2      | 7x7x512
-------------------------------------------------- 
fc6            | 4096            | 1x1x4096 
fc7            | 4096            | 1x1x4096 
fc8            | 1000            | 1x1x1000
Softmax        | Classifier      | 1x1x1000
--------------------------------------------------

代码实现

tensorflow(以下简称tf)中构建cnn网络的python API主要有3种:

  • tf.nn
  • tf.layers
  • tf.contrib.layers
    封装程度逐个递进,其中tf.nn定义卷积层等是最为复杂的,tf.contrib.layers相对简单方便的多。值得一提的是,tf.contrib 模块使用起来能够结合python的高级语法特性,使得定义网络结构的代码可以得到很大程度的简化,更加可读并且pythonic,其中的tf.contrib.slim(TF-Slim)也因如此而广受用户欢迎,目前很多成熟的网络结构都是基于该模块实现的。因为在tensorflow官网的解释比较详细。鉴于不同用户采用的tf API模块不同,其实现的VGG代码也不相同,此处做统一整理和比较,以供参考。
方式0

tf.nn神经网络模块是tensorflow用于深度学习计算的核心模块,包括conv2d, pool, relu等为首的卷积, 池化,激活等各种操作(opterator)。以图像2d卷积为例,tf.nn.conv2d 接受一个4D的input([batch, h, w, channel])和一个4D的kernel(kh, kw, in_channel, out_channel)以及stride(int or list) 执行2d卷积操作,其中卷积核kernel 以及偏置bias都需要事先根据shape构造,通常的做法是通过python自己先进行封装,定义一个能通过指定kw,kh,out_channel等参数来自动构造卷积层的函数(如代码中的conv_op)。全连接层,最大池化也通过类似的方法进行构造。详细代码如下:

# --------------------------Method 0 --------------------------------------------
import tensorflow as tf
# 用来创建卷积层并把本层的参数存入参数列表
def conv_op(input_op, name, kh, kw, n_out, dh, dw, p):
    """
    define conv operator with tf.nn 
    :param input_op: 输入的tensor
    :param name: 该层的名称
    :param kh: 卷积层的高
    :param kw: 卷积层的宽
    :param n_out: 输出通道数
    :param dh: 步长的高
    :param dw: 步长的宽
    :param p: 参数列表
    :return: 
    """
    # 输入的通道数
    n_in = input_op.get_shape()[-1].value
    with tf.name_scope(name) as scope:
        kernel = tf.get_variable(scope + "w", shape=[kh, kw, n_in, n_out], dtype=tf.float32,
                                 initializer=tf.contrib.layers.xavier_initializer_conv2d())
        conv = tf.nn.conv2d(input_op, kernel, (1, dh, dw, 1), padding='SAME')
        bias_init_val = tf.constant(0.0, shape=[n_out], dtype=tf.float32)
        biases = tf.Variable(bias_init_val, trainable=True, name='b')
        z = tf.nn.bias_add(conv, biases)
        activation = tf.nn.relu(z, name=scope)
        p += [kernel, biases]
        return activation


# 定义全连接层
def fc_op(input_op, name, n_out, p):
    """
    define full connect opterator with tf.nn 
    :param input_op: 输入的tensor
    :param name: 该层的名称
    :param n_out: 输出通道数
    :param p: 参数列表
    :return: 
    """
    n_in = input_op.get_shape()[-1].value
    with tf.name_scope(name) as scope:
        kernel = tf.get_variable(scope + 'w', shape=[n_in, n_out], dtype=tf.float32,
                                 initializer=tf.contrib.layers.xavier_initializer_conv2d())
        biases = tf.Variable(tf.constant(0.1, shape=[n_out], dtype=tf.float32), name='b')
        # tf.nn.relu_layer()用来对输入变量input_op与kernel做乘法并且加上偏置b
        activation = tf.nn.relu_layer(input_op, kernel, biases, name=scope)
        p += [kernel, biases]
        return activation


# 定义最大池化层
def mpool_op(input_op, name, kh, kw, dh, dw):
    return tf.nn.max_pool(input_op, ksize=[1, kh, kw, 1], strides=[1, dh, dw, 1], padding='SAME', name=name)


# 定义网络结构 Method 0
def vgg16_op(input_op, keep_prob):
    p = []
    conv1_1 = conv_op(input_op, name='conv1_1', kh=3, kw=3, n_out=64, dh=1, dw=1, p=p)
    conv1_2 = conv_op(conv1_1, name='conv1_2', kh=3, kw=3, n_out=64, dh=1, dw=1, p=p)
    pool1 = mpool_op(conv1_2, name='pool1', kh=2, kw=2, dw=2, dh=2)

    conv2_1 = conv_op(pool1, name='conv2_1', kh=3, kw=3, n_out=128, dh=1, dw=1, p=p)
    conv2_2 = conv_op(conv2_1, name='conv2_2', kh=3, kw=3, n_out=128, dh=1, dw=1, p=p)
    pool2 = mpool_op(conv2_2, name='pool2', kh=2, kw=2, dw=2, dh=2)

    conv3_1 = conv_op(pool2, name='conv3_1', kh=3, kw=3, n_out=256, dh=1, dw=1, p=p)
    conv3_2 = conv_op(conv3_1, name='conv3_2', kh=3, kw=3, n_out=256, dh=1, dw=1, p=p)
    conv3_3 = conv_op(conv3_2, name='conv3_3', kh=3, kw=3, n_out=256, dh=1, dw=1, p=p)
    pool3 = mpool_op(conv3_3, name='pool3', kh=2, kw=2, dw=2, dh=2)

    conv4_1 = conv_op(pool3, name='conv4_1', kh=3, kw=3, n_out=512, dh=1, dw=1, p=p)
    conv4_2 = conv_op(conv4_1, name='conv4_2', kh=3, kw=3, n_out=512, dh=1, dw=1, p=p)
    conv4_3 = conv_op(conv4_2, name='conv4_3', kh=3, kw=3, n_out=512, dh=1, dw=1, p=p)
    pool4 = mpool_op(conv4_3, name='pool4', kh=2, kw=2, dw=2, dh=2)

    conv5_1 = conv_op(pool4, name='conv5_1', kh=3, kw=3, n_out=512, dh=1, dw=1, p=p)
    conv5_2 = conv_op(conv5_1, name='conv5_2', kh=3, kw=3, n_out=512, dh=1, dw=1, p=p)
    conv5_3 = conv_op(conv5_2, name='conv5_3', kh=3, kw=3, n_out=512, dh=1, dw=1, p=p)
    pool5 = mpool_op(conv5_3, name='pool5', kh=2, kw=2, dw=2, dh=2)

    shp = pool5.get_shape()
    print("pool5 shape ", shp)

    flattened_shape = shp[1].value * shp[2].value * shp[3].value
    resh1 = tf.reshape(pool5, [-1, flattened_shape], name="resh1")

    fc6 = fc_op(resh1, name="fc6", n_out=4096, p=p)
    fc6_drop = tf.nn.dropout(fc6, keep_prob, name='fc6_drop')
    fc7 = fc_op(fc6_drop, name="fc7", n_out=4096, p=p)
    fc7_drop = tf.nn.dropout(fc7, keep_prob, name="fc7_drop")
    fc8 = fc_op(fc7_drop, name="fc8", n_out=1000, p=p)
    softmax = tf.nn.softmax(fc8)
    predictions = tf.argmax(softmax, 1)
    return predictions, softmax, fc8, p

  • 方式1
    tf.layers模块属于TensorFlow的一个稳定的中层API,算是tf.nn模块的抽象,封装了Conv2D, Dense,BatchNormalization,Conv2DTranspose等类和conv2d等函数,极大地加快了模型的构建速度。如卷积层的构建可以使用conv = tf.layers.conv2d(x, filters=32, kernel_size=3, padding="same", strides=1, activation=tf.nn.relu) 一行代码实现,同时还可以直接指定卷积后激活的函数。基于该模块的VGG16代码实现如下:
# --------------------------Method 1 --------------------------------------------
import tensorflow as tf
class VGG1:
    """
    define with tf.layers
    """
    def __init__(self, resolution_inp=224, channel=3, name='vgg'):
        """
        construct function
        :param resolution_inp: int, size of input image. default 224 of ImageNet
        :param channel: int, channel of input image. 1 or 3
        :param name: 
        """
        self.name = name
        self.channel = channel
        self.resolution_inp = resolution_inp

    def __call__(self, x, dropout=0.5, is_training=True):
        with tf.variable_scope(self.name) as scope:
            size = 64
            se = self.vgg_block(x, 2, size, is_training=is_training)
            se = self.vgg_block(se, 2, size * 2, is_training=is_training)
            se = self.vgg_block(se, 3, size * 4, is_training=is_training)
            se = self.vgg_block(se, 3, size * 8, is_training=is_training)
            se = self.vgg_block(se, 3, size * 8, is_training=is_training)

            flatten = tcl.flatten(se)
            fc6 = tf.layers.dense(flatten, 4096)
            fc6_drop = tcl.dropout(fc6, dropout, is_training=is_training)
            fc7 = tf.layers.dense(fc6_drop, 4096)
            fc7_drop = tcl.dropout(fc7, dropout, is_training=is_training)
            self.fc_out = tf.layers.dense(fc7_drop, 1000)

            # predict for classify
            softmax = tf.nn.softmax(self.fc_out)
            self.predictions = tf.argmax(softmax, 1)
            return self.predictions

    def vgg_block(self, x, num_convs, num_channels, scope=None, is_training=True):
        """
        define the basic repeat unit in vgg: n x (conv-relu-batchnorm)-maxpool
        :param x: tensor or numpy.array, input
        :param num_convs: int, number of conv-relu-batchnorm 
        :param num_channels: int, number of conv filters
        :param scope: name space or scope
        :param is_training: bool, is training or not
        :return: 
        """
        with tf.variable_scope(scope, "conv"):
            se = x
            # conv-relu-batchnorm group
            for i in range(num_convs):
                se = tf.layers.conv2d(se,
                                      filters=num_channels,
                                      kernel_size=3,
                                      padding="same",
                                      strides=1,
                                      activation=tf.nn.relu)
                se = tf.layers.batch_normalization(se,
                                                   training=is_training,
                                                   scale=True)

            se = tf.layers.max_pooling2d(se, 2, 2, padding="same")

        return se

    @property
    def trainable_vars(self):
        return [var for var in tf.trainable_variables() if self.name in var.name]
  • 方式2
    tf.contrib.layers 是tf.layers的进一步封装,如在tf.contrib.layers.conv2d中增加了batch_norm的参数。而tf.contrib 的framework实现了很多pythonic的操作,如arg_scope的上下文管理,可以对不同卷积层进行相同的参数设置(如激活类型,batch_norm等),使得代码更加简洁优美。当然还有一个比较火的slim模块,在此基础上又增加了些新的特性,整体用法基本类似,此处不再赘述。但由于很多模块不是tf原生支持,在即将发布的tensorflow2.0声明中明确指出该模块下的众多模块可能被移到其他模块或被弃用,届时此处代码可能不再合适,在此声明。基于该模块的vgg16的实现如下:
# --------------------------Method 2 --------------------------------------------
import tensorflow.contrib.layers as tcl
from tensorflow.contrib.framework import arg_scope
class VGG2:
    """
    define with tf.contrib.layers
    """
    def __init__(self, resolution_inp=224, channel=3, name='vgg'):
        self.name = name
        self.channel = channel
        self.resolution_inp = resolution_inp

    def __call__(self, x, dropout=0.5, is_training=True):
        with tf.variable_scope(self.name) as scope:
            with arg_scope([tcl.batch_norm], is_training=is_training, scale=True):
                with arg_scope([tcl.conv2d],
                               padding="SAME",
                               normalizer_fn=tcl.batch_norm,
                               activation_fn=tf.nn.relu, ):
                    size = 64
                    se = self.vgg_block(x, 2, size, is_training=is_training)
                    se = self.vgg_block(se, 2, size * 2, is_training=is_training)
                    se = self.vgg_block(se, 3, size * 4, is_training=is_training)
                    se = self.vgg_block(se, 3, size * 8, is_training=is_training)
                    se = self.vgg_block(se, 3, size * 8, is_training=is_training)

                    flatten = tcl.flatten(se) 
                    fc6 = tf.layers.dense(flatten, 4096)
                    fc6_drop = tcl.dropout(fc6, dropout, is_training=is_training)
                    print("dropout ", fc6, fc6_drop)

                    fc7 = tf.layers.dense(fc6_drop, 4096)
                    fc7_drop = tcl.dropout(fc7, dropout, is_training=is_training)
                    self.fc_out = tf.layers.dense(fc7_drop, 1000)

                    # predict for classify
                    softmax = tf.nn.softmax(self.fc_out)
                    self.predictions = tf.argmax(softmax, 1)
                    return self.predictions

    def vgg_block(self, x, num_convs, num_channels, scope=None, is_training=True):
        """
        define the basic repeat unit in vgg: n x (conv-relu-batchnorm)-maxpool
        :param x: tensor or numpy.array, input
        :param num_convs: int, number of conv-relu-batchnorm 
        :param num_channels: int, number of conv filters
        :param scope: name space or scope
        :param is_training: bool, is training or not
        :return: 
        """
        with tf.variable_scope(scope, "conv"):
            se = x
            for i in range(num_convs):
                se = tcl.conv2d(se, num_outputs=num_channels, kernel_size=3, stride=1)
            se = tf.layers.max_pooling2d(se, 2, 2, padding="same")

        print("layer ", self.name, "in ", x, "out ", se)

        return se

    @property
    def trainable_vars(self):
        return [var for var in tf.trainable_variables() if self.name in var.name]

    @property
    def vars(self):
        return [var for var in tf.global_variables() if self.name in var.name]```

运行

该部分代码包含2部分:计时函数time_tensorflow_run接受一个tf.Session变量和待计算的tensor以及相应的参数字典和打印信息, 统计执行该tensor100次所需要的时间(平均值和方差);主函数 run_benchmark中初始化了vgg16的3种调用方式,分别统计3中网络在推理(predict) 和梯度计算(后向传递)的时间消耗,详细代码如下:

# -------------------------- Demo and Test -------------------------------------------
from datetime import datetime
import tensorflow as tf
import math
import time
batch_size = 16
num_batches = 100
def time_tensorflow_run(session, target, feed, info_string):
    """
    calculate time for each session run
    :param session: tf.Session
    :param target: opterator or tensor need to run with session
    :param feed: feed dict for session
    :param info_string: info message for print
    :return: 
    """
    num_steps_burn_in = 10  # 预热轮数
    total_duration = 0.0  # 总时间
    total_duration_squared = 0.0  # 总时间的平方和用以计算方差
    for i in range(num_batches + num_steps_burn_in):
        start_time = time.time()
        _ = session.run(target, feed_dict=feed)
        duration = time.time() - start_time

        if i >= num_steps_burn_in:  # 只考虑预热轮数之后的时间
            if not i % 10:
                print('[%s] step %d, duration = %.3f' % (datetime.now(), i - num_steps_burn_in, duration))
            total_duration += duration
            total_duration_squared += duration * duration

    mn = total_duration / num_batches  # 平均每个batch的时间
    vr = total_duration_squared / num_batches - mn * mn  # 方差
    sd = math.sqrt(vr)  # 标准差
    print('[%s] %s across %d steps, %.3f +/- %.3f sec/batch' % (datetime.now(), info_string, num_batches, mn, sd))


# test demo
def run_benchmark():
    """
    main function for test or demo
    :return: 
    """
    with tf.Graph().as_default():
        image_size = 224  # 输入图像尺寸
        images = tf.Variable(tf.random_normal([batch_size, image_size, image_size, 3], dtype=tf.float32, stddev=1e-1))
        keep_prob = tf.placeholder(tf.float32)

        # method 0
        # prediction, softmax, fc8, p = vgg16_op(images, keep_prob)

        # method 1 and method 2
        # vgg16 = VGG1(resolution_inp=image_size, name="vgg16")
        vgg16 = VGG2(resolution_inp=image_size, name="vgg16")
        prediction = vgg16(images, 0.5, True)
        fc8 = vgg16.fc_out
        p = vgg16.trainable_vars

        for v in p:
            print(v)
        init = tf.global_variables_initializer()

        # for var in tf.global_variables():
        #     print("param ", var.name)
        sess = tf.Session()
        print("init...")
        sess.run(init)

        print("predict..")
        writer = tf.summary.FileWriter("./logs")
        writer.add_graph(sess.graph)
        time_tensorflow_run(sess, prediction, {keep_prob: 1.0}, "Forward")

        # 用以模拟训练的过程
        objective = tf.nn.l2_loss(fc8)  # 给一个loss
        grad = tf.gradients(objective, p)  # 相对于loss的 所有模型参数的梯度

        print('grad backword')
        time_tensorflow_run(sess, grad, {keep_prob: 0.5}, "Forward-backward")
        writer.close()

if __name__ == '__main__':
    run_benchmark()

注: 完整代码可参见个人github工程

参数量

image.png

总共参数数量大约138M左右
全连接层参数量:

时间效率

参考

项目主页
https://blog.csdn.net/wcy12341189/article/details/56281618
https://blog.csdn.net/App_12062011/article/details/60962978
https://blog.csdn.net/zhangwei15hh/article/details/78417789
感受野
感受野计算

你可能感兴趣的:(VGG-Very Deep Convolutional Networks for Large-Scale Visual Recognition)