前面我们介绍了VGGNet在图像识别中的应用,在VGGNet中,作者提出随着CNN网络层数的加深,模型的效果会进一步得到提升,但是在这篇论文提出来之后,有学者发现,随着层数的加深,当深度达到一定数值之后,模型的准确率不仅没有提升,反而下降了,如图所示:
最开始有学者以为是梯度消失或爆炸的问题导致的,但是这个问题可以被Batch Normalization解决,并且加了Batch Normalization后依然出现类似的问题,另外,该现象也不是因为过拟合,因为如果是过拟合的话,模型在训练集上的准确率应该不会下降才对,因此,这里面肯定是因为模型中其他潜在的原因导致的。这个问题在2015年时被何凯明大神发现了,并且提出了一种新型的模型结构——ResNet,ResNet的提出可以说是图像识别任务中的里程碑,下面我们将对该模型进行具体介绍,并用tensorflow来实现它。
记 H ( x ) \mathcal{H}(\mathrm{x}) H(x)为网络中几个堆叠的层所要拟合的潜在映射,其中, x x x是这几层中第一层的输入,那么,在VGGNet等传统的网络中,几个堆叠的层所要拟合的目标其实就是 H ( x ) \mathcal{H}(\mathrm{x}) H(x),而ResNet则只要求这几个层拟合目标为 F ( x ) : = H ( x ) − x \mathcal{F}(\mathrm{x}) :=\mathcal{H}(\mathrm{x})-\mathrm{x} F(x):=H(x)−x,即拟合该层输入与输出的残差,因此ResNet也被称为残差网络。其结构如下图所示,其实就是在每个block的最后一层relu层之前将输出与输入进行相加,这种结构也被称为“shortcut connections”,用公式表达如下:
y = F ( x , { W i } ) + x \mathrm{y}=\mathcal{F}\left(\mathrm{x},\left\{W_{i}\right\}\right)+\mathrm{x} y=F(x,{Wi})+x
其中, x x x和 y y y分别表示一个block的输入和输出, F ( x , { W i } ) \mathcal{F}\left(\mathrm{x},\left\{W_{i}\right\}\right) F(x,{Wi})表示当前block所要拟合的残差映射, F = W 2 σ ( W 1 x ) \mathcal{F}=W_{2} \sigma\left(W_{1} \mathrm{x}\right) F=W2σ(W1x),其中 W 1 W_{1} W1和 W 2 W_{2} W2为当前block的参数矩阵, σ \sigma σ表示relu激活函数,在该计算之后,则是残差连接,残差连接后再经过一层relu激活函数得到该block的输出。虽然最终block的输出与VGGNet等网络类似,但是却降低了模型拟合的难度。
不过,上面这种残差连接方式要求 F ( x ) \mathcal{F}(\mathrm{x}) F(x)与 x x x之间的维度必须完全相同,当两者的维度不一致时,可以通过一个线性变换将 x x x的维度转变为与 F ( x ) \mathcal{F}(\mathrm{x}) F(x)的维度一致,即:
y = F ( x , { W i } ) + W s x \mathrm{y}=\mathcal{F}\left(\mathrm{x},\left\{W_{i}\right\}\right)+W_{s} \mathrm{x} y=F(x,{Wi})+Wsx
另外,上面介绍的 F ( x ) \mathcal{F}(\mathrm{x}) F(x)函数其实表示的是全连接网络,那么在图像中,我们也可以将其替换为卷积神经网络,并且最终的残差连接是直接对每个feature map进行对应相加。在每个block里面,层数也是可以灵活设置的,可以是两层、三层甚至多层,作者在论文中主要采用的是两种block结构,如下图所示,右边的三层结构也被称为“bottleneck”,采用这种三层的结构可以有效地提高模型的计算速度。
为了与传统的深层卷积神经网络对比,作者选择了经典的VGGNet进行对比,但是作者做了一些细微的修改,比如将中间的部分max-pooling改为卷积层,只是步伐改为2,另外,第一层卷积层采用 7 × 7 7 \times 7 7×7的卷积核。对于网络的层数,作者尝试了多种,用2层的block搭建了18、34层的模型,用3层的block搭建了50、101、152层的模型。模型的结构如下图所示,最右侧是ResNet,模型侧边每个实线表示一个block,并且尺度不变,每个虚线也表示一个block,但是对应的feature_map尺度减半,通道翻倍,当尺度发生变化时,作者主要有两种措施实现残差连接,一种是直接采用zero-padding操作使得 F ( x ) \mathcal{F}(\mathrm{x}) F(x)和 x x x尺度一致,另一种是采用 1 × 1 1 \times 1 1×1的卷积操作使得 x x x的维度和 F ( x ) \mathcal{F}(\mathrm{x}) F(x)一致。
作者通过在多个数据上进行实验,发现ResNet确实解决了随着深度加深,模型准确率下降的问题,如下图所示:
采用tensorflow对ResNet模型进行复现,模型的代码如下:
import os
import config
import random
import numpy as np
import tensorflow as tf
from config import resnet_config
from data_loader import DataLoader
from eval.evaluate import accuracy
class ResNet(object):
def __init__(self,
depth=resnet_config.depth,
height=config.height,
width=config.width,
channel=config.channel,
num_classes=config.num_classes,
learning_rate=resnet_config.learning_rate,
learning_decay_rate=resnet_config.learning_decay_rate,
learning_decay_steps=resnet_config.learning_decay_steps,
epoch=resnet_config.epoch,
batch_size=resnet_config.batch_size,
model_path=resnet_config.model_path,
summary_path=resnet_config.summary_path):
""" :param depth: """
self.depth = depth
self.height = height
self.width = width
self.channel = channel
self.learning_rate = learning_rate
self.learning_decay_rate = learning_decay_rate
self.learning_decay_steps = learning_decay_steps
self.epoch = epoch
self.batch_size = batch_size
self.num_classes = num_classes
self.model_path = model_path
self.summary_path = summary_path
self.num_block_dict = {18: [2, 2, 2, 2],
34: [3, 4, 6, 3],
50: [3, 4, 6, 3],
101: [3, 4, 23, 3]}
self.bottleneck_dict = {18: False,
34: False,
50: True,
101: True}
self.filter_out = [64, 128, 256, 512]
self.filter_out_last_layer = [256, 512, 1024, 2048]
self.conv_out_depth = self.filter_out[-1] if self.depth < 50 else self.filter_out_last_layer[-1]
assert self.depth in self.num_block_dict, 'depth should be in [18,34,50,101]'
self.num_block = self.num_block_dict[self.depth]
self.bottleneck = self.bottleneck_dict[self.depth]
self.input_x = tf.placeholder(tf.float32, shape=[None, self.height, self.width, self.channel], name='input_x')
self.input_y = tf.placeholder(tf.float32, shape=[None, self.num_classes], name='input_y')
self.prediction = None
self.loss = None
self.acc = None
self.global_step = None
self.data_loader = DataLoader()
self.model()
def model(self):
# first convolution layers
x = self.conv(x=self.input_x, k_size=7, filters_out=64, strides=2, activation=True, name='First_Conv')
x = tf.layers.max_pooling2d(x, pool_size=[3, 3], strides=2, padding='same', name='max_pool')
x = self.stack_block(x)
x = tf.layers.average_pooling2d(x, pool_size=x.get_shape()[1:3], strides=1, name='average_pool')
x = tf.reshape(x, [-1, 1 * 1 * self.conv_out_depth])
fc_W = tf.truncated_normal_initializer(stddev=0.1)
logits = tf.layers.dense(inputs=x, units=self.num_classes,kernel_initializer=fc_W)
# 预测值
self.prediction = tf.argmax(logits,axis=-1)
# 计算准确率
self.acc = accuracy(logits, self.input_y)
# 损失值
self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=self.input_y))
# 全局步数
self.global_step = tf.train.get_or_create_global_step()
# 递减学习率
learning_rate = tf.train.exponential_decay(learning_rate=self.learning_rate,
global_step=self.global_step,
decay_rate=self.learning_decay_rate,
decay_steps=self.learning_decay_steps,
staircase=True)
self.optimize = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)
def stack_block(self, input_x):
for stack in range(4):
stack_strides = 1 if stack == 0 else 2
stack_name = 'stack_%s' % stack
with tf.name_scope(stack_name):
for block in range(self.num_block[stack]):
shortcut = input_x
block_strides = stack_strides if block == 0 else 1
block_name = stack_name + '_block_%s' % block
with tf.name_scope(block_name):
if self.bottleneck:
for layer in range(3):
with tf.name_scope(block_name + '_layer_%s' % layer):
filters = self.filter_out[stack] if layer < 2 else self.filter_out_last_layer[stack]
k_size = 3 if layer == 1 else 1
layer_strides = block_strides if layer < 1 else 1
activation = True if layer < 2 else False
layer_name = block_name + '_conv_%s' % layer
input_x = self.conv(x=input_x, filters_out=filters, k_size=k_size,
strides=layer_strides, activation=activation, name=layer_name)
else:
for layer in range(2):
with tf.name_scope(block_name + '_layer_%s' % layer):
filters = self.filter_out[stack]
k_size = 3
layer_strides = block_strides if layer < 1 else 1
activation = True if layer < 1 else False
layer_name = block_name + '_conv_%s' % layer
input_x = self.conv(x=input_x, filters_out=filters, k_size=k_size,
strides=layer_strides, activation=activation, name=layer_name)
shortcut_depth = shortcut.get_shape()[-1]
input_x_depth = input_x.get_shape()[-1]
with tf.name_scope('shortcut_connect'):
if shortcut_depth != input_x_depth:
connect_k_size = 1
connect_strides = block_strides
connect_filter = filters
shortcut_name = block_name + '_shortcut'
shortcut = self.conv(x=shortcut, filters_out=connect_filter, k_size=connect_k_size,
strides=connect_strides, activation=False, name=shortcut_name)
input_x = tf.nn.relu(shortcut + input_x)
return input_x
def conv(self, x, k_size, filters_out, strides, activation, name):
x = tf.layers.conv2d(x, filters=filters_out, kernel_size=k_size, strides=strides, padding='same', name=name)
x = tf.layers.batch_normalization(x, name=name + '_BN')
if activation:
x = tf.nn.relu(x)
return x
def fit(self, train_id_list, valid_img, valid_label):
""" training model :return: """
# 模型存储路径初始化
if not os.path.exists(self.model_path):
os.makedirs(self.model_path)
if not os.path.exists(self.summary_path):
os.makedirs(self.summary_path)
# train_steps初始化
train_steps = 0
best_valid_acc = 0.0
# summary初始化
tf.summary.scalar('loss', self.loss)
merged = tf.summary.merge_all()
# session初始化
sess = tf.Session()
writer = tf.summary.FileWriter(self.summary_path, sess.graph)
saver = tf.train.Saver(max_to_keep=10)
sess.run(tf.global_variables_initializer())
for epoch in range(self.epoch):
shuffle_id_list = random.sample(train_id_list.tolist(), len(train_id_list))
batch_num = int(np.ceil(len(shuffle_id_list) / self.batch_size))
train_id_batch = np.array_split(shuffle_id_list, batch_num)
for i in range(batch_num):
this_batch = train_id_batch[i]
batch_img, batch_label = self.data_loader.get_batch_data(this_batch)
train_steps += 1
feed_dict = {self.input_x: batch_img, self.input_y: batch_label}
_, train_loss, train_acc = sess.run([self.optimize, self.loss, self.acc], feed_dict=feed_dict)
if train_steps % 1 == 0:
val_loss, val_acc = sess.run([self.loss, self.acc],
feed_dict={self.input_x: valid_img, self.input_y: valid_label})
msg = 'epoch:%s | steps:%s | train_loss:%.4f | val_loss:%.4f | train_acc:%.4f | val_acc:%.4f' % (
epoch, train_steps, train_loss, val_loss, train_acc, val_acc)
print(msg)
summary = sess.run(merged, feed_dict={self.input_x: valid_img, self.input_y: valid_label})
writer.add_summary(summary, global_step=train_steps)
if val_acc >= best_valid_acc:
best_valid_acc = val_acc
saver.save(sess, save_path=self.model_path, global_step=train_steps)
sess.close()
def predict(self, x):
""" predicting :param x: :return: """
sess = tf.Session()
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver(tf.global_variables())
ckpt = tf.train.get_checkpoint_state(self.model_path)
saver.restore(sess, ckpt.model_checkpoint_path)
prediction = sess.run(self.prediction, feed_dict={self.input_x: x})
return prediction
最终,对ResNet的优缺点总结如下: