《Densely Connected Convolutional Networks》提出了DenseNet,它用前馈的方式连接每一层与所有其他层,L层网络共有 L ( L + 1 ) 2 \frac{L(L+1)}{2} 2L(L+1)条直接连接。DenseNet有几个优势:可以减轻梯度消失问题、强化特征传播、鼓励特征再利用、实质地减少参数数量。
随着卷积神经网络变得越来越深,一个新的研究问题出现了:当输入信息(梯度信息)经过许多层之后,在它到达网络末尾(开端)时,它会消失和“洗净”。一种方式是:创建从早期层到后期层的短路径。
作者提出了一种架构:为了保证网络之间信息流动最大化,将所有层直接相互连接起来。
第 l t h l^{th} lth层接收所有之前层的特征图, x 0 , . . . , x l − 1 x_0, ..., x_{l-1} x0,...,xl−1,作为输入:
x l = H l ( [ x 0 , x 1 , . . . , x l 1 ] ) x_l = H_l([x_0, x_1, ..., x_{l_1}]) xl=Hl([x0,x1,...,xl1])
其中 [ x 0 , x 1 , . . . , x l 1 ] [x_0, x_1, ..., x_{l_1}] [x0,x1,...,xl1]表示第 0 0 0层到第 l − 1 l-1 l−1层产生的特征图的连结(合并)。
大量的连接并未增加任何超参数。
H l ( . ) H_l(.) Hl(.)由三部分组成:BN,随后跟一个ReLU和一个3x3 Conv。
将blocks之间的层定义为transition layers,它包含:一个BN层、一个1x1 Conv和一个2x2 average pooling。
如果函数 H l H_l Hl产生 k k k个特征图,那么第 l t h l^{th} lth层有 k 0 + k x ( l − 1 ) k_0 + k x (l - 1) k0+kx(l−1)个输入特征图,其中 k 0 k_0 k0是输入层的通道数。将超参数 k k k定义为网络的growth rate。将这些特征图视为网络的全局状态。每层加入自己的 k k k个特征图到网络全局状态。growth rate控制每层贡献多少新信息给这个网络全局状态。
在3x3 Conv之前引入1x1 Conv可以减少输入特征图的数量。
为了提高模型的简洁性,可以在transition layers减少特征图的数量。
注意事项:原始的DenseNet实现可能会存在内存不足的问题。为了减少GPU的内存消耗,可以使用DenseNets的内存高效实现版本。见《Memory-efficient implementation of densenets》。
DenseNet与ResNets很相似,区别之一是用连结(合并)替代了加法运算。实现同样的精度,DenseNet-BC只需要ResNets参数的1/3。
dense convolutional networks精度提升的一个解释是单独的层通过更短的连接接收了来自损失函数的额外监督。
https://github.com/liuzhuang13/DenseNet
https://github.com/shicai/DenseNet-Caffe
以下代码来自https://github.com/liuzhuang13/DenseNetCaffe
from __future__ import print_function
from caffe import layers as L, params as P, to_proto
from caffe.proto import caffe_pb2
import caffe
def bn_relu_conv(bottom, ks, nout, stride, pad, dropout):
batch_norm = L.BatchNorm(bottom, in_place=False, param=[dict(lr_mult=0, decay_mult=0), dict(lr_mult=0, decay_mult=0), dict(lr_mult=0, decay_mult=0)])
scale = L.Scale(batch_norm, bias_term=True, in_place=True, filler=dict(value=1), bias_filler=dict(value=0))
relu = L.ReLU(scale, in_place=True)
conv = L.Convolution(relu, kernel_size=ks, stride=stride,
num_output=nout, pad=pad, bias_term=False, weight_filler=dict(type='msra'), bias_filler=dict(type='constant'))
if dropout>0:
conv = L.Dropout(conv, dropout_ratio=dropout)
return conv
def add_layer(bottom, num_filter, dropout):
conv = bn_relu_conv(bottom, ks=3, nout=num_filter, stride=1, pad=1, dropout=dropout)
concate = L.Concat(bottom, conv, axis=1)
return concate
def transition(bottom, num_filter, dropout):
conv = bn_relu_conv(bottom, ks=1, nout=num_filter, stride=1, pad=0, dropout=dropout)
pooling = L.Pooling(conv, pool=P.Pooling.AVE, kernel_size=2, stride=2)
return pooling
#change the line below to experiment with different setting
#depth -- must be 3n+4
#first_output -- #channels before entering the first dense block, set it to be comparable to growth_rate
#growth_rate -- growth rate
#dropout -- set to 0 to disable dropout, non-zero number to set dropout rate
def densenet(data_file, mode='train', batch_size=64, depth=40, first_output=16, growth_rate=12, dropout=0.2):
data, label = L.Data(source=data_file, backend=P.Data.LMDB, batch_size=batch_size, ntop=2,
transform_param=dict(mean_file="/home/zl499/caffe/examples/cifar10/mean.binaryproto"))
nchannels = first_output
model = L.Convolution(data, kernel_size=3, stride=1, num_output=nchannels,
pad=1, bias_term=False, weight_filler=dict(type='msra'), bias_filler=dict(type='constant'))
N = (depth-4)/3
for i in range(N):
model = add_layer(model, growth_rate, dropout)
nchannels += growth_rate
model = transition(model, nchannels, dropout)
for i in range(N):
model = add_layer(model, growth_rate, dropout)
nchannels += growth_rate
model = transition(model, nchannels, dropout)
for i in range(N):
model = add_layer(model, growth_rate, dropout)
nchannels += growth_rate
model = L.BatchNorm(model, in_place=False, param=[dict(lr_mult=0, decay_mult=0), dict(lr_mult=0, decay_mult=0), dict(lr_mult=0, decay_mult=0)])
model = L.Scale(model, bias_term=True, in_place=True, filler=dict(value=1), bias_filler=dict(value=0))
model = L.ReLU(model, in_place=True)
model = L.Pooling(model, pool=P.Pooling.AVE, global_pooling=True)
model = L.InnerProduct(model, num_output=10, bias_term=True, weight_filler=dict(type='xavier'), bias_filler=dict(type='constant'))
loss = L.SoftmaxWithLoss(model, label)
accuracy = L.Accuracy(model, label)
return to_proto(loss, accuracy)
def make_net():
with open('train_densenet.prototxt', 'w') as f:
#change the path to your data. If it's not lmdb format, also change first line of densenet() function
print(str(densenet('/home/zl499/caffe/examples/cifar10/cifar10_train_lmdb', batch_size=64)), file=f)
with open('test_densenet.prototxt', 'w') as f:
print(str(densenet('/home/zl499/caffe/examples/cifar10/cifar10_test_lmdb', batch_size=50)), file=f)
def make_solver():
s = caffe_pb2.SolverParameter()
s.random_seed = 0xCAFFE
s.train_net = 'train_densenet.prototxt'
s.test_net.append('test_densenet.prototxt')
s.test_interval = 800
s.test_iter.append(200)
s.max_iter = 230000
s.type = 'Nesterov'
s.display = 1
s.base_lr = 0.1
s.momentum = 0.9
s.weight_decay = 1e-4
s.lr_policy='multistep'
s.gamma = 0.1
s.stepvalue.append(int(0.5 * s.max_iter))
s.stepvalue.append(int(0.75 * s.max_iter))
s.solver_mode = caffe_pb2.SolverParameter.GPU
solver_path = 'solver.prototxt'
with open(solver_path, 'w') as f:
f.write(str(s))
if __name__ == '__main__':
make_net()
make_solver()