用于大规模图像识别的超深度卷积网络
摘要
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
本文研究了在大规模图像识别中,卷积网络深度对其识别精度的影响。我们的主要贡献是使用具有非常小(3×3)卷积滤波器的体系结构对增加深度的网络进行全面评估,这表明通过将深度推到16-19个权重层,可以实现对现有技术配置的重大改进。这些发现是我们在ImageNet Challenge 2014提交报告的基础,我们的团队在本地化和分类方面分别获得了第一和第二名。我们还表明,我们的表示可以很好地推广到其他数据集,在这些数据集中,它们可以获得最新的结果。我们已经公开了两个性能最好的ConvNet模型,以促进进一步研究在计算机视觉中使用深度视觉表示。
引言
Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition which has become possible due to the large public image repositories, such as ImageNet, and high-performance computing systems, such as GPUs or large-scale distributed clusters. In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings to deep ConvNets.
卷积网络(ConvNets)最近在大规模的图像和视频识别方面取得了巨大的成功,这归功于大型公共图像存储库(例如ImageNet)和高性能计算系统(例如GPU或大规模分布式系统)集群。特别是ImageNet大规模视觉识别挑战赛(ILSVRC)在深度视觉识别体系结构的发展中发挥了重要作用,它为从高维浅特征编码到深度卷积网络的几代大规模图像分类系统提供了试验台。
With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC-2013 utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales. In this paper, we address another important aspect of ConvNet architecture design - its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3×3) convolution filters in all layers.
随着ConvNets在计算机视觉领域的应用越来越广泛,人们进行了许多尝试来改进Krizhevsky等人的原始体系结构,以期获得更高的准确性。例如,向ILSVRC-2013提交的最佳表现是使用较小的接收窗口大小和较小的第一卷积层跨度。另一项改进涉及在整个图像上和多个尺度上密集地训练和测试网络。在本文中,我们讨论了ConvNet体系结构设计的另一个重要方面:深度。为此,我们固定了体系结构的其他参数,并通过添加更多的卷积层来稳步增加网络的深度,这是可行的,因为在所有层中都使用了非常小的(3×3)卷积滤波器。
As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models to facilitate further research.
因此,我们重大地提出了更精确的ConvNet体系结构,它不仅在ILSVRC分类和定位任务上达到了最先进的精度,而且也适用于其他图像识别数据集,即使作为相对简单的流水线的一部分使用(例如,通过线性SVM进行分类而无需微调的深层特征),它们也能获得优异的性能。为了便于进一步研究,我们发布了两个性能最好的模型。
The rest of the paper is organised as follows. In Sect.2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect.3, and the configurations are compared on the ILSVRC classification task in Sect.4. Sect.5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.
论文的其余部分安排如下。在第2节中,我们描述了ConvNet配置。第三节介绍了图像分类训练和评估的详细信息,并在第四节的ILSVRC分类任务中进行了配置比较。第五节是对本文的总结。为了完整起见,我们还在附录A中描述和评估了ILSVRC-2014目标定位系统,并在附录B中讨论了将超深度特性推广到其他数据集的问题。最后,附录C包含了主要的论文修订列表。
CONVNET配置
To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect2.1) and then detail the specific configurations used in the evaluation (Sect2.2). Our design choices are then discussed and compared to the prior art in Sect2.3.
为了在公平的环境下测量增加的ConvNet深度带来的改进,我们所有的ConvNet层配置均采用相同的原理设计,灵感来自Ciresan和Krizhevsky等人。在本节中,我们首先描述ConvNet配置的一般布局(第2.1节),然后详细说明评估中使用的特定配置(第2.2节)。然后讨论我们的设计选择,并与第2.3节中的现有技术进行比较。
结构体系
During training, the input to our ConvNets is a fixed-size 224×224 RGB image. The only pre- processing we do is subtracting the mean RGB value, computed on the training set, from each pixel.The image is passed through a stack of convolutional(conv.) layers, where we use filters with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down,center). In one of the configurations we also utilise 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv.layer input is such that the spatial resolutionis preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv.layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv.layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.
在训练过程中,我们的ConvNets的输入是一个固定大小的224×224 RGB图像。我们所做的唯一预处理就是从每个像素中减去在训练集上计算出的RGB平均值,图像通过一堆卷积层传递,在这里我们使用一个感受野非常小的滤波器:3×3(这是捕获左/右、上/下、中心概念的最小尺寸)。在其中一种配置中,我们还使用1×1卷积滤波器,这可以看作是输入通道的线性变换(其次是非线性)。卷积步长固定为1像素;空间填充转换层输入使得卷积后保留的空间分辨率,即对于3×3的转换层,填充为1像素。空间池化是由五个最大池化层执行的,它们遵循一些卷积层(不是所有卷积层都跟随最大池化)。最大池化是在2×2的像素窗口上执行的,步长为2。
A stack of convolutional layers (which has a different depth in different architectures)is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
卷积层的堆栈(在不同的体系结构中具有不同的深度)之后是三个全连接层:前两层各有4096个信道,第三个层执行1000路ILSVRC分类,因此包含1000个信道(每个类一个)。最后一层是softmax层。全连接层的配置在所有网络中都是相同的。
All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al.) non-linearity.We note that none of our networks (except for one) contain Local Response Normalisation(LRN) normalisation (Krizhevsky et al.): as will be shown in Sect.4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and com- putation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al.).
所有隐藏层均配有非线性校正(ReLU)。我们请注意,我们的网络(除了一个)都不包含本地响应规范化(LRN):如第4节所示。这种规范化不会提高ILSVRC数据集的性能,但会增加内存消耗和计算时间。适用时,LRN层的参数为(Krizhevsky等人,2012年)的参数。
配置
The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.
本文评估的ConvNet配置如表1所示,每列一个。在下文中,我们将按网络的名称(A-E)来指代网络。所有配置均遵循第2.1节所介绍的通用设计,仅在深度上不同:从网络A中的11个权重层(8个卷积层和3个全连接层)到网络E中的19个权重层(16个卷积层和3个全连接层)。卷积层的宽度(通道数)相当小,从第一层的64个开始,然后在每个最大池化层之后增加2倍,直到达到512个为止。
In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).
在表2中,我们报告了每个配置的参数数量。尽管深度较大,但我们的网络中的权重数不大于具有较大卷积层宽度和接收场的较浅网络中的权重数(144M权重(Sermanet et al.,2014))。
讨论
Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 and ILSVRC-2013 competitions. Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3×3 conv.layers (without spatial poolingin between) has an effective receptive field of 5×5; three
我们的ConvNet配置与ILSVRC-2012和ILSVRC-2013比赛中表现最好的参赛作品中使用的配置大不相同。与其在第一转换层中使用相对较大的接收场(例如,使用步长为4的11×11(Krizhevsky et al.,2012),或者使用步长为2的7×7,我们在整个网络中使用非常小的3×3接收场,这些接收场与每个像素的输入进行卷积(步长为1)。很容易看出两个3×3的堆栈卷积层(中间没有空间池化操作)有效感受野为5×5;三个
Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv-”. The ReLU activation function is not shown for brevity.
表1:ConvNet配置(在列中显示)。随着添加更多层(添加的层以粗体显示),配置的深度从左侧(A)向右侧(E)增加。卷积层参数表示为“conv<感受野大小>-<信道数>”。为了简洁起见,不显示ReLU激活函数。
ConvNet Configuration
A A-LRN B C D E
11 weight 11 weight 13 weight 16 weight 16 weight 19 weight
layers layers layers layers layers layers
input (224 × 224 RGB image)
conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 conv3-64
LRN conv3-64 conv3-64 conv3-64 conv3-64
maxpool
conv3-128 conv3-128 conv3-128 conv3-128 conv3-128 conv3-128
conv3-128 conv3-128 conv3-128 conv3-128
maxpool
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
conv3-256 conv3-256 conv3-256
conv3-256
maxpool
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512
conv3-512
maxpool
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512
conv3-512
maxpool
FC-4096
FC-4096
FC-1000
softmax
Table 2: Number of parameters (in millions).
Network A,A-LRN B C D E
Number of parameters 133 133 134 138 144
such layers have a 7×7 effective receptive field. So what have we gained by using, for instance, a stack of three 3×3 conv.layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3×3 convolution stack has C channels, the stack is parametrised by weights; at the same time, a single 7×7 conv. layer would requireparameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7×7 conv. filters, forcing them to have a decomposition through the 3×3 filters (with non-linearity injected in between).
这些层有7×7的有效感受野。例如,通过使用三个3×3的卷积层而不是一个7×7层的卷积层,我们得到了什么?首先,我们合并了三个非线性校正层,而不是一个单一的校正层,这使得决策函数更具判别性。其次,我们减少了参数的数目:假设三层3×3卷积堆栈的输入和输出都有C个通道,则堆栈是按个权重进行参数化的;同时,单个7×7卷积层将需要个参数,即增加81%。这可以被视为对7×7转换滤波器施加正则化,迫使它们通过3×3滤波器进行分解(在两者之间注入非线性)。
The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non- linearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1×1 conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).
合并1×1的卷积层(配置C,表1)是在不影响卷积层感受野的情况下增加决策函数非线性的一种方法。即使在我们的情况下,1×1卷积本质上是在相同维数空间上的线性投影(输入和输出通道的数量相同),校正函数也引入了其他的非线性。值得注意的是,最近在Lin等人的“网络中的网络”体系结构中也使用了1×1的卷积层。
Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance. GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(22 weight layers) and small convolution filters (apart from 3×3, they also use 1×1 and 5×5 convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect.4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.
小尺寸卷积滤波器以前曾被Ciresan等人使用过,但他们的网络深度明显低于我们的,而且他们没有在大规模ILSVRC数据集上进行评估。古德费罗等人将深度卷积网络(11个权重层)应用于街道号码识别任务,结果表明,深度的增加会带来更好的性能。GoogLeNet是ILSVRC-2014项目分类任务中表现最好的网络,是独立于我们的工作而开发的,但是类似的是,它基于非常深的卷积网络(22个权重层)和小卷积滤波器(除了3×3,它们还使用1×1和5×5卷积)。然而,它们的网络拓扑结构比我们的复杂,并且特征映射的空间分辨率在第一层被更积极地降低以减少计算量。如第4.5节所示,我们的模型在单网络分类精度方面优于Szegedy等人的模型。
分类框架
In the previous section we presented the details of our network configurations. In this section, we
describe the details of classification ConvNet training and evaluation.
在上一节中,我们介绍了网络配置的详细信息。在本节中,我们描述分类ConvNet训练和评估的详细信息。
训练
The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent(based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (thepenalty multiplier set to) and dropout regularisation for the first two fully-connected layers (dropout ratio set to). The learning rate was initially set to, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv.filter sizes; (b) pre-initialisation of certain layers.
ConvNet训练步骤通常遵循Krizhevsky等人的方法。(除了从多尺度训练图像中采样输入物,如后文所述)。也就是说,通过使用带动量的小批量梯度下降(基于反向传播算法)优化多项式逻辑回归目标来进行训练的。批量大小设置为256,动量设置为0.9。训练通过权重衰减(惩罚乘数设置为)和前两个全连接层的dropout正则化(dropout率设置为0.5)进行正则化。学习率最初设置为,然后在验证集的准确性停止提高时降低10倍。总的来说,学习率下降了3倍,经过370K次迭代(74个阶段)后停止了学习。我们推测,尽管网络的参数数量和深度比其他网络大,但由于:(a)由于深度更大和卷积滤波器较小所产生的隐式正则化,使得网络需要更少的训练轮次去收敛;(b)某些层的预初始化。
The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fully-connected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning.For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean andvariance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio.
网络权值的初始化是很重要的,因为在深度网络中,由于梯度的不稳定性,错误的初始化会导致学习停滞。为了避免这个问题,我们从训练配置A开始(表1),该配置足够浅,可以通过随机初始化进行训练。然后,在训练更深层次的体系结构时,我们用网络A的层初始化了前四个卷积层和最后三个全连接层(中间层是随机初始化的)。我们没有降低预初始化层的学习率,而是让它们在学习过程中发生变化。对于随机初始化(在适用的情况下),我们从均值为零且方差为的正态分布中采样权重。偏置量以零初始化。值得注意的是,在提交论文后,我们发现可以使用Glorot&Bengio的随机初始化程序在不进行预训练的情况下初始化权重。
To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012).Training image rescaling is explained below.
为了获得固定大小为224×224的ConvNet输入图像,从重新缩放的训练图像中随机裁剪(每次SGD(随机梯度下降法)迭代一次)。为了进一步增加训练集,对图片进行了随机水平翻转和随机RGB颜色偏移。训练图像重缩放说明如下。
Training image size. Letbe the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer toas the training scale). While the crop size is fixed to 224×224, in principlecan take on any value not less than 224: forthe crop will capture whole-image statistics, completely spanning the smallest side of a training image; for the crop will correspond to a small part of the image, containing a small objector an object part.
训练图像大小。令为等比例重缩放训练图像的最小边,从中裁剪ConvNet输入(我们也将S称为训练尺度)。当裁剪大小固定为224×224时,原则上可以取不小于224的任何值:对于,裁剪将捕获整个图像统计信息,完全跨越训练图像的最小边;对于,裁剪将对应于图像的一小部分,其中包含小对象或对象部分。
We consider two approaches for setting the training scale. The first is to fix, which corresponds to single-scale training (note that image content within the sampled crops can still represent multi-scale image statistics). In our experiments, we evaluated models trained at two fixed scales:(which has been widely used in the prior art) and. Given a ConvNet configuration, we first trained the network using. To speed-up training of thenetwork, it was initialised with the weights pre-trained with, and we used a smaller initial learning rate of.
我们考虑了两种设置训练尺度的方法:第一种是固定,它对应于单标度训练(注意,采样的图像内容仍然可以表示多尺度图像统计)。在我们的实验中,我们评估了在两个固定尺度下训练的模型:(在现有技术中已广泛使用)和S=384。给定ConvNet配置,我们首先使用训练网络。为了加快网络的训练速度,用预先训练的权重初始化网络,并使用较小的初始学习率。
The second approach to settingis multi-scale training, where each training image is individually rescaled by randomly samplingfrom a certain range [,] (we used and ). Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed.
设置的第二种方法是多尺度训练,其中每个训练图像通过从特定范围内随机采样(,)(使用和)来单独重缩放。由于图像中的对象可以有不同的大小,因此在训练时考虑这一点是有益的。这也可以被看作是通过尺度抖动来增强训练集,其中训练单个模型来识别大范围尺度上的对象。由于速度原因,我们通过微调具有相同配置的单尺度模型的所有层来训练多尺度模型,并使用固定的进行预训练。
测试
At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as(we also refer to it as the test scale). We note thatis not necessarily equal to the training scale(as we will show in Sect. 4, using several values offor eachleads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7×7 conv.layer, the last two FC layers to 1×1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.
在测试时,给定一个经过训练的ConvNet和一个输入图像,按照以下方式对其进行分类。首先,它被等比例地重新缩放到预先定义的最小图像面,表示为(我们也将其称为测试尺度)。我们注意到不一定等于训练尺度(如我们将在第4节中所示,对每个使用几个值可以提高性能)。然后,以类似于(Sermanet等人)的方式在重新缩放的测试图像上密集地应用网络。即,全连接层首先转换为卷积层(第一个FC层转换为7×7的卷积层,最后两个FC层转换为1×1卷积层)。然后将得到的全卷积网络应用于整个(未裁剪)图像。结果是一个类别分数图,其中通道数等于类数,空间分辨率可变,取决于输入图像的大小。最后,为了获得图像的类别分数的固定大小向量,对类别分数图进行空间平均(平均池化)。我们还通过图像的水平翻转来增加测试集;原始图像和翻转图像的softmax类后验进行平均以获得图像的最终分数。
Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop. At the same time, using a large set of crops, as done by Szegedy et al.(2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy,for reference we also evaluate our networks using 50 crops per scale (5×5 regular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014).
由于全卷积网络应用于整个图像,因此不需要在测试时对多个图像进行采样,这是效率较低的,因为它需要对每个图像进行网络重新计算。同时,如Szegedy等人所做的那样,使用大量图像可以提高精确度,因为与全卷积网络相比,它可以对输入图像进行更精细的采样。此外,由于卷积边界条件的不同,多图像评估是对密集评估的补充:当对图像应用ConvNet时,卷积的特征图 被填充为零,而在密集评估的情况下,同一图像的填充自然来自图像的相邻部分(由于卷积和空间池化),这大大增加了整个网络的感受野,因此捕获更多的上下文。虽然我们认为在实践中,增加多个图像的计算时间并不能证明在准确性方面的潜在提高,但作为参考,我们还在每个尺度上使用50个图像(5×5规则网格,2次翻转)评估了我们的网络,在3个尺度上共有150个图像,与Szegedy等人使用的4个尺度上的144个图像相当。
实现细节
Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system,as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.
我们的实现源自可公开使用的C++ Caffe工具箱(2013年12月推出),但是包许多重大修改,允许我们对安装在单个系统中的多个GPU进行训练和评估,以及训练和评估在多个尺度(如上所述)上的全尺寸(未裁剪)图像。多GPU训练利用数据并行性,将每一批训练图像分割成若干个GPU批次,在每个GPU上并行处理。在计算GPU批次梯度之后,对其进行平均以获得整个批次的梯度。梯度计算在整个GPU上是同步的,因此其结果与在单个GPU上训练时的结果完全相同。
While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2-3 weeks depending on the architecture.
虽然最近提出了加速ConvNet训练的更为复杂的方法,它们对网络的不同层采用了模型和数据并行,但我们发现women 概念上简单得多的方案,已经在现有的4-GPU系统上与使用单一GPU相比提供了3.75倍的加速。在配备四个NVIDI Titan Black GPU的系统上,根据体系结构,训练一个网络需要2-3周的时间。
分类试验
Dataset. In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 chal-lenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.
数据集。在本节中,我们将介绍本文所描述的ConvNet架构在ILSVRC-2012数据集(用于ILSVRC 2012–2014年挑战赛)的图像分类结果。该数据集包含1000个类别的图像,分为三组:训练集(1.3M个图像)、验证集(50K个图像)和测试集(100K个具有类别标签的图像)。采用两种方法评估分类性能:top-1和top-5错误率。前者是一个多类分类错误,即分类错误的图像所占的比例;后者是ILSVRC中使用的主要评估标准,即真实类别不在top-5预测类别之中的图像的比例。
For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014).
在大多数实验中,我们使用验证集作为测试集。某些实验也在测试集上进行,并提交给ILSVRC官方服务器作为“VGG”团队参加ILSVRC-2014竞赛(Russakovsky等人)。
但尺度评估
We begin with evaluating the performance of individual ConvNet models at a single scale with the layer configurations described in Sect. 2.2. The test image size was set as follows:for fixed, andfor jittered. The results of are shown in Table 3.
我们首先使用2.2节所描述的层配置评估单个ConvNet模型在单一尺度的性能。测试图像大小设置如下:对于固定的,;对于抖动的,。结果如表3所示。
First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E).
首先,我们注意到,在没有任何标准化层的情况下,使用局部响应标准化(A-LRN网络)并不能改善模型A的性能。因此,我们没有在更深层次的架构(B-E)中采用标准化操作。
Second, we observe that the classification error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E. Notably, in spite of the same depth, the configuration C (which contains three 1 × 1 conv. layers), performs worse than the configuration D, which uses 3 × 3 conv. layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C). The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5 × 5 conv. layers, which was derived from B by replacing each pair of 3 × 3 conv. layers with a single 5 × 5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.
其次,我们观察到分类误差随着ConvNet深度的增加而减小:从含有11层的A中到含有19层的E都是如此。值得注意的是,尽管深度相同,配置C(包含3个1×1的卷积层)的性能比配置D(在整个网络中使用3×3的卷积层)的性能差。这表明,虽然增加的非线性层确实对提高性能有帮助(C优于B),但使用具有非平凡感受野的卷积滤波器捕获空间上下文也很重要(D优于C)。当模型的深度达到19层时,错误率会达到饱和,但更深层的模型可能对处理更大的数据集有利。我们还比较了网络B和具有5个5×5 卷积层的浅层网络之间的差异,浅层网络通过将网络B中的每两个3×3 卷积层替换为一个5×5卷积层产生(如2.3节所描述的,它们具有相同的感受野)。浅层网络的top-1错误率误差比网络B(在中心图像上)的top-1错误率高7%,这证实了使用小滤波器的深层网络优于使用大滤波器的浅层网络。
Finally, scale jittering at training time () leads to significantly better results than training on images with fixed smallest side (or), even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.
最后,尽管在测试时使用了单个尺度,但在训练时使用()尺度抖动比在固定最小边(或)的图像上训练的效果要好得多。这证实了通过尺度抖动进行训练集增强确实有助于捕获多尺度图像统计信息。
Table 3: ConvNet performance at a single test scale.
ConvNet config.(Table 1) smallest image side top-1 val.error (%) top-5 val.error (%)
train (S) test (Q)
A 256 256 29.6 10.4
A-LRN 256 256 29.7 10.5
B 256 256 28.7 9.9
C 256 256 28.1 9.4
384 384 28.1 9.3
[256;512] 384 27.3 8.8
D 256 256 27.0 8.8
384 384 26.8 8.7
[256;512] 384 25.6 8.1
E 256 256 27.3 9.0
384 384 26.9 8.7
[256;512] 384 25.5 8.0
多尺度评估
Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time. It consists of running a model over several rescaled versions of a test image (corresponding to different values of), followed by averaging the resulting class posteriors. Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with fixedwere evaluated over three test image sizes, close to the training one:. At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable was evaluated over a larger range of sizes.
在单一尺度上评估了ConvNet模型之后,我们现在评估测试时使用尺度抖动的影响。它包括在几个重新缩放尺寸的测试图像(对应于不同的值)上运行模型,然后计算生成类的后验概率的平均值。考虑到训练尺寸和测试尺寸之间的巨大差异会导致模型性能下降,因此在训练模型时,首先固定的大小,使用三个接近的测试图像尺寸进行评估。同时,训练时的尺度抖动使得网络在测试时可以应用到更大范围的尺度上,因此在更大范围的测试图像尺寸上对使用可变尺寸的训练的模型进行评估。
The results, presented in Table 4, indicate that scale jittering at test time leads to better performance(as compared to evaluating the same model at a single scale, shown in Table 3). As before, the deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side. Our best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error (highlighted in bold in Table 4). On the test set, the configurationE achieves 7.3% top-5 error.
表4所示的结果表明,测试时的尺度抖动导致更好的性能(与表3所示的在单个尺度上评估同一模型相比)。和以前一样,最深的配置(D和E)的性能最好,规模抖动比固定最小边的训练效果要好。我们性能最好的单网络模型在验证集中的top-1/top-5错误率是24.8%/7.5%(在表4中用粗体突出显示)。在测试集中,配置E达到7.3%的top-5错误率。
Table 4: ConvNet performance at multiple test scales.
ConvNet config.(Table 1) smallest image side top-1 val.error (%) top-5 val.error (%)
train (S) test (Q)
B 256 224,256,288 28.2 9.6
C 256 224,256,288 27.7 9.2
384 352,384,416 27.8 9.2
[256;512] 256,384,512 26.3 8.2
D 256 224,256,288 26.6 8.6
384 352,384,416 26.5 8.6
[256;512] 256,384,512 24.8 7.5
E 256 224,256,288 26.9 8.7
384 352,384,416 26.7 8.6
[256;512] 256,384,512 24.8 7.5
In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details). We also assess the complementarity of the two evaluation techniques by averaging their soft-max outputs. As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them.As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions.
在表5中,我们比较了密集ConvNet评估和多重裁切评估(详情见第3.2节)。我们还通过计算softmax输出的平均值评估了两种评估技术的互补性。可以看到,使用多重裁切的性能略优于密集评估,而且这两种方法完全是互补的,因此两者组合的效果比使用单一的要好。根据如上所述,我们假设这是由于对卷积边界条件的不同处理方法造成的。
Table 5: ConvNet evaluation techniques comparison. In all experiments the training scale S was
sampled from [256;512], and three test scales Q were considered: {256,384,512}.
表5:ConvNet评估技术比较。在所有实验中,训练尺寸是从[256;512]中抽取,并使用三个测试尺寸:{256,384,512}。
ConvNet config.(Table 1) Evaluation method top-1 val.error (%) top-5 val.error (%)
D dense 24.8 7.5
multi-crop 24.6 7.5
multi-crop & dense 24.4 7.2
E dense 24.8 7.5
multi-crop 24.6 7.4
multi-crop & dense 24.4 7.1
CONVNET融合
Up until now, we evaluated the performance of individual ConvNet models. In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al.,2014).
到目前为止,我们评估了独立ConvNet模型的性能。在这一部分的实验中,我们通过平均多个模型softmax后验概率值来组合几个模型的输出。由于模型的互补性,提高了性能,并在2012年和2013年的ILSVRC最佳结果中使用。
The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has 7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1% error (model E, Table 5).
结果见表6。在ILSVRC提交时,我们只训练了单尺度网络,以及多尺度模型D(只微调全连接层而不是所有层)。由此得到的7个网络的组合达到了7.3%的ILSVRC测试错误率。在提交后,我们考虑了只有两个性能最好的多尺度模型(配置D和E)的组合,使用密集评估将测试错误率降低到7.0%,使用密集和多重裁切组合评估将测试错误率降低到6.8%。作为参考,我们性能最好的单一模型达到7.1%的错误率(模型E,表5)。
Table 6: Multiple ConvNet fusion results.
Combined ConvNet models Error
top-1 val top-5 val top-5 test
ILSVRC submission
(D/256/224,256,288), (D/384/352,384,416), (D/[256;512]/256,384,512)
(C/256/224,256,288), (C/384/352,384,416)
(E/256/224,256,288), (E/384/352,384,416) 24.7
7.5
7.3
post-submission
(D/[256;512]/256,384,512), (E/[256;512]/256,384,512), dense eval. 24.0 7.1 7.0
(D/[256;512]/256,384,512), (E/[256;512]/256,384,512), multi-crop 23.9 7.2 -
(D/[256;512]/256,384,512), (E/[256;512]/256,384,512), multi-crop&dense eval. 23.7 6.8 6.8
与最新的技术相比较
Finally, we compare our results with the state of the art in Table 7. In the classification task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our “VGG” team secured the 2nd place with 7.3% test error using an ensemble of 7 models. After the submission, we decreased the error rate to 6.8% using an ensemble of 2 models.
最后,我们将我们的结果与表7中的最新技术进行了比较。在ILSVRC-2014挑战赛的分类任务中,我们的“VGG”团队使用7个模型的集合,以7.3%的测试错误率获得了第二名。提交后,我们使用两个模型的集合将错误率降低到6.8%。
As can be seen from Table 7, our very deep ConvNets significantly outperform the previous gener-
ation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competi-
tions. Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it. This is remarkable, considering that our best result is achieved by combining just two models – significantly less than used in most ILSVRC submissions. In terms of the single-net performance, our architecture achieves the best result (7.0% test error), outperforming a single GoogLeNet by 0.9%. Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.
从表7可以看出,我们的超深度ConvNets明显优于上一代,并且在ILSVRC-2012和ILSVRC-2013竞赛中取得最佳成绩。我们的结果与分类任务的(GoogLeNet,错误率为6.7%)相比也极具竞争力,并且大大优于ILSVRC-2013的获胜者Clarifai,它在使用外部训练数据的情况下错误率为11.2%,在没有外部训练数据的情况下错误率为11.7%。更加重要的是,考虑到我们的最佳结果是通过结合两个模型实现的——明显少于大多数ILSVRC参赛使用的模型。在单一网络性能方面,我们的模型结构取得了最好的结果(7.0%的测试错误率),比单一的GoogLeNet低0.9%。值得注意的是,我们并没有脱离LeCun等人的经典卷积网络架构,而是通过大幅度增加深度来提高它的性能。
Table 7: Comparison with the state of the art in ILSVRC classification. Our method is denoted
as “VGG”. Only the results obtained without outside training data are reported.
表7:与ILSVRC分类上所使用的最新技术的比较。我们的方法称为“VGG”。只显示了没有使用外部训练数据的结果。
Method top-1 val. error (%) top-5 val. error (%) top-5 test. error (%)
VGG (2 nets, multi-crop & dense eval.) 23.7 6.8 6.8
VGG (1 net, multi-crop & dense eval.) 24.4 7.1 7.0
VGG (ILSVRC submission, 7 nets, dense eval.) 24.7 7.5 7.3
GoogLeNet (Szegedy et al., 2014) (1 net) - 7.9
GoogLeNet (Szegedy et al., 2014) (7 nets) - 6.7
MSRA (He et al., 2014) (11 nets) - - 8.1
MSRA (He et al., 2014) (1 net) 27.9 9.1 9.1
Clarifai (Russakovsky et al., 2014) (multiple nets) - - 11.7
Clarifai (Russakovsky et al., 2014) (1 net) - - 12.5
Zeiler & Fergus (Zeiler & Fergus, 2013) (6 nets) 36.0 14.7 14.8
Zeiler & Fergus (Zeiler & Fergus, 2013) (1 net) 37.5 16.0 16.1
OverFeat (Sermanet et al., 2014) (7 nets) 34.0 13.2 13.6
OverFeat (Sermanet et al., 2014) (1 net) 35.7 14.2 -
Krizhevsky et al. (Krizhevsky et al., 2012) (5 nets) 38.1 16.4 16.4
Krizhevsky et al. (Krizhevsky et al., 2012) (1 net) 40.7 18.2 -
总结
In this work we evaluated very deep convolutional networks (up to 19 weight layers) for large- scale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture(LeCun et al., 1989; Krizhevsky et al., 2012)with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations.
在本文中,我们评估了用于大规模图像分类的超深度卷积网络(多达19个权重层)的性能。结果表明,表示深度有利于提升分类准确率,并且使用深度显著增加的传统ConvNet架构(LeCun等人;Krizhevsky等人)在ImageNet挑战赛数据集上可以实现更高的性能。在附录中,我们还展示了我们的模型在广泛的任务和数据集有很好的泛化能力,达到甚至超过了围绕较浅深度的图像表示构建的更复杂的识别流程。我们的结果再次证实了深度在视觉表现中的重要性。
定位
In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth. In this section, we turn to the localisation task of the challenge, which we have won in 2014 with 25.3% error. It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class. For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge,with a few modifications. Our method is described in Sect. A.1and evaluated in Sect.A.2.
在论文的主体部分,我们针对ILSVRC挑战赛的分类任务,并对不同深度的卷积网络框架进行了全面的评估。在本节中,我们将介绍挑战赛的定位任务,我们在2014年以25.3%的错误率获胜。这可以看作是目标检测的一种特殊情况,在这种情况下,不管目标类别的实际数量如何,top-5类中的每个类都应该预测一个目标边界框。为此,我们采用并稍加修改了Sermanet等人的方法,他们是ILSVRC-2013挑战赛定位项目的获胜者。我们的方法在A.1节中有描述,在A.2节对其进行评估。
目标定位卷积神经网络
To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores. A bounding box is represented by a 4-D vector storing its center coordinates, width, and height. There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al., 2014)) or is class-specific (per-class regression, PCR). In the former case, the last layer is 4-D, while in the latter it is 4000-D (since there are 1000 classes in the dataset). Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performingin the classification task (Sect. 4).
为了实现对象定位,我们使用非常深的ConvNet,其中最后一个全连接层用来预测边界框的位置,而不是分类得分。边界框由保存其中心坐标、宽度和高度的四维向量表示。边界框预测有两种选择:一是在所有类之间共享(单类回归,SCR),二是特定于某一类(每类回归,PCR)。在前一种情况中,最后一层的输出维度是4-D,而后者是4000-D(因为数据集中有1000个类别)。除了最后一个边界框预测层外,我们都使用ConvNet架构D(表1),它包含16个权重层,在分类任务中表现最好(第4节)。
Training. Training of localisation ConvNets is similar to that of the classification ConvNets (Sect. 3.1). The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth. We trained two localisation models, each on a single scale:and(due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission). Training was initialised with the corresponding classification models (trained on the same scales), and the initial learning rate was set to. We explored both fine-tuning all layers and fine-tuning only the first two fully-connected layers, as done in (Sermanet et al., 2014). The last fully-connected layer was initialised randomly and trained from scratch.
训练。定位ConvNets的训练与分类ConvNets的训练类似(第3.1节)。主要的区别在于我们用欧氏损失代替了逻辑回归,用来惩罚预测的边界框参数与真实值之间的偏差。我们训练了两个定位模型,每个模型均使用单一尺寸,分别是:和(由于时间限制,提交ILSVRC-2014时我们没有在训练上使用尺寸抖动)。使用对应的分类模型初始化训练(在相同的尺寸上进行训练),初始学习率设置为。我们探索了两种调优方法:对所有层进行微调和仅微调前两个全连接层,正如(Sermanet等人)所做的那样。最后一个全连接层使用随机初始化并从头开始训练。
Testing. We consider two testing protocols. The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class (to factor out the classification errors). The bounding box is obtained by applying the network only to the central crop of the image.
测试。我们考虑两个测试方案。第一个用于比较使用不同网络修正(即调优方式不同,所有层调优/全连接层调优)在验证集上的区别,并且仅考虑对真实类别的边界框预测(以排除分类错误)。边界框仅通过将网络应用于图像的中心裁剪来获得。
The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2). The difference is that instead of the class score map, the output of the last fully-connected layer is a set of bounding box predictions. To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coor- dinates), and then rates them based on the class scores, obtained from the classification ConvNet. When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union. We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results.
第二个完全成熟的、基于定位ConvNet对整个图像密集应用的测试程序,类似于分类任务(第3.2节)。不同之处在于,最后一个全连接层的输出是一组边界框预测,而不是分类得分图。为了得出最终的预测结果,我们使用了Sermanet等人的贪婪合并程序。它首先合并了空间上接近的预测(通过计算坐标的平均值),然后根据从分类ConvNet获得的分类得分对它们进行评级。当使用多个定位ConvNets时,我们首先得到它们对边界框预测的合集,然后在合集上运行合并程序。我们没有使用Sermanet等人的提高边界框预测空间分辨率的多重池化偏移技术来进一步改善结果。
定位试验
In this section we first determine the best-performing localisation setting (using the first test proto- col), and then evaluate it in a fully-fledged scenario (the second protocol). The localisation error is measured according to the ILSVRC criterion (Russakovsky et al., 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.
在本节中,我们首先确定性能最佳的定位设置(使用第一个测试方案),然后再用成熟的方案(第二个方案)中对其进行评估。定位错误率根据ILSVRC标准(Russakovsky等人)衡量,即如果预测的边界框与真实边界框的交并比大于0.5,则认为边界框预测是正确的。
Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR. We also note that fine-tuning all layers for the lo- calisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al., 2014)). In these experiments, the smallest images side was set to ; the results withexhibit the same behaviour and are not shown for brevity.
设置比较。从表8可以看出,每类回归(PCR)优于类别不可知的单类回归(SCR),这与Sermanet等人的发现不同,这里PCR的表现优于SCR。我们还注意到,在定位任务中对所有层进行调优会比仅调优全连接层(如(Sermanet et al.)所做的那样)获得明显更好的结果。在这些实验中,图像的最小边设置为S=384;S=256时的结果与之相似,为了简洁不予显示。
Table 8: Localisation error for different modifications with the simplified testing protocol: the bounding box is predicted from a single central image crop, and the ground-truth class is used. All ConvNet layers (except for the last one) have the configuration D (Table 1), while the last layer performs either single-class regression (SCR) or per-class regression (PCR).
表8:使用简化测试方案的不同修正方式时的定位错误率:在单个中心裁切图像来预测边界框,并且使用真实类别。所有ConvNet层(最后一层除外)都用配置D(表1)进行配置,而最后一层使用单类回归(SCR)或每类回归(PCR)。
Fine-tuned layers regression type GT class localisation error
1st and 2nd FC SCR 36.4
PCR 34.3
all PCR 33.1
Fully-fledged evaluation. Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014). As can be seen from Table 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth. Similarly to the classification task (Sect. 4), testing at several scales and combining the predictions of multiple networks further improves the performance.
全面的评估。在确定了最佳的定位设置(PCR,调优所有层)之后,我们现在将其应用于完全成熟的场景中,在该场景中,使用我们性能最好的分类系统(4.5节)来对top-5类别标签进行预测,并使用Sermanet等人的方法合并多个密集计算得出的预测边界框。从表9可以看出,与使用中心裁切(表8)相比,将定位ConvNet应用于整个图像大大改善了结果,尽管使用了top-5预测类标签而不是真实类。类似于分类任务(第4节),在多个尺度上进行测试并结合多个网络的预测结果,进一步提高了性能。
Table 9: Localisation error
smallest image side top-5 localisation error (%)
train () test () val. test.
256 256 29.5 -
384 384 28.2 26.7
384 352,384 27.5 -
fusion: 256/256 and 384/352,384 26.9 25.3
Comparison with the state of the art. We compare our best localisation result with the state of the art in Table 10. With 25.3% test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al., 2014). Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al., 2014), even though we used less scales and did not employ their resolution enhancement technique. We envisage that better localisation performance can be achieved if this technique is incorporated into our method. This indicates the performance advancement brought by our very deep ConvNets - we got better results with a simpler localisation method, but a more powerful representation.
与最先进的技术相比较。我们将最佳定位结果与表10中的最新技术进行了比较。我们的“VGG”团队以25.3%的测试错误率赢得了ILSVRC-2014定位比赛的(Russakovsky等人)。值得注意的是,我们的结果比ILSVRC-2013的冠军Overfeat(Sermanet等人)的结果要好得多,尽管我们使用的尺度较少,并且没有使用他们的分辨率增强技术。我们设想如果将此技术应用我们的方法中,可以获得更好的定位性能。这表明了我们超深度ConvNets带来的性能提升—我们使用了更简单的定位方法,但是使用了更强大的表示,获得了更好的结果。
超深度特征的概括
In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset. In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting. Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin. Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-art methods. In this evaluation, we consider two models with the best classification performance on ILSVRC(Sect. 4)-configurations “Net-D” and “Net-E”(which we made publicly available).
在前面的章节中,我们讨论了超深度ConvNets在ILSVRC数据集上的训练和评估。在这一节中,评估了我们在ILSVRC上预先训练过的的ConvNets作为其他较小数据集的特征提取器,由于过拟合,在这些数据集中从头训练大型模型是不可行的。最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus;Donahue等人;Razavine等人;Chatfield等人),因为事实证明,在ILSVRC上学习到的深度图像表示,在其他数据集有很好的泛化能力,性能大大优于手工的表达。按照这一思路,我们研究我们的模型是否会比最先进方法中使用浅层模型带来更好的性能。在这个评估中,我们考虑了两个在ILSVRC分类性能最好的模型(第4节)——配置“Net-D”和“Net-E”(我们公开提供)。
Table 10: Comparison with the state of the art in ILSVRC localisation. Our method is denoted as “VGG”.
表10:ILSVRC定位技术现状的比较。我们的方法被称为“VGG”。
Method top-5 val. error (%) top-5 test error (%)
VGG 26.9 25.3
GoogLeNet (Szegedy et al., 2014) - 26.7
OverFeat (Sermanet et al., 2014) 30.0 29.9
Krizhevsky et al. (Krizhevsky et al., 2012) - 34.2
To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales. The resulting image descriptor isand combined with a linear SVM classifier, trained on the target dataset. For simplicity, pre-trained ConvNet weights are kept fixed (no fine-tuning is performed).
为了利用在ILSVRC上预先训练的ConvNets对其他数据集进行图像分类,我们移除了最后一个全连接层(执行1000类的ILSVRC分类),并使用倒数第二层的4096-D激活作为图像特征,这些特征在多个位置和尺度上聚合。得到的图像描述子经过正则化的,并与一个线性支持向量机分类器组合,在目标数据集上进行训练。为简单起见,预先训练的ConvNet权重保持不变(不进行调优)。
Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2). Namely, an image is first rescaled so that its smallest side equals, and then the net- work is densely applied over the image plane (which is possible when all weight layers are treated as convolutional). We then perform global average pooling on the resulting feature map, which produces a 4096-D image descriptor. The descriptor is then averaged with the descriptor of a hori- zontally flipped image. As was shown in Sect. 4.2, evaluation over multiple scales is beneficial, so we extract features over several scales. The resulting multi-scale features can be either stacked or pooled across scales. Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality. We return to the discussion of this design choice in the experiments below. We also assess late fusion of features, computed using two networks, which is performed by stacking their respective image descriptors.
特征的聚集与我们的ILSVRC评估程序(第3.2节)相似。也就是说,首先对图像进行重新缩放,使其最小边等于,然后在图像平面上密集地应用网络(当所有权重层都被视为卷积层时,这是可能的)。然后在生成的特征图上进行全局平均池化,生成一个4096-D的图像描述子。然后将描述子与水平翻转图像的描述子取平均。如第4.2节所示,在多个尺度上进行评估是有益的,因此我们可以在多个尺度上提取特征。得到的多尺度特征可以在多个尺度上堆叠或池化。堆叠允许后续分类器学习如何最优地在一系列尺度上组合图像统计特征;然而,这样做的代价是会增加描述子的维度。我们将在下面的实验中重新讨论这个设计选择。我们还评估了使用两个网络计算得到的特征的后期融合,这是通过堆叠它们各自的图像描述子完成的。
Table11: Comparison with the state of the art in image classification on VOC-2007,VOC-2012, Caltech-101, and Caltech-256. Our models are denoted as “VGG”. Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (2000 classes).
表11:与数据集VOC-2007、VOC-2012、Caltech-101和Caltech-256中的图像分类最新技术进行比较。我们的模型被称为“VGG”。标有的结果是在扩展的ILSVRC数据集(2000个类)上使用预先训练的ConvNets得出的。
Method VOC-2007
(mean AP) VOC-2012
(mean AP) Caltech-101
(mean class recall) Caltech-256
(mean class recall)
Zeiler & Fergus (Zeiler & Fergus, 2013) - 79.0 86.5 ± 0.5 74.2 ± 0.3
Chatfield et al. (Chatfield et al., 2014) 82.4 83.2 88.4 ± 0.6 77.6 ± 0.1
He et al. (He et al., 2014) 82.4 - 93.4 ± 0.5 -
Wei et al. (Wei et al., 2014) 81.5 (85.2) 81.7 (90.3*) - -
VGG Net-D (16 layers) 89.3 89.0 91.8 ± 1.0 85.0 ± 0.2
VGG Net-E (19 layers) 89.3 89.0 92.3 ± 0.5 85.1 ± 0.3
VGG Net-D & Net-E 89.7 89.3 92.7 ± 0.5 86.2 ± 0.3
Image Classification on VOC-2007 and VOC-2012. We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al., 2015). These datasets contain 10K and 22.5K images respectively, and each image is annotated with one or several labels,corresponding to 20 object categories. The VOC organisers provide a pre-defined split into training, validation, and test data (the test data for VOC-2012 is not publicly available; instead, an official evaluation server is provided). Recognition performance is measured using mean average precision (mAP) across classes.
在数据集VOC-2007和VOC-2012上的图像分类。我们首先对PASCAL VOC-2007和VOC-2012基准数据集的图像分类任务进行评估(Everingham等人)。这些数据集分别包含10K张和22.5K张图像,每张图像用一个或多个标签标注,对应20个目标类别。VOC组织者提供预定义划分好的训练集、验证集和测试集(VOC-2012的测试集不公开,而是提供官方评估服务器)。使用多类上的平均精度均值(mAP)来评估识别性能。
Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking. We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit. Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales:. It is worth noting though that the improvement over a smaller range of {256,384,512}was rather marginal (0.3%).
值得注意的是,通过验证在VOC-2007和VOC-2012验证集上的性能,我们发现,通过计算在多个尺度上平均值来聚合图像描述子的性能与通过堆叠来聚合的性能类似。我们假设这是由于VOC数据集中的目标出现在不同的尺度上,因此没有特定的尺度语义可供分类器利用。由于平均化不会不增加描述子的维度,因此我们能够聚合大范围尺度的图像描述子:。但值得注意的是,这相对于使用较小的尺度范围的提升是相当小的(0.3%):。
The test set performance is reported and compared with other approaches in Table 11. Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combinations lightly improves the results. Our methods set the new state of the art across image representations, pre-trained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%. It should be noted that the method of Wei et al. (2014), which achieves 1% better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets. It also benefits from the fusion with an object detection-assisted classification pipeline.
在测试集上的性能以及与其他方法的比较见表11。我们的网络“Net-D”和“Net-E”在VOC数据集上表现出相同的性能,并且它们的组合对结果有略微改善。我们在ILSVRC数据集上的预训练模型,展现了优秀的图像表达能力,比Chatfield等人以前的最佳结果增加了超过6%。值得注意的是,Wei等人的方法,是在2000类ILSVRC的扩展数据集上预训练,该数据集包括额外的1000个类别,语义上接近VOC数据集,在VOC-2012数据集上的平均精度均值仅仅比我们的高出1%。它也受益于与一个辅助目标检测的分类框架流程的融合。
Image Classification on Caltech-101and Caltech-256. In this section we evaluate very deep fea- tures on Caltech-101(Fei-Fei et al., 2004) andCaltech-256(Griffin et al., 2007) image classification benchmarks. Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes. A standard evaluation protocol on these datasets is to generate several random splits into training and test data and report the average recognition performance across the splits, which is measured by the mean class recall(which compensates for a different number of test images per class). Following Chatfield et al.(2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class. On Caltech-256 we also generated 3 splits, each of which contains 60 training images per class (and the rest is used for testing). In each split, 20% of training images were used as a validation set for hyper-parameter selection.
在Caltech-101和Caltech-256数据集上的图像分类。在本节中,我们评估了Caltech-101(Fei Fei等人)和Caltech-256(Griffin等人)图像分类基准的超深度特征。Caltech-101包含9K张图像,分为为102个类别标签(101个目标类别和一个背景类别),而Caltech-256更大,有31K张图像和257个类别。对这些数据集的一个标准评估方案是将这些数据集随机分成多个训练数据和测试数据,并报告平均识别性能,该性能由平均类召回率(它是对每个类别具有不同数量测试图像的补偿)来衡量。在Caltech-101上,我们随机生成了3个训练和测试数据,每个数据中包含每类30张训练图像,每类50张测试图像。在Caltech-256上,我们也随机生成了3个训练和测试数据,每个数据中包含每类60张训练图像(其余用于测试)。在每次分组中,20%的训练图像被用作超参数选择的验证集。
We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multi- ple scales, performs better than averaging or max-pooling. This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are se- mantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations. We used three scales.
我们发现与VOC不同的是,在Caltech数据集上,将在多个尺度上计算的描述子进行堆叠比平均或最大池化性能更好。这可以解释为,在Caltech图像中,目标通常占据整个图像,因此多尺度图像特征的语义是不同的(整个目标与目标的一部分相对比),堆叠允许分类器利用这种特定尺度的表达。我们使用三个尺度:。
Our models are compared to each other and the state of the art inTable11. As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance. On Caltech-101, our representations are competitive with the approach of He et al.(2014), which, however, performs significantly worse than our nets on VOC-2007. On Caltech-256,our features outperform the state of the art (Chatfield et al., 2014) by a large margin (8.6%).
表11对我们的模型和最新技术进行了比较。可以看出,较深的19层网络Net-E性能优于16层网络Net-D,它们的组合进一步提高了性能。在Caltech-101上,我们的表达与He等人的方法性能相近,但是他们在VOC-2007的表现明显比我们的网络模型差。在Caltech-256上,我们的特征比最先进的(Chatfield等人)有很大的优势(8.6%)。
Action Classification on VOC-2012. We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al., 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action. The dataset contains 4.6K training images, labelled into 11 classes. Similarly to the VOC-2012 object classification task, the performance is measured using the mAP. We considered two training settings: (i) computing the ConvNet fea- tures on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box,and stacking them to obtain the final representation. The results are compared to other approaches in Table 12.
在VOC-2012数据上的动作分类。我们还在PASCAL VOC-2012动作分类任务(Everingham等人)中评估了我们性能最好的图像表达(Net-D和Net-E特征的堆叠),该任务包括根据单个图像预测动作类别,给出执行动作人的边界框。数据集包含4.6K张训练图像,分为11个类别标签。与VOC-2012目标分类任务类似,使用平均精度均值来衡量性能。我们考虑了两种训练设置:(i)在整个图像上计算ConvNet特征,忽略提供的边界框;(ii)在整个图像和提供的边界框上计算特征,并将其堆叠以获得最终的表达。与其他方法比较的结果见表12。
Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes. Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features.
即使不使用所提供的边界框,我们的表示也达到了VOC动作分类任务最好的水平,并且当同时使用图像和边界框时,结果得到了进一步的改进。与其他方法不同,我们没有使用任何特定任务的探索法,而是依赖于超深度卷积特征的表达能力。
Other Recognition Tasks. Since the public release of our models, they have been actively used by the research community for a wide range of image recognition tasks, consistently outperforming more shallow representations. For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model. Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al., 2014), image caption generation (Kiros et al., 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al., 2014; Bell et al., 2014).
其他识别任务。自我们的模型公开发布以来,它们一直被研究界积极用于大规模的图像识别任务中,始终优于浅层表达的性能。例如,Girshick等人通过使用我们的16层模型替换掉Krizhevsky等人的ConvNet实现了目标检测的最好结果。在Krizhevsky等人的浅层结构也可以观察到类似的结果,例如:语义分割(Long等人)、图像字幕生成(Kiros等人;Karpathy&Fei Fei)、纹理和材料识别(Cimpoi等人;Bell等人)。
Table 12: Comparison with the state of the art in single-image action classification on VOC-2012. Our models are denoted as “VGG”. Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (1512 classes).
表12:在VOC-2012数据集上与其他分类方法在单图像动作分类上的比较。我们的模型被称为“VGG”。标记的为使用在扩展ILSVRC数据集(1512个类别)上预训练的ConvNets模型实现的结果。
Method VOC-2012 (mean AP)
(Oquab et al., 2014) 70.2
(Gkioxari et al., 2014) 73.6
(Hoai, 2014) 76.3
VGG Net-D & Net-E, image-only 79.2
VGG Net-D & Net-E, image and bounding box 84.0
论文修订
Here we present the list of major paper revisions, outlining the substantial changes for the conve- nience of the reader.
v1 Initial version. Presents the experiments carried out before the ILSVRC submission.
v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.
v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classification datasets. The models used for these experiments are publicly available.
v4 The paper is converted to ICLR-2015 submission format. Also adds experiments with multiple crops for classification.
v6 Camera-ready ICLR-2015 conference paper. Adds a comparison of the net B with a shallow net and the results on PASCAL VOC action classification benchmark.
在这里,我们列出了主要的论文修订版本,概述了为方便读者阅读而进行的重大修改。
v1初始版本。介绍了在提交ILSVRC竞赛结果之前进行的实验。
v2增加了提交ILSVRC后进行的实验,通过使用尺度抖动来增加训练集,从而提高了性能。
v3增加了对PASCAL VOC和Caltech图像分类数据集的泛化实验(附录B)。用于这些实验的模型是公开的。
v4论文转换为ICLR-2015提交格式。还增加了对多重裁切进行分类的实验。
v6适用于相机的ICLR-2015会议论文。将带有浅层网络的网络B与在PASCAL VOC动作分类基准上的结果进行比较。