ImageNet Classification with Deep Convolutional Neural Networks论文翻译 下
code
AlexNet实现地址(基于PyTorch): https://github.com/Lornatang/pytorch/blob/master/official/net/alexnet.py
ImageNet Classification with Deep Convolutional Neural Networks
深度卷积神经网络的ImageNet分类
论文:http://static.tongtianta.site/paper_pdf/2c26fb78-7abb-11e8-87f8-00163e08ba34.pdf
Abstract
摘要
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
我们训练了一个庞大的深层卷积神经网络,将ImageNet LSVRC-2010比赛中的120万张高分辨率图像分为1000个不同的类别。在测试数据上,我们取得了37.5%和17.0%的前1和前5的错误率,这比以前的先进水平要好得多。具有6000万个参数和650,000个神经元的神经网络由五个卷积层组成,其中一些是最大汇聚层,另外三个是完全连接层,最后有1000个方向的最大值。为了加快训练速度,我们使用非饱和神经元和一个非常有效的卷积运算的GPU实现。为了减少完全连接层中的过度配合,我们采用了最近开发的称为“辍学”的正则化方法,该方法证明是非常有效的。我们还在ILSVRC-2012比赛中进入了这种模式的一个变种,取得了15.3%的前五名测试失误率,而第二名的成绩是26.2%。
1 Introduction
1引言
Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations. For example, the currentbest error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4]. But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to collect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of over 15 million labeled high-resolution images in over 22,000 categories.
目前的物体识别方法对机器学习方法的使用非常重要。为了提高他们的表现,我们可以收集更大的数据集,学习更强大的模型,并使用更好的技术来防止过度填充。直到最近,标记图像的数据集相对较小 - 数量级成千上万的图像(例如,NORB [16],Caltech-101/256 [8,9]和CIFAR-10/100 [12])。使用这种尺寸的数据集可以很好地解决简单的识别任务,特别是如果他们增加了标签保留转换。例如,当前MNIST数字识别任务的最佳错误率(<0.3%)接近人类表现[4]。但是在现实环境中的物体表现出相当大的变化性,所以要学会识别它们,就必须使用更大的训练集。事实上,小图像数据集的缺点已被广泛认可(例如,Pinto等[21]),但最近才有可能收集带有数百万图像的标记数据集。新的大型数据集包括LabelMe [23],其中包含数十万个完全分割的图像,以及ImageNet [6],其中包含超过15,000万个超过22,000个类别的高分辨率图像。
To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.
要从数百万图像中了解数千个对象,我们需要一个具有大量学习能力的模型。然而,对象识别任务的巨大复杂性意味着即使是像ImageNet这样大的数据集也不能指定这个问题,所以我们的模型也应该有很多先验知识来补偿我们没有的所有数据。卷积神经网络(CNN)构成了这样一类模型[16,11,13,18,15,22,26]。他们的能力可以通过改变他们的深度和广度来加以控制,他们也对图像的性质(即统计数据的平稳性和像素依赖性的局部性)做出强而且大多数正确的假设。因此,与具有相同大小的层的标准前馈神经网络相比,CNN具有更少的连接和参数,因此它们更容易训练,而其理论上最好的性能可能仅稍微更差。
Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting.
尽管CNN具有吸引人的特质,并且尽管其本地架构相对有效,但它们在大规模应用于高分辨率图像方面仍然过于昂贵。幸运的是,当前的GPU与高度优化的二维卷积实现配合使用,足以促进对有趣的大型CNN的训练,而最近的数据集(如ImageNet)包含足够的标记示例来训练此类模型,而不会出现严重过度拟合。
The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly1. Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 3. The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overfitting, which are described in Section 4. Our final network contains five convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance.
本文的具体贡献如下:我们在ILSVRC-2010和ILSVRC-2012比赛中使用的ImageNet子集上训练了迄今为止最大的卷积神经网络之一[2],并取得了迄今为止报道的最好的结果这些数据集。我们编写了一个高度优化的2D卷积GPU实现以及训练卷积神经网络固有的所有其他操作,我们可以公开发布1。我们的网络包含许多新的和不同寻常的功能,可以提高其性能并缩短训练时间,详见第3节。我们的网络的大小使过度拟合成为一个重要的问题,即使有120万个标记的训练例子,所以我们使用了一些有效的技术来防止过度拟合,如第4节所述。我们的最终网络包含五个卷积和三个完全连接的层,这个深度似乎很重要:我们发现去除任何卷积层(每个卷积层不超过模型参数的1%)导致性能较差。
In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between five and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
最后,网络的规模主要受限于当前GPU上可用的内存量以及我们愿意接受的培训时间。我们的网络需要5至6天的时间才能在两台GTX 580 3GB GPU上进行培训。我们所有的实验都表明,通过等待更快的GPU和更大的数据集变得可用,我们的结果可以得到改善。
2 The Dataset
2数据集
ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.
ImageNet是超过1500万个标记的高分辨率图像的数据集,属于大约22,000个类别。这些图像是从网上收集的,并使用亚马逊的Mechanical Turk群众采购工具由人类贴标签商标记。从2010年开始,作为Pascal视觉对象挑战赛的一部分,每年举办一次名为ImageNet大型视觉识别挑战赛(ILSVRC)的比赛。ILSVRC使用ImageNet的一个子集,每1000个类别中大约有1000个图像。总共有大约120万个训练图像,50,000个验证图像和150,000个测试图像。
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.
ILSVRC-2010是ILSVRC的唯一可用测试集标签版本,因此这是我们执行大部分实验的版本。由于我们也在ILSVRC-2012比赛中进入了我们的模型,因此在第6部分中,我们也报告了此版本数据集的结果,并且测试集标签不可用。在ImageNet上,习惯上报告两种错误率:top-1和top-5,其中top-5错误率是测试图像的分数,正确标签不是被模型认为最可能的五个标签之一。
ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of3 The Architecture
3建筑
The architecture of our network is summarized in Figure 2. It contains eight learned layers — five convolutional and three fully-connected. Below, we describe some of the novel or unusual features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of their importance, with the most important first.
图2总结了我们网络的体系结构。它包含八个学习层 - 五个卷积和三个完全连接。下面,我们描述一些我们网络架构的新颖或不寻常的特征。3.1-3.4节按照我们对它们重要性的估计进行分类,最重要的是第一个。
1http://code.google.com/p/cuda-convnet/
1http://code.google.com/p/cuda-convnet/
3.1 ReLU Nonlinearity
3.1 ReLU非线性
The standard way to model a neuron’s output f as a function of its input x is withwork were chosen independently to make train works particularly well with their type of contrast nor ing as fast as possible. No regularization of malization followed by local average pooling on the any kind was employed. The magnitude of the Caltech-101 dataset. However, on this dataset the pri effect demonstrated here varies with network mary concern is preventing overfitting, so the effect architecture, but networks with ReLUs consis they are observing is different from the accelerated tently learn several times faster than equivalents ability to fit the training set which we report when us with saturating neurons.
将神经元的输出f建模为其输入x的函数的标准方法是使用工作是独立选择的,以使列车工作特别好,其对比类型也不尽可能快。没有任何正规化的恶化,其次是当地的平均水平。 Caltech-101数据集的大小。然而,在这个数据集中,这里展示的pri效应随着网络的不同而有所不同,因为网络mary关心的是防止过度配合,所以效果架构,但是他们观察到的具有ReLUs一致性的网络与加速学习不同,当我们用饱和神经元报告时,我们会报告它。
ing ReLUs. Faster learning has a great influence on the performance of large models trained on large datasets.
ReLUs。加快学习对在大型数据集上训练的大型模型的性能有很大的影响。
3.2 Training on Multiple GPUs
3.2在多个GPU上进行培训
A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU. Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory. The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us to precisely tune the amount of communication until it is an acceptable fraction of the amount of computation.
单个GTX 580 GPU只有3GB内存,这限制了可以在其上训练的网络的最大尺寸。事实证明,120万个训练样例足以训练那些太大而不适合在一个GPU上的网络。因此,我们将网络分布在两个GPU上。目前的GPU特别适合于跨GPU并行化,因为它们能够直接读写对方的内存,而无需通过主机内存。我们使用的并行化方案基本上将每半个内核(或神经元)放在每个GPU上,另外还有一个技巧:GPU只在某些层进行通信。这意味着,例如,第3层的内核从第2层的所有内核映射中获取输入。但是,第4层中的内核只能从驻留在同一GPU上的第3层中的那些内核映射接收输入。选择连通性模式是交叉验证的一个问题,但这使我们能够精确调整通信量,直到它达到计算量的可接受部分。
The resultant architecture is somewhat similar to that of the “columnar” CNN employed by Cire¸san et al. [5], except that our columns are not independent (see Figure 2). This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively, as compared with a net with half as many kernels in each convolutional layer trained on one GPU. The two-GPU net takes slightly less time to train than the one-GPU net2.
由此产生的架构有点类似于Cire¸san等人使用的“柱状”CNN。 [5],除了我们的列不是独立的(见图2)。与一个GPU上训练的每个卷积层内核数量减少一半的网络相比,该方案分别将我们的前1和前5的错误率分别降低了1.7%和1.2%。双GPU网络的培训时间比单GPU网络少2。
2The one-GPU net actually has the same number of kernels as the two-GPU net in the final convolutional layer. This is because most of the net’s parameters are in the first fully-connected layer, which takes the last convolutional layer as input. So to make the two nets have approximately the same number of parameters, we did not halve the size of the final convolutional layer (nor the fully-conneced layers which follow). Therefore this comparison is biased in favor of the one-GPU net, since it is bigger than “half the size” of the two-GPU net.
2单GPU实际上与最终卷积层中的双GPU网具有相同数量的内核。这是因为大多数网络参数都在第一个完全连接层中,它将最后一个卷积层作为输入。因此,为了使两个网络具有大致相同数量的参数,我们没有减半最后的卷积层的大小(以及后面的完全连接的层)。因此,这种比较偏向于单GPU网络,因为它大于双GPU网络的“一半大小”。
3.3 Local Response Normalization
3.3本地响应标准化
ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization. Denoting byThis scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean activity. Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively. We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization3.
该方案与Jarrett等人的局部对比归一化方案有某些相似之处。 [11],但我们的将被更准确地称为“亮度标准化”,因为我们不减去平均活动。响应规范化将我们的前1和前5的错误率分别降低1.4%和1.2%。我们还验证了这种方案在CIFAR-10数据集上的有效性:四层CNN实现了13%的没有标准化的测试错误率和11%的正常化3。
3.4 Overlapping Pooling
3.4重叠池
Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g., [17, 11, 4]). To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size3.5 Overall Architecture
3.5总体架构
Now we are ready to describe the overall architecture of our CNN. As depicted in Figure 2, the net contains eight layers with weights; the first five are convolutional and the remaining three are fullyconnected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.
现在我们准备好描述CNN的整体架构。如图2所示,网包含八层重量;第一个五是卷积,其余三个完全连接。最后完全连接的层的输出被馈送到1000路softmax,其产生1000个类别标签上的分布。我们的网络最大化多项逻辑回归目标,这相当于在预测分布下最大化正确标签对数概率的训练案例的平均值。
The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU (see Figure 2). The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fullyconnected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers, of the kind described in Section 3.4, follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.
第二,第四和第五卷积层的内核仅与上一层驻留在同一GPU上的内核映射相连(见图2)。第三卷积层的内核连接到第二层中的所有内核映射。完全连接层中的神经元连接到前一层中的所有神经元。响应标准化层遵循第一和第二卷积层。3.4节中所描述的最大汇集层跟随响应规范化层以及第五卷积层。将ReLU非线性应用于每个卷积和完全连接层的输出。
The first convolutional layer filters the3We cannot describe this network in detail due to space constraints, but it is specified precisely by the code and parameter files provided here: http://code.google.com/p/cuda-convnet/.
3由于空间限制,我们无法详细描述此网络,但详细说明请参阅此处提供的代码和参数文件:http://code.google.com/p/cuda-convnet/。
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
图2:我们有线电视新闻网体系结构的一个例证,明确显示责任的划分
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size. The fully-connected layers have 4096 neurons each.
核心图中的神经元)。第二个卷积层将第一个卷积层的(响应标准化和合并)输出作为输入,并用256个。完全连接的层各有4096个神经元。
文章引用于 http://tongtianta.site/paper/1954
编辑 Lornatang
校准 Lornatang