【开始时间】2018.09.25
【完成时间】2018.09.26
【论文翻译】GoogleNet网络论文中英对照翻译--(Going deeper with convolutions)
【中文译名】 更深的卷积
【论文链接】https://arxiv.org/abs/1409.4842
题目:更深的卷积
Abstract(摘要)
We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014(ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
我们在ImageNet大规模视觉识别挑战赛2014(ILSVRC14)上提出了一种代号为Inception的深度卷积神经网络结构,并在分类和检测上取得了新的最好结果。该体系结构的主要特点是提高了网络内计算资源的利用率。这是通过精心设计实现的,该设计允许在保持计算预算不变的同时增加网络的深度和宽度。为了优化质量(quality),架构决策基于Hebbian原则和多尺度处理。在我们为ilsvrc 14提交的文件中使用的一种特殊形式称为googlenet,它是一个22层深的网络,其质量将在分类和检测的范围内进行评估。
1 Introduction(介绍)
In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks [10], the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12× fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].
近三年来,主要由于深入学习、更具体的卷积网络[10]的发展,图像识别和目标检测的质量正以前所未有的速度向前发展。一个令人鼓舞的消息是,这一进步的大部分不仅仅是更强大的硬件、更大的数据集和更大的模型的结果,而且主要是新的思想、算法和改进的网络结构的结果。 例如,ILSVRC 2014竞赛中最靠前的输入除了用于检测目的的分类数据集之外,没有使用新的数据资源。我们的GoogleNet提交给ILSVRC 2014的报告实际上是两年前Krizhevsky等人[9]的获奖架构使用的参数的1/12,而且要准确得多。在目标检测方面,最大的收获不是来自于单独利用深度网络或更大的模型,而是来自于深层架构和经典计算机视觉的协同作用,比如girshick等人的r-cnn算法[6]。
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.
另一个值得注意的因素是,随着移动计算和嵌入式计算的不断发展,我们算法的效率-尤其是它们的能力和内存的使用-变得越来越重要。 值得注意的是,正是包含了这个因素的考虑才得出了本文中呈现的深度架构设计,而不是单纯的为了提高准确率。在大多数实验中,这些模型的设计是为了保持15亿的计算预算-在推理时增加,这样它们最终不会成为纯粹的学术好奇心,而是可以合理的成本投入现实世界的使用,即使是在大型数据集上也是如此。
In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, on which it significantly outperforms the current state of the art.
本文将重点研究一种高效的计算机视觉深层神经网络体系结构,代号为“Inception”,它的名称来源于Lin等人[12]的网络论文中的网络,以及著名的“我们需要更深层次的”网络模因[1]。在我们的例子中,“深度”一词有两种不同的含义:首先,我们以“Inception模块”的形式引入了一个新的组织层次,并且在更直接的意义上增加了网络深度。一般来说,人们可以把初始模型看作是[12]的逻辑顶点,同时从Arora等人的理论工作中获得灵感和指导[2]。该架构的优点在ILSVRC 2014分类和检测挑战上得到了实验验证,在这方面,它的性能明显优于当前的先进水平。
2 Related Work(相关工作)
Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max- pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.
从LENET-5[10]开始,卷积神经网络(CNN)通常有一个标准的结构-堆叠的卷积层(可选地接着是对比度归一化和最大池)后面是一个或多个完全连接的层。这种基本设计的变体在图像分类文献中非常流行,并在mnist、CIFAR和ImageNet分类挑战[9,21]上取得了迄今为止最好的结果。对于大型数据集,如ImageNet,最近的趋势是增加层数[12]和层大小[21,14],同时使用Dropout[7]来解决过度拟合的问题。
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19].Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.
尽管人们担心最大池化层层会导致精确的空间信息丢失,但与[9]相同的卷积网络结构也被成功地用于定位[9,14],目标检测[6,14,18,5]和人体姿态估计[19]。 受灵长类视觉皮层神经科学模型的启发,Serre等人[15]使用了一系列固定的不同大小的Gabor滤波器来处理多尺度。然而,与[15]中固定的2层深度模型相反,在初始模型中的所有滤波器都是学习的。 此外,Inception层重复了很多次,在GoogLeNet模型中得到了一个22层的深度模型。
Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. In their model, additional 1 × 1 convolutional layers are added to the network, increasing its depth. We use this approach heavily in our architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.
Network-in-Network是Lin等人[12]为了增加神经网络表现能力而提出的一种方法。当应用于卷积层时,该方法可以看作是额外的1×1卷积层,然后是典型的校正线性激活[9]。这使得它能够很容易地集成到目前的CNN管道中。我们在架构中大量使用这种方法。然而,在我们的设置中,1×1卷积具有双重用途:最关键的是,它们主要用作降维模块,以消除计算瓶颈,否则会限制我们网络的规模。这不仅允许增加深度,而且还允许我们的网络的宽度没有显著的性能损失。
The current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: utilizing low-level cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals
目前主要的目标检测方法是Girshick等人提出的基于区域的卷积神经网络方法(R-CNN)[6]。R-CNN将整个检测问题分解为两个子问题:首先,以一种与类别无关的方式,利用颜色和超像素一致性等低级别线索来进行潜在的对象建议,然后使用cnn分类器识别这些位置的对象类别。 这样一种两个阶段的方法利用了低层特征分割边界框的准确性,也利用了目前的CNN非常强大的分类能力。我们在提交的检测报告中采用了类似的方法,但在这两个阶段都进行了改进,例如多框[5]对较高对象包围盒召回的预测,以及更好地对边界框提案进行分类的集成方法。
3 Motivation and High Level Considerations(动机和高层考虑)
The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth —— the number of network levels —— as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.
改善深层神经网络性能最直接的方法是增加它们的大小。这包括增加网络的深度(层数)及其宽度:每层的单元数。这是一种简单而安全的方法来训练高质量的模型,特别是考虑到大量的标记训练数据的可用性。然而,这个简单的解决方案有两个主要缺点。
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet (even in the 1000-class ILSVRC subset) as shown in Figure 1
更大的规模通常意味着更多的参数,这使得扩大后的网络更容易过度拟合,特别是在训练集中标记示例的数量有限的情况下。这可能成为一个主要的瓶颈,因为创建高质量的培训集可能是棘手和昂贵的,特别是如果需要专家评估人员来区分像ImageNet(甚至在1000类ILSVRC子集中)这样的细粒度视觉类别,如图1所示。
The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance.
网络大小均匀增加的另一个缺点是计算资源的使用急剧增加。例如,在深度视觉网络中,如果将两个卷积层链接起来,它们的滤波器数目的任何均匀增加都会导致计算的二次增长。如果增加的容量没有得到有效的使用(例如,如果大多数权重最终接近于零),那么大量的计算就会被浪费掉。由于计算预算在实践中总是有限的,因此更倾向于有效分配计算资源,而不是任意增加规模,即使主要目标是提高结果的质量。
A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle —— neurons that fire together, wire together —— suggests that the underlying idea is applicable even under less strict conditions, in practice.
解决这两个问题的一个基本的方式就是将全连接层替换为稀疏的全连接层,甚至在卷积层内部。除了模仿生物系统之外,由于Arora等人的开创性工作,这也将具有更坚实的理论基础的优势[2]。它们的主要结果是,如果数据集的概率分布可以用一个大的、非常稀疏的深层神经网络来表示, 则最优的网络拓扑结构可以通过分析前一层激活的相关性统计和聚类高度相关的神经元来一层层的构建。尽管严格的数学证明需要很强的条件,但这一说法与众所周知的Hebbian原理产生了共鸣-神经元一起激发、一起连接-这表明,即使在实际中,在不太严格的条件下,这种基本思想也是适用的。
On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off. The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of em- ploying convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing. The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.
缺点是,今天的计算架构对于非均匀稀疏数据结构的数值计算效率很低。即使算术运算的数量减少了100倍,查找和缓存丢失的开销仍然占主导地位,因此切换到稀疏矩阵是不会有好处的。 随着稳定提升和高度调整的数值库的应用,差距仍在进一步扩大,这些数值库允许极度快速密集的矩阵乘法,利用底层的CPU或GPU硬件[16, 9]的微小细节。此外,非均匀 的稀疏模型需要更复杂的工程和计算基础设施。目前大多数面向视觉的机器学习系统都是利用空间域的稀疏性来实现的。但是,卷积是作为与前一层中的补丁的密集连接的集合来实现的。自[11]以来,为了打破对称性和提高学习能力,卷积网习惯上上在特征维中使用随机和稀疏连接表,以更好地优化并行计算,这种趋势又回到了与[9]完全连接的状态。结构的均匀性和大量的过滤器和更大的批量允许使用高效的密集计算。
This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.
这就提出了一个问题:是否有希望实现下一个中间步骤:一种利用额外稀疏性的体系结构,即使是滤波器级,但正如理论所建议的那样,能通过利用密集矩阵上的计算来利用我们当前的硬件。关于稀疏矩阵计算的大量文献(例如[3])表明,将稀疏矩阵聚类成相对稠密的子矩阵,往往会给稀疏矩阵乘法提供最先进的实际性能。似乎不难想象,在不久的将来,类似的方法将被用于非均匀的深度学习体系结构的自动化构建。
The Inception architecture started out as a case study of the first author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, only after two iterations on the exact choice of topology, we could already see modest gains against the reference architecture based on [12]. After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly, they turned out to be at least locally optimal.
Inception架构开始是作为案例研究,用于评估一个复杂网络拓扑构建算法的假设输出,该算法试图近似[2]中所示的视觉网络的稀疏结构,并通过密集的、容易获得的组件来覆盖假设结果。尽管这是一项高度投机性的工作,但只有在对拓扑的精确选择进行了两次迭代之后,我们已经可以看到与基于[12]的参考架构相比所取得的一些进展。在进一步调整学习率、超参数和改进的训练方法之后,我们确定了该Inception结构对于[6]和[5]的基本网络在定位和目标检测方面是特别有用的。有趣的是,虽然大多数最初的架构选择都经过了彻底的质疑和测试,但最终它们至少在本地是最优的。
One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure would require much more thorough analysis and verification: for example, if automated tools based on the principles described below would find similar, but better topology for the vision networks. The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture. At very least, the initial success of the Inception architecture yields firm motivation for exciting future work in this direction.
然而必须谨慎:尽管Inception架构在计算机上领域取得成功,但这是否可以归因于构建其架构的指导原则仍是有疑问的。要确保这一点需要更彻底的分析和验证:例如,如果基于以下原则的自动化工具会发现类似的、但更好的视觉网络拓扑结构。最令人信服的证据是,自动化系统是否会创建网络拓扑,从而在其他领域使用相同的算法,但具有非常不同的全局架构,从而获得类似的收益。至少,Inception架构的最初成功为在这个方向上激动人心的未来工作提供了坚定的动力。
4 Architectural Details(结构细节)
The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by-layer construction where one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).
Inception架构的主要思想是找出卷积视觉网络中最优的局部稀疏结构是如何被容易获得的密集分量所近似与覆盖的。请注意,假定转换不变性意味着我们的网络将由卷积积木构建。请注意,假定平移不变性意味着我们的网络将由卷积积木构建。Arora等人[2]提出一种逐层结构,对上一层的相关统计量进行分析,并将其聚成一组具有高度相关性的单元。这些聚类形成了下一层的单元并与前一层的单元连接。我们假设来自前一层的每个单元对应于输入图像的某个区域,并且这些单元被分组为滤波器组。在较低层(接近输入层),相关单元集中在局部区域。这意味着,我们最终会有大量的团簇集中在一个单一的区域,它们可以在下一层被1×1的卷积覆盖,就像[12]中所建议的那样。然而也可以预期,将存在更小数目的在更大空间上扩展的聚类,其可以被更大块上的卷积覆盖,在越来越大的区域上块的数量将会下降。为了避免块校正的问题,目前Inception架构形式的滤波器的尺寸仅限于1×1、3×3、5×5,这个决定更多的是基于便易性而不是必要性。这还意味着所建议的体系结构是所有这些层的组合,它们的输出滤波器组连接成一个单一的输出矢量,形成下一阶段的输入。此外,由于池操作对于当前最先进的卷积网络的成功至关重要,它建议在每个这样的阶段增加一条可供选择的并行池路径,这也应具有额外的有益效果(见图2(A)。
As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.
由于这些“Inception模块”是层叠在一起的,它们的输出相关统计量必然会有所不同:由于较高的抽象特征被较高的层所捕捉,它们的空间浓度预计会降低,这意味着3×3和5×5卷积的比率应该随着我们移动到更高的层而增加。
One big problem with the above modules, at least in this naive form, is that even a modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage. The merging of output of the pooling layer with outputs of the convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
上述模块的一个大问题是在具有大量滤波器的卷积层之上,即使适量的5×5卷积也可能是非常昂贵的,至少在这种朴素形式中有这个问题。一旦将池单元添加到混合中,这个问题就会更加明显:它们的输出过滤器的数量等于上一阶段的过滤器的数量。将池层的输出与卷积层的输出合并将不可避免地导致从一个阶段到另一个阶段的输出数量的增加。即使这个体系结构可能覆盖最优的稀疏结构,它也会非常低效率地完成它,在几个阶段内导致计算崩溃。
This leads to the second idea of the Inception architecture: judiciously reducing dimension wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to process. The representation should be kept sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation making them dual-purpose. The final result is depicted in Figure 2(b).
这导致了Inception架构的第二个想法:在计算要求会增加太多的地方,明智地减少维度和映射。这是基于嵌入式的成功:即使是低维嵌入也可能包含大量关于相对较大的图像修补程序的信息。然而,嵌入以密集、压缩的形式表示信息,压缩后的信息更难建模。我们希望在大多数地方保持我们的表示稀疏(根据[2]的要求),并且只有当信号必须聚集在一起时才对它们进行压缩。也就是说,在昂贵的3×3和5×5卷积之前,使用1×1卷积来进行计算约简。除了用作减少(参数)外,它们还包括使用经校正的线性激活,使它们具有双重用途。最后的结果如图2(B)所示。
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.
一般来说,Inception网络是由上述类型的模块相互叠加而成的网络,偶尔会有跨越2的最大池层,以将网格的分辨率减半。由于技术原因(训练期间的内存效率),似乎只在较高层开始使用初始模块,而以传统的卷积方式保持较低层的使用是有益的。这不是绝对必要的,只是反映了我们目前实现中的一些基础结构效率低下。
One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously
这个体系结构的主要好处之一是,它允许在每个阶段显着地增加单元数量, 而不会在后面的阶段出现计算复杂度不受控制的爆炸。维数约简的普遍使用使得在上一阶段的大量输入滤波器被屏蔽到下一层,首先减小它们的维数,然后再将它们与大的块大小集合在一起。该设计的另一个实际有用的方面是,它与直觉保持一致,即视觉信息应该在不同的尺度上进行处理,然后进行聚合,以便下一阶段能够同时从不同的尺度中提取特征。
The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are 2−3× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.
通过改进计算资源的使用,可以增加每个阶段的宽度和阶段数,而不会陷入计算困难。另一种利用初始架构的方法是创建稍微低劣的,但计算成本较低的版本。我们已经发现,所有包含的旋钮和杠杆都允许对计算资源进行控制平衡,从而使网络比具有非初始架构的类似执行网络快2−3×,但是这需要在此时进行仔细的手工设计。
5 GoogLeNet
We chose GoogLeNet as our team-name in the ILSVRC14 competition. This name is an homage to Yann LeCuns pioneering LeNet 5 network [10]. We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition. We have also used a deeper and wider Inception network, the quality of which was slightly inferior, but adding it to the ensemble seemed to improve the results marginally. We omit the details of that network, since our experiments have shown that the influence of the exact architectural parameters is relatively minor. Here, the most successful particular instance (named GoogLeNet) is described in Table 1 for demonstrational purposes. The exact same topology (trained with different sampling methods) was used for 6 out of the 7 models in our ensemble.
我们在ilsvrc 14竞赛中选择GoogLeNet作为我们的团队名称。这个名字是对亚恩莱昂开拓性的 LeNet 55网络[10]的一种敬意。我们还使用GoogleNet来作为我们提交的竞赛中所使用的Inception架构的特例。我们还使用了一个更深更广的初始网络,其质量稍差,但将其添加到集合中似乎可以稍微提高效果。我们忽略了网络的细节,因为我们的实验表明,精确的架构参数的影响相对较小。在这里,为了演示目的,表1描述了最成功的特定实例(名为GoogLeNet)。在我们集成的7种模型中,有6种采用了完全相同的拓扑结构(用不同的采样方法训练)。
All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224×224 in the RGB color space with zero mean. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.
所有的卷积都使用了修正线性激活,包括Inception模块内部的卷积。在我们的网络中感受野是在均值为0的RGB颜色空间中,大小是224×224。“#3×3 reduce”和“#5×5 reduce”表示在3×3和5×5卷积之前,降维层使用的1×1滤波器的数量。在pool proj列可以看到内置的最大池化之后,投影层中1×1滤波器的数量。所有的这些降维/投影层也都使用了线性修正激活。
The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint. The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. However this number depends onthe machine learning infrastructure system used. The use of average pooling before the classifier is based on [12], although our implementation differs in that we use an extra linear layer. This enables
adapting and fine-tuning our networks for other label sets easily, but it is mostly convenience and we do not expect it to have a major effect. It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.
该网络的设计考虑了计算效率和实用性,因此可以在单个设备上运行,包括那些计算资源有限的设备,尤其是内存占用较少的设备。当只计算带有参数的层时,网络有22层深度(如果计算pooling 池,则为27层)。用于建造网络的层(独立构建块)的总数约为100层。然而,这个数字取决于所使用的机器学习基础设施系统。在分类器之前使用平均池是基于[12]的,尽管我们的实现不同之处在于我们使用了额外的线性层。 线性层使我们的网络能很容易地适应其它的标签集,但它主要是方便,我们不期望它有一个重大的影响。我们发现从全连接层变为平均池化,提高了大约top-1 %0.6的准确率,然而即使在移除了全连接层之后,Dropout的使用还是必不可少的。
Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At
inference time, these auxiliary networks are discarded.
考虑到网络的相对较大的深度,以有效的方式将梯度传播回所有层的能力是一个值得关注的问题。一个有趣的观点是,相对较浅的网络在这项任务上的强大性能表明,网络中间层产生的特性应该是非常有区别的。通过增加与这些中间层相连接的辅助分类器,我们期望在分类器的较低阶段增强识别,增加传播回来的梯度信号,并提供额外的正则化。这些分类器采用设置在初始(4a)和(4d)模块的输出之上的较小卷积网络的形式。在训练过程中,它们的损失以折扣权重加到网络的总损失中(辅助分类器的损失加权0.3)。在推理时,这些辅助网络被丢弃。
The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:
An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.
A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
A fully connected layer with 1024 units and rectified linear activation.
A dropout layer with 70% ratio of dropped outputs.
A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).
A schematic view of the resulting network is depicted in Figure 3.
包括辅助分类器在内的附加网络的具体结构如下:
一个滤波器大小5×5,步长为3的平均池化层,导致(4a)阶段的输出为4×4×512,(4d)的输出为4×4×528。
具有128个滤波器的1×1卷积,用于降维和修正线性激活。
一个全连接层,具有1024个单元和修正线性激活。
丢弃70%输出的丢弃层。
使用带有softmax损失的线性层作为分类器(作为主分类器预测同样的1000类,但在推断时移除)。
图3描述了结果网络的示意图视图。
Figure 3: GoogLeNet network with all the bells and whistles
6 Training Methodology(训练方法)
Our networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.
我们的网络使用分布式机器学习系统对网络进行了训练,使用了少量的模型和数据并行性。尽管我们仅使用一个基于CPU的实现,但粗略的估计表明GoogLeNet网络可以用更少的高端GPU在一周之内训练到收敛,主要的限制是内存使用。我们的训练采用异步随机梯度下降的0.9动量[17],固定的学习速率时间表(降低4%的学习率每8个时代)。利用Polyak平均[13]建立了推理时使用的最终模型。
Ourimagesamplingmethodshavechangedsubstantiallyoverthemonthsleadingtothecompetition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, like dropout and learning rate, so it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3. Also, we found that the photometric distortions by Andrew Howard [8] were useful to combat overfitting to some extent. In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes, so we could not tell definitely whether the final results were affected positively by their use.
图像采样方法在过去几个月的竞赛中发生了重大变化,并且已收敛的模型(可以)在其他选项上进行了训练,有时还结合着超参数的改变,例如丢弃和学习率,因此,很难对培训这些网络的最有效的单一方式给予明确的指导。使问题更加复杂的是,一些模型主要是在较小的相对裁剪(crop)上进行训练,另一些是在[8]的启发下训练更大的crop。不过,有一种处方在比赛后得到了很好的验证,它的尺寸均匀分布在图像区域的8%—100%之间,并在3/4和4/3之间随机选择其长宽比的各种大小的图像块进行采样。此外,我们还发现,AndrewHoward[8]的光度畸变在一定程度上有助于防止过度拟合。此外,我们还开始使用随机插值方法(双线性、面积、最近邻和立方,概率相等)来比较晚地调整大小,并结合其他超参数变化,因此无法确定最终结果是否受到其使用的积极影响。
7 ILSVRC 2014 Classification Challenge Setup and Results(ILSVRC 2014分类挑战设置和结果)
The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.
ILSVRC 2014分类挑战涉及将图像分类为ImageNet层次结构中的1000个叶节点类别之一的任务。大约有120万张图像用于培训,5万张用于验证,10万张用于测试。每幅图像都与一个地面真相分类器相关联,并且性能是基于最高得分分类器预测来衡量的。通常报告两个数字:top-1准确率,比较实际类别和第一个预测类别,top-5错误率,比较实际类别与前5个预测类别:如果图像实际类别在top-5中,则认为图像分类正确,不管它在top-5中的排名。挑战赛使用top-5错误率来进行排名。
We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we describe next.
We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, due to an oversight) and learning rate policies. They differed only in sampling methodologies and the randomized input image order.
During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resized the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This leads to 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).
The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging.
我们参加了这次挑战,没有使用外部数据进行培训。除了本文中提到的训练技术之外,我们还在测试中采用了一套技术来获得更高的性能,我们将在下面对此进行详细的阐述。
我们独立地培训了7个版本的相同的谷歌网模型(包括一个更广泛的版本),并与他们一起进行了集成预测。这些模型经过相同的初始化(甚至具有相同的初始权重(主要是由于疏忽)和学习速率策略的训练,它们只在采样方法和看到输入图像的随机顺序上有所不同。
在测试中,我们采用比Krizhevsky等人[9]更积极的裁剪方法。具体来说,我们将图像归一化为四个尺度,其中较短维度(高度或宽度)分别为256,288,320和352,取这些归一化的图像的左,中,右方块(在肖像图片中,我们采用顶部,中心和底部方块)。对于每个方块,我们将采用4个角以及中心224×224裁剪图像以及方块尺寸归一化为224×224,以及它们的镜像版本。这导致每张图像会得到4×3×6×2 = 144的裁剪图像。前一年的输入中,Andrew Howard[8]采用了类似的方法,经过我们实证验证,其方法略差于我们提出的方案。我们注意到,在实际应用中,这种积极裁剪可能是不必要的,因为存在合理数量的裁剪图像后,更多裁剪图像的好处会变得很微小(正如我们后面展示的那样)。
在多个作物和所有分类器上,对Softmax概率进行平均,以获得最终的预测结果。在我们的实验中,我们分析了验证数据的替代方法,例如对裁剪的最大池和对分类器的平均,但它们导致的性能不如简单平均。
In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.
Our final submission in the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. The following table shows the statistics of some of the top-performing approaches.
在本文的其余部分,我们分析了影响最终提交的总体性能的多种因素。
我们在挑战中的最后提交在验证和测试数据上都获得了6.67%的前5位错误,在其他参与者中排名第一。这与2012年的监督方法相比,相对减少了56.5%,与前一年的最佳方法(Clarifai)相比,相对减少了40%,这两种方法都使用外部数据来培训分类器。下表显示了一些性能最好的方法的统计数据。
We also analyze and report the performance of multiple testing choices, by varying the number ofmodels and the number of crops used when predicting an image in the following table. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.
我们还通过改变模型的数量和在下表中预测图像时使用的作物数量来分析和报告多种测试选择的性能。当我们使用一个模型时,我们选择了一个在验证数据上具有最低前1错误率的模型。所有数字 都报告在验证数据集中,以避免与测试数据统计数据过分匹配。
8 ILSVRC 2014 Detection Challenge Setup and Results(ILSVRC 2014检测挑战设置和结果)
The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary. Results are reported using the mean average precision (mAP).
ILSVRC检测任务是在200个可能的类中,围绕图像中的对象生成包围框。如果检测到的对象与地面真相类相匹配,并且它们的边界框至少重叠50%(使用Jaccard索引),则它们就算作正确的对象。多余的检测被视为假阳性并受到惩罚。与分类任务相反,每幅图像可能包含多个对象,也可能没有对象,它们的比例可能从大到小。报告的结果使用平均精度均值(mAP)。
The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the Selective Search [20] approach with multi-box [5] predictions for higher object bounding box recall. In order to cut down the number of false positives, the superpixel size was increased by 2×. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the pro- posals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 ConvNets when classifying each region
which improves results from 40% to 43.9% accuracy. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.
Google网所采用的检测方法与r-CNN的方法类似[6],但作为区域分类器的起始模型得到了扩展。此外,通过将选择性搜索[20]方法与多框[5]预测相结合,改进了区域建议步骤,从而提高了目标包围盒召回率。为了减少假阳性的数量,增加了2倍的超像素大小。这将选择性搜索算法中的提议减半。我们总共补充了200个来自多盒结果的区域生成,大约60%的区域生成用于[6],同时将覆盖率从92%提高到93%。减少区域生成的数量,增加覆盖率的整体影响是对于单个模型的情况平均精度均值增加了1%。最后,等分类单个区域时,我们使用了6个GoogLeNets的组合。这导致准确率从40%提高到43.9%。注意,与R-CNN相反,由于缺少时间我们没有使用边界框回归。
We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use Convolutional Networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNetentry did not use the localization data for pretraining.
我们首先报告顶级检测结果,并显示自第一版检测任务以来的进展情况。与2013年的结果相比,准确率几乎翻了一番。表现最好的团队都使用卷积网络。我们报告表4中的官方分数和每个团队的共同策略:使用外部数据、集成模型或上下文模型。外部数据通常是用于预训练的ilsvrc 12分类数据,该模型随后对检测数据进行细化。一些团队还提到了本地化数据的使用。由于定位任务边界框的很大一部分不包含在检测数据集中,因此可以使用该数据对一个通用的边界盒回归器进行预训练,就像在预训练中使用分类一样。
In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.
在表5中,我们仅比较了单个模型的结果。最好性能模型是Deep Insight的,令人惊讶的是3个模型的集合仅提高了0.3个点,而GoogLeNet在模型集成时明显获得了更好的结果。
9 Conclusions(总结)
Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Also note that our detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture. Although it is expected that similar quality of result can be achieved by much more expensive networks of similar depth and width, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest promising future work towards creating sparser and more refined structures in automated ways on the basis of [2].
我们的结果似乎提供了一个确凿的证据,证明用现有的密集积木来逼近预期的最优稀疏结构是改进计算机视觉神经网络的一种可行方法。该方法的主要优点是与较浅和较小的网络相比,在计算需求略有增加的情况下获得了显著的质量增益。还要注意的是,我们的检测工作是有竞争力的,尽管既没有使用上下文,也没有执行边界框回归,这一事实为初始架构的强度提供了进一步的证据。虽然我们的方法可以通过更昂贵的、深度和宽度相似的网络来实现类似的结果质量,但是我们的方法提供了确凿的证据,证明移动到稀疏的体系结构在一般情况下是可行的和有用的。这表明未来有希望在[2]的基础上,以自动化的方式创造更稀疏、更精细的结构。
10 Acknowledgements(致谢)
We would like to thank Sanjeev Arora and Aditya Bhaskara for fruitful discussions on [2]. Also we are indebted to the DistBelief [4] team for their support especially to Rajat Monga, Jon Shlens, Alex Krizhevsky, Jeff Dean, Ilya Sutskever and Andrea Frome. We would also like to thank to Tom Duerig and Ning Ye for their help on photometric distortions. Also our work would not have been possible without the support of Chuck Rosenberg and Hartwig Adam.
我们要感谢Sanjeev Arora和Aditya Bhas卡拉就[2]进行的富有成果的讨论。我们还要感谢迪贝利夫[4]队的支持,特别是对拉贾特·蒙加、乔恩·希透镜、亚历克斯·克里泽夫斯基、杰夫·迪安、伊利亚·萨茨卡特和安德里亚·弗洛姆的支持。我们还要感谢汤姆·杜里格和宁·叶在光度畸变方面的帮助。此外,如果没有查克、罗森博格和哈特尼格·亚当的支持,我们的工作就不可能完成。
[1] Know your meme: We need to go deeper. http://knowyourmeme.com/memes/we-need-to-go-deeper. Accessed: 2014-09-15.
[2] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. CoRR, abs/1310.6343, 2013.
[3] U. V. C ̧atalyu ̈rek, C. Aykanat, and B. Uc ̧ar. On two-dimensional sparse matrix partitioning: Models, methods, and a recipe. SIAM J. Sci. Comput., 32(2):656–683, Feb. 2010.
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232–1240. 2012.
[5] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.
[6] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014.
[7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
[8] A. G. Howard. Some improvements on deep convolutional neural network based image classification. CoRR, abs/1312.5402, 2013.
[9] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, Dec. 1989.
[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[12] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.
[13] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, July 1992.
[14] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
[15] T. Serre, L. Wolf, S. M. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–426, 2007.
[16] F. Song and J. Dongarra. Scaling up matrix computations on shared-memory manycore systems with 1000 cpu cores. In Proceedings of the 28th ACM Interna- tional Conference on Supercomputing, ICS ’14, pages 333–342, New York, NY, USA, 2014. ACM.
[17] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceed- ings, pages 1139–1147. JMLR.org, 2013.
[18] C.Szegedy,A.Toshev,andD.Erhan.Deep neural networks for object detection. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors, NIPS, pages 2553–2561, 2013.
[19] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. CoRR, abs/1312.4659, 2013.
[20] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeulders. Segmentation as selective search for object recognition. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 1879–1886, Washington, DC, USA, 2011. IEEE Computer Society.
[21] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV, volume 8689 of Lecture Notes in Computer Science, pages 818–833. Springer, 2014.