《Learning Transferable Architectures for Scalable Image Recognition》翻译

原文:https://arxiv.org/abs/1707.07012

Learning Transferable Architectures for Scalable Image Recognition

Barret ZophGoogle Brain[email protected]

Abstract

Developing neural network image classification modelsoften requires significant architecture engineering. In thispaper, we study a method to learn the model architecturesdirectly on the dataset of interest. As this approach is ex-pensive when the dataset is large, we propose to search foran architectural building block on a small dataset and thentransfer the block to a larger dataset. The key contribu-tion of this work is the design of a new search space (whichwe call the “NASNet search space”) which enables trans-ferability. In our experiments, we search for the best con-volutional layer (or “cell”) on the CIFAR-10 dataset andthen apply this cell to the ImageNet dataset by stacking to-gether more copies of this cell, each with their own parame-ters to design a convolutional architecture, which we namea “NASNet architecture”. We also introduce a new regu-larization technique called ScheduledDropPath that signif-icantly improves generalization in the NASNet models. OnCIFAR-10 itself, a NASNet found by our method achieves2.4% error rate, which is state-of-the-art. Although the cellis not searched for directly on ImageNet, a NASNet con-structed from the best cell achieves, among the publishedworks, state-of-the-art accuracy of 82.7% top-1 and 96.2%top-5 on ImageNet. Our model is 1.2% better in top-1 accu-racy than the best human-invented architectures while hav-ing 9 billion fewer FLOPS – a reduction of 28% in compu-tational demand from the previous state-of-the-art model.When evaluated at different levels of computational cost,accuracies of NASNets exceed those of the state-of-the-arthuman-designed models. For instance, a small version ofNASNet also achieves 74% top-1 accuracy, which is 3.1%better than equivalently-sized, state-of-the-art models formobile platforms. Finally, the image features learned fromimage classification are generically useful and can be trans-ferred to other computer vision problems. On the task of ob-ject detection, the learned features by NASNet used with theFaster-RCNN framework surpass state-of-the-art by 4.0%achieving 43.1% mAP on the COCO dataset.

摘要

开发神经网络图像分类模型往往需要大量的体系结构工程。在本文中,我们研究了一种直接在感兴趣数据集上学习模型体系结构的方法。由于这种方法在数据集很大时很成功,我们建议在一个小数据集上搜索foran架构构建块,然后将块转移到更大的数据集。这项工作的关键贡献是设计一个新的搜索空间(我们称之为“NASNet搜索空间”),它可以实现可传输性。在我们的实验中,我们在CIFAR-10数据集上搜索最佳的复合层(或“单元”),然后将这个单元格应用到ImageNet数据集中,通过堆叠多个这个单元格的副本,每个副本都有自己的参数 - 设计一个卷积体系结构,我们称之为“NASNet体系结构”。我们还引入了一种名为ScheduledDropPath的新规则化技术,可显着提高NASNet模型的泛化能力。 OnCIFAR-10本身,我们的方法找到的NASNet实现了2.4%的错误率,这是最先进的。尽管没有在ImageNet上直接搜索cellis,但在已发表的作品中,由最佳细胞构建的NASNet在ImageNet上实现了82.7%的前1名和96.2%的前5名的最新准确度。我们的模型比最好的人工发明的架构在顶级1精度上提高了1.2%,同时FLOPS减少了90亿个 - 从以前的最先进的模型中减少了28%的计算需求。当以不同的计算成本水平进行评估时,NASN的精确度超过了最先进的人造模型。例如,一个小型的NAS网络也能达到74%的前1精度,比同等规模的最先进的移动平台模型好3.1%。最后,从图像分类学习的图像特征通常是有用的,可以转化为其他计算机视觉问题。在对象检测的任务中,NASNet使用的更快的RCNN框架的学习功能超过了最新的4.0%,在COCO数据集上达到了43.1%的mAP。

1. Introduction

Developing neural network image classification models often requires significant architecture engineering. Starting from the seminal work of [32] on using convolutional archi-tectures [17, 34] for ImageNet [11] classification, successive advancements through architecture engineering have achieved impressive results [53, 59, 20, 60, 58, 68].

In this paper, we study a new paradigm of designing convolutional architectures and describe a scalable method tooptimize convolutional architectures on a dataset of inter-est, for instance the ImageNet classification dataset. Ourapproach is inspired by the recently proposed Neural Ar-chitecture Search (NAS) framework [71], which uses a re-inforcement learning search method to optimize architec-ture configurations. Applying NAS, or any other searchmethods, directly to a large dataset, such as the ImageNetdataset, is however computationally expensive. We there-fore propose to search for a good architecture on a proxydataset, for example the smaller CIFAR-10 dataset, and thentransfer the learned architecture to ImageNet. We achievethis transferrability by designing a search space (which wecall “the NASNet search space”) so that the complexity ofthe architecture is independent of the depth of the networkand the size of input images. More concretely, all convolu-tional networks in our search space are composed of convo-lutional layers (or “cells”) with identical structure but dif-ferent weights. Searching for the best convolutional archi-tectures is therefore reduced to searching for the best cellstructure. Searching for the best cell structure has two mainbenefits: it is much faster than searching for an entire net-work architecture and the cell itself is more likely to gener-alize to other problems. In our experiments, this approachsignificantly accelerates the search for the best architecturesusing CIFAR-10 by a factor of 7× and learns architecturesthat successfully transfer to ImageNet.

1.引言

开发神经网络图像分类模型通常需要重要的架构工程。从[32]开展的有关ImageNet [11]分类的卷积结构[17,34]的开创性工作开始,通过架构工程的成功改进取得了令人印象深刻的结果[53,59,20,60,58,68]。

在本文中,我们研究了一种设计卷积体系结构的新范例,并描述了一种可扩展的方法,用于在数据集上优化卷积体系结构,例如ImageNet分类数据集。 我们方法的灵感来自最近提出的神经网络搜索(NAS)框架[71],该框架使用重新强化学习搜索方法来优化架构配置。然而,将NAS或任何其他搜索方法直接应用于大型数据集(例如ImageNet数据集)时,计算量非常大。因此,我们建议在proxy dataset上搜索一个好的架构,例如较小的CIFAR-10数据集,并将学习的架构转移到ImageNet。我们通过设计一个搜索空间(称为“NASNet搜索空间”)来实现这种可转移性,从而使架构的复杂性与网络的深度和输入图像的大小无关。更具体地说,我们搜索空间中的所有卷积网络都是由具有相同结构但权重不同的卷积层(或“单元”)组成的。寻找最佳的卷积结构因此被缩减为寻找最佳的细胞结构。寻找最好的蜂窝结构有两个主要的好处:它比搜索整个网络结构要快得多,蜂窝本身更有可能产生其他问题。在我们的实验中,这种方法显着加速了使用CIFAR-10的7倍因子搜索最佳体系结构,并学习了成功转移到ImageNet的体系结构。

Our main result is that the best architecture found onCIFAR-10, called NASNet, achieves state-of-the-art ac-curacy when transferred to ImageNet classification with-out much modification. On ImageNet, NASNet achieves,among the published works, state-of-the-art accuracy of82.7% top-1 and 96.2% top-5. This result amounts to a 1.2% improvement in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS.On CIFAR-10 itself, NASNet achieves 2.4% error rate,which is also state-of-the-art.

我们的主要结果是,在被称为NASNet的CIFAR-10上发现的最佳体系结构在转换到ImageNet分类时无需进行大量修改即可实现最先进的精度。 在ImageNet上,NASNet在已发表的作品中,实现了前82.7%的顶级精度和96.2%的前五精度。 这个结果相比最好的人工发明的架构提高了前1精度的1.2%,同时减少了90亿FLOPS。在CIFAR-10本身,NASNet实现了2.4%的错误率,这也是最先进的。

Additionally, by simply varying the number of the con-volutional cells and number of filters in the convolutionalcells, we can create different versions of NASNets with dif-ferent computational demands. Thanks to this property ofthe cells, we can generate a family of models that achieveaccuracies superior to all human-invented models at equiv-alent or smaller computational budgets [60, 29]. Notably,the smallest version of NASNet achieves 74.0% top-1 ac-curacy on ImageNet, which is 3.1% better than previouslyengineered architectures targeted towards mobile and em-bedded vision tasks [24, 70].

另外,通过简单地改变卷积单元中卷积单元的数量和滤波器的数量,我们可以创建具有不同计算需求的不同版本的NASNets。 由于细胞的这种特性,我们可以生成一系列模型,在相同或更小的计算预算下实现优于所有人类发明的模型的准确度[60,29]。 值得注意的是,NASNet的最小版本在ImageNet上实现了74.0%的最高精度,比以前针对移动和嵌入式视觉任务的工程化架构高出3.1%[24,70]。

Finally, we show that the image features learned byNASNets are generically useful and transfer to other com-puter vision problems. In our experiments, the featureslearned by NASNets from ImageNet classification can becombined with the Faster-RCNN framework [47] to achievestate-of-the-art on COCO object detection task for both thelargest as well as mobile-optimized models. Our largestNASNet model achieves 43.1% mAP, which is 4% betterthan previous state-of-the-art.

最后,我们证明了由NASs学习的图像特征通常是有用的,并转移到其他计算机视觉问题。 在我们的实验中,来自ImageNet分类的NASNets所具有的特征可以与Faster-RCNN框架[47]相结合,以实现对最大和移动优化模型的COCO目标检测任务的最新投资。 我们最大的NAS网络模型达到了43.1%的mAP,比之前的最新技术水平高出4%。

2. Related Work

The proposed method is related to previous work in hy-perparameter optimization [44, 4, 5, 54, 55, 6, 40] – es-pecially recent approaches in designing architectures suchas Neural Fabrics [48], DiffRNN [41], MetaQNN [3] andDeepArchitect [43]. A more flexible class of methods fordesigning architecture is evolutionary algorithms [65, 16,57, 30, 46, 42, 67], yet they have not had as much successat large scale. Xie and Yuille [67] also transferred learnedarchitectures from CIFAR-10 to ImageNet but performanceof these models (top-1 accuracy 72.1%) are notably belowprevious state-of-the-art (Table 2).

The concept of having one neural network interact with a second neural network to aid the learning process, or learn-ing to learn or meta-learning [23, 49] has attracted muchattention in recent years [1, 62, 14, 19, 35, 45, 15]. Mostof these approaches have not been scaled to large problemslike ImageNet. An exception is the recent work focusedon learning an optimizer for ImageNet classification thatachieved notable improvements [64].

The design of our search space took much inspira-tion from LSTMs [22], and Neural Architecture SearchCell [71]. The modular structure of the convolutional cell isalso related to previous methods on ImageNet such as VGG[53], Inception [59, 60, 58], ResNet/ResNext [20, 68], andXception/MobileNet [9, 24].

2.相关工作

所提出的方法与以前在超参数优化方面的工作有关[44,4,5,54,55,6,40] - 特别是最近在设计架构方面的方法,例如神经织物[48],DiffRNN [41],MetaQNN [3]和深渊建筑师[43]。用于设计架构的更灵活的一类方法是演化算法[65,16,57,30,46,42,67],但它们并没有在规模上取得如此大的成功。 Xie和Yuille [67]也将CIFAR-10的学习架构转移到ImageNet,但这些模型的性能(前1精度为72.1%)显着低于以前的最新水平(表2)。

有一个神经网络与第二个神经网络相互作用以帮助学习过程或学习学习或元学习的概念[23,49]近年来吸引了无聊的注意[1,62,14,19,35,45 ,15]。大多数这些方法都没有像ImageNet那样扩展到大问题。最近的一项例外是关于学习ImageNet分类的优化程序,该程序实现了显着的改进[64]。

我们的搜索空间的设计从LSTMs [22]和Neural Architecture SearchCell [71]中获得了很多灵感。卷积单元的模块化结构也与ImageNet上的以前的方法有关,例如VGG [53],Inception [59,60,58],ResNet / ResNext [20,68]和Xception / MobileNet [9,24]。

3. Method

Our work makes use of search methods to find good con-volutional architectures on a dataset of interest. The mainsearch method we use in this work is the Neural Architec-ture Search (NAS) framework proposed by [71]. In NAS,a controller recurrent neural network (RNN) samples childnetworks with different architectures. The child networksare trained to convergence to obtain some accuracy on aheld-out validation set. The resulting accuracies are usedto update the controller so that the controller will generatebetter architectures over time. The controller weights areupdated with policy gradient (see Figure 1).

3.方法
我们的工作利用搜索方法在感兴趣的数据集上找到良好的卷积体系结构。 我们在这项工作中使用的主要研究方法是由[71]提出的神经网络架构搜索(NAS)框架。 在NAS中,控制器递归神经网络(RNN)对具有不同体系结构的子网络进行采样。 子网络经过训练可以收敛,从而在预留验证集上获得一些准确性。 最终的精度用于更新控制器,以便控制器随着时间的推移生成更好的架构。 控制器权重通过策略梯度更新(参见图1)。

The main contribution of this work is the design of anovel search space, such that the best architecture foundon the CIFAR-10 dataset would scale to larger, higher-resolution image datasets across a range of computationalsettings. We name this search space the NASNet searchspace as it gives rise to NASNet, the best architecture foundin our experiments. One inspiration for the NASNet searchspace is the realization that architecture engineering withCNNs often identifies repeated motifs consisting of com-binations of convolutional filter banks, nonlinearities and aprudent selection of connections to achieve state-of-the-artresults (such as the repeated modules present in the Incep-tion and ResNet models [59, 20, 60, 58]). These observa-tions suggest that it may be possible for the controller RNNto predict a generic convolutional cell expressed in terms ofthese motifs. This cell can then be stacked in series to han-dle inputs of arbitrary spatial dimensions and filter depth.

这项工作的主要贡献是设计一个新的搜索空间,以便在CIFAR-10数据集上创建的最佳架构可以在一系列计算设置中扩展为更大,更高分辨率的图像数据集。 我们将这个搜索空间命名为NASNet搜索空间,因为它引发了NASNet,这是我们实验中最好的体系结构。 NASNet搜索空间的一个灵感就是认识到,使用CNN的体系结构工程通常会识别由卷积滤波器组的组合,非线性和对连接的简单选择组成的重复图案,以实现最先进的结果(例如存在的重复模块 Incep-tion和ResNet模型[59,20,60,58])。 这些观察结果表明,控制器RNN有可能预测用这些基序表达的通用卷积细胞。 然后可以将该单元串联堆叠,以处理任意空间维度和滤波器深度的汉明输入。

In our approach, the overall architectures of the convo-lutional nets are manually predetermined. They are com-posed of convolutional cells repeated many times whereeach convolutional cell has the same architecture, but dif-ferent weights. To easily build scalable architectures forimages of any size, we need two types of convolutional cellsto serve two main functions when taking in a feature map as input: (1) convolutional cells that return a feature map ofthe same dimension, and (2) convolutional cells that returna feature map where the feature map height and width is re-duced by a factor of two. We name the first type and secondtype of convolutional cells Normal Cell and Reduction Cellrespectively. For the Reduction Cell, we make the initialoperation applied to the cell’s inputs have a stride of two toreduce the height and width. All of our operations that weconsider for building our convolutional cells have an optionof striding.

在我们的方法中,控制网的总体结构是人工预定的。 它们是由卷积单元重复多次构成的,而卷积单元具有相同的体系结构,但权重不同。 为了容易地构建任意大小的可伸缩体系结构,我们需要两种类型的卷积单元在输入特征映射时提供两种主要功能:(1)返回相同维度的特征映射的卷积单元;(2)卷积单元 返回特征映射,其中特征映射的高度和宽度减少了2倍。 我们分别命名第一类和第二类卷积单元正常单元格和还原单元格。 对于还原单元格,我们使应用于单元格输入的初始操作有两个步骤来降低高度和宽度。 我们考虑构建卷积单元的所有操作都有一个选择。

Figure 2 shows our placement of Normal and ReductionCells for CIFAR-10 and ImageNet. Note on ImageNet wehave more Reduction Cells, since the incoming image sizeis 299x299 compared to 32x32 for CIFAR. The Reductionand Normal Cell could have the same architecture, but weempirically found it beneficial to learn two separate archi-tectures. We use a common heuristic to double the numberof filters in the output whenever the spatial activation size isreduced in order to maintain roughly constant hidden statedimension [32, 53]. Importantly, much like Inception andResNet models [59, 20, 60, 58], we consider the number ofmotif repetitions N and the number of initial convolutionalfilters as free parameters that we tailor to the scale of animage classification problem.

图2显示了我们对CIFAR-10和ImageNet的Normal和ReductionCells的位置。 关于ImageNet的注意事项我们拥有更多的缩小单元,因为传入的图像大小为299x299,而CIFAR的大小为32x32。 Reduce和Normal Cell可以具有相同的体系结构,但是在学习两个单独的架构时,会发现它有利于学习。 我们使用一种通用的启发式方法来减少输出中滤波器的数量,只要空间激活大小减小以保持大致恒定的隐含维度[32,53]。 重要的是,与Inception和ResNet模型非常相似[59,20,60,58],我们将重复次数N和初始卷积滤波器的数量视为我们根据图像分类问题的规模量身定制的自由参数。

What varies in the convolutional nets is the structures of the Normal and Reduction Cells, which are searched by thecontroller RNN. The structures of the cells can be searchedwithin a search space defined as follows (see Appendix,Figure 7 for schematic). In our search space, each cell re-ceives as input two initial hidden states hi and hi1 whichare the outputs of two cells in previous two lower layersor the input image. The controller RNN recursively pre-dicts the rest of the structure of the convolutional cell, giventhese two initial hidden states (Figure 3). The predictionsof the controller for each cell are grouped into B blocks,where each block has 5 prediction steps made by 5 distinctsoftmax classifiers corresponding to discrete choices of theelements of a block:

Step1. Select a hidden state from hi,hi1 or from the set of hiddenstates created in previous blocks.

Step 2.Select a second hidden state from the same options as in Step1.
Step 3.Select an operation to apply to the hidden state selected in Step1.
Step 4.Select an operation to apply to the hidden state selected in Step2.

Step 5. Select a method to combine the outputs of Step 3 and 4 to createa new hidden state.

卷积网络的不同之处在于由控制器RNN搜索的正常单元和简化单元的结构。可以在如下定义的搜索空间中搜索单元的结构(对于示意图,参见附录,图7)。在我们的搜索空间中,每个单元格将输入两个初始隐藏状态hi和hi-1作为前两个较低层中的两个单元格的输出或输入图像。控制器RNN递归地预先指示卷积单元的其余结构,给出这两个初始隐藏状态(图3)。对每个单元的控制器的预测被分组为B个块,其中每个块具有5个预测步骤,由5个不同的softmax分类器对应于块的元素的离散选择:
步骤1.从hi,hi-1或从前面块中创建的一组隐藏状态中选择一个隐藏状态。
步骤2.从Step1中选择第二个隐藏状态。
步骤3.选择一个操作应用于在步骤1中选择的隐藏状态。
步骤4.选择一个操作应用于在步骤2中选择的隐藏状态。
步骤5.选择一种方法合并步骤3和4的输出以创建一个新的隐藏状态。

The algorithm appends the newly-created hidden state to the set of existing hidden states as a potential input in sub-sequent blocks. The controller RNN repeats the above 5prediction steps B times corresponding to the B blocks in a convolutional cell. In our experiments, selecting B = 5provides good results, although we have not exhaustively searched this space due to computational limitations.

In steps 3 and 4, the controller RNN selects an operationto apply to the hidden states. We collected the following setof operations based on their prevalence in the CNN literature:

该算法将新创建的隐藏状态附加到现有隐藏状态集作为子连续块中的潜在输入。 控制器RNN在卷积单元中重复B个对应于B块的上述5个预测步骤。 在我们的实验中,选择B = 5提供了良好的结果,尽管由于计算限制我们没有详尽地搜索这个空间。

在步骤3和4中,控制器RNN选择应用于隐藏状态的操作。 我们根据其在CNN文献中的流行情况收集了以下一组操作:

In step 5 the controller RNN selects a method to combinethe two hidden states, either (1) element-wise addition be-tween two hidden states or (2) concatenation between twohidden states along the filter dimension. Finally, all of theunused hidden states generated in the convolutional cell areconcatenated together in depth to provide the final cell out-put.

To allow the controller RNN to predict both Normal Celland Reduction Cell, we simply make the controller have2 × 5B predictions in total, where the first 5B predictionsare for the Normal Cell and the second 5B predictions arefor the Reduction Cell.

在步骤5中,控制器RNN选择一种方法来组合两个隐藏状态,或者(1)两个隐藏状态之间的元素相加或(2)沿过滤器维度的两个隐藏状态之间的级联。 最后,卷积单元中生成的所有未使用的隐藏状态一起深入地共同提供最终的单元输出。

为了允许控制器RNN预测正常Celland还原单元,我们简单地使控制器具有总共2×5B个预测,其中第一个5B预测用于正常单元,第二个5B预测用于还原单元。

Finally, our work makes use of the reinforcement learn-ing proposal in NAS [71]; however, it is also possible touse random search to search for architectures in the NAS-Net search space. In random search, instead of samplingthe decisions from the softmax classifiers in the controllerRNN, we can sample the decisions from the uniform distri-bution. In our experiments, we find that random search isslightly worse than reinforcement learning on the CIFAR-10 dataset. Although there is value in using reinforcementlearning, the gap is smaller than what is found in the originalwork of [71]. This result suggests that 1) the NASNet searchspace is well-constructed such that random search can per-form reasonably well and 2) random search is a difficultbaseline to beat. We will compare reinforcement learningagainst random search in Section 4.4.

最后,我们的工作利用NAS中的强化学习提案[71]; 然而,也可以随机搜索来搜索NAS-Net搜索空间中的体系结构。 在随机搜索中,我们可以从均匀分布中对决策进行抽样,而不是从控制器RNN中的softmax分类器中抽取决策。 在我们的实验中,我们发现随机搜索比CIFAR-10数据集上的强化学习稍差。 虽然使用增强学习有价值,但差距比[71]的原始作品中所发现的要小。 这个结果表明:1)NASNet搜索空间构造良好,使得随机搜索可以很好地完成,2)随机搜索是一个难以击败的基线。 我们将比较第4.4节中对随机搜索的强化学习。

4. Experiments and Results

In this section, we describe our experiments with themethod described above to learn convolutional cells. Insummary, all architecture searches are performed using theCIFAR-10 classification task [31]. The controller RNN wastrained using Proximal Policy Optimization (PPO) [51] byemploying a global workqueue system for generating a poolof child networks controlled by the RNN. In our experi-ments, the pool of workers in the workqueue consisted of500 GPUs.

The result of this search process over 4 days yields sev-eral candidate convolutional cells. We note that this searchprocedure is almost 7× faster than previous approaches [71]thattook28days.1 Additionally,wedemonstratebelowthatthe resulting architecture is superior in accuracy.

4.实验和结果
在本节中,我们用上面描述的方法来描述我们的实验来学习卷积单元。 简而言之,所有架构搜索都是使用CIFAR-10分类任务执行的[31]。 使用近端策略优化(PPO)[51]控制器RNN通过使用全局工作队列系统来生成由RNN控制的子网络池。 在我们的实验中,workqueue中的工作人员由500个GPU组成。

这种搜索过程的结果超过4天会产生几个候选卷积单元。 我们注意到,这个搜索过程比以前的方法[71]快了近7倍,到了28天。另外,我们证明了所得到的架构在精度上是优越的。

Figure 4 shows a diagram of the top performing NormalCell and Reduction Cell. Note the prevalence of separable convolutions and the number of branches compared with competing architectures [53, 59, 20, 60, 58]. Subsequent experiments focus on this convolutional cell architecture, although we examine the efficacy of other, top-ranked convolutional cells in ImageNet experiments (described in Appendix B) and report their results as well. We call the three networks constructed from the best three searches NASNet-A, NASNet-B and NASNet-C.

图4显示了表现最佳的正常细胞和还原细胞的图。 请注意可比卷积的发生率以及与竞争架构相比的分支数量[53,59,20,60,58]。 尽管我们在ImageNet实验中检验了其他排名靠前的卷积单元的效率(附录B中描述),并且也报告了它们的结果,但随后的实验专注于这种卷积单元体系结构。 我们称这三个网络为最佳三个搜索NASNet-A,NASNet-B和NASNet-C。

We demonstrate the utility of the convolutional cells byemploying this learned architecture on CIFAR-10 and afamily of ImageNet classification tasks. The latter family oftasks is explored across a few orders of magnitude in com-putational budget. After having learned the convolutionalcells, several hyper-parameters may be explored to build afinal network for a given task: (1) the number of cell repeatsN and (2) the number of filters in the initial convolutionalcell. After selecting the number of initial filters, we use acommon heuristic to double the number of filters wheneverthe stride is 2. Finally, we define a simple notation, e.g.,4 @ 64, to indicate these two parameters in all networks,where 4 and 64 indicate the number of cell repeats and thenumber of filters in the penultimate layer of the network,respectively.

我们通过在CIFAR-10和ImageNet分类任务之外使用这种学习架构来展示卷积单元的实用性。 后一个任务系列在计算预算中的几个数量级被探索。 学习了卷积单元之后,可以探索几个超参数来构建给定任务的后期网络:(1)单元重复次数N和(2)初始卷积单元中滤波器的数量。 最终,我们定义了一个简单的表示法,例如4 @ 64,以在所有网络中指示这两个参数,其中4和64 分别指示网络倒数第二层中的小区重复次数和过滤器数量。

For complete details of of the architecture learning algo-rithm and the controller system, please refer to Appendix A.Importantly, when training NASNets, we discovered Sched-uledDropPath, a modified version of DropPath [33], to bean effective regularization method for NASNet. In Drop-Path [33], each path in the cell is stochastically droppedwith some fixed probability during training. In our mod-ified version, ScheduledDropPath, each path in the cell isdropped out with a probability that is linearly increasedover the course of training. We find that DropPath does notwork well for NASNets, while ScheduledDropPath signifi-cantly improves the final performance of NASNets in bothCIFAR and ImageNet experiments.

有关架构学习算法和控制器系统的完整详细信息,请参阅附录A.重要的是,在培训NASNets时,我们发现了一个修改过的DropPath [33]版本的Sched-uledDropPath,以便为NASNet提供有效的正则化方法。 在丢弃路径[33]中,单元中的每条路径在训练期间随着某种固定概率随机丢弃。 在我们的修改版本ScheduledDropPath中,单元格中的每条路径都以一个在训练过程中线性增加的概率被删除。 我们发现DropPath对于NASNets不太合适,而ScheduledDropPath在CIFAR和ImageNet实验中显着提高了NASNets的最终性能。

4.1. Results on CIFAR-10 Image Classification

For the task of image classification with CIFAR-10, weset N = 4 or 6 (Figure 2). The test accuracies of thebest architectures are reported in Table 1 along with otherstate-of-the-art models. As can be seen from the Table, alarge NASNet-A model with cutout data augmentation [12]achieves a state-of-the-art error rate of 2.40% (averagedacross 5 runs), which is slightly better than the previousbest record of 2.56% by [12]. The best single run from ourmodel achieves 2.19% error rate.

4.1 CIFAR-10图像分类结果
对于使用CIFAR-10进行图像分类的任务,我们设置N = 4或6(图2)。 表1列出了最好的架构的测试精度以及其他最先进的模型。 从表中可以看出,一个大型数据增强的NASNet-A模型[12]实现了2.40%的最新错误率(平均每跨5次运行),这比以前最好的2.56记录稍好一些 %乘以[12]。 来自我们的模型的最好的单次运行实现了2.19%的错误率。

4.2. Results on ImageNet Image Classification

We performed several sets of experiments on ImageNetwith the best convolutional cells learned from CIFAR-10.We emphasize that we merely transfer the architecturesfrom CIFAR-10 but train all ImageNet models weights fromscratch.

Results are summarized in Table 2 and 3 and Figure 5.In the first set of experiments, we train several image clas-sification systems operating on 299x299 or 331x331 reso-lution images with different experiments scaled in compu-tational demand to create models that are roughly on parin computational cost with Inception-v2 [29], Inception-v3[60] and PolyNet [69]. We show that this family of mod-els achieve state-of-the-art performance with fewer floatingpoint operations and parameters than comparable architec-tures. Second, we demonstrate that by adjusting the scaleof the model we can achieve state-of-the-art performanceat smaller computational budgets, exceeding streamlined CNNs hand-designed for this operating regime [24, 70].Note we do not have residual connections between convolutional cells as the models learn skip connections on their own. We empirically found manually inserting residual connections between cells to not help performance. Our training setup on ImageNet is similar to [60], but please see Appendix A for details.

4.2 ImageNet图像分类结果
我们在ImageNet上进行了几组实验,从CIFAR-10中学习了最好的卷积单元。我们强调,我们只是从CIFAR-10转移架构,但是从零开始训练所有ImageNet模型的权重。
结果总结在表2和表3和图5中。在第一组实验中,我们训练了几个在299x299或331x331分辨率图像上运行的图像分类系统,并根据计算需求进行了不同的实验,以创建模型与Inception-v2 [29],Inception-v3 [60]和PolyNet [69]大致相同的计算成本。我们证明,这个mod-els系列实现了最先进的性能,比同类架构具有更少的浮点操作和参数。其次,我们证明,通过调整模型的规模,我们可以在较小的计算预算中实现最先进的性能,超过了为这种操作体制手动设计的流线型CNN [24,70]。注意,我们之间没有剩余的连接卷积单元作为模型自行学习跳过连接。我们凭经验发现手动插入单元格之间的残余连接,无助于提高性能。我们在ImageNet上的培训设置与[60]类似,但请参阅附录A了解详情。

Table 2 shows that the convolutional cells discovered with CIFAR-10 generalize well to ImageNet problems. In particular, each model based on the convolutional cells exceeds the predictive performance of the corresponding hand-designed model. Importantly, the largest model achieves a new state-of-the-art performance for Ima-geNet (82.7%) based on single, non-ensembled predictions,  surpassing previous best published result by 1.2% [8].Among the unpublished works, our model is on par with the best reported result of 82.7% [25], while having significantly fewer floating point operations. Figure 5 shows a complete summary of our results in comparison with other published results. Note the family of models based on convolutional cells provides an envelope over a broad class of human-invented architectures.

表2显示用CIFAR-10发现的卷积单元对ImageNet问题具有良好的一般性。 特别是,基于卷积单元的每个模型都超出了相应手工设计模型的预测性能。 重要的是,这个最大的模型基于单一的非整体预测实现Ima-geNet(82.7%)的最新的最新性能,超过以前最好的公布结果约1.2%[8]。未发表的作品 ,我们的模型与82.7%的最佳报告结果[25]相当,而浮点运算明显较少。 图5显示了我们的结果与其他公布结果的完整总结。 请注意,基于卷积单元的模型族提供了一大类人类发明的体系结构的信封。

Finally, we test how well the best convolutional cellsmay perform in a resource-constrained setting, e.g., mobiledevices (Table 3). In these settings, the number of float-ing point operations is severely constrained and predictiveperformance must be weighed against latency requirementson a device with limited computational resources. Mo-bileNet [24] and ShuffleNet [70] provide state-of-the-art re-sults obtaining 70.6% and 70.9% accuracy, respectively on 224x224 images using 550M multliply-add operations.An architecture constructed from the best convolutional cells achieves superior predictive performance (74.0% ac-curacy) surpassing previous models but with comparable computational demand. In summary, we find that the learned convolutional cells are flexible across model scales achieving state-of-the-art performance across almost 2 orders of magnitude in computational budget.

最后,我们测试了最佳卷积单元在资源受限设置(例如移动设备)中的表现如何(表3)。 在这些设置中,浮点操作的数量受到严重限制,必须对具有有限计算资源的设备的延迟要求权衡预测性能。 Mo-bileNet [24]和ShuffleNet [70]使用~550M多滑动加法操作分别在224x224图像上获得了70.6%和70.9%准确度的最新结果。由最佳卷积单元 与以前的型号相比,达到了更高的预测性能(74.0%的准确度),但具有可比拟的计算需求。 总之,我们发现学习的卷积单元在模型尺度上是灵活的,在计算预算中几乎达到2个数量级,达到了最先进的性能。

4.3. Improved features for object detection

Image classification networks provide generic image fea-tures that may be transferred to other computer vision problems [13]. One of the most important problems is the spatial localization of objects within an image. To further validate the performance of the family of NASNet-A net-works, we test whether object detection systems derived from NASNet-A lead to improvements in object detection[28].

4.3改进物体检测功能
图像分类网络提供了可能转移到其他计算机视觉问题的通用图像特征[13]。 最重要的问题之一是图像中物体的空间定位。 为了进一步验证NASNet-A网络系列的性能,我们测试了源自NASNet-A的对象检测系统是否导致对象检测的改进[28]。

To address this question, we plug in the family ofNASNet-A networks pretrained on ImageNet into theFaster-RCNN object detection pipeline [47] using an open-source software platform [28]. We retrain the resulting ob-ject detection pipeline on the combined COCO training plusvalidation dataset excluding 8,000 mini-validation images.

We perform single model evaluation using 300-500 RPNproposals per image. In other words, we only pass a sin-gle image through a single network. We evaluate the modelon the COCO mini-val [28] and test-dev dataset and reportthe mean average precision (mAP) as computed with thestandard COCO metric library [38]. We perform a simplesearch over learning rate schedules to identify the best pos-sible model. Finally, we examine the behavior of two object detection systems employing the best performing NASNet-A image featurization (NASNet-A, 6 @ 4032) as well as the image featurization geared towards mobile platforms(NASNet-A, 4 @ 1056).

为了解决这个问题,我们使用开放源码软件平台[28]将在ImageNet上预训练的NAS网络插入到Faster-RCNN对象检测流水线[47]中。 我们重新培训COCO训练加验证数据集合中的最终对象检测流水线,不包括8,000个迷你验证图像。

我们使用每个图像300-500个RPN建议进行单一模型评估。 换句话说,我们只通过一个网络传递一个单一的图像。 我们评估COCO mini-val模型[28]和测试开发数据集,并报告用标准COCO度量库[38]计算的平均平均精度(mAP)。 我们对学习率计划进行简单搜索以确定最佳可能模型。 最后,我们研究了采用性能最佳的NASNet-A图像特征(NASNet-A,6 @ 4032)以及面向移动平台的图像特征(NASNet-A,4 @ 1056)的两个对象检测系统的行为。

For the mobile-optimized network, our resulting systemachieves a mAP of 29.6% – exceeding previous mobile-optimized networks that employ Faster-RCNN by over5.0% (Table 4). For the best NASNet network, our resulting network operating on images of the same spatial resolution(800 × 800) achieves mAP = 40.7%, exceeding equivalent object detection systems based off lesser performing image featurization (i.e. Inception-ResNet-v2) by 4.0% [28, 52](see Appendix for example detections on images and side-by-side comparisons). Finally, increasing the spatial resolution of the input image results in the best reported, single model result for object detection of 43.1%, surpassing the best previous best by over 4.0% [37].2 These results provide further evidence that NASNet provides superior, generic image features that may be transferred across other computer vision tasks. Figure 10 and Figure 11 in Appendix C show four examples of object detection results produced byNASNet-A with the Faster-RCNN framework.

对于移动优化网络,我们的系统获得了29.6%的MAP - 超过了先前使用Faster-RCNN超过5.0%的移动优化网络(表4)。对于最好的NASNet网络,我们所得到的网络在相同空间分辨率(800×800)的图像上运行,达到了mAP = 40.7%,超过基于性能较低的图像特征(即Inception-ResNet-v2)4.0% [28,52](参见附录,例如图像检测和并排比较)。最后,增加输入图像的空间分辨率导致最好的报告,单个模型结果为43.1%的对象检测,超过以前最好的最好的超过4.0%[37] .2这些结果进一步证明NASNet提供了优越,通用图像功能可能跨越其他计算机视觉任务传输。附录C中的图10和图11显示了由NASS-A使用Faster-RCNN框架生成的四个对象检测结果示例。

4.4. Efficiency of architecture search methods

Though what search method to use is not the focus ofthe paper, an open question is how effective is the rein-forcement learning search method. In this section, we studythe effectiveness of reinforcement learning for architecturesearch on the CIFAR-10 image classification problem andcompare it to brute-force random search (considered to bea very strong baseline for black-box optimization [5]) givenan equivalent amount of computational resources.

Figure 6 shows the performance of reinforcement learn-ing (RL) and random search (RS) as more model architectures are sampled. Note that the best model identified withRL is significantly better than the best model found by RS by over 1% as measured by on CIFAR-10. Additionally, RL finds an entire range of models that are of superior quality to random search. We observe this in the mean performance of the top-5 and top-25 models identified in RL versus RS.We take these results to indicate that although RS may pro-vide a viable search strategy, RL finds better architectures in the NASNet search space. 

4.4架构搜索方法的效率
尽管使用什么样的搜索方法并不是论文的重点,但一个公开的问题是强化学习搜索方法的有效性如何。在本节中,我们研究CIFAR-10图像分类问题的架构搜索强化学习的有效性,并将其与蛮力随机搜索(被认为是黑盒子优化的一个非常强大的基线[5])进行比较,并给出相当数量的计算资源。

图6显示了随着更多模型架构的采样,强化学习(RL)和随机搜索(RS)的性能。请注意,通过CIFAR-10测量,用RL确定的最佳模型显着优于RS发现的最佳模型超过1%。另外,RL找到了一系列质量优于随机搜索的模型。我们在RL和RS中确定的前5个和前25个模型的平均性能中观察到这一点。我们拿这些结果表明尽管RS可以提供​​可行的搜索策略,RL在NASNet搜索空间中找到更好的体系结构。

 5. Conclusion

In this work, we demonstrate how to learn scalable, con-volutional cells from data that transfer to multiple imageclassification tasks. The learned architecture is quite flex-ible as it may be scaled in terms of computational costand parameters to easily address a variety of problems. Inall cases, the accuracy of the resulting model exceeds allhuman-designed models – ranging from models designedfor mobile applications to computationally-heavy modelsdesigned to achieve the most accurate results.

5.结论
在这项工作中,我们演示了如何从数据中学习可扩展的复合单元,这些单元转移到多个图像分类任务。 学习的体系结构非常灵活,因为它可以根据计算的costand参数进行缩放,以轻松解决各种问题。 在所有情况下,所得到的模型的准确度都超过了所有人类设计的模型 - 从为移动应用设计的模型到设计为实现最准确结果的计算重型模型。

The key insight in our approach is to design a searchspace that decouples the complexity of an architecture fromthe depth of a network. This resulting search space per-mits identifying good architectures on a small dataset (i.e.,CIFAR-10) and transferring the learned architecture to im-age classifications across a range of data and computationalscales.

The resulting architectures approach or exceed state-of-the-art performance in both CIFAR-10 and ImageNetdatasets with less computational demand than human-designed architectures [60, 29, 69]. The ImageNet re-sults are particularly important because many state-of-the-art computer vision problems (e.g., object detection [28],face detection [50], image localization [63]) derive im-age features or architectures from ImageNet classificationmodels. For instance, we find that image features ob-tained from ImageNet used in combination with the Faster-RCNN framework achieves state-of-the-art object detectionresults. Finally, we demonstrate that we can use the re-sulting learned architecture to perform ImageNet classifi-cation with reduced computational budgets that outperformstreamlined architectures targeted to mobile and embeddedplatforms [24, 70].

我们的方法的关键洞察是设计一个搜索空间,从网络的深度解耦架构的复杂性。由此产生的搜索空间可以在小数据集(即CIFAR-10)上识别出良好的体系结构,并将学习的体系结构转换为跨数据和计算尺度范围的图像分类。

由此产生的体系结构在CIFAR-10和ImageNet数据集中的接近或超过最先进的性能,其计算需求比人工设计的体系结构少[60,29,69]。 ImageNet结果特别重要,因为许多最先进的计算机视觉问题(例如对象检测[28],人脸检测[50],图像定位[63])从ImageNet导出图像特征或体系结构classificationmodels。例如,我们发现从ImageNet获得的图像特征与Faster-RCNN框架结合使用可实现最先进的对象检测结果。最后,我们证明我们可以使用重新学习的架构来执行ImageNet分类,同时降低计算预算,优于针对移动和嵌入式平台的流式架构[24,70]。

参考:https://blog.csdn.net/xjz18298268521/article/details/79079008

你可能感兴趣的:(CNN经典论文,NASNet,NASNet)