Remoa

神经结构搜索(Neural Architecture Search, NAS)学习

一、参考文献

1、本文对两篇NeurIPS 2020收录的论文进行阅读整理，分别是：
[1]Colin White，Willie Neiswanger，Sam Nolen，Yash Savani：A Study on Encodings for Neural Architecture Search，NeurIPS 2020；
论文：https://papers.nips.cc/paper/2020/file/ea4eb49329550caaa1d2044105223721-Paper.pdf
代码：https://github.com/naszilla/naszilla
[2]Houwen Peng, Hao Du, Hongyuan Yu, Qi Li, Jing Liao, Jianlong Fu：Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search，NeurIPS 2020；
论文：https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf
代码：https://github.com/microsoft/cream.git

2、阅读过程中觉得可以延伸阅读的论文：

[1]Regularized Evolution for Image Classifier Architecture Search，AAAI 2019；
论文：https://ojs.aaai.org/index.php/AAAI/article/view/4405/4283
[2]Neural Architecture Search with Bayesian Optimisation and Optimal Transport，NeurIPS 2018；
论文：https://papers.nips.cc/paper/2018/file/f33ba15effa5c10e873bf3842afb46a6-Paper.pdf
[3]Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, Quoc Le ：Understanding and Simplifying One-Shot Architecture Search，PMLR 2018；
论文：http://proceedings.mlr.press/v80/bender18a/bender18a.pdf
[4]Chenglin Yang, Lingxi Xie, Chi Su, Alan L. Yuille：Snapshot Distillation： Teacher-Student Optimization in One Generation，CVPR 2019；
论文：https://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_Snapshot_Distillation_Teacher-Student_Optimization_in_One_Generation_CVPR_2019_paper.pdf

二、NAS基础知识整理

（1）NAS定义

神经结构搜索(Neural Architecture Search，NAS)是指给定一个称为搜索空间的候选神经网络结点集合，通过控制器按照某种搜索算法策略从集合中搜索出子网络结构，并使用某种性能评估策略评估性能。评估结果将返回给搜索策略，用于调整下一次的神经结构选择，迭代直到搜索出符合要求的神经网络

（2）经典NAS方法

使用RNN作为控制器产生子网络
对子网络进行训练和评估，得到网络性能（如正确率），然后更新控制器的参数（利用强化学习 + 策略梯度）
ENAS：提高了NAS的搜索效率
- ENAS将搜索空间表示为一个有向无环图（DAG），任一子图代表一个网络结构，每个结点代表局部计算，结点间的有向连接代表信息的流动
- 权重共享：不同的网络结构共享整个有向无环图结点上的参数，减少搜索时间

（3）NAS核心要素

定义搜索空间：定义可以搜索的神经网络结构的集合，即解的空间。其规模决定了搜索难度和搜索时间
- 全局搜索空间
- 基于细胞的搜索空间：堆叠 + 拼接，减少搜索代价，提高结构可迁移性
执行搜索策略采样网络：定义如何在搜索空间中寻找最优网络结构，本质上是一个迭代优化超参数的过程
- 穷举算法：
  - 随机搜索（Random Search）：从搜索空间中随机选取网络结构信息并训练，最终得到性能最优的模型
  - 网格搜索（Grid Search）：超参数优化，对于每个超参数取值指定一个有限集，使用笛卡尔积得到若干组参数，使用每组超参数训练模型后挑选验证集误差最小的即为最优结构
- 基于离散空间的搜索策略：
  - 基于强化学习（Reinforcement Learning）：将网络结构的搜索看作智能体的动作，使用得到的子网络在数据上的表现作为智能体的奖励值。
    - 智能体与环境交互 + 智能体执行动作 + 环境中回馈
    - 目标：回馈最大化
    - 三个关键因素：
      - 状态（State）：当前搜索过程已确定了的部分神经结构
      - 动作（Action）：在当前状态向结构加入新的结点
      - 奖励（Reward）：每个动作带来的对预定搜索目标的增益。每个动作的奖励决定了是否在当前状态选择该动作
  - 基于进化算法（Evolutionary Alogrithm）：选择 + 重组 + 变异
    - 迭代直到达到最大迭代次数或变异后网络的性能不再提升
    - 预设定的结构变异操作 → 新的候选，训练 + 评估 → 加入种群
    - 从种群中挑选结构训练并评估，留下高性能网络而淘汰低性能网络
    - 对网络结构进行编码，维护结构的集合（种群）
- 基于连续空间的搜索策略：
  - 基于梯度的方法（Gradient-Based Method）：搜索空间连续 + 目标函数可微
    - Neural Architecture Optimization：基于encode-decode框架
      - 将网络结构映射到连续空间中表示（embedding），空间中每个点对应一个网络结构
      - 在空间上定义准确率的预测函数，对embedding进行优化，网络收敛后，再将这个表示映射回网络结构
    - DARTS：Differentiable Architecture Search，提出可微结构搜索方法，将结点连接和激活函数组成矩阵，每个元素代表了连接和激活函数的权重，将候选操作在搜索时使用softmax函数进行混合，从而使得搜索空间连续化并且目标函数可微。搜索结束后，将从混合操作中选取权重最大的操作作为最终网络
  - 贝叶斯优化（Bayesian Optimization）：从细节上指定如何探索搜索空间，本质上是一个迭代优化超参数的过程
对采样的网络进行性能评估：定义如何评估搜索出的网络结构的性能，性能评估策略指能够降低性能评估成本的特定方法
- 主要评估标准：
  - 提高分类精度
  - 提高计算速度
- 两个方面入手：
  - 减少训练时间
    - 基于小规模代理任务：使用和实际任务同类型但规模更小或分辨率更低的数据集进行结构搜索，在搜索过程中用神经结构在代理任务上的性能来估计它在实际任务上的性能，当搜索完成后再将神经结构迁移到实际任务上
    - 减少卷积核数量
  - 快速评估模型性能
    - 基于训练曲线：将神经结构只训练少量的迭代次数并在训练过程中记录神经结构的性能曲线，通过已有的性能曲线来预测未来的性能并提前停止训练预期性能不好的神经结构
    - 基于搜索出的网络结构：基于网络结构提取特征信息拟合模型性能

（4）NAS搜索加速策略

权重共享（Weight Sharing）：DSO-NAS方法从一个完全连接的模块开始，通过引入缩放因子以缩放操作之间的信息流，然后对模块中的无用连接进行稀疏正则化，也就是去掉不重要的操作，得到最优结构
一次性结构搜索（One-Shot）：将所有结构视为超图的不同子图，共享超图的权重，在一个过参数化的大网络中进行搜索，交替地训练网络权重和模型权重，最终只保留其中一个子结构
网态映射（Network Morphisms）：将网络进行变形，同时保持其功能不变

（5）NAS核心目标

全自动的神经结构搜索方法，针对特定的任务，通过算法自动学习出适用的深度神经结构

三、A Study on Encodings for Neural Architecture Search论文学习

1、论文概要

论文由Abacus.AI公司、卡内基美隆大学合作完成，是NeurIPS 2020的论文之一。
近几年中，基于编码方式的NAS方法不断涌现，然而即使对每个结构的编码方式进行很小的更改，也会对NAS算法的性能有很大影响。这就存在一个问题：之前提出的基于编码方式的各种NAS算法中，不同编码的影响程度如何？
为此，本文拟探索不同编码方式对现有网络结构搜索算法的影响，从理论和实验两个方面进行研究分析：首先，通过基于邻接矩阵的编码和基于路径的编码（见图1）两种范式下，定义了八种不同的编码方式；其次，对各编码方式的扩展效果、对依赖于编码方式的三个NAS子过程（包括随机均匀采样结构特征，更改操作、边或路径，训练预测模型过程），进行了大量实验分析；最后，在文中总结不同子过程下常用的NAS算法 (例如贝叶斯优化的高斯过程、年龄进化算法、局部搜索算法、随机搜索算法等) 的最佳编码方式，从而可以作为未来研究工作的指南。

神经结构搜索(Neural Architecture Search, NAS)学习_第1张图片

图1 神经结构a的基于邻接矩阵编码以及基于路径的编码，同时展示One-Hot编码及分类邻接矩阵编码方式

2、论文动机

（1）论文动机：

对神经结构的编码方式进行了首次正式研究

（2）对NAS神经结构编码方式进行研究：

基于邻接矩阵的编码
- One-hot邻接矩阵编码
- 分类邻接矩阵编码
- 连续邻接矩阵编码
基于路径的编码
- One-hot路径编码
- 截断的One-hot路径编码
- 分类路径编码
- 连续路径编码
- 截断连续路径编码

3、论文主要内容

（1）两种范式下编码方式对比：

基于邻接矩阵的编码
- 优点：
  - 直观，容易理解
- 缺点：
  - 结点是在矩阵中任意分配的索引
  - 一个结构能够拥有许多不同的表示形式
基于路径的编码
- 优点：
  - 结点不是任意分配的索引
  - 同形结构会自动映射到相同的编码中
- 缺点：
  - 不同的结构会映射到相同的编码上

神经结构搜索(Neural Architecture Search, NAS)学习_第2张图片

图2 右边所示网络结构拥有的左边两种不同的邻接矩阵表示形式

神经结构搜索(Neural Architecture Search, NAS)学习_第3张图片

图3 三种不同网络结构映射为同一种编码

（3）依赖于编码方式的三个NAS子过程：

样本随机结构：
- 随机、均匀地采样编码每个特征
扰动结构：固定概率随机对编码每个特征统一采样
- 更改操作
- 增删边
- 增删路径
训练预测模型过程：
- 采用高斯过程的贝叶斯优化调参：结构编码之间的编辑距离
- 神经网络模型：编码→预测结构准确性

（4）论文中总结不同子过程下常用的NAS算法的最佳编码方式的实验结果：

以下展示的实验结果图片中缩写所对应的注释：

Adj. 邻接矩阵编码
Cont. Adj. 连续邻接矩阵编码
Path 路径编码
Trunc. Path 截断路径编码
Trunc. Cont. Path 截断连续路径编码
Uniform 均匀随机变量编码
Cat. Adj. 分类邻接矩阵编码
Cat. Path 分类路径编码
Trunc. Cat. Path 截断分类路径编码
Cat. Adj. 分类邻接矩阵编码
Cat. Path 分类路径编码
NASBOT Neural Architecture Search with Bayesian Optimisation and Optimal Transport

a）子过程1：样本随机结构

神经结构搜索(Neural Architecture Search, NAS)学习_第4张图片

图4 随机搜索其中邻接矩阵编码效果最好

b）子过程2：扰动结构

神经结构搜索(Neural Architecture Search, NAS)学习_第5张图片

图5 年龄进化算法其中分类邻接矩阵编码效果最好

神经结构搜索(Neural Architecture Search, NAS)学习_第6张图片

图6 局部搜索算法其中邻接矩阵编码效果最好

c）子过程3：训练预测模型过程

神经结构搜索(Neural Architecture Search, NAS)学习_第7张图片

图7 采用高斯过程的贝叶斯优化调参其中NASBOT及连续邻接矩阵编码效果最好

神经结构搜索(Neural Architecture Search, NAS)学习_第8张图片

图8 神经预测模型其中路径编码效果最好

神经结构搜索(Neural Architecture Search, NAS)学习_第9张图片

图9 BANANAS数据集上对三个子例程进行实验其中截断路径及连续邻接矩阵编码效果最好

4、论文重点内容翻译

Abstract

Neural architecture search (NAS) has been extensively studied in the past few years. A popular approach is to represent each neural architecture in the search space as a directed acyclic graph (DAG), and then search over all DAGs by encoding the adjacency matrix and list of operations as a set of hyperparameters. Recent work has demonstrated that even small changes to the way each architecture is encoded can have a significant effect on the performance of NAS algorithms (White et al., 2019; Ying et al., 2019). In this work, we present the first formal study on the effect of architecture encodings for NAS, including a theoretical grounding and an empirical study. First we formally define architecture encodings and give a theoretical characterization on the scalability of the encodings we study. Then we identify the main encoding-dependent subroutines which NAS algorithms employ, running experiments to show which encodings work best with each subroutine for many popular algorithms. The experiments act as an ablation study for prior work, disentangling the algorithmic and encoding-based contributions, as well as a guideline for future work. Our results demonstrate that NAS encodings are an important design decision which can have a significant impact on overall performance.

神经结构搜索在过去几年被广泛研究。一种流行的方法是将搜索空间中的每一神经结构表示为有向无环图（DAG），然后通过邻接矩阵和操作列表编码作为一组超参数，搜索所有的DAGs。最近的工作表明，即使对每个结构的编码方式进行很小的更改，也会对NAS算法的性能有很大影响（White等人，2019；Ying等人，2019）。在这项工作中，我们提出关于NAS结构编码效果的第一个正式研究，包括理论基础和实证研究。首先我们正式定义结构编码，并对所研究编码的可扩展性进行理论上的特性描述。然后我们确定NAS算法使用的主要依赖于编码的子例程，运行实验来显示对于许多流行算法，哪种编码方式在每个子例程中工作得最好。实验充当了之前工作的消融实验，解开了基于算法和编码的贡献，以及对未来工作的指南。我们的结果证明，NAS编码是一项重要的设计决策，能够对整体性能产生重大影响。

extensively adv. 广泛地，普遍
theoretical grounding phr. 理论基础
an empirical study phr. 实证研究
theoretical characterization phr. 理论描述
scalability n. 可扩展性，可伸缩性
subroutine n. 子例程，子程序
act as phr. 充当，担任
ablation study phr. 消融实验
disentangle v. 解开，松开
design decision phr. 设计决策

3 Encodings for NAS

We denote a set of neural architectures a by A (called a search space), and we define an objective function L : A → R, where L(a) is typically a combination of the neural network accuracy, model parameters, or FLOPS. We define a neural network encoding as an integer d and a multifunction e : A → R^d from a set of neural architectures A to a d-dimensional Euclidean space R^d , and we define a NAS algorithm A which takes as input a triple (A,L,e), and outputs an architecture a, with the goal that L(a) is as close to max a∈A L(a) as possible. Based on this definition, we consider an encoding e to be a fixed transformation, independent of L. In particular, NAS components that use L to learn a transformation of an input architecture (such as graph convolutional networks or autoencoders), are considered part of the NAS algorithm rather than the encoding. This is consistent with prior defintions of encodings (Ying et al., 2019; Talbi, 2020).

我们用A表示一组神经结构a（称为搜索空间），并且定义了一个目标函数L：A → R，其中L(a)通常是一个神经网络正确率、模型参数或者每秒浮点计算的组合。我们定义了一个编码为整型d和多功能e的神经网络： A → R^d，从一组神经结构A到d维欧几里得空间R^d，然后我们定义了一个NAS算法A，该算法的输入是一个三元组(A, L, e)，输出是一个结构a，目标是让L(a)尽可能接近max a∈A L(a) 。基于这个定义，我们认为编码e是独立于L的一种固定变换。特别是，使用L来学习输入结构转换的NAS组件（例如图卷积神经网络或自动编码机），被视为NAS算法的一部分，而不是编码的一部分。这与编码的先前定义是一致的 (Ying 等人, 2019; Talbi, 2020)。

a set of phr. 一组
typically adv. 通常，一般
consider v. 认为
be consistent with phr. 符合，与…一致

We define eight different encodings split into two paradigms: adjacency matrix-based and path-based encodings. We assume that each architecture is represented by a DAG with at most n nodes, at most k edges, and q choices of operations on each node. For brevity, we focus on the case where nodes represent operations, though our analysis extends similarly to formulations where edges represent operations. Most of the following encodings have been defined in prior work (Ying et al., 2019; White et al., 2019; Talbi, 2020), and we will see in the next section that each encoding is useful for some part of the NAS pipeline.

我们定义了八种不同的编码方式，分为两种范式：基于邻接矩阵的编码和基于路径的编码。我们假设每一结构通过有向无环图（DAG）表示，它最多含有n个结点，最多含有k条边以及每一结点最多有q种操作选择。为简洁起见，我们将重点放在结点代表操作的情况，尽管我们的分析类似于边代表操作的表述。以下大部分编码已经在先前的工作中得到定义（Ying等人，2019；White等人，2019；Talbi，2020）。我们将会在下一节看到每一种编码对于NAS管道的某些部分很有用。

paradigm n. 范例，范式
for brevity phr. 为简洁起见

Adjacency matrix encodings. We first consider a class of encodings that are based on representations of the adjacency matrix. These are the most common types of encodings used in current NAS research. For visualizations of these encodings, see Figure A.1 (b).

邻接矩阵编码。首先我们考虑一类基于邻接矩阵表示的编码。这是在当前的NAS研究中最常用的编码类型。有关这些编码的可视化展示，见图A.1(b)。

神经结构搜索(Neural Architecture Search, NAS)学习_第10张图片

图A.1 (b) a的邻接矩阵表示，展示两种编码

The one-hot adjacency matrix encoding is created by row-major vectorizing (i.e. flattening) the architecture adjacency matrix and concatenating it with a list of node operation labels. Each position in the operation list is a single integer-valued feature, where each operation is denoted by a different integer. The total dimension is n(n − 1)/2 + n. In the categorical adjacency matrix encoding, the adjacency matrix is first flattened (similar to the one-hot encoding described previously), and is then defined as a list of the indices each of which specifies one of the n(n − 1)/2 possible edges in the adjacency matrix. To ensure a fixed length encoding, each architecture is represented by k features, where k is the maximum number of possible edges. We again concatenate this representation with a list of operations, yielding a total dimensionality of k + n. Finally, the continuous adjacency matrix encoding is similar to the one-hot encoding, but each of the features for each edge can take on any real value in [0,1], rather than just {0,1}. We also add a feature representing he number of edges, 1 ≤ K ≤ k. The list of operations is encoded the same way as before.The architecture is created by choosing the K edges with the largest continuous features. The dimension is n(n−1)/2+n+1. The disadvantage of adjacency matrix-based encodings is that nodes are arbitrarily assigned indices in the matrix, which means one architecture can have many different representations (in other words, e^−1 is not onto).

One-hot邻接矩阵编码是通过行优先向量化（即展平）结构邻接矩阵和结点操作标签列表连接而创建的。操作列表上的每一位置是一个单一的整型值特征，其中每一操作都由不同的整型值表示。总的维度是n(n - 1)/2 + n。在分类邻接矩阵编码中，邻接矩阵首先被展平（类似于之前描述的one-hot编码），然后邻接矩阵定义为一个索引列表，每一个索引指向邻接矩阵中n(n - 1)/2个可能边。为了确保固定长度的编码，每一结构表示为k个特征，其中k是最大可能边数。我们再次将此表示形式与一系列操作连接起来，得出总维度为k + n。最后，连续邻接矩阵编码类似于one-hot编码，但是每一条边的每一特征能够表示为在[0, 1]区间上的任意实数值，而不仅仅表示为0和1。我们还添加了一个表示边数（1 ≤ K ≤ k）的特征。这个操作列表的编码方式和从前相同。通过选择具有最大连续特征的K条边来创建结构，维度是n(n-1)/2 + n + 1。基于邻接矩阵编码的缺点在于结点是在矩阵中任意分配的索引，意味着一个结构能够拥有许多不同的表示形式（换句话说，e^-1不存在）

vectorize v. 向量化，矢量化
flatten v. 平面化
concatenate v. 连接
arbitrarily adv. 反复地，任意地

Path-based encodings. Path-based encodings are representations of a neural architecture that are based on the set of paths from input to output that are present within the architecture DAG. For visualizations of these encodings, see Figure A.1 (c).

基于路径的编码。基于路径的编码是一个神经结构的表示形式，该神经结构基于一组从输入到输出的有向无环图中的路径。有关这些编码的可视化展示，见图A.1(c)

神经结构搜索(Neural Architecture Search, NAS)学习_第11张图片

图A.1(c) a的基于路径表示，展示两种编码

The one-hot path encoding is created by giving a binary feature to each possible path from the input node to the output node in the DAG (for example: input–conv1x1– maxpool3x3–output). The total dimension is Sum(i=0 to n) q^i = (q^(n+1) − 1)/(q − 1). The truncated one-hot path encoding, simply truncates this encoding to only include paths of length x. The new dimension is Sum(i=0 to x) q^i . The categorical path encoding, is defined as a list of indices each of which specifies one of the Sum(i=0 to x) q^i possible paths. The continuous path encoding consists of a real-valued feature [0,1] for each potential path, as well as a feature representing the number of paths. Just like the one-hot path encoding, the continuous path encoding can be truncated. Path-based encodings have the advantage that nodes are not arbitrarily assigned indices, and also that isomorphisms are automatically mapped to the same encoding. Path-based encodings have the disadvantage that different architectures can map to the same encoding (e is not onto).

One-hot路径编码通过给每一条可能的路径提供二进制特征创建，在有向无环图中从输入结点到输出结点（例如：输入-1x1卷积层-3x3最大池化层-输出）。总维度是Sum(i=0 to n) q^i = (q^(n+1) − 1)/(q − 1)。截断的one-hot路径编码，将截断这种编码为仅包括长度为x的路径。新的维度为：Sum(i=0 to x) q^i。分类路径编码，定义为索引列表。每一索引指向Sum(i=0 to x) q^i中一条可能的路径。连续路径编码包括每一可能路径的实数值特征[0, 1]，以及代表路径数的特征。就像one-hot路径编码一样，连续路径编码也能够被截断。基于路径的编码的优势在于结点不是任意分配的索引，并且同形结构会自动映射到相同的编码中。基于路径的编码的缺点在于不同的结构能够映射到相同的编码上（换句话说，e不存在）

truncate v. 缩短，截去
simply adv. 仅
isomorphism n. 同形，类质同像