AnnaAraslanova/FBNet 程序分析

AnnaAraslanova/FBNet 是 FBNet 相对来说比较好的一个第三方实现。延迟测量采用 x86 处理器的结果近似。需要注意的是:

  • PyTorch GPU 并行对输入数据有要求;
  • 随机超网络直接使用 BN 层似乎不妥。

supernet_main_file.py

train_supernet 训练随机超网络。

sample_architecture_from_the_supernet 从中选出最优结构。

if __name__ == "__main__":
    assert args.train_or_sample in ['train', 'sample']
    if args.train_or_sample == 'train':
        train_supernet()
    elif args.train_or_sample == 'sample':
        assert args.architecture_name != '' and args.architecture_name not in MODEL_ARCH
        hardsampling = False if args.hardsampling_bool_value in ['False', '0'] else True
    sample_architecture_from_the_supernet(unique_name_of_arch=args.architecture_name, hardsampling=hardsampling)

train_supernet

Created with Raphaël 2.2.0 train_supernet config_for_supernet create_directories_from_list get_logger SummaryWriter LookUpTable get_loaders get_test_loader FBNet_Stochastic_SuperNet weights_init SupernetLoss check_tensor_in_list CosineAnnealingLR TrainerSupernet TrainerSupernet.train_loop

设置随机种子,确保可复现。

    manual_seed = 1
    np.random.seed(manual_seed)
    torch.manual_seed(manual_seed)
    torch.cuda.manual_seed_all(manual_seed)
    torch.backends.cudnn.benchmark = True

CONFIG_SUPERNET 存储了超网络的配置参数。 create_directories_from_list 创建 tensorboard 日志文件路径。
get_logger 根据文件路径创建一个日志记录器并设置格式。
SummaryWriter 创建一个 TensorBoard 事件异步写入器。

1.7版本之后参数变成了logdir

    create_directories_from_list([CONFIG_SUPERNET['logging']['path_to_tensorboard_logs']])
    
    logger = get_logger(CONFIG_SUPERNET['logging']['path_to_log_file'])
    writer = SummaryWriter(log_dir=CONFIG_SUPERNET['logging']['path_to_tensorboard_logs'])

LookUpTable 会将结果写入文件。

    #### LookUp table consists all information about layers
    lookup_table = LookUpTable(calulate_latency=CONFIG_SUPERNET['lookup_table']['create_from_scratch'])

get_loaders 划分训练和验证数据集。

    #### DataLoading
    train_w_loader, train_thetas_loader = get_loaders(CONFIG_SUPERNET['dataloading']['w_share_in_train'],
                                                      CONFIG_SUPERNET['dataloading']['batch_size'],
                                                      CONFIG_SUPERNET['dataloading']['path_to_save_data'],
                                                      logger)
    test_loader = get_test_loader(CONFIG_SUPERNET['dataloading']['batch_size'],
                                  CONFIG_SUPERNET['dataloading']['path_to_save_data'])

实例化 FBNet_Stochastic_SuperNet 。

nn.Module.apply 将fn递归地应用于每个子模块(由.children()返回)以及self。典型用途包括初始化模型的参数(另请参见 torch.nn.init)。

这里为什么调用 weights_init 而不是在内部初始化?

没有加载快照继续训练的功能。

torch.nn.DataParallel 在模块级实现数据并行性。此容器通过在批处理维度中进行分块,将输入拆分到指定设备上,从而使给定module的应用程序并行化(其他对象将在每个设备上复制一次)。在前向过程中,模块在每个设备上复制,每个副本处理输入的一部分。在向后传递期间,汇总每个副本的梯度到原始模块中。批量大小应大于使用的 GPU 数量。
另请参阅:Use nn.DataParallel instead of multiprocessing
允许将任意位置和关键字输入传递到 DataParallel,但某些类型是特殊处理的。在指定的dim上(默认为0)分散张量。浅复制元组、列表和字典类型。其他类型将在不同的线程之间共享,如果在模型的正向传递中写入,则可能会损坏。
在运行此 DataParallel 模块之前,并行化module必须在device_ids[0]上具有其参数和缓冲区。

每次前向时,模块都会复制到每个设备上,因此forward运行模块的任何更新都将丢失。例如,如果module具有在每个forward中递增的计数器属性,则它将始终保持在初始值,因为更新是对forward之后销毁的副本进行的。但是,DataParallel 保证device[0]上副本的参数和缓冲区与基本并行化module共享存储。 因此将记录device[0]上的参数和缓冲区的原地更新。例如,BBatchNorm2d 和 spectral_norm() 依赖于此行为来更新缓冲区。
将调用module及其子模块上定义的前向和后向钩子len(device_ids)次,每个钩子的输入都位于特定的设备上。特别地,仅保证钩子在相应设备上的操作顺序正确。例如,不能保证在所有len(device_ids)个 forward() 调用之前执行通过 register_forward_pre_hook() 设置的钩子,但是每个钩子都会在该设备的相应 forward() 调用之前执行。

    #### Model
    model = FBNet_Stochastic_SuperNet(lookup_table, cnt_classes=10).cuda()
    model = model.apply(weights_init)
    model = nn.DataParallel(model, device_ids=[0])

网络权重和结构参数关联到不同的优化器。
SupernetLoss 计算带有延迟的损失。

torch.optim.lr_scheduler.CosineAnnealingLR 使用余弦退火计划设置每个参数组的学习率,其中 η m a x \eta_{max} ηmax 设置为初始 lr, T c u r T_{cur} Tcur 是自 SGDR 上次重启以来的纪元数:
η t + 1 = η m i n + ( η t − η m i n ) 1 + cos ⁡ ( T c u r + 1 T m a x π ) 1 + cos ⁡ ( T c u r T m a x π ) , T c u r ≠ ( 2 k + 1 ) T m a x ; η t + 1 = η t + ( η m a x − η m i n ) 1 − cos ⁡ ( 1 T m a x π ) 2 , T c u r = ( 2 k + 1 ) T m a x . \begin{aligned} \eta_{t+1} = \eta_{min} + (\eta_t - \eta_{min})\frac{1 + \cos(\frac{T_{cur+1}}{T_{max}}\pi)}{1 + \cos(\frac{T_{cur}}{T_{max}}\pi)}, T_{cur} \neq (2k+1)T_{max};\\ \eta_{t+1} = \eta_{t} + (\eta_{max} - \eta_{min})\frac{1 - \cos(\frac{1}{T_{max}}\pi)}{2}, T_{cur} = (2k+1)T_{max}.\\ \end{aligned} ηt+1=ηmin+(ηtηmin)1+cos(TmaxTcurπ)1+cos(TmaxTcur+1π),Tcur̸=(2k+1)Tmax;ηt+1=ηt+(ηmaxηmin)21cos(Tmax1π),Tcur=(2k+1)Tmax.

    #### Loss, Optimizer and Scheduler
    criterion = SupernetLoss().cuda()

    thetas_params = [param for name, param in model.named_parameters() if 'thetas' in name]
    params_except_thetas = [param for param in model.parameters() if not check_tensor_in_list(param, thetas_params)]

    w_optimizer = torch.optim.SGD(params=params_except_thetas,
                                  lr=CONFIG_SUPERNET['optimizer']['w_lr'], 
                                  momentum=CONFIG_SUPERNET['optimizer']['w_momentum'],
                                  weight_decay=CONFIG_SUPERNET['optimizer']['w_weight_decay'])
    
    theta_optimizer = torch.optim.Adam(params=thetas_params,
                                       lr=CONFIG_SUPERNET['optimizer']['thetas_lr'],
                                       weight_decay=CONFIG_SUPERNET['optimizer']['thetas_weight_decay'])

    last_epoch = -1
    w_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(w_optimizer,
                                                             T_max=CONFIG_SUPERNET['train_settings']['cnt_epochs'],
                                                             last_epoch=last_epoch)

TrainerSupernet 封装了训练过程。

    #### Training Loop
    trainer = TrainerSupernet(criterion, w_optimizer, theta_optimizer, w_scheduler, logger, writer)
    trainer.train_loop(train_w_loader, train_thetas_loader, test_loader, model)

get_logger

    """ Make python logger """
    # [!] Since tensorboardX use default logger (e.g. logging.info()), we should use custom logger
    logger = logging.getLogger('fbnet')
    log_format = '%(asctime)s | %(message)s'
    formatter = logging.Formatter(log_format, datefmt='%m/%d %I:%M:%S %p')
    file_handler = logging.FileHandler(file_path)
    file_handler.setFormatter(formatter)
    stream_handler = logging.StreamHandler()
    stream_handler.setFormatter(formatter)

    logger.addHandler(file_handler)
    logger.addHandler(stream_handler)
    logger.setLevel(logging.INFO)

    return logger

LookUpTable

Created with Raphaël 2.2.0 LookUpTable candidate_blocks, search_space _generate_layers_parameters calulate_latency? _create_from_operations End _create_from_file yes no

CANDIDATE_BLOCKS 列举了论文表2中的9种结构,详细参数在 PRIMITIVES 中。

Block type expansion Kernel Group
k3_e1 1 3 1
k3_e1_g2 1 3 2
k3_e3 3 3 1
k3_e6 6 3 1
k5_e1 1 5 1
k5_e1_g2 1 5 2
k5_e3 3 5 1
k5_e6 6 5 1
skip - - -

SEARCH_SPACE 对应论文表1网络结构(仅 TBS)。

Input shape Block f n s
22 4 2 × 3 224^2 \times 3 2242×3 3x3 conv 16 1 2
11 2 2 × 16 112^2 \times 16 1122×16 TBS 16 1 1
11 2 2 × 16 112^2 \times 16 1122×16 TBS 24 4 2
5 6 2 × 24 56^2 \times 24 562×24 TBS 32 4 2
2 8 2 × 32 28^2 \times 32 282×32 TBS 64 4 2
1 4 2 × 64 14^2 \times 64 142×64 TBS 112 4 1
1 4 2 × 112 14^2 \times 112 142×112 TBS 184 4 2
7 2 × 184 7^2 \times 184 72×184 TBS 352 1 1
7 2 × 352 7^2 \times 352 72×352 1x1 conv 1984 1 1
7 2 × 1504   ( 1984 ) 7^2 \times 1504~(1984) 72×1504 (1984) x7 avgpool - 1 1
1504 1504 1504 fc 1000 1 -

search_space的输入形状数量推断层数。
创建操作符字典self.lookup_table_operations
_generate_layers_parameters 从 SEARCH_SPACE 中解析出层参数和输入参数。

    def __init__(self, candidate_blocks=CANDIDATE_BLOCKS, search_space=SEARCH_SPACE,
                 calulate_latency=False):
        self.cnt_layers = len(search_space["input_shape"])
        # constructors for each operation
        self.lookup_table_operations = {op_name : PRIMITIVES[op_name] for op_name in candidate_blocks}
        # arguments for the ops constructors. one set of arguments for all 9 constructors at each layer
        # input_shapes just for convinience
        self.layers_parameters, self.layers_input_shapes = self._generate_layers_parameters(search_space)

_create_from_operations 计算操作符的耗时并写入文件。

_read_lookup_table_from_file 从文件读取结果。

        # lookup_table
        self.lookup_table_latency = None
        if calulate_latency:
            self._create_from_operations(cnt_of_runs=CONFIG_SUPERNET['lookup_table']['number_of_runs'],
                                         write_to_file=CONFIG_SUPERNET['lookup_table']['path_to_lookup_table'])
        else:
            self._create_from_file(path_to_file=CONFIG_SUPERNET['lookup_table']['path_to_lookup_table'])

_generate_layers_parameters

_generate_layers_parameters 从search_space字典中读取参数,构造各层参数列表layers_parameters。这里的参数顺序需要与 PRIMITIVES 中一致。

        # layers_parameters are : C_in, C_out, expansion, stride
        layers_parameters = [(search_space["input_shape"][layer_id][0],
                              search_space["channel_size"][layer_id],
                              # expansion (set to -999) embedded into operation and will not be considered
                              # (look fbnet_building_blocks/fbnet_builder.py - this is facebookresearch code
                              # and I don't want to modify it)
                              -999,
                              search_space["strides"][layer_id]
                             ) for layer_id in range(self.cnt_layers)]
        
        # layers_input_shapes are (C_in, input_w, input_h)
        layers_input_shapes = search_space["input_shape"]
        
        return layers_parameters, layers_input_shapes

_create_from_operations

_create_from_operations
_calculate_latency
_write_lookup_table_to_file
        self.lookup_table_latency = self._calculate_latency(self.lookup_table_operations,
                                                            self.layers_parameters,
                                                            self.layers_input_shapes,
                                                            cnt_of_runs)
        if write_to_file is not None:
            self._write_lookup_table_to_file(write_to_file)

_calculate_latency

latency_table_layer_by_ops为每 TBS 创建一个字典,用于记录每个操作的耗时。
随机生成数据,globals() 返回表示当前全局符号表的字典。这始终是当前模块的字典(在函数或方法内部,这是定义它的模块,而不是调用它的模块)。
timeit.timeit 使用给定的语句、设置代码和计时器函数创建一个Timer 实例,并使用数字执行运行其 timeit() 方法。 可选的globals参数指定用于执行代码的命名空间。

        LATENCY_BATCH_SIZE = 1
        latency_table_layer_by_ops = [{} for i in range(self.cnt_layers)]
        
        for layer_id in range(self.cnt_layers):
            for op_name in operations:
                op = operations[op_name](*layers_parameters[layer_id])
                input_sample = torch.randn((LATENCY_BATCH_SIZE, *layers_input_shapes[layer_id]))
                globals()['op'], globals()['input_sample'] = op, input_sample
                total_time = timeit.timeit('output = op(input_sample)', setup="gc.enable()", \
                                           globals=globals(), number=cnt_of_runs)
                # measured in micro-second
                latency_table_layer_by_ops[layer_id][op_name] = total_time / cnt_of_runs / LATENCY_BATCH_SIZE * 1e6
                
        return latency_table_layer_by_ops

_write_lookup_table_to_file

_write_lookup_table_to_file
clear_files_in_the_list
add_text_to_file

clear_files_in_the_list 清空已有文件。
ops为操作符名称列表。第1行打印名称。

        clear_files_in_the_list([path_to_file])
        ops = [op_name for op_name in self.lookup_table_operations]
        text = [op_name + " " for op_name in ops[:-1]]
        text.append(ops[-1] + "\n")

打印操作符的耗时,每行为一个 TBS。
add_text_to_file 以文件形式保存结果。

        for layer_id in range(self.cnt_layers):
            for op_name in ops:
                text.append(str(self.lookup_table_latency[layer_id][op_name]))
                text.append(" ")
            text[-1] = "\n"
        text = text[:-1]
        
        text = ''.join(text)
        add_text_to_file(text, path_to_file)

_create_from_file

_create_from_file
_read_lookup_table_from_file
        self.lookup_table_latency = self._read_lookup_table_from_file(path_to_file)

_read_lookup_table_from_file

从文件读取结果,第一行为名称。

        latences = [line.strip('\n') for line in open(path_to_file)]
        ops_names = latences[0].split(" ")
        latences = [list(map(float, layer.split(" "))) for layer in latences[1:]]
        
        lookup_table_latency = [{op_name : latences[i][op_id] 
                                      for op_id, op_name in enumerate(ops_names)
                                     } for i in range(self.cnt_layers)]
        return lookup_table_latency

get_loaders

随机裁减、翻转并标准化。

    train_transform = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
        ])
    train_data = datasets.CIFAR10(root=path_to_save_data, train=True, 
                                  download=True, transform=train_transform)

创建索引,划分数据集。
torch.utils.data.SubsetRandomSampler 从给定的索引列表中随机抽样元素,无需替换。

    num_train = len(train_data)                        # 50k
    indices = list(range(num_train))                   # 
    split = int(np.floor(train_portion * num_train))   # 40k
    
    train_idx, valid_idx = indices[:split], indices[split:]

    train_sampler = SubsetRandomSampler(train_idx)
    
    train_loader = torch.utils.data.DataLoader(
        train_data, batch_size=batch_size, sampler=train_sampler,
        pin_memory=True, num_workers=32)
    
    if train_portion == 1:
        return train_loader
    
    valid_sampler = SubsetRandomSampler(valid_idx)
    
    val_loader = torch.utils.data.DataLoader(
        train_data, batch_size=batch_size, sampler=train_sampler,
        pin_memory=True, num_workers=16)
    
    return train_loader, val_loader

get_test_loader

测试仅作归一化。

    test_transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
        ])
    
    test_data = datasets.CIFAR10(root=path_to_save_data, train=False,
                                 download=True, transform=test_transform)
    test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size,
                                              shuffle=False, num_workers=16)
    return test_loader

FBNet_Stochastic_SuperNet

FBNet_Stochastic_SuperNet
ConvBNRelu
MixedOperation

ConvBNRelu 构建基本模块,仅初始化了卷积参数。
torch.nn.ModuleList 将子模块保存在列表中。ModuleList 可以像常规 Python 列表一样编制索引,但它包含的模块已正确注册,并且所有 Module 方法都可以看到它们。
MixedOperation 运行操作符列表并求延迟加权和。

    def __init__(self, lookup_table, cnt_classes=1000):
        super(FBNet_Stochastic_SuperNet, self).__init__()
        
        # self.first identical to 'add_first' in the fbnet_building_blocks/fbnet_builder.py
        self.first = ConvBNRelu(input_depth=3, output_depth=16, kernel=3, stride=2,
                                pad=3 // 2, no_bias=1, use_relu="relu", bn_type="bn")
        self.stages_to_search = nn.ModuleList([MixedOperation(
                                                   lookup_table.layers_parameters[layer_id],
                                                   lookup_table.lookup_table_operations,
                                                   lookup_table.lookup_table_latency[layer_id])
                                               for layer_id in range(lookup_table.cnt_layers)])
        self.last_stages = nn.Sequential(OrderedDict([
            ("conv_k1", nn.Conv2d(lookup_table.layers_parameters[-1][1], 1504, kernel_size = 1)),
            ("avg_pool_k7", nn.AvgPool2d(kernel_size=7)),
            ("flatten", Flatten()),
            ("fc", nn.Linear(in_features=1504, out_features=cnt_classes)),
        ]))

forward

网络抽象为:

first
stages_to_search
last_stages
        y = self.first(x)
        for mixed_op in self.stages_to_search:
            y, latency_to_accumulate = mixed_op(y, temperature, latency_to_accumulate)
        y = self.last_stages(y)
        return y, latency_to_accumulate

MixedOperation

MixedOperation 根据proposed_operations字典构建操作列表、延迟列表及相应参数。
提取出proposed_operations的键得到列表ops_nameslatency为字典。

    # Arguments:
    # proposed_operations is a dictionary {operation_name : op_constructor}
    # latency is a dictionary {operation_name : latency}
    def __init__(self, layer_parameters, proposed_operations, latency):
        super(MixedOperation, self).__init__()
        ops_names = [op_name for op_name in proposed_operations]
        
        self.ops = nn.ModuleList([proposed_operations[op_name](*layer_parameters)
                                  for op_name in ops_names])
        self.latency = [latency[op_name] for op_name in ops_names]
        self.thetas = nn.Parameter(torch.Tensor([1.0 / len(ops_names) for i in range(len(ops_names))]))

forward

m l , i = GumbelSoftmax ( θ l , i ∣ θ l ) = exp ⁡ [ ( θ l , i + g l , i ) / τ ] ∑ i exp ⁡ [ ( θ l , i + g l , i ) / τ ] , \begin{aligned} m_{l, i} & = \text{GumbelSoftmax}(\theta_{l, i}|\mathrm{\theta_{l}}) \\ & = \frac{\exp[(\theta_{l,i} + g_{l,i})/\tau]}{\sum_i \exp[(\theta_{l,i} + g_{l,i})/\tau]}, \end{aligned} ml,i=GumbelSoftmax(θl,iθl)=iexp[(θl,i+gl,i)/τ]exp[(θl,i+gl,i)/τ],
x l + 1 = ∑ i m l , i ⋅ b l , i ( x l ) , \begin{aligned} x_{l+1} = \sum_i m_{l, i} \cdot b_{l, i}(x_{l}), \end{aligned} xl+1=iml,ibl,i(xl),
LAT ( a ) = ∑ l ∑ i m l , i ⋅ LAT ( b l , i ) . \begin{aligned} \text{LAT}(a) = \sum_l \sum_i m_{l,i} \cdot \text{LAT} (b_{l,i}). \end{aligned} LAT(a)=liml,iLAT(bl,i).
torch.nn.functional.gumbel_softmax 从 Gumbel-Softmax 分布([Concrete Distribution] [Gumbel-Softmax])采样,并可选择离散化。

参数:

  • logits[…, num_features]未标准化的概率对数
  • tau:非负标量温度
  • hard:如果为True,则返回的样本将被离散化为 one-hot 矢量,但可微,就好像它是 autograd 中的软样本一样
  • dim(int):计算 softmax 的维数。默认值:-1。

返回:
采样与logits形状相同的张量,服从 Gumbel-Softmax 分布。如果hard=True,则返回的样本将是独热的,否则它们将是各dim概率和为1的概率分布。

此函数出于遗留原因,可能会在将来从 nn.Functional 中删除。

hard的主要技巧是做 y_hard - y_soft.detach() + y_soft
它实现了两件事:

  • 使输出值完全独热(因为我们加然后减去 y_soft 值)
  • 使梯度等于 y_soft 梯度(因为我们剥离所有其他梯度)

这里self.thetas需要加 torch.Tensor.unsqueeze 操作变成2维。

        soft_mask_variables = nn.functional.gumbel_softmax(self.thetas, temperature)
        output  = sum(m * op(x) for m, op in zip(soft_mask_variables, self.ops))
        latency = sum(m * lat for m, lat in zip(soft_mask_variables, self.latency))
        latency_to_accumulate = latency_to_accumulate + latency
        return output, latency_to_accumulate

weights_init

weights_init 仅初始化卷积和全连接。

    if deepth > max_depth:
        return
    if isinstance(m, torch.nn.Conv2d):
        torch.nn.init.kaiming_uniform_(m.weight.data)
        if m.bias is not None:
            torch.nn.init.constant_(m.bias.data, 0)
    elif isinstance(m, torch.nn.Linear):
        m.weight.data.normal_(0, 0.01)
        if m.bias is not None:
            m.bias.data.zero_()
    elif isinstance(m, torch.nn.BatchNorm2d):
        return
    elif isinstance(m, torch.nn.ReLU):
        return
    elif isinstance(m, torch.nn.Module):
        deepth += 1
        for m_ in m.modules():
            weights_init(m_, deepth)
    else:
        raise ValueError("%s is unk" % m.__class__.__name__)

SupernetLoss

    def __init__(self):
        super(SupernetLoss, self).__init__()
        self.alpha = CONFIG_SUPERNET['loss']['alpha']
        self.beta = CONFIG_SUPERNET['loss']['beta']
        self.weight_criterion = nn.CrossEntropyLoss()

forward

L ( a , w a ) =  CE ( a , w a ) ⋅ α log ⁡ ( LAT ( a ) ) β . \begin{aligned} \mathcal{L}(a, w_a) = \text{ CE}(a, w_a) \cdot \alpha \log(\text{LAT}(a))^\beta. \end{aligned} L(a,wa)= CE(a,wa)αlog(LAT(a))β.

需要对torch.log(latency ** self.beta)求均值。

self.beta应放在外面,否则会失去作用。

        ce = self.weight_criterion(outs, targets)
        lat = torch.log(latency ** self.beta)
        
        losses_ce.update(ce.item(), N)
        losses_lat.update(lat.item(), N)
        
        loss = self.alpha * ce * lat
        return loss #.unsqueeze(0)

TrainerSupernet

AverageMeter 能够累积数据求均值。

    def __init__(self, criterion, w_optimizer, theta_optimizer, w_scheduler, logger, writer):
        self.top1       = AverageMeter()
        self.top3       = AverageMeter()
        self.losses     = AverageMeter()
        self.losses_lat = AverageMeter()
        self.losses_ce  = AverageMeter()
        
        self.logger = logger
        self.writer = writer
        
        self.criterion = criterion
        self.w_optimizer = w_optimizer
        self.theta_optimizer = theta_optimizer
        self.w_scheduler = w_scheduler
        
        self.temperature                 = CONFIG_SUPERNET['train_settings']['init_temperature']
        self.exp_anneal_rate             = CONFIG_SUPERNET['train_settings']['exp_anneal_rate'] # apply it every epoch
        self.cnt_epochs                  = CONFIG_SUPERNET['train_settings']['cnt_epochs']
        self.train_thetas_from_the_epoch = CONFIG_SUPERNET['train_settings']['train_thetas_from_the_epoch']
        self.print_freq                  = CONFIG_SUPERNET['train_settings']['print_freq']
        self.path_to_save_model          = CONFIG_SUPERNET['train_settings']['path_to_save_model']

train_loop

train_loop
_training_step
_validate

首先训练网络权重self.train_thetas_from_the_epoch个 epoch。
调用 _training_step 一次训练一个 epoch,名字不具有表现力。

        
        best_top1 = 0.0
        
        # firstly, train weights only
        for epoch in range(self.train_thetas_from_the_epoch):
            self.writer.add_scalar('learning_rate/weights', self.w_optimizer.param_groups[0]['lr'], epoch)
            
            self.logger.info("Firstly, start to train weights for epoch %d" % (epoch))
            self._training_step(model, train_w_loader, self.w_optimizer, epoch, info_for_logger="_w_step_")
            self.w_scheduler.step()

然后交替训练权重和结构。交替更新一定程度上降低了效率。

        for epoch in range(self.train_thetas_from_the_epoch, self.cnt_epochs):
            self.writer.add_scalar('learning_rate/weights', self.w_optimizer.param_groups[0]['lr'], epoch)
            self.writer.add_scalar('learning_rate/theta', self.theta_optimizer.param_groups[0]['lr'], epoch)
            
            self.logger.info("Start to train weights for epoch %d" % (epoch))
            self._training_step(model, train_w_loader, self.w_optimizer, epoch, info_for_logger="_w_step_")
            self.w_scheduler.step()
            
            self.logger.info("Start to train theta for epoch %d" % (epoch))
            self._training_step(model, train_thetas_loader, self.theta_optimizer, epoch, info_for_logger="_theta_step_")
            
            top1_avg = self._validate(model, test_loader, epoch)
            if best_top1 < top1_avg:
                best_top1 = top1_avg
                self.logger.info("Best top1 acc by now. Save model")
                save(model, self.path_to_save_model)
            
            self.temperature = self.temperature * self.exp_anneal_rate

_training_step

需要显式构造latency_to_accumulate变量,且元素与设备数量相同。

_intermediate_stats_logging 记录损失、top1、top3、交叉熵以及延迟。
_epoch_stats_logging 记录 epoch 状态信息到 tensorboard。

        model = model.train()
        start_time = time.time()
        
        for step, (X, y) in enumerate(loader):
            X, y = X.cuda(non_blocking=True), y.cuda(non_blocking=True)
            # X.to(device, non_blocking=True), y.to(device, non_blocking=True)
            N = X.shape[0]
            
            optimizer.zero_grad()
            latency_to_accumulate = Variable(torch.Tensor([[0.0]]), requires_grad=True).cuda()
            outs, latency_to_accumulate = model(X, self.temperature, latency_to_accumulate)
            loss = self.criterion(outs, y, latency_to_accumulate, self.losses_ce, self.losses_lat, N)
            loss.backward()
            optimizer.step()
            
            self._intermediate_stats_logging(outs, y, loss, step, epoch, N, len_loader=len(loader), val_or_train="Train")
        
        self._epoch_stats_logging(start_time=start_time, epoch=epoch, info_for_logger=info_for_logger, val_or_train='train')
        for avg in [self.top1, self.top3, self.losses]:
            avg.reset()

_validate

验证准确率。

        model.eval()
        start_time = time.time()

        with torch.no_grad():
            for step, (X, y) in enumerate(loader):
                X, y = X.cuda(), y.cuda()
                N = X.shape[0]
                
                latency_to_accumulate = torch.Tensor([[0.0]]).cuda()
                outs, latency_to_accumulate = model(X, self.temperature, latency_to_accumulate)
                loss = self.criterion(outs, y, latency_to_accumulate, self.losses_ce, self.losses_lat, N)

                self._intermediate_stats_logging(outs, y, loss, step, epoch, N, len_loader=len(loader), val_or_train="Valid")
                
        top1_avg = self.top1.get_avg()
        self._epoch_stats_logging(start_time=start_time, epoch=epoch, val_or_train='val')
        for avg in [self.top1, self.top3, self.losses]:
            avg.reset()
        return top1_avg

_intermediate_stats_logging

accuracy 计算准确率。

        prec1, prec3 = accuracy(outs, y, topk=(1, 5))
        self.losses.update(loss.item(), N)
        self.top1.update(prec1.item(), N)
        self.top3.update(prec3.item(), N)

如果迭代数满足打印间隔或者是最后一次则记录信息。

        if (step > 1 and step % self.print_freq == 0) or step == len_loader - 1:
            self.logger.info(val_or_train+
               ": [{:3d}/{}] Step {:03d}/{:03d} Loss {:.3f} "
               "Prec@(1,3) ({:.1%}, {:.1%}), ce_loss {:.3f}, lat_loss {:.3f}".format(
                   epoch + 1, self.cnt_epochs, step, len_loader - 1, self.losses.get_avg(),
                   self.top1.get_avg(), self.top3.get_avg(), self.losses_ce.get_avg(), self.losses_lat.get_avg()))

_epoch_stats_logging

记录 epoch 状态信息到 tensorboard。

        self.writer.add_scalar('train_vs_val/'+val_or_train+'_loss'+info_for_logger, self.losses.get_avg(), epoch)
        self.writer.add_scalar('train_vs_val/'+val_or_train+'_top1'+info_for_logger, self.top1.get_avg(), epoch)
        self.writer.add_scalar('train_vs_val/'+val_or_train+'_top3'+info_for_logger, self.top3.get_avg(), epoch)
        self.writer.add_scalar('train_vs_val/'+val_or_train+'_losses_lat'+info_for_logger, self.losses_lat.get_avg(), epoch)
        self.writer.add_scalar('train_vs_val/'+val_or_train+'_losses_ce'+info_for_logger, self.losses_ce.get_avg(), epoch)
        
        top1_avg = self.top1.get_avg()
        self.logger.info(info_for_logger+val_or_train + ": [{:3d}/{}] Final Prec@1 {:.4%} Time {:.2f}".format(
            epoch+1, self.cnt_epochs, top1_avg, time.time() - start_time))

accuracy

torch.topk 返回给定输入张量沿给定维度的k个最大元素。如果未给定dim,则选择输入的最后一个维度。如果largestFalse,则返回k个最小元素。返回(values, indices)的命名元组,其中索引是原始输入张量中元素的索引。如果布尔值选项sortedTrue,则将确保返回的k个元素本身已排序。

    """ Computes the precision@k for the specified values of k """
    maxk = max(topk)
    batch_size = target.size(0)

    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    # one-hot case
    if target.ndimension() > 1:
        target = target.max(1)[1]

    correct = pred.eq(target.view(1, -1).expand_as(pred))

    res = []
    for k in topk:
        correct_k = correct[:k].view(-1).float().sum(0)
        res.append(correct_k.mul_(1.0 / batch_size))

    return res

PRIMITIVES

    "skip": lambda C_in, C_out, expansion, stride, **kwargs: Identity(
        C_in, C_out, stride
    ),
    "ir_k3": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, expansion, stride, **kwargs
    ),
    "ir_k5": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, expansion, stride, kernel=5, **kwargs
    ),
    "ir_k7": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, expansion, stride, kernel=7, **kwargs
    ),
    "ir_k1": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, expansion, stride, kernel=1, **kwargs
    ),
    "shuffle": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, expansion, stride, shuffle_type="mid", pw_group=4, **kwargs
    ),
    "basic_block": lambda C_in, C_out, expansion, stride, **kwargs: CascadeConv3x3(
        C_in, C_out, stride
    ),
    "shift_5x5": lambda C_in, C_out, expansion, stride, **kwargs: ShiftBlock5x5(
        C_in, C_out, expansion, stride
    ),
    # layer search 2
    "ir_k3_e1": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=3, **kwargs
    ),
    "ir_k3_e3": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 3, stride, kernel=3, **kwargs
    ),
    "ir_k3_e6": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 6, stride, kernel=3, **kwargs
    ),
    "ir_k3_s4": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 4, stride, kernel=3, shuffle_type="mid", pw_group=4, **kwargs
    ),
    "ir_k5_e1": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=5, **kwargs
    ),
    "ir_k5_e3": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 3, stride, kernel=5, **kwargs
    ),
    "ir_k5_e6": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 6, stride, kernel=5, **kwargs
    ),
    "ir_k5_s4": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 4, stride, kernel=5, shuffle_type="mid", pw_group=4, **kwargs
    ),
    # layer search se
    "ir_k3_e1_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=3, se=True, **kwargs
    ),
    "ir_k3_e3_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 3, stride, kernel=3, se=True, **kwargs
    ),
    "ir_k3_e6_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 6, stride, kernel=3, se=True, **kwargs
    ),
    "ir_k3_s4_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in,
        C_out,
        4,
        stride,
        kernel=3,
        shuffle_type="mid",
        pw_group=4,
        se=True,
        **kwargs
    ),
    "ir_k5_e1_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=5, se=True, **kwargs
    ),
    "ir_k5_e3_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 3, stride, kernel=5, se=True, **kwargs
    ),
    "ir_k5_e6_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 6, stride, kernel=5, se=True, **kwargs
    ),
    "ir_k5_s4_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in,
        C_out,
        4,
        stride,
        kernel=5,
        shuffle_type="mid",
        pw_group=4,
        se=True,
        **kwargs
    ),
    # layer search 3 (in addition to layer search 2)
    "ir_k3_s2": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=3, shuffle_type="mid", pw_group=2, **kwargs
    ),
    "ir_k5_s2": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=5, shuffle_type="mid", pw_group=2, **kwargs
    ),
    "ir_k3_s2_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in,
        C_out,
        1,
        stride,
        kernel=3,
        shuffle_type="mid",
        pw_group=2,
        se=True,
        **kwargs
    ),
    "ir_k5_s2_se": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in,
        C_out,
        1,
        stride,
        kernel=5,
        shuffle_type="mid",
        pw_group=2,
        se=True,
        **kwargs
    ),
    # layer search 4 (in addition to layer search 3)
    "ir_k3_sep": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, expansion, stride, kernel=3, cdw=True, **kwargs
    ),
    "ir_k33_e1": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=3, cdw=True, **kwargs
    ),
    "ir_k33_e3": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 3, stride, kernel=3, cdw=True, **kwargs
    ),
    "ir_k33_e6": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 6, stride, kernel=3, cdw=True, **kwargs
    ),
    # layer search 5 (in addition to layer search 4)
    "ir_k7_e1": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=7, **kwargs
    ),
    "ir_k7_e3": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 3, stride, kernel=7, **kwargs
    ),
    "ir_k7_e6": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 6, stride, kernel=7, **kwargs
    ),
    "ir_k7_sep": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, expansion, stride, kernel=7, cdw=True, **kwargs
    ),
    "ir_k7_sep_e1": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 1, stride, kernel=7, cdw=True, **kwargs
    ),
    "ir_k7_sep_e3": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 3, stride, kernel=7, cdw=True, **kwargs
    ),
    "ir_k7_sep_e6": lambda C_in, C_out, expansion, stride, **kwargs: IRFBlock(
        C_in, C_out, 6, stride, kernel=7, cdw=True, **kwargs
    ),
}

ConvBNRelu

ConvBNRelu 模块选项很多,可设置不使用 BN,或者使用 FrozenBatchNorm2d。相对来说,nn.BatchNorm2d(C_out, affine=affine)更常见一些。

    def __init__(
        self,
        input_depth,
        output_depth,
        kernel,
        stride,
        pad,
        no_bias,
        use_relu,
        bn_type,
        group=1,
        *args,
        **kwargs
    ):
        super(ConvBNRelu, self).__init__()

        assert use_relu in ["relu", None]
        if isinstance(bn_type, (list, tuple)):
            assert len(bn_type) == 2
            assert bn_type[0] == "gn"
            gn_group = bn_type[1]
            bn_type = bn_type[0]
        assert bn_type in ["bn", "af", "gn", None]
        assert stride in [1, 2, 4]

        op = Conv2d(
            input_depth,
            output_depth,
            kernel_size=kernel,
            stride=stride,
            padding=pad,
            bias=not no_bias,
            groups=group,
            *args,
            **kwargs
        )
        nn.init.kaiming_normal_(op.weight, mode="fan_out", nonlinearity="relu")
        if op.bias is not None:
            nn.init.constant_(op.bias, 0.0)
        self.add_module("conv", op)

        if bn_type == "bn":
            bn_op = BatchNorm2d(output_depth)
        elif bn_type == "gn":
            bn_op = nn.GroupNorm(num_groups=gn_group, num_channels=output_depth)
        elif bn_type == "af":
            bn_op = FrozenBatchNorm2d(output_depth)
        if bn_type is not None:
            self.add_module("bn", bn_op)

        if use_relu == "relu":
            self.add_module("relu", nn.ReLU(inplace=True))

sample_architecture_from_the_supernet

Created with Raphaël 2.2.0 sample_architecture_from_the_supernet verification get_logger LookUpTable FBNet_Stochastic_SuperNet load hardsampling? writh_new_ARCH_to_fbnet_modeldef End softmax yes no

加载模型。由于 save 保存的是 torch.nn.DataParallel 类型的模型,所以
load 的输入也需保持一致。其属性module为原模型。

    logger = get_logger(CONFIG_SUPERNET['logging']['path_to_log_file'])
    
    lookup_table = LookUpTable()
    model = FBNet_Stochastic_SuperNet(lookup_table, cnt_classes=10).cuda()
    model = nn.DataParallel(model)

    load(model, CONFIG_SUPERNET['train_settings']['path_to_save_model'])

    ops_names = [op_name for op_name in lookup_table.lookup_table_operations]
    cnt_ops = len(ops_names)

numpy.linspace 在指定的间隔内返回均匀间隔的数字。
scipy.special.softmax
如果是hardsampling,每个 TBS 直接取 θ \theta θ 最大的操作符;否则 计算概率:
P θ l ( b l = b l , i ) = softmax ( θ l , i ; θ l ) = exp ⁡ ( θ l , i ) ∑ i exp ⁡ ( θ l , i ) . P θ ( a ) = ∏ l P θ l ( b l = b l , i ( a ) ) , \begin{aligned} P_{\mathrm{\theta}_{l}}(b_l = b_{l,i}) = \text{softmax}(\theta_{l,i}; \mathrm{\theta}_l) = \frac{\exp(\theta_{l,i})}{\sum_i \exp(\theta_{l,i})}. \end{aligned} \begin{aligned} P_{\mathrm{\theta}}(a) = \prod_l P_{\mathrm{\theta}_l} (b_l = b_{l,i}^{(a)}), \end{aligned} Pθl(bl=bl,i)=softmax(θl,i;θl)=iexp(θl,i)exp(θl,i).Pθ(a)=lPθl(bl=bl,i(a)),
writh_new_ARCH_to_fbnet_modeldef

    arch_operations=[]
    if hardsampling:
        for layer in model.module.stages_to_search:
            arch_operations.append(ops_names[np.argmax(layer.thetas.detach().cpu().numpy())])
    else:
        rng = np.linspace(0, cnt_ops - 1, cnt_ops, dtype=int)
        for layer in model.module.stages_to_search:
            distribution = softmax(layer.thetas.detach().cpu().numpy())
            arch_operations.append(ops_names[np.random.choice(rng, p=distribution)])
    
    logger.info("Sampled Architecture: " + " - ".join(arch_operations))
    writh_new_ARCH_to_fbnet_modeldef(arch_operations, my_unique_name_for_ARCH=unique_name_of_arch)
    logger.info("CONGRATULATIONS! New architecture " + unique_name_of_arch \
                + " was written into fbnet_building_blocks/fbnet_modeldef.py")

load

    model.load_state_dict(torch.load(model_path))

writh_new_ARCH_to_fbnet_modeldef

MODEL_ARCH 用于保存模型结构。
检查名字是否已存在。

    assert len(ops_names) == 22
    if my_unique_name_for_ARCH in MODEL_ARCH:
        print("The specification with the name", my_unique_name_for_ARCH, "already written \
              to the fbnet_building_blocks.fbnet_modeldef. Please, create a new name \
              or delete the specification from fbnet_building_blocks.fbnet_modeldef (by hand)")
        assert my_unique_name_for_ARCH not in MODEL_ARCH

ops_names转为字符串列表ops,进一步按 stage 分组拼接为ops_lines

    ### create text to insert
    
    text_to_write = "    \"" + my_unique_name_for_ARCH + "\": {\n\
            \"block_op_type\": [\n"

    ops = ["[\"" + str(op) + "\"], " for op in ops_names]
    ops_lines = [ops[0], ops[1:5], ops[5:9], ops[9:13], ops[13:17], ops[17:21], ops[21]]
    ops_lines = [''.join(line) for line in ops_lines]
    text_to_write += '            ' + '\n            '.join(ops_lines)

记录每次的维度信息。e即 expantion_ratio。

    e = [(op_name[-1] if op_name[-2] == 'e' else '1') for op_name in ops_names]

    text_to_write += "\n\
            ],\n\
            \"block_cfg\": {\n\
                \"first\": [16, 2],\n\
                \"stages\": [\n\
                    [["+e[0]+", 16, 1, 1]],                                                        # stage 1\n\
                    [["+e[1]+", 24, 1, 2]],  [["+e[2]+", 24, 1, 1]],  \
    [["+e[3]+", 24, 1, 1]],  [["+e[4]+", 24, 1, 1]],  # stage 2\n\
                    [["+e[5]+", 32, 1, 2]],  [["+e[6]+", 32, 1, 1]],  \
    [["+e[7]+", 32, 1, 1]],  [["+e[8]+", 32, 1, 1]],  # stage 3\n\
                    [["+e[9]+", 64, 1, 2]],  [["+e[10]+", 64, 1, 1]],  \
    [["+e[11]+", 64, 1, 1]],  [["+e[12]+", 64, 1, 1]],  # stage 4\n\
                    [["+e[13]+", 112, 1, 1]], [["+e[14]+", 112, 1, 1]], \
    [["+e[15]+", 112, 1, 1]], [["+e[16]+", 112, 1, 1]], # stage 5\n\
                    [["+e[17]+", 184, 1, 2]], [["+e[18]+", 184, 1, 1]], \
    [["+e[19]+", 184, 1, 1]], [["+e[20]+", 184, 1, 1]], # stage 6\n\
                    [["+e[21]+", 352, 1, 1]],                                                       # stage 7\n\
                ],\n\
                \"backbone\": [num for num in range(23)],\n\
            },\n\
        },\n\
}\
"

读取./fbnet_building_blocks/fbnet_modeldef.py,追加后写入。
需要跳过末尾的右括号。
next 通过调用 _next_() 方法从迭代器中检索下一个项。如果给定default,则在迭代器耗尽时返回,否则引发 StopIteration。

    ### open file and find place to insert
    with open('./fbnet_building_blocks/fbnet_modeldef.py') as f1:
        lines = f1.readlines()
    end_of_MODEL_ARCH_id = next(i for i in reversed(range(len(lines))) if lines[i].strip() == '}')
    text_to_write = lines[:end_of_MODEL_ARCH_id] + [text_to_write]
    with open('./fbnet_building_blocks/fbnet_modeldef.py', 'w') as f2:
        f2.writelines(text_to_write)

参考资料:

  • Lambda Lambda Lambda
  • Print lists in Python (4 Different Ways)
  • Optional: Data Parallelism

你可能感兴趣的:(DeepLearning,NAS,NAS,AutoML,架构搜索,PyTorch)