RetinaNet Examples:NVIDIA 一站式训练、推理及模型转换解决方案

retinanet-examples 是英伟达提供的目标检测工程范例,针对端到端 GPU 处理进行了优化:

  • 使用基于 Python 多进程的 apex.parallel.DistributedDataParallel 加速分布式训练;
  • apex.amp 优化混合精度训练;
  • NVIDIA DALI 加速数据预处理;
  • 推理使用 TensorRT。

项目推荐安装 PyTorch NGC docker container:

nvidia-docker run --rm --ipc=host -it nvcr.io/nvidia/pytorch:19.05-py3

然而新发布的版本仅支持 compute capability 6.0 以上的 GPU。

main

Created with Raphaël 2.2.0 main args parse load_model worker End

parse 解析参数。
load_model 创建模型并加载参数。
Module.share_memory 未在文档中列出。
worker 是工作的函数。
torch.multiprocessing.spawn 生成使用args运行fnnprocs个进程。如果其中一个进程以非零退出状态退出,则会终止其余进程,并引发异常并导致终止。在子进程中捕获异常的情况下,它将被转发,并且其跟踪包含在父进程中引发的异常中。

每个设备创建一个进程执行 worker。

    'Entry point for the retinanet command'

    args = parse(args or sys.argv[1:])

    model, state = load_model(args, verbose=True)
    if model: model.share_memory()

    world = torch.cuda.device_count()
    if args.command == 'export' or world <= 1:
        worker(0, args, 1, model, state)
    else:
        torch.multiprocessing.spawn(worker, args=(args, world, model, state), nprocs=world)

parse

ArgumentParser.add_subparsers
许多程序将其功能分解为许多子命令,例如,svn程序可以调用子命令如svn checkoutsvn updatesvn commit。当程序执行多个函数,每个函数需要不同类型的命令行参数时,以这种方式拆分功能可能是一个特别好的主意。ArgumentParser 支持使用 add_subparsers() 方法创建此类子命令。通常不带参数调用 add_subparsers() 方法,并返回一个特殊的操作对象。该对象有一个方法 add_parser(),它接受命令名和任何 ArgumentParser 构造函数参数,并返回一个可以照常修改的 ArgumentParser 对象。

参数说明:

  • title:默认情况下,如果提供了description,则使用“subcommands”,否则将使用title参数的标题。
  • description:帮助输出中子解析器组的描述,默认情况下为None
  • prog:将在子命令帮助中显示的用法信息,默认情况下,程序的名称和subparser参数之前的任何位置参数。
    parser_class:用于创建子解析器实例的类,默认情况下是当前解析器的类(例如 ArgumentParser)
  • action:在命令行遇到此参数时要采取的基本操作类型。
  • dest:存储子命令名的属性的名称;默认情况下,不存储任何值。
  • required:是否必须提供子命令,默认为False
  • help:帮助输出中子解析器组的帮助,默认为None
  • metavar:在帮助中显示可用子命令的字符串;默认情况下为None,并以{cmd1, cmd2, …}的形式显示子命令。
    parser = argparse.ArgumentParser(description='RetinaNet Detection Utility.')
    parser.add_argument('--master', metavar='address:port', type=str, help='Adress and port of the master worker', default='127.0.0.1:29500')

    subparsers = parser.add_subparsers(help='sub-command', dest='command')
    subparsers.required = True

    devcount = max(1, torch.cuda.device_count())

训练参数设置。
batch 为每个设备两张图。

    parser_train = subparsers.add_parser('train', help='train a network')
    parser_train.add_argument('model', type=str, help='path to output model or checkpoint to resume from')
    parser_train.add_argument('--annotations', metavar='path', type=str, help='path to COCO style annotations', required=True)
    parser_train.add_argument('--images', metavar='path', type=str, help='path to images', default='.')
    parser_train.add_argument('--backbone', action='store', type=str, nargs='+', help='backbone model (or list of)', default=['ResNet50FPN'])
    parser_train.add_argument('--classes', metavar='num', type=int, help='number of classes', default=80)
    parser_train.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount)
    parser_train.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800)
    parser_train.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333)
    parser_train.add_argument('--jitter', metavar='min max', type=int, nargs=2, help='jitter size within range', default=[640, 1024])
    parser_train.add_argument('--iters', metavar='number', type=int, help='number of iterations to train for', default=90000)
    parser_train.add_argument('--milestones', action='store', type=int, nargs='*', help='list of iteration indices where learning rate decays', default=[60000, 80000])
    parser_train.add_argument('--schedule', metavar='scale', type=float, help='scale schedule (affecting iters and milestones)', default=1)
    parser_train.add_argument('--full-precision', help='train in full precision', action='store_true')
    parser_train.add_argument('--lr', metavar='value', help='learning rate', type=float, default=0.01)
    parser_train.add_argument('--warmup', metavar='iterations', help='numer of warmup iterations', type=int, default=1000)
    parser_train.add_argument('--gamma', metavar='value', type=float, help='multiplicative factor of learning rate decay', default=0.1)
    parser_train.add_argument('--override', help='override model', action='store_true')
    parser_train.add_argument('--val-annotations', metavar='path', type=str, help='path to COCO style validation annotations')
    parser_train.add_argument('--val-images', metavar='path', type=str, help='path to validation images')
    parser_train.add_argument('--post-metrics', metavar='url', type=str, help='post metrics to specified url')
    parser_train.add_argument('--fine-tune', metavar='path', type=str, help='fine tune a pretrained model')
    parser_train.add_argument('--logdir', metavar='logdir', type=str, help='directory where to write logs')
    parser_train.add_argument('--val-iters', metavar='number', type=int, help='number of iterations between each validation', default=8000)
    parser_train.add_argument('--with-dali', help='use dali for data loading', action='store_true')

测试参数设置。

    parser_infer = subparsers.add_parser('infer', help='run inference')
    parser_infer.add_argument('model', type=str, help='path to model')
    parser_infer.add_argument('--images', metavar='path', type=str, help='path to images', default='.')
    parser_infer.add_argument('--annotations', metavar='annotations', type=str, help='evaluate using provided annotations')
    parser_infer.add_argument('--output', metavar='file', type=str, help='save detections to specified JSON file', default='detections.json')
    parser_infer.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount)
    parser_infer.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800)
    parser_infer.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333)
    parser_infer.add_argument('--with-dali', help='use dali for data loading', action='store_true')
    parser_infer.add_argument('--full-precision', help='inference in full precision', action='store_true')

模型导出参数。

    parser_export = subparsers.add_parser('export', help='export a model into a TensorRT engine')
    parser_export.add_argument('model', type=str, help='path to model')
    parser_export.add_argument('export', type=str, help='path to exported output')
    parser_export.add_argument('--size', metavar='height width', type=int, nargs='+', help='input size (square) or sizes (h w) to use when generating TensorRT engine', default=[1280])
    parser_export.add_argument('--batch', metavar='size', type=int, help='max batch size to use for TensorRT engine', default=2)
    parser_export.add_argument('--full-precision', help='export in full instead of half precision', action='store_true')
    parser_export.add_argument('--int8', help='calibrate model and export in int8 precision', action='store_true')
    parser_export.add_argument('--opset', metavar='version', type=int, help='ONNX opset version')
    parser_export.add_argument('--calibration-batches', metavar='size', type=int, help='number of batches to use for int8 calibration', default=10)
    parser_export.add_argument('--calibration-images', metavar='path', type=str, help='path to calibration images to use for int8 calibration', default="")
    parser_export.add_argument('--calibration-table', metavar='path', type=str, help='path of existing calibration table to load from, or name of new calibration table', default="")
    parser_export.add_argument('--verbose', help='enable verbose logging', action='store_true')

    return parser.parse_args(args)

load_model

检查是否指定了模型文件。

    if args.command != 'train' and not os.path.isfile(args.model):
        raise RuntimeError('Model file {} does not exist!'.format(args.model))

解析模型的扩展名。

    model = None
    state = {}
    _, ext = os.path.splitext(args.model)

训练模式下如果未指定模型文件则创建 Model 实例。Model.initialize 加载预训练的参数或者进行初始化。
Model.load 首先由基础网络创建模型,然后加载参数并获取训练状态变量。

    if args.command == 'train' and (not os.path.exists(args.model) or args.override):
        if verbose: print('Initializing model...')
        model = Model(args.backbone, args.classes)
        model.initialize(args.fine_tune)
        if verbose: print(model)

    elif ext == '.pth' or ext == '.torch':
        if verbose: print('Loading model from {}...'.format(os.path.basename(args.model)))
        model, state = Model.load(args.model)
        if verbose: print(model)

    elif args.command == 'infer' and ext in ['.engine', '.plan']:
        model = None
    
    else:
        raise RuntimeError('Invalid model format "{}"!'.format(args.ext))

    state['path'] = args.model
    return model, state

worker

worker
train.train
infer.infer
model.export

worker 函数能够执行训练、推理、模型导出3种任务。

os.environ 是进程参数之一,表示字符串环境的mapping 对象。例如,environ['HOME']是主目录的路径名(在某些平台上),相当于 C 中的getenv("HOME")。这个映射是在第一次导入 os 模块时捕获的,通常在 Python 启动期间作为处理site.py的一部分。在此时间之后对环境所做的更改不会反映在 os.environ 中,除非直接修改 os.environ。

torch.cuda.set_device 设置当前设备。
torch.distributed.init_process_group 初始化默认的分布式进程组,这也将初始化分布式程序包。初始化进程组主要有两种方法:

  • 明确指定storerankworld_size
  • 指定init_method(URL 字符串),指示发现对等点的位置/方式。(可选)指定rankworld_size,或在URL中编码所有必需的参数并省略它们。

如果两者都未指定,则假定init_method为“env://”。

    'Per-device distributed worker'

    if torch.cuda.is_available():
        os.environ.update({
            'MASTER_PORT': args.master.split(':')[-1],
            'MASTER_ADDR': ':'.join(args.master.split(':')[:-1]),
            'WORLD_SIZE':  str(world),
            'RANK':        str(rank),
            'CUDA_DEVICE': str(rank)
        })

        torch.cuda.set_device(rank)
        torch.distributed.init_process_group(backend='nccl', init_method='env://')

        if args.batch % world != 0:
            raise RuntimeError('Batch size should be a multiple of the number of GPUs')

train 函数参数特别多。

    if args.command == 'train':
        train.train(model, state, args.images, args.annotations,
            args.val_images or args.images, args.val_annotations, args.resize, args.max_size, args.jitter, 
            args.batch, int(args.iters * args.schedule), args.val_iters, not args.full_precision, args.lr, 
            args.warmup, [int(m * args.schedule) for m in args.milestones], args.gamma, 
            is_master=(rank == 0), world=world, use_dali=args.with_dali,
            metrics_url=args.post_metrics, logdir=args.logdir, verbose=(rank == 0))

推理时调用 Engine::_load 加载模型。 Engine 类封装了 TensorRT CUDA engine,PYBIND11_MODULE 宏在 extensions.cpp 中导出 C++ 变量。
infer 进行推理。

    elif args.command == 'infer':
        if model is None:
            if rank == 0: print('Loading CUDA engine from {}...'.format(os.path.basename(args.model)))
            model = Engine.load(args.model)

        infer.infer(model, args.images, args.output, args.resize, args.max_size, args.batch,
            annotations=args.annotations, mixed_precision=not args.full_precision,
            is_master=(rank == 0), world=world, use_dali=args.with_dali, verbose=(rank == 0))

由路径获取校正图像列表并置乱。
Model.export 进行校正。

    elif args.command == 'export':
        onnx_only = args.export.split('.')[-1] == 'onnx'
        input_size = args.size * 2 if len(args.size) == 1 else args.size

        calibration_files = []
        if args.int8:
            # Get list of images to use for calibration
            if os.path.isdir(args.calibration_images):
                import glob
                file_extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']
                for ex in file_extensions:
                    calibration_files += glob.glob("{}/*{}".format(args.calibration_images, ex), recursive=True)
                # Only need enough images for specified num of calibration batches
                if len(calibration_files) >= args.calibration_batches * args.batch:
                    calibration_files = calibration_files[:(args.calibration_batches * args.batch)]
                else:
                    print('Only found enough images for {} batches. Continuing anyway...'.format(len(calibration_files) // args.batch))

                random.shuffle(calibration_files)

        precision = "FP32"
        if args.int8:
            precision = "INT8"
        elif not args.full_precision:
            precision = "FP16"

        exported = model.export(input_size, args.batch, precision, calibration_files, args.calibration_table, args.verbose, onnx_only=onnx_only, opset=args.opset)
        if onnx_only:
            with open(args.export, 'wb') as out:
                out.write(exported)
        else:
            exported.save(args.export)

train

保留模型到nn_model
convert_fixedbn_model 将模型中的 torch.nn.BatchNorm2d 替换为 FixedBatchNorm2d。

    'Train the model on the given dataset'

    # Prepare model
    nn_model = model
    stride = model.stride

    model = convert_fixedbn_model(model)
    if torch.cuda.is_available():
        model = model.cuda()

apex.amp.initialize 根据所选的opt_level和重写属性(如果有的话)初始化模型,优化器以及 Torch 张量和函数命名空间。构建完模型和优化器之后,应在将模型发送到 torch.nn.parallel.DistributedDataParallel 包装器之前,调用 apex.amp.initialize。请参见 ImageNet 示例中的 Distributed training。目前,尽管 Apex 可以处理任意数量的模型和优化器,但只应调用一次 apex.amp.initialize(参见相应的 Advanced Amp Usage topic)。如果您认为您的用例需要多次调用 apex.amp.initialize,请联系 NVIDIA。任何非None的属性关键字参数都将被解释为手动覆盖。为了避免重写脚本中的其他内容,请命名返回的模型/优化器以替换传递的模型/优化器。

apex.parallel.DistributedDataParallel 是一个模块包装器,它支持简单的多进程分布式数据并行训练,类似于 torch.nn.parallel.DistributedDataParallel。初始化过程中在进程之间广播参数,并且在 backward() 期间,规约平均进程间的梯度。

DistributedDataParallel 针对 NCCL 进行了优化。它在 backward() 期间交叠传输和通信,并且桶运较小的梯度传输以减少所需的传输总数,从而提高了性能。这与 NVCaffe 所做的优化类似。

    # Setup optimizer and schedule
    optimizer = SGD(model.parameters(), lr=lr, weight_decay=0.0001, momentum=0.9) 

    model, optimizer = amp.initialize(model, optimizer,
                                      opt_level = 'O2' if mixed_precision else 'O0',
                                      keep_batchnorm_fp32 = True,
                                      loss_scale = 128.0,
                                      verbosity = is_master)

    if world > 1: 
        model = DistributedDataParallel(model)
    model.train()

    if 'optimizer' in state:
        optimizer.load_state_dict(state['optimizer'])

    def schedule(train_iter):
        if warmup and train_iter <= warmup:
            return 0.9 * train_iter / warmup + 0.1
        return gamma ** len([m for m in milestones if m <= train_iter])
    scheduler = LambdaLR(optimizer, schedule)

DaliDataIterator 使用 NVIDIA DALI 加载数据。

    # Prepare dataset
    if verbose: print('Preparing dataset...')
    data_iterator = (DaliDataIterator if use_dali else DataIterator)(
        path, jitter, max_size, batch_size, stride,
        world, annotations, training=True)
    if verbose: print(data_iterator)


    if verbose:
        print('    device: {} {}'.format(
            world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus'))
        print('    batch: {}, precision: {}'.format(batch_size, 'mixed' if mixed_precision else 'full'))
        print('Training model for {} iterations...'.format(iterations))

    # Create TensorBoard writer
    if logdir is not None:
        from tensorboardX import SummaryWriter
        if is_master and verbose:
            print('Writing TensorBoard logs to: {}'.format(logdir))
        writer = SummaryWriter(logdir=logdir)

Profiler 记录时间用于分析。
前向运行完后手动删除data
apex.amp.scale_loss 在上下文管理器入口处,创建scaled_loss = (loss.float())*current loss scale。产生scaled_loss,以便用户可以调用scaled_loss.backward()

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

在上下文管理器退出时(if delay_unscale=False),将检查梯度的 infs/NaNs 并反缩放,以便可以调用optimizer.step()

    profiler = Profiler(['train', 'fw', 'bw'])
    iteration = state.get('iteration', 0)
    while iteration < iterations:
        cls_losses, box_losses = [], []
        for i, (data, target) in enumerate(data_iterator):
            scheduler.step(iteration)

            # Forward pass
            profiler.start('fw')

            optimizer.zero_grad()
            cls_loss, box_loss = model([data, target])
            del data
            profiler.stop('fw')

            # Backward pass
            profiler.start('bw')
            with amp.scale_loss(cls_loss + box_loss, optimizer) as scaled_loss:
                scaled_loss.backward()
            optimizer.step()

            # Reduce all losses
            cls_loss, box_loss = cls_loss.mean().clone(), box_loss.mean().clone()
            if world > 1:
                torch.distributed.all_reduce(cls_loss)
                torch.distributed.all_reduce(box_loss)
                cls_loss /= world
                box_loss /= world
            if is_master:
                cls_losses.append(cls_loss)
                box_losses.append(box_loss)

            if is_master and not isfinite(cls_loss + box_loss):
                raise RuntimeError('Loss is diverging!\n{}'.format(
                    'Try lowering the learning rate.'))

            del cls_loss, box_loss
            profiler.stop('bw')

            iteration += 1

主节点打印并记录信息。
post_metrics 使用 requests 发送信息。
ignore_sigint 忽略信号。
infer 根据模型进行推理。

            profiler.bump('train')
            if is_master and (profiler.totals['train'] > 60 or iteration == iterations):
                focal_loss = torch.stack(list(cls_losses)).mean().item()
                box_loss = torch.stack(list(box_losses)).mean().item()
                learning_rate = optimizer.param_groups[0]['lr']
                if verbose:
                    msg  = '[{:{len}}/{}]'.format(iteration, iterations, len=len(str(iterations)))
                    msg += ' focal loss: {:.3f}'.format(focal_loss)
                    msg += ', box loss: {:.3f}'.format(box_loss)
                    msg += ', {:.3f}s/{}-batch'.format(profiler.means['train'], batch_size)
                    msg += ' (fw: {:.3f}s, bw: {:.3f}s)'.format(profiler.means['fw'], profiler.means['bw'])
                    msg += ', {:.1f} im/s'.format(batch_size / profiler.means['train'])
                    msg += ', lr: {:.2g}'.format(learning_rate)
                    print(msg, flush=True)

                if logdir is not None:
                    writer.add_scalar('focal_loss', focal_loss,  iteration)
                    writer.add_scalar('box_loss', box_loss, iteration)
                    writer.add_scalar('learning_rate', learning_rate, iteration)
                    del box_loss, focal_loss

                if metrics_url:
                    post_metrics(metrics_url, {
                        'focal loss': mean(cls_losses),
                        'box loss': mean(box_losses),
                        'im_s': batch_size / profiler.means['train'],
                        'lr': learning_rate
                    })

                # Save model weights
                state.update({
                    'iteration': iteration,
                    'optimizer': optimizer.state_dict(),
                    'scheduler': scheduler.state_dict(),
                })
                with ignore_sigint():
                    nn_model.save(state)

                profiler.reset()
                del cls_losses[:], box_losses[:]

            if val_annotations and (iteration == iterations or iteration % val_iterations == 0):
                infer(model, val_path, None, resize, max_size, batch_size, annotations=val_annotations,
                    mixed_precision=mixed_precision, is_master=is_master, world=world, use_dali=use_dali, is_validation=True, verbose=False)
                model.train()

            if iteration == iterations:
                break

    if logdir is not None:
        writer.close()

infer

根据模型的类型确定执行后端。

    'Run inference on images from path'

    backend = 'pytorch' if isinstance(model, Model) or isinstance(model, DDP) else 'tensorrt'

    stride = model.module.stride if isinstance(model, DDP) else model.stride

tempfile.mktemp 已经废弃了。tempfile.mkstemp 尽可能以最安全的方式创建临时文件。假设平台正确实现 os.open() 的 os.O_EXCL 标志,文件创建中没有竞争条件。只有创建用户 ID 才能读写该文件。如果平台使用权限位指示文件是否可执行,则任何人都无法执行该文件。子进程不继承文件描述符。

    # Create annotations if none was provided
    if not annotations:
        annotations = tempfile.mktemp('.json')
        images = [{ 'id': i, 'file_name': f} for i, f in enumerate(os.listdir(path))]
        json.dump({ 'images': images }, open(annotations, 'w'))

    # TensorRT only supports fixed input sizes, so override input size accordingly
    if backend == 'tensorrt': max_size = max(model.input_size)

使用 DaliDataIterator 或者 DataIterator 加载数据。

    # Prepare dataset
    if verbose: print('Preparing dataset...')
    data_iterator = (DaliDataIterator if use_dali else DataIterator)(
        path, resize, max_size, batch_size, stride,
        world, annotations, training=False)
    if verbose: print(data_iterator)

判断是独立推理还是训练时的验证。

    # Prepare model
    if backend is 'pytorch':
        # If we are doing validation during training,
        # no need to register model with AMP again
        if not is_validation:
            if torch.cuda.is_available(): model = model.cuda()
            model = amp.initialize(model, None,
                               opt_level = 'O2' if mixed_precision else 'O0',
                               keep_batchnorm_fp32 = True,
                               verbosity = 0)

        model.eval()
 
    if verbose:
        print('   backend: {}'.format(backend))
        print('    device: {} {}'.format(
            world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus'))
        print('     batch: {}, precision: {}'.format(batch_size,
            'unknown' if backend is 'tensorrt' else 'mixed' if mixed_precision else 'full'))
        print('Running inference...')

网络运行结果保存在列表中。

    results = []
    profiler = Profiler(['infer', 'fw'])
    with torch.no_grad():
        for i, (data, ids, ratios) in enumerate(data_iterator):
            # Forward pass
            profiler.start('fw')
            scores, boxes, classes = model(data)
            profiler.stop('fw')

            results.append([scores, boxes, classes, ids, ratios])

            profiler.bump('infer')
            if verbose and (profiler.totals['infer'] > 60 or i == len(data_iterator) - 1):
                size = len(data_iterator.ids)
                msg  = '[{:{len}}/{}]'.format(min((i + 1) * batch_size,
                    size), size, len=len(str(size)))
                msg += ' {:.3f}s/{}-batch'.format(profiler.means['infer'], batch_size)
                msg += ' (fw: {:.3f}s)'.format(profiler.means['fw'])
                msg += ', {:.1f} im/s'.format(batch_size / profiler.means['infer'])
                print(msg, flush=True)

                profiler.reset()

torch.distributed.all_gather 从列表中收集整个组的张量。

    # Gather results from all devices
    if verbose: print('Gathering results...')
    results = [torch.cat(r, dim=0) for r in zip(*results)]
    if world > 1:
        for r, result in enumerate(results):
            all_result = [torch.ones_like(result, device=result.device) for _ in range(world)]
            torch.distributed.all_gather(list(all_result), result)
            results[r] = torch.cat(all_result, dim=0)

主节点将结果拷贝到 cpu,然后收集整理并评测。
pycocotools.coco.COCO.getCatIds 获取类别 id 列表。

    if is_master:
        # Copy buffers back to host
        results = [r.cpu() for r in results]

        # Collect detections
        detections = []
        processed_ids = set()
        for scores, boxes, classes, image_id, ratios in zip(*results):
            image_id = image_id.item()
            if image_id in processed_ids:
                continue
            processed_ids.add(image_id)

            keep = (scores > 0).nonzero()
            scores = scores[keep].view(-1)
            boxes = boxes[keep, :].view(-1, 4) / ratios
            classes = classes[keep].view(-1).int()

            for score, box, cat in zip(scores, boxes, classes):
                x1, y1, x2, y2 = box.data.tolist()
                cat = cat.item()
                if 'annotations' in data_iterator.coco.dataset:
                    cat = data_iterator.coco.getCatIds()[cat]
                detections.append({
                    'image_id': image_id,
                    'score': score.item(),
                    'bbox': [x1, y1, x2 - x1 + 1, y2 - y1 + 1],
                    'category_id': cat
                })

COCO.loadRes 加载结果文件并返回结果 api 对象。
实例化一个 COCOeval 对象评测结果。

        if detections:
            # Save detections
            if detections_file and verbose: print('Writing {}...'.format(detections_file))
            detections = { 'annotations': detections }
            detections['images'] = data_iterator.coco.dataset['images']
            if 'categories' in data_iterator.coco.dataset:
                detections['categories'] = [data_iterator.coco.dataset['categories']]
            if detections_file:
                json.dump(detections, open(detections_file, 'w'), indent=4)

            # Evaluate model on dataset
            if 'annotations' in data_iterator.coco.dataset:
                if verbose: print('Evaluating model...')
                with redirect_stdout(None):
                    coco_pred = data_iterator.coco.loadRes(detections['annotations'])
                    coco_eval = COCOeval(data_iterator.coco, coco_pred, 'bbox')
                    coco_eval.evaluate()
                    coco_eval.accumulate()
                coco_eval.summarize()
        else:
            print('No detections!')

Model

getattr 返回object的命名属性的值。name必须是字符串。如果字符串是对象属性之一的名称,则结果是该属性的值。例如,getattr(x, 'foobar')相当于x.foobar。如果命名属性不存在,则返回default(如果提供),否则将引发 AttributeError。
torch.nn.ModuleDict 在字典中保存子模块。ModuleDict 可以像常规 Python 字典一样编制索引,只不过它包含的模块已正确注册,并且对所有 Module 方法可见。

ModuleDict 是一个有序字典并且遵循

  • 插入顺序
  • 在 update() 中,合并的OrderedDict 或另一个 ModuleDict(update() 的参数)的顺序。

请注意,使用其他无序映射类型(例如,Python 的普通字典)的 update() 不会保留合并映射的顺序。

    'RetinaNet - https://arxiv.org/abs/1708.02002'

    def __init__(self, backbones='ResNet50FPN', classes=80, config={}):
        super().__init__()

        if not isinstance(backbones, list):
            backbones = [backbones]

        self.backbones = nn.ModuleDict({b: getattr(backbones_mod, b)() for b in backbones})
        self.name = 'RetinaNet'
        self.exporting = False

        self.ratios = [1.0, 2.0, 0.5]
        self.scales = [4 * 2**(i/3) for i in range(3)]
        self.anchors = {}
        self.classes = classes

        self.threshold  = config.get('threshold', 0.05)
        self.top_n      = config.get('top_n', 1000)
        self.nms        = config.get('nms', 0.5)
        self.detections = config.get('detections', 100)

        self.stride = max([b.stride for _, b in self.backbones.items()])

分类和回归头均为4层卷积。
FocalLoss 基于 torch.nn.functional.binary_cross_entropy_with_logits。
SmoothL1Loss 为自行实现。

        # classification and box regression heads
        def make_head(out_size):
            layers = []
            for _ in range(4):
                layers += [nn.Conv2d(256, 256, 3, padding=1), nn.ReLU()]
            layers += [nn.Conv2d(256, out_size, 3, padding=1)]
            return nn.Sequential(*layers)

        anchors = len(self.ratios) * len(self.scales)
        self.cls_head = make_head(classes * anchors)
        self.box_head = make_head(4 * anchors)

        self.cls_criterion = FocalLoss()
        self.box_criterion = SmoothL1Loss(beta=0.11)

__repr__

object.__repr__ 由 repr() 内置函数调用以计算对象的“官方”字符串表示。 如果可能的话,这应该看起来像一个有效的 Python 表达式,可用于重新创建具有相同值的对象(给定适当的环境)。如果这不可能,则应返回<…一些有用的描述…>形式的字符串。 返回值必须是字符串对象。 如果一个类定义了__repr__() 而不是 __str__(),那么当需要该类的实例的“非正式”字符串表示时,也会使用 _repr_()。这通常用于调试,因此表示是信息丰富且清晰的很重要。

        return '\n'.join([
            '     model: {}'.format(self.name),
            '  backbone: {}'.format(', '.join([k for k, _ in self.backbones.items()])),
            '   classes: {}, anchors: {}'.format(self.classes, len(self.ratios) * len(self.scales)),
        ])

initialize

如果是预训练的,加载模型并忽略cls_head.8

        if pre_trained:
            # Initialize using weights from pre-trained model
            if not os.path.isfile(pre_trained):
                raise ValueError('No checkpoint {}'.format(pre_trained))

            print('Fine-tuning weights from {}...'.format(os.path.basename(pre_trained)))
            state_dict = self.state_dict()
            chk = torch.load(pre_trained, map_location=lambda storage, loc: storage)
            ignored = ['cls_head.8.bias', 'cls_head.8.weight']
            weights = { k: v for k, v in chk['state_dict'].items() if k not in ignored }
            state_dict.update(weights)
            self.load_state_dict(state_dict)

            del chk, weights
            torch.cuda.empty_cache()

否则调用基础网络的初始化方法,然后初始化检测和回归分支。

        else:
            # Initialize backbone(s)
            for _, backbone in self.backbones.items():
                backbone.initialize()

            # Initialize heads
            def initialize_layer(layer):
                if isinstance(layer, nn.Conv2d):
                    nn.init.normal_(layer.weight, std=0.01)
                    if layer.bias is not None:
                        nn.init.constant_(layer.bias, val=0)
            self.cls_head.apply(initialize_layer)
            self.box_head.apply(initialize_layer)

特殊初始化分类分支的最后一层。

        # Initialize class head prior
        def initialize_prior(layer):
            pi = 0.01
            b = - math.log((1 - pi) / pi)
            nn.init.constant_(layer.bias, b)
            nn.init.normal_(layer.weight, std=0.01)
        self.cls_head[-1].apply(initialize_prior)

forward

Created with Raphaël 2.2.0 forward x backbone cls_head box_head training? _compute_loss End cls_head.sigmoid exporting? generate_anchors decode nms yes no yes no

训练时,输入的x包含处理后的标注信息。
由基础网络提取特征,然后同时进行分类和回归。
如果是训练,则调用 _compute_loss 计算损失并返回。

        if self.training: x, targets = x

        # Backbones forward pass
        features = []
        for _, backbone in self.backbones.items():
            features.extend(backbone(x))

        # Heads forward pass
        cls_heads = [self.cls_head(t) for t in features]
        box_heads = [self.box_head(t) for t in features]

        if self.training:
            return self._compute_loss(x, cls_heads, box_heads, targets.float())

如果是导出,返回分类和回归输出。

        cls_heads = [cls_head.sigmoid() for cls_head in cls_heads]

        if self.exporting:
            self.strides = [x.shape[-1] // cls_head.shape[-1] for cls_head in cls_heads]
            return cls_heads, box_heads

否则,运行推理的后处理。
generate_anchors 从比例/比率生成锚点坐标。
decode 根据得分过滤结果并解码出边界框。
nms 进一步过滤结果。

        # Inference post-processing
        decoded = []
        for cls_head, box_head in zip(cls_heads, box_heads):
            # Generate level's anchors
            stride = x.shape[-1] // cls_head.shape[-1]
            if stride not in self.anchors:
                self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales)

            # Decode and filter boxes
            decoded.append(decode(cls_head, box_head, stride,
                self.threshold, self.top_n, self.anchors[stride]))

        # Perform non-maximum suppression
        decoded = [torch.cat(tensors, 1) for tensors in zip(*decoded)]
        return nms(*decoded, self.nms, self.detections)

_extract_targets

_extract_targets
generate_anchors
snap_to_anchors

generate_anchors 从比例/比率生成锚点坐标。
snap_to_anchors 参照锚点生成目标量。

        cls_target, box_target, depth = [], [], []
        for target in targets:
            target = target[target[:, -1] > -1]
            if stride not in self.anchors:
                self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales)
            snapped = snap_to_anchors(
                target, [s * stride for s in size[::-1]], stride,
                self.anchors[stride].to(targets.device), self.classes, targets.device)
            for l, s in zip((cls_target, box_target, depth), snapped): l.append(s)
        return torch.stack(cls_target), torch.stack(box_target), torch.stack(depth)

_compute_loss

_compute_loss
_extract_targets
self.cls_criterion
self.box_criterion

_extract_targets 获得分类和回归目标。
depth为样本选取掩码。

        cls_losses, box_losses, fg_targets = [], [], []
        for cls_head, box_head in zip(cls_heads, box_heads):
            size = cls_head.shape[-2:]
            stride = x.shape[-1] / cls_head.shape[-1]

            cls_target, box_target, depth = self._extract_targets(targets, stride, size)
            fg_targets.append((depth > 0).sum().float().clamp(min=1))

            cls_head = cls_head.view_as(cls_target).float()
            cls_mask = (depth >= 0).expand_as(cls_target).float()
            cls_loss = self.cls_criterion(cls_head, cls_target)
            cls_loss = cls_mask * cls_loss
            cls_losses.append(cls_loss.sum())

            box_head = box_head.view_as(box_target).float()
            box_mask = (depth > 0).expand_as(box_target).float()
            box_loss = self.box_criterion(box_head, box_target)
            box_loss = box_mask * box_loss
            box_losses.append(box_loss.sum())

        fg_targets = torch.stack(fg_targets).sum()
        cls_loss = torch.stack(cls_losses).sum() / fg_targets
        box_loss = torch.stack(box_losses).sum() / fg_targets
        return cls_loss, box_loss

save

保存基础网络、类别数和模型。
如果有的话,保存iterationoptimizerscheduler

        checkpoint = {
            'backbone': [k for k, _ in self.backbones.items()],
            'classes': self.classes,
            'state_dict': self.state_dict()
        }

        for key in ('iteration', 'optimizer', 'scheduler'):
            if key in state:
                checkpoint[key] = state[key]

        torch.save(checkpoint, state['path'])

load

创建基础网络,然后加载其中的参数。
torch.cuda.empty_cache 释放当前由缓存分配器保存的所有未占用的缓存内存,以便可以在其他 GPU 应用程序中使用这些缓存并在 nvidia-smi 中可见。

    @classmethod
    def load(cls, filename):
        if not os.path.isfile(filename):
            raise ValueError('No checkpoint {}'.format(filename))

        checkpoint = torch.load(filename, map_location=lambda storage, loc: storage)
        # Recreate model from checkpoint instead of from individual backbones
        model = cls(backbones=checkpoint['backbone'], classes=checkpoint['classes'])
        model.load_state_dict(checkpoint['state_dict'])

        state = {}
        for key in ('iteration', 'optimizer', 'scheduler'):
            if key in checkpoint:
                state[key] = checkpoint[key]

        del checkpoint
        torch.cuda.empty_cache()

        return model, state

export

如果 OpSet 的版本低于9,则定义upsample_nearest2d函数。

        import torch.onnx.symbolic

        if opset is not None and opset < 9:
            # Override Upsample's ONNX export from old opset if required (not needed for TRT 5.1+)
            @torch.onnx.symbolic.parse_args('v', 'is')
            def upsample_nearest2d(g, input, output_size):
                height_scale = float(output_size[-2]) / input.type().sizes()[-2]
                width_scale = float(output_size[-1]) / input.type().sizes()[-1]
                return g.op("Upsample", input,
                    scales_f=(1, 1, height_scale, width_scale),
                    mode_s="nearest")
            torch.onnx.symbolic.upsample_nearest2d = upsample_nearest2d

io.BytesIO 使用内存中字节缓冲区的流实现。它继承了 BufferedIOBase。 调用 close() 方法时,将丢弃缓冲区。可选参数initial_bytes是一个包含初始数据的 bytes-like object。
torch.onnx.export 导出模型。
io.BytesIO.getvalue 返回包含缓冲区全部内容的 bytes。

        # Export to ONNX
        print('Exporting to ONNX...')
        self.exporting = True
        onnx_bytes = io.BytesIO()
        zero_input = torch.zeros([1, 3, *size]).cuda()
        extra_args = { 'opset_version': opset } if opset else {}
        torch.onnx.export(self.cuda(), zero_input, onnx_bytes, *extra_args)
        self.exporting = False

        if onnx_only:
            return onnx_bytes.getvalue()

generate_anchors 从比例/比率生成锚点坐标。
返回一个 Engine 对象。

        # Build TensorRT engine
        model_name = '_'.join([k for k, _ in self.backbones.items()])
        anchors = [generate_anchors(stride, self.ratios, self.scales).view(-1).tolist() 
            for stride in self.strides]
        return Engine(onnx_bytes.getvalue(), len(onnx_bytes.getvalue()), batch, precision,
            self.threshold, self.top_n, anchors, self.nms, self.detections, calibration_files, model_name, calibration_table, verbose)

DaliDataIterator

DaliDataIterator
COCOPipeline

DaliDataIterator 定义的函数与 torch.utils.data.DataLoader 类似,但使用 DALI 实现数据并行加载。
输入的batch_size参数为总批量大小。

    'Data loader for data parallel using Dali'

    def __init__(self, path, resize, max_size, batch_size, stride, world, annotations, training=False):
        self.training = training
        self.resize = resize
        self.max_size = max_size
        self.stride = stride
        self.batch_size = batch_size // world
        self.mean = [255.*x for x in [0.485, 0.456, 0.406]]
        self.std = [255.*x for x in [0.229, 0.224, 0.225]]
        self.world = world
        self.path = path

contextlib.redirect_stdout 是用于临时将 sys.stdout 重定向到另一个文件或类文件对象的上下文管理器。此工具为输出硬连接到 stdout 的现有函数或类增加了灵活性。
创建一个 COCO 实例。
COCOPipeline 封装了 nvidia.dali.pipeline.Pipeline。

nvidia.dali.pipeline.Pipeline.build 构建管道。需要构建管道才能独立运行它。特定于框架的插件自动处理此步骤。

        # Setup COCO
        with redirect_stdout(None):
            self.coco = COCO(annotations)
        self.ids = list(self.coco.imgs.keys())
        if 'categories' in self.coco.dataset:
            self.categories_inv = { k: i for i, k in enumerate(self.coco.getCatIds()) }

        self.pipe = COCOPipeline(batch_size=self.batch_size, num_threads=2, 
            path=path, coco=self.coco, training=training, annotations=annotations, world=world, 
            device_id = torch.cuda.current_device(), mean=self.mean, std=self.std, resize=resize, max_size=max_size, stride=self.stride)

        self.pipe.build()

__repr__

功能说明。

        return '\n'.join([
            '    loader: dali',
            '    resize: {}, max: {}'.format(self.resize, self.max_size),
        ])

__len__

        return ceil(len(self.ids) // self.world / self.batch_size)

__iter__

nvidia.dali.pipeline.Pipeline.run 运行管道并返回结果。如果管道是在exec_pipelined选项设置为True的情况下创建的,则此函数也将开始预取下一次迭代以加快执行速度。不应与同一管道中的 nvidia.dali.pipeline.Pipeline.schedule_run()、nvidia.dali.pipeline.Pipeline.share_outputs() 和 nvidia.dali.pipeline.Pipeline.release_outputs() 混合使用。
ctypes.c_void_p 表示 C 语言的void *类型。该值表示为整数。构造函数接受可选的整数初始值设定项。
orch.Tensor.data_ptr 返回自张量的第一个元素的地址。

self.pipe.run()返回的结果均为 nvidia.dali.backend.TensorListCPU 或 nvidia.dali.backend.TensorListGPU 类型。
nvidia.dali.backend.TensorListCPU.copy_to_external 将此TensorList的内容复制到 CPU 内存中的外部指针(类型为 ctypes.c_void_p)。插件内部使用此函数与支持的深度学习框架中的张量进行交互。

nvidia.dali.backend.TensorListGPU.as_cpu 返回 TensorListCPU 对象,该对象是此 TensorListGPU 的副本。

            data, ratios, ids, num_detections = [], [], [], []
            dali_data, dali_boxes, dali_labels, dali_ids, dali_attrs, dali_resize_img = self.pipe.run()

            for l in range(len(dali_boxes)):
                num_detections.append(dali_boxes.at(l).shape[0])

            pyt_targets = -1 * torch.ones([len(dali_boxes), max(max(num_detections),1), 5])

            for batch in range(self.batch_size):
                id = int(dali_ids.at(batch)[0])
                
                # Convert dali tensor to pytorch
                dali_tensor = dali_data.at(batch)
                tensor_shape = dali_tensor.shape()

                datum = torch.zeros(dali_tensor.shape(), dtype=torch.float, device=torch.device('cuda'))
                c_type_pointer = ctypes.c_void_p(datum.data_ptr())
                dali_tensor.copy_to_external(c_type_pointer)

                # Calculate image resize ratio to rescale boxes
                prior_size = dali_attrs.as_cpu().at(batch)
                resized_size = dali_resize_img.at(batch).shape()
                ratio = max(resized_size) / max(prior_size)

                if self.training:
                    # Rescale boxes
                    b_arr = dali_boxes.at(batch)
                    num_dets = b_arr.shape[0]
                    if num_dets is not 0:
                        pyt_bbox = torch.from_numpy(b_arr).float()

                        pyt_bbox[:,0] *= float(prior_size[1])
                        pyt_bbox[:,1] *= float(prior_size[0])
                        pyt_bbox[:,2] *= float(prior_size[1])
                        pyt_bbox[:,3] *= float(prior_size[0])
                        # (l,t,r,b) ->  (x,y,w,h) == (l,r, r-l, b-t)
                        pyt_bbox[:,2] -= pyt_bbox[:,0]
                        pyt_bbox[:,3] -= pyt_bbox[:,1]
                        pyt_targets[batch,:num_dets,:4] = pyt_bbox * ratio

                    # Arrange labels in target tensor
                    l_arr = dali_labels.at(batch)
                    if num_dets is not 0:
                        pyt_label = torch.from_numpy(l_arr).float()
                        pyt_label -= 1 #Rescale labels to [0,79] instead of [1,80]
                        pyt_targets[batch,:num_dets, 4] = pyt_label.squeeze()

                ids.append(id)
                data.append(datum.unsqueeze(0))
                ratios.append(ratio)

            data = torch.cat(data, dim=0)

            if self.training:
                pyt_targets = pyt_targets.cuda(non_blocking=True)

                yield data, pyt_targets

            else:
                ids = torch.Tensor(ids).int().cuda(non_blocking=True)
                ratios = torch.Tensor(ratios).cuda(non_blocking=True)

                yield data, ids, ratios

COCOPipeline

COCOPipeline
pipeline.Pipeline

可以参考文档 COCO Reader with augmentations。

nvidia.dali.ops.COCOReader 是一个 CPU 运算符。从 COCO 数据集中读取数据,其每个目录中包含图像和一个注释文件。对于带有 m 个 bbox 的图像,将其 bbox 返回为 (m,4) Tensor (m * [x, y, w, h] or m * [left, top, right, bottom]) 和 标签为 (m,1) Tensor (m * category_id)。
nvidia.dali.ops.nvJPEGDecoderSlice 是一个“混合”操作。根据给定大小和锚点的裁剪窗口使用 nvJPEG 库部分解码 JPEG 图像。输入必须以特定顺序提供3个张量:

  • encoded_dat包含编码图像数据;
  • begin包含(x,y)格式的裁剪起始像素坐标
  • size包含裁剪的像素尺寸,(w,h)格式。

对于beginsize,坐标必须在区间[0.0, 1.0]中。解码器输出的排序为HWC

警告
此运算符现已弃用。请改用 ImageDecoderSlice

nvidia.dali.ops.RandomBBoxCrop 是一个 CPU 运算符。对图像执行预期裁剪,同时保持边界框和标签一致。输入必须作为两个张量提供:

  • BBoxes包含表示为[l,t,r,b][x,y,w,h]的边界框;
  • Labels包含每个边界框的相应标签。

得到的预期切图以两个张量形式提供:

  • Begin包含(x,y)格式的切图起始坐标;
  • Size包含(w,h)格式的切图大小。

边界框提供为(m*4)张量,其中每个边界框表示为[l,t,r,b][x,y,w,h]。丢弃与边框交并比小于阈值的标签。 请注意,当allow_no_cropFalse且阈值不包含0时,最好增加num_attempts,否则它可能会循环很长时间。

nvidia.dali.ops.BbFlip 是一个 CPU,GPU 运算符,执行边界框水平翻转(镜像)操作。 输入为边界框坐标,格式为[x, y, w, h][left, top, right, bottom]。所有坐标都在图像坐标系中(即0.0-1.0)。

nvidia.dali.ops.Flip 是一个 CPU,GPU 运算符,在水平轴和(或)垂直轴上翻转图像。

nvidia.dali.ops.CoinFlip 是一个支持操作符。产生充满0和1的张量——随机掷硬币的结果,可用作选择操作的参数。

nvidia.dali.ops.Uniform 是一个支持操作符,产生均匀分布随机数的张量。

nvidia.dali.ops.Resize 是一个 CPU,GPU 运算符,调整图像大小。

nvidia.dali.ops.Paste 是一个 GPU 操作符。将输入图像粘贴到更大的画布上。画布大小等于输入大小*比率。

nvidia.dali.ops.CropMirrorNormalize 是一个 CPU,GPU 运算符。如果需要,融合执行裁剪、标准化、格式转换(NHWC 到 NCHW)和类型转换。标准化输入图像并使用以下公式生成输出:

    output = (input - mean) / std

请注意,不提供任何裁剪参数将仅导致镜像和标准化。该运算符允许序列输入。

        super().__init__(batch_size=batch_size, num_threads=num_threads, device_id = device_id, prefetch_queue_depth=num_threads, seed=42)

        self.path = path
        self.training = training
        self.coco = coco
        self.stride = stride
        self.iter = 0

        self.reader = ops.COCOReader(annotations_file=annotations, file_root=path, num_shards=world,shard_id=torch.cuda.current_device(), 
                                     ltrb=True, ratio=True, shuffle_after_epoch=True, save_img_ids=True)

        self.decode_train = ops.nvJPEGDecoderSlice(device="mixed", output_type=types.RGB)
        self.decode_infer = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB)
        self.bbox_crop = ops.RandomBBoxCrop(device='cpu', ltrb=True, scaling=[0.3, 1.0], thresholds=[0.1,0.3,0.5,0.7,0.9])

        self.bbox_flip = ops.BbFlip(device='cpu', ltrb=True)
        self.img_flip = ops.Flip(device='gpu')
        self.coin_flip = ops.CoinFlip(probability=0.5)

        if isinstance(resize, list): resize = max(resize)
        self.rand_resize = ops.Uniform(range=[resize, float(max_size)])

        self.resize_train = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, save_attrs=True)
        self.resize_infer = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, resize_longer=max_size, save_attrs=True)

        padded_size = max_size + ((self.stride - max_size % self.stride) % self.stride)

        self.pad = ops.Paste(device='gpu', fill_value = 0, ratio=1.1, min_canvas_size=padded_size, paste_x=0, paste_y=0)
        self.normalize = ops.CropMirrorNormalize(device='gpu', mean=mean, std=std, crop=padded_size, crop_pos_x=0, crop_pos_y=0)

define_graph

nvidia.dali.pipeline.Pipeline.define_graph 返回输出EdgeReference的列表。用户定义此函数以构造其管道的操作图。
self.reader()从数据集中读取数据。
如果是训练,读取图像并进行增广;如果是推理则仅解码和调整大小。

        images, bboxes, labels, img_ids = self.reader()

        if self.training:
            crop_begin, crop_size, bboxes, labels = self.bbox_crop(bboxes, labels)
            images = self.decode_train(images, crop_begin, crop_size)
            resize = self.rand_resize()
            images, attrs = self.resize_train(images, resize_longer=resize)

            flip = self.coin_flip()
            bboxes = self.bbox_flip(bboxes, horizontal=flip)
            images = self.img_flip(images, horizontal=flip)

        else:
            images = self.decode_infer(images)
            images, attrs = self.resize_infer(images)

        resized_images = images
        images = self.normalize(self.pad(images))

        return images, bboxes, labels, img_ids, attrs, resized_images

snap_to_anchors

torch.Tensor.nelement 是 torch.Tensor.numel 的别称。

如果boxes为空则直接返回0填充的张量。

    'Snap target boxes (x, y, w, h) to anchors'

    num_anchors = anchors.size()[0] if anchors is not None else 1
    width, height = (int(size[0] / stride), int(size[1] / stride))

    if boxes.nelement() == 0:
        return (torch.zeros([num_anchors, num_classes, height, width], device=device),
            torch.zeros([num_anchors, 4, height, width], device=device),
            torch.zeros([num_anchors, 1, height, width], device=device))

根据输出尺寸将锚点广播到每个位置。

    boxes, classes = boxes.split(4, dim=1)

    # Generate anchors
    x, y = torch.meshgrid([torch.arange(0, size[i], stride, device=device, dtype=classes.dtype) for i in range(2)])
    xyxy = torch.stack((x, y, x, y), 2).unsqueeze(0)
    anchors = anchors.view(-1, 1, 1, 4).to(dtype=classes.dtype)
    anchors = (xyxy + anchors).contiguous().view(-1, 4)

boxes[x, y, width, height]转为[left, top, right, bottom],便于计算交并比。

    # Compute overlap between boxes and anchors
    boxes = torch.cat([boxes[:, :2], boxes[:, :2] + boxes[:, 2:] - 1], 1)
    xy1 = torch.max(anchors[:, None, :2], boxes[:, :2])
    xy2 = torch.min(anchors[:, None, 2:], boxes[:, 2:])
    inter = torch.prod((xy2 - xy1 + 1).clamp(0), 2)
    boxes_area = torch.prod(boxes[:, 2:] - boxes[:, :2] + 1, 1)
    anchors_area = torch.prod(anchors[:, 2:] - anchors[:, :2] + 1, 1)
    overlap = inter / (anchors_area[:, None] + boxes_area - inter)

为每个锚框保留最佳匹配目标框。
box2delta 将边界框转换为锚框的增量。
torch.ones_like 返回填充标量值1的张量,其大小与input相同。torch.ones_like(input)等效于torch.ones(input.size(), dtype=input.dtype, layout=input.layout, device=input.device)

depth为样本选取掩码。

    # Keep best box per anchor
    overlap, indices = overlap.max(1)
    box_target = box2delta(boxes[indices], anchors)
    box_target = box_target.view(num_anchors, 1, width, height, 4)
    box_target = box_target.transpose(1, 4).transpose(2, 3)
    box_target = box_target.squeeze().contiguous()

    depth = torch.ones_like(overlap) * -1
    depth[overlap < 0.4] = 0 # background
    depth[overlap >= 0.5] = classes[indices][overlap >= 0.5].squeeze() + 1 # objects
    depth = depth.view(num_anchors, width, height).transpose(1, 2).contiguous()

生成目标类别。每个类别上的值为0或1。

torch.Tensor.scatter_ 将张量src中的所有值写入index张量中指定的索引处的self。对于src中的每个值,dimension != dim时其输出索引由src中的索引指定,dimension = dim时为index的索引值。

    # Generate target classes
    cls_target = torch.zeros((anchors.size()[0], num_classes + 1), device=device, dtype=boxes.dtype)
    if classes.nelement() == 0:
        classes = torch.LongTensor([num_classes], device=device).expand_as(indices)
    else:
        classes = classes[indices].long()
    classes = classes.view(-1, 1)
    classes[overlap < 0.4] = num_classes # background has no class
    cls_target.scatter_(1, classes, 1)
    cls_target = cls_target[:, :num_classes].view(-1, 1, width, height, num_classes)
    cls_target = cls_target.transpose(1, 4).transpose(2, 3)
    cls_target = cls_target.squeeze().contiguous()

    return (cls_target.view(num_anchors, num_classes, height, width),
        box_target.view(num_anchors, 4, height, width),
        depth.view(num_anchors, 1, height, width))

参考资料:

  • Multiprocessing best practices
  • PyTorch为何如此高效好用?来探寻深度学习框架的内部架构
  • 【分布式训练】单机多卡的正确打开方式(三):PyTorch
  • Pytorch tutorial 之Transfer Learning
  • Python’s Requests Library (Guide)
  • 90秒训练AlexNet!商汤刷新纪录
  • S9243 Fast and Accurate Object Detection
  • NumPy array slice using None
  • Pytorch多机多卡分布式训练

你可能感兴趣的:(DeepLearning,ObjectDetection,PyTorch,GPU)