retinanet-examples 是英伟达提供的目标检测工程范例,针对端到端 GPU 处理进行了优化:
项目推荐安装 PyTorch NGC docker container:
nvidia-docker run --rm --ipc=host -it nvcr.io/nvidia/pytorch:19.05-py3
然而新发布的版本仅支持 compute capability 6.0 以上的 GPU。
parse 解析参数。
load_model 创建模型并加载参数。
Module.share_memory 未在文档中列出。
worker 是工作的函数。
torch.multiprocessing.spawn 生成使用args
运行fn
的nprocs
个进程。如果其中一个进程以非零退出状态退出,则会终止其余进程,并引发异常并导致终止。在子进程中捕获异常的情况下,它将被转发,并且其跟踪包含在父进程中引发的异常中。
每个设备创建一个进程执行 worker。
'Entry point for the retinanet command'
args = parse(args or sys.argv[1:])
model, state = load_model(args, verbose=True)
if model: model.share_memory()
world = torch.cuda.device_count()
if args.command == 'export' or world <= 1:
worker(0, args, 1, model, state)
else:
torch.multiprocessing.spawn(worker, args=(args, world, model, state), nprocs=world)
ArgumentParser.add_subparsers
许多程序将其功能分解为许多子命令,例如,svn
程序可以调用子命令如svn checkout
、svn update
和svn commit
。当程序执行多个函数,每个函数需要不同类型的命令行参数时,以这种方式拆分功能可能是一个特别好的主意。ArgumentParser 支持使用 add_subparsers() 方法创建此类子命令。通常不带参数调用 add_subparsers() 方法,并返回一个特殊的操作对象。该对象有一个方法 add_parser(),它接受命令名和任何 ArgumentParser 构造函数参数,并返回一个可以照常修改的 ArgumentParser 对象。
参数说明:
title
:默认情况下,如果提供了description
,则使用“subcommands”,否则将使用title
参数的标题。description
:帮助输出中子解析器组的描述,默认情况下为None
。prog
:将在子命令帮助中显示的用法信息,默认情况下,程序的名称和subparser参数之前的任何位置参数。parser_class
:用于创建子解析器实例的类,默认情况下是当前解析器的类(例如 ArgumentParser)False
。None
。None
,并以{cmd1, cmd2, …}的形式显示子命令。 parser = argparse.ArgumentParser(description='RetinaNet Detection Utility.')
parser.add_argument('--master', metavar='address:port', type=str, help='Adress and port of the master worker', default='127.0.0.1:29500')
subparsers = parser.add_subparsers(help='sub-command', dest='command')
subparsers.required = True
devcount = max(1, torch.cuda.device_count())
训练参数设置。
batch 为每个设备两张图。
parser_train = subparsers.add_parser('train', help='train a network')
parser_train.add_argument('model', type=str, help='path to output model or checkpoint to resume from')
parser_train.add_argument('--annotations', metavar='path', type=str, help='path to COCO style annotations', required=True)
parser_train.add_argument('--images', metavar='path', type=str, help='path to images', default='.')
parser_train.add_argument('--backbone', action='store', type=str, nargs='+', help='backbone model (or list of)', default=['ResNet50FPN'])
parser_train.add_argument('--classes', metavar='num', type=int, help='number of classes', default=80)
parser_train.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount)
parser_train.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800)
parser_train.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333)
parser_train.add_argument('--jitter', metavar='min max', type=int, nargs=2, help='jitter size within range', default=[640, 1024])
parser_train.add_argument('--iters', metavar='number', type=int, help='number of iterations to train for', default=90000)
parser_train.add_argument('--milestones', action='store', type=int, nargs='*', help='list of iteration indices where learning rate decays', default=[60000, 80000])
parser_train.add_argument('--schedule', metavar='scale', type=float, help='scale schedule (affecting iters and milestones)', default=1)
parser_train.add_argument('--full-precision', help='train in full precision', action='store_true')
parser_train.add_argument('--lr', metavar='value', help='learning rate', type=float, default=0.01)
parser_train.add_argument('--warmup', metavar='iterations', help='numer of warmup iterations', type=int, default=1000)
parser_train.add_argument('--gamma', metavar='value', type=float, help='multiplicative factor of learning rate decay', default=0.1)
parser_train.add_argument('--override', help='override model', action='store_true')
parser_train.add_argument('--val-annotations', metavar='path', type=str, help='path to COCO style validation annotations')
parser_train.add_argument('--val-images', metavar='path', type=str, help='path to validation images')
parser_train.add_argument('--post-metrics', metavar='url', type=str, help='post metrics to specified url')
parser_train.add_argument('--fine-tune', metavar='path', type=str, help='fine tune a pretrained model')
parser_train.add_argument('--logdir', metavar='logdir', type=str, help='directory where to write logs')
parser_train.add_argument('--val-iters', metavar='number', type=int, help='number of iterations between each validation', default=8000)
parser_train.add_argument('--with-dali', help='use dali for data loading', action='store_true')
测试参数设置。
parser_infer = subparsers.add_parser('infer', help='run inference')
parser_infer.add_argument('model', type=str, help='path to model')
parser_infer.add_argument('--images', metavar='path', type=str, help='path to images', default='.')
parser_infer.add_argument('--annotations', metavar='annotations', type=str, help='evaluate using provided annotations')
parser_infer.add_argument('--output', metavar='file', type=str, help='save detections to specified JSON file', default='detections.json')
parser_infer.add_argument('--batch', metavar='size', type=int, help='batch size', default=2*devcount)
parser_infer.add_argument('--resize', metavar='scale', type=int, help='resize to given size', default=800)
parser_infer.add_argument('--max-size', metavar='max', type=int, help='maximum resizing size', default=1333)
parser_infer.add_argument('--with-dali', help='use dali for data loading', action='store_true')
parser_infer.add_argument('--full-precision', help='inference in full precision', action='store_true')
模型导出参数。
parser_export = subparsers.add_parser('export', help='export a model into a TensorRT engine')
parser_export.add_argument('model', type=str, help='path to model')
parser_export.add_argument('export', type=str, help='path to exported output')
parser_export.add_argument('--size', metavar='height width', type=int, nargs='+', help='input size (square) or sizes (h w) to use when generating TensorRT engine', default=[1280])
parser_export.add_argument('--batch', metavar='size', type=int, help='max batch size to use for TensorRT engine', default=2)
parser_export.add_argument('--full-precision', help='export in full instead of half precision', action='store_true')
parser_export.add_argument('--int8', help='calibrate model and export in int8 precision', action='store_true')
parser_export.add_argument('--opset', metavar='version', type=int, help='ONNX opset version')
parser_export.add_argument('--calibration-batches', metavar='size', type=int, help='number of batches to use for int8 calibration', default=10)
parser_export.add_argument('--calibration-images', metavar='path', type=str, help='path to calibration images to use for int8 calibration', default="")
parser_export.add_argument('--calibration-table', metavar='path', type=str, help='path of existing calibration table to load from, or name of new calibration table', default="")
parser_export.add_argument('--verbose', help='enable verbose logging', action='store_true')
return parser.parse_args(args)
检查是否指定了模型文件。
if args.command != 'train' and not os.path.isfile(args.model):
raise RuntimeError('Model file {} does not exist!'.format(args.model))
解析模型的扩展名。
model = None
state = {}
_, ext = os.path.splitext(args.model)
训练模式下如果未指定模型文件则创建 Model 实例。Model.initialize 加载预训练的参数或者进行初始化。
Model.load 首先由基础网络创建模型,然后加载参数并获取训练状态变量。
if args.command == 'train' and (not os.path.exists(args.model) or args.override):
if verbose: print('Initializing model...')
model = Model(args.backbone, args.classes)
model.initialize(args.fine_tune)
if verbose: print(model)
elif ext == '.pth' or ext == '.torch':
if verbose: print('Loading model from {}...'.format(os.path.basename(args.model)))
model, state = Model.load(args.model)
if verbose: print(model)
elif args.command == 'infer' and ext in ['.engine', '.plan']:
model = None
else:
raise RuntimeError('Invalid model format "{}"!'.format(args.ext))
state['path'] = args.model
return model, state
worker 函数能够执行训练、推理、模型导出3种任务。
os.environ 是进程参数之一,表示字符串环境的mapping 对象。例如,environ['HOME']
是主目录的路径名(在某些平台上),相当于 C 中的getenv("HOME")
。这个映射是在第一次导入 os 模块时捕获的,通常在 Python 启动期间作为处理site.py
的一部分。在此时间之后对环境所做的更改不会反映在 os.environ 中,除非直接修改 os.environ。
torch.cuda.set_device 设置当前设备。
torch.distributed.init_process_group 初始化默认的分布式进程组,这也将初始化分布式程序包。初始化进程组主要有两种方法:
store
、rank
和world_size
。init_method
(URL 字符串),指示发现对等点的位置/方式。(可选)指定rank
和world_size
,或在URL中编码所有必需的参数并省略它们。如果两者都未指定,则假定init_method
为“env://”。
'Per-device distributed worker'
if torch.cuda.is_available():
os.environ.update({
'MASTER_PORT': args.master.split(':')[-1],
'MASTER_ADDR': ':'.join(args.master.split(':')[:-1]),
'WORLD_SIZE': str(world),
'RANK': str(rank),
'CUDA_DEVICE': str(rank)
})
torch.cuda.set_device(rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
if args.batch % world != 0:
raise RuntimeError('Batch size should be a multiple of the number of GPUs')
train 函数参数特别多。
if args.command == 'train':
train.train(model, state, args.images, args.annotations,
args.val_images or args.images, args.val_annotations, args.resize, args.max_size, args.jitter,
args.batch, int(args.iters * args.schedule), args.val_iters, not args.full_precision, args.lr,
args.warmup, [int(m * args.schedule) for m in args.milestones], args.gamma,
is_master=(rank == 0), world=world, use_dali=args.with_dali,
metrics_url=args.post_metrics, logdir=args.logdir, verbose=(rank == 0))
推理时调用 Engine::_load 加载模型。 Engine 类封装了 TensorRT CUDA engine,PYBIND11_MODULE 宏在 extensions.cpp 中导出 C++ 变量。
infer 进行推理。
elif args.command == 'infer':
if model is None:
if rank == 0: print('Loading CUDA engine from {}...'.format(os.path.basename(args.model)))
model = Engine.load(args.model)
infer.infer(model, args.images, args.output, args.resize, args.max_size, args.batch,
annotations=args.annotations, mixed_precision=not args.full_precision,
is_master=(rank == 0), world=world, use_dali=args.with_dali, verbose=(rank == 0))
由路径获取校正图像列表并置乱。
Model.export 进行校正。
elif args.command == 'export':
onnx_only = args.export.split('.')[-1] == 'onnx'
input_size = args.size * 2 if len(args.size) == 1 else args.size
calibration_files = []
if args.int8:
# Get list of images to use for calibration
if os.path.isdir(args.calibration_images):
import glob
file_extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']
for ex in file_extensions:
calibration_files += glob.glob("{}/*{}".format(args.calibration_images, ex), recursive=True)
# Only need enough images for specified num of calibration batches
if len(calibration_files) >= args.calibration_batches * args.batch:
calibration_files = calibration_files[:(args.calibration_batches * args.batch)]
else:
print('Only found enough images for {} batches. Continuing anyway...'.format(len(calibration_files) // args.batch))
random.shuffle(calibration_files)
precision = "FP32"
if args.int8:
precision = "INT8"
elif not args.full_precision:
precision = "FP16"
exported = model.export(input_size, args.batch, precision, calibration_files, args.calibration_table, args.verbose, onnx_only=onnx_only, opset=args.opset)
if onnx_only:
with open(args.export, 'wb') as out:
out.write(exported)
else:
exported.save(args.export)
保留模型到nn_model
。
convert_fixedbn_model 将模型中的 torch.nn.BatchNorm2d 替换为 FixedBatchNorm2d。
'Train the model on the given dataset'
# Prepare model
nn_model = model
stride = model.stride
model = convert_fixedbn_model(model)
if torch.cuda.is_available():
model = model.cuda()
apex.amp.initialize 根据所选的opt_level
和重写属性(如果有的话)初始化模型,优化器以及 Torch 张量和函数命名空间。构建完模型和优化器之后,应在将模型发送到 torch.nn.parallel.DistributedDataParallel 包装器之前,调用 apex.amp.initialize。请参见 ImageNet 示例中的 Distributed training。目前,尽管 Apex 可以处理任意数量的模型和优化器,但只应调用一次 apex.amp.initialize(参见相应的 Advanced Amp Usage topic)。如果您认为您的用例需要多次调用 apex.amp.initialize,请联系 NVIDIA。任何非None
的属性关键字参数都将被解释为手动覆盖。为了避免重写脚本中的其他内容,请命名返回的模型/优化器以替换传递的模型/优化器。
apex.parallel.DistributedDataParallel 是一个模块包装器,它支持简单的多进程分布式数据并行训练,类似于 torch.nn.parallel.DistributedDataParallel。初始化过程中在进程之间广播参数,并且在 backward() 期间,规约平均进程间的梯度。
DistributedDataParallel 针对 NCCL 进行了优化。它在 backward() 期间交叠传输和通信,并且桶运较小的梯度传输以减少所需的传输总数,从而提高了性能。这与 NVCaffe 所做的优化类似。
# Setup optimizer and schedule
optimizer = SGD(model.parameters(), lr=lr, weight_decay=0.0001, momentum=0.9)
model, optimizer = amp.initialize(model, optimizer,
opt_level = 'O2' if mixed_precision else 'O0',
keep_batchnorm_fp32 = True,
loss_scale = 128.0,
verbosity = is_master)
if world > 1:
model = DistributedDataParallel(model)
model.train()
if 'optimizer' in state:
optimizer.load_state_dict(state['optimizer'])
def schedule(train_iter):
if warmup and train_iter <= warmup:
return 0.9 * train_iter / warmup + 0.1
return gamma ** len([m for m in milestones if m <= train_iter])
scheduler = LambdaLR(optimizer, schedule)
DaliDataIterator 使用 NVIDIA DALI 加载数据。
# Prepare dataset
if verbose: print('Preparing dataset...')
data_iterator = (DaliDataIterator if use_dali else DataIterator)(
path, jitter, max_size, batch_size, stride,
world, annotations, training=True)
if verbose: print(data_iterator)
if verbose:
print(' device: {} {}'.format(
world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus'))
print(' batch: {}, precision: {}'.format(batch_size, 'mixed' if mixed_precision else 'full'))
print('Training model for {} iterations...'.format(iterations))
# Create TensorBoard writer
if logdir is not None:
from tensorboardX import SummaryWriter
if is_master and verbose:
print('Writing TensorBoard logs to: {}'.format(logdir))
writer = SummaryWriter(logdir=logdir)
Profiler 记录时间用于分析。
前向运行完后手动删除data
。
apex.amp.scale_loss 在上下文管理器入口处,创建scaled_loss = (loss.float())*current loss scale
。产生scaled_loss
,以便用户可以调用scaled_loss.backward()
:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
在上下文管理器退出时(if delay_unscale=False
),将检查梯度的 infs/NaNs 并反缩放,以便可以调用optimizer.step()
。
profiler = Profiler(['train', 'fw', 'bw'])
iteration = state.get('iteration', 0)
while iteration < iterations:
cls_losses, box_losses = [], []
for i, (data, target) in enumerate(data_iterator):
scheduler.step(iteration)
# Forward pass
profiler.start('fw')
optimizer.zero_grad()
cls_loss, box_loss = model([data, target])
del data
profiler.stop('fw')
# Backward pass
profiler.start('bw')
with amp.scale_loss(cls_loss + box_loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
# Reduce all losses
cls_loss, box_loss = cls_loss.mean().clone(), box_loss.mean().clone()
if world > 1:
torch.distributed.all_reduce(cls_loss)
torch.distributed.all_reduce(box_loss)
cls_loss /= world
box_loss /= world
if is_master:
cls_losses.append(cls_loss)
box_losses.append(box_loss)
if is_master and not isfinite(cls_loss + box_loss):
raise RuntimeError('Loss is diverging!\n{}'.format(
'Try lowering the learning rate.'))
del cls_loss, box_loss
profiler.stop('bw')
iteration += 1
主节点打印并记录信息。
post_metrics 使用 requests 发送信息。
ignore_sigint 忽略信号。
infer 根据模型进行推理。
profiler.bump('train')
if is_master and (profiler.totals['train'] > 60 or iteration == iterations):
focal_loss = torch.stack(list(cls_losses)).mean().item()
box_loss = torch.stack(list(box_losses)).mean().item()
learning_rate = optimizer.param_groups[0]['lr']
if verbose:
msg = '[{:{len}}/{}]'.format(iteration, iterations, len=len(str(iterations)))
msg += ' focal loss: {:.3f}'.format(focal_loss)
msg += ', box loss: {:.3f}'.format(box_loss)
msg += ', {:.3f}s/{}-batch'.format(profiler.means['train'], batch_size)
msg += ' (fw: {:.3f}s, bw: {:.3f}s)'.format(profiler.means['fw'], profiler.means['bw'])
msg += ', {:.1f} im/s'.format(batch_size / profiler.means['train'])
msg += ', lr: {:.2g}'.format(learning_rate)
print(msg, flush=True)
if logdir is not None:
writer.add_scalar('focal_loss', focal_loss, iteration)
writer.add_scalar('box_loss', box_loss, iteration)
writer.add_scalar('learning_rate', learning_rate, iteration)
del box_loss, focal_loss
if metrics_url:
post_metrics(metrics_url, {
'focal loss': mean(cls_losses),
'box loss': mean(box_losses),
'im_s': batch_size / profiler.means['train'],
'lr': learning_rate
})
# Save model weights
state.update({
'iteration': iteration,
'optimizer': optimizer.state_dict(),
'scheduler': scheduler.state_dict(),
})
with ignore_sigint():
nn_model.save(state)
profiler.reset()
del cls_losses[:], box_losses[:]
if val_annotations and (iteration == iterations or iteration % val_iterations == 0):
infer(model, val_path, None, resize, max_size, batch_size, annotations=val_annotations,
mixed_precision=mixed_precision, is_master=is_master, world=world, use_dali=use_dali, is_validation=True, verbose=False)
model.train()
if iteration == iterations:
break
if logdir is not None:
writer.close()
根据模型的类型确定执行后端。
'Run inference on images from path'
backend = 'pytorch' if isinstance(model, Model) or isinstance(model, DDP) else 'tensorrt'
stride = model.module.stride if isinstance(model, DDP) else model.stride
tempfile.mktemp 已经废弃了。tempfile.mkstemp 尽可能以最安全的方式创建临时文件。假设平台正确实现 os.open() 的 os.O_EXCL 标志,文件创建中没有竞争条件。只有创建用户 ID 才能读写该文件。如果平台使用权限位指示文件是否可执行,则任何人都无法执行该文件。子进程不继承文件描述符。
# Create annotations if none was provided
if not annotations:
annotations = tempfile.mktemp('.json')
images = [{ 'id': i, 'file_name': f} for i, f in enumerate(os.listdir(path))]
json.dump({ 'images': images }, open(annotations, 'w'))
# TensorRT only supports fixed input sizes, so override input size accordingly
if backend == 'tensorrt': max_size = max(model.input_size)
使用 DaliDataIterator 或者 DataIterator 加载数据。
# Prepare dataset
if verbose: print('Preparing dataset...')
data_iterator = (DaliDataIterator if use_dali else DataIterator)(
path, resize, max_size, batch_size, stride,
world, annotations, training=False)
if verbose: print(data_iterator)
判断是独立推理还是训练时的验证。
# Prepare model
if backend is 'pytorch':
# If we are doing validation during training,
# no need to register model with AMP again
if not is_validation:
if torch.cuda.is_available(): model = model.cuda()
model = amp.initialize(model, None,
opt_level = 'O2' if mixed_precision else 'O0',
keep_batchnorm_fp32 = True,
verbosity = 0)
model.eval()
if verbose:
print(' backend: {}'.format(backend))
print(' device: {} {}'.format(
world, 'cpu' if not torch.cuda.is_available() else 'gpu' if world == 1 else 'gpus'))
print(' batch: {}, precision: {}'.format(batch_size,
'unknown' if backend is 'tensorrt' else 'mixed' if mixed_precision else 'full'))
print('Running inference...')
网络运行结果保存在列表中。
results = []
profiler = Profiler(['infer', 'fw'])
with torch.no_grad():
for i, (data, ids, ratios) in enumerate(data_iterator):
# Forward pass
profiler.start('fw')
scores, boxes, classes = model(data)
profiler.stop('fw')
results.append([scores, boxes, classes, ids, ratios])
profiler.bump('infer')
if verbose and (profiler.totals['infer'] > 60 or i == len(data_iterator) - 1):
size = len(data_iterator.ids)
msg = '[{:{len}}/{}]'.format(min((i + 1) * batch_size,
size), size, len=len(str(size)))
msg += ' {:.3f}s/{}-batch'.format(profiler.means['infer'], batch_size)
msg += ' (fw: {:.3f}s)'.format(profiler.means['fw'])
msg += ', {:.1f} im/s'.format(batch_size / profiler.means['infer'])
print(msg, flush=True)
profiler.reset()
torch.distributed.all_gather 从列表中收集整个组的张量。
# Gather results from all devices
if verbose: print('Gathering results...')
results = [torch.cat(r, dim=0) for r in zip(*results)]
if world > 1:
for r, result in enumerate(results):
all_result = [torch.ones_like(result, device=result.device) for _ in range(world)]
torch.distributed.all_gather(list(all_result), result)
results[r] = torch.cat(all_result, dim=0)
主节点将结果拷贝到 cpu,然后收集整理并评测。
pycocotools.coco.COCO.getCatIds 获取类别 id 列表。
if is_master:
# Copy buffers back to host
results = [r.cpu() for r in results]
# Collect detections
detections = []
processed_ids = set()
for scores, boxes, classes, image_id, ratios in zip(*results):
image_id = image_id.item()
if image_id in processed_ids:
continue
processed_ids.add(image_id)
keep = (scores > 0).nonzero()
scores = scores[keep].view(-1)
boxes = boxes[keep, :].view(-1, 4) / ratios
classes = classes[keep].view(-1).int()
for score, box, cat in zip(scores, boxes, classes):
x1, y1, x2, y2 = box.data.tolist()
cat = cat.item()
if 'annotations' in data_iterator.coco.dataset:
cat = data_iterator.coco.getCatIds()[cat]
detections.append({
'image_id': image_id,
'score': score.item(),
'bbox': [x1, y1, x2 - x1 + 1, y2 - y1 + 1],
'category_id': cat
})
COCO.loadRes 加载结果文件并返回结果 api 对象。
实例化一个 COCOeval 对象评测结果。
if detections:
# Save detections
if detections_file and verbose: print('Writing {}...'.format(detections_file))
detections = { 'annotations': detections }
detections['images'] = data_iterator.coco.dataset['images']
if 'categories' in data_iterator.coco.dataset:
detections['categories'] = [data_iterator.coco.dataset['categories']]
if detections_file:
json.dump(detections, open(detections_file, 'w'), indent=4)
# Evaluate model on dataset
if 'annotations' in data_iterator.coco.dataset:
if verbose: print('Evaluating model...')
with redirect_stdout(None):
coco_pred = data_iterator.coco.loadRes(detections['annotations'])
coco_eval = COCOeval(data_iterator.coco, coco_pred, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
else:
print('No detections!')
getattr 返回object
的命名属性的值。name
必须是字符串。如果字符串是对象属性之一的名称,则结果是该属性的值。例如,getattr(x, 'foobar')
相当于x.foobar
。如果命名属性不存在,则返回default
(如果提供),否则将引发 AttributeError。
torch.nn.ModuleDict 在字典中保存子模块。ModuleDict 可以像常规 Python 字典一样编制索引,只不过它包含的模块已正确注册,并且对所有 Module 方法可见。
ModuleDict 是一个有序字典并且遵循
OrderedDict
或另一个 ModuleDict(update() 的参数)的顺序。请注意,使用其他无序映射类型(例如,Python 的普通字典)的 update() 不会保留合并映射的顺序。
'RetinaNet - https://arxiv.org/abs/1708.02002'
def __init__(self, backbones='ResNet50FPN', classes=80, config={}):
super().__init__()
if not isinstance(backbones, list):
backbones = [backbones]
self.backbones = nn.ModuleDict({b: getattr(backbones_mod, b)() for b in backbones})
self.name = 'RetinaNet'
self.exporting = False
self.ratios = [1.0, 2.0, 0.5]
self.scales = [4 * 2**(i/3) for i in range(3)]
self.anchors = {}
self.classes = classes
self.threshold = config.get('threshold', 0.05)
self.top_n = config.get('top_n', 1000)
self.nms = config.get('nms', 0.5)
self.detections = config.get('detections', 100)
self.stride = max([b.stride for _, b in self.backbones.items()])
分类和回归头均为4层卷积。
FocalLoss 基于 torch.nn.functional.binary_cross_entropy_with_logits。
SmoothL1Loss 为自行实现。
# classification and box regression heads
def make_head(out_size):
layers = []
for _ in range(4):
layers += [nn.Conv2d(256, 256, 3, padding=1), nn.ReLU()]
layers += [nn.Conv2d(256, out_size, 3, padding=1)]
return nn.Sequential(*layers)
anchors = len(self.ratios) * len(self.scales)
self.cls_head = make_head(classes * anchors)
self.box_head = make_head(4 * anchors)
self.cls_criterion = FocalLoss()
self.box_criterion = SmoothL1Loss(beta=0.11)
object.__repr__ 由 repr() 内置函数调用以计算对象的“官方”字符串表示。 如果可能的话,这应该看起来像一个有效的 Python 表达式,可用于重新创建具有相同值的对象(给定适当的环境)。如果这不可能,则应返回<…一些有用的描述…>形式的字符串。 返回值必须是字符串对象。 如果一个类定义了__repr__() 而不是 __str__(),那么当需要该类的实例的“非正式”字符串表示时,也会使用 _repr_()。这通常用于调试,因此表示是信息丰富且清晰的很重要。
return '\n'.join([
' model: {}'.format(self.name),
' backbone: {}'.format(', '.join([k for k, _ in self.backbones.items()])),
' classes: {}, anchors: {}'.format(self.classes, len(self.ratios) * len(self.scales)),
])
如果是预训练的,加载模型并忽略cls_head.8
。
if pre_trained:
# Initialize using weights from pre-trained model
if not os.path.isfile(pre_trained):
raise ValueError('No checkpoint {}'.format(pre_trained))
print('Fine-tuning weights from {}...'.format(os.path.basename(pre_trained)))
state_dict = self.state_dict()
chk = torch.load(pre_trained, map_location=lambda storage, loc: storage)
ignored = ['cls_head.8.bias', 'cls_head.8.weight']
weights = { k: v for k, v in chk['state_dict'].items() if k not in ignored }
state_dict.update(weights)
self.load_state_dict(state_dict)
del chk, weights
torch.cuda.empty_cache()
否则调用基础网络的初始化方法,然后初始化检测和回归分支。
else:
# Initialize backbone(s)
for _, backbone in self.backbones.items():
backbone.initialize()
# Initialize heads
def initialize_layer(layer):
if isinstance(layer, nn.Conv2d):
nn.init.normal_(layer.weight, std=0.01)
if layer.bias is not None:
nn.init.constant_(layer.bias, val=0)
self.cls_head.apply(initialize_layer)
self.box_head.apply(initialize_layer)
特殊初始化分类分支的最后一层。
# Initialize class head prior
def initialize_prior(layer):
pi = 0.01
b = - math.log((1 - pi) / pi)
nn.init.constant_(layer.bias, b)
nn.init.normal_(layer.weight, std=0.01)
self.cls_head[-1].apply(initialize_prior)
训练时,输入的x
包含处理后的标注信息。
由基础网络提取特征,然后同时进行分类和回归。
如果是训练,则调用 _compute_loss 计算损失并返回。
if self.training: x, targets = x
# Backbones forward pass
features = []
for _, backbone in self.backbones.items():
features.extend(backbone(x))
# Heads forward pass
cls_heads = [self.cls_head(t) for t in features]
box_heads = [self.box_head(t) for t in features]
if self.training:
return self._compute_loss(x, cls_heads, box_heads, targets.float())
如果是导出,返回分类和回归输出。
cls_heads = [cls_head.sigmoid() for cls_head in cls_heads]
if self.exporting:
self.strides = [x.shape[-1] // cls_head.shape[-1] for cls_head in cls_heads]
return cls_heads, box_heads
否则,运行推理的后处理。
generate_anchors 从比例/比率生成锚点坐标。
decode 根据得分过滤结果并解码出边界框。
nms 进一步过滤结果。
# Inference post-processing
decoded = []
for cls_head, box_head in zip(cls_heads, box_heads):
# Generate level's anchors
stride = x.shape[-1] // cls_head.shape[-1]
if stride not in self.anchors:
self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales)
# Decode and filter boxes
decoded.append(decode(cls_head, box_head, stride,
self.threshold, self.top_n, self.anchors[stride]))
# Perform non-maximum suppression
decoded = [torch.cat(tensors, 1) for tensors in zip(*decoded)]
return nms(*decoded, self.nms, self.detections)
generate_anchors 从比例/比率生成锚点坐标。
snap_to_anchors 参照锚点生成目标量。
cls_target, box_target, depth = [], [], []
for target in targets:
target = target[target[:, -1] > -1]
if stride not in self.anchors:
self.anchors[stride] = generate_anchors(stride, self.ratios, self.scales)
snapped = snap_to_anchors(
target, [s * stride for s in size[::-1]], stride,
self.anchors[stride].to(targets.device), self.classes, targets.device)
for l, s in zip((cls_target, box_target, depth), snapped): l.append(s)
return torch.stack(cls_target), torch.stack(box_target), torch.stack(depth)
_extract_targets 获得分类和回归目标。
depth
为样本选取掩码。
cls_losses, box_losses, fg_targets = [], [], []
for cls_head, box_head in zip(cls_heads, box_heads):
size = cls_head.shape[-2:]
stride = x.shape[-1] / cls_head.shape[-1]
cls_target, box_target, depth = self._extract_targets(targets, stride, size)
fg_targets.append((depth > 0).sum().float().clamp(min=1))
cls_head = cls_head.view_as(cls_target).float()
cls_mask = (depth >= 0).expand_as(cls_target).float()
cls_loss = self.cls_criterion(cls_head, cls_target)
cls_loss = cls_mask * cls_loss
cls_losses.append(cls_loss.sum())
box_head = box_head.view_as(box_target).float()
box_mask = (depth > 0).expand_as(box_target).float()
box_loss = self.box_criterion(box_head, box_target)
box_loss = box_mask * box_loss
box_losses.append(box_loss.sum())
fg_targets = torch.stack(fg_targets).sum()
cls_loss = torch.stack(cls_losses).sum() / fg_targets
box_loss = torch.stack(box_losses).sum() / fg_targets
return cls_loss, box_loss
保存基础网络、类别数和模型。
如果有的话,保存iteration
、optimizer
和scheduler
。
checkpoint = {
'backbone': [k for k, _ in self.backbones.items()],
'classes': self.classes,
'state_dict': self.state_dict()
}
for key in ('iteration', 'optimizer', 'scheduler'):
if key in state:
checkpoint[key] = state[key]
torch.save(checkpoint, state['path'])
创建基础网络,然后加载其中的参数。
torch.cuda.empty_cache 释放当前由缓存分配器保存的所有未占用的缓存内存,以便可以在其他 GPU 应用程序中使用这些缓存并在 nvidia-smi 中可见。
@classmethod
def load(cls, filename):
if not os.path.isfile(filename):
raise ValueError('No checkpoint {}'.format(filename))
checkpoint = torch.load(filename, map_location=lambda storage, loc: storage)
# Recreate model from checkpoint instead of from individual backbones
model = cls(backbones=checkpoint['backbone'], classes=checkpoint['classes'])
model.load_state_dict(checkpoint['state_dict'])
state = {}
for key in ('iteration', 'optimizer', 'scheduler'):
if key in checkpoint:
state[key] = checkpoint[key]
del checkpoint
torch.cuda.empty_cache()
return model, state
如果 OpSet 的版本低于9,则定义upsample_nearest2d
函数。
import torch.onnx.symbolic
if opset is not None and opset < 9:
# Override Upsample's ONNX export from old opset if required (not needed for TRT 5.1+)
@torch.onnx.symbolic.parse_args('v', 'is')
def upsample_nearest2d(g, input, output_size):
height_scale = float(output_size[-2]) / input.type().sizes()[-2]
width_scale = float(output_size[-1]) / input.type().sizes()[-1]
return g.op("Upsample", input,
scales_f=(1, 1, height_scale, width_scale),
mode_s="nearest")
torch.onnx.symbolic.upsample_nearest2d = upsample_nearest2d
io.BytesIO 使用内存中字节缓冲区的流实现。它继承了 BufferedIOBase。 调用 close() 方法时,将丢弃缓冲区。可选参数initial_bytes
是一个包含初始数据的 bytes-like object。
torch.onnx.export 导出模型。
io.BytesIO.getvalue 返回包含缓冲区全部内容的 bytes。
# Export to ONNX
print('Exporting to ONNX...')
self.exporting = True
onnx_bytes = io.BytesIO()
zero_input = torch.zeros([1, 3, *size]).cuda()
extra_args = { 'opset_version': opset } if opset else {}
torch.onnx.export(self.cuda(), zero_input, onnx_bytes, *extra_args)
self.exporting = False
if onnx_only:
return onnx_bytes.getvalue()
generate_anchors 从比例/比率生成锚点坐标。
返回一个 Engine 对象。
# Build TensorRT engine
model_name = '_'.join([k for k, _ in self.backbones.items()])
anchors = [generate_anchors(stride, self.ratios, self.scales).view(-1).tolist()
for stride in self.strides]
return Engine(onnx_bytes.getvalue(), len(onnx_bytes.getvalue()), batch, precision,
self.threshold, self.top_n, anchors, self.nms, self.detections, calibration_files, model_name, calibration_table, verbose)
DaliDataIterator 定义的函数与 torch.utils.data.DataLoader 类似,但使用 DALI 实现数据并行加载。
输入的batch_size
参数为总批量大小。
'Data loader for data parallel using Dali'
def __init__(self, path, resize, max_size, batch_size, stride, world, annotations, training=False):
self.training = training
self.resize = resize
self.max_size = max_size
self.stride = stride
self.batch_size = batch_size // world
self.mean = [255.*x for x in [0.485, 0.456, 0.406]]
self.std = [255.*x for x in [0.229, 0.224, 0.225]]
self.world = world
self.path = path
contextlib.redirect_stdout 是用于临时将 sys.stdout 重定向到另一个文件或类文件对象的上下文管理器。此工具为输出硬连接到 stdout 的现有函数或类增加了灵活性。
创建一个 COCO 实例。
COCOPipeline 封装了 nvidia.dali.pipeline.Pipeline。
nvidia.dali.pipeline.Pipeline.build 构建管道。需要构建管道才能独立运行它。特定于框架的插件自动处理此步骤。
# Setup COCO
with redirect_stdout(None):
self.coco = COCO(annotations)
self.ids = list(self.coco.imgs.keys())
if 'categories' in self.coco.dataset:
self.categories_inv = { k: i for i, k in enumerate(self.coco.getCatIds()) }
self.pipe = COCOPipeline(batch_size=self.batch_size, num_threads=2,
path=path, coco=self.coco, training=training, annotations=annotations, world=world,
device_id = torch.cuda.current_device(), mean=self.mean, std=self.std, resize=resize, max_size=max_size, stride=self.stride)
self.pipe.build()
功能说明。
return '\n'.join([
' loader: dali',
' resize: {}, max: {}'.format(self.resize, self.max_size),
])
return ceil(len(self.ids) // self.world / self.batch_size)
nvidia.dali.pipeline.Pipeline.run 运行管道并返回结果。如果管道是在exec_pipelined
选项设置为True
的情况下创建的,则此函数也将开始预取下一次迭代以加快执行速度。不应与同一管道中的 nvidia.dali.pipeline.Pipeline.schedule_run()、nvidia.dali.pipeline.Pipeline.share_outputs() 和 nvidia.dali.pipeline.Pipeline.release_outputs() 混合使用。
ctypes.c_void_p 表示 C 语言的void *
类型。该值表示为整数。构造函数接受可选的整数初始值设定项。
orch.Tensor.data_ptr 返回自张量的第一个元素的地址。
self.pipe.run()
返回的结果均为 nvidia.dali.backend.TensorListCPU 或 nvidia.dali.backend.TensorListGPU 类型。
nvidia.dali.backend.TensorListCPU.copy_to_external 将此TensorList
的内容复制到 CPU 内存中的外部指针(类型为 ctypes.c_void_p)。插件内部使用此函数与支持的深度学习框架中的张量进行交互。
nvidia.dali.backend.TensorListGPU.as_cpu 返回 TensorListCPU 对象,该对象是此 TensorListGPU 的副本。
data, ratios, ids, num_detections = [], [], [], []
dali_data, dali_boxes, dali_labels, dali_ids, dali_attrs, dali_resize_img = self.pipe.run()
for l in range(len(dali_boxes)):
num_detections.append(dali_boxes.at(l).shape[0])
pyt_targets = -1 * torch.ones([len(dali_boxes), max(max(num_detections),1), 5])
for batch in range(self.batch_size):
id = int(dali_ids.at(batch)[0])
# Convert dali tensor to pytorch
dali_tensor = dali_data.at(batch)
tensor_shape = dali_tensor.shape()
datum = torch.zeros(dali_tensor.shape(), dtype=torch.float, device=torch.device('cuda'))
c_type_pointer = ctypes.c_void_p(datum.data_ptr())
dali_tensor.copy_to_external(c_type_pointer)
# Calculate image resize ratio to rescale boxes
prior_size = dali_attrs.as_cpu().at(batch)
resized_size = dali_resize_img.at(batch).shape()
ratio = max(resized_size) / max(prior_size)
if self.training:
# Rescale boxes
b_arr = dali_boxes.at(batch)
num_dets = b_arr.shape[0]
if num_dets is not 0:
pyt_bbox = torch.from_numpy(b_arr).float()
pyt_bbox[:,0] *= float(prior_size[1])
pyt_bbox[:,1] *= float(prior_size[0])
pyt_bbox[:,2] *= float(prior_size[1])
pyt_bbox[:,3] *= float(prior_size[0])
# (l,t,r,b) -> (x,y,w,h) == (l,r, r-l, b-t)
pyt_bbox[:,2] -= pyt_bbox[:,0]
pyt_bbox[:,3] -= pyt_bbox[:,1]
pyt_targets[batch,:num_dets,:4] = pyt_bbox * ratio
# Arrange labels in target tensor
l_arr = dali_labels.at(batch)
if num_dets is not 0:
pyt_label = torch.from_numpy(l_arr).float()
pyt_label -= 1 #Rescale labels to [0,79] instead of [1,80]
pyt_targets[batch,:num_dets, 4] = pyt_label.squeeze()
ids.append(id)
data.append(datum.unsqueeze(0))
ratios.append(ratio)
data = torch.cat(data, dim=0)
if self.training:
pyt_targets = pyt_targets.cuda(non_blocking=True)
yield data, pyt_targets
else:
ids = torch.Tensor(ids).int().cuda(non_blocking=True)
ratios = torch.Tensor(ratios).cuda(non_blocking=True)
yield data, ids, ratios
可以参考文档 COCO Reader with augmentations。
nvidia.dali.ops.COCOReader 是一个 CPU 运算符。从 COCO 数据集中读取数据,其每个目录中包含图像和一个注释文件。对于带有 m 个 bbox 的图像,将其 bbox 返回为 (m,4) Tensor (m * [x, y, w, h] or m * [left, top, right, bottom]
) 和 标签为 (m,1) Tensor (m * category_id)。
nvidia.dali.ops.nvJPEGDecoderSlice 是一个“混合”操作。根据给定大小和锚点的裁剪窗口使用 nvJPEG 库部分解码 JPEG 图像。输入必须以特定顺序提供3个张量:
encoded_dat
包含编码图像数据;begin
包含(x,y)
格式的裁剪起始像素坐标size
包含裁剪的像素尺寸,(w,h)
格式。对于begin
和size
,坐标必须在区间[0.0, 1.0]
中。解码器输出的排序为HWC
。
警告
此运算符现已弃用。请改用 ImageDecoderSlice
nvidia.dali.ops.RandomBBoxCrop 是一个 CPU 运算符。对图像执行预期裁剪,同时保持边界框和标签一致。输入必须作为两个张量提供:
BBoxes
包含表示为[l,t,r,b]
或[x,y,w,h]
的边界框;Labels
包含每个边界框的相应标签。得到的预期切图以两个张量形式提供:
Begin
包含(x,y)
格式的切图起始坐标;Size
包含(w,h)
格式的切图大小。边界框提供为(m*4)
张量,其中每个边界框表示为[l,t,r,b]
或[x,y,w,h]
。丢弃与边框交并比小于阈值的标签。 请注意,当allow_no_crop
为False
且阈值不包含0时,最好增加num_attempts
,否则它可能会循环很长时间。
nvidia.dali.ops.BbFlip 是一个 CPU,GPU 运算符,执行边界框水平翻转(镜像)操作。 输入为边界框坐标,格式为[x, y, w, h]
或[left, top, right, bottom]
。所有坐标都在图像坐标系中(即0.0-1.0)。
nvidia.dali.ops.Flip 是一个 CPU,GPU 运算符,在水平轴和(或)垂直轴上翻转图像。
nvidia.dali.ops.CoinFlip 是一个支持操作符。产生充满0和1的张量——随机掷硬币的结果,可用作选择操作的参数。
nvidia.dali.ops.Uniform 是一个支持操作符,产生均匀分布随机数的张量。
nvidia.dali.ops.Resize 是一个 CPU,GPU 运算符,调整图像大小。
nvidia.dali.ops.Paste 是一个 GPU 操作符。将输入图像粘贴到更大的画布上。画布大小等于输入大小*比率。
nvidia.dali.ops.CropMirrorNormalize 是一个 CPU,GPU 运算符。如果需要,融合执行裁剪、标准化、格式转换(NHWC 到 NCHW)和类型转换。标准化输入图像并使用以下公式生成输出:
output = (input - mean) / std
请注意,不提供任何裁剪参数将仅导致镜像和标准化。该运算符允许序列输入。
super().__init__(batch_size=batch_size, num_threads=num_threads, device_id = device_id, prefetch_queue_depth=num_threads, seed=42)
self.path = path
self.training = training
self.coco = coco
self.stride = stride
self.iter = 0
self.reader = ops.COCOReader(annotations_file=annotations, file_root=path, num_shards=world,shard_id=torch.cuda.current_device(),
ltrb=True, ratio=True, shuffle_after_epoch=True, save_img_ids=True)
self.decode_train = ops.nvJPEGDecoderSlice(device="mixed", output_type=types.RGB)
self.decode_infer = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB)
self.bbox_crop = ops.RandomBBoxCrop(device='cpu', ltrb=True, scaling=[0.3, 1.0], thresholds=[0.1,0.3,0.5,0.7,0.9])
self.bbox_flip = ops.BbFlip(device='cpu', ltrb=True)
self.img_flip = ops.Flip(device='gpu')
self.coin_flip = ops.CoinFlip(probability=0.5)
if isinstance(resize, list): resize = max(resize)
self.rand_resize = ops.Uniform(range=[resize, float(max_size)])
self.resize_train = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, save_attrs=True)
self.resize_infer = ops.Resize(device='gpu', interp_type=types.DALIInterpType.INTERP_CUBIC, resize_longer=max_size, save_attrs=True)
padded_size = max_size + ((self.stride - max_size % self.stride) % self.stride)
self.pad = ops.Paste(device='gpu', fill_value = 0, ratio=1.1, min_canvas_size=padded_size, paste_x=0, paste_y=0)
self.normalize = ops.CropMirrorNormalize(device='gpu', mean=mean, std=std, crop=padded_size, crop_pos_x=0, crop_pos_y=0)
nvidia.dali.pipeline.Pipeline.define_graph 返回输出EdgeReference
的列表。用户定义此函数以构造其管道的操作图。
self.reader()
从数据集中读取数据。
如果是训练,读取图像并进行增广;如果是推理则仅解码和调整大小。
images, bboxes, labels, img_ids = self.reader()
if self.training:
crop_begin, crop_size, bboxes, labels = self.bbox_crop(bboxes, labels)
images = self.decode_train(images, crop_begin, crop_size)
resize = self.rand_resize()
images, attrs = self.resize_train(images, resize_longer=resize)
flip = self.coin_flip()
bboxes = self.bbox_flip(bboxes, horizontal=flip)
images = self.img_flip(images, horizontal=flip)
else:
images = self.decode_infer(images)
images, attrs = self.resize_infer(images)
resized_images = images
images = self.normalize(self.pad(images))
return images, bboxes, labels, img_ids, attrs, resized_images
torch.Tensor.nelement 是 torch.Tensor.numel 的别称。
如果boxes
为空则直接返回0填充的张量。
'Snap target boxes (x, y, w, h) to anchors'
num_anchors = anchors.size()[0] if anchors is not None else 1
width, height = (int(size[0] / stride), int(size[1] / stride))
if boxes.nelement() == 0:
return (torch.zeros([num_anchors, num_classes, height, width], device=device),
torch.zeros([num_anchors, 4, height, width], device=device),
torch.zeros([num_anchors, 1, height, width], device=device))
根据输出尺寸将锚点广播到每个位置。
boxes, classes = boxes.split(4, dim=1)
# Generate anchors
x, y = torch.meshgrid([torch.arange(0, size[i], stride, device=device, dtype=classes.dtype) for i in range(2)])
xyxy = torch.stack((x, y, x, y), 2).unsqueeze(0)
anchors = anchors.view(-1, 1, 1, 4).to(dtype=classes.dtype)
anchors = (xyxy + anchors).contiguous().view(-1, 4)
boxes
由[x, y, width, height]
转为[left, top, right, bottom]
,便于计算交并比。
# Compute overlap between boxes and anchors
boxes = torch.cat([boxes[:, :2], boxes[:, :2] + boxes[:, 2:] - 1], 1)
xy1 = torch.max(anchors[:, None, :2], boxes[:, :2])
xy2 = torch.min(anchors[:, None, 2:], boxes[:, 2:])
inter = torch.prod((xy2 - xy1 + 1).clamp(0), 2)
boxes_area = torch.prod(boxes[:, 2:] - boxes[:, :2] + 1, 1)
anchors_area = torch.prod(anchors[:, 2:] - anchors[:, :2] + 1, 1)
overlap = inter / (anchors_area[:, None] + boxes_area - inter)
为每个锚框保留最佳匹配目标框。
box2delta 将边界框转换为锚框的增量。
torch.ones_like 返回填充标量值1的张量,其大小与input
相同。torch.ones_like(input)
等效于torch.ones(input.size(), dtype=input.dtype, layout=input.layout, device=input.device)
。
depth
为样本选取掩码。
# Keep best box per anchor
overlap, indices = overlap.max(1)
box_target = box2delta(boxes[indices], anchors)
box_target = box_target.view(num_anchors, 1, width, height, 4)
box_target = box_target.transpose(1, 4).transpose(2, 3)
box_target = box_target.squeeze().contiguous()
depth = torch.ones_like(overlap) * -1
depth[overlap < 0.4] = 0 # background
depth[overlap >= 0.5] = classes[indices][overlap >= 0.5].squeeze() + 1 # objects
depth = depth.view(num_anchors, width, height).transpose(1, 2).contiguous()
生成目标类别。每个类别上的值为0或1。
torch.Tensor.scatter_ 将张量src
中的所有值写入index
张量中指定的索引处的self
。对于src
中的每个值,dimension != dim
时其输出索引由src
中的索引指定,dimension = dim
时为index
的索引值。
# Generate target classes
cls_target = torch.zeros((anchors.size()[0], num_classes + 1), device=device, dtype=boxes.dtype)
if classes.nelement() == 0:
classes = torch.LongTensor([num_classes], device=device).expand_as(indices)
else:
classes = classes[indices].long()
classes = classes.view(-1, 1)
classes[overlap < 0.4] = num_classes # background has no class
cls_target.scatter_(1, classes, 1)
cls_target = cls_target[:, :num_classes].view(-1, 1, width, height, num_classes)
cls_target = cls_target.transpose(1, 4).transpose(2, 3)
cls_target = cls_target.squeeze().contiguous()
return (cls_target.view(num_anchors, num_classes, height, width),
box_target.view(num_anchors, 4, height, width),
depth.view(num_anchors, 1, height, width))