欢迎访问个人网络日志知行空间
论文地址:http://arxiv.org/abs/2107.08430
代码:https://github.com/Megvii-BaseDetection/YOLOX
2021年07月份,旷视的Zheng Ze
与Sonttao Liu
提交的论文中提出的Anchor Free
检测算法。主要工作聚焦在a decoupled head
和label assigment strategy SimOTA
。作者使用YOLOX
获得了2021年CVPR Autonomous Driving领域Streaming Perception Challenge
的第一名BaseDet。
之前YOLO
系列的论文自YoloV1
后都是Anchor Based
的,但自那之后如CornerNet/CenterNet/FCOS
等Anchor Free
的算法不断进步,YoloX
的作者再次尝试将Anchor Free
的算法技巧应用到Yolo
算法上。作者认为YoloV4/YoloV5
属于优化过度的Anchor Based
算法,因此其提出的YoloX
算法主要与YoloV3
做比较。YoloX
中的使用的baseline
是YoloV3-SPP
,YoloV3-SPP
中作者引入了EMA
权值更新, cosine lr schedule
等策略。
在R-CNN
系列目标检测论文中,bounding box
位置回归和物体类别判断都是分两个输出头来做的,而Yolo
系列论文使用的Yolo Head
都是在同个向量中通过共同的神经网络同时输出回归和类别信息。如下图1,在YoloX
中作者重新使用了Decoupled Head
,通过实验证明了其能提升检测的效果,
NMS Free
做End2End Yolo
时性能下降小。Label Assignment
时目标检测中的关键点,以往多是基于Max IoU
来实现Anchor Box
正负标签的判定,属于基于先验知识的静态匹配方式,因一张图像中目标的数量是有限的,因此会将大量的Anchor
判定为negative
,造成严重的类别不平衡问题。ATSS
提出了自适应训练样本采样算法,自适应确定判断Anchor
正负标签的IoU
阈值。Label Assignment
问题是结合Ground Truth Boxes
给每个Anchor Box
分配类别,然后得到全局最优的结果,这个问题正好满足二分图优化的模型,因此也有不少工作是基于匈牙利算法,最优路经传输(Optimal transport assignment
)来做的。OTA
是YoloX
作者2021年04
月提交的论文中提出的方法,实现label
的动态匹配。
advanced label assignment
的四个要点
YoloX
作者指出通过Sinkhorn-Knopp
算法求解OTA
问题增加了约25%
的训练时间,因此在YoloX
中使用被称为SimOTA
的dynamic top-K
算法求近似解。
SimOTA算法流程:
prediction-gt pair
计算代价cost
c o s t i j = L i j c l s + λ L i j r e g cost_{ij}=L^{cls}_{ij} + \lambda L^{reg}_{ij} costij=Lijcls+λLijreg
top k
个prediction
作为positive sample
,其余作为negative samples
,超参数k
对不同的ground truth box
取不同的值,其选择见作者在OTA论文中的介绍和这篇博客。源码分析:
参考mmdetection
中SimOTA
的实现,见mmdet/core/bbox/assigners/sim_ota_assigner.py
class SimOTAAssigner(BaseAssigner):
def __init__(self,):
...
def _assign(self,
pred_scores,
priors,
decoded_bboxes,
gt_bboxes,
gt_labels,
gt_bboxes_ignore=None,
eps=1e-7):
"""Assign gt to priors using SimOTA.
Args:
pred_scores (Tensor): Classification scores of one image,
a 2D-Tensor with shape [num_priors, num_classes]
priors (Tensor): All priors of one image, a 2D-Tensor with shape
[num_priors, 4] in [cx, xy, stride_w, stride_y] format.
decoded_bboxes (Tensor): Predicted bboxes, a 2D-Tensor with shape
[num_priors, 4] in [tl_x, tl_y, br_x, br_y] format.
gt_bboxes (Tensor): Ground truth bboxes of one image, a 2D-Tensor
with shape [num_gts, 4] in [tl_x, tl_y, br_x, br_y] format.
gt_labels (Tensor): Ground truth labels of one image, a Tensor
with shape [num_gts].
gt_bboxes_ignore (Tensor, optional): Ground truth bboxes that are
labelled as `ignored`, e.g., crowd boxes in COCO.
eps (float): A value added to the denominator for numerical
stability. Default 1e-7.
Returns:
:obj:`AssignResult`: The assigned result.
"""
INF = 100000.0
num_gt = gt_bboxes.size(0)
num_bboxes = decoded_bboxes.size(0)
# assign 0 by default
assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ),
0,
dtype=torch.long)
valid_mask, is_in_boxes_and_center = self.get_in_gt_and_in_center_info(
priors, gt_bboxes)
valid_decoded_bbox = decoded_bboxes[valid_mask]
valid_pred_scores = pred_scores[valid_mask]
num_valid = valid_decoded_bbox.size(0)
if num_gt == 0 or num_bboxes == 0 or num_valid == 0:
# No ground truth or boxes, return empty assignment
max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
if num_gt == 0:
# No truth, assign everything to background
assigned_gt_inds[:] = 0
if gt_labels is None:
assigned_labels = None
else:
assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
-1,
dtype=torch.long)
return AssignResult(
num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
pairwise_ious = bbox_overlaps(valid_decoded_bbox, gt_bboxes)
iou_cost = -torch.log(pairwise_ious + eps)
gt_onehot_label = (
F.one_hot(gt_labels.to(torch.int64),
pred_scores.shape[-1]).float().unsqueeze(0).repeat(
num_valid, 1, 1))
valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1)
cls_cost = (
F.binary_cross_entropy(
valid_pred_scores.to(dtype=torch.float32).sqrt_(),
gt_onehot_label,
reduction='none',
).sum(-1).to(dtype=valid_pred_scores.dtype))
cost_matrix = (
cls_cost * self.cls_weight + iou_cost * self.iou_weight +
(~is_in_boxes_and_center) * INF)
matched_pred_ious, matched_gt_inds = \
self.dynamic_k_matching(
cost_matrix, pairwise_ious, num_gt, valid_mask)
# convert to AssignResult format
assigned_gt_inds[valid_mask] = matched_gt_inds + 1
assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long()
max_overlaps = assigned_gt_inds.new_full((num_bboxes, ),
-INF,
dtype=torch.float32)
max_overlaps[valid_mask] = matched_pred_ious
return AssignResult(
num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
def get_in_gt_and_in_center_info(self, priors, gt_bboxes):
num_gt = gt_bboxes.size(0)
repeated_x = priors[:, 0].unsqueeze(1).repeat(1, num_gt)
repeated_y = priors[:, 1].unsqueeze(1).repeat(1, num_gt)
repeated_stride_x = priors[:, 2].unsqueeze(1).repeat(1, num_gt)
repeated_stride_y = priors[:, 3].unsqueeze(1).repeat(1, num_gt)
# is prior centers in gt bboxes, shape: [n_prior, n_gt]
l_ = repeated_x - gt_bboxes[:, 0]
t_ = repeated_y - gt_bboxes[:, 1]
r_ = gt_bboxes[:, 2] - repeated_x
b_ = gt_bboxes[:, 3] - repeated_y
deltas = torch.stack([l_, t_, r_, b_], dim=1)
is_in_gts = deltas.min(dim=1).values > 0
is_in_gts_all = is_in_gts.sum(dim=1) > 0
# is prior centers in gt centers
gt_cxs = (gt_bboxes[:, 0] + gt_bboxes[:, 2]) / 2.0
gt_cys = (gt_bboxes[:, 1] + gt_bboxes[:, 3]) / 2.0
ct_box_l = gt_cxs - self.center_radius * repeated_stride_x
ct_box_t = gt_cys - self.center_radius * repeated_stride_y
ct_box_r = gt_cxs + self.center_radius * repeated_stride_x
ct_box_b = gt_cys + self.center_radius * repeated_stride_y
cl_ = repeated_x - ct_box_l
ct_ = repeated_y - ct_box_t
cr_ = ct_box_r - repeated_x
cb_ = ct_box_b - repeated_y
ct_deltas = torch.stack([cl_, ct_, cr_, cb_], dim=1)
is_in_cts = ct_deltas.min(dim=1).values > 0
is_in_cts_all = is_in_cts.sum(dim=1) > 0
# in boxes or in centers, shape: [num_priors]
is_in_gts_or_centers = is_in_gts_all | is_in_cts_all
# both in boxes and centers, shape: [num_fg, num_gt]
is_in_boxes_and_centers = (
is_in_gts[is_in_gts_or_centers, :]
& is_in_cts[is_in_gts_or_centers, :])
return is_in_gts_or_centers, is_in_boxes_and_centers
def dynamic_k_matching(self, cost, pairwise_ious, num_gt, valid_mask):
matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
# select candidate topk ious for dynamic-k calculation
candidate_topk = min(self.candidate_topk, pairwise_ious.size(0))
# topk_ious shape: candidate_topk x num_gts
topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0)
# calculate dynamic k for each gt
# dynamic_ks shape: 1 x num_gts
# dynamic k正是基于最少`candidate_topk`个iou求和计算出来的,因此是dynamic的
# 每个 gt对应的topk是不一样的
dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1)
for gt_idx in range(num_gt):
_, pos_idx = torch.topk(
cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False)
matching_matrix[:, gt_idx][pos_idx] = 1
"""
以上生成的`matching_matrix`有可能1个`prior`对应多个`gt_boxes`,还需以下处理
从1个`prior`对应的多个`gt_boxes`选`cost`最小的那个作为`gt_box`
"""
del topk_ious, dynamic_ks, pos_idx
# prior_match_gt_mask shape like (num_priors,)
prior_match_gt_mask = matching_matrix.sum(1) > 1
if prior_match_gt_mask.sum() > 0:
cost_min, cost_argmin = torch.min(
cost[prior_match_gt_mask, :], dim=1)
matching_matrix[prior_match_gt_mask, :] *= 0
matching_matrix[prior_match_gt_mask, cost_argmin] = 1
# get foreground mask inside box and center prior
fg_mask_inboxes = matching_matrix.sum(1) > 0
valid_mask[valid_mask.clone()] = fg_mask_inboxes
matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1)
matched_pred_ious = (matching_matrix *
pairwise_ious).sum(1)[fg_mask_inboxes]
return matched_pred_ious, matched_gt_inds
在2020年12月
份,旷视的Jianfeng Wang
等提交的论文End-to-End Object Detection with Fully Convolutional Network
提出了不需要使用极大值抑制做后处理的检测方法,其改进了label assigment
,将以往one2many
的标签分配方式改成了one2one
的方式,更多介绍可参考论文作者知乎上的文章。2021年01月
份阿里的Qiang Zhou
等提交的论文[Object Detection Made Simpler by Eliminating Heuristic NMS](https://arxiv.org/abs/2101.11782)
中也使用one2one
的标签分配方式实现了NMS Free
的目标检测算法,YoloX
参考以上方法实现End2End
训练得到如下表灰色的实验结果,可以看到对性能和准确率都有一定的影响:
最近聚焦在研究label assignment
,因此这块NMS Free
还没来得及看源码,先挖个坑。NMS Free
其实也非常有意义,因为NMS
存在,在边缘设备部署模型时,后处理部分不能像模型本身能放NPU
上加速计算,故需在CPU上运算。
数据增强方法
Mosaic
:YoloV5
作者在ultralytics-YOLOv3中提出的
MixUp
2018年mixup: Beyond empirical risk minimization
论文中提出,最早用于图像分类,自BoF
后用于目标检测的数据增强
FCOS
,每个feature map cell
只预测一个输出框,相当于Anchor Number=1
。同时为了得到更多的正样本,将Feature Map
上cell
落在目标中心附近一定范围的点都当作正样本,即Multiple Positive Sample
。欢迎访问个人网络日志知行空间
- 1.https://zhuanlan.zhihu.com/p/392221567
- 2.https://zhuanlan.zhihu.com/p/394392992
- 3.https://zhuanlan.zhihu.com/p/397993315
- 4.丢弃Transformer,FCN也可以实现E2E检测