经过RPN后,我们得到了Classification/Regression loss和Proposal Region,接下来要对得到的loss和proposals做后续处理,其中包括proposal的细分类和再回归,以及ROI Pooling等操作。
目录
RoIHeads
select_training_samples
ROIAlign pooling
box_head和box_predictor
fastrcnn_loss
postprocess_detections
总结
从RPN的到了一定数量的proposals,接下来需要做的事情是先将proposals和groundtruth对应的box匹配上,并生成用于traning的target,因此下面介绍一下select_training_samples。
select_training_samples主要完成了以下几个功能:
代码
def select_training_samples(self, proposals, targets):
self.check_targets(targets)
gt_boxes = [t["boxes"] for t in targets]
gt_labels = [t["labels"] for t in targets]
# append ground-truth bboxes to propos
proposals = self.add_gt_proposals(proposals, gt_boxes)
# 和rpn中的match相同,计算每个proposal和groudtruth的iou
# matched_idxs保存的是与groudtruth匹配的id(没有匹配上的默认id=0)
# labels保存的是类别信息,其中背景为0,ignore proposal为-1
matched_idxs, labels = self.assign_targets_to_proposals(proposals, gt_boxes, gt_labels)
# subsample对proposal进行sample,挑选出其中的positive和negative proposals
# 并保证参与训练的正负proposals的比例和个数保持一定
sampled_inds = self.subsample(labels)
matched_gt_boxes = []
num_images = len(proposals)
# 根据sample的结果选取对应的proposals
for img_id in range(num_images):
img_sampled_inds = sampled_inds[img_id]
proposals[img_id] = proposals[img_id][img_sampled_inds]
labels[img_id] = labels[img_id][img_sampled_inds]
matched_idxs[img_id] = matched_idxs[img_id][img_sampled_inds]
matched_gt_boxes.append(gt_boxes[img_id][matched_idxs[img_id]])
# 计算gt_truth和proposal间的deltas(dx,dy,dw,dh)
regression_targets = self.box_coder.encode(matched_gt_boxes, proposals)
return proposals, matched_idxs, labels, regression_targets
proposal经过筛选后,需要进行ROIAlign操作,因为输入到roi head模块中的特征是完整尺寸的feature map,需要根据proposals的尺寸在对应的feature map中剪裁出相应的特征,这也就是ROIPooling和ROIAlign要完成的事情。在torchvision实现的faster-rcnn中,实现了精度更高的ROIAlign,定义在torchvision/ops/poolers.py中的MultiScaleRoIAlign。
参数定义:
featmap_names : ROIAlign的forward输入是一个包含feature maps的OrderedDict,因此featuremap_names指定从OrderedDict中的哪些feature来做ROIAlign
output_size : ROIAlign后输出的feature的大小
sampling_ratio :
来看一下官方给出的ROIAlign使用用例:
Examples::
# 创建roialign模块
# ['feat1','feat2']指定用于做roialign的feature
>>> m = torchvision.ops.MultiScaleRoIAlign(['feat1', 'feat3'], 3, 2)
# i是输入到roialign中的input feature的OrderedDict
>>> i = OrderedDict()
>>> i['feat1'] = torch.rand(1, 5, 64, 64)
>>> i['feat2'] = torch.rand(1, 5, 32, 32) # this feature won't be used in the pooling
>>> i['feat3'] = torch.rand(1, 5, 16, 16)
>>> # 创建6个随机boxes作为forward输入
>>> boxes = torch.rand(6, 4) * 256; boxes[:, 2:] += boxes[:, :2]
>>> image_sizes = [(512, 512)] # image_size为图像输入大小
>>> output = m(i, [boxes], image_sizes)
>>> print(output.shape)
# 经过roialign后,6个boxes生成了经过在feature上剪裁后的3×3大小的Feature
>>> torch.Size([6, 5, 3, 3])
"""
box_head和box_predictor将经过roialign后的特征,通过全连接,得到分类和回归结果。
box_features = self.box_roi_pool(features, proposals, image_shapes)
box_features = self.box_head(box_features)
class_logits, box_regression = self.box_predictor(box_features)
如果是训练的话,需要计算loss,同rpn,分类用交叉熵,回归用SmoothL1 Loss。
如果是推理时,则需要对结果进行后处理,代码定义在roi_head.py中的postprocess_detections中:
def postprocess_detections(self, class_logits, box_regression, proposals, image_shapes):
device = class_logits.device
num_classes = class_logits.shape[-1]
# proposals为rpn生成的proposal region
# proposals以list形式传入,list元素个数等于batch_size
# boxes_in_image : p_num × 4,其中P_num为经过sample后的proposal region个数
# boxes_per_image获取每张图片中的proposals个数
boxes_per_image = [len(boxes_in_image) for boxes_in_image in proposals]
# 通过网络输出的box_regression和proposals得到最后的bbox坐标
pred_boxes = self.box_coder.decode(box_regression, proposals)
# pred_scores为分类结果
pred_scores = F.softmax(class_logits, -1)
# split boxes and scores per image
pred_boxes = pred_boxes.split(boxes_per_image, 0)
pred_scores = pred_scores.split(boxes_per_image, 0)
all_boxes = []
all_scores = []
all_labels = []
for boxes, scores, image_shape in zip(pred_boxes, pred_scores, image_shapes):
boxes = box_ops.clip_boxes_to_image(boxes, image_shape)
# labels size: num_proposals × num_classes
# labels的每一行为class的id,从0到num_classes
labels = torch.arange(num_classes, device=device)
labels = labels.view(1, -1).expand_as(scores)
# id=0为背景类,推理时排除背景类别
boxes = boxes[:, 1:]
scores = scores[:, 1:]
labels = labels[:, 1:]
# 将boxes,scores,labels的大小分别resize成:N×4,N,N
# N = num_proposals × num_classes
# 这样做的目的是将每个类别进行处理
boxes = boxes.reshape(-1, 4)
scores = scores.flatten()
labels = labels.flatten()
# 仅保留大于score阈值的结果
inds = torch.nonzero(scores > self.score_thresh).squeeze(1)
boxes, scores, labels = boxes[inds], scores[inds], labels[inds]
# 移除最小长度小于0.01的boxes
keep = box_ops.remove_small_boxes(boxes, min_size=1e-2)
boxes, scores, labels = boxes[keep], scores[keep], labels[keep]
# 对boxes做nms,由于传入了labels参数,nms是针对于每个类别分别做nms
keep = box_ops.batched_nms(boxes, scores, labels, self.nms_thresh)
# 仅保留前detctions_per_img个结果,默认保留前100
keep = keep[:self.detections_per_img]
boxes, scores, labels = boxes[keep], scores[keep], labels[keep]
all_boxes.append(boxes)
all_scores.append(scores)
all_labels.append(labels)
return all_boxes, all_scores, all_labels
至此,我们通过处理输入到roi_head中的feature map和proposals,得到了最终的检测结果(或者是loss),总结一下,roi_head主要做了以下几件事: