RPN是two-stage的标志性结构,并且其本身也是一个二分类的目标检测网络,因此在faster-rcnn的整个网络结构中能看到anchor的使用,回归和分类等操作,这里讲具体介绍一下。
整个rpn部分代码在torchvison/models/detection/rpn.py中,其中定义了RPNHead,AnchorGenerator,RegionProposalNetwork三个模块。
目录
AnchorGenerator
RegionProposalNetwork
BoxCoder
encode
decode
filter_proposals
assign_targets_to_anchors
compute_loss
RPNHead
AnchorGenerator的定义:
Module that generates anchors for a set of feature maps and image sizes.
顾名思义,AnchorGenerator的主要作用就是生成与feature相对应的anchors。
输入参数:
根据sizes的个数和aspect_ratios的个数,将在feature map的每个位置上生成固定数量的anchor。
并且注意AnchorGenerator继承自nn.Module,也是有forward()函数的,并且注意forward的输入包含了一个ImageList类型。ImageList类型的定义在torchvison/models/detection/image_list.py中。
写个简单的例子测试一下:
import torchvision.models.detection.rpn as rpn
import torchvision.models.detection.image_list as image_list
import torch
# 创建AnchorGenerator实例
anchor_generator = rpn.AnchorGenerator()
# 构建ImageList
batched_images = torch.Tensor(8,3,640,640)
image_sizes = [(640,640)] * 8
image_list_ = image_list.ImageList(batched_images,image_sizes)
# 构建feature_maps
feature_maps = [torch.Tensor([8,256,80,80]),torch.Tensor(8,256,160,160), torch.Tensor(8,256,320,320)]
# 生成anchors
anchors = anchor_generator(image_list_,feature_maps)
验证一下生成的anchors:
>>> print(type(anchors))
>>> for anchor in anchors:
... print(anchor.shape)
...
torch.Size([403200, 4])
torch.Size([403200, 4])
torch.Size([403200, 4])
torch.Size([403200, 4])
torch.Size([403200, 4])
torch.Size([403200, 4])
torch.Size([403200, 4])
torch.Size([403200, 4])
从结果上我们可以看到:
这里来分析下为什么anchors中输出的tensor大小都是403200×4:
首先AnchorGenerator()的默认参数aspect_scale=(0.5, 1.0, 2.0),sizes=(128, 256, 512),因为输入的aspect_scale大小为3,因此会在feature的每个位置上生成3个anchor,共生成80×80×3+160×160×3+320×320×3=403200个anchors。
注意sizes中的每个值是用于每层feature的anchor的基数大小,比如在例子中80×80的feature map每个grid设置的anchor的大小为128,并根据aspect_scale的值生成(128/sqrt(2), 128*sqrt(2)),(128*128),(128*sqrt(2),128/sqrt(2))三种尺寸的anchor
RegionProposalNetwork是整个rpn的主体,其中集成了AnchorGenerator和RPNHead,功能包含生成anchors,anchor与groundtruth的匹配,nms,回归与分类损失的计算等等。
参数定义如下:
anchor_generator :传入AnchorGenerator
head:通过feature生成regression deltas和objectness的模块
fg_iou_thresh:iou大于该阈值被认为是前景
bg_iou_thresh:iou小于该阈值被认为是背景
batch_size_per_image:
positive_fraction:保留positive anchors的最大比例
pre_nms_top_n:用来做nms的positive anchors的个数
post_nms_top_n:nms后保留的positive anchors个数
nms_thresh:nms的阈值
因为组件比较多,我们从forward入手来分析rpn是如何实现的。先看forward的定义:
def forward(self, images, features, targets=None):
features = list(features.values())
# feature通过head得到预测结果:置信度和box offsets
# objectness:
objectness, pred_bbox_deltas = self.head(features)
# 生成anchors
anchors = self.anchor_generator(images, features)
num_images = len(anchors)
num_anchors_per_level = [o[0].numel() for o in objectness]
# 经过concat后,objectness和pred_bbox_deltas被concat成了tensor
# objectness size: anchors_in_image × 1
# pred_bbox_deltas: anchors_in_image × 4
objectness, pred_bbox_deltas = \
concat_box_prediction_layers(objectness, pred_bbox_deltas)
# 将deltas转化为(x1,y1,x2,y2)形式的box坐标储存在proposals中
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
proposals = proposals.view(num_images, -1, 4)
# 过滤proposals,对所有的proposals根据score进行排序后再进行nms
# 返回结果是list,list中包含每个batch检测到的box和对应的score
boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
losses = {}
if self.training:
# 训练过程中,计算anchor和gt_box的IOU
# 判断anchor为positive/negative/ignore
labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
# 如果anchor为positive,将对应的gt_box转化为regression delta的形式
regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
# 计算classification和regression losses
loss_objectness, loss_rpn_box_reg = self.compute_loss(
objectness, pred_bbox_deltas, labels, regression_targets)
losses = {
"loss_objectness": loss_objectness,
"loss_rpn_box_reg": loss_rpn_box_reg,
}
return boxes, losses
BoxCoder定义在torchvision/models/detection/_utils.py中,主要作用是实现groudtruth的box坐标和regression deltas之间的decode和encode工作。如果定义box的坐标为(x1, y1, x2, y2),deltas为(dx, dy, dw, dh),那么:
encode是将bbox转换为deltas的过程,decode则是讲deltas转化为bbox的过程。具体如何转换请参考各类anchor目标检测方法,可以参考:一文读懂Faster RCNN
BoxCoder中有两个重要函数:encode()和decode()
encode()完成从bbox的四个坐标(x1, y1, x2, y2)到delta (dx, dy ,dw, dh)的转换。注意encode的定义:
def encode(self, reference_boxes, proposals)
decode完成对网络输出的pred_bbox_deltas的解码工作,其定义为:
def decode(self, rel_codes, boxes)
注意:boxes的类型一定要是list或者tuple,list的长度为batch_size的大小,list中每个元素均为tensor,具体尺寸和之前说过AnchorGenerator返回的结果一致。而rel_codes可以是list/tuple,也可以是经过concate后的tensor。注意rel_codes和boxes的大小要保持基本一致,即rel_codes如果是list/tuple,他的长度为batch_size大小,如果是concate后的tensor的形式,则size为:B×C×H×W的形式。
来看一下decode在rpn中是如何使用的:
# objectness: (B × anchors_in_image) × 1
# pred_bbox_deltas: (B × anchors_in_image) × 4
objectness, pred_bbox_deltas = \
concat_box_prediction_layers(objectness, pred_bbox_deltas)
# decode将对应的deltas和anchors转为为bbox的坐标保存在proposals中
# proposals : (B × anchors_in_image) × 1 × 4
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
proposals = proposals.view(num_images, -1, 4)
filter_proposals定义在rpn.py中,经过box decode之后,我们将网络输出的pred_bbox_deltas转化成为了bbox坐标形式的proposals,而这些网络预测的proposals中,有一些objectness score小于置信度阈值,还有一些proposals的区域重合了,需要进行nms,filter_proposals的主要任务就是完成这些功能,将objcectness score过小和重合的框滤除掉。这里对filter_proposals做一些注释:
def filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level):
num_images = proposals.shape[0]
device = proposals.device
# do not backprop throught objectness
objectness = objectness.detach()
objectness = objectness.reshape(num_images, -1)
levels = [
torch.full((n,), idx, dtype=torch.int64, device=device)
for idx, n in enumerate(num_anchors_per_level)
]
levels = torch.cat(levels, 0)
levels = levels.reshape(1, -1).expand_as(objectness)
# 获取每个batch中每个feature中objectness前top_n个结果的id
# top_n_idx : batch_size × id
top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)
batch_idx = torch.arange(num_images, device=device)[:, None]
# 根据top_n的id,获取每个batch中前top_n个objectness
objectness = objectness[batch_idx, top_n_idx]
# levels的size和objectness,proposals相同
# levels中保存的是当前结果来自哪一层Feature的idx
levels = levels[batch_idx, top_n_idx]
# 根据top_n的id,获取每个batch中前top_n个bbox的prediction
proposals = proposals[batch_idx, top_n_idx]
final_boxes = []
final_scores = []
# 对每个batch分别做处理
for boxes, scores, lvl, img_shape in zip(proposals, objectness, levels, image_shapes):
# 防止boxes坐标溢出image
boxes = box_ops.clip_boxes_to_image(boxes, img_shape)
# 移除小于min_size的boxes
keep = box_ops.remove_small_boxes(boxes, self.min_size)
boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
# 非极大值抑制,返回大于阈值nms_thresh的box的idx
keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
# 取前post_nms_top_n个结果
keep = keep[:self.post_nms_top_n]
boxes, scores = boxes[keep], scores[keep]
final_boxes.append(boxes)
final_scores.append(scores)
return final_boxes, final_scores
网络训练时,需要计算anchor与groudtruth的iou用于决定哪个anchor为前景,哪个anchor为背景,assign_targets_to_anchors就是用来完成这个功能,看一下代码中的实现方式:
def assign_targets_to_anchors(self, anchors, targets):
labels = []
matched_gt_boxes = []
for anchors_per_image, targets_per_image in zip(anchors, targets):
gt_boxes = targets_per_image["boxes"]
# box_similarity计算即计算boxes和anchors的iou
# gt_boxes: M × 4
# anchors_per_image: N × 4
# 返回结果为M × N的矩阵,矩阵中每个值代表对应gt_boxes和anchor的iou
match_quality_matrix = self.box_similarity(gt_boxes, anchors_per_image)
# proposal_matcher实现每个anchor和gt_box的匹配工作
# 输入为M × N的iou矩阵match_quality_matrix
# 对于N个anchors,每个anchor和M个gt_box的iou大于high_threshold,认为是positive
# 并将M个gt_boxes中iou最大的box的idx匹配到第N个anchor
# 当M个gt_box的iou均小于low_threshold,认为是negtive,赋值-1
# 其余介于high_threshold和low_threshold被认为是ignore,赋值-2
# 因此得到的matched_idxs是一个N维的tensor,值包含idx/-1/-2
matched_idxs = self.proposal_matcher(match_quality_matrix)
# matched_gt_boxes_per_image: N × 4
# matched_gt_boxes_per_image保存每个anchor assigned gt_box
matched_gt_boxes_per_image = gt_boxes[matched_idxs.clamp(min=0)]
labels_per_image = matched_idxs >= 0
labels_per_image = labels_per_image.to(dtype=torch.float32)
# Background被赋值为0
bg_indices = matched_idxs == self.proposal_matcher.BELOW_LOW_THRESHOLD
labels_per_image[bg_indices] = 0
# ignored被赋值为-1
inds_to_discard = matched_idxs == self.proposal_matcher.BETWEEN_THRESHOLDS
labels_per_image[inds_to_discard] = -1
labels.append(labels_per_image)
matched_gt_boxes.append(matched_gt_boxes_per_image)
return labels, matched_gt_boxes
assign_targets_to_anchors返回的结果:
计算target和网络输出的prediction之间的loss,其中回归损失用smoothL1,分类损失用binary cross entropy。
值得说明的是,其中fg_bg_sampler的作用是限制rpn生成anchor的数量,具体做法可以参考faster-rcnn原文:
Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.
保持positive和negative anchors的总数在256并且比例维持在1:1,当positive anchors的数量小于128时,由negative anchors填充。
RPNHead的功能就比较简单了,主要是对对feature做和激活等操作生成classification和regression结果。
class RPNHead(nn.Module):
def __init__(self, in_channels, num_anchors):
super(RPNHead, self).__init__()
self.conv = nn.Conv2d(
in_channels, in_channels, kernel_size=3, stride=1, padding=1
)
self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
self.bbox_pred = nn.Conv2d(
in_channels, num_anchors * 4, kernel_size=1, stride=1
)
for l in self.children():
torch.nn.init.normal_(l.weight, std=0.01)
torch.nn.init.constant_(l.bias, 0)
def forward(self, x):
logits = []
bbox_reg = []
for feature in x:
t = F.relu(self.conv(feature))
logits.append(self.cls_logits(t))
bbox_reg.append(self.bbox_pred(t))
return logits, bbox_reg
对照模型和图片可以自行参考下RPNHead的结构。
至此,整个rpn网络的整体架构就讲解完毕,以上代码均能够在torchvision源码中找到。