将图片进行缩放。对应的box和mask也进行缩放。box缩放时直接将坐标乘以相应的倍数。图像缩放采用bilinear方式,而mask缩放时采用nearest方式。
同时可能需要将image使用padding方式扩大,即:
new_img = np.zeros((3, new_x, new_y))
new_img[:, :x_max, :y_max] = old_img
对应GenerilizedRCNN.py
中的
images, targets = self.transform(images, targets)
将缩放后的图像通过backbone
,即resnet_fpn
得到feature的数组。对应GenerilizedRCNN.py
中的:
features = self.backbone(images.tensors)
首先通过IntermediateLayerGetter得到resnet不同分辨率的features:
IntermediateLayerGetter(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): FrozenBatchNorm2d(64)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(64)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(64)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(256)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): FrozenBatchNorm2d(256)
)
)
(1): Bottleneck()
(2): Bottleneck()
)
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(128)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(128)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(512)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): FrozenBatchNorm2d(512)
)
)
(1): Bottleneck()
(2): Bottleneck()
(3): Bottleneck()
)
(layer3): Sequential(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(256)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(256)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(1024)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): FrozenBatchNorm2d(1024)
)
)
(1): Bottleneck()
(2): Bottleneck()
(3): Bottleneck()
(4): Bottleneck()
(5): Bottleneck()
)
(layer4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(512)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(512)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(2048)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): FrozenBatchNorm2d(2048)
)
)
(1): Bottleneck()
(2): Bottleneck()
)
)
在out
中分别返回layer1 layer2 layer3 layer4
的输出features。假设输出图像大小为(3, 800, 1312),则输出featuers为:
[
Tensor(torch.Size([1, 256, 200, 328])),
Tensor(torch.Size([1, 512, 100, 164])),
Tensor(torch.Size([1, 1024, 50, 82])),
Tensor(torch.Size([1, 2048, 25, 41]))
]
然后将得到的feature list输出FPN网络:
左边是fpn的输入,最后的out是fpn的输出。
out =
[
Tensor(torch.Size([1, 256, 200, 328])),
Tensor(torch.Size([1, 256, 100, 164])),
Tensor(torch.Size([1, 256, 50, 82])),
Tensor(torch.Size([1, 256, 25, 41])),
Tensor(torch.Size([1, 256, 13, 21]))
]
对应GenerilizedRCNN.py
中的
features = self.backbone(images.tensors)
对应proposals, proposal_losses = self.rpn(images, features, targets)
在rpn.py
中首先将每个feature通过一个RPNHead, 对应
objectness, pred_bbox_deltas = self.head(features)
objectness =
[
Tensor(torch.Size([1, 3, 200, 328])),
Tensor(torch.Size([1, 3, 100, 164])),
Tensor(torch.Size([1, 3, 50, 82])),
Tensor(torch.Size([1, 3, 25, 41])),
Tensor(torch.Size([1, 3, 13, 21]))
]
pred_bbox_deltas =
[
Tensor(torch.Size([1, 12, 200, 328])),
Tensor(torch.Size([1, 12, 100, 164])),
Tensor(torch.Size([1, 12, 50, 82])),
Tensor(torch.Size([1, 12, 25, 41])),
Tensor(torch.Size([1, 12, 13, 21]))
]
其中 3 3 3指的是每个位置3
个anchors.由anchor_utils.py
中的num_anchors_per_location
决定。
对应anchor_utils.py
的AnchorGenerator
类。
对于每一个size: [32, 64, 128] 和aspect_ratio: [0.5, 1, 2],生成基准anchor,以size=32为例:
tensor([[-22.6274, -11.3137, 22.6274, 11.3137],
[-16.0000, -16.0000, 16.0000, 16.0000],
[-11.3137, -22.6274, 11.3137, 22.6274]]))
然后进行round操作:
tensor([[-23., -11., 23., 11.],
[-16., -16., 16., 16.],
[-11., -23., 11., 23.]])
因此生成的base_anchor为(32, 64, 128, 256, 512):
[
tensor([[-23., -11., 23., 11.],
[-16., -16., 16., 16.],
[-11., -23., 11., 23.]]),
tensor([[-45., -23., 45., 23.],
[-32., -32., 32., 32.],
[-23., -45., 23., 45.]]),
tensor([[-91., -45., 91., 45.],
[-64., -64., 64., 64.],
[-45., -91., 45., 91.]]),
tensor([[-181., -91., 181., 91.],
[-128., -128., 128., 128.],
[ -91., -181., 91., 181.]]),
tensor([[-362., -181., 362., 181.],
[-256., -256., 256., 256.],
[-181., -362., 181., 362.]])
]
对于anchor_utils.py
中的self.set_cell_anchors(dtype, device)
然后通过anchor_utils.py
中的cached_grid_anchors
生成最终的anchors.
对于每个feature size: (200, 328), 相对于原图的stride (4, 4), base_anchor:
tensor([[-23., -11., 23., 11.],
[-16., -16., 16., 16.],
[-11., -23., 11., 23.]])
生成对应的anchor,因此最终的anchor坐标为:
anchors = [
200*328*3=196800
100*164*3=49200
50*82*3=12300
25*41*3=3075
100*164*3=819
]
对应anchor_utils.py
中的anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)
最终通过anchor_utils.py
中的anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
之后,所有的anchor变成torch.Size([262194, 4])
的anchor。
以上就是rpn.py
中anchors = self.anchor_generator(images, features)
的全部过程。
然后对于上面的objectness
和pred_bbox_deltas
分别进行permute_and_flatten
,(1, 3, 200, 328)变成torch.Size([1, 196800, 1])
, torch.Size([1, 12, 200, 328])
变成torch.Size([1, 196800, 4])
:
def permute_and_flatten(layer, N, A, C, H, W):
# type: (Tensor, int, int, int, int, int) -> Tensor
layer = layer.view(N, -1, C, H, W)
layer = layer.permute(0, 3, 4, 1, 2)
layer = layer.reshape(N, -1, C)
return layer
因此
box_cls_flattened = [torch.Size([1, 196800, 1]), torch.Size([1, 49200, 1]), torch.Size([1, 12300, 1]), torch.Size([1, 3075, 1]), torch.Size([1, 819, 1])]
最后进行concat
变成: torch.Size([262194, 1])
和torch.Size([262194, 4])
即得到神经网络rpn的输出, 以上就是rpn.py
中objectness, pred_bbox_deltas = \ concat_box_prediction_layers(objectness, pred_bbox_deltas)
的全过程。
其中objectness, pred_bbox_deltas
分别为对应anchor的置信度和边框回归参数。
然后通过decode
函数对每个anchor进行平移和缩放生成真正的每个anchor的坐标。
通过utils_.py
中的decode_single
函数进行平移和缩放。即以原始生成的anchors
为基础,以RPN网络的输出作为平移和缩放系数,得到decoded boxes
。
def decode_single(self, rel_codes, boxes):
"""
From a set of original boxes and encoded relative box offsets,
get the decoded boxes.
Arguments:
rel_codes (Tensor): encoded boxes
boxes (Tensor): reference boxes.
"""
boxes = boxes.to(rel_codes.dtype)
widths = boxes[:, 2] - boxes[:, 0]
heights = boxes[:, 3] - boxes[:, 1]
ctr_x = boxes[:, 0] + 0.5 * widths
ctr_y = boxes[:, 1] + 0.5 * heights
wx, wy, ww, wh = self.weights
dx = rel_codes[:, 0::4] / wx
dy = rel_codes[:, 1::4] / wy
dw = rel_codes[:, 2::4] / ww
dh = rel_codes[:, 3::4] / wh
# Prevent sending too large values into torch.exp()
dw = torch.clamp(dw, max=self.bbox_xform_clip)
dh = torch.clamp(dh, max=self.bbox_xform_clip)
pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
pred_w = torch.exp(dw) * widths[:, None]
pred_h = torch.exp(dh) * heights[:, None]
pred_boxes1 = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
pred_boxes2 = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
pred_boxes3 = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
pred_boxes4 = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
pred_boxes = torch.stack((pred_boxes1, pred_boxes2, pred_boxes3, pred_boxes4), dim=2).flatten(1)
return pred_boxes
以上即为rpn.py
中proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
的过程。
接下来对所有的proposals
进行filter_proposals
操作。
对于每个feature_map的anchor,根据对应的score(objectiveness)
的大小分别选择2000个anchor,并返回对应anchor的index。
经过第一轮筛选,得到了 8819 8819 8819个anchors,并反悔了他们的index。这就是rpn.py
中top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)
的过程。
接下来进行第二轮筛选。
首先将每个box根据原图大小(800, 1282)进行torch.clamp
操作,使得它们不越界。然后去掉size
接下来将每个feature_map size选出的proposal进行offset操作,这是为了确保不同feature_map size选出的proposal不会在nms中互相干扰,因为它们不会有重叠,解释如下:
def batched_nms(
boxes: Tensor,
scores: Tensor,
idxs: Tensor,
iou_threshold: float,
) -> Tensor:
"""
Performs non-maximum suppression in a batched fashion.
Each index value correspond to a category, and NMS
will not be applied between elements of different categories.
Parameters
----------
boxes : Tensor[N, 4]
boxes where NMS will be performed. They
are expected to be in (x1, y1, x2, y2) format
scores : Tensor[N]
scores for each one of the boxes
idxs : Tensor[N]
indices of the categories for each one of the boxes.
iou_threshold : float
discards all overlapping boxes
with IoU > iou_threshold
Returns
-------
keep : Tensor
int64 tensor with the indices of
the elements that have been kept by NMS, sorted
in decreasing order of scores
"""
if boxes.numel() == 0:
return torch.empty((0,), dtype=torch.int64, device=boxes.device)
# strategy: in order to perform NMS independently per class.
# we add an offset to all the boxes. The offset is dependent
# only on the class idx, and is large enough so that boxes
# from different classes do not overlap
else:
max_coordinate = boxes.max()
offsets = idxs.to(boxes) * (max_coordinate + torch.tensor(1).to(boxes))
boxes_for_nms = boxes + offsets[:, None]
keep = nms(boxes_for_nms, scores, iou_threshold)
return keep
以上是rpn.py
中keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
的过程。nms
过后会按照score的大小留下2702
个proposals。再从中选出top 2000个proposals。
以上就是rpn.py
中boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
的过程。
接下来对选中的2000个proposal分配对应的ground truth,即学习目标。
首先对20多万个anchors分别计算与每个ground truth的iou。假设ground truth有3个, 因此每个anchor都会计算3个iou的值。考虑最大的iou,如果小于0.3,对应anchor标签设为-1, 如果在[0.3, 0.7]则对应anchor标签设为-2。
额外一点需要做的是,对于每个ground truth
,找出与其iou最大的anchors, 然后将该anchor对应的ground truth设置为对应的ground truth的index。
该操作在rpn.py
中matched_idxs = self.proposal_matcher(match_quality_matrix)
操作中完成。因此返回的matched_idxs
为tensor([262194,1])
的数组,其中每个元素为与每个anchor的iou最大的ground truth的index。iou小于0.3的已经设置为-1, [0.3, 0.7]之间的设置为-2。还有就是设置了与每个ground truth的iou最大的anchor对应的ground truth的index。
这就是每个anchor的label。最后将label为-2的设置为-1, 为-1的设置为0,将大于0的设置为1。
以上操作在rpn.py
的labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
中完成。因此返回的labels
是一个[262194]的数组,每个元素对应该anchor的ground truth的index。matched_gt_boxes
是一个[262194, 4]的tensor,每个元素对应该anchor的ground truth的坐标。注意,这所有的anchor中只有label大于0的才是有意义的。其他的要么是背景,要么不参与计算。
接下来根据20多万个的anchor对应的ground truth和20多万个最原始的anchors, 即未与rpn网络输出的predict的缩放和平移处理的anchors, 来生成对应的anchor需要做的缩放和平移的大小。
def encode_boxes(reference_boxes, proposals, weights):
# type: (torch.Tensor, torch.Tensor, torch.Tensor) -> torch.Tensor
"""
Encode a set of proposals with respect to some
reference boxes
Arguments:
reference_boxes (Tensor): reference boxes
proposals (Tensor): boxes to be encoded
"""
# perform some unpacking to make it JIT-fusion friendly
wx = weights[0]
wy = weights[1]
ww = weights[2]
wh = weights[3]
proposals_x1 = proposals[:, 0].unsqueeze(1)
proposals_y1 = proposals[:, 1].unsqueeze(1)
proposals_x2 = proposals[:, 2].unsqueeze(1)
proposals_y2 = proposals[:, 3].unsqueeze(1)
reference_boxes_x1 = reference_boxes[:, 0].unsqueeze(1)
reference_boxes_y1 = reference_boxes[:, 1].unsqueeze(1)
reference_boxes_x2 = reference_boxes[:, 2].unsqueeze(1)
reference_boxes_y2 = reference_boxes[:, 3].unsqueeze(1)
# implementation starts here
ex_widths = proposals_x2 - proposals_x1
ex_heights = proposals_y2 - proposals_y1
ex_ctr_x = proposals_x1 + 0.5 * ex_widths
ex_ctr_y = proposals_y1 + 0.5 * ex_heights
gt_widths = reference_boxes_x2 - reference_boxes_x1
gt_heights = reference_boxes_y2 - reference_boxes_y1
gt_ctr_x = reference_boxes_x1 + 0.5 * gt_widths
gt_ctr_y = reference_boxes_y1 + 0.5 * gt_heights
targets_dx = wx * (gt_ctr_x - ex_ctr_x) / ex_widths
targets_dy = wy * (gt_ctr_y - ex_ctr_y) / ex_heights
targets_dw = ww * torch.log(gt_widths / ex_widths)
targets_dh = wh * torch.log(gt_heights / ex_heights)
targets = torch.cat((targets_dx, targets_dy, targets_dw, targets_dh), dim=1)
return targets
即regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
, 'regression_targets`为torch.Size([262194, 4])。其中前两列为平移量,后两列为缩放量。
接下来计算rpn的分类和回归损失。
首先在label中随机选择一些label为1(物体)和0(背景)的anchor出来, 将选择出的anchor都标记为1, 该操作在rpn.py
中sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
完成。
然后将选择出的 n n n个positive_index的通过RPN输出的regression的值和计算到的regression_targets执行smooth_L1_loss
计算,将选择出的 n n n个和 256 − n 256-n 256−n个negative_index经过RPN后输出的objectness
和对应的为0或者1的labels执行cross_entropy计算,得到RPN的分类和回归损失。该操作在rpn.py
的compute_loss
中完成。
RPN最终返回2000个通过两轮筛选后的boxes(proposals), 以及rpn的分类(256个)和回归损失 n n n个。
注意这里生成boxes和计算losses之间没有任何联系。参与计算losses的box没有经过任何nms操作,可能在boxes中也是没有出现的。
以上就是generalized_rcnn.py
中proposals, proposal_losses = self.rpn(images, features, targets)
的全部过程。proposals
是选择的2000个box经过decode后在原图上的坐标, proposal_losses和256个anchor的分类和回归损失。
接下来计算着2000个proposal对应的分类和回归损失。
接下里的计算只跟proposal有关。首先将3个ground truth与2000个proposals进行concate,得到2003, 4]的tensor。然后将2003个proposal分别与3个ground truth计算iou, 分别得到与每个proposal得到iou最大的ground truthd的iou的数值和对应的ground truth的label值。
然后将对应最大的iou小于0.5的proposal的label设置为-1。注意label为0并非背景,而是第一个ground truth。
然后随机选择 n n n个有物体的proposal和 512 − n 512-n 512−n个没有物体的proposals。
在对每个proposal赋予对应的label时,根据每个proposal对应的ground truth的index给每个proposal赋予[0, n-1]之间的label。同时将iou小于0.5的proposal也赋值为0, 然后根据每个ground truth的index对应真实的类别cls将[0, n-1]个index转化为每个proposal对应的ground truth的类别[1, m],因为这里面已经没有类别0(背景)了。注意此时某些对应background的proposal的类也混在里面了,这就是roi_heads.py
中RoIHeads
类的下面几行做的事:
clamped_matched_idxs_in_image = matched_idxs_in_image.clamp(min=0)
labels_in_image = gt_labels_in_image[clamped_matched_idxs_in_image]
labels_in_image = labels_in_image.to(dtype=torch.int64)
接下来将matched_idxs_in_image
为 − 1 -1 −1的index找出来,这些是背景proposal。然后将相应的之前混进去的背景的proposal的类别设置为0:
bg_inds = matched_idxs_in_image == self.proposal_matcher.BELOW_LOW_THRESHOLD
labels_in_image[bg_inds] = 0
返回proposal
对应的ground truth的index,其中掺杂背景proposal, 以及所有proposal对应的类别labels,这里面不掺杂背景。
然后随机选择n个和512-n个positive和negative的proposals。即sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
。将这些被选中的proposal的index记录下来:
def subsample(self, labels):
# type: (List[Tensor]) -> List[Tensor]
sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
sampled_inds = []
for img_idx, (pos_inds_img, neg_inds_img) in enumerate(
zip(sampled_pos_inds, sampled_neg_inds)
):
img_sampled_inds = torch.where(pos_inds_img | neg_inds_img)[0]
sampled_inds.append(img_sampled_inds)
return sampled_inds
并返回相应proposal的index。这就是sampled_inds = self.subsample(labels)
做的。
接下来获取这些被选中的proposal的box坐标,对应的真实label(无bg掺杂), 对应的ground truth的index(有bg掺杂)和对应的ground truth的box的坐标(有bg掺杂):
img_sampled_inds = sampled_inds[img_id] # 选择的proposal的idx
proposals[img_id] = proposals[img_id][img_sampled_inds] # 选择的proposal的box预测坐标
labels[img_id] = labels[img_id][img_sampled_inds] # 选择的proposal的label
matched_idxs[img_id] = matched_idxs[img_id][img_sampled_inds]
gt_boxes_in_image = gt_boxes[img_id]
if gt_boxes_in_image.numel() == 0:
gt_boxes_in_image = torch.zeros((1, 4), dtype=dtype, device=device)
matched_gt_boxes.append(gt_boxes_in_image[matched_idxs[img_id]]) # # 选择的proposal对应的ground truth的坐标
然后对每一个选中的proposal都计算相应的平移和缩放量,作为学习目标,即regression_targets = self.box_coder.encode(matched_gt_boxes, proposals)
。注意这里面的背景的proposal
即进行了计算,因此返回(512, 4)
个值。
以上就是roi_heads.py
中proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
做的。注意这当中matched_idxs
和regression_targets
中有bg
掺杂。
接下里进行MultiScaleRoIAlign
操作。尽管最初生成anchor时采用了不同stride,但是这里对于roi的操作统一如下: 对于面积大的roi,采用feature_map中size较小的(256, 25, 41)进行roi_pooling操作, 对于面积较大的roi,采用feature_map中size较大的(256, 300, 328)进行roi_pooling操作。因此MultiScaleRoIAlign最终返回(512, 256, 7, 7)的feature。以上就是roi_heads.py
中box_features = self.box_roi_pool(features, proposals, image_shapes)
的操作。其中决定使用哪个feature_map进行roi_pooling操作的公式为:
代码为:
class LevelMapper(object):
"""Determine which FPN level each RoI in a set of RoIs should map to based
on the heuristic in the FPN paper.
Arguments:
k_min (int)
k_max (int)
canonical_scale (int)
canonical_level (int)
eps (float)
"""
def __init__(
self,
k_min: int,
k_max: int,
canonical_scale: int = 224,
canonical_level: int = 4,
eps: float = 1e-6,
):
self.k_min = k_min
self.k_max = k_max
self.s0 = canonical_scale
self.lvl0 = canonical_level
self.eps = eps
def __call__(self, boxlists: List[Tensor]) -> Tensor:
"""
Arguments:
boxlists (list[BoxList])
"""
# Compute level ids
s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))
# Eqn.(1) in FPN paper
target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)
接下来将MultiScaleAlign后的box_feature(512, 256, 7, 7)通过flatten()即两个全连接层转为(512, 1000)的tensor,即TwoMLPHead
的操作。以上就是roi_heads.py
中box_features = self.box_head(box_features)
的操作。
接下来将这1000维度的tensor通过FastRCNNPredictor
的两个分支转化为属于某一个类的概率和平移缩放预测值:
class FastRCNNPredictor(nn.Module):
"""
Standard classification + bounding box regression layers
for Fast R-CNN.
Arguments:
in_channels (int): number of input channels
num_classes (int): number of output classes (including background)
"""
def __init__(self, in_channels, num_classes):
super(FastRCNNPredictor, self).__init__()
self.cls_score = nn.Linear(in_channels, num_classes)
self.bbox_pred = nn.Linear(in_channels, num_classes * 4)
def forward(self, x):
if x.dim() == 4:
assert list(x.shape[2:]) == [1, 1]
x = x.flatten(start_dim=1)
scores = self.cls_score(x)
bbox_deltas = self.bbox_pred(x)
return scores, bbox_deltas
以上是roi_heads.py
中class_logits, box_regression = self.box_predictor(box_features)
的内容。
接下来求每个proposal的fastrcnn分类和回归损失:
def fastrcnn_loss(class_logits, box_regression, labels, regression_targets):
# type: (Tensor, Tensor, List[Tensor], List[Tensor]) -> Tuple[Tensor, Tensor]
"""
Computes the loss for Faster R-CNN.
Arguments:
class_logits (Tensor)
box_regression (Tensor)
labels (list[BoxList])
regression_targets (Tensor)
Returns:
classification_loss (Tensor)
box_loss (Tensor)
"""
labels = torch.cat(labels, dim=0)
regression_targets = torch.cat(regression_targets, dim=0)
classification_loss = F.cross_entropy(class_logits, labels)
# get indices that correspond to the regression targets for
# the corresponding ground truth labels, to be used with
# advanced indexing
sampled_pos_inds_subset = torch.where(labels > 0)[0]
labels_pos = labels[sampled_pos_inds_subset]
N, num_classes = class_logits.shape
box_regression = box_regression.reshape(N, -1, 4)
box_loss = det_utils.smooth_l1_loss(
box_regression[sampled_pos_inds_subset, labels_pos],
regression_targets[sampled_pos_inds_subset],
beta=1 / 9,
size_average=False,
)
box_loss = box_loss / labels.numel()
return classification_loss, box_loss
其中计算分类损失时, 由于labels中没有bg掺杂,因此512和proposal都参与计算。计算regression中的loss时,首先通过label选择大于0,即非bg的box出来,即sampled_pos_inds_subset = torch.where(labels > 0)[0]
。同时计算出每个pos proposal对应的ground truth的box对应的哪一类, labels_pos = labels[sampled_pos_inds_subset]
, 然后从box_regression的(512, n+1, 4)中选择出该类对应的box_regression参与regression计算。例如如果该proposal对应的ground truth的box的类别为1, 则选择box_regression[idx, 1]来参与计算。这里就不再使用统一的box,而是给每个类都计算一个box,然后判断该proposal属于哪个类,再选择相应的类的regression box进行计算。这也是ground truth的值为(n, 4)而regression box值为(n, k, 4)最后计算regression loss的方式。其中n为positive proposal的数目, k为总的类别数(包含bg)。计算regression loss根据label中大于0的index只考虑对应box为前景的proposal,而classification loss则都考虑了。
以上是roi_heads.py
中loss_classifier, loss_box_reg = fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
的操作。
接下来计算mask和segmentation的loss。
首先将512个proposal中对应ground truth box为非bg的index取出来,获取这些positive的proposal和对应的ground truth的index,即
pos = torch.where(labels[img_id] > 0)[0]
mask_proposals.append(proposals[img_id][pos])
pos_matched_idxs.append(matched_idxs[img_id][pos])
然后将这些positive的proposal和前面backbone之后的feature list进行MultiScaleAlign操作,得到(50, 256, 14, 14)的feature, 即mask_features = self.mask_roi_pool(features, mask_proposals, image_shapes)
。接下里将该roialign
之后的feature通过MaskRCNNHeads
操作还是得到(50,256, 14, 14)的feature:
MaskRCNNHeads(
(mask_fcn1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu1): ReLU(inplace=True)
(mask_fcn2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu2): ReLU(inplace=True)
(mask_fcn3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3): ReLU(inplace=True)
(mask_fcn4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu4): ReLU(inplace=True)
)
以上为roi_heads.py
中mask_features = self.mask_head(mask_features)
的操作。
接下来通过MaskRCNNPredictor
操作上采样一次得到(50, n+1, 28, 28)
的feature, n为类别数, 即mask_logits = self.mask_predictor(mask_features)
。
MaskRCNNPredictor(
(conv5_mask): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2))
(relu): ReLU(inplace=True)
(mask_fcn_logits): Conv2d(256, 91, kernel_size=(1, 1), stride=(1, 1))
)
接下来计算maskrcnnloss
。
首先将ground truth mask和positive的rois进行RoIAlign操作,得到(50, 3, 28, 28)这个是每个roi的学习target。即计算每个roi经过roiPooling之后的ground truth的mask的值。
然后将mask_logits(50, n+1, 28, 28)的每个mask_logit对应的类别取出来,即(50, 28, 28)与对应的target_mask求binary_cross_entropy得到mask_loss:
def maskrcnn_loss(mask_logits, proposals, gt_masks, gt_labels, mask_matched_idxs):
# type: (Tensor, List[Tensor], List[Tensor], List[Tensor], List[Tensor]) -> Tensor
"""
Arguments:
proposals (list[BoxList])
mask_logits (Tensor)
targets (list[BoxList])
Return:
mask_loss (Tensor): scalar tensor containing the loss
"""
discretization_size = mask_logits.shape[-1]
labels = [gt_label[idxs] for gt_label, idxs in zip(gt_labels, mask_matched_idxs)]
mask_targets = [
project_masks_on_boxes(m, p, i, discretization_size)
for m, p, i in zip(gt_masks, proposals, mask_matched_idxs)
]
labels = torch.cat(labels, dim=0)
mask_targets = torch.cat(mask_targets, dim=0)
# torch.mean (in binary_cross_entropy_with_logits) doesn't
# accept empty tensors, so handle it separately
if mask_targets.numel() == 0:
return mask_logits.sum() * 0
mask_loss = F.binary_cross_entropy_with_logits(
mask_logits[torch.arange(labels.shape[0], device=labels.device), labels], mask_targets
)
return mask_loss
以上就是roi_heads.py
中rcnn_loss_mask = maskrcnn_loss(mask_logits, mask_proposals, gt_masks, gt_labels, pos_matched_idxs)
的操作。注意这里计算maskrcnnloss与一般的分割问题不同。一般的分割问题会将feature map缩放到原图计算loss,而这里是将ground truth映射到feature map上计算loss。
如果是training
,则不会记录任何detection,最后只返回所有的loss:
loss_dict={
'loss_classifier': tensor(0.2305, grad_fn=<NllLossBackward>),
'loss_box_reg': tensor(0.0949, grad_fn=<DivBackward0>),
'loss_mask': tensor(0.1284, grad_fn=<BinaryCrossEntropyWithLogitsBackward>),
'loss_objectness': tensor(0.0333, grad_fn=<BinaryCrossEntropyWithLogitsBackward>),
'loss_rpn_box_reg': tensor(0.0114, grad_fn=<DivBackward0>)}
其中loss_objectness, loss_rpn_box_reg
是rpn的loss。
以上就是engine.py
中loss_dict = model(images, targets)
的全过程, training
时只返回loss,eval
时只返回result
。
在eval中,首先得到(28, 28)大小上的mask的预测值,pad成(30, 30),然后通过nearset interpolate
操作直接缩放到bounding box
上的mask,然后通过bounding box的坐标映射到原图上。因此MaskRCNN的整个过程中都没有任何的DeConvolution操作。虽然FPN中有Deconv,但是FPN提取的不同level的feature最终是要通过roipooling缩小成(28, 28)大小,而不是传统segmentation将小size的feature map放大最后缩放到原图大小。