Mask RCNN pytorch官方代码解读

1. 数据预处理

1.1 transform

将图片进行缩放。对应的box和mask也进行缩放。box缩放时直接将坐标乘以相应的倍数。图像缩放采用bilinear方式,而mask缩放时采用nearest方式。

同时可能需要将image使用padding方式扩大,即:
new_img = np.zeros((3, new_x, new_y))
new_img[:, :x_max, :y_max] = old_img

对应GenerilizedRCNN.py中的
images, targets = self.transform(images, targets)

2. 得到features

将缩放后的图像通过backbone,即resnet_fpn得到feature的数组。对应GenerilizedRCNN.py中的:
features = self.backbone(images.tensors)

首先通过IntermediateLayerGetter得到resnet不同分辨率的features:

IntermediateLayerGetter(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): FrozenBatchNorm2d(64)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d(64)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d(64)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d(256)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): FrozenBatchNorm2d(256)
      )
    )
    (1): Bottleneck()
    (2): Bottleneck()
  )
  (layer2): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d(128)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d(128)
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d(512)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): FrozenBatchNorm2d(512)
      )
    )
    (1): Bottleneck()
    (2): Bottleneck()
    (3): Bottleneck()
  )
  (layer3): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d(256)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d(256)
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d(1024)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): FrozenBatchNorm2d(1024)
      )
    )
    (1): Bottleneck()
    (2): Bottleneck()
    (3): Bottleneck()
    (4): Bottleneck()
    (5): Bottleneck()
  )
  (layer4): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d(512)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d(512)
      (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d(2048)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): FrozenBatchNorm2d(2048)
      )
    )
    (1): Bottleneck()
    (2): Bottleneck()
  )
)

out中分别返回layer1 layer2 layer3 layer4的输出features。假设输出图像大小为(3, 800, 1312),则输出featuers为:

[
	Tensor(torch.Size([1, 256, 200, 328])),
    Tensor(torch.Size([1, 512, 100, 164])),
    Tensor(torch.Size([1, 1024, 50, 82])),
    Tensor(torch.Size([1, 2048, 25, 41]))
]

然后将得到的feature list输出FPN网络:

左边是fpn的输入,最后的out是fpn的输出。

out =
	[
        Tensor(torch.Size([1, 256, 200, 328])),
        Tensor(torch.Size([1, 256, 100, 164])),
        Tensor(torch.Size([1, 256, 50, 82])),
        Tensor(torch.Size([1, 256, 25, 41])),
        Tensor(torch.Size([1, 256, 13, 21]))
	]

对应GenerilizedRCNN.py中的
features = self.backbone(images.tensors)

3. 根据features通过rpn生成proposals

对应proposals, proposal_losses = self.rpn(images, features, targets)

3.1 RPNHead

rpn.py中首先将每个feature通过一个RPNHead, 对应

objectness, pred_bbox_deltas = self.head(features)

过程:
Mask RCNN pytorch官方代码解读_第1张图片
因此:

objectness =
	[
        Tensor(torch.Size([1, 3, 200, 328])),
        Tensor(torch.Size([1, 3, 100, 164])),
        Tensor(torch.Size([1, 3, 50, 82])),
        Tensor(torch.Size([1, 3, 25, 41])),
        Tensor(torch.Size([1, 3, 13, 21]))
	]

pred_bbox_deltas =
	[
        Tensor(torch.Size([1, 12, 200, 328])),
        Tensor(torch.Size([1, 12, 100, 164])),
        Tensor(torch.Size([1, 12, 50, 82])),
        Tensor(torch.Size([1, 12, 25, 41])),
        Tensor(torch.Size([1, 12, 13, 21]))
	]

其中 3 3 3指的是每个位置3个anchors.由anchor_utils.py中的num_anchors_per_location决定。

3.2 根据image和feature list生成anchors

对应anchor_utils.pyAnchorGenerator类。

对于每一个size: [32, 64, 128] 和aspect_ratio: [0.5, 1, 2],生成基准anchor,以size=32为例:

tensor([[-22.6274, -11.3137,  22.6274,  11.3137],
        [-16.0000, -16.0000,  16.0000,  16.0000],
        [-11.3137, -22.6274,  11.3137,  22.6274]]))

然后进行round操作:

tensor([[-23., -11.,  23.,  11.],
        [-16., -16.,  16.,  16.],
        [-11., -23.,  11.,  23.]])

因此生成的base_anchor为(32, 64, 128, 256, 512):

[
tensor([[-23., -11.,  23.,  11.],
        [-16., -16.,  16.,  16.],
        [-11., -23.,  11.,  23.]]),

tensor([[-45., -23.,  45.,  23.],
        [-32., -32.,  32.,  32.],
        [-23., -45.,  23.,  45.]]),

tensor([[-91., -45.,  91.,  45.],
        [-64., -64.,  64.,  64.],
        [-45., -91.,  45.,  91.]]),

tensor([[-181.,  -91.,  181.,   91.],
        [-128., -128.,  128.,  128.],
        [ -91., -181.,   91.,  181.]]),

tensor([[-362., -181.,  362.,  181.],
        [-256., -256.,  256.,  256.],
        [-181., -362.,  181.,  362.]])
]

对于anchor_utils.py中的self.set_cell_anchors(dtype, device)
然后通过anchor_utils.py中的cached_grid_anchors生成最终的anchors.

对于每个feature size: (200, 328), 相对于原图的stride (4, 4), base_anchor:

tensor([[-23., -11.,  23.,  11.],
        [-16., -16.,  16.,  16.],
        [-11., -23.,  11.,  23.]])

生成对应的anchor,因此最终的anchor坐标为:

anchors = [
	200*328*3=196800
	100*164*3=49200
    50*82*3=12300
    25*41*3=3075
    100*164*3=819
]

对应anchor_utils.py中的anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)

最终通过anchor_utils.py中的anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]之后,所有的anchor变成torch.Size([262194, 4])的anchor。

以上就是rpn.pyanchors = self.anchor_generator(images, features)的全部过程。

然后对于上面的objectnesspred_bbox_deltas分别进行permute_and_flatten,(1, 3, 200, 328)变成torch.Size([1, 196800, 1]), torch.Size([1, 12, 200, 328])变成torch.Size([1, 196800, 4]):

def permute_and_flatten(layer, N, A, C, H, W):
    # type: (Tensor, int, int, int, int, int) -> Tensor
    layer = layer.view(N, -1, C, H, W)
    layer = layer.permute(0, 3, 4, 1, 2)
    layer = layer.reshape(N, -1, C)
    return layer

因此

box_cls_flattened = [torch.Size([1, 196800, 1]), torch.Size([1, 49200, 1]), torch.Size([1, 12300, 1]), torch.Size([1, 3075, 1]), torch.Size([1, 819, 1])]

最后进行concat变成: torch.Size([262194, 1])torch.Size([262194, 4])

即得到神经网络rpn的输出, 以上就是rpn.pyobjectness, pred_bbox_deltas = \ concat_box_prediction_layers(objectness, pred_bbox_deltas)的全过程。

其中objectness, pred_bbox_deltas分别为对应anchor的置信度和边框回归参数。

然后通过decode函数对每个anchor进行平移和缩放生成真正的每个anchor的坐标。

通过utils_.py中的decode_single函数进行平移和缩放。即以原始生成的anchors为基础,以RPN网络的输出作为平移和缩放系数,得到decoded boxes

def decode_single(self, rel_codes, boxes):
    """
    From a set of original boxes and encoded relative box offsets,
    get the decoded boxes.

    Arguments:
        rel_codes (Tensor): encoded boxes
        boxes (Tensor): reference boxes.
    """

    boxes = boxes.to(rel_codes.dtype)

    widths = boxes[:, 2] - boxes[:, 0]
    heights = boxes[:, 3] - boxes[:, 1]
    ctr_x = boxes[:, 0] + 0.5 * widths
    ctr_y = boxes[:, 1] + 0.5 * heights

    wx, wy, ww, wh = self.weights
    dx = rel_codes[:, 0::4] / wx
    dy = rel_codes[:, 1::4] / wy
    dw = rel_codes[:, 2::4] / ww
    dh = rel_codes[:, 3::4] / wh

    # Prevent sending too large values into torch.exp()
    dw = torch.clamp(dw, max=self.bbox_xform_clip)
    dh = torch.clamp(dh, max=self.bbox_xform_clip)

    pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
    pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
    pred_w = torch.exp(dw) * widths[:, None]
    pred_h = torch.exp(dh) * heights[:, None]

    pred_boxes1 = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
    pred_boxes2 = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
    pred_boxes3 = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
    pred_boxes4 = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
    pred_boxes = torch.stack((pred_boxes1, pred_boxes2, pred_boxes3, pred_boxes4), dim=2).flatten(1)
    return pred_boxes

以上即为rpn.pyproposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)的过程。

接下来对所有的proposals进行filter_proposals操作。

对于每个feature_map的anchor,根据对应的score(objectiveness)的大小分别选择2000个anchor,并返回对应anchor的index。

经过第一轮筛选,得到了 8819 8819 8819个anchors,并反悔了他们的index。这就是rpn.pytop_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)的过程。

接下来进行第二轮筛选。
首先将每个box根据原图大小(800, 1282)进行torch.clamp操作,使得它们不越界。然后去掉size的boxes。

接下来将每个feature_map size选出的proposal进行offset操作,这是为了确保不同feature_map size选出的proposal不会在nms中互相干扰,因为它们不会有重叠,解释如下:

def batched_nms(
    boxes: Tensor,
    scores: Tensor,
    idxs: Tensor,
    iou_threshold: float,
) -> Tensor:
    """
    Performs non-maximum suppression in a batched fashion.

    Each index value correspond to a category, and NMS
    will not be applied between elements of different categories.

    Parameters
    ----------
    boxes : Tensor[N, 4]
        boxes where NMS will be performed. They
        are expected to be in (x1, y1, x2, y2) format
    scores : Tensor[N]
        scores for each one of the boxes
    idxs : Tensor[N]
        indices of the categories for each one of the boxes.
    iou_threshold : float
        discards all overlapping boxes
        with IoU > iou_threshold

    Returns
    -------
    keep : Tensor
        int64 tensor with the indices of
        the elements that have been kept by NMS, sorted
        in decreasing order of scores
    """
    if boxes.numel() == 0:
        return torch.empty((0,), dtype=torch.int64, device=boxes.device)
    # strategy: in order to perform NMS independently per class.
    # we add an offset to all the boxes. The offset is dependent
    # only on the class idx, and is large enough so that boxes
    # from different classes do not overlap
    else:
        max_coordinate = boxes.max()
        offsets = idxs.to(boxes) * (max_coordinate + torch.tensor(1).to(boxes))
        boxes_for_nms = boxes + offsets[:, None]
        keep = nms(boxes_for_nms, scores, iou_threshold)
        return keep

以上是rpn.pykeep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)的过程。nms过后会按照score的大小留下2702个proposals。再从中选出top 2000个proposals。

以上就是rpn.pyboxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)的过程。

接下来对选中的2000个proposal分配对应的ground truth,即学习目标。

首先对20多万个anchors分别计算与每个ground truth的iou。假设ground truth有3个, 因此每个anchor都会计算3个iou的值。考虑最大的iou,如果小于0.3,对应anchor标签设为-1, 如果在[0.3, 0.7]则对应anchor标签设为-2。

额外一点需要做的是,对于每个ground truth,找出与其iou最大的anchors, 然后将该anchor对应的ground truth设置为对应的ground truth的index。

该操作在rpn.pymatched_idxs = self.proposal_matcher(match_quality_matrix)操作中完成。因此返回的matched_idxstensor([262194,1])的数组,其中每个元素为与每个anchor的iou最大的ground truth的index。iou小于0.3的已经设置为-1, [0.3, 0.7]之间的设置为-2。还有就是设置了与每个ground truth的iou最大的anchor对应的ground truth的index。

这就是每个anchor的label。最后将label为-2的设置为-1, 为-1的设置为0,将大于0的设置为1。

以上操作在rpn.pylabels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)中完成。因此返回的labels是一个[262194]的数组,每个元素对应该anchor的ground truth的index。matched_gt_boxes是一个[262194, 4]的tensor,每个元素对应该anchor的ground truth的坐标。注意,这所有的anchor中只有label大于0的才是有意义的。其他的要么是背景,要么不参与计算。

接下来根据20多万个的anchor对应的ground truth和20多万个最原始的anchors, 即未与rpn网络输出的predict的缩放和平移处理的anchors, 来生成对应的anchor需要做的缩放和平移的大小。


def encode_boxes(reference_boxes, proposals, weights):
    # type: (torch.Tensor, torch.Tensor, torch.Tensor) -> torch.Tensor
    """
    Encode a set of proposals with respect to some
    reference boxes

    Arguments:
        reference_boxes (Tensor): reference boxes
        proposals (Tensor): boxes to be encoded
    """

    # perform some unpacking to make it JIT-fusion friendly
    wx = weights[0]
    wy = weights[1]
    ww = weights[2]
    wh = weights[3]

    proposals_x1 = proposals[:, 0].unsqueeze(1)
    proposals_y1 = proposals[:, 1].unsqueeze(1)
    proposals_x2 = proposals[:, 2].unsqueeze(1)
    proposals_y2 = proposals[:, 3].unsqueeze(1)

    reference_boxes_x1 = reference_boxes[:, 0].unsqueeze(1)
    reference_boxes_y1 = reference_boxes[:, 1].unsqueeze(1)
    reference_boxes_x2 = reference_boxes[:, 2].unsqueeze(1)
    reference_boxes_y2 = reference_boxes[:, 3].unsqueeze(1)

    # implementation starts here
    ex_widths = proposals_x2 - proposals_x1
    ex_heights = proposals_y2 - proposals_y1
    ex_ctr_x = proposals_x1 + 0.5 * ex_widths
    ex_ctr_y = proposals_y1 + 0.5 * ex_heights

    gt_widths = reference_boxes_x2 - reference_boxes_x1
    gt_heights = reference_boxes_y2 - reference_boxes_y1
    gt_ctr_x = reference_boxes_x1 + 0.5 * gt_widths
    gt_ctr_y = reference_boxes_y1 + 0.5 * gt_heights

    targets_dx = wx * (gt_ctr_x - ex_ctr_x) / ex_widths
    targets_dy = wy * (gt_ctr_y - ex_ctr_y) / ex_heights
    targets_dw = ww * torch.log(gt_widths / ex_widths)
    targets_dh = wh * torch.log(gt_heights / ex_heights)

    targets = torch.cat((targets_dx, targets_dy, targets_dw, targets_dh), dim=1)
    return targets

regression_targets = self.box_coder.encode(matched_gt_boxes, anchors), 'regression_targets`为torch.Size([262194, 4])。其中前两列为平移量,后两列为缩放量。

接下来计算rpn的分类和回归损失。
首先在label中随机选择一些label为1(物体)和0(背景)的anchor出来, 将选择出的anchor都标记为1, 该操作在rpn.pysampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)完成。

然后将选择出的 n n n个positive_index的通过RPN输出的regression的值和计算到的regression_targets执行smooth_L1_loss计算,将选择出的 n n n个和 256 − n 256-n 256n个negative_index经过RPN后输出的objectness和对应的为0或者1的labels执行cross_entropy计算,得到RPN的分类和回归损失。该操作在rpn.pycompute_loss中完成。

RPN最终返回2000个通过两轮筛选后的boxes(proposals), 以及rpn的分类(256个)和回归损失 n n n个。

注意这里生成boxes和计算losses之间没有任何联系。参与计算losses的box没有经过任何nms操作,可能在boxes中也是没有出现的。

以上就是generalized_rcnn.pyproposals, proposal_losses = self.rpn(images, features, targets)的全部过程。proposals是选择的2000个box经过decode后在原图上的坐标, proposal_losses和256个anchor的分类和回归损失。

接下来计算着2000个proposal对应的分类和回归损失。

接下里的计算只跟proposal有关。首先将3个ground truth与2000个proposals进行concate,得到2003, 4]的tensor。然后将2003个proposal分别与3个ground truth计算iou, 分别得到与每个proposal得到iou最大的ground truthd的iou的数值和对应的ground truth的label值。

然后将对应最大的iou小于0.5的proposal的label设置为-1。注意label为0并非背景,而是第一个ground truth。

然后随机选择 n n n个有物体的proposal和 512 − n 512-n 512n个没有物体的proposals。

在对每个proposal赋予对应的label时,根据每个proposal对应的ground truth的index给每个proposal赋予[0, n-1]之间的label。同时将iou小于0.5的proposal也赋值为0, 然后根据每个ground truth的index对应真实的类别cls将[0, n-1]个index转化为每个proposal对应的ground truth的类别[1, m],因为这里面已经没有类别0(背景)了。注意此时某些对应background的proposal的类也混在里面了,这就是roi_heads.pyRoIHeads类的下面几行做的事:

clamped_matched_idxs_in_image = matched_idxs_in_image.clamp(min=0)

                labels_in_image = gt_labels_in_image[clamped_matched_idxs_in_image]
                labels_in_image = labels_in_image.to(dtype=torch.int64)

接下来将matched_idxs_in_image − 1 -1 1的index找出来,这些是背景proposal。然后将相应的之前混进去的背景的proposal的类别设置为0:

bg_inds = matched_idxs_in_image == self.proposal_matcher.BELOW_LOW_THRESHOLD
labels_in_image[bg_inds] = 0

返回proposal对应的ground truth的index,其中掺杂背景proposal, 以及所有proposal对应的类别labels,这里面不掺杂背景。

然后随机选择n个和512-n个positive和negative的proposals。即sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)。将这些被选中的proposal的index记录下来:

def subsample(self, labels):
    # type: (List[Tensor]) -> List[Tensor]
    sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
    sampled_inds = []
    for img_idx, (pos_inds_img, neg_inds_img) in enumerate(
        zip(sampled_pos_inds, sampled_neg_inds)
    ):
        img_sampled_inds = torch.where(pos_inds_img | neg_inds_img)[0]
        sampled_inds.append(img_sampled_inds)
    return sampled_inds

并返回相应proposal的index。这就是sampled_inds = self.subsample(labels)做的。

接下来获取这些被选中的proposal的box坐标,对应的真实label(无bg掺杂), 对应的ground truth的index(有bg掺杂)和对应的ground truth的box的坐标(有bg掺杂):

img_sampled_inds = sampled_inds[img_id]  # 选择的proposal的idx
proposals[img_id] = proposals[img_id][img_sampled_inds]  # 选择的proposal的box预测坐标

labels[img_id] = labels[img_id][img_sampled_inds]  # 选择的proposal的label
matched_idxs[img_id] = matched_idxs[img_id][img_sampled_inds]

gt_boxes_in_image = gt_boxes[img_id]
if gt_boxes_in_image.numel() == 0:
    gt_boxes_in_image = torch.zeros((1, 4), dtype=dtype, device=device)
matched_gt_boxes.append(gt_boxes_in_image[matched_idxs[img_id]])  # # 选择的proposal对应的ground truth的坐标

然后对每一个选中的proposal都计算相应的平移和缩放量,作为学习目标,即regression_targets = self.box_coder.encode(matched_gt_boxes, proposals)。注意这里面的背景的proposal即进行了计算,因此返回(512, 4)个值。

以上就是roi_heads.pyproposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)做的。注意这当中matched_idxsregression_targets中有bg掺杂。

接下里进行MultiScaleRoIAlign操作。尽管最初生成anchor时采用了不同stride,但是这里对于roi的操作统一如下: 对于面积大的roi,采用feature_map中size较小的(256, 25, 41)进行roi_pooling操作, 对于面积较大的roi,采用feature_map中size较大的(256, 300, 328)进行roi_pooling操作。因此MultiScaleRoIAlign最终返回(512, 256, 7, 7)的feature。以上就是roi_heads.pybox_features = self.box_roi_pool(features, proposals, image_shapes)的操作。其中决定使用哪个feature_map进行roi_pooling操作的公式为:
在这里插入图片描述
代码为:

class LevelMapper(object):
"""Determine which FPN level each RoI in a set of RoIs should map to based
on the heuristic in the FPN paper.

Arguments:
    k_min (int)
    k_max (int)
    canonical_scale (int)
    canonical_level (int)
    eps (float)
"""

def __init__(
    self,
    k_min: int,
    k_max: int,
    canonical_scale: int = 224,
    canonical_level: int = 4,
    eps: float = 1e-6,
):
    self.k_min = k_min
    self.k_max = k_max
    self.s0 = canonical_scale
    self.lvl0 = canonical_level
    self.eps = eps

def __call__(self, boxlists: List[Tensor]) -> Tensor:
    """
    Arguments:
        boxlists (list[BoxList])
    """
    # Compute level ids
    s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))

    # Eqn.(1) in FPN paper
    target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
    target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
    return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)

接下来将MultiScaleAlign后的box_feature(512, 256, 7, 7)通过flatten()即两个全连接层转为(512, 1000)的tensor,即TwoMLPHead的操作。以上就是roi_heads.pybox_features = self.box_head(box_features)的操作。

接下来将这1000维度的tensor通过FastRCNNPredictor的两个分支转化为属于某一个类的概率和平移缩放预测值:

class FastRCNNPredictor(nn.Module):
"""
Standard classification + bounding box regression layers
for Fast R-CNN.

Arguments:
    in_channels (int): number of input channels
    num_classes (int): number of output classes (including background)
"""

def __init__(self, in_channels, num_classes):
    super(FastRCNNPredictor, self).__init__()
    self.cls_score = nn.Linear(in_channels, num_classes)
    self.bbox_pred = nn.Linear(in_channels, num_classes * 4)

def forward(self, x):
    if x.dim() == 4:
        assert list(x.shape[2:]) == [1, 1]
    x = x.flatten(start_dim=1)
    scores = self.cls_score(x)
    bbox_deltas = self.bbox_pred(x)

    return scores, bbox_deltas

以上是roi_heads.pyclass_logits, box_regression = self.box_predictor(box_features)的内容。

接下来求每个proposal的fastrcnn分类和回归损失:

def fastrcnn_loss(class_logits, box_regression, labels, regression_targets):
    # type: (Tensor, Tensor, List[Tensor], List[Tensor]) -> Tuple[Tensor, Tensor]
    """
    Computes the loss for Faster R-CNN.

    Arguments:
        class_logits (Tensor)
        box_regression (Tensor)
        labels (list[BoxList])
        regression_targets (Tensor)

    Returns:
        classification_loss (Tensor)
        box_loss (Tensor)
    """

    labels = torch.cat(labels, dim=0)
    regression_targets = torch.cat(regression_targets, dim=0)

    classification_loss = F.cross_entropy(class_logits, labels)

    # get indices that correspond to the regression targets for
    # the corresponding ground truth labels, to be used with
    # advanced indexing
    sampled_pos_inds_subset = torch.where(labels > 0)[0]
    labels_pos = labels[sampled_pos_inds_subset]
    N, num_classes = class_logits.shape
    box_regression = box_regression.reshape(N, -1, 4)

    box_loss = det_utils.smooth_l1_loss(
        box_regression[sampled_pos_inds_subset, labels_pos],
        regression_targets[sampled_pos_inds_subset],
        beta=1 / 9,
        size_average=False,
    )
    box_loss = box_loss / labels.numel()

    return classification_loss, box_loss

其中计算分类损失时, 由于labels中没有bg掺杂,因此512和proposal都参与计算。计算regression中的loss时,首先通过label选择大于0,即非bg的box出来,即sampled_pos_inds_subset = torch.where(labels > 0)[0]。同时计算出每个pos proposal对应的ground truth的box对应的哪一类, labels_pos = labels[sampled_pos_inds_subset], 然后从box_regression的(512, n+1, 4)中选择出该类对应的box_regression参与regression计算。例如如果该proposal对应的ground truth的box的类别为1, 则选择box_regression[idx, 1]来参与计算。这里就不再使用统一的box,而是给每个类都计算一个box,然后判断该proposal属于哪个类,再选择相应的类的regression box进行计算。这也是ground truth的值为(n, 4)而regression box值为(n, k, 4)最后计算regression loss的方式。其中n为positive proposal的数目, k为总的类别数(包含bg)。计算regression loss根据label中大于0的index只考虑对应box为前景的proposal,而classification loss则都考虑了。

以上是roi_heads.pyloss_classifier, loss_box_reg = fastrcnn_loss(class_logits, box_regression, labels, regression_targets)的操作。

接下来计算mask和segmentation的loss。

首先将512个proposal中对应ground truth box为非bg的index取出来,获取这些positive的proposal和对应的ground truth的index,即

pos = torch.where(labels[img_id] > 0)[0]
mask_proposals.append(proposals[img_id][pos])
pos_matched_idxs.append(matched_idxs[img_id][pos])

然后将这些positive的proposal和前面backbone之后的feature list进行MultiScaleAlign操作,得到(50, 256, 14, 14)的feature, 即mask_features = self.mask_roi_pool(features, mask_proposals, image_shapes)。接下里将该roialign之后的feature通过MaskRCNNHeads操作还是得到(50,256, 14, 14)的feature:

MaskRCNNHeads(
  (mask_fcn1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu1): ReLU(inplace=True)
  (mask_fcn2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu2): ReLU(inplace=True)
  (mask_fcn3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu3): ReLU(inplace=True)
  (mask_fcn4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (relu4): ReLU(inplace=True)
)

以上为roi_heads.pymask_features = self.mask_head(mask_features)的操作。

接下来通过MaskRCNNPredictor操作上采样一次得到(50, n+1, 28, 28)的feature, n为类别数, 即mask_logits = self.mask_predictor(mask_features)

MaskRCNNPredictor(
  (conv5_mask): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2))
  (relu): ReLU(inplace=True)
  (mask_fcn_logits): Conv2d(256, 91, kernel_size=(1, 1), stride=(1, 1))
)

接下来计算maskrcnnloss

首先将ground truth mask和positive的rois进行RoIAlign操作,得到(50, 3, 28, 28)这个是每个roi的学习target。即计算每个roi经过roiPooling之后的ground truth的mask的值。

然后将mask_logits(50, n+1, 28, 28)的每个mask_logit对应的类别取出来,即(50, 28, 28)与对应的target_mask求binary_cross_entropy得到mask_loss:

def maskrcnn_loss(mask_logits, proposals, gt_masks, gt_labels, mask_matched_idxs):
    # type: (Tensor, List[Tensor], List[Tensor], List[Tensor], List[Tensor]) -> Tensor
    """
    Arguments:
        proposals (list[BoxList])
        mask_logits (Tensor)
        targets (list[BoxList])

    Return:
        mask_loss (Tensor): scalar tensor containing the loss
    """

    discretization_size = mask_logits.shape[-1]
    labels = [gt_label[idxs] for gt_label, idxs in zip(gt_labels, mask_matched_idxs)]
    mask_targets = [
        project_masks_on_boxes(m, p, i, discretization_size)
        for m, p, i in zip(gt_masks, proposals, mask_matched_idxs)
    ]

    labels = torch.cat(labels, dim=0)
    mask_targets = torch.cat(mask_targets, dim=0)

    # torch.mean (in binary_cross_entropy_with_logits) doesn't
    # accept empty tensors, so handle it separately
    if mask_targets.numel() == 0:
        return mask_logits.sum() * 0

    mask_loss = F.binary_cross_entropy_with_logits(
        mask_logits[torch.arange(labels.shape[0], device=labels.device), labels], mask_targets
    )
    return mask_loss

以上就是roi_heads.pyrcnn_loss_mask = maskrcnn_loss(mask_logits, mask_proposals, gt_masks, gt_labels, pos_matched_idxs)的操作。注意这里计算maskrcnnloss与一般的分割问题不同。一般的分割问题会将feature map缩放到原图计算loss,而这里是将ground truth映射到feature map上计算loss。

如果是training,则不会记录任何detection,最后只返回所有的loss:

loss_dict={
	'loss_classifier': tensor(0.2305, grad_fn=<NllLossBackward>),
    'loss_box_reg': tensor(0.0949, grad_fn=<DivBackward0>),
    'loss_mask': tensor(0.1284, grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 
    'loss_objectness': tensor(0.0333, grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 
    'loss_rpn_box_reg': tensor(0.0114, grad_fn=<DivBackward0>)}

其中loss_objectness, loss_rpn_box_reg是rpn的loss。
以上就是engine.pyloss_dict = model(images, targets)的全过程, training时只返回loss,eval时只返回result

在eval中,首先得到(28, 28)大小上的mask的预测值,pad成(30, 30),然后通过nearset interpolate操作直接缩放到bounding box上的mask,然后通过bounding box的坐标映射到原图上。因此MaskRCNN的整个过程中都没有任何的DeConvolution操作。虽然FPN中有Deconv,但是FPN提取的不同level的feature最终是要通过roipooling缩小成(28, 28)大小,而不是传统segmentation将小size的feature map放大最后缩放到原图大小。

你可能感兴趣的:(pytorch,pytorch,计算机视觉,深度学习)