之前写了mmdetection的模型创建部分，这次以cascade rcnn为例具体看下网络是怎么构建的。

讲网络之前，要先看看配置文件，这里我主要结合官方提供的cascade_mask_rcnn_r50_fpn_1x.py来看具体实现，关于这些配置项具体的含义可以看mmdetection的configs中的各项参数具体解释

创建cascade rcnn网络

先找到cascade rcnn的定义文件mmdet/models/detectors/cascade_rcnn.py
这里我将cascade rcnn网络的创建过程主要分为5个部分。

backbone
neck
rpn_head
bbox_head
mask_head

backbone

cascade rcnn的backb选择的是res50，创建backbone的方式和之前一样，也是将支持的模型注册到registry中，只后再通过builder进行实例化。
resnet的定义文件在mmdet/models/backbones/resnet.py

    def forward(self, x):
        x = self.conv1(x)
        x = self.norm1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        outs = []
        for i, layer_name in enumerate(self.res_layers):
            res_layer = getattr(self, layer_name)
            x = res_layer(x)
            if i in self.out_indices:
                outs.append(x)
        if len(outs) == 1:
            return outs[0]
        else:
            return tuple(outs)

在forward中outs取的是多stage的输出，先拼成一个list在转成tuple，取哪些stage是根据config中的out_indices。

model = dict(
    type='CascadeRCNN',
    num_stages=3,
    pretrained='modelzoo://resnet50',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        style='pytorch'),

backbone是4stage,取了所有的stage。
backbone的主要作用就是提取图像特征。

neck

这部分主要是实现FPN,FPN讲解
先看下config文件中与FPN相关的部分

neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),

in_channels与之前backbone的输出相匹配，out_channels为输出纬度。
FPN定义在mmdet/models/necks/fpn.py,其中__init__.py中

        for i in range(self.start_level, self.backbone_end_level):
            l_conv = ConvModule(
                in_channels[i],
                out_channels,
                1,
                normalize=normalize,
                bias=self.with_bias,
                activation=self.activation,
                inplace=False)
            fpn_conv = ConvModule(
                out_channels,
                out_channels,
                3,
                padding=1,
                normalize=normalize,
                bias=self.with_bias,
                activation=self.activation,
                inplace=False)

            self.lateral_convs.append(l_conv)
            self.fpn_convs.append(fpn_conv)

这里的self.start_level为0 self.backbone_end_level为len(in_channels)，也就是说这里定义的lateral_convs和fpn_convs的长度和输入的长度是相等的。
这里可以这样理解，之前backbone的输出是多层的特征图，这里对每层的输出用不同的ConvModule来处理，再统一channel数，就完成了高低层特征的融合。可能比较绕，结合代码就比较好理解了。
下面是forward函数部分代码。

# build laterals
        laterals = [
            lateral_conv(inputs[i + self.start_level])
            for i, lateral_conv in enumerate(self.lateral_convs)
        ]
# part 1: from original levels
        outs = [
            self.fpn_convs[i](laterals[i]) for i in range(used_backbone_levels)
        ]

其实这部分也可以看成是在提取特征，到下面RPN部分就真正涉及到目标检测了。

RPN HEAD

cascade rcnn的rpn_head乍一看感觉还挺简单的，因为这部分主要就两个网络。主要涉及到两个文件mmdet/models/anchor_head/anchor_head.py和mmdet/models/anchor_head/rpn_head.py后者是前者的子类。
先是config相关项

rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_scales=[8],
        anchor_ratios=[0.5, 1.0, 2.0],
        anchor_strides=[4, 8, 16, 32, 64],
        target_means=[.0, .0, .0, .0],
        target_stds=[1.0, 1.0, 1.0, 1.0],
        use_sigmoid_cls=True),

rpn_head的主要实现如下

    #定义网络
    def _init_layers(self):
        self.rpn_conv = nn.Conv2d(
            self.in_channels, self.feat_channels, 3, padding=1)
        self.rpn_cls = nn.Conv2d(self.feat_channels,
                                 self.num_anchors * self.cls_out_channels, 1)
        self.rpn_reg = nn.Conv2d(self.feat_channels, self.num_anchors * 4, 1)
    #forward
    def forward_single(self, x):
        x = self.rpn_conv(x)
        x = F.relu(x, inplace=True)
        rpn_cls_score = self.rpn_cls(x)
        rpn_bbox_pred = self.rpn_reg(x)
        return rpn_cls_score, rpn_bbox_pred

很简单，就只有两个网络，判断是否是前景(rpn_cls)，预测框的修改值(rpn_reg)。并且其中self.num_anchors = len(self.anchor_ratios) * len(self.anchor_scales)。
但是RPN的目标是得到候选框，所以这里就还要用到anchor_head.py中的另一个函数get_bboxs()

    def get_bboxes(self, cls_scores, bbox_preds, img_metas, cfg,
                   rescale=False):
        assert len(cls_scores) == len(bbox_preds)
        num_levels = len(cls_scores)

        mlvl_anchors = [
            self.anchor_generators[i].grid_anchors(cls_scores[i].size()[-2:], self.anchor_strides[i])
            for i in range(num_levels)
        ]
        result_list = []
        for img_id in range(len(img_metas)):
            cls_score_list = [
                cls_scores[i][img_id].detach() for i in range(num_levels)
            ]
            bbox_pred_list = [
                bbox_preds[i][img_id].detach() for i in range(num_levels)
            ]
            img_shape = img_metas[img_id]['img_shape']
            scale_factor = img_metas[img_id]['scale_factor']
            proposals = self.get_bboxes_single(cls_score_list, bbox_pred_list,
                                               mlvl_anchors, img_shape,
                                               scale_factor, cfg, rescale)
            result_list.append(proposals)
        return result_list

在这里先通过self.anchor_generators[i].grid_anchors()这个函数取到所有的anchor_boxs,再通过self.get_bboxes_single()根据之前rpn的结果获取到候选框(proposal boxs)。
在self.get_bboxes_single()中，先在每个尺度上取2000个anchor出来，concat到一起作为该图像的anchor，对这些anchor boxs作nms(thr=0.7)就得到了所需的候选框。

这部分还有他的loss比较复杂,就放到之后写loss的时候在一起写。

assigners and samplers

上一步rpn输出了一堆候选框，但是在将这些候选框拿去训练之前还需要分为正负样本。assigners就是完成这个工作的。
cascade_rcnn默认使用的是MaxIoUAssigner定义在mmdet/core/bbox/assigners/max_iou_assigner.py主要用到的是assign()

    def assign(self, bboxes, gt_bboxes, gt_bboxes_ignore=None, gt_labels=None):
        """Assign gt to bboxes.

        This method assign a gt bbox to every bbox (proposal/anchor), each bbox
        will be assigned with -1, 0, or a positive number. -1 means don't care,
        0 means negative sample, positive number is the index (1-based) of
        assigned gt.
        The assignment is done in following steps, the order matters.

        1. assign every bbox to -1
        2. assign proposals whose iou with all gts < neg_iou_thr to 0
        3. for each bbox, if the iou with its nearest gt >= pos_iou_thr,
           assign it to that bbox
        4. for each gt bbox, assign its nearest proposals (may be more than
           one) to itself

        Args:
            bboxes (Tensor): Bounding boxes to be assigned, shape(n, 4).
            gt_bboxes (Tensor): Groundtruth boxes, shape (k, 4).
            gt_bboxes_ignore (Tensor, optional): Ground truth bboxes that are
                labelled as `ignored`, e.g., crowd boxes in COCO.
            gt_labels (Tensor, optional): Label of gt_bboxes, shape (k, ).

        Returns:
            :obj:`AssignResult`: The assign result.
        """
        if bboxes.shape[0] == 0 or gt_bboxes.shape[0] == 0:
            raise ValueError('No gt or bboxes')
        bboxes = bboxes[:, :4]
        overlaps = bbox_overlaps(gt_bboxes, bboxes)

        if (self.ignore_iof_thr > 0) and (gt_bboxes_ignore is not None) and (
                gt_bboxes_ignore.numel() > 0):
            if self.ignore_wrt_candidates:
                ignore_overlaps = bbox_overlaps(
                    bboxes, gt_bboxes_ignore, mode='iof')
                ignore_max_overlaps, _ = ignore_overlaps.max(dim=1)
            else:
                ignore_overlaps = bbox_overlaps(
                    gt_bboxes_ignore, bboxes, mode='iof')
                ignore_max_overlaps, _ = ignore_overlaps.max(dim=0)
            overlaps[:, ignore_max_overlaps > self.ignore_iof_thr] = -1

        assign_result = self.assign_wrt_overlaps(overlaps, gt_labels)
        return assign_result

将proposal分为正负样本过后，通过sampler对这些proposal进行采样得到sampler_result进行训练。
cascade_rcnn默认使用的是RandomSampler定义在mmdet/core/bbox/sampler/random_sampler.py

    @staticmethod
    def random_choice(gallery, num):
        """Random select some elements from the gallery.

        It seems that Pytorch's implementation is slower than numpy so we use
        numpy to randperm the indices.
        """
        assert len(gallery) >= num
        if isinstance(gallery, list):
            gallery = np.array(gallery)
        cands = np.arange(len(gallery))
        np.random.shuffle(cands)
        rand_inds = cands[:num]
        if not isinstance(gallery, np.ndarray):
            rand_inds = torch.from_numpy(rand_inds).long().to(gallery.device)
        return gallery[rand_inds]

    def _sample_pos(self, assign_result, num_expected, **kwargs):
        """Randomly sample some positive samples."""
        pos_inds = torch.nonzero(assign_result.gt_inds > 0)
        if pos_inds.numel() != 0:
            pos_inds = pos_inds.squeeze(1)
        if pos_inds.numel() <= num_expected:
            return pos_inds
        else:
            return self.random_choice(pos_inds, num_expected)

    def _sample_neg(self, assign_result, num_expected, **kwargs):
        """Randomly sample some negative samples."""
        neg_inds = torch.nonzero(assign_result.gt_inds == 0)
        if neg_inds.numel() != 0:
            neg_inds = neg_inds.squeeze(1)
        if len(neg_inds) <= num_expected:
            return neg_inds
        else:
            return self.random_choice(neg_inds, num_expected)

重写了两个sample函数供父类调用。
主要用到的是其父类mmdet/core/bbox/sampler/base_sampler.py定义的sample

    def sample(self,
               assign_result,
               bboxes,
               gt_bboxes,
               gt_labels=None,
               **kwargs):
        """Sample positive and negative bboxes.

        This is a simple implementation of bbox sampling given candidates,
        assigning results and ground truth bboxes.

        Args:
            assign_result (:obj:`AssignResult`): Bbox assigning results.
            bboxes (Tensor): Boxes to be sampled from.
            gt_bboxes (Tensor): Ground truth bboxes.
            gt_labels (Tensor, optional): Class labels of ground truth bboxes.

        Returns:
            :obj:`SamplingResult`: Sampling result.
        """
        bboxes = bboxes[:, :4]

        gt_flags = bboxes.new_zeros((bboxes.shape[0], ), dtype=torch.uint8)
        if self.add_gt_as_proposals:
            bboxes = torch.cat([gt_bboxes, bboxes], dim=0)
            assign_result.add_gt_(gt_labels)
            gt_ones = bboxes.new_ones(gt_bboxes.shape[0], dtype=torch.uint8)
            gt_flags = torch.cat([gt_ones, gt_flags])

        num_expected_pos = int(self.num * self.pos_fraction)
        pos_inds = self.pos_sampler._sample_pos(
            assign_result, num_expected_pos, bboxes=bboxes, **kwargs)
        # We found that sampled indices have duplicated items occasionally.
        # (may be a bug of PyTorch)
        pos_inds = pos_inds.unique()
        num_sampled_pos = pos_inds.numel()
        num_expected_neg = self.num - num_sampled_pos
        if self.neg_pos_ub >= 0:
            _pos = max(1, num_sampled_pos)
            neg_upper_bound = int(self.neg_pos_ub * _pos)
            if num_expected_neg > neg_upper_bound:
                num_expected_neg = neg_upper_bound
        neg_inds = self.neg_sampler._sample_neg(
            assign_result, num_expected_neg, bboxes=bboxes, **kwargs)
        neg_inds = neg_inds.unique()

        return SamplingResult(pos_inds, neg_inds, bboxes, gt_bboxes,
                              assign_result, gt_flags)

现在bbox已经处理好了，之后就是将这些框分别送到bbox head和mask head了。

bbox head

当然之前得到的那些框还不能直接送到bbox head,在此之前还要做一次RoI Pooling,将不同大小的框映射成固定大小。
具体定义在mmdet/models/roi_extractors/single_level.py

    def forward(self, feats, rois):
        if len(feats) == 1:
            return self.roi_layers[0](feats[0], rois)

        out_size = self.roi_layers[0].out_size
        num_levels = len(feats)
        target_lvls = self.map_roi_levels(rois, num_levels)
        roi_feats = torch.cuda.FloatTensor(rois.size()[0], self.out_channels,
                                           out_size, out_size).fill_(0)
        for i in range(num_levels):
            inds = target_lvls == i
            if inds.any():
                rois_ = rois[inds, :]
                roi_feats_t = self.roi_layers[i](feats[i], rois_)
                roi_feats[inds] += roi_feats_t
        return roi_feats

这里的roi_layers用的是RoIAlign,RoI的结果就可以送到bbox head了。
bbox head部分和之前的rpn部分的操作差不多，主要是针对每个框进行分类和坐标修正。之前rpn分为前景和背景两类，这里分为N+1类(实际类别 + 背景)。具体代码在mmdet/models/bbox_head/convfc_bbox_head.py

    def forward(self, x):
        # shared part
        if self.num_shared_convs > 0:
            for conv in self.shared_convs:
                x = conv(x)

        if self.num_shared_fcs > 0:
            if self.with_avg_pool:
                x = self.avg_pool(x)
            x = x.view(x.size(0), -1)
            for fc in self.shared_fcs:
                x = self.relu(fc(x))
        # separate branches
        x_cls = x
        x_reg = x

        for conv in self.cls_convs:
            x_cls = conv(x_cls)
        if x_cls.dim() > 2:
            if self.with_avg_pool:
                x_cls = self.avg_pool(x_cls)
            x_cls = x_cls.view(x_cls.size(0), -1)
        for fc in self.cls_fcs:
            x_cls = self.relu(fc(x_cls))

        for conv in self.reg_convs:
            x_reg = conv(x_reg)
        if x_reg.dim() > 2:
            if self.with_avg_pool:
                x_reg = self.avg_pool(x_reg)
            x_reg = x_reg.view(x_reg.size(0), -1)
        for fc in self.reg_fcs:
            x_reg = self.relu(fc(x_reg))

        cls_score = self.fc_cls(x_cls) if self.with_cls else None
        bbox_pred = self.fc_reg(x_reg) if self.with_reg else None
        return cls_score, bbox_pred

forward的输出就是框的分类score和坐标。
之后再通过这两个结果去计算bbox_loss,这个也放到之后在写。
下面就是与 bbox head平行的另一个分支mask head了。

mask head

mask 部分的流程和bbox部分相同，也是先对之前的候选框先做一次RoI Pooling，这里的RoI与之前bbox网络都一样只是部分参数不同。
具体定义在mmdet/models/mask_heads/fcn_mask_head.py

    def forward(self, x):
        for conv in self.convs:
            x = conv(x)
        if self.upsample is not None:
            x = self.upsample(x)
            if self.upsample_method == 'deconv':
                x = self.relu(x)
        mask_pred = self.conv_logits(x)
        return mask_pred

forward的输出就是每个像素点的分类值，之后也是通过这个结果去计算mask loss。
在bbox head 和这部分forward的输出结果都不是测试阶段的最终结果，还需要进行其他操作才能得到测试结果。这部分之后写test的时候再写。

小结

这篇主要写了mmdetection中cascade_rcnn的网络创建过程，之前想的是慢慢抠细节，争取把每部分的细节都写了，但是实际看的时候还是觉得太复杂了，就先把整体流程写了一遍，相当于把整体骨架写了。准备之后把loss和测试部分写完了，在慢慢来抠每个部分的细节。

mmdetection源码阅读笔记（1）--创建网络