Mask R-CNN阅读笔记


The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
在Faster R-CNN基础上,加了一条分支来预测Mask,从而实现了高效的物体检测与精细的分割。
In instance segmentation, boundingbox object detection, and person keypoint detection, Mask R-CNN outperforms all existing single-model entries.

  • 损失
    L = Lcls + Lbox + Lmask
    在计算mask损失时,只计算k-th mask(k=ground-truth)
    This is different from common practice when applying FCNs to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not.

  • RoIAlign
    This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.
    RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling).
    These quantizations introduce misalignments between the RoI and the extracted features.
    we avoid any quantization of the RoI boundaries or bins (i.e., we use x=16 instead of [x=16]). We use bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin,and aggregate the result (using max or average)。
