READING NOTE: SSD: Single Shot MultiBox Detector

TITLE: SSD: Single Shot MultiBox Detector##

AUTHER: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg

FROM: arXiv:1512.02325v2

CONTRIBUTIONS

  1. SSD, a single-shot detector for multiple categories is introduced that is fast and accurate.
  2. The network is easy to train, simple end-to-end training and high accuracy, even with relatively low resolution input images, further improving the speed vs accuracy trade-off.

METHOD

READING NOTE: SSD: Single Shot MultiBox Detector_第1张图片

Network structure:

  1. Multiple scale feature maps from different layers are used in order to handle objects with different sizes.
  2. On each feature map used for detectoin, an unique small network (filter) is utilized to learn to predict category scores and location offsets.
  3. Each feature map corresponds to a fixed set of default boxes. These default boxes have different aspect ratios.

Training:

  1. Default and groundtruth boxes are matched. Each ground truth box is matched to the default box with the best jaccard overlap. On the other hand default boxes are matched to any ground truth with jaccard overlap higher than a threshold.
  2. The training objective is is a weighted sum of the localization loss (loc) and the confidence loss (conf):
    L(x,c,l,g)=1N(Lconf(x,c)+αLloc(x,l,g))

    where N is the number of matched default boxes, and the localization loss is the Smooth L1 loss between the predicted box (l) and the ground truth box (g) parameters. Confidence loss is the softmax loss over multiple classes confidences (c) .
  3. The scale of the default boxes for each (kth) feature map is computed as:
    sk=smin+smaxsminm1(k1)

    where smin=0.2 and smax=0.95 . The width of default box is skar and the height is sk/ar where ar is the aspect ratio. The centre of a default box at location of (i,j) in the kth feature map is (i+0.5|fk|,j+0.5|fk|) .
  4. Hard negatives are extracted. The unmatched default boxes are sorted according to confidence and top ones are used as hard negatives so that the ratio between the negatives and positives is at most 3:1.
  5. Data augmentation is done by using the entire original input image and sampling a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.

ADVANTAGES

  1. It is fast because only one shot is utilized and the input is of lower resolution.
  2. Multiple scale feature maps are used so that it can handle objects with different sizes.
  3. End-to-end training.

你可能感兴趣的:(READING NOTE: SSD: Single Shot MultiBox Detector)