论文阅读-Rich feature hierarchies for accurate object detection and semantic segmentation

作者: Ross Gir
来源: CVPR 2014
评价: RCNN
论文链接: PDF

1 Problem

  • labeled data is scarce.

3 The proposed method

A single and scalable detection algorithm - RCNN.

  • features:
    1. can apply high capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects
    2. when labeled training data is scarce, supervised pre-train followed by domain-specific fine-tuning, yields a significant performance boost.
  • contributions:
    1. the first to use CNNs in object detection and lead to high performance.

论文阅读-Rich feature hierarchies for accurate object detection and semantic segmentation_第1张图片
pipeline:
region proposal(selective search) --> Extract Featues(CNNs) --> Classify(SVM)

3.1 Advantages:

  1. state of the art by outperform the best practice by relative 30% performance.
  2. quite efficient: two orders of magnitude lower than the counter-work(selective search)

3.2 Related Work

  1. J. Uijlings et al. Selective search for object recognition. IJCV, 2013.(RCNN use selective search for region proposa, counter-work)
  2. C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS, 2013.
    The author pointis out that Szegedy frames localization as a regression problem, and their stategy is not fare well in practice.

3.5 Training Methodology

  • supervised pre-training: they pre-train the CNN on ILSVRC2012 dataset(with class information but no bounding box).
  • Domain-specific training: using stochastical gredient descent train on VOC, output dimsension is 21.
  • Object category classifiers. Consider: To discriminate positive and negtive example, they set an IoU overlap threshold, below which is reconed as negtive example.

3.6 Experiments

  • visualizing learned features: proposed a simple non-parametric method that directly show what the network learned(compared to ZF-Net)
  • Conclusions:
    1. 94% of RCNN’s parameters(mostly fully connected layers) can be remove but only leads to moderate drop in detection accuracy.
      在这里插入图片描述
      as the table above shows, when add fc6 and fc7 the mAP improves little and fc7 mAP even drops down without fine-tuing, that indicates the cnn layers is decisive though occupy very small part of parameters.
    2. NMS is essential to recuce mislocations.
    3. pre-trained CNNs learn general features, fc layers learn domain-specifc features.
    4. bounding box regression can help to reduce localization errors(error means IoU overlap with correct class between 0.1 and 0.5)

3.7 Data sets

  • ILSVRC2012(for supervised pre-training)
  • PASCAL VOC 2007(detection task)
  • PASCAL VOC 2010(detection task)
  • PASCAL VOC 2011(segmentation task)
  • PASCAL VOC 2012(detection task)

3.8 Weakness

  1. Training is a multi-stage pipeline.
  2. Training is expensive in space and time.
  3. Object detection is slow.
  4. Fixed size region proposal. All region proposals are reshaped into fixed size, which distort the object in the images and may lose some information.

4 Future works

  • open question: The authors conjecture that the “supervised pre-training /domain-specific fine-tuing” will be highly effective for a variety of vision problem.

5 What do other researchers say about his work?

  • Ross Girshick.Fast R-CNN. ICCV, 2015.
    • The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks.
  • Kaiming He et al. Spatial Pyramid Pooling in Deep Convolutional Networks Networks for Visual Recognition. arXiv 2015.(RCNN is refered as 7)
    • when applied to images of artitrary sizes, the current methods mostly fit the input size into fixed size, either via cropping or via warping[7]. But the cropped region may not contain the entire object, while the warped content may result in unwanted geometric distortion.
    • In the leading object detection method R-CNN [7], the features from candidate windows are extracted via deep convolutional networks. This method shows remarkable detection accuracy on both the VOC and ImageNet datasets. But the feature computation in R- CNN is time-consuming, because it repeatedly applies the deep convolutional networks to the raw pixels of thousands of warped regions per image.
  • Wei Liu et al. SSD: Single Shot MultiBox Detector. arXiv 2016.(RCNN is refered as 22)
    • Before the advent of convolutional neural networks, the state of the art for those two approaches – Deformable Part Model (DPM) [26] and Selective Search [1] – had comparable performance. However, after the dramatic improvement brought on by R-CNN [22], which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent

Reference

R-CNN文章详细解读

你可能感兴趣的:(深度学习,论文阅读,神经网络)