fast rcnn

思路:

1.如何构建网络(image和roi输入,roi-pooling, 双输出,图1)

2.如何采样(roi,引rcnn)

3.roi-pooling(sppnet)

4.multi-task loss(仅仅对正样本计算loss)

 

 

Abstract
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. 
Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. 
Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. 
fast rcnn主要的目的是提速
Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN,is 213×faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. 
Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10×faster, and is more accurate. 
Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.


2.《FastR-CNNarchitectureandtraining》

fast rcnn_第1张图片
Figure 1. Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI:softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.
fastrcnn架构:输入图像和多个ROI一起输入到全卷机网络中。每一个roi汇集成一个固定尺寸的特征图,然后通过一个全连接层映射层一个特征向量。
在网络中,每一个roi有两个输出向量:softmax的可能性和每一类的bounding box回归偏差。此架构是end-to-end方式进行训练,并采用多任务loss)

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. (fast rcnn输入一整张图像和一堆object proposals)
The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. (该网络首先采用多个conv和max pooling层处理整张图像,并产生一个conv feature map
Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. (然后,对于每一个object proposal, roi pooling 层提取一个固定长度特征向量从feature map中
Each feature vector is fed into a sequence of fullyconnected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 valus encodes refined bounding-box positions for one of the K classes. (每一个特征向量被送入一系列的fc层,最后产生两个兄弟输出层:一个产生softmax概率,此概率包含k个目标类和一个背景类;另一个层输出4个实数,对于k个目标类中的每一个。每4个数一个集合,编码了k类中的一个。

2.1.TheRoIpoolinglayer (roi pooling layer)
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H ×W (e.g., 7×7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r,c,h,w) that specifies its top-left corner (r,c) and its height and width (h,w).
(roi pooling层使用max pooling来转换任何roi特征到一个小的特征值图,此特征图是固定尺寸H*W(如7*7)。此处的H和W与任何一个roi都无关。在本文中,一个roi是一个conv feature map的一个矩形窗口。每一个roi被敬意为4元祖,代表了坐上角和宽和高)
RoI max pooling works by dividing the h×w RoI window into an H × W grid of sub-windows of approximate size h/H ×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. 
(roi max-pooling 划分h*w的ROI窗口到H*W的子窗口,其中子窗口的尺寸近似为h/H ×w/W,然后对每一个子窗口,max-pooling其值到相应的输出网格cell中)
Pooling is applied independently to each feature map channel, as in standard max pooling. 
(对于每一个feature map 通道,pooling的应用是独立的,正如标准的max pooling)
The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11]. 
(roi层是一个简化版的sppnet)

2.2.Initializing from pre-trained networks (初始化预训练网络)
We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a FastR-CNN network, it undergoes three transformations. (本文采用三种不同的预训练网络,每一种网络均包含5个pooling 层和5-13个conv层。当每一个预训练网络被初始化为fastrcnn时,需要经过以下三个步骤。)
First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16). 
(首先,最后一个max pooling层用roi pooling替换。同时设置H和W,应该与网络的第一个全连接层兼容)
Second,the network’s last fullyconnected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier(afullyconnectedlayerandsoftmaxover K +1 categories and category-specific bounding-box regressors). 
(然后,网络最后的fc层和softmax层用两个兄弟层替换。)
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images. 
(最后,网络应该被调整为使用两个数据输入:一系列的图像和图像上额一列席roi)

2.3.Fine-tuning for detection (微调检测)
Training all network weights with back-propagation is an important capability of Fast R-CNN. 
(采用反向传播训练所有的网络权重是一个重要的能力)
First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer. The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency
stems from the fact that each RoI may have a very large receptivefield,often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image). 

We propose a more efficient training method that takes advantage of feature sharing during training. 
In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images(i.e.,theR-CNNandSPPnetstrategy). One concern over this strategy is it may causes low training convergence because RoIs from the same image are correlated. Thisconcerndoesnotappeartobeapracticalissue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN. In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The componentsofthisprocedure(theloss,mini-batchsamplingstrategy,back-propagationthroughRoIpoolinglayers,andSGD hyper-parameters) are described below

Multi-task loss.(多任务损失)
A Fast R-CNN network has two sibling output layers. 
The first outputs a discrete probability distribution (per RoI), p = (p0,...,pK), over K + 1 categories. 
As usual,p is computed by a softmax over the K+1 outputs of a fully connected layer. 
(第一个输出为:对于每一个roi都有一个离散的概率分布p = (p0, p1, 。。。, pk))。p是通过softmax计算出来的。
The second sibling layer outputs bounding-box regression offsets, tk =(tk x,tk y,tk w,tk h), for each of the K object classes, indexed by k. 
(第二个输出为:bounding box回归偏差,对于K个目标的每一个类,都会预测一个回归框)
We use the parameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.(?????)
 Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. 
 (每一个训练roi都会被标记为一个真值标签u和一个真值回归目标v)
 We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression: 
(我们对每一个标记的roi,使用一个多任务loss L来联合训练分类和回归)

in which Lcls(p,u) = −logpu is log loss for true class u. (针对真值u哈)
The second task loss, Lloc, is defined over a tuple of true bounding-box regression targets for class u, v = (vx,vy,vw,vh), and a predicted tuple tu = (tu x,tu y,tu w,tu h), again for class u. (回归变量也是针对指针u)
The Iverson bracket indicator function [u ≥ 1] evaluates to 1 when u ≥ 1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truth bounding box and hence Lloc is ignored. (背景时,设置u= 0 ,此时不计算回归变量,没有真值没有标注。)
For bounding-box regression, we use the loss 

fast rcnn_第2张图片

is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. 
When the regression targets are unbounded, training with L2 loss can require carefult uningof learning rates  in order to prevent exploding gradients. 
(现在还不懂,l1对离群点不敏感,l2可以帮助收敛,防止梯度发散?????)
Eq. 3 eliminates this sensitivity. The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. 
We normalize the ground-truth regression targets vi to have zero mean and unit variance.(如何归一化?????)
 All experiments use λ = 1.
 We note that [6] uses a related loss to train a class agnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separateslocalizationandclassification. OverFeat[19], R-CNN [9],andSPPnet[11]alsotrainclassifiersandbounding-box localizers, however these methods use stage-wise training, whichweshowissuboptimalforFastR-CNN(Section5.1).
 
 Mini-batch sampling.(采样)
 During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset).  We use mini-batches of size R = 128, sampling 64 RoIs from each image. (每一个mini-batch是由2张图像产生的,每张图像64个,共128个)
 As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u ≥ 1. (从object proposal中,我们获取0.25的roi,这些roi和真值bounding box的iou至少有0.5,这些roi组成正样本, 其u大于等于1)
 The remaining RoIs are sampled from object proposals that have a maximum IoU with groundtruth intheinterval [0.1,0.5), following [11]. These are the background examples and are labeled with u = 0.
 (从objejct proposal中采样的剩下的roi有有最大roi(【0.1,0.5】)标记为负样本, 其u =0)
 The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. 
 During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.
(图像0.5概率进行水平翻转)
 

你可能感兴趣的:(论文阅读)