论文阅读笔记(三十八):Dynamic Zoom-in Network for Fast Object Detection in Large Images

We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (Rnet) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Experiments on the Caltech Pedestrians dataset show that our approach reduces the number of processed pixels by over 50% without a drop in detection accuracy. The merits of our approach become more significant on a high resolution test set collected from YFCC100M dataset, where our approach maintains high detection performance while reducing the number of processed pixels by about 70% and the detection time by over 50%.

我们引入了一个通用框架, 它降低了物体检测的计算成本, 同时保留了不同大小的物体在高分辨率图像中出现的情况的准确性。检测过程中以coarse-to-fine的方式进行, 首先对图像的down-sampled版本, 然后再对被识别为可能提高检测精度的更高分辨率区域排序。在强化学习的基础上, 我们的方法包括一个模型 (R-net), 使用粗检测结果来预测在更高分辨率下分析一个区域的潜在精度增益, 另一个模型 (Q-net), 继续选择区域zoom-in.在 Caltech Pedestrians的行人数据集的实验表明, 我们的方法减少了processed pixels的数量超过50% ,检测精度没有下降。我们的方法的优点在从 YFCC100M 数据集收集的高分辨率测试集上变得更加重要, 我们的方法保持了较高的检测性能, 同时将处理的像素的数量减少了 70%, 并且检测时间超过了50%。

Most recent convolutional neural network (CNN) detectors are applied to images with relatively low resolution, e.g., VOC2007/2012 (about 500×400) [12, 13] and MS COCO (about 600×400) [26]. At such low resolutions, the computational cost of convolution is low. However, the resolution of everyday devices has quickly outpaced standard computer vision datasets. The camera of a 4K smartphone, for instance, has a resolution of 2,160×3,840 pixels and a DSLR camera can reach 6,000×4,000 pixels. Applying state-of-the-art CNN detectors directly to those high resolution images requires a large amount of processing time. Additionally, the convolution output maps are too large for the memory of current GPUs.

最近的卷积神经网络(CNN)检测器应用于分辨率相对较低的图像,例如VOC2007 / 2012(约500×400)[12,13]和MS COCO(约600×400)[26]。 在如此低的分辨率下,卷积的计算成本很低。 然而,日常设备的分辨率已经快速超过了标准的计算机视觉数据集。 例如,4K智能手机的相机分辨率为2,160×3,840像素,单反相机可以达到6000×4000像素。 将state-of-the-art的CNN检测器直接应用于这些高分辨率图像需要大量的处理时间。 此外,卷积输出映射对于当前GPU的内存来说太大。

Prior works address some of these issues by simplifying the network architecture [14, 41, 9, 23, 38] to speed up detection and reduce GPU memory consumption. However, these models are tailored to particular network structures and may not generalize well to new architectures. A more general direction is treating the detector as a black box that is judiciously applied to optimize accuracy and efficiency. For example, one could partition an image into sub-images that satisfy memory constraints and apply the CNN to each sub-image. However, this solution is still computationally burdensome. One could also speed up detection process and reduce memory requirements by running existing detectors on down-sampled images. However, the smallest objects may become too small to detect in the down-sampled images. Object proposal methods are the basis for most CNN detectors, restricting expensive analysis to regions that are likely to contain objects of interest [11, 35, 44, 43]. However, the number of object proposals needed to achieve good recall for small objects in large images is prohibitively high which leads to huge computational cost.

之前的工作通过简化网络架构来解决其中的一些问题[14,41,9,23,38],以加速检测并降低GPU内存消耗。但是,这些模型是针对特定的网络结构量身定制的,可能不能很好地推广到新架构。更普遍的方向是将检测器视为一个黑匣子,明智地应用该检测器来优化精度和效率。例如,可以将图像划分为满足存储器限制的子图像并将CNN应用于每个子图像。但是,这个解决方案仍然是计算繁琐的。还可以通过在down-sampled图像上运行现有的检测器来加速检测过程并减少存储器需求。但是,最小的物体可能变得太小而无法在降采样的图像中检测到。Object proposal方法是大多数CNN检测器的基础,将昂贵的分析限制在可能包含感兴趣物体的区域[11,35,44,43]。然而,为了在大图像中实现对小物体的良好召回而需要的Object proposal的数量非常高,这导致巨大的计算成本。

Our approach is illustrated in Fig. 1. We speed up object detection by first performing coarse detection on a downsampled version of the image and then sequentially selecting promising regions to be analyzed at a higher resolution. We employ reinforcement learning to model long-term reward in terms of detection accuracy and computational cost and dynamically select a sequence of regions to analyze at higher resolution. Our approach consists of two networks: a zoom-in accuracy gain regression network (R-net) learns correlations between coarse and fine detections and predicts the accuracy gain for zooming in on a region; a zoom-in Q function network (Q-net) learns to sequentially select the optimal zoom locations and scales by analyzing the output of the R-net and the history of previously analyzed regions.

我们的方法如图1所示。我们通过首先对图像的down-sampled版本执行粗略检测,然后依次选择有希望的区域以更高的分辨率进行分析,从而加速物体检测。我们采用强化学习在检测精度和计算成本方面对long-term reward进行建模,并动态选择一系列区域以更高分辨率进行分析。我们的方法由两个网络组成:zoom-in accuracy gain regression network(R-net)学习粗略和精细检测之间的相关性,并预测zoom-in区域的精度增益; zoom-in Q function network(Q-net)学习通过分析R-net的输出和先前分析的区域的历史来依次选择最佳缩放定位和缩放比例。

Experiments demonstrate that, with a negligible drop in detection accuracy, our method reduces processed pixels by over 50% and average detection time by 25% on the Caltech Pedestrian Detection dataset [10], and reduces processed pixels by about 70% and average detection time by over 50% on a high resolution dataset collected from YFCC100M [21] that has pedestrians of varied sizes. We also compare our method to recent single-shot detectors [32, 27] to show our advantage when handling large images.

实验证明,在检测精度可以忽略不计的情况下,我们的方法在Caltech行人检测数据集[10]上减少了50%以上的processed pixels和average detection时间25%,并且减少了约70%的processed pixels和average detection时间在从YFCC100M [21]收集的高分辨率数据集中有超过50%的行人具有不同的尺寸。我们还将我们的方法与最近的single-shot detectors进行了比较[32,27],以显示我们在处理大图像时的优势。

CNN detectors. One way to analyze high resolution images efficiently is to improve the underlying detector. Girshick [16] speeded up the region proposal based CNN [17] by sharing convolutional features between proposals. Ren et al. proposed Faster R-CNN [33], a fully end-to-end pipeline that shares features between proposal generation and object detection, improving both accuracy and computational efficiency. Recently, single-shot detectors [27, 31, 32] have received much attention for real-time performance. These methods remove the proposal generation stage and formulate detection as a regression problem. Although these detectors performed well on PASCAL VOC [12, 13] and MS COCO [26] datasets, which generally contain large objects in images with relatively low resolution, they do not generalize as well on large images with objects of variable sizes. Also, their processing cost increases dramatically with image size due to the large number of convolution operations.

CNN检测器。有效分析高分辨率图像的一种方法是改进底层检测器。 Girshick [16]通过分享proposal之间的卷积特征来加速基于CNN的区域proposal[17]。 Ren等人提出了 Faster R-CNN [33],这是一种完全端到端的pipeline,共享proposal生成和物体检测之间的特征,提高了准确性和计算效率。最近,single-shot detectors[27,31,32]在实时性能方面受到了很多关注。这些方法删除proposal生成阶段并将检测制定为回归问题。虽然这些检测器在PASCAL VOC [12,13]和MS COCO [26]数据集上表现良好,这些数据集通常在分辨率相对较低的图像中包含较大的物体,但它们不能在具有可变大小物体的大图像上进行概括。另外,由于大量的卷积操作,其处理成本随着图像大小而急剧增加。

Sequential search. Another strategy to handle large image sizes is to avoid processing the entire image and instead investigate small regions sequentially. However, most existing works focus on mining informative regions to improve detection accuracy without considering computational cost. Lu et al. [28] improve localization by adaptively focusing on subregions likely to contain objects. Alexe et al. [1] sequentially investigated locations based on what has been seen to improve detection accuracy. However, the proposed approach introduces a large overhead leading to long detection time (about 5s per object class per image). Zhang et al. [42] improved the detection accuracy by penalizing the inaccurate location of the initial object proposals, which introduced more than 15% overhead to detection time.

Sequential search。处理大图像尺寸的另一种策略是避免处理整个图像,而是依次调查小区域。然而,大多数现有的工作集中在挖掘信息区域以提高检测精度而不考虑计算成本。 Lu等人[28]通过适应性地关注可能包含物体的子区域来改善定位。Alexe等人[1]根据已经看到的提高检测准确性的顺序调查定位。然而,所提出的方法引入了大的开销,导致检测时间很长(每个图像的每个物体类别大约5s)。 Zhang等人[42]通过惩罚初始物体提议的不准确定位来提高检测的准确性,这会导致检测时间的开销超过15%。

A sequential search process can also make use of contextual cues from sources, such as scene segmentation. Existing approaches have explored this idea for various object localization tasks [8, 37, 30]. Such cues can also be incorporated within our framework (e.g., as input to predicting the zoom in reward). However, we focus on using only coarse detections as a guide for sequential search and leave additional contextual information for future work. Other previous work [25] utilizes a coarse-to-fine strategy to speed up detection, but this work does not select promising regions sequentially.

Sequential search过程还可以利用来自源的上下文线索,例如场景分割。现有的方法已经为各种物体定位任务探索了这个想法[8,37,30]。这样的线索也可以被纳入我们的框架内(例如,作为预测zoom-inreward的输入)。但是,我们专注于仅使用粗略检测作为Sequential search的指导,并为将来的工作留下额外的上下文信息。其他以前的工作[25]采用了coarse-to-fine的策略来加速检测,但这项工作并没有顺序选择有前途的区域。

Reinforcement learning (RL). RL a is popular mechanism for learning sequential search policies, as it allows models to consider the effect of a sequence of actions rather than individual ones. Ba et al. use RL to train a attention based model in [3] to sequentially select most relevant regions for object recognition and Jie et al. [20] select regions for localization in a top-down search fashion. However, these methods require a large number of selection steps and may lead to long running time. Caicedo et al. [7] designed an active detection model for object localization, which utilizes Deep Q Networks (DQN) [29] to learn a long-term reward function to transform an initial bounding box sequentially until it converges to an object. However, as reported in [7], the box transformation takes about 1.5s detection time on a typical Pascal VOC image which is much slower than recent detectors [33, 27, 32]. In addition, [7] does not explicitly consider selection cost. Although, RL implicitly forces the algorithm to take a minimum number of steps, we need to explicitly penalize cost since each step can yield a high cost. For example, if we do not penalize cost, the algorithm will tend to zoom in on the whole image. Existing works have proposed methods to apply RL in cost sensitive settings [18, 22]. We follow the approach of [18] and treat the reward function as a linear combination of accuracy and cost.

Reinforcement learning (RL). RL a 是学习Sequential search策略的流行机制,因为它允许模型考虑一系列操作而不是单个操作的效果。 Ba等人使用RL在[3]中训练基于attention的模型以依次选择最相关的区域用于物体识别,并且Jie等人[20]选择区域以top-down的搜索方式进行定位。但是,这些方法需要大量的选择步骤,并可能导致运行时间过长。 Caicedo等人 [7]为物体定位设计了一个主动检测模型,该模型利用Deep Q Networks(DQN)[29]学习一个long-term reward函数,以便顺序转换初始边界框直到它收敛到一个物体。然而,正如文献[7]报道的那样,在典型的Pascal VOC image上,box transformation需要大约1.5s的检测时间,这比最近的检测器慢得多[33,27,32]。另外,[7]没有明确考虑选择成本。尽管RL隐含地迫使算法采取最小步数,但我们需要明确惩罚成本,因为每一步都会产生高成本。例如,如果我们不惩罚成本,算法将倾向于zoom-in整个图像。现有工作已经提出了将RL应用于RL in cost sensitive settings的方法[18,22]。我们遵循[18]的方法,将reward函数看作精度和成本的线性组合。

Dynamic zoom-in network
Our work employs a coarse-to-fine strategy, applying a coarse detector at low resolution and using the outputs of this detector to guide an in-depth search for objects at high resolution. The intuition is that, while the coarse detector will not be as accurate as the fine detector, it will identify image regions that need to be further analyzed, incurring the cost of high resolution detection only in promising regions. We make use of two major components: 1) a mechanism for learning the statistical relationship between the coarse and fine detectors, so that we can predict which regions need to be zoomed in given the coarse detector output; and 2) a mechanism for selecting a sequence of regions to analyze at high resolution, given the coarse detector output and the regions that have already been analyzed by the fine detector. Our pipeline is illustrated in Fig. 2. We learn a strategy that models the long-term goal of maximizing the overall detection accuracy with limited cost.

Dynamic zoom-in network
我们的工作采用了coarse-to-fine的策略,在低分辨率下应用粗检测器,并使用该检测器的输出来指导深度搜索高分辨率物体。直觉是,尽管粗检测器不如精检测器那么精确,但它将识别需要进一步分析的图像区域,仅在有前途的区域中产生高分辨率检测的成本。我们利用两个主要部分:1)学习粗检测器和精检测器之间的统计关系的机制,以便我们可以预测在给定粗检测器输出的情况下哪些区域需要被zoom-in; 2) 给定粗检测器输出和已经由精细检测器分析的区域,用于以高分辨率选择要分析的区域序列的机制。我们的流程如图2所示。我们学习一种策略,模拟以有限的成本最大化整体检测精度的长期目标。

Baseline methods
We compare to the following baseline algorithms:
Fine-detection-all. This baseline directly applies the fine detector to the high resolution version of image. This method leads to high detection accuracy with high computational cost. All of the other approaches seek to maintain this detection accuracy with less computation.

Baseline methods
我们比较以下baseline算法:
Fine-detection-all。该baseline直接将精细检测器应用于高分辨率版本的图像。该方法导致高检测精度和高计算成本。所有其他方法都试图用较少的计算来保持这种检测精度。

Coarse-detection-all. This baseline applies the coarse detector on down-sampled images with no zooming.

Coarse-detection-all。该baseline在down-sampled图像上应用粗检测器,没有缩放。

GS+Rnet. Given the initial state representation generated by the R-net, we use a greedy search strategy (GS) to densely search for the best window every time based on the current state without considering the long-term reward.

GS+Rnet。给定由R-net生成的初始状态表示,我们使用贪婪搜索策略(GS)根据当前状态每次密集搜索最佳窗口,而不考虑long-term reward。

ER+Qnet. The entropy of the detector output (object vs no object) is another way to measure the quality of a coarse detection. [2] used entropy to measure the quality of a region for a classification task. Higher entropy implies lower quality of a coarse detection. So, if we ignore the correlation between fine and coarse detections, the accuracy gain of a region can also be computed as

plilog(pli)(1pli)log(1pli) − p i l l o g ( p i l ) − ( 1 − p i l ) l o g ( 1 − p i l )

where pl p l indicates the score of the coarse detection. For fair comparison, we fix all parameters of the pipeline except replacing the R-net output of a proposal with its entropy.

ER+Qnet。检测器输出(有物体vs无物体)的熵是另一种测量粗略检测质量的方法。 [2]用熵来衡量分类任务的区域质量。较高的熵意味着较低的粗检测质量。因此,如果我们忽略精细和粗略检测之间的相关性,则区域的精度增益也可以计算为

plilog(pli)(1pli)log(1pli) − p i l l o g ( p i l ) − ( 1 − p i l ) l o g ( 1 − p i l )

其中 pl p l 表示粗略检测的分数。为了公平比较,我们修复了pipeline的所有参数,但是用熵替换了proposal的R-net输出。

SSD and YOLOv2. We also compare our method with off-the-shelf SSD [27] and YOLOv2 [32] trained on CPD, to show the advantage of our method on large images.

SSD和YOLOv2。我们还将我们的方法与CPD训练的off-the-shelf的SSD [27]和YOLOv2 [32]进行了比较,以展示我们的方法在大图像上的优势。

Variants of our framework
We use Qnet-CNN to represent the Q-net developed using a fully convolutional network (see Fig. 2). To analyze the contributions of different components to the performance gain, we evaluate three variants of our framework: Qnet*, Qnet-FC and Rnet*.

Variants of our framework
我们使用Qnet-CNN来表示使用完全卷积网络开发的Q-net(见图2)。为了分析不同组件对性能增益的贡献,我们评估了我们框架的三种变体:Qnet ,Qnet-FC和Rnet

Qnet*. This method uses a Q-net with refinement to locally adjust the zoom-in window selected by Q-net.

QNET *。这种方法使用Q-net进行细化以在局部调整由Q-net选择的zoom-in window。

Qnet-FC. Following [7], we develop this variant with two fully connected (FC) layers for Q-net. For Qnet-FC, the state representation is resized to a vector of length 1, 200 as the input. The first layer has 128 units and the second layer has 34 units (9+25). Each output unit represents a sampled window on an image. We uniformly sample 25 windows of size 320 × 240 and 9 windows of size 214 × 160 on the CPD dataset. Since the output number of Qnet-FC can not be changed, windows sizes are proportionally increased when Qnet-FC is applied to WP dataset.

QNET-FC。 [7]之后,我们为Q-net开发了两个完全连接(FC)层的变体。对于Qnet-FC,状态表示被调整为长度为1,200的向量作为输入。第一层有128个单元,第二层有34个单元(9 + 25)。每个输出单位表示图像上的采样窗口。我们在CPD数据集上统一采样25个尺寸为320×240的窗口和9个尺寸为214×160的窗口。由于无法更改Qnet-FC的输出数量,因此将Qnet-FC应用于WP数据集时,窗口大小会成比例地增加。

Rnet*. This is an R-net learned using a reward function that does not explicitly encode cost (λ = 0 in Eq. 1).

Rnet*。这是使用reward函数学习的R-net,其没有明确编码成本(方程式1中的λ= 0)。

We propose a dynamic zoom-in network to speed up object detection in large images without manipulating the underlying detector’s structure. Images are first downsampled and processed by the R-net to predict the accuracy gain of zooming in on a region. Then, the Q-net sequentially selects regions with high zoom-in reward to conduct fine detection. The experiments show that our method is effective on both Caltech Pedestrian Detection dataset and a high resolution pedestrian dataset.

我们提出了一个Dynamic zoom-in network来加速大图像中的物体检测,而不需要操纵底层检测器的结构。 R-net首先对图像进行down-sampled和处理,以预测zoom-in区域的精度增益。然后,Q-net依次选择具有高zoom-in回报的区域来进行精细检测。实验表明,我们的方法对加州理工行人检测数据集和高分辨率行人数据集均有效。

论文阅读笔记(三十八):Dynamic Zoom-in Network for Fast Object Detection in Large Images_第1张图片

Figure 1: Illustration of our approach. The input is a downsampled version of the image to which a coarse detector is applied. The R-net uses the initial coarse detection results to predict the utility of zooming in on a region to perform detection at higher resolution. The Q-net, then uses the computed accuracy gain map and a history of previous zooms to determine the next zoom that is most likely to improve detection with limited computational cost.

图1:我们的方法说明。输入是应用粗检测器的图像的down-sampled版本。 R-net使用初始的粗略检测结果来预测在更高分辨率下zoom-in区域以执行检测的效用。 Q-net然后使用计算的精确度增益映射和previous zooms的历史来确定下一个zoom,其以最有限的计算成本最有可能改善检测。

论文阅读笔记(三十八):Dynamic Zoom-in Network for Fast Object Detection in Large Images_第2张图片

Figure 2: Given a down-sampled image as input, the R-net generates an initial accuracy gain (AG) map indicating the potential zoom-in accuracy gain of different regions (initial state). The Q-net is applied iteratively on the AG map to select regions. Once a region is selected, the AG map will be updated to reflect the history of actions. For the Q-net, two parallel pipelines are used, each of which outputs an action-reward map that corresponds to selecting zoom-in regions with a specific size. The value of the map indicates the likelihood that the action will increase accuracy at low cost. Action rewards from all maps are considered to select the optimal zoom-in region at each iteration. The notation 128×15×20:(7,10) means 128 convolution kernels with size 15×20, and stride of 7/10 in height/width. Each grid cell in the output maps is given a unique color, and a bounding box of the same color is drawn on the image to denote the corresponding zoom region size and location.

图2:给定一个down-sampled图像作为输入,R-net生成一个initial accuracy gain(AG)映射,指示不同区域(初始状态)的潜在zoom-in精度增益。迭代地在AG图上应用Q-net来选择区域。一旦选择了一个区域,AG映射将被更新以反映action的历史。对于Q-net,使用两条并行pipelines,每条pipeline输出一个动作reward映射,对应于选择具有特定大小的zoom-in区域。映射的价值表示action以低成本提高准确性的可能性。来自所有映射的action reward被认为是在每次迭代中选择最佳zoom-in区域。符号128×15×20:(7,10)表示128个大小为15×20的卷积核,高度/宽度为7/10的步长。输出映射中的每个网格单元都被赋予一种独特的颜色,并且在图像上绘制相同颜色的边界框以表示相应的缩放区域大小和定位。

你可能感兴趣的:(笔记)