RODEO: Replay for Online Object Detection 用于在线目标检测的回放
代码地址:GitHub - manoja328/rodeo: Official implementation of "RODEO: Replay for Online Object Detection", BMVC 2020
Humans can incrementally learn to do new visual detection tasks, which is a huge challenge for today’s computer vision systems. Incrementally trained deep learning models lack backwards transfer to previously seen classes and suffer from a phenomenon known as “catastrophic forgetting.” In this paper, we pioneer online streaming learning for object detection, where an agent must learn examples one at a time with severe memory and computational constraints. In object detection, a system must output all bounding boxes for an image with the correct label. Unlike earlier work, the system described in this paper can learn this task in an online manner with new classes being introduced over time. We achieve this capability by using a novel memory replay mechanism that efficiently replays entire scenes. We achieve state-of-the-art results on both the PASCAL VOC 2007 and MS COCO datasets.
人类可以逐步学习执行新的视觉检测任务,这对当今的计算机视觉系统来说是一个巨大的挑战。增量训练的深度学习模型缺乏向以前看学习的类别的向后迁移,并遭受被称为“灾难性遗忘”的现象。在本文中,我们开创了用于目标检测的在线流式学习,在这种学习中,代理必须一次学习一个具有严重内存和计算限制的示例。在目标检测中,系统必须输出具有正确标签的图像的所有边界框。与之前的工作不同,本文中描述的系统可以在线学习这项任务,并随着时间的推移引入新的类。我们通过使用一种新的内存重播机制来实现这一功能,该机制可以高效地重播整个场景。我们在PASCAL VOC 2007和MS COCO数据集上都取得了最先进的结果。
Object detection is a localization task that involves predicting bounding boxes and class labels for all objects in a scene. Recently, many deep learning systems for detection [45, 48] have achieved excellent performance on the commonly used Microsoft COCO [32] and Pascal VOC [10] datasets. These systems, however, are trained offline, meaning they cannot be continually updated with new object classes. In contrast, humans and mammals learn from non-stationary streams of samples, which are presented one at a time and they can immediately use new learning to better understand visual scenes. This setting is known as streaming learning, or online learning in a single pass through a dataset. Conventional models trained in this manner suffer from catastrophic forgetting of previous knowledge [12, 40].
目标检测是一项定位任务,涉及预测场景中所有目标的边界框和类标签。最近,许多用于检测的深度学习系统在常用的Microsoft COCO[32]和Pascal VOC[10]数据集上取得了优异的性能。然而,这些系统是离线训练的,这意味着它们不能用新的目标类不断更新。相比之下,人类和哺乳动物从非固定的样本流中学习,这些样本一次呈现一个,他们可以立即使用新的学习来更好地理解视觉场景。此设置称为流式学习,或通过数据集进行单次在线学习。以这种方式训练的传统模型会遭受对先前知识的灾难性遗忘【12,40】。
Streaming object detection enables new applications such as adding new classes, adapting detectors across seasons, and incorporating object appearance variations over time. Existing incremental object detection systems [14, 29, 50, 51] have significant limitations and are not capable of streaming learning. Instead of updating immediately using the current scene, they update using large batches of scenes. These systems use distillation [19] to mitigate forgetting. This means for the batch acquired at time t, they must generate predictions for all of the scenes in the batch before learning can occur, and afterwards they loop over the batch multiple times. This makes updating slow and impairs their ability to be used on embedded devices with limited compute or where fast learning is required
Previous works in incremental image recognition have shown that replay mechanisms are effective in alleviating catastrophic forgetting [4, 16, 44, 56]. Replay is inspired by how the human brain consolidates learned representations from the hippocampus to the neocortex, which helps in retaining knowledge over time [39]. Furthermore, hippocampal indexing theory postulates that the human brain uses an indexing mechanism to replay compressed representations from memory [52]. In contrast, others replay raw samples [4, 44, 56], which is not biologically plausible. Here, we present the Replay for the Online DEtection of Objects (RODEO) model, which replays compressed representations stored in a fixed capacity memory buffer to incrementally perform object detection in a streaming fashion. To the best of our knowledge, this is the first work to use replay for incremental object detection. We find that this method is computationally efficient and can be easily be extended to other applications.
This paper makes the following contributions:
1. We pioneer streaming learning for object detection and establish strong baselines.
2. We propose RODEO, a model that uses replay to mitigate forgetting in the streaming setting and achieves better results than incremental batch object detection algorithms.
Continual learning (sometimes called incremental batch learning), is a much easier problem than streaming learning and has recently seen much success on classification and detection tasks [4, 6, 21, 25, 27, 35, 36, 41, 42, 51, 56]. In continual learning, an agent is required to learn from a dataset that is broken up into T batches, i.e., . At each time-step t, an agent learns from a batch consisting of Nt training inputs, i.e., by looping through the batch until it has been learned, where Ii is an image. Continual learning is not an ideal paradigm for agents that must operate in real-time for two reasons: 1) the agent must wait for a batch of data to accumulate before training can happen and 2) an agent can only be evaluated after it has finished looping through a batch. While streaming learning has recently been used for image classification [7, 15, 16, 17, 36], it has not yet been explored for object detection, which we pioneer here.
More formally, during training, a streaming object detection model receives temporally ordered sequences of images with associated bounding boxes and labels from a dataset , where It is an image at time t. During evaluation, the model must produce labelled bounding boxes for all objects in a given image, using the model built until time t. Streaming learning poses unique challenges for models by requiring the agent to learn one example at a time with only a single epoch through the entire dataset. In streaming learning, model evaluation can happen at any point during training. Further, developers should impose memory and time constraints on agents to make them more amenable to real-time learning.
In comparison with image classification, which requires an agent to answer ‘what’ is in an image, object detection additionally requires agents capable of localization, i.e., the requirement to answer ‘where’ is the object located. Moreover, models must be capable of localizing multiple objects, often of varying categories within an image. Recently, two types of architectures have been proposed to tackle this problem: 1) single stage architectures (e.g., SSD [13, 34], YOLO [45, 46], RetinaNet [33]) and 2) two stage architectures (e.g., Fast RCNN [55], Faster RCNN [48]). Single stage architectures have a single, end-to-end network that generates proposal boxes and performs both class-aware bounding box regression and classification of those boxes in a single stage. While single stage architectures are faster to train, they often achieve lower performance than their two stage counterparts. These two stage architectures first use a region proposal network to generate class agnostic proposal boxes. In a second stage, these boxes are then classified and the bounding box coordinates are fine-tuned further via regression. The outputs of all detection models are bounding box coordinates with their respective probability scores corresponding to the closest category. While incremental object detection has recently been explored in the continual learning paradigm, we pioneer streaming object detection, which is a more realistic setup.
Although continual learning is an easier problem than streaming learning, both training paradigms suffer from catastrophic forgetting of previous knowledge when trained on changing, non-iid data distributions [12, 40]. Catastrophic forgetting is a result of the stabilityplasticity dilemma, where an agent must update its weights to learn new information, but if the weights are updated too much, then it will forget prior knowledge [1]. There are several strategies for overcoming forgetting in neural networks including: 1) regularization approaches that place constraints on weight updates [3, 6, 20, 27, 30, 38, 41, 58], 2) sparsity where a network sparsely updates weights to mitigate interference [8], 3) ensembling multiple classifiers together [9, 11, 43, 47, 53], and 4) rehearsal/replay models that store a subset of previous training inputs (or generate previous inputs) to mix with new examples when updating the network [4, 16, 17, 21, 25, 44, 56]. Many prior works have also combined these techniques to mitigate forgetting, with a combination of distillation [19] (a regularization approach) and replay yielding many state-of-the-art models for image recognition [4, 21, 44, 56]
While streaming object detection has not been explored, there has been some work on object detection in the continual (batch) learning paradigm [14, 29, 50, 51]. In [51], a distillation-based approach was proposed without replay. A network would initially be trained on a subset of classes and then its weights would be frozen and directly copied to a new network with additional parameters for new classes. A standard cross-entropy loss was used with an additional distillation loss computed from the frozen network to restrict weights from changing too much. Hao et al. [14] train an incremental end-to-end variant of Faster RCNN [48] with distillation, a feature preserving loss, and a nearest class prototype classifier to overcome the challenges of a fixed proposal generator. Similarly, [29] uses distillation on the classification predictions, bounding box coordinates, and network features to train an end-to-end incremental network. Shin et al. [50] introduce a novel incremental framework that combines active learning with semi-supervised learning. All of the aforementioned methods operate on batches and are not designed to learn one example at a time.
Inspired by [17], RODEO is a model architecture that performs object detection in an online fashion, i.e., learning examples one at a time with a single pass through the dataset. This means our model updates as soon as a new instance is observed, which is more amenable to real-time applications than models operating in the incremental batch paradigm. To facilitate online learning, our model uses a memory buffer to store compressed representations of examples. These representations are obtained from an intermediate layer of the CNN backbone and compressed to reduce storage, i.e., compressed mid-network CNN tensors. During training, RODEO compresses a new image input. It then combines this new input with a random, reconstructed subset of samples from its replay buffer, before updating the model with this replay mini-batch.
Figure 1: In offline object detection, a model is provided an image and then trained with the ground truth boxes for all classes (e.g., a, b, c) in the image at once (top figure). However, in an online setting, ground truth boxes of different categories are observed at different time steps (bottom figure). While conventional models suffer from catastrophic forgetting, RODEO uses replay to efficiently train an incremental object detector for large-scale, many-class problems. Given an image, RODEO passes the image through the frozen layers of its network (G). The image is then quantized and a random subset of examples from the replay buffer are reconstructed. This mixture of examples is then used to update the plastic layers of the network (F) and finally the new example is added to the buffer.
More formally, our object detection model, H, can be decomposed as H (x) = F (G(x)) for an input image x, where G consists of earlier layers of a CNN and F the remaining layers. We first initialize G(·) using a base initialization phase where our model is first trained offline on half of the total classes in the dataset. After this base initialization phase, the layers in G are frozen since earlier layers of CNNs learn general and transferable representations [57]. Then, during streaming learning, only F is kept plastic and updated on new data.
Unlike previous methods for incremental image recognition [44], which store raw (pixellevel) samples in the replay buffer, we store compressed representations of feature map tensors. One advantage of storing compressed samples is a drastic reduction in memory requirements for storage. Specifically, for an input image x, the output of G(x) is a feature map, z, of size p×q×d, where p×q is the spatial grid size and d is the feature dimension. After G has been initialized on the base initialization set of data, we push all base initialization samples through G to obtain these feature maps, which are used to train a product quantization (PQ) model [23]. This PQ model encodes each feature map tensor as a p×q×s array of integers, where s is the number of indices needed for storage, i.e., the number of codebooks used by PQ. After we train the PQ model, we obtain the compressed representations of all base initialization samples and add the compressed samples to our memory replay buffer. We then stream new examples into our model H one at a time. We compress the new sample using our PQ model, reconstruct a random subset of examples from the memory buffer, and update F on this mixture for a single iteration. We subject our replay buffer to an upper bound in terms of memory. If the memory buffer is full, then the new compressed sample is added and we choose an existing example for removal, which we discuss next. Otherwise, we just add the new compressed sample directly. For all experiments, we store codebook indices using 8 bits or equivalently 1 byte, i.e., the size of each codebook is 256. We use 64 codebooks for COCO and 32 for VOC. For PQ computations, we use the publicly available Faiss library [24]. A depiction of our overall training procedure is given in Alg. 1
For lifelong learning agents that are required to learn from possibly infinite data streams, it is not possible to store all previous examples in a memory replay buffer. Since the capacity of our memory buffer is fixed, it is essential to replace less useful examples over time. We use a replacement strategy that replaces the image having the least number of unique labels from the replay buffer. We also experiment with other replacement strategies in Sec. 6.1.
We use the Pascal VOC 2007 [10] and Microsoft COCO [32] datasets. VOC contains 20 object classes with 5,000 combined training/validation images and 5,000 testing images. COCO contains 80 classes (including all VOC classes) with 80K training images and 40K validation images, which we use for testing. We use the entire validation set as our test set.
我们使用Pascal VOC 2007【10】和Microsoft COCO【32】数据集。VOC包含20个对象类,包含5000个组合训练/验证图像和5000个测试图像。COCO包含80个类(包括所有VOC类),包含80K训练图像和40K验证图像,我们使用这些图像进行测试。我们使用整个验证集作为测试集。
We compare several baselines using the Fast RCNN architecture with edge box proposals and a ResNet-50 [18] backbone, which is the setup used in [51]. These baselines include:
• RODEO – RODEO operates as an incremental object detector by using replay mechanisms to mitigate forgetting. Our main variant replays 4 randomly selected samples from its buffer at each time step. We use 32 codebooks for VOC and 64 for COCO, each of size 256.
• Fine-Tune (No Replay) – This is a standard object detection model without a replay buffer that is fine-tuned one example at a time with only a single epoch through the dataset. This model serves as a lower bound on performance and suffers from catastrophic forgetting of previous classes.
• ILwFOD – The Incremental Learning without Forgetting Object Detection model [51] uses a fixed proposal generator (e.g., edge boxes) with distillation to incrementally learn classes. It is the current state-of-the-art for incremental object detection.
• SLDA + Stream-Regress – Deep streaming linear discriminant analysis was recently shown to work well in classifying deep network features on ImageNet [15]. Since SLDA is only used for classification, we combine it with a streaming regression model to regress for bounding box coordinates. To handle the background class with SLDA, we store a mean vector per class and a background mean vector per class, along with a universal covariance matrix. At test time, a label is assigned based on the closest Gaussian in feature space, defined by the class mean vectors and universal covariance matrix. More details for this model are provided in supplemental materials.
• Offline – This is a standard object detection network trained in the offline setting using mini-batches and multiple epochs through the dataset. This model serves as an upper bound for our experiments.
我们比较了使用Faster RCNN体系结构、边界框方案和ResNet-50骨干网(即[51]中使用的设置)的多条基线。这些基线包括:
All models use the same network initialization procedure. Similarly, all models are optimized with stochastic gradient descent with momentum, except SLDA. We were not able to replicate the results for ILwFOD, so we use the numbers provided by the authors for VOC and do not include results for COCO since our setup differs. While RODEO, LDA+Stream-Regress, and Fine-Tune are all streaming models trained one sample at a time with a single epoch through the dataset, ILwFOD is an incremental batch method that loops through batches of data many times making it less ideal for immediate learning.
所有型号都使用相同的网络初始化过程。同样,除SLDA外,所有模型均采用带动量的随机梯度下降法进行优化。我们无法复制ILwFOD的结果,因此我们使用作者提供的VOC数据,不包括COCO的结果,因为我们的设置不同。虽然RODEO、LDA+Stream Retression和Fine Tune都是通过数据集以单个历元一次训练一个样本的流模型,但ILwFOD是一种增量批处理方法,多次循环处理数据批,因此不太适合立即学习。
We introduce a new metric that captures a model’s mean average precision (mAP) at a 0.5 IoU threshold over time. This metric extends the Wall metric from [16, 26] for object detection and normalizes an incremental learner’s performance to an optimized offline baseline, i.e., ; where at is an incremental learner’s mAP at time t, is the offline learner’s mAP at time t, and there are T total testing events. We only evaluate performance on classes learned until time t. While is usually between 0 and 1, a value greater than 1 is possible if the incremental learner performed better than the offline baseline. This metric makes it easier to compare performance across datasets of varying difficulty.
我们引入了一个新的度量,它可以捕获模型在0.5 IoU阈值下随时间变化的平均精度(mAP)。该度量将目标检测的墙度量从[16,26]扩展到了优化的离线基线,即;其中at是时间t的增量学习器mAP,;t是时间t的离线学习器mAP,总共有t个测试事件。我们只评估在时间t之前学习的类别的表现。虽然通常在0到1之间,但如果增量学习者的表现优于离线基线,则该值可能大于1。此指标使比较不同难度数据集的性能变得更加容易。
In our training paradigm, the model is first initialized with half the total classes and then it is required to learn the second half of the dataset one class at a time, which follows the setup in [51]. We organize the classes in alphabetical order for both PASCAL VOC 2007 and COCO. For example, on VOC, which contains 20 total classes, the network is first initialized with classes 1-10, and then the network learns class 11, then 12, then 13, etc. This paradigm closely matches how incremental class learning experiments have been performed for classification tasks [17, 44]. For all experiments, the network is incrementally trained on all images containing at least one instance for the new class. This means that images could potentially be repeated in previous or future increments. When training a new class, only the labels for the ground truth boxes containing that particular class are provided.
在我们的训练范式中,首先用总类数的一半初始化模型,然后需要一次学习数据集的后半部分,一次学习一个类,这遵循了[51]中的设置。我们按照帕斯卡VOC 2007和COCO的字母顺序组织课程。例如,在总共包含20个类的VOC上,首先用类1-10初始化网络,然后网络学习类11,然后是12,然后是13,等等。这种范式与分类任务的增量类学习实验的执行方式非常匹配【17,44】。对于所有实验,网络都会在包含至少一个新类实例的所有图像上进行增量训练。这意味着图像可能会以以前或将来的增量重复。训练新类别时,仅提供包含该特定类别的ground truth的标签。
For incremental batch models, after base initialization, models are provided a batch containing all data for a single class, which they are allowed to loop over. Streaming models operate on the same batches of data, but examples from within the batch are observed one at a time and can only be observed once, unless the data is cached in a memory buffer. For VOC, after each new class is learned, each model is evaluated on test data containing at least one box of any previously trained classes. For COCO, models are updated on batches containing a single class after base initialization, which is identical to the VOC paradigm. However, since COCO is much larger than VOC and evaluation takes much longer, we evaluate the model after every 10 new classes of data have been trained.
Following [51], we use the Fast RCNN architecture [55] with a ResNet-50 [18] backbone and edge box object proposals [59] for all models, unless otherwise noted. Edge boxes is an unsupervised method for producing class agnostic object proposals, which is useful in the streaming setting where we don’t know what types of objects will appear in future time steps. Specifically, we compute 2,000 edge boxes for an image. Following [48], we first resize images to 800 × 1000 pixels. To determine whether a box should be labelled as background or foreground, we compute overlap with ground truth boxes using an IoU threshold of 0.5. Then, batches of 64 boxes are randomly selected per image, where each batch must have roughly 25% positive boxes (IoU > 0.5). During inference, 128 boxes are chosen as output after applying a per-category Non-Maximal Supression (NMS) threshold of 0.3 to eliminate overlapping boxes. More parameter settings are in supplemental materials.
在【51】之后,除非另有说明,否则我们使用Faster RCNN体系结构【55】以及所有模型的ResNet-50【18】主干和边缘框目标方案【59】。边缘框是一种无监督的方法,用于生成与类无关的目标建议,这在流式处理设置中很有用,因为我们不知道未来的时间步长中将出现什么类型的对象。具体来说,我们为一幅图像计算2000个边缘框。在[48]之后,我们首先将图像大小调整为800×1000像素。为了确定框是否应标记为背景或前景,我们使用IoU阈值0.5计算与地面真实框的重叠。然后,每个图像随机选择64个盒子的批次,其中每个批次必须有大约25%的正样本框(IoU>0.5)。在推理过程中,在应用0.3的每类非最大抑制(NMS)阈值以消除重叠框后,选择128个框作为输出。补充资料中提供了更多参数设置。
For each input image to RODEO, layer G produces feature map tensors of approximate size 25 × 30 × 2048. Images from the base initialization classes (1-10) for VOC and (1-40) for COCO are used to train the PQ model. For VOC, we are able to fit all the feature maps in memory to train the PQ model. For COCO, it is not possible to fit all the images in memory, so we sub-sample 30 random locations from the full feature map of each image to train the PQ. The ResNet-50 backbone has four residual blocks. We quantize RODEO after the third residual block, i.e., F consists of the last residual block, the Fast RCNN MLP head composed of two fully connected layers, and the linear classifier and regressor. To make experiments fair, we subject RODEO’s replay buffer to an upper limit of 510 MB, which is the amount of memory required by ILwFOD. For VOC, this allows RODEO to store a representation of every sample in the training set. For COCO, this only allows us to store 17,668 compressed samples. To manage the buffer, we use a strategy that always replaces the image with the least number of unique objects.
对于RODEO的每个输入图像,图层G生成大约大小为25×30×2048的特征地图张量。VOC的基本初始化类(1-10)和COCO的基本初始化类(1-40)的图像用于训练PQ模型。对于VOC,我们能够拟合内存中的所有特征映射来训练PQ模型。对于COCO来说,不可能在内存中匹配所有图像,因此我们从每个图像的全特征图中随机抽取30个位置来训练PQ。ResNet-50主干有四个剩余块。我们在第三个残差块之后对RODEO进行量化,即F由最后一个残差块、由两个完全连接的层组成的Fast RCNN MLP头以及线性分类器和回归器组成。为了公平起见,我们将RODEO的重播缓冲区设置为510 MB的上限,这是ILwFOD所需的内存量。对于VOC,这允许RODEO存储训练集中每个样本的表示。对于COCO,这只允许我们存储17668个压缩样本。为了管理缓冲区,我们使用一种策略,总是用最少数量的唯一目标替换图像。
Our main experimental results are in Table 1and learning curves are in Fig. 2 and Fig. 3 for VOC and COCO, respectively. We include results for RODEO models that use both real and reconstructed features. Real features do not undergo reconstruction before being passed through plastic layers, F. To normalize WmAP, we use offline models that achieve final mAP values of 0.715 and 0.42 on VOC and COCO, respectively. Additional results are in supplemental materials.
For VOC, RODEO beats all previous methods just by replaying only four samples. Our method is much less prone to forgetting than other models, which is demonstrated by its performance at the final time step in Fig. 2. The SLDA+Regress model is surprisingly competitive on both datasets without the need to update its backbone. For COCO, RODEO is run with four replay samples and outperforms the baseline models by a large margin. Further, across various replay sizes and replacement strategies (Table 2), we find that real features yield better results compared to reconstructed features.
To study the impact of the buffer management strategy chosen, we run the following replacement strategies on the COCO dataset. Results are in Table 2.
• BAL: Balanced replacement strategy that replaces the item which least affects the overall class distribution.
• MIN, MAX: Replace the image having the least and highest number of unique labels respectively.
• RANDOM: Randomly replace an image from the buffer.
• NO-REPLACE: No replacement, i.e., store everything and let the buffer expand infinitely.
For an ideal case, we ran a version of RODEO with real features (n = 4) and an unlimited buffer (storing everything). This model achieved an WmAP of 0.928. All other replacement strategies are only allowed to store 17,668 samples. We find that MAX replace yields even worse results compared to RANDOM replace suggesting storing more samples with more unique categories is better. Similarly, we find that MIN replace performs better across both real and reconstructed features, even beating the balanced (BAL) replacement strategy. We hypothesize that since MIN replace keeps images with the most unique objects, it results in a more diverse buffer to overcome forgetting.
For our VOC experiments, we do not replace anything from the buffer. As we increase the number of replay samples from 4 to 12, the performance improves by 0.3% for real features and 5.3% for reconstructed features respectively. Surprisingly for COCO, which has buffer replacement, the performance decreases as we increase the number of replay samples. We suspect this could be because COCO has many more objects per image compared to VOC which are being treated as background for region proposal selection. In the future, it would be interesting to develop new methods to handle this background class in an incremental setting, which has been explored for incremental semantic segmentation [5]
For COCO, we train each incremental iteration of Fast R-CNN for 10 epochs which takes about 21.83 hours. Thus, full offline training of 40 iterations takes a total of 873 hrs. In contrast, our method, RODEO, requires only 22 hours which is a 40× speed-up compared to offline. SLDA+Regress and Fine-Tune both train faster, but perform much worse in terms of detection performance. These numbers do not include the base initialization time, which is the same for all methods. Exact numbers are in supplemental materials (Table S2).
对于COCO,我们对Fast R-CNN的每一次增量迭代进行了10个阶段的训练,大约需要21.83小时。因此,40次迭代的完整离线训练总共需要873小时。相比之下,我们的RODEO方法只需要22小时,与离线相比,这是40倍的速度。SLDA+回归和微调都训练得更快,但检测性能要差得多。这些数字不包括基本初始化时间,这对于所有方法都是相同的。具体数字见补充材料(表S2)。
In current object detection problem formulations, detected objects are not aware of each other. However, many real-world applications require an understanding of attributes and the relationships between objects. For example, Visual Query Detection (VQD) is a new visual grounding task for localizing multiple objects in an image that satisfies a given language query [2]. Our method can be easily extended for the VQD task by modifying the object detector to output only the boxes relevant to the language query.
In any real system where memory is limited, the choice of an ideal buffer replacement strategy is vital. For any agent that needs to learn new information over time, while also recalling previous knowledge, it is critical to store the most informative memories and replace those which carry less information. This procedure has also been studied in the reinforcement learning literature as experience replay [22, 31]. Our buffer size is limited because it is calculated with respect to the maximum storage required by the ILwFOD model [51]. To efficiently use this limited storage, we tried various replacement strategies to store the newer examples such as: random replacement, class distribution balancing, and replacement of images with the most or fewest number of unique bounding boxes present. In the future, more efficient strategies for determining the maximum buffer size and replacement strategy could be useful for online applications.
RODEO is designed explicitly for streaming applications where real-time inference and overall compute are critical factors, such as robotic or embedded devices. Although RODEO uses Fast-RCNN, a two stage detector, which is slower than single stage detectors like SSD [13, 34] and YOLO [45, 46], single stage approaches could be used to facilitate faster learning and inference. Moreover, RODEO currently uses a ResNet-50 backbone and can only process two images in a single batch. Using a more efficient backbone model like a MobileNet [49] or ShuffleNet [37] architecture would allow the model to run faster with fewer storage requirements. In future work, it would be interesting to study how RODEO could be extended to single-stage detectors by replaying intermediate features and directly using the generated anchors instead of edge box proposals.
RODEO专门为实时推理和整体计算是关键因素的流式应用程序而设计,如机器人或嵌入式设备。虽然RODEO使用了Fast RCNN,这是一种两级检测器,比SSD(13,34)和YOLO(45,46)等单级检测器慢,但可以使用单级方法来促进更快的学习和推理。此外,RODEO目前使用ResNet-50主干网,一批只能处理两幅图像。使用更高效的主干网模型,如MobileNet(49)或ShuffleNet(37)体系结构,可以使模型运行更快,存储需求更少。在未来的工作中,研究RODEO如何通过重放中间特征和直接使用生成的锚而不是边缘框方案扩展到单级探测器将是一件有趣的事情。
Further performance gains could be achieved by using augmentation strategies on the midlevel CNN features. Recently, several augmentation strategies have been designed explicitly for object detection [28, 54, 60] and it would be interesting to explore how they could improve performance within deep feature space for an incremental learning application.
We proposed RODEO, a new method that pioneers streaming object detection. RODEO uses replay of quantized, mid-level CNN features to mitigate catastrophic forgetting on a fixed memory budget. Using our new model, we achieve state-of-the-art performance for incremental object detection tasks on the PASCAL VOC 2007 and MS COCO datasets when compared against models that operate in the easier incremental batch learning paradigm.
我们提出了RODEO,这是一种开创流式目标检测的新方法。RODEO使用量化的、中级的CNN功能回放,以在固定内存预算下缓解灾难性遗忘。使用我们的新模型,我们在PASCAL VOC 2007和MS COCO数据集上实现了最先进的增量对象检测任务性能,与在更简单的增量批处理学习范式中运行的模型相比。
Furthermore, our model is general enough to be applied to multi-modal incremental detection tasks in the future like VQD [2], which require an agent to understand scenes and the relationships between objects within them
S1 Training Details
Hyper-parameter settings for RODEO and the offline models for VOC and COCO are given in Table S1. Similarly, run time comparisons for the COCO dataset are in Table S2.
S2 Where to Quantize?
Our choices of layers to quantize are limited due to the architecture of the ResNet-50 backbone. ResNet-50 has four main major layers with each having (3,4,6,3) bottleneck blocks respectively. Since bottleneck blocks add a residual shortcut connection at the end, it is not possible to quantize from the middle of the block, leaving only four places to perform quantization. Quantizing earlier has some advantages since it leaves more trainable parameters for the incremental model, which could lead to better results [17]. But, it also requires twice the memory to store the same number of images as we move towards the earlier layers. For efficiency, we choose the last layer for feature quantization.
S3 Additional Results
We provide the individual mAP results for each increment of COCO in Table S3 and VOC in Table S4.
S4 Additional SLDA+Stream-Regress Object Detection Details
An overview of the incremental training stage for the SLDA+Stream-Regress object detection model is given in Alg. 2. We use the Fast RCNN model to extract features from edge box proposals. Given a new input, we then make classification and regression predictions using the SLDA and Stream-Regress models, respectively. For both the SLDA model and the Stream-Regress models, we use shrinkage regularization with parameters of 1e−2 and 1e−4, respectively.
流程2概述了SLDA+流回归目标检测模型的增量训练阶段。我们使用Fast RCNN模型从边缘盒方案中提取特征。在给定新输入的情况下,我们分别使用SLDA和流回归模型进行分类和回归预测。对于SLDA模型和流回归模型,我们使用参数分别为1e−2和1e−4的收缩正则化。
We train the SLDA model as proposed in [15] with one slight modification. In [15], there was a single mean vector stored per class. However, in our work we allow SLDA to store two mean vectors per class, where one mean vector is representative of the actual class data and the second mean vector is representative of the background for that particular class. During test time, we thus obtain two scores for each class: the main class score and the background class score. We keep the main class score for each class and only keep the maximum score of all background scores.
Training the Stream-Regress model is similar to training the SLDA model. That is, we first initialize one mean vector to zeros, where d is the dimension of the data. We initialize another mean vector to zeros, where m is the number of regression targets, and we have four regression coordinates per class including the background class. We also initialize two covariance matrices, , and a total count of the number of updates, N ∈ R.
流回归模型的训练与SLDA模型的训练类似。也就是说,我们首先将一个平均向量初始化为零,其中d是数据的维数。我们将另一个平均向量初始化为零,其中m是回归目标的数量,我们每个类有四个回归坐标,包括背景类。我们还初始化了两个协方差矩阵,,以及更新总数N∈ R
Given a new sample (xt,yt), where is a one-hot encoding of the regression targets, we make the following updates to our model:
To make predictions, we first compute the precision matrix
with shrinkage parameter e and identity matrix . We then compute regression targets,, for an input xt as: