
论文:General Instance Distillation for Object Detection



       In recent years, knowledge distillation has been proved to be an effective solution for model compression. This approach can make lightweight student models acquire the knowledge extracted from cumbersome teacher models. However, previous distillation methods of detection have weak generalization for different detection frameworks and rely heavily on ground truth (GT), ignoring the valuable relation information between instances. Thus, we propose a novel distillation method for detection tasks based on discriminative instances without considering the positive or negative distinguished by GT, which is called general instance distillation (GID). Our approach contains a general instance selection module (GISM) to make full use offeature-based, relation-based and response-based knowledge for distillation. Extensive results demonstrate that the student model achieves significant AP improvement and even outperforms the teacher in various detection frameworks. Specifically, RetinaNet with ResNet-50 achieves 39.1% in mAP with GID on COCO dataset, which surpasses the baseline 36.2% by 2.9%, and even better than the ResNet-101 based teacher model with 38.1% AP.

       近年来,知识蒸馏被证明是一种有效的解决模型压缩的方法。这种方法可以使轻量级的学生模型获得从繁琐的教师模型中提取的知识。然而,以前的蒸馏检测方法具有较弱的推广不同的检测框架,严重依赖地面真相(GT),忽略了有价值的实例之间的关系信息。因此,我们提出了一种新的蒸馏方法的检测任务的基础上的歧视性的实例,而不考虑的积极或消极区分GT,这被称为一般的实例蒸馏(GID)。我们的方法包含一个通用的实例选择模块(GISM),以充分利用offeature-based,基于关系和响应的知识蒸馏。大量的结果表明,学生模型实现了显着的AP改进,甚至在各种检测框架中优于教师。具体来说,RetinaNet与ResNet-50在COCO数据集上的GID的mAP中达到39.1%,超过基线36.2% 2.9%,甚至优于基于ResNet-101的教师模型38.1%的AP。


      In recent years, the accuracy of object detection has made a great progress due to the blossom of deep convolutional neural network (CNN). The deep learning network structure, including a variety of one-stage detection models [19, 23, 24, 25, 17] and two-stage detection models [26, 16, 8, 2], has replaced the traditional object detection and has become the mainstream method in this field. Furthermore, the anchor-free frameworks [13, 5, 32] have also achieved better performance with more simplified ap proaches. However, these high-precision deep learning based models are usually cumbersome, while a lightweight with high performance model is demanded in practical applications. Therefore, how to find a better trade-off between the accuracy and efficiency has become a crucial problem.


      Knowledge Distillation (KD), proposed by Hinton et al. [10], is a promising solution for the above problem. Knowledge distillation is to transfer the knowledge of large model to small model, thereby improving the performance of the small model and achieving the purpose of model compression. At present, the typical forms of knowledge can be divided into three categories [7], response-based knowledge [10, 22], feature-based knowledge [27, 35, 9] and relationbased knowledge [22, 20, 31, 33, 15]. However, most of the distillation methods are mainly designed for multi-class classification problems. Directly migrating the classification specific distillation method to the detection model is less effective, because of the extremely unbalanced ratio of positive and negative instances in the detection task. Some distillation frameworks designed for detection tasks cope with this problem and achieve impressive results, e.g. Li et al. [14] address the problem by distilling the positive and negative instances in a certain proportion sampled by RPN, and Wang et al. [34] further propose to only distill the near ground truth area. Nevertheless, the ratio between positive and negative instances for distillation needs to be meticulously designed, and distilling only GT-related area may ignore the potential informative area in the background. Moreover, current detection distillation methods cannot work well in multi detection frameworks simultaneously, e.g. two-stage, anchor-free methods. Therefore, we hope to design a general distillation method for various detection frameworks to use as much knowledge as possible effectively without concerning the positive or negative.


      Towards this goal, we propose a distillation method based on discriminative instances, utilizing response-based knowledge, feature-based knowledge as well as relationbased knowledge, as shown in Fig 1. There are several advantages: (i) We can model the relational knowledge between instances in one image for distillation. Hu et al. [11] demonstrates the effectiveness of relational information on detection tasks. However, the relation-based knowledge distillation in object detection has not been explored yet. (ii) We avoid manually setting the proportion of the positive and negative areas or selecting only the GT-related areas for distillation. Though GT-related areas are almost informative, the extremely hard and simple instances may be useless, and even some informative patches from the background can be useful for students to learn the generalization of teachers. Besides, we find that the automatic selection of some discriminative instances between the student and teacher for distillation can make knowledge transferring more effective. Those discriminative instances are called general instances (GIs), since our method does not care about the proportion between positive and negative instances, nor does it rely on GT labels. (iii) Our methods have robust generalization for various detection frameworks. GIs are calculated upon the output from student and teacher model without relying on certain modules from a specific detector or some key characteristic, such as anchor, from a particular detection framework.





目标检测的通用实例提取_第1张图片 图1.一般实例蒸馏(GID)的总体管线。一般实例(GI)是自适应地选择从教师和学生模型的输出。然后,基于特征的,基于关系的和基于响应的知识提取蒸馏基于所选择的地理标志。


  • 定义一般实例(GI)作为蒸馏目标,可以有效提高检测模型的蒸馏效果。(Define general instance (GI) as the distillation target, which can effectively improve the distillation effect of
    the detection model.)
  • 在GI的基础上,首先引入基于关系的知识,对检测任务进行提炼,并将其与基于响应和基于特征的知识相结合,使学生超越教师。(Based on GI, we first introduce the relation-based
    knowledge for distillation on detection tasks and inte-grate it with response-based and feature-based knowl-edge, which makes student surpass the teacher.)
  • 我们在MSCOCO和PASCAL VOC数据集上验证了我们的方法的有效性,包括一阶段,两阶段和无锚方法,实现了最先进的性能。(We verify the effectiveness of our method on the MSCOCO [18] and PASCAL VOC [6] datasets, including one-stage, two-stage and anchor-free methods, achieving state-of-the-art performance.)



     The current mainstream object detection algorithms are roughly divided into two-stage and one-stage detectors. Two-stage methods [16, 8, 2] represented by Faster R-CNN [26] maintain the highest accuracy in the detection field. These methods utilize region proposal network (RPN) and refinement procedure of classification and location to obtain better performance. However, high demands for lower latency bring one-stage detectors [19, 23] under the spotlight, which achieve classification and location of targets through the feature map directly. 

       目前主流的目标检测算法大致分为两阶段和一阶段检测器。以Faster R-CNN 为代表的两阶段方法在检测领域保持了最高的准确性。这些方法利用区域建议网络(RPN)和分类和定位的细化过程,以获得更好的性能。然而,对较低延迟的高需求使一级检测器成为焦点,其直接通过特征图实现目标的分类和定位。

      In recent years, another criterion divides detection algorithm into anchor-based and anchor-free methods. Anchorbased detectors such as [24, 17, 19] solve object detection tasks with the help of anchor boxes, which can be viewed as pre-defined sliding windows or proposals. Nevertheless, all anchor-based methods need to be meticulously designed and calculate a large number of anchor boxes which takes much computation. To avoid tunning hyper-parameters and calculation related to anchor boxes, anchor-free methods [23, 13, 5, 32] predict several key points of target, such as center and distance to boundaries, reach a better performance with less cost.



        Knowledge distillation is a kind of model compression and acceleration approach which can effectively improve the performance of small models with guiding of teacher models. In knowledge distillation, knowledge takes many forms, e.g. the soft targets of the output layer [10], the intermediate feature map [27], the distribution of the intermediate feature [12], the activation status of each neuron [9], the mutual information of intermediate feature [1], the transformation of the intermediate feature [35] and the instance relationship [22, 20, 31, 33]. Those knowledge for distillation can be classified into the following categories [7]: response-based [10], feature-based [27, 12, 9, 1, 35], and relation-based [22, 20, 31, 33].


      Recently, there are some works applying knowledge distillation to object detection tasks. Unlike the classification tasks, the distillation losses in detection tasks will encounter the extreme unbalance between positive and negative instances. Chen et al. [3] first deals with this problem by underweighting the background distillation loss in the classification head while remaining imitating the full feature map in the backbone. Li et al. [14] designs a distillation framework for two-stage detectors, applying the L2 distilla tion loss to the features sampled by RPN of student model, which consists of randomly sampled negative and positive proposals discriminated by ground truth (GT) labels in a certain proportion. Wang et al. [34] proposes a fine-grained feature imitation for anchor-based detectors, distilling the near objects regions which are calculated by the intersection between GT boxes and anchors generated from detectors. That is to say, the background areas will hardly be distilled even if it may contain several information-rich areas. Similar to Wang et al. [34], Sun et al. [30] only distilling the GT-related region both on feature map and detector head.

       最近,有一些工作将知识提炼应用于目标检测任务。与分类任务不同,检测任务中的蒸馏损失将遇到正负实例之间的极端不平衡。Chen等人首先通过降低分类头中的背景蒸馏损失的权重,同时保持模仿主干中的完整特征图来解决这个问题。Li等人设计了一个两阶段检测器的蒸馏框架,将L2蒸馏损失应用于学生模型的RPN采样的特征,该特征由随机采样的负和正建议组成,由地面真值(GT)标签以一定比例区分。Wang等人提出了一种基于锚点的检测器的细粒度特征模仿,提取通过GT盒和检测器生成的锚点之间的交集计算的近物体区域。也就是说,即使背景区域可能包含多个信息丰富的区域,也很难提取背景区域。类似于Wang et al.,Sun et al.仅在特征图和探测器头上提取GT相关区域。

        In summary, the previous distillation framework for detection tasks all manually set the ratio between distilled positive and negative instances distinguished by the GT labels to cope with the disproportion of foreground and background area in detection tasks. Thus, the main difference between our method and the previous works can be summarized as follows: (i) Our method does not rely on GT labels, nor does it care about the proportion between positive and negative instances selected for distillation. It is the information gap between student and teacher that guides the model to choose the discriminative patches for imitation. (ii) None of the previous methods take advantage of the relation-based knowledge for distillation. However, it is widely acknowledged that the relation between objects contains tremendous information even within one single image. Thus, based on our selected discriminative patches, we extract the relation-based knowledge among them for distillation, achieving further performance gain.


  1. 我们的方法不依赖于GT标签,也不关心选择用于蒸馏的阳性和阴性实例之间的比例。学生和教师之间的信息差引导模型选择用于模仿的判别块。
  2. 以前的方法都没有利用基于关系的知识进行蒸馏。然而,人们普遍认为,即使在一个单一的图像中,对象之间的关系也包含了大量的信息。因此,基于我们选择的判别补丁,我们提取其中的关系为基础的知识蒸馏,实现进一步的性能增益。


     Previous work [34] proposed that the feature regions near objects have considerable information which is useful for knowledge distillation. However, we find that not only the feature regions near objects but also the discriminative patches even from the background area have meaningful knowledge. Base on this finding, we design the general instance selection module (GISM), as shown in Fig 2. The module utilizes the predictions from both teacher and student model to select the key instances for distillation. 


目标检测的通用实例提取_第2张图片 图2.常规实例选择模块(GISM)的图示。为了获得最翔实的位置,我们计算的L1距离的分类分数从学生和教师的GI分数,并保留回归框具有较高的分数GI框。为了避免重复计算的损失,我们使用非最大值抑制(NMS)算法来删除重复。

        Furthermore, to make better use of the information provided by the teacher, we extract and take advantage of feature-based, relation-based and the response-based knowledge for distillation, as shown in Fig 3. The experimental results show that our distillation framework is general for current state-of-the-art detection models.           


目标检测的通用实例提取_第3张图片 图3.我们的方法的细节:(a)通过ROI对齐,使用所选的GI来裁剪学生和教师骨干中的特征。然后提取基于特征和基于关系的知识进行提炼。(b)选定的地理标志首先通过地理标志分配生成掩码。然后提取掩蔽分类和回归头以利用基于响应的知识。


      In detection model, predictions indicate the attention patches which are commonly meaningful areas. The difference of such patches between teacher and student model is also closely related to their performance gap. In order to quantify the difference for each instance and then select the discriminative instances for distillation, we propose two indicator: GI score and GI box. Both of them are dynamically calculated during each training step. For saving the computation resources during training, we simply calculate the L1 distance of classification score as GI score and choose box with higher score as GI box. Fig 2 illustrates the procedure of generating GI, and the score and box of which from each predicted instance r is defined as below. 


P_{GI}^{r}=\underset{0<c\leq C}{max}\left | P_{t}^{rc}-P_{s}^{rc} \right |,

B_{GI}^{r}=\left\{\begin{matrix} B_{t}^{r} ,& \underset{0<c\leq C}{max}P_{t}^{rc}> \underset{0<c\leq C}{max}P_{s}^{rc} & \\ B_{s}^{r}, & \underset{0<c\leq C}{max}P_{t}^{rc}\leq \underset{0<c\leq C}{max}P_{s}^{rc} & \end{matrix}\right., 

GI=NMS\left ( P_{GI} ,B_{GI}\right ), 

