[读论文][backbone]Knowledge Diffusion for Distillation

摘要

The representation gap between teacher and student is an emerging topic in knowledge distillation (KD).
To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific.
In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model.
To address this, we propose to denoise student features using a diffusion model trained by teacher features.
This allows us to perform better distillation between the refined clean feature and teacher feature. 
Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance.
Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the art performance consistently on image classification, object detection, and semantic segmentation tasks

师生之间的表示差距是知识边缘蒸馏(KD)中一个新兴的话题。

为了缩小差距和提高性能,目前的方法通常采用复杂的训练方案、损失函数和特征对齐,这些都是针对特定任务和特定特征的。
在本文中,我们指出这些方法的本质是丢弃噪声信息提取特征中有价值的信息,并提出了一种新的KD方法,称为DiffKD,使用扩散模型显式地去噪和匹配特征
我们的方法是基于这样的观察:
由于学生模型的容量较小,学生特征通常比教师特征包含更多的噪声。
为了解决这个问题,我们建议使用由teacher tea 训练的扩散模型对学生特征进行降噪。
这允许我们在精炼的clean特征和teacher特征之间进行更好的提炼。
此外,我们还引入了带有线性自编码器的轻量级扩散模型来降低计算成本,并引入了自适应噪声匹配模块来提高去噪性能。
大量的实验表明,DiffKD在各种类型的特征上都是有效的,并且在图像分类、目标检测和语义分割任务上实现了最先进的性能

introduction

深度神经网络的成功通常是在对大的要求下完成的

计算和内存,这限制了它们在资源有限的设备上的应用程序。一种广泛使用的解决方案是知识蒸馏(knowledge distillation, KD)[13],其目的是提高效率

模型(学生)通过转移更大模型(老师)的知识。

 

The key to knowledge distillation lies in how to transfer the knowledge from teacher to student by matching the output features ( e.g. , representations and logits).
Recently, some studies [
16 , 28 ] have shown that the discrepancy between student feature and teacher feature can be significantly large due to the capacity gap between the two models. Directly aligning those mismatched features would even disturb the optimization of student and weaken the performance.
As a result, the essence of most state-of-the-art KD methods is to shrink this discrepancy and only select the valuable information for distillation.

知识蒸馏的关键在于如何通过匹配输出特征(如表示和对数)将知识从教师传递给学生。

最近,一些研究[16,28]表明,由于两种模型之间的容量差距,学生特征和教师特征之间的差异可能会非常大。
直接对齐这些不匹配的特征甚至会干扰学生的优化,削弱成绩。

因此,大多数最先进的KD方法的本质是缩小这种差异,只选择有价值的信息进行蒸馏。


For example,
TAKD [
28] introduces multiple middle-sized teach assistant models to bridge the gap;
SFTN [
29] learns a student-friendly teacher by regularizing the teacher training with student; DIST [16] relaxes the exact matching of teacher and student features of Kullback Leibler (KL) divergence loss by proposing a correlation-based loss;
MasKD [
17] distills the valuable information in the features and ignores the noisy regions by learning to identify receptive regions that contribute to the task precision.
However, these methods need to resort to either complicated training schemes or task-specific priors, making them challenging to apply to various tasks and feature types.

TAKD[28]引入了多种中等规模的助教模式来弥补这一差距;

SFTN[29]通过与学生的规范化教师培训,培养了一位对学生友好的教师;
DIST[16]通过提出基于相关性的损失,放宽了Kullback Leibler (KL)散度损失的师生特征的精确匹配;
MasKD[17]通过学习识别有助于任务精度的接受区域,提取特征中有价值的信息,忽略噪声区域。

然而,这些方法需要采用复杂的训练方案或特定于任务的先验,这使得它们难以应用于各种任务和特征类型。

 

In this paper, we proceed from a different perspective and argue that the devil of knowledge distillation is in the noise within the distillation features.
Intuitively, we regard the student as a noisy
version of the teacher due to its limited capacity or training recipe to learn truly valuable and decent features.
However, distilling knowledge with this noise can be detrimental for the student, and may even
lead to undesired degradation.
Therefore, we propose to eliminate the noisy information within student and distill only the valuable information accordingly.
Concretely, inspired by the success of generative tasks, we leverage diffusion models [
14 , 39 ], a class of probabilistic generative models that can gradually remove the noise from an image or a feature, to perform the denoising module.
An overview of our DiffKD is illustrated in Fig.
1 .
We empirically show that this simple denoising process can generate a denoised student feature that is very similar to the corresponding teacher feature, ensuring that our distillation can be performed in a more consistent manner.

本文从不同的角度出发,认为知识蒸馏的devil在于蒸馏特征中的噪声。

直观地,我们认为学生是嘈杂版的老师,因为它的能力或训练方法有限,无法学习到真正有价值和体面的特征。

然而,用这种噪音提炼知识对学生是有害的,甚至可能

导致不希望的退化。

因此,我们建议消除学生内部的噪声信息,只提取有价值的信息。

具体而言,受生成任务成功的启发,我们利用扩散模型[14,39]来执行去噪模块,扩散模型是一类可以逐渐从图像或特征中去除噪声的概率生成模型。

我们的DiffKD的概述如图1所示。

我们的经验表明,这个简单的去噪过程可以生成一个去噪的学生特征,该特征与相应的教师特征非常相似,从而确保我们的蒸馏可以以更一致的方式执行。

Nevertheless, directly leveraging diffusion models in knowledge distillation has two major issues.
(1) Expensive computation cost.
The conventional diffusion models use a UNet-based architecture  to predict the noise, and take a large amount of computations to generate high-quality images2.
In  DiffKD, a lighter diffusion model would suffice since we only need to denoise the student feature.
We therefore propose a light-weight diffusion model consisting of two bottleneck blocks in ResNet [10].
Besides, inspired by Latent Diffusion [34], we also adopt a linear autoencoder to compress the teacher  feature, which further reduces the computation cost.
(2) Inexact noisy level of student feature.
The reverse denoising process in diffusion requires to start from a certain initial timestep, but in DiffKD, the student feature is used as the initial noisy feature and we cannot directly get its corresponding noisy level (timestep);   therefore, the inexact noisy level would weaken the denoising performance.
To solve this problem, we propose an adaptive noise matching module, which measures the noisy level of each student feature adaptively and specifies a corresponding Gaussian noise to the feature to match the correct noisy level in initialization.
With these two improvements, our resulting method DiffKD is efficient and effective, and can be easily implemented on various tasks.

然而,在知识蒸馏中直接利用扩散模型存在两个主要问题。

(1)计算成本昂贵。

传统的扩散模型使用基于unet的架构来预测噪声,并且需要大量的计算才能生成高质量的图像2。

在DiffKD中,一个更轻的扩散模型就足够了,因为我们只需要去噪学生特征

因此,我们在ResNet中提出了一个由两个bottleneck块组成的轻量级扩散模型[10]。

此外,受Latent Diffusion[34]的启发,我们还采用线性自编码器对教师特征进行压缩,进一步降低了计算成本。

(2)学生特征噪声水平不准确。

扩散中的反去噪过程需要从某一初始时间步长开始,而DiffKD中使用学生特征作为初始噪声特征,不能直接得到其对应的噪声电平(时间步长);因此,不精确的噪声水平会削弱去噪性能。

为了解决这一问题,我们提出了一种自适应噪声匹配模块,该模块自适应地测量每个学生特征的噪声水平,并为特征指定相应的高斯噪声,以便在初始化时匹配正确的噪声水平。

通过这两个改进,我们得到的方法DiffKD是高效和有效的,并且可以很容易地在各种任务中实现。

It is worth noting that one of the merits of our method DiffKD is feature-agnostic, and the knowledge diffusion can be applied to different types of features including intermediate feature, classification  output, and regression output.
Extensive experimental results show our DiffKD surpasses current state of-the-art methods consistently on standard model settings of image classification, object detection, and semantic segmentation tasks.
For instance, DiffKD obtains 73.62% accuracy with MobileNetV1 student and ResNet-50 teacher on ImageNet, surpassing DKD [
50 ] by 1.57%; while on semantic segmentation, DiffKD outperforms MasKD [ 17 ] by 1% with PSPNet-R18 student on Cityscapes test set.
Moreover, to demonstrate our efficacy in eliminating discrepancy between teacher and  student features, we also implement DiffKD on stronger teacher settings that have much more advanced teacher models, and our method significantly outperforms existing methods.
For example, with Swin-T student and Swin-L teacher, our DiffKD achieves remarkable
82 . 5% accuracy on ImageNet, improving KD baseline with a large margin of 1% .

值得注意的是,我们的方法DiffKD的优点之一是特征不可知,知识扩散可以应用于不同类型的特征,包括中间特征、分类输出和回归输出。

广泛的实验结果表明,我们的DiffKD在图像分类,目标检测和语义分割任务的标准模型设置上始终优于当前状态最先进的方法。

例如,DiffKD在ImageNet上对MobileNetV1学生和ResNet-50教师的准确率为73.62%,比DKD[50]高出1.57%;而在语义分割方面,使用PSPNet-R18学生在cityscape测试集上,DiffKD比MasKD[17]高出1%。

此外,为了证明我们在消除教师和学生特征之间差异方面的有效性,我们还在具有更先进的教师模型的更强的教师设置上实现了DiffKD,并且我们的方法显着优于现有方法。

例如,在swwin - t学生和swwin - l老师的情况下,我们的DiffKD在ImageNet上达到了惊人的82.5%的准确率,将KD基线提高了1%

你可能感兴趣的:(人工智能,深度学习)