[读论文][backbone]Knowledge Diffusion for Distillation


The representation gap between teacher and student is an emerging topic in knowledge distillation (KD).
To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific.
In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model.
To address this, we propose to denoise student features using a diffusion model trained by teacher features.
This allows us to perform better distillation between the refined clean feature and teacher feature. 
Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance.
Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the art performance consistently on image classification, object detection, and semantic segmentation tasks


为了解决这个问题,我们建议使用由teacher tea 训练的扩散模型对学生特征进行降噪。



计算和内存,这限制了它们在资源有限的设备上的应用程序。一种广泛使用的解决方案是知识蒸馏(knowledge distillation, KD)[13],其目的是提高效率



The key to knowledge distillation lies in how to transfer the knowledge from teacher to student by matching the output features ( e.g. , representations and logits).
Recently, some studies [
16 , 28 ] have shown that the discrepancy between student feature and teacher feature can be significantly large due to the capacity gap between the two models. Directly aligning those mismatched features would even disturb the optimization of student and weaken the performance.
As a result, the essence of most state-of-the-art KD methods is to shrink this discrepancy and only select the valuable information for distillation.




For example,
28] introduces multiple middle-sized teach assistant models to bridge the gap;
29] learns a student-friendly teacher by regularizing the teacher training with student; DIST [16] relaxes the exact matching of teacher and student features of Kullback Leibler (KL) divergence loss by proposing a correlation-based loss;
MasKD [
17] distills the valuable information in the features and ignores the noisy regions by learning to identify receptive regions that contribute to the task precision.
However, these methods need to resort to either complicated training schemes or task-specific priors, making them challenging to apply to various tasks and feature types.


DIST[16]通过提出基于相关性的损失,放宽了Kullback Leibler (KL)散度损失的师生特征的精确匹配;



In this paper, we proceed from a different perspective and argue that the devil of knowledge distillation is in the noise within the distillation features.
Intuitively, we regard the student as a noisy
version of the teacher due to its limited capacity or training recipe to learn truly valuable and decent features.
However, distilling knowledge with this noise can be detrimental for the student, and may even
lead to undesired degradation.
Therefore, we propose to eliminate the noisy information within student and distill only the valuable information accordingly.
Concretely, inspired by the success of generative tasks, we leverage diffusion models [
14 , 39 ], a class of probabilistic generative models that can gradually remove the noise from an image or a feature, to perform the denoising module.
An overview of our DiffKD is illustrated in Fig.
1 .
We empirically show that this simple denoising process can generate a denoised student feature that is very similar to the corresponding teacher feature, ensuring that our distillation can be performed in a more consistent manner.









Nevertheless, directly leveraging diffusion models in knowledge distillation has two major issues.
(1) Expensive computation cost.
The conventional diffusion models use a UNet-based architecture  to predict the noise, and take a large amount of computations to generate high-quality images2.
In  DiffKD, a lighter diffusion model would suffice since we only need to denoise the student feature.
We therefore propose a light-weight diffusion model consisting of two bottleneck blocks in ResNet [10].
Besides, inspired by Latent Diffusion [34], we also adopt a linear autoencoder to compress the teacher  feature, which further reduces the computation cost.
(2) Inexact noisy level of student feature.
The reverse denoising process in diffusion requires to start from a certain initial timestep, but in DiffKD, the student feature is used as the initial noisy feature and we cannot directly get its corresponding noisy level (timestep);   therefore, the inexact noisy level would weaken the denoising performance.
To solve this problem, we propose an adaptive noise matching module, which measures the noisy level of each student feature adaptively and specifies a corresponding Gaussian noise to the feature to match the correct noisy level in initialization.
With these two improvements, our resulting method DiffKD is efficient and effective, and can be easily implemented on various tasks.






此外,受Latent Diffusion[34]的启发,我们还采用线性自编码器对教师特征进行压缩,进一步降低了计算成本。





It is worth noting that one of the merits of our method DiffKD is feature-agnostic, and the knowledge diffusion can be applied to different types of features including intermediate feature, classification  output, and regression output.
Extensive experimental results show our DiffKD surpasses current state of-the-art methods consistently on standard model settings of image classification, object detection, and semantic segmentation tasks.
For instance, DiffKD obtains 73.62% accuracy with MobileNetV1 student and ResNet-50 teacher on ImageNet, surpassing DKD [
50 ] by 1.57%; while on semantic segmentation, DiffKD outperforms MasKD [ 17 ] by 1% with PSPNet-R18 student on Cityscapes test set.
Moreover, to demonstrate our efficacy in eliminating discrepancy between teacher and  student features, we also implement DiffKD on stronger teacher settings that have much more advanced teacher models, and our method significantly outperforms existing methods.
For example, with Swin-T student and Swin-L teacher, our DiffKD achieves remarkable
82 . 5% accuracy on ImageNet, improving KD baseline with a large margin of 1% .





例如,在swwin - t学生和swwin - l老师的情况下,我们的DiffKD在ImageNet上达到了惊人的82.5%的准确率,将KD基线提高了1%
