
原文链接: https://mp.weixin.qq.com/s?__biz=MzA4MzQ4Mzg2OQ==\x26amp;mid=2654201880\x26amp;idx=1\x26amp;sn=d1ba638aa9f1df54af289a32c6ccbfff\x26amp;chksm=843285aab3450cbcb9ff5d8ff703e83ce114e29fe543a457ec61c5afaf03c0a452e92d03e1aa\x26amp;mpshare=1\x26amp;scene=1\x26am


在右上方 ··· 设为星标 ,与你不见不散

编辑:Sophia计算机视觉联盟  报道  | 公众号 CVLianMeng

转载于 :华为诺亚方舟实验室

华为诺亚方舟实验室联合北京大学和悉尼大学发布论文《DAFL:Data-Free Learning of Student Networks》,提出了在无数据情况下的网络蒸馏方法(DAFL),比之前的最好算法在MNIST上提升了6个百分点,并且使用resnet18在CIFAR-10和100上分别达到了92%和74%的准确率(无需训练数据),该论文已被ICCV2019接收。












640?wx_fmt=png (1)



640?wx_fmt=png (2)



640?wx_fmt=png (3)



640?wx_fmt=png (4)






640?wx_fmt=png (5)



图1 Data-free Learning



算法1:Data-free Learning





表1 MNIST数据集实验结果



表2 CIFAR数据集实验结果



表3 CelebA数据集实验结果



表4 消融实验



图2 卷积核可视化

Great Breakthrough! Huawei Noah's Ark Labs first pioneers a novel knowledge distillation technique without training data.

     Huawei Noah's Ark Lab publishes the paper "DAFL:Data-Free Learning of Student Networks", which first proposed the knowledge distillation method without data. The proposed DAFL is superior to the state-of-the art methods on MNIST by 6% accuracy, and achieves 92% and 74% accuracy on the CIFAR-10 and 100 datasets using resnet-18 with no training data.The paper has been accepted by ICCV2019.


Deep convolutional neural networks (CNNs) have been successfully used in various computer vision applications such as image classification, object detection and semantic segmentation. However, launching most of the widely used CNNs requires heavy computation and storage, which can only be used on PCs with modern GPU cards. Inorder to compress and speed-up pre-trained heavy deep models, various effective approaches have been proposed recently.

Although the above mentioned methods have made tremendous efforts on benchmark datasets and models, an important issue has not been widely noticed, i.e. most existing network compression and speed-up algorithms have a strong assumption that training samples of the original network are available. However, the training dataset is routinely unknown in real-world applications due to privacy and transmission limitations. For instance, users do not want to let their photos leaked to others. Therefore, conventional methods cannot be directly used for learning portable deep models under these practice constrains.Nevertheless, only a few works have been proposed for compressing deep models without training data. The performance of compressed networks using these methods is much lower than that of the original network, due to they cannot effectively utilize the pre-trained neural networks. To address the aforementioned problem, we propose a novel framework for compressing deep neural networks without the original training dataset. To be specific, the given heavy neural network is regarded as a fixed discriminator. Then, a generative network is established for alternating the original training set by extracting information from the network during the adversarial procedure, which can be utlized for learning smaller networks with acceptable performance. The superiority of the proposed method is demonstrated through extensive experiments on benchmark datasets and models.

GAN for Generating TrainingSamples

In order to learn portable network without original data, we exploit GAN to generate training samples utilizing the available information of the given network. Generative adversarial networks (GANs) have been widely applied forgenerating samples. GANs consist of a generator G and a discriminator D. G is expected to generate desired data while D is trained to identify the differences between real images and those produced by the generator. Adversarial learning techniques can be naturally employed to synthesize training data. However, the discriminator requires real images for training. In the absence of training data, it is thus impossible to train the discriminator as vanilla GANs.

Recent works have proved that the discriminator can learn the hierarchy of representations from samples, which encourages the generalization of D in other tasks like image classification. Instead of training a new discriminator as vanilla GANs, the given deep neural network can extract semantic features from images as well, since it has already been well trained on large-scale datasets.Hence, we propose to regard this given deep neural network as a fixed discriminator. Therefore, G can be optimized directly without training D together.

The output of the discriminator is a probability indicating whether an input imageis real or fake in vanilla GANs. However, given the teacher deep neural networkas the discriminator, the output is to classify images to different concept sets, instead of indicating the reality of images. The loss function in vanilla GANs is therefore inapplicable for approximating the original training set. Thus, we conduct thorough analysis on real images and their responses on this teacher network. Several new loss functions will be devised to reflect our observations.

     On the image classification task, the teacher deep neural network adopts the cross entropy loss in the training stage. Specifically for multi-class classification, the outputs are encouraged to be one-hot vectors, where only one entry is 1 and all the others are 0s.  If images generated by $G$ follow the same distribution as that of the training data of the teacher network, they should also have similar outputs as the training data. We thus introduce the one-hot loss:

640?wx_fmt=png (1)

where 640?wx_fmt=png is the cross-entropyloss function. Since the generated images have no true label, we suggest to usethe index of the max value of its output as the pseudo label.

Besides predicted class labels by DNNs, intermediate features extracted by convolution layers are also important representations of input images. Since filters in theteacher DNNs have been trained to extract intrinsic patterns in training data, feature maps tend to receive higher activation value if input images are realrather than some random vectors. Hence, we define an activation loss functionas:

640?wx_fmt=png (2)

Moreover, to ease the training procedure of a deep neural network, the number of training examples in each category is usually balanced, e.g. there are 6,000 images ineach class in the MNIST dataset. We employ the information entropy loss to measure the class balance of generated images:

640?wx_fmt=png (3)

where 640?wx_fmt=png is the information entropy.  When the loss takes the minimum, G could generate images of each category with roughly the same probability. T

By combining the aforementioned three loss functions, we obtain the final objective function:


By minimizing the above function, the optimal generator can synthesize images that have the similar distribution as that of the training data previously used fortraining the teacher network.


As mentioned above, the generated images have no true label. In addition, parameters and detailed architecture information could also be unavailable sometimes. Thus, we propose to utilized the teacher-student learning paradigm for learning portable CNNs with unlabeled generated data.

    Knowledge Distillation (KD) is a widely used approach to transfer the output information from a heavy network to a smaller network for achieving higher performance. The student network can be optimized using the following loss function based on knowledge distillation:


Therefore, utilizing the knowledge transfer technique, a portable network can be optimized without the specific architecture of the given network.


Figure 1: Data-free Learning

     Detailed procedures of the proposed Data-Free Learning(DAFL) scheme for learning efficient student neural networks is summarized in Algorithm 1 and Figure 1.  First, we regard the well-trained teacher network as a fixed discriminator. Using the loss function in Eq.(5), we optimize a generator to generate images that follow the similar distribution as that of the original training images for the teacher network. Second, we utilize the knowledge distillation approach to directly transfer knowledge from the teacher network to the student network.


Algorithm 1:Data-free Learning


   We first implement experiments on the MNIST dataset. Two architectures are used for investigating the performance of proposed method, \ie a convolution-based architecture and a network consists of fully-connect layers. The student networks have significantly fewer parameters than teacher networks. Table 1 reports the results of different methods on the MNIST datasets. Traditional methods achieve decent results, but they cannot be applied without training data. A previous method using meta-data achieves only a 92.47% accuracy. We then use a similar dataset (USPS) to train the student network and achieves only 94.56% accuracy. The proposed method utilizing generative adversarial networks achieved a 98.20% accuracy, which is much higher than the previous data-free methods.


   We then conduct experiments on CIFAR-10 and CIFAR-100 datasets using ResNet-34 and ResNet-18 as teacher and student, respectively. we train the student network using the CIFAR-100 dataset, which has considerable overlaps with the original CIFAR-10 dataset, but this network only achieves a 90.65% accuracy, which is obviously lower than that of the teacher model. In contrast, the student network trained utilizing the proposed method achieved a 92.22% accuracy with only synthetic data.


   The experiments on the CelebA dataset also provide similar results.


Table 2 reports the ablation study of the proposed method, which indicates that each term of the loss function is essential.


We visualize the convolutional filters of teacher and student in Figure 2, and find that they are similar, which demonstrate the effectiveness of the proposed method.


Figure 2: Visualization of filters

Paper URL: https://arxiv.org/pdf/1904.01186

Github URL:https://github.com/huawei-noah/DAFL










Python必备收藏!博士大佬总结的Pycharm 常用快捷键思维导图!

如何看待 2020 届校招算法岗「爆炸」的情况?英雄所见略同




