文章提出了一种简单且普遍适用的方法,通过与同辈和相互蒸馏进行的队列训练来改善深层神经网络的性能。 通过这种方法,我们可以获得比那些强大但静态的teacher提炼的网络性能更好的紧凑网络。 DML的一种应用是获得紧凑,快速和有效的网络。 我们还表明,这种方法也有望改善大型强大网络的性能,并且可以将以此方式训练的网络队列作为一个整体进行组合,以进一步提高性能。
论文:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8578552
Bibtex: @INPROCEEDINGS{8578552, author={Y. {Zhang} and T. {Xiang} and T. M. {Hospedales} and H. {Lu}}, booktitle={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition}, title={Deep Mutual Learning}, year={2018}, volume={}, number={}, pages={4320-4328},}
Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network.
Different from the one-way transfer between a static pre-defined teacher and a student in model distillation, with
DML, an ensemble of students learn collaboratively and teach each other throughout the training process.
Distillation-based model compression relates to the observation [3, 1] that small networks often have the same representation capacity as large networks.
To better learn a small network, the distillation approach starts with a powerful (deep and/or wide) teacher network (or network ensemble), and then trains a smaller student network to mimic the teacher [8, 1, 16, 3].
In this paper we aim to solve the same problem of learning small but powerful deep neural networks.
Distillation vs Mutual learning
Specifically, each student is trained with two losses:
Overall, mutual learning provides a simple but effective way to improve the generalisation ability of a network by training collaboratively with a cohort of other networks.
The results show that, compared with distillation by a pre-trained static large network, collaborative learning by small peers achieves better performance.
Furthermore we observe that:
Finally, we note that while our focus is on obtaining a single effective network, the entire cohort can also be used as a highly effective ensemble model.
We formulate the proposed deep mutual learning (DML) approach with a cohort of two networks (see Fig. 1). Extension to more networks is straightforward (see Sec. 3.3). Given N N N samples X = { x i } i = 1 N X = \left\{x_i\right\}^N_{i=1} X={xi}i=1N from M M M classes, we denote the corresponding label set as Y = { y i } i = 1 N Y = \left\{y_i\right\}^N_{i=1} Y={yi}i=1N with y i ∈ 1 , 2 , . . . , M y_i ∈ {1, 2, ..., M} yi∈1,2,...,M. The probability of class m m m for sample x i x_i xi given by a neural network Θ 1 Θ1 Θ1 is computed as
where the logit z m z^m zm is the output of the “softmax” layer in Θ 1 Θ1 Θ1.
For multi-class classification, the objective function to train the network Θ 1 Θ1 Θ1 is defined as the cross entropy error between the predicted values and the correct labels,
with an indicator function I I I defined as
To improve the generalisation performance of Θ1 on the testing instances, we use another peer network Θ 2 Θ2 Θ2 to provide training experience in the form of its posterior probability p2.
To quantify the match of the two network’s predictions p 1 p_1 p1 and p 2 p_2 p2, we use the Kullback Leibler (KL) Divergence.
The KL distance from p 1 p1 p1 to p 2 p2 p2 is computed as
The overall loss functions L Θ 1 L_{Θ1} LΘ1 and L Θ 2 L_{Θ2} LΘ2 for networks Θ 1 Θ1 Θ1 and Θ 2 Θ2 Θ2 respectively are thus:
Our KL divergence based mimicry loss is asymmetric, thus different for the two networks. One can instead use a symmetric Jensen-Shannon Divergence loss:
However, we found empirically that whether a symmetric or asymmetric KL loss is used does not make any difference.
The mutual learning strategy is embedded in each mini-batch based model update step for both models and throughout the whole training process.
At each iteration, we compute the predictions of the two models and update both networks’ parameters according to the predictions of the other.
The optimisation details are summarised in Algorithm 1.
The proposed DML approach naturally extends to more networks in the student cohort. Given K networks Θ 1 , Θ 2 , . . . , Θ K ( K ≥ 2 ) Θ_1, Θ_2, ..., Θ_K(K ≥ 2) Θ1,Θ2,...,ΘK(K≥2), the objective function for optimising Θ k , ( 1 ≤ k ≤ K ) Θ_k,(1 ≤ k ≤ K) Θk,(1≤k≤K) becomes
With more than two networks, an interesting alternative learning strategy for DML is to take the ensemble of all
the other K − 1 K − 1 K−1 networks as a single teacher to provide a combined mimicry target, which would be very similar to
the distillation approach but performed at each mini-batch model update.
Then the objective function of Θ k Θ_k Θk can be written as
The proposed DML extends straightforwardly to semi-supervised learning. Under the semi-supervised learning setting, we only activate the cross-entropy loss for labelled data, while computing the K L KL KL distance based mimicry loss for all the training data. This is because the K L KL KL distance computation does not require class labels, so unlabelled data can also be used.
Denote the labelled and unlabelled data as L and U, where we have X = L ∪ U L∪U L∪U, the objective function for learning network Θ 1 Θ_1 Θ1 can be reformulated as
We have proposed a simple and generally applicable approach to improving the performance of deep neural networks by training them in a cohort with peers and mutual distillation. With this approach we can obtain compact networks that perform better than those distilled from a strong but static teacher. One application of DML is to obtain compact, fast and effective networks. We also showed that this approach is also promising to improve the performance of large powerful networks, and that the network cohort trained in this manner can be combined as an ensemble to further improve performance.