【论文阅读】Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels


	 title		= {Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels},
	 author	    = {Bo Han and Quanming  Yao and Xingrui Yu and Gang Niu and Miao Xu and Weihua Hu and Ivor Tsang and Masashi Sugiyama},
	 booktitle	= {NeurIPS},
	 year		= {8535--8545},
	 pages      = {2018}

1. 摘要

Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can totally memorize these noisy labels sooner or later during training.


Nonetheless, recent studies on the memorization effects of deep neural networks show that they would first memorize training data of clean labels and then those of noisy labels.


Therefore in this paper, we propose a new deep learning paradigm called “Co-teaching” for combating with noisy labels.

combating with noisy labels → \rightarrow 对抗嘈杂的标签

Namely, we train two deep neural networks simultaneously, and let them teach each other given every mini-batch: firstly, each network feeds forward all data and selects some data of possibly clean labels; secondly, two networks communicate with each other what data in this mini-batch should be used for training; finally, each network back propagates the data selected by its peer network and updates itself.


Empirical results on noisy versions of MNIST, CIFAR-10 and CIFAR-100 demonstrate that Co-teaching is much superior to the state-of-the-art methods in the robustness of trained deep models.


2. 算法描述

【论文阅读】Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels_第1张图片
D ˉ f = arg min ⁡ D ′ : ∣ D ′ ∣ ≥ R ( T ) ∣ D ˉ ∣ ℓ ( f , D ′ ) (1) \bar{\mathcal{D}}_f = \argmin_{\mathcal{D}^{'}:|\mathcal{D}^{'}| \geq R(T)|\bar{\mathcal{D}}|} \ell(f, \mathcal{D}^{'}) \tag{1} Dˉf=D:DR(T)Dˉargmin(f,D)(1)

其中,明白 D ˉ f \bar{\mathcal{D}}_f Dˉf计算,就几乎明白了整个算法了。

  • D ′ \mathcal{D}^{'} D表示当前的mini-batch, f f f表示模型。
  • : ∣ D ′ ∣ ≥ R ( T ) ∣ D ˉ ∣ :|\mathcal{D}^{'}| \geq R(T)|\bar{\mathcal{D}}| :DR(T)Dˉ arg min ⁡ \argmin argmin的条件是选出子集的数量的大小大于等于 R ( T ) ∣ D ˉ ∣ R(T)|\bar{\mathcal{D}}| R(T)Dˉ
  • R ( T ) R(T) R(T)是一个随着时间变化的比率。

由此, D ˉ f \bar{\mathcal{D}}_f Dˉf就表示 D ′ \mathcal{D}^{'} D R ( T ) R(T) R(T)%个损失最小的样本损失和。原文中的表述是sample R(T)% small-loss instances。这里的问题是限制条件中是大于等于,由于是 arg min ⁡ \argmin argmin,实际上只能是等于。

Q1: Why can sampling small-loss instances based on dynamic $R(T)$ help us find clean instances?

本文建立了small lossclean instances的关系。注意,这一点在本文中是没有理论推导的。

Intuitively, when labels are correct, small-loss instances are more likely to be the ones which are correctly labeled. Thus, if we train our classifier only using small-loss instances in each mini-bach data, it should be resistant to noisy labels.

本文基于的一个核心观点就是The “memorization” effect of deep networks,也就是说深度网络在训练含有噪声标签数据集的训练初期,会优先拟合干净的标签数据(有噪声标签更难拟合,会在训练后期强行记住)。因此,他们有能力在训练开始时使用损失值(选择small loss instances)过滤掉噪声实例。随着训练进行,网络最后会在嘈杂标签上过拟合。这一点是通过逐步变小的R(T)%来解决的。首先,训练初期,网络优先拟合干净标签数据,R(T)%可以较大保留更多实例;随着训练进行,网络会逐步拟合噪声标签,R(T)%逐步变小,这样我们就可以保持干净的实例,并在我们的网络记忆它们之前丢弃那些有噪声的实例。

Q2: Why do we need two networks and cross-update the parameters?

Relations to Co-training

尽管协同教学是由协同训练驱动的,但唯一的相似之处是训练了两个分类器。它们之间存在着根本的区别。协同训练需要两个视图(两组独立的特征),而协同教学需要单个视图。 (ii) 协同训练不利用深度神经网络的记忆,而协同教学则利用深度神经网络的记忆。 (iii) 协同训练是为半监督学习(SSL)而设计的,而协同教学是为带有噪声标签的学习(LNL)而设计的;由于 LNL 不是 SSL 的特例,因此我们不能简单地将协同训练从一种问题设置转换为另一种问题设置。

3. 总结

这篇文章基于The “memorization” effect of deep networks,算法比较容易理解,但是缺少理论证明。其中,Co-teaching很类似于半监督中的Co-training,但是两者的应用场景不同,Co-teaching是为了在噪声标签中学习,而Co-training是为了利用无标记数据。
