论文下载
GitHub
bib:
@INPROCEEDINGS{chen2023softmatch,
title = {SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning},
author = {Hao Chen and Ran Tao and Yue Fan and Yidong Wang and Jindong Wang and Bernt Schiele and Xing Xie and Bhiksha Raj and Marios Savvides},
booktitle = {ICLR},
year = {2023},
pages = {1--21}
}
The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model’s generalization performance.
In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning.
本文首先通过一个统一的样本加权公式,回顾了流行的伪标记方法,并证明了带阈值的伪标记固有的数量-质量权衡问题
,该问题可能会禁止学习。
To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data.
为此,我们提出了软匹配算法,通过在训练过程中保持大量和高质量的伪标签,有效地利用未标记的数据来克服这种折衷。
We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold.
我们推导了一个截断的高斯函数来根据样本的置信度对样本进行加权,这可以被看作是控制阈值的一个软版本。
We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach.
一个Trick。
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
note:
伪标签
和一致性正则
的半监督算法。有众多的match方法一步步刷新半监督算法表现榜单,这个是2023的match方法。数量
和质量
的权衡,正所谓鱼与熊掌不可兼得,本文就是针对这个问题得到一个折中方案。符号 | 含义 |
---|---|
D L = { x i l , y i l } i = 1 N L D_L = \{\mathbf{x}^l_i, y_i^l\}_{i=1}^{N_L} DL={xil,yil}i=1NL | 有标记数据 |
D U = { x i u } i = 1 N U D_U = \{\mathbf{x}_i^u\}_{i=1}^{N_U} DU={xiu}i=1NU | 有标记数据 |
N L = ∣ D L ∣ N_L = |D_L| NL=∣DL∣ | 有标记数据数量 |
N U = ∣ D U ∣ N_U = |D_U| NU=∣DU∣ | 无标记数据数量 |
x i l , x i u ∈ R d \mathbf{x}_i^l, \mathbf{x}_i^u \in \mathbb{R}^d xil,xiu∈Rd | d维度的训练样本(有标记样本和无标记样本) |
y i l ∈ { 1 , 2 , … , C } y_i^l \in \{1,2,\dots, C\} yil∈{1,2,…,C} | 有标记数据标签 |
C C C | C-class classfication |
p ( y ∣ x ) ∈ R C p(\mathbf{y} | \mathbf{x}) \in \mathbb{R}^C p(y∣x)∈RC | model prediction |
H \mathcal{H} H | cross-entropy loss |
B L B_L BL | 有标记数据的batch size |
B U B_U BU | 无标记数据的batch size |
L s \mathcal{L}_s Ls | 有监督损失 |
L u \mathcal{L}_u Lu | 无监督损失 |
Ω ( x u ) \Omega(\mathbf{x}^u) Ω(xu) | 强增强 |
ω ( x u ) \omega(\mathbf{x}^u) ω(xu) | 弱增强 |
p = p ( y ∣ ω ( x u ) ) \mathbf{p} = p(\mathbf{y} | \omega(\mathbf{x}^u)) p=p(y∣ω(xu)) | p \mathbf{p} p是模型对于弱增强无标记数据的预测简写 |
p ^ = arg max ( p ) \hat{\mathbf{p}} = \argmax(\mathbf{p}) p^=argmax(p) | 伪标签 |
λ ( p ) ∈ { x ∣ 0 ≤ x ≤ λ m a x } \lambda(\mathbf{p}) \in \{x| 0 \leq x \leq \lambda_{max}\} λ(p)∈{x∣0≤x≤λmax} | 权重函数,值域 [ 0 , λ m a x ] [0, \lambda_{max}] [0,λmax] |
L = L s + L u (1) \mathcal{L} = \mathcal{L}_s + \mathcal{L} _u \tag{1} L=Ls+Lu(1)
损失函数由两部分组成,有监督损失 L s \mathcal{L}_s Ls和无监督损失 L u \mathcal{L}_u Lu。顾名思义,有监督损失就是在有标记样本上计算的损失,而无监督损失是在无标记数据上计算的损失。
L s = 1 B L ∑ i = 1 B L H ( y i , p ( y ∣ x i l ) ) (2) \mathcal{L}_s = \frac{1}{B_L}\sum_{i=1}^{B_L}\mathcal{H}(y_i, p(\mathbf{y} | \mathbf{x}_i^l)) \tag{2} Ls=BL1i=1∑BLH(yi,p(y∣xil))(2)
L u = 1 B U ∑ i = 1 B U λ ( p i ) H ( p ^ i , p ( y ∣ Ω ( x i u ) ) (3) \mathcal{L}_u=\frac{1}{B_U} \sum_{i=1}^{B_U} \lambda(\mathbf{p}_i) \mathcal{H}(\hat{\mathbf{p}}_i,\mathbf{p}(\mathbf{y} | \Omega(\mathbf{x}_i^u) )\tag{3} Lu=BU1i=1∑BUλ(pi)H(p^i,p(y∣Ω(xiu))(3)
伪标签数量(Quantity)
f ( p ) = E D U [ λ ( p ) ] ∈ [ 0 , λ m a x ] (4) f(\mathbf{p}) = \mathbb{E}_{D_U}[\lambda(\mathbf{p})] \in [0, \lambda_{max}] \tag{4} f(p)=EDU[λ(p)]∈[0,λmax](4)
伪标签质量(Quality)
g ( p ) = ∑ i = 0 N U I ( p ^ = y i u ) λ ( p i ) ∑ j N U λ ( p j ) = E λ ‾ ( p ) [ I ( p ^ = y u ] ∈ [ 0 , 1 ] (4) g(\mathbf{p}) =\sum_{i=0}^{N_U}\mathbb{I}(\hat{\mathbf{p}} = y_i^u)\frac{\lambda(\mathbf{p}_i)}{\sum_j^{N_U}\lambda(\mathbf{p}_j)}= \mathbb{E}_{\overline{\lambda}(\mathbf{p})}[\mathbb{I}(\hat{\mathbf{p}} = \mathbf{y}^u] \in [0, 1] \tag{4} g(p)=i=0∑NUI(p^=yiu)∑jNUλ(pj)λ(pi)=Eλ(p)[I(p^=yu]∈[0,1](4)
Note:
λ ‾ ( p ) = λ ( p ) ∑ λ ( p ) \overline{\lambda}(\mathbf{p}) = \frac{\lambda(\mathbf{p})}{\sum\lambda(\mathbf{p})} λ(p)=∑λ(p)λ(p) 是 p \mathbf{p} p 的概率质量函数(probability mass function),接近于 y u \mathbf{y}_u yu(这里应该是说接近于正式的概率质量函数,一般来说,为了类之间的平衡,每个类的概率质量应该是相等的)。
Quantity-Quality Trade-off:
λ ( p ) = { λ m a x exp − ( max ( p ) − μ t ) 2 2 σ t 2 , if max ( p ) < μ t ; λ m a x , otherwise. \lambda(\mathbf{p}) = \left\{ \begin{aligned} \lambda_{max}\exp{-\frac{(\max(\mathbf{p}) - \mu_t)^2}{2\sigma_t^2}},& & \text{if } \max(\mathbf{p}) < \mu_t;\\ \lambda_{max},& & \text{otherwise.} \\ \end{aligned} \right. λ(p)=⎩ ⎨ ⎧λmaxexp−2σt2(max(p)−μt)2,λmax,if max(p)<μt;otherwise.
Note:
SoftMatch中的soft就体现在 λ ( p ) \lambda(\mathbf{p}) λ(p)的值是连续的,自信的伪标签对应的值越大,越不自信的伪标签对应越小的值。naive 伪标签算法中就是hard的,只有取用和为取用两个值,是离散的。
细节:
μ ^ b = E ^ B U [ max ( p ) ] = 1 B U ∑ i = 0 B U max ( p ) \hat{\mu}_b = \mathbb{\hat{E}}_{B_U}[\max(\mathbf{p})] = \frac{1}{B_U}\sum_{i=0}^{B_U}\max(\mathbf{p}) μ^b=E^BU[max(p)]=BU1i=0∑BUmax(p)
σ ^ b 2 = V a r ^ B U [ max ( p ) ] = 1 B U ∑ i = 0 B U ( max ( p ) − μ b ) 2 \hat{\sigma}^2_b = \hat{Var}_{B_U}[\max(\mathbf{p})] = \frac{1}{B_U}\sum_{i=0}^{B_U}(\max(\mathbf{p}) - \mu_b)^2 σ^b2=Var^BU[max(p)]=BU1i=0∑BU(max(p)−μb)2
EMA Update
μ ^ t = m μ ^ t − 1 + ( 1 − m ) μ ^ b \hat{\mu}_t = m\hat{\mu}_{t-1} + (1-m)\hat{\mu}_b μ^t=mμ^t−1+(1−m)μ^b
σ ^ t 2 = m σ ^ t − 1 2 + ( 1 − m ) B U B U − 1 σ ^ b 2 \hat{\sigma}^2_t = m\hat{\sigma}^2_{t-1} + (1-m)\frac{B_U}{B_U-1}\hat{\sigma}^2_b σ^t2=mσ^t−12+(1−m)BU−1BUσ^b2
Uniform Alignment
减轻不平衡的问题。
soft体现在element-wise weight function,这一点并不是一个新的思想(Self-Paced learning)。本文从伪标签出发,探讨数量-质量权衡的角度出发,引出soft weight function,这是主要的贡献。
本文没有针对每个类设置一个阈值,隐含假设每个类预测难度相似,这一定可以进一步的拓展。