Wang Y, Zou D, Yi J, et al. Improving Adversarial Robustness Requires Revisiting Misclassified Examples[C]. international conference on learning representations, 2020.
@article{wang2020improving,
title={Improving Adversarial Robustness Requires Revisiting Misclassified Examples},
author={Wang, Yisen and Zou, Difan and Yi, Jinfeng and Bailey, James and Ma, Xingjun and Gu, Quanquan},
year={2020}}
作者认为, 错分样本对于提高网络的鲁棒性是很重要的, 为此提出了一个启发于此的新的损失函数.
h θ h_{\theta} hθ: 参数为 θ \theta θ的神经网络;
( x , y ) ∈ R d × { 1 , … , K } (x,y) \in \mathbb{R}^d \times \{1,\ldots, K\} (x,y)∈Rd×{1,…,K}: 类别及其标签;
h θ ( x i ) = arg max k = 1 , … , K p k ( x i , θ ) , p k ( x i , θ ) = exp ( z k ( x i , θ ) ) / ∑ k ′ = 1 K exp ( z k ′ ( x i , θ ) ) (2) \tag{2} h_{\boldsymbol{\theta}}\left(\mathbf{x}_{i}\right)=\underset{k=1, \ldots, K}{\arg \max } \mathbf{p}_{k}\left(\mathbf{x}_{i}, \boldsymbol{\theta}\right), \quad \mathbf{p}_{k}\left(\mathbf{x}_{i}, \boldsymbol{\theta}\right)=\exp \left(\mathbf{z}_{k}\left(\mathbf{x}_{i}, \boldsymbol{\theta}\right)\right) / \sum_{k^{\prime}=1}^{K} \exp \left(\mathbf{z}_{k^{\prime}}\left(\mathbf{x}_{i}, \boldsymbol{\theta}\right)\right) hθ(xi)=k=1,…,Kargmaxpk(xi,θ),pk(xi,θ)=exp(zk(xi,θ))/k′=1∑Kexp(zk′(xi,θ))(2)
定义正分类样本和误分类样本
S h θ + = { i : i ∈ [ n ] , h θ ( x i ) = y i } a n d S h θ − = { i : i ∈ [ n ] , h θ ( x i ) ≠ y i } . \mathcal{S}_{h_{\theta}}^+ = \{i : i \in [n], h_{\theta} (x_i)=y_i \} \quad \mathrm{and} \quad \mathcal{S}_{h_{\theta}}^- = \{i : i \in [n], h_{\theta} (x_i) \not =y_i \}. Shθ+={i:i∈[n],hθ(xi)=yi}andShθ−={i:i∈[n],hθ(xi)=yi}.
在所有样本上的鲁棒分类误差:
R ( h θ ) = 1 n ∑ i = 1 n max x i ′ ∈ B ϵ ( x i ) 1 ( h θ ( x i ′ ) ≠ y i ) , (3) \tag{3} \mathcal{R}(h_{\theta}) = \frac{1}{n} \sum_{i=1}^n \max_{x_i' \in \mathcal{B}_{\epsilon}(x_i)} \mathbb{1}(h_{\theta}(x_i') \not= y_i), R(hθ)=n1i=1∑nxi′∈Bϵ(xi)max1(hθ(xi′)=yi),(3)
并定义在错分样本上的鲁棒分类误差
R − ( h θ , x i ) : = 1 ( h θ ( x ^ i ′ ) ≠ y i ) + 1 ( h θ ( x i ) ≠ h θ ( x ^ i ′ ) ) (4) \tag{4} \mathcal{R}^- (h_{\theta}, x_i):= \mathbb{1} (h_{\theta}(\hat{x}_i') \not=y_i) + \mathbb{1}(h_{\theta}(x_i) \not= h_{\theta} (\hat{x}_i')) R−(hθ,xi):=1(hθ(x^i′)=yi)+1(hθ(xi)=hθ(x^i′))(4)
其中
x ^ i ′ = arg max x i ′ ∈ B ϵ ( x i ) 1 ( h θ ( x i ′ ) ≠ y i ) . (5) \tag{5} \hat{x}_i'=\arg \max_{x_i' \in \mathcal{B}_{\epsilon} (x_i)} \mathbb{1} (h_{\theta} (x_i') \not = y_i). x^i′=argxi′∈Bϵ(xi)max1(hθ(xi′)=yi).(5)
以及正分样本上的鲁棒分类误差:
R + ( h θ , x i ) : = 1 ( h θ ( x ^ i ′ ) ≠ y i ) . (6) \tag{6} \mathcal{R}^+(h_{\theta}, x_i):=\mathbb{1}(h_{\theta}(\hat{x}_i') \not = y_i). R+(hθ,xi):=1(hθ(x^i′)=yi).(6)
最后, 我们要最小化的是二者的混合误差:
min θ R misc ( h θ ) : = 1 n ( ∑ i ∈ S h + R + ( h θ , x i ) + ∑ i ∈ S h θ − R − ( h θ , x i ) ) = 1 n ∑ i = 1 n { 1 ( h θ ( x ^ i ′ ) ≠ y i ) + 1 ( h θ ( x i ) ≠ h θ ( x ^ i ′ ) ) ⋅ 1 ( h θ ( x i ) ≠ y i ) } . (7) \tag{7} \begin{aligned} \min _{\boldsymbol{\theta}} \mathcal{R}_{\text {misc }}\left(h_{\boldsymbol{\theta}}\right): &=\frac{1}{n}\left(\sum_{i \in \mathcal{S}_{h}^{+}} \mathcal{R}^{+}\left(h_{\boldsymbol{\theta}}, \mathbf{x}_{i}\right)+\sum_{i \in \mathcal{S}_{\boldsymbol{h}_{\boldsymbol{\theta}}}^{-}} \mathcal{R}^{-}\left(h_{\boldsymbol{\theta}}, \mathbf{x}_{i}\right)\right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left\{\mathbb{1}\left(h_{\boldsymbol{\theta}}\left(\hat{\mathbf{x}}_{i}^{\prime}\right) \neq y_{i}\right)+\mathbb{1}\left(h_{\boldsymbol{\theta}}\left(\mathbf{x}_{i}\right) \neq h_{\boldsymbol{\theta}}\left(\hat{\mathbf{x}}_{i}^{\prime}\right)\right) \cdot \mathbb{1}\left(h_{\boldsymbol{\theta}}\left(\mathbf{x}_{i}\right) \neq y_{i}\right)\right\} \end{aligned}. θminRmisc (hθ):=n1⎝⎜⎛i∈Sh+∑R+(hθ,xi)+i∈Shθ−∑R−(hθ,xi)⎠⎟⎞=n1i=1∑n{1(hθ(x^i′)=yi)+1(hθ(xi)=hθ(x^i′))⋅1(hθ(xi)=yi)}.(7)
为了能够传递梯度, 需要利用一些替代函数"软化"上面的损失函数, 对于 1 ( h θ ( x ^ i ′ ) ≠ y i ) \mathbb{1}(h_{\theta}(\hat{x}_i')\not = y_i) 1(hθ(x^i′)=yi)利用BCE损失函数替代
B C E ( p ( x ^ i , θ ) , y i ) = − log ( p y i ( x ^ i ′ , θ ) ) − log ( 1 − max k ≠ y i p k ( x ^ i ′ , θ ) ) , (8) \tag{8} \mathrm{BCE} (p(\hat{x}_i, \theta),y_i)= -\log (p_{y_i} (\hat{x}_i',\theta))- \log (1-\max_{k\not=y_i} p_k(\hat{x}_i',\theta)), BCE(p(x^i,θ),yi)=−log(pyi(x^i′,θ))−log(1−k=yimaxpk(x^i′,θ)),(8)
第一项为普通的交叉熵损失, 第二项用于提高分类边界.
对于第二项 1 ( h θ ( x i ) ≠ h θ ( x ^ i ′ ) ) \mathbb{1}(h_{\theta}(x_i)\not=h_{\theta}(\hat{x}_i')) 1(hθ(xi)=hθ(x^i′)), 用KL散度作为替代
K L ( p ( x i , θ ) ∥ p ( x ^ i ′ , θ ) ) = ∑ k = 1 K p k ( x i , θ ) log p k ( x i , θ ) p k ( x ^ i ′ , θ ) . (9) \tag{9} \mathrm{KL} (p(x_i, \theta)\| p(\hat{x}_i', \theta))=\sum_{k=1}^K p_k(x_i, \theta)\log \frac{p_k(x_i,\theta)}{p_k(\hat{x}_i',\theta)}. KL(p(xi,θ)∥p(x^i′,θ))=k=1∑Kpk(xi,θ)logpk(x^i′,θ)pk(xi,θ).(9)
最后一项 1 ( h θ ( x i ) ≠ y i ) \mathbb{1}(h_{\theta}(x_i) \not =y_i) 1(hθ(xi)=yi)则可用 1 − p y i ( x i , θ ) 1-p_{y_i}(x_i,\theta) 1−pyi(xi,θ)来代替.
于是最后的损失函数便是
L M A R T ( θ ) = 1 n ∑ i = 1 n ℓ ( x i , y i , θ ) , (11) \tag{11} \mathcal{L}^{\mathrm{MART}}(\theta)= \frac{1}{n} \sum_{i=1}^n \ell(x_i, y_i, \theta), LMART(θ)=n1i=1∑nℓ(xi,yi,θ),(11)
其中
ℓ ( x i , y i , θ ) : = B C E ( p ( x ^ i ′ , θ ) , y i ) + λ ⋅ K L ( p ( x i , θ ) ∥ p ( x ^ i , θ ) ) ⋅ ( 1 − p y i ( x i , θ ) ) . \ell (x_i,y_i,\theta):=\mathrm{BCE}(p(\hat{x}_i', \theta),y_i)+\lambda \cdot \mathrm{KL} (p(x_i,\theta) \|p(\hat{x}_i,\theta)) \cdot (1-p_{y_i}(x_i, \theta)). ℓ(xi,yi,θ):=BCE(p(x^i′,θ),yi)+λ⋅KL(p(xi,θ)∥p(x^i,θ))⋅(1−pyi(xi,θ)).