这篇文章中我们介绍了Boundary Attack, a decision-based attack that starts from a large adversarial perturbation and then seeks to reduce the perturbation while staying adversarial。这种攻击在概念上十分简单,requires close to no hyperparameter tuning,并不依赖于替代模型 and is competitive with the best gradient-based attacks in standard computer vision tasks like ImageNet. We apply the attack on two black-box algorithms from Clarifai.com. The Boundary Attack in particular and the class of decision-based attacks in general open new avenues to study the robustness of machine learning models and raise new questions regarding the safety of deployed machine learning systems. An implementation of the attack is available as part of Foolbox (https://github.com/bethgelab/foolbox)。
对抗扰动从两方面吸引了很多关注。一方面, they are worrisome for the integrity and security of deployed machine learning algorithms such as autonomous cars or face recognition systems. 对于街道标志(例如将一个stop-sign识别为一个限速两百的标志牌)的微小扰动可能会导致很严重的后果。另一方面, adversarial perturbations provide an exciting spotlight on the gap between the sensory information processing in humans and machines and thus provide guidance towards more robust, human-like architectures。
本文关注于一个目前仅收到很少关注的黑盒攻击类别:
The 轮廓 of this category is justified for the following reasons: First, compared to score-based attacks decision-based attacks are much more relevant in real-world machine learning applications where confidence scores or logits are rarely accessible. At the same time decision-based attacks have the potential to be much more robust to standard defences like gradient masking, intrinsic stochasticity or robust training than attacks from the other categories. Finally, compared to transfer-based attacks they need much less information about the model (neither architecture nor training data) and are much simpler to apply.
There currently exists no effective decision-based attack that scales to natural datasets such as ImageNet and is applicable to deep neural networks (DNNs).
Throughout the paper we focus on the threat scenario in which the adversary aims to change the decision of a model (either targeted or untargeted) for a particular input sample by inducing a minimal
perturbation to the sample. The adversary can observe the final decision of the model for arbitrary
inputs and it knows at least one perturbation, however large, for which the perturbed sample is
adversarial.
本文贡献如下:
论文中要用到的术语:
向量用黑体进行了标注。
boundary attack算法在图2中进行了描述:
算法从一个已经是对抗图片的点出发,然后沿着对抗图片和原始图片之间的边界进行随机行走,但这个过程需要满足:(1)停留在对抗区域内(2)距离原始图片的距离不断减小。
换句话说 we perform rejection sampling with a suitable proposal distribution P \mathcal{P} P to find progressively smaller adversarial perturbations according to a given adversarial criterion c ( ⋅ ) c(\cdot) c(⋅)。算法的基本逻辑在算法1中进行了描述:
boundary attack需要从一个已经是对抗图片的样本出发。在一个非目标性的场景下,我们simply sample from a maximum entropy distribution given the valid domain of the input. In the computer vision applications below, where the input is constrained to a range of [ 0 , 255 ] [0,255] [0,255] per pixel, we sample each pixel in the initial image o ~ 0 \tilde{o}^0 o~0 from a uniform distribution U ( 0 , 255 ) \mathcal{U}(0,255) U(0,255)。We reject samples that are not adversarial. In a targeted scenario we start from any sample that is classified by the model as being from the target class.
算法的效率严重取决于proposal distribution P \mathcal{P} P,即which random directions are explored in each step of the algorithm. The optimal proposal distribution will generally depend on the domain and / or model to be attacked, but for all vision-related problems tested here a very simple proposal distribution worked surprisingly well. The basic idea behind this proposal distribution is as follows: in the k-th step we want to draw perturbations η k \eta^k ηk from a maximum entropy distribution subject to the following constraints:
实际上想要从这个分布中取样是十分困难的,因此我们采取了一种更简单的启发式算法:
一个经典的判定一个输入是对抗样本的criterion是观察这个样本是否被误分类,即模型是否将扰动后的样本识别为和扰动前的图片不同的类。另外一个常用的选择是targeted misclassification for which the perturbed input has to be classified in a given target class. 其他的选择包括 top-k misclassification (the top-k classes predicted for the perturbed input do not contain the original class label) or thresholds on certain confidence scores. Outside of computer vision many other choices exist such as criteria on the worderror rates. In comparison to most other attacks, the Boundary Attack is extremely flexible with regards to the adversarial criterion. It basically allows any criterion (including non-differentiable ones) as long as for that criterion an initial adversarial can be found (which is trivial in most cases).
boundary attack仅有两个相关的超参数:the length of the total perturbation δ \delta δ and the length of the step ϵ \epsilon ϵ towards the original input (参考图二)。 We adjust both parameters dynamically according to the local geometry of the boundary. The adjustment is inspired by Trust Region methods. In essence, we first test whether the orthogonal perturbation is still adversarial. If this is true, then we make a small movement towards the target and test again. The orthogonal step tests whether the step-size is small enough so that we can treat the decision boundary between the adversarial and the non-adversarial region as being approximately linear. If this is the case, then we expect around 50% of the orthogonal perturbations to still be adversarial. If this ratio is much lower, we reduce the step-size δ, if it is close to 50% or higher we increase it. If the orthogonal perturbation is still adversarial we add a small step towards the original input. The maximum size of this step depends on the angle of the decision boundary in the local neighbourhood (see also Figure 2). If the success rate is too small we decrease , if it is too large we increase it. Typically, the closer we get to the original image, the flatter the decision boundary becomes and the smaller has to be to still make progress. The attack is converged whenever converges to zero.
我们使用250张从ImageNet validation set中选择的图片在VGG-19,ResNet-50以及Inception-v3网络上测试了qFool。我们测量了经过一定数量的查询后对抗扰动的average norm的中位数,通过如下方式进行定义:
M X ( n ) = median x i ∈ X ( 1 m ∥ v ( x i , n ) ∥ 2 2 ) \mathcal{M}_{\mathcal{X}}(n)=\text{median}_{x_i\in\mathcal{X}}(\frac{1}{m}\Vert v(x_i,n)\Vert^2_2) MX(n)=medianxi∈X(m1∥v(xi,n)∥22)
这里 v ( x i , n ) R m v(x_i,n)\mathbb{R}^m v(xi,n)Rm是使用 n n n次对模型查询后针对样本 x i x_i xi生成的对抗扰动。中位数is taken over the images in dataset X \mathcal{X} X。
对于non-targeted攻击,图3展示了不同模型上的样本的对抗扰动:
我们将qFool和Boundary attck在ImageNet上进行了对比。结果如图4所示:
When the distances between adversarial examples generated by the two methods and original images are similar, qFool always spends fewer queries. qFool can converge much faster: if the number of queries is limited to a small value (e.g., 10000), our method can achieve much better performance.
此外,相比full空间中的qFool,subspace版本能够更进一步减少查询次数。特别地,在本篇文章中,我们使用一个a 2-dimensional Discrete Cosine Transform (DCT) basis 来定义低维子空间。(?在低维空间上进行扰动,这个思路我也想到了)。我们使用 S = { ψ i , j } i , j = 0 , … , m − 1 \mathcal{S}=\{\psi_{i,j}\}_{i,j=0,\dots,\sqrt{m}-1} S={ψi,j}i,j=0,…,m−1来表示维度为 m m m的子空间的基向量。当估计子空间的梯度时,我们使用 n n n个噪音向量 η i ∗ = S γ i \eta_i^*=\mathcal{S}\gamma_i ηi∗=Sγi,这里 γ i ∼ N ( 0 , I m ) \gamma_i\sim\mathcal{N}(0,I_m) γi∼N(0,Im)而不是 η i ∼ N ( 0 , I d ) \eta_i\sim\mathcal{N}(0,I_d) ηi∼N(0,Id)。
在我们的实验中,整个空间的维度 d d d是 224 × 224 224\times224 224×224或 299 × 299 299\times299 299×299,我们使用ImageNet training set中的250张图片来寻找最佳子空间。从图5中,最佳子空间的维度 m m m从 70 × 70 70\times 70 70×70到 90 × 90 90\times90 90×90。