Adversarial Attack

  • 本文为李宏毅 2021 ML 课程的笔记

目录

  • Motivation
  • How to Attack
    • Example of Attack
    • How to Attack (White Box Attack)
    • Black Box Attack
    • One pixel attack
    • Universal Adversarial Attack
    • Attack in the Physical World
    • “Backdoor” in Model
  • The attack is so easy! Why?
  • How to Defend
    • Passive Defense
    • Proactive Defense (Adversarial Training)

Motivation

  • Are networks robust to the inputs that are built to fool them?
    • Useful for spam classification, malware detection, network intrusion detection, etc.

How to Attack

Example of Attack

  • 下面以图像分类为例。benign image 表示未经修改的原图像,识别结果为 tiger cat。攻击的目的就是给 benign image 加上一个小噪声,使得分类器输出的类别不为 “猫”
    Adversarial Attack_第1张图片
  • 而攻击类型也可以分为 Non-targetedTargeted 两种
    • Non-targeted: 让分类器输出任何非 “猫” 的类别
    • Targeted: 让分类器输出指定的非 “猫” 类别 (e.g. Star fish)
      Adversarial Attack_第2张图片

上图中,加入的噪声甚至是人眼不可分辨的,分类器对 Benign image 进行分类时 Tiger Cat 的置信度都只有 0.64,但对 Acttacked Image 进行分类时 Star Fish 的置信度却达到了 1.00

How to Attack (White Box Attack)

  • Non-targeted: x 0 x^0 x0 为 benign image, x x x 为 attacked image,在固定网络参数的情况下,我们想要使分类器输出的概率分布尽量远离 cat 的概率分布
    Adversarial Attack_第3张图片进而得到如下的优化目标 (约束 d ( x 0 , x ) ≤ ε d(x^0,x)\leq \varepsilon d(x0,x)ε 保证了加入的 noise 不会被人眼察觉; e e e 可以为 cross entropy):
    在这里插入图片描述
  • Targeted: 相比 Non-targeted,在优化目标中增加了一项,使得 attacked image 对应的概率分布尽量接近目标类别的概率分布
    Adversarial Attack_第4张图片在这里插入图片描述

Non-perceivable

  • 那么我们应该如何表示 d d d 来使得人眼无法感知到我们加入的 noise 呢?下图对比了 L2 norm 和 L-infinity norm,发现使用 L-infinity norm 更加合理
    Adversarial Attack_第5张图片

Attack Approach

  • 我们的目标是解如下的优化问题:
    x ∗ = arg min ⁡ d ( x 0 , x ) ≤ ε L ( x ) x^*=\argmin_{d(x^0,x)\leq\varepsilon} L(x) x=d(x0,x)εargminL(x)
  • 可以先不考虑约束条件,直接用梯度下降法求解 x x x:
    Adversarial Attack_第6张图片
  • 那么现在如何满足该约束条件呢?我们可以直接简单粗暴的在每次更新后,都检查更新后的 x x x 是否满足约束,如果不满足,就强行将其修改回满足约束的情况:
    Adversarial Attack_第7张图片

Fast Gradient Sign Method (FGSM)

  • paper: Explaining and Harnessing Adversarial Examples
  • FGSM 只更新一次 x x x,可以保证更新后的 x x x 满足约束
    Adversarial Attack_第8张图片

Iterative FGSM

  • paper: Adversarial examples in the physical world
    Adversarial Attack_第9张图片

Black Box Attack

  • In the previous attack, we know the network parameters θ \theta θ. This is called White Box Attack.
    • Are we safe if we do not release model? - No, because Black Box Attack is possible.

Proxy network

  • If you have the training data of the target network: Train a proxy network yourself. Using the proxy network to generate attacked objects
    Adversarial Attack_第10张图片
  • What if we do not know the training data? - 那就直接给目标网络 input,然后得到网络的 output 以获取输入-输出对,直接将其作为 proxy network 的训练数据即可
  • 黑箱攻击的效果 (比较适合 non-targeted attack):
    • paper: Delving into Transferable Adversarial Examples and Black-box Attacks
      Adversarial Attack_第11张图片

对角线为白箱攻击,非对角线为黑箱攻击

  • 提高黑箱攻击的效果: Ensemble Attack
    Adversarial Attack_第12张图片

上表中,非对角线为白箱攻击,对角线为黑箱攻击。 i i i j j j 列表示使用骗过了 i i i 行外所有模型的 attacked image 去攻击 j j j 列模型得到的准确率

One pixel attack

  • paper: One pixel attack for fooling deep neural networks
  • vider: [TA 補充課] More about Adversarial Attack (1/2) (由助教黃冠博同學講授)
    Adversarial Attack_第13张图片

Universal Adversarial Attack

  • paper: Universal adversarial perturbations

  • 一般的攻击会为每张图片都找出一个特制的 noise,但也有可能找到一个 noise,使每张图片加上该 noise 后都能被攻击成功 (Black Box Attack is also possible!)

Attack in the Physical World

攻击人脸识别系统

  • paper: Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition
  • 攻击时的注意点:
    • An attacker would need to find perturbations that generalize beyond a single image.
    • Extreme differences between adjacent pixels in the perturbation are unlikely to be accurately captured by cameras. (攻击的花纹会受到相机分辨率的限制)
    • It is desirable to craft perturbations that are comprised mostly of colors reproducible by the printer. (人眼看到的颜色与图像中的颜色存在差异性)
  • 左边的男人带上特制的眼镜后,就会在各个角度被识别为右边的女人
    Adversarial Attack_第14张图片

攻击路牌识别系统

  • Model Hacking ADAS to Pave Safer Roads for Autonomous Vehicles
  • 仅仅将 “3” 中间的横线拉长一些,就能使路牌识别系统将其识别为限速 85
    Adversarial Attack_第15张图片

“Backdoor” in Model

  • Attack happens at the training phase: 可以在训练数据集中加入 attacked image 来攻击模型
  • be careful of unknown dataset ……
    Adversarial Attack_第16张图片

The attack is so easy! Why?

  • paper: Adversarial Examples Are Not Bugs, They Are Features (容易被攻击并不是模型的问题,而是因为训练数据本身的特征) (just an idea)

Adversarial Attack_第17张图片

在上图中,将图片看作一个高维向量,蓝色区域为识别为小丑鱼的区域,横轴为容易被攻击成功的移动方向

Adversarial Attack_第18张图片

How to Defend

Passive Defense

  • 在模型前加上 filter
    Adversarial Attack_第19张图片

Smoothing

Adversarial Attack_第20张图片


Image Compression

  • paper:
    • Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks
    • Shield: Fast, Practical Defense and Vaccination for Deep Learning using JPEG Compression
      Adversarial Attack_第21张图片

Generator

  • paper: Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models
    Adversarial Attack_第22张图片

Randomization

  • Mitigating Adversarial Effects Through Randomization
  • 如果攻击者知道了 passive defense 的防御手段并且将 filter 当作模型的第一层加以攻击的话,passive defense 就失效了。因此,我们可以随机采用不同的 passive defense 的防御方法
    Adversarial Attack_第23张图片

Proactive Defense (Adversarial Training)

  • Adversarial Training: Training a model that is robust to adversarial attack.
    Adversarial Attack_第24张图片

Problem

  • (1) 不一定能挡住新的攻击方式
  • (2) 需要大量的运算资源
    • solution: Adversarial Training for Free!

你可能感兴趣的:(机器学习)