IoUAttack:Towards Temporally Coherent Black-Box Adversarial Attack for Visual Object Tracking
IoUAttack:面向视觉目标跟踪的时间相干黑盒对抗攻击
ShuaiJia1YibingSong2 ChaoMa1* XiaokangYang1 1MoE Key Lab of Artificial Intelligence, AIInstitute, Shanghai Jiao Tong University2 Tencent AILab {jiashuai, chaoma, xkyang }@sjtu.edu.cn,[email protected]
Abstract
Adversarial attack arises due to the vulnerability of deep neural networks to perceive input samples injected with imperceptible perturbations. Recently, adversarial attack has been applied to visual object tracking to evaluate the robustness of deep trackers. Assuming that the model structures of deep trackers are known, a variety of white-box attack approaches to visual tracking have demonstrated promising results. However, the model knowledge about deep trackers is usually unavailable in real applications. In this paper, we propose a decision-based black-box attack method for visual object tracking. In contrast to existing black-box adversarial attack methods that deal with static images for image classifification, we propose IoU attack that sequentially generates perturbations based on the predicted IoU scores from both current and historical frames. By decreasing the IoU scores, the proposed attack method degrades the accuracy of temporal coherent bounding boxes (i.e., object motions) accordingly. In addition, we transfer the learned perturbations to the next few frames to initialize temporal motion attack. We validate the proposed IoU attack on stateof-the-art deep trackers (i.e., detection based, correlation fifilter based, and long-term trackers). Extensive experiments on the benchmark datasets indicate the effectiveness of theproposed IoU attack method.
The source code is available at https://github.com/VISION-SJTU/IoUattack.
对抗性攻击的出现是由于深度神经网络容易感知注入了难以察觉的扰动的输入样本。近年来,将对抗性攻击应用于视觉目标跟踪中,以评价深度跟踪器的鲁棒性。假设深度跟踪器的模型结构已知,各种视觉跟踪的白盒攻击方法已经证明了良好的结果。然而,在实际应用中,关于深度跟踪器的模型知识通常是缺乏的。本文提出了一种基于决策的视觉目标跟踪黑盒攻击方法。与现有的处理静态图像进行图像分类的黑盒对抗攻击方法相比,我们提出了基于当前和历史框架的预测IoU分数的连续生成扰动的IoU攻击。通过降低IoU得分,该攻击方法相应地降低了时间相干边界盒(即物体运动)的准确性。此外,我们将学习到的扰动转移到接下来的几帧,以初始化时间运动攻击。我们验证了对最先进的深度跟踪器(即基于检测、基于相关滤波器和长期跟踪器)提出的IoU攻击。在基准数据集上的大量实验表明了该方法的有效性,提出了欠条攻击方法。
源代码可在https://github.com/VISION-SJTU/IoUattack上获得。
1. Introduction
Visual object tracking is one of the fundamental computer vision problems with a wide range of applications. The convolutional neural networks (CNNs) have signififi- cantly advanced visual tracking performance. Meanwhile, the enigma of interpreting CNNs has perplexed existing visual tracking algorithms as well. For example, injecting imperceptible perturbations into input images leads deep neural networks to predict incorrectly [37, 43, 48]. To in-
Figure 1. IoU attack for visual object tracking. State-of-the-art deep trackers (i.e., SiamRPN++ [22], DiMP [1], and LTMU [5]) effectively locate target objects in the original video sequences as shown in (a). Our IoU attack decreases their tracking accuracies by injecting imperceptible perturbations as shown in (b).
图1所示。IoU攻击的视觉目标跟踪。最先进的深度跟踪器(如SiamRPN++ [22], DiMP[1],和LTMU[5])有效地定位原始视频序列中的目标对象,如(a)所示。我们的IoU攻击通过注入难以感知的扰动降低了其跟踪精度,如(b)所示。
vestigate the robustness of visual tracking algorithms with deep models, recent approaches [3, 44, 16, 24] assume that the model structures of deep tracking algorithms are known and carry out white-box attack on them. Despite the demonstrated promising results, the concrete structures and parameters of deep trackers are barely known in real applications. In this paper, we investigate black-box adversarial attack for visual tracking, where the model knowledge of deep trackers is unknown.
视觉目标跟踪是计算机视觉的基本问题之一,有着广泛的应用。卷积神经网络(CNNs)具有显著的视觉跟踪性能。与此同时,cnn的解读之谜也困扰着现有的视觉跟踪算法。例如,向输入图像注入难以察觉的扰动会导致深度神经网络预测错误[37,43,48]。为了研究具有深度模型的视觉跟踪算法的鲁棒性,目前的研究方法[3,44,16,24]假定深度跟踪算法的模型结构已知,并对其进行白盒攻击。尽管取得了很好的效果,但在实际应用中,人们对深度跟踪器的具体结构和参数知之甚少。在深度跟踪器模型知识未知的情况下,研究了视觉跟踪的黑盒对抗攻击问题。
Prevalent black-box attack algorithms inject imperceptible perturbations into input images to decrease network classifification accuracies. Although these methods are effective to attack static images, they are not suitable to attack temporally moving objects in videos. This is because deep trackers maintain temporal motions of the target object within tracking models (i.e., the correlation fifilters [6, 35] or deep binary classififiers [30, 17, 23, 22]). When localizing the target object, these deep trackers produce temporally coherent bounding boxes (bbxs). Meanwhile, deep trackers constrain the search area to be close to the predicted bbx from the last frame. As existing black-box methods rarely degrade temporally coherent bbxs, perturbations produced based on CNN classifification scores are not effective for visual tracking. An intriguing direction thus arises to investigate the black-box attack on both individual frames and temporal motions among sequential frames with a holistic decision-based approach.
常见的黑盒攻击算法向输入图像注入难以察觉的扰动,以降低网络分类的准确性。虽然这些方法对攻击静态图像是有效的,但它们不适用于攻击视频中临时移动的目标。这是因为深度跟踪器在跟踪模型中保持目标对象的时间运动(即相关滤波器[6,35]或深度二进制分类器[30,17,23,22])。当定位目标时,这些深度跟踪器产生时间相干的边界盒(bbxs)。同时,深度跟踪器将搜索区域限制在最后一帧预测的bbx附近。由于现有的黑盒方法很少降低时间相干的bbxs,基于CNN分类分数产生的扰动对视觉跟踪并不有效。因此,一个有趣的方向出现了研究黑盒攻击的个人帧和时间运动之间的序列帧与整体决策的方法。
In this paper, we propose IoU attack for visual tracking. IoU attack is a decision-based black-box attack method which focuses on both image content and target motions in video sequences. When processing each frame, we start image content attack with two bbxs. One is predicted by the deep tracker using the original frame, which is perturbation free. The other one is predicted by the same tracker using the same frame with noisy perturbations. These two bbxs are used to compute an IoU score as feedback to our IoU attack. For each frame, we use an iterative orthogonal composition method for image content attack. During each iteration of orthogonal attack, we fifirst randomly generate several tangential perturbations whose noise levels are the same. Then, we compute their IoU scores and select the tangential perturbation with the lowest score. The selected perturbation is the most effective one to attack the current frame at the current iteration. We then increase the selected perturbation in its normal direction to add a small amount of noise, which is the normal perturbation. We compose both tangential and normal perturbations to generate the perturbations for the current iteration of orthogonal attack.
本文提出了一种面向视觉跟踪的IoU攻击。欠条攻击是一种基于决策的黑盒攻击方法,针对视频序列中的图像内容和目标运动进行攻击。当处理每一帧时,我们使用两个bbx开始图像内容攻击。一种是利用无扰动原始帧的深度跟踪器进行预测。另一种是由同一跟踪器使用相同的帧和噪声扰动进行预测。这两个bbx被用来计算欠条得分,作为对我们欠条攻击的反馈。对每一帧采用迭代正交合成方法进行图像内容攻击。在每次正交攻击迭代中,我们首先随机生成几个噪声水平相同的切向扰动。然后,我们计算他们的欠条分数,并选择分数最低的切向扰动。所选扰动是在当前迭代时攻击当前帧的最有效扰动。然后我们在其法向上增加所选的扰动,以添加少量的噪声,这就是法向扰动。我们合成了切向摄动和法向摄动来生成正交攻击当前迭代的摄动。
For target motion attack, we compute an IoU score between the bbxs from both the current and the previous frames. This IoU score is integrated into the tangential perturbation identifification process. To this end, our orthogonal attack deviates a deep tracker from its original performance of both the current and historical frames. We transfer the learned perturbations to the next few frames as perturbation initialization to reinforce temporal motion attack. As a result, the deviation from the original tracking results ensures the success of black-box attack on deep trackers shown in Figure 1. We extensively validate the proposed IoU attack on state-of-the-art methods including detection based [22], correlation fifilter based [1], and long-term [5] trackers. Experiments on benchmark datasets demonstrate the effectiveness of the proposed black-box IoU attack.
对于目标运动攻击,我们计算当前帧和前一帧bbxs之间的IoU分数。这个IoU得分被集成到切向摄动识别过程中。为此,我们的正交攻击使深度跟踪器偏离其当前和历史帧的原始性能。我们将学习到的扰动转移到接下来的几帧作为扰动初始化,以加强时间运动攻击。因此,与原始跟踪结果的偏差保证了黑盒攻击深度跟踪器的成功,如图1所示。我们广泛验证了所提出的IoU攻击的最新方法,包括基于[22]检测、基于[1]相关滤波器和长期[5]跟踪器。在基准数据集上的实验证明了所提出的黑盒欠条攻击的有效性。
2. Related Work
In this section, we brieflfly introduce recent state-of-theart trackers and their basic principles. Besides, we also review recent adversarial attack methods, especially for the aspect of black-box attack.
在本节中,我们将简要介绍最新的跟踪器及其基本原理。此外,我们还回顾了最近的对抗攻击方法,特别是黑盒攻击方面。
2.1. Visual Object Tracking
Visual object tracking has received widespread attention in the last decade and brings about a series of new benchmark datasets [42, 18, 28, 29, 11]. Existing trackers can be generally categorized as offlfline trackers and online update trackers. Offlfline trackers do not update their model parameters during the inference, leading to a higher speed. These trackers consider tracking as a discriminative object detection problem. They generate candidate regions and classify the target or background to locate. Bounding box regression [34] is always used to locate precisely. Among them, siamese based methods [23, 22, 39, 47, 13, 4, 38, 40] are typical structures consisting of a template branch and a search branch. SiamRPN [23] draws a region proposal network to formulate a one-shot detection by comparing the similarity between two branches. SiamRPN++ [22] applies a deeper network ResNet instead of commonly-used AlexNet to improve the tracking accuracy and maintain the real-time speed.
近十年来,视觉目标跟踪得到了广泛的关注,并带来了一系列新的基准数据集[42,18,28,29,11]。现有的跟踪器一般可以分为离线跟踪器和在线更新跟踪器。离线跟踪器在推理过程中不会更新其模型参数,从而导致更快的速度。这些跟踪器认为跟踪是一个鉴别的目标检测问题。它们生成候选区域并对目标或背景进行分类来定位。边界框回归[34]始终用于精确定位。其中基于siamese的方法[23,22,39,47,13,4,38,40]是典型的由模板分支和搜索分支组成的结构。SiamRPN[23]绘制一个区域提议网络,通过比较两个分支之间的相似性,形成一次检测。siamrpn++[22]采用了更深的网络ResNet代替了常用的AlexNet,提高了跟踪精度,保持了实时速度。
Online update trackers constantly update their models during the inference to adapt to the current scenarios [31]. MDNet [30, 36, 32] regard tracking as a classifification to distinguish the target and background. During the inference, they collect the samples from previous frames to enhance the target appearance. UpdateNet [46] formulates an update strategy into siamese based trackers to maintain the temporal motion between frames. Besides, correlation fifilter based methods also belong to online update trackers. They typically learn the discriminative correlation fifilter by deep or hand-craft features to estimate the target location. Recently, DiMP [1] learns a discriminative learning loss to exploit both target and background appearance information for target model prediction. PrDiMP [7] proposes a probabilistic regression formulation to address the modeling label noise.
在线更新跟踪器在推理过程中不断更新其模型,以适应当前场景[31]。MDNet[30,36,32]将跟踪作为区分目标和背景的一种分类方法。在推理过程中,他们从之前的帧中收集样本,以增强目标的外观。UpdateNet[46]为基于连体的跟踪器制定了更新策略,以保持帧间的时间运动。此外,基于相关滤波的方法也属于在线更新跟踪方法。他们通常通过深度或手工特征学习鉴别相关滤波器来估计目标位置。最近,DiMP[1]学习了一种判别学习损失,利用目标和背景外观信息预测目标模型。PrDiMP[7]提出了一个概率回归公式来解决建模标签噪声。
Furthermore, existing long-term trackers [45, 5] integrate an online update module to improve the tracking performance. The re-detection module is mostly introduced to handle the disappearance and reappearance of the target, involving more challenges into adversarial attack. LTMU [5] is a long-term tracker with a meta-updater, which learns to guide the tracker’s update to gain helpful appearance information for accuracy. In this work, we implement our adversarial attack on three representative trackers [22, 1, 5] to illustrate the generality of our black-box attack method.
此外,现有的长期跟踪器[45,5]集成了在线更新模块来提高跟踪性能。重新检测模块主要用于处理目标的消失和再现,涉及到更多的对抗性攻击挑战。LTMU[5]是一个带有元更新器的长期跟踪器,它学习指导跟踪器的更新,以获得有助于准确性的外观信息。在这项工作中,我们实现了我们的对抗性攻击三个代表性的跟踪器[22,1,5],以说明我们的黑盒攻击方法的通用性。
2.2. Adversarial Attack
Convolution Neural Networks (CNNs) have been deployed in various tasks of computer vision today. However, recent studies [12, 37] notice that CNNs are sensitive to the imperceptible perturbations in adversarial examples. The intentional light-weight perturbations deteriorate
the performance dramatically. Existing adversarial attack methods [12, 27, 15, 8, 33] mainly focus on static image tasks like classifification, segmentation and detection. Except for attacking digital images, some studies implement physical attacks [10, 41] in concrete applications (e.g., autonomous driving). They generate a distractor in the real world to cause CNNs models to misclassify or fail to detect, leading to a security problem.
卷积神经网络(CNNs)已经部署在计算机视觉的各种任务今天。然而,最近的研究[12,37]注意到cnn对对抗性例子中难以察觉的干扰很敏感。有意的轻量级扰动会恶化
显著的性能。现有的对抗攻击方法[12、27、15、8、33]主要针对静态图像的分类、分割和检测等任务。除攻击数字图像外,一些研究将物理攻击[10,41]应用于具体应用(如自动驾驶)。它们在现实世界中产生干扰,导致cnn模型被错误分类或无法检测,从而导致安全问题
Overall, existing adversarial attack methods are mainly divided into two categories: white-box and black-box attack. In white-box attack, the adversary assumes to gain all knowledge of the attacked target, such as the learned parameters, the concrete structure, etc. Compared with white-box attack, black-box attack has limited knowledge of the model but is closer to the practical scenarios. It is often modeled on querying the method by inputs, acquiring the fifinal labels or confifidence scores. Black-box attack roughly fails into transfer-based, score-based, and decision-based attack [9]. Transfer-based attack [25] utilizes the transferability of adversarial examples generated by white-box models. Scorebased attack knows the predicted probability of classifification, relying on approximated gradients to generate adversarial examples [15]. In decision-based attack, only the fifinal
label of classifification is accessible [2] to the threat model. For black-box attack in visual object tracking, we assume that only the outputs of trackers (i.e., predicted bounding boxes) are available.
总体而言,现有的对抗攻击方法主要分为白盒攻击和黑盒攻击两类。在白盒攻击中,对手假定获得被攻击目标的所有知识,如学习到的参数、具体结构等。与白盒攻击相比,黑盒攻击对模型的了解有限,但更接近实际场景。它通常通过输入查询方法来建模,获得最终的标签或置信度分数。黑盒攻击大致分为基于转移的攻击、基于分数的攻击和基于决策的攻击[9]。基于传输的攻击[25]利用了白盒模型生成的对抗例子的可移植性。基于分数的攻击知道预测的分类概率,依赖于近似梯度生成对抗例子[15]。在决策型攻击中,只有最终
威胁模型的分类标签为[2]。对于视觉目标跟踪中的黑盒攻击,我们假设只有跟踪器的输出(即预测的边界盒)是可用的。
In the fifield of visual object tracking, some methods [3, 44, 16, 24, 14] explore the adversarial attack on different trackers. Chen et al. [3] propose a one-shot adversarial attack method by optimizing the batch confifidence loss and the feature loss to modify the initial target patch. Yan etal. [44] design an adversarial loss to cool the hot regions on the heatmaps and shrink the predicted bounding box. All adversarial attack methods motioned above are summarized as white-box attack. In the real world, it is hard to know the concrete knowledge of trackers, causing the white-box attack methods less practical. In this work, we propose a novel decision-based black-box adversarial attack method for visual tracking. Motion consistency is taken into our attack method to further deteriorate the tracking performance.
在视觉目标跟踪领域,一些方法[3,44,16,24,14]探讨了对不同跟踪器的对抗性攻击。Chen等人[3]提出了一种单次对抗性攻击方法,通过优化批置信损失和特征损失来修改初始目标补丁。燕等等。[44]设计一个对抗损失来冷却热图上的热区域和收缩预测的边界盒。以上提出的对抗性攻击方法统称为白盒攻击。在现实世界中,由于很难了解跟踪器的具体知识,导致白盒攻击方法的实用性较差。在本研究中,我们提出一种新的基于决策的黑盒对抗攻击视觉跟踪方法。我们的攻击方法中加入了运动一致性,进一步降低了跟踪性能。
3. Proposed Method
The proposed IoU attack aims to gradually decrease IoU scores (i.e., bbx overlap values) during iterations by using the minimal amount of noise. Figure 2 shows an intuitive view of the proposed IoU attack.
提出的欠条攻击的目的是利用最小的噪声量,在迭代过程中逐步降低欠条评分(即bbx重叠值)。图2显示了提议的IoU攻击的直观视图。
3.1. Motivation
Current studies on black-box attack mainly focus on static image recognition while the temporal motions of visual tracking are untouched. This is limited to attack deep
目前对黑盒攻击的研究主要集中在静态图像识别上,而视觉跟踪的时间运动尚未涉及。这仅限于攻击深度
(a) Iterative perturbation update (b) Orthogonal composition Figure 2. An intuitive view of IoU attack in the image space. In (a), we show that the increase of noise level positively correlates to the decrease of IoU scores but their directions are not exactly the same. The IoU attack method iteratively fifinds the intersection points (i.e., intermediate images) between each contour line of noise increase and IoU decrease. These intermediate images
gradually decrease IoU scores with the lowest amount of noise.
In (b), we show the orthogonal composition during each iteration. We generate noise hypothesis tangentially according to the current contour line (i.e., #1) and increase a small amount of noise in the normal direction (i.e., #2). The intersection point will be identifified
from the hypothesis that yields the lowest IoU at the same noise level. The updated perturbation in each iteration is the composition of #1 and #2.
(a)迭代摄动更新(b)正交组合图2。在图像空间直观查看IoU攻击。在(a)中,我们表明噪声水平的增加与欠条分数的减少正相关,但它们的方向并不完全相同。欠条攻击方法迭代地寻找噪声增加和欠条减少的每条轮廓线之间的交点(即中间图像)。这些中间的图片以最低的噪声量逐渐降低欠条分数。
在(b)中,我们表示每次迭代过程中的正交组合。我们根据当前的等高线(即#1)在切线上生成噪声假设,并在法线方向增加少量噪声(即#2)。交点将被识别假设在相同的噪音水平下产生最低的欠条。每次迭代中更新的扰动是#1和#2的组成部分。
trackers as the target object motion is maintained temporally. Meanwhile, deep trackers utilize temporally coherent schemes (i.e., search region constraint, and online update) to ensure tracking accuracy. The image content and temporal motions are equally important for black-box attack on visual tracking.
跟踪器作为目标物体的运动是暂时保持的。同时,深度跟踪器利用时间相干方案(搜索区域约束和在线更新)来保证跟踪精度。图像内容和时间运动对于黑盒攻击的视觉跟踪同样重要。
The proposed IoU attack is to make the prediction results of one tracker deviate from its original performance. This is because of the tracking scenario where there is only one
ground-truth bounding box (bbx) available (i.e., bbx annotation on the fifirst frame). We defifine the original performance of one tracker is that it predicts one bbx on each frame without noise addition. By adding heavy-noisy perturbations, we make the same tracker predict another bbx and compute the spatial IoU score based on these two bbxs. Meanwhile,
we use the bbx from the current frame and the one from the previous frame to compute a temporally coherent IoU score, which is then fused with the spatial IoU score. As stateof-the-art trackers demonstrate premier performance on the benchmarks, gradually decreasing the IoU scores by involving consecutive video frames indicates that their tracking
performance deteriorates signifificantly. The IoU measurement suits different trackers as long as they predict one bbx for each frame
提出的IoU攻击是为了使一个跟踪器的预测结果偏离其原始性能。这是因为只有一个跟踪场景Ground-truth bounding box (bbx)可用(即第一帧bbx标注)。我们定义一个跟踪器的原始性能是它在每帧上预测一个bbx而不添加噪声。通过添加重噪声扰动,我们让同一个跟踪器预测另一个bbx,并基于这两个bbx计算空间IoU分数。与此同时,我们使用当前帧的bbx和前一帧的bbx计算一个时间相干的欠条分数,然后与空间欠条分数融合。由于最先进的跟踪器在基准上展示了卓越的性能,通过涉及连续的视频帧逐渐减少IoU分数表明他们的跟踪计数性能恶化。IoU测量适合不同的跟踪器,只要它们预测每帧一个bbx
3.2. IoU Attack
Figure 2 shows an intuitive view of how the proposed IoU attack gradually decreases the IoU scores between frames. Given a clean input frame, we fifirst add heavy uniform noise on it to generate a heavy noise image where the IoU score is low. Along the direction from the clean image to the heavy noise image, the IoU scores gradually decrease while the noise level increases. The direction of IoU decrease positively correlates to that of noise increase
but they are not exactly the same. IoU attack aims to progressively fifind a decreased IoU score while introducing the lowest amount of noise. The contour lines of IoU shown
in Figure 2(a) indicate the tracker performance with regard to different noise perturbations, which can not be explicitly modeled in practice. From another perspective, IoU at
tack aims to identify one specifific noise perturbation leading to the lowest IoU score among the same amount of noise levels. The identifification process is fulfifilled by orthogonal
composition illustrated as follows.
图2直观地展示了提议的IoU攻击如何逐渐减少帧之间的IoU分数。给定一个干净的输入帧,我们首先在其上添加重均匀噪声,生成一个IoU评分较低的重噪声图像。从干净图像到重噪声图像的方向上,IoU评分逐渐降低,噪声等级逐渐升高。欠条减少的方向与噪声增加的方向正相关
但它们并不完全相同。欠条攻击的目的是逐步找到一个减少的欠条评分,同时引入最低数量的噪声。这是欠条的等高线在图2(a)中显示了跟踪器性能关于不同的噪声扰动,这在实践中不能明确地建模。从另一个角度看,欠条tack的目的是识别一个特定的噪音干扰导致最低的欠条评分在相同数量的噪音水平。采用正交法实现了识别过程
作文如下所示。
We denote the original image on the t-th frame as I0, the heavy noise image as H, and the intermediate image on the k-th iteration as Ik. In the (k+1)-th iteration, we fifirst randomly generate several Gaussian distribution noise η ∼ N (0, 1) and select the tangential perturbation η from n of them as:
我们将第t帧上的原始图像表示为I0,,重噪声图像表示为H,第k次迭代的中间图像表示为 Ik.在(k+1)-th迭代中,我们首先随机生成几个高斯分布噪声η ~ N(0,1),并从N中选择切向扰动η为:
where d is the pixel-wise distance measurement between two images. Eq. 1 ensures tangential perturbations at the same noise level. The selected ηj (j ∈ [1, 2, ..., n]) is the perturbation tangential towards the contour line of noise level at the point Ik. We generate one Ij (j ∈ [1, 2, ..., n]) according to each ηj and use the tracker to predict a bbx Btj on it. Then, we defifine the IoU score SIoU as:
其中d为两幅图像之间按像素级的距离测量值。公式1保证了在相同的噪声水平上的切向扰动。所选的ηj (j∈[1,2,…, n])为点Ik处切向噪声级等高线的扰动。我们生成了一个Ij (j∈[1,2,…, n]),并使用跟踪器预测其上的一个bbx Btj。然后,我们将IoU得分SIoU定义为:
where Sspatial denotes the spatial IoU score between the predicted bbx Btj and the original noise-free bbx Btorig at the t-th frame, Stemporal denotes the temporal IoU score with the original noise-free bbxBt-1orig at the (t-1)-th original frame, and λ is the scalar to balance the inflfluence of spatial and temporal IoU scores. We attack SIoU to perform both image content attack and temporal motion attack. In total, we obtain n IoU scores and select Ij whose SIoU is lowest. An example of ηj is visualized as #1 in Figure 2(b).
哪里Sspatial表示之间的空间借据分数预测bbx释放和原始无噪声的bbx Btorig t坐标系,Stemporal表示时态借据得分与原始无噪声的bbxBt-1orig (t - 1)的原始帧th,λ是标量的inflfluence平衡时空借据的分数。我们对SIoU进行了攻击,实现了图像内容攻击和时间运动攻击。我们总共得到n个IoU分数,选择SIoU最低的Ij。图2(b)中的#1是ηj的一个例子。
After getting the tangential perturbation, we denote ηj as neighboring hypothesis based on Ik and make Ik + ηj
在得到切向扰动后,我们将ηj表示为基于Ik的邻近假设,使其为Ik + ηj
towards the heavy noise image H as:
where∈controls the moving step towards H and∈· ψ(H, Ik + ηj ) is the perturbation following the noise increase direction (i.e., normal direction towards the contour line of noise level). We adjust the parameter∈moderately to limit the variation of perturbations. An example of∈· ψ(H,Ik + ηj ) is visualized as #2 in Figure 2(b). To this end, Ijk+1 is the intermediate image on the (k+1)-th it eration, consisting of the composed perturbations from both tangential and normal directions. We continuously perform the iteration until the IoU score is below the predefifined threshold or the perturbations exceed the maximum. Wetransfer the learned perturbations Pt to the next few frames. The learned perturbations become the initialized perturbations, which are added on I0 of (t+1)-th frame to encode temporal motion attack from previous frames. The pseudo code of the black-box IoU attack is shown in Algorithm 1.
∈控制向H移动的步长,ψ(H, Ik + ηj)是噪声增加方向(即噪声水平等值线的法线方向)的扰动。我们适度地调整参数∈以限制扰动的变化。∈·ψ(H,Ik + ηj)的例子如图2(b)中的#2所示。为此,Ijk+1是第(k+1)次运算的中间像,由切向和法向的复合扰动组成。我们持续执行迭代,直到IoU分数低于预定义的阈值或扰动超过最大值。我们把学到的扰动Pt转移到接下来的几个坐标系。学习到的扰动成为初始扰动,在第(t+1)-第i帧上添加初始扰动,以编码前一帧的时间运动攻击。黑盒欠条攻击的伪代码如算法1所示。
3.3. Discussions and Visualizations
In this section, we visualize the variations of adversarial perturbations during IoU attack in Figure 3. Given an original image, we iteratively inject the adversarial perturbation as shown in the fifirst row of Figure 3. With the increase of adversarial perturbations, the adversarial example drifts the
在本节中,我们将在图3中可视化IoU攻击期间的敌对扰动的变化。给定一个原始图像,我们迭代地注入敌对扰动,如图3的第一行所示。随着对抗性扰动的增加,对抗性示例漂移
Figure 3. Variations of adversarial perturbations during IoU attack. The 3D response map above the image represents the difference between the original image and the adversarial example at different IoU scores. The IoU score decreases as the magnitude of perturbations
increases. The variations of perturbations are illustrated in the fifirst row. The orthogonal composition is shown in the second row, including tangential direction and normal direction. The image framed in green represents the minimal IoU score in the tangential direction and the image framed in yellow represents the moving step towards the heavy noise image.
图3。欠条攻击期间对抗性扰动的变化。图像上方的3D响应图表示在不同的欠条分数下,原始图像和对抗性示例之间的差异。IoU分数随着扰动的大小而减小增加。扰动的变化在第一行作了说明。第二行为正交组合,包括切向和法向。绿色框中的图像表示切线方向上的最小IoU分数,黄色框中的图像表示向重噪声图像移动的步骤。
Table 1. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2019 [21] dataset.
表1。在VOT2019[21]数据集上分别与原始序列、随机噪声、SiamRPN++[22]、DiMP[1]、LTMU[5]的IoU攻击进行跟踪结果对比。
target from the original result and leads IoU scores to decrease. We compute the cosine distance between the perturbations from two consecutive intermediate images. The cosine distance indicates that the generated perturbations follow an increasing trend without fluctuation, decreasing the query numbers effectively in our black-box attack. During each iteration, we visualize the concrete orthogonal composition between the consecutive intermediate images for instance, as shown in the second row of Figure 3. We introduce several candidate images according to Eq. 1 and select the one with the minimal IoU score as the tangential direction (i.e., #1). Then, we move toward the heavy noise image in trails to make sure the IoU score decreases. We adjust the weight in Eq. 5 to constrain the variation of perturbation and output the result as the normal direction (i.e., #2). These two directions compose the orthogonal composition during each iteration. As a result, we hope the fifinal perturbation preserves a lighter degree of noise than heavy random noise does, but the fifinal perturbation can decrease the IoU scores heavily. In other words, our IoU attack makes larger degradation of IoU scores by injecting fewer perturbations.
目标从原来的结果,导致欠条分数下降。我们计算了两个连续中间像之间的摄动的余弦距离。余弦距离表明,生成的扰动呈增加趋势,没有波动,有效地减少了黑盒攻击中的查询数。在每次迭代中,我们将连续中间图像之间的具体正交组合可视化,如图3的第二行所示。我们根据Eq. 1引入几个候选图像,并选择IoU分数最小的图像作为切线方向(即#1)。然后,我们在步道中向重噪声图像移动,以确保IoU评分下降。我们调整Eq. 5中的权值来约束扰动的变化,并将结果输出为法线方向(即#2)。这些在每次迭代中,两个方向构成正交组合。因此,我们希望最终的扰动保持一个较轻的程度的噪声比大的随机噪声但最后的扰动会严重降低欠条分数。换句话说,我们的欠条攻击通过注入更少的扰动使欠条分数更大的退化。
4. Experiments
We validate the performance of our IoU attack on six challenging datasets, VOT2019 [21], VOT2018 [19], VOT2016 [20], OTB100 [42], NFS [18] and VOT2018- LT [19]. Detailed results are provided as follows. 我们在6个具有挑战性的数据集VOT2019[21]、VOT2018[19]、VOT2016[20]、OTB100[42]、NFS[18]和VOT2018- LT[19]上验证了我们的IoU攻击性能。详细结果如下。
4.1. Experiment Setup
Deployment of Trackers. In order to validate the generality of our black-box adversarial attack, we choose three representative trackers with different structures, SiamRPN++ [22], DiMP [1] and LTMU [5], respectively. SiamRPN++ is a typical detection based tracker with the
siamese network. It compares the similarity between a target template and a search region with the region proposal network. The end-to-end learned tracker DiMP exploits both target and background appearance information to locate the target precisely. LTMU is a long-term tracker, utilizing the meta-updater to update the tracker online for target prediction.
追踪器的部署。为了验证我们的黑盒对抗攻击的通用性,我们分别选择了siamrpn++[22]、DiMP[1]和LTMU[5]三个具有代表性的不同结构的跟踪器。siamrpn++是一种典型的基于检测的跟踪器暹罗网络。利用区域建议网络比较目标模板与搜索区域的相似度。端到端学习跟踪器DiMP利用目标和背景的外观信息来精确定位目标。LTMU是一种长期跟踪器,利用元更新器在线更新跟踪器以进行目标预测。
Table 2. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2018 [19] dataset. 表2。在VOT2018[19]数据集上分别与原始序列、随机噪声、SiamRPN++[22]、DiMP[1]、LTMU[5]的IoU攻击进行跟踪结果对比。
Table 3. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2016 [20] dataset.表3。在VOT2016[20]数据集上分别与原始序列、随机噪声、siamrpn++[22]、DiMP[1]和LTMU[5]的IoU攻击进行跟踪结果对比。
Table 4. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the OTB100 [42] dataset. 表4。在OTB100[42]数据集上分别与原始序列、随机噪声、siamrpn++[22]、DiMP[1]和LTMU[5]的IoU攻击进行跟踪结果对比。
Table 5. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the NFS30 [18] dataset.表5所示。在NFS30[18]数据集上分别与原始序列、随机噪声、siamrpn++[22]、DiMP[1]和LTMU[5]的IoU攻击进行跟踪结果对比。
Implementation Details. We formulate the heavy noise image by injecting uniform noise into the clean image as feedback. The type of initial random noise at the same noise level is not sensitive to the degradation of tracking. We discontinue the iterative perturbation update when the IoU score is below the predefifined score or the perturbations exceed the maximum. To sum up, the average query numbers of IoU attack are 21.2, 31.4 and 54.2 per frame for SiamRPN++, DiMP and LTMU, respectively. 实现细节。我们通过在干净图像中注入均匀噪声作为反馈来形成重噪声图像。同一噪声水平下的初始随机噪声类型对跟踪退化不敏感。当IoU分数低于预先定义的分数或扰动超过最大值时,我们停止迭代扰动更新。综上所述,siamrpn++、DiMP和LTMU的IoU攻击平均查询次数分别为每帧21.2、31.4和54.2。
4.2. Overall Attack Results
VOT2019.
We implement the three trackers on the VOT2019 [21] dataset consisting of 60 challenging sequences. Different from other datasets, the VOT dataset has a reinitialization module. When the tracker loses the target (i.e., the overlap is zero between the predicted result and the annotation), the tracker will be reinitialized with the ground truth. Failures show the number of re-initialization. Accuracy evaluates the average overlap ratios of successfully tracking frames. Robustness measures the overall lost numbers. In addition, Expected Average Overlap (EAO) is evaluated by a combination of Accuracy and Robustness.
我们在由60个具有挑战性的序列组成的VOT2019[21]数据集上实现了这三个跟踪器。与其他数据集不同,VOT数据集有一个重新初始化模块。当跟踪器失去目标时(即预测结果与标注之间的重叠为零),跟踪器将用ground truth重新初始化。失败显示重新初始化的次数。精度评估成功跟踪帧的平均重叠比率。稳健性衡量的是总体损失数字。同时,采用精度和鲁棒性相结合的方法评价了期望平均重叠(Expected Average Overlap, EAO)。
Table 1 shows the performance drops after IoU attack. We fifirst test all trackers on original sequences. Then we implement our IoU attack method to generate the adversarial examples and evaluate the tracking results. SiamRPN++ leads to more failures than its original results, and the EAO score drops from 0.287 to 0.124. DiMP obtains a 16.5% drop on its accuracy score, which indicates our attack method leads to an obvious drift. The EAO score also drops dramatically from 0.332 to 0.195. Similarly, our IoU attack method reduces the EAO score of LTMU from 0.201 to 0.150. For further comparison, we also conduct experiments that inject the same level of random noise into the original sequences. Our generated perturbations decrease the IoU scores more dramatically than random noise. 表1显示了IoU攻击后的性能下降情况。我们首先在原始序列上测试所有跟踪器。然后实现了欠条攻击方法,生成了对抗性欠条实例和评估跟踪结果。siamrpn++导致的失败比原来的结果更多,EAO评分从0.287下降到0.124。DiMP获得它的准确率下降了16.5%,说明我们的攻击方法导致了明显的偏差。EAO分数也从0.332急剧下降到0.195。同样,我们的IoU攻击方法将LTMU的EAO评分从0.201降低到0.150。为了进一步比较,我们还进行了实验,在原始序列中注入相同水平的随机噪声。我们生成的扰动比随机噪声更显著地减少欠条分数。
VOT2018.
There are 60 different sequences in the VOT2018 [19] dataset. All the trackers perform favorably on the original sequences. LTMU performs worse than the other two trackers since the re-detection module yields more reinitializations in VOT-toolkit. Table 2 shows that the
performance of these trackers deteriorates obviously under IoU attack. Concretely, the accuracies of these three trackers get worse after the adversarial attack. These indicate that the trackers indeed deviate from their original results. The primary metric EAO scores are reduced by 68.8%, 41.9%, 38.5% for SiamRPN++, DiMP and LTMU, respectively在VOT2018[19]数据集中有60个不同的序列。所有跟踪器在原始序列上表现良好。LTMU的性能比其他两个跟踪器差,因为重新检测模块在VOT-toolkit中产生更多的重新初始化。表2显示了在IoU攻击下,这些跟踪器的性能明显下降。具体来说,这三种追踪器的准确性在对抗性攻击后变得更差。这表明跟踪器确实偏离了原来的结果。siamrpn++、DiMP和LTMU的主要指标EAO评分分别降低了68.8%、41.9%和38.5%
Figure 4. Precision and recall plots of IoU attack for SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2018-LT dataset [19]. We use Attack and Random to denote IoU attack and the same level of random noise. The legend is ranked by F-score. 图4。VOT2018-LT数据集[19]上SiamRPN++[22]、DiMP[1]和LTMU[5]的IoU攻击精度和召回图。我们用攻击和随机来表示欠条攻击和相同级别的随机噪声。该图例按f分进行排名。
VOT2016.
Similarly, we also conduct the IoU attack method on the VOT2016 dataset [20], as shown in Table 3. These trackers perform much better than the above two datasets on the original sequences. However, IoU attack also reduces the EAO by 60.3%, 43.0%, 28.0% for SiamRPN++, DiMP and LTMU, respectively. Our IoU attack is more effective than the same level of random noise. 同样,我们也对VOT2016数据集[20]进行了IoU攻击方法,如表3所示。这些跟踪器在原始序列上的性能远远优于上述两个数据集。IoU攻击对siamrpn++、DiMP和LTMU的EAO分别降低了60.3%、43.0%和28.0%。我们的欠条攻击比相同水平的随机噪声更有效。
OTB100. The OTB100 [42] dataset includes 100 fully annotated video sequences. The evaluation has two main metrics, success and precision, by using the one-pass evaluation (OPE). We compare the results before and after IoU attack in Table 4. With IoU attack, the AUC scores of success signifificantly decline, accounting for 71.8%, 88.2% and 76.9% of original results for SiamRPN++, DiMP and LTMU, respectively. However, the AUC scores with random noise account for 90.8%, 98.2% and 92.6%, respectively.
OTB100[42]数据集包括100个完全注释的视频序列。通过使用一次性评估(OPE),评估有两个主要的度量标准,成功和精度。我们在表4中比较了IoU攻击前后的结果。IoU攻击时,siamrpn++、DiMP和LTMU的成功AUC评分显著下降,分别占原始结果的71.8%、88.2%和76.9%。而随机噪声下的AUC得分分别为90.8%、98.2%和92.6%。
NFS30. We also conduct IoU attack on the NFS30 [18] dataset consisting of 100 videos at 30 FPS with an average length of 479 frames. All sequences are manually labeled with nine attributes, like occlusion, fast motion, etc. And we adopt the same metrics used in the OTB100 dataset, as shown in Table 5. According to the AUC metric of success, SiamRPN++ obtains a 22.6% decrease after IoU attack while injecting the same level of random noise causes an 8.4% decrease. DiMP achieves an 11.2% decrease compared to a 3.7% decrease with random noise. LTMU gets a 26.8% decrease after IoU attack and an 8.2% decrease with random noise. IoU attack makes approximately triple drops compared to the same level of random noise.
NFS30。我们还对NFS30[18]数据集进行了IoU攻击,该数据集由100个视频组成,平均长度为479帧,帧数为30帧/秒。所有的序列都被手工标记为9个属性,比如遮挡、快速运动等。我们采用与OTB100数据集相同的指标,如表5所示。根据成功的AUC度量,siamrpn++在IoU攻击后得到22.6%的下降,而注入相同水平的随机噪声引起8.4%的下降。与随机噪声下的3.7%相比,DiMP降低11.2%。LTMU在IoU攻击后下降26.8%,在随机噪声下下降8.2%。与相同水平的随机噪声相比,欠条攻击造成大约三倍的下降。
VOT2018-LT. In order to further verify the effectiveness of
our IoU attack, we conduct three trackers on a more challenging dataset VOT2018-LT [19]. It has 35 sequences with an average length of 4200 frames, which is much longer than other datasets and closer to practical applications. Each tracker needs to output a confifidence score for the target being present and a predicted bounding box in each frame. Precision (P) and recall (R) are evaluated for a series ofTable 6. Ablation studies on IoU attack for SiamRPN++ [22], DiMP [1] and LTMU [5] on the VOT2018 [19] and VOT2016 [20] datasets. Stemporal represents the temporal IoU score and Pt 1 represents the learned perturbation from historical frames.
VOT2018-LT。为了进一步验证的有效性我们的IoU攻击,我们在一个更具挑战性的数据集VOT2018-LT[19]上执行了三个跟踪器。它有35个序列,平均长度为4200帧,比其他数据集要长得多,更接近实际应用。每个跟踪器都需要输出目标存在的置信度分数和每一帧的预测边界框。精度(P)和召回率(R)评估一系列表6。SiamRPN++[22]、DiMP[1]和LTMU[5]在VOT2018[19]和VOT2016[20]数据集上IoU攻击的消融研究。Stemporal表示时间上的IoU评分,Pt 1表示从历史框架中学习到的扰动。
confifidence thresholds, and the F-score is calculated as F = 2P · R/(P + R). The primary long-term tracking metric is the highest F-score among all thresholds. Figure 4 shows the results of precision and recall at different confifi-dence thresholds before and after IoU attack. The results in the legend are ranked by F-score. The precision and recall both drop signifificantly after our IoU attack on three trackers. Our IoU attack method reduces the F-Score by 27.5%, 27.3% and 14.8% for SiamRPN++, DiMP and LTMU, respectively. All trackers after IoU attack perform poorly compared with injecting the same level of random noise. Our black-box attack method is proven to be also effective for long-term tracking. F-score计算为F = 2P·R/(P + R),主要的长期跟踪指标是所有阈值中F-score最高的。图4显示了IoU攻击前后在不同置信阈值下的准确率和召回率结果。图例中的结果按f分数排序。在我们的欠条攻击了三款追踪器之后精度和召回率都显著下降了。我们的IoU攻击方法使siamrpn++、DiMP和LTMU的F-Score分别降低了27.5%、27.3%和14.8%。与注入相同水平的随机噪声相比,IoU攻击后的所有跟踪器表现都很差。我们的黑盒攻击方法也被证明是有效的长期跟踪。
4.3. Ablation Studies
To explore the temporal motion of visual object tracking in black-box attack, we separately compare the IoU attack method with or without involving temporal IoU scores in Eq. 2, as reported in Table 6. With the help of temporal IoU scores, the deep trackers get worse tracking accuracies than only using the spatial IoU scores on multiple datasets. In addition, we transfer the learned perturbation Pt 1 into theTable 7. Comparison with existing white-box and blackbox attack methods for SiamRPN++ [22] with ResNet on the OTB100 [42] dataset.为了探究黑盒攻击中视觉目标跟踪的时间运动,我们分别比较了Eq. 2中涉及时间IoU评分和不涉及时间IoU评分的IoU攻击方法,如表6所示。在多个数据集上,借助时间欠条分数的深度跟踪器的跟踪精度比仅使用空间欠条分数的深度跟踪器差。此外,我们将学习到的扰动Pt 1转移到表7中。与在OTB100[42]数据集上使用ResNet的siamrpn++[22]现有的白盒和黑盒攻击方法进行比较。
Figure 5. Comparison on EAO scores and failure rates of different perturbations for SiamRPN++ [22] on the VOT2018 [19] dataset.
图5。对比siamrpn++[22]在VOT2018[19]数据集上不同扰动的EAO评分和失败率。
temporally consistent motion attack. We compare the IoU attack method with or without the transfer of historical perturbation Ptt 1 in Table 6. The overall performance metric
EAO indicates that the transfer and initialization of previous perturbation indeed improve the attack effects and decrease the tracking accuracies. In addition, we also illustrate the attack performance with the variation of perturbations on the VOT2018 [19] dataset, as shown in Figure 5. The perturbations are measured by `2 norm. Failure rate represents the average rate of failure frames in the whole video. We observe that the attack performance gets worse with the increase of perturbations accordingly. 时间一致的运动攻击。我们比较了表6中有或没有传递历史扰动Ptt 1的IoU攻击方法。整体性能指标EAO表明,先前扰动的传递和初始化确实提高了攻击效果,降低了跟踪精度。此外,我们还举例说明了VOT2018[19]数据集受扰动变化情况下的攻击性能,如图5所示。扰动用' 2范数来测量。失败率表示整个视频中帧数的平均失败率。我们观察到攻击性能随着扰动的增加而变差。
4.4. Comparison with Other Methods
Table 7 reports the comparison with existing whitebox and black-box attack methods. Our black-box attack method without access to the network architecture of trackers performs slightly worse than the white-box attack method CSA [44] for tracking. SPARK [14] performs the transfer-based black-box attack and obtains a 6.6% success drop and a 2.7% precision drop. Our decision-based black-box attack signifificantly outperforms SPARK. In addition, we apply the perturbation from UAP [26] frame by frame, which is designed for attacking the classifification in static images. Our method considering the temporal motion changes of the target objects achieves much greater success. 表7报告了与现有的白盒和黑盒攻击方法的比较。在不访问跟踪器网络架构的情况下,我们的黑盒攻击方法的跟踪性能略低于白盒攻击方法CSA[44]。SPARK[14]执行基于传输的黑盒攻击,获得6.6%的成功下降和2.7%的精度下降。我们基于决策的黑盒攻击的性能明显优于SPARK。此外,我们将UAP[26]的扰动逐帧应用于静态图像的分类。我们的方法考虑了目标物体的时间运动变化,取得了更大的成功。
4.5. Qualitative Results
Figure 6 qualitatively shows the tracking results of our IoU attack for SiamRPN++ [22], DiMP [1] and LTMU [5] on three challenging sequences. We visualize the original tracking results in (a), the results with the same level of random noise in (b) and the results of our IoU attack in (c). In the original images, all these trackers locate the target objects and estimate the scale changes accurately. After generating the adversarial examples, these trackers estimate the target location inaccurately. However, the same level of random noise cannot drift the trackers, as shown in the second column. This indicates that the proposed IoU attack generates the optimized perturbations and maintains the same level of random noise.
图6定性地展示了我们对siamrpn++[22]、DiMP[1]和LTMU[5]的IoU攻击在三个具有挑战性的序列上的跟踪结果。我们在(a)中可视化了原始跟踪结果,在(b)中可视化了具有相同水平随机噪声的结果,在(c)中可视化了我们的IoU攻击结果。在原始图像中,所有这些跟踪器都能准确地定位目标对象并估计尺度变化。在生成对抗实例后,这些跟踪器对目标位置估计不准确。然而,相同水平的随机噪声不能使跟踪器漂移,如第二列所示。这表明,所提出的IoU攻击产生了最优的扰动,并保持了相同水平的随机噪声。
(a) Original results (b) Random results (c) Attack results
Figure 6. Qualitative results of IoU attack on three challenging sequences from the OTB100 [42] dataset. 图6。从OTB100[42]数据集中对三个具有挑战性的序列进行IoU攻击的定性结果。
5. Concluding Remarks
In this paper, we propose an IoU attack method in the black-box setting to generate adversarial examples for visual object tracking. Without access to the network architecture of deep trackers, we iteratively adjust the direction of light-weight noise according to the predicted IoU scores of bounding boxes, which involve temporal motion in historical frames. Furthermore, we transfer the perturbations into the next frames to improve the effectiveness of attack. We apply the proposed method to three state-of-the-artrepresentative trackers to illustrate the generality of our black-box adversarial attack for visual object tracking. The extensive experiments on standard benchmarks demonstrate the effectiveness of the proposed black-box IoU attack. We believe this work helps to evaluate the robustness of visual object tracking.
本文提出了一种基于黑盒设置的IoU攻击方法,以生成对抗实例进行视觉目标跟踪。在不访问深度跟踪器网络架构的情况下,我们根据预测的包含历史帧中时间运动的边界盒IoU分数迭代调整轻量噪声的方向。此外,我们将扰动转移到下一帧,以提高攻击的有效性。我们将提出的方法应用于三个最具代表性的跟踪器,以说明我们的视觉目标跟踪的黑盒对抗攻击的通用性。在标准基准上的大量实验证明了所提出的黑盒欠条攻击的有效性。我们相信这项工作有助于评估视觉目标跟踪的鲁棒性。