IoUAttack:Towards Temporally Coherent Black-Box Adversarial Attack for Visual Object Tracking


ShuaiJia1YibingSong2 ChaoMa1* XiaokangYang1 1MoE Key Lab of Artificial Intelligence, AIInstitute, Shanghai Jiao Tong University2 Tencent AILab {jiashuai, chaoma, xkyang }@sjtu.edu.cn,[email protected]


Adversarial attack arises due to the vulnerability of deep neural networks to perceive input samples injected with imperceptible perturbations. Recently, adversarial attack has been applied to visual object tracking to evaluate the robustness of deep trackers. Assuming that the model structures of deep trackers are known, a variety of white-box attack approaches to visual tracking have demonstrated promising results. However, the model knowledge about deep trackers is usually unavailable in real applications. In this paper, we propose a decision-based black-box attack method for visual object tracking. In contrast to existing black-box adversarial attack methods that deal with static images for image classifification, we propose IoU attack that sequentially generates perturbations based on the predicted IoU scores from both current and historical frames. By decreasing the IoU scores, the proposed attack method degrades the accuracy of temporal coherent bounding boxes (i.e., object motions) accordingly. In addition, we transfer the learned perturbations to the next few frames to initialize temporal motion attack. We validate the proposed IoU attack on stateof-the-art deep trackers (i.e., detection based, correlation fifilter based, and long-term trackers). Extensive experiments on the benchmark datasets indicate the effectiveness of theproposed IoU attack method.

The source code is available at https://github.com/VISION-SJTU/IoUattack.



1. Introduction

Visual object tracking is one of the fundamental computer vision problems with a wide range of applications. The convolutional neural networks (CNNs) have signififi- cantly advanced visual tracking performance. Meanwhile, the enigma of interpreting CNNs has perplexed existing visual tracking algorithms as well. For example, injecting imperceptible perturbations into input images leads deep neural networks to predict incorrectly [37, 43, 48]. To in-


Figure 1. IoU attack for visual object tracking. State-of-the-art deep trackers (i.e., SiamRPN++ [22], DiMP [1], and LTMU [5]) effectively locate target objects in the original video sequences as shown in (a). Our IoU attack decreases their tracking accuracies by injecting imperceptible perturbations as shown in (b).

图1所示。IoU攻击的视觉目标跟踪。最先进的深度跟踪器(如SiamRPN++ [22], DiMP[1],和LTMU[5])有效地定位原始视频序列中的目标对象,如(a)所示。我们的IoU攻击通过注入难以感知的扰动降低了其跟踪精度,如(b)所示。

vestigate the robustness of visual tracking algorithms with deep models, recent approaches [3, 44, 16, 24] assume that the model structures of deep tracking algorithms are known and carry out white-box attack on them. Despite the demonstrated promising results, the concrete structures and parameters of deep trackers are barely known in real applications. In this paper, we investigate black-box adversarial attack for visual tracking, where the model knowledge of deep trackers is unknown.


Prevalent black-box attack algorithms inject imperceptible perturbations into input images to decrease network classifification accuracies. Although these methods are effective to attack static images, they are not suitable to attack temporally moving objects in videos. This is because deep trackers maintain temporal motions of the target object within tracking models (i.e., the correlation fifilters [6, 35] or deep binary classififiers [30, 17, 23, 22]). When localizing the target object, these deep trackers produce temporally coherent bounding boxes (bbxs). Meanwhile, deep trackers constrain the search area to be close to the predicted bbx from the last frame. As existing black-box methods rarely degrade temporally coherent bbxs, perturbations produced based on CNN classifification scores are not effective for visual tracking. An intriguing direction thus arises to investigate the black-box attack on both individual frames and temporal motions among sequential frames with a holistic decision-based approach.


In this paper, we propose IoU attack for visual tracking. IoU attack is a decision-based black-box attack method which focuses on both image content and target motions in video sequences. When processing each frame, we start image content attack with two bbxs. One is predicted by the deep tracker using the original frame, which is perturbation free. The other one is predicted by the same tracker using the same frame with noisy perturbations. These two bbxs are used to compute an IoU score as feedback to our IoU attack. For each frame, we use an iterative orthogonal composition method for image content attack. During each iteration of orthogonal attack, we fifirst randomly generate several tangential perturbations whose noise levels are the same. Then, we compute their IoU scores and select the tangential perturbation with the lowest score. The selected perturbation is the most effective one to attack the current frame at the current iteration. We then increase the selected perturbation in its normal direction to add a small amount of noise, which is the normal perturbation. We compose both tangential and normal perturbations to generate the perturbations for the current iteration of orthogonal attack.


For target motion attack, we compute an IoU score between the bbxs from both the current and the previous frames. This IoU score is integrated into the tangential perturbation identifification process. To this end, our orthogonal attack deviates a deep tracker from its original performance of both the current and historical frames. We transfer the learned perturbations to the next few frames as perturbation initialization to reinforce temporal motion attack. As a result, the deviation from the original tracking results ensures the success of black-box attack on deep trackers shown in Figure 1. We extensively validate the proposed IoU attack on state-of-the-art methods including detection based [22], correlation fifilter based [1], and long-term [5] trackers. Experiments on benchmark datasets demonstrate the effectiveness of the proposed black-box IoU attack.


2. Related Work

In this section, we brieflfly introduce recent state-of-theart trackers and their basic principles. Besides, we also review recent adversarial attack methods, especially for the aspect of black-box attack.


2.1. Visual Object Tracking

Visual object tracking has received widespread attention in the last decade and brings about a series of new benchmark datasets [42, 18, 28, 29, 11]. Existing trackers can be generally categorized as offlfline trackers and online update trackers. Offlfline trackers do not update their model parameters during the inference, leading to a higher speed. These trackers consider tracking as a discriminative object detection problem. They generate candidate regions and classify the target or background to locate. Bounding box regression [34] is always used to locate precisely. Among them, siamese based methods [23, 22, 39, 47, 13, 4, 38, 40] are typical structures consisting of a template branch and a search branch. SiamRPN [23] draws a region proposal network to formulate a one-shot detection by comparing the similarity between two branches. SiamRPN++ [22] applies a deeper network ResNet instead of commonly-used AlexNet to improve the tracking accuracy and maintain the real-time speed.


Online update trackers constantly update their models during the inference to adapt to the current scenarios [31]. MDNet [30, 36, 32] regard tracking as a classifification to distinguish the target and background. During the inference, they collect the samples from previous frames to enhance the target appearance. UpdateNet [46] formulates an update strategy into siamese based trackers to maintain the temporal motion between frames. Besides, correlation fifilter based methods also belong to online update trackers. They typically learn the discriminative correlation fifilter by deep or hand-craft features to estimate the target location. Recently, DiMP [1] learns a discriminative learning loss to exploit both target and background appearance information for target model prediction. PrDiMP [7] proposes a probabilistic regression formulation to address the modeling label noise.


Furthermore, existing long-term trackers [45, 5] integrate an online update module to improve the tracking performance. The re-detection module is mostly introduced to handle the disappearance and reappearance of the target, involving more challenges into adversarial attack. LTMU [5] is a long-term tracker with a meta-updater, which learns to guide the tracker’s update to gain helpful appearance information for accuracy. In this work, we implement our adversarial attack on three representative trackers [22, 1, 5] to illustrate the generality of our black-box attack method.


2.2. Adversarial Attack

Convolution Neural Networks (CNNs) have been deployed in various tasks of computer vision today. However, recent studies [12, 37] notice that CNNs are sensitive to the imperceptible perturbations in adversarial examples. The intentional light-weight perturbations deteriorate

the performance dramatically. Existing adversarial attack methods [12, 27, 15, 8, 33] mainly focus on static image tasks like classifification, segmentation and detection. Except for attacking digital images, some studies implement physical attacks [10, 41] in concrete applications (e.g., autonomous driving). They generate a distractor in the real world to cause CNNs models to misclassify or fail to detect, leading to a security problem.



Overall, existing adversarial attack methods are mainly divided into two categories: white-box and black-box attack. In white-box attack, the adversary assumes to gain all knowledge of the attacked target, such as the learned parameters, the concrete structure, etc. Compared with white-box attack, black-box attack has limited knowledge of the model but is closer to the practical scenarios. It is often modeled on querying the method by inputs, acquiring the fifinal labels or confifidence scores. Black-box attack roughly fails into transfer-based, score-based, and decision-based attack [9]. Transfer-based attack [25] utilizes the transferability of adversarial examples generated by white-box models. Scorebased attack knows the predicted probability of classifification, relying on approximated gradients to generate adversarial examples [15]. In decision-based attack, only the fifinal

label of classifification is accessible [2] to the threat model. For black-box attack in visual object tracking, we assume that only the outputs of trackers (i.e., predicted bounding boxes) are available.



In the fifield of visual object tracking, some methods [3, 44, 16, 24, 14] explore the adversarial attack on different trackers. Chen et al. [3] propose a one-shot adversarial attack method by optimizing the batch confifidence loss and the feature loss to modify the initial target patch. Yan etal. [44] design an adversarial loss to cool the hot regions on the heatmaps and shrink the predicted bounding box. All adversarial attack methods motioned above are summarized as white-box attack. In the real world, it is hard to know the concrete knowledge of trackers, causing the white-box attack methods less practical. In this work, we propose a novel decision-based black-box adversarial attack method for visual tracking. Motion consistency is taken into our attack method to further deteriorate the tracking performance.


3. Proposed Method

The proposed IoU attack aims to gradually decrease IoU scores (i.e., bbx overlap values) during iterations by using the minimal amount of noise. Figure 2 shows an intuitive view of the proposed IoU attack.

3.1. Motivation

Current studies on black-box attack mainly focus on static image recognition while the temporal motions of visual tracking are untouched. This is limited to attack deep


(a) Iterative perturbation update (b) Orthogonal composition Figure 2. An intuitive view of IoU attack in the image space. In (a), we show that the increase of noise level positively correlates to the decrease of IoU scores but their directions are not exactly the same. The IoU attack method iteratively fifinds the intersection points (i.e., intermediate images) between each contour line of noise increase and IoU decrease. These intermediate images

gradually decrease IoU scores with the lowest amount of noise.

In (b), we show the orthogonal composition during each iteration. We generate noise hypothesis tangentially according to the current contour line (i.e., #1) and increase a small amount of noise in the normal direction (i.e., #2). The intersection point will be identifified

from the hypothesis that yields the lowest IoU at the same noise level. The updated perturbation in each iteration is the composition of #1 and #2.


trackers as the target object motion is maintained temporally. Meanwhile, deep trackers utilize temporally coherent schemes (i.e., search region constraint, and online update) to ensure tracking accuracy. The image content and temporal motions are equally important for black-box attack on visual tracking.


The proposed IoU attack is to make the prediction results of one tracker deviate from its original performance. This is because of the tracking scenario where there is only one

ground-truth bounding box (bbx) available (i.e., bbx annotation on the fifirst frame). We defifine the original performance of one tracker is that it predicts one bbx on each frame without noise addition. By adding heavy-noisy perturbations, we make the same tracker predict another bbx and compute the spatial IoU score based on these two bbxs. Meanwhile,

we use the bbx from the current frame and the one from the previous frame to compute a temporally coherent IoU score, which is then fused with the spatial IoU score. As stateof-the-art trackers demonstrate premier performance on the benchmarks, gradually decreasing the IoU scores by involving consecutive video frames indicates that their tracking

performance deteriorates signifificantly. The IoU measurement suits different trackers as long as they predict one bbx for each frame

提出的IoU攻击是为了使一个跟踪器的预测结果偏离其原始性能。这是因为只有一个跟踪场景Ground-truth bounding box (bbx)可用(即第一帧bbx标注)。我们定义一个跟踪器的原始性能是它在每帧上预测一个bbx而不添加噪声。通过添加重噪声扰动,我们让同一个跟踪器预测另一个bbx,并基于这两个bbx计算空间IoU分数。与此同时,我们使用当前帧的bbx和前一帧的bbx计算一个时间相干的欠条分数,然后与空间欠条分数融合。由于最先进的跟踪器在基准上展示了卓越的性能,通过涉及连续的视频帧逐渐减少IoU分数表明他们的跟踪计数性能恶化。IoU测量适合不同的跟踪器,只要它们预测每帧一个bbx

3.2. IoU Attack

Figure 2 shows an intuitive view of how the proposed IoU attack gradually decreases the IoU scores between frames. Given a clean input frame, we fifirst add heavy uniform noise on it to generate a heavy noise image where the IoU score is low. Along the direction from the clean image to the heavy noise image, the IoU scores gradually decrease while the noise level increases. The direction of IoU decrease positively correlates to that of noise increase

but they are not exactly the same. IoU attack aims to progressively fifind a decreased IoU score while introducing the lowest amount of noise. The contour lines of IoU shown

in Figure 2(a) indicate the tracker performance with regard to different noise perturbations, which can not be explicitly modeled in practice. From another perspective, IoU at

tack aims to identify one specifific noise perturbation leading to the lowest IoU score among the same amount of noise levels. The identifification process is fulfifilled by orthogonal

composition illustrated as follows.




We denote the original image on the t-th frame as I0, the heavy noise image as H, and the intermediate image on the k-th iteration as Ik. In the (k+1)-th iteration, we fifirst randomly generate several Gaussian distribution noise η ∼ N (0, 1) and select the tangential perturbation η from n of them as:

我们将第t帧上的原始图像表示为I0,,重噪声图像表示为H,第k次迭代的中间图像表示为 Ik.在(k+1)-th迭代中,我们首先随机生成几个高斯分布噪声η ~ N(0,1),并从N中选择切向扰动η为:


where d is the pixel-wise distance measurement between two images. Eq. 1 ensures tangential perturbations at the same noise level. The selected ηj (j ∈ [1, 2, ..., n]) is the perturbation tangential towards the contour line of noise level at the point Ik. We generate one Ij (j ∈ [1, 2, ..., n]) according to each ηj and use the tracker to predict a bbx Btj on it. Then, we defifine the IoU score SIoU as:

其中d为两幅图像之间按像素级的距离测量值。公式1保证了在相同的噪声水平上的切向扰动。所选的ηj (j∈[1,2,…, n])为点Ik处切向噪声级等高线的扰动。我们生成了一个Ij (j∈[1,2,…, n]),并使用跟踪器预测其上的一个bbx Btj。然后,我们将IoU得分SIoU定义为:


where Sspatial denotes the spatial IoU score between the predicted bbx Btj and the original noise-free bbx Btorig at the t-th frame, Stemporal denotes the temporal IoU score with the original noise-free bbxBt-1orig at the (t-1)-th original frame, and λ is the scalar to balance the inflfluence of spatial and temporal IoU scores. We attack SIoU to perform both image content attack and temporal motion attack. In total, we obtain n IoU scores and select Ij whose SIoU is lowest. An example of ηj is visualized as #1 in Figure 2(b).

哪里Sspatial表示之间的空间借据分数预测bbx释放和原始无噪声的bbx Btorig t坐标系,Stemporal表示时态借据得分与原始无噪声的bbxBt-1orig (t - 1)的原始帧th,λ是标量的inflfluence平衡时空借据的分数。我们对SIoU进行了攻击,实现了图像内容攻击和时间运动攻击。我们总共得到n个IoU分数,选择SIoU最低的Ij。图2(b)中的#1是ηj的一个例子。

After getting the tangential perturbation, we denote ηj as neighboring hypothesis based on Ik and make Ik + ηj

在得到切向扰动后,我们将ηj表示为基于Ik的邻近假设,使其为Ik + ηj


towards the heavy noise image H as:


where∈controls the moving step towards H and∈· ψ(H, Ik + ηj ) is the perturbation following the noise increase direction (i.e., normal direction towards the contour line of noise level). We adjust the parameter∈moderately to limit the variation of perturbations. An example of∈· ψ(H,Ik + ηj ) is visualized as #2 in Figure 2(b). To this end, Ijk+1 is the intermediate image on the (k+1)-th it eration, consisting of the composed perturbations from both tangential and normal directions. We continuously perform the iteration until the IoU score is below the predefifined threshold or the perturbations exceed the maximum. Wetransfer the learned perturbations Pt to the next few frames. The learned perturbations become the initialized perturbations, which are added on I0 of (t+1)-th frame to encode temporal motion attack from previous frames. The pseudo code of the black-box IoU attack is shown in Algorithm 1.

∈控制向H移动的步长,ψ(H, Ik + ηj)是噪声增加方向(即噪声水平等值线的法线方向)的扰动。我们适度地调整参数∈以限制扰动的变化。∈·ψ(H,Ik + ηj)的例子如图2(b)中的#2所示。为此,Ijk+1是第(k+1)次运算的中间像,由切向和法向的复合扰动组成。我们持续执行迭代,直到IoU分数低于预定义的阈值或扰动超过最大值。我们把学到的扰动Pt转移到接下来的几个坐标系。学习到的扰动成为初始扰动,在第(t+1)-第i帧上添加初始扰动,以编码前一帧的时间运动攻击。黑盒欠条攻击的伪代码如算法1所示。

3.3. Discussions and Visualizations

In this section, we visualize the variations of adversarial perturbations during IoU attack in Figure 3. Given an original image, we iteratively inject the adversarial perturbation as shown in the fifirst row of Figure 3. With the increase of adversarial perturbations, the adversarial example drifts the



Figure 3. Variations of adversarial perturbations during IoU attack. The 3D response map above the image represents the difference between the original image and the adversarial example at different IoU scores. The IoU score decreases as the magnitude of perturbations

increases. The variations of perturbations are illustrated in the fifirst row. The orthogonal composition is shown in the second row, including tangential direction and normal direction. The image framed in green represents the minimal IoU score in the tangential direction and the image framed in yellow represents the moving step towards the heavy noise image.


Table 1. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2019 [21] dataset.



target from the original result and leads IoU scores to decrease. We compute the cosine distance between the perturbations from two consecutive intermediate images. The cosine distance indicates that the generated perturbations follow an increasing trend without fluctuation, decreasing the query numbers effectively in our black-box attack. During each iteration, we visualize the concrete orthogonal composition between the consecutive intermediate images for instance, as shown in the second row of Figure 3. We introduce several candidate images according to Eq. 1 and select the one with the minimal IoU score as the tangential direction (i.e., #1). Then, we move toward the heavy noise image in trails to make sure the IoU score decreases. We adjust the weight in Eq. 5 to constrain the variation of perturbation and output the result as the normal direction (i.e., #2). These two directions compose the orthogonal composition during each iteration. As a result, we hope the fifinal perturbation preserves a lighter degree of noise than heavy random noise does, but the fifinal perturbation can decrease the IoU scores heavily. In other words, our IoU attack makes larger degradation of IoU scores by injecting fewer perturbations.

目标从原来的结果,导致欠条分数下降。我们计算了两个连续中间像之间的摄动的余弦距离。余弦距离表明,生成的扰动呈增加趋势,没有波动,有效地减少了黑盒攻击中的查询数。在每次迭代中,我们将连续中间图像之间的具体正交组合可视化,如图3的第二行所示。我们根据Eq. 1引入几个候选图像,并选择IoU分数最小的图像作为切线方向(即#1)。然后,我们在步道中向重噪声图像移动,以确保IoU评分下降。我们调整Eq. 5中的权值来约束扰动的变化,并将结果输出为法线方向(即#2)。这些在每次迭代中,两个方向构成正交组合。因此,我们希望最终的扰动保持一个较轻的程度的噪声比大的随机噪声但最后的扰动会严重降低欠条分数。换句话说,我们的欠条攻击通过注入更少的扰动使欠条分数更大的退化。

4. Experiments

We validate the performance of our IoU attack on six challenging datasets, VOT2019 [21], VOT2018 [19], VOT2016 [20], OTB100 [42], NFS [18] and VOT2018- LT [19]. Detailed results are provided as follows. 我们在6个具有挑战性的数据集VOT2019[21]、VOT2018[19]、VOT2016[20]、OTB100[42]、NFS[18]和VOT2018- LT[19]上验证了我们的IoU攻击性能。详细结果如下。

4.1. Experiment Setup

Deployment of Trackers. In order to validate the generality of our black-box adversarial attack, we choose three representative trackers with different structures, SiamRPN++ [22], DiMP [1] and LTMU [5], respectively. SiamRPN++ is a typical detection based tracker with the

siamese network. It compares the similarity between a target template and a search region with the region proposal network. The end-to-end learned tracker DiMP exploits both target and background appearance information to locate the target precisely. LTMU is a long-term tracker, utilizing the meta-updater to update the tracker online for target prediction.


Table 2. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2018 [19] dataset. 表2。在VOT2018[19]数据集上分别与原始序列、随机噪声、SiamRPN++[22]、DiMP[1]、LTMU[5]的IoU攻击进行跟踪结果对比。


Table 3. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2016 [20] dataset.表3。在VOT2016[20]数据集上分别与原始序列、随机噪声、siamrpn++[22]、DiMP[1]和LTMU[5]的IoU攻击进行跟踪结果对比。


Table 4. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the OTB100 [42] dataset. 表4。在OTB100[42]数据集上分别与原始序列、随机噪声、siamrpn++[22]、DiMP[1]和LTMU[5]的IoU攻击进行跟踪结果对比。


Table 5. Comparison of tracking results with original sequences, random noise, and IoU attack of SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the NFS30 [18] dataset.表5所示。在NFS30[18]数据集上分别与原始序列、随机噪声、siamrpn++[22]、DiMP[1]和LTMU[5]的IoU攻击进行跟踪结果对比。


Implementation Details. We formulate the heavy noise image by injecting uniform noise into the clean image as feedback. The type of initial random noise at the same noise level is not sensitive to the degradation of tracking. We discontinue the iterative perturbation update when the IoU score is below the predefifined score or the perturbations exceed the maximum. To sum up, the average query numbers of IoU attack are 21.2, 31.4 and 54.2 per frame for SiamRPN++, DiMP and LTMU, respectively. 实现细节。我们通过在干净图像中注入均匀噪声作为反馈来形成重噪声图像。同一噪声水平下的初始随机噪声类型对跟踪退化不敏感。当IoU分数低于预先定义的分数或扰动超过最大值时,我们停止迭代扰动更新。综上所述,siamrpn++、DiMP和LTMU的IoU攻击平均查询次数分别为每帧21.2、31.4和54.2。

4.2. Overall Attack Results


We implement the three trackers on the VOT2019 [21] dataset consisting of 60 challenging sequences. Different from other datasets, the VOT dataset has a reinitialization module. When the tracker loses the target (i.e., the overlap is zero between the predicted result and the annotation), the tracker will be reinitialized with the ground truth. Failures show the number of re-initialization. Accuracy evaluates the average overlap ratios of successfully tracking frames. Robustness measures the overall lost numbers. In addition, Expected Average Overlap (EAO) is evaluated by a combination of Accuracy and Robustness.

我们在由60个具有挑战性的序列组成的VOT2019[21]数据集上实现了这三个跟踪器。与其他数据集不同,VOT数据集有一个重新初始化模块。当跟踪器失去目标时(即预测结果与标注之间的重叠为零),跟踪器将用ground truth重新初始化。失败显示重新初始化的次数。精度评估成功跟踪帧的平均重叠比率。稳健性衡量的是总体损失数字。同时,采用精度和鲁棒性相结合的方法评价了期望平均重叠(Expected Average Overlap, EAO)。

Table 1 shows the performance drops after IoU attack. We fifirst test all trackers on original sequences. Then we implement our IoU attack method to generate the adversarial examples and evaluate the tracking results. SiamRPN++ leads to more failures than its original results, and the EAO score drops from 0.287 to 0.124. DiMP obtains a 16.5% drop on its accuracy score, which indicates our attack method leads to an obvious drift. The EAO score also drops dramatically from 0.332 to 0.195. Similarly, our IoU attack method reduces the EAO score of LTMU from 0.201 to 0.150. For further comparison, we also conduct experiments that inject the same level of random noise into the original sequences. Our generated perturbations decrease the IoU scores more dramatically than random noise. 表1显示了IoU攻击后的性能下降情况。我们首先在原始序列上测试所有跟踪器。然后实现了欠条攻击方法,生成了对抗性欠条实例和评估跟踪结果。siamrpn++导致的失败比原来的结果更多,EAO评分从0.287下降到0.124。DiMP获得它的准确率下降了16.5%,说明我们的攻击方法导致了明显的偏差。EAO分数也从0.332急剧下降到0.195。同样,我们的IoU攻击方法将LTMU的EAO评分从0.201降低到0.150。为了进一步比较,我们还进行了实验,在原始序列中注入相同水平的随机噪声。我们生成的扰动比随机噪声更显著地减少欠条分数。


There are 60 different sequences in the VOT2018 [19] dataset. All the trackers perform favorably on the original sequences. LTMU performs worse than the other two trackers since the re-detection module yields more reinitializations in VOT-toolkit. Table 2 shows that the

performance of these trackers deteriorates obviously under IoU attack. Concretely, the accuracies of these three trackers get worse after the adversarial attack. These indicate that the trackers indeed deviate from their original results. The primary metric EAO scores are reduced by 68.8%, 41.9%, 38.5% for SiamRPN++, DiMP and LTMU, respectively在VOT2018[19]数据集中有60个不同的序列。所有跟踪器在原始序列上表现良好。LTMU的性能比其他两个跟踪器差,因为重新检测模块在VOT-toolkit中产生更多的重新初始化。表2显示了在IoU攻击下,这些跟踪器的性能明显下降。具体来说,这三种追踪器的准确性在对抗性攻击后变得更差。这表明跟踪器确实偏离了原来的结果。siamrpn++、DiMP和LTMU的主要指标EAO评分分别降低了68.8%、41.9%和38.5%


Figure 4. Precision and recall plots of IoU attack for SiamRPN++ [22], DiMP [1] and LTMU [5] respectively on the VOT2018-LT dataset [19]. We use Attack and Random to denote IoU attack and the same level of random noise. The legend is ranked by F-score. 图4。VOT2018-LT数据集[19]上SiamRPN++[22]、DiMP[1]和LTMU[5]的IoU攻击精度和召回图。我们用攻击和随机来表示欠条攻击和相同级别的随机噪声。该图例按f分进行排名。


Similarly, we also conduct the IoU attack method on the VOT2016 dataset [20], as shown in Table 3. These trackers perform much better than the above two datasets on the original sequences. However, IoU attack also reduces the EAO by 60.3%, 43.0%, 28.0% for SiamRPN++, DiMP and LTMU, respectively. Our IoU attack is more effective than the same level of random noise. 同样,我们也对VOT2016数据集[20]进行了IoU攻击方法,如表3所示。这些跟踪器在原始序列上的性能远远优于上述两个数据集。IoU攻击对siamrpn++、DiMP和LTMU的EAO分别降低了60.3%、43.0%和28.0%。我们的欠条攻击比相同水平的随机噪声更有效。

OTB100. The OTB100 [42] dataset includes 100 fully annotated video sequences. The evaluation has two main metrics, success and precision, by using the one-pass evaluation (OPE). We compare the results before and after IoU attack in Table 4. With IoU attack, the AUC scores of success signifificantly decline, accounting for 71.8%, 88.2% and 76.9% of original results for SiamRPN++, DiMP and LTMU, respectively. However, the AUC scores with random noise account for 90.8%, 98.2% and 92.6%, respectively.


NFS30. We also conduct IoU attack on the NFS30 [18] dataset consisting of 100 videos at 30 FPS with an average length of 479 frames. All sequences are manually labeled with nine attributes, like occlusion, fast motion, etc. And we adopt the same metrics used in the OTB100 dataset, as shown in Table 5. According to the AUC metric of success, SiamRPN++ obtains a 22.6% decrease after IoU attack while injecting the same level of random noise causes an 8.4% decrease. DiMP achieves an 11.2% decrease compared to a 3.7% decrease with random noise. LTMU gets a 26.8% decrease after IoU attack and an 8.2% decrease with random noise. IoU attack makes approximately triple drops compared to the same level of random noise.


VOT2018-LT. In order to further verify the effectiveness of

our IoU attack, we conduct three trackers on a more challenging dataset VOT2018-LT [19]. It has 35 sequences with an average length of 4200 frames, which is much longer than other datasets and closer to practical applications. Each tracker needs to output a confifidence score for the target being present and a predicted bounding box in each frame. Precision (P) and recall (R) are evaluated for a series ofTable 6. Ablation studies on IoU attack for SiamRPN++ [22], DiMP [1] and LTMU [5] on the VOT2018 [19] and VOT2016 [20] datasets. Stemporal represents the temporal IoU score and Pt 1 represents the learned perturbation from historical frames.

VOT2018-LT。为了进一步验证的有效性我们的IoU攻击,我们在一个更具挑战性的数据集VOT2018-LT[19]上执行了三个跟踪器。它有35个序列,平均长度为4200帧,比其他数据集要长得多,更接近实际应用。每个跟踪器都需要输出目标存在的置信度分数和每一帧的预测边界框。精度(P)和召回率(R)评估一系列表6。SiamRPN++[22]、DiMP[1]和LTMU[5]在VOT2018[19]和VOT2016[20]数据集上IoU攻击的消融研究。Stemporal表示时间上的IoU评分,Pt 1表示从历史框架中学习到的扰动。


confifidence thresholds, and the F-score is calculated as F = 2P · R/(P + R). The primary long-term tracking metric is the highest F-score among all thresholds. Figure 4 shows the results of precision and recall at different confifi-dence thresholds before and after IoU attack. The results in the legend are ranked by F-score. The precision and recall both drop signifificantly after our IoU attack on three trackers. Our IoU attack method reduces the F-Score by 27.5%, 27.3% and 14.8% for SiamRPN++, DiMP and LTMU, respectively. All trackers after IoU attack perform poorly compared with injecting the same level of random noise. Our black-box attack method is proven to be also effective for long-term tracking. F-score计算为F = 2P·R/(P + R),主要的长期跟踪指标是所有阈值中F-score最高的。图4显示了IoU攻击前后在不同置信阈值下的准确率和召回率结果。图例中的结果按f分数排序。在我们的欠条攻击了三款追踪器之后精度和召回率都显著下降了。我们的IoU攻击方法使siamrpn++、DiMP和LTMU的F-Score分别降低了27.5%、27.3%和14.8%。与注入相同水平的随机噪声相比,IoU攻击后的所有跟踪器表现都很差。我们的黑盒攻击方法也被证明是有效的长期跟踪。

4.3. Ablation Studies

To explore the temporal motion of visual object tracking in black-box attack, we separately compare the IoU attack method with or without involving temporal IoU scores in Eq. 2, as reported in Table 6. With the help of temporal IoU scores, the deep trackers get worse tracking accuracies than only using the spatial IoU scores on multiple datasets. In addition, we transfer the learned perturbation Pt 1 into theTable 7. Comparison with existing white-box and blackbox attack methods for SiamRPN++ [22] with ResNet on the OTB100 [42] dataset.为了探究黑盒攻击中视觉目标跟踪的时间运动,我们分别比较了Eq. 2中涉及时间IoU评分和不涉及时间IoU评分的IoU攻击方法,如表6所示。在多个数据集上,借助时间欠条分数的深度跟踪器的跟踪精度比仅使用空间欠条分数的深度跟踪器差。此外,我们将学习到的扰动Pt 1转移到表7中。与在OTB100[42]数据集上使用ResNet的siamrpn++[22]现有的白盒和黑盒攻击方法进行比较。


Figure 5. Comparison on EAO scores and failure rates of different perturbations for SiamRPN++ [22] on the VOT2018 [19] dataset.


temporally consistent motion attack. We compare the IoU attack method with or without the transfer of historical perturbation Ptt 1 in Table 6. The overall performance metric

EAO indicates that the transfer and initialization of previous perturbation indeed improve the attack effects and decrease the tracking accuracies. In addition, we also illustrate the attack performance with the variation of perturbations on the VOT2018 [19] dataset, as shown in Figure 5. The perturbations are measured by `2 norm. Failure rate represents the average rate of failure frames in the whole video. We observe that the attack performance gets worse with the increase of perturbations accordingly. 时间一致的运动攻击。我们比较了表6中有或没有传递历史扰动Ptt 1的IoU攻击方法。整体性能指标EAO表明,先前扰动的传递和初始化确实提高了攻击效果,降低了跟踪精度。此外,我们还举例说明了VOT2018[19]数据集受扰动变化情况下的攻击性能,如图5所示。扰动用' 2范数来测量。失败率表示整个视频中帧数的平均失败率。我们观察到攻击性能随着扰动的增加而变差。

4.4. Comparison with Other Methods

Table 7 reports the comparison with existing whitebox and black-box attack methods. Our black-box attack method without access to the network architecture of trackers performs slightly worse than the white-box attack method CSA [44] for tracking. SPARK [14] performs the transfer-based black-box attack and obtains a 6.6% success drop and a 2.7% precision drop. Our decision-based black-box attack signifificantly outperforms SPARK. In addition, we apply the perturbation from UAP [26] frame by frame, which is designed for attacking the classifification in static images. Our method considering the temporal motion changes of the target objects achieves much greater success. 表7报告了与现有的白盒和黑盒攻击方法的比较。在不访问跟踪器网络架构的情况下,我们的黑盒攻击方法的跟踪性能略低于白盒攻击方法CSA[44]。SPARK[14]执行基于传输的黑盒攻击,获得6.6%的成功下降和2.7%的精度下降。我们基于决策的黑盒攻击的性能明显优于SPARK。此外,我们将UAP[26]的扰动逐帧应用于静态图像的分类。我们的方法考虑了目标物体的时间运动变化,取得了更大的成功。

4.5. Qualitative Results

Figure 6 qualitatively shows the tracking results of our IoU attack for SiamRPN++ [22], DiMP [1] and LTMU [5] on three challenging sequences. We visualize the original tracking results in (a), the results with the same level of random noise in (b) and the results of our IoU attack in (c). In the original images, all these trackers locate the target objects and estimate the scale changes accurately. After generating the adversarial examples, these trackers estimate the target location inaccurately. However, the same level of random noise cannot drift the trackers, as shown in the second column. This indicates that the proposed IoU attack generates the optimized perturbations and maintains the same level of random noise.



(a) Original results (b) Random results (c) Attack results

Figure 6. Qualitative results of IoU attack on three challenging sequences from the OTB100 [42] dataset. 图6。从OTB100[42]数据集中对三个具有挑战性的序列进行IoU攻击的定性结果。

5. Concluding Remarks

In this paper, we propose an IoU attack method in the black-box setting to generate adversarial examples for visual object tracking. Without access to the network architecture of deep trackers, we iteratively adjust the direction of light-weight noise according to the predicted IoU scores of bounding boxes, which involve temporal motion in historical frames. Furthermore, we transfer the perturbations into the next frames to improve the effectiveness of attack. We apply the proposed method to three state-of-the-artrepresentative trackers to illustrate the generality of our black-box adversarial attack for visual object tracking. The extensive experiments on standard benchmarks demonstrate the effectiveness of the proposed black-box IoU attack. We believe this work helps to evaluate the robustness of visual object tracking.


你可能感兴趣的:(翻译IoUAttack:Towards Temporally Coherent Black-Box Adversarial Attack for Visual Object Tracking)