[在线阅读]
Figure 1. Comparison between background-aware correlation filter (BACF) and the proposed ARCF tracker. The central figure is to demonstrate the differences between previous response map and current response map on group1 1 from UAV123@10fps. Sudden changes of response maps indicate aberrances. When aberrances take place, BACF is tend to lose track of the object while the proposed ARCF can repress aberrances so that this kind of drifting can be avoided.
Figure 2. Main work-flow of the proposed ARCF tracker. It learns both positive sample (green samples) of the object and negative samples (red samples) extracted from the background and the response map restriction is integrated in the learning process so that aberrances in response maps can be repressed. [ ψ p , q ] \left[\psi_{p, q}\right] [ψp,q] serves to shift the generated response map so that the peak position in the previous frame is the same as that of the current frame and thus the position of the detected object will not affect the restriction.
[在线阅读]
Figure 1: Motivation of the proposed visual tracker. Our framework incorporates a meta-learner network along with a matching network. The meta-learner network receives meta information from the matching network and provides the matching network with the adaptive target-specific feature space needed for robust matching and tracking.
Figure 2: Overview of proposed visual tracking framework. The matching network provides the meta-learner network with meta-information in the form of loss gradients obtained using the training samples. Then the meta-learner network provides the matching network with target-specific information in the form of convolutional kernels and channel-wise attention.
Figure 3: Training scheme of meta-learner network. The meta-learner network uses loss gradients δ δ δ in (2) as meta information, derived from the matching network, which explains its own status in the current feature space [35]. Then, the function g ( ⋅ ) g(·) g(⋅) in (3) learns the mapping from this loss gradient to adaptive weights w target, which describe the target-specific feature space. The meta-learner network can be trained by minimizing the loss function in (7), which measures how accurate the adaptive weights w target were at fitting new examples z 1 , . . . , z M ′ {z_1, ..., z_{M'}} z1,...,zM′ correctly.
[在线阅读] [code]
Figure 1: Joint online detection and tracking in 3D. Our dynamic 3D tracking pipeline predicts 3D bounding box association of observed vehicles in image sequences captured by a monocular camera with an ego-motion sensor.
Figure 2: Overview of our monocular 3D tracking framework. Our online approach processes monocular frames to estimate and track region of interests (RoIs) in 3D (a). For each ROI, we learn 3D layout (i.e., depth, orientation, dimension, a projection of 3D center) estimation (b). With 3D layout, our LSTM tracker produces robust linking across frames leveraging occlusion-aware association and depth-ordering matching ©. With the help of 3D tracking, the model further refines the ability of 3D estimation by fusing object motion features of the previous frames (d).
Figure 3: Illustration of depth-ordering matching. Given the tracklets and detections, we sort them into a list by depth order. For each detection of interest (DOI), we calculate the IOU between DOI and non-occluded regions of each tracklet. The depth order naturally provides higher probabilities to tracklets near the DOI.
Figure 4: Illustration of Occlusion-aware association. A tracked tracklet (yellow) is visible all the time, while a tracklet (red) is occluded by another (blue) at frame T − 1 T −1 T−1. During occlusion, the tracklet does not update state but keep inference motion until reappearance. For a truncated or disappear tracklet (blue at frame T T T), we left it as lost.
[在线阅读] [源码链接] [sinat_31184961的论文笔记]
Figure 1. Our ‘Skimming-Perusal’ long-term tracking framework. Better viewed in color with zoom-in.
Figure 2. The adopted SiameseRPN module in our framework. Better viewed in color with zoom-in.
Figure 4. The network architecture of our skimming module. Better viewed in color with zoom-in.
[在线阅读]
Figure 1. Confidence maps of the target object (red box) provided by the target model obtained using i) a Siamese approach (middle), and ii) Our approach (right). The model predicted in a Siamese fashion, using only target appearance, struggles to distinguish the target from distractor objects in the background. In contrast, our model prediction architecture also integrates background appearance, providing superior discriminative power.
Figure 2. An overview of the target classification branch in our tracking architecture. Given an annotated training set (top left), we extract deep feature maps using a backbone network followed by an additional convolutional block (Cls Feat). The feature maps are then input to the model predictor D, consisting of the initializer and the recurrent optimizer module. The model predictor outputs the weights of the convolutional layer which performs target classification on the feature map extracted from the test frame.
[在线阅读]
Figure 1. In contrast to the classical DCF paradigm, our GFSDCF performs channel and spatial group feature selection for the learning of correlation filters. Group sparsity is enforced in the channel and spatial dimensions to highlight relevant features with enhanced discrimination and interpretability. Additionally, a lowrank temporal smoothness constraint is employed across temporal frames to improve the stability of the learned filters.
[在线阅读]
Figure 1. The motivation of our algorithm. Images in the first and third column are target patches in SiameseFC. The other images show absolute values of their gradients, where the red regions have large gradient. As we can see, the gradient values can reflect the target variations and background clutter.
Figure 3. The pipeline of the proposed algorithm, which consists of two branches. The bottom branch extracts the feature of search region X and the top branch (named update branch) is responsible for template generation. The two purple trapezoids in the figure represent sub-nets with shared parameters; the solid and dotted line represents forward and backward propagation respectively.
[在线阅读]
Figure 1: (a) The overall architecture of our tracking-by-detection framework. The architecture consists of two branches, one for generating target features as guidance while the other is an ordinary object detector. The two branches are bridged through a Target-Guidance Module (TGM). The blue dotted line represents traditional object detection process, while the red arrow denotes the procedure of the proposed guided object detection. (b) The outline of TGM. The input to the module is the exemplar and search image features and it output a modulated feature map with target information incorporated. The follow-up detection process remains intact. Note the detection model in (a) can be replaced by almost any modern object detectors.
Figure 2: Overview of our training procedure. In the training stage, we sample triplets of exemplar, support and query images from video frames. Each triplet is chronologically sampled from a same video. We take the exemplar image as the guidance and perform detection on the support and query images. The losses calculated on the support image are used to finetune the meta-layers (i.e., the detector’s heads) of our model, and we expect the updated model to generalize and perform well on the query image, which is realized by backpropagating all parameters of our model based on the losses on query image. The red arrows represent the backpropagation path during optimization. The inner optimization loop only updates head layer parameters, while the outer optimization loop updates all parameters in the architecture.
Figure 3: Instantiation of our framework on SSD [40]. We employ SSD with VGG-16 [50] as the backbone. The original SSD performs object detection on 6 different convolutional layers with increased receptive fields, with each being responsible for specific sized objects. In our work, we only use its first 3 backbone layers, denoting as L1, L2 and L3 in the figure. The target-guidance modules are appended to each layer, with increased guidance image resolutions that are consistent with the receptive fields. Operators ϕ 1 ϕ_1 ϕ1, ϕ 2 ϕ_2 ϕ2 and ϕ 3 ϕ_3 ϕ3 represent extracting features at L1, L2 and L3 layers.
Figure 4: Instantiation of our framework on FasterRCNN [15]. The exemplar and query images are fed into the backbone and bridged using the target-guidance module, while the subsequent region proposal and RoI classification and regression procedure is kept unchanged. We evaluate the model with either VGG [50] or ResNet [20] as the backbone.
[在线阅读]
Figure 1: A poster of a Physical Adversarial Texture resembling a photograph, causes a tracker’s bounding-box predictions to lose track as the target person moves over it.
Figure 2: The Physical Adversarial Texture (PAT) Attack creates adversaries to fool the GOTURN tracker, via minibatch
gradient descent to optimize various losses, using randomized scenes following Expectation Over Transformation (EOT).
Figure 1. RGB and depth sequences from CDTB. Depth offers a complementary information to color: two identical objects are easier to distinguish in depth (a), low illumination scenes (b) are less challenging for trackers if depth information is available, tracking a deformable object in depth simplifies the problem © and a sudden significant change in depth is a strong clue for occlusion (d). Sequences (a,b) are captured by a ToF-RGB pair of cameras, © by s tereo-camera sensor and (d) by a Kinect sensor.
最后,感谢Sophia-11同学维护的ICCV2019录用论文项目:https://github.com/Sophia-11/Awesome-ICCV2019