Recently:Updating on 2019.9.26
Note: All the papers can be downloaded in Google Scholar Website
1. Gradient-Guided based Visual Object Tracking
-
GradNet:Gradient-Guided Network for Visual Object Tracking(2019 ICCV)
Link:https://arxiv.org/abs/1909.06800
Abstract:
The fully-convolutional siamese network based on template matching has shown great potentials in visual tracking. During testing, the template is fixed with the initial target feature and the performance totally relies on the general matching ability of the siamese network. However, this manner cannot capture the temporal variations of targets or background clutter. In this work,we propose a novel gradient-guided network to exploit the discriminative information in gradients and update the template in the siamese network through feed-forward and backward operations
. To be specific, the algorithm can utilize the information from the gradient to update the template in the current frame. In addition, a template generalization training method is proposed to better use gradient information and avoid overfitting. To our knowledge, this work is the first attempt to exploit the information in the gradient for template update in siamese-based trackers.
Extensive experiments on recent benchmarks demonstrate that our method achieves better performance than other state-of-the-art trackers. -
Target-Aware Deep Tracking(2019 CVPR)
Link:http://openaccess.thecvf.com/content_CVPR_2019/html/Li_Target-Aware_Deep_Tracking_CVPR_2019_paper.html
Abstract:
Existing deep trackers mainly use convolutional neural networks pre-trained for the generic object recognition task for representations. Despite demonstrated successes for numerous vision tasks, the contributions of using pre-trained deep features for visual tracking are not as significant as that for object recognition. The key issue is that in visual tracking the targets of interest can be arbitrary object class with arbitrary forms. As such, pre-trained deep features are less effective in modeling these targets of arbitrary forms for distinguishing them from the background. In this paper,we propose a novel scheme to learn target-aware features, which can better recognize the targets undergoing significant appearance variations than pre-trained deep features
. To this end, wedevelop a regression loss and a ranking loss to guide the generation of target-active and scale-sensitive features
. We identify the importance of each convolutional filter according to the back-propagated gradients and select the target-aware features based on activations for representing the targets. The target-aware features are integrated with a Siamese matching network for visual tracking. Extensive experimental results show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of accuracy and speed. -
DomainSiam:Domain-Aware Siamese Network for Visual Object Tracking(2019 British Conference)
Link:https://arxiv.org/abs/1908.07905
Abstract:
Visual object tracking is a fundamental task in the field of computer vision. Recently, Siamese trackers have achieved state-of-theart performance on recent benchmarks. However, Siamese trackers do not fully utilize semantic and objectness information from pre-trained networks that have been trained on the image classification task. Furthermore, the pre-trained Siamese architecture is sparsely activated by the category label which leads to unnecessary calculations and overfitting. In this paper,we propose to learn a Domain-Aware, that is fully utilizing semantic and objectness information while producing a class-agnostic using a ridge regression network
. Moreover, to reduce the sparsity problem, we solve the ridge regression problem with a differentiable weighteddynamic loss function. Our tracker, dubbed DomainSiam, improves the feature learning in the training phase and generalization capability to other domains. Extensive experiments are performed on five tracking benchmarks including OTB2013 and OTB2015 for a validation set; as well as the VOT2017, VOT2018, LaSOT, TrackingNet, and GOT10k for a testing set. DomainSiam achieves a state-of-the-art performance on these benchmarks while running at 53 FPS. -
Distractor-Aware Deep Regression for Visual Tracking(2019 sensors in MDPI)
Link:https://www_mdpi.gg363.site
Abstract:
In recent years, regression trackers have drawn increasing attention in the visual-object tracking community due to their favorable performance and easy implementation. The tracker algorithms directly learn mapping from dense samples around the target object to Gaussian-like soft labels. However, in many real applications, when applied to test data, the extreme imbalanced distribution of training samples usually hinders the robustness and accuracy of regression trackers. In this paper,we propose a novel effective distractor-aware loss function to balance this issue by highlighting the significant domain and by severely penalizing the pure background
. In addition,we introduce a full differentiable hierarchy-normalized concatenation connection to exploit abstractions across multiple convolutional layers
. Extensive experiments were conducted on five challenging benchmark-tracking datasets, that is, OTB-13, OTB-15, TC-128, UAV-123, and VOT17. The experimental results are promising and show that the proposed tracker performs much better than nearly all the compared state-of-the-art approaches. -
Convolutional Regression for Visual Tracking(2018 IEEE Transactions on Image Processing)
Link:https://ieeexplore_ieee.gg363.site/abstract/document/8327618/
Abstract:
Recently, discriminatively learned correlation filters (DCF) has attracted much attention in visual object tracking community. The success of DCF is potentially attributed to the fact that a large number of samples are utilized to train the ridge regression model and predict the location of an object. To solve the regression problem in an efficient way, these samples are all generated by circularly shifting from a searching patch. However, these synthetic samples also induce some negative effects that weaken the robustness of DCF-based trackers. In this paper,we propose a new approach to learn the regression model for visual tracking with single convolutional layer
.Instead of learning the linear regression model in a closed form, we try to solve the regression problem by optimizing a one-channel-output convolution layer with gradient descent (GD).
In particular, the kernel size of the convolution layer is set to the size of the object. Contrary to DCF, it is possible to incorporate all “real” samples clipped from the whole image. A critical issue of the GD approach is that most of the convolutional samples are negative and the contribution of positive samples will be suppressed. To address this problem, we propose a novel objective function to eliminate easy negatives and enhance positives. We perform extensive experiments on four widely used datasets: OTB-100, OTB-50, TempleColor, and VOT-2016. The results show that the proposed algorithm achieves outstanding performance and outperforms most of the existing DCF-based algorithms. -
Grad-CAM:Visual Explanations from Deep Networks via Gradient-based Localization(2017 ICCV)
Link:http://openaccess.thecvf.com/content_iccv_2017/html/Selvaraju_Grad-CAM_Visual_Explanations_ICCV_2017_paper.html
Abatract:
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, GradCAM is applicable to a wide variety of CNN model-families:
(1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multi-modal inputs (e.g. visual question answering) or reinforcement learning, without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, © are more faithful to the underlying model, and (d) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show even non-attention based models can localize inputs. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. -
Learning Deep Features for Discriminative Localization(2016 CVPR)
Link:https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Zhou_Learning_Deep_Features_CVPR_2016_paper.html
Abstract:
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that exposes the implicit attention of CNNs on an image. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014 without training on any bounding box annotation.We demonstrate in a variety of experiments that our network is able to localize the discriminative image regions despite just being trained for solving classification task
1.