PaperReading2目标跟踪系统的各个部分

以下摘录于:
[Uderstanding and Diagnosing Visual Tracking Systems]
Authors:Naiyan Wang, Jianping Shi, Dit-Yan Yeung, Jiaya Jia


(this paper focus on the most general type of visual tracking problems: short-term single-object model-free tracking)


generative tracker VS discriminative tracker
usually discriminative trackers or hybrid ones
mainly because purely generative trackers cannot handle complicated background well, making it easy to drift away from the target.


Benchmark
sample videos(VOT,VBT,PTB)
The key difference with the benchmark above lies in the evaluation metric.

accuracy:the overlap rate between the prediction and ground truth when the tracker does not drift away
robustness: the frequency of tracking failure which happens when the overlap rate is zero(Whenever such failure occurs, the tracker is reset to the correct bounding box to continue tracking.)

Two metrics for evalution:
1.AUC(the area of the curve) of overlap rate(between the ground truth and predicted bounding boxes)
2.the precision at threshold 20 for central pixel error curve.
This metric is useful for the cases that the scale of the object changes but the tracker does not support scale variation


five parts of tracker

  • feature extractor
    represents each candidate in the candidate set using some features.

  • observation model
    judges whether a candidate is the target based on the features extracted from the candidate.

  • motion model
    generates a set of candidate regions or bounding boxes which may contain the target in the current frame.

  • model updater
    controls the strategy and frequency of updating the observation model.It has to strike a balance between model adaptation and drift.

  • ensemble post-processor
    When a tracking system consists of multiple trackers, the ensemble post-processor takes the outputs of the constituent trackers and uses the ensemble learning approach to combine them into the final result.


A tracking system
initializing the observation model with the given bounding box of the target in the first frame.
In each of the following frames, the motion model first generates candidate regions or proposals for testing based on the estimation from the previous frame.
The candidate regions or proposals are fed into the observation model to compute their probability of being the target.
The one with the highest probability is then selected as the estimation result of the current frame.
Based on the output of the observation model, the model updater decides whether the observation model needs any update and, if needed, the update frequency.
Finally, if there are multiple trackers, the bounding boxes returned by the trackers will be combined by the ensemble post-processorto obtain a more accurate estimate.

PaperReading2目标跟踪系统的各个部分_第1张图片


Validation Setup
determine the parameters of each component using five videos outside the benchmark and then fix the parameters afterwards throughout the evaluation unless specified otherwise.

measured by the overlap rate between the ground-truth and predicted bounding boxes, where the overlap rate is defined as the area of intersection of the two bounding boxes over the area of their union.
With a given threshold for the overlap rate, we can calculate the success rate of the tracker over all the video frames.
By varying the threshold from 0 gradually to 1, it will yield a curve which varies from it maximum successful rate to success rate 0 accordingly.

basic model for following analysis
motion mode:particle filter framework motion model
features:raw pixels of grayscale images
observation model:logistic regression
model updater: if the highest score among the candidates tested is below a threshold, the model will be updated.
ensemble post-processor:none(single tracker)


See how each component of a tracker affects its final performance.


Feature Extractor
the raw image data->some (usually) more informative representation.

  • Raw Grayscale: resizes the image into a fixed size, converts it to grayscale, and then uses the pixel values as features.
  • Raw Color: represented in the CIE Lab color space instead of grayscale.
  • Haar-like Features
  • HOG: histograms of oriented gradients
  • HOG + Raw Color


    PaperReading2目标跟踪系统的各个部分_第2张图片

CNN can get more powerful features but incur high conputational cost.
An direction is to exploit the color information. Some recent methods demonstrated notable performance with carefully designed color features. Not only are these features lightweight, but they are also suitable for deformable objects.
We believe that finding good features for object tracking is still a research direction that is worth pursuing.

Our Findings: Using proper features can dramatically improve the tracking performance.


Observation Model
returns the confidence of a given candidate being the target
consider the following observation models(discrimitive):

  • Logistic Regression: Logistic regression with l2 regularization is used. Online update is achieved by simply using gradient descent.
  • Ridge Regression: Least squares regression with l2 regularization is used. The targets for positive examples are set to one while those for negative examples are set to zero. Online update is achieved by aggregating sufficient statistics.(online dictionary learning)
  • SVM: Standard SVM with hinge loss and l2 regularization is used.
  • Structured Output SVM (SO-SVM): The optimization target of the structured output SVM is the overlap rate instead of the class label.


    PaperReading2目标跟踪系统的各个部分_第3张图片

Our Findings: Different observation models indeed affect the performance when the features are weak.
However, the performance gaps diminish when the features are strong enough. Consequently, satisfactory results can be obtained
even using simple classifiers from textbooks.


Motion Model

  • Particle Filter: Particle filter is a sequential Bayesian estimation approach which recursively infers the hidden state of the target.
    Two advantages:
    1.maintain a probabilistic estimation for each frame. Thus when several candidates have high probability of being the target, they will all be kept for the next frames. As a result, it can help to recover from tracker failure. In contrast, the sliding window approach only chooses the candidate with the highest probability and prune all others.
    2.easily incorporate changes in scale, aspect ratio, and even rotation and skewness.
  • Sliding Window: The sliding window approach is an exhaustive search scheme which simply considers all possible candidates within a square neighborhood.
  • Radius Sliding Window: It is a simple modification of the previous approach which considers a circular region instead.


    PaperReading2目标跟踪系统的各个部分_第4张图片

    although particle filter has two advantages, the performance of three models is similar.
    But when there is fast motion or scale variation, particle filter's advantages will affect.


    PaperReading2目标跟踪系统的各个部分_第5张图片

    particle is good with scale variation, but bad with fast motion.

scale the parameters by the video resolution


PaperReading2目标跟踪系统的各个部分_第6张图片

even such a simple normalization step can improve the performance significantly especially when there exists fast motion.

Our Findings: particle filter approach with resized input is good.


Model Updater(这部分其实没太看懂)
determines both the strategy and frequency of model update.
need to maintain a tradeoff between adapting to new( but possibly noisy )examples collected during tracking and preventing the tracker from drifting to the background.
When the model needs update, we first collect some positive examples whose centers are within 5 pixels from the target and some negative examples within 100 pixels but with overlapping rate less than 0.3.

two model update methods:

  1. update the model whenever the confidence of the target falls below a threshold.
  2. update the model whenever the difference between the confidence of the target and that of the background examples is below a threshold.
    This strategy simply maintains a sufficiently large
    margin between the positive and negative examples instead of forcing the target to have high confidence. It is potentially helpful when the target is occluded or disappears.

To the best of our knowledge, the only principled method for model updater is the one by [41].They proposed to use entropy minimization to identify reliable model update and discard the incorrect ones.

Our Findings: Although implementation of the model updater is often treated as engineering tricks in papers especially for discriminative trackers, their impact on performance is usually very significant and hence is worth studying. Unfortunately, very few work focuses on this component.


Ensemble Post-processor

  1. a loss function for bounding box majority voting and
    then extended it to incorporate tracker weights, trajectory continuity and removal of bad trackers.
  2. formulated the ensemble learning problem as a structured crowd-sourcing problem which treats the reliability of each tracker as a hidden variable to be inferred. Then they proposed a factorial hidden Markov model that considers the temporal smoothness between frames. We adopt the basic model called ensemble based tracking (EBT) without self-correction.
PaperReading2目标跟踪系统的各个部分_第7张图片

PaperReading2目标跟踪系统的各个部分_第8张图片

the difference between these method is small.
but diversity of trackers in the ensemble helps to achieve good results. Both ensemble methods can significantly improve the results when the trackers have high diversity
Even when the diversity is low, the ensemble does not impair the performance but still slightly outperforms the best single tracker.

Our Findings: The ensemble post-processor can improve the performance substantially especially when the trackers have high diversity. This component is universal and effective yet it is least explored


in the latest deep learning trackers, the feature extractor and observation model are combined into a unified deep learning framework for end-to-end learning


speed is a problem.
fast Fourier transform (FFT) and circular matrices are used to accelerate dense (kernelized) ridge regression.


Conclusion

  1. the feature extractor is the most important part of a tracker.
  2. the observation model is not that important if the features are
    good enough.
  3. the model updater can affect the result significantly, but currently there are not many principled ways for realizing this component.
  4. the ensemble post-processor is quite universal and effective.
  5. paying attention to some details of the motion model and model updater can significantly improve the performance.

你可能感兴趣的:(PaperReading2目标跟踪系统的各个部分)