Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[Paper Part]

1.Aim

  • discover the principle to design effective ConvNet architecture for action recognition in videos
  • learn these models given limited training samples

2.Contribution

  • TSN
    • based on the idea of long-range temporal structure modeling
    • combines a sparse temporal sampling strategy video-level supervision
  • study on a series of good practices in learning ConvNet on video data with the help of temporal segment network

3.Action Recognition

  • appearance & dynamics

4.Two Major obstacles of ConvNet in Video-based Action Recognition

  • long-range temporal structure
    • understand the dynamics in action video
  • training deep ConvNet requires a large volume of training sample in practice
    • limited dataset [size & diversity]

5.Study Two Problem

  • how to design an effective and efficient video-level framework for learning video representation that is able to capture long-range temporal structure
  • how to learn the ConvNet models given limited training samples

6.Action Recognition

  • Convolutional network
    • directly operated on a longer continuous video stream
    • limited by computational cost
      • these methods usually process sequences of fixed lengths ranging form 64 to 120 frames
    • limited temporal coverage
      • unable to assemble an end-to-end learning scheme for modeling the temporal structure

7.Temporal Segment Networks

  • a video-level framework,to enable to model dynamics throughout the whole video
  • aim to utilize the visual information of entire video to perform video-level prediction
  • composed of spatial stream ConvNet and temporal stream ConvNets
  • operate on a sequence of short snippets sparsely sampled from the entire video.Each snippet in this sequence will produce its own preliminary prediction of the action classes. Then a consensus among the snippets will be derived as
    the video-level prediction
  • In the learning process, the loss values of video-level predictions, other than those of snippet-level predictions which were used in twostream ConvNets, are optimized by iteratively updating the model parameters
  • TSN(T1, T2, · · · , TK) = H(G(F(T1;W), F(T2;W), · · · , F(TK;W)))
    • (T1, T2, · · · , TK) is a sequence of snippets.
      • each snippet TK is randomly sampled from its corresponding segment SK.
    • F(TK;W) is the function representing a ConvNet with parameters W
      • operates on the short snippet TK
      • produces class scores for all the classes. The segmental consensus function
    • G combines the outputs from multiple short snippets to obtain a consensus of class hypothesis among them.
      • use evenly average
    • the prediction function H predicts the probability of each action class for the whole video.
    • here we choose the widely used Softmax function for H.
    • K is set to 3
  • Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[Paper Part]_第1张图片
    • C is the number of action classes
    • yi the groundtruth label concerning class i
  • back-propagation process
    在这里插入图片描述
  • by fixing K for all videos, we assemble a sparse temporal sampling
    strategy
    • reduces the computational cost for evaluating ConvNets on the
      frames

8.Learning Temporal Segment Network

  • Network Architecture
    • Inception with Batch Normalization (BN-Inception)
    • due to its good balance between accuracy and efficiency
    • the spatial stream ConvNet operates on a single RGB images
    • the temporal stream ConvNet takes a stack of consecutive optical flow fields as input
  • Network Inputs
    • the two-stream ConvNets used RGB images for the spatial stream & stacked optical flow fields for the temporal stream
    • RGB difference & warped optical flow fields
      • RGB difference between two consecutive frames describe the
        appearance change
      • extract the warped optical flow by first estimating homography matrix and then compensating camera motion
  • Network Training
    • Cross Modality Pre-training
      • we utilize RGB models to initialize the temporal networks
        • discretize optical flow fields into the interval from 0 to 255 by a linear transformation
          • makes the range of optical flow fields to be the same with RGB images
        • modify the weights of first convolution layer of RGB models to handle the input of optical flow fields
          • average the weights across the RGB channels and replicate this average by the channel number of temporal network input
  • Regularization Techniques
    • freeze the mean and variance parameters of all Batch Normalization layers except the first one
    • add a extra dropout layer after the global pooling layer in BN-Inception
      architecture
      - the dropout ratio is set as 0.8 for spatial stream ConvNets and 0.7 for temporal stream ConvNets
  • Data Augmentation
    • In the original two-stream ConvNets, random cropping and horizontal flipping are employed to augment training samples
    • corner cropping
      • extracted regions are only selected from the corners or the center of the image to avoid implicitly focusing on the center area of a image
    • scale jittering

9.Testing Temporal Segment Networks

  • sample 25 RGB frames or optical flow stacks from the action videos. Meanwhile, we crop 4 corners and 1 center, and their horizontal flipping from the sampled frames to evaluate the ConvNets
  • For the fusion of spatial and temporal stream networks,we take a weighted average of them.
  • When learned within the temporal segment network framework, the performance gap between spatial stream ConvNets and temporal stream ConvNets is much smaller than that in the original two-stream ConvNets
  • weight of temporal stream is set to 1.5 for temporal stream
    • weight of temporal stream is divided to 1 for optical flow
    • weight of temporal stream is divided to 0.5 for warped flow
  • weight of spatial stream is set to 1.5 for temporal stream

10.Datasets and Implementation Details

  • follow the original evaluation scheme using three training/testing splits and report average accuracy over these splits
  • use the mini-batch stochastic gradient descent algorithm to learn the network parameters
    • batch size is set to 256
    • momentum set to 0.9
  • spatial networks
    • the learning rate is initialized as 0.001
    • decreases to its 1/10 every 2, 000 iterations
    • the whole training procedure stops at 4, 500 iterations
  • temporal networks
    • the learning rate is initialized as 0.005
    • reduces to its 1/10 after 12, 000 and 18, 000 iterations
    • the maximum iteration is set as 20, 000
  • data augmentation
    • location jittering,
    • horizontal flipping,
    • corner cropping
    • scale jittering
  • extraction of optical flow and warped optical flow
    • choose the TVL1 optical flow algorithm
  • speed up training
    • employ a data-parallel strategy with multiple GPUs

11.Exploration Study

  • the optical flow is better at capturing motion information and sometimes RGB difference may be unstable for describing motions
  • RGB difference may serve as a low-quality, high-speed alternative
    for motion representations

12.Evaluation of Temporal Segment Networks

  • choose average pooling as the default aggregation function
  • BN-Inception as the ConvNet architecture for temporal segment networks
  • modeling long-term temporal structures is crucial for better understanding of action in videos

13. Model Visualization

  • learned models focus more on humans in the videos, and seem to be modeling the long-range structure of the action class
  • models learned with the proposed method may perform better, which is well reflected in our quantitative experiments

14.Conclusion

  • TSN
    • a video-level framework that aims to model long-term temporal structur
  • bring the state of the art to a new level
    • segmental architecture with sparse sampling
    • a series of good practices that we explored in this work

你可能感兴趣的:(video,recognition)