Two-Stream Convolutional Networks for Action Recognition in Videos [Paper Part]

1.Contribution

  • propose a two-stream ConvNet architecture
    • spatial & tmporal
  • ConvNet trained on multi-frame dense optical flow is able to achieve very good performance
  • multi-task learn
    • can increase the amount of training data
    • can improve the performance on both

2.Two-Stream

  • spatial stream
    • action recognition from still video frames
  • temporal stream
    • recognize action from motion in the form of dense optical flow
  • based on two pathway of human vision
    • ventral stream
      • performs object recognition
    • dorsal stream
      • recognize motion

3.Video

  • spatial
    • in the form of individual frame appearance
    • carry information about scenes and objects depicted in the video
  • temporal
    • in the form of motion across the frames
    • conveys the movement of the observer (the camera) and the objects

4.Spatial Stream ConvNet

  • operates on individual video frames
    • effectively performing action recognition from still imagew
  • some actions are strongly associated with particular objects
    • an image classification architecture

5.Optical Flow ConvNets

  • input
    • formed by stacking optical flow displacement fields between several consecutive frames
    • explicitly describes the motion between video frame
      • make the recognition easier
        • the network does not need to estimate motion

6.Mean Flow Subtraction

  • from each displacement field d we subtract its mean vector

7.Architecture

  • sample a 224×224×2L sub-volume from I and pass it to the net as input
  • hidden layer configuration same as the spatial net
  • testing is similar to the spatial ConvNet

8.Optical Flow Stacking

  • a dense optical flow can be seen as a set of displacement vector fields dt between the pairs of consecutive t & t+1
  • dt(u,v)
    -the displacement vector at the point(u,v) in the frame t1 which moves the point to the corresponding point in the following frame t+1
  • dtx & dty
    • the victor field of horizontal and vertical
    • well suited to recognition using a convolutional network
  • w,h
    • width and height frames of a video
  • IT(u,v,2k-1) = dxT+k-1(u.v)
    IT(u,v,2k) = dxT+k-1(u.v)
    • u=[1;w] v=[1,h] h=[1;L]
    • a ConvNet input volume IT∈Rw×h×2L for an arbitary frame T
      • 2L means input channel
    • the channel IT(u,v,c) store the displacement vector at the location(u,v)
  • for an arbitary point (u,v),the channels IT(u,v,c) encode the motion at the point over a sequence of L frames
    • c=[i;2L]

9.Trajectory Stacking

  • sample along the motion trajectory
  • IT(u,v,2k-1) = dxT+k-1(Pk)
    IT(u,v,2k) = dyT+k-1(Pk)
    • u=[1;w] v=[1;h] k=[1;L]
    • the input volume IT1 corresponding to a frame T
    • Pk is the k-th point along the trajectory
      • start at the location(u,v) in the frame T
      • defined by the following recurrence relation
        • p1=(u,v)
          pk-1+dT+k-2(Pk-1) (k>1)
          • IT stores the vectors sampled at the locations Pk along the trajectory

10.Bi-directional Optical Flow

  • compute an additional set of displacement fields in the opposite direction
  • construct an input volume It by stacking L/2 forward flows between frame T and T+L/2 and L/2 backward flows between frames T-L/2 and T
  • the flow can be represented using either of the methods (1) and (2)

11.Relation of The Temporal ConvNet Architecture to Previous Representation

  • motion is explicitly represented using the optical flow displacement field computed based on the assumption of the intensity and smoothness of the flow

12.Visualisation of Learnt Convolutional Filters

  • first-layer convolutional filters learnt on 10 stacked optical flows
  • the visualisation is split into 96 columns and 20 rows
    • each column corresponds to a filter
    • each row corresponds to an input channel
  • each of the 96 filters has a spatial receptive field of 7×7 pixels,and spans components of 10 stacked optical flow displacement fields d
  • some filters compute spatial derivatives of the optical flow
    • capture how motion changes with image location
    • capture which generalises derivative-based hand-crafted descriptors
      • e.g. MBH
    • other filters compute temporal derivatives
      • capture changes in motion over time

13.Multi-task Learning

  • combine several dataset
  • aim to learn a (video) representation not only be applicable to the task in question (e.g. HMDB-51 classification),but also to other tasks (e.g.UCF-101 classification)
    • additional tasks act as a regulariser , and allow for the exploitation of additional training data
  • in our case, ConvNet architecture has two softmax classification layer on top of the last fully-connecter layer
    • compute HMDB-H classification
    • compute UCF-101 scores
    • each of the layers is equipped with its own loss function
    • the overall training loss is computer as the sum of the individual task’s losses
    • the network weight derivatives can be found by back-propagation

14.Implementation details

  • ConvNets configuration
    • all hidden weight layers use the rectification[ReLu] activation function
    • max pooling is performed over 3×3 spatial windows with stride 2
    • local response normalisation uses the same setting as 《ImageNet Classification with Deep Convolutional Neural Networks》
    • different between spatial and temporal ConvNet configuration
      • remove the second nomalisation layer from the latter to reduce memory consumption
  • training
    • spatial net training
      • a 224×224 sub-image is randomly cropped from the selecte frame, then undergoes random horizontal flipping and RGB jittering
      • videos are rescaled beforehand
      • the sub-image is sampled from the whole frame
    • temporal net training
      • compute an optical flow volume I for the selected training frame form I, a fixed-size 224×224×2L input is randomly cropped and flipped
    • learning rate
      • intially set to 10-2, the decrease accroding to a fixed schedule, which is kept the same for all training set
      • change to 10-3 after 50k iterations, training stop after 80k iterations
      • in fine-tuning, the rate is changed to 10-3 after 14k iterations,stop after 20k iterations
    • testing
      • sample a fixed number of frames(25) with equal temporal spacing between them
      • get 10 ConvNet inputs from each of the frames by cropping and flipping four corners and the center of the frame
      • class scores for the whole video are the obtained by averaging the scores across the sampled frames and crops therein
    • pre-training on ImageNet ILSVRC-2012
      • pre-train the spatial ConvNet
      • use the same training and test data augmentation[cropping,flipping,RGB jittering]
      • sample from the whole image
    • Multi-GPU training
      • derived from the Caffe, contain a lot of modification including parallel training on Multiple GPUs installed in a single system
      • exploit data parallelism ,and split each SGD Batch across several GPUs
        • 3.2 times speed up
    • optical flow
      • using the off-the-shelf GPU implementation of 《High accuracy optical flow estimation based on a theory for warping》from the OpenCV toolbox
      • do pre-compute the flow before training
      • the horizontal and vertical components of the flow were linearly rescaled to a [0,225] range and compressed use JPEG to avoid storing the displacement field as float and can reduce the flow size for the UCF-101 dataset from 1.5TB to 27GB

15.Evalution

  • datasets and evalution protocol
    • performed on UCF-101 and HMDB-51
      • UCF-101 contains 13k video
      • HMDB-51 contains 6.8k video of 51 actions
  • evalution protocol
    • the organisers provide 3 splits into training and testing data
    • the performance is measure by the mean classification accuracy across the splits
    • UCF-101 contains 9.5k training video
    • HMDB-51 contains 3.7k training video
    • we begin by comparing different architectures on the first split of CUF-101 dataset
    • follow the standard evalution protocol & report the average accuracy over three splits on both UCF-101 & HMDB-51
  • spatial ConvNet
    • measure the performance of the spatial stream ConvNet
    • choose training the last layer on top of a pre-trained ConvNet
  • temporal Convnet
    • particular measure the effect of
      • using multiple[L={5,10}] stacked optical flows
      • trajectory stacking
      • mean displacement substraction
      • using the Bi-directional optical flow
    • use an aggressive dropout ratio of 0.9 to help improve generalisation
      • results
        • stacking multiple(L>1) displacement in the field in the input is highly beneficial
          • it provides the network with long-term motion information
        • mean subtraction is helpful
          • reduce the effect of global motion between the frames
        • temporal ConvNet significantly outperform the spatial ConvNet
          • confirms the importance of motion information for action recognition
        • implement the “slow fusion” architecture of 《Large-scale video classifi-
          cation with convolutional neural networks》
          • amounts to applying a ConvNet to a stack of RGB frames
        • while multi-frame information is important, it is also important to present it to a ConvNet in an appropriate manner
        • multi-task learning of temporal ConvNets
          • train the ConvNet on HMDB-51 is different than on UCF-101
          • multi-task learning performs the best
            • it allows the training procedure to exploit all available training data
              -two-stream ConvNet
            • we evaluate the complete two-stream model
              • combines the two recognition streams
            • fuse the softmax scores using either averaging or a linear SVM to combine the network
          • conclude
            • temporal and spatial recognition streams are complementary
              • their fusion significantly improves on both
                • 6% over temporal and 14% over spatial nets
            • SVM-based fusion of softmax scores outperforms fusion by averaging
            • using Bi-directional flow is not beneficial in the case of ConvNet fusion
            • temporal ConvNet trained using multi-task learning, performs the best both along and when fused with a spatial net

16.Comparison with the State of the Art

  • both our spatial and temporal nets alone outperform the deep architecture of 《Large-scale video classification with convolutional neural networks》and 《A large video database for human motion recognition》by a large margin
  • the combination of two nets
    • further improves the results
    • is comparable to very recent state-of-the-art hand-crafted models
  • confusion matrix and per-class recall for UCF-101 classification
    • worst class is Hammering confused with HeadMessage class and BrushingTeeth class
    • reason
      • spatial ConvNet confuses Hammering with Headmaessage, which can be caused by the significant presence of human faces in both class
      • the temporal ConvNet confuses Hammering with BrushingTeeth as both actions contain recurring motion patterns
        • hand moving up and down

17.Conclusion

  • proposed a deep video classification model with competitive performance, which incorporates separate spatial and temporal recognition streams based on ConvNets
    • training a temporal ConvNet on optical flow is significantly better than training on raw stacked frames
    • our temporal model does not require significant hand-crafting , despite using optical flow as input
      • since the flow is computed using a method based on the generic assumptions of constancy and smoothness
  • extra training data poses a significant challenge on its own
    -due to the gigantic amount of training data
    • multiple TBs
    • essential ingredients of the state of art missed in our current architecture
      • local feature pooling over spatio-temporal tubes, centered at the trajectories
        • even though the input(2) captures the optical flow along the trajectories the spatial pooling in our network does not take the trajectories into account
        • explicit handing of camera motion, which in our case is compensated by mean displacement subtraction

你可能感兴趣的:(video,recognition)