[深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos

Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” Advances in Neural Information Processing Systems. 2014.
(Citations: 425).


1 Motivation

The features learnt by Spatio-Temporal CNN do not capture the motion well. Our idea is to separate CNN streams for appearance from still frames and motion between frames, and combine them by late fusion. Decoupling the spatial and temporal nets also allows us to exploit the availability of large amounts of annotated image data by pre-training the spatial net on the ImageNet challenge dataset.


2 Architecture

See Fig.

[深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos_第1张图片

The spatial stream are used to perform action recognition from still frames. This is the standard Image classification task. Thus, we can use CNN pre-trained on ImageNet.


The temporal stream are used to perform action recognition from motion. The input to this model is formed by stacking optical flow displacement fields between several consecutive frames. Such input explicitly describes the motion between video frames. Thus, the convolution is 3d convolution.


The final fusion is by class score averaging or linear SVM on top of l_2 normalized softmax scores as features.


3 Temporal Stream
There are several variations of the temporal stream part.


3.1 Optical Flow Stacking

The input is a set of displacement vector fields d ⃗ t between the pairs of consecutive frames t and t + 1. By d ⃗ t (i, j) we denote the displacement vector at the point (i, j) in frame t, which moves the point to the corresponding point in the following frame t + 1. To represent the motion across a sequence of frames, we stack the horizontal and vertical components of the vector field d t (i, j) x , d t (i, j) y of T consecutive frames to form a total of 2T input channels.

3.2 Trajectory Stacking
Replaces the optical flow, sampled at the same locations across several frames, with the flow, sampled along the motion trajectories of some anchor points.

See Fig. for illustration.
[深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos_第2张图片


3.3 Bi-directional Optical Flow
We can construct an input volume by stacking T/2 forward flows between frames t and t + T/2 and T/2 backward flows between frames tT/2 and t. The input thus has the same
number of channels (2T) as before. The flows can be represented using either of the optical flow stacking or trajectory stacking.


3.4 Training Details
It is generally beneficial to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities. In our case, the displacement vector field components can be dominated by a particular displacement, e.g., caused by the camera movement. We consider a simpler approach: from each displacement field d ⃗ we subtract its mean vector. Because the datasets are small, to combat overfitting, we use multi-task learning. The CNN architecture is modified so that it has two softmax classification layers on top of the last fully-connected layer, one for each dataset. 


5 Results
Stacking multiple (T > 1) displacement fields in the input is highly beneficial, as it provides the network with long-term motion information. Mean subtraction is helpful, as it
reduces the effect of global motion between the frames. Optical flow stacking performs better than trajectory stacking, and using the bi-directional optical flow is only slightly
better than a uni-directional forward flow. Temporal CNN significantly outperform the spatial CNN, which confirms the importance of motion information for action recognition.
Temporal and spatial recognition streams are complementary, as their fusion significantly improves on both.

6 References
[1]. https://www.youtube.com/watch?v=FXQZBZVrigM.



你可能感兴趣的:(CNN,Papers)