(注:为避免中文翻译不准确带来误解,故附上论文原句。)
论文:Simonyan K , Zisserman A . Two-Stream Convolutional Networks for Action Recognition in Videos[J]. 2014.
链接:https://arxiv.org/abs/1406.2199
这篇论文发表在是NIPS2014上,比较经典,使用双流法(two stream网络)来做action recognition in video。
1、提出了two-stream结构的CNN,由空间和时间两个维度的网络组成。(we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks。)
2、在多帧密集光流上训练的CNN,在有限的训练数据上仍能获得很好的性能。(we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.)
3、使用了多任务学习,应用了两种不同的动作分类数据集,增加了训练数据量,并提高了性能。(we show that multi- task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both.)
下面将对论文中的关键点进行简单说明。
使用two-stream结构是受到神经科学的启发,人类视觉皮层包含两条路径:腹侧流(执行物体识别)和背侧流(识别运动)。(Our proposed architecture is related to the two-streams hypothesis [9], according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognises motion).)
论文中的two-stream指的是spatial stream convnet 和 temporal stream convnet. 这两个并行网络的作用是:The spatial stream performs action recognition from still video frames, whilst the temporal stream is trained to recognise action from motion in the form of dense optical flow.总结为:使用两个网络分别提取静态图片特征和动态密度光流特征。
spatial stream convnet 和 temporal stream convnet网络结构完全相同,唯一不同的是后者少了normalisation layer(The only difference between spatial and temporal ConvNet configurations is that we removed the second normalisation layer from the latter to reduce memory consumption.)。
两个网络可分开训练,spatial stream convnet输入单帧RGB图片,可以在Image-net上进行预训练,再在视频数据集上进行fine-tuning,训练图片是从所有视频帧中随机选一帧,并进行相应的数据增强,如翻转等其他处理,再随机剪切成224*224.
temporal stream convnet处理的是L连续帧的光流信息,由于光流是矢量,为了方便输入网络训练,故将此向量分解为水平和垂直方向的两个分量,因此一帧m*n*3的彩色图片,对应的光流特征图就是m*n*2,总共是L个连续的帧,经过随机剪裁,输入矩阵为224*224*2L。由于光流计算比较耗时,所以每个视频的光流图可通过OpenCV预先提取,并linearly rescaled到0-255 (the horizontal and vertical components of the flow were linearly rescaled to a [0, 255] range and compressed using JPEG)。
在计算光流时,为了消除相机运动(全局运动)对目标动作的光流场的影响,作者使用了一个简单的处理方法,即减去平均光流。
对于空间卷积网络,只是一个图像分类网络,它有大量的数据集可供预训练,如Imagenet。但是对于时间卷积网络,只能在视频数据集上训练,且可供训练的数据集少得多。作者使用多任务训练的方法,即在全连接层后接两个softmax,不同的softmax对应不同的数据集,在最后BP算法时,把两个softmax层的输出加和,作为总的误差执行BP算法更新网络的权值。(In our case, a ConvNet architecture is modified so that it has two softmax classification layers on top of the last fullyconnected layer: one softmax layer computes HMDB-51 classification scores, the other one – the UCF-101 scores. Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset. The overall training loss is computed as the sum of the individual tasks’ losses, and the network weight derivatives can be found by back-propagation.)
本文只是介绍了论文中的几个关键点,具体的实验结果请看原文。
参考资料:【论文学习】Two-Stream总结