作者信息:Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao,
深度卷积神经网络在静止图像的目标识别取得了巨大的成功,但是在视频的行为识别领域,深度学习提升的效果并不是很显著,主要的原因有两点:
为了解决上述问题,采用了very deep two-stream网络用于行为识别,将当前流行的深度结构应用于视频领域。但是由于视频行为识别的规模较小,这种扩展并不容易。我们设计了多种实践应用于very deep two_stream 的训练,如下:
在时域网络和空间网络预训练
更小的学习速率
与此同时,将Caffe 工具箱延伸到多GPU环境,从而有更高的计算效率,更少的内存消耗。very deep two_stream方法在UCF-101数据集上取得了91.4%的识别精度。
近几年视频片段的行为识别取得了巨大的进步,这些方法可以分为两类:
然而,不像图像分类,深度卷积神经网络相对于传统方法并没有产生很好的提升效果。我们认为主要的原因有两点:
2.1 网络结构
在过去几年,提出了很多流行的网络结构,例如AlexNet,ClarifaiNet,GoogLeNet,VGGNet等等。这些网络结构特点的趋势主要有:更小的卷积核尺寸,更小的卷积步长,更深的网络结构。这些趋势证明了在提升识别性能上是有效的。
Comparison.将我们的方法与最近流行的视频行为识别方法进行了对比,结果如下:
论文中主要提出了Very deep two-stream ConvNets方法,由于数据集规模较小,我们提出了一些好的实践来训练Very deep two-Stream 网络。基于这些训练方法和技巧,Very deep two-stream ConvNets在UCF-101数据集上获得了91.4%的精度。与此同时,我们扩展了Caffe 工具箱到多GPU方案中,从而获得了更高的效率和更少的内存消耗。
[1] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li.
ImageNet: A large-scale hierarchical image database.
In CVPR, pages 248–255, 2009. 1
[2] A. Gorban, H. Idrees, Y.-G. Jiang, A. Roshan Zamir,
I. Laptev, M. Shah, and R. Sukthankar. THUMOS
challenge: Action recognition with a large number of
classes. http://www.thumos.info/, 2015. 3,
4
[3] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep
into rectifiers: Surpassing human-level performance
on imagenet classification. CoRR, abs/1502.01852,
2015. 3
[4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe:
Convolutional architecture for fast feature embedding.
CoRR, abs/1408.5093. 1, 3
[5] Y.-G. Jiang, J. Liu, A. Roshan Zamir, I. Laptev,
M. Piccardi, M. Shah, and R. Sukthankar. THUMOS
challenge: Action recognition with a large number of
classes, 2013. 3
[6] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
and L. Fei-Fei. Large-scale video classification
with convolutional neural networks. In CVPR,
pages 1725–1732, 2014. 4
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
classification with deep convolutional neural
networks. In NIPS, pages 1106–1114, 2012. 1, 2
[8] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj.
Beyond gaussian pyramid: Multi-skip feature stacking
for action recognition. In CVPR, pages 204–212,
2015. 1, 4
[9] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan,
O. Vinyals, R. Monga, and G. Toderici. Beyond short
snippets: Deep networks for video classification. In
CVPR, pages 4694–4702, 2015. 1, 4
[10] X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of
visual words and fusion methods for action recognition:
Comprehensive study and good practice. CoRR,
abs/1405.4506, 2014. 4
[11] J. S´anchez, F. Perronnin, T. Mensink, and J. J. Verbeek.
Image classification with the fisher vector: Theory
and practice. International Journal of Computer
Vision, 105(3):222–245, 2013. 1
[12] K. Simonyan and A. Zisserman. Two-stream convolutional
networks for action recognition in videos. In
NIPS, pages 568–576, 2014. 1, 2, 3, 4
[13] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition.
CoRR, abs/1409.1556, 2014. 1, 2, 3
[14] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A
dataset of 101 human actions classes from videos in
the wild. CoRR, abs/1212.0402, 2012. 1, 3
[15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. CoRR,
abs/1409.4842, 2014. 1, 2
[16] H. Wang and C. Schmid. Action recognition with improved
trajectories. In ICCV, pages 3551–3558, 2013.
1, 4
[17] L. Wang, Y. Qiao, and X. Tang. Mining motion atoms
and phrases for complex action recognition. In ICCV,
pages 2680–2687, 2013. 1
[18] L.Wang, Y. Qiao, and X. Tang. Motionlets: Mid-level
3D parts for human motion recognition. In CVPR,
pages 2674–2681, 2013. 1
[19] L. Wang, Y. Qiao, and X. Tang. Action recognition
with trajectory-pooled deep-convolutional descriptors.
In CVPR, pages 4305–4314, 2015. 1, 3, 4
[20] L. Wang, Z. Wang, Y. Xiong, and Y. Qiao.
CUHK&SIAT submission for THUMOS15 action
recognition challenge. In THUMOS’15 Action Recognition
Challenge, 2015. 3, 4
[21] C. Zach, T. Pock, and H. Bischof. A duality based approach
for realtime tv-L1 optical flow. In 29th DAGM
Symposium on Pattern Recognition, 2007. 3
[22] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In ECCV, pages
818–833, 2014. 2