论文阅读:Learnable pooling with Context Gating for video classification

这篇论文是关于视频分类的。
2016年在比赛中获得冠军,2017年v1,2018年v2

视频分类背景

  1. 从视频中提取强有力的特征:从视频中提取出能更好的描述视频的时空(spatio-temporal)特征,特征越强,模型分类识别的效果越好。

  2. 特征的编码和融合方法:包括空域(spatio)特征和时域(temporal)特征两方面,在空域,需要编码和融合多种空域特征;在时域,由于一些动作通过单帧的图像无法判断,只能通过时序上的变化进行判断,需要将时序上的特征进行编码和融合,获得对视频的整体描述;在时空域上,需要将空域和时域特征综合利用融合,以获得更好的效果。

  3. 高效的算法: 需要考虑模型的大小、训练时间和识别的速度等因素,算法越高效越有可能应用到实际场景中。

本文

主要工作:
1、two-stream architecture aggregating audio and visual features
2、clustering-based aggregation layers
3、a learnable non-linear unit, named Context Gating,aiming to model interdependencies among network activations

流程:
论文阅读:Learnable pooling with Context Gating for video classification_第1张图片

1. the input features are extracted from video and audio signals

2. the pooling module aggregates the extracted features into a single compact (e.g. 1024-dimensional) representation for the entire video.

2.1. This pooling module has a two-stream architecture treating visual and audio features separately.

2.2. The aggregated representation is then enhanced by the Context Gating layer.

3. the classification module takes the resulting representation as input and outputs scores for a pre-defined set of labels.

Context Gating

你可能感兴趣的:(论文阅读)