论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app

目录

Contributions

Method

1、Partitioning patterns

2、Motion Statistics

3、Appearance Statistics

Results

 


 

论文标题:Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics(2019 CVPR)

论文作者:Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, Wei Liu

下载地址:https://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Self-Supervised_Spatio-Temporal_Representation_Learning_for_Videos_by_Predicting_Motion_and_CVPR_2019_paper.html

 


 

Contributions

In this paper, we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, given a video sequence, we design a novel task to predict several numerical labels derived from motion and appearance statistics for spatio-temporal representation learning, in a self-supervised manner. Each video frame is first divided into several spatial regions using different partitioning patterns like the grid. Then the derived statistical labels, such as the region with the largest motion and its direction (the red patch), the most diverged region in appearance and its dominant color (the yellow patch), and the most stable region in appearance and its dominant color (the blue patch), are employed as supervision during the learning. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach, and the experiments show that our approach can significantly improve the performance ofC3D when applied to video classification tasks.

论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app_第1张图片

 


 

Method

Inspired by human visual system, we break the process of understanding videos into several questions and encourage a CNN to answer them accordingly:

  1. Where is the largest motion in the video?
  2. What is the dominant direction of the largest motion?
  3. Where is the largest color diversity and what is its dominant color?
  4. Where is the smallest color diversity, i.e., the potential background of a scene and what is its dominant color?

论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app_第2张图片

 

 

1、Partitioning patterns

Given a video clip, we first divide it into several blocks without overlap using simple patterns.

论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app_第3张图片

 

 

2、Motion Statistics

We use optical flow to derive the motion statistical labels to be predicted in our task. Specifically, for an N-frame video clip, (N − 1) ∗ 2 motion boundaries are computed. Diverse video motion information can be encoded into two summarized motion boundaries by summing up all these (N−1) sparse motion boundaries of each component as follows:

论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app_第4张图片

where Mu denotes the motion boundaries on horizontal optical flow u, and Mv denotes the motion boundaries on vertical optical flow v.

论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app_第5张图片

As for the largest motion statistics, we compute the average magnitude of each block and use the number of the block with the largest average magnitude as the largest motion location. We divide 360◦ into 8 bins, with each bin containing 45◦ angle range and again assign each bin to a number to represent its orientation. For each pixel in the largest motion block, we first use its orientation angle to determine which angle bin it belongs to and then add the corresponding magnitude number into the angle bin. The dominant orientation is the number of the angle bin with the largest magnitude sum.

论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app_第6张图片

 

 

3、Appearance Statistics

Given an N-frame video clip, same as motion statistics, we divide it into several video blocks by patterns described above. For an N-frame video block, we first compute the 3D distribution Vi in 3D color space of each frame i. We then use the In- tersection over Union (IoU) along temporal axis to quantify the spatio-temporal color diversity as follows:

The largest color diversity location is the block with the smallest IoUscore, while the smallest color diversity location is the block with the largest IoUscore. In practice, we calculate the IoUscore on R,G,B channels separately and compute the final IoUscore by averaging them. In the 3-D RGB color space, we evenly divide it into 8 bins. For the two representative video blocks, we assign each pixel a corresponding bin number by its RGB value, and the bin with the largest number of pixels is the dominant color.

 


 

Results

论文阅读:Self-supervised spatio-temporal representation learning for videos by predicting motion and app_第7张图片

 

你可能感兴趣的:(学习,人工智能,自监督学习,SSL,video,pretext,视频动作识别,CVPR)