Comprehensive Video Understanding: Video summarization with content-based video recommender design

论文地址：https://arxiv.org/pdf/1910.13888.pdf

Abstract

摘要

Video summarization aims to extract keyframes/shots from a long video. Previous methods mainly take diversity and representativeness of generated summaries as prior knowledge in algorithm design. In this paper, we formulate video summarization as a content-based recommender problem, which should distill the most useful content from a long video for users who suffer from information overload. A scalable deep neural network is proposed on predicting if one video segment is a useful segment for users by explicitly modelling both segment and video. Moreover, we accomplish scene and action recognition in untrimmed videos in order to find more correlations among different aspects of video understanding tasks. Also, our paper will discuss the effect of audio and visual features in summarization task. We also extend our work by data augmentation and multi-task learning for preventing the model from early-stage overfitting. The final results of our model win the first place in ICCV 2019 CoView Workshop Challenge Track.

视频摘要就是从长视频中提取关键帧/镜头。以往的方法主要是将生成的摘要的多样性和代表性作为算法设计的先验知识。在本文中，我们将视频摘要定义为一个基于内容的推荐问题，该问题应该从一段长视频中提取出最有用的内容，供信息过载的用户使用。提出了一种可扩展的深度神经网络，通过对片段和视频进行显式建模来预测一个视频段是否对用户有用。此外，我们在未修剪的视频中完成场景和动作识别，以发现更多视频理解任务在不同方面之间的相关性。另外，本文还讨论了语音特征和视觉特征在摘要任务中的作用。我们还通过数据增强和多任务学习来扩展我们的工作，以防止模型早期过拟合。我们的模型最终获得了ICCV 2019 CoView workshop挑战赛赛道的第一名。

1. Introduction

1. 引言

As information overload becomes a more and more serious topic in modern society, many efficient tools have been designed to overcome information anxiety. Videos, the fastest-growing information carrier, will account for more than 80% of all Internet traffic by 2020 [3]. Video summarization aims to address this problem by extracting keyframes/shots from a long video, which contains the most useful content for people. It serves as a vital way of comprehensively understanding video data in research area while saving viewers’ time on information acquisition. This technology also gains the attention of both the academic world and the industry. Adobe has already presented video summarization feature in their products for video editing. Some cloud computing companies, like Microsoft Azure, AliCloud, provide this function as an online service.

随着信息过载在现代社会成为一个越来越严重的问题，人们设计了许多有效的工具来克服信息焦虑。视频作为增长最快的信息载体，到2020年将占到所有互联网流量的80%以上 [3]。视频摘要就是要解决这个问题，从长视频中提取出对人们最有用的关键帧/镜头。它是研究全面视频数据理解领域的重要方式，同时也节省了观众获取信息的时间。这一技术也得到了学术界和产业界的关注。Adobe已经在他们的视频编辑产品中展示了视频摘要功能。一些云计算公司，如微软Azure、阿里云，将此功能作为在线服务提供。

One of the main challenges in video summarization is subjectivity, different people may have different selections on key shots for a same video. We investigate two top referenced dataset SumMe [6] and TVSum [15], which both provide manually curated consistency benchmark, with F1 score of 0.31 for SumMe, F1 score of 0.36 for TVSum. Based on the assumptions above, we don’t take diversity or representativeness and other subjective attributes into consideration, instead, we focuse on solely the key shots and the full video. The mission of our algorithm is to find the most attractive video segments for a group of annotators/users when they watch a long video.

视频摘要的主要难点之一是主观性，不同的人对同一视频的关键镜头可能有不同的选择。我们研究了两个顶级参考数据集SumMe [6] 和TVSum [15]，它们都提供了人工管理的一致性基准，SumMe的F1得分为0.31，TVSum的F1得分为0.36。基于以上的假设，我们没有考虑多样性和代表性等主观属性，我们只关注关键镜头和整个视频。我们算法的任务是为一组标注者/用户在观看长视频时找到最吸引他们的视频片段。

Content-based recommender modeling is one of the most important technology in recommendation system, especially for alleviating cold start problem, which recommend items with similar content to what users like. It is usually formalized as a similarity learning problem [13]. A content-based video recommender deep model learns com- pact representation of videos and constructs a bridge between users feedback and video semantics information. We develop and deploy the HighlightNet to learn annotators preferences, the model is supervised by the segments importance feedbacks. It can easily combine different inputs, such as raw features, high-level features, audio features and vision feature together under this framework. The holistic picture of our work is shown in figure 1.

基于内容的推荐建模是推荐系统中最重要的技术之一，特别是用于缓解冷启动问题，推荐与用户喜欢的内容相近的条目。它通常被形式化为相似度学习问题 [13]。基于内容的视频推荐深度模型学习视频的压缩表示，在用户反馈与视频语义信息之间构建桥梁。我们开发并部署了HighlightNet来学习标注者的偏好，该模型由片段的重要性反馈来监督。它可以在这个框架下轻松地将不同的输入，如原始特征、高级特征、音频特征和视觉特征组合在一起。我们工作的整体情况如图1所示。

Figure 1. The whole picture of our video summarization system. Feature extractor can be a model trained on different video understanding tasks. Summarization system use viewers’ feedback and features from different source to generate the video summary. Users’ feedback is treated as supervised signals but not a necessary condition in the prediction process.

图1。我们的视频总结系统的全貌。特征提取器可以是一种在不同视频理解任务上训练的模型。摘要系统利用观众的反馈信息和不同来源的特征来生成视频摘要。用户的反馈被视为监督信号，而不是预测过程中的必要条件。

Currently, as video summarization problem appeals to a lot of researchers, many state-of-the-art approaches have been presented in solving this problem [3, 24-26]. Generally, they treat video summarization as sequence-to-sequence learning problem. Since RNN and its variants LSTM, GRU are very efficient in modeling long time dependencies under encoder-decoder architectures, many works in machine translation, image/video captioning, reading comprehension adopt these technologies. In this paper, GRU modeling is adopted in order to consider both the isolated segment, and its roles in the whole video.

目前，由于视频摘要问题吸引了大量的研究者，许多最先进的方法被提出来解决这个问题 [3,24-26]。一般来说，他们把视频摘要看作是序列到序列的学习问题。由于RNN及其变体LSTM、GRU在编码器-解码器架构下对长时间依赖的建模非常有效，许多机器翻译、图像/视频字幕、阅读理解等工作都采用了这些技术。本文采用GRU建模，既考虑单独的片段，又考虑其在整个视频中的作用。

Video summarization, as one of the comprehensive video understanding tasks, is extremely difficult and requires a large amount of data in deep learning architecture. However, the collection of such summarization labels is time- consuming and labor-intensive, resulting in an insufficient dataset. Since supervised learning can easily overfit on small data, many methods have been explored on alleviating this issue. Many state-of-the-art videos summarization work address this problem by unsupervised learning, semi-supervised learning, or multi-task learning [25,26,28]. We explore the self-supervised learning on video sequence modeling and joint training with segments importance score prediction. Another way to confront overfitting is to shuffle information flows which potentially improves the model’s generalization ability. The contribution of our work:
[1] We unify various inputs on different semantic levels to one framework by formatting video summarisation into a recommender problem.
[2] We develop an algorithm that models independent segments and segment sequence(whole video).
[3] We extend the summarisation framework with self-supervised learning and data augmentation to deal with lack of labelled data.

视频摘要作为综合性的视频理解任务之一，在深度学习体系结构中难度极大，需要大量数据。然而，收集这样的摘要标签是费时和劳动密集型的，导致数据集不足。由于监督学习可以很容易地对小数据进行过拟合，人们探索了许多方法来缓解这一问题。许多最先进的视频摘要工作通过无监督学习、半监督学习或多任务学习来解决这个问题 [25,26,28]。我们探讨了视频序列建模的自监督学习和片段重要性分数预测的联合训练。另外一种应对过拟合的方法是打乱信息流，这可能提高模型的泛化能力。我们工作的贡献总结如下:
[1] 通过将视频摘要格式化为推荐问题，我们将不同语义层次上的各种输入统一到一个框架中。
[2] 我们开发了一个算法，建模独立的片段和段序列(整个视频)。
[3] 我们通过自我监督学习和数据增强来扩展摘要框架，以处理标注数据的缺乏。

2. Related Work

2. 相关工作

Video classification, as a fundamental task for video understanding, have been studied seriously. Many high-quality datasets [1] [10] [11] [16] are published which drives the research to a higher level. Supervised learning brings the performance of algorithms nearly to human performance on large-scale video classification tasks [4, 7, 9, 18-21].

视频分类作为视频理解的一项基础性工作，已经得到了认真的研究。许多高质量的数据集 [1] [10] [11] [16] 的发表，将研究提升到了一个更高的水平。监督学习使算法在处理大规模视频分类任务时的性能接近于人类的性能 [4, 7 , 9, 18-21]。

A significant number of deep learning based frameworks have been explored recently in solving video summarization [3, 22, 24, 26, 28]. K. Zhang et al. creatively applied LSTM in supervised video sequence labelling to model video temporal information with good performance [24]. Jiri Fajtl et.al introduced self-attention instead of RNN to enhance computation efficiency [3]. K. Zhou et al. showed that fully unsupervised learning can outperform many supervised methods by considering diversity and representativeness in reinforcement learning-based framework [28]. Y. Zhang et al. introduced adversarial loss to video summarization which learns a dilated temporal relational generator and a discriminator with three-player loss [26].

最近，大量基于深度学习的框架在解决视频摘要的问题上进行了探索 [3, 22, 24, 26, 28]。K. Zhang等人创造性地将LSTM应用于有监督的视频序列标签中用于建模视频时序信息取得了良好的效果 [24]。Jiri Fajtl等人用self-attention代替RNN来提高计算效率 [3]。K. Zhou等人通过在基于强化学习的框架考虑多样性和代表性，表明了完全无监督方法可以优于许多监督方法 [28]。Y. Zhang等人在视频摘要中引入了对抗性损失，通过three-player损失函数学习了发生器和鉴别器的扩张时间关系 [26]。

Researchers believe learning good visual representation can help deep neural network improve both fitting and generalization ability especially in classification task [5, 12, 14, 17]. Some video summarization work adopts unsupervised learning as auxiliary task to help improving the supervised learning systems performance. K. Zhang et al. use retrospective encoder that embeds the predicted summary and original video into the same abstract semantic space with closer distance. The semi-supervised settings helps increase F-score on TVSum dataset from 63.9% to 65.2% [25]. K. Zhou et al. on the other way, extending their reinforcement learning based framework to semi-supervised style elevated performance on both SumMe and TVSum comparing to purely supervised or unsupervised methods [28].

研究人员认为学习良好的视觉表示有助于深度神经网络提高拟合能力和泛化能力，尤其是在分类任务中 [5, 12, 14, 17]。一些视频摘要工作采用无监督学习作为辅助任务，有助于提高监督学习系统的性能。K. Zhang等人使用retrospective encoder，通过距离更近编码预测摘要和原始视频到同一个抽象语义空间。半监督设置有助于将TV-Sum数据集的F-socre从63.9%提高到65.2% [25]。K. Zhou等人以另一种方式，将他们的强化学习框架扩展到半监督风格，与纯监督或无监督方法相比，它提高了SumMe和TVSum上的性能 [28]。

For the subjectivity of video summarization task, Kanehira et al. provided a solution by building a summary that depends on the particular aspect of a video the viewer focuses on [8]. Joonseok Lee et.al from Google publish a content-only video recommendation system which we regard as a very close work to video summarization [13].

对于视频摘要任务的主观性，Kanehira等人提供了一种解决方案，通过构建一个依赖于观看者关注的视频的特定方面的摘要 [8]。oonseok Lee等人从谷歌发布了一个仅包含内容的视频推荐系统，我们认为这是一个非常接近视频摘要的工作 [13]。

3. Approach

3. 方法

In this section, we detail our method on CoView 2019 comprehensive video understanding challenge track. First, we simply introduce our work on action and scene recognition in untrimmed video. Second, we formalize the summarization problem and present our solution based on deep neural network. Third, we talk about some important technology on how to prevent the DNN from early-stage over-fitting.

在本节中，我们详细介绍了CoView 2019综合视频理解挑战赛道的方法。首先，我们简单介绍了我们在未裁剪视频中的动作和场景识别方面的工作。其次，我们将摘要问题形式化，并提出了基于深度神经网络的解决方案。第三，讨论了如何防止DNN早期过拟合的一些重要技术。

3.1. Action and Scene Recognition in Untrimmed Video

3.1. 未裁剪视频中的动作识别和场景识别

We first adopt I3D with non-local blocks video classification [2] [21] in this subtask. A ResNet101 backbone is used for this framework. We tried two ways for training:
1. Action and scene recognition task sharing the same backbone with two SoftMax loss branches for each classification tasks.
2. We get one model each for action and scene recognition.

我们首先在这个子任务中采用I3D带non-local块的视频分类模型 [2] [21]。这个框架使用ResNet101作为主干。我们尝试了两种训练方式:
1. 动作识别和场景识别任务共享相同的主干，使用两个SoftMax损失分支，每个分类任务一个。
2. 对于动作识别和场景识别各训练一个模型。

Since the training data just comes from only 1000 videos, the scene data for training is too little and lacks variation for recognition. There are many public image scene classification datasets such as Place360 [27] and SUN [23] for image scene recognition. We suppose the video scene classification task can be trained by image classification without much information loss. We can easily make full use of public datasets for potentially improve models’ robustness. We use average pooling operation on image classification predictions for final video level results. To capture the advantages of both video-based and image-based models, we ensemble the output scores of two models together with logistic regression.

由于训练数据仅来自1000个视频，用于训练场景识别的数据太少且缺乏变化。目前有许多公开的图像场景分类数据集用于图像场景识别，如Place360 [27] 和SUN [23] 。我们假设视频场景分类任务可以通过图像分类来训练，而不会有太多的信息丢失。我们可以很容易地充分利用公共数据集来潜在地提高模型的稳健性。我们对多个图像分类预测使用平均池化操作的作为最终视频级结果。为了获得基于视频和基于图像模型的优势，我们将两个模型的输出分数用逻辑回归集成在一起。

3.2. Video Summarization Framework

3.2. 视频摘要框架

We take the video summarization task as a content-based recommender problem. Let

be the feature matrix for a video m with segments in dimension.

我们将视频摘要任务作为一个基于内容的推荐问题。让

作为视频m的特征矩阵包含个片段，每个片段特征向量为sd维。

For each segment we have a mean importance score as users’ feedback, can be written by

对于每个片段我们有一个平均重要性得分作为用户反馈, 可以写做

Our goal is to select segments from the video by Top-k highest importance prediction score as summarization set . So the problem can be defined as finding a ranking function that can predict the segment importance score in video segments sequence.

A loss function can be defined by mean-square error(MSE),

where is the number of videos in training set, our algorithm is to find optimal function that minimizes the overall loss:

in the prediction model space .

我们的目标是从视频中根据重要性预测分数选出分数最高的k个片段作为摘要集合。所以这个问题可以被定义为寻找一个可以预测视频片段在视频片段序列中重要性的排序函数。

损失函数可以用均方误差(MSE)来定义，

其中是训练集中的视频数，我们的算法是在预测模型空间中寻找使整体损失值最小的最优函数:

Since lack information from the whole video sequence as the input of , we produce defined by

is the operator that concatenate each row of to , is a matrix that represent dimension whole video feature to each segment, which can also be taken as a learnable function of .In this work, we set the same value for each row of .

因为作为的输入缺乏完整视频序列的信息，我们定义

是按行连接和的操作，是一个的矩阵，相当于将维度的完整视频特征给到每个片段，同时也可以作为可学习的函数。在这项工作中，我们为的每一行设置相同的值。

We can also get sequence descriptors about one segment, like image frame, audio frame, a segment is defined by

我们也可以得到一个片段的序列描述，如视频帧，音频帧，一个片段被定义为

Another learnable function is used to map frames sequence features from to space. We fuse image based feature and segment based feature by learnable function to final segment-level feature

We can minimizes , , , in one framework by minimizes
$\begin{aligned} L\left(f\right) &= L\left(f\left({V_{m}}^{'}\right)\right) \\ &= L\left(f\left(V_{m} \bigoplus V_{s}\right)\right) \\ &= L\left(f\left(V_{m} \bigoplus \phi\left(V_{m}\right)\right)\right) \\ &= L\left(f\left(\varphi\left(\theta\left(S_{p}\right), V_{w}\right) \bigoplus \phi\left(\varphi\left(\theta\left(S_{p}\right), V_{m}\right)\right)\right)\right) \end{aligned}$

另一个可学习函数用于将帧序列特征从映射到空间。我们利用可学习函数将基于图像的特征和基于片段的特征融合作为最终的片段级特征

我们可以在一个框架中同时优化, , ,
$\begin{aligned} L\left(f\right) &= L\left(f\left({V_{m}}^{'}\right)\right) \\ &= L\left(f\left(V_{m} \bigoplus V_{s}\right)\right) \\ &= L\left(f\left(V_{m} \bigoplus \phi\left(V_{m}\right)\right)\right) \\ &= L\left(f\left(\varphi\left(\theta\left(S_{p}\right), V_{w}\right) \bigoplus \phi\left(\varphi\left(\theta\left(S_{p}\right), V_{m}\right)\right)\right)\right) \end{aligned}$

We design a video summarization network which consist of three subnetworks, SegNet learns the function and , VideoNet for , and HighlightNet for as shown in figure 2.

我们设计了一个由三个子网组成的视频摘要网络，SegNet学习函数和， VideoNet学习函数， HighlightNet学习函数，如图2所示。

Figure 2. The Video Summarization Network structure. It contains three subnetworks including: SegNet, VideoNet, and Highlight-Net.

图2。视频摘要网络结构。它包含三个子网络SegNet, VideoNet, and HighlightNet。

As we utilize imageNet classification features for each sampled video frame, SegNet combine all frame features’ sequence to a frame-based vector. SegNet also fuse frame-based feature and segment-level feature together. The SegNet is designed to process either 2D or 3D frame sequence. For 2D features, we adopt temporal-convolution and pooling squeeze temporal dimension. And for 3D features, we use three 3D convolution blocks. The 2D convolution blocks adopt bottleneck structure for spatial fully connected layers(figure 3).

当我们对每个采样的视频帧使用imageNet分类特征时，SegNet将所有的帧特征序列组合成一个基于帧的向量。SegNet还融合了基于帧的特征和片段级特征。SegNet被设计用来处理2D或3D帧序列。对于二维特征，我们采用时间卷积和池化压缩时间维。对于三维特征，我们使用三个三维卷积块。二维卷积块采用bottleneck结构用于空间全连通层(图3)。

Figure 3. The SegNet structure details. The network is designed to process either 2D or 3D frame sequence. Only one output(output1 or output2) is taken as the segment embedding.

图3。SegNet结构详情。该网络被设计可用于处理2D或3D帧序列。只有一个输出(output1或output2)作为片段编码。

For each long video, the segments’ feature sequence generated by SegNet are taken as inputs to VideoNet for video-level modeling. VideoNet use Bi-GRU for sequence encoding. The final GRU unit hidden layer, which is a fixed-length context vector, is taken as the output of VideoNet representing the video-level feature.

对于每个长视频，将SegNet生成的片段特征序列作为输入到VideoNet进行视频级建模。VideoNet使用Bi-GRU进行序列编码。最后的GRU单元隐层是一个固定长度的上下文向量，被作为VideoNet的输出表示视频级特征。

Each segment-level feature concatenates with the video-level feature of the same long video as the final segment representation. The representation contains not only the independent video segment information, but also the whole video sequence information, which is passed to highlight subnetwork to predict the segment importance score. Figure 4 shows the structure of highlight subnetwork.

每个片段级特征与相同的长视频视频级特征连接作为最终的片段级特征。该表示不仅包含单独的视频片段信息，还包含完整的视频序列信息，这些信息通过HighlightNet来预测片段重要性。图4显示了HighlightNet的结构。

Figure 4. The HighlightNet structure details. The network is designed to predict the segment importance score by fusion features.

图4。HighlightNet结构详情。网络被设计通过融合特征预测片段重要性分数

3.3. Multi-task Learning and Data Augmentation for Video Summarization Network

3.3. 视频摘要网络的多任务学习与数据增强

We consider an auxiliary self-supervised task to better modeling video sequence. Some segments are selected from one video by the fixed proportion before inputting to the network. We shuffle the selected segments and train the network to distinguish the odd-position segments as shown in figure 5. This operation assumes that a good video sequence encoder has the ability to model the right segments order. We control the difficulty of the task by only indicate the odd-position segments but not sort the shuffled sequence to right order. Parameters and are used for adjusting the shuffled segments proportion and the weight of self-supervised task in multi-task learning. The final loss can be defined by:

The advantages of applying multi-task learning are:

Learning several tasks simultaneously can suppress early-stage overfitting, by sharing the same representations.
The auxiliary task is helpful in providing more useful information.
Our method implicitly performs data augmentation since we shuffle the input video sequence, which may improve the robustness of the algorithm.
This method utilize unlabelled data for video sequence modeling.

我们考虑使用一种辅助的自我监督任务来更好地建模视频序列。在一个视频输入到网络之前，按固定比例选择一些片段。我们打乱选择片段的顺序，训练网络区分位置错误的片段，如图5所示。这个操作，假设一个好的视频序列编码器有能力建模片段的正确顺序。我们通过只标出位置错误的片段而不对打乱序列进行正确排序来控制任务的难度。参数和用于调整多任务学习中自监督任务选择片段的比例和权重。最终损失函数可以定义为:

多任务学习的优点是:

通过共享相同的表示同时学习多个任务可以抑制早期的过拟合。
辅助任务有助于提供更多有用的信息。
该方法通过对输入视频序列进行洗牌，隐式地进行数据增强，提高了算法的鲁棒性。
该方法利用未标注数据对视频序列进行建模。

Figure 5. Self-taught learning for odd-position labelling. The input segment sequence are shuffled randomly by α ratio. The video encoder learn to find the odd-position segments in the sequence.

图5.错误位置自学习。输入序列按比例α随机打乱。视频编码器学习寻找序列中位置错误的片段。

A data augmentation method is proposed which only choose a portion of information from each segment when modeling the whole video each time. After modeling all portion of segments, we average the embeddings from different portion to one vector(figure 6).

提出了一种数据增强方法，每次对整个视频进行建模时，只需从每个片段中选取一部分信息。在对所有部分分段建模后，我们将不同部分的编码平均到一个向量(图6)。

Figure 6. Data augmentation method. VideoNet only models a portion of each segment to video embedding. Multiple embeddings are combined by average pooling.

图6。数据增强方法。VideoNet只对每个片段的一部分进行建模以实现视频编码。采用平均池化的方法将多个编码组合在一起。

4. Experiments

4. 实验

In this section, we provide our comparison experiment on CoView 2019 dataset. The dataset consists of 1200 videos for training and 300 for testing. Videos are sampled from Youtube-8M, Dense-Captioning, and Video summarization dataset. Every video from the dataset is segmented into a set of 5-second-long segments and asked 20 users to annotate importance score and 99 action / 79 scene labels. The average of importance scores and the most voted action/scene labels are provided as ground-truth in segment-level. Before any experiments, we randomly split videos into training(1000) and validation (200) set based on original videos instead of five-seconds videos.

在本节中，我们提供了我们在CoView 2019数据集上的比较实验。数据集包括1200个用于训练的视频和300个用于测试的视频。视频从Youtube-8M，Dense-Captioning和Video summarization dataset中采集。数据集中的每个视频都被分割成一组5秒长的片段，并要求20个用户标注重要性分数和99个动作/ 79个场景标签。重要性得分的平均值和投票最多的动作/场景标签将作为片段级别的ground-truth。在进行实验之前，我们根据原始视频，而不是5秒的视频，将视频随机分为训练集(1000)和验证集合(200)。

4.1. Scene and Action Classification in Untrimmed Video

4.1. 未裁剪视频中的场景分类和动作分类

For scene/action classification task, top-5 accuracy on validation results show in table 1. a. joint training I3D+non-local attention with two SoftMax branches(JT). b. independent I3D+non-local model(IT). c. independent I3D+non-local model with external action classification data(AED). d. ResNet-50 with external scene image classification data(SED).

对于场景/动作分类任务，验证集的top-5精确率，如表1所示。a.联合训练I3D+non-local attention与两个SoftMax分支(JT)。b.独立训练的I3D+non-local(IT)。c.具有外部动作分类数据(AED)的独立训练的I3D+non-local。d.带有外部图像场景分类数据的ResNet-50 (SED)。

Table 1. Top-5 accuracy of different models on validation.

表1。验证集上不同模型的top-5精确率

We train I3D+non-local with the pre-trained model on Kinetics, and the ResNet-50 pre-trained on ImageNet. The results indicate: 1. Although scene/action classification task may have inner connection, both joint training and ensemble failed to improve the results. 2. Pure 2D convolution-based network method is less accurate than 3D convolution-based network on both scene and action recognition task. 3. External data didn’t help improve classification accuracy on validation set, but we still use it for potential model generalization enhancement.

我们在Kinetics上用预训练的模型训练I3D+非局部，在ImageNet上用预训练的ResNet-50。结果表明:1;虽然场景/动作分类任务可能存在内在联系，但联合训练和集成训练都不能提高结果。2. 纯2D卷积网络方法在场景和动作识别任务上的准确率都低于3D卷积网络方法。3.外部数据不能帮助提高验证集上的分类精度，但我们仍然使用它来增强潜在的模型泛化。

4.2. Video Summarization

4.2. 视频摘要

We investigate our summarization network on different parameter scales, visual and visual-audio feature inputs, single-task and multi-task learning, with and without data augmentation.

我们研究了不同参数尺度的摘要网络，视觉和视听特征输入，单任务和多任务学习，有和没有数据增强。

Evaluation Protocol. The video summarization evaluation metric computes the sum of importance scores of selected segments comparing with ground truth summary segments. The importance score metric is defined as

where is the number of test videos, is the number of summarized segments. is the importance score from the i-th segment of the ground truth summary of the k-th video, and is the importance score of the i-th segment of the submitted summary for the k-th video. Importance Score is shared in both ground truth and submitted summary. is set to 6 for all videos, the top importance segments are ground truth summary. We randomly choose segments for 10 times as the baseline of summary score, the mean value is 74.92%.

评估方案。视频摘要评估度量计算所选摘要片段的重要性分数之和和与真实摘要片段的重要性分数之和的比值。重要性评分指标定义为

其中为测试视频数量，N_{s}为摘要片段数量，是第k个视频的真实摘要中第i个片段的重要性得分，是第k个视频的提交摘要中第i段的重要度得分。重要分数在真实和提交的摘要中都是共享的。所有视频的都设置为6，重要性最高的个片段作为真实摘要。我们随机选择个片段10次作为摘要分数的基线，均值为74.92%。

Embedding Dimension. We explore different embedding dimension for the network by regularizing the same length of segNet and videoNet outputs. For each segment, we sample 16 frames for image-based extractor and 32 frames for video-based extractor. We use GoogleNet trained on ImageNet to extract image-based feature and I3D+non-local blocks trained on Kinetics to extract video-based feature. Table 2 shows our results for different feature length from 512 to 64. The 256-feature length is slightly better.

编码维度。我们通过调整segNet和videoNet相同长度的输出来探索网络的不同编码维度。对于每个片段，我们为图像提取器采样16帧图像提取器，为视频提取器采样32帧。我们使用在ImageNet上训练的GoogleNet来提取基于图像的特征，使用经过Kinetics训练的I3D+non-local提取基于视频的特征。表2显示了从512到64的不同特征长度的结果。256个字符的长度稍好一些。

Table 2. Summary score on different embedding dimension.

表2. 不同编码维度的摘要得分。

Semantic Features Combination. Image-based feature is extracted by image classifier. We select three types of image classifier: 1. ResNet-50 trained on CoView2019 scene classification task(R50_s). 2. ResNet-152 trained on ImageNet(R152_i). 3. GoogleNet trained on ImageNet(G_i). Video-based feature is extracted by video classifier. We choose two video classifier: 1. I3D+non-local model trained on Kinetics(INK). 2. I3D+non-local model trained on CoView2019 action recognition task(INA). Table 3 shows different visual feature combination on feature length 256. Our model didn’t benefit a lot from action and scene recognition task.

语义特征组合。通过图像分类器提取基于图像的特征。我们选择了三种类型的图像分类器:1. 在CoView2019场景分类任务上训练的ResNet-50(R50_s)。2. 在ImageNet上训练的ResNet-152(R152_i)。3. 在ImageNet上训练的GoogleNet(G_i)。利用视频分类器提取基于视频的特征。我们选择两个视频分类器:1. 在Kinetics上训练的I3D+non-local(INK)。2. 在CoView2019动作识别任务上训练的I3D+non-local(INA)。表3显示了特征长度为256的不同视觉特征组合。我们的模型并没有从动作和场景识别任务中获益。

Table 3. Summary score with different visual semantic feature combination.

表3。不同的视觉语义特征组合的摘要得分。

Multi-Task Learning. Since the self-taught learning increases the learning difficulty, we adopt 3-layers bi-GRU on video net and feature length equals to 512 with R50_s+INK features. Dropout and bottleneck structure are removed from segNet. The self-taught learning is firstly trained for two days using videos on CoView2019 with = 15% shuffle ratio. We use the trained model, with 96.64% accuracy on self-taught task and 78.6% recall on odd order segments, as pre-trained model in multi-task learning setting. In joint training, we set = 0.02, = 1, and = 0 when evaluate on validation. Table 4 shows the comparison results, we evaluate three models: a. the baseline supervised learning without pre-trained model(sup_no_prt). b. supervised learning with self-taught pre-trained model(sup_with_prt). c. multi-task learning with self-taught pre-trained model(multi_with_prt). The results shows that multi-task setting is better than the baseline model with small margin.

多任务学习。由于自主学习增加了学习难度，我们在视频网络上采用3层bi-GRU，特征长度为512,R50_s+INK特征。从segNet中移除Dropout和bottleneck结构。自学学习首先在CoView2019视频上使用 = 15%作为打乱比例训练两天。在自学任务上的准确率96.64%，错误位置的召回率为78.6%，我们用训练后的模型作为多任务学习的预训练模型。在联合训练中，我们设置 = 0.02， = 1，在评估时我们设置 = 0。表4显示了对比结果，我们评估了三种模型: a. 无预训练的基线监督学习模型(sup_no_prt)。b. 带有自学预训练模型的监督学习(sup_with_prt)。c. 基于自学预训练模型的多任务学习(multi_with_prt)。结果表明，多任务设置在小范围内优于基线模型。

Table 4. Summary score with Multi-Task learning.

表4。多任务学习的摘要分数

Data augmentation. The basic network setting is the same as R50 s+INK in “Semantic Features Combination” experiment, we get 0.29% improvment from 81.45% to 81.74%.

数据增强. “语义特征组合”实验的网络基础设置与R50 s+INK相同，从81.45%提高到81.74%，提高了0.29%。

Audio feature. We extract audio features MFCC, Chroma-gram for each segment by using python audio processing package LibROSA. We concatenate the audio features to 5450-dimension segment-level features before inputing to SegNet. The audio fusion model builds bases on Multi-Task setting in ”Multi-Task Learning” experiment. The result decrease from 81.64% to 80.42%, that may due to increased model complexity caused by larger feature dimension without important information.

音频特征。我们使用python音频处理包LibROSA提取每个片段的音频特征MFCC、Chroma-gram。在输入到SegNet之前，我们将音频特征连接到5450维的片段级特征。该音频融合模型建立在“多任务学习”实验中的多任务设置基础上。结果从81.64%下降到80.42%，这可能是由于没有重要信息的特征维数变大导致模型复杂度增加所致。

Optical flow. Optical flow as a low-level feature describing video motion, is shown as complementary information to RGB in action recognition task. Optical flow feature sequence are obtained by using Gunnar Farnebacks algorithm to extract 16 frames for each segment. We modified SegNet in Multi-Task setting to 4 layers 3D convolution for processing optical flow features on temporal and spectral. Same as audio feature, this modification decrease summary score from 81.64% to 80.78%.

光流。光流作为描述视频运动的底层特征，在动作识别任务中作为RGB的补充信息。采用Gunnar Farnebacks算法得到光流特征序列，每个片段提取16帧。我们将多任务设置中的SegNet修改为4层三维卷积，用于处理光流的时间和光谱特征。与音频功能相同，这一修改将汇总分数从81.64%降低到80.78%。

Model ensemble. We ensemble five models: base 256 feature length model(81.81%), audio feature model(80.42%), optical flow model(80.78%), multi- task model(81.64%), data augmentation model(81.74%) by linear regression. Then we do model selection according to linear regression weights which direct us choosing base 256-length model and data augmentation model ensemble result(81.86%) on validation as the final submission.

模型集成。我们通过线性回归集成了5个模型：基于256特征长度的模型(81.81%)，音频特征模型(80.42%)，光流模型(80.78%)，多任务模型(81.64%)，数据增强模型(81.74%)。然后根据线性回归权重进行模型选择，最终提交的基于256长度模型和数据增强模型的集成在验证集上取得结果为(81.86%)。

5. Conclusion and Future Work

5. 结论与未来工作

In this paper, we propose a scalable deep neural network for video summarization in content-based recommender formulation, which predicts the segments importance score by considering both segment and video level information. Our work shows that data augmentation and Multi-Task learning is helpful in solving the limitation of dataset problem. To better understand the video content, we also perform action and scene recognition in untrimmed video with state-of-the-art video classification algorithm. We experiment with combinations of high-level visual semantic features, audio features and optical flow, and concluded that visual semantic features play the most important role in this summarization task.

在本文中，我们提出了一种可扩展的基于内容的视频摘要深度神经网络，该网络通过同时考虑片段和视频级别信息来预测片段的重要性得分。我们的工作表明，数据增强和多任务学习有助于解决数据集问题的局限性。为了更好地理解视频内容，我们还利用最先进的视频分类算法对未裁剪视频进行了动作和场景识别。我们实验了高级视觉语义特征、音频特征和光流特征的组合，得出视觉语义特征在摘要任务中起着最重要的作用。

There are some areas we can further explore to improve comprehensive video understanding in the future. First of all, we didn’t benefit from action and scene connection in both recognition and summarization task, which leaves us room for better utilize this prior knowledge. Second, as we re-formalize video summarization in recommender frame- work, some state-of-the-art recommendation technologies can be introduced such as Collaborative Filtering, Factorization Machine, Wide & Deep Learning and so on. Last but not the least, although CoView2019, to our knowledge, provides the largest public video summarization dataset, it is too small in the scale of deep learning. We need more, like users action when browsing video websites, video language description, and large-scale dataset to accomplish such complex task.

在未来有一些领域我们可以进一步探索，以提高综合视频理解能力。首先，我们在识别和摘要任务中都没有从动作和场景连接中受益，这给我们留下了更好地利用这些先验知识的空间。其次，当我们在推荐框架中重新形式化视频总结时，可以引入一些最先进的推荐技术，如Collaborative Filtering、actorization Machine、Wide & Deep Learning等。最后但并非最不重要的是，尽管据我们所知，CoView2019提供了最大的开源视频摘要数据集，但在深度学习的规模上它太小了。我们需要更多，比如用户浏览视频网站时的动作，视频语言描述，大规模数据集来完成这样复杂的任务。

References

参考文献

[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
[2] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[3] Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. Summarizing videos with attention. In ACCV Workshops, 2018.
[4] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018.
[5] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learn- ing with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3636–3645, 2017.
[6] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. Creating summaries from user videos. In European conference on computer vision, pages 505–520. Springer, 2014.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[8] AtsushiKanehira,LucVanGool,YoshitakaUshiku,andTat- suya Harada. aware video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7435–7444, 2018.
[9] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Pro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
[10] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[11] Hildegard Kuehne, Hueihan Jhuang, Est ́ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 Interna- tional Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
[12] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming- Hsuan Yang. Unsupervised representation learning by sort- ing sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 667–676, 2017.
[13] Joonseok Lee and Sami Abu-El-Haija. Large-scale content- only video recommendation. In Proceedings of the IEEE International Conference on Computer Vision, pages 987– 995, 2017.
[14] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf- fle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
[15] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5179–5187, 2015.
[16] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[17] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- nov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
[18] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[19] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018.
[20] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment net- works: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
[21] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 7794–7803, 2018.
[22] Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, Xi- aokang Yang, and Chen Yao. Video summarization via se- mantic attended networks. In Thirty-Second AAAI Confer- ence on Artificial Intelligence, 2018.
[23] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer So- ciety Conference on Computer Vision and Pattern Recogni- tion, pages 3485–3492. IEEE, 2010.
[24] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. In European conference on computer vision, pages 766–782. Springer, 2016.
[25] Ke Zhang, Kristen Grauman, and Fei Sha. Retrospective en- coders for video summarization. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 383– 399, 2018.
[26] Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao, and Min Tan. Dtr-gan: Dilated temporal relational adversarial network for video summarization. In Proceedings of the ACM Turing Celebration Conference-China, page 89. ACM, 2019.
[27] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, 2017.
[28] Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforce-
ment learning for unsupervised video summarization with diversity-representativeness reward. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

【论文翻译】Comprehensive Video Understanding: Video summarization with content-based video recommender...