单位:University of Southern California、University of Texas at Austin
论文来自2016 ECCV,作者来自USC、UT 论文地址:https://arxiv.org/abs/1605.08110 DOI: 10.1007/978-3-319-46478-7 47
写在前面: 中间第3节公式多的部分没咋翻译,后边也没咋弄太好,多担待,原文还是要看的,毕竟翻译的有些不准确 -_-||
We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the task as a structured prediction problem, our main idea is to use Long Short-Term Memory (LSTM) to model the variable-range temporal dependency among video frames, so as to derive both representative and compact video summaries. The proposed model successfully accounts for the sequential structure crucial to generating meaningful video summaries, leading to state-of-the-art results on two benchmark datasets. In addition to advances in modeling techniques, we introduce a strategy to address the need for a large amount of annotated data for training complex learning approaches to summarization. There, our main idea is to exploit auxiliary annotated video summarization datasets, in spite of their heterogeneity in visual styles and contents. Specifically, we show that domain adaptation techniques can improve learning by reducing the discrepancies in the original datasets’ statistical properties.
Keywords : Video summarization;Long short-term memory
Video has rapidly become one of the most common sources of visual information. The amount of video data is daunting — it takes over 82 years to watch all videos uploaded to YouTube per day! Automatic tools for analyzing and understanding video contents are thus essential. In particular , automatic video summarization is a key tool to help human users browse video data. A good video summary would compactly depict the original video, distilling its important events into a short watchable synopsis. Video summarization can shorten video in several ways. In this paper, we focus on the two most common ones: keyframe selection, where the system identifies a series of defining frames [1,2,3,4,5] and key subshot selection, where the system identifies a series of defining subshots, each of which is a temporally contiguous set of frames spanning a short time interval [6,7,8,9].
视频已迅速成为视觉信息的最常见来源之一。视频数据量巨大,每天观看上传到YouTube的所有视频需要82年以上的时间! 因此,用于分析和理解视频内容的自动工具至关重要。特别地,自动视频摘要是帮助人类用户浏览视频数据的关键工具。一个好的视频摘要将紧凑地(简洁地)描述原始视频,并将其重要事件提炼成一个可供观看的简短摘要。视频摘要可以通过多种方式缩短视频。在本文中,我们重点介绍两种最常见的方法:关键帧选择(其中系统识别一系列定义帧[1,2,3,4,5])和关键子镜头的选择(其中系统识别一系列定义的子镜头),其中每个子镜头是一个时间连续的帧集,跨度很短的时间间隔[6,7,8,9]。
There has been a steadily growing interest in studying learning techniques for video summarization. Many approaches are based on unsupervised learning, and define intuitive criteria to pick frames [1,5,6,9,10,11,12,13,14] without explicitly optimizing the evaluation metrics. Recent work has begun to explore supervised learning techniques [2,15,16,17,18]. In contrast to unsupervised ones, supervised methods directly learn from human-created summaries to capture the underlying frame selection criterion as well as to output a subset of those frames that is more aligned with human semantic understanding of the video contents.
对研究视频摘要的学习技术的兴趣一直在稳定增长。许多方法都是基于无监督学习的,并定义了直观的标准来挑选框架[1、5、6、9、10、11、12、13、14],而没有明确优化评估指标。 最近的工作已开始探索监督学习技术[2,15,16,17,18]。 与不受监督的方法相反,受监督的方法直接从人工创建的摘要中学习,以捕获底层的帧选择标准,并输出与人类对视频内容的语义理解更加一致的那些帧的子集。
Supervised learning for video summarization entails two questions: what type of learning model to use? and how to acquire enough annotated data for fitting those models? Abstractly, video summarization is a structured prediction problem: the input to the summarization algorithm is a sequence of video frames, and the output is a binary vector indicating whether a frame is to be selected or not. This type of sequential prediction task is the underpinning of many popular algorithms for problems in speech recognition, language processing, etc. The most important aspect of this kind of task is that the decision to select cannot be made locally and in isolation—the inter-dependency entails making decisions after considering all data from the original sequence.
For video summarization, the inter-dependency across video frames is complex and highly inhomogeneous. This is not entirely surprising as human viewers rely on high-level semantic understanding of the video contents (and keep track of the unfolding of storylines) to decide whether a frame would be valuable to keep for a summary. For example, in deciding what the keyframes are, temporally close video frames are often visually similar and thus convey redundant information such that they should be condensed. However, the converse is not true. That is, visually similar frames do not have to be temporally close. For example, consider summarizing the video “leave home in the morning and come back to lunch at home and leave again and return to home at night.” While the frames related to the “at home” scene can be visually similar, the semantic flow of the video dictates none of them should be eliminated. Thus, a summarization algorithm that relies on examining visual cues only but fails to take into consideration the high-level semantic understanding about the video over a long-range temporal span will erroneously eliminate important frames. Essentially, the nature of making those decisions is largely sequential – any decision including or excluding frames is dependent on other decisions made on a temporal line.
Modeling variable-range dependencies where both short-range and long-range relationships intertwine is a long-standing challenging problem in machine learning. Our work is inspired by the recent success of applying long short-term memory (LSTM) to structured prediction problems such as speech recognition [19,20,21] and image and video captioning [22,23,24,25,26]. LSTM is especially advantageous in modeling long-range structural dependencies where the influence by the distant past on the present and the future must be adjusted in a data-dependent manner. In the context of video summarization, LSTMs explicitly use its memory cells to learn the progression of “storylines”, thus to know when to forget or incorporate the past events to make decisions.
In this paper, we investigate how to apply LSTM and its variants to supervised video summarization. We make the following contributions. We propose vsLSTM, a LSTM-based model for video summarization (Sec. 3.3). Fig. 2 illustrates the conceptual design of the model. We demonstrate that the sequential modeling aspect of LSTM is essential; the performance of multi-layer neural networks (MLPs) using neighboring frames as features is inferior. We further show how LSTM’s strength can be enhanced by combining it with the determinantal point process (DPP), a recently introduced probabilistic model for diverse subset selection [2,27]. The resulting model achieves the best results on two recent challenging benchmark datasets (Sec. 4). Besides advances in modeling, we also show how to address the practical challenge of insufficient human-annotated video summarization examples. We show that model fitting can benefit from combining video datasets, despite their heterogeneity in both contents and visual styles. In particular, this benefit can be improved by “domain adaptation” techniques that aim to reduce the discrepancies in statistical characteristics across the diverse datasets.
The rest of the paper is organized as follows. Section 2 reviews related work of video summarization, and Section 3 describes the proposed LSTM-based model and its variants. In Section 4, we report empirical results. We examine our approach in several supervised learning settings and contrast it to other existing methods, and we analyze the impact of domain adapation for merging summarization datasets for training (Section 4.4). We conclude our paper in Section 5.
Techniques for automatic video summarization fall in two broad categories: unsupervised ones that rely on manually designed criteria to prioritize and select frames or subshots from videos [1,3,5,6,9,10,11,12,14,28,29,30,31,32,33,34,35,36] and supervised ones that leverage human-edited summary examples (or frame importance ratings) to learn how to summarize novel videos [2,15,16,17,18]. Recent results by the latter suggest great promise compared to traditional unupservised methods.
自动视频摘要的技术分为两大类:无监督的技术(该技术依赖于人工手动设计的标准来对视频中的帧或子镜头进行优先排序和选择[1、3、5、6、9、10、11、12、14、28、29 ,30、31、32、33、34、35、36])和监督的技术(该技术利用人工编辑的摘要示例(或视频帧重要性评级)来学习如何总结新视频[2,15,16,17,18])。最近的研究结果表明,与传统的未使用的方法相比,后者有很大的前景。
Informative criteria include relevance [10,13,14,31,36], representativeness or importance [5,6,9,10,11,33,35], and diversity or coverage [1,12,28,30,34]. Several recent methods also exploit auxiliary information such as web images [10,11,33,35] or video categories [31] to facilitate the summarization process.
Because they explicitly learn from human-created summaries, supervised methods are better equipped to align with how humans would summarize the input video. For example, a prior supervised approach learns to combine multiple hand-crafted criteria so that the summaries are consistent with ground truth [15,17]. Alternatively, the determinatal point process (DPP) — a probabilistic model that characterizes how a representative and diverse subset can be sampled from a ground set — is a valuable tool to model summarization in the supervised setting [2,16,18].
由于他们明确地从人工创建的摘要中学习,因此受监督的方法可以更好地适应人类对输入视频进行摘要的方式。 例如,先前的监督方法学会了结合多个手工制定的标准,以使摘要与基本事实保持一致[15,17]。 另外,确定点过程(DPP)是一种概率模型,它描述了如何从基础集合中抽取具有代表性的和多样性的子集。它是在有监督的环境中建模进行摘要的有价值的工具[2,16,18]。
None of above work uses LSTMs to model both the short-range and long-range dependencies in the sequential video frames. The sequential DPP proposed in [2] uses pre-defined temporal structures, so the dependencies are “hard-wired”. In contrast, LSTMs can model dependencies with a data-dependent on/off switch, which is extremely powerful for modeling sequential data [20].
LSTMs are used in [37] to model temporal dependencies to identify video highlights, cast as auto-encoder-based outlier detection. LSTMs are also used in modeling an observer’s visual attention in analyzing images [38,39], and to perform natural language video description [23,24,25]. However, to the best of our knowledge, our work is the first to explore LSTMs for video summarization. As our results will demonstrate, their flexibility in capturing sequential structure is quite promising for the task.
在文献[37]中使用LSTMs来对时间依赖性进行建模,以识别视频亮点,并转换为基于自动编码器的离群点检测(孤立点检测、异常值检测)。 LSTMs还用于在分析图像时对观察者的视觉注意力进行建模[38,39],并用于执行自然语言视频描述[23,24,25]。然而,就我们所知,我们的工作是第一次探索视频摘要的LSTMs。正如我们的结果将证明的那样,它们在捕获顺序结构方面的灵活性对于这项任务是很有前途的。
In this section, we describe our methods for summarizing videos. We first formally state the problem and the notations, and briefly review LSTM [40,41,42], the building block of our approach. We then introduce our first summarization model vsLSTM. Then we describe how we can enhance vsLSTM by combining it with a determinantal point process (DPP) that further takes the summarization structure (e.g., diversity among selected frames) into consideration.
在本节中,我们将介绍视频摘要的方法。首先正式地陈述问题和符号,并简要回顾LSTM[40,41,42],以及我们方法的组成部分。然后介绍我们的第一个总结模型vsLSTM。然后,我们描述了如何通过将vsLSTM与确定性点过程(DPP)结合来增强vsLSTM, DPP进一步考虑了摘要结构(例如,选择帧之间的多样性)。
We use x = {x1, x2, · · · , xt, · · · , xT } to denote a sequence of frames in a video to be summarized while xt is the visual features extracted at the t-th frame.
我们用x = {x1, x2,···,xt,···,xT}表示要进行摘要的视频帧序列,xt是在第t帧提取的视觉特征。
The output of the summarization algorithm can take one of two forms. The first is selected keyframes [2,3,12,28,29,43], where the summarization result is a subset of (isolated) frames. The second is interval-based keyshots [15,17,31,35], where the summary is a set of (short) intervals along the time axis. Instead of binary information (being selected or not selected), certain datasets provide frame-level importance scores computed from human annotations [17,35]. Those scores represent the likelihoods of the frames being selected as a part of summary. Our models make use of all types of annotations — binary keyframe labels, binary subshot labels, or frame-level importances — as learning signals.1
Our models use frames as its internal representation. The inputs are frame-level features x and the (target) outputs are either hard binary indicators orframe-level importance scores (i.e., softened indicators).
LSTMs are a special kind of recurrent neural network that are adept at modeling long-range dependencies. At the core of the LSTMs are memory cells c which encode, at every time step, the knowledge of the inputs that have been observed up to that step. The cells are modulated by nonlinear sigmoidal gates, and are applied multiplicatively. The gates determine whether the LSTM keeps the values at the gates (if the gates evaluate to 1) or discard them (if the gates evaluate to 0).
There are three gates: the input gate (i) controlling whether the LSTM considers its current input (xt), the forget gate (f) allowing the LSTM to forget its previous memory (ct), and the output gate (o) deciding how much of the memory to transfer to the hidden states (ht). Together they enable the LSTM to learn complex long-term dependencies–in particular, the forget date serves as a time-varying data-dependent on/off switch to selectively incorporating the past and present information. See Fig.1 for a conceptual diagram of a LSTM unit and its algebraic definitions [21].
有三个门:输入门(i)控制LSTM是否考虑它的当前输入(xt),遗忘门(f)允许LSTM忘记它以前的记忆(ct),和输出门(o)决定有多少记忆转移到隐藏状态(ht)。它们使LSTM能够学习复杂的长期依赖关系,特别是,遗忘日期作为一个时变数据依赖于 开/闭 开关,以选择性地合并过去和现在的信息。LSTM单元的概念图及其代数定义[21]见图1。
Fig. 1.The LSTM unit, redrawn from [21]. The memory cell is modulated jointly by the input, output and forget gates to control the knowledge transferred at each timestep. ⊙ denotes element-wise products.
Our vsLSTM model is illustrated in Fig.2. There are several differences from the basic LSTM model. We use bidirectional LSTM layers [44] for modeling better long-range dependency in both the past and the future directions. Note that the forward and the backward chains do not directly interact. We combine the information in those two chains, as well as the visual features, with a multi-layer perceptron (MLP). The output of this perceptron is a scalar
To learn the parameters in the LSTM layers and the MLP for fI(·), our algorithm can use annotations in the forms of either the frame-level importance scores or the selected keyframes encoded as binary indicator vectors. In the former case, y is a continuous variable and in the latter case, y is a binary variable. The parameters are optimized with stochastic gradient descent.
Fig.2. Our vsLSTM model for video summarization. The model is composed of two LSTM (long short-term memory) layers: one layer models video sequences in the for-ward direction and the other the backward direction. Each LSTM block is a LSTM unit, shown in Fig.1.
The forward/backward chains model temporal inter-dependencies between the past and the future. The inputs to the layers are visual features extracted at frames. The outputs combine the LSTM layers’ hidden states and the visual features with a multi-layer perceptron, representing the likelihoods of whether the frames should be included in the summary. As our results will show, modeling sequential structures as well as the long-range dependencies is essential.
图2: 我们的vsLSTM模型(用于视频摘要)。该模型由两层LSTM(长短期存储器)组成:一层对视频序列进行正向建模,另一层对视频序列进行反向建模。每个LSTM块都是一个LSTM单元,如图1所示。
vsLSTM excels at predicting the likelihood that a frame should be included or how important/relevant a frame is to the summary. We further enhance it with the ability to model pairwise frame-level “repulsiveness” by stacking it with a determinantal point process (DPP) (which we discuss in more detail below). Modeling the repulsiveness aims to increase the diversity in the selected frames by eliminating redundant frames. The modeling advantage provided in DPP has been exploited in DPP-based summarization methods [2,16,18]. Note that diversity can only be measured “collectively” on a (sub)set of (selected) frames, not on frames independently or sequentially. The directed sequential nature in LSTMs is arguably weaker in examining all the fames simultaneously in the subset to measure diversity, thus is at the risk of having higher recall but lower precision. On the other hand, DPPs likely yield low recalls but high precisions. In essence, the two are complementary to each other.
Fig.3.Our dppLSTMmodel. It combines vsLSTM (Fig.2) and DPP by modeling both long-range dependencies and pairwise frame-level repulsiveness explicitly.
**Determinantal point processes (DPP).**Given a ground set Z of N items(e.g., all frames of a video), together with an N×N kernel matrix that records the pairwise frame-level similarity, a DPP encodes the probability to sample any subset from the ground set [2,27]. The probability of a subset z is proportional to the determinant of the corresponding principal minor of the matrix Lz
Where I is the N×N identity matrix. If two items are identical and appear in the subset, Lz will have identical rows and columns, leading to zero-valued determinant. Namely, we will have zero-probability assigned to this subset. A highly probable subset is one capturing significant diversity (i.e., pairwise dissimilarity)
Our vsLSTM predicts frame-level importance scores, i.e., the likelihood that a frame should be included in the summary. For our dppLSTM, the approximate MAP inference algorithm [46] outputs a subset of selected frames. Thus, for dppLSTM we use the procedure described in the Supplementary Material to convert them into key shot-based summaries for evaluation.
为了解决对大量注释数据的需求,我们使用了另外两个带有基于关键帧摘要注释的注释数据集:Youtube[28]和Open Video Project (OVP)[28,47]。我们将它们处理为[2],为每个视频创建关键帧的地基真值集(然后转换为帧级重要度分数的地基真值序列)。我们使用ground- truth in importance scores来训练vsLSTM,并将序列转换为选定的关键帧来训练dppLSTM。
特性。对于大多数实验来说,每一帧的特征描述符都是通过提取GoogLeNet模型48的倒数第二层(pool 5)的输出来获得的。我们也用[35]中使用的相同的浅有限元法进行了实验。例如,颜色直方图,GIST, HOG,浓密的SIFT),以提供对深层特征的比较。
由于MLP不能显式地捕获时间信息,为了与基于lstm的方法进行公平比较,我们考虑了两个变量。在第一种不同的MLP- shot中,我们使用投篮中的平均帧特征作为MLP的输入,并预测投篮水平的重要分数。地面真射击级的重要度得分是相应帧级重要度得分的平均值。然后使用预测的射击级重要度分数来选择关键射击,然后将结果的基于射击的摘要与用户注释进行比较。在第二个MLP-Frame中,我们将以每一帧为中心的K-frame(在我们的实验中K = 5)窗口内的所有视觉特征连接起来,作为预测帧级重要性得分的输入。
浅层还是深层?我们还研究了对每一帧使用替代视觉特征的效果。从表5可以看出,深度特征能够略微提高浅层特征的性能。注意,我们的带有浅层特性的dppLSTM仍然优于[35],[35]报告了使用相同浅层特性(即浅层特性)在TVSum上的结果。,颜色直方图,GIST, HOG,浓密筛)。
图5显示了dppLSTM的一个失败案例。这是一个以自我为中心的户外视频,记录了非常丰富的内容。特别是在三明治店、建筑、食品和城市广场之间的场景变化。从总结的结果可以看出,dppLSTM仍然选择了不同的内容,但是没有捕捉到开始帧,这些帧都有很高的重要度,视觉上相似,但是时间上拥挤。在这种情况下,dppLSTM被迫删除其中一些,导致召回率较低。另一方面,MLP- Shot只需要预测重要性得分,而不需要多样性,这就导致了较高的召回率和f得分。有趣的是,MLP-Shot对视频结束时的预测很差,而dppLSTM建模的厌恶性为该方法提供了在视频结束时选择几帧的边缘。
我们的工作是探索长短期记忆发展新的监督学习方法自动视频摘要。我们基于lstm的模型在两个具有挑战性的基准测试中优于其他方法。其中有几个关键因素:LSTMs捕捉可变范围内的相互依赖的建模能力,以及我们用DPP来补充LSTMs强度的想法,明确地建模帧间的排斥,以鼓励选择的帧多样化。虽然LSTMs需要大量的注释样例,但是我们展示了如何通过利用存在的其他anno- tated视频数据集来调节这种需求,尽管它们在风格和内容上是异构的。初步的结果是很有前途的,暗示了未来的研究方向,发展更复杂的技术,可以汇集大量可用的视频数据集,以进行视频摘要。特别是,通过学习编码视频内容的语义理解,并使用它们来指导可视化分析中的总结和其他任务,探索新的序列模型可以增强视频数据建模中的LSTMs能力,这将是非常有效的。