in this paper, we address the problem of unsupervised video summarization that automatically extracts key-shots from an input video. Specifically, we tackle two critical issues based on our empirical observations: (i) Ineffective feature learning due to flat distributions of output importance scores for each frame, and (ii) training difficulty when dealing with longlength video inputs. To alleviate the first problem, we propose a simple yet effective regularization loss term called variance loss. The proposed variance loss allows a network to predict output scores for each frame with high discrepancy which enables effective feature learning and significantly improves model performance. For the second problem, we design a novel two-stream network named Chunk and Stride Network (CSNet) that utilizes local (chunk) and global (stride) temporal view on the video features. Our CSNet gives better summarization results for long-length videos compared to the existing methods. In addition, we introduce an attention mechanism to handle the dynamic information in videos. We demonstrate the effectiveness of the proposed methods by conducting extensive ablation studies and show that our final model achieves new state-of-the-art results on two benchmark datasets.
Video has become a highly significant form of visual data, and the amount of video content uploaded to various online platforms has increased dramatically in recent years. In this regard, efficient ways of handling video have become increasingly important. One popular solution is to summarize videos into shorter ones without missing semantically important frames. Over the past few decades, many studies (Song et al. 2015; Ngo, Ma, and Zhang 2003; Lu and Grauman 2013; Kim and Xing 2014; Khosla et al. 2013) have attempted to solve this problem. Recently, Zhang et al. showed promising results using deep neural networks, and a lot of follow-up work has been conducted in areas of supervised (Zhang et al. 2016a; 2016b; Zhao, Li, and Lu 2017; 2018; Wei et al. 2018) and unsupervised learning (Mahasseni, Lam, and Todorovic 2017; Zhou and Qiao 2018).
Supervised learning methods (Zhang et al. 2016a; 2016b; Zhao, Li, and Lu 2017; 2018; Wei et al. 2018) utilize ground truth labels that represent importance scores of each frame to train deep neural networks. Since human-annotated data is used, semantic features are faithfully learned. However, labeling for many video frames is expensive, and overfitting problems frequently occur when there is insufficient label data. These limitations can be mitigated by using the unsupervised learning method as in (Mahasseni, Lam, and Todorovic 2017; Zhou and Qiao 2018). However, since there is no human labeling in this method, a method for supervising the network needs to be appropriately designed.
Our baseline method (Mahasseni, Lam, and Todorovic 2017) uses a variational autoencoder (VAE) (Kingma and Welling 2013) and generative adversarial networks (GANs) (Goodfellow et al. 2014) to learn video summarization without human labels. The key idea is that a good summary should reconstruct original video seamlessly. Features of each input frame obtained by convolutional neural network (CNN) are multiplied with predicted importance scores. Then, these features are passed to a generator to restore the original features. The discriminator is trained to distinguish between the generated (restored) features and the original ones.
Although it is fair to say that a good summary can represent and restore original video well, original features can also be restored well with uniformly distributed frame level importance scores. This trivial solution leads to difficulties in learning discriminative features to find key-shots. Our approach works to overcome this problem. When output scores become more flattened, the variance of the scores tremendously decreases. From this mathematically obvious fact, we propose a simple yet powerful way to increase the variance of the scores. Variance loss is simply defined as a reciprocal of variance of the predicted scores.
In addition, to learn more discriminative features, we propose Chunk and Stride Network (CSNet) that simultaneously utilizes local (chunk) and global (stride) temporal views on the video. CSNet splits input features of a video into two streams (chunk and stride), then passes both split features to bidirectional long short-term memory (LSTM) and merges them back to estimate the final scores. Using chunk and stride, the difficulty of feature learning for long-length videos is overcome.
Finally, we develop an attention mechanism to capture dynamic scene transitions, which are highly related to key-shots. In order to implement this module, we use temporal difference between frame-level CNN features. If a scene changes only slightly, the CNN features of the adjacent frames will have similar values. In contrast, at scene transitions in videos, CNN features in the adjacent frames will differ a lot. The attention module is used in conjunction with CSNet as shown in Fig. 1, and helps to learn discriminative features by considering information about dynamic scene transitions.
We evaluate our network by conducting extensive experiments on SumMe (Gygli et al. 2014) and TVSum (Song et al. 2015) datasets. YouTube and OVP (De Avila et al. 2011) datasets are used for the training process in augmented and transfer settings. We also conducted an ablation study to analyze the contribution of each component of our design. Quantitative results show the selected key-shots and demonstrate the validity of difference attention. Similar to previous methods, we randomly split the test set and the train set five times. To make the comparison fair, we exclude duplicated or skipped videos in the test set.
Our overall contributions are as follows. (i) We propose variance loss, which effectively solves the flat output problem experienced by some of the previous methods. This approach significantly improves performance, especially in unsupervised learning. (ii) We construct CSNet architecture to detect highlights in local (chunk) and global (stride) temporal view on the video. We also impose a difference attention approach to capture dynamic scene transitions which are highly related to key-shots. (iii) We analyze our methods with ablation studies and achieve the state-of-the-art performances on SumMe and TVSum datasets.
Given an input video, video summarization aims to produce a shortened version that highlights the representative video frames. Various prior work has proposed solutions to this problem, including video time-lapse (Joshi et al. 2015; Kopf, Cohen, and Szeliski 2014; Poleg et al. 2015), synopsis (Pritch, Rav-Acha, and Peleg 2008), montage (Kang et al. 2006; Sun et al. 2014) and storyboards (Gong et al. 2014; Gygli et al. 2014; Gygli, Grabner, and Van Gool 2015; Lee, Ghosh, and Grauman 2012; Liu, Hua, and Chen 2010; Yang et al. 2015; Gong et al. 2014). Our work is most closely related to storyboards, selecting some important pieces of information to summarize key events present in the entire video.
Early work on video summarization problems heavily relied on hand-crafted features and unsupervised learning. Such work defined various heuristics to represent the importance of the frames (Song et al. 2015; Ngo, Ma, and Zhang 2003; Lu and Grauman 2013; Kim and Xing 2014; Khosla et al. 2013) and to use the scores to select representative frames to build the summary video. Recent work has explored supervised learning approach for this problem, using training data consisting of videos and their ground-truth summaries generated by humans. These supervised learning methods outperform early work on unsupervised approach, since they can better learn the high-level semantic knowledge that is used by humans to generate summaries.
Recently, deep learning based methods (Zhang et al. 2016b; Mahasseni, Lam, and Todorovic 2017; Sharghi, Laurel, and Gong 2017) have gained attention for video summarization tasks. The most recent studies adopt recurrent models such as LSTMs, based on the intuition that using LSTM enables the capture of long-range temporal dependencies among video frames which are critical for effective summary generation.
Zhang et al. (Zhang et al. 2016b) introduced two LSTMs to model the variable range dependency in video summarization. One LSTM was used for video frame sequences in the forward direction, while the other LSTM was used for the backward direction. In addition, a determinantal point process model (Gong et al. 2014; Zhang et al. 2016a) was adopted for further improvement of diversity in the subset selection. Mahasseni et al… (Mahasseni, Lam, and Todorovic 2017) proposed an unsupervised method that was based on a generative adversarial framework. The model consists of the summarizer and discriminator. The summarizer was a variational autoencoder LSTM, which first summarized video and then reconstructed the output. The discriminator was another LSTM that learned to distinguish between its reconstruction and the input video.
In this work, we focus on unsupervised video summarization, and adopt LSTM following previous work. However, we empirically worked out that these LSTM-based models have inherent limitations for unsupervised video summarization. In particular, two main issues exits: First, there is ineffective feature learning due to flat distribution of output importance scores and second, there is the training difficulty with long-length video inputs. To address these problems, we propose a simple yet effective regularization loss term called Variance Loss, and design a novel two-stream network named the Chunk and Stride Network. We experimentally verify that our final model considerably outperforms state-of-the-art unsupervised video summarization. The following section gives a detailed description of our method.
In this section, we introduce methods for unsupervised video summarization. Our methods are based on a variational autoencoder (VAE) and generative adversarial networks (GAN) as (Mahasseni, Lam, and Todorovic 2017). We firstly deal with discriminative feature learning under a VAE-GAN framework by using variance loss. Then, a chunk and stride network (CSNet) is proposed to overcome the limitation of most of the existing methods, which is the difficulty of learning for long-length videos. CSNet resolves this problem by taking a local (chunk) and a global (stride) view of input features. Finally, to consider which part of the video is important, we use the difference in CNN features between adjacent or wider spaced video frames as attention, assuming that dynamic plays a large role in selecting key-shots. Fig. 1 shows the overall structure of our proposed approach.
We adopt (Mahasseni, Lam, and Todorovic 2017) as our baseline, using a variational autoencoder (VAE) and generative adversarial networks (GANs) to perform unsupervised video summarization. The key idea is that a good summary should reconstruct original video seamlessly and adopt a GAN framework to reconstruct the original video from summarized key-shots.
In the model, an input video is firstly forwarded through the backbone CNN (i.e., GoogleNet), Bi-LSTM, and FC layers (encoder LSTM) to output the importance scores of each frame. The scores are multiplied with input features to select key-frames. Original features are then reconstructed from those frames using the decoder LSTM. Finally, a discriminator distinguishes whether it is from an original input video or from reconstructed ones. By following Mahasseni et al.’s overall concept of VAE-GAN, we inherit the advantages, while developing our own ideas, significantly overcoming the existing limitations.
Figure 1: The overall architecture of our network. (a) chunk and stride network (CSNet) splits input features xt into ct and st by chunk and stride methods. Each orange, yellow, green, and blue color represents how the chunk and stride divide the input features xt. Divided features are combined in the original order after going through LSTM and FC separately. (b) Difference attention is a approach for designing dynamic scene transitions at different temporal strides. d1 t , d2 t , d4 t are difference of input eatures xt with 1, 2, 4 temporal strides. Each difference features are summed after FC, which is denoted as difference attention dt, and summed again with c0 t and s0 t, respectively.
The main assumption of our baseline (Mahasseni, Lam, and Todorovic 2017) is “well-picked key-shots can reconstruct the original image well”. However, for reconstructing the original image, it is better to keep all frames instead of selecting only a few key-shots. In other words, mode collapse occurs when the encoder LSTM attempts to keep all frames, which is a trivial solution. This results in flat importance output scores for each frame, which is undesirable. To prevent the output scores from being a flat distribution, we propose a variance loss as follows:
where p = fpt : t = 1; :::; T g, eps is epsilon, and V^ (·) is the variance operator. pt is an output importance score at time t, and T is the number of frames. By enforcing Eq. (1), the network makes the difference in output scores per frames larger, then avoids a trivial solution (flat distribution).
In addition, in order to deal with outliers, we extend variance loss in Eq. (1) by utilizing the median value of scores. The variance is computed as follows:
where med(·) is the median operator. As has been reported for many years (Pratt 1975; Huang, Yang, and Tang 1979; Zhang, Xu, and Jia 2014), the median value is usually more robust to outliers than the mean value. We call this modified function variance loss for the rest of the paper, and use it for all experiments.
To handle long-length videos, which are difficult for LSTM-based methods, our approach suggests a chunk and stride network (CSNet) as a way of jointly considering a local and a global view of input features. For each frame of the input video v = fvt : t = 1; :::; T g, we obtain the deep features x = fxt : t = 1; :::; T g of the CNN which is GoogLeNet pool-5 layer.
As shown in Fig. 1 (a), CSNet takes a long video feature x as an input, and divides it into smaller sequences in two ways. The first way involves dividing x into successive frames, and the other way involves dividing it at a uniform interval. The streams are denoted as cm, and sm, where fm = 1; :::; Mg and M is the number of divisions. Specifically, cm and sm can be explained as follows:
where k is the interval such that k = M. Two different sequences, cm and sm, pass through the chunk and stride stream separately. Each stream consists of bidirectional LSTM (Bi-LSTM) and a fully connected (FC) layer, which predicts importance scores at the end. Then, each of the outputs are reshaped into c0 m and s0 m, enforcing the maintenance of the original frame order. Then, c0 m and s0 m are added with difference attention dt. Details of the attentioning process are described in the next section. The combined features are then passed through sigmoid function to predict the final scores pt as follows:
where W is learnable parameters for weighted sum of p1 t and p2 t , which allows for flexible fusion of local (chunk) and global (stride) view of input features.
In this section, we introduce the attention module, exploiting dynamic information as guidance for the video summarization. In practice, we use the differences in CNN features of adjacent frames. The feature difference softly encodes temporally different dynamic information which can be used as a signal for deciding whether a certain frame is relatively meaningful or not
As shown in Fig. 1 (b), the differences d1 t , d2 t , d4 t between xt+k, and xt pass through the FC layer (d0 t1, d0 t2, d0 t4) and are merged to become dt, then added to both cm and sm. The proposed attention modules are represented as follows:
While the difference between the features of adjacent frames can model the simplest dynamic, the wider temporal stride can include a relatively global dynamic between the scenes.
We evaluate our approach on two benchmark datasets, SumMe (Gygli et al. 2014) and TVSum (Song et al. 2015). SumMe contains 25 user videos with various events. The videos include both cases where the scene changes quickly or slowly. The length of the videos range from 1 minute to 6.5 minutes. Each video has an annotation of mostly 15 user annotations, with a maximum of 18 users. TVSum contains 50 videos with lengths ranging from 1.5 to 11 minutes. Each video in TVSum is annotated by 20 users. The annotations of SumMe and TVSum are frame-level importance scores, and we follow the evaluation method of (Zhang et al. 2016b). OVP (De Avila et al. 2011) and YouTube (De Avila et al. 2011) datasets consist of 50 and 39 videos, respectively. We use OVP and YouTube datasets for transfer and augmented settings.
Similar to other methods, we use the F-score used in (Zhang et al. 2016b) as an evaluation metric. In all datasets, user annotation and prediction are changed from frame-level scores to key-shots using the KTS method in (Zhang et al. 2016b). The precision, recall, and F -score are calculated as a measure of how much the key-shots overlap. Let “predicted” be the length of the predicted key-shots, “user annotated” be the length of the user annotated key-shots and “overlap” be the length of the overlapping key-shots in the following equations.
Our approach is evaluated using the Canonical ©, Augmented (A), and Transfer (T) settings shown in Table 1 in (Zhang et al. 2016b). To divide the test set and the training set, we randomly extract the test set five times, 20% of the total. The remaining 80% of the videos is used for the training set. We use the final F-score, which is the average of the F -scores of the five tests. However, if a test set is randomly selected, there may be video that is not used in the test set or is used multiple times in duplicate, making it difficult to evaluate fairly. To avoid this problem, we evaluate all the videos in the datasets without duplication or exception.
For input features, we extract each frame by 2fps as in (Zhang et al. 2016b), and then obtain a feature with 1024 dimensions through GoogLeNet pool-5 (Szegedy et al. 2015) trained on ImageNet (Russakovsky et al. 2015). The LSTM input and hidden size is 256 reduced by FC (1024 to 256) for fast convergence, and the weight is shared with each chunk and stride input. The maximum epoch is 20, the learning rate is 1e-4, and 0.1 times after 10 epochs. The weights of the network are randomly initialized. M in CSNet is experimentally picked as 4. We implement our method using Pytorch.
Our baseline (Mahasseni, Lam, and Todorovic 2017) uses the VAE and GAN in the model of Mahasseni et al. We use their adversarial framework, which allows us unsupervised learning. Specifically, basic sparsity loss, reconstruction loss, and GAN loss are adopted. For supervised learning, we add binary cross entropy (BCE) loss between ground truth scores and predicted scores. We also put fake input, which has uniform distribution.
In this section, we show the experimental results of our various approach proposed in the ablation study. Then, we compare our methods with the existing unsupervised and supervised methods and finally show the experimental results in canonical, augmented, and transfer settings. For fair comparison, we quote performances of previous research recorded in (Zhou and Qiao 2018).
We have three proposed approaches: CSNet, difference attention and variance loss. When all three methods are applied, the highest performance can be obtained. The ablation study in Table 2 shows the contribution of each proposed method to the performance by conducting experiments on the number of cases in which each method can be applied. We call these methods shown in exp. 1 to exp. 8 CSNet1 through CSNet8, respectively. If any of our proposed methods is not applied, we experiment with a version of the baseline in that we reproduce and modify some layers and hyper parameters. In this case, the lowest F-score is shown, and it is obvious that performance increases gradually when each method is applied.