【论文翻译】SoccerDB: A Large-Scale Database for Comprehensive Video Understanding

SoccerDB: A Large-Scale Database for Comprehensive Video Understanding

论文地址:https://arxiv.org/pdf/1912.04465.pdf

ABSTRACT

摘要

Soccer videos can serve as a perfect research object for video understanding because soccer games are played under well-defined rules while complex and intriguing enough for researchers to study. In this paper, we propose a new soccer video database named SoccerDB, comprising 171,191 video segments from 346 high-quality soccer games. The database contains 702,096 bounding boxes, 37,709 essential event labels with time boundary and 17,115 highlight annotations for object detection, action recognition, temporal action localization, and highlight detection tasks. To our knowledge, it is the largest database for comprehensive sports video under- standing on various aspects. We further survey a collection of strong baselines on SoccerDB, which have demonstrated state-of- the-art performances on independent tasks. Our evaluation suggests that we can benefit significantly when jointly considering the inner correlations among those tasks. We believe the release of SoccerDB will tremendously advance researches around comprehensive video understanding. Our dataset and code published on https://github.com/newsdata/SoccerDB.

足球视频可以作为视频理解的完美研究对象,因为足球比赛是在定义明确的规则下进行的,同时也足够复杂和有趣,供研究人员研究。在本文中,我们提出了一个新的足球视频数据库SoccerDB,包含了来自346场高质量足球比赛的171191个视频片段。该数据库包含702096个包围框、37709个具有时间边界的基本事件标签和17115个高光片段注释,用于目标检测、动作识别、时间动作定位和高光片段检测任务。据我们所知,它是目前最大的全面了解体育视频各个方面的数据库。我们进一步研究了SoccerDB上的一系列基线,这些基线在独立任务中展示了最先进的性能。我们的评估表明,当我们共同考虑这些任务之间的内在相关性时,我们可以获得显著的好处。我们相信SoccerDB的发布将极大地推进关于全面视频理解的研究。我们的数据集和代码发布在https://github.com/newsdata/SoccerDB上

1 INTRODUCTION

1 引言

Comprehensive video understanding is a challenging task in computer vision. It has been explored in ways including action recognition, temporal action localization, object detection, object tracking and so on. However, most works on video understanding mainly focus on isolated aspects of video analysis, yet ignore the inner correlation among those tasks.

视频的综合理解是计算机视觉中的一项具有挑战性的任务。在动作识别、时间动作定位、目标检测、目标跟踪等方面均有深入的研究。然而,大多数关于视频理解的研究主要集中在视频分析的孤立方面,而忽略了这些任务之间的内在联系。

There are many obstacles for researchers doing the correlation study: first, the manual annotation of multiple tasks’ labels on a large-scale video database is extremely time-consuming; second, different approaches lack a fair and uniform benchmark excluding interference factors for conducting rigorous quantitative analysis; third, some datasets focusing on the areas that are not challenging or valuable enough to attract researchers’ attention. We need research objects, which are challenging and with clear rules and restrictive conditions for us to conduct an accurate study on questions we are interested in. In this paper, we choose soccer matches as our research object, and construct a dataset with multiple visual understanding tasks featuring various analysis aspects, aiming at building algorithms that can comprehensively understand various aspects of videos like a human.

研究人员在进行相关性研究时存在诸多障碍:首先,在大规模视频数据库上手工标注多个任务的标签非常耗时;二是不同方法缺乏一个公平、统一的基准,不能排除干扰因素进行严格的定量分析;第三,一些数据集中在没有足够挑战性或价值的领域,不足以引起研究人员的注意。我们需要有挑战性的研究对象,有明确的规则和限制性的条件,以便我们对我们感兴趣的问题进行准确的研究。在本文中,我们选择足球比赛作为我们的研究对象,构建一个包含多种视觉理解任务不同分析方面的数据集,旨在构建能够像人一样全面理解视频各个方面的算法。

1.1 Soccer Video Understanding

1.1 足球视频理解

Soccer video understanding is not only valuable to academic communities but also lucrative in the commercial world. The European soccer market generates annual revenue of $28.7 billion [6]. Regarding soccer content production, automatic soccer video analysis can help editors to produce match summaries, visualize key players’ performance for tactical analysis, and so on. Some pioneering companies like GameFace, SportLogiq adopt this technology on match statistics to analyze strategies and players’ performance. However, automatic video analysis has not fully met the markets’ needs. The CEO of Wyscout, claims the company employs 400 people on soccer data, each of whom takes over 8 hours to provide up to 2000 annotations per game [6].

足球视频的理解不仅对学术界有价值,而且在商业世界也有利可图。欧洲足球市场每年产生287亿美元的收入 [6]。在足球内容生产方面,足球视频自动分析可以帮助编辑生成比赛摘要,将关键球员的表现可视化,用于战术分析等。一些先锋公司,如GameFace、SportLogiq,在比赛统计中采用这种技术来分析策略和球员的表现。然而,自动视频分析并没有完全满足市场的需求。Wyscout的首席执行官称,该公司在足球数据方面雇佣了400名员工,每个人需要8小时以上的时间来提供多达2000条标注 [6]。

1.2 Object Detection

1.2 目标检测

Object detection has seen huge development over the past few years and gained human-level performance in applications including face detection, pedestrian detection, etc. To localize instances of semantic objects in images is a fundamental task in computer vision. In soccer video analysis, a detection system can help us to find positions of the ball, players, and goalposts on the field. With the position information, we can produce engaging visualization as shown in Figure 1 for tactic analysis or enhance the fan experience. Though many advanced detection systems can output reliable results under various conditions, there are still many challenges when the object is small, fast-moving, or blur. In this work, we construct a soccer game object detection dataset and benchmark two state-of-the-art detection models under different framework: RetinaNet [11], a “one-stage” detection algorithm, and Faster R-CNN [15], a “two-stage” detection algorithm.

在过去的几年中,目标检测得到了巨大的发展,在人脸检测、行人检测等应用中取得了与人类水平相当的性能。定位图像中语义目标的实例是计算机视觉的一项基本任务。在足球视频分析中,检测系统可以帮助我们找到球、球员和门柱在球场上的位置。通过位置信息,我们可以生成如图1所示的引人入胜的可视化内容,用于战术分析或增强球迷体验。虽然许多先进的探测系统可以在各种条件下输出可靠的结果,但当目标很小、快速移动或模糊时,仍然存在许多挑战。在这项工作中,我们构建了一个足球目标检测数据集,并在不同的框架下对两种最先进的检测模型进行了基准测试:“one-stage”检测算法RetinaNet [11] 和“two-stage”检测算法Faster R-CNN [15]。

Figure 1: Soccer tactics visualization powered by object detection
图1: 基于目标检测的足球战术可视化

1.3 Action Recognition

1.3 动作识别

Action recognition is also a core video understanding problem and has achieved a lot over the past few years. Large-scale datasets such as Kinetics [3], Sports-1M [9], YouTube-8M [1] have been published. Many state-of-the-art deep learning-based algorithms like I3D [3], Non-local Neural Networks [20], slowFast Network [5], were proposed to this task. While supervised learning has shown its power on large scale recognition datasets, it failed when lacking training data. In soccer games, key events such as penalty kicks, are rare, which means many state-of-the-art recognition models cannot output convincing results when facing these tasks. We hope this problem could be further investigated by considering multiple objects’ relationships as a whole in the dataset.

动作识别也是视频理解的一个核心问题,在过去的几年里已经取得了很大的进展。已经发布了诸如Kinetics [3], Sports-1M [9], YouTube-8M [1] 等大规模数据集。提出了许多基于深度学习的算法,如I3D [3]、Non-local Neural Networks [20]、slowFast Network [5]。虽然监督学习在大规模识别数据集上显示出了其强大的能力,但当缺乏训练数据时,它就会失败。在足球比赛中,像点球这样的关键事件很少发生,这意味着许多最先进的识别模型在面对这些任务时无法输出令人信服的结果。我们希望通过将数据集中的多个目标的关系作为一个整体来考虑,可以进一步研究这个问题。

In this paper, we also provide our insight into the relationship between object detection and action recognition. We observe that since soccer match supply simplex scene and object classes, it is extraordinarily crucial to model the special relationships of objects and their change over time. Imagine, if you can only see players, the ball and goal posts from a game’s screenshot, could you still understand what is happening on the field? Look at the left picture in Figure 2, maybe you have guessed right: that’s the moment of a shooting. Although modeling human-object or object-object interactions have been explored to improve action recognition [7] [21] in recent years, we still need to have a closer look at how do we use the detection knowledge boosting action recognition more efficiently? Our experiments show that the performance of a state-of-the-art action recognition algorithm can be increased by a large margin while combining with object class and location knowledge.

在本文中,我们还提供了我们对于目标检测和行动识别之间的关系的研究。我们观察到,由于足球比赛提供了单一的场景和目标类,因此建模目标的特殊关系及其随时间的变化是非常重要的。想象一下,如果你只能从比赛截图中看到球员、球和门柱,你还能理解场上发生了什么吗?看看图2中的左图,也许您猜对了:这就是射门的时刻。尽管近年来人们已经探索了对人-物或物-物交互行为建模来提高动作识别能力 [7] [21],但我们仍然需要进一步研究如何更有效地利用检测知识来提高动作识别能力。我们的实验表明,在结合目标类别和位置知识的情况下,最先进的动作识别算法的性能可以大幅度提高。

Figure 2: The moment of a shooting. Right side: original image. Left side: only keep players, ball, goal areas in image
图2: 射门的镜头。右侧:原图。左侧:只保留球员、球、球门区域的图像

1.4 Temporal Action Localization

1.4 动作时间定位

Temporal action localization is a significant and more complicated problem than action recognition in video understanding because it requires to recognize both action categories and the time boundary of an event. The definition of the temporal boundary of an event is ambiguous and subjective, for instance, some famous databases like Charades and MultiTHUMOS are not consistent among different human annotators [17]. This also increases our difficulty when labeling for SoccerDB. To overcome the challenge of ambiguity, we define soccer events with a particular emphasis on time boundaries, based on the events’ actual meaning in soccer rules. For example, we define red/yellow card as starting from a referee showing the card, and ending when the game resuming. The new definition helps us to get more consist of action localization annotations.

在视频理解中,动作时间定位是一个比动作识别更重要、更复杂的问题,因为它既需要识别动作类别,又需要识别事件的时间边界。事件的时间边界的定义是模糊的和主观的,例如一些著名的数据库,如Charades和MultiTHUMOS在不同的人类标注者之间并不一致 [17]。这也增加了为SoccerDB标注时的难度。为了克服模糊性的挑战,我们在定义足球事件时特别强调时间边界,基于事件在足球规则中的实际意义。例如,我们将红牌/黄牌定义为从裁判出示红牌开始,到比赛恢复时结束。新的定义帮助我们获得更多包含动作定位的标注。

1.5 Highlight Detection

1.5 高光检测

The purpose of highlight detection is to distill interesting content from a long video. Because of the subjectivity problem, to construct a highlight detection dataset usually requires multi-person labeling for the same video. It will greatly increase the costs and limit the scale of the dataset [18]. We find in soccer TV broadcasts, video segments containing highlight events are usually replayed many times, which can be taken as an important clue for soccer video highlight detection. Many works explored highlight detection while considering replays. Zhao Zhao et. al proposed a highlight summarization system by modeling Event-Replay(ER) structure [22], A. Ravents et. al used audio-visual descriptors for automatic summarization which introduced replays for improving the robustness [14]. SoccerDB provides a playback label and reviews this problem by considering the relationship between the actions and highlight events.

高光检测的目的是从长视频中提取有趣的内容。由于主观性问题,构建高光检测数据集通常需要对同一视频进行多人标记。这将大大增加成本,并限制数据集的规模 [18]。我们发现,在足球电视转播中,包含高光事件的视频片段通常会被多次回放,这可以作为足球视频高光检测的重要线索。许多工作在考虑回放的同时探索了高光检测。Zhao Zhao等人通过对事件回放(ER)结构建模,提出了一个亮点摘要系统 [22],A. Ravents等人采用视听描述符进行自动摘要,引入回放来提高鲁棒性 [14]。SoccerDB提供了一个回放标签,并通过考虑动作和高光事件之间的关系来检测这个问题。

1.6 Contributions

1.6 贡献

• We introduce a challenging database on comprehensive soccer video understanding. Object detection, action recognition, temporal action localization, and highlight detection. Those tasks, crucial to video analysis, can be investigated in the closed-form under a constrained environment.
• We provide strong baseline systems on each task, which are not only meant for academic researches but also valuable for automatic soccer video analysis in the industry.
• We discuss the benefit when considering the inner connections among different tasks: we demonstrate modeling objects’ spatial-temporal relationships by detection results could provide complementary representation to the convolution-based model learned from RGB that increases the action recognition performance by a large margin; joint training on action recognition and highlight detection can boost the performance on both tasks.

• 我们介绍了一个具有挑战性的全面足球视频理解数据集。目标检测,动作识别,时间动作定位,高光检测。这些任务对视频分析至关重要,可以在受限的环境下以封闭的形式进行研究。
• 我们为每个任务提供强大的基线系统,这不仅用于学术研究,而且对行业中的自动足球视频分析也有价值。
• 我们讨论了在考虑不同任务之间的内部联系时的好处:我们证明了通过检测结果建模目标的时空关系可以为从RGB学习的基于卷积的模型提供辅助表示,从而大大提高了动作识别性能;动作识别和高光检测的联合训练可以提高这两项任务的表现。

2 RELATED WORK

2 相关工作

2.1 Sports Analytics

2.1 运动分析

Automated sports analytics, particularly those on soccer and basketball, are popular around the world. The topic has been profoundly researched by the computer vision community over the past few years. Vignesh Ramanathan et al. brought a new automatic attention mechanism on RNN to identify who is the key player of an event in basketball games [13]. Silvio Giancola et al. focused on temporal soccer events detection for finding highlight moments in soccer TV broadcast videos [6]. Rajkumar Theagarajan et al. presented an approach that generates visual analytics and player statistics for solving the talent identification problem in soccer match videos [19]. Huang-Chia Shih surveyed 251 sports video analysis works from content-based viewpoint for advancing broad- cast sports video understanding [16]. The above works were only the tip of the iceberg among magnanimous research achievements in the sports analytics area.

自动化的体育分析,尤其是足球和篮球方面的分析,在世界各地都很受欢迎。在过去的几年中,计算机视觉界对这一主题进行了深入的研究。Vignesh Ramanathan等人在RNN上提出了一种新的automatic attention机制,用于识别篮球比赛中谁是某一事件的关键球员 [13]。Silvio Giancola等人专注于足球时间事件检测,以发现足球电视广播视频中的高光时刻 [6]。Rajkumar Theagarajan等人提出了一种生成视觉分析和球员统计数据的方法,用于解决足球比赛视频中的持球人识别的问题 [19]。Huang-Chia Shih从内容的角度对251部体育视频分析作品进行了调查,以提高对体育视频的理解[16]。上述工作只是体育分析领域众多研究成果中的冰山一角。

2.2 Datasets

2.2 数据集

Many datasets have been contributed to sports video understand- ing. Vignesh Ramanathan et al. provided 257 basketball games with 14K event annotations corresponding to 10 event classes for event classification and detection [13]. Karpathy et al. collected one million sports videos from Youtube belonging to 487 classes of sports promoting deep learning research on action recognition greatly [9]. Datasets for video classification in the wild have played a vital role in related researches. Two famous large-scale datasets, Youtube-8M [1] and Kinetics [3] were widely investigated, which have inspired most of the state-of-the-art methods in the last few years. Google proposed the AVA dataset to tackle the dense activity understanding problem, which contained 57,600 clips of 3 seconds duration taken from featured films [8]. ActivityNet explored general activity understanding by providing 849 video hours of 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video[2]. Although ActivityNet considered video understanding from multiple aspects including semantic ontology, trimmed and untrimmed video classification, spatial-temporal action localization, we argued that it is still too far away from a human-comparable general activity understanding in an unconstrained environment. Part of the source videos in our dataset was collected from SoccerNet [6], a benchmark with a total of 6,637 temporal annotations on 500 complete soccer games from six main European leagues. A comparison between different databases is shown in Table 3.

许多数据集为体育视频的理解做出了贡献。Vignesh Ramanathan等人为257场篮球比赛提供了14K事件标注,对应10个事件类,用于事件分类和检测 [13]。Karpathy等人从Youtube上收集了100万个体育视频,属于487类体育大大的促进动作识别的深度学习研究 [9]。自然环境下的视频分类数据集在相关研究中发挥着重要作用。两个著名的大规模数据集,Youtube-8M [1] 和Kinetics [3] 被广泛研究,在过去的几年里启发了大多数最先进的方法。谷歌提出了AVA数据集来解决密集活动理解问题,该数据集包含了从特色电影中提取的57600个时长为3秒的片段 [8]。ActivityNet用于探索了一般的事件理解,提供203个事件类的849个视频小时,平均每个类包含137个未修剪的视频,每个视频包含1.41个事件 [2]。尽管ActivityNet从语义本体、裁剪和非裁剪视频分类、时空动作定位等多个方面考虑了视频理解,但我们认为,它距离不受约束环境下人类可比的一般活动理解还太遥远。我们数据集中的部分源视频来自SoccerNet [6],这是一个基准测试,在六个主要欧洲联赛的500场完整足球比赛中总共有6637个时间标注。表3显示了不同数据库之间的比较。

Table 3: The comparison of different datasets on video understanding. In support tasks column [1]: Video Classification, [2]: Spatial-Temporal Detection, [3]: Temporal Detection, [4]: Highlight Detection [5]: Object Detection. The background is taken as a class in classes number statistics
表3: 不同数据集在视频理解上的比较。在支持任务行[1]:视频分类,[2]:时空检测,[3]:时间检测,[4]:高光检测,[5]:目标检测。在类数量统计中背景类被作为一个类别统计

3 CREATING SOCCERDB

3 SoccerDB创建

3.1 Object Detection Dataset Collection

3.1 目标检测数据集收集

To train a robust detector for different scene, we increase the diversity of the dataset by collecting data from both images and videos. We crawl 24,475 images of soccer matches from the Internet covering as many different scenes as possible, then use them to train a detector for boosting the labeling process. For video parts, we collect 103 hours of soccer match videos including 53 full-match and 18 half-match which source is described in section 3.2. To increase the difficulty of the dataset, we auto-label each frame from the videos by the detector trained on image set, then select the keyframes with poor predictions as the dataset proposals. Finally, we select 45,732 frames from the videos for object detection task. As shown in Table 1, the total number of bounding box labels for image parts are 142,579, with 117,277 player boxes, 19,072 ball boxes, and 6,230 goal boxes, the total number of bounding box labels for video parts are 702,096, with 643,581 player boxes, 45,160 ball boxes, and 13,355 goal boxes. We also calculate the scale of the boxes by COCO definition [12]. The image parts is spited into 21,985 images for training, and 2,490 for testing randomly. For the video parts random select 18 half-matches for testing, the other matches for training yielding 38,784 frames for training and 6,948 for testing.

为了训练针对不同场景都鲁棒的检测器,我们通过从图像和视频中收集数据来增加数据集的多样性。我们从互联网上抓取24,475张足球比赛的图像,覆盖尽可能多的不同场景,然后用它们来训练一个检测器,以促进标注过程。视频部分,我们收集了103个小时的足球比赛视频,其中53个整场比赛,18个半场比赛,其来源见3.2节。为了增加数据集的难度,我们通过图像集训练的检测器自动标注视频中的每一帧,然后选择预测较差的关键帧作为数据集的建议。最后,我们从视频中选择45732帧进行目标检测任务。如表1所示,图像部件边界框标签总数为142579个,球员的边界框为117277个,球的边界框为19072个,球门的边界框为6230个,视频部件边框标签总数为702096个,球员的边界框为643581个,球的边界框为45160个,球门的边界框为13355个。我们还通过COCO的定义计算了盒子的尺寸 [12]。将图像部分分割成21,985张图像进行训练,2490张随机测试。对于视频部分,随机选择18个半场比赛进行测试,其他比赛用于训练,得到38,784帧用于训练,6948帧用于测试。

Table 1: Bounding box statistics for object detection dataset. The scale of the bounding box, small, medium and large, following the definition of COCO dataset. *-img represent the image parts, *-vid represent the video parts
Table 1: Bounding box statistics for object detection dataset. The scale of the bounding box, small, medium and large, following the definition of COCO dataset. *-img represent the image parts, *-vid represent the video parts
表1: 目标检测数据集的边界框统计信息。边框的尺度,小、中、大,遵循COCO数据集的定义。*-img表示图像部分,*-vid表示视频部分

3.2 Video Dataset Collection

3.2 视频数据集收集

We adopt 346 high-quality full soccer matches’ videos, including 270 matches from SoccerNet [6] covering six main European leagues ranging from 2014 to 2017 three seasons, 76 matches videos from the China Football Association Super League from 2017 to 2018, and the 18th, 19th, 20th FIFA World Cup3. The whole dataset consumes 1.4 TB storage, with a total duration of 668.6 hours. We split the games into 226 for training, 63 for validation, and 57 for testing randomly. All videos for object detection are not included in this video dataset.

我们采用了346个高质量的足球比赛全视频,其中包括2014 - 2017三个赛季,覆盖欧洲六大联赛的SoccerNet [6] 的270场比赛,2017 - 2018年的中国足球超级联赛,以及第18、19、20届世界杯的76场比赛视频。整个数据集消耗1.4 TB的存储空间,总持续时间为668.6小时。我们将比赛分成226个用于训练,63个用于验证,57个用于随机测试。所有用于目标检测的视频都不包含在这个视频数据集中

3.3 Event Annotations

3.3 事件标注

We define ten different soccer events which are usually the high- lights of the soccer game with standard rules for their definition. We define the event boundaries as clear as possible and annotate all of them densely in long soccer videos. The annotation system records the start/end time of an event, the categories of the event and if the event is a playback. An annotator takes about three hours to label one match, and another experienced annotator reviews those annotations to ensure the outcomes’ quality.

我们定义了十个不同的足球事件,它们通常是足球比赛的亮点,并为它们的定义制定了标准规则。我们尽可能清晰地定义了事件边界,并在长的足球视频中密集地标注它们。标注系统记录事件的开始/结束时间、事件的类别以及事件是否为回放。一个标注员花费大约3个小时来标记一个比赛,另一个有经验的标注员检查这些标注以确保结果的质量。

3.4 Video Segmentation Processing

3.4 视频分割处理

We split the dataset into 3 to 30 seconds segments for easier processing. We make sure an event would not be divided into two segments, and keep the event’s temporal boundary in one segment. Video without any event is randomly split into 145,473 video clips with time duration from 3 to 20 seconds. All of the processed segments are checked again by humans to avoid annotation mistakes. Some confusing segments are discarded during this process. Finally, we get a total of 25,719 video segments with event annotations (core dataset) and 145,473 background segments. There are 1.47 labels per segment in the core dataset.

为了更容易处理,我们将数据集分成3到30秒的片段。我们确保一个事件不会被分为两个部分,并保持事件的时间边界在一个部分。没有任何事件的视频被随机分成145,473个视频片段,时长从3秒到20秒不等。所有处理过的片段都由人工再次检查,以避免标注错误。在这个过程中,一些混乱的片段被丢弃。最后,我们总共得到25,719个带有事件标注的视频片段(核心数据集)和145,473个背景片段。在核心数据集中,每个片段有1.47个标签。

3.5 Dataset Analysis

3.5 数据集分析

Details of SoccerDB statistics are shown in Table 2. A total of 14,358 segments have shot labels, which account for 38.07% among all events, except for the background. In contrast, we only collected 156 segments for penalty kick, and 1160 for red and yellow card, accounting for 0.41% and 3.07%, respectively. Since the dataset has an extreme class imbalance problem, it is difficult for the existing state-of-the-art supervised methods to produce convincing results. We also explored the distribution of playbacks and found its relevance to events’ type, as every goal event has playbacks, contrasting with only 1.6% proportion of substitution have playbacks. In section 5.5 we prove this relevance. As shown in section 2.2, we also provide comparisons of many aspects between other popular datasets and ours. Our dataset supports more variety in tasks and more detailed soccer class labels for constrained video understanding.

SoccerDB的详细统计信息如表2所示。共有14,358个片段有射门标签,占除背景外所有事件的38.07%。相比之下,我们只收集了156个点球片段,1160个红黄牌片段,分别占0.41%和3.07%。由于数据集存在极端类别不平衡问题,现有的最先进的监督方法很难产生令人信服的结果。我们还探索了回放的分布,发现了它与事件类型的相关性,因为每个目标事件都有回放,相比之下,只有1.6%的换人事件有回放。在第5.5节中,我们证明了这种相关性。如2.2节所示,我们还提供了其他流行数据集与我们的数据集在许多方面的比较。我们的数据集支持更多的任务和更详细的足球类标签对于约束视频理解。

Table 2: SoccerDB statistics. The dataset covers ten key events in soccer games. This table shows segment number, total time duration and playback segment number of each events. The unit of the duration is minute
表2: SoccerDB统计. 该数据集涵盖了足球比赛中的10个关键事件。该表显示了每个事件的片段数、总时间时长和回放片段数量。时长的单位为分钟

4 THE BASELINE SYSTEM

4 基线系统

To evaluate the capability of current video understanding technologies, and also to understand challenges to the dataset, we developed algorithms that have show strong performances on various datasets, which can provide strong baselines for future work to compare with. In our baseline system, the action recognition sub-module plays an essential role by providing basic visual representation to both temporal action detection and highlight detection tasks.

为了评估当前视频理解技术的能力,也为了理解数据集面临的挑战,我们开发了在各种数据集上表现出强大性能的算法,这可以为未来的工作提供强有力的基线进行比较。在我们的基线系统中,动作识别子模块扮演着至关重要的角色,它为时间动作检测和高光检测任务提供基本的视觉表示

4.1 Object Detection

4.1 目标检测

We adopt two representative object detection algorithms as base- lines. One is Faster R-CNN, developed by Shaoqing Ren et al. [15]. The algorithm and its variant are widely used in many detection systems in recent years. Faster R-CNN belongs to the two-stage detector: The model using RPN proposes a set of regions of interests (RoI), then a classifier and a regressor only process the region candidates to get the category of the RoI and precise coordinates of bounding boxes. Another one is RetinaNet, which is well known as a one-stage detector. The authors Tsung-Yi Lin et al. discover that the extreme foreground-background class imbalance encountered is the central cause and introduced focal loss for solving this problem [11].

我们采用两种具有代表性的目标检测算法作为基线。一个是Faster R-CNN,由Shaoqing Ren等人开发 [15]。近年来,该算法及其变体在许多检测系统中得到了广泛应用。Faster R-CNN属于two-stage 检测器:使用RPN的模型提出一组感兴趣区域(RoI),然后分类器和回归器只对候选区域进行处理,得到感兴趣区域的类别和精确的边界框坐标。另一种是RetinaNet,是一个one-stage探测器。作者Tsung-Yi Lin等人发现,遇到的极端前景-背景类失衡是核心原因,并引入focal loss来解决这一问题 [11]。

4.2 Action Recognition

4.2 动作识别

We treat each class as a binary classification problem. Cross entropy loss is adopted for each class. Two state-of-the-art action recognition algorithms are explored, the SlowFast Networks and the Non-local Neural Networks. The SlowFast networks contain two pathways: a slow pathway, simple with low frame rate, to capture spatial semantics, and a fast pathway, opposed to the slow pathway, operating at a high frame rate, to capture the motion pattern. We use ResNet-50 as the backbone of the network. The Non-local Neural Networks proposed by Xiaolong Wang et. al [20], that can capture long-range dependencies on the video sequence. The non-local operator as a generic building block can be plugged into many deep architectures.

我们将每一类视为一个二元分类问题。每一类采用交叉熵损失函数。探讨了两种最先进的动作识别算法,SlowFast网络和Non-local神经网络。SlowFasrt网络包含两个路径: 一个slow路径,使用简单的低帧率,以捕获空间语义,和一个fast路径,相对于slow路径,在高帧率下运行,以捕获运动模式。我们使用ResNet-50作为网络的主干。Non-local神经网络由Xiaolong Wang等人提出,可以在视频序列获取长范围依赖关系。Non-local模块作为通用的构建块可以插入到许多深度结构中。我们采用带有I3D的ResNet-50骨干网络,插入Non-local模块。

4.3 Transfer Knowledge from Object Detection to Action Recognition

4.3 从目标检测到动作识别的知识迁移

We survey the relationship between object detection and action recognition based on Faster R-CNN and SF-32 network (slowFast framework by sampling 32 frames per video segments) mentioned in section 4.1 and 4.2. First, we use Faster R-CNN to detect the objects from each sampled frame. Then, as shown in figure 3, we add a new branch to SF-32 for modeling object spatial-temporal interaction explicitly for explaning: object detection can provide complementary objects interaction knowledge that convolution- based model could not learn from the RGB sequence.

我们在4.1节和4.2节中提到的基于Faster R-CNN和SF-32网络(slowFast框架,每个视频片段采样32帧)上研究了目标检测和动作识别之间的关系。首先,我们使用Faster R-CNN从每个采样帧中检测目标。然后,如图3所示,我们在SF-32中增加了一个新的分支,用于显式地建模对象的时空交互:目标检测可以提供基于卷积的模型无法从RGB序列学习到的互补的目标交互知识。

Figure 3: Mask and RGB Two-Stream (MRTS) approach structure.

![图3: Mask和RGB双流(MRTS)方法的结构。(https://upload-images.jianshu.io/upload_images/20265886-2dad0f9b9536cbaa.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Mask and RGB Two-Stream (MRTS) approach. We generate object masks as the same size of the RGB frame, the channel size of the mask is equal to the object class number. For each channel, representing that one object class, the areas containing objects belong to this class are set to 1, others are set to 0. we set a two-stream ConvNet architecture, one stream takes the mask as input, the other input original RGB frame. Two streams are converged by concatenating the last fully-connected layers. We suppose that if the spatial-temporal modeling of object location can provide complementary representation, the result of this approach exceeds the baseline SF-32 network performance largely.

Mask和RGB双流(MRTS)方法。我们生成与RGB帧大小相同的对象掩码,掩码的通道大小等于目标的类数。对于表示一个目标类的每个通道,包含属于该类的对象的区域设置为1,其他设置为0。我们设置了一个双流ConvNet架构,一个流以mask作为输入,另一个流输入原始RGB帧。两个流通过连接最后完全连接的层来聚合。我们假设,如果目标位置的时空建模能够提供互补表示,该方法的结果大大超过基线SF-32网络的性能。

4.4 Temporal Action Detection

4.4 动作时间检测

We explore temporal action detection by a two-stage based method. First, a class-agnostic algorithm generates potential event proposals, then apply the classifying proposals approach for final temporal boundary localization. During the first stage, we utilize Boundary-Matching Network (BMN), a fibottom-upfi temporal action proposal generation method, for generating high-quality proposals [10]. The BMN model is composed of three modules: (1) Base module processes the extracted feature sequence of the origin video, and output another video embedding shared by Temporal Evaluation Module (TEM) and Proposal Evaluation Module (PEM). (2) TEM evaluates the starting and ending probabilities of each location in a video to generate boundary probability sequences. (3) PEM transfers the feature to a boundary-matching feature map which contains confidence scores of proposals. During the second stage, an action recognition models mentioned in section 4.2 predicts the classification score of each top K proposals. We choose the highest prediction score of each class as the final detection result.

我们探索了基于两阶段的方法的时间动作检测。首先,使用类无关算法生成潜在的事件提议,然后应用分类提议方法进行最终的时间边界定位。在第一阶段,我们利用Boundary-Matching Network(BMN),一种自下而上的时间动作提议生成方法,生成高质量的提议 [10]。BMN模型由三个模块组成:(1)基本模块对提取的原始视频特征序列进行处理,输出另一个视频编码,由时间评估模块(TEM)提议评估模块(PEM)共享。(2) TEM计算视频中每个位置的属于起始概率和结束概率,生成边界概率序列。(3) PEM将特征转换为包含建议置信度分数的边界匹配特征映射。在第二阶段,用一个4.2节中提到的动作识别模型预测排名前K的提案的分类得分。我们选择每一类预测得分最高的作为最终检测结果。

4.5 Highlight Detection

4.5 高光检测

In this section, we formalize the highlight detection task as a binary classification problem, to recognize which video is a playback video. The structures of the highlight detection models are presented in Figure 4. We select SF-32 network as the basic classifier, then we consider four scenarios:
Fully-connected only (fc-only) approach involves extracting features from the final fc layer of a pre-trained model which is trained by action recognition task as shown in section 4.2. Then we train a logistic regressor for highlight detection. This approach evaluates the strength of the representation learned by action recognition, which can indicate the internal correlation between highlight detection and action recognition tasks.
Fully Fine-tuning (full-ft) approach fine-tuning a binary classification network by initializing weights from the action recognition model.
Multi-task (mt) approach we train a multi-label classification network for both action recognition and highlight detection tasks. We adopt a per-label sigmoid output followed by a logistic loss at the end of slowFast-32 network. This approach takes highlight segments as another action label in the action recognition framework. The advantage of this setting is that it can force the network to learn the relevance among tasks, while the disadvantage is that the new label may introduce noise confusing the learning procedure.
Multi-task with highlight detection branch (mt-hl-branch) approach we add a new two layers 3x3x3 convolution branch for playback recognition, which shares the same backbone with the recognition task. We only train the highlight detection branch by freezing action recognition pre-trained model initialized parameters at first, then fine- tune all parameters for multi-task learning.

在本节中,我们将高光检测任务形式化为一个二进制分类问题,以识别哪个视频是回放视频。高光检测模型的结构如图4所示。我们选择SF-32网络作为基本分类器,然后我们考虑四种情况:
全连接(fc-only)方法通过如4.2节所示在动作识别任务上训练的预训练模型的最终fc层提取特征。然后训练一个逻辑回归进行高光检测。该方法评估通过动作识别学习到的表示的能力,可以表明高光检测和动作识别任务之间的内在关联。
完全微调(full-ft)方法通过从动作识别模型初始化权值来微调二元分类网络。
多任务(mt)方法我们训练了一个多标签分类网络,用于动作识别和高光检测任务。在slowFast-32网络的末端,我们采用了逐标签sigmoid输出和逻辑损失。这种方法将高光片段作为动作识别框架中的另一个动作标签。这种设置的优点是它可以强迫网络学习任务之间的相关性,而缺点是新的标签可能会引入干扰学习过程的噪声。
多任务高光检测分支(mt-hl-branch)方法增加了一个新的两层3x3x3卷积分支用于高光检测,该分支与识别任务共享相同的主干。首先,我们通过冻结动作识别预训练的模型初始化参数只训练高光检测分支,然后对所有参数进行微调,进行多任务学习。

Figure 4: The structure of the highlight detection models
图4: 高光检测模型的结构

5 EXPERIMENTS

5 实验

In this section, we focus on the performance of our baseline system on SoccerDB for object detection, action recognition, temporal action detection, and highlight detection tasks.

在本节中,我们将重点介绍基线系统在SoccerDB上的目标检测、动作识别、动作时间检测和高光检测任务的性能。

5.1 Object Detection

5.1 目标检测

We choose ResNeXt-101 with FPN as the backbone of both RetinaNet and Faster R-CNN. We use a pre-trained model on the MS-COCO dataset, and train the models by 8 NVIDIA-2080TI GPUs, with the initial learning rate of 0.01 for RetinaNet, and 0.02 for Faster R-CNN. MS-COCO style evaluation method is applied to models’ benchmark. The training data from both video parts and image parts are mixed to train each model. We present AP with IoU=0.5:0.95 and multi-scale in table 4, and also report the AP of each class as shown in table 5. RetinaNet performs better than Faster R-CNN, and large-scale object is easier for both methods than the small object. The ball detection result is lower than the player and goal dual to the small scale and motion blur issue. All of the detection experiments are powered by mmdetection software which is developed by the winner of the COCO detection challenge in 2018 [4].

我们选择带有FPN的ResNeXt-101作为RetinaNet和Faster R-CNN的骨干。我们在MS-COCO数据集上使用预训练的模型,用8块NVIDIA-2080TI GPU对模型进行训练,对RetinaNet的初始学习率为0.01,对Faster R-CNN的初始学习率为0.02。模型基准采用MS-COCO style评价方法。将来自视频部分和图像部分的训练数据混合,对每个模型进行训练。我们在表4中用IoU=0.5:0.95和多尺度来表示AP,并报告了每个类别的AP,如表5所示。与Faster R-CNN相比,RetinaNet的性能更好,且两种方法对大尺度对象的处理都比小尺度对象容易。球的检测结果低于球员和目标的因为小尺寸和运动模糊的问题。所有检测实验均采用mmdetection软件,该软件由2018年COCO检测挑战赛冠军 [4] 开发。

5.2 Action Recognition

5.2 动作识别

We set up the experiments by open-source tool PySlowFast, and boost all recognition network from Kinetics pre-training model. Since some labels are rare in the dataset, we adjust the distribution of different labels appearing in the training batch to balance the proportion of labels. We resize the original video frames to 224x224 pixels and do horizontal flip randomly on the training stage. On the inference stage, we just resize the frame to 224x224 without a horizontal flip. We compare 32 and 64 sample frame number for investigating the sample rate influence. For each class, the average precision (AP) scores are demonstrated on Table 6.

我们使用开源工具PySlowFast建立实验,并用Kinetics预训练模型中对所有的识别网络进行增强。由于有些标签在数据集中很少见,我们调整不同标签出现在训练批中的分布,以平衡标签的比例。我们将原始视频帧的大小调整为224x224像素,并在训练阶段随机进行水平翻转。在推断阶段,我们只是将帧大小调整为224x224,而没有进行水平翻转。我们比较了32帧和64帧的采样帧数来研究采样率的影响。表6展示了每个类别的平均精度(AP)分数。

Table 6: Average precision(%) of different recognition models on each classes. SF-32/SF-64: slowFast Network with 32/64 sample rates. NL-32/NL-64: Non-local Network with 32/64 sample rates. MRTS is the method powered by object detection. The events name for shot are the same as Table 2
Table 6:不同识别模型对每个类的AP(%)。SF-32/SF-64: 32/64采样率的slowFast网络。 L-32/NL-64: 32/64采样率Non-local网络。 MRTS是基于目标检测的方法. 镜头的事件命名和表2一致

The dense frame sample rate surpasses the sparse sample rate for both methods. The classes with more instances like shot perform better than classes with fewer instances. Substitution and corner with discriminative visual features to others, obtain high AP scores too. The AP of penalty kick fluctuates in a wide range because there are only 30 instances in the validation dataset.

在两种方法上密集帧采样率都超过稀疏帧采样率。拥有更多样本的类如shot的类比拥有更少样本的类性能更好。对他人具有辨别性视觉特征的换人角球,也能获得较高的AP分数。点球的AP在一个很大的范围内波动,因为在验证数据集中只有30个样本。

5.3 Transfer Knowledge from Object Detection to Action Recognition

5.3 目标检测到动作识别的知识迁移

To make the results more comparable, all the basic experiment settings in this section are the same as described in section 5.2. The average precision results of MRST approach introduced by section 4.3 are shown in Table 6.

为了使结果更具可比性,本节的所有基本实验设置与5.2节相同。4.3节介绍的MRST方法的平均精度结果如表6所示。

From the experiment results, we can easily conclude that understanding the basic objects spatial-temporal interaction is critical for action recognition. MRST increases SF-32 by 15%, which demonstrates the objects’ relationship modeling can provide complementary representation that cannot be captured by 3D ConvNet from RGB sequence.

从实验结果可以很容易地得出结论,理解基本物体的时空相互作用对动作识别至关重要。MRST使SF-32提高了15%,这表明目标关系建模可以提供3D ConvNet无法从RGB序列捕获的互补表示。

5.4 Temporal Action Detection

5.4 动作时间检测

In this section, we evaluate performances of temporal action proposal generation and detection and give quantified analysis on how action recognition task affects temporal action localization. For a fair comparison of different action detection algorithms, we benchmark our baseline system on the core dataset instead of the results produced by section 4.2 models. We adopt the fc-layer of action classifier as a feature extractor on contiguous 32 frames getting 2304 length features. We set 32 frames sliding window with 5 frames for each step, which produces overlap segments for a video. The feature sequence is re-scaled to a fixed length D by zero-padding or average pooling with D=100. To evaluate proposal quality, Average Recall (AR) under multiple IoU thresholds [0.5:0.05:0.95] is calculated. We report AR under different Average Number of proposals (AN) as AR@AN, and the area under the AR-AN curve (AUC) as ActivityNet-1.3 metrics, where AN ranges from 0 to 100. To show the different feature extractor influence on the detection task, we compare two slowFast-32 pre-trained models, one is trained on the SoccerDB action recognition task described in section 4.2, another is trained on Kinetics. Table 7 demonstrates the results of those two extractors.

在这一部分中,我们评估了动作时间提议生成和检测的性能,并对动作识别任务对时间动作定位的影响进行了量化分析。为了公平地比较不同的动作检测算法,我们在核心数据集上对我们的基线系统进行基准测试,而不是在4.2节模型产生的结果上进行基准测试。我们采用动作分类器的fc层作为连续32帧的特征提取器,得到2304个长度的特征。我们设置了32帧的滑动窗口,每一步5帧,这就产生了视频的重叠段。通过零填充或平均池化(D=100)将特征序列重新缩放到固定长度D。为了评估提案质量,计算了多个IoU阈值[0.5:0.05:0.95]下的平均召回率(AR)。我们报告不同提案平均数量(AN)下的AR为AR@AN, AR-AN曲线面积(AUC)为ActivityNet-1.3指标,其中AN范围从0到100。为了显示不同的特征提取器对检测任务的影响,我们比较了两个slowFast-32预训练模型,一个在4.2节描述的SoccerDB动作识别任务上训练,另一个在Kinetics上训练。表7展示了这两种提取器的结果。

Table 7: Temporal action proposal AR@AN(%) and AUC(%) results
表7: 动作时间检测AR@AN(%)和 AUC(%)结果

The feature extractor trained on SoccerDB exceeds Kinetics extractor by 0.7% on the AUC metric. The results mean we benefit from training feature encoder on the same dataset on temporal action proposal generation stage, but the gain is limited. We use the same SF-32 classifier to produce the final detection results based on temporal proposals, and the detection metric is mAP with IoU thresholds {0.3:0.1:0.7}. For Kinetics proposals the mAP is 52.35%, while SoccerDB proposals mAP is 54.30%. The similar performance adopts by different feature encoder due to following reasons: first, Kinetics is a very large-scale action recognition database which contains ample patterns for training a good general feature encoder; second, the algorithm we adopt on proposal stage is strong enough for modeling the important event temporal location.

SoccerDB上训练的特征提取器在AUC指标上超过Kinetics提取器的0.7%。结果意味着在时间动作提议生成阶段在相同的数据集上训练的特征编码器是有益的,但增益是有限的。我们使用同一个SF-32分类器根据时间建议产生最终的检测结果,检测度量为带有IoU阈值{0.3:0.1:0.7}的mAP。Kinetics产生的提议mAP是52.35%,而SoccerDB产生提议的mAP是54.30%。不同的特征编码器取得相似的性能,原因如下:第一,Kinetics是一个非常大规模的动作识别数据库,它包含大量的模式,可以训练一个良好的通用特征编码器;其次,我们在提议阶段采用的算法对重要事件的时间位置进行了足够强的建模。

5.5 Highlight Detection

5.5 高光检测

We set the experiments on the whole SoccerDB dataset. The average precision results of our four baseline models are shown in Table 8. The fc-only model gets 68.72% AP demonstrates the action recognition model can provide strong representation to highlight detection tasks indicating a close relationship between our defined events and the highlight segments. The mt model decreases the AP of the full-ft model by 2.33%, which means the highlight segments are very different from action recognition when sharing the same features. The mt-hl-branch model gives the highest AP by better utilizing the correlation between the two tasks while distinguishing their differences. We also find the mt model is harmful to the recognition which decreases the mAP by 1.85 comparing to the baseline model. The mt-hl-branch can increase the action recognition mAP by 1.46% while providing the highest highlight detection score. The detailed action recognition mAP for the three models is shown in Table 9. A better way to utilize the connection between action recognition and highlight detection is expected to be able to boost the performances on both of them.

我们在整个SoccerDB数据集上设置实验。我们的四个基线模型的平均精度结果如表8所示。fc-only获得68.72%的AP,表明动作识别模型可以提供强表示来高光检测任务,表明我们定义的事件和高光片段之间的密切关系。mt模型使full-ft模型的AP降低了2.33%,这意味着当具有相同特征时,高亮部分与动作识别有很大的差异。mt-hl-branch模型在区分任务差异的同时,更好地利用了两个任务之间的相关性,从而获得了最高的AP。我们还发现mt模型对动作识别是有害的,它比基线模型降低了1.85的mAP。mt-hl-branch可以将动作识别mAP提高1.46%,同时提供最高的高光检测分数。三种模型的详细动作识别mAP如表9所示。一种更好利用动作识别和高光检测之间的联系的方法,有望提高两者的性能。

Table 8: Highlight detection AP(%) under four different setting
表8: 在四种不同的设置下高光检测的AP(%)
Table 9: Highlight detection multi-task learning mAP(%) on action recognition. SF-32 is the baseline model
Table 9: 多任务学习高光检测在动作识别上的mAP(%)。SF-32是基线模型

6 CONCLUSION

6 结论

In this paper, we introduce SoccerDB, a new benchmark for comprehensive video understanding. It helps us discuss object detection, action recognition, temporal action detection, and video highlight detection in a closed-form under a restricted but challenging environment. We explore many state-of-the-art methods on different tasks and discuss the relationship among those tasks. The quantified results show that there are very close connections between different visual understanding tasks, and algorithms can benefit a lot when considering the connections. We release the benchmark to the video understanding community in the hope of driving researchers towards building a human-comparable video understanding system.

在本文中,我们引入了一种新的视频综合理解基准——SoccerDB。它帮助我们在受限但具有挑战性的环境下以封闭的形式讨论目标检测、动作识别、动作时间检测和视频高光检测。我们探索了许多最先进的方法在不同的任务,并讨论这些任务之间的关系。量化结果表明,不同的视觉理解任务之间存在着非常密切的联系,算法在考虑这些联系时可以受益很多。我们向视频理解社区发布了这个基准,希望推动研究人员构建一个与人类类似的视频理解系统。

7 ACKNOWLEDGMENTS

7 致谢

This work is supported by State Key Laboratory of Media Convergence Production Technology and Systems, and Xinhua Zhiyun Technology Co., Ltd..

本研究由媒体融合生产技术与系统国家重点实验室、新华智云科技有限公司资助。

REFERENCES

参考文献

[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube- 8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
[2] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity under- standing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.
[3] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[4] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. 2019. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155 (2019).
[5] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.
[6] Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. 2018. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1711–1721.
[7] Georgia Gkioxari, Ross Girshick, and Jitendra Malik. 2015. Contextual action recognition with r* cnn. In Proceedings of the IEEE international conference on computer vision. 1080–1088.
[8] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, et al. 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047–6056.
[9] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- thankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
[10] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary- Matching Network for Temporal Action Proposal Generation. In Proceedings of the IEEE International Conference on Computer Vision. 3889–3898.
[11] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dolla ́r. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.
[12] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla ́r, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
[13] Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei-Fei. 2016. Detecting events and key actors in multi- person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3043–3053.
[14] Arnau Raventos, Raul Quijada, Luis Torres, and Francesc Tarre ́s. 2015. Automatic summarization of soccer highlights using audio-visual descriptors. SpringerPlus 4, 1 (2015), 1–19.
[15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
[16] Huang-Chia Shih. 2017. A survey of content-aware video analysis for sports. IEEE Transactions on Circuits and Systems for Video Technology 28, 5 (2017), 1212–1231.
[17] Gunnar A Sigurdsson, Olga Russakovsky, and Abhinav Gupta. 2017. What
actions are needed for understanding human actions in videos?. In Proceedings
of the IEEE International Conference on Computer Vision. 2137–2146.
[18] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 5179–5187.
[19] Rajkumar Theagarajan, Federico Pala, Xiu Zhang, and Bir Bhanu. 2018. Soccer:
Who has the ball? Generating visual analytics and player statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1749–1757.
[20] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.
[21] Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV). 399–417.
[22] Zhao Zhao, Shuqiang Jiang, Qingming Huang, and Guangyu Zhu. 2006. High- light summarization in sports video based on replay detection. In 2006 IEEE international conference on multimedia and expo. IEEE, 1613–1616.

你可能感兴趣的:(【论文翻译】SoccerDB: A Large-Scale Database for Comprehensive Video Understanding)