learned video compression 论文理解翻译(1)

Learned Video Compression

基于机器学习的视频压缩

Abstract

摘要

We present a new algorithm for video coding, learned end-to-end for the low-latency mode. In this setting, our approach outperforms all existing video codecs across nearly the entire bitrate range. To our knowledge, this is the first ML-based method to do so.

对于低延迟模式的视频编码,我们提出了一种新的端到端学习算法。在几乎整个码率范围上,我们的方法优于现有的所有编码器,据我们所知,这是第一个基于机器学习的视频编码,

We evaluate our approach on standard video compression test sets of varying resolutions, and benchmark against all mainstream commercial codecs in the low-latency mode. On standard-definition videos, HEVC/H.265, AVC/H.264 and VP9 typically produce codes up to 60% larger than our algorithm. On high-definition 1080p videos, H.265 and VP9 typically produce codes up to 20% larger, and H.264
up to 35% larger. Furthermore, our approach does not suffer from blocking artifacts and pixelation, and thus produces videos that are more visually pleasing.

我们在不同分辨率的视频压缩测试集上评估了我们的方法,并且在低延迟模式下与所有主流商业编码器进行基准测试。在标清视频上,HEVC,AVC,VP9与我们的算法相比会多出60%的码率,在高清视频上HEVC,VP9多20%的码率,AVC多35%。此外,我们的方法不受块效应,和像素化的影响,压缩后的视频在视觉感受上更好

We propose two main contributions. The first is a novel architecture for video compression, which (1) generalizes motion estimation to perform any learned compensation beyond simple translations, (2) rather than strictly relying on previously transmitted reference frames, maintains a state of arbitrary information learned by the model, and (3) enables jointly compressing all transmitted signals (such as optical flow and residual).

我们提出两项主要贡献。第一种是一种新的视频压缩体系结构,它(1)将运动估计推广到除了简单的平移之外的任何学习补偿,(2)不严格依赖先前传输的参考帧,保持从模型学习到的任意信息的状态,以及(3)能够联合压缩所有发送的信号(例如光流和残差)。(译者注:这一段我理解的不是太清楚,先跳过)

Secondly, we present a framework for ML-based spatial rate control — a mechanism for assigning variable bitrates across space for each frame. This is a critical component for video coding, which to our knowledge had not been developed within a machine learning setting.

第二,我们提出了一个基于机器学习的空间速率控制框架-一种为每个帧跨空间分配可变比特率的机制。这是视频编码的一个关键部分,据我们所知,它不是用机器学习方法开发的。

1. Introduction

1. 背景介绍

Video content consumed more than 70% of all internet traffic in 2016, and is expected to grow threefold by 2021 [1]. At the same time, the fundamentals of existing video compression algorithms have not changed considerably over the last 20 years [46, 36, 35, . . . ]. While they have been very well engineered and thoroughly tuned, they are hard-coded, and as such cannot adapt to the growing demand and increasingly versatile spectrum of video use cases such as social media sharing, object detection, VR streaming, and so on.

2016年,视频内容占据了所有互联网流量的70%以上,预计到2021年将增长三倍。同时,在过去的20年里,现有视频压缩算法的基本原理并没有大的改变. 虽然它们经过了很好的设计和彻底的调整,但它们是硬编码的,因此无法适应日益增长的需求和日益多样化的视频应用,如社交媒体共享,对象检测、虚拟现实流等等。

Meanwhile, approaches based on deep learning have revolutionized many industries and research disciplines. In particular, in the last two years, the field of image compression has made large leaps: ML-based image compression approaches have been surpassing the commercial codecs by significant margins, and are still far from saturating to their full potential (survey in Section 1.3).

同时,基于以深度学习的方法也给许多行业和研究学科带来了革命性的变化。特别是,在过去的两年里,图像压缩领域取得了巨大的飞跃:基于机器学习的图像压缩方法已经大大超过了商用编解码器,而且还远未达到饱和,无法充分发挥其潜力(见第1.3节)。

The prevalence of deep learning has further catalyzed the proliferation of architectures for neural network acceleration across a spectrum of devices and machines. This hardware revolution has been increasingly improving the performance of deployed ML-based technologies—rendering video compression a prime candidate for disruption.

深度学习的流行进一步促进了神经网络加速架构在各种设备和机器中的扩散。这场硬件革命日益提高了部署的基于机器学习的技术的性能,使视频压缩成为下一个机器学习的应用领域。

In this paper, we introduce a new algorithm for video coding. Our approach is learned end-to-end for the low latency mode, where each frame can only rely on information from the past. This is an important setting for live transmission, and constitutes a self contained research problem and a stepping-stone towards coding in its full generality. In this setting, our approach outperforms all existing video codecs across nearly the entire bitrate range.

本文介绍了一种新的视频编码算法。我们的方法是端到端学习的,低延迟模式,其中每个帧只能依赖于过去的信息。这是实时传输的一个重要设置,构成了一个独立的研究问题,也是实现其全部通用性的一个前提。在此设置中,我们的方法在几乎整个比特率范围内都优于所有现有的视频编解码器。

We thoroughly evaluate our approach on standard datasets of varying resolutions, and benchmark against all modern commercial codecs in this mode. On standard definition (SD) videos, HEVC/H.265, AVC/H.264 and VP9 typically produce codes up to 60% larger than our algorithm. On high-definition (HD) 1080p videos, H.265 and VP9 typically produce codes up to 20% larger, and H.264 up to 35% larger. Furthermore, our approach does not suffer from blocking artifacts and pixelation, and thus produces videos that are more visually pleasing (see Figure 1).

我们在不同分辨率的标准数据集上彻底评估我们的方法,并在此模式下与所有现代商业编解码器进行基准测试。在标准清晰度(SD)视频中,HEVC/H.265、AVC/H.264和VP9通常产生的代码比我们的算法大60%。在高清1080p视频中,H.265和VP9通常会产生高达20%的代码,而H.264则高达35%的代码。此外,我们的方法不受块效应和像素化的影响,因此生成的视频在视觉上更令人满意(参见图1)。

In Section 1.1, we provide a brief introduction to video coding in general. In Section 1.2, we proceed to describe our contributions. In Section 1.3 we discuss related work, and in Section 1.4 we provide an outline of this paper.

在第1.1节中,我们简要介绍了视频编码的大体情况。在第1.2节中,我们继续描述我们的方法。在第1.3节中,我们讨论了文章里的方法的相关工作,在第1.4节中,我们提供了本文的概要。

1.1. Video coding in a nutshell

1.1. 视频编码概述

1.1.1 Video frame types

1.1.1 编码视频帧类型

Video codecs are designed for high compression efficiency,and achieve this by exploiting spatial and temporal redundancies within and across video frames ([51, 47, 36, 34] provide great overviews of commercial video coding techniques).Existing video codecs feature 3 types of frames:

1. I-frames (”intra-coded”), compressed using an image codec and do not depend on any other frames;

2. P-frames (”predicted”), extrapolated from frames in the past; and

3. B-frames (”bi-directional”), interpolated from previously
transmitted frames in both the past and future.

While introducing B-frames enables higher coding efficiency,it increases the latency: to decode a given frame,future frames have to first be transmitted and decoded.

视频编码器是为了高效压缩视频而设计的,它利用了视频帧内的空间冗余和帧间的时间冗余(参考文献51,47,36,34提供了商业编码器的很好的概述)。当前的视频编码器存在3种帧类型。
1,I帧,帧内编码,图像压缩,不依赖其他帧
2,P帧,预测帧,从前面恢复的参考帧推断而来
3,B帧,双向预测帧。由前面和后面的参考帧推断,插入得到
虽然引入B帧可以提高编码效率,但它增加了延迟:要解码B帧,必须先传输和解码B帧后面的帧。

1.1.2 Compression procedure

1.1.2 视频压缩过程

In all modern video codecs, P-frame coding is invariably accomplished via two separate steps: (1) motion compensation, followed by (2) residual compression.
Motion compensation.
The goal of this step is to leverage temporal redundancy in the form of translations. This is done via block-matching (overview at [30]), which reconstructs the current target, say xt for time step t, from a handful of previously transmitted reference frames. Specifically, different blocks in the target are compared to ones within the reference frames, across a range of possible displacements. These displacements can be represented as an optical flow map f t, and block-matching can be written as a special case of the flow estimation problem (see Section1.3). In order to minimize the bandwidth required to transmit the flow f t and reduce the complexity of the search, the flows are applied uniformly over large spatial blocks, and discretized to precision of half/quarter/eighth-pixel.
Residual compression.
Following motion compensation,the leftover difference between the target and its motion compensated approximation mt is then compressed. This difference t = xt − mt is known as the residual, and is independently encoded with an image compression algorithm adapted to the sparsity of the residual.
在所有现代视频编解码器中,P帧编码总是通过两个独立的步骤来完成:(1)运动补偿,然后(2)残差压缩。
运动补偿
这个步骤的目标是利用时间冗余。这是通过块匹配(在[30]处概述)来完成的,块匹配从几个先前发送的参考帧重建当前帧。具体地说,目标中的块与参考帧中的块进行比较,然后会找到一个与当前块最相似的块,当前块与最相似的块之间的距离就是当前块的位移。这些位移可以表示为光流图f t,块匹配可以写为流估计问题的一个特例(见第1.3节)。为了最小化传输f t所需的带宽并降低搜索的复杂度,将流均匀地应用于大空间块上(译者注:这个大空间,也就是块,共用一个位移),并离散到半/四分之一/八分之一像素的精度(亚像素搜索运动估计)。
残差压缩
在运动补偿之后,压缩目标(当前块)与其运动补偿近似mt(参考帧中的相似块)之间的差异。这种差异 t=xt-mt被称为残差,并且用适应残差稀疏性的图像压缩算法独立地编码。

1.2. Contributions

1.2. 这篇文章所做的工作

This paper presents several novel contributions to video codec design, and to ML modeling of compression:
Compensation beyond translation.
Traditional codecs are constrained to predicting temporal patterns strictly in the form of motion. However, there exists significant redundancy that cannot be captured via simple translations. Consider, for example, an out-of-plane rotation such as a person turning their head sideways. Traditional codecs will not be able to predict a profile face from a frontal view. In contrast, our system is able to learn arbitrary spatio-temporal
patterns, and thus propose more accurate predictions, leading to bitrate savings.
本文介绍了对视频编解码器设计,机器学习的压缩模型的建模的一些新贡献:

超过传统上的只简单传输平面运动
传统的编解码器只能严格地以运动的形式预测时间模式。但是,存在大量冗余,无法通过简单的转换捕获。例如,考虑一个平面外的旋转,比如一个人把头侧向转动。传统的编解码器无法从正面预测轮廓面。相比之下,我们的系统能够学习任意时空模式,从而提出更准确的预测,从而节省比特率。

Propagation of a learned state.
In traditional codecs all “prior knowledge” propagated from frame to frame is expressed strictly via reference frames and optical flow maps, both embedded in raw pixel space. These representations are very limited in the class of signals they may characterize,and moreover cannot capture long-term memory.In contrast, we propagate an arbitrary state autonomously learned by the model to maximize information retention
先验信息的传播
在传统的编码器中,先验信息严格地通过参考帧和光流图传播,两者都嵌入在原始像素空间。这样的表示方法能力有限,不能捕获长期信息。相反,我们传播由模型自主学习的任意状态,以最大限度地保留信息。
Joint compression of motion and residual.
Each codec must fundamentally decide how to distribute bandwidth among motion and residual. However, the optimal tradeoff between these is different for each frame. In traditional methods, the motion and residual are compressed separately, and there is no easy way to trade them off. Instead, we jointly compress the compensation and residual signals using the same bottleneck. This allows our network to reduce redundancy by learning how to distribute the bitrate among them as a function of frame complexity.
联合压缩运动和残。
每个编解码器必须从根本上决定如何分配带宽给运动信息和残差。然而,对于每一帧,这两者之间的最佳选择是不同的。在传统的方法中,运动和残差是分开压缩的,没有一种简单的方法来权衡它们。相反,我们使用相同的约束联合压缩运动补偿和残差信号。我们的网络通过学习如何为运动补偿和残差信息分配带宽来减少冗余。

Flexible motion field representation.
In traditional codecs, optical flow is represented with a hierarchical block structure where all pixels within a block share the same motion. Moreover, the motion vectors are quantized to a particular sub-pixel resolution. While this representation is chosen because it can be compressed efficiently, it does not capture complex and fine motion. In contrast, our algorithm has the full flexibility to distribute the bandwidth so that areas that matter more have arbitrarily sophisticated motion
boundaries at an arbitrary flow precision, while unimportant areas are represented very efficiently. See comparisons in Figure 2.
灵活的运动场表示。
在传统的编解码器中,光流是用一个分层的块结构来表示的,其中块中的所有像素共享相同的运动。此外,运动矢量被量化到特定的亚像素分辨率。虽然选择这种表示是因为它可以被有效地压缩,但它不能捕捉复杂而精细的运动。相比之下,我们的算法具有充分的灵活性来分配带宽,使得更重要的区域在任意流精度下具有任意复杂的运动边界,而不重要的区域被非常有效地表示。参见图2中的比较。
Multi-flow representation.
Consider a video of a train moving behind fine branches of a tree. Such a scene is highly inefficient to represent with traditional systems that use a single flow map, as there are small occlusion patterns that break the flow. Furthermore, the occluded content will have to be synthesized again once it reappears. We propose a representation that allows our method the flexibility to decompose a complex scene into a mixture of multiple simple flows and preserve occluded content.
流多路表示(具体是编码中的什么流我暂时还没搞清楚)
想象一下这样的视频,一辆火车在一颗有很多细枝条的树后面行驶。这样的场景用传统的编码方案很难被高效表示,因为传统方案用单一的流图,这样的阻塞式的场景会中断这个流。另外,阻塞的内容再出现时不得不被重新综合。我们提出的表示方法可以更灵活地分解复杂场景,分解成简单流的组合,保留阻塞的内容。
Spatial rate control. It is critical for any video compression approach to feature a mechanism for assigning different bitrates at different spatial locations for each frame. In ML-based codec modeling, it has been challenging to construct a single model which supports R multiple bitrates, and achieves the same results as R separate, individual
models each trained exclusively for one of the bitrates. In
this work we present a framework for ML-driven spatial rate
control which meets this requirement.
空间码率控制。
对于任何视频压缩方法来说,为每帧在不同的空间位置分配不同比特率的机制都是至关重要的。在基于机器学习的编解码器建模中,构建一个支持多种比特率的单个模型,并获得与多个单独的模型组合而成的模型相同的结果一直是一个挑战(译者注:一个模型可以分配多个码率,不需要多个模型)。本文提出了一个基于机器学习驱动的空间速率控制框架。

1.3. RelatedWork
ML-based image compression
In the last two years, we have seen a great surge of ML-based image compression approaches [15, 44, 45, 5, 4, 14, 25, 43, 38, 23, 2, 27, 6, 3, 10, 32, 33]. These learned approaches have been reinventing
many of the hard-coded techniques developed in traditional
image coding: the coding scheme, transformations into and
out of a learned codespace, quality assessment, and so on.
在过去两年里出现了许多基于机器学习的图像压缩方法,这些机器学习方法,让基于硬编码的传统图像编码技术面目一新。改变了传统编码的方案,或者在传统方案与机器学习之间的转换,还改变了传统的质量估计方法等等。

你可能感兴趣的:(论文理解翻译)