Burst photography for high dynamic range and low-light imaging on mobile cameras

Abstract

Cell phone cameras have small apertures, which limits the number of photons they can gather, leading to noisy images in low light. They also have small sensor pixels, which limits the number of electrons each pixel can store, leading to limited dynamic range. We describe a computational photography pipeline that captures, aligns, and merges a burst of frames to reduce noise and increase dynamic range. Our system has several key features that help make it robust and efficient. First, we do not use bracketed exposures. Instead, we capture frames of constant exposure, which makes alignment more robust, and we set this exposure low enough to avoid blowing out highlights. The resulting merged image has clean shadows and high bit depth, allowing us to apply standard HDR tone mapping methods. Second, we begin from Bayer raw frames rather than the demosaicked RGB (or YUV) frames produced by hardware Image Signal Processors (ISPs) common on mobile platforms. This
gives us more bits per pixel and allows us to circumvent the ISP’s unwanted tone mapping and spatial denoising. Third, we use a novel FFT-based alignment algorithm and a hybrid 2D/3D Wiener filter to denoise and merge the frames in a burst. Our implementation is built atop Android’s Camera2 API, which provides per-frame camera control and access to raw imagery, and is written in the Halide domain-specific language (DSL). It runs in 4 seconds on device (for a 12 Mpix image), requires no user intervention, and ships on several mass-produced cell phones.

手机摄像头的孔径很小,这限制了它们可以收集的光子数量,导致在弱光下产生噪音图像。它们还具有较小的传感器像素,这限制了每个像素可以存储的电子数量,从而限制了动态范围。我们描述了一个计算摄影管道,捕捉、校准和合并burst帧,以减少噪音和增加动态范围。我们的系统有几个关键特性,有助于使其健壮和高效。首先,我们不使用包围曝光。相反,我们捕捉固定曝光的帧,这使校准更加可靠,我们将曝光设置得足够低,以避免吹出高光。合并后的图像具有清晰的阴影和较高的位深,允许我们应用标准的HDR色调映射方法。其次,我们从Bayer raw帧开始,而不是移动平台上常见的硬件图像信号处理器(isp)生成的去马赛克RGB(或YUV)帧。这给了我们更多的比特每像素,并允许我们规避ISP不需要的色调映射和空间去噪。第三,我们使用一种新的基于fff的对齐算法和一个混合的二维/三维维纳滤波器去噪和合并burst帧。我们的实现构建在Android s Camera2 API之上,它提供了每帧相机控制和对RAW图像的访问,并且是用Halide特定领域语言(DSL)编写的。它在设备上运行4秒钟(对于12 Mpix图像),不需要用户干预,并且可以在几款批量生产的手机上运行。

1 Introduction

The main technical impediment to better photographs is lack of light. In indoor or night-time shots, the scene as a whole may provide insufficient light. The standard solution is either to apply analog or digital gain, which amplifies noise, or to lengthen exposure time, which causes motion blur due to camera shake or subject motion. Surprisingly, daytime shots with high dynamic range may also suffer from lack of light. In particular, if exposure time is reduced to avoid blowing out highlights, then insufficient light may be collected in shadowed areas. These areas can be brightened using local tonemapping, but this again amplifies noise.

拍摄好照片的主要技术障碍是光线不足。在室内或夜间拍摄时,整个场景可能提供的光线不足。标准的解决方案是要么应用模拟或数字增益,放大噪音,要么延长曝光时间,由于相机抖动或主体运动,导致运动模糊。令人惊讶的是,动态范围大的白天拍摄也可能缺乏光线。特别是,如果减少曝光时间以避免吹出高光,那么在阴影区域可能会收集到不足的光。这些区域可以用局部色调增强来照亮,但这又会放大噪音。

Ways to gather more light include using a larger-aperture lens, optical image stabilization, exposure bracketing, or flash. However, each method is a tradeoff. If the camera is a cell phone, then it is thickness-constrained, so making its aperture larger is difficult. Such devices are also power-constrained, making it challenging to create a synthetic aperture by increasing the number of cameras [Wilburn et al. 2005; Light 2016]. Optical image stabilization allows longer exposures while minimizing camera shake blur, but it cannot control blur caused by subject motion. With exposure bracketing followed by image fusion, different parts of the fused image represent the scene at different times, which makes it hard to achieve a single self-consistent composition. The most frequent artifact caused by incorrect fusion is ghosting (figure 2a), due to the difficulty of aligning images captured at different times. Sensors that alternate exposure times between adjacent scanlines ameliorate ghosting somewhat, but sacrifice detail and make accurate demosaicking difficult. To many photographers, an on-camera flash is the least palatable option. It adds light, but can change the scene in an unpleasant way. Flash/noflash photography [Petschnigg et al. 2004] addresses this issue but is not sufficiently robust.

收集更多光线的方法包括使用大孔径镜头、光学图像稳定、包围曝光或闪光灯。然而,每种方法都是一种折衷。如果相机是一部手机,那么它的厚度就会受到限制,因此要加大光圈是很困难的。这类设备的功率也受到限制,因此很难通过增加相机数量来合成孔径[Wilburn et al. 2005; Light 2016]。光学图像稳定(OIS)允许更长的曝光,同时最小化相机抖动模糊,但它不能控制由主体运动引起的模糊。包围曝光后进行图像融合,融合后的图像的不同部分代表不同时间的场景,难以实现单一的自一致性构图。不正确的融合导致的最常见的伪影是重影(图2a),这是由于在不同时间捕获的图像很难对齐。在相邻扫描线之间交替曝光时间的传感器在一定程度上改善了重影,但牺牲了细节,使准确的去噪变得困难。对许多摄影师来说,闪光灯是最不受欢迎的选择。它增加了光线,但以一种不愉快的方式改变场景。Flash/noflash photography [Petschnigg et al. 2004] 解决了这个问题,但不够健壮。

In this paper we describe a camera system that addresses these problems by capturing a burst of images and combining them with dynamic range compression. While algorithms for doing this are well known [Debevec and Malik 1997], building a system based on these algorithms and deploying it commercially on a mobile camera is challenging. In building our system we have found the following design principles to be important:

本文描述了一种通过捕获burst图像并结合动态范围压缩来解决这些问题的摄像机系统。虽然实现这一目的的算法是众所周知的 [Debevec and Malik 1997],但基于这些算法构建一个系统并将其商业化部署到移动摄像机上是一项挑战。在构建我们的系统时,我们发现以下设计原则非常重要:

• Be immediate. The system must produce a photograph within a few seconds, and display it on the camera, even when the
camera is not connected (wired or wirelessly). This means we cannot defer processing to a desktop computer or the cloud.
• Be automatic. The method must be parameter-free and fully automatic. Photographers should get better pictures without
knowing the strategy used for capture or image processing.
• Be natural. The photographs we produce must be faithful to the appearance of the scene. In high-dynamic-range situations
we must therefore limit the amount of local tone mapping we do to avoid cartoony or surrealistic images. In very low-light
scenes we must not brighten the image so much that it changes the apparent illumination or reveals excessive noise.
• Be conservative. It should be possible to use this as the default picture-taking mode. This means that the photographs
produced must not contain artifacts, and must always be at least as good as conventional photographs. Moreover, in extreme situations it must degrade gradually to a conventional photograph.

.Be immediate 系统必须在几秒钟内生成一张照片,并将其显示在相机上,即使相机没有连接(有线或无线)。这意味着我们不能将处理延迟到桌面计算机或云。

.Be automatic 该方法必须是无参数和全自动的。摄影师应该在不知道拍摄或图像处理策略的情况下获得更好的照片。

.Be natural 我们拍的照片必须与现场的景象相符。因此,在高动态范围的情况下,我们必须限制局部色调映射的数量,以避免卡通或超现实的图像。在非常弱光的场景中,我们不能把图像弄得太亮,以至于它改变了明显的照明或显示出过多的噪音。

.Be conservative 应该可以使用它作为默认的拍照模式。这意味着所拍摄的照片必须不包含artifacts,而且必须至少与传统照片一样好。此外,在极端情况下,它必须逐渐退化为传统照片。

Given this conservative constraint, we have found the most reliable approach to burst mode photography is to capture each image in the burst with the same exposure time. In other words we do not bracket. We arrived at this unexpected protocol because of the inherent difficulty in accurately aligning images captured using different exposure times. Small exposure variation may compromise alignment due to differing levels of noise and motion blur, and large variation may render local alignment impossible if a patch is exposed with no image content visible. Recent HDR fusion methods address the challenges of varying exposure with sophisticated alignment and inpainting [Gallo and Sen 2016]. While these methods can produce compelling results, the best methods are expensive and still demonstrate occasional artifacts or physical inconsistency.

在这种保守的约束下,我们发现最可靠的burst模式摄影方法是在相同的曝光时间内捕获burst中的每个图像。换句话说,我们没有包围曝光。我们得到这个意想不到的协议,是因为使用不同曝光时间拍摄的图像很难精确对齐。由于噪声和运动模糊的级别不同,较小的曝光变化可能会影响对齐,而较大的变化可能会导致局部对齐不可能,如果patch曝光时没有可见的图像内容。最近的HDR融合方法通过复杂的对准和修复解决了不同曝光的挑战 [Gallo and Sen 2016]。虽然这些方法可以产生引人注目的结果,但是最好的方法代价高昂,而且仍然会显示出偶然的artifacts或物理不一致。

To execute our constant-exposure protocol, we choose an exposure that is low enough to avoid clipping (blowing out highlights) for the given scene. In other words we deliberately down-expose. We do this to capture more dynamic range. We also choose shorter than typical exposure times to mitigate camera shake blur, regardless of scene content [Telleen et al. 2007]. Although using lower exposures would seem to worsen noise, we offset this effect by capturing and merging multiple frames.

为了执行我们的恒曝光协议,我们选择了一个足够低的曝光,以避免对给定场景进行裁剪(消除高光)。换句话说,我们故意低曝。我们这样做是为了获取更多的动态范围。我们也选择比一般曝光时间更短的时间来减轻相机抖动模糊,不管场景内容如何 [Telleen et al. 2007]。虽然使用低曝光似乎会使噪音恶化,但我们通过捕捉和合并多个帧来抵消这一影响。

A second design decision arising from our conservative constraint is that we select one of the images in the burst as a “reference  frame, then align and merge into this frame those patches from other “alternate” frames where we are confident that we have imaged the same portion of the scene. Furthermore, to reduce computational complexity, we merge only a single patch from each alternate frame. Our conservative merging may cause some parts of the final image to appear noisier than others, but this artifact is seldom noticeable.

由于我们的保守约束而产生的第二个设计决策是,我们选择burst中的一个图像作为参考帧,然后将来自其他备用帧的patches对齐并合并到这个帧中,我们确信我们已经对场景的相同部分进行了成像。此外,为了降低计算复杂度,我们只合并每一帧的一个patch。我们的保守合并可能会导致最终图像的某些部分看起来比其他部分更嘈杂,但是这个artifact很少被注意到。

By aligning and merging multiple frames, we produce an intermediate image with higher bit depth, higher dynamic range, and reduced noise compared to our input frames. This would let us produce a high-quality (albeit underexposed) photograph merely by discarding the low-order bits. However, one of our goals is to produce natural-looking photographs even if the scene contains strong contrast. Therefore, we instead boost shadows, preserving local contrast while judiciously sacrificing global contrast. This process is called HDR tone mapping, and has been well studied [Reinhard et al. 2010]. Its effect is similar to that produced by traditional “dodging and burning” methods in print photography [Adams 1981]. We use a variant of exposure fusion [Mertens et al. 2007], because it is computationally efficient and produces natural-looking images; however, other
algorithms are possible.

通过对多个帧进行对齐和合并,我们得到了一个与输入帧相比具有更高位深、更高动态范围和更低噪声的中间图像。这将使我们仅通过丢弃低阶位就可以制作出高质量(尽管曝光不足)的照片。然而,我们的目标之一是拍摄出自然的照片,即使场景中有强烈的对比。因此,我们转而增强阴影,保留局部对比度,同时明智地牺牲整体对比度。这一过程被称为HDR色调映射,并得到了很好的研究 [Reinhard et al. 2010]。它的效果类似于传统的闪避和燃烧方法在印刷摄影中产生的效果 [Adams 1981]。我们使用曝光融合的一种变体[Mertens et al. 2007],因为它计算效率高,生成的图像自然;然而,其他算法是可能的。

One challenge in writing a systems paper about a commercial computational photography system is that academic papers in this area describe only algorithms, not complete systems, and the algorithms in existing commercial systems are proprietary and not easily reverseengineered. This situation is worse in the camera industry than in the computer graphics community, where public APIs have led to a tradition of openness and comparison [Levoy 2010]. This secrecy makes it hard for us to compare our results quantitatively with competing systems. To address this issue, we have structured this paper around an enumeration of design principles, a description of our implementation, and a sampling of our results—good and bad. We
also present in supplemental material a detailed comparison with several state of the art burst fusion methods [Liu et al. 2014; Dabov et al. 2007a; Adobe Inc. 2016; Heide et al. 2014], evaluating our method for aligning and merging frames in isolation from the rest of our system. Finally, we have created an archive of several thousand raw input bursts with associated output [Google Inc. 2016b], so others can improve upon or compare against our technique.

在撰写关于商业计算摄影系统的系统论文时面临的一个挑战是,该领域的学术论文只描述算法,而不是完整的系统,而且现有商业系统中的算法是专有的,不容易进行反向工程。这种情况在相机行业比在计算机图形学社区更严重,在计算机图形学社区,公共api已经形成了开放和比较的传统 [Levoy 2010]。这种保密使得我们很难将我们的结果与竞争对手的系统进行定量比较。为了解决这个问题,我们围绕设计原则的枚举、实现的描述和结果的好坏进行了结构化。在补充材料中,我们还对几种最先进的burst融合方法进行了详细的比较[Liu et al. 2014; Dabov et al. 2007a; Adobe Inc. 2016; Heide et al. 2014],评估与系统其他部分隔离的帧对齐和合并方法。最后,我们创建了一个包含数千个RAW输入burst和相关输出的文档[Google Inc. 2016b],以便其他人可以改进或比较我们的技术。

2 Overview of capture and processing

Figure 3 summarizes our capture and image processing system. It consists of a real-time pipeline (top row) that produces a continuous low-resolution viewfinder stream, and a non-real-time pipeline (bottom row) that produces a single high-resolution image.

图3总结了我们的捕获和图像处理系统。它由一个实时管道(上一行)和一个非实时管道(下一行)组成,前者生成连续的低分辨率取景器流,后者生成单个高分辨率图像。

In our current implementation the viewfinder stream is computed by a hardware Image Signal Processor (ISP) on the mobile device’s System on a Chip (SoC). By contrast the high-resolution output image is computed in software running on the SoC’s application processor. To achieve good performance this software is written in Halide [Ragan-Kelley et al. 2012]. We utilize an ISP to handle the viewfinder because it is power efficient. However, its images look different than those computed by our software. In other words, our viewfinder is not WYSIWYG.

在我们当前的实现中,取景器流是由一个硬件图像信号处理器(ISP)在一个芯片上的移动设备s系统(SoC)上计算的。与此相反,高分辨率输出图像是在SoC应用处理器上运行的软件中计算得到的。为了获得良好的性能,本软件是用Halide编写的 [Ragan-Kelley et al. 2012]。我们使用ISP来处理取景器,因为它是节能的。然而,它的图像看起来不同于我们的软件计算的那些。换句话说,我们的取景器不WYSIWYG。

A key enabling technology for our approach is the ability to request a specific exposure time and gain for each frame in a burst. For this we employ the Camera2 API [Google Inc. 2016a] available on select Android phones. Camera2 utilizes a request-based architecture based on the Frankencamera [Adams et al. 2010]. Another advantage of Camera2 is that it provides access to Bayer raw imagery, allowing us to bypass the ISP. As shown in figure 3 we use raw imagery in two places: (1) to determine exposure and gain from the same stream used by the ISP to produce the viewfinder, and (2) to capture the burst used to compute a high-resolution photograph. Using raw images conveys several advantages:

我们的方法的一个关键的启用技术是在burst中请求特定的曝光时间并为每一帧获取增益的能力。为此,我们使用了Camera2 API,该API可用于特定的Android手机。Camera2[Google Inc. 2016a]利用了基于Frankencamera[Adams et al. 2010]的基于请求的体系结构。Camera2的另一个优点是它提供了对Bayer原始图像的访问,允许我们绕过ISP。如图3所示,我们在两个地方使用原始图像:(1)确定ISP用于生成取景器的同一流的曝光和增益;(2)捕获用于计算高分辨率照片的burst。使用原始图像有几个优点:

• Increased dynamic range. The pixels in raw images are typically 10 bits, whereas the YUV (or RGB) pixels produced by mobile ISPs are typically 8 bits. The actual advantage is less than 2 bits, because raw is linear and YUV already has a gamma curve, but it is not negligible.

• Linearity. After subtracting a black level offset, raw images are proportional to scene brightness, whereas images output by
ISPs include nonlinear tone mapping. Linearity lets us model sensor noise accurately, which makes alignment and merging
more reliable, and also makes auto-exposure easier.
• Portability. Merging the images produced by an ISP entails modeling and reversing its processing, which is proprietary and scene dependent [Kim et al. 2012]. By starting from raw images we can omit these steps, which makes it easier to port our system to new cameras.

.Increased dynamic range raw图像中的像素通常为10位,而移动isp生成的YUV(或RGB)像素通常为8位。实际的优点是小于2位,因为raw是线性的,YUV已经有了伽马曲线,但它不可忽略。

.Linearity 在减去黑电平偏移后,原始图像与场景亮度成正比,而isp输出的图像包含非线性色调映射。线性化使我们能够精确地模拟传感器噪声,使校准和合并更加可靠,也使自动曝光更加容易。

.Portability 合并由ISP产生的图像需要建模和反向处理,这是专有的和场景相关的 [Kim et al. 2012]。从原始图像开始,我们可以省略这些步骤,这使我们更容易将系统移植到新的相机。

In the academic literature, burst fusion methods based on raw imagery [Farsiu et al. 2006; Heide et al. 2014] are relatively uncommon. One drawback of raw imagery is that we need to implement the entire photographic pipeline, including correction of lens shading and chromatic aberration, and demosaicking. (These correction steps have been omitted from figure 3 for brevity.) Fortunately, since our alignment and merging algorithm operates on raw images, the expensive demosaicking step need only be performed once—on a single merged image, rather than on every frame in the burst.

在学术文献中,基于原始图像 [Farsiu et al. 2006; Heide et al. 2014]的burst融合方法相对少见。RAW图像的一个缺点是我们需要实现整个摄影管道,包括镜头阴影和色差的校正,以及去马赛克。(为了简洁起见,图3省略了这些纠正步骤。)幸运的是,由于我们的对齐和合并算法对RAW图像进行操作,昂贵的去马赛克步骤只需要对单个合并图像执行一次,而不是对burst图像中的每一帧执行一次。

3 Auto-exposure 

An important function of a mobile ISP is to continuously adjust exposure time, gain, focus, and white balance as the user aims the camera. In principle we could adopt the ISP’s auto-exposure, reusing the capture settings from a recent viewfinder frame when requesting our constant-exposure burst. For scenes with moderate dynamic range this strategy works well. However, for scenes with high dynamic range, the captured images may include blown highlights or underexposed subjects that cannot be recovered by later HDR tone mapping.

移动ISP的一个重要功能是在用户对准相机的同时不断调整曝光时间、增益、焦距和白平衡。原则上,我们可以采用ISP的自动曝光,在请求我们的恒定曝光burst时,重用最近取景器帧中的捕获设置。对于动态范围适中的场景,该策略效果良好。然而,对于动态范围大的场景,捕获的图像可能包括吹散的高光或曝光不足的主题,这些主题无法通过后期的HDR色调映射恢复。

To address this, we develop a custom auto-exposure algorithm aware of future tone mapping, responsible for determining not only the overall exposure but also the dynamic range compression to come. Our approach for handling HDR scenes consists of three steps:

为了解决这个问题,我们开发了一种自定义的自动曝光算法,可以识别未来的色调映射,不仅负责确定整体曝光,还负责确定未来的动态范围压缩。我们处理HDR场景的方法包括三个步骤:

1. deliberately underexpose so that fewer pixels saturate,
2. capture multiple frames to reduce noise in the shadows, and
3. compress the dynamic range using local tone mapping.

1.deliberately underexposure 这样饱和的像素就更少了

2.capture multiple frames 减少阴影中的噪音

3.compress the dynamic range 使用局部色调映射

Underexposure is a well-known approach for high dynamic range capture, popularized for digital SLRs as “expose to the right” [Martinec 2008]. What makes underexposure viable in our system is the noise reduction provided by capturing a burst. In effect we treat HDR imaging as denoising [Hasinoff et al. 2010; Zhang et al. 2010].

曝光不足是一种众所周知的高动态范围捕获方法,在数码单反中作为曝光右侧而广泛使用[Martinec 2008]。在我们的系统中,使曝光不足成为可能的是捕捉burst所提供的降噪。实际上,我们把HDR成像当作去噪处理[Hasinoff et al. 2010; Zhang et al. 2010]。

Given this solution our auto-exposure algorithm must determine what exposure to use (i.e., how much to underexpose), how much to compress the dynamic range, and how many frames to capture.

有了这个解决方案,我们的自动曝光算法必须确定使用什么曝光(即多少曝光不足),多少压缩动态范围,多少帧捕捉。

Underexposure as dynamic range compression For the HDR tone mapping method we use, underexposure at capture is tightly coupled with the dynamic range compression applied in processing. As described in section 6, our method operates by fusing two gamma-corrected images, an underexposed input frame and a brighter version of the same frame, where digital gain compensates for underexposure. The output of our auto-exposure algorithm can therefore be expressed as two exposure levels, a short exposure for the highlights, used to capture the scene, and a synthetic long exposure for the shadows, used in HDR tone mapping.

Underexposure as dynamic range compression 对于我们使用的HDR色调映射方法,捕获时的曝光不足与处理中应用的动态范围压缩紧密耦合。如第6节所述,我们的方法是将两幅经过伽玛校正的图像、一幅曝光不足的输入帧和同一帧的一幅更亮的图像融合在一起,其中数字增益补偿了曝光不足。因此,我们的自动曝光算法的输出可以表示为两个曝光级别,一个是用于捕捉场景的高光的短曝光,另一个是用于HDR色调映射的阴影的合成长曝光。

Note that if we underexpose too much our photograph will be noisy even if we merge multiple frames. We cannot capture an unlimited number since capture and merging take time and power. Furthermore, if we compress the dynamic range too much our photograph will look cartoony (see figure 2c). We therefore limit our maximum dynamic range compression (underexposure, in our method) to 8. Fortunately, as figure 4 shows, few real-world scenes require more compression than this.

请注意,如果我们曝光不足太多,我们的照片将是嘈杂的,即使我们合并多帧。因为捕获和合并需要时间和精力,所以我们不能捕获无限数量的数据。此外,如果我们压缩的动态范围太大,我们的照片将看起来卡通。因此,我们将最大动态范围压缩(在我们的方法中是曝光不足)限制在8。幸运的是,如图4所示,很少有真实场景需要比这更多的压缩。

Auto-exposure by example A key difficulty in choosing exposure automatically is that this choice is scene dependent. For example, it is usually acceptable to let the sun blow out, but if the scene is a sunset at the beach the sun should remain colored, and the rings of color around it should not be overexposed, even if the beach must be left dark. To address this problem we have created a database of scenes captured using traditional HDR bracketing, which we have hand-tuned to look as natural as possible when rendered using our tone mapping method. The success of this approach depends on covering every kind of scene consumers are likely to encounter. Our database contains about 5;000 scenes, collected over the course of
several years, hand-labeled with two parameters corresponding to the short and long exposures yielding the best final rendition.

Auto-exposure by example 自动选择曝光的一个关键困难是,这个选择依赖于场景。例如,让太阳blow out通常是可以接受的,但如果场景是日落在海滩上,太阳应该保持颜色,它周围的颜色环不应该过曝,即使海滩必须是黑暗的。为了解决这个问题,我们创建了一个使用传统HDR包围曝光捕获的场景数据库,在使用色调映射方法呈现时,我们手工调整了数据库,使其看起来尽可能自然。这种方法的成功依赖于覆盖消费者可能遇到的每一种场景。我们的数据库包含了大约5000个场景,经过几年的收集,手工标记了两个参数,这两个参数对应于短曝光和长曝光,产生了最佳的最终效果。

Given this labeled database and an input frame in raw format, we compute a descriptor of the frame and search our database for scenes that match it. The features we use for this descriptor are quantiles of the image brightness distribution, measured on a white-balanced and aggressively downsampled version of the frame. Our complete descriptor comprises four sets of 64 non-uniformly spaced quantiles, measured at two different spatial scales, and measured for both the maximum and the average of the RGB channels. This respectively helps represent exposure at different spatial frequencies and account for color clipping. In computing these quantiles, we apply a fixed weighting to favor the center of the image, and we strongly boost the weight of regions where faces are detected. We also restrict the set of candidates to examples whose luminance is within a factor of 8 of
the current scene. This helps retain perception of scene brightness, avoiding, for example, unnatural day-for-night renditions.

给定这个带标签的数据库和RAW格式的输入帧,我们计算该帧的描述符,并在数据库中搜索与之匹配的场景。我们使用这个描述符的特征是图像亮度分布的分位数,在白平衡和积极下采样版本的帧上测量。我们的完整描述符由四组64个非均匀间距分位数组成,在两个不同的空间尺度上测量,并同时测量RGB通道的最大值和平均值。这分别有助于表示不同空间频率下的曝光和解释颜色裁剪。在计算这些分位数时,我们使用一个固定的权重来支持图像的中心,并且我们强烈地提高了检测到人脸的区域的权重。我们还将候选集限制为亮度在当前场景的8倍以内的示例。这有助于保持对场景亮度的感知,例如避免不自然的昼夜渲染。

Once we have found a set of candidate matching scenes, we compute a weighted blend of our hand-tuned auto-exposure for those scenes. This blend ultimately yields two parameters: the short exposure for capture, and the long exposure to apply to during tone mapping. For more detail about our implementation, see the supplement.

一旦我们找到了一组候选的匹配场景,我们计算出这些场景的手动调优自动曝光的加权混合。这种混合最终会产生两个参数:用于捕获的短曝光,以及用于色调映射的长曝光。有关我们的实现的更多细节,请参阅补充。

Exposure factorization Translating the exposure for capture into sensor settings entails factoring it into exposure time and gain (ISO setting). For this step we use a fixed schedule that balances motion blur against noise. For the brightest scenes we hold gain at its minimum level, allowing exposure times to increase up to 8 ms. Next, as scenes become darker, we hold exposure time at 8 ms and increase gain up to 4×. Finally, we increase exposure time and gain simultaneously, up to our maximums of 100 ms exposure time and 96× gain, increasing them proportionally in log space. To maximize SNR, we apply as much gain in analog form as possible [Martinec 2008]. Any gain above the limit supported by the camera sensor we apply digitally in our pipeline.

Exposure factorization 将捕获曝光转换为传感器设置需要将其分解为曝光时间和增益(ISO设置)。对于这一步,我们使用一个固定的时间表,平衡运动模糊与噪音。对于最亮的场景,我们将增益保持在最低水平,允许曝光时间增加到8毫秒。接下来,随着场景变暗,我们将曝光时间保持在8毫秒,并将增益增加到4毫秒。最后,我们同时增加曝光时间和增益,最大曝光时间为100ms,增益为96,在对数空间中按比例增加。为了最大化信噪比,我们采用尽可能多的模拟形式的增益[Martinec 2008]。任何超过相机传感器支持的限度增益,我们在管道中应用数字技术。

Burst size In addition to determining exposure time, gain, and dynamic range compression, we must also decide how many frames to capture in a burst. The number we capture, N, is a tradeoff. In low light, or in very high dynamic range scenes where we will be boosting the shadows later, we want more frames to improve signalto-noise ratio, but they take more time and memory to capture, buffer, and process. In bright scenes, capturing 1–2 images is usually sufficient, although more images are still generally beneficial in combating camera shake blur. In practice, we limit our bursts to 2–8 images, informing the decision using our model for raw image noise (see section 5 for more detail).

Burst size 除了决定曝光时间、增益和动态范围压缩,我们还必须决定在burst中捕捉多少帧。我们得到的N是一个权衡。在弱光下,或者在我们稍后将增强阴影的非常高的动态范围场景中,我们需要更多的帧来提高信噪比,但是它们需要更多的时间和内存来捕获、缓冲和处理。在明亮的场景中,捕捉1-2张图像通常是足够的,尽管更多的图像通常仍然有助于对抗相机抖动模糊。
在实践中,我们将我们的burst图像限制为2-8幅图像,并使用我们的模型对原始图像噪声进行决策(更多细节请参见第5节)。

Viewfinder integration Our auto-exposure algorithm builds atop the ISP-controlled viewfinder. To improve latency, we continuously analyze the raw frames captured during viewfinding; at shutter press we are already prepared with the desired burst capture settings. Although our auto-exposure runs in real-time, at about 10 ms per frame, analyzing every viewfinder frame is unnecessary, since rapid changes in scene brightness are uncommon. To save power we therefore only run auto-exposure one in every 4 frames.

Viewfinder integration 我们的自动曝光算法建立在isp控制的取景器之上。为了改进延迟,我们不断地分析在视图查找过程中捕获的原始帧。在快门按下时,我们已经准备好了所需的burst捕捉设置。虽然我们的自动曝光是实时的,大约每帧10毫秒,但是分析每个取景器帧是不必要的,因为快速改变场景亮度是不常见的。因此,为了节省电能,我们每4帧只运行一次自动曝光。

One challenge for our algorithm is that for a highly HDR scene, a single ISP-controlled viewfinder frame may contain many overexposed pixels. This can make it tricky to estimate the underexposure to apply. We have experimented with continuous bracketing during viewfinding. However, differently-exposed images cannot be displayed to the user during viewfinding, and capturing them in the background disrupts the smoothness of the viewfinder. Fortunately, we have found that by ignoring clipped pixels when evaluating our matching metric we can still determine how much to underexpose from similar scenes. Our algorithm predicts exposures within 10% of the bracketing result for 87% of shots; the shots with larger variations tend to be both more strongly HDR and more forgiving to lack of precision.

我们的算法面临的一个挑战是,对于高度HDR的场景,单个isp控制的取景器帧可能包含许多过度曝光的像素。这可能使估算要应用的曝光不足变得棘手。我们在取景的过程中尝试了连续的包围曝光。但是,在取景器中,不同曝光的图像不能显示给用户,在背景中捕捉会影响取景器的平滑度。幸运的是,我们发现通过在评估匹配度量时忽略剪切像素,我们仍然可以确定从类似场景中需要曝光多少。我们的算法预测87%的拍摄曝光率在10%以内;变化较大的拍摄往往具有更强的HDR和对精度不足更加宽容。

4 Aligning Frames

In the context of our high-resolution pipeline, alignment consists of finding a dense correspondence from each alternate (non-reference) frame of our burst to a chosen reference frame. This correspondence problem is well-studied, with solutions ranging from optical flow [Horn and Schunk 1981; Lucas and Kanade 1981], which performs iterative optimization under assumptions of smoothness and brightness constancy, to more recent techniques that use patches or feature descriptors to construct and “densify” a sparse correspondence [Liu et al. 2011; Brox and Malik 2011], or that use image oversegmentations and directly reason about geometry and occlusion [Yamaguchi et al. 2014]. In the computer vision literature, optical flow techniques are evaluated primarily by quality on established benchmarks [Baker et al. 2011; Menze and Geiger 2015]. As a result, most techniques produce high-quality correspondences, but at a significant computational cost—at time of submission, the top 5 techniques on the KITTI optical flow benchmark [Menze and Geiger 2015] require between 1.7 and 107 minutes per Mpix in desktop environments.

在我们的高分辨率管道环境中,对齐包括从我们的burst中每个备用(非参考)帧到所选参考帧之间找到一个密集对应。这个对应问题得到了很好的研究,其解决方案包括光流[Horn and Schunk 1981; Lucas and Kanade 1981],他们在平滑度和亮度恒定的假设下进行迭代优化,更近期的技术使用patches或特征描述符来构造和致密化稀疏对应 [Liu et al. 2011; Brox and Malik 2011]或者使用图像过度分割,直接对几何和遮挡进行推理[Yamaguchi et al. 2014]。在计算机视觉文献中,光流技术主要是通过建立基准的质量来评估的 [Baker et al. 2011; Menze and Geiger 2015]。因此,大多数技术产生高质量对应,但在提交时需要大量的计算成本,在桌面环境中,KITTI光流基准上[Menze and Geiger 2015]排名前五的技术要求每Mpix在1.7到107分钟之间。

Unfortunately, our strong constraints on speed, memory, and power preclude nearly all of these techniques. However, because our merging procedure (section 5) is robust to both small and gross alignment errors, we can construct a simple algorithm that meets our requirements. Much like systems for video compression [Wiegand et al. 2003], our approach is designed to strike a balance between computational cost and correspondence quality. Our alignment algorithm runs at 24 milliseconds per Mpix on a mobile device. We achieve this performance using a frequency-domain acceleration method similar to [Lewis 1995] together with careful engineering.

不幸的是,我们对速度、内存和电源的严格限制几乎排除了所有这些技术。但是,因为我们的合并过程(第5节)对小的和大的对齐错误都是健壮的,所以我们可以构造一个简单的算法来满足我们的要求。与视频压缩系统[Wiegand et al. 2003]非常相似,我们的方法旨在平衡计算成本和对应质量。我们的对齐算法在移动设备上以每Mpix 24毫秒的速度运行。我们通过类似于[Lewis 1995]的频域加速方法以及精心的工程来实现这一性能。

Reference frame selection To address blur induced by both hand and scene motion we choose the reference frame to be the sharpest frame in a subset of the burst, according to a simple metric based on gradients in the green channel of the raw input. This follows a general strategy known as lucky imaging [Joshi and Cohen 2010]. To minimize perceived shutter lag, we choose the reference frame from the first 3 frames in the burst.

Reference frame selection 为了解决由手和场景运动引起的模糊,我们根据基于原始输入的绿色通道中的梯度的简单度量,选择参考帧作为burst子集中最清晰的帧。这遵循称为幸运成像的一般策略[Joshi and Cohen 2010]。 为了最小化感知的快门滞后,我们从突发中的前3帧中选择参考帧。

Handling raw images Because our input consists of Bayer raw images, alignment poses a special challenge. The four color planes of a raw image are undersampled, making alignment an ill-posed problem. Although we could demosaic the input to estimate RGB values for every pixel, running even a low-quality demosaic on all burst frames would be prohibitively expensive. We circumvent this problem by estimating displacements only up to a multiple of 2 pixels. Displacements subject to this constraint have the convenient property that displaced Bayer samples have coincident colors. In effect, our approach defers the undersampling problem to our merge stage, where image mismatch due to aliasing is treated like any other form of misalignment. We implement this strategy by averaging 2×2 blocks of Bayer RGGB samples, so that we align downsampled 3 Mpix grayscale images instead of 12 Mpix raw images.

Handling raw images  因为我们的输入包括拜耳原始图像,所以对齐是一个特殊的挑战。 原始图像的四个颜色平面被欠采样,使得对齐成为一个不适定的问题。虽然我们可以对输入进行去马赛克来估计每个像素的RGB值,但即使在所有burst帧上运行低质量的去马赛克也会非常昂贵。我们通过估计最多2个像素的倍数的位移来避免这个问题。 受此约束影响的位移具有方便的特性,即置换的拜耳样品具有重合的颜色。实际上,我们的方法将欠采样问题推迟到我们的合并阶段,其中由于混叠导致的图像不匹配被视为任何其他形式的错位。我们通过平均2×2块拜耳RGGB样本来实现此策略,以便我们对齐下采样的3 Mpix灰度图像而不是12 Mpix原始图像。

Hierarchical alignment To align an alternate frame to our reference frame, we perform a coarse-to-fine alignment on four-level
Gaussian pyramids of the downsampled-to-gray raw input. As figure 5 illustrates, we produce a tile-based alignment for each pyramid level, using the alignments from the coarser scale as an initial guess. Each reference tile’s alignment is the offset that minimizes the following distance measure relating it to candidate tiles in the alternate image:

Hierarchical alignment 为了将替代帧与我们的参考帧对齐,我们在下采样到灰色原始输入的四级高斯金字塔上执行粗到精对齐。如图5所示,我们为每个金字塔层生成基于tile的对齐,使用较粗尺度的对齐作为初始猜测。每个参考图块的对齐是最小化下面距离度量的偏移,该度量将其与备用图像中的候选图块相关联:

D_{p}(u,v)=\sum_{y=0}^{n-1}\sum_{x=0}^{n-1}|T(x,y)-I(x+u+u_{0},y+v+v_{0})|^{p}(1)

where T is a tile of the reference image, I is a larger search area of the alternate image, p is the power of the norm used for alignment (1 or 2, discussed later), n is the size of the tile (8 or 16, discussed later), and (u0; v0) is the initial alignment inherited by the tile from the coarser level of the pyramid.
其中T为参考图像的tile,I为备选图像的较大搜索区域,p为用于对齐的范数的幂(1或2,稍后讨论),n为tile的大小(8或16,稍后讨论),(u_{0},v_{0})是tile从金字塔的粗层继承的初始对齐。

The model in equation 1 implies several assumptions about motion in our bursts. We assume piecewise translation, which is true in the limit as the patch approaches a single pixel, but can be a limiting assumption for larger patches. By minimizing absolute error between image patches instead of, say, maximizing normalized cross-correlation, we are not invariant to changes in brightness and contrast. However, this is not a disadvantage, because camera exposure is fixed and illumination is unlikely to change quickly over the duration of our bursts.

等式1中的模型暗示了关于我们burst中的运动的若干假设。 我们假设分段平移,当patch接近单个像素时,这在限制中是正确的,但对于较大的patch可以是限制性假设。通过最小化图像块之间的绝对误差而不是最大化归一化的互相关,我们对亮度和对比度的变化不是不变的。然而,这不是一个缺点,因为相机曝光是固定的,并且在我们的burst持续时间内照明不太可能快速改变。

Upsampling the coarse alignment to the next level of the pyramid is challenging when the coarse alignment straddles object or motion boundaries. In particular, standard upsampling methods like nearest-neighbor and bilinear interpolation can fail when the best displacement for an upsampled tile is not represented in the search area around the initial guess. In our system, we address this problem by evaluating multiple hypotheses for each upsampled alignment, choosing the alignment with minimum L1 residual between the reference and alternate frames. We take as candidates the alignments for the 3 nearest coarse-scale tiles, the nearest neighbor tile plus the next-nearest tiles in each dimension. This approach is similar in spirit to SimpleFlow [Tao et al. 2012], which also uses image content to inform the upsampling.

当粗对齐跨越对象或运动边界时,将粗对齐向上采样到金字塔的下一层是一个挑战。特别是,当最小邻居和双线性插值的标准上采样方法在最初猜测周围的搜索区域中未表示上采样图块的最佳位移时可能会失败。在我们的系统中,我们通过评估每一个上采样对齐的多个假设来解决这个问题,选择参考帧和备用帧之间L1残差最小的对齐。我们将最接近的3个粗尺度块、最近的相邻块和每个维度的下一个最近块作为候选对齐。这种方法在精神上类似于SimpleFlow [Tao et al. 2012],它也使用图像内容来通知向上采样。

In our approach we make a number of heuristic decisions regarding decimation, patch size, search radius, and the choice of norm in equation 1. One crucial decision is to align differently depending on pyramid scale. In particular, at coarse scales we compute a sub-pixel alignment, minimize L2 residuals, and use a large search radius. Sub-pixel alignment is valuable at coarse scales because it increases the accuracy of initialization and allows aggressive pyramid decimation. At the finest scale of our pyramid, we instead compute pixel-level alignment, minimize L1 residuals, and limit ourselves to a small search radius. Only pixel-level alignment is needed here, as our current merging procedure cannot make use of sub-pixel align ment. More detail explaining these decisions, plus a description of how the computation of D1 can be made fast with a bruteforce implementation, can be found in the supplement.

在我们的方法中,我们对抽取、patch大小、搜索半径和公式1中范数的选择做了一些启发式决策。一个关键的决定是根据金字塔尺度进行不同的调整。 特别地,在粗尺度下,我们计算亚像素对准,最小化L2残差,并使用大的搜索半径。亚像素对齐在粗尺度上是有价值的,因为它提高了初始化的准确性并允许激进的金字塔抽取。这里仅需要像素级对齐,因为我们当前的合并过程不能利用子像素对齐。 更多详细解释这些决定,以及如何通过强力实施快速完成D1的计算的描述可以在补充中找到。
 

4.1 Fast subpixel L2 alignmnet

At coarse scales, because we use a larger search radius, na¨ıvely computing equation 1 would be prohibitively expensive. We address this with algorithmic techniques to compute D2 more efficiently. Similar to the way normalized cross-correlation can be accelerated [Lewis 1995], the L2 version of equation 1 can be computed with a box filter and a convolution:

在粗尺度上,因为我们使用更大的搜索半径,计算方程1会十分昂贵。 我们使用算法技术解决这个问题,以便更有效地计算D2。类似于标准化互相关可以加速的方式[Lewis 1995],方程1的L2版本可以用box滤波器和卷积计算:

D_{2}=||T||_{2}^{2}+box(I\circ I,n)-2(F^{-1}(F(I)^{*}\circ F(T)))(2)

where the first term is the sum of the squared elements of T , the second term is the squared elements of I filtered with a non-normalized box filter of size n × n (the same size as T), and the third term is proportional to the cross-correlation of I and T , computed efficiently using a fast Fourier transform. For a complete derivation, see the supplement.

其中第一项是T的平方元素的总和,第二项是使用尺寸为n×n的非标准化box滤波器(与T的大小相同)过滤的I的平方元素,第三项是和I和T的互相关成比例的,使用快速傅里叶变换有效地计算。 有关完整的派生,请参阅补充说明。

Having computed D2 it is cheap to identify the integer displacement (\hat{u},\hat{v}) that minimizes the displacement error. To produce a subpixel estimate of motion, we fit a bivariate polynomial to the 3 × 3 window surrounding (\hat{u},\hat{v}) and find the minimum of that polynomial. This improves on the standard approach of fitting two separable functions [Stone et al. 2001] by avoiding the assumption that motion is independently constrained to the respective axes. Formally, we approximate:

计算出D2后,识别最小化位移误差的整数位移(\hat{u},\hat{v})是很便宜的。为了产生运动的子像素估计,我们将二元多项式拟合到围绕(\hat{u},\hat{v})的3×3窗口并找到该多项式的最小值。这通过避免假设运动独立地约束到相应的轴而改进了拟合两个可分离函数[Stone et al. 2001]的标准方法。 在形式上,我们是近似的。

where A is a 2 × 2 positive semi-definite matrix, b is a 2 × 1 vector, and c is a scalar. We construct a weighted least-squares problem fitting a polynomial to the 3 × 3 patch of D2 centered around (\hat{u},\hat{v}). Solving this system is equivalent to taking the inner product of D2 with a set of six 3 × 3 filters, derived in the supplement, each corresponding to a free parameter in (A; b; c). The process is similar to the polynomial expansion approach of [Farneback 2002 ¨ ]. Once we have recovered the parameters of the quadratic, its minimum follows by completing the square:

其中A是2×2正半正定矩阵,b是2×1向量,c是标量。 我们构造了一个加权最小二乘问题,将多项式拟合到以(\hat{u},\hat{v})为中心的D2的3×3 patch。解决该系统相当于将D2的内积与一组六个3×3滤波器一起取得,这些滤波器在补充中得出,每个滤波器对应于(A,b,c)中的自由参数。该过程类似于[Farneback2002¨]的多项式扩展方法。 一旦我们恢复了二次参数,其最小值就是完成正方形:

u=-A^{-1}b

The vector µ represents the sub-pixel translation that must be added to our integer displacement (^ u; v^).

向量μ表示必须添加到整数位移(\hat{u},\hat{v})的亚像素平移。

5 Merging Frames

The key premise of burst photography is that we can realize noise reduction by combining multiple observations of the scene over time. However, to be useful in a photographic application, our merging method must be robust to alignment failures. As figure 6 shows, while alignment is important to help compensate for camera and object motion, we cannot rely on alignment alone, which can fail for a variety of reasons, such as occlusions, non-rigid motion, or changes in lighting.

burst摄影的关键前提是我们可以通过结合场景的多个观察结果来实现降噪。 但是,为了在照相应用中有用,我们的合并方法必须对对齐失败很稳健。如图6所示,虽然对齐对于帮助补偿相机和物体运动很重要,但我们不能单独依靠对齐,这可能由于各种原因而失败,例如遮挡,非刚性运动或照明变化。

With our performance goals in mind, we develop a merging method that is robust to misalignment, based on a pairwise frequency-domain temporal filter operating on the tiles of the input. In our setting, each tile in the reference is merged with one tile taken from each of the alternate frames, corresponding to the result of our alignment. Our method typically uses 16 × 16 tiles from color planes of Bayer raw input, but for very dark scenes, for which low-frequency noise can be objectionable, we use 32 × 32 tiles instead.

考虑到我们的性能目标,我们基于在输入的tiles上操作的成对频域时间滤波器,开发了一种对未对准具有鲁棒性的合并方法。在我们的设置中,参考中的每个tile与从每个备用帧中获取的一个tile合并,对应于我们的对齐结果。我们的方法通常使用来自拜耳原始输入颜色平面的16×16 tiles,但对于非常暗的场景,低频噪声可能令人讨厌,我们使用32×32 tiles代替。

Our approach takes inspiration from frequency-domain video denoising techniques that operate on 3D stacks of matching images patches [Kokaram 1993; Bennett and McMillan 2005; Dabov et al. 2007a]. In particular, Kokaram [1993] proposed a variant of classic Wiener filtering in the 3D DFT domain, attenuating small coefficients more likely to be noise. V-BM3D [Dabov et al. 2007a] takes a similar approach, reinterpreting the Wiener filter and similar operators as “shrinkage” operators favoring the sparsity that is a statistical property of natural images in the transform domain. Techniques in this family are robust to misalignment because, for a given spatial frequency, any mismatch to the reference that cannot be ascribed to the expected noise level will be suppressed.

我们的方法从频域视频去噪技术中获得灵感,这些技术在匹配图像块的3D堆栈上运行[Kokaram 1993; Bennett and McMillan 2005; Dabov et al. 2007a]。特别是,Kokaram [1993]在3D DFT域中提出了经典维纳滤波的变体,衰减了更可能是噪声的小系数。V-BM3D [Dabov et al. 2007a]采用类似的方法,重新解释维纳滤波器和类似运算符为“收缩”运算符,有利于稀疏性,这是变换域中自然图像的统计属性。该系列中的技术对于未对准是鲁棒的,因为对于给定的空间频率,将抑制不能归因于预期噪声水平的与参考的任何不匹配。

The recent Fourier burst accumulation method [Delbracio and Sapiro 2015] uses similar principles, but is more aggressive about combining frequency content across the burst, to reduce motion blur due to long exposures. At the extreme, this method consists of taking the max of each spatial frequency across the burst. We view retaining the motion blur of the reference frame as a useful feature for photography. Moreover, our combination of underexposure and lucky imaging makes unwanted motion blur less common.

最近的傅里叶突发累积方法[Delbracio and Sapiro 2015]使用了类似的原理,但更积极地在突发中组合频率内容,以减少由于长时间曝光造成的运动模糊。在极端情况下,该方法包括在突发上获取每个空间频率的最大值。 我们认为保留参考帧的运动模糊作为摄影的有用特征。 此外,我们将曝光不足和幸运成像相结合,使不必要的运动模糊不太常见。

While our merging method inherits the benefits of frequency-domain denoising, it departs from previous methods in several ways. First, because we process raw images we have a simple model describing noise in the image. This improves robustness by letting us more reliably discriminate between alignment failures and noise. Second, instead of applying the DFT or another orthogonal transformation in the temporal dimension, we use a simpler pairwise filter, merging each alternate frame onto the reference frame independently. While this approach sacrifices some noise reduction for well-aligned images, it is cheaper to compute and degrades more gracefully with alignment failures (see figure 7). Third, as a consequence of this filter operating only over the temporal dimension, we run spatial denoising in a separate post-processing step, applied in the 2D DFT. Fourth, we apply our filter to the color planes of Bayer raw images independently, then reinterpret the filtered result as a new Bayer
image. This method is simple but surprisingly robust, in that we observe little degradation even though we are ignoring Bayer undersampling. In the following, we expand on each these points and discuss artifacts that can result in extreme conditions.

虽然我们的合并方法继承了频域去噪的好处,但它在几个方面与以前的方法不同。 第一,因为我们处理原始图像,所以我们有一个描述图像中噪声的简单模型。通过让我们更可靠地区分对准失败和噪声,这提高了鲁棒性。 其次,我们使用更简单的成对滤波器,将每个交替帧合并到参考帧上,而不是在时间维度中应用DFT或其他正交变换。虽然这种方法可以为良好对齐的图像牺牲一些噪声,但是通过对齐失败来计算和降级更加便宜(参见图7)。第三,由于此滤波器仅在时间维度上操作,我们在单独的后处理步骤中运行空间去噪,应用于2D DFT。 第四,我们将滤波器独立应用于拜耳原始图像的颜色平面,然后将滤波后的结果重新解释为新的拜耳图像。这种方法很简单,但却非常强大,因为即使我们忽略了拜耳欠采样,我们也观察到很少的退化。 在下文中,我们将扩展每个点,并讨论可能导致极端条件的artifacts。

Noise model and tiled approximation Because we operate on Bayer raw data, noise is independent for each pixel and takes a
simple, signal-dependent form. In particular, for a signal level of x, the noise variance σ2 can be expressed as Ax + B, following from the Poisson-distributed physical process of photon counting [Healey and Kondepudy 1994]. The parameters A and B depend only on the analog and digital gain settings of the shot, which we control directly. To validate this model of sensor noise, we empirically measured how noise varies with different signal levels and gain settings.

Noise model and tiled approximation 因为我们对拜耳原始数据进行操作,所以噪声对于每个像素是独立的,并且采用简单的信号相关形式。 特别是,对于x的信号电平,噪声方差σ2可以表示为Ax + B,遵循泊松分布的光子计数物理过程[Healey and Kondepudy 1994]。参数A和B仅取决于镜头的模拟和数字增益设置,我们直接控制。 为了验证这种传感器噪声模型,我们根据经验测量了噪声随不同信号电平和增益设置的变化情况。

In the transform domain where we apply our filtering, directly using a signal-dependent model of noise is impractical, as the DFT requires representing a full covariance matrix. While this could be addressed by applying a variance stabilizing transform [Makitalo and Foi 2013 ¨ ] to the input, for computational efficiency we instead approximate the noise as signal independent within a given tile. For each tile, we compute the variance by evaluating our noise model using a single value, the root-mean-square (RMS) of the samples in the tile. Using RMS has the effect of biasing the signal estimate toward brighter image content. For low-contrast tiles, this is similar to using the mean; high-contrast tiles will be filtered more aggressively, as if they had a higher average signal level.

在我们应用滤波的变换域中,直接使用信号相关的噪声模型是不切实际的,因为DFT需要表示完整的协方差矩阵。 虽然这可以通过对输入应用方差稳定变换来解决[Makitalo and Foi 2013],但为了计算效率,我们将噪声近似为在给定的tile内独立的信号。对于每个图块,我们通过使用单个值(图块中样本的均方根(RMS))评估我们的噪声模型来计算方差。 使用RMS具有将信号估计偏向更亮图像内容的效果。对于低对比度tiles,这类似于使用平均值; 高对比度tiles将被更积极地过滤,就像它们具有更高的平均信号水平一样。

Robust pairwise temporal merge Our merge method operates on image tiles in the spatial frequency domain. For a given reference tile, we assemble a set of corresponding tiles across the burst, one per frame, and compute their respective 2D DFTs as T_{z}(w), wherew=(w_{x},w_{y}) denotes spatial frequency, z is the frame index, and, without loss of generality, we take frame 0 to be the reference.

Robust pairwise temporal merge 我们的合并方法在空间频率域中的图像块上操作。 对于给定的参考图块,我们在burst上组装一组相应的图块,每帧一个,并将它们各自的2D DFT计算为T_{z}(w),其中w=(w_{x},w_{y})表示空间频率,z是帧索引,并且在不失一般性的情况下,我们将第0帧作为参考。

Where our method departs from other frequency-based denoising methods is our pairwise treatment of frames in the temporal dimension. To build intuition, a simple way to merge over the temporal dimension would be to compute the average for each frequency coefficient. This na¨ıve averaging filter can be thought of as expressing an estimate for the denoised reference frame:

我们的方法偏离其他基于频率的去噪方法是我们在时间维度上对帧的成对处理。 为了建立直觉,在时间维度上合并的简单方法是计算每个频率系数的平均值。这种平均滤波器可以被认为是表示对去噪参考帧的估计:

\hat{T_{0}}(w)=\frac{1}{N}\sum_{0}^{N-1}T_{z}(w)(5)

While this performs well when alignment is successful, it is not robust to alignment failure (see figure 6c). Because the 2D DFT is linear, this filter is actually equivalent to a temporal average in the spatial domain.

虽然这在对齐成功时表现良好,但对齐失败并不稳健(见图6c)。 因为2D DFT是线性的,所以该滤波器实际上等于空间域中的时间平均值。

To add robustness, we instead construct an expression similar to equation 5, but incorporate a filter that lets us control the contribution of alternate frames:

为了增加鲁棒性,我们改为构造一个类似于等式5的表达式,但是包含一个过滤器,让我们可以控制备用帧的贡献:

\hat{T_{0}}(w)=\frac{1}{N}\sum_{z=0}^{N-1}T_{z}(w)+A_{z}(w)[T_{0}(w)-T_{z}(w)](6)

For a given frequency, Az controls the degree to which we merge alternate frame z into the final result versus falling back to the
reference frame. The body of this sum can be rewritten as (1- Az)· Tz + Az · T0 to emphasize that Az controls a linear interpolation between Tz and T0. Since the contribution of each alternate frame is adjusted on a per-frequency basis, alignment failure can be partial, in that rejected image content for one spatial frequency will not corrupt other frequencies.

对于给定频率,Az控制我们将交替帧z合并到最终结果与退回到参考帧的程度。这个和的主体可以改写为(1-A_{z})*T_{z}+A_{z}T_{0}(1- Az)·Tz + Az·T0,以强调A_{z}控制T_{z}T_{0}之间的线性插值。由于每个备用帧的贡献是基于每个频率调整的,因此对齐失败可能是部分的,因为一个空间频率的被拒绝图像内容不会破坏其他频率。

We are now left with the task of defining Az to attenuate frequency coefficients that do not match the reference. In particular, we want Tz to contribute to the merged result when its difference from T0 can be ascribed to noise, and for its contribution to be suppressed when it differs from T0 due to poor alignment or other problems. In other words, Az is a shrinkage operator. Our definition of Az is a variant of the classic Wiener filter:

我们现在剩下的任务是定义Az来衰减与参考不匹配的频率系数。特别是,当Tz与T0的差异可以归结为噪声时,我们希望Tz对合并结果有所贡献;当Tz与T0的差异是由于对齐不良或其他问题时,我们希望Tz的贡献可以被抑制。换句话说,Az是一个收缩算子。我们对Az的定义是经典维也纳滤波器的变体:

A_{z}(W)=\frac{|D_{z}(w)|^{2}}{|D_{z}(w)|^{2}+c\sigma^{2}}(7)

where D_{z}(w)=T_{0}(w)-T_{z}(w), the noise variance σ2 is provided by our noise model, and c is a constant that accounts for the scaling of noise variance in the construction of Dz and includes a further tuning factor (in our implementation, fixed to 8) that increases noise reduction at the expense of some robustness. The construction of Dz scales the noise variance by a factor of n2 for the number of 2D DFT samples, a factor of 1/16 for the window function (described later), and a factor of 2 for its definition as a difference of two tiles. We tried several alternative shrinkage operators, such as hard and soft thresholding [Donoho 1995], and found this filter to provide the best balance between noise reduction strength and visual artifacts.

式中D_{z}(w)=T_{0}(w)-T_{z}(w)噪声方差σ2由我们提供噪声模型,c是一个常数,它解释了Dz构建过程中噪声方差的缩放,并且包含了一个进一步的调优因子(在我们的实现中,固定为8),它以牺牲一定的鲁棒性来增加降噪。对于2D DFT样本的数量,Dz的构造将噪声方差按n^{2}的倍数缩放,对于窗口函数(稍后将进行描述),噪声方差按1/4^{2}的倍数缩放,对于定义为两个tiles的差值,噪声方差按2的倍数缩放。我们尝试了几种可选的收缩操作,如硬阈值和软阈值[Donoho 1995],发现这种过滤器能够在降噪强度和视觉效果之间提供最佳平衡。

We found our pairwise temporal operator to produce higher quality images than a full 3D DFT, particularly in the presence of alignment failure. As figure 7 illustrates, a single poorly aligned frame renders the entire DFT transform domain non-sparse, leading the shrinkage operator to reject the contribution from all of the alternate frames, not only the poorly aligned one. By contrast, our temporal operator evaluates the contribution of each alternate frame independently, letting us degrade more gracefully with alignment failure. Our temporal filtering also has the advantage of being cheaper to compute and requiring less memory. The contribution of each alternate frame can be computed and discarded before moving on to the next.

我们发现我们的时间对操作符可以产生比完整的3D DFT更高质量的图像,特别是在校准失败的情况下。如图7所示,单个未对齐的帧使得整个DFT变换域是非稀疏的,导致收缩操作符拒绝来自所有备用帧的贡献,而不仅仅是未对齐的帧。与此相反,我们的时态操作符独立地评估每个备用帧的贡献,使我们在对齐失败时能够更优雅地降级。我们的时间滤波还具有计算成本低、所需内存少的优点。在继续下一帧之前,可以计算并丢弃每个备用帧的贡献。

Spatial denoising Because our pairwise temporal filter above does not perform any spatial filtering, we apply spatial filtering as a separate post-processing step in the 2D DFT domain. Starting from the temporally filtered result, we perform spatial filtering by applying a pointwise shrinkage operator, of the same form as equation 7, to the spatial frequency coefficients. To be conservative, we limit the strength of denoising by assuming that all N frames were averaged perfectly. Accordingly, we update our estimate of the noise variance to be σ2=N. Consistent with classic studies of the human visual system, we found that we can filter high spatial frequency content more aggressively than lower spatial frequency content without introducing noticeable artifacts. Therefore, we apply a “noise shaping” function \tilde{\sigma}=f(w) σ which adjusts the effective noise level as a function of w, increasing its magnitude for higher frequencies. We represent this function by defining a piecewise linear function, tuned to maximize subjective image quality rather than SNR.

Spatial denoising 由于我们上面的两两时间滤波器不执行任何空间滤波,我们将空间滤波作为二维DFT域中单独的后处理步骤应用。从时间滤波结果出发,对空间频率系数应用与方程7相同形式的逐点收缩算子进行空间滤波。为了保守起见,我们假设所有N帧都是完美平均的,从而限制了去噪的强度。因此,我们更新我们的估计噪声方差为\sigma^{2}/N。与人类视觉系统的经典研究一致,我们发现,我们可以更积极地过滤高空间频率的内容相比低空间频率的内容,而不引入明显的伪影。因此,我们应用“噪声整形”函数\tilde{\sigma}=f(w),σ调整有效噪声水平作为w的函数,增加其幅度以获得更高频率。我们通过定义一个分段线性函数来表示这个函数,这个函数是为了最大化主观图像质量而不是SNR(信噪比)而优化的。

Merging Bayer raw Note that up to this point, we have presented our merging algorithm in terms of single-channel images. However, as mentioned above, both our input and output consist of Bayer-mosaicked raw images. Our design handles raw images in the simplest way possible: we merge each plane of the Bayer image independently using a common locally translational alignment, and we do not use alignment any more precise than pixel level in the Bayer color planes. Aligning to higher precision would require interpolation for both align and merge, which would increase computational cost significantly. While our approach is fast and effective, it is less sophisticated than multi-frame demosaicking algorithms (e.g., [Farsiu et al. 2006]) designed to recover high frequency content lost to Bayer undersampling.

Merging Bayer raw 注意,到目前为止,我们已经用单通道图像表示了我们的合并算法。然而,如上所述,我们的输入和输出都是由Bayer-mosaicked raw图像组成的。我们的设计以最简单的方式处理原始图像:我们使用一个通用的局部平移对齐独立合并拜耳图像的每个平面,我们不使用比拜耳彩色平面中的像素级更精确的对齐。要实现更高的精度,对齐和合并都需要插值,这将大大增加计算成本。虽然我们的方法是快速和有效的,它没有复杂的多帧解马赛克算法设计的恢复高频内容丢失的拜耳欠采样[Farsiu et al. 2006]。

Because Bayer color planes are undersampled by a factor of four, one might pessimistically assume that 75% of frames will be rejected on average, compromising denoising. While our robust filter will indeed reject aliased image content not fitting our noise model, this rejection only happens on a per-DFT bin basis and aliasing issues are likely to be confined to a subset of DFT bins. The same behavior can be observed in figure 6, where despite poor alignment (figure 6c, bottom), our robust temporal filter is able to significantly reduce noise without introducing any visible ghosting (figure 6d, bottom).

由于Bayer彩色平面采样不足4倍,人们可能会悲观地认为平均75%的帧会被拒绝,从而影响去噪。虽然我们的鲁棒滤波器确实会拒绝不符合我们的噪声模型的混叠图像内容,但这种拒绝只发生在每个DFT bin的基础上,混叠问题可能仅限于DFT bin的一个子集。在图6中可以观察到相同的行为,尽管校准不好(图6c,底部),我们的健壮的时间滤波器能够显著降低噪声,而不引入任何可见的重影(图6d,底部)。

Overlapped tiles Our merge method operates on tiles overlapped by half in each spatial dimension. By smoothly blending between overlapped tiles, we avoid visually objectionable discontinuities at tile boundaries. Additionally, we must apply a window function to the tiles to avoid edge artifacts when operating in the DFT domain. We use a modified raised cosine window, 1/2 - 1/2cos(2π(x + 1/2)/n) for 0 ≤ x < n, and 0 otherwise. This differs from the conventional definition: first, the denominator of the cosine argument is n, not n-1. Unlike the conventional window, when this function is repeated with n/2 samples of overlap, the total contribution from all tiles sum to one at every position. Second, the window is shifted by half to
avoid zeros in the window resulting from the modified denominator. Zeros in the window correspond to pixels not contributing to the output, which implies we could have used a smaller tile size (with the associated computational savings) to achieve the same result.

Overlapped tiles 我们的合并方法在每个空间维度上重叠一半的tiles上运行。通过在重叠的tiles之间平滑地混合,我们避免了tiles边界上视觉上令人讨厌的不连续。另外,在DFT域中操作时,我们必须对tile应用一个窗口函数来避免边缘artifacts。我们使用一个改进的凸余弦窗口,\tiny \frac{1}{2}-\frac{1}{2}cos(2\pi(x+\frac{1}{2})/n),当x < n,否则为0。这与传统定义不同:首先,cosine参数的分母是n,而不是n-1。与传统的窗口不同,当这个函数在n/2个重叠样本中重复时,所有块的贡献总和在每个位置上为1。其次,将窗口移动一半,以避免由于修改分母而导致窗口中出现零。窗口中的零对应于没有贡献输出的像素,这意味着我们可以使用较小的tile大小(通过相关的计算节省)来实现相同的结果。

Artifacts We have observed several classes of artifacts resulting from this system. First, this filter tends to fail to suppress noise around strong high contrast features, as shown in figure 8. This is a result of high contrast features having a non-sparse representation in the spatial DFT domain, reducing the effectiveness of spatial denoising.

Artifacts 我们已经观察到这个系统产生的一些artifacts。首先,这种滤波器往往无法抑制强对比度特征周围的噪声,如图8所示。这是由于高对比度特征在空间DFT域中具有非稀疏表示,降低了空间去噪的有效性。

Second, because our shrinkage function never fully rejects a poorly aligned tile, mild ghosting artifacts can sometimes occur, as shown in figure 9. In our experience, these ghosting artifacts are subtle, and very often are difficult to distinguish from motion blur.

其次,因为我们的收缩函数永远不会完全拒绝对齐不良的tile,所以有时会出现轻微的重影,如图9所示。在我们的经验中,这些鬼影是很微妙的,通常很难从运动模糊中区分出来。

Finally, our filter can occasionally produce ringing artifacts typically associated with frequency-domain filters. While ringing is largely mitigated by our windowing approach, in challenging situations classic Gibbs phenomenon can be visible, particularly after being amplified by sharpening and other steps in our finishing pipeline. Ringing is most frequently visible in the neighborhood of poorly-aligned clipped highlights, which exhibit high spatio-temporal contrast. In our experience, ringing has a negligible visual effect for most scenes.

最后,我们的过滤器有时会产生通常与频域过滤器相关联的振铃artifacts。虽然振铃很大程度上是由我们的开窗方法减轻的,但在具有挑战性的情况下,经典的吉布斯现象是可以看到的,特别是经过锐化和我们最后管道其他步骤放大后。振铃最常见的是在低对齐的剪接高光附近,表现出较高的时空对比度。根据我们的经验,在大多数场景中,铃声的视觉效果可以忽略不计。

6 Finishing

Aligning and merging the captured Bayer raw frames produces a single raw image with higher bit depth and SNR. In practice our input is 10-bit raw and we merge to 12 bits to preserve the precision gained from merging. This image must now undergo correction, demosaicking, and tone mapping–operations that would normally be performed by an ISP, but in our case is implemented are software and include the key additional step of dynamic range compression. In order of application, these operations are:

对齐和合并捕获的拜耳原始帧产生一个具有更高比特深度和信噪比的原始图像。在实践中,我们的输入是10位原始的,我们合并到12位,以保持从合并中获得的精度。这幅图像现在必须经过校正、解马赛克和色调映射操作,这些操作通常由ISP执行,但在我们的示例中实现的是软件,并包含动态范围压缩的关键附加步骤。按照应用程序的顺序,这些操作是:

1. Black-level subtraction deducts an offset from all pixels, so that pixels receiving no light become zero. We obtain this offset from optically shielded pixels on the sensor.
2. Lens shading correction brightens the corners of the image to compensate for lens vignetting and corrects for spatially varying color due to light striking the sensor at an oblique angle. These corrections are performed using a low-resolution
RGGB image supplied by the ISP.
3. White balancing linearly scales the four (RGGB) channels so that grays in the scene map to grays in the image. These scale factors are supplied by the ISP.
4. Demosaicking converts the image from a Bayer raw image to a full-resolution linear RGB image with 12 bits per pixel. We
use a combination of techniques from Gunturk et al. [2005],including edge directed interpolation with weighted averaging,
constant-hue based interpolation, and second order gradients as correction terms.
5. Chroma denoising to reduce red and green splotches in dark areas of low-light images. For this we use an approximate
bilateral filter, implemented using a sparse 3x3 tap non-linear kernel applied in two passes in YUV.
6. Color correction converts the image from sensor RGB to linear sRGB using a 3x3 matrix supplied by the ISP.
7. Dynamic range compression See description below.
8. Dehazing reduces the effect of veiling glare by applying a global tone curve that pushes low pixel values even lower while preserving midtones and highlights. Specifically, we allow up to 0.1% of pixels to be clamped to zero, but only adjust pixels below 7% of the white level.
9. Global tone adjustment, to increase contrast and apply sRGB gamma correction, by concatenating an S-shaped contrastenhancing tone curve with the standard sRGB color component transfer function.
10. Chromatic aberration correction to hide lateral and longitudinal chromatic aberration. We do not assume a lens model,
but instead look for pixels along high-contrast edges, and replace their chroma from nearby pixels less likely to be affected
by chromatic aberration.
11. Sharpening using unsharp masking, implemented using a sum of Gaussian kernels constructed from a 3-level convolution
pyramid [Farbman et al. 2011].
12. Hue-specific color adjustments to make blue skies and vegetation look more appealing, implemented by shifting bluish
cyans and purples towards light blue, and increasing the saturation of blues and greens generally.
13. Dithering to avoid quantization artifacts when reducing from 12 bits per pixel to 8 bits for display, implemented by adding
blue noise from a precomputed table.

1.Black-level subtraction 从所有像素中扣除偏移量,使不接收光的像素变为零。我们从传感器上的光学屏蔽像素获得这个偏移量。

2.Lens shading correction 使图像的角变亮以补偿镜头渐晕并校正由于光以倾斜角度照射传感器而在空间上变化的颜色。使用ISP提供的低分辨率RGGB图像执行这些校正。

3.White balancing 线性缩放四个(RGGB)通道,使场景图中灰色映射到在图像中灰色。 这些比例因子由ISP提供。

4.Demosaicking 将图像从拜耳原始图像转换为每像素12位的全分辨率线性RGB图像。我们结合了Gunturk et al. [2005]的技术,包括带加权平均的边缘有向插值、基于不变色调的插值和二阶梯度作为校正项。

5.Chroma denoisng 减少低光图像暗区的红色和绿色斑点。为此,我们使用一个近似的双边滤波器,使用稀疏的3x3 tap非线性核应用在YUV的两个通道来实现。

6.Color correction 使用ISP提供的3x3矩阵将图像从传感器RGB转换为线性sRGB。

7.Dynamic range compression 看下面描述。

8.Dehazing 通过应用一个全局色调曲线来减少遮蔽眩光的影响,该曲线将低像素值推得更低,同时保留中间色调和高光。具体来说,我们允许最多0.1%的像素被夹到零,但只调整像素低于7%的白色水平。

9.Global tone adjustment 通过将s形对比度增强色调曲线与标准sRGB颜色分量传递函数连接,以增加对比度并应用sRGB伽马校正。

10.Chromatic aberration correction(色差校正) 隐藏横向和纵向色差。我们不假定是镜头模型,而是在高对比度的边缘寻找像素,并从不太可能受到色差影响的邻近像素替换它们的色度。

11.Sharpening 使用非锐化掩模,使用从3层卷积金字塔构造的高斯核的总和来实现[Farbman et al. 2011]。

12.Hue-specific color adjustments 为了让蓝色的天空和植被看起来更有吸引力,我们将蓝蓝的青色和紫色改为浅蓝色,并增加蓝色和绿色的饱和度。

13.Dithering 当从每像素12比特减少到8比特用于显示时,避免量化伪像,通过从预先计算的表中添加蓝噪声来实现。

Dynamic range compression For high dynamic range scenes we use local tone mapping to reduce the contrast between highlights and shadows while preserving local contrast. The tone mapping method we have chosen is a variant of exposure fusion [Mertens et al. 2007]. Given input images that depict the same scene at different brightness levels, exposure fusion uses image pyramids to blend the best-exposed parts of the input images to produce a single output image that looks natural and has fewer badly exposed areas than the inputs.

Dynamic range compression 对于高动态范围场景,我们使用局部色调映射来减少高光和阴影之间的对比度,同时保留局部对比度。我们选择的色调映射方法是曝光融合的一种变体。给定在不同亮度下描述同一场景的输入图像,exposure fusion[Mertens et al. 2007].使用图像金字塔来混合输入图像中曝光最好的部分,从而生成一个看起来很自然的输出图像,并且比输入图像的严重曝光区域要少。

Exposure fusion is typically applied to images captured using bracketing. In our pipeline we capture multiple frames with constant exposure, not bracketing. To adapt exposure fusion to our pipeline, we derive “synthetic exposures” from our intermediate HDR image by applying gain and gamma correction to it, then fuse these as if they had been captured using bracketing. We perform these extractions in grayscale, and we create only two synthetic exposures–one short and one long. The short exposure tells us how many pixels will blow out, and becomes the overall exposure used during capture, while the ratio between the short and long exposures tells us how much dynamic range compression we are applying. Both values
come from our auto-exposure algorithm.

曝光融合通常应用于使用包围曝光捕获的图像。在我们的管道中,我们捕获了具有常亮曝光的多个帧,而不是包围。为了使曝光融合适应我们的管道,我们通过对中间的HDR图像应用增益和伽马校正来获得合成曝光,然后将它们融合,就好像它们是用包围捕获的一样。我们在灰度中执行这些提取,我们只创建两个合成曝光,一个短曝光和一个长曝光。短曝光告诉我们有多少像素会爆炸,并成为捕获过程中使用的整体曝光,而短曝光和长曝光的比值告诉我们将应用多少动态范围压缩。这两个值都来自我们的自动曝光算法。

Fusing grayscale instead of color images, and using only two synthetic exposures, reduces computation and memory requirements. It also allows us to simplify the per-pixel blend-weights compared to those in the work by Mertens et al. [2007]. In particular, we use a fixed weighting function of luma that favors moderately bright pixels.This function can be expressed as a one-dimensional lookup table. After fusing the synthetic exposures we undo the gamma-correction of the resulting grayscale image and re-colorize it by copying perpixel chroma ratios from the original linear RGB image.

融合灰度图像而不是彩色图像,并且只使用两次合成曝光,减少了计算和内存需求。它还允许我们简化每个像素的混合权重,与Mertens et al. [2007]的工作相比。特别地,我们使用了一个固定的亮度加权函数,它偏爱中等亮度的像素。这个函数可以表示为一维查找表。在融合了合成曝光后,我们撤消对得到的灰度图像的伽玛校正,并通过复制原始线性RGB图像的全像素色度比对其重新着色。

7 Results

Figure 10 shows example photos taken with our system side-by-side with single-exposure photos produced by a conventional imaging pipeline. Our system almost always produces results superior to a conventional single-exposure pipeline, and in scenes with high dynamic range or low light the improvement is often dramatic–fewer blown-out highlights or crushed shadows, less noise, less motion blur, better color, sharper details, and more texture. While our results benefit from choosing a sharp reference frame, our system is robust to alternate choices; it can convert any burst to a denoised video.

图10显示了我们的系统与传统成像管道生成的单曝光照片并排拍摄的示例照片。我们的系统几乎总是比传统的单次曝光管道产生更好的效果,在动态范围高或光线低的场景中,改进通常是显著的——更少的曝光或碎的阴影,更少的噪音,更少的运动模糊,更好的颜色,更清晰的细节,更多的纹理。虽然我们的结果受益于选择一个清晰的参考帧,我们的系统对于可选方案是健壮的;
它可以将任何burst转换为去噪视频。

For a more detailed evaluation of our system’s align and merge method, demonstrating its robustness compared to state of the art burst fusion [Liu et al. 2014; Dabov et al. 2007a; Adobe Inc. 2016; Heide et al. 2014], please refer to the supplement

对于我们系统的align和merge方法进行更详细的评价,证明其相对于现有burst融合的状态具有鲁棒性[Liu et al. 2014;Dabov et al. 2007a;Adobe公司。2016;Heide et al. 2014],详见补充。

Failure cases Despite generally good image quality, our system does fail in extreme situations. We have designed it to degrade gracefully in these situations, but we wish it were even better. Some of these situations are shown in figure 11.

Failure cases 尽管图像质量一般都很好,但我们的系统在极端情况下还是会失败。我们已经将它设计为在这些情况下优雅地降级,但是我们希望它能更好。图11显示了其中一些情况。

In addition, if a scene’s dynamic range is so high that exposure fusion using two synthetic exposures would yield cartoony results, then we treat the scene as if its dynamic range was low and allow more pixels to blow out. Using three synthetic exposures might work better, but is expensive to compute and requires more subtle tuning of our auto-exposure database. Also, if a scene contains such fast motions that features blur despite our short exposure time, then alignment might fail, leaving excessive noise in the output photograph.

此外,如果一个场景的动态范围非常大,使用两个合成曝光的曝光融合会产生卡通效果,那么我们就把场景当作它的动态范围很低,允许更多的像素被放大。使用三次合成曝光效果可能更好,但计算成本高昂,需要对自动曝光数据库进行更细微的调整。此外,如果一个场景包含如此快速的运动,尽管我们的曝光时间很短,但仍然具有模糊的特征,那么校准可能会失败,在输出照片中留下过多的噪音。

Our most serious failure mode is that at very low light levels the ISP’s autofocus and white balance estimation begin failing. Although merging and alignment may still work, the photograph might be out of focus or have a color cast. Slight casts are visible in figure 1.

我们最严重的故障模式是在非常低的光水平下ISP的自动对焦和白平衡估计开始失败。虽然合并和对齐可能仍然有效,但照片可能会失去焦点或有色差。在图1中可以看到一些强制类型转换。

Performance To make our pipelines fast enough to deploy on mobile devices, we have selected algorithms for their computational efficiency. This means avoiding non-local communication and data dependencies that preclude parallelization, consuming as little memory as possible, and employing fixed point arithmetic wherever possible. These same concerns preclude using algorithms with global iteration (e.g., FlexISP [Heide et al. 2014]), large or dynamic spatial support (e.g., BM3D [Dabov et al. 2007b]), or expensive tone mapping (e.g., local Laplacian filters [Aubry et al. 2014]).

Performance 为了使我们的管道能够足够快地部署到移动设备上,我们根据计算效率选择了一些算法。这意味着要避免妨碍并行化的非本地通信和数据依赖,消耗尽可能少的内存,并尽可能使用不动点算法。这些同样的担忧排除了使用具有全局迭代(例如FlexISP [Heide et al. 2014])、大型或动态空间支持(例如BM3D [Dabov et al. 2007b])或昂贵的色调映射的算法[Aubry et al. 2014]。

Our system has shipped on devices having 12–13 Mpix sensors, on which we capture bursts of up to 8 frames. Thus, we may be required to store and process as much as 104 Mpix per output photograph. Although we have selected algorithms for efficiency, processing this much data still requires a highly-optimized implementation. Most of our code is written in Halide [Ragan-Kelley et al. 2012], which enables us to more easily fuse pipeline stages for locality and to make use of SIMD and thread parallelism. In addition, since we compute many small 2D real DFTs for align and merge, we have implemented our own FFT in Halide. For the small DFTs in our pipeline, this implementation is five times faster than FFTW [Frigo and Johnson 2005] on an ARM-based mobile phone.

我们的系统已经安装在拥有12-13个Mpix传感器的设备上,我们可以在这些设备上捕捉多达8帧的burst。因此,我们可能需要每张输出照片存储和处理多达104mpix。虽然我们已经为效率选择了一些算法,但是处理这么多数据仍然需要高度优化的实现。
我们的大多数代码是用Halide写的[Ragan-Kelley et al. 2012],使我们能够更容易地融合管道阶段的局部性,并利用SIMD和线程并行。此外,由于我们计算了许多用于对齐和合并的小型2D real DFTs,我们在Halide中实现了自己的FFT。对于我们正在开发的小型DFTs,这个实现比基于arm的移动电话上的FFTW [Frigo and Johnson 2005]快5倍。

Summarizing, on a Qualcomm Snapdragon 810 not subject to thermal throttling the time required to produce an output photograph ranges from 2.5 to 4 seconds, depending on the number of frames in the burst. For a low light shot taking 4 seconds, this breaks down as 1 second to capture the frames, 500 ms for alignment, 1200 ms for merging, and 1600 ms for finishing. For a daylight shot taking 2.5 seconds, we measure 100 ms for capture, 250 ms for alignment, 580 ms for merging, and 1600 ms for finishing.

综上所述,在高通骁龙810上,不受热节流影响,产生一张输出照片所需的时间从2.5秒到4秒不等,这取决于burst中的帧数。对于4秒钟的低光拍摄,拍摄帧时为1秒,对齐为500 ms,合并为1200 ms,完成时为1600 ms。对于需要2.5秒的日光拍摄,我们测量100 ms用于捕获,250 ms用于对齐,580 ms用于合并,1600 ms用于完成。

8 Conclusions

In this paper we have described a system for capturing a burst of underexposed frames, aligning and merging these frames to produce a single intermediate image of high bit depth, and tone mapping this image to produce a high-resolution photograph. Our results have better image quality than single-exposure photos produced by a conventional imaging pipeline, especially in high dynamic range or low-light scenes, and almost never exhibit objectionable artifacts. The system is deployed on several mass-produced cell phones, marketed as “HDR+” in the Nexus 6, 5X, and 6P. Consumers using our system are unaware that they are capturing bursts of frames with each shutter press, or that their final photograph is generated from multiple images using computational photography.

在本文中,我们描述了一个系统,捕捉burst的曝光不足帧,对齐和合并这些帧,以产生一个高比特深度的中间图像,色调映射这幅图像,以产生高分辨率的照片。我们的结果比传统成像管道产生的单曝光照片具有更好的图像质量,特别是在高动态范围或低光场景下,而且几乎从不显示令人反感的伪影。该系统被部署在几款大规模生产的手机上,在Nexus 6、5X和6P中以HDR+销售。使用我们系统的用户并不知道,他们每次按下快门时都在捕捉帧的burst,也不知道他们的最终照片是通过计算摄影从多个图像中生成的。

It is difficult in a technical paper to prove our general claim of superior image quality, or to cover the range of corner cases our system handles robustly. However, our system has received positive reviews in the press, has scored higher than most competing commercial systems in independent evaluations [DxO Inc. 2015], and in millions of pictures captured by consumers each week, we have not seen disastrous results.

在技术论文中很难证明我们对优质图像质量的一般要求,或者涵盖我们的系统强有力处理的角落情况范围。然而,我们的系统在媒体上得到了积极的评价,在独立评价中得分高于大多数竞争对手的商业系统[DxO Inc. 2015],在消费者每周拍摄的数百万张照片中,我们没有看到灾难性的结果。

So that others may judge our image quality and improve on our algorithms, we have created an archive of several thousand bursts of raw images in DNG format [Google Inc. 2016b]. For each burst we include our merged raw output and final JPEG output. EXIF tags and additional files describe our camera parameters, noise model, and other metadata used to generate our results.

为了让其他人能够判断我们的图像质量并改进我们的算法,我们创建了一个包含数千个DNG格式[Google Inc. 2016b]。原始图像的文档。对于每个burst,我们包括合并的原始输出和最终的JPEG输出。EXIF标记和其他文件描述了用于生成结果的摄像机参数、噪声模型和其他元数据。

Limitations and future work The most significant drawback of our system is that after the user presses the shutter there is a sensible lag before the burst begins and the reference frame is captured. Since this frame sets the composition for the photograph, it can be difficult to capture the right moment in an action scene. Some of this lag is due to our auto-exposure algorithm, some to Camera2’s software structure, and some to our use of lucky imaging, which adds a variable delay, depending on which frame was chosen as the reference.

Limitations and future work 我们的系统最大的缺点是,当用户按下快门后,在burst开始和捕捉参考帧之前会有一个明显的延迟。由于这个框架设置了照片的构图,所以很难在动作场景中捕捉到合适的时刻。有些延迟是由于我们的自动曝光算法,有些是由于Camera2软件结构,还有一些是由于我们使用的lucky imaging,它增加了可变的延迟,这取决于选择哪一帧作为参考。

To avoid shutter lag, many mobile phones employ zero shutter lag (ZSL), in which the camera continuously captures full-resolution frames, stores them in a circular buffer, and responds to shutter press by selecting one image from this buffer to finish and store. Since focus, exposure, and white balance change continuously during aiming, handling ZSL would entail relaxing our assumption of constant-exposure bursts. This is a topic for future work.

为了避免快门延迟,许多手机采用零快门延迟(zero shutter lag, ZSL),即相机连续捕捉全分辨率的帧,并将其存储在一个圆形缓冲区中,然后从该缓冲区中选择一张图像进行处理和存储,以响应快门按下。由于焦点、曝光和白平衡在瞄准过程中不断变化,处理ZSL需要放松我们对常亮曝光burst的假设。这是一个未来工作的主题。

Another limitation of our system is that computing the output photograph takes several seconds and occupies a significant amount of memory until it finishes. If the user presses the shutter several times in rapid succession, we can easily run out of memory, causing the camera app to stall. Programmable hardware might solve this problem, but incorporating such hardware into mass-produced mobile devices is not easy.

我们的系统的另一个限制是,计算输出照片需要几秒钟,并占用大量内存,直到它完成。如果用户连续几次快速按下快门,我们很容易耗尽内存,导致相机应用程序暂停。可编程硬件可以解决这个问题,但是将这种硬件集成到大规模生产的移动设备中并不容易。

Finally, the discrepancy between our ISP-generated viewfinder and software-generated photograph produces a non-ideal user experience. In extreme situations, a user might abandon photographing a scene because it looks poor in the viewfinder, when in fact our software would produce a usable photograph of that scene. Programmable hardware might solve this problem as well.

最后,isp生成的取景器和软件生成的照片之间的差异产生了不理想的用户体验。在极端情况下,用户可能会放弃拍摄场景,因为它在取景器中看起来很差,而实际上我们的软件会生成该场景的可用照片。可编程硬件也可以解决这个问题。

Acknowledgements

Integrating our system with the Google Camera app would not have been possible without close collaboration with the Android
camera team. We thank them for their product and engineering contributions on HDR+, and also for their valuable feedback on
image quality. Special thanks to the authors of [Liu et al. 2014] and [Heide et al. 2014] for their help with experimental comparisons. We also thank Peyman Milanfar for helpful discussions and the anonymous reviewers for their feedback on the paper.

如果没有与Android相机团队的密切合作,我们的系统就不可能与谷歌相机应用程序集成。我们感谢他们对HDR+的产品和工程贡献,也感谢他们对图像质量的宝贵反馈。特别感谢[Liu et al. 2014]和[Heide et al. 2014]两位作者对实验比较的帮助。我们也感谢Peyman Milanfar的有益讨论和匿名审稿人对论文的反馈。

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(Burst photography for high dynamic range and low-light imaging on mobile cameras)