Depth perception is paramount to tackle real-world problems, ranging from autonomous driving to consumer applications. For the latter, depth estimation from a single image represents the most versatile solution, since a standard camera is available on almost any handheld device. Nonetheless, two main issues limit its practical deployment: i) the low reliability when deployed in-the-wild and ii) the demanding resource requirements to achieve real-time performance, often not compatible with such devices. Therefore, in this paper, we deeply investigate these issues showing how they are both addressable adopting appropriate network design and training strategies - also outlining how to map the resulting networks on handheld devices to achieve real-time performance. Our thorough evaluation highlights the ability of such fast networks to generalize well to new environments, a crucial feature required to tackle the extremely varied contexts faced in real applications. Indeed, to further support this evidence, we report experimental results concerning real-time depth-aware augmented reality and image blurring with smartphones in-the-wild.
深度感知对于解决从自动驾驶到消费应用等现实世界的问题至关重要。对于后者,从单个图像进行深度估计是最通用的解决方案,因为几乎任何手持设备上都有标准相机。尽管如此,两个主要问题限制了它的实际部署:i )在野外部署时的低可靠性,ii )实现实时性能的苛刻资源要求,往往与此类设备不兼容。因此,在本文中,我们深入地研究了这些问题,这些问题显示了如何采用适当的网络设计和训练策略- -同时还概述了如何在手持设备上映射生成的网络,以实现实时性能。我们的全面评估突出了这种快速网络能够很好地推广到新环境的能力,这是解决实际应用中面临的极其多样的环境所需要的一个重要特征。事实上,为了进一步支持这一证据,我们报告了关于实时深度感知的增强现实和智能手机在野外图像模糊的实验结果。
Depth perception is an essential step to tackle real-world problems such as robotics and autonomous driving, and some well-known sensors exist for this purpose. Among them, active sensing techniques such as Time-of-Flight (ToF) or LiDAR are often deployed in the application domains mentioned before. However, they struggle with typical consumer applications since ToF is mostly suited for indoor environments. At the same time, conventional LiDAR technology, frequently used for autonomous driving and other tasks, is too cumbersome and expensive for its deployment with relatively cheap and lightweight consumer handheld devices. Nonetheless, it is worth noting that active sensing technologies, mostly suited for indoor environments, are sometimes integrated into high-end devices as, for instance, occurs with the 2020 Apple iPad Pro.
深度感知是解决机器人、自主驾驶等现实问题的必要步骤,一些著名的传感器也为此而存在。其中,飞行时间( Time of Flight,ToF )或激光雷达( LiDAR )等主动感知技术常被部署在上述应用领域。然而,由于ToF大多适用于室内环境,因此它们难以满足典型的消费者应用。同时,通常用于自动驾驶和其他任务的传统激光雷达技术对于其部署在相对便宜和轻便的消费者手持设备上来说过于繁琐和昂贵。尽管如此,值得注意的是,大多数适用于室内环境的主动传感技术有时会被集成到高端设备中,例如,出现在2020年的苹果iPad Pro中。
Therefore, camera-based technologies are often the only viable strategies to infer depth in consumer applications, and among them, well-known methodologies are structured light and stereo vision. The former typically requires an infrared camera and a specific pattern projector making it not suited for environments flooded by sunlight. The latter requires two appropriately spaced and synchronized cameras. Some recent high-end smartphones or tablets feature one or both technologies, although they are not yet widespread enough to be considered as standard equipment. Moreover, for stereo systems, the distance between the cameras (baseline) is necessarily narrow, limiting the depth range to a few meters away.
因此,基于摄像头的技术通常是推断消费者应用中深度的唯一可行策略,其中著名的方法是结构光和立体视觉。前者通常需要一个红外摄像机和一个特定的模式投影仪,这使得它不适合被阳光淹没的环境。后者需要两个适当间隔且同步的摄像机。最近的一些高端智能手机或平板电脑采用了一种或两种技术,尽管它们尚未普及到足以被视为标准设备的程度。此外,对于立体声系统,摄像机之间的距离(基线)必然很窄,将深度范围限制在几米之外。
On the other hand, with the advent of deep-learning, recent years witnessed the rising of a further strategy to infer depth using a single standard camera available substantially in any consumer device. Compared to previous technologies for depth estimation, such an approach would enable to tackle all the limitations mentioned before. Nonetheless, single image depth estimation is seldom deployed in consumer applications due to two main reasons. The first one concerns the low reliability of state-of-the-art methods when tackling depth estimation in the wild dealing with unpredictable environments not seen at training time, as necessarily occurs when targeting a massive amount of users facing heterogeneous environments. The second reason concerns the constrained computational resources available in handheld devices, such as smartphones or tablets, deployed for consumer applications. In fact, despite the steady progress in this field, the gap with system leveraging high-end GPUs is (and always will be) significant since power requirements heavily constrain handheld devices. Despite the limited computational resources available, most consumer applications need real-time performance.
另一方面,随着深度学习的出现,近年来出现了一种进一步的策略,即使用任何消费类设备中基本可用的单个标准摄像头推断深度。与以前的深度估计技术相比,这种方法能够解决前面提到的所有限制。然而,由于两个主要原因,单个图像深度估计很少部署在消费者应用程序中。第一个问题是,在野外处理深度估计时,最先进方法的可靠性较低,而在训练时无法看到不可预测的环境,这在针对大量面临异构环境的用户时必然发生。第二个原因涉及为消费者应用部署的手持设备(如智能手机或平板电脑)中可用的计算资源有限。事实上,尽管在这一领域取得了稳步进展,但由于电源需求严重限制了手持设备,因此与使用高端GPU的系统之间的差距是(并且将永远是)巨大的。尽管可用的计算资源有限,但大多数消费类应用程序需要实时性能。
Arguing these facts, in this paper, we deeply investigate both the issues outlined so far. In particular, we will show how tackling them leveraging appropriate network design approaches, training strategies, and outlining how to map them on off-the-shelf consumer devices to achieve real-time performance, as shown in Figure 1. Indeed, our extensive evaluation highlights the ability of the resulting networks to robustly generalize to unseen environments, a crucial feature to tackle the heterogeneous contexts faced in real consumer applications.
针对这些事实,本文对迄今为止所概述的两个问题进行了深入的调查。特别地,我们将展示如何利用适当的网络设计方法、训练策略来对付它们,并概述如何将它们映射到离线消费设备上,以实现实时性能,如图1所示。事实上,我们的广泛评估强调了生成的网络能够鲁棒地推广到不可观测环境,这是解决实际消费应用中面临的异构环境的一个重要特征。
The paper is organized as follows. At first, we present previous works about monocular depth estimation on high performance and mobile devices, then we describe the framework that allows us to train a model, even a lightweight one, to be robust when deployed in the wild. We then evaluate a set of monocular networks already proposed in the literature on three benchmarks, deploying such models on mobile smartphones. Finally, we report how enabling real-time single-image depth estimation at the edge can be effectively exploited to tackle two well-known applications: depth aware augmented reality and blurring.
论文组织如下。首先,我们介绍了前人在高性能和移动设备上单目深度估计的工作,然后我们描述了一个框架,它允许我们训练一个模型,甚至是一个轻量级的模型,以便在野外部署时具有健壮性。然后,我们对文献中已经提出的一组单目网络进行了三个基准的评估,将这些模型部署在移动智能手机上。最后,我们报告了如何有效地利用边缘的实时单幅图像深度估计来处理两个著名的应用:深度感知增强现实和模糊。
Monocular depth estimation. Even if depth estimation from multiple views has a long history in computer vision, depth from a single image [2], [3] started being considered feasible only with the advent of deep learning techniques. Indeed, obtaining depth values starting from a single image is an ill-posed problem, since an infinite number of real-world layouts may have generated the input image. However, learning-based methods, and in particular deep learning, proved to be adequate to face the problem. Initially were proposed supervised approaches [4], [5]. However, the need for ground truth measurements represents a severe restraint since an active sensor, as a LiDAR, together with specific manual post-processing is required to obtain such data. Therefore, effective deployment of supervised approaches is burdensome both in terms of time and costs. To overcome this limitation, solutions able to learn depth supervised only by images are incredibly appealing, and nowadays, several methods rely on simple stereo pairs or monocular video sequences for training. Among these, [6] is the first notable attempt leveraging stereo pairs, eventually improved exploiting traditional stereo algorithm [7], [8], visual odometry supervision [9], [10] or 3D movies [11]. On the other hand, methods leveraging monocular videos do not even require a stereo camera at training time, at the cost of learning depth estimation up to a scale factor. Starting with Zhou et al. [12], proposing to learn both depth and camera ego motion, more recent methods propose to apply Direct Visual Odometry [13] or ICP [14] strategies to improve predictions. Additional cues have been used by more recent works, such as optical flow [15]–[18] or semantic segmentation [19]. Finally, it is worth noting that the two supervisions coming from stereo images and monocular sequences can be combined [20].
**单目深度估计。**即使从多个视图进行深度估计在计算机视觉中有很长的历史,但只有随着深度学习技术的出现,单幅图像的深度[2],[3]才被认为是可行的。事实上,从单个图像开始获取深度值是一个不适定问题,因为可能有无限多的真实布局生成了输入图像。然而,基于学习的方法,特别是深度学习,被证明足以应对这一问题。最初提出了监督方法[4],[5]。然而,由于需要有源传感器(如激光雷达)以及特定的人工后处理来获取此类数据,因此对地面实况测量的需求受到了严重限制。因此,有效部署受监督的方法在时间和成本方面都是一项负担。为了克服这一局限性,只能在图像监督下学习深度的解决方案非常吸引人,现在有几种方法依赖于简单的立体对或单目视频序列进行训练。其中,[6]是利用立体对的第一次显著尝试,最终改进了传统立体算法[7]、[8]、视觉里程计监控[9]、[10]或3D电影[11]。另一方面,利用单目视频的方法在训练时甚至不需要立体摄像机,代价是学习深度估计达到比例因子。从Zhou等人[12]提出学习深度和相机自我运动开始,最近的方法提出应用直接视觉里程计[13]或ICP[14]策略来改进预测。更多的线索被最近的作品所使用,如光流[15]–[18]或语义分割[19]。最后,值得注意的是,来自立体图像和单目序列的两种监控可以结合使用[20]。
Depth estimation on mobile systems. Mobile devices are ubiquitous, and deep learning opened many applications scenarios [21]. Although sometimes server-side inference is unavoidable, maintaining the computation on-board is highly beneficial. It allows to get-rid of privacy issues, the need for tailored datacenters allowing to reduce costs as well to improve scalability. Moreover, although not critical for most consumer scenarios, full on-board processing does not dictate an internet connection. Despite limited computing capabilities, mostly constrained by power consumption issues and typically not comparable to those available in standard laptops or PCs, some authors proposed deep networks for depth estimation suited for mobile devices too. These works targeted stereo [22] and monocular [1], [23] setups. Moreover, some authors proposed depth estimation architectures tailored for specific hardware setups, such as those based on dual-pixel sensors available in some recent Google’s smartphones, as reported in [24], [25].
**移动系统的深度估计。**移动设备无处不在,深度学习开启了许多应用场景[21]。虽然有时服务器端推断是不可避免的,但在板上维护计算是非常有益的。它可以消除隐私问题,需要定制数据中心,从而降低成本并提高可扩展性。此外,尽管对大多数消费者场景来说并不重要,但完整的车载处理并不要求互联网连接。尽管计算能力有限,主要受到功耗问题的限制,并且通常无法与标准笔记本电脑或PC中可用的计算能力相比,但一些作者提出了适合移动设备的深度估计深度网络。这些作品针对立体声[22]和单目[1]、[23]设置。此外,一些作者提出了为特定硬件设置量身定制的深度估计体系结构,如[24]、[25]中所述的基于谷歌近期智能手机上可用的双像素传感器的深度估计体系结构。
The availability of more and more powerful devices paves the way for complex and immersive applications, in which users can interact with the nearby environment. As a notable example, augmented reality can be used to display interactive tools or concepts, avoiding to build a real prototype and thus cutting off costs. For this and many other applications, obtaining accurate depth information with a high frame rate is paramount to further enhance the interaction with the surrounding environment, even with devices devoid of active sensors. Almost any modern handheld device features at least a single camera and an integrated CPU within, typically, an ARM-based system-on-chip to cope with the constrained energy budget of such devices. Sometimes, especially in most new ones, a Neural Processing Unit (NPU) devoted to accelerating deep neural networks is also available. Inevitably, the resulting overall computation performance is far from conventional PC-based setups, and the availability of an NPU only partially fills this gap. Given these constraints, single-image depth perception would be rather appealing since it could seamlessly deal with dynamic contexts, whereas other techniques such as structure from motion (SfM) would struggle. However, these techniques are computationally demanding, and most state of the art approaches would not fit the computational resources available in handheld devices. Moreover, regardless of the computing requirements, training the networks for predictable target environments is not feasible for consumer applications. Thus the depth estimation network shall be robust to any faced deployment scenarios and possibly invariant to the training data distribution. A client-server approach would soften some of the computational issues, although with notable disadvantages the need for an internet connection and a poorly scaling of the whole overall system when the number of users increases.
越来越强大的设备的可用性为复杂和身临其境的应用程序铺平了道路,用户可以在其中与附近的环境进行交互。作为一个显着的例子,增强现实可用于显示交互式工具或概念,避免构建真实的原型从而削减成本。对于这个应用程序和许多其他应用程序,以高帧速率获取准确的深度信息对于进一步增强与周围环境的交互至关重要,即使设备没有主动传感器也是如此。几乎所有现代手持设备都至少有一个摄像头和一个集成 CPU,通常是基于 ARM 的片上系统,以应对此类能源预算受限的设备。有时,尤其是在大多数新产品中,还可以使用专门用于加速深度神经网络的神经处理单元 (NPU)。不可避免地,由此产生的整体计算性能与基于 PC 的传统设置相去甚远,而 NPU 的可用性仅部分填补了这一空白。鉴于这些限制,单图像深度感知将相当有吸引力,因为它可以无缝地处理动态上下文,而其他技术,如运动结构 (SfM) 会很困难。然而,这些技术在计算上要求很高,并且大多数最先进的方法不适合手持设备中可用的计算资源。此外,无论计算要求如何,针对可预测的目标环境训练网络对于消费者应用程序来说都是不可行的。因此,深度估计网络应对任何面临的部署场景都具有鲁棒性,并且可能对训练数据分布保持不变。客户端-服务器方法可以缓解一些计算问题,尽管存在显着的缺点,即需要互联网连接以及当用户数量增加时整个系统的扩展性很差。
To get rid of all the issues mentioned above and to deal with practical applications, we will describe next how to achieve real-time and robust single image depth perception on low-power architectures found in off-the-shelf handheld devices.
为了摆脱上述所有问题并处理实际应用,下面我们将介绍如何在非现场手持设备中发现的低功耗架构上实现实时、鲁棒的单幅图像深度感知。
In this section, we introduce our framework aimed at enabling single image depth estimation in the wild with mobile devices, devoting specific attention to iOS and Android systems. Before the actual deployment on the target handheld device, our strategy requires an offline training procedure typically carried out on power unconstrained devices such as a PC equipped with a high-end GPU. We will discuss in the reminder the training methodology, leveraging knowledge distillation, deployed to achieve our goal in a limited amount of time and the dataset adopted for this purpose. Another critical component of our framework is a lightweight network enabling real-time processing on the target handheld devices. Purposely, we will introduce and thoroughly assess the performance of state of the art networks fitting this constraint.
在本节中,我们将介绍我们的框架,该框架旨在通过移动设备在野外实现单图像深度估计,特别关注iOS和Android系统。在实际部署到目标手持设备上之前,我们的策略需要一个离线培训程序,通常在功率不受限制的设备上执行,例如配备高端GPU的PC。我们将在提醒中讨论培训方法,利用知识提炼,在有限的时间内实现我们的目标,以及为此目的采用的数据集。我们框架的另一个关键组件是一个轻量级网络,能够在目标手持设备上进行实时处理。我们将有目的地介绍并全面评估符合这一约束条件的最先进网络的性能。
As for most learning-based monocular depth estimation models, our proposal is trained off-line on standard work-stations, equipped with one or more GPUs, or through cloud processing services. In principle, depending on the training data available, one can leverage different training strategies: supervised, semi-supervised or self-supervised training paradigms. Moreover, as done in this paper, cheaper and better-scaling supervision can be conveniently obtained from another network, by leveraging knowledge distillation to avoid the need for expensive ground truth labels, through a teacher-student network.
对于大多数基于学习的单目深度估计模型,我们的建议是在标准工作站上进行离线培训,配备一个或多个GPU,或通过云处理服务进行培训。原则上,根据可用的训练数据,可以利用不同的训练策略:监督、半监督或自我监督训练范式。此外,正如本文所做的那样,通过利用知识提炼,避免需要昂贵的地面真理标签,通过师生网络,可以方便地从另一个网络获得更便宜、更好的缩放监控。
When a large enough dataset providing ground truth labels inferred by an active sensor is available, such as [26], [27] (semi-)supervised fashion is certainly valuable since it enables, among other things, to disambiguate difficult regions (e.g. texture-less regions such as walls). Unfortunately, large datasets with depth labels are not available or extremely costly and cumbersome to obtain. Therefore, when this condition is not met, self-supervised paradigms enable to train with (potentially) countless examples, at the cost of a more challenging training setup and typically less accurate results. Note that, depending on the dataset, a strong depth prior can be distilled even if are not available depth labels provided by an active sensor. For instance, [7], [8] exploit depth values from a stereo algorithm, while [28] relies on a SfM pipeline. Finally, supervision can be distilled from other networks as well, for the stereo [29] and monocular [30] setup. The latter is the strategy followed in this paper. Specifically, we use as teacher the MiDaS network proposed in [11]. This strategy allows us to speed-up the training procedure of the considered lightweight networks significantly, since doing this from scratch according to the methodology proposed in [11] would take much much longer time (weeks vs days) since mostly bounded by proxy labels generation. Moreover, it is worth noting that given a reliable teacher network, pre-trained in a semi or self-supervised manner, such as [11], it is straightforward to distill an appropriate training dataset since any collection of images is potentially suited to this aim. We will describe next the training dataset used for our experiments made of a bunch of single images belonging to well-known popular datasets.
当一个足够大的数据集能够提供由活动传感器推断的地面真实值标签时,例如[26]、[27](半监督)方式肯定是有价值的,因为它能够消除困难区域(例如,无纹理区域,如墙)的歧义。不幸的是,带有深度标签的大型数据集不可用,或者获取成本极高且繁琐。因此,当不满足此条件时,自我监督范式可以(潜在地)使用无数示例进行训练,代价是训练设置更具挑战性,结果通常不太准确。请注意,根据数据集,即使活动传感器提供的深度标签不可用,也可以提取强深度先验。例如,[7]、[8]利用立体算法的深度值,而[28]依赖于SfM管道。最后,对于立体[29]和单目[30]设置,也可以从其他网络中提取监控信息。后者是本文遵循的策略。具体而言,我们使用[11]中提出的MiDaS网络作为教师。该策略使我们能够显著加快所考虑的轻量级网络的训练过程,因为根据[11]中提出的方法从头开始进行训练需要更长的时间(几周vs几天),因为大部分时间受代理标签生成的限制。此外,值得注意的是,考虑到可靠的教师网络,以半监督或自我监督的方式进行预培训,如[11],提取适当的培训数据集是很简单的,因为任何图像集合都可能适合这一目标。接下来,我们将描述用于我们的实验的训练数据集,该数据集由一组属于著名流行数据集的单个图像组成。
Once outlined the training paradigm, the next issue concerns the choice of a network capable of learning from the teacher how to infer meaningful depth maps and, at the same time, able to run in real-time on the target handheld devices. Unfortunately, only a few networks described next potentially fulfil these requirements, in particular, considering the ability to run in real-time on embedded systems.
一旦概述了培训模式,下一个问题是选择一个能够向教师学习如何推断有意义的深度图的网络,同时能够在目标手持设备上实时运行。不幸的是,接下来描述的只有少数网络可能满足这些要求,特别是考虑到在嵌入式系统上实时运行的能力。
Once identified and trained a suitable network, its mapping on a mobile device is nowadays quite easy. In fact,there exist various tools that, starting from a deep learning framework as PyTorch [31] or TensorFlow [32], can export, optimize (e.g. perform weights quantization) and execute models even leveraging mobile GPU [33] on principal operating systems (OS). In some cases, the target OS exposes utilities and tools to improve the performances further. For instance, starting from iOS 13, neural networks deployed on iPhones can use the GPU or even the Apple Neural Engine (ANE) thanks to Metal and Metal Performance Shaders (MPS), thus largely improving the runtime performances. We will discuss in the next section how to map the networks on iOS and Android devices using TensorFlow and PyTorch as high-level development frameworks.
一旦确定并训练了合适的网络,它在移动设备上的映射现在相当容易。事实上,从PyTorch[31]或TensorFlow[32]等深度学习框架开始,存在各种工具,可以导出、优化(例如,执行权重量化)和执行模型,甚至可以在主要操作系统(OS)上利用移动GPU[33]。在某些情况下,目标操作系统公开实用程序和工具以进一步提高性能。例如,从iOS 13开始,部署在iPhone上的神经网络可以使用GPU甚至Apple神经引擎(ANE),这要归功于金属和金属性能着色器(MPS),从而大大提高运行时性能。我们将在下一节讨论如何使用TensorFlow和PyTorch作为高级开发框架在iOS和Android设备上映射网络。
According to the previous discussion, only a subset of the state of the art single image depth estimation networks fits our purposes. Specifically, we consider the following publicly available lightweight architectures: PyDNet [1], DSNet [19] and FastDepth [23]. Moreover, we also include a representative example of a large state of the art network MonoDepth2, proposed in [20]. It is worth to notice that other and more complex state-of-the-art networks, as [7], could be deployed in place within the proposed framework. However, this might come at the cost of higher execution time on the embedded device and, potentially, overhead for the developer in case of custom layers not directly supported by the mobile executor (e.g., the correlation layer used in [7]).
根据前面的讨论,只有最先进的单图像深度估计网络的子集符合我们的目的。具体而言,我们考虑以下公开可用的轻量级架构:PydNET[ 1 ]、DSNET[ 19 ]和FASTHOST〔23〕。此外,我们还包括[20]中提出的大型最先进网络MonoDepth2的代表性示例。值得注意的是,如[7]所述,其他更复杂的最先进网络可以在提议的框架内部署到位。然而,这可能会导致嵌入式设备上执行时间增加,并且在移动执行器不直接支持自定义层(例如,在[7]中使用的相关层)的情况下,可能会增加开发人员的开销。
MonoDepth2. An architecture deploying a ResNet encoder, proposed initially in [34], made of 18 feature extraction layers, shrinking the input by a factor of 1 32 \frac{1}{32} 321. Then, the dense layers are replaced in favour of a decoder module, able to restore the original input resolution and output an estimated depth map. At each level in the decoder, 3 × 3 3 \times 3 3×3 convolutions with skip connections are performed, followed by a 3 × 3 3 \times 3 3×3 convolution layer in charge of depth estimation. The resulting network can predict depths at different scales, counting 14.84 14.84 14.84 M parameters. It is worth to notice that in our evaluation we do not rely on ImageNet [35] pre-training for the encoder for fairness to other architectures not pretrained at all.
一种部署ResNet编码器的体系结构,最初在[34]中提出,由18个特征提取层组成,将输入缩减了 1 32 \frac{1}{32} 321。然后,用解码器模块替换密集层,恢复原始输入分辨率并输出估计深度图。在解码器中的每个级别上,执行带跳过连接的卷 3 × 3 3 \times 3 3×3 积,然后是负责深度估计的卷积 3 × 3 3 \times 3 3×3 层。由此产生的网络可以预测不同尺度下的深度, 14.84 14.84 14.84 M参数。值得注意的是,在我们的评估中,我们并不依赖ImageNet[35]对编码器进行预培训,以确保对其他根本没有预培训的架构的公平性。
PyDNet. This network, proposed in [1], features a pyramidal encoder-decoder design able to infer depth maps from a single RGB image. Thanks to its small size and design choices, PyDNet can run on almost any device including low-power embedded platforms [36], such as the Raspberry Pi 3. In particular, the network exploits 6 layers to reduce the input resolution at 1 64 \frac{1}{64} 641, restored in the depth domain by 5 layers in the decoder. Each layer in the decoder applies 3 × 3 3 \times 3 3×3 convolutions with 96 , 64 , 32 , 8 96,64,32,8 96,64,32,8 feature channels, followed by a 3 × 3 3 \times 3 3×3 convolution in charge of depth estimation. Notice that, to keep low the resources and inference time, top prediction of PyDNet is at half resolution, so the final depth map is obtained through an upsampling operation. We adopt the mobile implementation provided by the authors, publicly available online 1 ^{1} 1, which differs from the paper network by small changes (e.g. transposed convolutions have been replaced by upsampling and convolution blocks). The network counts 1.97 1.97 1.97 M parameters.
该网络在 [1] 中提出,具有金字塔编码器-解码器设计,能够从单个 RGB 图像推断深度图。由于其小尺寸和设计选择,PyDNet 几乎可以在任何设备上运行,包括低功耗嵌入式平台 [36],例如 Raspberry Pi 3。特别是,该网络利用 6 层来降低输入分辨率 1 64 \frac {1}{64} 641,由解码器中的 5 层在深度域中恢复。解码器中的每一层应用 3 × 3 3\times 3 3×3 卷积和 96 , 64 , 32 , 8 96,64,32,8 96,64,32,8 特征通道,然后是 3 × 3 3\times 3 3×3 卷积负责深度估计。请注意,为了保持较低的资源和推理时间,PyDNet 的顶部预测为一半分辨率,因此最终深度图是通过上采样操作获得的。我们采用作者提供的移动实现,可在网上公开 1 ^{1} 1,它与纸质网络的区别很小(例如,转置卷积已被上采样和卷积块取代)。该网络计算了 1.97 1.97 1.97 M 个参数。
FastDepth. Proposed by Wofk et al. [23], this network can infer depth predictions at 178 fps with an NVIDIA Jetson TX2 GPU. This notable speed is the result of design choices and optimization steps. Specifically, the encoder is a MobileNet [21], thus suited for execution on embedded devices. The decoder consists of 6 layers, each one with a depthwise separable convolution, with skip connections starting from the encoder (in this case, features are combined with addition). However, it is worth observing that the highest frame rate previously reported is achievable only exploiting both pruning [37] and hardware-specific optimization techniques. In this paper, we do not rely on such strategies for fairness with other networks. The network counts 3.93 M 3.93 \mathrm{M} 3.93M parameters.
由Wofk等人提出。这个网络可以用NVIDIA的Jetson TX2 GPU在178 fps下推断深度预测。这种显著的速度是设计选择和优化步骤的结果。具体地说,编码器是MobileNet[21],因此适合在嵌入式设备上执行。解码器由6个层组成,每一层都有一个深度可分离的卷积,从编码器开始有跳跃连接(在这种情况下,特性与加法结合)。然而,值得注意的是,以前报道的最高帧速率只有利用剪枝[37]和特定硬件的优化技术才能实现。在本文中,我们不依赖于这样的策略来实现对其他网络的公平性。网络计数3.93 m 参数
DSNet. This architecture is part of Ω \Omega Ω Net [19], an ensemble of networks predicting not only the depth of the scene starting from a single view but also the semantic segmentation, camera intrinsic parameters and if two frames are provided, the optical flow. In our evaluation we consider only the depth estimation network DSNet, inspired by PyDNet, which contains a feature extractor able to decrease the resolution by 1 32 \frac{1}{32} 321, followed by 5 decoding layers able to infer depth predictions starting from the current features and previous depth estimate. In the original architecture, the last decoder also predicts per-pixel semantic labels through a dedicated layer, removed in this work. With this change, the network counts 1.91 M 1.91 \mathrm{M} 1.91M of parameters, 0.2 M 0.2 \mathrm{M} 0.2M fewer than the original model.
这个架构是 ΩNet [19]的一部分,ΩNet 是一个网络集合,它不仅可以从单一视图开始预测场景的深度,而且可以预测语义分割、凸轮时代的内在参数,如果提供两个帧,则可以预测光流。在我们的评估中,我们只考虑深度估计网络DSNet,它受到PyDNet的启发,它包含一个特征提取器,能够将分辨率降低 1 32 \frac{1}{32} 321,然后是5个解码层,能够从当前特征和以前的深度估计出发推断深度预测。在原始架构中,最后一个解码器还通过一个专用层预测每个像素的语义标签,在本工作中删除了这一层。在此变化下,网络的参数数为1.91 M,比原模型少0.2 M。
In our evaluation, we use four datasets. At first, we rely on the KITTI dataset to assess the performance of the four networks when trained with the standard self-supervised paradigm deployed typically in this field [20]. Then, we retrain from scratch the four networks using the paradigm previously outlined, distilling proxy labels by employing the pre-trained MiDaS network [11] made available by the same authors. For this task, we use a novel dataset, referred to as WILD, described next. We then evaluate the networks trained according to this methodology on the TUM RGBD [38] and NYUv2 [39] dataset to assess their generalization capability.
在我们的评估中,我们使用四个数据集。首先,我们依靠KITTI数据集来评估四个网络在使用标准自我监督范式进行训练时的性能,这种范式通常部署在这个领域[20]中。然后,我们使用前面概述的范例重新训练这四个网络,通过使用由同一作者提供的预先训练好的MiDaS网络[11]来提取代理标签。对于这个任务,我们使用一个新的数据集,称为WILD,下面将介绍。然后在TUM RGBD[38]和NYUv2[39]数据集上对根据该方法训练的网络进行评估,以评估它们的泛化能力。
KITTI. The KITTI dataset [40] contains 61 scenes collected by moving car equipped with a LiDAR sensor and a stereo rig. Following [ 20 ] [20] [20], we select a split of 697 images for testing, while 39810 and 4424 images are used respectively for preliminary training and validation purpose. Moreover, we use it to assess the generalization capability of the networks in the wild during the second part of our evaluation.
KITTI的数据集[40]包含了61个场景,这些场景是由配备了激光雷达传感器和立体音响设备的移动汽车收集的。根据[20] ,我们选择了697张图像进行测试,而39810张和4424张图像分别用于初步训练和验证。此外,在评估的第二部分,我们使用它来评估网络在野外的泛化能力。
WILD The Wild dataset ( W ) (W) (W), introduced in this paper, consists of a mixture of Microsoft COCO [41] and OpenImages [42] datasets. Both datasets contain a large number of internet photos, and they do not provide depth labels. Moreover, since video sequences nor stereo pairs are available, they are not suited for conventional self-supervised guidance methods (e.g. SfM or stereo algorithms). On the other hand, they cover a broad spectrum of various real-world situations, allowing to face both indoor and outdoor environments, deal with everyday objects and various depth ranges. We select almost 447000 frames for training purposes. Then, we distilled the supervision required by our networks with the robust monocular architecture proposed in [11] with the weights publicly available. We point out once again that our supervision protocol has been carefully chosen mostly for practical reasons. It takes a few days to distill the WILD dataset by running MiDaS (using the publicly available checkpoints) on a single machine. On the contrary, to obtain the same data used to train the network as in [11], it would require an extremely intensive effort. Doing so, we can scale better: since we trust the teacher, we could, in principle, source knowledge from various and heterogeneous domains on the fly. Of course, the major drawbacks of this approach are evident: we need an already available and reliable teacher, and the accuracy of the student is bounded to the one of the teacher. However, we point out that the training scheme proposed in [11] is general, so it can also be applied in our case, and that we already expect a margin with state-of-the-art networks due to the lightweight size of mobile architectures considered. For these reasons, we believe that our approach is beneficial to source a fast prototype than can be improved later leveraging other techniques if needed. This belief is supported by experimental results presented later in the paper.
本文介绍的Wild数据集(W)由Microsoft COCO[41]和OpenImages[42]数据集组成。这两个数据集都包含大量的互联网照片,而且它们不提供深度标签。此外,由于视频序列或立体对是可用的,它们不适合传统的自我监督制导方法(如SfM或立体算法)。另一方面,它们覆盖了广泛的真实世界的情况,允许面对室内和室外环境,处理日常物体和各种深度范围。我们选择大约447000帧作为训练用途。然后,我们利用文献[11]中提出的健壮的单目架构以及公开可用的权重,提炼出我们的网络所需的监督。我们再次指出,我们的监督方案主要是出于实际原因而精心选择的。通过在一台机器上运行MiDaS(使用公共可用的检查点)来提取WILD数据集需要几天的时间。相反,要获得与[11]中相同的用于训练网络的数据,将需要非常密集的努力。这样做,我们可以扩展得更好: 因为我们信任老师,我们原则上可以动态地从各种不同的领域获取知识。当然,这种方法的主要缺点是显而易见的:我们需要一个现成的、可靠的教师,而学生的准确性仅限于教师的准确性。然而,我们指出,[11]中提出的训练方案是通用的,所以它也可以应用到我们的案例中,并且我们已经预期与最先进的网络有一定的差距,因为考虑了移动架构的轻量级规模。由于这些原因,我们相信我们的方法对于开发一个快速原型比以后在需要时利用其他技术进行改进更加有益。本文后面的实验结果支持了这一观点。
TUM RGBD. The TUM RGBD (3D Object Reconstruction category) dataset [38] contains indoor sequences framing people and furniture. We adopt the same split of 1815 images used in [28] for evaluation purposes only.
TUM RGBD(三维物体重建类别)数据集[38]包含了人与家具的室内序列。我们采用了文献[28]中仅用于评估目的的1815张图像的相同分割。
NYUv2. The NYUv2 dataset [39] is an indoor RGBD dataset acquired with a Microsoft Kinect device. It provides more than 400 k 400 \mathrm{k} 400k raw depth frames and 1449 densely labelled frames. As for the previous dataset, we adopt the official test split containing 649 images for generalization tests.
NYUv2数据集[39]是一个用微软Kinect设备获取的室内RGBD数据集。它提供超过400k的原始深度帧和1449个密集标签帧。对于以前的数据集,我们采用包含649张图像的官方测试分割进行泛化测试
Since we aim at mapping single image depth estimation networks on handheld devices, we briefly outline here the steps required to carry out this task.
由于我们的目标是在手持设备上绘制单幅图像深度估计网络,我们在这里简要概述了完成这项任务所需的步骤。
As depicted in Figure 2, different tools are available according to both the deep learning framework and the target OS. With TensorFlow models, weights are processed using t f − c o r e m l tf-coreml tf−coreml converter in case of iOS deployment or TensorFlow Lite converter when targeting an Android device. On the other hand, when starting from PyTorch models, a further intermediate step is required. In particular, stored weights are converted in the Open Neural Network Exchange format (ONNX), a common representation allowing for porting architectures implemented in a source framework into a different target framework. This conversion is possible seamlessly if the networks consist of standard layers, while is not straightforward in case of custom modules (e.g., correlation layers as in monoResMatch [7]). Although these tools typically enable to perform weights quantization during the model conversion phase to the target environment, we refrain from applying quantization to maintain the original accuracy of each network. In the experimental results section, we will provide execution time mapping the networks on mobile devices following the porting strategy outlined so far.
如图2所示,根据深度学习框架和目标操作系统,可以使用不同的工具。使用TensorFlow模型,在iOS部署情况下使用tf- coreml转换器处理权重,或者在针对Android设备时使用TensorFlow Lite转换器。另一方面,当从PyTorch模型开始时,需要进一步的中间步骤。特别是,存储的权重以开放式神经网络交换格式(ONNX)进行转换,这是一种常见的表示形式,允许将源框架中实现的体系结构移植到不同的目标框架中。如果网络由标准层组成,则可以无缝地进行这种转换,而对于自定义模块(例如,monoResMatch[7]中的相关层)则不是直接的。虽然这些工具通常能够在模型转换到目标环境的阶段执行权重化,但我们避免使用量化来保持每个网络的原始准确性。在实验结果分析中,我们将按照目前提出的移植策略,提供网络在移动设备上的执行时间映射。
In this section, we thoroughly assess the performance of the considered networks with standard datasets deployed in this field. At first, since differently from other methods FastDepth [23] was not initially evaluated on KITTI, we carry out a preliminary evaluation of all networks on such dataset. Then, we train from scratch the considered networks according to the framework outlined on the Wild dataset, evaluating their generalization ability. Finally, we show how to take advantage of the depth maps inferred by such networks for two applications particularly relevant for mobile devices.
在本节中,我们将全面评估在该领域中部署的标准数据集所考虑的网络的性能。首先,由于与其他方法不同,FastDepth[23]最初没有在KITTI上进行评估,我们对此类数据集上的所有网络进行初步评估。然后,我们根据 Wild 数据集的框架从头开始训练所考虑的网络,评估它们的泛化能力。最后,我们展示了如何利用这些网络推断出的深度图,用于两个与移动设备特别相关的应用程序。
At first, we first investigate the accuracy of the considered networks on the KITII dataset. Since the models have been developed with different frameworks (PyDNet in TensorFlow, the other two in PyTorch) and trained on different datasets (FastDepth on NYU v2 [39], others on the Eigen [4] split KITTI [40]), we implement all the networks in PyTorch. This strategy allows us to adopt the same self-supervised protocol proposed in [20] to train all the models. This choice is suited for the KITTI dataset since it exploits stereo sequences enabling to achieve the best accuracy. Given two images I I I and I † I^{\dagger} I†, with known intrinsic parameters ( K K K and K † ) \left.K^{\dagger}\right) K†) and relative pose of the cameras ( R , T ) (R, T) (R,T), the network predicts depth D \mathcal{D} D allowing to reconstruct the reference image I I I from I † I^{\dagger} I†, so:
I ^ = ω ( I † , K † , R , T , K , D ) \hat{I}=\omega\left(I^{\dagger}, K^{\dagger}, R, T, K, \mathcal{D}\right) I^=ω(I†,K†,R,T,K,D)
where ω \omega ω is a differentiable warping function.
Then, the difference between I ^ \hat{I} I^ and I I I can be used to supervise the network, thus improving D \mathcal{D} D, without any ground truth. The loss function used in [20] is composed by a photometric error term p e 2 p_{e} 2 pe2 and an edge-aware regularization term L s L_{s} Ls
p e ( I , I ^ ) = α ( 1 − SSIM ( I , I ^ ) ) 2 + ( 1 − α ) ∥ I − I ^ ∥ 1 L s = ∥ δ x D t ∗ ∥ 1 e − ∥ δ x I t ∥ 1 + ∥ δ y D t ∗ ∥ 1 e − ∥ δ y I t ∥ 1 \begin{gathered} p_{e}(I, \hat{I})=\alpha \frac{(1-\operatorname{SSIM}(I, \hat{I}))}{2}+(1-\alpha)\|I-\hat{I}\|_{1} \\ \mathcal{L}_{s}=\left\|\delta_{x} \mathcal{D}_{t}^{*}\right\|_{1} e^{-\left\|\delta_{x} I_{t}\right\|_{1}}+\left\|\delta_{y} \mathcal{D}_{t}^{*}\right\|_{1} e^{-\left\|\delta_{y} I_{t}\right\|_{1}} \end{gathered} pe(I,I^)=α2(1−SSIM(I,I^))+(1−α)∥I−I^∥1Ls=∥δxDt∗∥1e−∥δxIt∥1+∥δyDt∗∥1e−∥δyIt∥1
where SSIM is the structure similarity index [43], while D ∗ = D / D ‾ \mathcal{D} *=\mathcal{D} / \overline{\mathcal{D}} D∗=D/D is the mean normalized inverse depth proposed in [44]. We adopt the M configuration of [20] to train all the models. Doing so, given the reference image I t I_{t} It, at training time we also need { I t − 1 , I t + 1 } \left\{I_{t-1}, I_{t+1}\right\} {It−1,It+1}, that are respectively the previous and the next frames in the sequence, to leverage the supervision from monocular sequences as well. Purposely, a pose network is trained to estimate relative poses between the frames in the sequence as in [20]. Moreover, per-pixel minimum and automask strategies are used to preserve sharp details: the former select best p e p_{e} pe among multiple views according to occlusions, while the latter helps to filter out pixels that do not change between frames (e.g. scenes with a non-moving camera or dynamic objects that are moving at the same speed of the camera), thus breaking the moving camera in a stationary world assumption (more details are provided in the original paper [20]). Finally, intermediate predictions, when available, are upsampled and optimized at input resolution.
首先,我们在 KITTI 数据集上研究所考虑的网络的准确性。由于这些模型是用不同的框架开发的(PyDNet 在 Tensor-Flow 中,另外两个在 PyTorch 中) ,并在不同的数据集上训练(FastDepth 在 NYU v2[39]上,其他的在 Eigen [4] split KITTI [40]上) ,我们在 PyTorch 中实现了所有的网络。这种策略允许我们采用文献[20]中提出的相同的自监督协议来训练所有的模型。这种选择适合于KITTI数据集,因为它利用了立体序列,从而实现了最佳的精度。假设有两张图像 I I I 和 I † I^{\dagger} I† ,加上已知的内在参数( K K K 和 K † ) \left.K^{\dagger}\right) K†))和摄影机的相对姿态 ( R , T ) (R, T) (R,T) ,网络网络预测深度 D \mathcal{D} D,允许从 I † I^{\dagger} I†重建参考图像 I I I,因此:
I ^ = ω ( I † , K † , R , T , K , D ) \hat{I}=\omega\left(I^{\dagger}, K^{\dagger}, R, T, K, \mathcal{D}\right) I^=ω(I†,K†,R,T,K,D)
其中 ω \omega ω 是可微的翘曲函数。然后,可以使用 I ^ \hat{I} I^和 I I I之间的差异来监控网络,从而改进 D \mathcal{D} D,而无需任何真实正确数据。[20]中使用的损失函数由光度误差项 p e 2 p{e}2 pe2和边缘感知正则化项 L s L{s} Ls组成
p e ( I , I ^ ) = α ( 1 − SSIM ( I , I ^ ) ) 2 + ( 1 − α ) ∥ I − I ^ ∥ 1 L s = ∥ δ x D t ∗ ∥ 1 e − ∥ δ x I t ∥ 1 + ∥ δ y D t ∗ ∥ 1 e − ∥ δ y I t ∥ 1 \begin{gathered} p_{e}(I, \hat{I})=\alpha \frac{(1-\operatorname{SSIM}(I, \hat{I}))}{2}+(1-\alpha)\|I-\hat{I}\|_{1} \\ \mathcal{L}_{s}=\left\|\delta_{x} \mathcal{D}_{t}^{*}\right\|_{1} e^{-\left\|\delta_{x} I_{t}\right\|_{1}}+\left\|\delta_{y} \mathcal{D}_{t}^{*}\right\|_{1} e^{-\left\|\delta_{y} I_{t}\right\|_{1}} \end{gathered} pe(I,I^)=α2(1−SSIM(I,I^))+(1−α)∥I−I^∥1Ls=∥δxDt∗∥1e−∥δxIt∥1+∥δyDt∗∥1e−∥δyIt∥1
其中,SSIM是结构相似性指数[43],而 D ∗ = D / D ‾ \mathcal{D} *=\mathcal{D} / \overline{\mathcal{D}} D∗=D/D是[44]中提出的平均归一化逆深度。我们采用[20]的M配置来训练所有模型。这样做,给定参考图像 I t I{t} It,在训练时我们还需要 { I t − 1 , I t + 1 } \left\{I_{t-1},I_{t+1}\right\} {It−1,It+1},它们分别是序列中的前一帧和下一帧,以利用单目序列的监控。有意地,训练姿势网络以估计序列中帧之间的相对姿势,如[20]所示。此外,每像素最小值和自动检测策略被用来保存清晰的细节: 前者根据遮挡在多个视图中选择最佳的 p e p_{e} pe ,而后者帮助过滤出帧之间不变化的像素(例如带有静止摄像机的场景或以相同速度运动的动态对象) ,从而打破了静止世界中移动摄像机的假设(更多的细节在原文[20]中提供)。最后,中间预测(如果可用)将以输入分辨率进行上采样和优化。
Considering that all the models have been trained with different configurations on different datasets, we re-train all the architectures exploiting the training framework of [20] for a fair comparison. Specifically, we run 20 epochs of training for each model, decimating the learning rate after 15, on the Eigen train split of KITTI. We use Adam optimizer [ 45 ] [45] [45], with an initial learning rate of 1 0 − 1 10^{-1} 10−1, and minimize the highest three available scales for all the network except FastDepth, which provides full-resolution (i.e. 640 × 192 640 \times 192 640×192 ) predictions only. Since the training framework expects a normalized inverse depth as the output of the network, we replace the last activation of each architecture (if present) with a sigmoid.
考虑到所有的模型在不同的数据集上都经过了不同配置的训练,我们利用[20]的训练框架对所有的体系结构进行了重新训练,以便进行公平的比较。具体来说,我们为每个模型运行了20个训练周期,在KITTI的Eigen列车分段上,在15个周期后大大降低了学习速率。我们使用Adam optimizer [ 45 ] [45] [45],初始学习率为 1 0 − 4 10^{-4} 10−4,并将除FastDepth之外的所有网络的最高三个可用比例最小化,FastDepth可提供完整分辨率(即 640 × 192 640\times 192 640×192 ) 仅限预测。由于训练框架期望标准化反向深度作为网络的输出,因此我们用一个sigmoid替换每个架构(如果存在)的最后激活。
Table 1 summarizes the experimental results of the models tested on the Eigen split of KITIT. The top four rows report the results, if available, provided in the original papers, while last three the accuracy of models re-trained within the framework described so far. This test allows for evaluating the potential of each architecture in fair conditions, regardless of the specific practices, advanced tricks or pre-training being deployed in the original works. Not surprisingly, larger MonoDepth2 model performs better than the three lightweight models, showing non-negligible margins on each evaluation metric when trained in fair conditions. Among these latter, although their performance is comparable, PyDNet results more effective with respect to FastDepth and DSNet on most metrics, such as RMSE and δ < 1.25 \delta<1.25 δ<1.25.
表1总结了在KITTI的Eigen split上测试的模型的实验结果。前四行报告原始论文中提供的结果(如果可用的话) ,而后三行报告迄今为止在所描述的框架内重新训练的模型的准确性。这个测试允许在公平的条件下评估每个架构的潜力,而不用考虑在原始作品中部署的具体实践、高级技巧或预训练。毫不奇怪,较大的MonoDepth2模型比三个轻量级模型表现得更好,在公平的条件下训练时,每个评估度量上显示不可忽略的边界。在后者中,尽管它们的性能是可比较的,PyDNet在大多数指标上比FastDepth和DSNet更有效,如RMSE和δ < 1.25。
Figure 3 shows some qualitative results, enabling us to compare depth maps estimated by the four networks considered in our evaluation on a single image from the Eigen test split.
图3显示了一些定性结果,使我们能够比较四个网络所估计的深度图,这些网络是在我们对来自特征测试分割的单个图像进行评估时考虑的。
In the previous section, we have assessed the performance of the considered lightweight networks on a data distribution similar to the training one. Unfortunately, this circumstance is seldom found in most practical applications, and typically it is not known in advance where a network will be deployed. Therefore, how to achieve reliable depth maps in the wild? In Figure 4 we report some qualitative results about original pre-trained networks on different scenarios. Notice that the first two networks have strong constraints about input size ( 224 × 224 (224 \times 224 (224×224 for [ 23 ] , 1024 × 320 [23], 1024 \times 320 [23],1024×320 for [ 20 ] ) [20]) [20]) that these networks internally apply, imposed by how these models have been trained in their original context. Although this limitation, FastDepth (second column) can predict a meaningful result in an indoor environment (first row), not in outdoor (second row). It is not surprising since the network was trained on NYUv2, which is an indoor dataset. Monodepth2 [20] suffers from the same problem, highlighting that this issue is not concerned with the network size (smaller the first, larger the second) or training approach (supervised the first, self-supervised the second), but it is rather related to the training data. Conversely, MiDaS by Ranftl et al. [11], is effective in both situations. Such robustness comes from a mixture of datasets, collecting about 2 M 2 \mathrm{M} 2M frames covering many different scenarios, used to train a large ( 105M parameters) and very accurate monocular network.
在上一节中,我们评估了所考虑的轻量级网络在类似于训练网络的数据分布上的性能。不幸的是,在大多数实际应用中很少发现这种情况,并且通常不知道将在哪里部署网络。因此,如何在野外获得可靠的深度图?在图4中,我们报告了不同场景下原始预训练网络的一些定性结果。请注意,前两个网络对这些网络内部应用的输入大小( 224 × 224 224\times224 224×224)在[23]中, 1024 × 320 1024\times320 1024×320在[20]中,有很强的限制,这是由这些模型在其原始上下文中的训练方式造成的。尽管存在此限制,FastDepth(第二列)可以在室内环境(第一行)中预测有意义的结果,而不是在室外(第二行)。这并不奇怪,因为网络是在NYUv2上训练的,NYUv2是一个室内数据集。Monodepth2[20]也面临同样的问题,强调该问题与网络规模(第一个越小,第二个越大)或训练方法(第一个受监督,第二个自监督)无关,但与训练数据相关。相反,Ranftl等人[11]的MiDaS在这两种情况下都是有效的。这种健壮性来自混合数据集,收集了大约 2 M 2\mathrm{M} 2M帧,涵盖许多不同的场景,用于训练大型(105M参数)和非常精确的单目网络。
We leverage this latter model to distill knowledge and train the lightweight models compatible with mobile devices. As mentioned before, this strategy allows use to use MiDaS knowledge for faster training data generation compared to time-consuming pipelines used to train it, such as COLMAP [47], [48]. Moreover, it allows us to generate additional training samples and thus a much more scalable training set, potentially from any (single) image. Therefore, in order to train our network using the WILD dataset, we first generate proxy labels with MiDaS for each training image of this dataset. Then, obtained such proxy labels, we train the networks using the following loss function:
L ( D x s , D g t ) = α l ∥ ( D x s − D g t ) ∥ + α s L g ( D x s , D g t ) \mathcal{L}\left(D_{x}^{s}, D_{g t}\right)=\alpha_{l}\left\|\left(D_{x}^{s}-D_{g t}\right)\right\|+\alpha_{s} \mathcal{L}_{g}\left(D_{x}^{s}, D_{g t}\right) L(Dxs,Dgt)=αl∥(Dxs−Dgt)∥+αsLg(Dxs,Dgt)
where L g \mathcal{L}_{g} Lg is the gradient loss term defined in [28], D x s D_{x}^{s} Dxs the predictions of the network at scale s s s (bilinearly upsampled to full resolution) and D g t D_{g t} Dgt is the proxy depth. The weight α s \alpha_{s} αs depends on the scale s s s and is halved at each lower scale. On the contrary, α l \alpha_{l} αl is fixed and set to 1 . Intuitively, the L 1 L^{1} L1 norm penalizes differences w.r.t proxies, while L g \mathcal{L}_{g} Lg helps to preserve sharp edges. We train the models for 40 epochs, halving the learning rate after 20 and 30 , with a batch size of 12 images, with an input size of 640 × 320 640 \times 320 640×320. We set the initial value of α s \alpha_{s} αs to 0.5 0.5 0.5 for all networks except for FastDepth, set to 0.01 0.01 0.01. Additionally, for MonoDepth2 and FastDepth feature upsampling through the nearest neighbour operator in the decoder phase have been replaced with bilinear interpolation. These changes were necessary to mitigate some checkboard artefacts found in depth estimations inferred by these networks following the training procedure outlined.
我们利用后一个模型来提取知识,并训练与移动设备兼容的轻量级模型。如前所述,与用于训练的耗时的管道(如 COLMAP [47]、[48])相比,这种策略允许使用 MiDaS 知识来更快地生成训练数据。此外,它允许我们生成额外的训练样本,从而从任何(单个)图像生成一个可扩展得多的训练集。因此,为了使用WILD数据集训练我们的网络,我们首先用MiDaS为该数据集的每个训练图像生成代理标签。然后,得到这样的代理标签,我们使用以下的损失函数训练网络:
L ( D x s , D g t ) = α l ∥ ( D x s − D g t ) ∥ + α s L g ( D x s , D g t ) \mathcal{L}\left(D_{x}^{s}, D_{g t}\right)=\alpha_{l}\left\|\left(D_{x}^{s}-D_{g t}\right)\right\|+\alpha_{s} \mathcal{L}_{g}\left(D_{x}^{s}, D_{g t}\right) L(Dxs,Dgt)=αl∥(Dxs−Dgt)∥+αsLg(Dxs,Dgt)
其中 L g \mathcal{L}{g} Lg 是[28]中定义的梯度损失项, D x s D_{x}^{s} Dxs 规模为 s s s的网络预测 (双线性上采样到全分辨率)和 D g t D_{gt} Dgt 是代理深度。权重 α s \alpha_{s} αs 取决于规模 s s s 并且在每一个较低的尺度上减半。相反, α l \alpha_{l} αl 已固定并设置为1。直观地说, L 1 L^{1} L1 规范惩罚w.r.t代理的差异,而 L g \mathcal{L}{g} Lg 有助于保持锋利的边缘。我们对模型进行了40轮的训练,20和30轮后的学习速度减半,输入大小为640 × 320,batch size 为12。我们将 α s \alpha_{s} αs 的初始值设置为0.5对于除FastDepth之外的所有网络,设置为 0.01 0.01 0.01。此外,对于MonoDepth2和FastDepth特性,在解码器阶段通过最近邻算子向上采样已经被双线性内插代替。这些变化是必要的,以减轻在这些网络推断的深度估计中发现的一些检查板伪影。
Table 2 collects quantitative results on three datasets, respectively TUM [38] (3D object reconstruction category), KITTI Eigen split [4] and NYU [39]. On each dataset, we first show the results achieved by large and complex networks MiDaS [11] and the model by Li et al. [28] (using the single frame version), both trained in the wild on a large variety of data. The table also reports results achieved by the four networks considered in our work trained on the WILD dataset exploiting knowledge distillation from MiDaS. First and foremost, we highlight how MiDaS performs in general better than [28], emphasizing the reason to distil knowledge from it.
表2收集了三个数据集的定量结果,分别是TUM [38] (3D对象重构类),KITTI Eigen split[4]和NYU[39]。在每个数据集上,我们首先展示了大型复杂网络MiDaS[11]和Li等人[28] (使用单帧版本)的模型所获得的结果,两者都是在野外对大量数据进行训练。该表还报告了我们在WILD数据集上训练的四个网络所获得的结果,这些网络从MiDaS中提取知识。首先,我们强调了MiDaS在总体上优于[28]的原因,并强调了从它中提取知识的原因。
Considering lightweight compact models PyDNet, DSNet and FastDepth we can notice that the margin between them and MiDaS is often non-negligible. Similar behaviour occurs for the significantly more complex network MonoDepth2 despite in general more accurate than other more compact networks, except on KITTI where it turns out less accurate when trained in the wild. However, considering the massive gap in terms of computational efficiency between compact networks and MiDaS analyzed later, that makes MiDaS not suited at all for real-time inference on the target devices the outcome reported in Table 2 is not so surprising. Looking more in details the outcome of lightweight networks, PyDNet is the best model on KITTI when trained in the wild and also achieves the second-best accuracy on NYU, with minor drops on TUM. Finally, DSNet and FastDepth achieve average performance in general, never resulting in the best on any dataset.
考虑到轻量级的紧凑模型PyDNet, DSNet和FastDepth,我们可以注意到它们和MiDaS之间的差距通常是不可忽略的。更为复杂的网络MonoDepth2也会出现类似的行为,尽管总体上比其他更紧凑的网络更精确,但KITTI上的情况除外,在野外训练时,它的精确度较低。然而,考虑到紧凑型网络和后面分析的MiDaS在计算效率方面存在巨大差距,这使得MiDaS根本不适合在目标设备上进行实时推理。表2中报告的结果并不令人惊讶。更详细地看一下轻量级网络的结果,PyDNet是KITTI在野外训练时的最佳模型,在纽约大学也达到了第二高的精确度,在TUM上有轻微下降。最后,DSNet和FastDepth总体上实现了平均性能,从未在任何数据集上获得最佳性能。
Figure 5 shows some qualitative examples of depth maps processed from internet pictures by MegaDepth [46], the model by Li et al. [28], MiDaS [11] and the fast networks trained through knowledge distillation in this work.
图5显示了一些由 MegaDepth [46]、 Li et al. [28]、 MiDaS [11]模型以及通过知识提取训练的快速网络处理的深度图的定性例子。
Finally, in Figure 6 we report some example of failure cases of MiDaS (in the middle column) inherited by student networks. Since both networks fail, the problem is not attributable to their different architecture. Observing the figure, we can notice that such behavior occurs in very ambiguous scenes such as when dealing with mirrors or flat surfaces with content aimed at inducing optical illusions in the observers.
最后,在图6中,我们报告了由学生网络继承的MiDaS(中间一列)的一些失败案例。因为两个网络都失败了,所以问题不能归咎于它们的不同架构。观察这个图,我们可以注意到这种行为发生在非常模糊的场景中,比如在处理镜子或平面时表面的内容,旨在诱导视觉幻觉的观察者。
After training the considered architectures on the WILD dataset, the stored weights can be converted into mobilefriendly models using tools provided by deep learning frameworks. Moreover, as previously specified, in our experiments, we perform only model conversion avoiding weights quantization not to alter the accuracy of the original network.
在野外数据集上训练所考虑的体系结构后,可以使用深度学习框架提供的工具将存储的权重转换为可移动的模型。此外,如前所述,在我们的实验中,我们只执行模型转换,避免权重量化,不改变原始网络的精度。
Table 3 collects stats about the considered networks. Specifically, we report the number of multiply-accumulate operations (MAC), computed with TensorFlow utilities, and the frame rate (FPS) measured when deploying the converted models on an Apple iPhone X S X S XS. Measurements are gathered processing 640 × 384 640 \times 384 640×384 images and averaging over 50 consecutive inferences. On top, we report the performance achieved by MiDaS, showing that it requires about 5 seconds on a smartphone to process a single depth map, performing about 170 billion operations. This evidence highlights how, despite being much more accurate, as shown before, this vast network is not suited at all for real-time processing on mobile devices. Moving on more compact models, we can notice how MonoDepth2 reaches nearly 10 FPS performing one order of magnitude fewer operations. DSNet and PyDNet both perform about 9 billion operations, but the latter allows for much faster inference, close to 60 F P S 60 \mathrm{FPS} 60FPS and about 6 times faster than previous models. Since the number of operations is almost the same for DSNet and PyDNet, we reconduct this performance discrepancy to low-level optimization of some specific modules. Finally, FastDepth performs 3 times fewer operations, yet runs slightly slower than PyDNet when deployed with the same degree of optimization of the other networks on the iPhone XS.
表3收集了所考虑的网络的统计信息。具体来说,我们报告了使用TensorFlow实用程序计算的乘法累加运算(MAC)的数量,以及在Apple iPhone X S XS XS上部署转换模型时测量的帧速率(FPS)。通过处理 640 × 384 640\times384 640×384的图像收集测量值,并对50个连续推断进行平均。最重要的是,我们报告了MiDaS所取得的性能,显示在智能手机上处理单个深度图大约需要5秒钟,执行大约1700亿次操作。这一证据突出表明,尽管如前所示更加准确,但这个庞大的网络根本不适合在移动设备上进行实时处理。在更紧凑的模型上,我们可以注意到MonoDepth2是如何达到近10 FPS的速度,执行的操作少了一个数量级。DSNet和PyDNet都执行大约90亿次操作,但后者允许更快的推断,接近60帧每秒,比以前的模型快6倍。由于DSNet和PyDNet的操作数几乎相同,我们将这种性能差异重新归因于某些特定模块的低级优化。最后,FastDepth执行的操作比PyDNet少3倍,但在iPhone XS上部署其他网络时,运行速度略慢于PyDNet。
Summarizing the performance analysis reported in this section and the previous accuracy assessment concerning the deployment of single image depth estimation in the wild, our experiments highlight PyDNet as the best tradeoff between accuracy and speed when targeting embedded devices.
总结本节中报告的性能分析和之前关于在野外部署单图像深度估计的精度评估,我们的实验强调PyDNet是针对嵌入式设备时在精度和速度之间的最佳折衷。
A video showing the deployment of PyDNet with an iPhone XS framing an urban environment is available at youtube.com/watch?v=LRfGablYZNw.
At the following link is also available a PyDNet web demo with client-side inference carried out by TensorFlow JS: filippoaleotti.github.io/demo_live.
Once exhaustively assessed the performance of the considered lightweight networks, we present two well-known applications that can significantly take advantage of realtime and accurate single image depth estimation. For these experiments, we use the PyDNet model trained on the WILD dataset, as described in previous sections.
一旦全面评估了考虑的轻量级网络的性能,我们提出了两个著名的应用程序,可以显著地利用实时和准确的单图像深度估计。对于这些实验,我们使用在WILD数据集上训练的PyDNet模型,如前几节所述。
Bokeh effect. The first application consists of a bokeh filter, aimed at blurring an image according to the distance from the camera. More precisely, in our implementation, given a threshold τ \tau τ, all the pixels with a relative inverse depth larger than τ \tau τ are blurred by a 25 × 25 25 \times 25 25×25 Gaussian kernel.
散景效应。第一个应用包括散景滤镜,目的是模糊图像根据从相机的距离。更精确地说,在我们的实现中,给定阈值τ,所有相对逆深度大于τ的像素都被25×25高斯核模糊。
For our experiments, we captured a stereo pair using the rear cameras of an iPhone XS, and then using its API, we inferred a depth map to obtain a baseline. We also fed the PyDNet network the single reference image of the stereo pair. Figure 7 depicts the depth maps inferred by the stereo and monocular approach and the outcome of the bokeh filter. From the figure, we can notice that even if the depth map inferred by the monocular system is not in scale as the stereo one, it preserves pretty well details and also allows to retrieve the relative distance of objects. This latter feature, combined with the need for a single image is highly desirable is many consumer applications like the one described. Additionally − - − since the distance between the two imaging sensors of a stereo setup of a mobile phone is short- the parallax effect enabling to infer depth with stereo vanishes close to the camera. On the contrary, a monocular system is agnostic to this problem. For this experiment, we set τ \tau τ equals to 0.9 0.9 0.9 and 0.7 0.7 0.7 for stereo and monocular depth, respectively.
在我们的实验中,我们使用iPhone XS的后摄像头捕捉到一对立体声相机,然后使用它的API,我们推断出一个深度图来获得基线。我们还将立体对的单个参考图像输入PyDNet网络。图7描绘了由立体和单目方法推断的深度图以及散景滤波的结果。从图中,我们可以注意到,即使由单目系统推断的深度图不在与立体声的比例一样,它保留了相当好的细节,也允许检索物体的相对距离。后一种特性,加上对单个图像的需求,是许多消费者应用程序非常希望看到的。此外,由于移动电话的立体设置的两个成像传感器之间的距离很短,使立体推断深度的视差效应在接近相机时消失。相反,单目系统对这个问题是不可知的。在本次实验中,我们将立体深度和单眼深度τ分别设为0.9和0.7。
Finally, another advantage consists of enabling the bokeh effect even when a stereo pair is not available, for instance, dealing with images sampled from the web (third row in Figure 7). From it, we can notice how processing a single input image enables a charming effect. Notice that bokeh effect with stereo is not applicable in this case since the stereo pair is not available as frequently occurs in practice.
最后,另一个优点是,即使在没有立体音响的情况下,也可以启用散景效果,例如,处理从web上采样的图像(图7中的第三行)。从它中,我们可以注意到处理单个输入图像如何产生迷人的效果。注意,在这种情况下,散景效应是不适用的,因为在实践中经常出现的立体对是不可用的。
Augmented reality with depth-aware occlusion handling. Modern augmented reality (AR) frameworks for smartphones allows robust and consistent integration of virtual objects on flat areas leveraging camera tracking. However, they miserably fail when the scene contains occluding obfects protruding from the flat surfaces. Therefore, in A R \mathrm{AR} AR scenarios, dense depth estimation is paramount to handle properly physic interactions with the real world, such as occlusions. Unfortunately, most methods rely only on sparse depth measurements for a few points in the scene appropriately scaled exploiting the sensor suite of modern smartphones comprising accelerometers, gyroscope, etcetera. Although some authors proposed to densify such sparse measurements, it is worth observing that dynamic objects in the sensed scene may yield incorrect, sparse estimation and thus these methods need to filter out moving points [49]. We argue that single image depth estimation may enable full perception of the scene suited for many real-world use cases potentially avoiding at all the issues outlined so far. The only remaining issue, concerned with the unknown scale factor intrinsic in a monocular system can be robustly addressed leveraging, as described next, one of the multiple sparse depth measurements in scale among those made available by standard AR frameworks.
增强现实与深度感知遮挡处理。智能手机的现代增强现实(AR)框架允许利用相机跟踪在平面上健壮和一致地集成虚拟对象。然而,当场景包含从平面凸出的闭塞物体时,它们不幸地失败了。因此,在AR场景中,密度深度估计对于正确处理与真实世界的物理交互(如遮挡)至关重要。不幸的是,大多数方法只依赖于稀疏深度测量在场景中的几个点适当缩放利用现代智能手机的传感器套件,包括加速度计,陀螺仪等。尽管有些作者提出要强化这种稀疏测量,但值得注意的是,感知场景中的动态对象可能会产生不正确的稀疏估计,因此这些方法需要过滤掉移动点[49]。我们认为,单图像深度估计可以实现场景的完整感知,适用于许多现实世界的用例,从而可能避免迄今为止概述的所有问题。如下文所述,利用标准AR框架提供的多个稀疏深度测量中的一个,可以稳健地解决与单目系统固有的未知比例因子有关的唯一剩余问题。
Purposely, we developed a mobile application capable of handling in real-time object occlusions by combining AR frameworks, such as ARCore or ARKit, and a robust and lightweight monocular depth estimation. network To achieve this goal, we first exploit the AR framework to retrieve low-level information, such as the pose of the camera and the position of anchors in the sensed environment. Then, we retrieve depth for anchor points and, by comparing such measurements with monocular depth predictions, we can prevent to render occluded regions. At each frame, the scale factor issue is tackled within a robust RANSACbased framework fed with the sparse and potentially noisy depth measurements provided by the AR framework and the dense depth map estimated by the monocular network. Differently from other approaches, such as [49] and [50], our networks do not require SLAM points to infer dense depth maps nor a fine-tuning of the network on the input video data. In our case, a single image and at least one point in scale suffice to obtain absolute depth perception. Consequently, we do not rely on other techniques (e.g. optical flow or edge localization) in our whole pipeline for AR. Nevertheless, it can be noticed in Figure 8 how our strategy coupled with PyDNet can produce competitive and detailed depth maps leveraging a single RGB image only. Figure 9 shows some qualitative examples of an AR application, i.e. visualization of a virtual duck in the observed scene. Once positioned on a surface, we can notice how foreground elements do not correctly hide it without proper occlusion handling. In contrast, our strategy allows for a more realistic experience, thanks to the dense and robust depth map inferred by PyDNet and sparse anchors provided by the AR framework.
有目的地,我们开发了一个移动应用程序,能够通过结合AR框架(如ARCore或ARKit)和一个健壮、轻量级的单目深度估计来处理实时对象遮挡。为了实现这一目标,我们首先利用AR框架来检索底层信息,例如摄像机的姿态和锚在感知环境中的位置。然后,我们检索锚定点的深度,通过将这些测量与单目深度预测进行比较,我们可以防止渲染遮挡区域。在每一帧中,比例因子问题在基于RANSACR的鲁棒框架内解决,该框架由AR框架提供的稀疏且可能有噪声的深度测量和单目网络估计的密集深度图提供。与[49]和[50]等其他方法不同,我们的网络不需要SLAM点来推断密集的深度贴图,也不需要对输入视频数据进行网络微调。在我们的例子中,单个图像和至少一个比例点足以获得绝对深度感知。因此,我们在AR的整个管道中不依赖其他技术(例如光流或边缘定位)。然而,图8中可以注意到,我们的策略与PyDNet相结合,可以仅利用单个RGB图像生成具有竞争力的详细深度图。图9显示了AR应用程序的一些定性示例,即观察场景中虚拟鸭子的可视化。一旦定位到曲面上,我们可以注意到前景元素在没有正确遮挡处理的情况下无法正确隐藏它。相比之下,我们的策略允许更真实的体验,这要归功于PyDNet推断出的密集而健壮的深度图和AR框架提供的稀疏锚。
In this paper, we proposed a strategy to train single image depth estimation networks, focusing our attention on lightweight ones suited for handheld devices characterized by severe constraints concerning power consumption and computational resources. An exhaustive evaluation highlights that real-time depth estimation from a single image in the wild is feasible by adopting appropriate network design and training strategies. By distilling knowledge from complex architecture, not suited for mobile deployment, we have shown that it is possible to develop accurate yet fast networks enabling for a variety of AR applications on consumer smartphones. We also reported the effectiveness of such an approach in two notable application scenarios concerning depth aware blurring and augmented reality.
在本文中,我们提出了一种训练单图像深度估计网络的策略,将我们的注意力集中在适用于手持设备的轻量级网络上,这些手持设备具有严重的功耗和计算资源限制。详尽的评估强调,通过采用适当的网络设计和训练策略,从野外的单个图像实时估计深度是可行的。通过从不适合移动部署的复杂体系结构中提取知识,我们已经证明,开发准确而快速的网络是可能的,可以在消费者智能手机上实现各种AR应用。我们还报告了这种方法在深度感知模糊和增强现实两个显著应用场景中的有效性。