Abstract
We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets.
Keywords: stereo, matching cost, similarity learning, supervised learning, convolutional neural networks
摘要
我们给出了一种从修正过的图像对中获取深度信息的方法。我们的方法专注于许多立体算法的第一个步骤:配对运算。我们通过在小块图片上用卷积神经网络学习相似的方法来解决这个问题。训练是通过建立一个包含相似与不相似的图像对样例的二元分类数据集的方法,在有监管的方式下实现的。我们为了这一任务测试了两种网络结构:一种经过调整速度更快,另一种精确度更高。卷积神经网络的输出被用于启动立体匹配。接下来有一系列预处理步骤:基于交叉的聚类、半全局匹配、左右一致性检验、子像素增强、中值滤波器、双边滤波器。我们在KITTI 2012、KITTI 2015和Middlebury立体数据集评估了我们的方法,并且表明这种方法在这三个数据集上比其他方法有更好的表现。
关键词:立体,匹配,相似度学习,监督式学习,卷积神经网络
Consider the following problem: given two images taken by cameras at different horizontal positions, we wish to compute the disparity d for each pixel in the left image. Disparity refers to the difference in horizontal location of an object in the left and right image—an object at position (x, y) in the left image appears at position (x−d, y) in the right image. If we know the disparity of an object we can compute its depth z using the following relation:
where f is the focal length of the camera and B is the distance between the camera centers. Figure 1 depicts the input to and the output from our method.
考虑如下的问题:给出两个在不同水平位置的相机拍摄的两张照片,我们希望计算左边图片中每一个像素的视差d。视差指的是同一个物体在左侧图片和右侧图片中的水平位置的偏差——一个物体在左侧图片中的位置是 (x, y),但在右侧图片中的位置是 (x-d, y)。如果我们知道一个物体的视差,我们就能通过下面的关系式计算它的深度:
在这之中 f 是相机的焦距,B是两个相机中心的距离。图1描绘了我们的方法的输入和输出。
Figure 1:
The input is a pair of images from the left and right camera. The two input images differ mostly in horizontal locations of objects (other differences are caused by reflections, occlusions, and perspective distortions). Note that objects closer to the camera have larger disparities than objects farther away. The output is a dense disparity map shown on the right, with warmer colors representing larger values of disparity (and smaller values of depth).
图1:
输入是一对左侧相机和右侧相机拍摄的照片。这两张图片的主要差别是物体的水平位置不同(其他差别主要是由反光、遮挡和透视畸变导致的)。可以注意到近处的物体相比远处的物体有更大的视差。输出图像是一个展示在右侧图像上的视差图像,颜色色调越暖表示视差更大而深度更小。
The described problem of stereo matching is important in many fields such as autonomous driving, robotics, intermediate view generation, and 3D scene reconstruction. According to the taxonomy of Scharstein and Szeliski (2002), a typical stereo algorithm consists of four steps: matching cost computation, cost aggregation, optimization, and disparity refinement. Following Hirschm¨uller and Scharstein (2009) we refer to the first two steps as computing the matching cost and the last two steps as the stereo method. The focus of this work is on computing a good matching cost.
上面所描述的立体匹配问题在很多领域,例如自动驾驶、机器人技术、中间视图生成和三维场景重建等等当中有很重要的应用。根据Scharstein和Szeliski(在2002年的分类),一个典型的立体匹配算法包括四个步骤:匹配运算、成本聚类、最优化,以及视差优化。跟从Hirschm¨uller和Scharstein(在2009的研究)我们把前两步称作计算匹配代价并且后两步称作立体方法。工作的重心是计算出一个好的匹配代价。
We propose training a convolutional neural network (LeCun et al., 1998) on pairs of small image patches where the true disparity is known (for example, obtained by LIDAR or structured light). The output of the network is used to initialize the matching cost. We proceed with a number of post-processing steps that are not novel, but are necessary to achieve good results. Matching costs are combined between neighboring pixels with similar image intensities using cross-based cost aggregation. Smoothness constraints are enforced by semiglobal matching and a left-right consistency check is used to detect and eliminate errors in occluded regions. We perform subpixel enhancement and apply a median filter and a bilateral filter to obtain the final disparity map.
我们提议在已知配对是否正确的图片对上训练一个卷积神经网络。网络的输出用于初始化匹配代价。接着我们用了一些不算新奇的预处理步骤,想要得到好的结果这些步骤是必须的。匹配代价是由在基于交叉的成本聚集下有相似图像强度的相邻像素结合而成的。利用半全局匹配来实现平滑性约束,并利用左右一致性检测来检测和消除被遮挡区域中的误差。我们使用亚像素增强,并应用中值滤波器和双边滤波器来获得最终视差图。
The contributions of this paper are
这篇文章的贡献有:
This paper extends our previous work (Zbontar and LeCun, 2015) by including a description of a new architecture, results on two new data sets, lower error rates, and more thorough experiments.
这篇文章拓展了我们先前的工作,囊括了一种新结构的介绍,在两个新的数据集上运行的结果和更多全面的经验。
Before the introduction of large stereo data sets like KITTI and Middlebury, relatively few stereo algorithms used ground truth information to learn parameters of their models; in this section, we review the ones that did. For a general overview of stereo algorithms see Scharstein and Szeliski (2002).
在介绍KITTI和Middlebury这类大数据集之前,相对较少的几个立体算法使用背景信息来学习模型中的参数;这一节中,我们来回顾这么做了的几个算法。如果想要综合的立体算法概述可以看Scharstein和Szeliski(2002年的文章)。
Kong and Tao (2004) used the sum of squared distances to compute an initial matching cost. They then trained a model to predict the probability distribution over three classes: the initial disparity is correct, the initial disparity is incorrect due to fattening of a foreground object, and the initial disparity is incorrect due to other reasons. The predicted probabilities were used to adjust the initial matching cost. Kong and Tao (2006) later extend their work by combining predictions obtained by computing normalized cross-correlation over different window sizes and centers. Peris et al. (2012) initialized the matching cost with AD-Census (Mei et al., 2011), and used multiclass linear discriminant analysis to learn a mapping from the computed matching cost to the final disparity.
Kong和Tao(2004年的文章)用平方距离之和来计算初始匹配损失。他们训练了一个模型来预测三种类型的可能性分布:初始视差是正确的、由于前景过大导致的错误、其他原因导致的错误。预测分布用于调整匹配损失。Kong和Tao(2006年的文章)进一步拓展了他们的工作,结合了通过在不同大小和中心的窗口中的互相关归一化运算得到的预测结果。Peris et al.(2012年的文章)用广告统计来初始化匹配损失(Mei et al., 2011),并且用多类线性判别分析来学习从计算出的匹配损失到最终的视差图。
Ground-truth data was also used to learn parameters of probabilistic graphical models. Zhang and Seitz (2007) used an alternative optimization algorithm to estimate optimal values of Markov random field hyperparameters. Scharstein and Pal (2007) constructed a new data set of 30 stereo pairs and used it to learn parameters of a conditional random field. Li and Huttenlocher (2008) presented a conditional random field model with a nonparametric cost function and used a structured support vector machine to learn the model parameters.
背景信息也由于学习概率图模型的参数。Zhang和Seitz(2007年的文章)使用了另一种优化算法来估计马尔科夫随机场超参数的值。Scharstein和Pal(2007年的文章)构建了一个包括30个立体对的数据集并用它来学习条件随机场的参数。Li和Huttenlocher(2008年的文章)给出了一种使用非参数成本函数的条件随机场模型,并且使用结构化支持向量机来学习模型参数。
Recent work (Haeusler et al., 2013; Spyropoulos et al., 2014) focused on estimating the confidence of the computed matching cost. Haeusler et al. (2013) used a random forest classifier to combine several confidence measures. Similarly, Spyropoulos et al. (2014) trained a random forest classifier to predict the confidence of the matching cost and used the predictions as soft constraints in a Markov random field to decrease the error of the stereo method.
最近的一些工作(Haeusler et al., 2013; Spyropoulos et al., 2014)注重于估计计算匹配成本的置信度。Haeusler et al.(2013)使用了一种随机森林分类算法来结合几种置信度测度。类似地,Spyropoulos et al.(2014)训练了一个随机森林分类算法来预测匹配成本的置信度,并且把预测结果作为马尔科夫随机场中的软约束来减少立体算法的错误。
A related problem to computing the matching cost is learning local image descriptors (Brown et al., 2011; Trzcinski et al., 2012; Simonyan et al., 2014; Revaud et al., 2015; Paulin et al., 2015; Han et al., 2015; Zagoruyko and Komodakis, 2015). The two problems share a common subtask: to measure the similarity between image patches. Brown et al. (2011) introduced a general framework for learning image descriptors and used Powell’s method to select good hyperparameters. Several methods have been suggested for solving the problem of learning local image descriptors, such as boosting (Trzcinski et al., 2012), convex optimization (Simonyan et al., 2014), hierarchical moving-quadrant similarity (Revaud et al., 2015), convolutional kernel networks (Paulin et al., 2015), and convolutional neural networks (Zagoruyko and Komodakis, 2015; Han et al., 2015). Works of Zagoruyko and Komodakis (2015) and Han et al. (2015), in particular, are very similar to our own, differing mostly in the architecture of the network; concretely, the inclusion of pooling and subsampling to account for larger patch sizes and larger variation in viewpoint.
计算匹配成本的一个相关问题是学习局部图像描述符(Brown et al., 2011; Trzcinski et al., 2012; Simonyan et al., 2014; Revaud et al., 2015; Paulin et al., 2015; Han et al., 2015; Zagoruyko and Komodakis, 2015)。这两个问题有着共同的子任务:测算图像块的相似度。Brown et al. (2011) 介绍了一种用于学习图像描述符的通用框架,并且使用Powell的方法来选择好的超参数。推荐了几种方法来解决学习图像描述符的问题,例如助推(Trzcinski et al., 2012)、凸优化(Simonyan et al., 2014)、分层运动象限相似度(Revaud et al., 2015)、卷积内核网络(Paulin et al., 2015)和卷积神经网络(Zagoruyko and Komodakis, 2015; Han et al., 2015)。Zagoruyko和Komodakis(2015)以及Han et al. (2015) 的工作,尤其与我们的工作非常相似,主要区别在于网络的体系结构;具体地说,包括汇集和子采样,以计算更大尺寸的图像块和更大的视点变化。