【论文翻译】SOLD2: Self-Supervised Occlusion-Aware Line Description and Detection

SOLD2: Self-Supervised Occlusion-Aware Line Description and Detection

论文:http://arxiv.org/abs/2104.03362
代码:https://github.com/cvg/SOLD2

Remi Pautrat, Juan-Ting Lin, Viktor Larsson, Martin R. Oswald, Marc Pollefeys; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11368-11378

Abstract

Compared to feature point detection and description, detecting and matching line segments offer additional challenges. Yet, line features represent a promising complement to points for multi-view tasks. Lines are indeed well-defined by the image gradient, frequently appear even in poorly textured areas and offer robust structural cues. We thus hereby introduce the first joint detection and description of line segments in a single deep network. Thanks to a self-supervised training, our method does not require any annotated line labels and can therefore generalize to any dataset. Our detector offers repeatable and accurate localization of line segments in images, departing from the wireframe parsing approach. Leveraging the recent progresses in descriptor learning, our proposed line descriptor is highly discriminative, while remaining robust to viewpoint changes and occlusions. We evaluate our approach against previous line detection and description methods on several multi-view datasets created with homographic warps as well as real-world viewpoint changes. Our full pipeline yields higher repeatability, localization accuracy and matching metrics, and thus represents a first step to bridge the gap with learned feature points methods. Code and trained weights are available at https://github.com/cvg/SOLD2.
与特征点的检测和描述相比,检测和匹配线段带来了额外的挑战。然而,对于多视图任务,线特征代表了对点的一种有希望的补充。线确实由图像梯度定义得很好,即使在纹理较差的区域也经常出现,并提供强大的结构线索。因此,我们在此介绍第一个在单个深度网络中联合检测和描述线段的方法。通过自我监督训练,我们的方法不需要任何带注释的线标签,因此可以推广到任何数据集。与线框解析方法不同,我们的检测器提供了图像中线段的可重复和精确定位。利用描述符学习的最新进展,我们提出的线描述符具有高度的区分性,同时对视点变化和遮挡保持鲁棒性。我们将我们的方法与之前的线检测和描述方法进行比较,这些方法是在几个使用同形扭曲以及真实世界视点变化创建的多视图数据集上进行的。我们的完整管道产生了更高的重复性、定位精度和匹配度量,因此代表了用学习的特征点方法弥合差距的第一步。代码和经过训练的重量可在https://github.com/cvg/SOLD2.

1. Introduction

Feature points are at the core of many computer vision tasks such as Structure-from-Motion (SfM) [14, 47], Simultaneous Localization and Mapping (SLAM) [37], large-scale visual localization [44, 50] and 3D reconstruction [9], due to their compact and robust representation. Yet, the world is composed of higher-level geometric structures which are semantically more meaningful than points. Among these structures, lines can offer many benefits compared to points. Lines are widespread and frequent in the world, especially in man-made environments, and are still present in poorly textured areas. In contrast to points, they have a natural orientation, and a collection of lines provide strong geometric clues about the structure of a scene [61, 53, 16]. As such, lines represent good features for 3D geometric tasks.
特征点是许多计算机视觉任务的核心,例如运动结构(SfM)[14,47]、同步定位和映射(SLAM)[37]、大规模视觉定位[44,50]和三维重建[9],因为它们具有紧凑和鲁棒的表示。然而,世界是由更高层次的几何结构组成的,这些结构在语义上比点更有意义。在这些结构中,线与点相比可以提供许多好处。纹路在世界上广泛而频繁,尤其是在人造环境中,纹路仍然存在于纹理较差的地区。与点相反,它们有一个自然的方向,一系列线提供了关于场景结构的强有力的几何线索[61,53,16]。因此,线代表3D几何任务的良好特征。

Previous methods to detect line segments in images often relied on image gradient information and handcrafted filters [57, 1]. Recently, deep learning has also enabled robust and real-time line detection [19]. Most learned line detectors are however tackling a closely related task: wireframe parsing, which aims at inferring the structured layout of a scene based on line segments and their connectivity [18, 67, 63, 68]. These structures provide strong geometric cues, in particular for man-made environments. Yet, these methods have not been optimized for repeatability across images, a vital feature for multi-view tasks, and their training requires ground truth lines that are cumbersome to manually label [18].
以前检测图像中线段的方法通常依赖于图像梯度信息和手工制作的过滤器[57,1]。最近,深度学习还实现了鲁棒的实时测线[19]。然而,大多数已知的线检测器都在处理一项密切相关的任务:线框解析,其目的是基于线段及其连通性推断场景的结构化布局[18,67,63,68]。这些结构提供了强烈的几何暗示,尤其是在人造环境中。然而,这些方法还没有针对图像的可重复性进行优化,这是多视图任务的一个重要特征,它们的训练需要地面真值线,手动标记起来很麻烦[18]。

The traditional way to match geometric structures across images is to use feature descriptors. Yet, line descriptors face several challenges: line segments can be partially occluded, their endpoints may not be well localized, the scale of the area to describe around each line fluctuates a lot, and it can be severely deformed under perspective and distortion changes [46]. Early line descriptors focused on extracting a support region around each line and on computing gradient statistics on it [60, 65]. More recently, motivated by the success of learned point descriptors [7, 9, 42], a few deep line descriptors have been proposed [25, 55, 24]. However, they are not designed to handle line occlusion and remain sensitive to poorly localized endpoints.
匹配图像几何结构的传统方法是使用特征描述符。然而,线描述符面临一些挑战:线段可以被部分遮挡,其端点可能没有很好地定位,每一条线周围描述的区域的尺度波动很大,并且在透视和畸变下可以严重变形。早期的线条描述符关注于提取每条线条周围的支撑区域,并计算其梯度统计[60,65]。最近,由于学习点描述符的成功[7,9,42],一些深线描述符被提出[25,55,24]。然而,它们不是为处理线遮挡而设计的,并且对定位不良的端点保持敏感。

In this work, we propose to jointly learn the detection and description of line segments. To this end, we introduce a self-supervised network, inspired by LCNN [67] and SuperPoint [7], that can be trained on any image dataset without any labels. Pretrained on a synthetic dataset, our method is then generalized to real images. Our line detection aims at maximizing the line repeatability and at being as accurate as possible to allow its use in geometric estimation tasks. The learned descriptor is designed to be robust to occlusions, while remaining as discriminative as the current learned point descriptors. To achieve that, we introduce a novel line matching based on dynamic programming and inspired by sequence alignment in genetics [38] and classical stereo matching [8]. Thus, our self-supervised occlusion-aware line description and detection (SOLD2 ) offers a generic pipeline that aims at bridging the gap with the recent learned feature point methods. Overall, our contributions can be summarized as follows:
在这项工作中,我们提出了共同学习线段的检测和描述。为此,我们引入了一个受LCNN[67]和SuperPoint[7]启发的自监督网络,它可以在任何图像数据集上进行训练,而无需任何标签。在合成数据集上进行预训练,然后将我们的方法推广到真实图像。我们的线条检测旨在最大限度地提高线条的重复性,并尽可能准确地将其用于几何估计任务。学习描述符被设计为对遮挡具有鲁棒性,同时保持与当前学习点描述符一样的区分性。为了实现这一点,我们引入了一种新的基于动态规划的线匹配,其灵感来自遗传学[38]中的序列比对和经典的立体匹配[8]。因此,我们的自我监督遮挡感知线描述和检测(SOLD2)提供了一个通用管道,旨在弥补与最近学习的特征点方法之间的差距。总的来说,我们的贡献可以总结如下:

  • We propose the first deep network for joint line segment detection and description.
  • We show how to self-supervise our network for line detection, allowing training on any dataset of real images.
  • Our line matching procedure is robust to occlusion and achieves state-of-the-art results on image matching tasks.
  • 我们提出了第一个用于联合线段检测和描述的深度网络。
  • 我们展示了如何自我监督我们的网络进行线检测,允许在任何真实图像数据集上进行训练。
  • 我们的线匹配程序对遮挡具有鲁棒性,并在图像匹配任务中获得最先进的结果。

2. Related work

Line detection.

Line description.

Joint detection and description of learned features.

Line matching.

图片上传失败,论文链接:http://arxiv.org/abs/2104.03362

Figure 2: Training pipeline overview. Left: Our detector network is first trained on a synthetic dataset with known ground truth. Middle: A pseudo ground truth of line segments is then generated on real images through homography adaptation. Right: Finally, the full model with descriptors is trained on real images using the pseudo ground truth.
图2:训练管道概述。**左图:**我们的探测器网络首先在已知地面真相的合成数据集上进行训练;**中间:**然后通过单应适应在真实图像上生成线段的伪地面真值;**右图:**最后,使用伪地面真值在真实图像上训练带描述符的完整模型。

3. Method

We propose a unified network to perform line segment detection and description, allowing to match lines across different images. We achieve self-supervision in two steps. Our detector is first pretrained on a synthetic dataset with known ground truth. The full detector and descriptor can then be trained by generating pseudo ground truth line segments on real images using the pretrained model. We provide an overview of our training pipeline in Figure 2 and detail its parts in the following sections.

我们提出了一个统一的网络来执行线段检测和描述,允许在不同的图像之间匹配线条。我们通过两个步骤实现自我监督。我们的探测器首先在已知地面真相的合成数据集上进行预训练。然后,可以通过使用预训练模型在真实图像上生成伪地面真值线段来训练完整检测器和描述符。我们在图2中概述了我们的训练管道,并在以下章节中详细介绍了其各个部分。

3.1. Problem formulation 问题公式化

Line segments can be parametrized in many ways: with two endpoints; with a middle point, a direction and a length; with a middle point and offsets for the endpoints; with an attraction field, etc. In this work, we chose the line representation with two endpoints for its simplicity and compatibility with our self-supervision process discussed in Section 3.4. For an image I I I with spatial resolution h × w h \times w h×w, we thus consider in the following; the set of all junctions P = { p n } n = 1 N P=\left\{p_{n}\right\}_{n=1}^{N} P={pn}n=1N and line segments L = { l m } m = 1 M L=\left\{l_{m}\right\}_{m=1}^{M} L={lm}m=1M of I I I. A line segment l m l_{m} lm is defined by a pair of endpoints ( e m 1 , e m 2 ) ∈ P 2 \left(e_{m}^{1}, e_{m}^{2}\right) \in P^{2} (em1,em2)P2.

线段可以通过多种方式进行参数化:具有两个端点;有一个中点,一个方向和一个长度;具有一个中点和端点的偏移;在这项工作中,我们选择了带有两个端点的线表示法,因为它简单且与第3.4节讨论的自我监督过程兼容。对于图像 I I I 空间分辨率 h × w h \times w h×w,我们因此在以下考虑; I I I 的所有连接结点记为 P = { p n } n = 1 N P=\left\{p_{n}\right\}_{n=1}^{N} P={pn}n=1N 和线段记为 L = { l m } m = 1 M L=\left\{l_{m}\right\}_{m=1}^{M} L={lm}m=1M 。其中线段 l m l_{m} lm 由一对端点 ( e m 1 , e m 2 ) ∈ P 2 \left(e_{m}^{1}, e_{m}^{2}\right) \in P^{2} (em1,em2)P2定义。

3.2. Junction and line heatmap inference Junction和line热图推理

Our network takes grayscale images as input, processes them through a shared backbone encoder that is later divided into three different outputs. A junction map J J J predicts the probability of each pixel to be a line endpoint, a line heatmap H H H provides the probability of a pixel to be on a line, and a descriptor map D D D yields a pixel-wise local descriptor. We focus here on the optimization of the first two branches, while the following sections describe their combination to retrieve and match the line segments of an image.

我们的网络将灰度图像作为输入,通过一个共享的主干编码器对其进行处理,然后将其分为三个不同的输出。Junction图 J J J预测每个像素成为线端点的概率,Line热图 H H H提供像素位于线上的概率,描述符图 D D D生成像素局部描述符。我们在这里重点讨论前两个分支的优化,而下面几节将描述它们的组合,以检索和匹配图像的线段。

We adopt a similar approach to SuperPoint’s keypoint decoder [7] for the junction branch, where the output is a coarse h 8 × w 8 × 65 \frac{h}{8} \times \frac{w}{8} \times 65 8h×8w×65 feature map J c \mathbf{J}^{\mathbf{c}} Jc. Each 65-dimensional vector corresponds to an 8 × 8 8 \times 8 8×8 patch plus an extra “no junction” dustbin. We define the ground truth junctions y ∈ { 1 , … , 65 } h 8 × w 8 \mathbf{y} \in\{1, \ldots, 65\}^{\frac{h}{8} \times \frac{w}{8}} y{1,,65}8h×8w indicating the index of the true junction position in each patch. A junction is randomly selected when several ground truth junctions land in the same patch and a value of 65 means that there is no junction. The junction loss is then a cross-entropy loss between J c \mathbf{J}^{\mathbf{c}} Jc and y \mathbf{y} y :
L j u n c = 64 h × w ∑ i , j = 1 h 8 , w 8 − log ⁡ ( exp ⁡ ( J i j y i j c ) ∑ k = 1 65 exp ⁡ ( J i j k c ) ) \mathcal{L}_{j u n c}=\frac{64}{h \times w} \sum_{i, j=1}^{\frac{h}{8}, \frac{w}{8}}-\log \left(\frac{\exp \left(J_{i j y_{i j}}^{c}\right)}{\sum_{k=1}^{65} \exp \left(J_{i j k}^{c}\right)}\right) Ljunc=h×w64i,j=18h,8wlogk=165exp(Jijkc)exp(Jijyijc)
At inference time, we perform a softmax on the channel dimension and discard the 65 th dimension, before resizing the junction map to get the final h × w h \times w h×w grid.

对于junction分支,我们采用了与SuperPoint的keypoint decoder[7]类似的方法,其输出是粗略的 h 8 × w 8 × 65 \frac{h}{8} \times \frac{w}{8} \times 65 8h×8w×65特征图 J c \mathbf{J}^{\mathbf{c}} Jc。每个65维向量对应一个 8 × 8 8 \times 8 8×8的块加上一个额外的“no junction”垃圾箱。我们定义了the ground truth junctions y ∈ { 1 , … , 65 } h 8 × w 8 \mathbf{y} \in\{1, \ldots, 65\}^{\frac{h}{8} \times \frac{w}{8}} y{1,,65}8h×8w 表示每个面片中真值junction位置的索引。当多个ground truth junctions落在同一块时,会随机选择一个junction,值65表示没有junction。然后,junction损失是 J c \mathbf{J}^{\mathbf{c}} Jc y \mathbf{y} y之间的交叉熵损失:
L j u n c = 64 h × w ∑ i , j = 1 h 8 , w 8 − log ⁡ ( exp ⁡ ( J i j y i j c ) ∑ k = 1 65 exp ⁡ ( J i j k c ) ) \mathcal{L}_{j u n c}=\frac{64}{h \times w} \sum_{i, j=1}^{\frac{h}{8}, \frac{w}{8}}-\log \left(\frac{\exp \left(J_{i j y_{i j}}^{c}\right)}{\sum_{k=1}^{65} \exp \left(J_{i j k}^{c}\right)}\right) Ljunc=h×w64i,j=18h,8wlogk=165exp(Jijkc)exp(Jijyijc)
在推理时,我们在通道维度上执行softmax,并放弃第65维度,然后调整junction图的大小,以获得最终的 h × w h \times w h×w网格。

The second branch outputs a line heatmap H \mathbf{H} H at the image resolution h × w h \times w h×w. Given a binary ground truth H G T \mathbf{H}^{\mathbf{G T}} HGT with a value of 1 for pixels on lines and 0 otherwise, the line heatmap is optimized via a binary cross-entropy loss:
L line  = 1 h × w ∑ i , j = 1 h , w − H i j G T log ⁡ ( H i j ) \mathcal{L}_{\text {line }}=\frac{1}{h \times w} \sum_{i, j=1}^{h, w}-H_{i j}^{G T} \log \left(H_{i j}\right) Lline =h×w1i,j=1h,wHijGTlog(Hij)

第二个分支输出分辨率为 h × w h \times w h×w的线热图 H \mathbf{H} H。给定一个二进制ground truth H G T \mathbf{H}^{\mathbf{G T}} HGT,线上的像素值为1,否则为0,通过二进制交叉熵损失优化line热图:
L line  = 1 h × w ∑ i , j = 1 h , w − H i j G T log ⁡ ( H i j ) \mathcal{L}_{\text {line }}=\frac{1}{h \times w} \sum_{i, j=1}^{h, w}-H_{i j}^{G T} \log \left(H_{i j}\right) Lline =h×w1i,j=1h,wHijGTlog(Hij)

3.3. Line detection module

After inferring the junction map J \mathbf{J} J and line heatmap H \mathbf{H} H, we threshold J \mathbf{J} J to keep the maximal detections and apply a non-maximum suppression (NMS) to extract the segment junctions P ^ . \hat{P} . P^. The line segment candidates set L ^ cand  \hat{L}_{\text {cand }} L^cand  is composed of every pair of junctions in P ^ \hat{P} P^. Extracting the final line segment predictions L L L based on H \mathbf{H} H and L cand  L_{\text {cand }} Lcand  is nonendpoints may vary a lot across different candidates. Our approach can be broken down into four parts: (1) regular sampling between endpoints, (2) adaptive local-maximum search, (3) average score, and (4) inlier ratio.

在推断junction图 J \mathbf{J} J和line热图 H \mathbf{H} H之后,我们设置阈值 J \mathbf{J} J以保留最大检测,并应用非最大抑制(NMS)来提取线段junctions** P ^ \hat{P} P^。线段候选集 L ^ cand \hat{L}{\text{cand}} L^cand P ^ \hat{P} P^中的每对junctions组成。基于 H \mathbf{H} H L cand L{\text{cand}} Lcand提取最终线段预测 L L L**可能会因不同的候选点而有很大差异。我们的方法可以分为四个部分:(1)端点之间的规则采样,(2)自适应局部最大搜索,(3)平均分数,和(4)内点率(inlier ratio)。

R e g u l a r   s a m p l i n g   b e t w e e n   e n d p o i n t s : ‾ \underline{Regular\ sampling\ between\ endpoints:} Regular sampling between endpoints: Instead of fetching all the rasterized pixels between the two endpoints [68], we sample N s N_{s} Ns uniformly spaced points (including the two endpoints) along the line segment. 不是在两个端点之间提取所有的栅格化像素,而是沿着线段采样 N s N_{s} Ns个均匀间隔点( 包括两个端点 )。
A d a p t i v e   l o c a l   m a x i m u m   s e a r c h : ‾ \underline{Adaptive\ local\ maximum\ search:} Adaptive local maximum search: Using bilinear interpolation to fetch the heatmap values at the extracted points q k q_{k} qk may discard some candidates due to the misalignment between the endpoints and the heatmap, especially for long lines. To alleviate that, we search for the local maximal heatmap activation h k h_k hk around each sampled location q k q_k qk within a radius r r r proportional to the length of the line.使用双线性插值来获取提取点 q k q_{k} qk处的热图值,可能会因为端点与热图的不对齐而丢弃一些候选点,特别是对于长线。为了缓解这一问题,我们在与线路长度成正比的半径r内,在每个采样位置 q k q_k qk周围搜索局部最大热图激活 h k h_k hk
A v e r a g e   s c o r e : ‾ \underline{Average\ score:} Average score: The average score is defined as the mean of all the sampled heatmap values: y a v g = 1 N s ∑ k = 1 N s h k y_{avg}=\frac{1}{N_{s}} \sum_{k=1}^{N_{s}} h_{k} yavg=Ns1k=1Nshk. Given a threshold ξ avg \xi_{\text {avg}} ξavg, valid line segment candidates should satisfy y a v g ≥ ξ a v g y_{a v g} \geq \xi_{avg} yavgξavg. 平均得分定义为所有采样热图值的平均值: y a v g = 1 N s ∑ k = 1 N s h k y_{a v g}=\frac{1}{N_{s}} \sum_{k=1}^{N_{s}} h_{k} yavg=Ns1k=1Nshk。给定阈值 ξ avg \xi_{\text {avg}} ξavg,有效线段候选应满足 y a v g ≥ ξ a v g y_{a v g} \geq \xi_{a v g} yavgξavg
I n l i e r   r a t i o : ‾ \underline{Inlier\ ratio:} Inlier ratio: Only relying on the average score may keep segments with a few high activations but with holes along the line. To remove these spurious detections, we also consider an inlier ratio y inlier = 1 N s ∣ { h k ∣ h k ≥ ξ avg , h k ∈ H } ∣ y_{\text {inlier}}=\frac{1}{N_{s}}\left|\left\{h_{k} \mid h_{k} \geq \xi_{\text {avg}}, h_{k} \in H\right\}\right| yinlier=Ns1{hkhkξavg,hkH}. Given an inlier ratio threshold ξ inlier \xi_{\text {inlier}} ξinlier, we only keep candidates satisfying y inlier  ≥ ξ inlier  y_{\text {inlier }} \geq \xi_{\text {inlier }} yinlier ξinlier . 仅依靠平均分,可能会使节段保持少数高激活但沿线有孔洞。为了消除这些杂散检测,我们还考虑了一个内点率 y inlier = 1 N s ∣ { h k ∣ h k ≥ ξ avg  , h k ∈ H } ∣ y_{\text {inlier}}=\frac{1}{N_{s}}\left|\left\{h_{k} \mid h_{k} \geq \xi_{\text {avg }}, h_{k} \in H\right\}\right| yinlier=Ns1{hkhkξavg ,hkH}。给定一个内点率阈值 ξ inlier  \xi_{\text {inlier }} ξinlier ,我们只保留满足 y inlier ≥ ξ inlier y_{\text{inlier}} \geq \xi_{\text{inlier}} yinlierξinlier的候选。

3.4. Self-supervised learning pipeline

Inspired by the success of DeTone et al.[7], we extend their homography adaptation to the case of line segments. Let f j u n c f_{junc} fjunc and f h e a t f_{heat} fheat represent the forward pass of our network to compute the junction map and the line heatmap. We start by aggregating the junction and heatmap predictions as in SuperPoint using a set of N h N_{h} Nh homographies ( H i ) i = 1 N h \left(\mathcal{H}_{i}\right)_{i=1}^{N_{h}} (Hi)i=1Nh :
受DeTone等的成功的启发,我们将它们的单应适应扩展到线段的情况。让 f j u n c f_{junc} fjunc f h e a t f_{heat} fheat代表我们网络的前向通道来计算junction 图和线热图。我们首先使用 N h N_{h} Nh个homography ( H i ) i = 1 N h \left(\mathcal{H}_{i}\right)_{i=1}^{N_{h}} (Hi)i=1Nh的一个集合来聚合SuperPoint中的junction和热图预测:
J ^ ( I ; f j u n c ) = 1 N h ∑ i = 1 N h H i − 1 ( f j u n c ( H i ( I ) ) ) H ^ ( I ; f heat  ) = 1 N h ∑ i = 1 N h H i − 1 ( f heat  ( H i ( I ) ) ) \begin{aligned} \hat{\mathbf{J}}\left(I ; f_{j u n c}\right) &=\frac{1}{N_{h}} \sum_{i=1}^{N_{h}} \mathcal{H}_{i}^{-1}\left(f_{j u n c}\left(\mathcal{H}_{i}(I)\right)\right) \\ \hat{\mathbf{H}}\left(I ; f_{\text {heat }}\right) &=\frac{1}{N_{h}} \sum_{i=1}^{N_{h}} \mathcal{H}_{i}^{-1}\left(f_{\text {heat }}\left(\mathcal{H}_{i}(I)\right)\right) \end{aligned} J^(I;fjunc)H^(I;fheat )=Nh1i=1NhHi1(fjunc(Hi(I)))=Nh1i=1NhHi1(fheat (Hi(I)))
We then apply the line detection module to the aggregated maps J ^ \hat{\mathbf{J}} J^ and H ^ \hat{\mathbf{H}} H^ to obtain the predicted line segments L L L, which are then used as ground truth for the next training round. Figure 2 provides an overview of the pipeline. Similar to Superpoint, this process can be iteratively applied to improve the label quality. However, we found that a single round of adaptation already provides sufficiently good labels.
**然后将线段检测模块应用到聚合的映射 J ^ \hat{\mathbf{J}} J^ H ^ \hat{\mathbf{H}} H^中,得到预测的线段 L L L,作为下一轮训练的地面真值。**图2提供了管道概述。与Superpoint类似,此过程可以反复应用以提高标签质量。然而,我们发现,单轮适应已经提供了足够好的标签。

3.5. Line description

Describing lines in images is a problem inherently more difficult than describing feature points. A line can be partially occluded, its endpoints are not always repeatable across views, and the appearance of a line can significantly differ under viewpoint changes. To tackle these challenges, we depart from the classical description of a patch centered on the line [25, 24], that is not robust to occlusions and endpoints shortening. Motivated by the success of learned point descriptors, we formulate our line descriptor as a sequence of point descriptors sampled along the line. Given a good coverage of the points along the line, even if part of the line is occluded, the points on the non-occluded part will store enough line details and can still be matched.

在图像中描述线条,本质上是一个比描述特征点困难的问题。一条线可以被部分遮挡,其端点在视图间并不总是可重复的,在视点变化下,一条线的外观会有明显的差异。为了应对这些挑战,我们脱离了以线为中心的块的经典描述[ 25,24 ],对遮挡和端点缩短不鲁棒。受学习到的点描述符成功的启发,我们将我们的行描述符制定为沿行采样的点描述符序列。考虑到沿线的点复盖很好,即使部分线被遮挡,非遮挡部分的点也会存储足够的线条细节,仍然可以匹配。

The descriptor head of our network outputs a descriptor map D ∈ R h 4 × w 4 × 128 \mathbf{D} \in \mathbb{R}^{\frac{h}{4} \times \frac{w}{4} \times 128} DR4h×4w×128 and is optimized through the classical point-based triplet loss [ 3 , 36 ] [3,36] [3,36] used in other dense descriptors [9]. Given a pair of images I 1 I_{1} I1 and I 2 I_{2} I2 and matching lines in both images, we regularly sample points along each line and extract the corresponding descriptors ( D 1 i ) i = 1 n \left(\mathbf{D}_{1}^{\mathbf{i}}\right)_{i=1}^{n} (D1i)i=1n and ( D 2 i ) i = 1 n \left(\mathbf{D}_{2}^{\mathrm{i}}\right)_{i=1}^{n} (D2i)i=1n from the descriptor maps, where n n n is total number of points in an image. The triplet loss minimizes the descriptor distance of matching points and maximizes the one of non-matching points. The positive distance is defined as
p i = ∥ D 1 i − D 2 i ∥ 2 p_{i}=\left\|\mathbf{D}_{\mathbf{1}}^{\mathbf{i}}-\mathbf{D}_{\mathbf{2}}^{\mathbf{i}}\right\|_{2} pi=D1iD2i2
The negative distance is computed between a point and its hardest negative example in batch:
n i = min ⁡ ( ∥ D 1 i − D 2 h 2 ( i ) ∥ 2 , ∥ D 1 h 1 ( i ) − D 2 i ∥ 2 ) n_{i}=\min \left(\left\|\mathbf{D}_{1}^{\mathbf{i}}-\mathbf{D}_{2}^{\mathbf{h}_{2}(\mathbf{i})}\right\|_{2},\left\|\mathbf{D}_{1}^{\mathbf{h}_{1}(\mathbf{i})}-\mathbf{D}_{2}^{\mathbf{i}}\right\|_{2}\right) ni=min(D1iD2h2(i)2,D1h1(i)D2i2)
where h 1 ( i ) = arg ⁡ min ⁡ k ∈ [ 1 , n ] ∥ D 1 k − D 2 i ∥ h_{1}(i)=\arg \min _{k \in[1, n]}\left\|\mathbf{D}_{1}^{\mathbf{k}}-\mathbf{D}_{2}^{\mathbf{i}}\right\| h1(i)=argmink[1,n]D1kD2i such that the points i i i and k k k are at a distance of at least T T T pixels and are not part of the same line, and similarly for h 2 ( i ) h_{2}(i) h2(i). The triplet loss with margin M M M is then defined as
L desc  = 1 n ∑ i = 1 n max ⁡ ( 0 , M + p i − n i ) \mathcal{L}_{\text {desc }}=\frac{1}{n} \sum_{i=1}^{n} \max \left(0, M+p_{i}-n_{i}\right) Ldesc =n1i=1nmax(0,M+pini)

我们网络的描述符头输出一个描述符映射 D ∈ R h 4 × w 4 × 128 \mathbf{D}\in\mathbb{R}^{\frac{h}{4}\times\frac{w}{4}\times 128} DR4h×4w×128,并通过其他密集描述符中使用的经典基于点的三重态损失 [ 3 , 36 ] [3,36] [3,36]进行优化[9]。给定一对图像 I 1 I{1} I1 I 2 I{2} I2以及两幅图像中的匹配线,我们沿着每一条线进行规则采样,并提取相应的描述符 ( D 1 i ) i = 1 n \left(\mathbf{D}_{1}^{\mathbf{i}}\right)_{i=1}^{n} (D1i)i=1n ( D 2 i ) i = 1 n \left(\mathbf{D}_{2}^{\mathbf{i}}\right)_{i=1}^{n} (D2i)i=1n,其中, n n n是图像描述符的总数。三重态损失使匹配点的描述子距离最小,使非匹配点的描述子距离最大。正距离定义为
p i = ∥ D 1 i − D 2 i ∥ 2 p_{i}=\left\|\mathbf{D}_{\mathbf{1}}^{\mathbf{i}}-\mathbf{D}_{\mathbf{2}}^{\mathbf{i}}\right\|_{2} pi=D1iD2i2
负距离是在块中一个点和其最彻底的负示例之间计算的:
n i = min ⁡ ( ∥ D 1 i − D 2 h 2 ( i ) ∥ 2 , ∥ D 1 h 1 ( i ) − D 2 i ∥ 2 ) n_{i}=\min \left(\left\|\mathbf{D}_{1}^{\mathbf{i}}-\mathbf{D}_{2}^{\mathbf{h}_{2}(\mathbf{i})}\right\|_{2},\left\|\mathbf{D}_{1}^{\mathbf{h}_{1}(\mathbf{i})}-\mathbf{D}_{2}^{\mathbf{i}}\right\|_{2}\right) ni=min(D1iD2h2(i)2,D1h1(i)D2i2)
其中, h 1 ( i ) = arg ⁡ min ⁡ k ∈ [ 1 , n ] ∥ D 1 k − D 2 i ∥ h_{1}(i)=\arg \min _{k \in[1, n]}\left\|\mathbf{D}_{1}^{\mathbf{k}}-\mathbf{D}_{2}^{\mathbf{i}}\right\| h1(i)=argmink[1,n]D1kD2i,使得点 i i i k k k的距离至少为 T T T像素,并且不是同一条线的一部分,对于 h 2 ( i ) h_{2}(i) h2(i)也是如此。三重损失with margin M M M,被定义为
L desc  = 1 n ∑ i = 1 n max ⁡ ( 0 , M + p i − n i ) \mathcal{L}_{\text {desc }}=\frac{1}{n} \sum_{i=1}^{n} \max \left(0, M+p_{i}-n_{i}\right) Ldesc =n1i=1nmax(0,M+pini)

3.6. Multi-task learning

Detecting and describing lines are independent tasks with different homoscedastic aleatoric uncertainties and their respective losses can have different orders of magnitude. Thus, we adopt the multi-task learning proposed by Kendall et al. [21] with a dynamic weighting of the losses, where the weights w j u n c , w l i n e w_{junc}, w_{line} wjunc,wline and w d e s c w_{desc} wdesc are optimized during training [ 20 , 43 ] [20,43] [20,43]. The total loss becomes:
检测和描述线段是具有不同的同方差不确定性的独立任务,其各自的损失可以有不同的数量级。因此,我们采用了Kendall等人提出的多任务学习方法。对损失进行动态加权,其中权重 w j u n c , w l i n e w_{junc}, w_{line} wjunc,wline w d e s c w_{desc} wdesc在训练[20,43]时得到优化。总损失变为:
L total  = e − w junc  L junc  + e − w line  L line  + e − w desc  L desc  + w junc  + w line  + w desc  \begin{aligned} \mathcal{L}_{\text {total }}=& e^{-w_{\text {junc }}} \mathcal{L}_{\text {junc }}+e^{-w_{\text {line }}} \mathcal{L}_{\text {line }} \\ &+e^{-w_{\text {desc }}} \mathcal{L}_{\text {desc }}+w_{\text {junc }}+w_{\text {line }}+w_{\text {desc }} \end{aligned} Ltotal =ewjunc Ljunc +ewline Lline +ewdesc Ldesc +wjunc +wline +wdesc 

3.7. Line matching

At inference time, two line segments are compared based on their respective collection of point descriptors sampled along each line. However, some of the points might be occluded or, due to perspective changes, the length of a line can vary and the sampled points may be misaligned. The ordering of the points matched along the line should nevertheless be constant, i.e. the line descriptor is an ordered sequence of descriptors, not just a set. To solve this sequence assignment problem, we take inspiration from nucleotide alignment in bioinformatics [38] and pixel alignment along scanlines in stereo vision [8]. We thus propose to find the optimal point assignment through the dynamic programming algorithm originally introduced by Needleman and Wunsch [38].
在推理时,根据沿每条线采样的各自的点描述符集合,对两个线段进行比较。然而,一些点可能会被遮挡,或者由于透视变化,线的长度可能会变化,采样点可能会不对齐。不过沿着直线匹配的点的顺序应该是恒定的,也就是说,线描述符是描述符的有序序列,而不仅仅是一个集合。为了解决这一序列分配问题,我们从生物信息学中[38]的核苷酸比对和立体视觉中[8]的扫描线像素比对中得到启发。因此,我们提出通过Needleman和Wunsch[38]提出的动态规划算法来寻找最优点的分配。。

When matching two sequences of points, each point can be either matched to another one or skipped. The score attributed to a match of two points depends on the similarity of their descriptors (i.e. their dot product), so that a higher similarity gives a higher score. Skipping a point is penalized by a g a p gap gap score, which has to be adjusted so that it is preferable to match points with high similarity but to skip the ones with low similarity. The total score of a line match is then the sum of all skip and match operations of the line points. The Needleman-Wunsch (NW) algorithm returns the optimal matching sequence maximizing this total score. This is achieved with dynamic programming by filling a matrix of scores row by row, as depicted in Figure 3 . Given a sequence of m m m points along a line l , m ′ l, m^{\prime} l,m points along l ′ l^{\prime} l, and the associated descriptors D \mathbf{D} D and D ′ \mathbf{D}^{\prime} D, this score matrix S \mathbf{S} S is an ( m + 1 ) × ( m ′ + 1 ) (m+1) \times\left(m^{\prime}+1\right) (m+1)×(m+1) grid where S ( i , j ) \mathbf{S}(i, j) S(i,j) contains the optimal score for matching the first i i i points of l l l with the first j j j points of l ′ l^{\prime} l. The grid is initialized by the g a p g a p gap score in the first row and column, and is sequentially filled row by row, using the scores stored in the left, top and top-left cells:
S ( i , j ) = max ⁡ ( S ( i − 1 , j ) + g a p , S ( i , j − 1 ) + g a p S ( i − 1 , j − 1 ) + D i T D ′ j ) \begin{gathered} \mathbf{S}(i, j)=\max (\mathbf{S}(i-1, j)+g a p, \mathbf{S}(i, j-1)+g a p \\ \left.\mathbf{S}(i-1, j-1)+\mathbf{D}^{\mathbf{i}^{T}} \mathbf{D}^{\prime \mathbf{j}}\right) \end{gathered} S(i,j)=max(S(i1,j)+gap,S(i,j1)+gapS(i1,j1)+DiTDj)
Once the matrix is filled, we select the highest score in the grid and use it as a match score for the candidate pair of lines. Each line of the first image is then matched to the line in the second image with the maximum match score.

当匹配两个点序列时,每个点可以与另一个点匹配或跳过。两个点的匹配的得分取决于它们描述符的相似性(即它们的点积),因此相似性越高,得分就越高。跳过一个点将受到 g a p gap gap分数的惩罚,该分数必须进行调整,以便匹配具有高相似性的点,而跳过具有低相似性的点。行匹配的总得分是所有行点的跳过和匹配操作的总和。Needleman-Wunsch (NW)算法返回最大值的最优匹配序列。这是通过动态规划通过逐行填充一个分数矩阵来实现的,如图3所示。给定沿 l l l m m m点序列和沿 l ′ l^{\prime} l m ′ m^{\prime} m点序列,以及相关的描述符 D \mathbf{D} D D ′ \mathbf{D^{\prime}} D,这个分数矩阵 S \mathbf{S} S是一个 ( m + 1 ) × ( m ′ + 1 ) (m+1) \times\left(m^{\prime}+1\right) (m+1)×(m+1)网格,其中 S ( i , j ) \mathbf{S}(i, j) S(i,j)包含 l l l的第一个 i i i点与 l ′ l^{\prime} l的第一个 j j j点匹配的最佳分数。网格通过第一行和第一列的 g a p gap gap分数被初始化,并使用存储在左边、顶部和左上单元格中的分数按行填充
S ( i , j ) = max ⁡ ( S ( i − 1 , j ) + g a p , S ( i , j − 1 ) + g a p S ( i − 1 , j − 1 ) + D i T D ′ j ) \begin{gathered} \mathbf{S}(i, j)=\max (\mathbf{S}(i-1, j)+g a p, \mathbf{S}(i, j-1)+g a p \\ \left.\mathbf{S}(i-1, j-1)+\mathbf{D}^{\mathbf{i}^{T}} \mathbf{D}^{\prime \mathbf{j}}\right) \end{gathered} S(i,j)=max(S(i1,j)+gap,S(i,j1)+gapS(i1,j1)+DiTDj)
一旦矩阵被填充,我们将在网格中选择最高的分数,并将其作为候选线对的匹配分数。第一幅图像的每条线再以最大匹配分数匹配到第二幅图像中的线。

3.8. Implementation details

Network implementation. To have a fair comparison with most wireframe parsing methods [67, 63, 30], we use the same stacked hourglass network [39] for our backbone. The three branches of our network are then series of convolutions, ReLU activations and upsampling blocks via subpixel shuffles [51]. Please refer to the supplementary material for more details about the architecture. The network is optimized with the Adam solver [23] with a learning rate of 0.0005.
**网络实现。**为了与大多数线框解析方法进行比较[67,63,30],我们使用相同堆叠的沙漏网络[39]作为我们的主干。我们的网络的三个分支是一系列卷积、ReLU激活和通过亚像素重组[51]的上采样块。有关网络的更多细节,请参阅补充材料。该网络使用Adam求解器[23]进行优化,学习速率为0.0005。

Line parameters. We use a junction threshold of 1 65 \frac{1}{65} 651, a heatmap threshold ξ avg  = 0.25 \xi_{\text {avg }}=0.25 ξavg =0.25, an inlier threshold ξ inlier  = 0.75 \xi_{\text {inlier }}=0.75 ξinlier =0.75, we extract N s = 64 N_{s}=64 Ns=64 samples along each line to compute the heatmap and inlier scores, and we use N h = 100 N_{h}=100 Nh=100 homographies for the homography adaptation.
**行参数. **我们使用结阈值 1 65 \frac{1}{65} 651,热图阈值 ξ avg  = 0.25 \xi_{\text {avg }}=0.25 ξavg =0.25,内点阈值 ξ inlier  = 0.75 \xi_{\text {inlier }}=0.75 ξinlier =0.75,沿每行抽取 N s = 64 N_{s}=64 Ns=64样本,计算热图和内点得分,使用 N h = 100 N_{h}=100 Nh=100个同调进行单应适应。

Matching details. The line descriptor is computed by regularly sampling up to 5 points along each line segment, but keeping a minimum distance of 8 pixels between each point. Since the ordering of the points might be reversed from one image to the other, we run the matching twice with one point-set flipped. A gap score of 0.1 empirically yields the best results during the NW matching. To speed up the line matching, we pre-filter the set of line candidates with a simple heuristic. Given the descriptor of the 5 points sampled on a line of I1 to be matched, we compute the similarity with their nearest neighbor in each line of I2, and average these scores for each line. This yields a rough estimate of the line match score, and we keep the top 10 best lines as candidates for the NW matching. Finally, we retain at matching time only the pairs that are mutually matched.
**匹配详细信息。**行描述符是通过沿每条线段定期采样至多5个点来计算的,但每个点之间保持8个像素的最小距离。由于点的排列顺序可能会从一幅图像反转到另一幅图像,所以我们用一个点集翻转两次匹配。在北西向匹配过程中,间隙得分为0.1的实证结果最好。为了加快行匹配速度,我们使用简单的启发式方法对行候选集进行预过滤。给定在待匹配的I1行上采样的5个点的描述符,我们计算它们与I2行上最近邻的相似度,并对每行的得分进行平均。这就产生了线匹配得分的粗略估计,我们保留前10位的最佳线作为NW匹配的候选线。最后,我们在匹配时只保留相互匹配的对。

Training dataset. We use the same synthetic dataset as in SuperPoint [7], labelling the corners of the geometrical shapes as junctions and edges as line segments. For the training with real images, we use the Wireframe dataset [18], allowing a fair comparison with the current state of the art also trained on these images. We follow the split policy in LCNN [67]: 5, 000 images for training and 462 images for testing. We however only use the images and ignore the ground truth lines provided by the dataset.
**训练数据集. **我们使用与SuperPoint相同的合成数据集,将几何形状的角点标记为连接点,边标记为线段。对于真实图像的训练,我们使用了线框数据集,使得我们可以公平的比较当前的艺术状态也在这些图像上训练。我们遵循LCNN 中的分裂策略:5000幅图像用于训练,462幅图像用于测试。然而,我们只使用图像,忽略了数据集提供的基本真实线。

4. Experiments

4.1. Line segment detection evaluation

To evaluate our line segment detection, we use the test split of the Wireframe dataset [18] and the YorkUrban dataset [6], which contains 102 outdoor images. For both datasets, we generate a fixed set of random homographies and warp each image to get a pair of matching images. Line segment distance metrics. A line distance metric needs to be defined to evaluate the accuracy of a line detection. We use the two following metrics:

S t r u c t u r a l d i s t a n c e ( d s ) : ‾ \underline{Structural distance \left(\mathbf{d}_{\mathbf{s}}\right):} Structuraldistance(ds): The structural distance of two line segments l 1 l_{1} l1 and l 2 l_{2} l2 is defined as:
d s ( l 1 , l 2 ) = min ⁡ ( ∥ e 1 1 − e 2 1 ∥ 2 + ∥ e 1 2 − e 2 2 ∥ 2 , ∥ e 1 1 − e 2 2 ∥ 2 + ∥ e 1 2 − e 2 1 ∥ 2 ) \begin{array}{r} \mathrm{d}_{\mathrm{s}}\left(l_{1}, l_{2}\right)=\min \left(\left\|e_{1}^{1}-e_{2}^{1}\right\|_{2}+\left\|e_{1}^{2}-e_{2}^{2}\right\|_{2},\right. \\ \left.\left\|e_{1}^{1}-e_{2}^{2}\right\|_{2}+\left\|e_{1}^{2}-e_{2}^{1}\right\|_{2}\right) \end{array} ds(l1,l2)=min(e11e212+e12e222,e11e222+e12e212)
where ( e 1 1 , e 1 2 ) \left(e_{1}^{1}, e_{1}^{2}\right) (e11,e12) and ( e 2 1 , e 2 2 ) \left(e_{2}^{1}, e_{2}^{2}\right) (e21,e22) are the endpoints of l 1 l_{1} l1 and l 2 l_{2} l2 respectively. Contrary to the formulation of recent wireframe parsing works [67, 63], we do not use square norms to make it directly interpretable in terms of endpoints distance.

Appendix

In the following, we provide additional details about our Self-supervised Occlusion-aware Line Descriptor and Detector (SOLD2 ). Section A describes the generation of the synthetic dataset used to pretrain the network. Section B details our network architecture. Section C refers to the multi-task approach used to balance our different losses. Section D explains into details some parts of the line segment detection module. Section E gives the exact equations used to compute the evaluation metrics considered in this work. Section F provides proof that our results are statistically meaningful. Section G describes how we preprocessed the ETH3D dataset. Section H discusses the feasibility of applying our method to the homography estimation task. Finally section I displays qualitative examples of our line detections and matches compared to other baselines.
在下面,我们提供了关于我们的自监督遮挡感知线描述符和检测器(SOLD2)的更多细节。Section A 描述了用于预训练网络的合成数据集的生成。Section B 详细介绍了我们的网络架构。Section C 是指用来平衡不同损失的多任务方法。Section D 详细解释了线段检测模块的一些部分。Section E 给出了用于计算本工作中考虑的评价指标的精确方程。Section F 证明了我们的结果在统计上是有意义的。Section G 描述了我们如何预处理ETH3D数据集。Section H 讨论了将我们的方法应用于单应性估计任务的可行性。最后Section I 将展示与其他 baselines 相比较的线段检测和匹配的定性示例。

A. Synthetic dataset examples and homography generation

We provide here some examples of the images in our synthetic dataset and the homographies we used in both data augmentation and homography adaptation. These shapes include polygons, cubes, stars, lines, checkerboards, and stripes. Figure 8 shows some examples of these rendered shapes.
我们在这里提供了合成数据集中图像的一些例子,以及我们在数据增强和单应性适应中使用的单应性。这些形状包括多边形、立方体、星形、线条、棋盘和条纹。图8显示了这些渲染图形的一些示例。

We follow the same process as in SuperPoint [7] to generate the random homographies. They are generated as a composition of simple transformations with pre-defined ranges: scaling (normal distribution N ( 1. , 0.1 ) ) N(1., 0.1)) N(1.,0.1)), translation (uniform distribution within the image boundaries), rotation (uniformly in [−90◦, +90◦]), and perspective transform. Examples of the difficulty of the test set can be observed in Figures 11 and 12 of the supplementary material.
我们遵循与SuperPoint[7]中相同的过程来生成随机单应。它(均匀分布在图像边界内),旋转(均匀在[−90◦,+90◦]),和透视变换。在补充材料的图11和图12中可以看到测试集难度的例子。

B. Network architecture

We provide here more details about our architecture and parameter choices. To have a fair comparison with most wireframe parsing methods [67, 63, 30], we use the same stacked hourglass network as in [39]. Given an image with resolution h × w h \times w h×w, the output of the backbone encoder is a h 4 × w 4 × 256 \frac {h} {4} \times \frac {w} {4} \times 256 4h×4w×256 feature map. The three heads of the network are implemented as follows:
我们在这里提供更多关于我们的架构和参数选择的细节。为了与大多数线框解析方法进行公平比较[67,63,30],我们使用了与[39]相同的堆叠沙漏网络。给定分辨率为 h × w h \times w h×w 的图像,主干编码器的输出为 h 4 × w 4 × 256 \frac {h} {4} \times \frac {w} {4} \times 256 4h×4w×256 特征映射。该网络的三个头如下:

Junction branch: It is composed of a 3 × 3 convolution with stride 2 and 256 channels, followed by a 1 × 1 convolution with stride 1 and 65 channels, to finally get the h 8 × w 8 × 65 junction map.
Junction分支:它由步幅为2、256通道的 3 × 3 3 \times 3 3×3卷积组成,然后是一步幅为1、65通道的 1 × 1 1 \times 1 1×1卷积,最终得到 h 8 × w 8 × 65 \frac {h} {8} \times \frac {w} {8} \times 65 8h×8w×65连接图。

Heatmap branch: To keep a light network and avoid artifacts from transposed convolutions, we perform two consecutive subpixel shuffles [51] blocks to perform a ×4 upsampling. More precisely, we use two 3 × 3 conv layers of output channel sizes 256 and 64, each of them followed by batch normalization, ReLU activation and a ×2 subpixel shuffle for upsampling. A final 1 × 1 convolution with output channel 1 and sigmoid activation is then used to get the final line heatmap of resolution h × w.
Heatmap分支:为了保持网络轻量和避免转置卷积产生的伪影,我们执行两个连续的亚像素重组[51]块以执行 × 4 \times 4 ×4上采样。更准确地说,我们使用两个输出通道大小为256和64的 3 × 3 3\times3 3×3卷积层,每个层后面都有批量标准化、ReLU激活和一个×2亚像素重组来进行上采样。最终使用输出通道1的1×1卷积和sigmoid激活的得到分辨率为h×w的最终线热图。

Descriptor branch: The backbone encoding is processed by two consecutive convolutions of kernels 3 × 3 and 1 × 1, and output channels 256 and 128, to produce a h 4 × w 4 × 128 \frac {h} {4} \times \frac {w} {4} \times 128 4h×4w×128 feature descriptor map. This semi-dense map can be later bilinearly interpolated at any point location. The triplet loss is optimized with a margin M = 1 and a minimal distance to the hardest negative of T = 8 pixels.
描述符分支:主干编码通过核3×3和1×1的两个连续卷积进行处理,并输出通道256和128,以生成 h 4 × w 4 × 128 \frac {h} {4} \times \frac {w} {4} \times 128 4h×4w×128 特征描述符映射。这个半稠密贴图可以在任何点位置进行双线性插值。三重态损耗优化为边距M=1,到T=8像素的the hardest negative 最小距离。

We use ReLU activations after each convolution and optimize the network with the Adam solver [23]. Images are resized to a 512 × 512 resolution and converted to grayscale during training.
我们在每次卷积后使用ReLU激活,并使用Adam解算器优化网络[23]。图像的大小调整为512×512分辨率,并在训练期间转换为灰度。

C. Multi-task learning

The tasks of detecting lines, their junctions, and describing them are diverse, and we assume them to have a different homoscedastic aleatoric uncertainty. Additionally, they can have different orders of magnitude and their relative values are changing during training, in particular when the descriptor branch is added to the pre-trained detector network. Therefore, we chose to use the multi-task loss introduced by Kendall et al. [21] and successfully used in other geometrical tasks [20, 43], to automatically adjust the weights of the losses during training.
检测线、线的连接点和描述线的任务是多种多样的,我们假设它们具有不同的同构任意不确定性。此外,它们可以具有不同的数量级,并且它们的相对值在训练期间发生变化,特别是当描述符分支被添加到预先训练的检测器网络时。因此,我们选择使用Kendall等人[21]提出并成功用于其他几何任务[20,43]的多任务损失,在训练期间自动调整损失的权重。

The final weights of Equation (8) gracefully converged towards the inverse of each loss, such that the value of each loss multiplied by its weight is around 1. The final weight values are the following: e − w j u n c = 7.2 e^{−w_{junc} }= 7.2 ewjunc=7.2, e − w l i n e = 16.3 e^{−w_{line}} = 16.3 ewline=16.3 and e − w d e s c = 8.2 e^{−w_{desc}} = 8.2 ewdesc=8.2. To show the effectiveness of the dynamic weighting, we tried two variants: (1) all loss weights are 1, and (2) we used the final values from the dynamic weighting as static loss weights. In the first case, the detection and description results are worse by at least 10 % 10\% 10% and 5.5 % 5.5\% 5.5%, respectively. In the second case, the detection and description results are worse by at least 6.7 % 6.7\% 6.7% and 76.2 % 76.2\% 76.2%, respectively.
等式(8)的最终权重优雅地收敛于每个损失的倒数,因此每个损失的值乘以其权重约为1。最终重量值如下: e − w j u n c = 7.2 e^{−w_{junc} }= 7.2 ewjunc=7.2 e − w l i n e = 16.3 e^{−w_{line}} = 16.3 ewline=16.3 e − w d e s c = 8.2 e^{−w_{desc}} = 8.2 ewdesc=8.2。为了证明动态加权的有效性,我们尝试了两种变体:(1)所有损失权重均为1,(2)我们使用动态加权的最终值作为静态损失权重。在第一种情况下,检测和描述结果分别差至少 10 % 10\% 10% 5.5 % 5.5\% 5.5%。在第二种情况下,检测和描述结果分别差至少 6.7 % 6.7\% 6.7% 76.2 % 76.2\% 76.2%

D. Line segment detection details

To convert the raw junctions and line heatmap predicted by our network into line segments, the following steps are performed sequentially: regular sampling of points along each line, adaptive local-maximum search, and accepting the lines verifying a minimum average score and inlier ratio. Additionally, an initial step called candidate selection can be used to pre-filter some of the line candidates L ^ \hat{L} L^. We describe here two of these steps into more details: the adaptive local-maximum search, and the candidate selection.
为了将我们的网络预测的原始连接点和线热图转换为线段,按顺序执行以下步骤:沿每条线规则采样点,自适应局部最大搜索,并接受验证最小平均分数和内点率的线。此外,可以使用名为“候选选择”的初始步骤预筛选一些线候选 L ^ \hat{L} L^。我们在这里详细描述其中的两个步骤:自适应局部最大搜索和候选选择。

Adaptive local-maximum search. Given a set of points sampled along a candidate line segment, one wants to extract the line heatmap values at these sampling locations. However, since the heatmap is limited to a resolution of one pixel, some samples may get a lower heatmap value if they land next to the actual line. Thus, we instead use an adaptive local-maximum search to find the highest activation of the line heatmap around each sampling location. Given a line segment l ^ = ( e ^ 1 , e ^ 2 ) \hat{l} = (\hat{e}^1 ,\hat{e}^2) l^=(e^1,e^2) from the candidate set L ^ \hat{L} L^ in an image of size h × w h \times w h×w, the search radius r r r is defined as:
r = r m i n + λ ∥ e ^ 1 − e ^ 2 ∥ h 2 + w 2 r=r_{min}+\lambda \frac{\left\| \hat{e}^1 - \hat{e}^2 \right\|}{\sqrt{h^2+w2}} r=rmin+λh2+w2 e^1e^2
where r m i n = 2 2 r_{min} = \frac {\sqrt{2}}{2} rmin=22 is the minimum search radius and λ \lambda λ is a hyper-parameter to adjust the linear dependency on the segment lengths. We used λ = 3 \lambda=3 λ=3 pixels in all experiments. The optimal line parameters were selected by a grid search on the validation set. The r m i n r_{min} rmin parameter can in particular be kept constant across different image resolutions, without performance degradation.
给定沿候选线段采样的一组点,需要在这些采样位置提取线热图值。但是,由于热图的分辨率仅限于一个像素,因此如果一些样本落在实际线附近,它们可能会获得较低的热图值。因此,我们使用自适应局部最大搜索来查找每个采样位置周围的线热图的最高激活度。给定一个线段 l ^ = ( e ^ 1 、 e ^ 2 ) \hat{l}=(\hat{e}^1、\hat{e}^2) l^=(e^1e^2),该线段来自大小为 h × w h \times w h×w的图像中的候选集 l ^ \hat{l} l^,搜索半径 r r r定义为:
r = r m i n + λ ∥ e ^ 1 − e ^ 2 ∥ h 2 + w 2 r=r_{min}+\lambda \frac{\left\| \hat{e}^1 - \hat{e}^2 \right\|}{\sqrt{h^2+w2}} r=rmin+λh2+w2 e^1e^2
其中,KaTeX parse error: Expected '}', got 'EOF' at end of input: …rac{\sqrt{2}{2}是最小搜索半径, λ \lambda λ是一个超参数,用于调整对线段长度的线性依赖关系。我们在所有实验中都使用了 λ = 3 \lambda=3 λ=3像素。通过在验证集上进行网格搜索来选择最佳线路参数。 r m i n r_{min} rmin参数尤其可以在不同的图像分辨率之间保持不变,而不会降低性能。

Candidate selection (CS). In some application requiring line matching, having multiple overlapping segments may hinder the matching as the descriptor will have a harder job at discriminating close lines. Therefore, a non-maximum suppression (NMS) mechanism is necessary for lines. Unlike point or bounding box NMS, there is no well-established procedure for line NMS. Contrary to usual NMS methods which are used as postprocessing steps, we implement our line NMS as a preprocessing step, which actually speeds up the overall line segment detection as it removes some line candidates. Starting from the initial line candidates set L ^ c a n d \hat{L}_{cand} L^cand, we remove the line segments containing other junctions between their two endpoints. To identify whether a junction lies on a line segment, we first project the junction on the line and check if it falls within the segment boundaries. When it does, the junction is considered to be on the line segment if it is at a distance of less than ξ c s \xi_{cs} ξcs pixels from the line. Through out our experiments, we adopted ξ c s = 3 \xi_{cs}=3 ξcs=3 pixels.
在一些需要直线匹配的应用中,多个重叠段可能会阻碍匹配,因为描述符在识别闭合直线时会有更大的难度。因此,线路需要非最大抑制(NMS)机制。与点或边界框NMS不同,直线NMS没有成熟的程序。与通常用作后处理步骤的NMS方法不同,我们将线条NMS作为预处理步骤来实现,这实际上加快了整体线段检测,因为它删除了一些候选线条。从初始候选直线集 L ^ c a n d \hat{L}_{cand} L^cand开始,我们移除包含两个端点之间其他连接的线段。为了确定连接点是否位于线段上,我们首先将连接点投影到线段上,并检查它是否位于线段边界内。如果连接点与直线的距离小于 ξ c s \xi_{cs} ξcs像素,则连接点被视为位于线段上。通过我们的实验,我们采用了 ξ c s = 3 \xi_{cs}=3 ξcs=3像素。

E. Detector evaluation metrics

Similarly to the metrics introduced in SuperPoint [7], we propose line segments repeatability and localization error metrics. Both of these metrics are computed using pairs of images I 1 I_1 I1 and I 2 I_2 I2, where I 2 I_2 I2 is a warped version of I 1 I_1 I1 under a homography H H H. Each image is associated with a set of line segments L 1 = { l m 1 } m = 1 M 1 L_1 = \{ l^1_m \}^{M_1}_{m=1} L1={lm1}m=1M1 and L 2 = { l m 2 } m = 1 M 2 L_2 = \{ l^2_m \}^{M_2}_{m=1} L2={lm2}m=1M2, and d d d refers to one of the two line distances defined in this work: structural distance d s d_s ds and orthogonal distance d o r t h d_{orth} dorth. Repeatability: The repeatability measures how often a line can be re-detected in different views. The repeatability with tolerance ϵ \epsilon ϵ is defined as:
与SuperPoint[7]中介绍的度量类似,我们提出了线段重复性和定位误差度量。这两个度量都是使用成对的图像 I 1 I_1 I1 I 2 I_2 I2计算的,其中 I 2 I_2 I2 I 1 I_1 I1在单应性 H H H下的变换版本。每个图像都与一组线段相关联, L 1 = { L m 1 } m 1 m = 1 L_1=\{L^1_m\}^{m_1}{m=1} L1={Lm1}m1m=1 L 2 = { L m 2 } m 2 m = 1 L_2=\{L^2_m\}{m_2}{m=1} L2={Lm2}m2m=1,而 d d d指的是本文定义的两条直线距离之一:结构距离 d s d_s ds和正交距离 d o r t h d{orth} dorth。可重复性:可重复性度量在不同视图中重新检测线条的频率。公差 ϵ \epsilon ϵ的可重复性定义为:
∀ l ∈ L 1 , C L 2 ( l ) = { 1 i f ( m i n l j 2 ∈ L 2 ) d ( l . l j 2 ) ≤ ϵ , 0 o t h e r w i s e \forall l \in L_1, C_{L_2}(l) = \begin{cases} 1 & if(min_{l^2_j \in L_2}) d(l.l^2_j)\le \epsilon,\\ 0 & otherwise \end{cases} lL1,CL2(l)={10if(minlj2L2)d(l.lj2)ϵ,otherwise

R e p − ϵ = ∑ i = 1 M 1 C L 2 ( l i 1 ) + ∑ j = 1 M 2 C L 1 ( l j 2 ) M 1 + M 2 Rep{-}\epsilon=\frac{ \sum^{M_1}_{i=1}{C_{L_2}(l^1_i)} + \sum^{M_2}_{j=1}{C_{L_1}(l^2_j)} } {M_1+M_2} Repϵ=M1+M2i=1M1CL2(li1)+j=1M2CL1(lj2)

Localization error: The localization error with tolerance ϵ \epsilon ϵ is the average line distance between a line and its re-detection in another view:
定位误差:公差为 ϵ \epsilon ϵ的定位误差是一条线与其在另一个视图中的重新检测之间的平均线距离:
L E − ϵ = ∑ j ∈ C o r r m i n l i 1 ∈ L 1 d ( l i 1 , l j 2 ) ∣ C o r r ∣ LE{-}\epsilon = \frac {\sum_{j\in C_{orr}}min_{l_i^1\in {L_1}}d(l_i^1,l_j^2)} {\left|C_{orr}\right|} LEϵ=CorrjCorrminli1L1d(li1,lj2)

C o r r = { j ∣ C L 1 ( l j 2 ) = 1 , l j 2 ∈ L 2 } C_{orr}=\{ j | C_{L_1}(l^2_j) = 1, l_j^2 \in L_2 \} Corr={jCL1(lj2)=1,lj2L2}

where ∣ ⋅ ∣ \left| · \right| measures the cardinality of a set.
其中 ∣ ⋅ ∣ \left|· \right| 度量集合的基数。

F. Statistical evaluation of our method

All the experiments displayed in the main paper are from a single training run. To justify our statistical improvement over the baselines, we re-trained the full detector and descriptor network 5 times with different random seeds each time. Figure 9 displays the same evaluation on the Wireframe dataset [18] as in our paper with the mean and standard deviation of the performance of our method over these 5 runs, and thus shows statistically meaningful results.
本文展示的所有实验都来自一次训练。为了证明我们对基线的统计改进是正确的,我们用不同的随机种子对完整检测器和描述符网络进行了5次重新训练。图9显示了与本文中相同的线框数据集[18]的评估,以及我们的方法在这5次运行中的性能平均值和标准偏差,因此显示了具有统计意义的结果。

G. ETH3D dataset preprocessing

The ETH3D dataset [49] is composed of 13 scenes taken in indoor as well as outdoor environments. Each image comes with the corresponding camera intrinsics and depth map, and a 3D model of each scene built with Colmap [48] is provided as well. We use the undistorted image downsampled by a factor of 8 to run the line detection and description. We then use the depth maps and camera intrinsics to reproject the lines in 3D and compute the descriptor metrics in 3D space. While the depth maps have been obtained from a high-precision laser scanner, they contain some holes, in particular close to depth discontinuities. Since these discontinuities are actually where lines are often located, we inpaint the depth in all of the invalid areas at up to 10 pixels from a valid depth pixel. We used NLSPN [40], the current state of the art in deep depth inpainting guided with RGB images.
ETH3D数据集[49]由室内和室外环境中拍摄的13个场景组成。每幅图像都带有相应的相机内部结构和深度图,还提供了使用Colmap[48]构建的每个场景的3D模型。我们使用降采样系数为8的未失真图像来运行线条检测和描述。然后,我们使用深度贴图和相机内部函数在三维空间中重新投影线条,并计算三维空间中的描述符度量。虽然深度图是从高精度激光扫描仪获得的,但其中包含一些孔,尤其是接近深度不连续的孔。由于这些不连续性实际上是线条经常出现的位置,因此我们修复了所有无效区域的深度,从有效深度像素到10像素。我们使用了NLSPN[40],这是目前由RGB图像引导的深度修复的最新技术。

H. Homography estimation experiment

To validate the real-world applications of our method, we used the line segment detections and descriptors to match segments across pairs of images of the Wireframe dataset related by a homography [18] and estimate the homography with RANSAC [12]. We sample minimal sets of 4 lines to fit a homography and run up to 1, 000, 000 iterations with the LORANSAC [26] implementation of Sattler et al. *. The reprojection error is computed with the orthogonal line distance. We compute the accuracy of the homography estimation similarly as in SuperPoint [7] by warping the four corners of the image with the estimated homography, warping them back to the initial image with the ground truth homography and computing the reprojection error of the corners. We consider the estimated homography to be correct if the average reprojection error is less than 3 pixels. The results are listed in Table 4.
为了验证我们的方法在现实世界中的应用,我们使用线段检测和描述符来匹配由单应关系[18]关联的线框数据集的成对图像中的线段,并使用RANSAC[12]估计单应关系。我们对4条线的最小集合进行采样,以适应单应性,并使用Sattler等人的LORANSAC[26]实现运行多达1000000次迭代。用正交线距离计算重投影误差。我们计算单应估计的精度,方法与SuperPoint[7]中类似,使用估计的单应将图像的四个角扭曲,使用地面真值单应将它们扭曲回初始图像,并计算角点的重投影误差。当平均重投影误差小于3个像素时,我们认为估计单应性是正确的。结果如表4所示。

* https://github.com/tsattler/RansacLib

When compared on LSD line [57], our descriptors provide the highest accuracy among all baselines, and our full pipeline achieves a similar performance. When using our lines, we use a similar refinement of the junctions as in LSD [57]: we sample small perturbations of the endpoints by a quarter of a pixel and keep the perturbed endpoints maximizing the line average score. Similarly to feature point methods [7, 9], this experiment shows that learned features are still on par or slightly worse than handcrafted detections in terms of localization error.
当在LSD线[57]上进行比较时,我们的描述符提供了所有基线中最高的准确性,我们的整个管道实现了类似的性能。当使用我们的线时,我们使用了与LSD[57]中类似的连接细化:我们对端点的小扰动采样四分之一像素,并使扰动端点最大化线平均分数。与特征点方法(7, 9)相似,本实验表明,在定位误差方面,所学特征仍然比手工检测稍差或稍差。

We also add to the comparison the results of homography estimation for a learned feature point detector and descriptor, SuperPoint [7]. The point-based approach performs significantly worse than our method, due to the numerous textureless scenes and repeated structures present in the Wireframe dataset. We also found that SuperPoint is not robust to rotations above 45 degrees, while our line descriptor can leverage its ordered sequence of descriptors to achieve invariance with respect to any rotation.
我们还将学习的特征点检测器和描述符SuperPoint[7]的单应估计结果添加到比较中。由于线框数据集中存在大量无纹理场景和重复结构,基于点的方法的性能明显低于我们的方法。我们还发现,超点对45度以上的旋转不具有鲁棒性,而我们的线描述符可以利用其有序描述符序列实现对任何旋转的不变性。

I. Qualitative results of line segment detections and matches

We provide some visualizations of the line segment detection results in Figure 11 and of the line matching in Figure 10. Figure 12 also offers a comparison of line matches with point matches in challenging images with low texture, and repeated structures. Our method is able to match enough lines to obtain an accurate pose estimation, while point-based methods such as SuperPoint [7] fail in such scenarios.
我们在图11中提供了一些线段检测结果的可视化,在图10中提供了线段匹配的可视化。图12还提供了低纹理和重复结构的挑战性图像中的线匹配与点匹配的比较。我们的方法能够匹配足够多的线,以获得准确的姿势估计,而基于点的方法,如SuperPoint[7]在这种情况下失败。

你可能感兴趣的:(深度学习,计算机视觉,python,图像处理)