NetVLAD: CNN Architecture for Weakly Supervised Place Recognition
NetVLAD:用于弱监督位置识别的CNN架构
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5297-5307
https://ieeexplore.ieee.org/document/7937898
Abstract:
文摘:
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph.
我们解决了大规模的视觉位置识别问题,其中的任务是快速准确地识别给定查询照片的位置。
We present the following four principal contributions.
我们提出以下四项主要贡献。
First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task.
首先,我们开发了一个卷积神经网络(CNN)结构,该结构可以直接以端到端方式训练用于位置识别任务。
The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the “Vector of Locally Aggregated Descriptors” image representation commonly used in image retrieval.
该体系结构的主要组件NetVLAD是一种新的通用VLAD层,灵感来自于图像检索中常用的“局部聚合描述符向量”图像表示。
The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation.
该层可以很容易地插入到任何CNN架构中,并且可以通过反向传播进行训练。
Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture’s parameters from images depicting the same places over time downloaded from Google Street View Time Machine.
其次,我们创建了一个新的弱监督的排名损失,这使得端到端的学习架构的参数从图像描绘相同的地方随着时间的推移从谷歌街景时间机器下载。
Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks.
第三,我们开发了一个有效的训练程序,可以应用于非常大规模的弱标记任务。
Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.
最后,我们证明了所提出的架构和训练程序在挑战性的位置识别和图像检索基准上显著优于非学习图像表示和现成的CNN描述符。
SECTION 1Introduction
第一节介绍
Visual place recognition has received a significant amount of attention in the past years both in computer vision [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] and robotics communities [12] , [13], [14], [15], [16] motivated by, e.g., applications in autonomous driving [14], augmented reality [17] or geo-localizing archival imagery [18].
视觉识别已经收到了大量的关注在过去几年都在计算机视觉[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]和机器人社区[12],[13],[14],[15],[16]的动机,例如,应用程序在自动驾驶[14],增强现实技术[17]或geo-localizing档案图像[18]。
The place recognition problem, however, still remains extremely challenging.
然而,地点识别问题仍然极具挑战性。
How can we recognize the same street-corner in the entire city or on the scale of the entire country despite the fact it can be captured in different illuminations or change its appearance over time?
我们如何能够识别整个城市或整个国家的同一个街角,尽管它可以被不同的灯光捕捉到,或者随着时间的推移改变它的外观?
The fundamental scientific question is what is the appropriate representation of a place that is rich enough to distinguish similarly looking places yet compact to represent entire cities or countries.
最基本的科学问题是,一个地方是否足够丰富,能够区分出外观相似但紧凑到可以代表整个城市或国家的地方,这个地方的合适代表是什么。
The place recognition problem has been traditionally cast as an instance retrieval task, where the query image location is estimated using the locations of the most visually similar images obtained by querying a large geotagged database [1], [2], [3], [8], [9], [10].
传统上,位置识别问题被转换为一个实例检索任务,其中查询图像位置是通过查询一个带有地理标记的大型数据库[1]、[2]、[3]、[8]、[9]、[10]获得的最直观相似的图像位置来估计的。
Each database image is represented using local invariant features [19] such as SIFT [20] that are aggregated into a fixed length vector representation for the entire image such as bag-of-visual-words [21], [22], VLAD [23], [24] or Fisher vector [25], [26].
每个数据库图像都使用局部不变的特性19表示,这些特性被聚合为整个图像的固定长度向量表示形式,如可视词包[21]、[22]、VLAD[23]、[24]或Fisher向量[25]、[26]。
The resulting representation is then usually compressed and efficiently indexed [21], [27].
然后通常对得到的表示进行压缩,并有效地索引[21]和[27]。
The image database can be further augmented by 3D structure that enables recovery of accurate camera pose [4], [11], [28].
通过三维结构进一步扩充图像数据库,恢复准确的相机位姿[4]、[11]、[28]。
In the last few years, convolutional neural networks (CNNs) [29], [30] have emerged as powerful image representations for various category-level recognition tasks such as object classification [31], [32], [33], [34], scene recognition [35] or object detection [36].
近年来,卷积神经网络(CNNs)[29]、[30]已成为对象分类[31]、[32]、[33]、[34]、场景识别[35]、目标检测[36]等各种类别级识别任务的强大图像表示。
The basic principles of CNNs are known from 80’s [29], [30] and the recent successes are a combination of advances in GPU-based computation power together with large labelled image datasets [31].
CNNs的基本原理从80年代的[29]和[30]就已经知道了,最近的成功是基于gpu的计算能力的进步与大型标记图像数据集[31]的结合。
It has been shown that the trained representations are, to some extent, transferable between recognition tasks [32], [36], [37], [38], [39], and direct application of CNN representations trained for object classification [31] as black-box descriptor extractors has brought some improvements in performance on instance-level recognition tasks [40], [41], [42], [43], [44], [45], [46], [47].
表明训练表示,在某种程度上,转移之间的识别任务[32],[36],[37],[38],[39],和直接的应用训练对CNN表示对象分类[31]作为黑盒描述符提取器带来了一些性能改进实例级识别任务[40]、[41],[42],[43],[44],[45],[46],[47]。
In this work we investigate whether the performance can be further improved by CNN representations developed and trained directly for place recognition.
在这项工作中,我们研究是否可以通过直接开发和训练CNN表示来提高位置识别的性能。
This requires addressing the following four main challenges: First, what is a good CNN architecture for place recognition?
这需要解决以下四个主要挑战:首先,什么是用于位置识别的CNN架构?
Second, how to gather sufficient amount of annotated data for the training?
第二,如何为培训收集足够的带注释的数据?
Third, how can we train the developed architecture in an end-to-end manner tailored for the place recognition task?
第三,我们如何针对位置识别任务以端到端方式培训开发的体系结构?
Fourth, how to perform computationally efficient training in order to scale up to very large datasets?
第四,如何执行计算效率的培训,以扩大到非常大的数据集?
To address these challenges, we bring the following four innovations.
为了应对这些挑战,我们提出了以下四项创新。
First, building on the lessons learnt from the current well performing hand-engineered object retrieval and place recognition pipelines [10], [23] , [48], [49], we develop a convolutional neural network architecture for place recognition that aggregates mid-level (conv5) convolutional features extracted from the entire image into a compact fixed length vector representation amenable to efficient indexing.
首先,建筑在教训当前执行hand-engineered对象检索和识别管道[10],[23],[48],[49],我们开发一个卷积神经网络架构的地方承认骨料中层(conv5)卷积特性从整个图像中提取到一个紧凑的固定长度的向量表示服从高效的索引。
To achieve this, we design a new trainable generalized VLAD layer, NetVLAD, inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation [24] that has shown excellent performance in image retrieval and place recognition.
为了实现这一目标,我们设计了一个新的可训练的广义VLAD层NetVLAD,其灵感来自于局部聚集描述符(VLAD)表示[24]的向量,该向量在图像检索和位置识别方面表现出了良好的性能。
The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation.
该层可以很容易地插入到任何CNN架构中,并且可以通过反向传播进行训练。
The resulting aggregated representation is then compressed using Principal Component Analysis (PCA) to obtain the final compact descriptor of the image.
然后使用主成分分析(PCA)对生成的聚合表示进行压缩,以获得图像的最终压缩描述符。
Second, to train the architecture for place recognition, we gather a large dataset of multiple panoramic images depicting the same place from different viewpoints over time from the Google Street View Time Machine.
其次,为了训练位置识别的架构,我们从谷歌街景时间机器上收集了一个大的数据集,包含多个全景图像,这些图像从不同的角度描述了同一地点随时间的变化。
Such data is available for vast areas of the world, but provides only weak form of supervision: We know the two panoramas are captured at approximately similar positions based on their (noisy) GPS but we don’t know which parts of the panoramas depict the same parts of the scene.
这样的数据在世界上大部分地区都是可用的,但只提供了微弱的监督形式:我们知道这两幅全景图是基于它们(嘈杂的)GPS在大约相似的位置拍摄的,但我们不知道全景图的哪些部分描述了相同的场景。
Third, we create a new loss function, which enables end-to-end learning of the architecture’s parameters, tailored for the place recognition task from the weakly labelled Time Machine imagery.
第三,我们创建了一个新的损失函数,它使端到端学习的架构参数,从弱标记的时间机器图像为地点识别任务量身定制。
The loss function is also more widely applicable to other ranking tasks where large amounts of weakly labelled data are available.
损失函数也更广泛地适用于其他有大量弱标记数据可用的排序任务。
Fourth, we develop an efficient learning procedure which can be applied on very large-scale weakly labelled tasks.
第四,我们开发了一个有效的学习过程,可以应用于非常大规模的弱标记任务。
It requires only a fraction of the computational time of a naive implementation thanks to improved data efficiency through hard negative mining, combined with an effective use of caching.
由于通过硬负挖掘(hard negative mining)提高了数据效率,再加上有效地使用缓存,它只需要原始实现的一小部分计算时间。
The resulting representation is robust to changes in viewpoint and lighting conditions, while simultaneously learns to focus on the relevant parts of the image such as the building façades and the skyline,
由此产生的表现对视角和光照条件的变化具有很强的鲁棒性,同时学会关注图像的相关部分,如建筑立面和天际线,
while ignoring confusing elements such as cars and people that may occur at many different places ( Fig. 1). We show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.
而忽略了可能发生在许多不同地方的混淆元素,如汽车和人(图1)。我们表明,在挑战位置识别和图像检索基准方面,所提出的体系结构和训练过程显著优于非学习图像表示和现成的CNN描述符。
1.1 Related Work
1.1相关工作
While there have been many improvements in designing better image retrieval [22], [23], [24], [25], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62] and place recognition [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] systems, not many works have performed learning for these tasks.
虽然有很多的改进设计更好的图像检索[22],[23],[24],[25],[48],[49],[50],[51],[52],[53],[54],[55],[56],[57],[58],[59],[60],[61],[62]和地点识别[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16]系统,而不是许多作品表现学习这些任务。
All relevant learning-based approaches fall into one or both of the following two categories: (i) learning for an auxiliary task (e.g., some form of distinctiveness of local features [2], [9], [12], [63], [64], [65], [66]), and (ii) learning on top of shallow hand-engineered descriptors that cannot be fine-tuned for the target task [2], [6], [7], [48], [67].
基于所有相关学习方法分为一个或两个以下两类:(i)学习辅助的任务(例如,某种形式的特殊性的地方特性[2],[9],[12],[63],[64],[65],[66]),和(2)学习上的浅hand-engineered描述符无法调整为目标任务[2],[6],[7],[48],[67]。
Both of these are in spirit opposite to the core idea behind deep learning that has provided a major boost in performance in various recognition tasks: End-to-end learning.
这两者在本质上都与深度学习背后的核心理念背道而驰,深度学习在不同的识别任务(端到端学习)中提供了重要的性能提升。
We will indeed show in Section 6.2 that training representations directly for the end-task, place recognition, is crucial for obtaining good performance.
我们将在第6.2节中说明,直接针对最终任务(地点识别)的培训表示对于获得良好的性能至关重要。
Numerous works concentrate on learning better local descriptors or metrics to compare them [56], [59], [68], [69], [70], [71], [72], [73], [74], [75], but even though some of them show results on image retrieval, the descriptors are learnt on the task of matching local image patches, and not directly with image retrieval in mind.
了众多作品集中精力学习更好的局部描述符或指标来比较他们[56],[59],[68],[69],[70],[71],[72],[73],[74],[75],但即使他们中的一些人显示在图像检索结果,描述符是学习的任务匹配本地图像补丁,而不是直接与图像检索。
Some of them also make use of hand-engineered features to bootstrap the learning, i.e.
其中一些还利用手工设计的特性来引导学习,即
, to provide noisy training data [56], [59], [69], [70], [74].
,提供噪声训练数据[56],[59],[69],[70],[74]。
Several works have investigated using CNN-based features for image retrieval.
利用基于cnn的特征对图像检索进行了研究。
These include treating activations from certain layers directly as descriptors by concatenating them [43], [76], or by pooling [40], [41], [42], [45], [46], [47].
这些方法包括将某些层的激活直接作为描述符处理,方法是将它们串联起来,如[43],[76],或者将[40]、[41]、[42]、[45]、[46]、[47]合用。
However, none of these works actually train the CNNs for the task at hand, but use CNNs as black-box descriptor extractors.
然而,这些工作实际上都没有训练CNNs来完成手头的任务,而是使用CNNs作为黑盒描述符提取器。
One exception is the work of Babenko et al. [76] in which the network is fine-tuned on an auxiliary task of classifying 700 landmarks.
一个例外是Babenko等人[76]的工作,他们对网络进行了微调,完成了对700个地标进行分类的辅助任务。
However, again the network is not trained directly on the target retrieval task.
然而,网络并没有直接针对目标检索任务进行训练。
Very recent works [77], [78], published after the first version of this paper [79], train CNNs end-to-end for image retrieval by making use of image correspondences obtained from structure-from-motion models, i.e.
在本文第一个版本之后发表的非常近期的文献[77]、[78][79],利用从结构-运动模型中获得的图像对应关系,即
, they rely on pre-existing image retrieval pipelines based on precise matching of RootSIFT descriptors, spatial verification and bundle adjustment.
它们依赖于现有的基于精确匹配RootSIFT描述符、空间验证和束调整的图像检索管道。
Weyand et al. [80] proposed a CNN-based method for geo-localization by partitioning the Earth into cells and treating place recognition as a classification task.
Weyand等[80]提出了一种基于cnn的地理定位方法,将地球划分为单元,将位置识别作为分类任务。
While providing impressive rough city/country level estimates of where a photo is taken, their method is not capable of providing several-meter accuracy place recognition that we consider here, as their errors are measured in tens and hundreds of kilometres.
虽然他们提供了令人印象深刻的粗略的城市/国家水平的照片拍摄地点估计,但他们的方法不能提供我们在这里考虑的几米精度的地点识别,因为他们的误差是在几十公里和几百公里测量的。
Finally, [81] and [82] performed end-to-end learning for different but related tasks of ground-to-aerial matching [82] and camera pose estimation [81].
最后,[81]和[82]对不同但相关的地空匹配任务和相机姿态估计任务进行端到端学习[82]。
场景识别之NetVLAD
https://www.jianshu.com/p/7d48bff4d1c3
论文笔记:NetVLAD: CNN architecture for weakly supervised place recognition
http://www.liuxiao.org/2019/02/%E8%AE%BA%E6%96%87%E7%AC%94%E8%AE%B0%EF%BC%9Anetvlad-cnn-architecture-for-weakly-supervised-place-recognition/
vlad的python简单实现
https://github.com/Lithogenous/VLAD-SIFT-python/blob/master/vlad_raw.py