[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉

专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉_第1张图片

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉语言导航VLN
  • 强化学习 RL
  • 模仿学习 IL
  • 机器人
  • 开放词汇,检测分割

== LLM ==

标题: Paramanu: A Family of Novel Efficient Indic Generative Foundation Language Models

作者: Mitodru Niyogi, Arnab Bhattacharya

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.18034v1

Project: https://www.bharatgpts.com|

中文摘要: 我们介绍了Gyan AI Paramanu(“atom”),这是一个用于印度语言的新颖语言模型家族。它是一个自回归单语、双语和多语言印度语模型的集合,在单个GPU上为10种印度语言(阿萨姆语、孟加拉语、印地语、康卡尼语、麦蒂里语、马拉地语、奥迪亚语、梵语、泰米尔语、泰卢固语)从头开始进行预训练,跨越5种脚本(孟加拉语、梵文、奥迪亚语、泰米尔语、泰卢固语),大小从13.29 M到367.5 M不等。这些模型在单个GPU上以1024的上下文大小进行预训练。这些模型非常高效、小巧、快速、强大。我们还开发了一个高效的最先进的印度语标记器,甚至可以标记看不见的语言。为了避免我们的多语言mParamanu模型中的“多语言诅咒”,我们使用相同的脚本通过类型分组对可比较的语料库进行预训练。我们对孟加拉语、印地语和梵语的语法、连贯性、创造性和真实性指标的开放式文本生成预训练模型进行了人工评估。我们的孟加拉语、印地语和梵语模型的性能远远优于GPT-3.5-Turbo(ChatGPT)、Bloom 7B、LLaMa-2 7B、OPT 6.7 B、GPT-J 6B、GPTNeo 1.3 B、GPT2-XL大型语言模型(LLMs),尽管其尺寸比标准7B LLMs小66至20倍。要在我们预训练的模型上运行推理,CPU就够了,不需要GPU。我们还根据各自语言的23k指令对预训练的孟加拉语、印地语、马拉地语、泰米尔语和泰卢固语模型进行了指令调整。我们的预训练和指令调整模型是同类模型中的第一个,是有史以来为印度语开发的最强大、最高效的小型生成语言模型,各种结果导致这样的结论:高质量的生成语言模型在没有大量计算能力和大量参数的情况下是可能的。我们计划在https://www.bharatgpts.com发布我们的模型。

摘要: We present Gyan AI Paramanu (“atom”), a family of novel language models for Indian languages. It is a collection of auto-regressive monolingual, bilingual, and multilingual Indic language models pretrained from scratch on a single GPU for 10 Indian languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu) of varying sizes ranging from 13.29M to 367.5M.The models are pretrained with a context size of 1024 on a single GPU. The models are very efficient, small, fast, and powerful. We have also developed an efficient most advanced Indic tokenizer that can even tokenize unseen languages. In order to avoid the “curse of multi-linguality” in our multilingual mParamanu model, we pretrained on comparable corpora by typological grouping using the same script. We performed human evaluation of our pretrained models for open end text generation on grammar, coherence, creativity, and factuality metrics for Bangla, Hindi, and Sanskrit. Our Bangla, Hindi, and Sanskrit models outperformed GPT-3.5-Turbo (ChatGPT), Bloom 7B, LLaMa-2 7B, OPT 6.7B, GPT-J 6B, GPTNeo 1.3B, GPT2-XL large language models (LLMs) by a large margin despite being smaller in size by 66 to 20 times compared to standard 7B LLMs. To run inference on our pretrained models, CPU is enough, and GPU is not needed. We also instruction-tuned our pretrained Bangla, Hindi, Marathi, Tamil, and Telugu models on 23k instructions in respective languages. Our pretrained and instruction-tuned models which are first of its kind, most powerful efficient small generative language models ever developed for Indic languages, and the various results lead to the conclusion that high quality generative language models are possible without high amount of compute power and humongous number of parameters. We plan to release our models at https://www.bharatgpts.com.


标题: An Empirical Study of Scaling Law for OCR

作者: Miao Rang, Zhenni Bi, Chuanjian Liu

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.00028v3

GitHub: https://github.com/large-ocr-model/large-ocr-model.github.io|

中文摘要: 模型大小、数据量、计算和模型性能的规律已经在自然语言处理(NLP)领域得到了广泛的研究。然而,光学字符识别(OCR)中的标度律还没有被研究。为了解决这个问题,我们进行了全面的研究,包括检查文本识别领域的性能与模型规模、数据量和计算之间的相关性。总之,该研究证明了当其他影响因素保持不变时,性能和模型大小以及训练数据量之间的平滑幂律。此外,我们还构建了一个名为REBU-Syn的大规模数据集,其中包括600万个真实样本和1800万个合成样本。基于我们的缩放定律和新的数据集,我们成功地训练了一个场景文本识别模型,在6个常见的测试基准上实现了新的最先进水平,平均准确率为97.42%。这些模型和数据集可在https://github.com/large-ocr-model/large-ocr-model.github.io上公开获得

摘要: The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%. The models and dataset are publicly available at https://github.com/large-ocr-model/large-ocr-model.github.io.


标题: Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

作者: Zhen Qin, Daoyuan Chen, Bingchen Qian

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2312.06353v3

GitHub: https://github.com/alibaba/FederatedScope/tree/FedKSeed|

中文摘要: 预训练的大型语言模型(LLM)需要微调,以提高它们对自然语言指令的响应能力。联合学习提供了一种使用终端设备上的丰富数据来微调LLM而不损害数据隐私的方法。大多数现有的LLMs联合微调方法依赖于参数高效的微调技术,这可能无法达到全参数调整可能达到的性能高度。然而,由于巨大的通信成本,LLMs的联邦全参数调整是一个不小的问题。这项工作介绍了FedKSeed,它采用了有限组随机种子的零阶优化。它显著地将服务器和客户端之间的传输要求降低到仅几个随机种子和标量梯度,总计仅几千字节,使得在设备上对十亿大小的LLM进行联合全参数调优成为可能。在此基础上,我们开发了一种支持概率区分种子采样的策略,优先考虑对模型准确性有更大影响的扰动。使用各种LLM、数据集和数据分区在六个场景中进行的实验表明,我们的方法在通信效率和新任务泛化方面都优于现有的联合LLM微调方法。

摘要: Pre-trained large language models (LLMs) need fine-tuning to improve their responsiveness to natural language instructions. Federated learning offers a way to fine-tune LLMs using the abundant data on end devices without compromising data privacy. Most existing federated fine-tuning methods for LLMs rely on parameter-efficient fine-tuning techniques, which may not reach the performance height possible with full-parameter tuning. However, federated full-parameter tuning of LLMs is a non-trivial problem due to the immense communication cost. This work introduces FedKSeed that employs zeroth-order optimization with a finite set of random seeds. It significantly reduces transmission requirements between the server and clients to just a few random seeds and scalar gradients, amounting to only a few thousand bytes, making federated full-parameter tuning of billion-sized LLMs possible on devices. Building on it, we develop a strategy enabling probability-differentiated seed sampling, prioritizing perturbations with greater impact on model accuracy. Experiments across six scenarios with various LLMs, datasets and data partitions demonstrate that our approach outperforms existing federated LLM fine-tuning methods in both communication efficiency and new task generalization.


标题: Efficient Large Language Models: A Survey

作者: Zhongwei Wan, Xin Wang, Che Liu

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2312.03863v3

GitHub: https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey|

中文摘要: 大型语言模型(LLMs)在自然语言理解、语言生成和复杂推理等重要任务中表现出卓越的能力,并有可能对我们的社会产生重大影响。然而,这种能力伴随着它们所需的大量资源,突出表明迫切需要开发有效的技术来应对它们的效率挑战。在这项调查中,我们提供了一个有效的LLMs研究系统和全面的审查。我们按照由三个主要类别组成的分类法组织文献,分别从以模型为中心、以数据为中心和以框架为中心的角度涵盖不同但相互关联的高效LLMs主题。我们还创建了一个GitHub存储库,在那里我们汇编了本次调查中的论文,网址为https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey,并将积极维护该存储库,并在新的研究出现时纳入其中。我们希望我们的调查可以作为一个宝贵的资源,帮助研究人员和从业人员系统地了解高效有限责任管理的研究进展,并激励他们为这一重要而令人兴奋的领域做出贡献。

摘要: Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding, language generation, and complex reasoning and have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges.In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey, and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.


标题: Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

作者: Jingbo Zhang, Xiaoyu Li, Ziyu Wan

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2305.11588v2

Project: https://eckertzhang.github.io/Text2NeRF.github.io/|

GitHub: https://github.com/eckertzhang/Text2NeRF|https://github.com/eckertzhang/Text2NeRF|

中文摘要: 文本驱动的3D场景生成广泛适用于对3D场景有大量需求的视频游戏、电影行业和元宇宙应用。然而,现有的文本到3D生成方法仅限于生成具有简单几何图形和缺乏真实感的梦幻风格的3D对象。在这项工作中,我们提出了Text2NeRF,它能够纯粹从文本提示中生成具有复杂几何结构和高保真纹理的各种3D场景。为此,我们采用NeRF作为3D表示,并利用预先训练的文本到图像扩散模型来约束NeRF的3D重建,以反映场景描述。具体来说,我们采用扩散模型来推断文本相关图像作为内容先验,并使用单目深度估计方法来提供几何先验。内容和几何先验都被用来更新NeRF模型。为了保证不同视图之间的纹理和几何一致性,我们引入了一种渐进的场景修复和更新策略来进行场景的新视图合成。我们的方法不需要额外的训练数据,只需要场景的自然语言描述作为输入。大量实验表明,我们的Text2NeRF在从各种自然语言提示生成照片般逼真、多视图一致和多样化的3D场景方面优于现有方法。我们的代码可在https://github.com/eckertzhang/text 2 nerf。

摘要: Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts. Our code is available at https://github.com/eckertzhang/Text2NeRF.


标题: KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

作者: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.18079v1

中文摘要: LLMs越来越多地用于需要大上下文窗口的应用程序,如文档分析和摘要,随着这些大上下文窗口的出现,KV缓存激活成为推理期间内存消耗的主要因素。量化是压缩KV高速缓存激活的一种有前途的方法;然而,现有的解决方案不能以超低精度(例如亚4位)精确地表示激活。在这项工作中,我们提出了KVQuant,它通过结合量化缓存KV激活的新方法来解决这个问题,包括:(i)每通道密钥量化,其中我们调整量化密钥激活的维度,以更好地匹配分布;(ii)预绳键量化,其中我们在旋转位置嵌入之前量化键激活,以减轻其对量化的影响;(iii)非均匀KV高速缓存量化,其中我们导出更好地表示分布的每层灵敏度加权非均匀数据类型;(iv)每个向量的密集和稀疏量化,其中我们分别隔离每个向量的离群值,以最小化量化范围中的偏斜;和(v)Q-Norm,其中我们归一化量化质心以减轻分布偏移,为2位量化提供额外的好处。通过将我们的方法应用于LLaMA、LLaMA-2和Mistral模型,我们在Wikitext-2和C4上实现了3位量化的 < 0.1 <0.1 <0.1困惑退化,优于现有的方法。我们的方法能够在单个A100-80GB GPU上为LLaMA-7B模型提供上下文长度高达100万的服务,在8 GPU系统上为上下文长度高达1000万的服务。

摘要: LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit. In this work, we present KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization. By applying our method to the LLaMA, LLaMA-2, and Mistral models, we achieve < 0.1 <0.1 <0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving the LLaMA-7B model with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.


== CLIP@ViT @ VLM @ visual model ==

标题: M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

作者: Xingning Dong, Zipeng Feng, Chunluan Zhou

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17797v1

GitHub: https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP|

中文摘要: 我们提出了一种多模态方法,用于将基于适应的预训练推进到有效和高效的零镜头视频文本检索,称为M2-RAAP。在CLIP等流行的图像——文本模型上,当前大多数基于自适应的视频——文本预训练方法面临三个主要问题,即数据语料库噪声大、预训练耗时长和性能增益有限。为此,我们进行了一项全面的研究,包括视频文本预培训的四个关键步骤。具体来说,我们研究了1)数据过滤和细化,2)视频输入类型选择,3)时间建模,以及4)视频特征增强。然后,我们将这项实证研究总结到M2-RAAP配方中,其中我们的技术贡献在于1)数据过滤和文本重写管道,产生1M高质量的双语视频——文本对,2)用关键帧替换视频输入以加速预训练,以及3)辅助字幕引导(ACG)策略以增强视频特征。我们通过在来自不同语言的两个精炼的视频——文本数据集上改编三个图像——文本基础模型进行了广泛的实验,验证了M2-RAAP对于基于适应的预训练的鲁棒性和可重复性。结果表明,M2-RAAP产生了优越的性能,显著减少了数据(-90%)和时间消耗(-95%)),在四个英文零镜头检索数据集和两个中文零镜头检索数据集上建立了一个新的SOTAd双语数据注释和代码库,可在https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP上获得。

摘要: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end, we conduct a comprehensive study including four critical steps in video-text pre-training. Specifically, we investigate 1) data filtering and refinement, 2) video input type selection, 3) temporal modeling, and 4) video feature enhancement. We then summarize this empirical study into the M2-RAAP recipe, where our technical contributions lie in 1) the data filtering and text re-writing pipeline resulting in 1M high-quality bilingual video-text pairs, 2) the replacement of video inputs with key-frames to accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to enhance video features. We conduct extensive experiments by adapting three image-text foundation models on two refined video-text datasets from different languages, validating the robustness and reproducibility of M2-RAAP for adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior performance with significantly reduced data (-90%) and time consumption (-95%), establishing a new SOTA on four English zero-shot retrieval datasets and two Chinese ones. We are preparing our refined bilingual data annotations and codebase, which will be available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.


标题: Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

作者: Chenhui Zhang, Sherrie Wang

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17600v1

Project: https://vleo.danielz.ch/|https://huggingface.co/collections/mit-ei/vleo-benchmark-datasets-65b789b0466555489cce0d70|

中文摘要: 大型视觉语言模型在涉及自然语言指令视觉输入的复杂任务中表现出令人印象深刻的性能。然而,尚不清楚自然图像的能力在多大程度上转移到地球观测数据,地球观测数据主要是卫星和航空图像,在VLM训练数据中不太常见。在这项工作中,我们提出了一个全面的基准,通过评估VLM在场景理解、定位和计数以及变化检测任务方面的能力,来衡量VLM成为EO数据有用工具的进展。受现实世界应用的激励,我们的基准包括城市监控、救灾、土地使用和保护等场景。我们发现,尽管像GPT-4V这样最先进的VLM拥有广泛的世界知识,可以在位置理解和图像字幕等开放式任务中表现出色,但它们糟糕的空间推理限制了对象定位和计数任务的有用性。我们的基准测试将在https://vleo.danielz.ch/和https://huggingface.co/collections/mit-ei/vleo-benchmark-datasets-65b789b0466555489cce0d70上公开,以便于模型评估。

摘要: Large Vision-Language Models (VLMs) have demonstrated impressive performance on complex tasks involving visual input with natural language instructions. However, it remains unclear to what extent capabilities on natural images transfer to Earth observation (EO) data, which are predominantly satellite and aerial images less common in VLM training data. In this work, we propose a comprehensive benchmark to gauge the progress of VLMs toward being useful tools for EO data by assessing their abilities on scene understanding, localization and counting, and change detection tasks. Motivated by real-world applications, our benchmark includes scenarios like urban monitoring, disaster relief, land use, and conservation. We discover that, although state-of-the-art VLMs like GPT-4V possess extensive world knowledge that leads to strong performance on open-ended tasks like location understanding and image captioning, their poor spatial reasoning limits usefulness on object localization and counting tasks. Our benchmark will be made publicly available at https://vleo.danielz.ch/ and on Hugging Face at https://huggingface.co/collections/mit-ei/vleo-benchmark-datasets-65b789b0466555489cce0d70 for easy model evaluation.


标题: Calibrating Segmentation Networks with Margin-based Label Smoothing

作者: Balamurali Murugesan, Bingyuan Liu, Adrian Galdran

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2209.09641v2

GitHub: https://github.com/Bala93/MarginLoss|

中文摘要: 尽管深度神经网络推动的视觉识别任务取得了不可否认的进展,但最近有证据表明,这些模型校准不良,导致预测过于自信。训练期间最小化交叉熵损失的标准实践促进了预测的软最大概率来匹配单热标签分配。然而,这产生了正确类别的前软最大激活,其明显大于剩余的激活,这加剧了错误校准问题。来自分类文献的最新观察表明,嵌入预测熵的隐式或显式最大化的损失函数产生最先进的校准性能。尽管有这些发现,这些损失在校准医学图像分割网络的相关任务中的影响仍未被探索。在这项工作中,我们提供了当前最先进的校准损耗的统一约束优化观点。具体来说,这些损失可以被视为对logit距离施加等式约束的线性惩罚(或拉格朗日项)的近似。这指出了这种潜在等式约束的一个重要限制,其随后的梯度不断推向非信息解决方案,这可能会阻止在基于梯度的优化期间在判别性能和模型校准之间达到最佳折衷。根据我们的观察,我们提出了一个简单而灵活的基于不等式约束的推广,它对logit距离施加了一个可控的余量。在各种公共医学图像分割基准上的综合实验表明,我们的方法在网络校准方面在这些任务上设置了新的最先进的结果,同时区分性能也得到了改善。

摘要: Despite the undeniable progress in visual recognition tasks fueled by deep neural networks, there exists recent evidence showing that these models are poorly calibrated, resulting in over-confident predictions. The standard practices of minimizing the cross entropy loss during training promote the predicted softmax probabilities to match the one-hot label assignments. Nevertheless, this yields a pre-softmax activation of the correct class that is significantly larger than the remaining activations, which exacerbates the miscalibration problem. Recent observations from the classification literature suggest that loss functions that embed implicit or explicit maximization of the entropy of predictions yield state-of-the-art calibration performances. Despite these findings, the impact of these losses in the relevant task of calibrating medical image segmentation networks remains unexplored. In this work, we provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses. Specifically, these losses could be viewed as approximations of a linear penalty (or a Lagrangian term) imposing equality constraints on logit distances. This points to an important limitation of such underlying equality constraints, whose ensuing gradients constantly push towards a non-informative solution, which might prevent from reaching the best compromise between the discriminative performance and calibration of the model during gradient-based optimization. Following our observations, we propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances. Comprehensive experiments on a variety of public medical image segmentation benchmarks demonstrate that our method sets novel state-of-the-art results on these tasks in terms of network calibration, whereas the discriminative performance is also improved.


标题: GeoSAM: Fine-tuning SAM with Sparse and Dense Visual Prompting for Automated Segmentation of Mobility Infrastructure

作者: Rafi Ibn Sultan, Chengyin Li, Hui Zhu

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2311.11319v2

GitHub: https://github.com/rafiibnsultan/GeoSAM/tree/main|

中文摘要: 当应用于自然图像分割时,分段任何模型(SAM)显示出令人印象深刻的性能。然而,它很难处理航空和卫星图像等地理图像,尤其是在分割道路、人行道和人行横道等移动基础设施时。这种较差的性能源于这些对象的狭窄特征,它们的纹理融入周围环境,以及来自树木、建筑物、车辆和行人等对象的干扰——所有这些都会使模型迷失方向,从而产生不准确的分割图。为了应对这些挑战,我们提出了地理SAM(GeoSAM),这是一种新的基于SAM的框架,它使用来自零镜头学习的密集视觉提示和来自预训练的CNN分割模型的稀疏视觉提示来实现微调策略。所提出的GeoSAM优于现有的地理图像分割方法,特别是道路基础设施、行人基础设施平均分别高出26%、7%和17%,代表了在利用基础模型分割移动基础设施(包括地理图像中的道路和行人基础设施)方面的重大飞跃。源代码可以在这个GitHub资源库中找到:https://github.com/rafiibnsultan/GeoSAM/tree/main。

摘要: The Segment Anything Model (SAM) has shown impressive performance when applied to natural image segmentation. However, it struggles with geographical images like aerial and satellite imagery, especially when segmenting mobility infrastructure including roads, sidewalks, and crosswalks. This inferior performance stems from the narrow features of these objects, their textures blending into the surroundings, and interference from objects like trees, buildings, vehicles, and pedestrians - all of which can disorient the model to produce inaccurate segmentation maps. To address these challenges, we propose Geographical SAM (GeoSAM), a novel SAM-based framework that implements a fine-tuning strategy using the dense visual prompt from zero-shot learning, and the sparse visual prompt from a pre-trained CNN segmentation model. The proposed GeoSAM outperforms existing approaches for geographical image segmentation, specifically by 26%, 7%, and 17% for road infrastructure, pedestrian infrastructure, and on average, respectively, representing a momentous leap in leveraging foundation models to segment mobility infrastructure including both road and pedestrian infrastructure in geographical images. The source code can be found on this GitHub repository: https://github.com/rafiibnsultan/GeoSAM/tree/main.


标题: Synchformer: Efficient Synchronization from Sparse Cues

作者: Vladimir Iashin, Weidi Xie, Esa Rahtu

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2401.16423v1

Project: https://www.robots.ox.ac.uk/|

GitHub: https://github.com/v-iashin/Synchformer|

摘要: Our objective is audio-visual synchronization with a focus on ‘in-the-wild’ videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale ‘in-the-wild’ dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.


标题: CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting

作者: Jiezhi Yang, Khushi Desai, Charles Packer

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.18075v1

摘要: We propose CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting, a method for predicting future 3D scenes given past observations, such as 2D ego-centric images. Our method maps an image to a distribution over plausible 3D latent scene configurations using a probabilistic encoder, and predicts the evolution of the hypothesized scenes through time. Our latent scene representation conditions a global Neural Radiance Field (NeRF) to represent a 3D scene model, which enables explainable predictions and straightforward downstream applications. This approach extends beyond previous neural rendering work by considering complex scenarios of uncertainty in environmental states and dynamics. We employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations. Additionally, we auto-regressively predict latent scene representations as a partially observable Markov decision process, utilizing a mixture density network. We demonstrate the utility of our method in realistic scenarios using the CARLA driving simulator, where CARFF can be used to enable efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving visual occlusions.


== diffusion policy@diffusion formulation@diffusion model ==

标题: BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation

作者: Zhennan Wu, Yang Li, Han Yan

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17053v2

Project: https://www.youtube.com/watch?v=PxIBtd6G0mA|

中文摘要: 我们展示了BlockFusion,这是一个基于扩散的模型,它将3D场景生成为单元块,并无缝地合并新块来扩展场景。使用从完整的3D场景网格中随机裁剪的3D块数据集来训练块融合。通过逐块拟合,所有训练块被转换成混合神经场:具有包含几何特征的三平面,随后是用于解码符号距离值的多层感知器(MLP)。采用变分自动编码器将三平面压缩到潜在的三平面空间中,并在该空间上进行去噪扩散过程。应用于潜在表示的扩散允许高质量和多样化的3D场景生成。为了在生成过程中扩展场景,只需要附加空块以与当前场景重叠,并外推现有的潜在三平面以填充新块。外推是通过在去噪迭代期间用来自重叠三平面的特征样本调节生成过程来完成的。潜在的三平面外推产生语义和几何上有意义的过渡,与现有场景和谐融合。2D布局调节机制用于控制场景元素的放置和排列。实验结果表明,BlockFusion能够在室内和室外场景中生成具有前所未有的高质量形状的多样、几何一致和无界的大型3D场景。

摘要: We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values. A variational auto-encoder is employed to compress the tri-planes into the latent tri-plane space, on which the denoising diffusion process is performed. Diffusion applied to the latent representations allows for high-quality and diverse 3D scene generation. To expand a scene during generation, one needs only to append empty blocks to overlap with the current scene and extrapolate existing latent tri-planes to populate new blocks. The extrapolation is done by conditioning the generation process with the feature samples from the overlapping tri-planes during the denoising iterations. Latent tri-plane extrapolation produces semantically and geometrically meaningful transitions that harmoniously blend with the existing scene. A 2D layout conditioning mechanism is used to control the placement and arrangement of scene elements. Experimental results indicate that BlockFusion is capable of generating diverse, geometrically consistent and unbounded large 3D scenes with unprecedented high-quality shapes in both indoor and outdoor scenarios.


标题: On Inference Stability for Diffusion Models

作者: Viet Nguyen, Giang Vu, Tung Nguyen Thanh

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2312.12431v2

GitHub: https://github.com/VinAIResearch/SA-DPM|

中文摘要: 去噪概率模型(DPM)代表了生成模型的新兴领域,在生成多样化和高质量的图像方面表现出色。然而,大多数当前的DPM训练方法经常忽略时间步长之间的相关性,限制了模型有效生成图像的性能。值得注意的是,我们从理论上指出,这个问题可能是由预测轨迹和实际轨迹之间的累积估计差距引起的。为了最小化这种差距,我们提出了一种新的\textit{sequency-aware}损失,旨在减少估计差距,以提高采样质量。此外,我们从理论上表明,与DPMs中的传统损耗相比,我们提出的损耗函数是估计损耗的更紧的上限。在包括CIFAR10、CelebA和CelebA-HQ在内的几个基准数据集上的实验结果一致显示,与几个DPM基线相比,我们提出的方法在通过FID和初始分数测量的图像综合质量方面有显著改进。我们的代码和预先训练过的检查点可以在\url{https://github.com/vinaisearch/SA-DPM}上找到。

摘要: Denoising Probabilistic Models (DPMs) represent an emerging domain of generative models that excel in generating diverse and high-quality images. However, most current training methods for DPMs often neglect the correlation between timesteps, limiting the model’s performance in generating images effectively. Notably, we theoretically point out that this issue can be caused by the cumulative estimation gap between the predicted and the actual trajectory. To minimize that gap, we propose a novel \textit{sequence-aware} loss that aims to reduce the estimation gap to enhance the sampling quality. Furthermore, we theoretically show that our proposed loss function is a tighter upper bound of the estimation loss in comparison with the conventional loss in DPMs. Experimental results on several benchmark datasets including CIFAR10, CelebA, and CelebA-HQ consistently show a remarkable improvement of our proposed method regarding the image generalization quality measured by FID and Inception Score compared to several DPM baselines. Our code and pre-trained checkpoints are available at \url{https://github.com/VinAIResearch/SA-DPM}.


标题: Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

作者: Jingbo Zhang, Xiaoyu Li, Ziyu Wan

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2305.11588v2

Project: https://eckertzhang.github.io/Text2NeRF.github.io/|

GitHub: https://github.com/eckertzhang/Text2NeRF|https://github.com/eckertzhang/Text2NeRF|

中文摘要: 文本驱动的3D场景生成广泛适用于对3D场景有大量需求的视频游戏、电影行业和元宇宙应用。然而,现有的文本到3D生成方法仅限于生成具有简单几何图形和缺乏真实感的梦幻风格的3D对象。在这项工作中,我们提出了Text2NeRF,它能够纯粹从文本提示中生成具有复杂几何结构和高保真纹理的各种3D场景。为此,我们采用NeRF作为3D表示,并利用预先训练的文本到图像扩散模型来约束NeRF的3D重建,以反映场景描述。具体来说,我们采用扩散模型来推断文本相关图像作为内容先验,并使用单目深度估计方法来提供几何先验。内容和几何先验都被用来更新NeRF模型。为了保证不同视图之间的纹理和几何一致性,我们引入了一种渐进的场景修复和更新策略来进行场景的新视图合成。我们的方法不需要额外的训练数据,只需要场景的自然语言描述作为输入。大量实验表明,我们的Text2NeRF在从各种自然语言提示生成照片般逼真、多视图一致和多样化的3D场景方面优于现有方法。我们的代码可在https://github.com/eckertzhang/text 2 nerf。

摘要: Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts. Our code is available at https://github.com/eckertzhang/Text2NeRF.


标题: Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models

作者: Zhongjie Duan, Chengyu Wang, Cen Chen

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2401.16224v1

Project: https://ecnu-cilab.github.io/DiffutoonProjectPage/)|

中文摘要: 卡通着色是一种非真实感动画渲染任务。它的主要目的是渲染具有平面和风格化外观的对象。随着扩散模型上升到图像合成方法的前沿,本文深入研究了一种基于扩散模型的卡通阴影的创新形式,旨在将逼真的视频直接渲染成动漫风格。在视频风格化中,现存的方法遇到了持续的挑战,特别是在保持一致性和实现高视觉质量方面。在本文中,我们将卡通阴影问题建模为四个子问题:风格化、一致性增强、结构引导和彩色化。为了解决视频风格化的挑战,我们提出了一种有效的卡通着色方法,称为\textit{Diffutoon}。Diffutoon能够以动漫风格渲染非常详细、高分辨率和长时间的视频。它还可以通过一个附加分支根据提示编辑内容。通过定量度量和人体评估来评估Diffutoon的功效。值得注意的是,在我们的实验中,Diffutoon超越了开源和闭源基线方法。我们的工作伴随着Github上源代码和示例视频的发布(项目页面:https://ecnu-cilab.github.io/DiffutoonProjectPage/)。

摘要: Toon shading is a type of non-photorealistic rendering task of animation. Its primary purpose is to render objects with a flat and stylized appearance. As diffusion models have ascended to the forefront of image synthesis methodologies, this paper delves into an innovative form of toon shading based on diffusion models, aiming to directly render photorealistic videos into anime styles. In video stylization, extant methods encounter persistent challenges, notably in maintaining consistency and achieving high visual quality. In this paper, we model the toon shading problem as four subproblems: stylization, consistency enhancement, structure guidance, and colorization. To address the challenges in video stylization, we propose an effective toon shading approach called \textit{Diffutoon}. Diffutoon is capable of rendering remarkably detailed, high-resolution, and extended-duration videos in anime style. It can also edit the content according to prompts via an additional branch. The efficacy of Diffutoon is evaluated through quantitive metrics and human evaluation. Notably, Diffutoon surpasses both open-source and closed-source baseline approaches in our experiments. Our work is accompanied by the release of both the source code and example videos on Github (Project page: https://ecnu-cilab.github.io/DiffutoonProjectPage/).


标题: Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

作者: Daniel Geng, Andrew Owens

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.18085v1

中文摘要: 扩散模型能够根据文本描述生成令人印象深刻的图像,这些模型的扩展允许用户以相对粗略的比例编辑图像。然而,使用扩散模型精确编辑图像中对象的布局、位置、姿态和形状的能力仍然很困难。为此,我们提出了运动引导,一种零镜头技术,允许用户指定密集、复杂的运动场,指示图像中每个像素应该移动到哪里。运动制导的工作原理是通过现成的光流网络用梯度控制扩散采样过程。具体来说,我们设计了一种引导损失,它鼓励样本具有所需的运动,如流动网络所估计的,同时在视觉上也与源图像相似。通过同时从扩散模型采样并引导样本具有低引导损失,我们可以获得运动编辑图像。我们证明了我们的技术适用于复杂的运动,并对真实和生成的图像进行高质量的编辑。

摘要: Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.


标题: Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

作者: Zhipeng Bao, Yijun Li, Krishna Kumar Singh

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2312.06712v2

中文摘要: 尽管最近基于扩散的文本到图像(T2I)模型取得了重大进展,但当前系统仍然无法确保与文本提示一致的体面合成生成,特别是对于多对象生成。这项工作阐明了这种错位的根本原因,指出了与低注意力激活分数和面具重叠相关的问题。虽然以前的研究工作已经单独解决了这些问题,但我们认为整体方法是最重要的。因此,我们提出了两个新的目标,分离损失和增强损失,分别减少对象掩模重叠和最大化注意分数。我们的方法不同于传统的测试时间适应技术,专注于微调关键参数,这增强了可扩展性和普遍性。综合评估证明了我们的模型在图像真实感、文本——图像对齐和适应性方面的卓越性能,明显优于突出的基线。最终,本研究为T2I扩散模型铺平了道路,该模型具有增强的成分容量和更广泛的适用性。

摘要: Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.


== Visual Navigation@VLN @ Visual Language Navigation ==

标题: SubPipe: A Submarine Pipeline Inspection Dataset for Segmentation and Visual-inertial Localization

作者: Olaya Álvarez-Tuñón, Luiza Ribeiro Marnet, László Antal

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17907v1

GitHub: https://github.com/remaro-network/SubPipe-dataset|

中文摘要: 本文介绍了SubPipe,这是一个用于SLAM、对象检测和图像分割的水下数据集。SubPipe已经使用由OceanScan MST运营的\gls{LAUV}进行了记录,并携带了一套传感器,包括两个摄像机、一个侧扫声纳和一个惯性导航系统以及其他传感器。AUV已经部署在管道检查环境中,海底管道部分被沙子覆盖。AUV的姿态地面真实值由导航传感器估计。侧扫声纳和RGB图像分别包括目标检测和分割注释。最先进的分割、对象检测和SLAM方法在SubPipe上进行了基准测试,以展示数据集在利用计算机视觉算法方面的挑战和机遇。据作者所知,这是第一个带注释的水下数据集,提供了真实的管道检查场景。数据集和实验可在https://github.com/remaro-network/SubPipe-dataset

摘要: This paper presents SubPipe, an underwater dataset for SLAM, object detection, and image segmentation. SubPipe has been recorded using a \gls{LAUV}, operated by OceanScan MST, and carrying a sensor suite including two cameras, a side-scan sonar, and an inertial navigation system, among other sensors. The AUV has been deployed in a pipeline inspection environment with a submarine pipe partially covered by sand. The AUV’s pose ground truth is estimated from the navigation sensors. The side-scan sonar and RGB images include object detection and segmentation annotations, respectively. State-of-the-art segmentation, object detection, and SLAM methods are benchmarked on SubPipe to demonstrate the dataset’s challenges and opportunities for leveraging computer vision algorithms. To the authors’ knowledge, this is the first annotated underwater dataset providing a real pipeline inspection scenario. The dataset and experiments are publicly available online at https://github.com/remaro-network/SubPipe-dataset


标题: Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction

作者: Hwan-Soo Choi, Jongoh Jeong, Young Hoo Cho

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2308.02126v2

摘要: Sensor fusion approaches for intelligent self-driving agents remain key to driving scene understanding given visual global contexts acquired from input sensors. Specifically, for the local waypoint prediction task, single-modality networks are still limited by strong dependency on the sensitivity of the input sensor, and thus recent works therefore promote the use of multiple sensors in fusion in feature level in practice. While it is well known that multiple data modalities encourage mutual contextual exchange, it requires global 3D scene understanding in real-time with minimal computation upon deployment to practical driving scenarios, thereby placing greater significance on the training strategy given a limited number of practically usable sensors. In this light, we exploit carefully selected auxiliary tasks that are highly correlated with the target task of interest (e.g., traffic light recognition and semantic segmentation) by fusing auxiliary task features and also using auxiliary heads for waypoint prediction based on imitation learning. Our RGB-LIDAR-based multi-task feature fusion network, coined Cognitive TransFuser, augments and exceeds the baseline network by a significant margin for safer and more complete road navigation in the CARLA simulator. We validate the proposed network on the Town05 Short and Town05 Long Benchmark through extensive experiments, achieving up to 44.2 FPS real-time inference time.


标题: Pixel to Elevation: Learning to Predict Elevation Maps at Long Range using Images for Autonomous Offroad Navigation

作者: Chanyoung Chung, Georgios Georgakis, Patrick Spieler

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17484v1

中文摘要: 了解远程地形拓扑对于越野机器人任务的成功至关重要,尤其是在高速导航时。激光雷达传感器目前严重依赖于几何测绘,当在更远的距离测绘时,提供稀疏的测量。为了应对这一挑战,我们提出了一种新的基于学习的方法,能够仅使用机载以自我为中心的图像实时预测远距离地形高程地图。我们提出的方法由三个主要元素组成。首先,引入了基于Transformer model的编码器,其学习以自我为中心的视图和先前鸟瞰高程地图预测之间的交叉视图关联。其次,提出了一种方向感知的位置编码方法,将复杂非结构化地形上的三维车辆姿态信息与多视图视觉图像特征相结合。最后,提出了一种历史增强的可学习地图嵌入,以实现高程地图预测之间更好的时间一致性,从而促进下游导航任务。我们使用真实世界的越野驾驶数据,通过实验验证了我们提出的方法在复杂和非结构化地形中自主越野机器人导航的适用性。此外,该方法与当前最先进的方法进行了定性和定量的比较。大量的现场实验表明,我们的方法在准确预测地形高程的同时有效地捕捉长期的整体地形拓扑方面优于基线模型。最后,进行消融研究,以突出和理解所提出的方法的关键组件的效果,并验证它们对提高越野机器人导航能力的适用性。

摘要: Understanding terrain topology at long-range is crucial for the success of off-road robotic missions, especially when navigating at high-speeds. LiDAR sensors, which are currently heavily relied upon for geometric mapping, provide sparse measurements when mapping at greater distances. To address this challenge, we present a novel learning-based approach capable of predicting terrain elevation maps at long-range using only onboard egocentric images in real-time. Our proposed method is comprised of three main elements. First, a transformer-based encoder is introduced that learns cross-view associations between the egocentric views and prior bird-eye-view elevation map predictions. Second, an orientation-aware positional encoding is proposed to incorporate the 3D vehicle pose information over complex unstructured terrain with multi-view visual image features. Lastly, a history-augmented learn-able map embedding is proposed to achieve better temporal consistency between elevation map predictions to facilitate the downstream navigational tasks. We experimentally validate the applicability of our proposed approach for autonomous offroad robotic navigation in complex and unstructured terrain using real-world offroad driving data. Furthermore, the method is qualitatively and quantitatively compared against the current state-of-the-art methods. Extensive field experiments demonstrate that our method surpasses baseline models in accurately predicting terrain elevation while effectively capturing the overall terrain topology at long-ranges. Finally, ablation studies are conducted to highlight and understand the effect of key components of the proposed approach and validate their suitability to improve offroad robotic navigation capabilities.


标题: Regressing Transformers for Data-efficient Visual Place Recognition

作者: María Leyva-Vallina, Nicola Strisciuglio, Nicolai Petkov

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2401.16304v1

中文摘要: 视觉位置识别是计算机视觉中的一项关键任务,尤其是对于定位和导航系统。现有的方法通常依赖于对比学习:图像描述符被训练成对于潜在空间中的相似图像具有较小的距离,而对于不相似的图像具有较大的距离。然而,这种方法难以确保准确的基于距离的图像相似性表示,特别是当使用二进制成对标签进行训练时,并且需要复杂的重新排序策略。这项工作引入了一个新的视角,将位置识别框定为一个回归问题,使用相机视场重叠作为学习的相似性基础事实。通过优化图像描述符以直接与分级相似性标签对齐,该方法增强了排序能力,而无需昂贵的重新排序,提供了数据高效的训练和跨多个基准数据集的强泛化。

摘要: Visual place recognition is a critical task in computer vision, especially for localization and navigation systems. Existing methods often rely on contrastive learning: image descriptors are trained to have small distance for similar images and larger distance for dissimilar ones in a latent space. However, this approach struggles to ensure accurate distance-based image similarity representation, particularly when training with binary pairwise labels, and complex re-ranking strategies are required. This work introduces a fresh perspective by framing place recognition as a regression problem, using camera field-of-view overlap as similarity ground truth for learning. By optimizing image descriptors to align directly with graded similarity labels, this approach enhances ranking capabilities without expensive re-ranking, offering data-efficient training and strong generalization across several benchmark datasets.


专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉_第2张图片

你可能感兴趣的:(每日论文,机器人,人工智能,大模型)