[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航

专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航_第1张图片

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉语言导航VLN
  • 具身智能,机器人
  • 强化学习
  • 开放词汇,检测分割

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)

== LLM ==

标题: SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

作者: Xin Zhang, Dong Zhang, Shimin Li

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2308.16692v2

Project: https://0nutation.github.io/SpeechTokenizer.github.io/|

GitHub: https://github.com/ZhangXInFD/SpeechTokenizer/.|

中文摘要: 当前的语音大型语言模型建立在离散的语音表示之上,这些表示可以分为语义标记和声学标记。然而,现有的语音标记并不是专门为语音语言建模而设计的。为了评估语音标记对于构建语音语言模型的适用性,我们建立了第一个基准SLMTokBench。我们的结果表明,无论是语义标记还是声学标记都不适合这一目的。因此,我们提出了SpeechTokenizer,一个用于语音大型语言模型的统一语音标记器。SpeechTokenizer采用带有残差矢量量化(RVQ)的编码器——解码器架构。SpeechTokenizer统一了语义和声学标记,跨不同的RVQ层分层解开语音信息的不同方面。此外,我们利用SpeechTokenizer构建了一个统一的语音语言模型(USLM)。实验表明,SpeechTokenizer在语音重建方面的性能与EnCodec相当,并在SLMTokBench基准测试中表现出强大的性能。此外,USLM在零镜头文本到语音转换任务中优于VALL-E。代码和模型可在https://github.com/ZhangXInFD/SpeechTokenizer/获得。

摘要: Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.


标题: Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

作者: Mustafa Shukor, Alexandre Rame, Corentin Dancette

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2310.00647v2

Project: https://evalign-icl.github.io/|

GitHub: https://github.com/mshukor/EvALign-ICL.|

中文摘要: 随着大型语言模型(LLMs)的成功,大型多模态模型(LMMs),如Flamingo模型及其后续竞争对手,已经开始成为走向通才代理的自然步骤。然而,与最近的LMM的互动揭示了当前评估基准很难捕捉到的主要局限性。事实上,任务性能(例如,VQA准确性)本身并不能提供足够的线索来理解它们的真实能力、局限性以及这些模型在多大程度上符合人类的期望。为了完善我们对这些缺陷的理解,我们偏离了当前的评估范式,并且(1)在5个不同的轴上评估了10个最近的开源LMM,从3B到80B参数尺度;幻觉、弃权、组合性、可解释性和指令遵循。我们对这些轴的评估揭示了LMMs的主要缺陷。虽然当前调整这些模型的首选解决方案是基于培训,如指令调整或RLHF,但我们宁愿(2)探索免培训情境学习(ICL)作为解决方案,并研究它如何影响这些限制。基于我们的ICL研究,(3)我们进一步推动ICL,并提出新的多模态ICL变体,如;多任务——ICL,后见之明链——ICL,和自我纠正——ICL。我们的发现如下。(1)尽管LMM取得了成功,但它们仍有缺陷,仅通过扩展无法解决。(2)ICL对LMMs缺陷的影响是微妙的;尽管ICL对提高可解释性和答案弃权很有效,但它只是稍微提高了指令遵循,并没有提高写作能力,实际上甚至放大了幻觉。(3)建议的ICL变体作为有效解决其中一些缺陷的事后方法是有希望的。代码可在以下网址获得:https://github.com/mshukor/EvALign-ICL。

摘要: Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://github.com/mshukor/EvALign-ICL.


标题: GenSim: Generating Robotic Simulation Tasks via Large Language Models

作者: Lirui Wang, Yiyang Ling, Zhecheng Yuan

PubTime: 2024-01-21

Downlink: http://arxiv.org/abs/2310.01361v2

Project: https://liruiw.github.io/gensim)|https://liruiw.github.io/gensim),|https://huggingface.co/spaces/Gen-Sim/Gen-Sim),|

GitHub: https://github.com/liruiw/GenSim)|

中文摘要: 收集大量真实世界的交互数据来训练一般的机器人策略通常非常昂贵,因此激发了模拟数据的使用。然而,由于提出和验证新任务需要人工,现有的数据生成方法通常集中于场景级多样性(例如,对象实例和姿态)而不是任务级多样性。这使得在模拟数据上训练的策略很难展示重要的任务级泛化。在本文中,我们建议通过利用大型语言模型(LLM)的基础和编码能力来自动生成丰富的仿真环境和专家演示。我们的方法被称为GenSim,有两种模式:目标导向生成,其中将目标任务交给LLM,LLM提出任务课程来解决目标任务;探索性生成,其中LLM从以前的任务开始,迭代地提出有助于解决更复杂任务的新任务。我们使用GPT4将现有的基准扩展了10倍,达到100多个任务,在这些任务上,我们进行了有监督的微调,并评估了几个LLM,包括微调的GPTs和机器人模拟任务代码生成的Code Llama。此外,我们观察到,当用于多任务策略训练时,LLMs生成的模拟程序可以显著增强任务级泛化。我们进一步发现,在最小的模拟到真实适应的情况下,在GPT4生成的模拟任务上预训练的多任务策略表现出对现实世界中看不见的长期任务的更强转移,并且比基线高出25%。有关代码、演示和视频,请访问项目网站(https://liruiw.github.io/gensim)。

摘要: Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models’ (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.


标题: LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition

作者: Chengsong Huang, Qian Liu, Bill Yuchen Lin

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2307.13269v2

Project: https://huggingface.co/lorahub.|

GitHub: https://github.com/sail-sg/lorahub,|

中文摘要: 低秩自适应(LoRA)经常被用来为新任务微调大型语言模型(LLMs)。本文研究了用于跨任务泛化的LoRA可组合性,并介绍了LoraHub,这是一个简单的框架,用于有目的地组装在不同给定任务上训练的LoRA模块,目标是在看不见的任务上实现适应性能。只需几个新任务的例子,LoraHub就可以流畅地组合多个LoRA模块,无需人工专业知识和假设。值得注意的是,合成既不需要额外的模型参数,也不需要梯度。Big-Bench硬基准测试的实证结果表明,LoraHub虽然没有超过上下文学习的性能,但通过在推理过程中每个示例使用显著减少的令牌数量,在少数镜头场景中提供了显著的性能——效率权衡。值得注意的是,当与不同的演示示例配对时,与上下文学习相比,LoraHub建立了更好的上限,展示了其未来发展的潜力。我们的愿景是为LoRA模块建立一个平台,使用户能够分享他们训练过的LoRA模块。这种协作方法促进了LoRA模块在新任务中的无缝应用,有助于适应性生态系统。我们的代码可以在https://github.com/sail-sg/lorahub,获得,所有预先训练的LoRA模块都在https://huggingface.co/lorahub。发布

摘要: Low-rank adaptations (LoRA) are often employed to fine-tune large language models (LLMs) for new tasks. This paper investigates LoRA composability for cross-task generalization and introduces LoraHub, a simple framework devised for the purposive assembly of LoRA modules trained on diverse given tasks, with the objective of achieving adaptable performance on unseen tasks. With just a few examples from a new task, LoraHub can fluidly combine multiple LoRA modules, eliminating the need for human expertise and assumptions. Notably, the composition requires neither additional model parameters nor gradients. Empirical results on the Big-Bench Hard benchmark suggest that LoraHub, while not surpassing the performance of in-context learning, offers a notable performance-efficiency trade-off in few-shot scenarios by employing a significantly reduced number of tokens per example during inference. Notably, LoraHub establishes a better upper bound compared to in-context learning when paired with different demonstration examples, demonstrating its potential for future development. Our vision is to establish a platform for LoRA modules, empowering users to share their trained LoRA modules. This collaborative approach facilitates the seamless application of LoRA modules to novel tasks, contributing to an adaptive ecosystem. Our code is available at https://github.com/sail-sg/lorahub, and all the pre-trained LoRA modules are released at https://huggingface.co/lorahub.


标题: CLadder: Assessing Causal Reasoning in Language Models

作者: Zhijing Jin, Yuen Chen, Felix Leeb

PubTime: 2024-01-17

Downlink: http://arxiv.org/abs/2312.04350v3

Project: https://huggingface.co/datasets/causalNLP/cladder,|

GitHub: https://github.com/causalNLP/cladder.|

中文摘要: 执行因果推理的能力被广泛认为是智能的核心特征。在这项工作中,我们研究了大型语言模型(LLMs)是否可以连贯地推理因果关系。自然语言处理(NLP)中的许多现有工作集中于评估LLMs中的常识性因果推理,因此未能评估模型是否能够根据一组定义良好的形式规则执行因果推理。为了解决这个问题,我们提出了一个新的NLP任务,自然语言中的因果推理,受Judea Pearl等人假设的“因果推理机”的启发。我们用10K样本组成了一个大型数据集CLadder:基于因果图和查询(关联、干预和反事实)的集合,我们通过oracle因果推理引擎获得符号问题和基本事实答案。然后这些被翻译成自然语言。我们在我们的数据集上评估了多个LLM,并引入和评估了一个定制的思维链提示策略CausalCoT。我们表明,我们的任务对LLMs来说是极具挑战性的,我们进行了深入的分析,以获得对LLMs因果推理能力的更深入的见解。我们的数据在https://huggingface.co/datasets/causalNLP/cladder上是开源的,我们的代码可以在https://github.com/causalNLP/cladder上找到。

摘要: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the “causal inference engine” postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.


标题: EarthPT: a time series foundation model for Earth Observation

作者: Michael J. Smith, Luke Fleming, James E. Geach

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2309.07207v2

Project: https://www.climatechange.ai/papers/neurips2023/2|

GitHub: https://github.com/aspiaspace/EarthPT|

中文摘要: 我们引入了EarthPT-一种地球观测(EO)预训练Transformer model。EarthPT是一个7亿参数解码Transformer model基础模型,以自回归自我监督的方式训练,并专门针对EO用例开发。我们证明了EarthPT是一个有效的预测器,可以准确预测未来400-2300 nm范围内的未来像素级表面反射率。例如,在五个月的测试集范围内,标准化差异植被指数(NDVI)的演变预测在像素级具有大约0.05的典型误差(在-1->1的自然范围内),优于基于历史平均的简单相位折叠模型。我们还证明了EarthPT学习的嵌入包含语义上有意义的信息,可以用于下游任务,如高粒度、动态的土地利用分类。令人兴奋的是,我们注意到丰富的EO数据在理论上为我们提供了千万亿次训练令牌。因此,如果我们假设EarthPT遵循类似于大型语言模型(LLMs)的神经缩放定律,那么目前对缩放EarthPT和其他类似的“大型观察模型”没有数据限制。

摘要: We introduce EarthPT – an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm range well into the future. For example, forecasts of the evolution of the Normalised Difference Vegetation Index (NDVI) have a typical error of approximately 0.05 (over a natural range of -1 -> 1) at the pixel level over a five month test set horizon, out-performing simple phase-folded models based on historical averaging. We also demonstrate that embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification. Excitingly, we note that the abundance of EO data provides us with – in theory – quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar `Large Observation Models.’


== VLM ==

标题: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

作者: Lihe Yang, Bingyi Kang, Zilong Huang

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10891v1

Project: https://depth-anything.github.io|

GitHub: https://github.com/LiheYoung/Depth-Anything.|

中文摘要: 这项工作提出了深度任何东西,一个非常实用的解决方案,用于鲁棒的单目深度估计。不追求新的技术模块,我们的目标是建立一个简单而强大的基础模型,处理任何情况下的任何图像。为此,我们通过设计一个数据引擎来收集和自动注释大规模未标记数据(~62M),从而扩大数据覆盖范围,从而能够减少泛化误差。我们研究了两种简单而有效的策略,它们使数据扩展变得有希望。首先,通过利用数据扩充工具创建更具挑战性的优化目标。它迫使模型主动寻求额外的视觉知识并获得鲁棒的表示。其次,开发了一个辅助监督来加强模型从预训练的编码器继承丰富的语义先验。我们广泛评估了它的零拍摄能力,包括六个公共数据集和随机拍摄的照片。它展示了令人印象深刻的概括能力。此外,通过使用来自NYUv2和KITTI的度量深度信息对其进行微调,可以设置新的SOTA。我们更好的深度模型也产生了更好的深度调节控制网络。我们的模型在https://github.com/LiheYoung/Depth-Anything发布。

摘要: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.


标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2307.01849v3

Project: https://youtu.be/9deKHueZBuk|

GitHub: https://github.com/LostXine/crossway_diffusion|

中文摘要: 序列建模方法在机器人模仿学习中显示出有希望的结果。最近,扩散模型已经以序列建模的方式被用于行为克隆,受益于它们在建模复杂数据分布方面的卓越能力。标准的基于扩散的策略根据输入状态从随机噪声中迭代地生成动作序列。尽管如此,扩散策略的模型可以在视觉表示方面进一步改进。在这项工作中,我们提出了交叉扩散,这是一种简单而有效的方法,通过精心设计的状态解码器和辅助的自我监督学习(SSL)目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。利用SSL目标和初始扩散损失对整个模型进行了联合优化。我们的实验证明了交叉扩散在各种模拟和真实世界机器人任务中的有效性,证实了它相对于标准的基于扩散的策略的一致优势以及对基线的实质性改进。

摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.


标题: AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

作者: Joao P. C. Bertoldo, Dick Ameln, Ashwin Vaidya

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.01984v2

Project: https://summerofcode.withgoogle.com/archive/2023/projects/SPMopugd|

GitHub: https://github.com/jpcbertoldo/aupimo.|

摘要: Recent advances in visual anomaly detection research have seen AUROC and AUPRO scores on public benchmark datasets such as MVTec and VisA converge towards perfect recall, giving the impression that these benchmarks are near-solved. However, high AUROC and AUPRO scores do not always reflect qualitative performance, which limits the validity of these metrics in real-world applications. We argue that the artificial ceiling imposed by the lack of an adequate evaluation metric restrains progression of the field, and it is crucial that we revisit the evaluation metrics used to rate our algorithms. In response, we introduce Per-IMage Overlap (PIMO), a novel metric that addresses the shortcomings of AUROC and AUPRO. PIMO retains the recall-based nature of the existing metrics but introduces two distinctions: the assignment of curves (and respective area under the curve) is per-image, and its X-axis relies solely on normal images. Measuring recall per image simplifies instance score indexing and is more robust to noisy annotations. As we show, it also accelerates computation and enables the usage of statistical tests to compare models. By imposing low tolerance for false positives on normal images, PIMO provides an enhanced model validation procedure and highlights performance variations across datasets. Our experiments demonstrate that PIMO offers practical advantages and nuanced performance insights that redefine anomaly detection benchmarks – notably challenging the perception that MVTec AD and VisA datasets have been solved by contemporary models. Available on GitHub: https://github.com/jpcbertoldo/aupimo.


标题: Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

作者: David Junhao Zhang, Dongxu Li, Hung Le

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01827v1

Project: https://showlab.github.io/Moonshot/|

GitHub: https://github.com/salesforce/LAVIS.|

中文摘要: 大多数现有的视频扩散模型(VDM)仅限于文本条件。因此,它们通常缺乏对所生成视频的视觉外观和几何结构的控制。这项工作介绍了Moonshot,一种新的视频生成模型,它同时基于图像和文本的多模态输入。该模型建立在一个核心模块上,称为多模态视频块(MVB),它由用于表示视频特征的传统时空层和用于处理图像和文本输入以进行外观调节的解耦交叉注意力层组成。此外,我们仔细设计了模型架构,使得它可以选择性地与预先训练的图像控制网络模块集成,用于几何视觉条件,而不需要额外的训练开销,这与先前的方法相反。实验表明,与现有模型相比,Moonshot具有多功能的多模态调节机制,在视觉质量和时间一致性方面表现出显著的改善。此外,该模型可以很容易地重新用于各种生成应用,如个性化视频生成、图像动画和视频编辑,揭示了其作为可控视频生成的基本架构的潜力。模型将在https://github.com/salesforce/LAVIS上公开。

摘要: Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS.


标题: SVGDreamer: Text Guided SVG Generation with Diffusion Model

作者: Ximing Xing, Haitao Zhou, Chuang Wang

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2312.16476v2

Project: https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|

中文摘要: 最近,文本引导的可缩放矢量图形(SVG)合成在图像学和草图等领域显示出了前景。然而,现有的文本到SVG的生成方法缺乏可编辑性,并且难以满足视觉质量和结果多样性。为了解决这些限制,我们提出了一种新的文本引导矢量图形合成方法,称为SVGDreamer。SVGDreamer整合了语义驱动的图像矢量化(SIVE)过程,能够将合成分解为前景对象和背景,从而增强可编辑性。具体来说,SIVE过程引入了基于注意力的原始控制和注意力屏蔽损失函数,用于有效控制和操作单个元素。此外,我们提出了一种矢量化的基于粒子的分数提取(VPSD)方法来解决现有文本到SVG生成方法中颜色过饱和度、矢量基元过平滑和有限结果多样性的挑战。此外,在VPSD的基础上,我们引入了奖励反馈学习(ReFL)来加速VPSD的收敛和提高审美情趣。已经进行了大量的实验来验证SVGDreamer的有效性,证明了它在可编辑性、视觉质量和多样性方面优于基线方法。SVGDreamer的代码和演示可以在\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}找到

摘要: Recently, text-guided scalable vector graphics (SVGs) synthesis has shown
promise in domains such as iconography and sketch. However, existing
text-to-SVG generation methods lack editability and struggle with visual
quality and result diversity. To address these limitations, we propose a novel
text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer
incorporates a semantic-driven image vectorization (SIVE) process that enables
the decomposition of synthesis into foreground objects and background, thereby
enhancing editability. Specifically, the SIVE process introduce attention-based
primitive control and an attention-mask loss function for effective control and
manipulation of individual elements. Additionally, we propose a Vectorized
Particle-based Score Distillation (VPSD) approach to tackle the challenges of
color over-saturation, vector primitives over-smoothing, and limited result
diversity in existing text-to-SVG generation methods. Furthermore, on the basis
of VPSD, we introduce Reward Feedback Learning (ReFL) to accelerate VPSD
convergence and improve aesthetic appeal. Extensive experiments have been
conducted to validate the effectiveness of SVGDreamer, demonstrating its
superiority over baseline methods in terms of editability, visual quality, and
diversity. The code and demo of SVGDreamer can be found at
\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}.


标题: ODTrack: Online Dense Temporal Token Learning for Visual Tracking

作者: Yaozong Zheng, Bineng Zhong, Qihua Liang

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01686v1

GitHub: https://github.com/GXNU-ZhongLab/ODTrack|

中文摘要: 跨连续视频帧的在线上下文推理和关联对于视觉跟踪中的感知实例至关重要。然而,大多数当前性能最好的跟踪器通过离线模式持续依赖于参考帧和搜索帧之间的稀疏时间关系。因此,它们只能在每个图像对中独立地相互作用,并建立有限的时间相关性。为了缓解上述问题,我们提出了一种简单、灵活、有效的视频级跟踪流水线,名为\textbf{ODTrack},它以在线令牌传播的方式密集关联视频帧的上下文关系。ODTrack接收任意长度的视频帧来捕捉实例的时空轨迹关系,并将目标的辨别特征(定位信息)压缩成令牌序列,实现帧到帧的关联。这种新的解决方案带来了以下好处:1)净化后的令牌序列可以作为下一个视频帧中推理的提示,从而利用过去的信息来指导未来的推理;2)通过令牌序列的迭代传播,有效地避免了复杂的在线更新策略,从而实现了更高效的模型表示和计算。ODTrack在七个基准测试中实现了新的\textit{SOTA}性能,同时以实时速度运行。代码和模型可在\url{https://github.com/GXNU-ZhongLab/ODTrack}获得。

摘要: Online contextual reasoning and association across consecutive video frames
are critical to perceive instances in visual tracking. However, most current
top-performing trackers persistently lean on sparse temporal relationships
between reference and search frames via an offline mode. Consequently, they can
only interact independently within each image-pair and establish limited
temporal correlations. To alleviate the above problem, we propose a simple,
flexible and effective video-level tracking pipeline, named \textbf{ODTrack},
which densely associates the contextual relationships of video frames in an
online token propagation manner. ODTrack receives video frames of arbitrary
length to capture the spatio-temporal trajectory relationships of an instance,
and compresses the discrimination features (localization information) of a
target into a token sequence to achieve frame-to-frame association. This new
solution brings the following benefits: 1) the purified token sequences can
serve as prompts for the inference in the next video frame, whereby past
information is leveraged to guide future inference; 2) the complex online
update strategies are effectively avoided by the iterative propagation of token
sequences, and thus we can achieve more efficient model representation and
computation. ODTrack achieves a new \textit{SOTA} performance on seven
benchmarks, while running at real-time speed. Code and models are available at
\url{https://github.com/GXNU-ZhongLab/ODTrack}.


== diffusion model ==

标题: VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

作者: Haoxin Chen, Yong Zhang, Xiaodong Cun

PubTime: 2024-01-17

Downlink: http://arxiv.org/abs/2401.09047v1

Project: https://ailab-cvc.github.io/videocrafter;|

GitHub: https://github.com/AILab-CVC/VideoCrafter|

中文摘要: 文本到视频生成旨在根据给定的提示生成视频。最近,一些商业视频模型已经能够生成具有最小噪声、出色细节和高美学分数的似是而非的视频。然而,这些模型依赖于大规模、过滤良好、高质量的视频,社区无法访问这些视频。许多现有的研究工作使用低质量的WebVid-10M数据集训练模型,很难生成高质量的视频,因为模型经过优化以适应WebVid-10M。在这项工作中,我们探索了从稳定扩散扩展的视频模型的训练方案,并研究了利用低质量视频和合成的高质量图像来获得高质量视频模型的可行性。我们首先分析了视频模型的空间和时间模块之间的联系以及向低质量视频的分布转移。我们观察到,与仅训练时间模块相比,所有模块的完全训练导致空间和时间模块之间更强的耦合。基于这种更强的耦合,我们通过用高质量图像微调空间模块,将分布转移到更高的质量而没有运动退化,从而产生通用的高质量视频模型。进行评估以证明所提出的方法的优越性,特别是在图像质量、运动和概念合成方面。

摘要: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.


标题: InstantID: Zero-shot Identity-Preserving Generation in Seconds

作者: Qixun Wang, Xu Bai, Haofan Wang

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2401.07519v1

Project: https://instantid.github.io/|

GitHub: https://github.com/InstantID/InstantID.|

中文摘要: 使用文本反转、DreamBooth和LoRA等方法进行个性化图像合成已经取得了重大进展。然而,它们在现实世界中的适用性受到高存储需求、冗长的微调过程以及对多个参考图像的需求的阻碍。相反,现有的基于ID嵌入的方法虽然只需要单一的正向推理,但面临着挑战:它们要么需要跨众多模型参数进行广泛的微调,缺乏与社区预训练模型的兼容性,要么无法保持高人脸保真度。针对这些限制,我们引入了InstantID,这是一个强大的基于扩散模型的解决方案。我们的即插即用模块仅使用一张面部图像就能熟练地处理各种风格的图像个性化,同时确保高保真。为了实现这一点,我们设计了一个新的身份网,通过施加强语义和弱空间条件,将面部和地标图像与文本提示相结合来指导图像生成。InstantID展示了卓越的性能和效率,证明在身份保护至关重要的实际应用中非常有益。此外,我们的工作与SD1.5和SDXL等流行的预训练文本到图像扩散模型无缝集成,作为一个适应性强的插件。我们的代码和预先训练的检查点将在https://github.com/InstantID/InstantID上提供。

摘要: There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.


标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2307.01849v3

Project: https://youtu.be/9deKHueZBuk|

GitHub: https://github.com/LostXine/crossway_diffusion|

中文摘要: 序列建模方法在机器人模仿学习中显示出有希望的结果。最近,扩散模型已经以序列建模的方式被用于行为克隆,受益于它们在建模复杂数据分布方面的卓越能力。标准的基于扩散的策略根据输入状态从随机噪声中迭代地生成动作序列。尽管如此,扩散策略的模型可以在视觉表示方面进一步改进。在这项工作中,我们提出了交叉扩散,这是一种简单而有效的方法,通过精心设计的状态解码器和辅助的自我监督学习(SSL)目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。利用SSL目标和初始扩散损失对整个模型进行了联合优化。我们的实验证明了交叉扩散在各种模拟和真实世界机器人任务中的有效性,证实了它相对于标准的基于扩散的策略的一致优势以及对基线的实质性改进。

摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.


标题: ControlDreamer: Stylized 3D Generation with Multi-View ControlNet

作者: Yeongtak Oh, Jooyoung Choi, Yongsung Kim

PubTime: 2024-01-05

Downlink: http://arxiv.org/abs/2312.01129v2

Project: https://controldreamer.github.io/|

摘要: Recent advancements in text-to-3D generation have significantly contributed
to the automation and democratization of 3D content creation. Building upon
these developments, we aim to address the limitations of current methods in
generating 3D models with creative geometry and styles. We introduce multi-view
ControlNet, a novel depth-aware multi-view diffusion model trained on generated
datasets from a carefully curated text corpus. Our multi-view ControlNet is
then integrated into our two-stage pipeline, ControlDreamer, enabling
text-guided generation of stylized 3D models. Additionally, we present a
comprehensive benchmark for 3D style editing, encompassing a broad range of
subjects, including objects, animals, and characters, to further facilitate
research on diverse 3D generation. Our comparative analysis reveals that this
new pipeline outperforms existing text-to-3D methods as evidenced by human
evaluations and CLIP score metrics.


标题: Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

作者: David Junhao Zhang, Dongxu Li, Hung Le

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01827v1

Project: https://showlab.github.io/Moonshot/|

GitHub: https://github.com/salesforce/LAVIS.|

中文摘要: 大多数现有的视频扩散模型(VDM)仅限于文本条件。因此,它们通常缺乏对所生成视频的视觉外观和几何结构的控制。这项工作介绍了Moonshot,一种新的视频生成模型,它同时基于图像和文本的多模态输入。该模型建立在一个核心模块上,称为多模态视频块(MVB),它由用于表示视频特征的传统时空层和用于处理图像和文本输入以进行外观调节的解耦交叉注意力层组成。此外,我们仔细设计了模型架构,使得它可以选择性地与预先训练的图像控制网络模块集成,用于几何视觉条件,而不需要额外的训练开销,这与先前的方法相反。实验表明,与现有模型相比,Moonshot具有多功能的多模态调节机制,在视觉质量和时间一致性方面表现出显著的改善。此外,该模型可以很容易地重新用于各种生成应用,如个性化视频生成、图像动画和视频编辑,揭示了其作为可控视频生成的基本架构的潜力。模型将在https://github.com/salesforce/LAVIS上公开。

摘要: Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS.


标题: CoMoSVC: Consistency Model-based Singing Voice Conversion

作者: Yiwen Lu, Zhen Ye, Wei Xue

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01792v1

Project: https://comosvc.github.io/.|

中文摘要: 基于扩散的歌声转换(SVC)方法取得了显著的性能,产生了与目标音色高度相似的自然音频。然而,迭代采样过程导致推理速度慢,因此加速变得至关重要。本文提出了一种基于一致性模型的SVC方法CoMoSVC,旨在实现高质量生成和高速采样。首先专门为SVC设计了一个基于扩散的教师模型,并在自洽属性下进一步提取学生模型,以实现一步采样。在单个NVIDIA GTX4090 GPU上的实验表明,尽管CoMoSVC的推理速度明显快于最先进的(SOTA)基于扩散的SVC系统,但基于主观和客观指标,它仍然实现了相当或更好的转换性能。音频样本和代码可在https://comosvc.github.io/。

摘要: The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.


== VLN ==

标题: ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

作者: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13311v1

Project: https://con-textual.github.io/|

中文摘要: 人工智能的最新进展导致了大型多模态模型(LMM)的发展,这些模型能够处理复杂的任务,包括对图像中的文本和视觉内容进行联合推理(例如,在公共场所导航地图)。本文介绍了ConTextual,这是一个新颖的基准测试,包括明确设计的指令,用于评估LMMs执行上下文敏感的文本丰富的可视化推理的能力。上下文强调不同的真实世界场景(例如,时间阅读、导航、购物等),要求更深入地理解文本和视觉元素之间的交互。我们的发现揭示了表现最好的LMM、GPT-4V(ision)和使用人类评估的人类能力之间30.8%的显著性能差距,表明在上下文敏感的文本丰富的视觉推理方面有很大的改进空间。值得注意的是,虽然GPT-4V在模因和引用解释等抽象类别中表现出色,但其整体表现仍落后于人类。除了人工评估,我们还采用了使用GPT-4的自动评估指标,揭示了绩效差异的类似趋势。我们还在不同的视觉环境中进行细粒度的评估,并提供定性分析,为LMM设计的未来发展提供了一个强大的框架。https://con-textual.github.io/

摘要: Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/


标题: SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

作者: Mingyang Li, Yue Ma, Qinru Qiu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13076v1

GitHub: https://github.com/Leomingyangli/SemanticSLAM|

中文摘要: 视觉同步定位和绘图(VSLAM)中的当前技术通过比较连续场景的图像特征来估计相机位移。这些算法依赖于场景的连续性,因此需要频繁的摄像机输入。然而,频繁处理图像会导致大量的内存使用和计算开销。在这项研究中,我们介绍了SemanticSLAM,这是一个端到端的视觉惯性里程计系统,它利用了从RGB-D传感器提取的语义特征。这种方法能够创建环境的语义图,并确保可靠的相机定位。SemanticSLAM是场景不可知的,这意味着它不需要针对不同的环境进行重新训练。它可以在室内环境中有效地工作,即使没有频繁的摄像机输入,也不需要事先知道。SemanticSLAM的优势在于它能够逐步细化语义图并改进姿态估计。这是通过卷积长短期记忆(ConvLSTM)网络实现的,该网络经过训练可以在地图构建过程中纠正错误。与现有的VSLAM算法相比,SemanticSLAM将姿态估计提高了17%。由此产生的语义图提供了关于环境的可解释信息,并且可以容易地应用于各种下游任务,例如路径规划、避障和机器人导航。该代码将在https://github.com/Leomingyangli/SemanticSLAM

摘要: Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn’t require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM


标题: ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

作者: Dong An, Hanqing Wang, Wenguan Wang

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2304.03047v3

GitHub: https://github.com/MarSaKi/ETPNav.|https://github.com/MarSaKi/ETPNav|

中文摘要: Vision-language导航是一项需要代理按照指令在环境中导航的任务。它在具体化人工智能领域变得越来越重要,在自主导航、搜索和救援以及人机交互方面具有潜在的应用。在本文中,我们提出了一个更实际但具有挑战性的对应设置——连续环境中的视觉语言导航(VLN-CE)。为了开发一个鲁棒的VLN-CE代理,我们提出了一个新的导航框架ETPNav,它专注于两个关键技能:1)抽象环境和生成远程导航计划的能力,以及2)在连续环境中的避障控制能力。ETPNav通过沿着穿越路径自组织预测的航路点来执行环境的在线拓扑映射,而无需先前的环境经验。它赋予代理将导航过程分解为高级规划和低级控制的特权。同时,ETPNav利用基于Transformer model的跨模态规划器来基于拓扑图和指令生成导航计划。然后,该计划通过避障控制器来执行,该控制器利用试错法来防止导航陷入障碍物。实验结果证明了该方法的有效性。ETPNav的产量超过10%和20%的改进比以前的属性R2R-CE和RxR-CE数据集的最新技术。我们的代码可在https://github.com/MarSaKi/ETPNav。获得

摘要: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.


标题: Multimotion Visual Odometry (MVO)

作者: Kevin M. Judd, Jonathan D. Gammell

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2110.15169v3

Project: https://www.youtube.com/watch?v=mNj3s1nf-6A|https://www.youtube.com/playlist?list=PLbaQBz4TuPcxMIXKh5Q80s0N9ISezFcpi|

中文摘要: 视觉运动估计是自主导航中一个研究得很好的挑战。最近的工作集中于解决高度动态环境中的多运动估计问题。这些环境不仅包括多个复杂的运动,而且往往表现出显著的遮挡。很难同时估计第三方运动和传感器自运动,因为物体的观测运动包括其真实运动和传感器运动。先前在多运动估计中的大多数工作通过依赖于基于外观的对象检测或特定于应用程序的运动约束来简化这个问题。这些方法在特定的应用程序和环境中是有效的,但不能很好地推广到完整的多运动估计问题(MEP)。本文介绍了Multimotion Visual Odometry(MVO),这是一种多运动估计管道,它估计场景中每个运动的完整SE(3)轨迹,包括传感器自身运动,而不依赖于基于外观的信息。MVO通过多运动分割和跟踪技术扩展了传统的视觉里程计(VO)管道。它使用物理建立的运动先验来推断通过临时遮挡的运动,并通过运动闭合来识别运动的再现。对牛津多运动数据集(OMD)和KITTI Vision Benchmark Suite的真实世界数据的评估表明,与类似方法相比,MVO实现了良好的估计精度,并适用于各种多运动估计挑战

摘要: Visual motion estimation is a well-studied challenge in autonomous navigation. Recent work has focused on addressing multimotion estimation in highly dynamic environments. These environments not only comprise multiple, complex motions but also tend to exhibit significant occlusion. Estimating third-party motions simultaneously with the sensor egomotion is difficult because an object’s observed motion consists of both its true motion and the sensor motion. Most previous works in multimotion estimation simplify this problem by relying on appearance-based object detection or application-specific motion constraints. These approaches are effective in specific applications and environments but do not generalize well to the full multimotion estimation problem (MEP). This paper presents Multimotion Visual Odometry (MVO), a multimotion estimation pipeline that estimates the full SE(3) trajectory of every motion in the scene, including the sensor egomotion, without relying on appearance-based information. MVO extends the traditional visual odometry (VO) pipeline with multimotion segmentation and tracking techniques. It uses physically founded motion priors to extrapolate motions through temporary occlusions and identify the reappearance of motions through motion closure. Evaluations on real-world data from the Oxford Multimotion Dataset (OMD) and the KITTI Vision Benchmark Suite demonstrate that MVO achieves good estimation accuracy compared to similar approaches and is applicable to a variety of multimotion estimation challenges.


标题: Learning Interactive Real-World Simulators

作者: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour

PubTime: 2024-01-13

Downlink: http://arxiv.org/abs/2310.06114v2

Project: https://universal-simulator.github.io.|https://universal-simulator.github.io|

中文摘要: 基于互联网数据训练的生成模型彻底改变了文本、图像和视频内容的创建方式。也许生成模型的下一个里程碑是模拟现实体验,以响应人类、机器人和其他交互式代理所采取的行动。真实世界模拟器的应用范围从游戏和电影中的可控内容创建,到纯粹在模拟中训练可直接部署在现实世界中的具体代理。我们探索了通过生成建模学习真实世界交互的通用模拟器的可能性。我们首先提出了一个重要的观察结果,即可用于学习真实世界模拟器的自然数据集通常在不同维度上是丰富的(例如,图像数据中的大量对象、机器人数据中的密集采样动作以及导航数据中的不同运动)。通过仔细编排不同的数据集,每个数据集都提供了整体体验的不同方面,我们可以从静态场景和对象中模拟高级指令(如“打开抽屉”)和低级控件(如“按x,y移动”)的视觉结果。我们使用模拟器来训练高级视觉语言策略和低级强化学习策略,在纯模拟训练后,每一种策略都可以在现实世界中零次部署。我们还表明,其他类型的智能,如视频字幕模型,可以从模拟经验的训练中受益,从而开辟更广泛的应用。视频演示可在https://universal-simulator.github.io.

摘要: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer’’ and low-level controls such as “move by x, y” from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.


标题: Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

作者: Beilei Cui, Mobarakol Islam, Long Bai

PubTime: 2024-01-12

Downlink: http://arxiv.org/abs/2401.06013v2

GitHub: https://github.com/BeileiCui/SurgicalDINO.|

中文摘要: 目的:机器人手术中的深度估计在3D重建、手术导航和增强现实可视化中至关重要。尽管基础模型在许多视觉任务中表现出出色的性能,包括深度估计(例如,DINOv2),但最近的工作观察到其在医疗和外科领域特定应用中的局限性。这项工作提出了一个低排名适应(LoRA)的基础模型的手术深度估计。方法:我们设计了一种基于基础模型的深度估计方法,称为Surgical-DINO,这是DINOv2的低秩适应,用于内窥镜手术的深度估计。我们构建LoRA层,并将它们集成到DINO中,以适应外科手术特定领域的知识,而不是传统的微调。在训练过程中,我们冻结了显示出出色视觉表现能力的DINO图像编码器,并且只优化了LoRA层和深度解码器,以整合来自手术场景的特征。结果:我们的模型在从达芬奇Xi内窥镜手术中收集的MICCAI挑战数据集上得到了广泛的验证。我们的经验表明,在内窥镜深度估计任务中,Surgical-DINO明显优于所有最先进的模型。消融研究的分析显示了我们的LoRA层和适应性的显著效果的证据。结论:Surgical-DINO为基础模型成功适应外科领域的深度估计提供了一些启示。结果中有明确的证据表明,对计算机视觉数据集中预训练权重的零镜头预测或简单的微调不足以直接在外科领域使用基础模型。代码可在https://github.com/BeileiCui/SurgicalDINO获得。

摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.


专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航_第2张图片

你可能感兴趣的:(每日论文,机器人,深度学习,人工智能,机器学习)