分类:
- 大语言模型LLM
- 视觉模型VLM
- 扩散模型
- 视觉导航
- 具身智能,机器人
- 强化学习
- 开放词汇,检测分割
中文摘要: 执行因果推理的能力被广泛认为是智力的核心特征。在这项工作中,我们研究了大型语言模型(LLM)是否能够连贯地推理因果关系。自然语言处理(NLP)中的许多现有工作都集中在评估LLM中的常识性因果推理,从而未能评估模型是否能够根据一组定义明确的形式规则进行因果推理。为了解决这一问题,我们提出了一个新的NLP任务,即自然语言中的因果推理,其灵感来自Judea Pearl等人假设的“因果推理引擎”。我们用10K个样本组成了一个大型数据集CLadder:基于因果图和查询(关联、介入和反事实)的集合,我们通过预言因果推理引擎获得符号问题和基本事实答案。然后将其翻译成自然语言。我们在数据集上评估了多个LLM,并引入和评估了一种定制的思想链提示策略CausalCoT。我们表明,我们的任务对LLM来说极具挑战性,我们进行了深入分析,以深入了解LLM的因果推理能力。我们的数据来源于https://huggingface.co/datasets/causalNLP/cladder,我们的代码可以在https://github.com/causalNLP/cladder.
摘要: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the “causal inference engine” postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.
[Downlink:]http://arxiv.org/abs/2312.04350v3
[Project:]https://huggingface.co/datasets/causalNLP/cladder,|
[GitHub:]https://github.com/causalNLP/cladder.|
中文摘要: 我们介绍了EarthPT——一种地球观测(EO)预训练变压器。EarthPT是一个7亿参数解码变压器基础模型,以自回归自监督方式进行训练,并专门考虑EO用例进行开发。我们证明,EarthPT是一种有效的预测器,可以准确预测未来400-2300 nm范围内的未来像素级表面反射率。例如,在五个月的测试集范围内,对归一化差异植被指数(NDVI)演变的预测在像素水平上的典型误差约为0.05(在-1->1的自然范围内),超过了基于历史平均值的简单相位折叠模型。我们还证明了EarthPT学习的嵌入包含语义上有意义的信息,可以用于下游任务,如高度细粒度的动态土地利用分类。令人兴奋的是,我们注意到丰富的EO数据在理论上为我们提供了数万亿的训练令牌。因此,如果我们假设EarthPT遵循类似于大型语言模型(LLM)的神经缩放定律,那么目前对EarthPT和其他类似的“大型观测模型”的缩放没有数据限制
摘要: We introduce EarthPT – an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm range well into the future. For example, forecasts of the evolution of the Normalised Difference Vegetation Index (NDVI) have a typical error of approximately 0.05 (over a natural range of -1 -> 1) at the pixel level over a five month test set horizon, out-performing simple phase-folded models based on historical averaging. We also demonstrate that embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification. Excitingly, we note that the abundance of EO data provides us with – in theory – quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar `Large Observation Models.’
[Downlink:]http://arxiv.org/abs/2309.07207v2
[Project:]https://www.climatechange.ai/papers/neurips2023/2|
[GitHub:]https://github.com/aspiaspace/EarthPT|
中文摘要: 大型语言模型(LLM)在各种NLP任务上取得了显著的性能,并通过工具得到了更广泛应用的增强。然而,如何评价和分析有限责任管理系统的工具利用能力仍未得到充分的探索。与以前整体评估模型的工作相比,我们将工具利用全面分解为多个子过程,包括指令遵循、计划、推理、检索、理解和审查。在此基础上,我们进一步引入T-Eval来逐步评估工具利用能力。T-Eval将工具利用率评估分解为模型能力的几个子领域,促进了对LLMs整体和孤立能力的内部理解。我们对T-Eval进行了广泛的实验,并对各种LLMs进行了深入的分析。T-Eval不仅表现出与以结果为导向的评估的一致性,而且提供了对LLM能力的更细粒度的分析,为LLM工具利用能力的评估提供了一个新的视角。该基准将在https://github.com/open-compass/T-Eval上提供。
摘要: Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available at https://github.com/open-compass/T-Eval.
[Downlink:]http://arxiv.org/abs/2312.14033v3
[Project:]https://open-compass.github.io/T-Eval|
[GitHub:]https://github.com/open-compass/T-Eval.|
中文摘要: 最新语言模型的功能增加了将它们集成到现实应用程序中的兴趣。然而,当考虑到这些模型在几个领域中的使用时,这些模型生成看似合理但不正确的文本这一事实造成了限制。医疗保健是一个典型的例子,在这个领域,文本生成可信度是保障患者健康的硬性要求。在本文中,我们介绍了Physio,一个基于聊天的物理康复应用程序。Physio能够做出初步诊断,同时引用可靠的健康来源来支持所提供的信息。此外,利用外部知识数据库,Physio可以推荐康复锻炼和非处方药来缓解症状。通过结合这些功能,Physio可以利用生成模型的力量进行语言处理,同时也可以根据可靠和可验证的来源进行响应。Physio的现场演示可在https://physio.inesctec.pt.
摘要: The capabilities of the most recent language models have increased the
interest in integrating them into real-world applications. However, the fact
that these models generate plausible, yet incorrect text poses a constraint
when considering their use in several domains. Healthcare is a prime example of
a domain where text-generative trustworthiness is a hard requirement to
safeguard patient well-being. In this paper, we present Physio, a chat-based
application for physical rehabilitation. Physio is capable of making an initial
diagnosis while citing reliable health sources to support the information
provided. Furthermore, drawing upon external knowledge databases, Physio can
recommend rehabilitation exercises and over-the-counter medication for symptom
relief. By combining these features, Physio can leverage the power of
generative models for language processing while also conditioning its response
on dependable and verifiable sources. A live demo of Physio is available at
https://physio.inesctec.pt.
[Downlink:]http://arxiv.org/abs/2401.01825v1
[Project:]https://physio.inesctec.pt.|
中文摘要: 大型语言模型(LLMs)在理解和生成密切反映人类交流的文本方面表现出非凡的能力。然而,一个主要的限制在于训练期间的大量计算需求,这是由它们的广泛参数化引起的。世界的动态性质进一步加剧了这一挑战,需要经常更新法律资料管理系统,以纠正过时的信息或整合新知识,从而确保其持续的相关性。请注意,许多应用程序需要在训练后不断调整模型,以解决缺陷或不良行为。人们对用于动态模型修改的高效、轻量级方法越来越感兴趣。为此,近年来,LLMs知识编辑技术蓬勃发展,旨在有效地修改LLMs在特定领域内的行为,同时保持各种输入的整体性能。在本文中,我们首先定义了知识编辑问题,然后提供了一个前沿方法的全面审查。受教育和认知研究理论的启发,我们提出了一个统一的分类标准,将知识编辑方法分为三类:借助外部知识、将知识融入模型和编辑内在知识。此外,我们引入了一个新的基准,KnowEdit,用于代表性知识编辑方法的综合经验评估。此外,我们还提供了对知识定位的深入分析,这可以让我们更深入地了解LLMs中固有的知识结构。最后,我们讨论了知识编辑的几个潜在应用,概述了它的广泛和有影响力的影响。
摘要: Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication. However, a primary limitation lies in the significant computational demands during training, arising from their extensive parameterization. This challenge is further intensified by the dynamic nature of the world, necessitating frequent updates to LLMs to correct outdated information or integrate new knowledge, thereby ensuring their continued relevance. Note that many applications demand continual model adjustments post-training to address deficiencies or undesirable behaviors. There is an increasing interest in efficient, lightweight methods for on-the-fly model modifications. To this end, recent years have seen a burgeoning in the techniques of knowledge editing for LLMs, which aim to efficiently modify LLMs’ behaviors within specific domains while preserving overall performance across various inputs. In this paper, we first define the knowledge editing problem and then provide a comprehensive review of cutting-edge approaches. Drawing inspiration from educational and cognitive research theories, we propose a unified categorization criterion that classifies knowledge editing methods into three groups: resorting to external knowledge, merging knowledge into the model, and editing intrinsic knowledge. Furthermore, we introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches. Additionally, we provide an in-depth analysis of knowledge location, which can give a deeper understanding of the knowledge structures inherent within LLMs. Finally, we discuss several potential applications of knowledge editing, outlining its broad and impactful implications.
[Downlink:]http://arxiv.org/abs/2401.01286v3
[Project:]https://huggingface.co/datasets/zjunlp/KnowEdit|
[GitHub:]https://github.com/zjunlp/EasyEdit|https://github.com/zjunlp/KnowledgeEditingPapers|
中文摘要: 序列建模方法在机器人模仿学习中显示出了良好的效果。最近,扩散模型以序列建模的方式被用于行为克隆,这得益于它们在建模复杂数据分布方面的卓越能力。基于标准扩散的策略从以输入状态为条件的随机噪声迭代地生成动作序列。尽管如此,扩散政策的模型可以在视觉表示方面得到进一步改进。在这项工作中,我们提出了Crossway Diffusion,这是一种简单而有效的方法,通过精心设计的状态解码器和辅助自监督学习(SSL)目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。整个模型由SSL目标和原始扩散损失共同优化。我们的实验证明了Crossway Diffusion在各种模拟和真实世界的机器人任务中的有效性,证实了其相对于标准的基于扩散的策略的一致优势,以及相对于基线的显著改进
摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.
[Downlink:]http://arxiv.org/abs/2307.01849v3
[Project:]https://youtu.be/9deKHueZBuk|
[GitHub:]https://github.com/LostXine/crossway_diffusion|
中文摘要: 大多数现有的视频扩散模型(VDM)仅限于文本条件。因此,他们通常缺乏对所生成视频的视觉外观和几何结构的控制。这项工作介绍了Moonshot,这是一种新的视频生成模型,它同时处理图像和文本的多模式输入。该模型建立在一个称为多模式视频块(MVB)的核心模块之上,该模块由用于表示视频特征的传统时空层和用于处理图像和文本输入以进行外观调节的解耦交叉注意力层组成。此外,我们仔细设计了模型架构,使其可以选择性地与预先训练的图像ControlNet模块集成,以适应几何视觉条件,而不需要与现有方法相比的额外训练开销。实验表明,与现有模型相比,Moonshot具有多种多模式调节机制,在视觉质量和时间一致性方面有了显著提高。此外,该模型可以很容易地重新用于各种生成应用,如个性化视频生成、图像动画和视频编辑,揭示了其作为可控视频生成的基本架构的潜力。模型将于公开https://github.com/salesforce/LAVIS.
摘要: Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS.
[Downlink:]http://arxiv.org/abs/2401.01827v1
[Project:]https://showlab.github.io/Moonshot/|
[GitHub:]https://github.com/salesforce/LAVIS.|
中文摘要: 最近,文本引导的可缩放矢量图形(SVG)合成在图像学和素描等领域显示出了前景。然而,现有的文本到SVG的生成方法缺乏可编辑性,并且难以获得视觉质量和结果的多样性。为了解决这些局限性,我们提出了一种新的文本引导矢量图形合成方法,称为SVGDreamer。SVGDreamer结合了语义驱动的图像矢量化(SIVE)过程,该过程能够将合成分解为前景对象和背景,从而增强可编辑性。具体而言,SIVE过程引入了基于注意力的基元控制和注意力掩码丢失函数,用于有效控制和操纵单个元素。此外,我们提出了一种基于矢量化粒子的分数蒸馏(VPSD)方法,以解决现有文本到SVG生成方法中颜色过饱和、矢量基元过平滑和结果多样性有限的挑战。此外,在VPSD的基础上,我们引入了奖励反馈学习(ReFL),以加速VPSD的融合,提高审美吸引力。已经进行了大量的实验来验证SVGDreamer的有效性,证明了它在可编辑性、视觉质量和多样性方面优于基线方法。SVGDreamer的代码和演示可以在\href找到{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}.
摘要: Recently, text-guided scalable vector graphics (SVGs) synthesis has shown
promise in domains such as iconography and sketch. However, existing
text-to-SVG generation methods lack editability and struggle with visual
quality and result diversity. To address these limitations, we propose a novel
text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer
incorporates a semantic-driven image vectorization (SIVE) process that enables
the decomposition of synthesis into foreground objects and background, thereby
enhancing editability. Specifically, the SIVE process introduce attention-based
primitive control and an attention-mask loss function for effective control and
manipulation of individual elements. Additionally, we propose a Vectorized
Particle-based Score Distillation (VPSD) approach to tackle the challenges of
color over-saturation, vector primitives over-smoothing, and limited result
diversity in existing text-to-SVG generation methods. Furthermore, on the basis
of VPSD, we introduce Reward Feedback Learning (ReFL) to accelerate VPSD
convergence and improve aesthetic appeal. Extensive experiments have been
conducted to validate the effectiveness of SVGDreamer, demonstrating its
superiority over baseline methods in terms of editability, visual quality, and
diversity. The code and demo of SVGDreamer can be found at
\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}.
[Downlink:]http://arxiv.org/abs/2312.16476v2
[Project:]https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|
摘要: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.
[Downlink:]http://arxiv.org/abs/2401.09047v1
[Project:]https://ailab-cvc.github.io/videocrafter;|
[GitHub:]https://github.com/AILab-CVC/VideoCrafter|
中文摘要: 使用纹理反转、DreamBooth和LoRA等方法进行个性化图像合成已经取得了重大进展。然而,它们在现实世界中的适用性受到高存储需求、漫长的微调过程以及对多个参考图像的需求的阻碍。相反,现有的基于ID嵌入的方法虽然只需要单一的前向推理,但面临着挑战:它们要么需要对众多模型参数进行广泛的微调,要么与社区预先训练的模型缺乏兼容性,要么无法保持高的人脸保真度。为了解决这些限制,我们引入了InstantID,这是一个强大的基于扩散模型的解决方案。我们的即插即用模块仅使用一张面部图像即可熟练地处理各种风格的图像个性化,同时确保高保真度。为了实现这一点,我们设计了一个新颖的IdentityNet,通过强加强语义和弱空间条件,将面部和地标图像与文本提示相结合来引导图像生成。InstantID展示了卓越的性能和效率,在身份保护至关重要的现实应用程序中证明了这一点。此外,我们的工作与流行的预训练文本到图像扩散模型(如SD1.5和SDXL)无缝集成,作为一个适应性插件。我们的代码和预先培训的检查站将在https://github.com/InstantID/InstantID.
摘要: There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.
[Downlink:]http://arxiv.org/abs/2401.07519v1
[Project:]https://instantid.github.io/|
[GitHub:]https://github.com/InstantID/InstantID.|
中文摘要: 序列建模方法在机器人模仿学习中显示出了良好的效果。最近,扩散模型以序列建模的方式被用于行为克隆,这得益于它们在建模复杂数据分布方面的卓越能力。基于标准扩散的策略从以输入状态为条件的随机噪声迭代地生成动作序列。尽管如此,扩散政策的模型可以在视觉表示方面得到进一步改进。在这项工作中,我们提出了Crossway Diffusion,这是一种简单而有效的方法,通过精心设计的状态解码器和辅助自监督学习(SSL)目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。整个模型由SSL目标和原始扩散损失共同优化。我们的实验证明了Crossway Diffusion在各种模拟和真实世界的机器人任务中的有效性,证实了其相对于标准的基于扩散的策略的一致优势,以及相对于基线的显著改进
摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.
[Downlink:]http://arxiv.org/abs/2307.01849v3
[Project:]https://youtu.be/9deKHueZBuk|
[GitHub:]https://github.com/LostXine/crossway_diffusion|
中文摘要: 大多数现有的视频扩散模型(VDM)仅限于文本条件。因此,他们通常缺乏对所生成视频的视觉外观和几何结构的控制。这项工作介绍了Moonshot,这是一种新的视频生成模型,它同时处理图像和文本的多模式输入。该模型建立在一个称为多模式视频块(MVB)的核心模块之上,该模块由用于表示视频特征的传统时空层和用于处理图像和文本输入以进行外观调节的解耦交叉注意力层组成。此外,我们仔细设计了模型架构,使其可以选择性地与预先训练的图像ControlNet模块集成,以适应几何视觉条件,而不需要与现有方法相比的额外训练开销。实验表明,与现有模型相比,Moonshot具有多种多模式调节机制,在视觉质量和时间一致性方面有了显著提高。此外,该模型可以很容易地重新用于各种生成应用,如个性化视频生成、图像动画和视频编辑,揭示了其作为可控视频生成的基本架构的潜力。模型将于公开https://github.com/salesforce/LAVIS.
摘要: Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS.
[Downlink:]http://arxiv.org/abs/2401.01827v1
[Project:]https://showlab.github.io/Moonshot/|
[GitHub:]https://github.com/salesforce/LAVIS.|
中文摘要: 基于扩散的歌声转换(SVC)方法取得了显著的性能,产生了与目标音色高度相似的自然音频。然而,迭代采样过程导致推理速度慢,因此加速变得至关重要。在本文中,我们提出了CoMoSVC,这是一种基于一致性模型的SVC方法,旨在实现高质量的生成和高速采样。首先为SVC专门设计了一个基于扩散的教师模型,并在自洽性质下进一步提炼学生模型,实现一步采样。在单个NVIDIA GTX4090 GPU上的实验表明,尽管CoMoSVC的推理速度明显快于最先进的(SOTA)基于扩散的SVC系统,但基于主观和客观指标,它仍然实现了相当或优越的转换性能。音频样本和代码可在https://comosvc.github.io/.
摘要: The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.
[Downlink:]http://arxiv.org/abs/2401.01792v1
[Project:]https://comosvc.github.io/.|
中文摘要: 最近,文本引导的可缩放矢量图形(SVG)合成在图像学和素描等领域显示出了前景。然而,现有的文本到SVG的生成方法缺乏可编辑性,并且难以获得视觉质量和结果的多样性。为了解决这些局限性,我们提出了一种新的文本引导矢量图形合成方法,称为SVGDreamer。SVGDreamer结合了语义驱动的图像矢量化(SIVE)过程,该过程能够将合成分解为前景对象和背景,从而增强可编辑性。具体而言,SIVE过程引入了基于注意力的基元控制和注意力掩码丢失函数,用于有效控制和操纵单个元素。此外,我们提出了一种基于矢量化粒子的分数蒸馏(VPSD)方法,以解决现有文本到SVG生成方法中颜色过饱和、矢量基元过平滑和结果多样性有限的挑战。此外,在VPSD的基础上,我们引入了奖励反馈学习(ReFL),以加速VPSD的融合,提高审美吸引力。已经进行了大量的实验来验证SVGDreamer的有效性,证明了它在可编辑性、视觉质量和多样性方面优于基线方法。SVGDreamer的代码和演示可以在\href找到{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}.
摘要: Recently, text-guided scalable vector graphics (SVGs) synthesis has shown
promise in domains such as iconography and sketch. However, existing
text-to-SVG generation methods lack editability and struggle with visual
quality and result diversity. To address these limitations, we propose a novel
text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer
incorporates a semantic-driven image vectorization (SIVE) process that enables
the decomposition of synthesis into foreground objects and background, thereby
enhancing editability. Specifically, the SIVE process introduce attention-based
primitive control and an attention-mask loss function for effective control and
manipulation of individual elements. Additionally, we propose a Vectorized
Particle-based Score Distillation (VPSD) approach to tackle the challenges of
color over-saturation, vector primitives over-smoothing, and limited result
diversity in existing text-to-SVG generation methods. Furthermore, on the basis
of VPSD, we introduce Reward Feedback Learning (ReFL) to accelerate VPSD
convergence and improve aesthetic appeal. Extensive experiments have been
conducted to validate the effectiveness of SVGDreamer, demonstrating its
superiority over baseline methods in terms of editability, visual quality, and
diversity. The code and demo of SVGDreamer can be found at
\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}.
[Downlink:]http://arxiv.org/abs/2312.16476v2
[Project:]https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|
摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.
[Downlink:]http://arxiv.org/abs/2401.06013v2
[GitHub:]https://github.com/BeileiCui/SurgicalDINO.|
中文摘要: 随着多媒体内容的普及,多模式信息提取(MIE)得到了极大的关注。然而,当前的MIE方法通常采用特定于任务的模型结构,这导致跨任务的可推广性有限,并且未充分利用跨MIE任务的共享知识。为了解决这些问题,我们提出了UMIE,这是一种统一的多模式信息提取器,用于使用指令调优将三个MIE任务统一为一个生成问题,能够有效地提取文本和视觉提及。大量实验表明,我们的单个UMIE在三个任务上的六个MIE数据集上优于各种最先进的(SoTA)方法。此外,深入分析证明了UMIE在零样本设置中的强大泛化能力、对指令变体的鲁棒性和可解释性。我们的研究是朝着统一的MIE模型迈出的第一步,并启动了对MIE领域内的指令调整和大型语言模型的探索。我们的代码、数据和模型可在https://github.com/ZUCC-AI/UMIE
摘要: Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE’s strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE
[Downlink:]http://arxiv.org/abs/2401.03082v1
[GitHub:]https://github.com/ZUCC-AI/UMIE|
中文摘要: 视觉位置识别(VPR)是机器人导航和定位系统的一个重要组成部分,它允许机器人仅使用图像数据来识别位置。VPR具有挑战性,因为不同的日常照明、季节性天气变化和不同的视角会导致一个地方的外观发生重大变化。目前,没有一种VPR技术在每种环境条件下都表现出独特的优点和缺点,因此将多种技术相结合可以实现更可靠的VPR性能。目前的多方法方法要么依赖于通常不可用的在线地面实况信息,要么依赖于强力技术组合,这可能会降低高方差技术集的性能。针对这些缺点,我们提出了一种称为多序列信息一致性(MuSIC)的VPR系统,该系统利用序列信息在每帧在线的基础上选择最具凝聚力的技术。对于集合中的每种技术,MuSIC通过分析其顶部匹配候选者的帧到帧连续性来计算其各自的序列一致性,然后直接将其进行比较,以选择用于当前查询图像的最佳技术。使用顺序信息在VPR方法之间进行选择,可以提高不同基准数据集的整体VPR性能,同时避免对运行时环境的额外事实的需要
摘要: Visual place recognition (VPR) is an essential component of robot navigation and localization systems that allows them to identify a place using only image data. VPR is challenging due to the significant changes in a place’s appearance driven by different daily illumination, seasonal weather variations and diverse viewpoints. Currently, no single VPR technique excels in every environmental condition, each exhibiting unique benefits and shortcomings, and therefore combining multiple techniques can achieve more reliable VPR performance. Present multi-method approaches either rely on online ground-truth information, which is often not available, or on brute-force technique combination, potentially lowering performance with high variance technique sets. Addressing these shortcomings, we propose a VPR system dubbed Multi-Sequential Information Consistency (MuSIC) which leverages sequential information to select the most cohesive technique on an online per-frame basis. For each technique in a set, MuSIC computes their respective sequential consistencies by analysing the frame-to-frame continuity of their top match candidates, which are then directly compared to select the optimal technique for the current query image. The use of sequential information to select between VPR methods results in an overall VPR performance increase across different benchmark datasets, while avoiding the need for extra ground-truth of the runtime environment.
[Downlink:]http://arxiv.org/abs/2401.08263v1
中文摘要: 吸盘是工业机器人应用中的一种重要抓握器类型,现有文献侧重于使用基于视觉的规划者来提高抓握在这些任务中的成功率。如果不重新训练学习的算法,基于视觉的规划者可能会因对抗性对象而失败,或失去对看不见的场景的可推广性。当视觉抓取计划失败时,我们提出了触觉探索来改进吸盘抓取。我们介绍了智能吸盘,这是一种利用内部流量测量进行触觉传感的末端执行器。我们表明,在这些流量测量的指导下,基于模型的触觉搜索方法与在垃圾箱拾取任务中仅使用视觉规划器相比,可将抓取成功率提高2.5倍。在对智能吸盘的几何边缘和曲线进行表征时,我们发现即使存在较大的姿势误差,流速也能准确预测理想的运动方向。智能吸盘本身不包括电子设备,因此设计易于制造,触觉探索不会损坏传感器。这项工作促使人们在特别是对抗性场景中使用具有自主触觉搜索功能的吸盘
摘要: Suction cups are an important gripper type in industrial robot applications, and prior literature focuses on using vision-based planners to improve grasping success in these tasks. Vision-based planners can fail due to adversarial objects or lose generalizability for unseen scenarios, without retraining learned algorithms. We propose haptic exploration to improve suction cup grasping when visual grasp planners fail. We present the Smart Suction Cup, an end-effector that utilizes internal flow measurements for tactile sensing. We show that model-based haptic search methods, guided by these flow measurements, improve grasping success by up to 2.5x as compared with using only a vision planner during a bin-picking task. In characterizing the Smart Suction Cup on both geometric edges and curves, we find that flow rate can accurately predict the ideal motion direction even with large postural errors. The Smart Suction Cup includes no electronics on the cup itself, such that the design is easy to fabricate and haptic exploration does not damage the sensor. This work motivates the use of suction cups with autonomous haptic search capabilities in especially adversarial scenarios.
[Downlink:]http://arxiv.org/abs/2309.07360v2
中文摘要: 共同显著对象检测(CoSOD)致力于复制人类视觉系统识别图像集合中常见和显著对象的能力。尽管最近在深度学习模型方面取得了进展,但这些模型仍然依赖于使用注释良好的CoSOD数据集进行训练。对无训练零样本CoSOD框架的探索是有限的。在本文中,我们从基础计算机视觉模型的零样本传递能力中获得灵感,介绍了第一个零样本CoSOD框架,该框架在没有任何训练过程的情况下利用这些模型。为了实现这一点,我们在我们提出的框架中引入了两个新的组件:组提示生成(GPG)模块和共显著图生成(CMP)模块。我们在广泛使用的数据集上评估了该框架的性能,并观察到了令人印象深刻的结果。我们的方法超越了现有的无监督方法,甚至优于2020年之前开发的完全监督方法,同时与2022年之前开发出的一些完全监督方法保持竞争力。
摘要: Co-salient Object Detection (CoSOD) endeavors to replicate the human visual system’s capacity to recognize common and salient objects within a collection of images. Despite recent advancements in deep learning models, these models still rely on training with well-annotated CoSOD datasets. The exploration of training-free zero-shot CoSOD frameworks has been limited. In this paper, taking inspiration from the zero-shot transfer capabilities of foundational computer vision models, we introduce the first zero-shot CoSOD framework that harnesses these models without any training process. To achieve this, we introduce two novel components in our proposed framework: the group prompt generation (GPG) module and the co-saliency map generation (CMP) module. We evaluate the framework’s performance on widely-used datasets and observe impressive results. Our approach surpasses existing unsupervised methods and even outperforms fully supervised methods developed before 2020, while remaining competitive with some fully supervised methods developed before 2022.
[Downlink:]http://arxiv.org/abs/2309.05499v3
中文摘要: 导航、感知和决策是智能机器人的基本任务,其本质是估计必要的系统状态。其中,导航是其他上层应用程序的基础,通过集成来自多个传感器的测量,提供精确的位置和方向。通过对每个传感器的观测值进行适当的建模,将导航的多传感器融合任务简化为状态估计问题,该问题可以通过两种方法解决:优化和滤波。最近的研究表明,基于优化的框架在准确性方面优于基于过滤的框架。然而,这两种方法都是基于最大似然估计(MLE)的,并且在理论上应该与相同的线性化点、观测模型、测量和高斯噪声假设等效。在本文中,我们深入挖掘了基于优化和基于过滤的方法中使用的理论和现有策略。结果表明,这两种方法在理论上是相等的,但由于在实时操作中应用的策略不同,这种等价性会破坏。通过调整现有的基于滤波的方法的策略,基于视觉里程计(VO)的蒙特卡洛模拟和车载消融实验表明,策略调整后的滤波严格等于优化。因此,未来对传感器融合问题的研究应该集中在它们自己的算法和策略上,而不是状态估计方法
摘要: The essential of navigation, perception, and decision-making which are basic tasks for intelligent robots, is to estimate necessary system states. Among them, navigation is fundamental for other upper applications, providing precise position and orientation, by integrating measurements from multiple sensors. With observations of each sensor appropriately modelled, multi-sensor fusion tasks for navigation are reduced to the state estimation problem which can be solved by two approaches: optimization and filtering. Recent research has shown that optimization-based frameworks outperform filtering-based ones in terms of accuracy. However, both methods are based on maximum likelihood estimation (MLE) and should be theoretically equivalent with the same linearization points, observation model, measurements, and Gaussian noise assumption. In this paper, we deeply dig into the theories and existing strategies utilized in both optimization-based and filtering-based approaches. It is demonstrated that the two methods are equal theoretically, but this equivalence corrupts due to different strategies applied in real-time operation. By adjusting existing strategies of the filtering-based approaches, the Monte-Carlo simulation and vehicular ablation experiments based on visual odometry (VO) indicate that the strategy adjusted filtering strictly equals to optimization. Therefore, future research on sensor-fusion problems should concentrate on their own algorithms and strategies rather than state estimation approaches.
[Downlink:]http://arxiv.org/abs/2401.05836v1