VX关注{晓理紫},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。
分类:
- 大语言模型LLM
- 视觉模型VLM
- 扩散模型
- 视觉导航
- 具身智能,机器人
- 强化学习
- 开放词汇,检测分割
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.13527v2
GitHub: https://github.com/0nutation/SpeechGPT|
中文摘要: 受益于有效的语音建模,当前的语音大型语言模型(SLLMs)在上下文语音生成和对看不见的说话者的有效概括方面表现出了卓越的能力。然而,流行的信息建模过程受到某些冗余的阻碍,导致语音生成效率低下。我们提出了信息链生成(CoIG),这是一种在大规模语音生成中解耦语义和感知信息的方法。在此基础上,我们开发了SpeechGPT-Gen,这是一个80亿参数的SLLM,在语义和感知信息建模方面非常有效。它包括用于语义信息建模的基于LLM的自回归模型和用于感知信息建模的采用流匹配的非自回归模型。此外,我们引入了将语义信息注入先验分布的新方法来提高流匹配的效率。大量的实验结果表明,SpeechGPT-Gen在零镜头文本到语音、零镜头语音转换和语音到语音对话方面表现出色,强调了CoIG在捕捉和建模语音的语义和感知维度方面的非凡能力。代码和模型可在https://github.com/0 nutation/SpeechGPT。
摘要: Benefiting from effective speech modeling, current Speech Large Language Models (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experimental results demonstrate that SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue, underscoring CoIG’s remarkable proficiency in capturing and modeling speech’s semantic and perceptual dimensions. Code and models are available at https://github.com/0nutation/SpeechGPT.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2311.11482v2
GitHub: https://github.com/meta-prompting/meta-prompting)|
摘要: This paper presents a comprehensive study of Meta Prompting, an innovative technique reshaping the utilization of large language models (LLMs), multi-modal foundation models, and AI systems in problem-solving and data interpretation. Grounded in type theory and category theory, Meta Prompting emphasizes the structure and syntax of information over traditional content-centric methods. The paper explores the formal definitions of Meta Prompting (MP), sets it apart from Few-Shot Prompting, and underlines its effectiveness in various AI applications. A key focus is on extending Meta Prompting to complex reasoning tasks, showing how it effectively deconstructs intricate problems into simpler sub-problems, enhancing token efficiency and enabling more equitable problem-solving comparisons, especially against few-shot example methods. Additionally, the paper introduces Meta Prompting for Prompting Tasks, allowing LLMs to self-generate new prompts in an iterative, metaprogramming-like manner. This innovative approach marks a significant leap in AI’s autonomous and adaptive capabilities. The paper also pioneers the integration of Meta Prompting into multi-modal foundation model settings, tackling the challenges and opportunities of incorporating varied data types such as images, audio, and video within the structured Meta Prompting framework. (The code is available at https://github.com/meta-prompting/meta-prompting)
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2312.06968v3
GitHub: https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl|
中文摘要: 多模态大型语言模型(MLLMs)已被证明可以有效地将自然语言与视觉信息集成起来,以处理多模态任务。然而,MLLMs仍然面临幻觉的基本限制,其中它们倾向于产生错误或捏造的信息。在本文中,我们从表征学习的新角度来解决MLLMs中的幻觉。我们首先分析了MLLM中文本和视觉标记的表征分布,揭示了两个重要的发现:1)文本和视觉表征之间存在显著的差距,表明跨模态表征对齐不令人满意;2)包含和不包含幻觉的文本的表征是纠缠在一起的,使得区分它们具有挑战性。这两个观察启发了我们一个简单而有效的方法来减轻幻觉。具体来说,我们将对比学习引入MLLMs,并使用带有幻觉的文本作为硬反面例子,自然地将非幻觉文本和视觉样本的表征拉近,同时推动非幻觉和幻觉文本的方式表征。我们对我们的方法进行了定量和定性评估,显示了其在减少幻觉发生和提高跨多个基准的性能方面的有效性。在MMhal-Bench基准测试中,我们的方法获得了a比基线MiniGPT-4/LLaVA改善34.66%/29.5%。我们的代码可以在https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl上找到
摘要: Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
PubTime: 2024-01-23
Downlink: http://arxiv.org/abs/2308.16692v2
Project: https://0nutation.github.io/SpeechTokenizer.github.io/|
GitHub: https://github.com/ZhangXInFD/SpeechTokenizer/|
中文摘要: 当前的语音大型语言模型建立在离散的语音表示之上,这些表示可以分为语义标记和声学标记。然而,现有的语音标记并不是专门为语音语言建模而设计的。为了评估语音标记对于构建语音语言模型的适用性,我们建立了第一个基准SLMTokBench。我们的结果表明,无论是语义标记还是声学标记都不适合这一目的。因此,我们提出了SpeechTokenizer,一个用于语音大型语言模型的统一语音标记器。SpeechTokenizer采用带有残差矢量量化(RVQ)的编码器——解码器架构。SpeechTokenizer统一了语义和声学标记,跨不同的RVQ层分层解开语音信息的不同方面。此外,我们利用SpeechTokenizer构建了一个统一的语音语言模型(USLM)。实验表明,SpeechTokenizer在语音重建方面的性能与EnCodec相当,并在SLMTokBench基准测试中表现出强大的性能。此外,USLM在零镜头文本到语音转换任务中优于VALL-E。代码和模型可在https://github.com/ZhangXInFD/SpeechTokenizer/获得。
摘要: Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.14362v1
中文摘要: 经历严重痛苦的人越来越多地使用大型语言模型(LLM)聊天机器人作为心理健康支持工具。社交媒体上的讨论描述了参与如何拯救了一些人的生命,但有证据表明,通用LLM聊天机器人也有明显的风险,如果不负责任地设计,可能会危及用户的福利。在这项研究中,我们调查了使用LLM聊天机器人进行心理健康支持的人的生活体验。我们采访了来自全球不同背景的21个人,分析用户如何为他们的聊天机器人创建独特的支持角色,填补日常护理中的空白,以及在寻求聊天机器人支持时克服相关的文化限制。我们在心理治疗文献中围绕有效支持进行分析,并引入治疗调整的概念,或将人工智能与心理健康背景下的治疗价值调整。我们的研究为设计师如何在精神卫生保健中道德和有效地使用LLM聊天机器人和其他人工智能精神健康支持工具提供了建议。
摘要: People experiencing severe distress increasingly use Large Language Model (LLM) chatbots as mental health support tools. Discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose LLM chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. In this study, we investigate the lived experiences of people who have used LLM chatbots for mental health support. We build on interviews with 21 individuals from globally diverse backgrounds to analyze how users create unique support roles for their chatbots, fill in gaps in everyday care, and navigate associated cultural limitations when seeking support from chatbots. We ground our analysis in psychotherapy literature around effective support, and introduce the concept of therapeutic alignment, or aligning AI with therapeutic values for mental health contexts. Our study offers recommendations for how designers can approach the ethical and effective use of LLM chatbots and other AI mental health support tools in mental health care.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.14351v1
中文摘要: 本文介绍了ServerlessLLM,这是一个用于大型语言模型(LLMs)的局部性增强的无服务器推理系统。ServerlessLLM利用GPU服务器上可用的存储和内存设备的大量容量和带宽,从而减少昂贵的远程检查点下载并实现高效的检查点加载。ServerlessLLM通过三个主要贡献实现这一点:(i)通过新颖的加载优化检查点格式设计,结合高效的多层检查点加载系统,快速加载LLM检查点;(ii)具有实时迁移的局域驱动LLM推理,允许无服务器LLM有效地实现局域驱动的服务器分配,同时保持正在进行的LLM推理的低延迟;以及(iii)位置感知服务器分配,使ServerlessLLM能够评估集群中每个服务器的状态,并有效地调度模型启动时间,以利用本地检查点放置。我们的综合实验包括微基准测试和真实世界跟踪,表明当运行各种LLM推理工作负载时,ServerlessLLM的延迟性能比最先进的系统高出10-200倍。
摘要: This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ServerlessLLM exploits the substantial capacity and bandwidth of storage and memory devices available on GPU servers, thereby reducing costly remote checkpoint downloads and achieving efficient checkpoint loading. ServerlessLLM achieves this through three main contributions: (i) fast LLM checkpoint loading via a novel loading-optimized checkpoint format design, coupled with an efficient multi-tier checkpoint loading system; (ii) locality-driven LLM inference with live migration, which allows ServerlessLLM to effectively achieve locality-driven server allocation while preserving the low latency of ongoing LLM inference; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement. Our comprehensive experiments, which include microbenchmarks and real-world traces, show that ServerlessLLM surpasses state-of-the-art systems by 10 - 200X in latency performance when running various LLM inference workloads.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.14405v1
GitHub: https://github.com/AILab-CVC/M2PT|https://github.com/AILab-CVC/M2PT|
中文摘要: 我们建议使用来自其他模态的不相关数据来改进特定模态的变压器,例如,使用音频或点云数据集来改进ImageNet模型。我们想强调的是,目标模态的数据样本与其他模态无关,这将我们的方法与利用不同模态的成对(例如,CLIP)或交错数据的其他工作区分开来。我们提出了一种称为多模态路径的方法——给定一个目标模态和为其设计的Transformer model,我们使用一个用另一个模态的数据训练的辅助Transformer model,并构建路径来连接两个模型的组件,以便两个模型都可以处理目标模态的数据。通过这种方式,我们利用了从两种模态获得的变压器的通用序列到序列建模能力。作为一个具体的实现,我们像往常一样使用特定于模态的标记器和特定于任务的头,但是通过一种称为跨模态重新参数化的建议方法来利用辅助模型的Transformer model块,该方法利用辅助权重而没有任何推理成本。在图像、点云、视频和音频识别任务中,我们观察到来自其他模式的不相关数据的显著和一致的性能改进。代码和模型可从https://github.com/AILab-CVC/M2PT获得。
摘要: We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2312.06968v3
GitHub: https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl|
中文摘要: 多模态大型语言模型(MLLMs)已被证明可以有效地将自然语言与视觉信息集成起来,以处理多模态任务。然而,MLLMs仍然面临幻觉的基本限制,其中它们倾向于产生错误或捏造的信息。在本文中,我们从表征学习的新角度来解决MLLMs中的幻觉。我们首先分析了MLLM中文本和视觉标记的表征分布,揭示了两个重要的发现:1)文本和视觉表征之间存在显著的差距,表明跨模态表征对齐不令人满意;2)包含和不包含幻觉的文本的表征是纠缠在一起的,使得区分它们具有挑战性。这两个观察启发了我们一个简单而有效的方法来减轻幻觉。具体来说,我们将对比学习引入MLLMs,并使用带有幻觉的文本作为硬反面例子,自然地将非幻觉文本和视觉样本的表征拉近,同时推动非幻觉和幻觉文本的方式表征。我们对我们的方法进行了定量和定性评估,显示了其在减少幻觉发生和提高跨多个基准的性能方面的有效性。在MMhal-Bench基准测试中,我们的方法获得了a比基线MiniGPT-4/LLaVA改善34.66%/29.5%。我们的代码可以在https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl上找到
摘要: Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.13388v2
Project: https://unimo-ptm.github.io/|
中文摘要: 现有的文本到图像扩散模型主要从文本提示生成图像。然而,文本描述固有的简洁性对忠实地合成具有复杂细节的图像(如特定实体或场景)提出了挑战。本文介绍了UNIMO-G,这是一个简单的多模态条件扩散框架,它对具有交错文本和视觉输入的多模态提示进行操作,展示了文本驱动和主题驱动图像生成的统一能力。UNIMO-G包括两个核心组件:用于编码多模态提示的多模态大型语言模型(MLLM)和用于基于编码的多模态输入生成图像的条件去噪扩散网络。我们利用两阶段训练策略来有效地训练框架:首先对大规模文本——图像对进行预训练,以开发条件图像生成能力,然后使用多模态提示进行指令调整,以实现统一的图像生成能力。设计良好的数据处理流水线包括语言基础和图像分割,用于构建多模态提示。UNIMO-G在文本到图像生成和零镜头主体驱动合成方面都表现出色,在从涉及多个图像实体的复杂多模态提示生成高保真图像方面尤为有效。
摘要: Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2310.06992v2
Project: https://wenhsuanchu.github.io/ovtracktor/|
中文摘要: 物体跟踪是机器人感知和场景理解的核心。检测跟踪长期以来一直是特定目标类别的目标跟踪的主导范式。最近,大规模预训练模型在野外2D静态图像中检测和分割对象和部分方面显示出有希望的进展。这就引出了一个问题:我们能否将这些大规模预训练的静态图像模型重新用于开放词汇视频跟踪?在本文中,我们将开放词汇检测器、分割器和密集光流估计器重新设计成一个模型,用于跟踪和分割2D视频中任何类别的对象。我们的方法预测单目视频中具有相关语言描述的对象和部分轨迹,使用现代大型预训练模型重建拖拉机的管道,用于静态图像检测和分割:我们检测开放词汇对象实例,并使用基于流的运动模型将它们的框从一帧传播到另一帧,使用视觉检测器的框回归模块细化传播的框,并使用细化的框提示开放世界分割器分割对象。我们基于传播盒的客观性得分以及前向——后向光流一致性来决定物体轨迹的终止。我们使用深度特征匹配跨遮挡重新识别对象。我们表明,我们的模型在多个已建立的视频对象分割和跟踪基准上取得了很强的性能,并且可以在操纵数据中产生合理的轨迹。特别是,我们的模型在开放世界对象跟踪和分割的基准UVO和突发方面优于以前的最先进水平,尽管从未经过明确的跟踪训练。我们希望我们的方法可以作为未来研究的一个简单和可扩展的框架。
摘要: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. We decide the termination of an object track based on the objectness score of the propagated boxes, as well as forward-backward optical flow consistency. We re-identify objects across occlusions using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, and can produce reasonable tracks in manipulation data. In particular, our model outperforms previous state-of-the-art in UVO and BURST, benchmarks for open-world object tracking and segmentation, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2310.09503v3
GitHub: https://github.com/Mr-Neko/JM3D|
中文摘要: 3D理解在计算机视觉、自动驾驶和机器人技术中至关重要,其重要性日益上升,这是显而易见的。然而,直接诉诸于将2D比对策略转移到3D领域的流行趋势遇到了三个不同的挑战:(1)信息退化:这是由于3D数据与仅单视图2D图像和通用文本的比对,忽略了对多视图图像和详细子类别文本的需要。(2)协同不足:这些策略将3D表示与图像和文本特征单独对齐,阻碍了3D模型的整体优化。(3)利用不足:学习表征中固有的细粒度信息通常没有被充分利用,这表明潜在的细节损失。为了解决这些问题,我们引入了JM3D,一种集成点云、文本和图像的综合方法。主要贡献包括结构化多模态组织者(SMO),用多视图和分层文本丰富视觉语言表示,以及联合多模态对齐(JMA),将语言理解与视觉表示相结合。我们的高级模型JM3D-LLM通过有效的微调将3D表示与大型语言模型结合起来。对ModelNet40和ScanObjectNN的评估确立了JM3D的优越性。JM3D-LLM的卓越性能进一步强调了我们的表示转移方法的有效性。我们的代码和模型可在https://github.com/Mr-Neko/JM3D。
摘要: The rising importance of 3D understanding, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D’s superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach. Our code and models are available at https://github.com/Mr-Neko/JM3D.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.13965v1
GitHub: https://github.com/Adnan-Khan7/UPLM|
中文摘要: 除了获得域泛化(DG),视觉识别模型还应该通过利用有限的标签在学习过程中具有数据效率。我们研究了半监督域泛化(SSDG)问题,它对于自动化医疗保健等现实世界的应用至关重要。当给定的训练数据仅被部分标记时,SSDG需要学习跨域可概括模型。实证调查显示,DG方法在SSDG环境中往往表现不佳,可能是因为它们无法利用未标记的数据。与完全监督学习相比,半监督学习(SSL)显示出改进但仍然较差的结果。性能最佳的基于SSL的SSDG方法面临的一个关键挑战是在多个域移位下选择准确的伪标签,并在有限的标签下减少对源域的过度拟合。在这项工作中,我们提出了新的SSDG方法,它利用了一种新的不确定性引导的伪标记与模型平均(UPLM)。我们的不确定性引导伪标记(UPL)使用模型不确定性来改善伪标记选择,解决多源未标记数据下模型校准不良的问题。通过我们新的模型平均(MA)策略增强的UPL技术,减轻了对具有有限标签的源域的过度拟合。在关键代表性DG数据集上的大量实验表明,我们的方法证明了相对于现有方法的有效性。我们的代码和选择的带标签的数据种子可以在GitHub上找到:https://github.com/Adnan-Khan7/UPLM
摘要: Beyond attaining domain generalization (DG), visual recognition models should also be data-efficient during learning by leveraging limited labels. We study the problem of Semi-Supervised Domain Generalization (SSDG) which is crucial for real-world applications like automated healthcare. SSDG requires learning a cross-domain generalizable model when the given training data is only partially labelled. Empirical investigations reveal that the DG methods tend to underperform in SSDG settings, likely because they are unable to exploit the unlabelled data. Semi-supervised learning (SSL) shows improved but still inferior results compared to fully-supervised learning. A key challenge, faced by the best-performing SSL-based SSDG methods, is selecting accurate pseudo-labels under multiple domain shifts and reducing overfitting to source domains under limited labels. In this work, we propose new SSDG approach, which utilizes a novel uncertainty-guided pseudo-labelling with model averaging (UPLM). Our uncertainty-guided pseudo-labelling (UPL) uses model uncertainty to improve pseudo-labelling selection, addressing poor model calibration under multi-source unlabelled data. The UPL technique, enhanced by our novel model averaging (MA) strategy, mitigates overfitting to source domains with limited labels. Extensive experiments on key representative DG datasets suggest that our method demonstrates effectiveness against existing methods. Our code and chosen labelled data seeds are available on GitHub: https://github.com/Adnan-Khan7/UPLM
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.14398v1
Project: https://gestalt.cs.columbia.edu/|
中文摘要: 我们介绍了pix2gestalt,这是一个用于零镜头无模态分割的框架,它学习估计在遮挡后仅部分可见的整个对象的形状和外观。通过利用大规模扩散模型并将它们的表示转移到这项任务中,我们学习了一种条件扩散模型,用于在具有挑战性的零镜头情况下重建整个对象,包括打破自然和物理先验的例子,如art。作为训练数据,我们使用一个综合策划的数据集,该数据集包含与它们的整个对应物配对的遮挡对象。实验表明,我们的方法在已建立的基准上优于监督基线。我们的模型还可以用于在存在遮挡的情况下显著提高现有对象识别和3D重建方法的性能。
摘要: We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, including examples that break natural and physical priors, such as art. As training data, we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2307.01673v2
GitHub: https://github.com/RF5/simple-asgan/|
中文摘要: 我们能否开发一种模型,可以直接从潜在空间合成逼真的语音,而无需显式条件反射?尽管在过去十年中做了一些努力,以前的对抗性和基于扩散的方法仍然难以实现这一点,即使是在小词汇量的数据集上。为了解决这个问题,我们提出了AudioStyleGAN(ASGAN)-一个用于无条件语音合成的生成对抗网络,用于学习解开的潜在空间。基于StyleGAN系列图像合成模型,ASGAN将采样噪声映射到一个解开的潜在向量,然后将该向量映射到一系列音频特征,从而在每一层抑制信号混叠。为了成功地训练ASGAN,我们引入了许多新技术,包括对自适应鉴别器增强的修改,其概率跳过鉴别器更新。我们将它应用于小词汇量的Google Speech Commands digits数据集,在那里它实现了无条件语音合成的最先进的结果。它也比现有的性能最好的扩散模型快得多。我们证实了ASGAN的潜在空间是解开的:我们展示了空间中简单的线性运算如何被用来执行几个在训练中看不见的任务。具体来说,我们在语音转换、语音增强、说话人验证和关键字分类方面进行评估。我们的工作表明,GANs在无条件语音合成领域仍然具有很高的竞争力,并且解开的潜在空间可以用来帮助概括看不见的任务。代码、模型、样本:https://github.com/RF5/simple-asgan/
摘要: Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) – a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN’s latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2401.13388v2
Project: https://unimo-ptm.github.io/|
中文摘要: 现有的文本到图像扩散模型主要从文本提示生成图像。然而,文本描述固有的简洁性对忠实地合成具有复杂细节的图像(如特定实体或场景)提出了挑战。本文介绍了UNIMO-G,这是一个简单的多模态条件扩散框架,它对具有交错文本和视觉输入的多模态提示进行操作,展示了文本驱动和主题驱动图像生成的统一能力。UNIMO-G包括两个核心组件:用于编码多模态提示的多模态大型语言模型(MLLM)和用于基于编码的多模态输入生成图像的条件去噪扩散网络。我们利用两阶段训练策略来有效地训练框架:首先对大规模文本——图像对进行预训练,以开发条件图像生成能力,然后使用多模态提示进行指令调整,以实现统一的图像生成能力。设计良好的数据处理流水线包括语言基础和图像分割,用于构建多模态提示。UNIMO-G在文本到图像生成和零镜头主体驱动合成方面都表现出色,在从涉及多个图像实体的复杂多模态提示生成高保真图像方面尤为有效。
摘要: Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.
PubTime: 2024-01-25
Downlink: http://arxiv.org/abs/2305.19094v2
Project: https://ku-cvlab.github.io/DiffMatch/|https://ku-cvlab.github.io/DiffMatch/|
中文摘要: 在成对图像之间建立密集对应的目标由两个术语组成:数据术语和先验术语。虽然传统技术专注于定义难以公式化的手工设计的先验项,但最近的方法专注于使用深度神经网络学习数据项,而没有显式建模先验,假设模型本身具有从大规模数据集学习最佳先验的能力。性能改进是显而易见的,但是,它们通常无法解决匹配的固有模糊性,例如无纹理区域、重复图案和大位移。为了解决这个问题,我们提出了DiffMatch,这是一种新的基于条件扩散的框架,旨在显式地对数据和先验项进行建模。与以前的方法不同,这是通过利用条件去噪扩散模型来实现的。DiffMatch由两个主要部分组成:条件去噪扩散模块和代价注入模块。我们通过分阶段训练策略稳定训练过程并减少内存使用。此外,为了提高性能,我们引入了一种推理技术,可以找到到达精确匹配场的更好路径。我们的实验结果表明,与现有方法相比,我们的方法有显著的性能改进,消融研究验证了我们的设计选择以及每个组件的有效性。项目页面可在https://ku-cvlab.github.io/DiffMatch/。
摘要: The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term. While conventional techniques focused on defining hand-designed prior terms, which are difficult to formulate, recent approaches have focused on learning the data term with deep neural networks without explicitly modeling the prior, assuming that the model itself has the capacity to learn an optimal prior from a large-scale dataset. The performance improvement was obvious, however, they often fail to address inherent ambiguities of matching, such as textureless regions, repetitive patterns, and large displacements. To address this, we propose DiffMatch, a novel conditional diffusion-based framework designed to explicitly model both the data and prior terms. Unlike previous approaches, this is accomplished by leveraging a conditional denoising diffusion model. DiffMatch consists of two main components: conditional denoising diffusion module and cost injection module. We stabilize the training process and reduce memory usage with a stage-wise training strategy. Furthermore, to boost performance, we introduce an inference technique that finds a better path to the accurate matching field. Our experimental results demonstrate significant performance improvements of our method over existing approaches, and the ablation studies validate our design choices along with the effectiveness of each component. Project page is available at https://ku-cvlab.github.io/DiffMatch/.
PubTime: 2024-01-24
Downlink: http://arxiv.org/abs/2306.15667v4
Project: https://posediffusion.github.io/|
中文摘要: 相机姿态估计是一个长期存在的计算机视觉问题,迄今为止通常依赖于经典方法,如手工关键点匹配、RANSAC和束调整。在本文中,我们建议在概率扩散框架内公式化运动结构(SfM)问题,模拟给定输入图像的相机姿态的条件分布。这种对老问题的新观点有几个优点。(i)扩散框架的性质反映了束调整的迭代过程。(ii)该公式允许来自极外几何的几何约束的无缝集成。(iii)它在典型的困难场景中表现出色,例如具有宽基线的稀疏视图。(iv)该方法可以预测任意数量图像的内在和外在。我们证明了我们的方法PoseDiffusion显著改进了经典的SfM管道和两个真实世界数据集上的学习方法。最后,观察到我们的方法可以跨数据集推广,而无需进一步训练。项目页面:https://posediffusion.github.io/
摘要: Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of an old problem has several advantages. (i) The nature of the diffusion framework mirrors the iterative procedure of bundle adjustment. (ii) The formulation allows a seamless integration of geometric constraints from epipolar geometry. (iii) It excels in typically difficult scenarios such as sparse views with wide baselines. (iv) The method can predict intrinsics and extrinsics for an arbitrary amount of images. We demonstrate that our method PoseDiffusion significantly improves over the classic SfM pipelines and the learned approaches on two real-world datasets. Finally, it is observed that our method can generalize across datasets without further training. Project page: https://posediffusion.github.io/
PubTime: 2024-01-23
Downlink: http://arxiv.org/abs/2401.12979v1
Project: https://snuvclab.github.io/gala/|
中文摘要: 我们介绍了GALA,它是一个框架,将单层穿着衣服的3D人体网格作为输入,并将其分解为完整的多层3D资产。然后,输出可以与其他资产相结合,以创建具有任何姿势的新的穿着衣服的人类化身。现有的重建方法通常将穿着衣服的人视为单层几何图形,并忽略了人与发型、衣服和配饰的固有组合性,从而限制了网格在下游应用中的效用。将单层网格分解成单独的层是一项具有挑战性的任务,因为它需要为严重遮挡的区域合成似是而非的几何形状和纹理。此外,即使成功分解,网格在姿态和身体形状方面也没有标准化,无法用新的身份和姿态进行连贯的合成。为了应对这些挑战,我们建议利用预训练的2D扩散模型的一般知识作为人类和其他资产的几何和外观先验。我们首先使用从多视图2D分割中提取的3D表面分割来分离输入网格。然后,我们使用一种新的姿态引导分数蒸馏采样(SDS)损失来合成姿态空间和正则空间中不同层的缺失几何。一旦我们完成修复高保真3D几何体,我们还将相同的SDS损失应用于其纹理,以获得完整的外观,包括最初遮挡的区域。通过一系列的分解步骤,我们在一个共享的规范空间中获得了多层3D资产,该空间根据姿势和人体形状进行了标准化,因此支持毫不费力地合成新的身份和用新的姿势复活。我们的实验证明了与现有解决方案相比,我们的方法在分解、规范化和组合任务方面的有效性。
摘要: We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.
PubTime: 2024-01-24
Downlink: http://arxiv.org/abs/2401.13800v1
中文摘要: 导航在机器人学中通过SLAM和规划的结合得到了经典的解决。最近,除了航路点规划之外,涉及(视觉)高级推理的重要组成部分的问题已经在模拟环境中进行了探索,主要通过大规模机器学习来解决,特别是RL、离线RL或模仿学习。这些方法要求代理学习各种技能,如局部规划、映射对象和查询学习的空间表示。与航路点规划(PointGoal)等更简单的任务相比,对于这些更复杂的任务,当前最先进的模型已经在模拟中进行了彻底的评估,但据我们所知,还没有在真实环境中进行评估。在这项工作中,我们重点关注sim2real传输。我们针对具有挑战性的多对象导航(Multi-ON)任务,并将其移植到包含原始虚拟Multi-ON对象的真实副本的物理环境中。我们引入了一种混合导航方法,该方法将问题分解为两种不同的技能:(1)航路点导航用经典的SLAM结合符号规划器来处理,而(2)探索、语义映射和目标检索用结合监督学习和RL训练的深度神经网络来处理。我们在模拟和真实环境中展示了这种方法与端到端方法相比的优势,并且在这项任务中优于SOTA。
摘要: Navigation has been classically solved in robotics through the combination of SLAM and planning. More recently, beyond waypoint planning, problems involving significant components of (visual) high-level reasoning have been explored in simulated environments, mostly addressed with large-scale machine learning, in particular RL, offline-RL or imitation learning. These methods require the agent to learn various skills like local planning, mapping objects and querying the learned spatial representations. In contrast to simpler tasks like waypoint planning (PointGoal), for these more complex tasks the current state-of-the-art models have been thoroughly evaluated in simulation but, to our best knowledge, not yet in real environments. In this work we focus on sim2real transfer. We target the challenging Multi-Object Navigation (Multi-ON) task and port it to a physical environment containing real replicas of the originally virtual Multi-ON objects. We introduce a hybrid navigation method, which decomposes the problem into two different skills: (1) waypoint navigation is addressed with classical SLAM combined with a symbolic planner, whereas (2) exploration, semantic mapping and goal retrieval are dealt with deep neural networks trained with a combination of supervised learning and RL. We show the advantages of this approach compared to end-to-end methods both in simulation and a real environment and outperform the SOTA for this task.
VX关注{晓理紫},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文