VX关注{晓理紫},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。
为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有
分类:
- 大语言模型LLM
- 视觉模型VLM
- 扩散模型
- 视觉语言导航VLN
- 强化学习 RL
- 模仿学习 IL
- 机器人
- 开放词汇,检测分割
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16420v1
GitHub: https://github.com/InternLM/InternLM-XComposer|https://github.com/InternLM/InternLM-XComposer|
中文摘要: 我们介绍InternLM-XComposer2,这是一个尖端的视觉语言模型,在自由形式的文本图像合成和理解方面表现出色。该模型超越了传统的视觉语言理解,熟练地从不同的输入(如轮廓、详细的文本规范和参考图像)中制作交错的文本图像内容,实现了高度可定制的内容创建。InternLM-XComposer2提出了一种部分LoRA(PLoRA)方法,该方法将额外的LoRA参数专门应用于图像标记,以保持预训练语言知识的完整性,在精确的视觉理解和具有文学天赋的文本合成之间取得平衡。实验结果证明了基于InternLM2-7B的InternLM-XComposer2在生成高质量长文本多模态内容方面的优势及其在各种基准测试中出色的视觉语言理解性能,它不仅显著优于现有的多模态模型,而且在某些评估中与GPT-4V和Gemini Pro相当甚至超过。这突出了它在多模态理解领域的非凡能力。具有7B参数的internlm-xcomposer 2模型系列可在https://github.com/InternLM/InternLM-XComposer公开获得。
摘要: We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16405v1
GitHub: https://github.com/AlanAnsell/peft|https://github.com/ducdauge/sft-llm|
中文摘要: 大型语言模型(LLMs)由于其参数数量庞大,很难完全微调(例如,通过指令或人工反馈)。一系列参数有效的稀疏微调(SFT)方法已被证明在性能方面有前途,但它们的内存需求与LLMs的大小成比例增加。在这项工作中,我们将稀疏微调扩展到最先进的LLMs,如LLaMA 2 7B和13B。在任何给定的时间,对于期望的密度水平,我们维护一组参数索引和这些参数相对于它们的预训练值的增量。我们在以下方面进行迭代:(a)更新活动增量,(b)修剪指数(基于其增量大小的变化)和(c)指数的再生。对于再生,我们探索了两个基于几个候选参数的累积梯度或使用有效的SM3优化器估计的它们的近似动量的标准。我们在标准数据集混合物上对LLMs的指令调整进行了实验,发现SFT在性能方面通常优于LoRA(低秩自适应)等流行的参数高效微调方法,并且在运行时间方面具有可比性。此外,我们还表明,SFT与量化和高效优化器兼容,有助于扩展到更大的模型规模。我们在https://github.com/AlanAnsell/peft发布SFT代码,在https://github.com/ducdauge/sft-llm发布指令调优实验代码
摘要: Large Language Models (LLMs) are difficult to fully fine-tune (e.g., with instructions or human feedback) due to their sheer number of parameters. A family of parameter-efficient sparse fine-tuning (SFT) methods have proven promising in terms of performance but their memory requirements increase proportionally to the size of the LLMs. In this work, we scale sparse fine-tuning to state-of-the-art LLMs like LLaMA 2 7B and 13B. At any given time, for a desired density level, we maintain an array of parameter indices and the deltas of these parameters relative to their pretrained values. We iterate among: (a) updating the active deltas, (b) pruning indices (based on the change of magnitude of their deltas) and © regrowth of indices. For regrowth, we explore two criteria based on either the accumulated gradients of a few candidate parameters or their approximate momenta estimated using the efficient SM3 optimizer. We experiment with instruction-tuning of LLMs on standard dataset mixtures, finding that SFT is often superior to popular parameter-efficient fine-tuning methods like LoRA (low-rank adaptation) in terms of performance and comparable in terms of run time. We additionally show that SFT is compatible with both quantization and efficient optimizers, to facilitate scaling to ever-larger model sizes. We release the code for SFT at https://github.com/AlanAnsell/peft and for the instruction-tuning experiments at https://github.com/ducdauge/sft-llm.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2305.05644v2
GitHub: https://github.com/JayZhang42/FederatedGPT-Shepherd|
中文摘要: 虽然“指令调整”的生成式大型语言模型(LLMs)已经展示了推广到新任务的令人印象深刻的能力,但训练阶段严重依赖于大量不同的高质量指令数据(如ChatGPT和GPT-4)。不幸的是,获取高质量的数据,特别是当涉及到人工编写的数据时,会在成本和可访问性方面带来重大挑战。此外,对隐私的担忧会进一步限制对此类数据的访问,使获取数据的过程变得复杂而微妙。因此,这阻碍了调整模型的通用性,并可能限制它们在某些情况下的有效性。为了解决这个问题,我们的研究引入了一种称为联邦指令调优(FedIT)的新方法,它利用联邦学习(FL)作为LLMs指令调优的学习框架。这标志着基于FL的LLMs指令调优的首次探索。这一点尤其重要,因为文本数据主要是由最终用户生成的。因此,设计和调整FL方法以有效利用存储在本地设备上的这些用户的不同指令,同时保护隐私和确保数据安全是当务之急。在本文中,通过进行广泛使用的GPT-4自动评估,我们证明了通过利用客户端的异构和多样的指令集,与仅具有有限本地指令的集中训练相比,我们提高了LLMs的性能。此外,在本文中,我们开发了一个名为Shepherd的Github库。这个存储库提供了一个基础框架,用于探索使用跨不同类别的异构指令对LLMs进行联合微调。
摘要: While “instruction-tuned” generative large language models (LLMs) have demonstrated an impressive ability to generalize to new tasks, the training phases heavily rely on large amounts of diverse and high-quality instruction data (such as ChatGPT and GPT-4). Unfortunately, acquiring high-quality data, especially when it comes to human-written data, can pose significant challenges both in terms of cost and accessibility. Moreover, concerns related to privacy can further limit access to such data, making the process of obtaining it a complex and nuanced undertaking. Consequently, this hinders the generality of the tuned models and may restrict their effectiveness in certain contexts. To tackle this issue, our study introduces a new approach called Federated Instruction Tuning (FedIT), which leverages federated learning (FL) as the learning framework for the instruction tuning of LLMs. This marks the first exploration of FL-based instruction tuning for LLMs. This is especially important since text data is predominantly generated by end users. Therefore, it is imperative to design and adapt FL approaches to effectively leverage these users’ diverse instructions stored on local devices, while preserving privacy and ensuring data security. In the current paper, by conducting widely used GPT-4 auto-evaluation, we demonstrate that by exploiting the heterogeneous and diverse sets of instructions on the client’s end with the proposed framework FedIT, we improved the performance of LLMs compared to centralized training with only limited local instructions. Further, in this paper, we developed a Github repository named Shepherd. This repository offers a foundational framework for exploring federated fine-tuning of LLMs using heterogeneous instructions across diverse categories.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16265v1
GitHub: https://github.com/OpenNLPLab/CO2|
中文摘要: 大型语言模型的根本成功取决于大规模分布式训练技术的有效实现。然而,构建一个以高速通信互连为特色的庞大、高性能集群的成本高得惊人,而且只有知名实体才能访问。在这项工作中,我们的目标是降低这一障碍,并民主化有限带宽集群的大规模培训。我们提出了一种称为CO2的新方法,该方法将本地更新和异步通信引入分布式数据并行训练,从而促进通信与计算的完全重叠。即使在受到非常有限的通信带宽限制的大规模多节点集群上,CO2也能够获得高可扩展性。我们进一步提出了陈旧差距惩罚和外部动量削波技术以及CO2来支持它的收敛性和训练稳定性。此外,CO2表现出与成熟的零系列优化器的无缝集成,这些优化器通过大型模型训练来减轻模型状态的内存消耗。我们还提供了收敛性的数学证明,并建立了严格的上界。此外,我们通过一系列广泛的实践实验来验证我们的发现,这些实验涵盖了计算机视觉和自然语言处理领域的广泛任务。这些实验旨在展示当跨包含多达128个A100 GPU的配置部署时,CO2在收敛、泛化和可扩展性方面的能力。这些结果强调了CO2在极大地提高可扩展性方面的突出能力,无论是在具有800Gbps RDMA还是80Gbps TCP/IP节点间连接的集群上。
摘要: The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication interconnectivity is prohibitively costly, and accessible only to prominent entities. In this work, we aim to lower this barrier and democratize large-scale training with limited bandwidth clusters. We propose a new approach called CO2 that introduces local-updating and asynchronous communication to the distributed data-parallel training, thereby facilitating the full overlap of COmunication with COmputation. CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth. We further propose the staleness gap penalty and outer momentum clipping techniques together with CO2 to bolster its convergence and training stability. Besides, CO2 exhibits seamless integration with well-established ZeRO-series optimizers which mitigate memory consumption of model states with large model training. We also provide a mathematical proof of convergence, accompanied by the establishment of a stringent upper bound. Furthermore, we validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs. The outcomes emphasize the outstanding capacity of CO2 to hugely improve scalability, no matter on clusters with 800Gbps RDMA or 80Gbps TCP/IP inter-node connections.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16158v1
GitHub: https://github.com/X-PLUG/MobileAgent|
中文摘要: 基于多模态大型语言模型(MLLM)的移动设备代理正在成为一种流行的应用。在本文中,我们介绍了移动代理,一个自治的多模态移动设备代理。移动代理首先利用视觉感知工具来准确识别和定位应用程序前端界面中的视觉和文本元素。基于感知的视觉环境,它随后自主规划和分解复杂的操作任务,并通过操作逐步导航移动应用程序。与以前依赖于应用程序的XML文件或移动系统元数据的解决方案不同,Mobile-Agent以视觉为中心的方式允许跨不同移动操作环境的更大适应性,从而消除了特定于系统的定制的必要性。为了评估移动代理的性能,我们引入了Mobile-Eval,这是一个评估移动设备操作的基准。在移动评估的基础上,对移动Agent进行了综合评估。实验结果表明,移动Agent具有较高的准确率和完成率。即使有挑战性的指令,如多应用程序操作,移动代理仍然可以完成要求。代码和模型将在https://github.com/X-PLUG/mobile agent开源。
摘要: Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app’s front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16380v1
中文摘要: 大型语言模型是在大量的网络抓取上训练的,这些抓取通常是非结构化的、嘈杂的、措辞不当的。当前的缩放定律表明,从这样的数据中学习需要大量的计算和数据,这随着被训练模型的大小而增长。这是不可行的,因为与预训练相关的大量计算成本和持续时间,以及网络上高质量数据的迫在眉睫的稀缺。在这项工作中,我们提出了Web Rephrase增强预训练( WRAP \textbf{WRAP} WRAP),它使用现成的指令调整模型,提示以特定的风格(如“像维基百科”或“问答格式”)解释Web上的文档,以联合预训练真实和合成Rephrase的LLMs。首先,我们展示了在自然有噪声的C4数据集上使用WRAP可以将预训练速度提高 ∼ 3 x \sim3x ∼3x。在相同的预训练计算预算下,它在堆的不同子集上平均提高了10%以上的困惑度,并在13个任务上提高了2%以上的零镜头问题答案准确性。其次,我们研究了重新措辞风格对模型性能的影响,为训练数据的组成如何影响LLMs在OOD环境中的性能提供了见解。我们的收获归因于这样一个事实,即重新措辞的合成数据比真实数据具有更高的效用,因为它(i)包含了密切反映下游评估风格的风格多样性,以及(ii)比网络搜集的数据具有更高的“质量”。
摘要: Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training ( WRAP \textbf{WRAP} WRAP) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as “like Wikipedia” or in “question-answer format” to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by ∼ 3 x \sim3x ∼3x. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher ‘quality’ than web-scraped data.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16423v1
Project: https://www.robots.ox.ac.uk/|
GitHub: https://github.com/v-iashin/Synchformer|
中文摘要: 我们的目标是视听同步,重点是“野外”视频,如YouTube上的视频,其中同步提示可能很少。我们的贡献包括一个新的视听同步模型,以及通过多模态段级对比预训练将特征提取与同步建模解耦的训练。这种方法在密集和稀疏设置中都实现了最先进的性能。我们还将同步模型训练扩展到AudioSet百万规模的“野外”数据集,研究可解释性的证据归因技术,并探索同步模型的新功能:视听同步性。
摘要: Our objective is audio-visual synchronization with a focus on ‘in-the-wild’ videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale ‘in-the-wild’ dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16224v1
Project: https://ecnu-cilab.github.io/DiffutoonProjectPage/)|
中文摘要: 卡通着色是一种非真实感动画渲染任务。它的主要目的是渲染具有平面和风格化外观的对象。随着扩散模型上升到图像合成方法的前沿,本文深入研究了一种基于扩散模型的卡通阴影的创新形式,旨在将逼真的视频直接渲染成动漫风格。在视频风格化中,现存的方法遇到了持续的挑战,特别是在保持一致性和实现高视觉质量方面。在本文中,我们将卡通阴影问题建模为四个子问题:风格化、一致性增强、结构引导和彩色化。为了解决视频风格化的挑战,我们提出了一种有效的卡通着色方法,称为\textit{Diffutoon}。Diffutoon能够以动漫风格渲染非常详细、高分辨率和长时间的视频。它还可以通过一个附加分支根据提示编辑内容。通过定量度量和人体评估来评估Diffutoon的功效。值得注意的是,在我们的实验中,Diffutoon超越了开源和闭源基线方法。我们的工作伴随着Github上源代码和示例视频的发布(项目页面:https://ecnu-cilab.github.io/DiffutoonProjectPage/)。
摘要: Toon shading is a type of non-photorealistic rendering task of animation. Its primary purpose is to render objects with a flat and stylized appearance. As diffusion models have ascended to the forefront of image synthesis methodologies, this paper delves into an innovative form of toon shading based on diffusion models, aiming to directly render photorealistic videos into anime styles. In video stylization, extant methods encounter persistent challenges, notably in maintaining consistency and achieving high visual quality. In this paper, we model the toon shading problem as four subproblems: stylization, consistency enhancement, structure guidance, and colorization. To address the challenges in video stylization, we propose an effective toon shading approach called \textit{Diffutoon}. Diffutoon is capable of rendering remarkably detailed, high-resolution, and extended-duration videos in anime style. It can also edit the content according to prompts via an additional branch. The efficacy of Diffutoon is evaluated through quantitive metrics and human evaluation. Notably, Diffutoon surpasses both open-source and closed-source baseline approaches in our experiments. Our work is accompanied by the release of both the source code and example videos on Github (Project page: https://ecnu-cilab.github.io/DiffutoonProjectPage/).
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16158v1
GitHub: https://github.com/X-PLUG/MobileAgent|
中文摘要: 基于多模态大型语言模型(MLLM)的移动设备代理正在成为一种流行的应用。在本文中,我们介绍了移动代理,一个自治的多模态移动设备代理。移动代理首先利用视觉感知工具来准确识别和定位应用程序前端界面中的视觉和文本元素。基于感知的视觉环境,它随后自主规划和分解复杂的操作任务,并通过操作逐步导航移动应用程序。与以前依赖于应用程序的XML文件或移动系统元数据的解决方案不同,Mobile-Agent以视觉为中心的方式允许跨不同移动操作环境的更大适应性,从而消除了特定于系统的定制的必要性。为了评估移动代理的性能,我们引入了Mobile-Eval,这是一个评估移动设备操作的基准。在移动评估的基础上,对移动Agent进行了综合评估。实验结果表明,移动Agent具有较高的准确率和完成率。即使有挑战性的指令,如多应用程序操作,移动代理仍然可以完成要求。代码和模型将在https://github.com/X-PLUG/mobile agent开源。
摘要: Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app’s front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.15947v1
GitHub: https://github.com/PKU-YuanGroup/MoE-LLaVA|
中文摘要: 对于大型视觉语言模型(LVLMs),缩放模型可以有效提高性能。然而,扩展模型参数会显著增加训练和推断成本,因为在计算中,所有模型参数都会为每个令牌激活。在这项工作中,我们提出了一种新的LVLMs训练策略MoE-tuning,它可以构建一个具有大量参数但计算成本恒定的稀疏模型,并有效地解决了通常与多模态学习和模型稀疏性相关的性能下降。此外,我们提出了MoE-LLaVA框架,一个基于MoE的稀疏LVLM架构。该框架在部署期间仅通过路由器唯一地激活top-k专家,而保持其余专家不活动。我们广泛的实验强调了MoE-LLaVA在视觉理解方面的卓越能力及其减少模型输出中幻觉的潜力。值得注意的是,只有30亿个稀疏激活的参数,MoE-LLaVA在各种视觉理解数据集上表现出与LLaVA-1.5-7B相当的性能,甚至在物体幻觉基准测试中超过了LLaVA-1.5-13B。通过MoE-LLaVA,我们旨在为稀疏LVLMs建立一个基线,并为未来开发更高效和有效的多模态学习系统的研究提供有价值的见解。代码发布于\url{https://github.com/PKU-YuanGroup/MoE-LLaVA}。
摘要: For Large Vision-Language Models (LVLMs), scaling the model can effectively improve performance. However, expanding model parameters significantly increases the training and inferring costs, as all model parameters are activated for each token in the calculation. In this work, we propose a novel training strategy MoE-tuning for LVLMs, which can constructing a sparse model with an outrageous number of parameter but a constant computational cost, and effectively addresses the performance degradation typically associated with multi-modal learning and model sparsity. Furthermore, we present the MoE-LLaVA framework, a MoE-based sparse LVLM architecture. This framework uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Our extensive experiments highlight the excellent capabilities of MoE-LLaVA in visual understanding and its potential to reduce hallucinations in model outputs. Remarkably, with just 3 billion sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmarks. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at \url{https://github.com/PKU-YuanGroup/MoE-LLaVA}.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2004.13324v3
Project: https://qianqianwang68.github.io/CAPS/|
中文摘要: 最近关于学习视觉描述符的研究显示了对应估计的有希望的改进,对应估计是许多3D视觉任务的关键组成部分。然而,现有的描述符学习框架通常需要特征点之间的地面真实对应来进行训练,这对于大规模获取是具有挑战性的。在本文中,我们提出了一种新的弱监督框架,可以仅从图像之间的相对相机姿态学习特征描述符。为此,我们设计了一个新的损失函数,利用相机姿态给出的超极约束,以及一个新的模型架构,使整个流水线可微和有效。因为我们不再需要像素级的地面真实对应,我们的框架开辟了在更大和更多样化的数据集上训练更好和无偏见的描述符的可能性。我们称产生的描述符为相机姿态监督,或CAPS,描述符。尽管在弱监督下训练,CAPS描述符甚至优于先前的完全监督描述符,并在各种几何任务上实现最先进的性能。项目页面:https://qianqianwang68.github.io/CAPS/
摘要: Recent research on learned visual descriptors has shown promising improvements in correspondence estimation, a key component of many 3D vision tasks. However, existing descriptor learning frameworks typically require ground-truth correspondences between feature points for training, which are challenging to acquire at scale. In this paper we propose a novel weakly-supervised framework that can learn feature descriptors solely from relative camera poses between images. To do so, we devise both a new loss function that exploits the epipolar constraint given by camera poses, and a new model architecture that makes the whole pipeline differentiable and efficient. Because we no longer need pixel-level ground-truth correspondences, our framework opens up the possibility of training on much larger and more diverse datasets for better and unbiased descriptors. We call the resulting descriptors CAmera Pose Supervised, or CAPS, descriptors. Though trained with weak supervision, CAPS descriptors outperform even prior fully-supervised descriptors and achieve state-of-the-art performance on a variety of geometric tasks. Project Page: https://qianqianwang68.github.io/CAPS/
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.15847v1
Project: https://sites.google.com/view/multipanelvqa/home|
中文摘要: 多面板图片,常见于网页截图、海报等。,弥漫在我们的日常生活中。这些图像的特点是由不同布局的多个子图形组成,有效地向人们传达信息。对于构建高级多模态人工智能应用程序,如理解复杂场景和浏览网页的代理,多面板视觉推理的技能是必不可少的,在这方面对模型进行全面评估也很重要。因此,我们的论文介绍了多面板视觉问题回答(MultipanelVQA),这是一个新的基准,专门挑战模型理解多面板图像。该基准包括6,600个与多面板图像相关的问题和答案。虽然这些问题对普通人来说很简单,达到了近乎完美的正确性,但它们对我们测试的最先进的大视觉语言模型(LVLMs)构成了重大挑战。在我们的研究中,我们利用了专门设计的综合策划的多面板图像来隔离和评估不同因素对模型性能的影响,揭示了LVLMs对多面板图像中各种干扰的敏感性,如相邻子图形和布局复杂性。因此,MultipanelVQA强调了提高LVLMs理解复杂视觉语言上下文的能力的必要性和方向。代码和数据发布于https://sites.google.com/view/multipanelvqa/home。
摘要: Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, our paper introduces Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark that specifically challenges models in comprehending multipanel images. The benchmark comprises 6,600 questions and answers related to multipanel images. While these questions are straightforward for average humans, achieving nearly perfect correctness, they pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) we tested. In our study, we utilized synthetically curated multipanel images specifically designed to isolate and evaluate the impact of diverse factors on model performance, revealing the sensitivity of LVLMs to various interferences in multipanel images, such as adjacent subfigures and layout complexity. As a result, MultipanelVQA highlights the need and direction for improving LVLMs’ ability to understand complex visual-language contexts. Code and data are released at https://sites.google.com/view/multipanelvqa/home.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16224v1
Project: https://ecnu-cilab.github.io/DiffutoonProjectPage/)|
中文摘要: 卡通着色是一种非真实感动画渲染任务。它的主要目的是渲染具有平面和风格化外观的对象。随着扩散模型上升到图像合成方法的前沿,本文深入研究了一种基于扩散模型的卡通阴影的创新形式,旨在将逼真的视频直接渲染成动漫风格。在视频风格化中,现存的方法遇到了持续的挑战,特别是在保持一致性和实现高视觉质量方面。在本文中,我们将卡通阴影问题建模为四个子问题:风格化、一致性增强、结构引导和彩色化。为了解决视频风格化的挑战,我们提出了一种有效的卡通着色方法,称为\textit{Diffutoon}。Diffutoon能够以动漫风格渲染非常详细、高分辨率和长时间的视频。它还可以通过一个附加分支根据提示编辑内容。通过定量度量和人体评估来评估Diffutoon的功效。值得注意的是,在我们的实验中,Diffutoon超越了开源和闭源基线方法。我们的工作伴随着Github上源代码和示例视频的发布(项目页面:https://ecnu-cilab.github.io/DiffutoonProjectPage/)。
摘要: Toon shading is a type of non-photorealistic rendering task of animation. Its primary purpose is to render objects with a flat and stylized appearance. As diffusion models have ascended to the forefront of image synthesis methodologies, this paper delves into an innovative form of toon shading based on diffusion models, aiming to directly render photorealistic videos into anime styles. In video stylization, extant methods encounter persistent challenges, notably in maintaining consistency and achieving high visual quality. In this paper, we model the toon shading problem as four subproblems: stylization, consistency enhancement, structure guidance, and colorization. To address the challenges in video stylization, we propose an effective toon shading approach called \textit{Diffutoon}. Diffutoon is capable of rendering remarkably detailed, high-resolution, and extended-duration videos in anime style. It can also edit the content according to prompts via an additional branch. The efficacy of Diffutoon is evaluated through quantitive metrics and human evaluation. Notably, Diffutoon surpasses both open-source and closed-source baseline approaches in our experiments. Our work is accompanied by the release of both the source code and example videos on Github (Project page: https://ecnu-cilab.github.io/DiffutoonProjectPage/).
PubTime: 2024-01-28
Downlink: http://arxiv.org/abs/2401.15687v1
Project: https://sites.google.com/view/media2face|
中文摘要: 从语音合成3D面部动画已经获得了相当大的关注。由于缺乏高质量的4D面部数据和注释良好的丰富多模态标签,以前的方法经常遭受有限的真实性和缺乏词汇条件。我们通过三部曲来应对这一挑战。我们首先介绍了广义神经参数面部资产(GNPFA),这是一种有效的变分自动编码器,将面部几何和图像映射到高度广义的表情潜在空间,解耦表情和身份。然后,我们利用GNPFA从大量视频中提取高质量的表情和准确的头部姿势。这展示了M2F-D数据集,这是一个大型、多样化和扫描级的协同语音3D面部动画数据集,具有良好注释的情感和风格标签。最后,我们提出了Media2Face,这是一个用于协同语音面部动画生成的GNPFA潜在空间中的扩散模型,接受来自音频、文本和图像的丰富的多模态指导。大量实验表明,我们的模型不仅在面部动画合成中实现了高保真,而且拓宽了3D面部动画的表现力和风格适应性范围。
摘要: The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.
PubTime: 2024-01-28
Downlink: http://arxiv.org/abs/2401.15636v1
Project: https://freestylefreelunch.github.io/|
中文摘要: 生成扩散模型的快速发展极大地推进了风格转移领域。然而,大多数当前基于扩散模型的风格转移方法通常涉及缓慢的迭代优化过程,例如,模型微调和风格概念的文本反转。在本文中,我们介绍了自由式,一种创新的风格转移方法,建立在预先训练的大扩散模型上,不需要进一步优化。此外,我们的方法仅通过所需样式的文本描述实现样式转移,消除了样式图像的必要性。具体来说,我们提出了一种双流编码器和单流解码器架构,取代了扩散模型中的传统U-Net。在双流编码器中,两个不同的分支将内容图像和样式文本提示作为输入,实现内容和样式的解耦。在解码器中,我们进一步基于给定的内容图像和相应的风格文本提示来调制来自双流的特征,以实现精确的风格传输。我们的实验结果证明了我们的方法在各种内容图像和样式文本提示中的高质量综合和保真度。代码和更多的结果可以在我们的项目网站上找到:https://freestylefreelunch.github.io/。
摘要: The rapid development of generative diffusion models has significantly advanced the field of style transfer. However, most current style transfer methods based on diffusion models typically involve a slow iterative optimization process, e.g., model fine-tuning and textual inversion of style concept. In this paper, we introduce FreeStyle, an innovative style transfer method built upon a pre-trained large diffusion model, requiring no further optimization. Besides, our method enables style transfer only through a text description of the desired style, eliminating the necessity of style images. Specifically, we propose a dual-stream encoder and single-stream decoder architecture, replacing the conventional U-Net in diffusion models. In the dual-stream encoder, two distinct branches take the content image and style text prompt as inputs, achieving content and style decoupling. In the decoder, we further modulate features from the dual streams based on a given content image and the corresponding style text prompt for precise style transfer. Our experimental results demonstrate high-quality synthesis and fidelity of our method across various content images and style text prompts. The code and more results are available at our project website:https://freestylefreelunch.github.io/.
PubTime: 2024-01-27
Downlink: http://arxiv.org/abs/2310.13833v2
GitHub: https://github.com/Graph-COM/GraphMaker|
中文摘要: 具有节点属性的大规模图在各种实际应用中越来越常见。创建反映真实世界示例的合成的、属性丰富的图形至关重要,尤其是在原始数据被限制共享的情况下,对于共享图形数据以进行分析和开发学习模型而言。传统的图形生成方法在处理这些复杂结构的能力方面受到限制。扩散模型的最新进展显示了在生成没有属性的图结构和更小的分子图方面的潜力。然而,由于复杂的属性——结构相关性和这些图的大尺寸,这些模型在生成大型属性图方面面临挑战。本文介绍了一种新的扩散模型GraphMaker,它是专门为生成大型属性图而设计的。我们探索了节点属性和图结构生成过程的各种组合,发现异步方法更有效地捕捉了复杂的属性——结构相关性。我们还通过边缘小批处理生成来解决可扩展性问题。为了证明我们的方法在图形数据传播中的实用性,我们引入了一种新的评估管道。该评估表明,GraphMaker生成的合成图形可用于为原始图形上定义的任务开发竞争性图形机器学习模型,而无需实际访问这些图形,而许多领先的图形生成方法在该评估中有所欠缺。
摘要: Large-scale graphs with node attributes are increasingly common in various real-world applications. Creating synthetic, attribute-rich graphs that mirror real-world examples is crucial, especially for sharing graph data for analysis and developing learning models when original data is restricted to be shared. Traditional graph generation methods are limited in their capacity to handle these complex structures. Recent advances in diffusion models have shown potential in generating graph structures without attributes and smaller molecular graphs. However, these models face challenges in generating large attributed graphs due to the complex attribute-structure correlations and the large size of these graphs. This paper introduces a novel diffusion model, GraphMaker, specifically designed for generating large attributed graphs. We explore various combinations of node attribute and graph structure generation processes, finding that an asynchronous approach more effectively captures the intricate attribute-structure correlations. We also address scalability issues through edge mini-batching generation. To demonstrate the practicality of our approach in graph data dissemination, we introduce a new evaluation pipeline. The evaluation demonstrates that synthetic graphs generated by GraphMaker can be used to develop competitive graph machine learning models for the tasks defined over the original graphs without actually accessing these graphs, while many leading graph generation methods fall short in this evaluation.
PubTime: 2024-01-27
Downlink: http://arxiv.org/abs/2401.15422v1
GitHub: https://github.com/MLGroup-JLU/LLM-data-aug-survey|https://github.com/MLGroup-JLU/LLM-data-aug-survey|
中文摘要: 大型模型,包括大型语言和扩散模型,在近似人类水平的智能方面显示出非凡的前景,引起了学术界和工业界的极大兴趣。然而,这些大型模型的训练需要大量的高质量数据,并且随着这些模型的不断更新,现有的高质量数据库可能很快就会耗尽。这一挑战催生了专注于数据增强方法的研究激增。利用大型模型,这些数据增强技术的表现优于传统方法。本文从一个全面的角度对大型模型驱动的数据扩充方法进行了详尽的回顾。我们首先将相关研究分为三大类:图像增强、文本增强和成对数据增强。接下来,我们深入研究与基于大型模型的数据扩充相关的各种数据后处理技术。然后,我们的讨论扩展到包括这些数据增强方法在自然语言处理、计算机视觉和音频信号处理中的一系列应用。我们继续评估跨不同场景的大型基于模型的数据扩充的成功和局限性。在总结我们的综述时,我们强调了数据增强领域未来探索的潜在挑战和途径。我们的目标是为研究人员提供关键的见解,最终为更复杂的大型模型的进步做出贡献。我们始终在以下网址维护相关的开源材料:https://github.com/MLGroup-JLU/LLM-data-aug-survey。
摘要: Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence, garnering significant interest from both academic and industrial spheres. However, the training of these large models necessitates vast quantities of high-quality data, and with continuous updates to these models, the existing reservoir of high-quality data may soon be depleted. This challenge has catalyzed a surge in research focused on data augmentation methods. Leveraging large models, these data augmentation techniques have outperformed traditional approaches. This paper offers an exhaustive review of large model-driven data augmentation methods, adopting a comprehensive perspective. We begin by establishing a classification of relevant studies into three main categories: image augmentation, text augmentation, and paired data augmentation. Following this, we delve into various data post-processing techniques pertinent to large model-based data augmentation. Our discussion then expands to encompass the array of applications for these data augmentation methods within natural language processing, computer vision, and audio signal processing. We proceed to evaluate the successes and limitations of large model-based data augmentation across different scenarios. Concluding our review, we highlight prospective challenges and avenues for future exploration in the field of data augmentation. Our objective is to furnish researchers with critical insights, ultimately contributing to the advancement of more sophisticated large models. We consistently maintain the related open-source materials at: https://github.com/MLGroup-JLU/LLM-data-aug-survey.
PubTime: 2024-01-27
Downlink: http://arxiv.org/abs/2401.15282v1
GitHub: https://github.com/isbrycee/GEM-Glass-Segmentor|
摘要: Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: https://github.com/isbrycee/GEM-Glass-Segmentor.
PubTime: 2024-01-29
Downlink: http://arxiv.org/abs/2401.16304v1
中文摘要: 视觉位置识别是计算机视觉中的一项关键任务,尤其是对于定位和导航系统。现有的方法通常依赖于对比学习:图像描述符被训练成对于潜在空间中的相似图像具有较小的距离,而对于不相似的图像具有较大的距离。然而,这种方法难以确保准确的基于距离的图像相似性表示,特别是当使用二进制成对标签进行训练时,并且需要复杂的重新排序策略。这项工作引入了一个新的视角,将位置识别框定为一个回归问题,使用相机视场重叠作为学习的相似性基础事实。通过优化图像描述符以直接与分级相似性标签对齐,该方法增强了排序能力,而无需昂贵的重新排序,提供了数据高效的训练和跨多个基准数据集的强泛化。
摘要: Visual place recognition is a critical task in computer vision, especially for localization and navigation systems. Existing methods often rely on contrastive learning: image descriptors are trained to have small distance for similar images and larger distance for dissimilar ones in a latent space. However, this approach struggles to ensure accurate distance-based image similarity representation, particularly when training with binary pairwise labels, and complex re-ranking strategies are required. This work introduces a fresh perspective by framing place recognition as a regression problem, using camera field-of-view overlap as similarity ground truth for learning. By optimizing image descriptors to align directly with graded similarity labels, this approach enhances ranking capabilities without expensive re-ranking, offering data-efficient training and strong generalization across several benchmark datasets.
关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文
为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有