分类:
- 大语言模型LLM
- 视觉模型VLM
- 扩散模型
- 视觉导航
- 具身智能,机器人
- 强化学习
- 开放词汇,检测分割
中文摘要: 我们引入了一个新的任务——语言驱动的视频修复,它使用自然语言指令来指导修复过程。这种方法克服了传统视频修复方法的局限性,传统视频修复方法依赖于手动标记的二进制掩模,这是一个通常繁琐且劳动密集型的过程。我们提出了通过指令从视频中移除对象(ROVI)数据集,包含5,650个视频和9,091个修复结果,以支持这项任务的训练和评估。我们还提出了一种新的基于扩散的语言驱动的视频修复框架,这是该任务的第一个端到端基线,集成了多模态大型语言模型,以有效地理解和执行复杂的基于语言的修复请求。我们的综合结果展示了数据集的多功能性和模型在各种语言指导的修复场景中的有效性。我们将公开数据集、代码和模型。
摘要: We introduce a new task – language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset’s versatility and the model’s effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.
[Downlink:]http://arxiv.org/abs/2401.10226v1
[Project:]https://jianzongwu.github.io/projects/rovi|
中文摘要: 像CLIP这样的图像——文本训练近年来主导了视觉基础模型的预训练。随后的努力是将区域级视觉学习引入CLIP的预训练,但由于缺乏大规模区域级数据集,面临可扩展性挑战。从自然语言处理(如指令调整)中的监督微调(SFT)中获得灵感,我们探索了细粒度SFT在预训练后增强视觉基础模型生成的潜力。因此,提出了一种两阶段方法ViSFT(Vision SFT)来释放视觉基础模型的细粒度知识。在ViSFT中,视觉基础模型通过在一些域内任务上执行视觉联合学习来增强,然后在域外基准上进行测试。通过在不到2天的时间内在8个V100 GPUs上使用ViSFT进行更新,具有超过4.4 B参数的视觉Transformer model显示了各种域外基准测试的改进,包括视觉和视觉语言场景。
摘要: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP’s pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.
[Downlink:]http://arxiv.org/abs/2401.10222v1
[GitHub:]https://github.com/TencentARC/ViSFT/tree/main|
中文摘要: 大型语言模型(LLMs)因其令人印象深刻的自然语言处理(NLP)能力而获得了极大的关注。最近,许多研究集中在LLMs的工具利用能力上。他们主要调查了LLMs如何有效地与给定的特定工具合作。然而,在LLM充当智能代理的场景中,如在AutoGPT和MetaGPT等应用程序中所见,LLM需要参与复杂的决策过程,包括决定是否使用工具以及从可用工具集合中选择最合适的工具来满足用户请求。因此,在本文中,我们介绍了MetaTool,这是一个旨在评估LLMs是否具有工具使用意识并能够正确选择工具的基准。具体来说,我们在基准测试中创建了一个名为ToolE的数据集。该数据集包含各种类型的用户查询,这些查询以提示的形式触发LLMs使用工具,包括单工具和多工具场景。随后,我们为工具使用意识和工具选择设置任务。在刀具选择中,我们从不同的角度定义了四个子任务,包括具有相似选择的刀具选择、特定场景下的刀具选择、可能存在可靠性问题的刀具选择和多刀具选择。我们进行了涉及八个流行的LLM的实验,发现它们中的大多数仍然难以有效地选择工具,突出了LLM和真正的智能代理之间存在的差距。然而,通过误差分析,我们发现仍有很大的改进空间。最后,我们总结了对工具开发人员的见解——我们强烈建议工具开发人员选择一个合适的重写模型,根据工具将应用的下游LLM生成新的描述。我们的代码在\href{https://github.com/HowieHwong/MetaTool}{Github}中。
摘要: Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers – we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in \href{https://github.com/HowieHwong/MetaTool}{Github}.
[Downlink:]http://arxiv.org/abs/2310.03128v4
[GitHub:]https://github.com/HowieHwong/MetaTool|
中文摘要: 我们研究开放大型语言模型(LLMs)在多大程度上可以从结构化数据中生成连贯和相关的文本。为了防止基准泄露到LLM训练数据中的偏差,我们收集了Quintd-1:一个用于五个数据到文本(D2T)生成任务的特别基准,由从公共API收集的标准格式的结构化数据记录组成。我们利用无参考评估指标和LLMs的上下文学习能力,允许我们在没有人工编写参考的情况下测试模型。我们的评估集中在标记级别上注释语义准确性错误,结合人类注释器和基于GPT-4的度量。我们对跨领域和任务的模型行为的系统检查表明,具有7B参数的最先进的开放LLMs可以在零镜头设置中从各种标准数据格式生成流畅和连贯的文本。然而,我们也表明,输出的语义准确性仍然是一个主要问题:在我们的基准测试中,根据人类注释者,80%的开放LLMs输出包含语义错误(根据GPT-4,91%)。我们的代码、数据和模型输出可从https://d2t-llm.github.io获得。
摘要: We investigate to which extent open large language models (LLMs) can generate coherent and relevant text from structured data. To prevent bias from benchmarks leaked into LLM training data, we collect Quintd-1: an ad-hoc benchmark for five data-to-text (D2T) generation tasks, consisting of structured data records in standard formats gathered from public APIs. We leverage reference-free evaluation metrics and LLMs’ in-context learning capabilities, allowing us to test the models with no human-written references. Our evaluation focuses on annotating semantic accuracy errors on token-level, combining human annotators and a metric based on GPT-4. Our systematic examination of the models’ behavior across domains and tasks suggests that state-of-the-art open LLMs with 7B parameters can generate fluent and coherent text from various standard data formats in zero-shot settings. However, we also show that semantic accuracy of the outputs remains a major issue: on our benchmark, 80% of outputs of open LLMs contain a semantic error according to human annotators (91% according to GPT-4). Our code, data, and model outputs are available at https://d2t-llm.github.io.
[Downlink:]http://arxiv.org/abs/2401.10186v1
[Project:]https://d2t-llm.github.io.|
中文摘要: 视觉和语言概念自然地组织在一个层次结构中,其中文本概念“狗”包含所有包含狗的图像。尽管很直观,但当前的大规模视觉和语言模型(如CLIP)并没有明确捕捉到这种层次结构。我们提出了MERU,一个产生图像和文本双曲线表示的对比模型。双曲空间具有适合嵌入树状数据的几何属性,因此MERU可以更好地捕捉图像——文本数据集中的底层层次结构。我们的结果表明,MERU学习了高度可解释和结构化的表示空间,同时在标准多模态任务(如图像分类和图像文本检索)上与CLIP的性能具有竞争力。我们的代码和模型可从https://www.github.com/facebookresearch/meru
摘要: Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept “dog” entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP’s performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru
[Downlink:]http://arxiv.org/abs/2304.09172v3
[GitHub:]https://www.github.com/facebookresearch/meru|
中文摘要: 尽管LLM具有令人印象深刻的生成能力,但在现实世界的应用中,它们会受到与事实冲突的幻觉的阻碍。LLMs产生的文本中幻觉的准确识别,尤其是在复杂的推理场景中,是一个相对未探索的领域。为了解决这一差距,我们提出了FactCHD,这是一个专门为检测LLMs中与事实冲突的幻觉而设计的基准。FactCHD具有跨越各种事实模式的多样化数据集,包括普通、多跳、比较和集合操作。FactCHD的一个独特元素是它整合了基于事实的证据链,大大增强了评估检测器解释的深度。在不同LLMs上的实验暴露了当前方法在准确检测事实错误方面的缺点。此外,我们引入了真值三角测量器,它通过基于Llama2的工具增强的ChatGPT和LoRA调整来综合反射性考虑,旨在通过预测结果和证据的融合来产生更可信的检测。基准数据集可在https://github.com/zjunlp/FactCHD。
摘要: Despite their impressive generative capabilities, LLMs are hindered by fact-conflicting hallucinations in real-world applications. The accurate identification of hallucinations in texts generated by LLMs, especially in complex inferential scenarios, is a relatively unexplored area. To address this gap, we present FactCHD, a dedicated benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. A distinctive element of FactCHD is its integration of fact-based evidence chains, significantly enhancing the depth of evaluating the detectors’ explanations. Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately. Furthermore, we introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence. The benchmark dataset is available at https://github.com/zjunlp/FactCHD.
[Downlink:]http://arxiv.org/abs/2310.12086v2
[GitHub:]https://github.com/zjunlp/FactCHD.|
中文摘要: 在过去的几十年里,日本漫画,通常被称为漫画,已经超越了文化和语言的界限,成为真正的全球轰动。然而,漫画中固有的对视觉线索和插图的依赖使得有视觉障碍的人很难接触到它。在这项工作中,我们试图解决这一实质性的障碍,目的是确保漫画能够被每个人欣赏和积极参与。具体来说,我们解决了日记化的问题,即以全自动的方式生成谁在何时说了什么的转录。为此,我们做出了以下贡献:(1)我们提出了一个统一的模型Magi,它能够(a)检测面板、文本框和字符框,(b)通过身份对字符进行聚类(不预先知道聚类的数量),以及(c)将对话与其说话者相关联;(2)我们提出了一种新的方法,能够按照阅读顺序对检测到的文本框进行排序,并生成对话抄本;(3)我们使用公开可用的[英文]漫画页面来注释这项任务的评估基准。代码、评估数据集和预训练模型可在以下网址找到:https://github.com/ragavsachdeva/magi。
摘要: In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and © associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.
[Downlink:]http://arxiv.org/abs/2401.10224v1
[GitHub:]https://github.com/ragavsachdeva/magi.|
中文摘要: 像CLIP这样的图像——文本训练近年来主导了视觉基础模型的预训练。随后的努力是将区域级视觉学习引入CLIP的预训练,但由于缺乏大规模区域级数据集,面临可扩展性挑战。从自然语言处理(如指令调整)中的监督微调(SFT)中获得灵感,我们探索了细粒度SFT在预训练后增强视觉基础模型生成的潜力。因此,提出了一种两阶段方法ViSFT(Vision SFT)来释放视觉基础模型的细粒度知识。在ViSFT中,视觉基础模型通过在一些域内任务上执行视觉联合学习来增强,然后在域外基准上进行测试。通过在不到2天的时间内在8个V100 GPUs上使用ViSFT进行更新,具有超过4.4 B参数的视觉Transformer model显示了各种域外基准测试的改进,包括视觉和视觉语言场景。
摘要: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP’s pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.
[Downlink:]http://arxiv.org/abs/2401.10222v1
[GitHub:]https://github.com/TencentARC/ViSFT/tree/main|
中文摘要: 开发交错图像——文本数据的生成模型具有研究和实用价值。它要求模型理解交错序列,并随后生成图像和文本。然而,现有的尝试受到固定数量的视觉标记不能有效地捕捉图像细节的问题的限制,这在多图像场景中尤其成问题。为了解决这个问题,本文提出了MM-Interleaved,一个用于交错图像——文本数据的端到端生成模型。它引入了多尺度和多图像特征同步器模块,允许在生成过程中直接访问先前上下文中的细粒度图像特征。MM-Interleaved是对成对和交错的图像——文本语料库进行端到端预训练的。它通过监督微调阶段得到进一步增强,其中模型提高了其遵循复杂多模态指令的能力。实验证明了MM-Interleaved在按照多模态指令识别视觉细节和按照文本和视觉条件生成一致图像方面的多功能性。代码和模型可在\url{https://github.com/OpenGVLab/MM-Interleaved}获得。
摘要: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.
[Downlink:]http://arxiv.org/abs/2401.10208v1
[GitHub:]https://github.com/OpenGVLab/MM-Interleaved|
摘要: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency. To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields. To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences. Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases. Source code has been available at https://github.com/MzeroMiko/VMamba.
[Downlink:]http://arxiv.org/abs/2401.10166v1
[GitHub:]https://github.com/MzeroMiko/VMamba.|
中文摘要: 视觉和语言概念自然地组织在一个层次结构中,其中文本概念“狗”包含所有包含狗的图像。尽管很直观,但当前的大规模视觉和语言模型(如CLIP)并没有明确捕捉到这种层次结构。我们提出了MERU,一个产生图像和文本双曲线表示的对比模型。双曲空间具有适合嵌入树状数据的几何属性,因此MERU可以更好地捕捉图像——文本数据集中的底层层次结构。我们的结果表明,MERU学习了高度可解释和结构化的表示空间,同时在标准多模态任务(如图像分类和图像文本检索)上与CLIP的性能具有竞争力。我们的代码和模型可从https://www.github.com/facebookresearch/meru
摘要: Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept “dog” entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP’s performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru
[Downlink:]http://arxiv.org/abs/2304.09172v3
[GitHub:]https://www.github.com/facebookresearch/meru|
中文摘要: 场景文本识别(STR)是一项具有挑战性的任务,涉及识别自然场景图像中的文本。尽管当前最先进的STR模型表现出高性能,但由于它们依赖于由视觉编码器和序列解码器组成的混合架构,它们通常具有较低的推理效率。在这项工作中,我们提出了用于快速有效的场景文本识别(VIPTR)的视觉可排列提取器,它在STR领域的高性能和快速推理速度之间实现了令人印象深刻的平衡。具体来说,VIPTR利用了一个具有金字塔结构的视觉语义提取器,其特征是多个自我注意层,同时避开了传统的序列解码器。这种设计选择产生了一种轻量级和高效的模型,能够处理不同大小的输入。在各种标准数据集上对中英文场景文本识别的大量实验结果验证了VIPTR的优越性。值得注意的是,VIPTR-T(微型)变体提供了与其他轻量级模型相当的极具竞争力的准确性,并实现了SOTA推理速度。同时,VIPTR-L(大)变体获得了更高的识别精度,同时保持了较低的参数计数和有利的推理速度。我们提出的方法为STR挑战提供了一个令人信服的解决方案,它融合了高准确性和效率,极大地有利于需要快速可靠文本识别的现实世界应用。代码可在https://github.com/cxfyxl/VIPTR。
摘要: Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. Although current state-of-the-art models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by multiple self-attention layers, while eschewing the traditional sequence decoder. This design choice results in a lightweight and efficient model capable of handling inputs of varying sizes. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of VIPTR. Notably, the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which blends high accuracy with efficiency and greatly benefits real-world applications requiring fast and reliable text recognition. The code is publicly available at https://github.com/cxfyxl/VIPTR.
[Downlink:]http://arxiv.org/abs/2401.10110v1
[GitHub:]https://github.com/cxfyxl/VIPTR.|
中文摘要: 全景和实例分割网络通常用专门的对象检测模块、复杂的损失函数和特设的后处理步骤来训练,以处理实例掩码的排列不变性。这项工作建立在稳定扩散的基础上,并提出了一种用于全景分割的潜在扩散方法,从而产生了一种简单的架构,省略了这些复杂性。我们的训练过程包括两个步骤:(1)训练一个浅层自动编码器将分割掩模投影到潜在空间;(2)训练扩散模型以允许潜在空间中的图像条件采样。生成模型的使用开启了掩模完成或修复的探索,这在交互式分割中具有应用。实验验证产生了全景分割和掩模修复的有希望的结果。虽然没有设置一个新的最先进的状态,我们的模型的简单性,通用性和掩模完成能力是可取的属性。
摘要: Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to handle the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture which omits these complexities. Our training process consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. The use of a generative model unlocks the exploration of mask completion or inpainting, which has applications in interactive segmentation. The experimental validation yields promising results for both panoptic segmentation and mask inpainting. While not setting a new state-of-the-art, our model’s simplicity, generality, and mask completion capability are desirable properties.
[Downlink:]http://arxiv.org/abs/2401.10227v1
[GitHub:]https://github.com/segments-ai/latent-diffusion-segmentation|
中文摘要: 我们引入了一个新的任务——语言驱动的视频修复,它使用自然语言指令来指导修复过程。这种方法克服了传统视频修复方法的局限性,传统视频修复方法依赖于手动标记的二进制掩模,这是一个通常繁琐且劳动密集型的过程。我们提出了通过指令从视频中移除对象(ROVI)数据集,包含5,650个视频和9,091个修复结果,以支持这项任务的训练和评估。我们还提出了一种新的基于扩散的语言驱动的视频修复框架,这是该任务的第一个端到端基线,集成了多模态大型语言模型,以有效地理解和执行复杂的基于语言的修复请求。我们的综合结果展示了数据集的多功能性和模型在各种语言指导的修复场景中的有效性。我们将公开数据集、代码和模型。
摘要: We introduce a new task – language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset’s versatility and the model’s effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.
[Downlink:]http://arxiv.org/abs/2401.10226v1
[Project:]https://jianzongwu.github.io/projects/rovi|
中文摘要: 视频输出旨在充分完成视频帧边缘的缺失区域。与图像绘制相比,它提出了一个额外的挑战,因为模型应该保持填充区域的时间一致性。本文介绍了一种用于视频输出的屏蔽三维扩散模型。我们使用掩模建模技术来训练三维扩散模型。这允许我们使用多个引导帧来连接多个视频剪辑推理的结果,从而确保时间一致性并减少相邻帧之间的抖动。同时,我们提取视频的全局帧作为提示,并使用交叉注意力引导模型获取当前视频剪辑以外的信息。我们还引入了一种混合的从粗到细的推理流水线来缓解伪影积累问题。现有的粗细流水线仅采用填充策略,由于稀疏帧的时间间隔过大,导致性能下降。我们的流水线受益于掩模建模的双向学习,因此在生成稀疏帧时可以采用填充和插值的混合策略。实验表明,我们的方法在视频输出任务中取得了最先进的结果。我们的https://fanfanda.github.io/M3DDM/。
摘要: Video outpainting aims to adequately complete missing areas at the edges of video frames. Compared to image outpainting, it presents an additional challenge as the model should maintain the temporal consistency of the filled area. In this paper, we introduce a masked 3D diffusion model for video outpainting. We use the technique of mask modeling to train the 3D diffusion model. This allows us to use multiple guide frames to connect the results of multiple video clip inferences, thus ensuring temporal consistency and reducing jitter between adjacent frames. Meanwhile, we extract the global frames of the video as prompts and guide the model to obtain information other than the current video clip using cross-attention. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem. The existing coarse-to-fine pipeline only uses the infilling strategy, which brings degradation because the time interval of the sparse frames is too large. Our pipeline benefits from bidirectional learning of the mask modeling and thus can employ a hybrid strategy of infilling and interpolation when generating sparse frames. Experiments show that our method achieves state-of-the-art results in video outpainting tasks. More results and codes are provided at our https://fanfanda.github.io/M3DDM/.
[Downlink:]http://arxiv.org/abs/2309.02119v2
[Project:]https://fanfanda.github.io/M3DDM/.|
摘要: The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.
[Downlink:]http://arxiv.org/abs/2401.10032v1
[Project:]https://mm.kaist.ac.kr/projects/FreGrad.|
中文摘要: 无监督域自适应(UDA)旨在将使用标记数据学习的模型从源域转移到目标域中的未标记数据。为了解决源域和目标域之间的大域差距问题,我们提出了一种新的域自适应目标检测正则化方法BlenDA,通过生成中间域的伪样本及其相应的软域标签进行自适应训练。中间样本是通过使用现成的预训练文本到图像扩散模型将源图像与其相应的翻译图像动态混合来生成的,该模型将目标域的文本标签作为输入,并表现出优异的图像到图像翻译质量。基于两个自适应基准的实验结果,我们提出的方法可以显著增强最先进的领域自适应对象检测器——对抗性查询Transformer model(AQT)的性能。特别是,在城市景观到雾城市景观的适应中,我们在雾城市景观数据集上实现了令人印象深刻的53.4%地图,比以前的最先进水平高出1.5%。值得注意的是,我们提出的方法也适用于各种领域自适应目标检测范式。代码可从以下网址获得:https://github.com/aiiu-lab/BlenDA
摘要: Unsupervised domain adaptation (UDA) aims to transfer a model learned using labeled data from the source domain to unlabeled data in the target domain. To address the large domain gap issue between the source and target domains, we propose a novel regularization method for domain adaptive object detection, BlenDA, by generating the pseudo samples of the intermediate domains and their corresponding soft domain labels for adaptation training. The intermediate samples are generated by dynamically blending the source images with their corresponding translated images using an off-the-shelf pre-trained text-to-image diffusion model which takes the text label of the target domain as input and has demonstrated superior image-to-image translation quality. Based on experimental results from two adaptation benchmarks, our proposed approach can significantly enhance the performance of the state-of-the-art domain adaptive object detector, Adversarial Query Transformer (AQT). Particularly, in the Cityscapes to Foggy Cityscapes adaptation, we achieve an impressive 53.4% mAP on the Foggy Cityscapes dataset, surpassing the previous state-of-the-art by 1.5%. It is worth noting that our proposed method is also applicable to various paradigms of domain adaptive object detection. The code is available at:https://github.com/aiiu-lab/BlenDA
[Downlink:]http://arxiv.org/abs/2401.09921v1
[GitHub:]https://github.com/aiiu-lab/BlenDA|
中文摘要: 高分辨率3D对象生成仍然是一项具有挑战性的任务,这主要是由于综合注释训练数据的可用性有限。最近的进展旨在通过利用图像生成模型来克服这一限制,图像生成模型在广泛的精选网络数据集上进行预训练,使用知识转移技术,如分数蒸馏采样(SDS)。有效地解决高分辨率渲染的要求通常需要采用基于潜在表示的模型,例如潜在扩散模型(LDM)。在这个框架中,出现了一个重大的挑战:为了计算单个图像像素的梯度,有必要通过图像模型的冻结组件,例如LDM中使用的VAE编码器,从指定的潜在空间反向传播梯度。然而,这种梯度传播途径从未被优化,在训练期间保持不受控制。我们发现,不规范的梯度会不利地影响3D模型从图像生成模型获取纹理相关信息的能力,导致外观合成质量差。为了解决这一首要挑战,我们提出了一种称为像素梯度裁剪(PGC)的创新操作,旨在无缝集成到现有的3D生成模型中,从而提高它们的合成质量。具体来说,我们通过有效地裁剪像素级梯度来控制随机梯度的大小,同时保留关键的纹理相关梯度方向。尽管这种简单性和最小的额外成本,大量的实验证明了我们的PGC在增强现有3D生成模型的性能以实现高分辨率对象渲染方面的有效性。
摘要: High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS). Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training. We find that the unregulated gradients adversely affect the 3D model’s capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis. To address this overarching challenge, we propose an innovative operation termed Pixel-wise Gradient Clipping (PGC) designed for seamless integration into existing 3D generative models, thereby enhancing their synthesis quality. Specifically, we control the magnitude of stochastic gradients by clipping the pixel-wise gradients efficiently, while preserving crucial texture-related gradient directions. Despite this simplicity and minimal extra cost, extensive experiments demonstrate the efficacy of our PGC in enhancing the performance of existing 3D generative models for high-resolution object rendering.
[Downlink:]http://arxiv.org/abs/2310.12474v4
[Project:]https://fudan-zvg.github.io/PGC-3D|
中文摘要: 组织病理学图像中的细胞核实例分割对于生物分析和癌症诊断非常重要,但由于两个原因仍然具有挑战性。(1)嫌色细胞核的核内和核外区域的相似视觉呈现经常导致分割不足,以及(2)当前的方法缺乏对核结构的探索,导致碎片化的实例预测。为了解决这些问题,本文提出了一种结构编码和交互网络,称为SEINE,它开发了核的结构建模方案,并利用核之间的结构相似性来提高每个分段实例的完整性。具体来说,SEINE引入了一种基于轮廓的结构编码(SE),该编码考虑了原子核结构和语义之间的相关性,实现了原子核结构的合理表示。基于编码,我们提出了一种以清晰核为原型的结构引导注意(SGA)来增强模糊核的结构学习。为了增强结构学习能力,提出了语义特征融合(SFF)来提高语义分支和结构分支的语义一致性。此外,位置增强(PE)方法被应用于抑制不正确的核边界预测。大量的实验证明了我们的方法的优越性,SEINE在四个数据集上实现了最先进的(SOTA)性能。该代码可从\href{https://github.com/zhangye-zoe/SEINE}{https://github.com/zhangye-zoe/SEINE}获得。
摘要: Nuclei instance segmentation in histopathological images is of great importance for biological analysis and cancer diagnosis but remains challenging for two reasons. (1) Similar visual presentation of intranuclear and extranuclear regions of chromophobe nuclei often causes under-segmentation, and (2) current methods lack the exploration of nuclei structure, resulting in fragmented instance predictions. To address these problems, this paper proposes a structure encoding and interaction network, termed SEINE, which develops the structure modeling scheme of nuclei and exploits the structure similarity between nuclei to improve the integrality of each segmented instance. Concretely, SEINE introduces a contour-based structure encoding (SE) that considers the correlation between nuclei structure and semantics, realizing a reasonable representation of the nuclei structure. Based on the encoding, we propose a structure-guided attention (SGA) that takes the clear nuclei as prototypes to enhance the structure learning for the fuzzy nuclei. To strengthen the structural learning ability, a semantic feature fusion (SFF) is presented to boost the semantic consistency of semantic and structure branches. Furthermore, a position enhancement (PE) method is applied to suppress incorrect nuclei boundary predictions. Extensive experiments demonstrate the superiority of our approaches, and SEINE achieves state-of-the-art (SOTA) performance on four datasets. The code is available at \href{https://github.com/zhangye-zoe/SEINE}{https://github.com/zhangye-zoe/SEINE}.
[Downlink:]http://arxiv.org/abs/2401.09773v1
[GitHub:]https://github.com/zhangye-zoe/SEINE|https://github.com/zhangye-zoe/SEINE|
中文摘要: 大型语言模型(LLMs)最近已经扩展到视觉语言领域,获得了令人印象深刻的通用多模态能力。然而,针对遥感数据的多模态大语言模型(MLLMs)的探索仍处于起步阶段,性能并不令人满意。在这项工作中,我们介绍了SkyEyeGPT,一个统一的多模态大型语言模型,专门为RS视觉语言理解而设计。为此,我们精心策划了一个RS多模态指令调优数据集,包括单任务和多任务对话指令。经过人工验证,我们获得了968k样本的高质量RS指令跟随数据集。我们的研究表明,通过简单而有效的设计,SkyEyeGPT在相当不同的任务上工作得非常好,而不需要额外的编码模块。具体来说,在通过对齐层将RS视觉特征投影到语言域之后,它们与特定于任务的指令一起被馈送到基于LLM的RS解码器中,以预测RS开放式任务的答案。此外,我们设计了一种两阶段调整方法,以增强不同粒度下的指令跟随和多回合对话能力。在8个RS视觉语言任务数据集上的实验证明了SkyEyeGPT在图像级和区域级任务中的优势,如字幕和视觉接地。特别是,与GPT-4V相比,SkyEyeGPT在一些定性测试中表现出令人鼓舞的结果。在线演示、代码和数据集将在https://github.com/ZhanYang-nwpu/SkyEyeGPT中发布。
摘要: Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, and the performance is not satisfactory. In this work, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS vision-language understanding. To this end, we meticulously curate an RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT’s superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released in https://github.com/ZhanYang-nwpu/SkyEyeGPT.
[Downlink:]http://arxiv.org/abs/2401.09712v1
[GitHub:]https://github.com/ZhanYang-nwpu/SkyEyeGPT.|
中文摘要: 近年来,有大量研究专注于解决基于RGB图像的单模态人员再识别(ReID)系统中的安全问题。然而,在涉及红外摄像机捕获的图像的实际应用中更常见的跨模态场景的安全性没有得到足够的重视。跨模态ReID的主要挑战在于有效地处理不同模态之间的视觉差异。例如,红外图像通常是灰度的,不像可见光图像包含颜色信息。现有的攻击方法主要集中在可见图像模态的特征上,而忽略了其他模态的特征以及不同模态之间数据分布的差异。这种疏忽可能会潜在地破坏这些方法在跨不同模态的图像检索中的有效性。这项研究代表了对跨模态ReID模型安全性的首次探索,并提出了一种专门为跨模态ReID设计的通用扰动攻击。这种攻击通过利用来自不同模态数据的梯度来优化扰动,从而破坏鉴别器并加强模态之间的差异。我们在两个广泛使用的跨模态数据集上进行了实验,即RegDB和SYSU,这不仅证明了我们方法的有效性,而且为未来跨模态ReID系统的鲁棒性增强提供了见解。
摘要: In recent years, there has been significant research focusing on addressing security concerns in single-modal person re-identification (ReID) systems that are based on RGB images. However, the safety of cross-modality scenarios, which are more commonly encountered in practical applications involving images captured by infrared cameras, has not received adequate attention. The main challenge in cross-modality ReID lies in effectively dealing with visual differences between different modalities. For instance, infrared images are typically grayscale, unlike visible images that contain color information. Existing attack methods have primarily focused on the characteristics of the visible image modality, overlooking the features of other modalities and the variations in data distribution among different modalities. This oversight can potentially undermine the effectiveness of these methods in image retrieval across diverse modalities. This study represents the first exploration into the security of cross-modality ReID models and proposes a universal perturbation attack specifically designed for cross-modality ReID. This attack optimizes perturbations by leveraging gradients from diverse modality data, thereby disrupting the discriminator and reinforcing the differences between modalities. We conducted experiments on two widely used cross-modality datasets, namely RegDB and SYSU, which not only demonstrated the effectiveness of our method but also provided insights for future enhancements in the robustness of cross-modality ReID systems.
[Downlink:]http://arxiv.org/abs/2401.10090v1
中文摘要: 自动化、物联网、大数据和云计算技术的增长趋势导致了第四次工业革命(工业4.0),在这场革命中,可以可视化和识别模式和见解,从而更好地理解数据,并可以改善制造过程。然而,很多时候,数据探索的任务对制造专家来说很困难,因为他们可能对分析预先设计的可视化中没有出现的数据感兴趣,因此他们必须得到信息技术专家的帮助。在本文中,我们提出了一个基于语义的可视化查询系统,该系统是为真实的工业4.0场景开发的,允许领域专家以友好的方式探索和可视化数据。该系统的主要新颖之处在于它结合使用了首先进行语义注释的捕获数据,以及还与语义描述相关联的机器的2D定制数字表示。这些描述使用本体的术语来表达,其中,除其他外,用于捕获属于工业4.0场景的机器的性能指标的传感器已经被建模。此外,这种语义描述允许:在更高的抽象层次上制定查询,基于数据的格式和性质提供结果的定制图形可视化,以及下载丰富的数据以支持进一步类型的分析。
摘要: The growing trends in automation, Internet of Things, big data and cloud computing technologies have led to the fourth industrial revolution (Industry 4.0), where it is possible to visualize and identify patterns and insights, which results in a better understanding of the data and can improve the manufacturing process. However, many times, the task of data exploration results difficult for manufacturing experts because they might be interested in analyzing also data that does not appear in pre-designed visualizations and therefore they must be assisted by Information Technology experts. In this paper, we present a proposal materialized in a semantic-based visual query system developed for a real Industry 4.0 scenario that allows domain experts to explore and visualize data in a friendly way. The main novelty of the system is the combined use that it makes of captured data that are semantically annotated first, and a 2D customized digital representation of a machine that is also linked with semantic descriptions. Those descriptions are expressed using terms of an ontology, where, among others, the sensors that are used to capture indicators about the performance of a machine that belongs to a Industry 4.0 scenario have been modeled. Moreover, this semantic description allows to: formulate queries at a higher level of abstraction, provide customized graphical visualizations of the results based on the format and nature of the data, and download enriched data enabling further types of analysis.
[Downlink:]http://arxiv.org/abs/2401.09789v1
中文摘要: 无参考图像质量评估(NR-IQA)旨在预测与人类感知一致的图像质量分数,而不依赖于原始参考图像,作为各种视觉任务的关键组成部分。确保NR-IQA方法的鲁棒性对于不同图像处理技术的可靠比较和推荐中一致的用户体验至关重要。NR-IQA的攻击方法为测试NR-IQA的鲁棒性提供了有力的工具。然而,目前的NR-IQA攻击方法严重依赖于NR-IQA模型的梯度,导致梯度信息不可用时的局限性。在本文中,我们提出了一个开创性的基于查询的黑盒攻击NR-IQA方法。我们提出了分数边界的概念,并利用具有多个分数边界的自适应迭代方法。同时,初始攻击方向也被设计成利用人类视觉系统(HVS)的特征。实验表明,我们的方法优于所有比较的最先进的攻击方法,并远远领先于以前的黑盒方法。有效的NR-IQA模型DBCNN在我们的方法攻击下遭受0.6381的Spearman秩序相关系数(SROCC)下降,揭示了NR-IQA模型对黑盒攻击的脆弱性。所提出的攻击方法也为进一步探索NR-IQA鲁棒性提供了有力的工具。
摘要: No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack methods for NR-IQA provide a powerful instrument to test the robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on the gradient of the NR-IQA model, leading to limitations when the gradient information is unavailable. In this paper, we present a pioneering query-based black box attack against NR-IQA methods. We propose the concept of score boundary and leverage an adaptive iterative approach with multiple score boundaries. Meanwhile, the initial attack directions are also designed to leverage the characteristics of the Human Visual System (HVS). Experiments show our method outperforms all compared state-of-the-art attack methods and is far ahead of previous black-box methods. The effective NR-IQA model DBCNN suffers a Spearman’s rank-order correlation coefficient (SROCC) decline of 0.6381 attacked by our method, revealing the vulnerability of NR-IQA models to black-box attacks. The proposed attack method also provides a potent tool for further exploration into NR-IQA robustness.
[Downlink:]http://arxiv.org/abs/2401.05217v2