关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持
分类:
- 大语言模型LLM
- 视觉模型VLM
- 扩散模型
- 视觉导航
- 具身智能,机器人
- 强化学习
- 开放词汇,检测分割
中文摘要: 大视觉语言模型的最新进展彻底改变了图像分类范式。尽管显示了令人印象深刻的零样本功能,但在测试时,假设有一组预定义的类别,也就是词汇表,用于编写文本提示。然而,当语义上下文未知且不断演变时,这种假设可能是不切实际的。因此,我们将一项新任务形式化,称为无词汇图像分类(VIC),目的是在没有已知词汇的前提下,为输入图像分配一个存在于无约束语言诱导的语义空间中的类。VIC是一项具有挑战性的任务,因为语义空间非常大,包含数百万个概念,具有难以区分的细粒度类别。在这项工作中,我们首先实证验证了通过外部视觉语言数据库来表示这个语义空间是获得用于对图像进行分类的语义相关内容的最有效方法。然后,我们提出了从外部数据库中进行类别搜索(CaSED),这是一种利用预先训练的视觉语言模型和外部视觉语言数据库以无训练的方式解决VIC问题的方法。CaSED首先根据字幕与图像的语义相似性从数据库中提取一组候选类别,然后根据相同的视觉语言模型为图像分配最匹配的候选类别。在基准数据集上的实验验证了CaSED优于其他复杂的视觉语言框架,同时具有更少的参数,为未来这方面的研究铺平了道路
摘要: Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.
[Downlink:]http://arxiv.org/abs/2306.00917v3
[GitHub:]https://github.com/altndrr/vic|
中文摘要: 推进自动化编程需要强大而全面的代码生成基准,但当前的评估框架在很大程度上忽视了面向对象编程(OOP),而倾向于函数编程(FP),例如HumanEval和MBPP。为了解决这一问题,我们的研究引入了一个开创性的面向对象的基准测试,以431个Python程序为特色,这些程序包含了基本的面向对象概念和特性,如类和封装方法。我们提出了一种新的评估度量,pass@o,为OOP量身定制,增强了传统pass@k措施。我们对23个领先的大型语言模型(LLM)的评估,包括通用模型和代码专用模型,揭示了三个关键见解:1)pass@o为OOP代码生成提供了更相关、更全面的评估;2) 尽管在FP方面表现出色,但与ChatGPT等模型相比,像WizardCoder这样的代码专用LLM在OOP方面落后;3) 在我们的OOP基准测试中,所有高级LLM的性能都很差,这突出了该领域迫切需要改进。我们的基准测试和脚本公开发布于:https://github.com/alphadl/OOP-eval.
摘要: Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.
[Downlink:]http://arxiv.org/abs/2401.06628v1
[GitHub:]https://github.com/alphadl/OOP-eval.|
中文摘要: 大型语言模型(LLM)在各种自然语言处理任务中表现出了令人印象深刻的能力。尽管如此,由于许多特定于信息检索的概念在自然语言中很少出现,它们在信息检索任务中的应用仍然具有挑战性。虽然基于提示的方法可以为LLM提供任务描述,但它们往往无法促进对IR任务的全面理解和执行,从而限制了LLM的适用性。为了解决这一差距,在这项工作中,我们探索了教学调整的潜力,以提高LLM在IR任务中的熟练程度。我们介绍了一个新的指令调优数据集INTERS,它包含三个基本IR类别的21个任务:查询理解、文档理解和查询文档关系理解。这些数据来源于43个不同的数据集和手动编写的模板。我们的实证结果表明,INTERS显著提高了各种公开可用LLM的性能,如LLaMA、Mistral和Phi在搜索相关任务中的性能。此外,我们进行了全面的分析,以确定基础模型选择、指令设计、指令量和任务多样性对性能的影响。我们让我们的数据集和对其进行微调的模型在https://github.com/DaoD/INTERS.
摘要: Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating comprehensive understanding and execution of IR tasks, thereby limiting LLMs’ applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs’ proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 21 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Phi, in search-related tasks. Furthermore, we conduct a comprehensive analysis to ascertain the effects of base model selection, instruction design, volume of instructions, and task variety on performance. We make our dataset and the models fine-tuned on it publicly accessible at https://github.com/DaoD/INTERS.
[Downlink:]http://arxiv.org/abs/2401.06532v1
[GitHub:]https://github.com/DaoD/INTERS.|
中文摘要: 在本文中,我们介绍了Kun,这是一种在不依赖手动注释的情况下为大型语言模型(LLM)创建高质量指令调优数据集的新方法。Kun采用基于指令反翻译和答案策略的自训练算法,利用来自不同来源的未标记数据,如Wudao、Wanjuan和SkyPile,生成了超过一百万个中文教学数据点的大量数据集。这种方法通过使用自管理过程来细化和选择最有效的指令输出对,大大偏离了传统方法。我们在各种基准上对6B参数Yi模型进行的实验证明了Kun的稳健性和可扩展性。我们的方法的核心贡献在于其算法进步,增强了数据的保留和清晰度,以及创新的数据生成方法,大大减少了对昂贵且耗时的手动注释的依赖。这种方法为提高LLM的指令跟随能力提供了一种可扩展且高效的解决方案,对其在不同领域的应用具有重要意义。代码和数据集位于https://github.com/Zheng0428/COIG-Kun
摘要: In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun’s robustness and scalability. Our method’s core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun
[Downlink:]http://arxiv.org/abs/2401.06477v1
[GitHub:]https://github.com/Zheng0428/COIG-Kun|
中文摘要: 本文提出了一种针对多种语言的强大的视觉语音识别(VSR)方法,特别是针对标记数据数量有限的低资源语言。与以前试图通过使用从其他语言学习的知识来提高目标语言的VSR性能的方法不同,我们探索了是否可以在没有人为干预的情况下增加不同语言的训练数据量。为此,我们采用了一种Whisper模型,该模型既可以进行语言识别,也可以进行基于音频的语音识别。它用于过滤所需语言的数据,并从未标记的多语言视听数据库中转录标签。通过比较在自动标签和人工注释标签上训练的VSR模型的性能,我们表明,即使不使用人工注释,我们也可以实现与人工注释标签相似的VSR性能。通过自动化标记过程,我们标记了大规模未标记的多语言数据库VoxCeleb2和AVSpeech,为四种低VSR资源语言(法语、意大利语、西班牙语和葡萄牙语)生成了1002小时的数据。通过自动标签,我们在四种语言的mTEDx上实现了最先进的性能,大大超过了以前的方法。自动标签可在线获取:https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
摘要: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
[Downlink:]http://arxiv.org/abs/2309.08535v2
[GitHub:]https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages|
中文摘要: 绿化接地是指找到一个物体可以与之互动的区域的任务。这是一项基本但具有挑战性的任务,因为成功的解决方案需要从多个方面全面了解场景,包括物体及其部件的检测、定位和识别,场景的地理空间配置/布局,3D形状和物理,以及物体和人的功能和潜在交互。大部分知识都隐藏在有限训练集的监督标签的图像内容之外。在本文中,我们试图通过利用预先训练的大规模视觉语言模型中丰富的世界、抽象和人机交互知识来提高当前可供性基础的泛化能力。在AGD20K基准下,我们提出的模型与野生对象可供性基础的竞争方法相比,表现出显著的性能增益。我们进一步证明了它可以为来自随机互联网图像的对象奠定可供性的基础,即使对象和动作在训练过程中都是看不见的。项目地点:https://jasonqsy.github.io/AffordanceLLM/
摘要: Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/
[Downlink:]http://arxiv.org/abs/2401.06341v1
[Project:]https://jasonqsy.github.io/AffordanceLLM/|
中文摘要: 评估视觉语言模型(VLM)产生的长形式反应具有挑战性。它不仅需要检查VLM是否遵循给定的指令,还需要验证文本输出是否正确地基于给定的图像。受最近用LM评估LM的方法的启发,在这项工作中,我们建议用VLM评估VLM。为此,我们提出了一个名为Perception Collection的新反馈数据集,其中包括用户在评估过程中可能关心的15K自定义评分准则。使用Perception Collection,我们训练Prometheus Vision,这是第一个开源的VLM评估器模型,可以在评估过程中理解用户定义的评分标准。Prometheus Vision在开源模型中与人类评估者和GPT-4V的Pearson相关性最高,显示了其对VLM透明和可访问评估的有效性。我们将代码、数据集和模型开源于https://github.com/kaistAI/prometheus-vision
摘要: Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging. It not only requires checking whether the VLM follows the given instruction but also verifying whether the text output is properly grounded on the given image. Inspired by the recent approach of evaluating LMs with LMs, in this work, we propose to evaluate VLMs with VLMs. For this purpose, we present a new feedback dataset called the Perception Collection, encompassing 15K customized score rubrics that users might care about during assessment. Using the Perception Collection, we train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation. Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models, showing its effectiveness for transparent and accessible evaluation of VLMs. We open-source our code, dataset, and model at https://github.com/kaistAI/prometheus-vision
[Downlink:]http://arxiv.org/abs/2401.06591v1
[GitHub:]https://github.com/kaistAI/prometheus-vision|
中文摘要: 目的:机器人手术中的深度估计在三维重建、手术导航和增强现实可视化中至关重要。尽管基础模型在许多视觉任务中表现出出色的性能,包括深度估计(例如,DINOv2),但最近的工作观察到其在医学和外科领域特定应用中的局限性。这项工作提出了一种用于手术深度估计的基础模型的低阶自适应(LoRA)。方法:我们设计了一种基于基础模型的深度估计方法,称为Surgical DINO,这是对DINOv2的低阶自适应,用于内窥镜手术中的深度估计。我们构建了LoRA层,并将其集成到DINO中,以适应手术特定领域的知识,而不是传统的微调。在训练过程中,我们冻结了显示出出色视觉表示能力的DINO图像编码器,并仅优化了LoRA层和深度解码器,以集成来自手术场景的特征。结果:我们的模型在SCARED的MICCAI挑战数据集上得到了广泛验证,该数据集是从达芬奇Xi内窥镜手术中收集的。我们的经验表明,外科DINO在内窥镜深度估计任务中显著优于所有最先进的模型。消融研究的分析表明,我们的LoRA层和适应具有显著效果。结论:外科DINO为基础模型成功适应外科领域进行深度估计提供了一些启示。结果中有明确证据表明,对计算机视觉数据集中预先训练的权重进行零样本预测或简单微调不足以直接在外科领域使用基础模型。代码位于https://github.com/BeileiCui/SurgicalDINO.
摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.
[Downlink:]http://arxiv.org/abs/2401.06013v2
[GitHub:]https://github.com/BeileiCui/SurgicalDINO.|
中文摘要: 可见红外人物再识别(VIReID)主要处理来自不同模态的人物图像之间的身份匹配。由于可见光和红外图像之间的模态间隙,跨模态身份匹配带来了重大挑战。认识到行人外观的高级语义,如性别、形状和服装风格,在不同的模态中保持一致,本文旨在通过将视觉特征与高级语义相融合来弥合模态差距。鉴于CLIP感知与视觉表示相对应的高级语义信息的能力,我们探索了CLIP在VIReID领域中的应用。因此,我们提出了一个CLIP驱动的语义发现网络(CSDN),该网络由特定模态的即时学习者、语义信息集成(SII)和高级语义嵌入(HSE)组成。具体而言,考虑到语言描述中模态差异所带来的多样性,我们设计了双峰可学习文本标记来分别捕获可见光和红外图像的模态私有语义信息。此外,我们认识到不同模态的语义细节具有互补性,因此我们整合了双峰语言描述中的文本特征,以实现全面的语义。最后,我们在整合的文本特征和跨模态的视觉特征之间建立了联系。该过程将丰富的高级语义信息嵌入到视觉表示中,从而提高了视觉表示的模态不变性。通过对多个广泛使用的基准的实验评估,我们提出的CSDN相对于现有方法的有效性和优越性已经得到了证实。代码将在\url发布{https://github.com/nengdong96/CSDN}.
摘要: Visible-infrared person re-identification (VIReID) primarily deals with matching identities across person images from different modalities. Due to the modality gap between visible and infrared images, cross-modality identity matching poses significant challenges. Recognizing that high-level semantics of pedestrian appearance, such as gender, shape, and clothing style, remain consistent across modalities, this paper intends to bridge the modality gap by infusing visual features with high-level semantics. Given the capability of CLIP to sense high-level semantic information corresponding to visual representations, we explore the application of CLIP within the domain of VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration (SII), and High-level Semantic Embedding (HSE). Specifically, considering the diversity stemming from modality discrepancies in language descriptions, we devise bimodal learnable text tokens to capture modality-private semantic information for visible and infrared images, respectively. Additionally, acknowledging the complementary nature of semantic details across different modalities, we integrate text features from the bimodal language descriptions to achieve comprehensive semantics. Finally, we establish a connection between the integrated text features and the visual features across modalities. This process embed rich high-level semantic information into visual representations, thereby promoting the modality invariance of visual representations. The effectiveness and superiority of our proposed CSDN over existing methods have been substantiated through experimental evaluations on multiple widely used benchmarks. The code will be released at \url{https://github.com/nengdong96/CSDN}.
[Downlink:]http://arxiv.org/abs/2401.05806v2
[GitHub:]https://github.com/nengdong96/CSDN|
中文摘要: 本文提出了一种针对多种语言的强大的视觉语音识别(VSR)方法,特别是针对标记数据数量有限的低资源语言。与以前试图通过使用从其他语言学习的知识来提高目标语言的VSR性能的方法不同,我们探索了是否可以在没有人为干预的情况下增加不同语言的训练数据量。为此,我们采用了一种Whisper模型,该模型既可以进行语言识别,也可以进行基于音频的语音识别。它用于过滤所需语言的数据,并从未标记的多语言视听数据库中转录标签。通过比较在自动标签和人工注释标签上训练的VSR模型的性能,我们表明,即使不使用人工注释,我们也可以实现与人工注释标签相似的VSR性能。通过自动化标记过程,我们标记了大规模未标记的多语言数据库VoxCeleb2和AVSpeech,为四种低VSR资源语言(法语、意大利语、西班牙语和葡萄牙语)生成了1002小时的数据。通过自动标签,我们在四种语言的mTEDx上实现了最先进的性能,大大超过了以前的方法。自动标签可在线获取:https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
摘要: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
[Downlink:]http://arxiv.org/abs/2309.08535v2
[GitHub:]https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages|
中文摘要: 自动医疗报告生成(MRG)具有巨大的研究价值,因为它有可能减轻放射科医生撰写报告的沉重负担。尽管最近取得了进展,但由于需要精确的临床理解和疾病识别,准确的MRG仍然具有挑战性。此外,疾病分布的不平衡使这一挑战更加明显,因为罕见病在训练数据中的代表性不足,使其诊断性能不可靠。为了应对这些挑战,我们提出了诊断驱动的医疗报告生成提示(PromptMRG),这是一个新的框架,旨在通过诊断感知提示的指导来提高MRG的诊断准确性。具体来说,PromptMRG是基于编码器-解码器架构的,具有额外的疾病分类分支。生成报告时,分类分支的诊断结果会转换为令牌提示,以明确指导生成过程。为了进一步提高诊断准确性,我们设计了跨模态特征增强,它从数据库中检索类似的报告,通过利用预先训练的CLIP中的知识来帮助诊断查询图像。此外,通过基于每种疾病的个体学习状态对分类分支应用自适应logit调整损失来解决疾病不平衡问题,克服了文本解码器无法操纵疾病分布的障碍。在两个MRG基准上的实验表明了所提出方法的有效性,它在两个数据集上都获得了最先进的临床疗效表现。代码位于https://github.com/jhb86253817/PromptMRG.
摘要: Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and disease identification. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnostic performance unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder’s inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets. The code is available at https://github.com/jhb86253817/PromptMRG.
[Downlink:]http://arxiv.org/abs/2308.12604v2
[GitHub:]https://github.com/jhb86253817/PromptMRG.|
中文摘要: 换衣服者重新识别(CC ReID)旨在匹配长期换衣服的人。CC ReID的关键挑战是提取与服装无关的特征,如面部、发型、体型和步态。目前的研究主要集中在使用多模态生物特征(如轮廓和草图)对体型进行建模。然而,它并没有完全利用隐藏在原始RGB图像中的个人描述信息。考虑到某些属性描述在布的变化后保持不变,我们提出了一种将个人视觉外观和属性描述相统一的CC ReID掩蔽属性描述嵌入(MADE)方法。具体来说,处理各种服装敏感信息,如颜色和类型,对于有效的建模来说是一项挑战。为了解决这一问题,我们在通过属性检测模型提取的个人属性描述中屏蔽服装和颜色信息。然后,遮罩的属性描述被连接并嵌入到各个级别的Transformer块中,将其与图像的低级到高级特征融合。这种方法迫使模型丢弃服装信息。在几个CC ReID基准上进行了实验,包括PRCC、LTCC、Celeb ReID light和LaST。结果表明,MADE有效地利用了属性描述,提高了换布人的再识别性能,与现有方法相比具有良好的效果。代码位于https://github.com/moon-wh/MADE.
摘要: Cloth-changing person re-identification (CC-ReID) aims to match persons who change clothes over long periods. The key challenge in CC-ReID is to extract clothing-independent features, such as face, hairstyle, body shape, and gait. Current research mainly focuses on modeling body shape using multi-modal biological features (such as silhouettes and sketches). However, it does not fully leverage the personal description information hidden in the original RGB image. Considering that there are certain attribute descriptions which remain unchanged after the changing of cloth, we propose a Masked Attribute Description Embedding (MADE) method that unifies personal visual appearance and attribute description for CC-ReID. Specifically, handling variable clothing-sensitive information, such as color and type, is challenging for effective modeling. To address this, we mask the clothing and color information in the personal attribute description extracted through an attribute detection model. The masked attribute description is then connected and embedded into Transformer blocks at various levels, fusing it with the low-level to high-level features of the image. This approach compels the model to discard clothing information. Experiments are conducted on several CC-ReID benchmarks, including PRCC, LTCC, Celeb-reID-light, and LaST. Results demonstrate that MADE effectively utilizes attribute description, enhancing cloth-changing person re-identification performance, and compares favorably with state-of-the-art methods. The code is available at https://github.com/moon-wh/MADE.
[Downlink:]http://arxiv.org/abs/2401.05646v2
[GitHub:]https://github.com/moon-wh/MADE.|
中文摘要: 我们介绍了Motion2VecSets,一种用于从点云序列重建动态表面的4D扩散模型。虽然现有的最先进的方法已经证明在使用神经场表示重建非刚性对象方面取得了成功,但传统的前馈网络遇到了来自噪声、部分或稀疏点云的模糊观测的挑战。为了应对这些挑战,我们引入了一种扩散模型,该模型通过压缩潜在表示的迭代去噪过程来显式学习非刚性对象的形状和运动分布。当处理模糊输入时,基于扩散的先验能够进行更合理和概率的重建。我们用潜在向量集参数化4D动力学,而不是使用全局潜在。这种新颖的4D表示使我们能够学习局部表面形状和变形模式,从而实现更准确的非线性运动捕捉,并显著提高对看不见的运动和身份的可推广性。对于更具时间连贯性的目标跟踪,我们同步地对变形潜集进行去噪,并在多个帧之间交换信息。为了避免计算开销,我们设计了一个交错的空间和时间注意力块,以沿着空间和时间域交替聚集变形潜伏期。与最先进的方法进行了广泛的比较,证明了我们的Motion2VenSets在从各种不完美的观测进行4D重建方面的优势,特别是在从DeformingThings4D Animals数据集上的稀疏点云重建看不见的个体方面,与CaDex相比,交集优于并集(IoU)提高了19%。更多详细信息,请访问https://vveicao.github.io/projects/Motion2VecSets/.
摘要: We introduce Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations, conventional feed-forward networks encounter challenges with ambiguous observations from noisy, partial, or sparse point clouds. To address these challenges, we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based prior enables more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent vector sets instead of using a global latent. This novel 4D representation allows us to learn local surface shape and deformation patterns, leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporal-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid the computational overhead, we design an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against the state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations, notably achieving a 19% improvement in Intersection over Union (IoU) compared to CaDex for reconstructing unseen individuals from sparse point clouds on the DeformingThings4D-Animals dataset. More detailed information can be found at https://vveicao.github.io/projects/Motion2VecSets/.
[Downlink:]http://arxiv.org/abs/2401.06614v1
[Project:]https://vveicao.github.io/projects/Motion2VecSets/.|
中文摘要: 遥感变化检测对于了解地球表面的动态、促进环境变化的监测、评估人类影响、预测未来趋势和支持决策至关重要。在这项工作中,我们介绍了一种新的变化检测方法,该方法可以通过预训练去噪扩散概率模型(DDPM)——一类用于图像合成的生成模型——在训练过程中利用现成的未标记遥感图像。DDPM通过使用马尔可夫链将训练图像逐渐转换为高斯分布来学习训练数据分布。在推理(即采样)过程中,他们可以从高斯噪声开始生成更接近训练分布的不同样本集,从而获得最先进的图像合成结果。然而,在这项工作中,我们的重点不是图像合成,而是将其用作预先训练的特征提取器,用于变化检测的下游应用。具体来说,我们利用预先训练的DDPM与变化标签一起产生的特征表示来微调轻量级变化分类器。在LEVIR-CD、WHU-CD、DSIFN-CD和CDD数据集上进行的实验表明,所提出的DDPM-CD方法在F1得分、IoU和总体准确性方面显著优于现有最先进的变化检测方法,突出了预训练的DDPM作为下游应用的特征提取器的关键作用。我们已在上提供了代码和预训练模型https://github.com/wgcban/ddpm-cd
摘要: Remote sensing change detection is crucial for understanding the dynamics of our planet’s surface, facilitating the monitoring of environmental changes, evaluating human impact, predicting future trends, and supporting decision-making. In this work, we introduce a novel approach for change detection that can leverage off-the-shelf, unlabeled remote sensing images in the training process by pre-training a Denoising Diffusion Probabilistic Model (DDPM) - a class of generative models used in image synthesis. DDPMs learn the training data distribution by gradually converting training images into a Gaussian distribution using a Markov chain. During inference (i.e., sampling), they can generate a diverse set of samples closer to the training distribution, starting from Gaussian noise, achieving state-of-the-art image synthesis results. However, in this work, our focus is not on image synthesis but on utilizing it as a pre-trained feature extractor for the downstream application of change detection. Specifically, we fine-tune a lightweight change classifier utilizing the feature representations produced by the pre-trained DDPM alongside change labels. Experiments conducted on the LEVIR-CD, WHU-CD, DSIFN-CD, and CDD datasets demonstrate that the proposed DDPM-CD method significantly outperforms the existing state-of-the-art change detection methods in terms of F1 score, IoU, and overall accuracy, highlighting the pivotal role of pre-trained DDPM as a feature extractor for downstream applications. We have made both the code and pre-trained models available at https://github.com/wgcban/ddpm-cd
[Downlink:]http://arxiv.org/abs/2206.11892v3
[GitHub:]https://github.com/wgcban/ddpm-cd|https://github.com/wgcban/ddpm-cd|
中文摘要: 尽管扩散模型越来越受欢迎,但对于非平衡统计物理领域的外行来说,对模型类的深入理解仍然有些难以捉摸。考虑到这一点,我们使用有向图建模和变分贝叶斯原理对扩散模型进行了更直接的介绍,这对普通读者来说是一个相对较少的先决条件。我们的论述构成了一个全面的技术综述,从深层潜变量模型等基础概念到基于连续时间扩散的建模的最新进展,强调了模型类之间的理论联系。我们尽可能提供开创性著作中遗漏的额外数学见解,以帮助理解,同时避免引入新的符号。我们设想这篇文章将成为该领域研究人员和从业者的有用教育补充,我们欢迎社区的反馈和贡献https://github.com/biomedia-mira/demystifying-diffusion.
摘要: Despite the growing popularity of diffusion models, gaining a deep understanding of the model class remains somewhat elusive for the uninitiated in non-equilibrium statistical physics. With that in mind, we present what we believe is a more straightforward introduction to diffusion models using directed graphical modelling and variational Bayesian principles, which imposes relatively fewer prerequisites on the average reader. Our exposition constitutes a comprehensive technical review spanning from foundational concepts like deep latent variable models to recent advances in continuous-time diffusion-based modelling, highlighting theoretical connections between model classes along the way. We provide additional mathematical insights that were omitted in the seminal works whenever possible to aid in understanding, while avoiding the introduction of new notation. We envision this article serving as a useful educational supplement for both researchers and practitioners in the area, and we welcome feedback and contributions from the community at https://github.com/biomedia-mira/demystifying-diffusion.
[Downlink:]http://arxiv.org/abs/2401.06281v1
[GitHub:]https://github.com/biomedia-mira/demystifying-diffusion.|
中文摘要: 序列建模方法在机器人模仿学习中显示出良好的效果。最近,扩散模型以序列建模的方式被用于行为克隆,这得益于它们在建模复杂数据分布方面的卓越能力。基于标准扩散的策略从以输入状态为条件的随机噪声迭代地生成动作序列。尽管如此,扩散政策的模型可以在视觉表示方面得到进一步改进。在这项工作中,我们提出了Crossway Diffusion,这是一种简单而有效的方法,通过精心设计的状态解码器和辅助自监督学习(SSL)目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。整个模型由SSL目标和原始扩散损失共同优化。我们的实验证明了Crossway Diffusion在各种模拟和真实世界的机器人任务中的有效性,证实了其相对于标准的基于扩散的策略的一致优势,以及相对于基线的显著改进
摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.
[Downlink:]http://arxiv.org/abs/2307.01849v3
[Project:]https://youtu.be/9deKHueZBuk|
[GitHub:]https://github.com/LostXine/crossway_diffusion|
中文摘要: 扩散模型的最新进展在许多生成任务中树立了令人印象深刻的里程碑,DALL-E2、Imagen和Stable diffusion等趋势工作引起了人们的极大兴趣。尽管环境变化很快,但最近的新方法侧重于扩展和性能,而不是容量,因此需要为单独的任务提供单独的模型。在这项工作中,我们将现有的单流扩散管道扩展为多任务多模式网络,称为多功能扩散(VD),在一个统一的模型中处理文本到图像、图像到文本以及变体的多个流。VD的管道设计实例化了一个统一的多流扩散框架,该框架由可共享和可交换的层模块组成,能够实现超越图像和文本的跨模式通用性。通过广泛的实验,我们证明VD成功地实现了以下目标:a)VD优于基线方法,并以有竞争力的质量处理其所有基本任务;b) VD实现了新颖的扩展,如风格和语义的解开、双上下文和多上下文的融合等。;c) 我们在图像和文本上的多流多模式框架的成功可能会激励进一步基于扩散的通用人工智能研究。我们的代码和模型是开源的,位于https://github.com/SHI-Labs/Versatile-Diffusion.
摘要: Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research. Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
[Downlink:]http://arxiv.org/abs/2211.08332v4
[GitHub:]https://github.com/SHI-Labs/Versatile-Diffusion.|https://github.com/SHI-Labs/Versatile-Diffusion|
中文摘要: 我们介绍了可变形卷积v4(DCNv4),这是一种高效有效的算子,专为广泛的视觉应用而设计。DCNv4通过两个关键增强解决了其前身DCNv3的局限性:1。去除空间聚合中的softmax归一化以增强其动态特性和表达能力。优化内存访问以最大限度地减少冗余操作以加快速度。与DCNv3相比,这些改进显著加快了收敛速度,并显著提高了处理速度,其中DCNv4实现了三倍以上的正向速度。DCNv4在各种任务中表现出卓越的性能,包括图像分类、实例和语义分割,尤其是图像生成。当集成到潜在扩散模型中的U-Net等生成模型中时,DCNv4的性能优于其基线,突出了其增强生成模型的可能性。在实际应用中,将InternetImage模型中的DCNv3替换为DCNv4以创建FlashInternetImage,可以在不进行进一步修改的情况下提高高达80%的速度和进一步的性能。DCNv4在速度和效率方面的进步,加上其在不同视觉任务中的强大性能,显示出其作为未来视觉模型基础构建块的潜力
摘要: We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation. When integrated into generative models like U-Net in the latent diffusion model, DCNv4 outperforms its baseline, underscoring its possibility to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.
[Downlink:]http://arxiv.org/abs/2401.06197v1
[GitHub:]https://github.com/OpenGVLab/DCNv4|
中文摘要: 目的:机器人手术中的深度估计在三维重建、手术导航和增强现实可视化中至关重要。尽管基础模型在许多视觉任务中表现出出色的性能,包括深度估计(例如,DINOv2),但最近的工作观察到其在医学和外科领域特定应用中的局限性。这项工作提出了一种用于手术深度估计的基础模型的低阶自适应(LoRA)。方法:我们设计了一种基于基础模型的深度估计方法,称为Surgical DINO,这是对DINOv2的低阶自适应,用于内窥镜手术中的深度估计。我们构建了LoRA层,并将其集成到DINO中,以适应手术特定领域的知识,而不是传统的微调。在训练过程中,我们冻结了显示出出色视觉表示能力的DINO图像编码器,并仅优化了LoRA层和深度解码器,以集成来自手术场景的特征。结果:我们的模型在SCARED的MICCAI挑战数据集上得到了广泛验证,该数据集是从达芬奇Xi内窥镜手术中收集的。我们的经验表明,外科DINO在内窥镜深度估计任务中显著优于所有最先进的模型。消融研究的分析表明,我们的LoRA层和适应具有显著效果。结论:外科DINO为基础模型成功适应外科领域进行深度估计提供了一些启示。结果中有明确证据表明,对计算机视觉数据集中预先训练的权重进行零样本预测或简单微调不足以直接在外科领域使用基础模型。代码位于https://github.com/BeileiCui/SurgicalDINO.
摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.
[Downlink:]http://arxiv.org/abs/2401.06013v2
[GitHub:]https://github.com/BeileiCui/SurgicalDINO.|
中文摘要: 共同显著对象检测(CoSOD)致力于复制人类视觉系统识别图像集合中常见和显著对象的能力。尽管最近在深度学习模型方面取得了进展,但这些模型仍然依赖于使用注释良好的CoSOD数据集进行训练。对无训练零样本CoSOD框架的探索是有限的。在本文中,我们从基础计算机视觉模型的零样本传递能力中获得灵感,介绍了第一个零样本CoSOD框架,该框架在没有任何训练过程的情况下利用这些模型。为了实现这一点,我们在我们提出的框架中引入了两个新的组件:组提示生成(GPG)模块和共显著图生成(CMP)模块。我们在广泛使用的数据集上评估了该框架的性能,并观察到了令人印象深刻的结果。我们的方法超越了现有的无监督方法,甚至优于2020年之前开发的完全监督方法,同时与2022年之前开发出的一些完全监督方法保持竞争力。
摘要: Co-salient Object Detection (CoSOD) endeavors to replicate the human visual system’s capacity to recognize common and salient objects within a collection of images. Despite recent advancements in deep learning models, these models still rely on training with well-annotated CoSOD datasets. The exploration of training-free zero-shot CoSOD frameworks has been limited. In this paper, taking inspiration from the zero-shot transfer capabilities of foundational computer vision models, we introduce the first zero-shot CoSOD framework that harnesses these models without any training process. To achieve this, we introduce two novel components in our proposed framework: the group prompt generation (GPG) module and the co-saliency map generation (CMP) module. We evaluate the framework’s performance on widely-used datasets and observe impressive results. Our approach surpasses existing unsupervised methods and even outperforms fully supervised methods developed before 2020, while remaining competitive with some fully supervised methods developed before 2022.
[Downlink:]http://arxiv.org/abs/2309.05499v3
中文摘要: 导航、感知和决策是智能机器人的基本任务,其本质是估计必要的系统状态。其中,导航是其他上层应用程序的基础,通过集成来自多个传感器的测量,提供精确的位置和方向。通过对每个传感器的观测值进行适当的建模,将导航的多传感器融合任务简化为状态估计问题,该问题可以通过两种方法解决:优化和滤波。最近的研究表明,基于优化的框架在准确性方面优于基于过滤的框架。然而,这两种方法都是基于最大似然估计(MLE)的,并且在理论上应该与相同的线性化点、观测模型、测量和高斯噪声假设等效。在本文中,我们深入挖掘了基于优化和基于过滤的方法中使用的理论和现有策略。结果表明,这两种方法在理论上是相等的,但由于在实时操作中应用的策略不同,这种等价性会破坏。通过调整现有的基于滤波的方法的策略,基于视觉里程计(VO)的蒙特卡洛模拟和车载消融实验表明,策略调整后的滤波严格等于优化。因此,未来对传感器融合问题的研究应该集中在它们自己的算法和策略上,而不是状态估计方法
摘要: The essential of navigation, perception, and decision-making which are basic tasks for intelligent robots, is to estimate necessary system states. Among them, navigation is fundamental for other upper applications, providing precise position and orientation, by integrating measurements from multiple sensors. With observations of each sensor appropriately modelled, multi-sensor fusion tasks for navigation are reduced to the state estimation problem which can be solved by two approaches: optimization and filtering. Recent research has shown that optimization-based frameworks outperform filtering-based ones in terms of accuracy. However, both methods are based on maximum likelihood estimation (MLE) and should be theoretically equivalent with the same linearization points, observation model, measurements, and Gaussian noise assumption. In this paper, we deeply dig into the theories and existing strategies utilized in both optimization-based and filtering-based approaches. It is demonstrated that the two methods are equal theoretically, but this equivalence corrupts due to different strategies applied in real-time operation. By adjusting existing strategies of the filtering-based approaches, the Monte-Carlo simulation and vehicular ablation experiments based on visual odometry (VO) indicate that the strategy adjusted filtering strictly equals to optimization. Therefore, future research on sensor-fusion problems should concentrate on their own algorithms and strategies rather than state estimation approaches.
[Downlink:]http://arxiv.org/abs/2401.05836v1
中文摘要: 准确有效地提取材料微观图像中的微观结构,在探索结构-性能关系和优化工艺参数方面发挥着关键作用。基于深度学习的图像分割技术依赖于手动注释,耗时耗力,难以满足模型可移植性和泛化的要求。Segment Anything Model(SAM)是一种具有强大的深度特征表示和零样本泛化能力的大型视觉模型,为图像分割提供了新的解决方案。然而,在没有人为注释的情况下直接应用SAM来分割材料微观图像中的微观结构并不能达到预期的结果,因为难以使其原生的即时工程适应材料微观图像的关键微观结构的密集和分散特征。在本文中,我们提出了一种基于SAM的通用高效的微观结构提取解决方案MatSAM。根据材料微观结构的分布和形状,设计了一种新的基于点的提示生成策略。它为不同的微观图像生成提示,融合感兴趣区域(ROI)关键点和网格关键点的提示,并集成后处理方法对材料微观结构进行定量表征。对于包括晶界和相在内的常见微观结构,MatSAM实现了优于传统方法的分割性能,甚至优于对光学显微镜(OM)和扫描电子显微镜(SEM)成像的18种材料微观结构进行评估的监督学习方法。我们相信,MatSAM可以显著降低材料微观结构定量表征的成本,并加快新材料的设计
摘要: Accurate and efficient extraction of microstructures in microscopic images of materials plays a critical role in the exploration of structure-property relationships and the optimization of process parameters. Deep learning-based image segmentation techniques that rely on manual annotation are time-consuming and labor-intensive and hardly meet the demand for model transferability and generalization. Segment Anything Model (SAM), a large visual model with powerful deep feature representation and zero-shot generalization capabilities, has provided new solutions for image segmentation. However, directly applying SAM to segmenting microstructures in microscopic images of materials without human annotation cannot achieve the expected results, as the difficulty of adapting its native prompt engineering to the dense and dispersed characteristics of key microstructures in materials microscopy images. In this paper, we propose MatSAM, a general and efficient microstructure extraction solution based on SAM. A new point-based prompts generation strategy is designed, grounded on the distribution and shape of materials microstructures. It generates prompts for different microscopic images, fuses the prompts of the region of interest (ROI) key points and grid key points, and integrates post-processing methods for quantitative characterization of materials microstructures. For common microstructures including grain boundary and phase, MatSAM achieves superior segmentation performance to conventional methods and is even preferable to supervised learning methods evaluated on 18 materials microstructures imaged by the optical microscope (OM) and scanning electron microscope (SEM). We believe that MatSAM can significantly reduce the cost of quantitative characterization of materials microstructures and accelerate the design of new materials.
[Downlink:]http://arxiv.org/abs/2401.05638v1
中文摘要: 无参考图像质量评估(NR-IQA)旨在预测与人类感知一致的图像质量分数,而不依赖于原始参考图像,作为各种视觉任务的关键组成部分。确保NR-IQA方法的稳健性对于不同图像处理技术的可靠比较和推荐中一致的用户体验至关重要。NR-IQA的攻击方法为测试NR-IQA提供了一个强大的工具。然而,当前NR-IQA的攻击方法严重依赖于NR-IQA模型的梯度,导致在梯度信息不可用时受到限制。在本文中,我们提出了一种针对NR-IQA方法的开创性的基于查询的黑匣子攻击。我们提出了\emph{分数边界}的概念,并利用了一种具有多个分数边界的自适应迭代方法。同时,初始攻击方向也被设计为利用人类视觉系统(HVS)的特性。实验表明,我们的攻击方法优于所有最先进的方法,并且远远领先于以前的黑盒方法。有效的DBCNN模型在受到我们的方法攻击时,Spearman秩序相关系数(SROCC)下降了0.6972$,揭示了NR-IQA对黑匣子攻击的脆弱性。所提出的攻击方法也为进一步探索NR-IQA的鲁棒性提供了有力的工具
摘要: No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack methods for NR-IQA provide a powerful instrument to test the robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on the gradient of the NR-IQA model, leading to limitations when the gradient information is unavailable. In this paper, we present a pioneering query-based black box attack against NR-IQA methods. We propose the concept of \emph{score boundary} and leverage an adaptive iterative approach with multiple score boundaries. Meanwhile, the initial attack directions are also designed to leverage the characteristics of the Human Visual System (HVS). Experiments show our attack method outperforms all compared state-of-the-art methods and is far ahead of previous black-box methods. The effective DBCNN model suffers a Spearman rank-order correlation coefficient (SROCC) decline of 0.6972 0.6972 0.6972 attacked by our method, revealing the vulnerability of NR-IQA to black-box attacks. The proposed attack method also provides a potent tool for further exploration into NR-IQA robustness.
[Downlink:]http://arxiv.org/abs/2401.05217v1