[晓理紫]每日论文分享(有中文摘要,源码或项目地址)

专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)_第1张图片

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉导航
  • 具身智能,机器人
  • 强化学习
  • 开放词汇,检测分割

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)

== Embodied Artificial Intelligence ==

标题: Chat Failures and Troubles: Reasons and Solutions

作者: Manal Helal, Patrick Holthaus, Gabriella Lakatos

中文摘要: 本文研究了人机交互(HRI)中导致聊天失败和麻烦的一些常见问题。给定用例的设计决策始于合适的机器人、合适的聊天模型、识别导致故障的常见问题、识别潜在的解决方案以及规划持续改进。总之,建议使用闭环控制算法来指导训练过的人工智能(AI)预训练模型的使用,并提供词汇过滤,在新数据集上重新训练批处理模型,从数据流中在线学习,和/或使用强化学习模型来自我更新训练过的模型并减少错误。

摘要: This paper examines some common problems in Human-Robot Interaction (HRI) causing failures and troubles in Chat. A given use case’s design decisions start with the suitable robot, the suitable chatting model, identifying common problems that cause failures, identifying potential solutions, and planning continuous improvement. In conclusion, it is recommended to use a closed-loop control algorithm that guides the use of trained Artificial Intelligence (AI) pre-trained models and provides vocabulary filtering, re-train batched models on new datasets, learn online from data streams, and/or use reinforcement learning models to self-update the trained models and reduce errors.

[Downlink:]http://arxiv.org/abs/2309.03708v2


标题: Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

作者: Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

中文摘要: 大型语言模型在各种自然语言和生成任务中表现出非凡的生成能力。然而,可能的拟人化和对失败案例的宽容推动了对大型语言模型涌现能力的讨论,特别是对大型语言模型中心理理论(ToM)能力的讨论。虽然存在几个错误信念测试来验证推断和维护另一个实体的心智模型的能力,但我们研究了汤姆能力的一个特殊应用,它具有更高的风险和可能不可逆的后果:人机交互。在这项工作中,我们探索了感知行为识别的任务,其中机器人采用大型语言模型(LLM)以类似于人类观察者的方式评估机器人生成的行为。我们重点研究了四种行为类型,即可解释的、易读的、可预测的和模糊的行为,它们已被广泛用于合成可解释的机器人行为。因此,LLMs的目标是成为代理的人类代理,并回答循环中的人类将如何感知某个代理行为,例如“给定机器人的行为X,人类观察者会发现它是可解释的吗?”。我们进行了一项人类受试者研究,以验证用户能够在五个领域的策划情况(机器人设置和计划)中正确回答这样的问题。信念测试的第一次分析产生了非常积极的结果,夸大了人们对拥有ToM能力的LLMs的期望。然后,我们提出并执行了一套打破这种错觉的扰动测试,即不一致的信念、无信息的上下文和信念测试。我们的结论是,LLMs在普通提示上的高分展示了它在HRI环境中的潜在用途,然而拥有ToM需要对LLMs所缺乏的上下文中琐碎或不相关的扰动保持不变性。

摘要: Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot’s generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example “Given a robot’s behavior X, would the human observer find it explicable?”. We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.

[Downlink:]http://arxiv.org/abs/2401.05302v2


== Reinforcement Learning @ RL ==

标题: CQLite: Communication-Efficient Multi-Robot Exploration Using Coverage-biased Distributed Q-Learning

作者: Ehsan Latif, Ramviyas Parasuraman

中文摘要: 前沿探索和强化学习在历史上被用来解决使许多移动机器人能够自主和协作地探索复杂环境的问题。这些方法需要保持内部全局地图进行导航,但它们没有考虑到机器人之间通信和信息共享的高成本。本研究提供了一种新的分布式Q学习技术CQLite,旨在最小化机器人之间的数据通信开销,同时在多机器人探索中实现快速收敛和彻底覆盖。所提出的CQLite方法使用特别映射合并,并选择性地在最近识别的前沿共享更新的Q值,以显著降低通信成本。对CQLite算法的收敛性和效率的理论分析,以及利用几个机器人在模拟室内地图上的大量数值验证,证明了该方法的新颖性。凭借超过2倍的计算和通信减少以及改进的映射性能,CQLite超越了尖端的多机器人探索技术,如快速探索随机树和深度强化学习。相关代码在\url{https://github.com/herolab-uga/cqlite}开源。

摘要: Frontier exploration and reinforcement learning have historically been used to solve the problem of enabling many mobile robots to autonomously and cooperatively explore complex surroundings. These methods need to keep an internal global map for navigation, but they do not take into consideration the high costs of communication and information sharing between robots. This study offers CQLite, a novel distributed Q-learning technique designed to minimize data communication overhead between robots while achieving rapid convergence and thorough coverage in multi-robot exploration. The proposed CQLite method uses ad hoc map merging, and selectively shares updated Q-values at recently identified frontiers to significantly reduce communication costs. The theoretical analysis of CQLite’s convergence and efficiency, together with extensive numerical verification on simulated indoor maps utilizing several robots, demonstrates the method’s novelty. With over 2x reductions in computation and communication alongside improved mapping performance, CQLite outperformed cutting-edge multi-robot exploration techniques like Rapidly Exploring Random Trees and Deep Reinforcement Learning. Related codes are open-sourced at \url{https://github.com/herolab-uga/cqlite}.

[Downlink:]http://arxiv.org/abs/2307.00500v2

[GitHub:]https://github.com/herolab-uga/cqlite|


标题: Cooperative Edge Caching Based on Elastic Federated and Multi-Agent Deep Reinforcement Learning in Next-Generation Network

作者: Qiong Wu, Wenhua Wang, Pingyi Fan

中文摘要: Edge缓存是下一代网络的一个有前途的解决方案,它支持小蜂窝基站(SBS)中的缓存单元,允许用户设备(UE)获取用户请求的已经在SBSs中预缓存的内容。对于SBSs来说,通过学习预测准确的热门内容,同时保护用户的个人信息是至关重要的。传统的联合学习(FL)可以保护用户的隐私,但UE之间的数据差异会导致模型质量下降。因此,有必要为每个UE训练个性化的本地模型,以准确预测热门内容。此外,缓存的内容可以在下一代网络中的相邻SBS之间共享,因此在不同SBS中缓存预测的流行内容可能会影响获取内容的成本。因此,确定流行内容协作缓存的位置至关重要。为了解决这些问题,我们提出了一种基于弹性联邦和多智能体深度强化学习(CEFMR)的协作边缘缓存方案,以优化网络中的成本。我们首先提出一种弹性FL算法来训练每个UE的个性化模型,其中采用对抗性自动编码器(AAE)模型进行训练以提高预测精度,然后提出一种{流行}内容预测算法来基于训练的AAE模型预测每个SBS的流行内容。最后,我们提出了一种基于多智能体深度强化学习(MADRL)的算法来决定预测的流行内容在SBs之间的协作缓存位置。我们的实验结果证明了我们提出的方案优于现有的基线缓存方案。

摘要: Edge caching is a promising solution for next-generation networks by empowering caching units in small-cell base stations (SBSs), which allows user equipments (UEs) to fetch users’ requested contents that have been pre-cached in SBSs. It is crucial for SBSs to predict accurate popular contents through learning while protecting users’ personal information. Traditional federated learning (FL) can protect users’ privacy but the data discrepancies among UEs can lead to a degradation in model quality. Therefore, it is necessary to train personalized local models for each UE to predict popular contents accurately. In addition, the cached contents can be shared among adjacent SBSs in next-generation networks, thus caching predicted popular contents in different SBSs may affect the cost to fetch contents. Hence, it is critical to determine where the popular contents are cached cooperatively. To address these issues, we propose a cooperative edge caching scheme based on elastic federated and multi-agent deep reinforcement learning (CEFMR) to optimize the cost in the network. We first propose an elastic FL algorithm to train the personalized model for each UE, where adversarial autoencoder (AAE) model is adopted for training to improve the prediction accuracy, then {a popular} content prediction algorithm is proposed to predict the popular contents for each SBS based on the trained AAE model. Finally, we propose a multi-agent deep reinforcement learning (MADRL) based algorithm to decide where the predicted popular contents are collaboratively cached among SBSs. Our experimental results demonstrate the superiority of our proposed scheme to existing baseline caching schemes.

[Downlink:]http://arxiv.org/abs/2401.09886v1

[GitHub:]https://github.com/qiongwu86/Edge-Caching-Based-on-Multi-Agent-Deep-Reinforcement-Learning-and-Federated-Learning|


标题: FREED++: Improving RL Agents for Fragment-Based Molecule Generation by Thorough Reproduction

作者: Alexander Telepov, Artem Tsypin, Kuzma Khrabrov

中文摘要: 新治疗药物的合理设计旨在找到具有所需生物功能的分子结构,例如,通过与特定蛋白质结合来激活或抑制特定蛋白质的能力。分子对接是评估蛋白质——分子相互作用的常用技术。最近,强化学习(RL)已经成为一种有前途的方法,以对接分数(DS)作为奖励来生成分子。在这项工作中,我们复制,审查和改进了最近的分子生成RL模型称为FREED(arXiv:2110.01219)。尽管报道了三种靶蛋白的突出结果,但对所提出的方法的广泛评估揭示了几个局限性和挑战。我们的贡献包括修复了许多实现错误,简化了模型,同时提高了其质量,显著扩展了实验,并与当前最先进的蛋白质条件分子生成方法进行了准确的比较。我们表明,与其他方法相比,由此产生的固定模型能够产生具有更高对接分数的分子。

摘要: A rational design of new therapeutic drugs aims to find a molecular structure with desired biological functionality, e.g., an ability to activate or suppress a specific protein via binding to it. Molecular docking is a common technique for evaluating protein-molecule interactions. Recently, Reinforcement Learning (RL) has emerged as a promising approach to generating molecules with the docking score (DS) as a reward. In this work, we reproduce, scrutinize and improve the recent RL model for molecule generation called FREED (arXiv:2110.01219). Extensive evaluation of the proposed method reveals several limitations and challenges despite the outstanding results reported for three target proteins. Our contributions include fixing numerous implementation bugs and simplifying the model while increasing its quality, significantly extending experiments, and conducting an accurate comparison with current state-of-the-art methods for protein-conditioned molecule generation. We show that the resulting fixed model is capable of producing molecules with superior docking scores compared to alternative approaches.

[Downlink:]http://arxiv.org/abs/2401.09840v1

[Project:]https://www.jmlr.org/tmlr/)|


标题: BridgeData V2: A Dataset for Robot Learning at Scale

作者: Homer Walke, Kevin Black, Abraham Lee

中文摘要: 我们介绍了BridgeData V2,这是一个大型和多样化的机器人操作行为数据集,旨在促进可扩展机器人学习的研究。BridgeData V2包含在公开可用的低成本机器人上的24个环境中收集的60,096条轨迹。BridgeData V2提供了广泛的任务和环境可变性,产生了可以跨环境、领域和机构推广的技能,使数据集成为广泛研究人员的有用资源。此外,该数据集与各种基于目标图像或自然语言指令的开放词汇、多任务学习方法兼容。在我们的实验中,我们在我们的数据集上训练了6种最先进的模仿学习和离线强化学习方法,并发现它们在一系列需要不同概括量的任务上取得了成功。我们还证明了这些方法的性能随着更多的数据和更高容量的模型而提高,并且在更多种类的技能上的训练导致更好的泛化。通过公开共享BridgeData V2和我们的预训练模型,我们旨在加速可扩展机器人学习方法的研究。https://rail-berkeley.github.io/bridgedata

摘要: We introduce BridgeData V2, a large and diverse dataset of robotic manipulation behaviors designed to facilitate research on scalable robot learning. BridgeData V2 contains 60,096 trajectories collected across 24 environments on a publicly available low-cost robot. BridgeData V2 provides extensive task and environment variability, leading to skills that can generalize across environments, domains, and institutions, making the dataset a useful resource for a broad range of researchers. Additionally, the dataset is compatible with a wide variety of open-vocabulary, multi-task learning methods conditioned on goal images or natural language instructions. In our experiments, we train 6 state-of-the-art imitation learning and offline reinforcement learning methods on our dataset, and find that they succeed on a suite of tasks requiring varying amounts of generalization. We also demonstrate that the performance of these methods improves with more data and higher capacity models, and that training on a greater variety of skills leads to improved generalization. By publicly sharing BridgeData V2 and our pre-trained models, we aim to accelerate research in scalable robot learning methods. Project page at https://rail-berkeley.github.io/bridgedata

[Downlink:]http://arxiv.org/abs/2308.12952v3

[Project:]https://rail-berkeley.github.io/bridgedata|


标题: Multi-Agent Reinforcement Learning for Maritime Operational Technology Cyber Security

作者: Alec Wilson, Ryan Menzies, Neela Morarji

摘要: This paper demonstrates the potential for autonomous cyber defence to be applied on industrial control systems and provides a baseline environment to further explore Multi-Agent Reinforcement Learning’s (MARL) application to this problem domain. It introduces a simulation environment, IPMSRL, of a generic Integrated Platform Management System (IPMS) and explores the use of MARL for autonomous cyber defence decision-making on generic maritime based IPMS Operational Technology (OT). OT cyber defensive actions are less mature than they are for Enterprise IT. This is due to the relatively brittle nature of OT infrastructure originating from the use of legacy systems, design-time engineering assumptions, and lack of full-scale modern security controls. There are many obstacles to be tackled across the cyber landscape due to continually increasing cyber-attack sophistication and the limitations of traditional IT-centric cyber defence solutions. Traditional IT controls are rarely deployed on OT infrastructure, and where they are, some threats aren’t fully addressed. In our experiments, a shared critic implementation of Multi Agent Proximal Policy Optimisation (MAPPO) outperformed Independent Proximal Policy Optimisation (IPPO). MAPPO reached an optimal policy (episode outcome mean of 1) after 800K timesteps, whereas IPPO was only able to reach an episode outcome mean of 0.966 after one million timesteps. Hyperparameter tuning greatly improved training performance. Across one million timesteps the tuned hyperparameters reached an optimal policy whereas the default hyperparameters only managed to win sporadically, with most simulations resulting in a draw. We tested a real-world constraint, attack detection alert success, and found that when alert success probability is reduced to 0.75 or 0.9, the MARL defenders were still able to win in over 97.5% or 99.5% of episodes, respectively.

[Downlink:]http://arxiv.org/abs/2401.10149v1


标题: A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem

作者: Paul Barde, Jakob Foerster, Derek Nowrouzezahrai

中文摘要: 训练多个代理进行协调是机器人学、博弈论、经济学和社会科学应用中的一个基本问题。然而,大多数现有的多智能体强化学习(MARL)方法都是在线的,因此对于收集新交互是昂贵或危险的现实世界应用来说是不切实际的。虽然这些算法应该利用可用的离线数据,但这样做会产生我们所说的离线协调问题。具体来说,我们确定并正式确定了策略协议(SA)和策略微调(SFT)协调挑战,这是当前离线MARL算法失败的两个问题。具体来说,我们揭示了流行的无模型方法严重不足,并且不能处理玩具或MuJoCo领域中的协调密集型离线多智能体任务。为了解决这一挫折,我们强调了代理间交互的重要性,并提出了第一个基于模型的离线MARL方法。我们由此产生的算法,基于模型的离线多代理近似策略优化(MOMA-PPO)生成合成的交互数据,并使代理能够收敛于一个策略,同时相应地微调他们的策略。这种简单的基于模型的解决方案解决了协调密集型离线任务,即使在严重的部分可观测性和学习世界模型下,也明显优于流行的无模型方法。

摘要: Training multiple agents to coordinate is an essential problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to what we call the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) coordination challenges, two issues at which current offline MARL algorithms fail. Concretely, we reveal that the prevalent model-free methods are severely deficient and cannot handle coordination-intensive offline multi-agent tasks in either toy or MuJoCo domains. To address this setback, we emphasize the importance of inter-agent interactions and propose the very first model-based offline MARL method. Our resulting algorithm, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO) generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. This simple model-based solution solves the coordination-intensive offline tasks, significantly outperforming the prevalent model-free methods even under severe partial observability and with learned world models.

[Downlink:]http://arxiv.org/abs/2305.17198v2


== Object Detection@ Segmentation@Open vocabulary detection ==

标题: OMG-Seg: Is One Model Good Enough For All Segmentation?

作者: Xiangtai Li, Haobo Yuan, Wei Li

中文摘要: 在这项工作中,我们解决了各种分割任务,每个任务传统上都由不同的或部分统一的模型来解决。我们提出了OMG-Seg,这是一个足够好的模型,可以高效和有效地处理所有分割任务,包括图像语义、实例和全景分割,以及它们的视频对应物、开放词汇设置、提示驱动的交互式分割(如SAM)和视频对象分割。据我们所知,这是第一个在一个模型中处理所有这些任务并实现令人满意的性能的模型。我们表明,OMG-Seg是一种基于Transformer model的编码器——解码器架构,具有特定于任务的查询和输出,可以支持十多种不同的分割任务,同时显著降低各种任务和数据集的计算和参数开销。我们严格评估了合作训练中任务间的影响和相关性。代码和模型可在https://github.com/lxtGH/OMG-Seg获得。

摘要: In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.

[Downlink:]http://arxiv.org/abs/2401.10229v1

[Project:]https://lxtgh.github.io/project/omg_seg/|

[GitHub:]https://github.com/lxtGH/OMG-Seg.|


标题: RAP-SAM: Towards Real-Time All-Purpose Segment Anything

作者: Shilin Xu, Haobo Yuan, Qingyu Shi

中文摘要: 由Transformer model架构推进,视觉基础模型(VFMs)在性能和泛化能力方面取得了显著进步。Segment Anything模型(SAM)是一种能够实现广义分割的出色模型。然而,大多数VFM不能实时运行,这使得很难将它们转移到几个产品中。另一方面,目前的实时分割主要有一个目的,比如对驾驶场景进行语义分割。我们认为实际应用需要不同的输出。因此,本工作探索了一种新的实时分段设置,称为实时通用分段,以在实时部署中传输VFMs。它包含三个不同的任务,包括交互式分割、全景分割和视频分割。我们的目标是使用一个模型来实时完成上述任务。我们首先对几个强基线进行基准测试。然后,我们提出了实时通用SAM(RAP-SAM)。它包含一个高效的编码器和一个高效的解耦解码器来执行提示驱动解码。此外,我们进一步探索不同的训练策略和调整方法,以进一步提高共同训练的表现。我们的代码和模型可在https://github.com/xushilin1/RAP-SAM/获得。

摘要: Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most VFMs cannot run in realtime, which makes it difficult to transfer them into several products. On the other hand, current real-time segmentation mainly has one purpose, such as semantic segmentation on the driving scene. We argue that diverse outputs are needed for real applications. Thus, this work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment. It contains three different tasks, including interactive segmentation, panoptic segmentation, and video segmentation. We aim to use one model to achieve the above tasks in real-time. We first benchmark several strong baselines. Then, we present Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an efficient decoupled decoder to perform prompt-driven decoding. Moreover, we further explore different training strategies and tuning methods to boost co-training performance further. Our code and model are available at https://github.com/xushilin1/RAP-SAM/.

[Downlink:]http://arxiv.org/abs/2401.10228v1

[Project:]https://xushilin1.github.io/rap_sam/|

[GitHub:]https://github.com/xushilin1/RAP-SAM/.|


标题: A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

作者: Wouter Van Gansbeke, Bert De Brabandere

中文摘要: 全景和实例分割网络通常用专门的对象检测模块、复杂的损失函数和特设的后处理步骤来训练,以处理实例掩码的排列不变性。这项工作建立在稳定扩散的基础上,并提出了一种用于全景分割的潜在扩散方法,从而产生了一种简单的架构,省略了这些复杂性。我们的训练过程包括两个步骤:(1)训练一个浅层自动编码器将分割掩模投影到潜在空间;(2)训练扩散模型以允许潜在空间中的图像条件采样。生成模型的使用开启了掩模完成或修复的探索,这在交互式分割中具有应用。实验验证产生了全景分割和掩模修复的有希望的结果。虽然没有设置一个新的最先进的状态,我们的模型的简单性,通用性和掩模完成能力是可取的属性。

摘要: Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to handle the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture which omits these complexities. Our training process consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. The use of a generative model unlocks the exploration of mask completion or inpainting, which has applications in interactive segmentation. The experimental validation yields promising results for both panoptic segmentation and mask inpainting. While not setting a new state-of-the-art, our model’s simplicity, generality, and mask completion capability are desirable properties.

[Downlink:]http://arxiv.org/abs/2401.10227v1

[GitHub:]https://github.com/segments-ai/latent-diffusion-segmentation|


标题: AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation

作者: Yuanwen Yue, Sabarinath Mahadevan, Jonas Schult

中文摘要: 在交互式分割过程中,模型和用户一起在3D点云中描绘感兴趣的对象。在迭代过程中,模型将每个数据点分配给一个对象(或背景),同时用户纠正最终分割中的错误,并将它们反馈到模型中。当前的最佳实践将问题表述为二进制分类,一次分割一个对象。该模型期望用户提供正点击来指示错误地分配给背景的区域,并提供负点击来指示错误地分配给对象的区域。顺序访问对象是浪费的,因为它忽略了对象之间的协同作用:根据定义,给定对象的正点击可以作为附近对象的负点击。此外,相邻物体之间的直接竞争可以加快它们共同边界的识别。我们介绍了AGILE3D,这是一种高效的基于注意力的模型,它(1)支持多个3D对象的同时分割,(2)以更少的用户点击产生更准确的分割遮罩,以及(3)提供更快的推理。我们的核心思想是将用户点击编码为时空查询,并通过点击注意模块实现点击查询之间以及它们与3D场景之间的显式交互。每次添加新的点击时,我们只需要运行一个轻量级解码器来生成更新的分割掩码。在对四种不同的3D点云数据集的实验中,AGILE3D创造了新的技术水平。此外,我们还通过对真实用户的研究验证了它在真实环境中的实用性。

摘要: During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies.

[Downlink:]http://arxiv.org/abs/2306.00977v3

[Project:]https://ywyue.github.io/AGILE3D|


标题: FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image Segmentation against Heterogeneous Annotation Noise

作者: Nannan Wu, Zhaobin Sun, Zengqiang Yan

中文摘要: 联邦学习(FL)由于其隐私保护特性,已经成为在分散的医疗数据上训练分割模型的有前途的范例。然而,现有的研究忽略了现实世界医学数据集中遇到的普遍注释噪声,这限制了FL的性能上限。在本文中,我们第一次发现并解决了这个问题。对于问题公式化,我们提出了一种轮廓进化,用于对每个客户端内跨像素的非独立和同分布(非IID)噪声进行建模,然后将其扩展到多源数据的情况,以形成异构噪声模型(即,跨客户端的非IID注释噪声)。对于从具有这种两级非IID噪声的注释中进行鲁棒学习,我们强调数据质量在模型聚合中的重要性,允许高质量的客户端对FL产生更大的影响。为了实现这一点,我们通过引入基于客户端噪声估计的品质因数,提出了具有注释质量感知聚合的联邦学习,称为FedA3I。具体来说,每个客户端的噪声估计通过高斯混合模型完成,然后以分层的方式结合到模型聚合中,以提升高质量客户端的权重。在两个真实世界的医学图像分割数据集上的大量实验证明了FedA 3 ^3 3I在处理跨客户端注释噪声方面相对于最先进的方法的卓越性能。代码可在https://github.com/wnn2000/FedAAAI。

摘要: Federated learning (FL) has emerged as a promising paradigm for training segmentation models on decentralized medical data, owing to its privacy-preserving property. However, existing research overlooks the prevalent annotation noise encountered in real-world medical datasets, which limits the performance ceilings of FL. In this paper, we, for the first time, identify and tackle this problem. For problem formulation, we propose a contour evolution for modeling non-independent and identically distributed (Non-IID) noise across pixels within each client and then extend it to the case of multi-source data to form a heterogeneous noise model (i.e., Non-IID annotation noise across clients). For robust learning from annotations with such two-level Non-IID noise, we emphasize the importance of data quality in model aggregation, allowing high-quality clients to have a greater impact on FL. To achieve this, we propose Federated learning with Annotation quAlity-aware AggregatIon, named FedA3I, by introducing a quality factor based on client-wise noise estimation. Specifically, noise estimation at each client is accomplished through the Gaussian mixture model and then incorporated into model aggregation in a layer-wise manner to up-weight high-quality clients. Extensive experiments on two real-world medical image segmentation datasets demonstrate the superior performance of FedA 3 ^3 3I against the state-of-the-art approaches in dealing with cross-client annotation noise. The code is available at https://github.com/wnn2000/FedAAAI.

[Downlink:]http://arxiv.org/abs/2312.12838v2

[GitHub:]https://github.com/wnn2000/FedAAAI.|


标题: AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection

作者: Jingchun Zhou, Zongxin He, Kin-Man Lam

中文摘要: 在本文中,我们提出了一种新的调幅随机扰动和涡旋卷积网络,AMSP-UOD,用于水下目标检测。AMSP-UOD专门解决了复杂水下环境中非理想成像因素对探测精度的影响。为了减轻噪声对目标检测性能的影响,我们提出了AMSP涡旋卷积(AMSP-VConv)来扰乱噪声分布,增强特征提取能力,有效地减少参数,并提高网络鲁棒性。我们设计了特征关联解耦跨级部分(FAD-CSP)模块,加强了长距离和短距离特征的关联,提高了复杂水下环境下的网络性能。此外,我们复杂的后处理方法基于具有纵横比相似性阈值的非最大抑制(NMS),优化了密集场景中的检测,如水草和鱼群,提高了对象检测精度。在URPC和RUOD数据集上的大量实验表明,我们的方法在准确性和抗噪性方面优于现有的最先进的方法。AMSP-UOD提出了一个具有现实应用潜力的创新解决方案。我们的代码可从https://github.com/zhoujingchun03/AMSP-UOD获得。

摘要: In this paper, we present a novel Amplitude-Modulated Stochastic Perturbation and Vortex Convolutional Network, AMSP-UOD, designed for underwater object detection. AMSP-UOD specifically addresses the impact of non-ideal imaging factors on detection accuracy in complex underwater environments. To mitigate the influence of noise on object detection performance, we propose AMSP Vortex Convolution (AMSP-VConv) to disrupt the noise distribution, enhance feature extraction capabilities, effectively reduce parameters, and improve network robustness. We design the Feature Association Decoupling Cross Stage Partial (FAD-CSP) module, which strengthens the association of long and short range features, improving the network performance in complex underwater environments. Additionally, our sophisticated post-processing method, based on Non-Maximum Suppression (NMS) with aspect-ratio similarity thresholds, optimizes detection in dense scenes, such as waterweed and schools of fish, improving object detection accuracy. Extensive experiments on the URPC and RUOD datasets demonstrate that our method outperforms existing state-of-the-art methods in terms of accuracy and noise immunity. AMSP-UOD proposes an innovative solution with the potential for real-world applications. Our code is available at https://github.com/zhoujingchun03/AMSP-UOD.

[Downlink:]http://arxiv.org/abs/2308.11918v3

[GitHub:]https://github.com/zhoujingchun03/AMSP-UOD.|


你可能感兴趣的:(每日论文,深度学习,人工智能)