[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--机器人、强化学习

专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--机器人、强化学习_第1张图片

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉导航
  • 具身智能,机器人
  • 强化学习
  • 开放词汇,检测分割

== robotic agent ==

标题: Workspace Optimization Techniques to Improve Prediction of Human Motion During Human-Robot Collaboration

作者: Yi-Shiuan Tung, Matthew B. Luebbers, Alessandro Roncone

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12965v1

中文摘要: 理解人类的意图对于安全有效的人机协作至关重要。虽然人类目标预测的最新方法利用学习模型来解释人类运动数据的不确定性,但该数据本质上是随机和高方差的,阻碍了这些模型在需要协调的交互中的效用,包括安全关键或近距离任务。我们的关键见解是,机器人队友可以在交互之前故意配置共享工作空间,以减少人类运动的方差,实现目标预测中与分类器无关的改进。在这项工作中,我们提出了一种算法方法,让机器人在共享的人机工作空间中使用增强现实来安排物理对象和投影“虚拟障碍”,优化给定任务集的人类易读性。我们使用两个人类受试者研究将我们的方法与其他工作空间安排策略进行了比较,一个在虚拟2D导航领域,另一个在涉及机器人机械手的实时桌面操作领域。我们评估了从每个条件中学习的人类运动预测模型的准确性,证明了我们的虚拟障碍工作空间优化技术使用更少的训练数据导致更高的机器人预测准确性。

摘要: Understanding human intentions is critical for safe and effective human-robot collaboration. While state of the art methods for human goal prediction utilize learned models to account for the uncertainty of human motion data, that data is inherently stochastic and high variance, hindering those models’ utility for interactions requiring coordination, including safety-critical or close-proximity tasks. Our key insight is that robot teammates can deliberately configure shared workspaces prior to interaction in order to reduce the variance in human motion, realizing classifier-agnostic improvements in goal prediction. In this work, we present an algorithmic approach for a robot to arrange physical objects and project “virtual obstacles” using augmented reality in shared human-robot workspaces, optimizing for human legibility over a given set of tasks. We compare our approach against other workspace arrangement strategies using two human-subjects studies, one in a virtual 2D navigation domain and the other in a live tabletop manipulation domain involving a robotic manipulator arm. We evaluate the accuracy of human motion prediction models learned from each condition, demonstrating that our workspace optimization technique with virtual obstacles leads to higher robot prediction accuracy using less training data.


标题: Integrating Human Expertise in Continuous Spaces: A Novel Interactive Bayesian Optimization Framework with Preference Expected Improvement

作者: Nikolaus Feith, Elmar Rueckert

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12662v1

中文摘要: 交互式机器学习(IML)寻求将人类专业知识整合到机器学习过程中。然而,大多数现有的算法不能应用于现实世界的场景,因为它们的状态空间和/或动作空间被限制为离散值。此外,所有现有方法的交互仅限于在多个方案之间做出决定。因此,我们提出了一种基于贝叶斯优化(BO)的新框架。交互式贝叶斯优化(IBO)支持机器学习算法和人类之间的协作。该框架捕获用户偏好,并为用户提供一个界面来手动制定策略。此外,我们还整合了一个新的采集功能,偏好预期改善(PEI),使用用户偏好的概率模型来提高系统的效率。我们的方法旨在确保机器能够从人类的专业知识中受益,旨在实现更加一致和有效的学习过程。在这项工作的过程中,我们将我们的方法应用于模拟和真实世界的任务中,使用Franka熊猫机器人来展示人机协作。

摘要: Interactive Machine Learning (IML) seeks to integrate human expertise into machine learning processes. However, most existing algorithms cannot be applied to Realworld Scenarios because their state spaces and/or action spaces are limited to discrete values. Furthermore, the interaction of all existing methods is restricted to deciding between multiple proposals. We therefore propose a novel framework based on Bayesian Optimization (BO). Interactive Bayesian Optimization (IBO) enables collaboration between machine learning algorithms and humans. This framework captures user preferences and provides an interface for users to shape the strategy by hand. Additionally, we’ve incorporated a new acquisition function, Preference Expected Improvement (PEI), to refine the system’s efficiency using a probabilistic model of the user preferences. Our approach is geared towards ensuring that machines can benefit from human expertise, aiming for a more aligned and effective learning process. In the course of this work, we applied our method to simulations and in a real world task using a Franka Panda robot to show human-robot collaboration.


标题: Modeling Resilience of Collaborative AI Systems

作者: Diaeddin Rimawi, Antonio Liotta, Marco Todescato

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12632v1

中文摘要: 协作人工智能系统(CAIS)与人类协作执行动作,以实现共同的目标。CAISs可以使用经过训练的人工智能模型来控制人类与系统的交互,或者他们可以使用人类交互以在线方式动态地向人类学习。在具有人类反馈的在线学习中,AI模型在学习状态下通过系统传感器监控人类交互来进化,并在操作状态下基于学习来驱动CAIS的自主组件。因此,任何影响这些传感器的破坏性事件都可能影响人工智能模型做出准确决策的能力,并降低CAIS的性能。因此,对于CAIS管理者来说,能够自动跟踪系统性能以了解CAIS在此类破坏性事件中的恢复能力是至关重要的。在本文中,我们提供了一个新的框架来模拟系统经历破坏性事件时的CAIS性能。在此框架下,我们引入了一个CAIS性能演化模型。该模型配备了一套措施,旨在支持CAIS管理人员在决策过程中实现系统所需的弹性。我们在一个真实世界的案例研究中测试了我们的框架,当系统经历破坏性事件时,机器人与人类在线合作。案例研究表明,我们的框架可以在CAIS中采用,并集成到CAIS活动的在线执行中。

摘要: A Collaborative Artificial Intelligence System (CAIS) performs actions in collaboration with the human to achieve a common goal. CAISs can use a trained AI model to control human-system interaction, or they can use human interaction to dynamically learn from humans in an online fashion. In online learning with human feedback, the AI model evolves by monitoring human interaction through the system sensors in the learning state, and actuates the autonomous components of the CAIS based on the learning in the operational state. Therefore, any disruptive event affecting these sensors may affect the AI model’s ability to make accurate decisions and degrade the CAIS performance. Consequently, it is of paramount importance for CAIS managers to be able to automatically track the system performance to understand the resilience of the CAIS upon such disruptive events. In this paper, we provide a new framework to model CAIS performance when the system experiences a disruptive event. With our framework, we introduce a model of performance evolution of CAIS. The model is equipped with a set of measures that aim to support CAIS managers in the decision process to achieve the required resilience of the system. We tested our framework on a real-world case study of a robot collaborating online with the human, when the system is experiencing a disruptive event. The case study shows that our framework can be adopted in CAIS and integrated into the online execution of the CAIS activities.


标题: Chat Failures and Troubles: Reasons and Solutions

作者: Manal Helal, Patrick Holthaus, Gabriella Lakatos

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2309.03708v2

中文摘要: 本文研究了人机交互(HRI)中导致聊天失败和麻烦的一些常见问题。给定用例的设计决策始于合适的机器人、合适的聊天模型、识别导致故障的常见问题、识别潜在的解决方案以及规划持续改进。总之,建议使用闭环控制算法来指导训练过的人工智能(AI)预训练模型的使用,并提供词汇过滤,在新数据集上重新训练批处理模型,从数据流中在线学习,和/或使用强化学习模型来自我更新训练过的模型并减少错误。

摘要: This paper examines some common problems in Human-Robot Interaction (HRI) causing failures and troubles in Chat. A given use case’s design decisions start with the suitable robot, the suitable chatting model, identifying common problems that cause failures, identifying potential solutions, and planning continuous improvement. In conclusion, it is recommended to use a closed-loop control algorithm that guides the use of trained Artificial Intelligence (AI) pre-trained models and provides vocabulary filtering, re-train batched models on new datasets, learn online from data streams, and/or use reinforcement learning models to self-update the trained models and reduce errors.


标题: Self context-aware emotion perception on human-robot interaction

作者: Zihan Lin, Francisco Cruz, Eduardo Benitez Sandoval

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2401.10946v1

中文摘要: 情感识别在人机交互的各个领域都起着至关重要的作用。在与人类的长期交互中,机器人需要持续准确地做出反应,然而,主流的情绪识别方法大多侧重于短期的情绪识别,而忽视了情绪被感知的背景。人类认为语境信息和不同的语境会导致完全不同的情感表达。在本文中,我们介绍了自我上下文感知模型(SCAM),该模型采用二维情绪坐标系来锚定和重新标记不同的情绪。同时,它结合了其独特的信息保留结构和上下文丢失。这种方法在音频、视频和多模态方面产生了显著的改进。在听觉模态中,准确率有显著提高,从63.10%上升到72.46%。同样,视觉模态的准确性也有所提高,从77.03%提高到80.82%。在多模态中,准确率从77.48%上升到78.93%。未来,我们将通过心理学实验验证机器人骗局的可靠性和可用性。

摘要: Emotion recognition plays a crucial role in various domains of human-robot interaction. In long-term interactions with humans, robots need to respond continuously and accurately, however, the mainstream emotion recognition methods mostly focus on short-term emotion recognition, disregarding the context in which emotions are perceived. Humans consider that contextual information and different contexts can lead to completely different emotional expressions. In this paper, we introduce self context-aware model (SCAM) that employs a two-dimensional emotion coordinate system for anchoring and re-labeling distinct emotions. Simultaneously, it incorporates its distinctive information retention structure and contextual loss. This approach has yielded significant improvements across audio, video, and multimodal. In the auditory modality, there has been a notable enhancement in accuracy, rising from 63.10% to 72.46%. Similarly, the visual modality has demonstrated improved accuracy, increasing from 77.03% to 80.82%. In the multimodal, accuracy has experienced an elevation from 77.48% to 78.93%. In the future, we will validate the reliability and usability of SCAM on robots through psychology experiments.


== Reinforcement Learning ==

标题: HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments

作者: Qinhong Zhou, Sunli Chen, Yisong Wang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12975v1

Project: https://vis-www.cs.umass.edu/hazard/.|

中文摘要: 高保真虚拟环境的最新进展是构建智能具体化代理以感知、推理和与物理世界交互的主要驱动力之一。通常,这些环境保持不变,除非代理与它们交互。然而,在现实世界的场景中,代理还可能面临以意外事件为特征的动态变化的环境,并且需要快速采取相应的行动。为了弥补这一差距,我们提出了一种新的模拟具身基准,称为HAZARD,专门用于评估具身代理在动态情况下的决策能力。HAZARD由三种意外灾难场景组成,包括火灾、洪水和风,并特别支持利用大型语言模型(LLMs)来帮助常识推理和决策。该基准使我们能够评估自主代理跨各种管道的决策能力,包括动态变化环境中的强化学习(RL)、基于规则和基于搜索的方法。作为使用大型语言模型应对这一挑战的第一步,我们进一步开发了一个基于LLM的代理,并对其解决这些挑战性任务的前景和挑战进行了深入分析。HAZARD可在https://vis-www.cs.umass.edu/HAZARD/。获得

摘要: Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world. Typically, these environments remain unchanged unless agents interact with them. However, in real-world scenarios, agents might also face dynamically changing environments characterized by unexpected events and need to rapidly take action accordingly. To remedy this gap, we propose a new simulated embodied benchmark, called HAZARD, specifically designed to assess the decision-making abilities of embodied agents in dynamic situations. HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind, and specifically supports the utilization of large language models (LLMs) to assist common sense reasoning and decision-making. This benchmark enables us to evaluate autonomous agents’ decision-making capabilities across various pipelines, including reinforcement learning (RL), rule-based, and search-based methods in dynamically changing environments. As a first step toward addressing this challenge using large language models, we further develop an LLM-based agent and perform an in-depth analysis of its promise and challenge of solving these challenging tasks. HAZARD is available at https://vis-www.cs.umass.edu/hazard/.


标题: Personalized Algorithmic Recourse with Preference Elicitation

作者: Giovanni De Toni, Paolo Viappiani, Stefano Teso

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2205.13743v5

Project: https://openreview.net/forum?id=8sg2I9zXgO|

中文摘要: 算法追索(AR)是计算一系列动作的问题,一旦用户执行了这些动作,就会推翻不希望的机器决策。最重要的是,操作的顺序不需要用户太多的努力来实现。然而,大多数AR方法假设所有用户的行为成本相同,因此可能会向某些用户推荐不公平昂贵的追索计划。在这一观察的推动下,我们引入了PEAR,这是第一个能够根据任何最终用户的需求提供个性化算法资源的人在回路方法。PEAR基于贝叶斯偏好启发的洞察力,通过向目标用户询问选择集查询,迭代地细化行动成本的估计。查询本身是通过最大化选择的预期效用来计算的,选择的预期效用是一种信息增益的原则性度量,考虑了成本估计和用户响应的不确定性。PEAR将启发集成到强化学习代理中,结合蒙特卡罗树搜索,以快速识别有前途的追索计划。我们对真实世界数据集的经验评估强调了PEAR如何在少数迭代中产生高质量的个性化资源。

摘要: Algorithmic Recourse (AR) is the problem of computing a sequence of actions that – once performed by a user – overturns an undesirable machine decision. It is paramount that the sequence of actions does not require too much effort for users to implement. Yet, most approaches to AR assume that actions cost the same for all users, and thus may recommend unfairly expensive recourse plans to certain users. Prompted by this observation, we introduce PEAR, the first human-in-the-loop approach capable of providing personalized algorithmic recourse tailored to the needs of any end-user. PEAR builds on insights from Bayesian Preference Elicitation to iteratively refine an estimate of the costs of actions by asking choice set queries to the target user. The queries themselves are computed by maximizing the Expected Utility of Selection, a principled measure of information gain accounting for uncertainty on both the cost estimate and the user’s responses. PEAR integrates elicitation into a Reinforcement Learning agent coupled with Monte Carlo Tree Search to quickly identify promising recourse plans. Our empirical evaluation on real-world datasets highlights how PEAR produces high-quality personalized recourse in only a handful of iterations.


标题: DexTouch: Learning to Seek and Manipulate Objects with Tactile Dexterity

作者: Kang-Won Lee, Yuzhe Qin, Xiaolong Wang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12496v1

Project: https://lee-kangwon.github.io/dextouch/|https://lee-kangwon.github.io/dextouch/|

中文摘要: 触觉是熟练执行各种任务的基本能力,提供了在不依赖视觉信息的情况下搜索和操纵对象的能力。随着时间的推移,人们进行了广泛的研究,将这些人类触觉能力应用于机器人。在本文中,我们介绍了一个多指机器人系统,该系统旨在利用触觉搜索和操纵物体,而不依赖于视觉信息。使用触觉传感器搜索随机定位的目标物体,并操纵这些物体完成模拟日常生活的任务。这项研究的目的是赋予机器人类似人类的触觉能力。为了实现这一点,在机器人手的一侧实现了二元触觉传感器,以最小化Sim2Real间隙。通过模拟中的强化学习来训练策略,并将训练好的策略转移到真实环境中,我们证明了即使在没有视觉信息的环境中,使用触觉传感器进行对象搜索和操纵也是可能的。此外,还进行了一项消融研究,以分析触觉信息对操作任务的影响。我们的项目页面可在https://lee-kangwon.github.io/dextouch/

摘要: The sense of touch is an essential ability for skillfully performing a variety of tasks, providing the capacity to search and manipulate objects without relying on visual information. Extensive research has been conducted over time to apply these human tactile abilities to robots. In this paper, we introduce a multi-finger robot system designed to search for and manipulate objects using the sense of touch without relying on visual information. Randomly located target objects are searched using tactile sensors, and the objects are manipulated for tasks that mimic daily-life. The objective of the study is to endow robots with human-like tactile capabilities. To achieve this, binary tactile sensors are implemented on one side of the robot hand to minimize the Sim2Real gap. Training the policy through reinforcement learning in simulation and transferring the trained policy to the real environment, we demonstrate that object search and manipulation using tactile sensors is possible even in an environment without vision information. In addition, an ablation study was conducted to analyze the effect of tactile information on manipulative tasks. Our project page is available at https://lee-kangwon.github.io/dextouch/


标题: CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents

作者: Siyuan Qi, Shuo Chen, Yexin Li

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10568v1

GitHub: https://github.com/bigai-ai/civrealm.|

摘要: The generalization of decision-making agents encompasses two fundamental elements: learning from past experiences and reasoning in novel contexts. However, the predominant emphasis in most interactive environments is on learning, often at the expense of complexity in reasoning. In this paper, we introduce CivRealm, an environment inspired by the Civilization game. Civilization’s profound alignment with human history and society necessitates sophisticated learning, while its ever-changing situations demand strong reasoning to generalize. Particularly, CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, we provide interfaces for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning. To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RL- and LLM-based agents struggle to make substantial progress in the full game. Overall, CivRealm stands as a unique learning and reasoning challenge for decision-making agents. The code is available at https://github.com/bigai-ai/civrealm.


标题: LangProp: A code optimization framework using Language Models applied to driving

作者: Shu Ishida, Gianluca Corrado, George Fedoseev

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2401.10314v1

GitHub: https://github.com/shuishida/LangProp.|

中文摘要: LangProp是一个框架,用于在监督/强化学习环境中迭代优化大型语言模型(LLMs)生成的代码。虽然LLMs可以产生合理的解决方案,但这些解决方案往往是次优的。特别是对于代码生成任务,初始代码很可能会在某些边缘情况下失败。LangProp自动评估输入输出对数据集上的代码性能,以及捕捉任何异常,并在训练循环中将结果反馈给LLM,以便LLM可以迭代地改进它生成的代码。通过对该代码优化过程采用度量和数据驱动的训练范式,人们可以轻松地适应传统机器学习技术(如模仿学习、匕首和强化学习)的发现。我们在CARLA中展示了自动驾驶自动代码优化的第一个概念证明,表明LangProp可以生成可解释和透明的驾驶策略,这些策略可以以度量和数据驱动的方式进行验证和改进。我们的代码将是开源的,可从https://github.com/shuishida/LangProp获得。

摘要: LangProp is a framework for iteratively optimizing code generated by large language models (LLMs) in a supervised/reinforcement learning setting. While LLMs can generate sensible solutions zero-shot, the solutions are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, as well as catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA, showing that LangProp can generate interpretable and transparent driving policies that can be verified and improved in a metric- and data-driven way. Our code will be open-sourced and is available at https://github.com/shuishida/LangProp.


标题: CQLite: Communication-Efficient Multi-Robot Exploration Using Coverage-biased Distributed Q-Learning

作者: Ehsan Latif, Ramviyas Parasuraman

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2307.00500v2

GitHub: https://github.com/herolab-uga/cqlite|

中文摘要: 前沿探索和强化学习在历史上被用来解决使许多移动机器人能够自主和协作地探索复杂环境的问题。这些方法需要保持内部全局地图进行导航,但它们没有考虑到机器人之间通信和信息共享的高成本。本研究提供了一种新的分布式Q学习技术CQLite,旨在最小化机器人之间的数据通信开销,同时在多机器人探索中实现快速收敛和彻底覆盖。所提出的CQLite方法使用特别映射合并,并选择性地在最近识别的前沿共享更新的Q值,以显著降低通信成本。对CQLite算法的收敛性和效率的理论分析,以及利用几个机器人在模拟室内地图上的大量数值验证,证明了该方法的新颖性。凭借超过2倍的计算和通信减少以及改进的映射性能,CQLite超越了尖端的多机器人探索技术,如快速探索随机树和深度强化学习。相关代码在\url{https://github.com/herolab-uga/cqlite}开源。

摘要: Frontier exploration and reinforcement learning have historically been used to solve the problem of enabling many mobile robots to autonomously and cooperatively explore complex surroundings. These methods need to keep an internal global map for navigation, but they do not take into consideration the high costs of communication and information sharing between robots. This study offers CQLite, a novel distributed Q-learning technique designed to minimize data communication overhead between robots while achieving rapid convergence and thorough coverage in multi-robot exploration. The proposed CQLite method uses ad hoc map merging, and selectively shares updated Q-values at recently identified frontiers to significantly reduce communication costs. The theoretical analysis of CQLite’s convergence and efficiency, together with extensive numerical verification on simulated indoor maps utilizing several robots, demonstrates the method’s novelty. With over 2x reductions in computation and communication alongside improved mapping performance, CQLite outperformed cutting-edge multi-robot exploration techniques like Rapidly Exploring Random Trees and Deep Reinforcement Learning. Related codes are open-sourced at \url{https://github.com/herolab-uga/cqlite}.


== Object Detection ==

标题: GALA: Generating Animatable Layered Assets from a Single Scan

作者: Taeksoo Kim, Byungjun Kim, Shunsuke Saito

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12979v1

Project: https://snuvclab.github.io/gala/|

中文摘要: 我们介绍了GALA,它是一个框架,将单层穿着衣服的3D人体网格作为输入,并将其分解为完整的多层3D资产。然后,输出可以与其他资产相结合,以创建具有任何姿势的新的穿着衣服的人类化身。现有的重建方法通常将穿着衣服的人视为单层几何图形,并忽略了人与发型、衣服和配饰的固有组合性,从而限制了网格在下游应用中的效用。将单层网格分解成单独的层是一项具有挑战性的任务,因为它需要为严重遮挡的区域合成似是而非的几何形状和纹理。此外,即使成功分解,网格在姿态和身体形状方面也没有标准化,无法用新的身份和姿态进行连贯的合成。为了应对这些挑战,我们建议利用预训练的2D扩散模型的一般知识作为人类和其他资产的几何和外观先验。我们首先使用从多视图2D分割中提取的3D表面分割来分离输入网格。然后,我们使用一种新的姿态引导分数蒸馏采样(SDS)损失来合成姿态空间和正则空间中不同层的缺失几何。一旦我们完成修复高保真3D几何体,我们还将相同的SDS损失应用于其纹理,以获得完整的外观,包括最初遮挡的区域。通过一系列的分解步骤,我们在一个共享的规范空间中获得了多层3D资产,该空间根据姿势和人体形状进行了标准化,因此支持毫不费力地合成新的身份和用新的姿势复活。与现有的解决方案相比,我们的实验证明了我们的方法对于分解、规范化和合成任务的有效性。

摘要: We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.


标题: Tracking Any Object Amodally

作者: Cheng-Yen Hsieh, Tarasha Khurana, Achal Dave

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2312.12433v2

Project: https://tao-amodal.github.io|

中文摘要: 模态感知,即从部分可见性中理解完整物体结构的能力,是一项基本技能,即使对婴儿来说也是如此。它的意义延伸到自动驾驶等应用,在这些应用中,清楚地了解严重遮挡的物体至关重要。然而,现代检测和跟踪算法经常忽略这一关键能力,这可能是由于大多数数据集中模态注释的流行。为了解决amodal数据的稀缺问题,我们引入了TAO-Amodal基准测试,在数千个视频序列中包含880个不同的类别。我们的数据集包括可见和遮挡对象的模态和模态边界框,包括部分帧外的对象。为了增强具有对象持久性的模态跟踪,我们利用一个轻量级插件模块amodal expander,通过对数百个视频序列进行数据增强的微调,将标准模态跟踪器转换为amodal跟踪器。在TAO-Amodal上对遮挡目标的检测和跟踪分别实现了3.3%和1.6%的改进。当对人进行评估时,与最先进的模态基线相比,我们的方法产生了2倍的显著改进。

摘要: Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of modal annotations in most datasets. To address the scarcity of amodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse categories in thousands of video sequences. Our dataset includes amodal and modal bounding boxes for visible and occluded objects, including objects that are partially out-of-frame. To enhance amodal tracking with object permanence, we leverage a lightweight plug-in module, the amodal expander, to transform standard, modal trackers into amodal ones through fine-tuning on a few hundred video sequences with data augmentation. We achieve a 3.3% and 1.6% improvement on the detection and tracking of occluded objects on TAO-Amodal. When evaluated on people, our method produces dramatic improvements of 2x compared to state-of-the-art modal baselines.


标题: SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location on MRI

作者: Hanxue Gu, Roy Colglazier, Haoyu Dong

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12974v1

GitHub: https://github.com/mazurowski-lab/SegmentAnyBone.|

中文摘要: 磁共振成像(MRI)在放射学中至关重要,它提供了对人体的非侵入性和高质量的洞察。将MRI精确分割成不同的器官和组织将是非常有益的,因为它将允许对图像内容的更高水平的理解,并实现重要的测量,这对于准确的诊断和有效的治疗计划是必不可少的。具体来说,在MRI中分割骨骼将允许对肌肉骨骼状况进行更定量的评估,而这种评估在当前的放射学实践中基本上不存在。骨MRI分割的困难由有限的算法可公开使用的事实来说明,并且文献中包含的那些算法通常针对特定的解剖区域。在我们的研究中,我们提出了一个通用的,公开可用的深度学习模型,用于跨多个标准MRI位置的MRI中的骨分割。所提出的模型可以在两种模式下运行:全自动分割和基于提示的分割。我们的贡献包括(1)收集和注释跨各种MRI协议的新MRI数据集,包括跨不同解剖区域的300多个注释卷和8485个注释切片;(2)研究用于自动分段的几种标准网络架构和策略;(3)引入SegmentAnyBone,这是一种创新的基于基础模型的方法,扩展了SegmentAnyBone模型(SAM);(4)算法与以往算法的比较分析;以及(5)跨不同解剖位置和MRI序列以及外部数据集对我们的算法进行概括分析。我们在https://github.com/mazurowski-lab/SegmentAnyBone公开发布了我们的模型。

摘要: Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment planning. Specifically, segmenting bones in MRI would allow for more quantitative assessments of musculoskeletal conditions, while such assessments are largely absent in current radiological practice. The difficulty of bone MRI segmentation is illustrated by the fact that limited algorithms are publicly available for use, and those contained in the literature typically address a specific anatomic area. In our study, we propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations. The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation. Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions; (2) investigating several standard network architectures and strategies for automated segmentation; (3) introducing SegmentAnyBone, an innovative foundational model-based approach that extends Segment Anything Model (SAM); (4) comparative analysis of our algorithm and previous approaches; and (5) generalization analysis of our algorithm across different anatomical locations and MRI sequences, as well as an external dataset. We publicly release our model at https://github.com/mazurowski-lab/SegmentAnyBone.


标题: Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration

作者: Yifan Zhang, Siyu Ren, Junhui Hou

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12452v1

GitHub: https://github.com/Eaphan/NCLR|

摘要: This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes. Specifically, our approach, named NCLR, focuses on 2D-3D neural calibration, a novel pretext task that estimates the rigid transformation aligning camera and LiDAR coordinate systems. First, we propose the learnable transformation alignment to bridge the domain gap between image and point cloud data, converting features into a unified representation space for effective comparison and matching. Second, we identify the overlapping area between the image and point cloud with the fused features. Third, we establish dense 2D-3D correspondences to estimate the rigid transformation. The framework not only learns fine-grained matching from points to pixels but also achieves alignment of the image and point cloud at a holistic level, understanding their relative pose. We demonstrate NCLR’s efficacy by applying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D semantic segmentation, object detection, and panoptic segmentation. Comprehensive experiments on various datasets illustrate the superiority of NCLR over existing self-supervised methods. The results confirm that joint learning from different modalities significantly enhances the network’s understanding abilities and effectiveness of learned representation. Code will be available at \url{https://github.com/Eaphan/NCLR}.


标题: NIV-SSD: Neighbor IoU-Voting Single-Stage Object Detector From Point Cloud

作者: Shuai Liu, Di Wang, Quan Wang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12447v1

GitHub: https://github.com/Say2L/NIV-SSD.|

中文摘要: 以前的单级检测器通常遭受定位精度和分类置信度之间的不一致。为了解决错位问题,我们引入了一种新的纠正方法,称为邻居IoU投票(NIV)策略。通常,分类和回归被视为独立的分支,因此很难在它们之间建立联系。因此,分类置信度不能准确反映回归质量。NIV策略可以作为分类和回归分支之间的桥梁,通过从回归输出中计算两种类型的统计数据来校正分类置信度。此外,为了缓解点密集的完整对象(容易对象)和点稀疏的不完整对象(困难对象)检测精度的不平衡,我们提出了一种新的数据扩充方案,称为对象重采样。它通过将简单对象的一部分随机转换为困难对象来对简单对象进行欠采样和对困难对象进行过采样。最后,结合NIV策略和目标重采样增强,我们设计了一种高效的单级检测器,称为NIV-SSD。在几个数据集上的大量实验表明了NIV策略的有效性和NIV-SSD检测器的竞争性能。代码将在https://github.com/Say2L/NIV-SSD上提供。

摘要: Previous single-stage detectors typically suffer the misalignment between localization accuracy and classification confidence. To solve the misalignment problem, we introduce a novel rectification method named neighbor IoU-voting (NIV) strategy. Typically, classification and regression are treated as separate branches, making it challenging to establish a connection between them. Consequently, the classification confidence cannot accurately reflect the regression quality. NIV strategy can serve as a bridge between classification and regression branches by calculating two types of statistical data from the regression output to correct the classification confidence. Furthermore, to alleviate the imbalance of detection accuracy for complete objects with dense points (easy objects) and incomplete objects with sparse points (difficult objects), we propose a new data augmentation scheme named object resampling. It undersamples easy objects and oversamples difficult objects by randomly transforming part of easy objects into difficult objects. Finally, combining the NIV strategy and object resampling augmentation, we design an efficient single-stage detector termed NIV-SSD. Extensive experiments on several datasets indicate the effectiveness of the NIV strategy and the competitive performance of the NIV-SSD detector. The code will be available at https://github.com/Say2L/NIV-SSD.


标题: MAST: Video Polyp Segmentation with a Mixture-Attention Siamese Transformer

作者: Geng Chen, Junqing Yang, Xiaozhou Pu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12439v1

GitHub: https://github.com/Junqing-Yang/MAST.|

中文摘要: 结肠镜检查视频中息肉的准确分割对息肉治疗和结直肠癌的早期预防具有重要意义。然而,由于与结肠镜检查视频中的长程时空关系建模相关的困难,这是具有挑战性的。在本文中,我们用一种新的混合注意力连体Transformer model(MAST)来解决这一具有挑战性的任务,该Transformer model通过混合注意力机制显式地模拟长程时空关系,以实现精确的息肉分割。具体来说,我们首先构建一个连体Transformer model架构来联合编码成对的视频帧以表示它们的特征。然后,我们设计了一个混合注意模块来利用帧内和帧间的相关性,增强具有丰富时空关系的特征。最后,增强的特征被馈送到两个并行解码器,用于预测分割图。据我们所知,我们的MAST是第一个专用于视频息肉分割的transformer模型。在大规模SUN-SEG基准测试上进行的大量实验证明,与尖端竞争对手相比,MAST具有卓越的性能。我们的代码可在https://github.com/Junqing-Yang/MAST。

摘要: Accurate segmentation of polyps from colonoscopy videos is of great significance to polyp treatment and early prevention of colorectal cancer. However, it is challenging due to the difficulties associated with modelling long-range spatio-temporal relationships within a colonoscopy video. In this paper, we address this challenging task with a novel Mixture-Attention Siamese Transformer (MAST), which explicitly models the long-range spatio-temporal relationships with a mixture-attention mechanism for accurate polyp segmentation. Specifically, we first construct a Siamese transformer architecture to jointly encode paired video frames for their feature representations. We then design a mixture-attention module to exploit the intra-frame and inter-frame correlations, enhancing the features with rich spatio-temporal relationships. Finally, the enhanced features are fed to two parallel decoders for predicting the segmentation maps. To the best of our knowledge, our MAST is the first transformer model dedicated to video polyp segmentation. Extensive experiments on the large-scale SUN-SEG benchmark demonstrate the superior performance of MAST in comparison with the cutting-edge competitors. Our code is publicly available at https://github.com/Junqing-Yang/MAST.


专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--机器人、强化学习_第2张图片

你可能感兴趣的:(每日论文,机器人,深度学习,人工智能,机器学习)