关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持
分类:
- 大语言模型LLM
- 视觉模型VLM
- 扩散模型
- 视觉导航
- 具身智能,机器人
- 强化学习
- 开放词汇,检测分割
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.11838v1
Project: https://osf.io/wzyf6|
GitHub: https://github.com/LinusNEP/TCC_IRoNL.git).|
中文摘要: 近年来,自主代理在现实世界环境中激增,如我们的家庭、办公室和公共场所。然而,自然的人机交互仍然是一个关键的挑战。在本文中,我们介绍了一种协同利用大型语言模型(LLMs)和多模态视觉语言模型(VLMs)的能力的方法,使人类能够通过对话与自主机器人进行自然交互。我们利用LLMs解码来自人类的高级自然语言指令,并将其抽象为精确的机器人可操作命令或查询。此外,我们利用VLMs来提供对机器人任务环境的视觉和语义理解。我们99.13%的命令识别准确率和97.96%的命令执行成功率表明,我们的方法可以增强现实世界应用中的人机交互。本文的视频演示可以在https://osf.io/wzyf6找到,代码可以在我们的GitHub资源库(https://github.com/LinusNEP/tcc_iron.git)找到。
摘要: In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot’s task environment. Our results with 99.13% command recognition accuracy and 97.96% commands execution success show that our approach can enhance human-robot interaction in real-world applications. The video demonstrations of this paper can be found at https://osf.io/wzyf6 and the code is available at our GitHub repository (https://github.com/LinusNEP/TCC_IRoNL.git).
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2304.03047v3
GitHub: https://github.com/MarSaKi/ETPNav.|https://github.com/MarSaKi/ETPNav|
中文摘要: Vision-language导航是一项需要代理按照指令在环境中导航的任务。它在具体化人工智能领域变得越来越重要,在自主导航、搜索和救援以及人机交互方面具有潜在的应用。在本文中,我们提出了一个更实际但具有挑战性的对应设置——连续环境中的视觉语言导航(VLN-CE)。为了开发一个鲁棒的VLN-CE代理,我们提出了一个新的导航框架ETPNav,它专注于两个关键技能:1)抽象环境和生成远程导航计划的能力,以及2)在连续环境中的避障控制能力。ETPNav通过沿着穿越路径自组织预测的航路点来执行环境的在线拓扑映射,而无需先前的环境经验。它赋予代理将导航过程分解为高级规划和低级控制的特权。同时,ETPNav利用基于Transformer model的跨模态规划器来基于拓扑图和指令生成导航计划。然后,该计划通过避障控制器来执行,该控制器利用试错法来防止导航陷入障碍物。实验结果证明了该方法的有效性。ETPNav的产量超过10%和20%的改进比以前的属性R2R-CE和RxR-CE数据集的最新技术。我们的代码可在https://github.com/MarSaKi/ETPNav。获得
摘要: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.12076v1
中文摘要: 镜像非语言的社会线索,如情感或运动,可以增强现实世界中人与人和人与机器人的互动。机器人平台和控制方法也会影响人们对人机交互的感知。然而,有限的研究比较了不同平台和控制方法的机器人模仿。我们的研究通过进行两个实验来解决这一差距,这两个实验比较了人们对iCub和Pepper机器人之间的情感镜像以及基于视觉的iCub控制和基于惯性测量单元(IMU)的iCub控制之间的运动镜像的感知。我们发现,当镜像效果时,iCub机器人被认为比Pepper机器人更像人类。在运动镜像任务中,基于视觉的受控iCub优于基于IMU的受控iCub。我们的发现表明,在HRI期间,不同的机器人平台会影响人们对机器人镜像的感知。该控制方法也有助于机器人的镜像性能。我们的工作为不同人形机器人在现实世界中的设计和应用提供了思路。
摘要: Mirroring non-verbal social cues such as affect or movement can enhance human-human and human-robot interactions in the real world. The robotic platforms and control methods also impact people’s perception of human-robot interaction. However, limited studies have compared robot imitation across different platforms and control methods. Our research addresses this gap by conducting two experiments comparing people’s perception of affective mirroring between the iCub and Pepper robots and movement mirroring between vision-based iCub control and Inertial Measurement Unit (IMU)-based iCub control. We discovered that the iCub robot was perceived as more humanlike than the Pepper robot when mirroring affect. A vision-based controlled iCub outperformed the IMU-based controlled one in the movement mirroring task. Our findings suggest that different robotic platforms impact people’s perception of robots’ mirroring during HRI. The control method also contributes to the robot’s mirroring performance. Our work sheds light on the design and application of different humanoid robots in the real world.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.12046v1
中文摘要: 许多复杂的机器人操作任务可以分解为一系列拾取和放置动作。训练机器人代理在许多不同的起始条件下学习该序列通常需要许多迭代或演示,尤其是在3D环境中。在这项工作中,我们提出了傅立叶传输器(\ours{}),它利用拾取问题中的双倍KaTeX parse error: Undefined control sequence: \SE at position 1: \̲S̲E̲(d)\times\SE(d)对称性来实现更高的采样效率。\ours{}是一种使用专家演示训练的开环行为克隆方法,用于预测新环境中的拾取位置动作。\我们的{}被限制为独立地合并拾取和放置动作的对称性。我们的方法利用光纤空间傅立叶变换,允许内存高效的结构。我们在RLbench基准测试上测试了我们提出的网络,并在各种任务中实现了最先进的结果。
摘要: Many complex robotic manipulation tasks can be decomposed as a sequence of pick and place actions. Training a robotic agent to learn this sequence over many different starting conditions typically requires many iterations or demonstrations, especially in 3D environments. In this work, we propose Fourier Transporter (\ours{}) which leverages the two-fold KaTeX parse error: Undefined control sequence: \SE at position 1: \̲S̲E̲(d)\times\SE(d) symmetry in the pick-place problem to achieve much higher sample efficiency. \ours{} is an open-loop behavior cloning method trained using expert demonstrations to predict pick-place actions on new environments. \ours{} is constrained to incorporate symmetries of the pick and place actions independently. Our method utilizes a fiber space Fourier transformation that allows for memory-efficient construction. We test our proposed network on the RLbench benchmark and achieve state-of-the-art results across various tasks.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.11776v1
中文摘要: 这项研究回顾了个性化对人机交互的影响。首先,简要描述了用于实现个性化的各种策略。其次,讨论了迄今为止已知的个性化的影响。它们与个性化参数、个性化特性、使用的技术以及它们相关的用例一起呈现。据观察,文献中已经讨论了各种积极影响,而可能的消极影响似乎需要进一步研究。
摘要: This study reviews the impact of personalization on human-robot interaction. Firstly, the various strategies used to achieve personalization are briefly described. Secondly, the effects of personalization known to date are discussed. They are presented along with the personalized parameters, personalized features, used technology, and use case they relate to. It is observed that various positive effects have been discussed in the literature while possible negative effects seem to require further investigation.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.11721v1
中文摘要: 目的——颅底手术在移除侧颅底的骨头时要求非常精确。机器人辅助可以减轻人类感觉运动限制的影响。然而,机器人的刚度和惯性会显著影响外科医生对工具与组织相互作用力的感知和控制。方法——我们提出了一种情境感知的力控制技术,旨在调节机器人辅助钻削过程中的相互作用力。从数字孪生环境导出的上下文交互信息用于增强感官知觉和抑制不期望的高力。结果——为了验证我们的方法,我们进行了初步的可行性实验,涉及一名医科学生和两名工科学生。该实验集中在皮质乳突切除术后关键结构周围的进一步钻孔。实验结果表明,与没有所提出的力控制的机器人辅助相比,与我们提出的控制方案相结合的机器人辅助有效地限制了不期望的相互作用力。结论——所提出的力控制技术显示出在机器人辅助颅底手术中显著减少不期望的相互作用力的前景。这些发现有助于在涉及侧颅底的复杂手术中提高手术精度和安全性的持续努力。
摘要: Purpose - Skullbase surgery demands exceptional precision when removing bone in the lateral skull base. Robotic assistance can alleviate the effect of human sensory-motor limitations. However, the stiffness and inertia of the robot can significantly impact the surgeon’s perception and control of the tool-to-tissue interaction forces. Methods - We present a situational-aware, force control technique aimed at regulating interaction forces during robot-assisted skullbase drilling. The contextual interaction information derived from the digital twin environment is used to enhance sensory perception and suppress undesired high forces. Results - To validate our approach, we conducted initial feasibility experiments involving a medical and two engineering students. The experiment focused on further drilling around critical structures following cortical mastoidectomy. The experiment results demonstrate that robotic assistance coupled with our proposed control scheme effectively limited undesired interaction forces when compared to robotic assistance without the proposed force control. Conclusions - The proposed force control techniques show promise in significantly reducing undesired interaction forces during robot-assisted skullbase surgery. These findings contribute to the ongoing efforts to enhance surgical precision and safety in complex procedures involving the lateral skull base.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2308.02453v3
Project: https://srl-ethz.github.io/get-ball-rolling/|https://youtu.be/YahsMhqNU8o|
GitHub: https://github.com/srl-ethz/faive_gym_oss|
中文摘要: 仿生、灵巧的机器人手有潜力复制人类可以完成的许多任务,并获得作为通用操作平台的地位。强化学习(RL)框架的最新进展在四足运动和灵巧操纵任务中取得了显著的性能。结合能够并行模拟数千个机器人的基于GPU的高度并行化模拟,基于RL的控制器变得更加可扩展和可接近。然而,为了将RL训练的策略带到现实世界中,我们需要输出可以与物理致动器和传感器一起工作的策略的训练框架,以及可以用可访问的材料制造但足够健壮以运行交互式策略的硬件平台。本工作介绍了仿生肌腱驱动的Faive手及其系统架构,该系统使用肌腱驱动的滚动接触关节来实现3D可打印、鲁棒的高自由度手设计。我们对手的每个元素进行建模,并将其集成到GPU模拟环境中,用RL训练策略,并实现灵巧的手握球体旋转技能向物理机器人手的零镜头转移。
摘要: Biomimetic, dexterous robotic hands have the potential to replicate much of the tasks that a human can do, and to achieve status as a general manipulation platform. Recent advances in reinforcement learning (RL) frameworks have achieved remarkable performance in quadrupedal locomotion and dexterous manipulation tasks. Combined with GPU-based highly parallelized simulations capable of simulating thousands of robots in parallel, RL-based controllers have become more scalable and approachable. However, in order to bring RL-trained policies to the real world, we require training frameworks that output policies that can work with physical actuators and sensors as well as a hardware platform that can be manufactured with accessible materials yet is robust enough to run interactive policies. This work introduces the biomimetic tendon-driven Faive Hand and its system architecture, which uses tendon-driven rolling contact joints to achieve a 3D printable, robust high-DoF hand design. We model each element of the hand and integrate it into a GPU simulation environment to train a policy with RL, and achieve zero-shot transfer of a dexterous in-hand sphere rotation skill to the physical robot hand.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2311.13373v4
GitHub: https://github.com/ZJLAB-AMMI/LLM4Teach.|
中文摘要: 最近的研究揭示了大型语言模型(LLMs)通过提供高级指令来解决复杂的顺序决策任务的潜力。然而,基于LLM的代理在处理特定目标问题方面缺乏专业化,特别是在实时动态环境中。此外,在实际场景中部署基于LLM的代理既昂贵又耗时。另一方面,强化学习(RL)方法训练专门从事目标任务但经常遭受低采样效率和高探索成本的代理。在本文中,我们介绍了一个新的框架,通过使用来自基于LLM的教师代理的指令来训练一个较小的、专门的学生RL代理来解决这些挑战。通过结合教师代理的指导,学生代理可以将LLM的先验知识提取到自己的模型中。因此,可以用少得多的数据来训练学生代理。此外,通过环境反馈的进一步训练,学生代理在完成目标任务方面超越了其教师的能力。我们在具有挑战性的迷你网格和栖息地环境上进行了实验,这些环境是专门为具体化人工智能研究设计的,以评估我们框架的有效性。结果清楚地表明,与强基线方法相比,我们的方法实现了更好的性能。我们的代码可以在https://github.com/ZJLAB-AMMI/llm 4 teach。
摘要: Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at https://github.com/ZJLAB-AMMI/LLM4Teach.
PubTime: 2024-01-21
Downlink: http://arxiv.org/abs/2401.11458v1
GitHub: https://github.com/Wizardcoast/Linear_Alignment.git|
摘要: The success of AI assistants based on Language Models (LLMs) hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by complex annotation and training requirements. This reliance limits the applicability of RLHF and hinders the development of professional assistants tailored to diverse human preferences. In this work, we introduce \textit{Linear Alignment}, a novel algorithm that aligns language models with human preferences in one single inference step, eliminating the reliance on data annotation and model training. Linear alignment incorporates a new parameterization for policy optimization under divergence constraints, which enables the extraction of optimal policy in a closed-form manner and facilitates the direct estimation of the aligned response. Extensive experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment across diverse scenarios. Our code and dataset will be published on \url{https://github.com/Wizardcoast/Linear_Alignment.git}.
PubTime: 2024-01-21
Downlink: http://arxiv.org/abs/2401.11437v1
GitHub: https://github.com/BruceGeLi/TCE_RL|
中文摘要: 强化学习(RL)的当前进展主要集中在学习基于步骤的策略上,这些策略为每个感知状态生成动作。虽然这些方法有效地利用了来自环境交互的步骤信息,但它们通常忽略了动作之间的时间相关性,导致低效的探索和不平滑的轨迹,这在真实硬件上具有挑战性。情节性RL(ERL)试图通过探索捕捉动作相关性的参数空间来克服这些挑战。然而,这些方法通常会损害数据效率,因为它们将轨迹视为不透明的\emph{黑匣子}。在这项工作中,我们介绍了一种新的ERL算法,时间相关情节RL(TCE),它有效地利用了情节策略更新中的步骤信息,打开了现有ERL方法中的“黑箱”,同时保留了参数空间中平滑和一致的探索。TCE协同结合了基于步进的RL和情节式RL的优势,实现了与最近的ERL方法相当的性能,同时保持了类似于最先进的(SoTA)基于步进的RL的数据效率。
摘要: Current advancements in reinforcement learning (RL) have predominantly focused on learning step-based policies that generate actions for each perceived state. While these methods efficiently leverage step information from environmental interaction, they often ignore the temporal correlation between actions, resulting in inefficient exploration and unsmooth trajectories that are challenging to implement on real hardware. Episodic RL (ERL) seeks to overcome these challenges by exploring in parameters space that capture the correlation of actions. However, these approaches typically compromise data efficiency, as they treat trajectories as opaque \emph{black boxes}. In this work, we introduce a novel ERL algorithm, Temporally-Correlated Episodic RL (TCE), which effectively utilizes step information in episodic policy updates, opening the ‘black box’ in existing ERL methods while retaining the smooth and consistent exploration in parameter space. TCE synergistically combines the advantages of step-based and episodic RL, achieving comparable performance to recent ERL methods while maintaining data efficiency akin to state-of-the-art (SoTA) step-based RL.
PubTime: 2024-01-20
Downlink: http://arxiv.org/abs/2401.11237v1
GitHub: https://github.com/RajGhugare19/stitching-is-combinatorial-generalisation|
中文摘要: 一些强化学习(RL)算法可以拼接经验片段,以解决训练中从未见过的任务。这种经常寻求的性质是基于动态规划的RL方法与基于监督学习(SL)的RL方法的少数不同之处之一。然而,某些基于现成SL算法的RL方法在没有显式拼接机制的情况下获得了优异的结果;目前还不清楚这些方法是否放弃了这一重要的缝合特性。本文针对目标状态的实现和目标返回值的实现问题进行了研究。我们的主要结果是表明拼接属性对应于组合概括的一种形式:在对(状态,目标)对的分布进行训练之后,人们想要对在训练数据中没有一起看到的(状态,目标)对进行评估。我们的分析表明,这种概括不同于i.i.d。概括。拼接和概括之间的这种联系揭示了为什么我们不应该期望基于SL的RL方法执行拼接,即使在大型数据集和模型的限制下。基于这种分析,我们构建了新的数据集来显式测试这种性质,揭示了基于SL的方法缺乏这种拼接性质,因此无法执行组合泛化。尽管如此,拼接和组合概括之间的联系也提出了一个简单的方法来改善SL中的概括:数据扩充。我们提出了一种时态数据扩充,并证明了将其添加到基于SL的方法中能够使它们成功地完成在训练中没有一起看到的任务。在高层次上,这种联系说明了组合概括对于时间序列数据的数据效率的重要性,而不仅仅是RL之外的任务,如音频、视频或文本。
摘要: Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL). Yet, certain RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question for the problems of achieving a target goal state and achieving a target return value. Our main result is to show that the stitching property corresponds to a form of combinatorial generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen together in the training data. Our analysis shows that this sort of generalization is different from i.i.d. generalization. This connection between stitching and generalisation reveals why we should not expect SL-based RL methods to perform stitching, even in the limit of large datasets and models. Based on this analysis, we construct new datasets to explicitly test for this property, revealing that SL-based methods lack this stitching property and hence fail to perform combinatorial generalization. Nonetheless, the connection between stitching and combinatorial generalisation also suggests a simple remedy for improving generalisation in SL: data augmentation. We propose a temporal data augmentation and demonstrate that adding it to SL-based methods enables them to successfully complete tasks not seen together during training. On a high level, this connection illustrates the importance of combinatorial generalization for data efficiency in time-series data beyond tasks beyond RL, like audio, video, or text.
PubTime: 2024-01-20
Downlink: http://arxiv.org/abs/2303.05479v4
Project: https://nakamotoo.github.io/Cal-QL|https://nakamotoo.github.io/Cal-QL|
中文摘要: 离线强化学习(RL)的一个引人注目的用例是从现有数据集获得策略初始化,然后通过有限的交互进行快速在线微调。然而,现有的离线RL方法在微调期间往往表现不佳。在本文中,我们设计了一种从离线数据中学习有效初始化的方法,该方法还支持快速在线微调功能。我们的方法,校准Q学习(Cal-QL),通过学习保守的值函数初始化来实现这一点,该值函数初始化低估了从离线数据中学习的策略的值,同时也被校准,在某种意义上,学习的Q值处于合理的范围。我们将这种性质称为校准,并将其正式定义为提供学习策略的真实值函数的下限和其他一些(次优)参考策略的值的上限,这些参考策略可能只是行为策略。我们表明,学习这种校准值函数的离线RL算法导致有效的在线微调,使我们能够在在线微调中利用离线初始化的好处。在实践中,Cal-QL可以在保守的Q学习(CQL)之上实现,用于离线RL,只需一行代码更改。从经验上看,Cal-QL在我们本文研究的9/11微调基准任务上优于最先进的方法。代码和视频可在https://nakamotoo.github.io/Cal-QL
摘要: A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online fine-tuning with limited interaction. However, existing offline RL methods tend to behave poorly during fine-tuning. In this paper, we devise an approach for learning an effective initialization from offline data that also enables fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL), accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Q-values are at a reasonable scale. We refer to this property as calibration, and define it formally as providing a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy, which may simply be the behavior policy. We show that offline RL algorithms that learn such calibrated value functions lead to effective online fine-tuning, enabling us to take the benefits of offline initializations in online fine-tuning. In practice, Cal-QL can be implemented on top of the conservative Q learning (CQL) for offline RL within a one-line code change. Empirically, Cal-QL outperforms state-of-the-art methods on 9/11 fine-tuning benchmark tasks that we study in this paper. Code and video are available at https://nakamotoo.github.io/Cal-QL
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.12217v1
GitHub: https://github.com/zlai0/S-Seg|
中文摘要: 开放词汇语义分割模型旨在从一组任意开放词汇文本中准确地为图像中的每个像素分配语义标签。为了学习这种像素级对齐,当前的方法通常依赖于(i)图像级VL模型(例如,剪辑),(ii)地面真实遮罩,和(iii)自定义分组编码器的组合。在本文中,我们介绍了S-Seg,这是一种新颖的模型,可以在不依赖于上述任何元素的情况下实现惊人的强性能。S-Seg利用伪掩码和语言来训练掩码成形器,并且可以很容易地从公开可用的图像——文本数据集中进行训练。与以前的工作相反,我们的模型直接训练像素级特征和语言对齐。一旦经过训练,S-Seg可以很好地推广到多个测试数据集,而不需要微调。此外,S-Seg还具有数据可扩展性的额外优势,并在通过自我培训进行增强时不断改进。我们相信我们简单而有效的方法将作为未来研究的坚实基线。
摘要: Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improvement when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.12202v1
Project: https://ok-robot.github.io|
中文摘要: 近年来在视觉、语言和机器人领域取得了显著进展。我们现在有了能够基于语言查询识别物体的视觉模型,可以有效控制移动系统的导航系统,以及可以处理各种物体的抓取模型。尽管取得了这些进步,机器人的通用应用仍然落后,尽管它们依赖于识别、导航和抓取这些基本能力。在本文中,我们采用系统优先的方法来开发一个新的开放的基于知识的机器人框架,称为OK-Robot。通过结合用于对象检测的视觉语言模型(VLMs)、用于运动的导航图元和用于对象操作的抓取图元,OK-Robot提供了一个无需任何训练即可进行拖放操作的集成解决方案。为了评估它的性能,我们在10个真实的家庭环境中运行OK-Robot。结果表明,OK-Robot在开放式拾取和放下任务中实现了58.5%的成功率,代表了开放词汇移动操作(OVMM)的新水平,性能是先前工作的近1.8倍。在更干净、整洁的环境中,OK-Robot的性能提高到82%。然而,从OK-Robot获得的最重要的见解是,当将VLMs等开放知识系统与机器人模块相结合时,细微细节的关键作用。我们的实验视频可在我们的网站上获得:https://ok-robot.github.io
摘要: Remarkable progress has been made in recent years in the fields of vision, language, and robotics. We now have vision models capable of recognizing objects based on language queries, navigation systems that can effectively control mobile systems, and grasping models that can handle a wide range of objects. Despite these advancements, general-purpose applications of robotics still lag behind, even though they rely on these fundamental capabilities of recognition, navigation, and grasping. In this paper, we adopt a systems-first approach to develop a new Open Knowledge-based robotics framework called OK-Robot. By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training. To evaluate its performance, we run OK-Robot in 10 real-world home environments. The results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks, representing a new state-of-the-art in Open Vocabulary Mobile Manipulation (OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered environments, OK-Robot’s performance increases to 82%. However, the most important insight gained from OK-Robot is the critical role of nuanced details when combining Open Knowledge systems like VLMs with robotic modules. Videos of our experiments are available on our website: https://ok-robot.github.io
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.12051v1
Project: https://virtualhumans.mpi-inf.mpg.de/close3dv24/.|
中文摘要: 3D服装建模和数据集在娱乐、动画和数字时尚行业中发挥着至关重要的作用。现有的工作往往缺乏详细的语义理解或使用合成数据集,缺乏真实性和个性化。为了解决这个问题,我们首先引入CloSe-D:一个新颖的大规模数据集,包含3167次扫描的3D服装分割,涵盖了18个不同的服装类别。此外,我们提出了CloSe-Net,这是第一个基于学习的3D服装分割模型,用于从彩色点云中进行细粒度分割。CloSe-Net使用局部点特征、身体——服装相关性以及基于服装类别和点特征的注意力模块,提高了基线和先前工作的性能。所提出的注意力模块使我们的模型能够先验地从数据中学习外观和几何形状相关的服装。我们通过成功分割公开可用的服装人群数据集,进一步验证了我们方法的有效性。我们还介绍了CloSe-T,这是一个用于细化分割标签的3D交互式工具。将该工具与CloSe-T结合在一个持续的学习设置中,展示了对真实世界数据的改进的泛化。数据集、模型和工具可以在https://virtualhumans.mpi-inf.mpg.de/close3dv24/。找到
摘要: 3D Clothing modeling and datasets play crucial role in the entertainment, animation, and digital fashion industries. Existing work often lacks detailed semantic understanding or uses synthetic datasets, lacking realism and personalization. To address this, we first introduce CloSe-D: a novel large-scale dataset containing 3D clothing segmentation of 3167 scans, covering a range of 18 distinct clothing classes. Additionally, we propose CloSe-Net, the first learning-based 3D clothing segmentation model for fine-grained segmentation from colored point clouds. CloSe-Net uses local point features, body-clothing correlation, and a garment-class and point features-based attention module, improving performance over baselines and prior work. The proposed attention module enables our model to learn appearance and geometry-dependent clothing prior from data. We further validate the efficacy of our approach by successfully segmenting publicly available datasets of people in clothing. We also introduce CloSe-T, a 3D interactive tool for refining segmentation labels. Combining the tool with CloSe-T in a continual learning setup demonstrates improved generalization on real-world data. Dataset, model, and tool can be found at https://virtualhumans.mpi-inf.mpg.de/close3dv24/.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2312.10105v2
GitHub: https://github.com/naver-ai/tokenadapt.|
中文摘要: 深度神经网络(DNN)模型的最新进展显著提高了计算机视觉任务的性能。然而,实现高度通用和高性能的视觉模型需要大量的数据集,从而导致大量的存储需求。这种存储挑战对扩展视觉模型构成了一个关键的瓶颈。受离散表示成功的激励,SeiT提出使用矢量量化(VQ)特征向量(即令牌)作为视觉分类的网络输入。然而,由于输入域的转移,将传统的数据扩充应用于令牌面临挑战。为了解决这个问题,我们引入了TokenAdapt和ColorAdapt,这是一种简单而有效的基于令牌的扩充策略。TokenAdapt重新调整令牌嵌入空间以与空间增强兼容,从而保持模型的效率,而无需微调。此外,ColorAdapt解决了受自适应实例规范化(AdaIN)启发的基于颜色的标记增强问题。我们在各种场景中评估了我们的方法,包括存储高效的ImageNet-1k分类、细粒度分类、鲁棒性基准和ADE-20k语义分割。实验结果表明,在不同的实验中,性能得到了一致的改善。代码见https://github.com/naver-ai/tokenadapt。
摘要: Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks. However, achieving highly generalizable and high-performing vision models requires extensive datasets, leading to large storage requirements. This storage challenge poses a critical bottleneck for scaling up vision models. Motivated by the success of discrete representations, SeiT proposes to use Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification. However, applying traditional data augmentations to tokens faces challenges due to input domain shift. To address this issue, we introduce TokenAdapt and ColorAdapt, simple yet effective token-based augmentation strategies. TokenAdapt realigns token embedding space for compatibility with spatial augmentations, preserving the model’s efficiency without requiring fine-tuning. Additionally, ColorAdapt addresses color-based augmentations for tokens inspired by Adaptive Instance Normalization (AdaIN). We evaluate our approach across various scenarios, including storage-efficient ImageNet-1k classification, fine-grained classification, robustness benchmarks, and ADE-20k semantic segmentation. Experimental results demonstrate consistent performance improvement in diverse experiments. Code is available at https://github.com/naver-ai/tokenadapt.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2310.12817v2
Project: https://jimmy15923.github.io/mit_web/.|https://jimmy15923.github.io/mit_web/|
中文摘要: 我们提出了一种多模态交错Transformer model(MIT),它联合考虑2D和3D数据用于弱监督点云分割。研究表明,2D和3D特征对于点云分割是互补的。然而,现有的方法需要额外的2D注释来实现2D-3D信息融合。考虑到点云的高注释成本,基于弱监督学习的有效2D和3D特征融合是非常必要的。为此,我们提出了一种两个编码器和一个解码器的transformer模型,用于仅使用场景级类别标签的弱监督点云分割。具体来说,这两个编码器分别计算3D点云和2D多视图图像的自关注特征。解码器实现隔行2D-3D交叉注意,并进行隐式2D和3D特征融合。我们在解码器层交替切换查询和键值对的角色。事实证明,2D和3D特征相互迭代丰富。实验表明,在S3DIS和ScanNet基准测试上,该方法比现有的弱监督点云分割方法表现得更好。项目网页可在https://jimmy15923.github.io/mit_web/。
摘要: We present a Multimodal Interlaced Transformer (MIT) that jointly considers 2D and 3D data for weakly supervised point cloud segmentation. Research studies have shown that 2D and 3D features are complementary for point cloud segmentation. However, existing methods require extra 2D annotations to achieve 2D-3D information fusion. Considering the high annotation cost of point clouds, effective 2D and 3D feature fusion based on weakly supervised learning is in great demand. To this end, we propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation using only scene-level class tags. Specifically, the two encoders compute the self-attended features for 3D point clouds and 2D multi-view images, respectively. The decoder implements interlaced 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion. We alternately switch the roles of queries and key-value pairs in the decoder layers. It turns out that the 2D and 3D features are iteratively enriched by each other. Experiments show that it performs favorably against existing weakly supervised point cloud segmentation methods by a large margin on the S3DIS and ScanNet benchmarks. The project page will be available at https://jimmy15923.github.io/mit_web/.
PubTime: 2024-01-22
Downlink: http://arxiv.org/abs/2401.11739v1
Project: https://kmcode1.github.io/Projects/EmerDiff/|
中文摘要: 扩散模型因其在语义分割任务中显著的迁移能力而受到越来越多的研究关注。然而,使用扩散模型生成细粒度的分割掩码通常需要对带注释的数据集进行额外的训练,这使得不清楚预训练的扩散模型单独在多大程度上理解其生成的图像的语义关系。为了解决这个问题,我们利用从稳定扩散(SD)中提取的语义知识,旨在开发一种能够在没有任何额外训练的情况下生成细粒度分割图的图像分割器。主要的困难源于这样一个事实,即语义上有意义的特征图通常只存在于空间上较低维的层中,这给从这些特征图中直接提取像素级语义关系带来了挑战。为了克服这个问题,我们的框架通过利用SD的生成过程来识别图像像素和低维特征图的空间位置之间的语义对应关系,并利用它们来构建图像分辨率的分割图。在大量的实验中,所产生的分割图被证明是很好地描绘和捕捉图像的细节部分,表明扩散模型中存在高度精确的像素级语义知识。
摘要: Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD’s generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.
关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文