开头是个人理解,仅供参考:
机器学习领域的agent要呈现出好的性能,有三个方面:数据、模型和知识。
具身智能的训练其实数据并没有向视觉那么丰富,今天要翻译的这篇论文主要是在针对具身智能的数据方面。
论文链接如下:
https://robotics-transformer-x.github.io/paper.pdf
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models,with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train “generalist” X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. | 在不同数据集上训练的大型高容量模型在有效处理下游应用程序方面取得了显著成功。在从NLP到计算机视觉的领域,这导致了预训练模型的整合,一般的预训练backbones作为许多应用程序的起点。这样的整合会发生在机器人领域吗?传统上,机器人学习方法为每个应用程序、每个机器人甚至每个环境训练一个单独的模型。我们是否可以训练“通才”X机器人策略,使其能够有效地适应新的机器人,任务和环境?在本文中,我们提供了标准化数据格式和模型的数据集,以便在机器人操作的背景下探索这种可能性,以及提供有效X机器人策略示例的实验结果。我们从21个机构合作收集的22个不同机器人中收集了一个数据集,展示了527项技能(160266任务)。我们表明,在这些数据上训练的高容量模型,我们称之为RT-X,表现出正转移,并通过利用其他平台的经验来提高多个机器人的能力。 |
A central lesson from recent advances in machine learning and artificial intelligence is that large-scale learning from broad and diverse datasets can enable capable AI systems by providing for general-purpose pretrained models. In fact, large-scale general-purpose models typically trained on large and diverse datasets can often outperform their narrowly targeted counterparts trained on smaller but more task-specific data. For instance, open-vocabulary image classifiers (e.g., CLIP [1]) trained on large datasets scraped from the web tend to outperform fixed-vocabulary models trained on more limited datasets, and large language models [2, 3] trained on massive text corpora tend to outperform systems that are only trained on narrow task-specific datasets. Increasingly, the most effective way to tackle a given narrow task (e.g., in vision or NLP) is to adapt a general-purpose model. However, these lessons are difficult to apply in robotics: any single robotic domain might be too narrow, and while computer vision and NLP can leverage large datasets sourced from the web, comparably large and broad datasets for robotic interaction are hard to come by. Even the largest data collection efforts still end up with datasets that are a fraction of the size and diversity of benchmark datasets in vision (5-18M) [4, 5] and NLP (1.5B-4.5B) [6, 7]. Perhaps more importantly, such datasets are often still narrow along some axes of variation, either focusing on a single environment, a single set of objects, or a narrow range of tasks. How can we overcome these challenges in robotics and move the field of robotic learning toward the kind of large data regime that has been so successful in other domains? | 机器学习和人工智能最新进展的让我们认识到,从广泛而多样的数据集进行大规模学习可以通过提供通用的预训练模型来实现有能力的人工智能系统。事实上,通常在大型和多样化数据集上训练的大规模通用模型往往比在较小但更特定于任务的数据上训练的窄目标模型表现更好。例如,在从网络搜集的大数据集上训练的开放词汇图像分类器(例如,CLIP [1])往往优于在更有限的数据集上训练的固定词汇模型,并且在大规模文本语料库上训练的大语言模型[2,3]往往优于仅在狭窄的特定任务数据集上训练的系统。越来越多的情况下,解决特定狭窄任务(例如,在视觉或NLP中)的最有效方法是采用通用模型。然而,这些经验很难应用于机器人学:任何单一的机器人领域都可能太窄,尽管计算机视觉和NLP可以利用来自网络的大数据集,但用于机器人交互的相对大而广的数据集很难获得。即使是最大规模的数据收集工作,最终得到的数据集仍然只是vision (5-18M) [4,5]和NLP (1.5B-4.5B) [6,7]中基准数据集大小和多样性的一小部分。也许更重要的是,这样的数据集在某些变化轴上仍然很窄,要么集中在单一环境、单一对象集,要么集中在狭窄的任务范围。我们如何克服机器人学中的这些挑战,将机器人学习领域推向在其他领域如此成功的大数据体制? |
Inspired by the generalization made possible by pretraining large vision or language models on diverse data, we take the perspective that the goal of training generalizable robot policies requires X-embodiment training, i.e., with data from multiple robotic platforms. While each individual robotic learning dataset might be too narrow, the union of all such datasets provides a better coverage of variations in environments and robots. Learning generalizable robot policies requires developing methods that can utilize Xembodiment data, tapping into datasets from many labs, robots, and settings. Even if such datasets in their current size and coverage are insufficient to attain the impressive generalization results that have been demonstrated by large language models, in the future, the union of such data can potentially provide this kind of coverage. Because of this, we believe that enabling research into X-embodiment robotic learning is critical at the present juncture. | 受在不同数据上预训练大视觉或语言模型所带来的泛化启发,我们的观点是,训练可推广机器人策略的目标需要X-embodiment训练,即使用来自多个机器人平台的数据。虽然每个单独的机器人学习数据集可能太窄,但所有这些数据集的联合提供了对环境和机器人变化的更好覆盖。学习可推广的机器人策略需要开发可以利用Xembodiment数据的方法,利用来自许多实验室,机器人和设置的数据集。即使这些数据集目前的大小和覆盖范围不足以获得大型语言模型所证明的令人印象深刻的泛化结果,但在未来,这些数据的联合可能会提供这种覆盖率。正因为如此,我们认为,在目前这个时刻,对X-embodiment机器人学习的研究至关重要。 |
Following this rationale, our work has two primary goals: (1) Demonstrate that policies trained on data from many different robots and environments enjoy the benefits of positive transfer, attaining better performance than policies trained only on data from each evaluation setup. (2) Provide datasets, data formats and models for the robotics community to enable future research on X-embodiment models. | 遵循这一基本原理,我们的工作有两个主要目标:(1)证明在来自许多不同机器人和环境的数据上训练的策略享有positive transfer的好处,比仅根据来自每个评估设置的数据训练的策略获得更好的性能。(2)为机器人社区提供数据集、数据格式和模型,以便未来对X体现模型进行研究。 |
We focus our work on robotic manipulation. Addressing goal (1), our empirical contribution is to demonstrate that several recent robotic learning methods, with minimal modification, can utilize X-embodiment data and enable positive transfer. Specifically, we train the RT-1 [8] and RT-2 [9] models on 9 different robotic manipulators. We show that the resulting models, which we call RT-X, can improve over policies trained only on data from the evaluation domain, exhibiting better generalization and new capabilities. Addressing (2), we provide the Open X-Embodiment (OXE) Repository, which includes a dataset with 22 different robotic embodiments from 21 different institutions that can enable the robotics community to pursue further research on Xembodiment models, along with open-source tools to facilitate such research. Our aim is not to innovate in terms of the particular architectures and algorithms, but rather to provide the model that we trained together with data and tools to energize research around X-embodiment robotic learning. | 我们将工作重点放在机器人操作上。针对目标(1),我们的empirical贡献是证明几种最近的机器人学习方法,只需进行最小的修改,就可以利用X-embodiment数据并实现positive transfer。具体来说,我们在 9 种不同的机器人操纵器上训练 RT-1 [8] 和 RT-2 [9] 模型。我们表明,由此产生的模型(我们称之为RT-X)可以改进仅针对评估域数据训练的策略,从而表现出更好的泛化和新功能。解决(2),我们提供了开放X-Embodiment(OXE)存储库,其中包括一个数据集,其中包含来自21个不同机构的22个不同的机器人实施例,可以使机器人社区对Xembodiment模型进行进一步研究,以及开源工具以促进此类研究。我们的目标不是在特定架构和算法方面进行创新,而是提供我们一起训练的模型以及数据和工具,以激发围绕X-embodiment机器人学习的研究。 |
II. RELATED WORK
Transfer across embodiments. A number of prior works have studied methods for transfer across robot embodiments in simulation [10–22] and on real robots [23–29]. These methods often introduce mechanisms specifically designed to address the embodiment gap between different robots, such as shared action representations [14, 30], incorporating representation learning objectives [17, 26], adapting the learned policy on embodiment information [11, 15, 18, 30, 31], and decoupling robot and environment representations [24]. Prior work has provided initial demonstrations of X-embodiment training [27] and transfer [25, 29, 32] with transformer models. We investigate complementary architectures and provide complementary analyses, and, in particular, study the interaction between X-embodiment transfer and web-scale pretraining. Similarly, methods for transfer across human and robot embodiments also often employ techniques for reducing the embodiment gap, i.e. by translating between domains or learning transferable representations [33–43]. Alternatively, some works focus on sub-aspects of the problem such as learning transferable reward functions [17, 44–48], goals [49, 50], dynamics models [51], or visual representations [52–59] from human video data. Unlike most of these prior works, we directly train a policy on X-embodiment data, without any mechanisms to reduce the embodiment gap, and observe positive transfer by leveraging that data. | 跨embodiments转移。许多先前的工作已经研究了在仿真[10-22]和真实机器人[23-29]中跨机器人实施例转移的方法。这些方法通常引入专门设计用于解决不同机器人之间实现差距的机制,例如共享动作表示[14,30],结合表征学习目标[17,26],调整关于实施例信息的学习策略[11,15,18,30,31],以及解耦机器人和环境表示[24]。先前的工作已经提供了X-实施例训练[27]和转换器模型的转移[25,29,32]的初步演示。我们研究互补架构并提供互补分析,特别是研究X-embodiment转移和网络规模预训练之间的相互作用。同样,跨人类和机器人实施方式的转移方法也经常采用缩小实施方式差距的技术,即通过在域之间翻译或学习可转移的表示[33-43]。或者,一些工作侧重于问题的子方面,例如学习可转移奖励函数[17,44-48],目标[49,50],动力学模型[51]或来自人类视频数据的视觉表示[52-59]。与大多数先前的工作不同,我们直接在X实施例数据上训练策略,没有任何机制来缩小实施例差距,并通过利用该数据观察到正转移。 |
Large-scale robot learning datasets. The robot learning community has created open-source robot learning datasets, spanning grasping [60–71], pushing interactions [23, 72–74], sets of objects and models [75–85], and teleoperated demonstrations [8, 86–95]. With the exception of RoboNet [23], these datasets contain data of robots of the same type, whereas we focus on data spanning multiple embodiments. The goal of our data repository is complementary to these efforts: we process and aggregate a large number of prior datasets into a single, standardized repository, called Open X-Embodiment, which shows how robot learning datasets can be shared in a meaningul and useful way. | 大规模机器人学习数据集。机器人学习社区创建了开源机器人学习数据集,包括抓取[60-71],推动交互[23,72-74],对象和模型集[75-85],以及远程操作演示[8,86-95]。除了RoboNet [23]之外,这些数据集包含相同类型的机器人数据,而我们专注于跨越多个实施例的数据。我们的数据存储库的目标是对这些努力的补充:我们将大量先前的数据集处理并聚合到一个标准化的存储库中,称为Open X-Embodiment,它展示了如何以有意义和有用的方式共享机器人学习数据集。 |
Language-conditioned robot learning. Prior work has aimed to endow robots and other agents with the ability to understand and follow language instructions [96–101], often by learning language-conditioned policies [8, 40, 45, 102– 106]. We train language-conditioned policies via imitation learning like many of these prior works but do so using large-scale multi-embodiment demonstration data. Following previous works that leverage pre-trained language embeddings [8, 40, 45, 103, 107–112] and pre-trained visionlanguage models [9, 113–115] in robotic imitation learning, we study both forms of pre-training in our experiments, specifically following the recipes of RT-1 [8] and RT-2 [9]. | 语言条件机器人学习。先前的工作旨在赋予机器人和其他代理理解和遵循语言指令的能力[96-101],通常是通过学习语言条件的策略[8,40,45,102-106]。我们像许多以前的工作一样,通过模仿学习来训练语言条件的策略,但使用大规模的多实施例演示数据。继之前在机器人模仿学习中利用预训练语言嵌入[8,40,45,103,107-112]和预训练视觉语言模型[9,113-115]的工作之后,我们在实验中研究了两种形式的预训练,特别是遵循RT-1 [8]和RT-2的配方[9]。 |
We introduce the Open X-Embodiment Repository (robotics-transformer-x.github.io) – an open-source repository which includes large-scale data along with pre-trained model checkpoints for X-embodied robot learning research. More specifically, we provide and maintain the following open-source resources to the broader community: • Open X-Embodiment Dataset: robot learning dataset with 1M+ robot trajectories from 22 robot embodiments. • Pre-Trained Checkpoints: a selection of RT-X model checkpoints ready for inference and finetuning. |
我们介绍了Open X-Embodiment存储库(robotics-transformer-x.github.io) - 一个开源存储库,其中包括大规模数据以及用于X-embodied机器人学习研究的预训练模型checkpoints。更具体地说,我们为更广泛的社区提供和维护以下开源资源: • 开放 X-Embodiment数据集:机器人学习数据集,包含来自 22 个机器人实施例的 1M+ 机器人轨迹。 • 预训练Checkpoints:一系列 RT-X 模型检查点,可用于推理和微调。 |
We intend for these resources to form a foundation for Xembodiment research in robot learning, but they are just the start. Open X-Embodiment is a community-driven effort, currently involving 21 institutions from around the world, and we hope to further broaden participation and grow the initial Open X-Embodiment Dataset over time. In this section, we summarize the dataset and X-embodiment learning framework, before discussing the specific models we use to evaluate our dataset and our experimental results. | 我们打算让这些资源为Xembodiment在机器人学习中的研究奠定基础,但它们只是一个开始。Open X-Embodiment是一项由社区推动的工作,目前涉及来自世界各地的21个机构,我们希望随着时间的推移进一步扩大参与并发展最初的Open X-Embodiment数据集。在本节中,我们总结了数据集和X-embodiment学习框架,然后讨论了我们用于评估数据集和实验结果的特定模型。 |
The Open X-Embodiment Dataset contains 1M+ real robot trajectories spanning 22 robot embodiments, from single robot arms to bi-manual robots and quadrupeds. The dataset was constructed by pooling 60 existing robot datasets from 34 robotic research labs around the world and converting them into a consistent data format for easy download and usage. We use the RLDS data format [119], which saves data in serialized tfrecord files and accommodates the various action spaces and input modalities of different robot setups, such as differing numbers of RGB cameras, depth cameras and point clouds. It also supports efficient, parallelized data loading in all major deep learning frameworks. For more details about the data storage format and a breakdown of all 60 datasets, see robotics-transformer-x.github.io. | Open X-Embodiment数据集包含1M +真实机器人轨迹,跨越22个机器人实施例,从单机械臂到双手机器人和四足动物。该数据集是通过汇集来自全球 34 个机器人研究实验室的 60 个现有机器人数据集并将它们转换为一致的数据格式以便于下载和使用而构建的。我们使用RLDS数据格式[119],它将数据保存在序列化的tfrecord文件中,并适应不同机器人设置的各种动作空间和输入模式,例如不同数量的RGB相机,深度相机和点云。它还支持所有主要深度学习框架中的高效并行数据加载。有关数据存储格式和所有 60 个数据集的细分的更多详细信息,请参阅 robotics-transformer-x.github.io。 |
Fig. 2 analyzes the Open X-Embodiment Dataset. Fig. 2(a) shows the breakdown of datasets by robot embodiments, with the Franka robot being the most common. This is reflected in the number of distinct scenes (based on dataset metadata) per embodiment (Fig. 2(b)), where Franka dominates. Fig. 2(c) shows the breakdown of trajectories per embodiment. To further analyze the diversity, we use the language annotations present in our data. We use the PaLM language model [3] to extract objects and behaviors from the instructions. Fig. 2(d,e) show the diversity of skills and objects. While most skills belong to the pick-place family, the long tail of the dataset contains skills like “wiping” or “assembling”. Additionally, the data covers a range of household objects, from appliances to food items and utensils. | 图2分析了开放的X-Embodiment数据集。图2(a)显示了按机器人实施例划分的数据集,其中Franka机器人是最常见的。这反映在每个实施例的不同场景(基于数据集元数据)的数量上(图2(b)),其中Franka占主导地位。图2(c)显示了每个实施例的轨迹细分。为了进一步分析多样性,我们使用数据中存在的语言注释。我们使用 PaLM 语言模型 [3] 从指令中提取对象和行为。图2(d,e)显示了技能和对象的多样性。虽然大多数技能都属于pick-place系列,但数据集的长尾包含“擦拭”或“组装”等技能。此外,数据涵盖一系列家用物品,从电器到食品和器皿。 |
To evaluate how much X-embodiment training can improve the performance of learned policies on individual robots, we require models that have sufficient capacity to productively make use of such large and heterogeneous datasets. To that end, our experiments will build on two recently proposed Transformer-based robotic policies: RT1 [8] and RT-2 [9]. We briefly summarize the design of these models in this section, and discuss how we adapted them to the X-embodiment setting in our experiments. | 为了评估X-embodiment训练在多大程度上可以提高单个机器人上学习策略的性能,我们需要具有足够能力的模型来有效地利用如此大型和异构的数据集。为此,我们的实验将建立在最近提出的两个Transformer-based的机器人策略之上:RT1 [8] 和 RT-2 [9]。我们在本节中简要总结了这些模型的设计,并讨论了我们如何在实验中使它们适应X-embodiment设置。 |
One challenge of creating X-embodiment models is that observation and action spaces vary significantly across robots. We use a coarsely aligned action and observation space across datasets. The model receives a history of recent images and language instructions as observations and predicts a 7-dimensional action vector controlling the endeffector (x, y, z, roll, pitch, yaw, and gripper opening or the rates of these quantities). We select one canonical camera view from each dataset as the input image, resize it to a common resolution and convert the original action set into a 7 DoF end-effector action. We normalize each dataset’s actions prior to discretization. This way, an output of the model can be interpreted (de-normalized) differently depending on the embodiment used. It should be noted that despite this coarse alignment, the camera observations still vary substantially across datasets, e.g. due to differing camera poses relative to the robot or differing camera properties, see Figure 3. Similarly, for the action space, we do not align the coordinate frames across datasets in which the end-effector is controlled, and allow action values to represent either absolute or relative positions or velocities, as per the original control scheme chosen for each robot. Thus, the same action vector may induce very different motions for different robots. | 创建X-embodiment模型的一个挑战是,机器人的观察和动作空间差异很大。我们在数据集中使用粗略对齐的行动和观察空间。该模型接收最近图像和语言指令的历史记录作为观察结果,并预测控制末端执行器(x、y、z、滚动、俯仰、偏航和夹持器打开或这些数量的速率)的 7 维动作向量。我们从每个数据集中选择一个规范的相机视图作为输入图像,将其调整为通用分辨率,并将原始动作集转换为 7 DoF 末端执行器动作。我们在离散化之前对每个数据集的操作进行归一化。这样,模型的输出可以根据所使用的实施例以不同的方式解释(非规范化)。应该注意的是,尽管这种粗略的对齐,但相机观察结果在不同数据集之间仍然有很大差异,例如,由于相对于机器人的不同相机姿势或不同的相机属性,请参见图3。同样,对于动作空间,我们不会在控制末端执行器的数据集之间对齐坐标帧,而是允许动作值表示绝对或相对位置或速度,根据为每个机器人选择的原始控制方案。因此,相同的动作矢量可能会为不同的机器人引发非常不同的运动。 |