【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills


  • Abstract
  • Introduction
  • Related Work
  • Model
    • A. Background: GAIL and PPO
      • 1. 行为克隆(Behavior Cloning)
      • 2. GAIL方法
    • B. Reinforcement and Imitation Learning Model
      • 1.Hybrid IL/RL Reward
      • 2.Leveraging Physical States in Simulation
        • (1) Demonstration as a curriculum.
        • (2) Learning value functions from states
        • (3) Object-centric discriminator
        • (4) State prediction auxiliary tasks
      • 3.Sim2Real Policy Transfer
  • Experiments
    • A. Environment Setups
    • B. Robot Arm Manipulation Tasks
    • C. Quantitative Evaluation
    • D.Sim2Real Policy Transfer Results
  • Discussion
  • Conclusion


We propose a model-free deep reinforcement learning method that leverages a small amount of demonstration data to assist a reinforcement learning agent.

We apply this approach to robotic manipulation tasks and train end-to-end visuomotor policies that map directly from RGB camera inputs to joint velocities.

We demonstrate that our approach can solve a wide variety of visuomotor tasks, for which engineering a scripted controller would be laborious.

In experiments, our reinforcement and imitation agent achieves significantly better performances than agents trained with reinforcement learning or imitation learning alone.

We also illustrate that these policies, trained with large visual and dynamics variations, can achieve preliminary successes in zero-shot sim2real transfer.


For robotics, RL in combination with powerful function approximators such as neural networks provides a general framework for designing sophisticated controllers that would be hard to handcraft otherwise.

Nevertheless, end-to-end learning of visuomotor controllers for long-horizon and multi-stage manipulation tasks using model-free RL techniques remains a challenging problem.

Policies for robotics must transform multi-modal and partial observations from noisy sensors, such as cameras, into coordinated activity of many degrees of freedom.

At the same time, realistic tasks often come with contactrich dynamics and vary along multiple dimensions (visual appearance, position, shapes, etc.), posing significant generalization challenges.

Model-based methods can have difficulties handling such complex dynamics and large variations. Directly training model-free methods on real robotics hardware can be daunting due to the high sample complexity.

The difficulty of real-world RL training is compounded by safety considerations as well as the difficulty of accessing information about the state of the environment (e.g. the position of an object) to define a reward function.

Finally, even in simulation when perfect state information and large amounts of training data are available, exploration can be a significant challenge, especially for on-policy methods.

This is partly due to the often high-dimensional and continuous action space, but also due to the difficulty of designing suitable reward functions.

In this paper, we present a model-free deep RL method that can solve a variety of robotic manipulation tasks directly from pixel input. Our key insights are 1) to reduce the difficulty of exploration in continuous domains by leveraging a handful of human demonstrations; 2) to leverage several new techniques that exploit privileged and task-specific information during training only which can accelerate and stabilize the learning of visuomotor policies in multi-stage tasks; and 3) to improve generalization by increasing the diversity of the training conditions. As a result, the policies work well under significant variations of system dynamics, object appearances, task lengths, etc.

The set of tasks includes multi-stage and long-horizon tasks, and they require full 9-DoF joint velocity control directly from pixels.

Our approach utilizes demonstration data in two ways: first, it uses a hybrid reward that combines the task reward with an imitation reward based on Generative Adversarial Imitation Learning [15]. This aids with exploration
while still allowing the final controller to outperform the human demonstrator on the task. Second, it uses demonstration trajectories to construct a curriculum of states along which to initialize the episodes during training. This enables the agent to learn about later stages of the task earlier in training, facilitating the solving of long tasks.

[15] Jonathan Ho and Stefano Ermon. Generative adversarial
imitation learning. In NIPS, pages 4565–4573, 2016.


Through the use of a physics engine and high-throughput RL algorithms, we can simulate parallel copies of a robot arm to perform millions of complex physical interactions in a contact-rich environment while eliminating the practical concerns of robot safety and system reset.

Furthermore, we can, during training, exploit privileged and task-specific information about the true system state with several new techniques, including learning policy and value in separate modalities, an object-centric GAIL discriminator, and auxiliary tasks for visual modules.

We use the same model and the same algorithm with only small task-specific modifications of the training setup to learn visuomotor controllers for six diverse robot arm manipulation tasks.

【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills_第1张图片

Related Work

Three classes of RL algorithms are currently dominant for continuous control problems: guided policy search methods (GPS; Levine and Koltun [22]), value-based methods such as the deterministic policy gradient (DPG;
Silver et al. [45], Lillicrap et al. [26], Heess et al. [12]) or the normalized advantage function (NAF; Gu et al. [10]) algorithm, and trust-region based policy gradient algorithms such as trust region policy optimization (TRPO [42]) and proximal policy optimization (PPO [43]).
1、guided policy search methods——“引导性策略搜索模型”

[22] Sergey Levine and Vladlen Koltun. Guided policy search.
In ICML, pages 1–9, 2013.

the deterministic policy gradient——确定性策略梯度

[45] David Silver, Guy Lever, Nicolas Heess, Thomas Degris,
Daan Wierstra, and Martin Riedmiller. Deterministic
policy gradient algorithms. In ICML, 2014.

[26] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel,
Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and
Daan Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.

Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous
control policies by stochastic value gradients. In NIPS,
pages 2926–2934, 2015.

the normalized advantage function algorithm——标准优势函数算法;

[10] Shixiang Gu, Tim Lillicrap, Ilya Sutskever, and Sergey
Levine. Continuous deep Q-learning with model-based
acceleration. In ICML, 2016.

trust region policy optimization——置信区策略最优;

[42] John Schulman, Sergey Levine, Pieter Abbeel, Michael
Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.

proximal policy optimization——近似策略最优;

[43] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
Radford, and Oleg Klimov. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347, 2017.

TRPO [42] and PPO [43] hold appeal due to their robustness to hyperparameter settings as well as their scalability [14] but the lack of sample efficiency makes them unsuitable for training directly on robotics hardware.

The idea of using large-scale data collection for training visuomotor controllers has been the focus of Levine et al. [24] and Pinto and Gupta [33] who train a convolutional network to predict grasp success for diverse sets of objects using a large dataset with 10s or 100s of thousands of grasp attempts collected from multiple robots in a self-supervised setting.

Demonstrations can be used to initialize policies, design cost functions, guide exploration, augment the training data, or a combination of these.

Our method learns end-to-end visuomotor policies without reliance on demonstrator actions.


The policy takes both an RGB camera observation and a proprioceptive
feature vector that describes the joint positions and angular velocities.

【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills_第2张图片

A. Background: GAIL and PPO

启发式学习(Imitation Learning):

1. 行为克隆(Behavior Cloning)

D = { s i , a i } , i = 1 , 2 , 3 ⋯ N D=\lbrace s_{i},a_{i} \rbrace,i=1,2,3 \cdots N D={si,ai},i=1,2,3N
这个算法使用最大似然估计来训练一个参数化的策略 π \pi π
π θ : S → A \pi_{\theta}:S \rightarrow A πθ:SA
θ ∗ = a r g m a x θ ∑ N l o g θ π ( a i ∣ s i ) \theta^{*}=argmax_{\theta} \sum_{N}log_{\theta} \pi(a_{i}|s_{i}) θ=argmaxθNlogθπ(aisi)

2. GAIL方法

一个是策略网络: π θ : S → A \pi_{\theta}:S \rightarrow A πθ:SA
一个是鉴别网络: D ψ : S × A → [ 0 , 1 ] D_{\psi}:S\times A \rightarrow [0,1] Dψ:S×A[0,1]
max ⁡ θ min ⁡ ψ E π E [ l o g D ψ ( s , a ) ] + E π θ [ l o g ( 1 − D ψ ( s , a ) ) ] \operatorname{max}_\theta \operatorname{min}_\psi E_{\pi_{E}}[logD_{\psi}(s,a)] + E_{\pi_{\theta}}[log(1-D_{\psi}(s,a))] maxθminψEπE[logDψ(s,a)]+Eπθ[log(1Dψ(s,a))]
π E \pi_{E} πE指的是演示轨迹生成的专家的策略;
训练 π θ \pi_{\theta} πθ的方法:用策略梯度的方法最大化一个奖励函数
r g a i l = − l o g ( 1 − D ψ ( s t , a t ) ) r_{gail}=-log(1-D_{\psi}(s_{t},a_{t})) rgail=log(1Dψ(st,at)),使用限幅函数限制在最大值是10
PPO only relies on first-order gradients and can be easily implemented with recurrent networks in a distributed setting [14].

PPO implements an approximate trust region that limits the change in the policy per iteration.

This is achieved via a regularization term based on the Kullback-Leibler (KL) divergence, the strength of which is adjusted dynamically depending on actual change in the policy in past iterations.

B. Reinforcement and Imitation Learning Model

1.Hybrid IL/RL Reward

Hence, we design the task rewards as sparse piecewise constant
functions based on the different stages of the respective tasks.

we provide additional guidance via a hybrid reward function that combines the imitation reward r g a i l r_{gail} rgail with the task reward r t a s k r_{task} rtask.
r ( s t , a t ) = λ r g a i l ( s t , a t ) + ( 1 − λ ) r t a s k ( s t , a t ) , λ ∈ [ 0 , 1 ] r(s_{t},a_{t})=\lambda r_{gail}(s_{t},a_{t})+(1-\lambda)r_{task}(s_{t},a_{t}),\lambda \in [0,1] r(st,at)=λrgail(st,at)+(1λ)rtask(st,at),λ[0,1]
where the imitation reward encourages the policy to generate trajectories closer to demonstration trajectories, and the task reward encourages the policy to achieve high returns on the task.


2.Leveraging Physical States in Simulation

The physics simulator we employ for training exposes the full state of the system.

(1) Demonstration as a curriculum.

Previous work indicates that shaping the distribution of start states towards states where the optimal policy tends to visit can greatly improve policy learning [18, 35].
ϵ \epsilon ϵ:在环境中随机选取状态
1 − ϵ 1-\epsilon 1ϵ:在环境中选取演示起点的状态

(2) Learning value functions from states

During training, each PPO worker executes the policy for K steps and uses the discounted sum of rewards and the value as an advantage function estimator A t ∧ = ∑ i = 1 K γ i − 1 r t + i + γ K − 1 V ϕ ( s t + K ) − V ϕ ( s t ) \stackrel{\wedge}{A_{t}} = \sum_{i=1}^{K} \gamma^{i-1}r_{t+i} + \gamma^{K-1} V_{\phi}(s_{t+K}) -V_{\phi}(s_{t}) At=i=1Kγi1rt+i+γK1Vϕ(st+K)Vϕ(st) where γ \gamma γ is the discount factor.

we take advantage of the low-level physical states (e.g., the position and velocity of the 3D objects and the robot arm) to train the value Vφ with a smaller multilayer perceptron.
我们使用一些低等级的物理状态,比如说3D物理或者机械臂的位置姿态和速度信息,来训练价值 V ϕ V_{\phi} Vϕ网络。使用多层感知机训练。


(3) Object-centric discriminator

our discriminator only takes the object-centric features as input while masking out arm-related information.

The construction of the object-centric representation requires a certain amount of domain knowledge of the tasks.

(4) State prediction auxiliary tasks

To facilitate learning visuomotor policies we add a state prediction layer on the top of the CNN module to predict the locations of objects from the camera observation. We use a fully-connected layer to regress the 3D coordinates of objects in the task, minimizing the L 2 L_{2} L2 loss between the predicted and ground-truth object locations.
为了便于学习视觉运动策略,我们在CNN模块的顶部添加了一个状态预测层,从摄像机观察中预测物体的位置。我们使用一个全连接层来回归任务中对象的三维坐标,最小化预测对象和地面真实对象位置之间的 L 2 L_{2} L2损失。

3.Sim2Real Policy Transfer

Instead of using professional calibration equipment, our approach to sim2real policy transfer relies on domain randomization of camera position and orientation [17, 47].

In contrast to some previous works our trained policies do not rely on any object position information or intermediate goals but rather learn a mapping end-to-end from raw pixel input joint velocities.


A. Environment Setups

Kinova Jaco arm 具有9自由度

The visuomotor policy controls the robot by setting the joint velocity commands, producing 9-dimensional continuous velocities in the range of [−1, 1] at 20Hz.

Visual observations of the table-top scene are provided via a suitably positioned real-time RGB camera. The proprioceptive features and the camera observations are available in both simulation and real environments thus enabling policy transfer.

We use a large variety of objects, ranging from basic geometric shapes to procedurally generated 3D objects built from ensembles of primitive shapes. We increase the diversity of objects by randomizing various physical properties.

SpaceNavigator 3D motion controller——收集演示信息

B. Robot Arm Manipulation Tasks

【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills_第3张图片

Task Name Task Purposes
Block Lifting evaluate the model’s robustness
Block stacking evaluated in sim2real transfer experiments
Clearing table with blocks requires lifting two blocks off the tabletop
Clearing table with a box to grasp the toy put it into the box
Pouring liquid pour the “liquid” from one mug to the other container
Order fulfillment recognize the object categories, perform successful grasps on diverse shapes, and handle tasks with variable lengths

C. Quantitative Evaluation

On the contrary, neither reinforcement nor imitation alone can solve all tasks.

These baselines use the same setup as the full model, except that we set λ \lambda λ = 0 for RL and λ \lambda λ = 1 for GAIL, while our model uses a balanced contribution of the hybrid reward, where λ \lambda λ = 0.5.
本文提出的方法相当于两个方法的线性叠加,因此纯粹强化学习就是 λ \lambda λ = 0;纯粹GAIL就是 λ \lambda λ = 1,当两个均等结合就是 λ \lambda λ = 0.5。

We report the mean episode returns as a function of the number of training iterations in Fig. 4.
【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills_第4张图片
The only case where the baseline model is on par with the full model is the block lifting task, in which both the RL baseline and the full model achieved similar levels of performance. We hypothesize that this is due to the short length of the lifting task, where random exploration can provide a sufficient learning signal without the aid of demonstrations.
在Block Lifting任务中,结合模型与纯RL模型的性能指标相近。

First, the RL agent learns faster than the full model in the clearing blocks task, but thefull model eventually outperforms. This is because the full model discovers a novel strategy, different from the strategy employed by human operators (see video). In this case, imitation gave contradictory signals but eventually, reinforcement learning guided the policy towards a better strategy.

Second, pouring liquid is the only task where GAIL outperforms its RL counterpart. Imitation can effectively shape the agent’s behaviors towards the demonstration trajectories [51]. This is a viable solution for the pouring task, where a controller that generates similar-looking behaviors can complete the task.

【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills_第5张图片

D.Sim2Real Policy Transfer Results

Although the sim and real domains are similar, there is still a sizable reality gap that makes zero-shot transfer challenging.
For example, while the simulated blocks are rigid the objects employed in the real-world setup are non-rigid foam blocks which deform and bounce unpredictably.
Furthermore, neural network policies are sensitive to subtle discrepancies between simulated rendering and the real camera frame.
虽然模拟环境和真实环境是相似的,但仍然存在相当大的现实差距,使zero-shot transfer具有挑战性。


