论文笔记:g Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning

The key aspects ofsystem are:

  1. We learn joint pushing and grasping policies through self-supervised trial and error. Pushing actions are useful only if, in time, enable grasping. This is in contrast to prior approaches that define heuristics or hard-coded objectives for pushing motions.
  2. We train our policies end-to-end with a deep network that takes in visual observations and outputs expected return (i.e. in the form of Q values) for potential pushing and grasping actions. The joint policy then chooses the action with the highest Q value – i.e. , the one that maximizes the expected success of current/future grasps. This is in contrast to explicitly perceiving individual objects and planning actions on them based on handdesigned Features

Problem Formulation

Formulate the task of pushing-for-grasping as a Markov decision process:

a given state \(S_t\)

an action \(a_t\)

a police \(\pi(s_t)\)

a new state \(s_{t+1}\)

an immediate corresponding reward \(R_{a_t}(s_t,s_{t+1})\)

Goal

find an optimal policy \(\pi^*\) that maximizes te expected sum of future rewards, given by \(R_t = \sum_{i=t}^{\infty}\gamma R_{a_i}(s_i,s_{i+1})\)

\(\gamma\)-discounted sum over an infinite-horizon of future returns from time t to \(\infty\)

Use Q-learning to to train a greedy deterministic policy

Learning objective is to iteratively minimize the temporal difference error \(\delta_{t}\) of \(Q_\pi(s_t,a_t)\) to a fixed target value \(y_t\)
δ t = ∣ Q ( s t , a t ) − y t ∣ \delta_t = |Q(s_t,a_t)-y_t| δt=Q(st,at)yt
y t = R a t ( s t , s t + 1 ) + γ Q ( s t + 1 , a r g m a x a ′ ( Q ( s t + 1 , a ′ ) ) ) y_t = R_{a_t}(s_t,s_{t+1}) + \gamma Q(s_{t+1},argmax_{a^{'}}(Q(s_{t+1},a^{'}))) yt=Rat(st,st+1)+γQ(st+1,argmaxa(Q(st+1,a)))
\(a^{’}\) the set of all available actions

Method

A. State Representations

model each state \(s_t\) as an RGB-D heightmap image

  1. capture RGB-D images from a fixed-mount camera, project the data onto a 3D point cloud
  2. orthographically back-project upwards in the gravity direction to construct a heightmap image representation with both color (RGB) and height-from-bottom (D) channels

B. Primitive Actions

Parameterize each action \(a_t\) as a motion primitive behavior \(\psi\) esecuted at the 3D loacation q projected from a pixel p of the heightmap images representation of the state \(s_t\):
a = ( ψ , q ) ∣ ψ ∈ p u s h , g r a s p , q → p ∈ s t a = (\psi,q) | \psi \in {push,grasp}, q \to p \in s_t a=(ψ,q)ψpush,grasp,qpst
motion primitive behaviors are defined as follows:
Pushing: q starting position of a 10cm push in one of k = 16 directions

Grasping: q the middle position of a top-down parallel-jaw grasp in one of k=16 orientations

C. Learning Fully COnvolutional Action-Value Functions

extend vanilla deep Q-networks(DQN) by modeling Q-function as two feed-forward fully convolutional networks \(\Phi_p\) \(\Phi_g\)

input: the heightmap image representation of the state s_t

outputs: a dense pixel-wise map of Q values with the same image size and resolution as that of \(s_t\)

Both FCNs φ p and φ g share the same network architecture: two parallel 121-layer DenseNet pre-trained on ImageNet , followed by channel-wise concatenation and 2 additional 1 × 1 convolutional layers interleaved with nonlinear activation functions (ReLU) and spatial batch normalization, then bilinearly upsampled.

D. Rewards

\(R_g(s_t,s_{t+1}) = 1\) if grasp is successful

\(R_p(s_t,s_{t+1}) = 0.5\) if pushed that make detetable changes. if the sum of differences between heightmaps exceeds some threshold

E. Training details

Our Q-learning FCNs are trained at each iteration i using the Huber loss function:
在这里插入图片描述

F.Testing details

Limitation

  1. Motion primitives are defined with parameters specified on a regular grid (heightmap), which provides learning efficiency with deep networks, but limits expressiveness – it would be interesting to explore other parameterizations that allow more expressive motions (without excessively inducing sample complexity), including more dynamic pushes, parallel rather than sequential combinations of pushing and grasping, and the use of more varied contact surfaces of the robot.
  2. Train our system only with blocks and test with a limited range of other shapes (fruit, bottles, etc.) – it would be interesting to train on larger varieties of shapes and further evaluate the generalization capabilities of the learned policies.
  3. study only synergies between pushing and grasping, which are just two examples of the larger family of primitive manipulation actions, e.g. rolling, toppling, squeezing, levering, stacking, among others

你可能感兴趣的:(论文笔记,深度学习,神经网络,机器学习)