Formulate the task of pushing-for-grasping as a Markov decision process:
a given state \(S_t\)
an action \(a_t\)
a police \(\pi(s_t)\)
a new state \(s_{t+1}\)
an immediate corresponding reward \(R_{a_t}(s_t,s_{t+1})\)
Goal
find an optimal policy \(\pi^*\) that maximizes te expected sum of future rewards, given by \(R_t = \sum_{i=t}^{\infty}\gamma R_{a_i}(s_i,s_{i+1})\)
\(\gamma\)-discounted sum over an infinite-horizon of future returns from time t to \(\infty\)
Learning objective is to iteratively minimize the temporal difference error \(\delta_{t}\) of \(Q_\pi(s_t,a_t)\) to a fixed target value \(y_t\)
δ t = ∣ Q ( s t , a t ) − y t ∣ \delta_t = |Q(s_t,a_t)-y_t| δt=∣Q(st,at)−yt∣
y t = R a t ( s t , s t + 1 ) + γ Q ( s t + 1 , a r g m a x a ′ ( Q ( s t + 1 , a ′ ) ) ) y_t = R_{a_t}(s_t,s_{t+1}) + \gamma Q(s_{t+1},argmax_{a^{'}}(Q(s_{t+1},a^{'}))) yt=Rat(st,st+1)+γQ(st+1,argmaxa′(Q(st+1,a′)))
\(a^{’}\) the set of all available actions
model each state \(s_t\) as an RGB-D heightmap image
Parameterize each action \(a_t\) as a motion primitive behavior \(\psi\) esecuted at the 3D loacation q projected from a pixel p of the heightmap images representation of the state \(s_t\):
a = ( ψ , q ) ∣ ψ ∈ p u s h , g r a s p , q → p ∈ s t a = (\psi,q) | \psi \in {push,grasp}, q \to p \in s_t a=(ψ,q)∣ψ∈push,grasp,q→p∈st
motion primitive behaviors are defined as follows:
Pushing: q starting position of a 10cm push in one of k = 16 directions
Grasping: q the middle position of a top-down parallel-jaw grasp in one of k=16 orientations
extend vanilla deep Q-networks(DQN) by modeling Q-function as two feed-forward fully convolutional networks \(\Phi_p\) \(\Phi_g\)
input: the heightmap image representation of the state s_t
outputs: a dense pixel-wise map of Q values with the same image size and resolution as that of \(s_t\)
Both FCNs φ p and φ g share the same network architecture: two parallel 121-layer DenseNet pre-trained on ImageNet , followed by channel-wise concatenation and 2 additional 1 × 1 convolutional layers interleaved with nonlinear activation functions (ReLU) and spatial batch normalization, then bilinearly upsampled.
\(R_g(s_t,s_{t+1}) = 1\) if grasp is successful
\(R_p(s_t,s_{t+1}) = 0.5\) if pushed that make detetable changes. if the sum of differences between heightmaps exceeds some threshold
Our Q-learning FCNs are trained at each iteration i using the Huber loss function: