PR17.10.4:Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
What’sproblem?AmajorobstaclefacingdeepRLintherealworldistheirhighsamplecomplexity.Batchpolicygradientmethodsofferstablelearning,butatthecostofhighvariance,whichoftenrequireslargebatches.TD-stylemethod