[RL 8] Proximal Policy Optimization Algorithms (arXiv, 1707)

Proximal Policy Optimization Algorithms (arXiv, 1707)

1.Introduction

  1. room for RL
    1. scalable: support for parallel implementations, to make use of resources
    2. data efficient
    3. robust: non-sensitive to hyperparameter
  2. problems
    1. A3C: poor data efficiency
    2. TRPO: complicated, not support for parameter sharing, dropout, etc
  3. PPO
    1. data efficiency and reliable(of TRPO)

2.Background

  1. PG
    1. problems:
      1. large policy updates
  2. Trust Region Methods
    1. constrained
      • conjugate gradient algorithm (quadratic approximation)
      • complex
    2. unconstrained with KL penalty
      • SGD
      • but tuning β \beta β is non-trivial

3.Clipped Surrogate Objective

L C L I P ( θ ) = E ^ t [ min ⁡ ( r t ( θ ) , 1 ± ϵ ) A ^ t ] L^{C L I P}(\theta)=\hat{\mathbb{E}}_{t}\left[ \operatorname{min}\left(r_{t}(\theta), 1\pm\epsilon\right)\hat{A}_{t} \right ] LCLIP(θ)=E^t[min(rt(θ),1±ϵ)A^t]

  1. r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_{t}(\theta)=\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\mathrm{old}}}\left(a_{t} \mid s_{t}\right)} rt(θ)=πθold(atst)πθ(atst)
  2. Motivation:
    1. objective motivated from TRPO TODO
    2. aviod large policy update
      • by set r ( θ ) r(\theta) r(θ) as a constant to have g = 0 g=0 g=0
  3. Gradients
    ∇ θ L P P O = { ∇ θ L θ π θ ( a ∣ s ) π ( a ∣ s ) ∈ [ 1 − ϵ , 1 + ϵ ]  or  L θ C < L θ 0  otherwise   where  L θ : = E ( s , a ) ∈ τ ∼ π [ π θ ( a ∣ s ) π ( a ∣ s ) A π ( s , a ) ]  and  L θ C : = E ( s , a ) ∈ τ ∼ π [ clip ⁡ ( π θ ( a ∣ s ) π ( a ∣ s ) , 1 − ε , 1 + ε ) A π ( s , a ) ] \begin{array}{c} \nabla_{\theta} L_{P P O}=\left\{\begin{array}{ll} \nabla_{\theta} L_{\theta} & \frac{\pi_{\theta}(a \mid s)}{\pi(a \mid s)} \in[1-\epsilon, 1+\epsilon] \text { or } L_{\theta}^{C}θLPPO={θLθ0π(as)πθ(as)[1ϵ,1+ϵ] or LθC<Lθ otherwise  where Lθ:=E(s,a)τπ[π(as)πθ(as)Aπ(s,a)] and LθC:=E(s,a)τπ[clip(π(as)πθ(as),1ε,1+ε)Aπ(s,a)]

4.Adaptive KL Penalty Coefficient

  • Objective: TRPO obj + KL penalty
    • performe worse than clipped objective

5.Algorithm

  1. variance-reduced advantage-function estimators
    1. generalized advantage estimation(GAE 2015) TODO
    2. finite-horizon estimators (2016) TODO
  2. parameters sharing
  3. entropy bonus
  4. loss function
  5. parallel actors

6.Experiments

  1. Cipping vs KL penalty
    1. clipping outperformes KL penalty
    2. details
      • no parameters sharing
      • no entropy bonus
      • 2 hidden layers with 64 units
      • 1M step
  2. PPO vs other continous on-policy algs
    1. PPO does better (faster and higher fianl performance)
    2. A2C >= A3C
  3. PPO in high-dimensional continuous control
    1. competent
  4. PPO vs other dicrete on-policy algs in Aatri
    1. ACER TODO
      1. higher final performance
    2. PPO
      1. faster learning

你可能感兴趣的:(DRL,算法)