强化学习 | Part 2 - Reinforcement learning algorithms

https://www.datamachinist.com/reinforcement-learning/part-2-reinforcement-learning-algorithms/

1. Model-Free

Value-based

  • State Action Reward State-Action (SARSA) – 1994
  • Q-learning = SARSA max – 1992
  • Deep Q Network (DQN) – 2013
    • Double Deep Q Network (DDQN) – 2015
    • Deep Recurrent Q Network (DRQN) – 2015
    • Dueling Q Network – 2015
    • Persistent Advantage Learning (PAL) – 2015
    • Bootstrapped Deep Q Network – 2016
    • Normalized Advantage Functions (NAF) = Continuous DQN – 2016
    • N-Step Q Learning – 2016
    • Noisy Deep Q Network (NoisyNet DQN) – 2017
    • Deep Q Learning for Demonstration (DqfD) – 2017
    • Categorical Deep Q Network = Distributed Deep Q Network = C51 – 2017
      • Rainbow – 2017
    • Quantile Regression Deep Q Network (QR-DQN) – 2017
    • Implicit Quantile Netowork – 2018
  • Mixed Monte Carlo (MMC) – 2017
  • Neural Episodic Control (NEC) – 2017

 

Policy-based

  • Cross-Entropy Method (CEM) – 1999
  • Policy Gradient
    • REINFORCE = Vanilla Policy Gradient (VPG)- 1992
    • Policy gradient softmax
    • Natural policy gradient (NPG) – 2002
    • Truncated Natural Policy Gradient (TNPG)

 

Actor-Critic

  • Advantage Actor Critic (A2C)
  • Asynchronous Advantage Actor-Critic (A3C)  – 2016
  • Generalized Advantage Estimation (GAE) – 2015
  • Trust Region Policy Optimization (TRPO) – 2015
  • Deterministic Policy Gradient (DPG) – 2014
  • Deep Deterministic Policy Gradients (DDPG)  – 2015
    • Distributed Distributional Deterministic Policy Gradients (D4PG) – 2018
    • Twin Delayed Deep Deterministic Policy Gradient (TD3) – 2018
  • Proximal Policy Optimization (PPO) – 2017
    • Distributed PPO (DPPO) – 2017
    • Clipped PPO (CPPO)  – 2017
  • Actor Critic using Kronecker-Factored Trust Region (ACKTR) – 2017
  • Actor-Critic with Experience Replay (ACER) – 2016
  • Soft Actor-Critic (SAC)  – 2018

 

General Agents

  • Direct Future Prediction (DFP) – 2016
  • Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
  • Relative Entropy Policy Search (REPS)
  • Reward-Weighted Regression (RWR)

 

Imitation Learning Agents

  • Behavioral Cloning (BC)
  • Conditional Imitation Learning – 2017
  • Generative Adversarial Imitation Learning (GAIL) – 2016

 

Hierarchical Reinforcement Learning Agents

  • Hierarchical Actor Critic (HAC) – 2017

 

Memory Types

  • Prioritized Experience Replay (PER) – 2015
  • Hindsight Experience Replay (HER) – 2017

 

Exploration Techniques

  • E-Greedy
  • Boltzmann
  • Ornstein–Uhlenbeck process
  • Normal Noise
  • Truncated Normal Noise
  • Bootstrapped Deep Q Network 
  • UCB Exploration via Q-Ensembles (UCB) 
  • Noisy Networks for Exploration 
  • Intrinsic Curiosity Module (ICM) – 2017

 

2. Model-Based

  • DYNA-Q
  • Dataset Aggregation (Dagger)
  • Monte Carlo Tree Search (MCTS) (eg. AlphaZero)
  • Dynamic Programming
  • Model Predictive Control
  • Probabilistic Inference for Learning COntrol (PILCO)
  • Guided Policy Search (GPS)
    • Policy search with Gaussian Process
    • Policy search with backpropagation

 

Summary

Algorithm Model-free or model-based Agent type Policy Policy type Monte Carlo or Temporal difference (TD) Action space State space
Tabular Q-learning (= SARSA max)
Q learning lambda
Model free Value-based Off-policy Pseudo-deterministic (epsilon greedy) TD Discrete Discrete
SARSA
SARSA lambda
Model free Value-based On-policy Pseudo-deterministic (epsilon greedy) TD Discrete Discrete
DQN
N step DQN
Double DQN
Noisy DQN
Prioritized Replay DQN
Dueling DQN
Catergorical DQN
Distributed DQN (C51)
Model free Value-based Off-policy Pseudo-deterministic (epsilon greedy)   Discrete Continuous
Cross-entropy method Model free Policy-based On-policy   Monte Carlo    
REINFORCE (Vanilla policy gradient) Model free Policy-based On-policy Stochastic policy Monte Carlo    
Policy gradient softmax Model free     Stochastic policy      
Natural Policy Gradient Model free     Stochastic policy      
TRPO Model free Policy-based On-policy (?) Stochastic policy   Continuous Continuous
PPO Model free Policy-based On-policy (?) Stochastic policy   Continuous Continuous
Distributed PPO Model free Policy-based       Continuous Continuous
A2C Model free Actor-critic On-policy Stochastic policy TD Continuous  
A3C   Actor-critic On-policy        
DDPG (A2C family) Model free Actor-critic Off-policy Deterministic policy   Continuous Continuous
TD3 Model free Actor-critic       Continuous Continuous
D4PG              
SAC Model free Actor-critic Off-policy        
Dyna-Q              
Curiosity Model              
NAF Model free         Continuous  
DAgger              
MCTS              
Dynamic programming              
GPS              
Model Predictive Control Model-based            
PILCO Model-based            
Policy search with Gaussian Process Model-based            
Policy search with backpropagation Model-based            

 

Conclusion

We have just seen some of the most used RL algorithms. In the next article, we will look at the challenges and application of RL for robotic applications.

你可能感兴趣的:(深度强化学习,强化学习算法,强化学习,reinforcement,learning)