遗传算法概述

Reinforcement learning has gained tremendous popularity in the last decade with a series of successful real-world applications in robotics, games and many other fields.

在过去的十年中，强化学习在机器人技术，游戏和许多其他领域的一系列成功的实际应用中得到了极大的普及。

In this article, I will provide a high-level structural overview of classic reinforcement learning algorithms. The discussion will be based on their similarities and differences in the intricacies of algorithms.

在本文中，我将提供经典强化学习算法的高层结构概述。讨论将基于它们在算法复杂性上的异同。

RL基础 (RL Basics)

Let’s start with a quick refresher on some basic concepts. If you are already familiar with all the terms of RL, feel free to skip this section.

让我们从一些基本概念开始快速入门。如果您已经熟悉RL的所有术语，请随时跳过此部分。

Reinforcement learning models are a type of state-based models that utilize the markov decision process(MDP). The basic elements of RL include:

强化学习模型是一种利用马尔可夫决策过程(MDP)的基于状态的模型。 RL的基本元素包括：

Episode(rollout): playing out the whole sequence of state and action until reaching the terminate state;

Episode(rollout) ：播放状态和动作的整个序列，直到达到终止状态为止；

Current state s (or st): where the agent is current at;

当前状态s(或st) ：代理当前所在的位置；

Next state s’ (or st+1): next state from the current state;

下一个状态s'(或st + 1) ：当前状态下的下一个状态；

Action a: the action to take at state s;

动作a ：在状态s采取的动作；

Transition probability P(s’|s, a): the probability of reaching s’ if taking action at at state st;

转移概率P(s“的| S，A)：在到达S的概率”如果采取行动在在状态s 吨 ;

Policy π(s, a): a mapping from each state to an action that determines how the agent acts at each state. It can be either deterministic or stochastic

策略π(s，a) ：从每个状态到一个动作的映射，该动作确定代理在每个状态下的行为。它可以是确定性的或随机的

Reward r (or R(s, a)): a reward function that generates rewards for taking action a at state s;

奖励r(或R(s，a)) ：一种奖励函数，用于在状态s处采取行动a产生奖励；

Return Gt: total future rewards at state st;

Gt回报：处于状态s的未来总回报；

Value V(s): expected return for starting from state s;

值V(s) ：从状态s开始的预期收益；

Q value Q(s, a): expected return for starting from state s and taking action a;

Q值Q(s，a) ：从状态s开始并采取行动a的预期收益；

Bellman equation

贝尔曼方程

According to the Bellman equation, the current value is equal to current reward plus the discounted(γ) value at the next step, following the policy π. It can also be expressed using the Q value as:

根据Bellman方程，当前值等于当前奖励加下一个遵循政策π的下一个折扣(γ)值。也可以使用Q值将其表示为：

This is the theoretical core in most reinforcement learning algorithms.

这是大多数强化学习算法的理论核心。

预测与控制任务 (Prediction vs. Control Tasks)

There are two fundamental tasks of reinforcement learning: prediction and control.

强化学习有两个基本任务：预测和控制。

In prediction tasks, we are given a policy and our goal is to evaluate it by estimating the value or Q value of taking actions following this policy.

在预测任务中，我们有一个策略，我们的目标是通过估计遵循此策略的操作的值或Q值来对其进行评估。

In control tasks, we don’t know the policy, and the goal is to find the optimal policy that allows us to collect most rewards. In this article, we will only focus on control problems.

在控制任务中，我们不知道该策略，而目标是找到允许我们收集最多奖励的最佳策略。在本文中，我们将仅关注控制问题。

RL算法结构 (RL Algorithm Structure)

Below is a graph I made to visualize the high-level structure of different types of algorithms. In the next few sections, we will delve into the intricacies of each type.

下面是我制作的图表，用于可视化不同类型算法的高级结构。在接下来的几节中，我们将深入研究每种类型的复杂性。

MDP世界 (MDP World)

In the MDP world, we have a mental model of how the world works, meaning that we know the MDP dynamics (transition P(s’|s,a) and reward function R(s, a)), so we can directly build a model using the Bellman equation.

在MDP世界中，我们有一个关于世界运行方式的思维模型，这意味着我们知道MDP动态(转换P(s'| s，a)和奖励函数R(s，a))，因此我们可以直接构建使用Bellman方程的模型。

Again, in control tasks our goal is to find a policy that gives us maximum rewards. To achieve it, we use dynamic programming.

同样，在控制任务中，我们的目标是找到一种能够给我们带来最大回报的政策。为此，我们使用动态编程。

动态编程(迭代方法) (Dynamic Programming (Iterative Methods))

1. Policy Iteration

1.策略迭代

Policy iteration essentially performs two steps repeatedly until convergence: policy evaluation and policy improvement.

策略迭代本质上重复执行两个步骤，直到收敛为止：策略评估和策略改进。

In the policy evaluation step, we evaluate the policy π at state s by calculating the Q value using the Bellman equation:

在策略评估步骤中，我们通过使用Bellman方程计算Q值来评估状态s下的策略π ：

In the policy improvement step, we update the policy by greedily searching for the action that maximizes the Q value at each step.

在策略改进步骤中，我们通过贪婪地搜索在每个步骤中最大化Q值的操作来更新策略。

Let’s see how policy iteration works.

让我们看看策略迭代的工作原理。

2. Value Iteration

2.价值迭代

Value iteration combines the two steps in policy iteration so we only need to update the Q value. We can interpret value iteration as always following a greedy policy because at each step it always tries to find and take the action that maximizes the value. Once the values converge, the optimal policy can be extracted from the value function.

值迭代将策略迭代中的两个步骤结合在一起，因此我们只需要更新Q值即可。我们可以将价值迭代解释为始终遵循贪婪的策略，因为在每个步骤中，它总是试图找到并采取使价值最大化的行动。一旦值收敛，就可以从值函数中提取最佳策略。

In most real-world scenarios, we don’t know the MDP dynamics so the applications of iterative methods are limited. In the next section, we will switch gears and discuss reinforcement learning methods that can deal with the unknown world.

在大多数实际场景中，我们不了解MDP动态，因此迭代方法的应用受到限制。在下一节中，我们将切换齿轮并讨论可以应对未知世界的强化学习方法。

强化学习世界 (Reinforcement Learning World)

In most cases, the MDP dynamics are either unknown, or computationally infeasible to use directly, so instead of building a mental model we learn from sampling. In all the following reinforcement learning algorithms, we need to take actions in the environment to collect rewards and estimate our objectives.

在大多数情况下，MDP动态是未知的，或者在计算上无法直接使用，因此，我们无需构建心智模型，而是从采样中学习。在以下所有强化学习算法中，我们需要在环境中采取行动以收集奖励并估计我们的目标。

Exploration-exploitation Dilemma

勘探开发困境

In MDP models, we can explore all potential states before we come up with a good solution using the transition probability function. However, in reinforcement learning where the transition is unknown, if we keep greedily searching for the best action, we might end up stuck in a few states without being able to explore the entire environment. This is the exploration-exploitation dilemma.

在MDP模型中，我们可以在使用过渡概率函数得出良好解决方案之前先探索所有潜在状态。但是，在过渡未知的强化学习中，如果我们继续贪婪地寻找最佳行动，我们可能最终会陷入一些状态而无法探索整个环境。这就是勘探开发的困境。

To get out of the suboptimal states, we usually use a strategy called epsilon greedy: when we select the best action, there is a probability of ε that we might get a random action.

为了摆脱次优状态，我们通常使用一种称为epsilon greedy的策略：当我们选择最佳动作时，有可能获得随机动作ε。

基于模型的强化学习 (Model-based Reinforcement Learning)

One way to estimate the MDP dynamics is sampling. Following a random policy, we sample many (s, a, r, s’) pairs and use Monte Carlo(counting the occurrences) to estimate the transition and reward functions explicitly from the data. If the data size is large enough, our estimates should be very close to the true values.

估计MDP动态的一种方法是采样。遵循随机策略，我们对许多(s，a，r，s')对进行采样，并使用蒙特卡洛(计算发生次数)从数据中显式估计过渡和奖励函数。如果数据大小足够大，我们的估计值应该非常接近真实值。

Once we have the estimates, we can use iterative methods to search for the optimal policy.

一旦有了估计，就可以使用迭代方法搜索最佳策略。

无模型强化学习(表格) (Model-free Reinforcement Learning (Tabular))

Let’s take a step back. If our goal is to just find good policies, all we need is to get a good estimate of Q. From that perspective, estimating the model (transitions and rewards) was just a means towards an end. Why not just cut to the chase and estimate Q directly?

让我们退后一步。如果我们的目标是找到好的政策，那么我们所需要做的就是对Q进行一个很好的估计。从这个角度来看，估计模型(过渡和奖励)只是达到目的的一种手段。为什么不直接切入并直接估算Q？

This is called model-free learning.

这称为无模型学习。

1. Model-free Monte Carlo

1.无模型的蒙特卡洛

Recall that Q(s,a) is the expected utility when the agent takes action a from state s.

回想一下，当代理从状态s采取动作a时，Q(s，a)是期望的效用。

The idea of model-free Monte Carlo is to sample many rollouts, and use the data to estimate Q. Let’s take a look at the algorithm.

无模型的蒙特卡洛的想法是对许多部署进行采样，并使用数据估算Q。让我们看一下算法。

We first randomly initialize everything and use epsilon greedy to sample an action, and then we start to play rollouts. At the end of each rollout, we calculate the return Gt for each state St in the rollouts. To get Q(st,at), the average of returns Gt, we can store all the returns and update Q when we finish sampling. However, a more efficient way is to update Q incrementally at the end of each rollout using a moving average as is shown below.

我们首先随机初始化所有内容，然后使用epsilon贪婪对动作进行采样，然后开始播放卷展栏。在每个推出的末尾，我们计算出每个状态St中的回报Gt。为了得到Q(st，at)，即收益Gt的平均值，我们可以存储所有收益，并在完成采样后更新Q。但是，一种更有效的方法是在每次部署结束时使用移动平均值逐步更新Q，如下所示。

2. SARSA

SARSA is a Temporal Difference (TD) method, which combines both Monte Carlo and dynamic programming methods. The update equation has the similar form of Monte Carlo’s online update equation, except that SARSA uses rt + γQ(st+1, at+1) to replace the actual return Gt from the data. N(s, a) is also replaced by a parameter α.

SARSA是一种时间差异(TD)方法，它结合了蒙特卡洛方法和动态编程方法。更新方程具有与蒙特卡洛在线更新方程类似的形式，不同之处在于SARSA使用rt +γQ(st + 1，at + 1)替换数据中的实际收益Gt。 N(s，a)也被参数α代替。

Recall that in Monte Carlo, we need to wait for the episode to finish before we can update the Q value. The advantage of TD methods is that they can update the estimate of Q immediately when we move one step and get a state-action pair (st, at, rt, st+1, at+1).

回想一下，在蒙特卡洛，我们需要等待情节结束才能更新Q值。 TD方法的优势在于，当我们移动一步并获得状态-动作对(st，at，rt，st + 1，at + 1)时，它们可以立即更新Q的估计值。

3. Q-learning

3. Q学习

Q-learning is another type of TD method. The difference between SARSA and Q-learning is that SARSA is an on-policy model while Q-learning is off-policy. In SARSA, our return at state st is rt + γQ(st+1, at+1), where Q(st+1, at+1) is calculated from the state-action pair (st, at, rt, st+1, at+1) that was obtained by following policy π. However, in Q-learning, Q(st+1, at+1) is obtained by taking the optimal action, which might not necessarily be the same as our policy.

Q学习是另一种TD方法。 SARSA与Q学习之间的区别在于，SARSA是一种基于策略的模型，而Q学习则属于非策略模型。在SARSA中，我们在状态st处的收益是rt +γQ(st + 1，at + 1)，其中Q(st + 1，at + 1)是根据状态作用对(st，at，rt，st + 1，at + 1)是通过遵循策略π获得的。但是，在Q学习中，Q(st + 1，at + 1)是通过采取最佳措施获得的，不一定与我们的策略相同。

In general, on-policy methods are more stable but off-policy methods are more likely to find global optima. We can see from below that except the update equation, the other parts of the algorithm are the same as SARSA.

通常，基于策略的方法更稳定，但是基于策略的方法更有可能找到全局最优值。从下面我们可以看到，除了更新方程式之外，该算法的其他部分与SARSA相同。

值函数近似值(VFA) (Value Function Approximation(VFA))

So far, we have been assuming we can represent the value function V or state-action value function Q as a tabular representation (vector or matrix). However, many real world problems have enormous state and/or action spaces for which tabular representation is insufficient. Naturally, we might wonder if we can parameterize the value function so we don’t have to store a table.

到目前为止，我们一直假设可以将值函数V或状态作用值函数Q表示为表格表示形式(向量或矩阵)。但是，许多现实世界中的问题都有巨大的状态和/或动作空间，表格表示不足。自然地，我们可能想知道是否可以参数化value函数，从而不必存储表。

VFA is the very method that represents the Q value function with a parameterized function Q hat.

VFA就是用参数化函数Q hat表示Q值函数的方法。

The state and action are represented as a feature vector x(s, a), and the estimated Q hat is the score of the linear predictor.

状态和动作表示为特征向量x(s，a)，估计的Q hat是线性预测变量的分数。

The objective is to minimize the loss between the estimated Q(prediction) and real Q(target), and we can use stochastic gradient descent to solve this optimization problem.

目的是使估计的Q(预测)与实际Q(目标)之间的损失最小，并且我们可以使用随机梯度下降法来解决此优化问题。

How do we get our target — the real Q value in the objective function?

我们如何获得目标-目标函数中的实际Q值？

Recall the Q value is the expected returns (Gt), so one way to get Q value is to use Monte Carlo: playing many episodes and counting the occurrences.

回想一下，Q值是预期收益(Gt)，因此获得Q值的一种方法是使用蒙特卡洛：播放许多情节并计算发生次数。

We have the following objective functions and gradients for parameterized Monte Carlo:

对于参数化的蒙特卡洛，我们具有以下目标函数和梯度：

Another way is to utilize the recursive expression of Q value: Q(st, at) = rt + γQ(st+1, at+1). As we discussed earlier, Temporal Difference (TD) methods combine both Monte Carlo and dynamic programming and allow real-time update. Hence we can also obtain the target Q value using TD methods: SARSA and Q-learning.

另一种方法是利用Q值的递归表达式：Q(st，at)= rt +γQ(st + 1，at + 1)。如前所述，时差(TD)方法结合了蒙特卡洛和动态编程，并允许实时更新。因此，我们也可以使用TD方法获得目标Q值：SARSA和Q学习。

SARSA:

SARSA：

Q-learning

Q学习

Notice that in the above TD methods, we are actually using the model prediction to approximate the real target value Q(st+1, at+1), and this type of optimization is called semi-gradient.

请注意，在上述TD方法中，我们实际上是在使用模型预测来逼近实际目标值Q(st + 1，at + 1)，这种优化类型称为半梯度。

深度Q网络(DQN) (Deep Q Networks (DQN))

Linear VFA often works well if we have the right set of features, which usually require careful hand designing. An alternative is to use deep neural networks that directly use the states as input without requiring an explicit specification of features.

如果我们具有正确的功能集(通常需要仔细的手工设计)，则线性VFA通常效果很好。一种替代方法是使用直接将状态用作输入而无需明确说明特征的深度神经网络。

For example, in the graph below we have a neural network with the state s as the input layer, 2 hidden layers, and the predicted Q values as the output. The parameters can be learned through backpropagation.

例如，在下图中，我们有一个神经网络，其中状态s为输入层，2个隐藏层，而预测的Q值作为输出。可以通过反向传播来学习参数。

There are two important concepts in DQN: target net and experience replay.

DQN中有两个重要概念：目标网络和体验重播。

As you may have realized, a problem of using semi-gradient is that the model updates could be very unstable since the real target will change each time the model updates itself. The solution is to create a target network that copies the training model at a certain frequency so the target model updates less frequently. In equation below, w- are the weights of the target network.

您可能已经意识到，使用半梯度的问题是模型更新可能非常不稳定，因为实际目标每次模型更新时都会改变。解决方案是创建一个目标网络 ，该网络以特定频率复制训练模型，从而使目标模型的更新频率降低。在下面的公式中，w-是目标网络的权重。

We also create an experience replay buffer that stores the (s, a, r, s’, a’) pairs from prior episodes. When we update the weights, instead of using the most recent pair generated from the episode, we randomly select an experience from the experience replay buffer to run stochastic gradient descent. This will help avoid overfitting.

我们还创建了一个体验重播缓冲区，用于存储先前情节中的(s，a，r，s'，a')对。当我们更新权重时，我们不使用从情节中生成的最新对，而是从体验重播缓冲区中随机选择一种体验以进行随机梯度下降。这将有助于避免过度拟合。

I have previously implemented DQN with Tensorflow to play the CartPole game. If you are interested in learning more about the implementation, check out my article here.

我以前使用Tensorflow实施DQN来玩CartPole游戏。如果您有兴趣了解有关实现的更多信息，请在此处查看我的文章。

政策梯度 (Policy Gradient)

Different from the previous algorithms that model the value function, policy gradient methods directly learn the policy by parameterizing it as:

与先前对值函数进行建模的算法不同，策略梯度方法通过将其参数化为以下内容来直接学习策略：

However, when it comes to optimization we still have to go back to the value function as the policy function can’t be used as an objective function by itself. Our objective can be expressed as the value function V(θ), which is the expected total rewards we get from trajectories τ following stochastic policy π. Here θ are parameters for the policy.Be careful not to misinterpret V(θ) as the parameterized value function.

但是，在优化方面，我们仍然必须回到值函数，因为策略函数本身不能用作目标函数。我们的目标可以表示为值函数V(θ)，这是我们遵循随机策略π从轨迹τ获得的预期总回报。这里的θ是策略的参数，请注意不要误解V(θ)作为参数化的值函数。

Where τ is a state-action trajectory:

其中τ是状态动作轨迹：

R(τ) is sum of rewards for a trajectory τ:

R(τ)是轨迹τ的奖励总和：

Now, the goal is to find the policy parameters (θ) that maximize the value V(θ). To do so, we search for the maxima in V(θ) by ascending the gradient of the policy, w.r.t parameters θ.

现在，目标是找到使值V(θ)最大化的策略参数(θ)。为此，我们通过提高策略的梯度wrt参数θ来搜索V(θ)的最大值。

In the gradients shown as follows, the policy πθ is usually modeled using softmax, Gaussian or neural networks to ensure it’s differentiable.

在如下所示的梯度中，通常使用softmax，高斯或神经网络对策略πθ进行建模，以确保其可微。

Let’s see what policy gradient is like.

让我们看看政策梯度是什么样的。

影评人 (Actor-critic)

Actor-critic methods differ from the policy gradient methods in that actor-critic methods estimate both policy and the value function, and update both. In policy gradient methods, we update θ using Gt, an estimation of the value function at st from a single rollout. Although this estimation is unbiased, it has high variance.

参与者批评方法与策略梯度方法的不同之处在于，参与者批评方法估计策略和价值函数，并同时更新它们。在策略梯度方法中，我们使用Gt更新θ，Gt是一次部署时对st值函数的估计。尽管此估计是无偏的，但具有很高的方差。

To address this issue, Actor-critic methods introduce bias using bootstrapping and function approximation. Instead of using Gt from the roll out, we estimate the value with a parameterized function, and this is where the critic comes in.

为了解决此问题，Actor批评方法使用自举和函数逼近引入了偏差。不用从推出中使用Gt，而是通过参数化函数估算值，这就是评论家的用武之地。

Here is “vanilla” actor-critic policy gradient:

这是“香草”的参与者批评政策梯度：

In the above procedure, two new terms are introduced: advantage(At ) and baseline(b(s)).

在上述过程中，引入了两个新术语：优点(At)和基线(b(s))。

b(t) is the expected total future rewards at state St, equivalent to V(St), and this is the value function the critic estimates.

b(t)是状态St的预期未来总回报，等于V(St)，这是批评者估计的价值函数。

After every episode, we update both the value function b(s) and the policy function. Different from that in policy gradient, the policy function here is updated with At instead of Gt, which helps reduce variance of the gradient

在每个情节之后，我们都会更新值函数b(s)和策略函数。与策略梯度不同，此处的策略函数使用At而不是Gt更新，这有助于减小梯度的方差

On top of the vanilla actor-critic, there are two popular actor-critic methods A3C and A2C that update the policy and value functions with multiple workers. Their main difference is that A3C performs asynchronous updates while A2C is synchronous.

除了香草行为批评家之外，还有两种流行的行为批评家方法A3C和A2C，它们可以使用多个工作人员来更新政策和价值功能。它们的主要区别是A3C执行异步更新，而A2C同步。

结论与思想 (Conclusions & Thoughts)

In this article, we had an overview of many classic and popular reinforcement learning algorithms, and discussed their similarities and differences in the intricacies.

在本文中，我们概述了许多经典和流行的强化学习算法，并讨论了它们在复杂性方面的异同。

It’s worthwhile to mention that there are a lot of variants in each model family that we’ve not covered. For example, in the DQN family, there are dueling DQN and double DQN. In the policy gradient the actor-critic families, there are DDPG, ACER, SAC etc.

值得一提的是，每个模型系列都有许多我们没有介绍的变体。例如，在DQN系列中，有决斗DQN和双DQN。在政策梯度中，行动批评家包括DDPG，ACER，SAC等。

Additionally, there is another type of RL methods: evolution strategies(ES). Inspired by the theory of natural selection, ES solves problems when there isn’t a precise analytic form of an objective function. As they are beyond the scope of MDP, I didn’t include them in this article but I might have a discussion in my future articles. Stay tuned! :)

此外，还有另一种RL方法：进化策略(ES)。受自然选择理论的启发，ES在没有精确的目标函数解析形式的情况下解决了问题。由于它们不在MDP的范围内，因此我没有在本文中包括它们，但在以后的文章中可能会进行讨论。敬请关注！ :)

翻译自: https://towardsdatascience.com/an-overview-of-classic-reinforcement-learning-algorithms-part-1-f79c8b87e5af