论文摘要_论文摘要：发现强化学习代理

论文摘要

介绍 (Introduction)

Although the field of deep learning is evolving extremely fast, unique research with the potential to get us closer to Artificial General Intelligence (AGI) is rare and hard to find. One exception to this rule can be found in the field of meta-learning. Recently, meta-learning has also been applied to Reinforcement Learning (RL) with some success. The paper “Discovering Reinforcement Learning Agents” by Oh et al. from DeepMind provides a new and refreshing look at the application of meta-learning to RL.

尽管深度学习的领域发展非常Swift，但是具有使我们更接近人工通用人工智能(AGI)的潜力的独特研究却很少而且很难找到。在元学习领域中可以找到该规则的一个例外。最近，元学习也已应用于强化学习(RL)并取得了一些成功。 Oh等人的论文“ 发现强化学习代理 ”。 DeepMind的作者提供了有关元学习在RL中的应用的全新视图。

Traditionally, RL relied on hand-crafted algorithms such as Temporal Difference learning (TD-learning) and Monte Carlo learning, various Policy Gradient methods, or combinations thereof such as Actor-Critic models. These RL algorithms are usually finely adjusted to train models for a very specific task such as playing Go or Dota. One reason for this is that multiple hyperparameters such as the discount factor γ and the bootstrapping parameter λ need to be tuned for stable training. Furthermore, the very update rules as well as the choice of predictors such as value functions need to be chosen diligently to ensure good performance of the model. The entire process has to be performed manually and is often tedious and time-consuming.

传统上，RL依靠手工制作的算法(例如时间差异学习(TD学习)和蒙特卡洛学习)，各种策略梯度方法或其组合(例如Actor-Critic模型)。通常对这些RL算法进行微调，以训练模型以完成非常特定的任务，例如玩Go或Dota 。原因之一是，需要对多个超参数(例如折扣因子γ和自举参数λ)进行调整，以进行稳定的训练。此外，需要认真选择更新规则以及诸如值函数之类的预测变量的选择，以确保模型的良好性能。整个过程必须手动执行，并且通常是乏味且耗时的。

DeepMind is trying to change this with their latest publication. In the paper, the authors propose a new meta-learning approach that discovers the learning objective as well as the exploration procedure by interacting with a set of simple environments. They call the approach the Learned Policy Gradient (LPG). The most appealing result of the paper is that the algorithm is able to effectively generalize to more complex environments, suggesting the potential to discover novel RL frameworks purely by interaction.

DeepMind正在尝试通过其最新出版物来改变这一点。在本文中，作者提出了一种新的元学习方法，该方法通过与一组简单环境进行交互来发现学习目标以及探索过程 。他们称该方法为“ 学习策略梯度(LPG)” 。该论文最吸引人的结果是该算法能够有效地推广到更复杂的环境，这表明有可能仅通过交互来发现新颖的RL框架。

In this post, I will try to explain the paper in detail and provide additional explanation where I had problems with understanding. Hereby, I will stay close to the structure of the paper in order to allow you to find the relevant parts in the original text if you want to get additional details. Let’s dive in!

在这篇文章中，我将尝试详细解释本文，并在我理解有困难的地方提供其他解释。在此，我将紧贴本文的结构，以使您能够在原始文本中找到相关的部分，如果您想获得更多详细信息。让我们潜入吧！

元学习和早期方法 (Meta-Learning and Earlier Approaches)

Deep learning (including Deep RL) is known to be extremely data-hungry. Compare that to humans who can learn new skills much more efficiently. For example, people who can ride a mountain bike can also learn how to ride a road bike very quickly. Maybe they can even learn how to ride a motorcycle without too much additional external input. Meta-learning aims to equip machine learning models with a similar capability by “learning to learn”, i.e. learning about the training process in order to adapt more quickly to new data distributions.

众所周知，深度学习(包括Deep RL)非常耗费数据。相比之下，人类可以更有效地学习新技能。例如，可以骑山地车的人也可以很快学习如何骑公路车。也许他们甚至可以学习如何骑摩托车而无需过多的外部输入。元学习旨在通过“学习来学习”(即学习培训过程，以便更快地适应新的数据分布)为机器学习模型配备类似的功能。

In the paper, the authors subdivided meta-learning frameworks according to the problem they aim to address:

在本文中，作者根据他们旨在解决的问题细分了元学习框架：

Adapting a model trained on one or multiple tasks to a new task using only a few examples (Few-shot adaptation): this variant is exemplified by general algorithms such as MAML or Reptile, as well as RL² specifically in the context of RL.
仅使用几个示例即可将对一项或多项任务训练的模型改编为新任务(少量调整) ：此变体以MAML或Reptile等通用算法以及RL²专门用于RL。
Meta-learning for online adaptation of a single task: Meta-gradient RL by Xu et al. (also from DeepMind) falls in this category. This algorithm tunes hyperparameters such as γ and the bootstrapping parameter λ online while interacting with the environment. It is also possible to learn intrinsic rewards or auxiliary tasks in this manner.
在线学习单个任务的元学习 ：Xu等人的元梯度RL 。 (也来自DeepMind)属于此类别。该算法在与环境交互的同时，在线调整超参数(例如γ)和自举参数λ。也可以通过这种方式学习内在奖励或辅助任务。
Learning new RL algorithms: Meta-learning new algorithms from interacting with a number of environments has also been attempted by multiple groups already. For instance, the Evolved Policy Gradient method attempts to learn the policy gradient loss function using evolutionary methods. It was also recently shown by researchers from DeepMind that useful knowledge for exploration can be learned as a reward function.
学习新的RL算法 ：已经有多个小组尝试从与多种环境的交互中学习元学习新算法。例如，进化策略梯度方法尝试使用进化方法来学习策略梯度损失函数。 DeepMind的研究人员最近还表明，可以将有用的勘探知识作为奖励功能来学习。

All of the above approaches use the concept of a value function and try to generalize it. The framework presented in the described paper attempts, for the first time, to learn its own bootstrapping mechanism instead. Let us now have a look at how this is done.

以上所有方法都使用值函数的概念并尝试对其进行概括。所述论文中介绍的框架首次尝试学习其自身的引导机制。现在让我们看看如何完成此操作。

学习策略梯度(LPG) (Learned Policy Gradient (LPG))

The main goal of the paper is to find the optimal gradient update rule:

本文的主要目标是找到最佳梯度更新规则：

Let us explain this formula in detail: the optimal update rule, parametrized by η, maximizes the expected return at the end of the lifetime of the agent,

让我们详细解释这个公式：由η参数化的最佳更新规则，在代理生命周期结束时使期望收益最大化，

Hereby, we sample from a distribution of environments p(ε) and initial agent parameters p(θₒ). This means that after training an agent until the end of its lifetime, we want to achieve the maximal expected return.

因此，我们从环境p(ε)和初始代理参数p(θₒ)的分布中采样。这意味着，在对代理进行训练直到其生命周期结束之前，我们要获得最大的预期回报。

In order to achieve this without being specific about the type of environment we sample, we require the agent to produce two separate outputs:

为了在不具体说明我们采样的环境类型的情况下实现此目标，我们要求代理产生两个单独的输出：

The predicted policy π(a|s) as is usual in policy gradient algorithms,
与策略梯度算法一样，预测策略π(a | s)
An m-dimensional categorical prediction vector y(s) with output in the range [0, 1].
输出范围为[0，1]的m维分类预测向量y(s)。

Both the policy and the prediction vector y(s) are used as input to the meta-network. The meta-network is a backward LSTM producing at each update step a guidance on how to update the parameters, π_hat and y_hat (see Figure 1). At this point, it wasn’t entirely clear to me why a backward LSTM model was chosen. My understanding (although I may be wrong) is that the backward direction (from the end of the environment lifetime to the initial agent state) corresponds to the backward direction in the gradient descent optimization of the agent.

策略和预测向量y(s)都被用作元网络的输入。 元网络是一个向后的LSTM，在每个更新步骤中都会提供有关如何更新参数π_hat和y_hat的指南 (请参见图1)。在这一点上，我还不清楚为什么选择了反向 LSTM模型。我的理解(尽管可能是错误的)是，后向方向(从环境寿命结束到初始代理状态)对应于代理的梯度下降优化中的后向。

arXiv preprint arXiv:2007.08794 (2020). arXiv预印本arXiv：2007.08794 (2020)。

The input to the meta-learning LSTM network is

元学习LSTM网络的输入是

where r_t is the reward at time t, d_t indicated episode termination, and γ is the aforementioned discount factor. Since the LSTM does not depend on the observation and action space explicitly, it is largely invariant to the environment. Instead, the observation and action space are taken into account only indirectly, through the policy of the trained agent, π.

其中r_t是在时间t的奖励，d_t表示情节终止，而γ是上述折扣因子。由于LSTM并不明确依赖观察和行动空间，因此LSTM在很大程度上不受环境的影响。取而代之的是，观察和行动空间仅通过受过训练的代理人π的策略间接考虑。

更新代理 (Updating the agent)

During an inner loop, the agent is updated using the formula

在内部循环期间，使用以下公式更新代理

If you take a closer look at this formula, you will notice that the first term in the expectation is similar to the REINFORCE update, except that π_hat is used instead of the usual expected return G. Since π_hat is generated by the meta-learner, it allows the algorithm the flexibility to specify its own concept of a “value” function.

如果您仔细看一下这个公式，您会发现期望中的第一项类似于REINFORCE更新，只是使用π_hat代替了通常的期望收益G。由于π_hat由元学习器生成，它使算法可以灵活地指定自己的“值”函数概念。

The second term minimizes the Kullback-Leibler divergence (a form of distance metric on distributions) between the predicted and desired y. y provides additional information for the LSTM to discover a useful update rule. The meta-learner may indirectly affect the policy through y.

第二项将预测y与期望y之间的Kullback-Leibler散度(分布上距离度量的一种形式)最小化。 y为LSTM提供了更多信息，以发现有用的更新规则。元学习者可以通过y间接影响策略。

更新元学习器 (Updating the meta-learner)

The formula for updating the meta-learner LSTM is as follows:

更新元学习器LSTM的公式如下：

This definitely requires some explanation. Ideally, we would like to optimize the formula for the optimal gradient update rule over a distribution of environments, as shown above. As you may notice, the expected return at the end of the lifetime depends on the expectation over the policy with end-of-lifetime parameters and these in turn depend on η. This realization leads us to the idea of calculating the meta-gradient as a policy gradient. The first term in the formula above corresponds to exactly this gradient.

这肯定需要一些解释。理想情况下，我们希望针对环境分布优化最优梯度更新规则的公式，如上所示。您可能会注意到，生命周期结束时的预期回报取决于具有生命周期结束参数的策略预期，而这些参数又取决于η。这种认识使我们想到了将元梯度计算为策略梯度的想法。上式中的第一项恰好对应于该梯度。

The other terms are regularizer terms. These were introduced since meta-learning can be very unstable, especially at the beginning of the training process, when y_hat does not have any semantics attached to it. The first two regularization terms serve to maximize the entropy of both the policy and the prediction vector. Policy entropy regularization is a well-known technique in RL (see e.g. https://arxiv.org/abs/1602.01783). The remaining terms introduce L2-regularisation. This helps to prevent rapid changes in policy and prediction updates.

其他术语是正则化术语。之所以引入这些功能，是因为元学习可能非常不稳定，尤其是在训练过程开始时，y_hat没有任何语义。前两个正则项用于最大化策略和预测向量的熵。策略熵规范化是RL中的一种众所周知的技术(请参见例如https://arxiv.org/abs/1602.01783 )。其余术语引入L2正则化。这有助于防止策略和预测更新的快速更改。

注意事项 (Caveats)

As you may expect, there are some other minor implementation issues to be solved before getting the approach to work.

如您所料，在使用该方法之前，还有一些其他小的实现问题需要解决。

First of all, when training agents across different environments, it is not possible to use a fixed learning rate for all of them. The authors explain this by the fact that the optimal policy π_hat needs to be scaled in accordance to the learning rate update to make the training stable. Additionally, since η and thus π_hat change during training, we have no choice but to dynamically adjust the learning rate during meta-learning (notably, this is only used for training the meta-learner). In the paper it is proposed to use a bandit to sample useful hyperparameters for each lifetime separately and update the sampling distribution according to the end-of-lifetime return.

首先，当在不同环境中培训代理时，不可能对所有代理使用固定的学习率。作者通过以下事实对此进行解释：最优策略π_hat需要根据学习率更新进行缩放以使训练稳定。此外，由于η和π_hat在训练期间发生变化，因此我们别无选择，只能在元学习期间动态调整学习率(值得注意的是，这仅用于训练元学习者)。在本文中，建议使用强盗对每个生命周期分别采样有用的超参数，并根据寿命终了回报更新采样分布。

Furthermore, in the supplemental material, the authors note that they reset the lifetime whenever the entropy of the policy becomes 0 (the policy becomes deterministic) in order to prevent training collapse.

此外，在补充材料中，作者指出，每当策略的熵变为0(策略变为确定性)时，他们都会重置生命周期，以防止训练失败。

实验 (Experiments)

In order to train LPG, the authors set up two extremely simple toy environments for the agents. One of them is a simple grid world with rewards at certain pixels in the environment, as shown in the figure below. The second are delayed Markov Decision Processes (MDP). This is simply a sophisticated way to describe environments in which the agent can make a decision at some time step and the decision will reap positive or negative rewards at some later time step. 5 variations of environments for each domain were used, capturing problems such as “delayed reward, noisy reward and long-term credit assignment”. For more details about the experiment setup, please refer to the article.

为了训练液化石油气，作者为代理商建立了两个极其简单的玩具环境。其中之一是一个简单的网格世界，在环境中的某些像素处具有奖励，如下图所示。第二个是延迟的马尔可夫决策过程(MDP)。这只是描述代理人可以在某个时间步长做出决策并且该决策将在稍后的某个时间步长获得正面或负面回报的环境的一种复杂方式。 每个域使用5种环境变化，捕获诸如“延迟奖励，嘈杂奖励和长期信用分配”之类的问题 。有关实验设置的更多详细信息，请参阅本文。

arXiv preprint arXiv:2007.08794 (2020). arXiv预印本arXiv：2007.08794 (2020)。

In my opinion, the two most important questions asked in the paper are:

我认为，本文提出的两个最重要的问题是：

How does LPG discover useful semantics of predictions?
LPG如何发现预测的有用语义？
Can LPG generalise to more complex environments?
LPG可以推广到更复杂的环境吗？

预测语义学 (Prediction Semantics)

Figure 2 shows the visualized predictions from the paper. On the top left, a sample grid world with positive and negative, on the bottom left, a near-optimal policy (white paths) and its values (yellow = high, blue = low). The result of the experiment is shown on the right. These are the 30 components of y for the policy on the bottom left. These were obtained by updating y using the LPG with a fixed policy. Looking at the predictions, we can see that almost all of them have a high correlation with the true values. Some of them have large values around positive rewards, and these values are propagated to neighboring pixels. As the authors point out, this “implicitly shows that the LPG is asking the agent to predict future rewards and use such information for bootstrapping”.

图2显示了论文的可视化预测。左上方是一个带有正负值的示例网格世界，左下方是一个接近最优的策略(白色路径)及其值(黄色=高，蓝色=低)。实验结果显示在右侧。这是政策左上角y的30个组成部分。这些是通过使用具有固定策略的LPG更新y获得的。查看这些预测，我们可以发现几乎所有预测都与真实值具有高度相关性。它们中的一些在正奖励周围具有较大的值，并且这些值传播到相邻像素。正如作者所指出的那样，这“ 隐含地表明LPG正在要求代理商预测未来的回报并将此类信息用于引导” 。

To show that the correlation seen in the visualization is really there, a simple 1-layer perceptron is trained to predict true values from y for various discount factors. Although y was generated with a discount factor of 0.995, the trained perceptron could predict true values for lower discount factors down to 0.9 as well. This means that the framework can automatically discover a rich and useful semantics of predictions that the” framework can almost recover the value functions at various horizons, even though such a semantics was not enforced during meta-training”. This is important as it shows that the learned 30-dimensional vector indeed captures additional semantic information compared to a single value function. Note that this semantic information was entirely learned by the algorithm.

为了显示在可视化中看到的相关性确实存在，训练了一个简单的1层感知器以针对各种折扣因子从y预测真实值。 尽管y的折现因子为0.995，但受过训练的感知器也可以预测低折现因子(降至0.9)的真实值。 这意味着该框架可以自动发现丰富而有用的预测语义，即“ 即使在元训练过程中未强制使用这种语义 ，该框架也几乎可以恢复各个角度的价值函数”。 这很重要，因为它表明，与单个值函数相比，学习到的30维向量确实捕获了其他语义信息。 请注意，该语义信息是完全由算法学习的。

arXiv preprint arXiv:2007.08794 (2020). arXiv预印本arXiv：2007.08794 (2020)。

推广到更复杂的环境 (Generalizing to More Complex Environments)

The other significant result is that LPG can seamlessly generalize to more complex environments such as Atari games, while being trained only on the toy environments described above. As shown in Figure 3, LPG can beat human-level performance in almost half of the games while not being explicitly meta-trained on such complex domains. The meta-algorithm even outperforms the much-used Advantage Actor-Critic (A2C) on some games. As you can see from the figure, the performance improves rapidly as LPG sees more and more training environments. It is conceivable that with specifically designed training environments, the performance can surpass even state-of-the-art specialized algorithms in the future.

另一个重要的结果是，LPG可以无缝地推广到更复杂的环境(如Atari游戏)，而仅在上述玩具环境中进行训练。 如图3所示，LPG可以在几乎一半的游戏中击败人类水平的性能，而无需在这种复杂领域进行明确的元训练。在某些游戏中，元算法甚至胜过常用的“优势演员批评”(A2C)。从图中可以看出，随着LPG看到越来越多的培训环境，性能Swift提高。可以想象，通过专门设计的培训环境，其性能将来甚至可以超过最新的专业算法。

讨论区 (Discussion)

From the experiments it seems that the LPG is able to automate to a large degree the discovery of new RL algorithms by interacting with simple (or even more complex) environments. Since teaching humans also mostly relies on creating the appropriate environments for learning instead of fine-tuning update rules, this brings training an algorithm and training humans closer together. Moreover, the framework is able to generalize to much more complex environments than those it was trained on, potentially opening up a new approach to RL based entirely on data. Although the new approach is still lagging behind in performance compared to handcrafted RL algorithms, it can outperform A2C in a few Atari games as well as on the training environments, suggesting that it is not strictly worse than a hand-crafted approach. We also have to take into account the fact that these hand-crafted methods were perfected over years of work, while LPG is trained just using data (if we forget for a minute the training stability issue).

从实验看来，LPG通过与简单(甚至更复杂)的环境进行交互，能够在很大程度上自动发现新的RL算法。由于教人也大多依赖于创建合适的学习环境，而不是微调更新规则，因此这使训练算法和训练人更加紧密。而且，该框架能够推广到比其所接受的环境复杂得多的环境，从而有可能为完全基于数据的RL开辟一种新方法。尽管与手工制作的RL算法相比，新方法的性能仍然落后，但在一些Atari游戏以及训练环境中，它的性能都可以超过A2C，这表明它并不比手工制作的方法严格。我们还必须考虑以下事实：这些手工制作的方法经过多年的工作已得到完善，而LPG仅使用数据进行训练(如果我们忘记了一分钟的训练稳定性问题)。

Perhaps the most important point, in my opinion, is the fact that this approach scales with computing power and data. This means that as our computers get faster (as they inevitably will), LPG will only get better and better. The model described in the paper was trained using a 16-core TPU-v2 for 24 hours. While this might seem prohibitive for anybody without access to Google’s vast computing resources, in a few years, anybody with a modern PC will have this computing power at his/her disposal. I am strongly convinced that completely data-based algorithms are ultimately the only path to stronger AI.

在我看来，也许最重要的一点是该方法随计算能力和数据扩展的事实。 这意味着随着我们的计算机变得越来越快(不可避免地会变得越来越快)，LPG只会越来越好。本文中描述的模型使用16核TPU-v2进行了24小时的训练。尽管这对于任何无法访问Google大量计算资源的人来说都是令人望而却步的，但几年后，拥有现代PC的任何人都将拥有这种计算能力。我坚信，完全基于数据的算法最终是实现更强大AI的唯一途径。

结论 (Conclusion)

In this paper, the authors attempted for the first time to learn an update rule in RL from scratch, thereby avoiding the tedious process of discovering complex update rules manually. Their approach is fully data-driven and introduces an inductive bias to the learning process, something we may also expect to happen in the human brain. The paper shows that reward prediction and state evaluation emerge naturally during training on toy environments. The strong generalization capabilities of the approach suggest that it might be possible to discover extremely efficient RL algorithms from interactions with possibly simple, procedurally generated environments.

在本文中，作者首次尝试从零开始学习RL中的更新规则，从而避免了手动发现复杂更新规则的繁琐过程。他们的方法是完全由数据驱动的，并在学习过程中引入了归纳性偏见，这也是我们可能期望在人脑中发生的事情。本文表明，在玩具环境中进行训练时，奖励预测和状态评估自然会出现。该方法强大的概括能力表明，有可能通过与可能简单的程序生成的环境的交互来发现极其有效的RL算法。

翻译自: https://towardsdatascience.com/paper-summary-discovering-reinforcement-learning-agents-3cf9447b6ecd

论文摘要

论文摘要_论文摘要：发现强化学习代理

介绍 (Introduction)

元学习和早期方法 (Meta-Learning and Earlier Approaches)

学习策略梯度(LPG) (Learned Policy Gradient (LPG))

更新代理 (Updating the agent)

更新元学习器 (Updating the meta-learner)

注意事项 (Caveats)

实验 (Experiments)

预测语义学 (Prediction Semantics)

推广到更复杂的环境 (Generalizing to More Complex Environments)

讨论区 (Discussion)

结论 (Conclusion)

你可能感兴趣的:(人工智能,深度学习,python,强化学习,算法)