强化学习与环境不确定

Model-based Reinforcement Learning (RL) gets most of its favour from sample efficiency. It’s generous and undemanding on the amount desired as input, with a cap on what we should expect the model to achieve.

基于模型的强化学习(RL)最受样本效率的青睐。它对输入的期望数量是慷慨和不需要的，并且对我们期望模型实现的数量有上限。

It’s unlikely for the model to turn out a perfect representation of the environment. While interacting with the real world through the trained agent, we might meet states and rewards different from the ones seen during training. For model-based RL to work, we need to overcome this problem. It’s vital. It’s what will help our agent know what it’s doing.

该模型不太可能完美展现环境。通过受过训练的代理人与现实世界互动时，我们可能会遇到与训练期间所见不同的状态和奖励。为了使基于模型的RL正常工作，我们需要克服此问题。至关重要这将帮助我们的代理商了解其工作情况。

First, what of model-free RL? Model-free RL uses the ground-truth transitions of the environment in training and testing the agent, always. Unless there are offsets we introduce, such as simulation-to-real transfer, in which case we can’t blame the algorithm.Uncertainty is, therefore, not a big worry here. For something like a Q function Q(s, a), which optimizes over actions, we can attempt to integrate certainty awareness on action selection. But since it works well anyway, it’s no harm, for now, closing our eyes and pretending we didn’t see that.

首先，什么是无模型RL？无模型RL始终在训练和测试代理时使用环境的真实过渡。除非我们引入了偏移量，例如模拟到真实的传输，否则这种情况下我们不能怪算法，因此不确定性在这里不是大问题。对于像Q函数Q(s，a)这样的东西，它可以优化动作，我们可以尝试将确定性意识整合到动作选择中。但是由于无论如何它都能很好地工作，所以没有害处，现在，闭上眼睛，假装我们没有看到。

内容 (Contents)

1. Source of uncertainty in Model-based RL2. The benefit of uncertainty awareness3. Building uncertainty aware models- What might seem to work- What does work4. Conclusion

1.基于模型的RL2中的不确定性来源。不确定性意识的好处3。建立不确定性感知模型-似乎可行的方法-可行的方法4。结论

不确定性来源 (Source of Uncertainty)

Model uncertainty results from the distribution mismatch between the data the model sees during testing and that used to train the model. We test the agent on a distribution different from that seen during training.

模型不确定性是由模型在测试期间看到的数据与用于训练模型的数据之间的分布不匹配导致的。我们以与培训期间不同的分布测试代理。

不确定性意识到底会带来什么价值变化？ (What worth of difference would uncertainty awareness make, exactly?)

At the start of training, the model p(sₜ ₊₁| sₜ, aₜ) has exposed itself to quite small real-world data. We hope the function doesn’t over-fit to this small quantity because we need it to be expressive enough to capture the transitions in later time steps. Then, real data will have accumulated to learn the precise model.

在训练开始时，模型P(Sₜ₊₁| s ^ₜ， 一个 ₜ)将自己暴露在相当小的真实数据。我们希望该功能不会过小，因为我们需要它具有足够的表达能力，以便在以后的时间步中捕获过渡。然后，将积累实际数据以学习精确模型。

This is challenging to achieve in Model-based RL. Why? The simple goal of RL is maximizing the future cumulative reward. The planner, while aiming for this, attempts to follow plans for which the model predicts high reward. So if the model overestimates the rewards it will get for a particular action sequence, the planner will be glad to follow that gleaming but erroneous estimate. Selecting such actions in the real-world then results in funny behaviour. In short, the planner is motivated to exploit the positive mistakes of the model.

这在基于模型的RL中实现具有挑战性。为什么？ RL的简单目标是使未来的累积奖励最大化。计划者为此目的，尝试遵循模型预测高回报的计划。因此，如果模型高估了特定操作序列所能获得的回报，则计划者将很乐意遵循那种闪闪发光但错误的估计。在现实世界中选择此类动作会导致有趣的行为。简而言之，计划者有动机去利用模型的积极错误。

(We can think of the planner as the method we use to select optimal actions given the world states).

(我们可以将计划者视为在给定世界状态的情况下用于选择最佳行动的方法)。

And can this get worse? In high dimensional spaces — where input is an image, for instance — the mistakes the model makes will be a lot more owing to latent variables. It’s common in model-based RL to alleviate the distribution mismatch by using on-policy data collection — transitions observed in the real world are added to the training data and used to replan and correct deviations in the model. In this case, though, the mistakes will be too plenty for the on-policy fix to catch up with the lost model. The plenty of errors might result in the policy changing every time we re-plan, and as a result, the model may never converge.

这会变得更糟吗？在高维空间(例如，输入是图像)中，由于潜在变量，模型所犯的错误将更多。在基于模型的RL中，通过使用基于策略的数据收集来缓解分布不匹配是很常见的-现实世界中观察到的过渡会添加到训练数据中，并用于重新计划和纠正模型中的偏差。但是，在这种情况下，错误将太多，无法按策略修复以赶上丢失的模型。大量的错误可能会导致每次重新计划时都会更改策略，因此，该模型可能永远无法收敛。

We may choose to collect data for every mistake the model might make, but wouldn’t it better if we could detect where the model might go wrong, so the model-based planner can avoid actions likely to result in severe outcomes?

我们可能会针对模型可能犯的每个错误选择收集数据，但是如果我们能够检测出模型可能在哪里出错，这岂不是更好，因此基于模型的计划者可以避免可能导致严重后果的行动？

估计不确定度 (Estimating Uncertainty)

First, let’s phrase what we know as a simple story.

首先，让我们说一个简单的故事。

A loving couple gets the blessing of a baby and a robot — not necessarily at the same time. The robot’s goal is, as a babysitter, to keep baby Juliet happy. While it’s motivated with rewards to achieve this, it’s also desirable that the robot avoids anything damaging, or that might injure the baby.

一对充满爱心的夫妇会得到婴儿和机器人的祝福-不一定同时。作为保姆，机器人的目标是让朱丽叶(Juliet)开心。尽管奖励的动机是实现这一目标，但也希望机器人避免任何损坏或可能伤害婴儿的事情。

Now the baby grows fond of crying while pointing at bugs — because good babies can do that — and the robot’s optimal-reward plan becomes squashing the bug and letting baby Juliet watch it feed the vermin to the cat.

现在，婴儿变得很喜欢在指着虫子时哭泣-因为好婴儿可以做到这一点 -并且机器人的最佳奖励计划是挤压虫子，让朱丽叶宝宝看着它把害虫喂给猫。

For a change, though, say the robot encounters the baby crying while pointing at something scary on the Television — an unfamiliar state, seemingly close to baby Juliet’s cry-pointing behaviour. Unsure of the dynamics, the robot’s best plan might be to squash the TV and feed it to the cat. We are not sure if that will make the baby happy, but it’s sure to cause damage.

不过，要进行更改，可以说机器人遇到婴儿哭闹的同时，在电视上指着一些可怕的东西，这是一种陌生的状态，似乎与朱丽叶婴儿的哭泣行为接近。不确定动力学，机器人的最佳计划可能是压扁电视并将其喂给猫。我们不确定这是否会使婴儿开心，但一定会造成伤害。

However, if the model, being unconfident, evaded that action, it would have been better off not touching the TV and avoiding the damage at the expense of a sad Juliet.

但是，如果模型不自信地回避了该动作，那么最好不要触摸电视并避免损坏，而这要以朱丽叶伤心为代价。

An uncertainty-aware model would let the agent know where it has a high chance of an undesired outcome — where it needs to be more careful. But if the model is unconfident about what will result after taking action, then it’s probably good to use that to reach its goal.

不确定性感知模型将使业务代表知道在何处极有可能发生不希望的结果-在哪里需要格外小心。但是，如果模型对采取行动后的结果不确定，则最好使用该模型来达到目标。

If our robot is confident that pickles calm baby Juliet while posing no risk, then it might consider running to the kitchen and letting her chew on one, because then, it will achieve its goal of keeping her happy.

如果我们的机器人确信泡菜可以让婴儿朱丽叶平静下来，同时又不冒任何风险，那么它可以考虑跑到厨房并让其咀嚼，因为那样，它将实现使其保持快乐的目标。

A model that can get accurate estimates of its uncertainty gives the model-based planner ability to avoid actions with a non-slight chance of resulting in undesired outcomes. Gradually, the model will learn to make better estimates. Uncertainty awareness will also inform the model on states it needs to explore more.

一个能够准确估计其不确定性的模型，使基于模型的计划者能够避免采取行动的可能性很小，从而导致不良后果。该模型将逐步学习做出更好的估计。不确定性意识还将告知模型需要进一步探索的状态。

看起来像什么解决方案？ (What seems like a solution?)

Using Entropy

使用熵

We know entropy as a measure of randomness, or the degree of spread in a probability distribution of a random variable.

我们知道熵是衡量随机性的指标，或者说是随机变量的概率分布中的扩散程度。

The entropy of a discrete random variable X 离散随机变量 X 的熵

The entropy will peak when each of the outcomes (xᵢ) occur with equal probability, i.e., maximum uncertainty, and will be minimum when there is a single outcome with high probability p(almost 1), while the rest share a probability close to zero(1 — p), i.e., maximum certainty.

当每个结果( xᵢ )发生的概率相等时，即最大不确定性时，熵将达到峰值；当单个结果具有高概率p (几乎为1)时，熵将最小；而其余结果的概率接近于零。 ( 1 — p )，即最大确定性。

Here, A has lower uncertainty compared to B. For a unimodal Gaussian distribution , the entropy and variance(spread around the mean) tend to move in the same relative direction. 在这里，A相比于B.为了以更低的不确定性 单峰 高斯分布 ，熵和方差( 均值分布 )倾向于沿相同的相对方向移动。

Here’s the catch, though — does uncertainty in the data affect our model uncertainty? To help phrase that question better, here are two plots:

不过，这里有个难题-数据中的不确定性会影响我们的模型不确定性吗？为了更好地表达该问题，以下是两个图：

You are about to see visual plots of hand-cooked data. Just focus on the concept we shall use them to convey rather than what they represent, okay? Thanks.

您将看到手工制作的数据的可视化图。仅关注概念，我们将使用它们来传达而不是代表什么，好吗？谢谢。

Left: Model prediction to a linear function — the model fits the data, but “too well”. —* — — *— — *— Right : Fitting a Polynomial Regression function 左： 对线性函数的模型预测-模型适合数据，但“太好了”。 — * — — * — — ** 右 ：拟合多项式回归函数

These two plots will help us separate two related concepts. Between them, on which one do you think the is model uncertain? Grab an easy moment to observe them again, and try coming up with an answer.

这两个图将帮助我们分离两个相关的概念。在他们之间，您认为模型不确定吗？抓住一个简单的时刻再次观察它们，然后尝试给出答案。

Have you picked one? Here, let me give you some space to think.

你选了一个吗？在这里，让我给您一些思考的空间。

In the first plot, the model fits a linear regression (LR) to a few thousand samples. It seems certain about the data. Over-fitting shows certainty, correct? Granted, the model is confident about the data. The variance of the red line prediction will pretty much be close to zero. But does this estimate give the best explanation about the observations?

在第一个图中，模型将线性回归(LR)拟合到数千个样本。数据似乎可以肯定。过度拟合显示确定性，对吗？当然，该模型对数据很有信心。红线预测的方差将几乎接近零。但是，此估计值是否能最好地说明观测结果？

To achieve this, the model would need to explain the observations by creating a good linear relation between the data and the possible covariates. But being unsure which to include and which to omit, it includes (almost) everything! This effects model uncertainty as we are not confident about the model.

为此，该模型将需要通过在数据与可能的协变量之间建立良好的线性关系来解释观察结果。但是不确定要包含哪些内容和要忽略哪些内容，它几乎包含了所有内容！由于我们对模型不确定，因此会影响模型不确定性。

In the Polynomial Regression (PR), there’s noise in the data. It creates uncertainty in the observations. The samples show significant deviation from the mean, but the model fits relatively well — we can still snatch a few laughs from a psychology class without providing milk, which seems true enough. Compared to the error received by over-fitting to the Linear function, this would give a larger MSE.

在多项式回归(PR)中，数据中存在噪声。它在观测中造成不确定性。样本显示出与均值有显着偏差，但是该模型相对合适-我们仍然可以在不提供牛奶的情况下从心理学课程中窃取一些笑声，这似乎是正确的。与通过过度拟合线性函数而收到的误差相比，这将产生更大的MSE。

So coming back to our initial question, uncertainty in the data will not reveal our model-uncertainty. This makes entropy, as a measure of model uncertainty, not always work because when the variance of the data is close to zero, the entropy is low even when the model might still be uncertain, as seen in the LR plot.

所以回到我们最初的问题，数据的不确定性不会揭示我们的模型不确定性。这使得熵无法始终作为模型不确定性的度量，因为当数据的方差接近于零时，即使模型仍然不确定，熵也很低，如LR图所示。

不确定性模型-有效的解决方案 (Uncertainty aware models — solutions that work)

a) Learning a function that predicts bad behaviour

a)学习预测不良行为的功能

Consider drone learning to fly in a rain-forest. We wish it to learn to navigate that environment while avoiding collision with trees. In RL, an agent learns the consequence of an action by trying out that action. So to learn to dodge trees, it must experience a couple of hits. But high-speed blows would certainly cause destruction.

考虑无人机学习如何在雨林中飞行。我们希望它学会在该环境中导航，同时避免与树木碰撞。在RL中，代理通过尝试执行操作来了解操作的结果。因此，要学会躲避树木，必须经历几次打击。但是高速打击肯定会造成破坏。

We can train the drone by letting it experience gentle, low-speed hits, so it learns the forest environment. When it encounters a section of the forest absent in the training distribution, it needs knowledge about the uncertainty of its policy to enable safe interaction with that section while collecting new training data. Once confident about that section, it can fly at high speeds in future. This is an example of safe exploration.

我们可以通过让无人机体验轻柔，低速的撞击来训练无人机，从而了解森林环境。当它遇到训练分布中不存在的森林区域时，它需要有关其政策不确定性的知识，以便在收集新的训练数据时能够与该区域进行安全交互。一旦对该部分充满信心，它便可以在未来以高速飞行。这是安全探索的一个例子。

To achieve this, we integrate a cost for hitting trees in the RL cost function c(st, at) to have c(st, at) + C_bad. C_bad is the new cost assigned to behaviour that results in bad behaviour (collision). It influences when the drone can fly fast, and when it should tread with care.

为此，我们在RL成本函数c(st，at)中整合了打树成本，使c(st，at)+ C_bad。 C_bad是分配给导致不良行为(冲突)的行为的新成本。它会影响无人机何时可以快速飞行以及何时应谨慎踩踏。

To estimate C_bad, we use a bad-behaviour-prediction neural network P, with weights ϴ. It takes as input the current state st of the drone, it’s observation oₜ plus a sequence of actions [aₜ, aₜ ₊₁… aH] the drone plans to execute and estimates the probability of a collision occurring.

为了估计C_bad，我们使用了一个不好的行为预测的神经网络P，配重块Θ。它将无人机的当前状态st作为输入，这是观察o 加上一系列动作[ aₜ，aₜ₊₁…aH] ，无人机计划执行e并估计发生碰撞的可能性。

The action sequence is selected and optimized by Model Predictive Control (MPC) in a receding time horizon from the current time step t up to t + H. The bad-behaviour model Pϴ outputs a Bernoulli distribution (binary 0 or 1) indicating whether a collision occurred within this horizon.

通过模型预测控制 (MPC)在从当前时间步t到t + H的后退时间范围内选择并优化动作序列。不良行为模型P ϴ输出伯努利分布 (二进制0或1)，指示是否在此范围内发生了碰撞。

The collision-labels are recorded for each horizon H. This means that for a label 1, bad behaviour occurred in the sub-sequence between time steps t and t + H. With this probability label conditioned on the above inputs, the bad-behaviour model can be simply expressed as:

记录了每个地平线H的碰撞标签。这意味着对于标签1，时间步长t和t + H之间的子序列中发生了不良行为。在此概率标签取决于上述输入的情况下，不良行为模型可以简单表示为：

Estimating the cost for bad behaviour C_bad using the model P_theta 使用模型P_theta估算不良行为C_bad的代价

Similarly, a naive implementation would look like this:

同样，一个简单的实现看起来像这样：

However, you might have noticed that Pϴ outputs the probability distribution over bad behaviour, and not actually the expense for that behaviour. So the actual bad-behaviour cost would be multiplied by this probability p to give pC_bad. Finally, we tune it with a scalar λ, that determines how important it is for the agent to avoid risky outcomes compared to achieving its goal.

但是，您可能已经注意到 P ϴ输出不良行为的概率分布，而不是该行为的实际支出。因此，实际的不良行为成本将乘以该概率p 得出pC_bad 。最后，我们使用标量λ对其进行调整，确定与实现目标相比，代理避免风险结果的重要性。

Goal cost and the agent’s weighted 目标成本和代理商 cost for bad behaviour 针对不良行为的加权成本所占

It’s good to note, while we want a function that predicts unsafe actions, a discriminative model, which takes an input and gives a safety estimate might not always make us happy — its predictions might be quite meaningless in unfamiliar states. Preferably, it’s beneficial to incorporate model uncertainty in its predictions.

值得一提的是，尽管我们需要一个预测不安全行为的函数，但是一个判别模型 (该模型需要一个输入并给出一个安全估计)可能并不总使我们高兴-在不熟悉的状态下，其预测可能毫无意义。最好在模型预测中加入模型不确定性。

b) Bayesian Networks

b)贝叶斯网络

A neural network can be termed as a conditional model P(y|x, w), which given an input x, allocates a probability to each possible output y using weights w. With Bayesian neural networks, instead of having a single output for each neuron, the weights are denoted as probability distributions over the possible values.

可以将神经网络称为条件模型P(y | x，w)，该模型给定输入x，并使用权重w为每个可能的输出y分配概率。对于贝叶斯神经网络，权重表示为可能值上的概率分布，而不是每个神经元具有单个输出。

Right: Bayesian Neural Net. Instead of a single fixed scalar for each weight, a probability distribution is assigned over all possible values ( source ) 右：贝叶斯神经网络。 代替为每个权重使用单个固定标量，而是在所有可能的值( 源 )上 分配概率分布

How does this work?

这是如何运作的？

Using a set of training samples D, we find a posterior on the model-weights, conditioned on these samples P (w|D). To predict the distribution of a particular label ý, each viable combination of the weights, scaled by the posterior distribution, makes a prediction on the same input x.

使用一组训练样本D ，我们发现以这些样本P( w | D)为条件的模型权重的后验。为了预测特定标签ý的分布，权重的每个可行组合(由后验分布定标)对相同的输入x进行预测。

If a unit is uncertain about the observation, this will be expressed in the output as weights with higher uncertainty introduce more variability in the prediction. This is common in regions the model has seen minimal or no data and will encourage exploration. As more observations are made, the model makes more deterministic decisions.

如果单位不确定观测值，则将在输出中表示，因为不确定性较高的权重会在预测中引入更大的可变性。这在模型只看到很少或没有数据的地区很普遍，并且会鼓励探索。随着进行更多的观察，该模型将做出更多的确定性决策。

The posterior-distribution on the weights P (w|D) is approximated. This is done by trying to find a parameter ϴ of a different distribution on the weights q(w| ϴ) by making it as close as possible to the true posterior distribution P (w| D). This is variational inference; a little beyond our current scope :).

权重P( w | D)的后验分布是近似的。通过使|这是通过试图找到的权重q中的不同分布(ΘW)的参数Θ做尽可能接近的真实后验分布P(W | d)。 这是变分推理 ; 稍微超出了我们当前的范围：)。

c) DropOut

c)辍学

Dropout in RL is a bad idea. This isn’t the risk we take here, though. Remember we said a discriminative model will not always make us happy unless it can incorporate uncertainty in bad-behaviour predictions? Dropout is a simple way to do that.

放弃RL是个好主意。不过，这不是我们在这里冒的风险。还记得我们曾说过一种歧视性模型，除非它可以将不确定性纳入不良行为预测中，否则不会总使我们高兴吗？辍学是一种简单的方法。

Dropout is a regularization technique that randomly drops a unit in a neural network with probability p, or retains it with probability 1 — p. It’s frequently used during training to prevent neurons from over-depending on each other. This creates a new but related neural network during each training iteration.

辍学是一种正则化技术，它以概率p随机丢弃神经网络中的一个单元，或以概率1- p保留它。在训练期间经常使用它，以防止神经元彼此过度依赖 。这将在每次训练迭代期间创建一个新的但相关的神经网络。

In practice, dropout is known to be applied only during training and removed at test time to achieve high test accuracy. However, by retaining dropout at test time, we can estimate uncertainty by finding the sample mean and variance of different forward-passes. It’s a simple approach to estimate uncertainty.

实际上，已知辍学仅在训练期间应用，并在测试时删除以实现高测试精度。但是，通过保留测试时的失落，我们可以通过找到样本均值和不同前向通量的方差来估计不确定性。这是一种估计不确定性的简单方法。

Its caveat is that dropout, as a variational inference method, underestimates uncertainty severely owing to the variational lower bound.

它的告诫是，作为变分推理方法，由于变分下界，辍学严重低估了不确定性。

What does that mean?

那是什么意思？

To understand this, we need to introduce KL Divergence — a measure of the difference between two probability distributions over the same random variable.

要了解这一点，我们需要引入KL散度-度量同一随机变量上两个概率分布之间的差异。

At times, finding the true probability over large real-valued distributions is expensive. So an approximation to that distribution is used instead, and the KL divergence (difference) between the two minimized.

有时，在大的实际值分布上找到真实的概率是昂贵的。因此，改为使用该分布的近似值，并将两者之间的KL差异(差异)最小化。

An illustration of KL divergence between p(x) and q(x) — p(x)和q(x)之间的KL散度的说明 DKL(p||q) -DKL(p || q) ( ( source来源 ) )

In the above illustration, q(x) is an approximation to the precise distribution p(x). This approximation aims to place a high chance of occurrence where p(x) has a high probability. On the illustration, notice q(x) is a single Gaussian, while p(x) is a mixture of two Gaussians? To place high probability where the probability of p(x) is high, q(x) evens the two Gaussians in p to place high probability mass on both, equally.

在上图中， q(x)是精确分布p(x)的近似值。该近似的目的是在p(x)具有高概率的情况下发生的可能性很高。在图示中，请注意q(x)是一个高斯，而p(x)是两个高斯的混合？为了在p(x)的概率较高的地方放置高概率， q(x)使p中的两个高斯均匀，以将高概率质量均等地放置在p和x上。

Similarly, dropout has a true posterior p(w| x, y) on the model’s weights w conditioned on the inputs x and the labels y. q(w) is used as an approximating distribution on this posterior. We then lower the KL divergence between q(w) and the actual posterior p(w| x, y) to make them as close as possible. However, doing so will penalise q(w) for placing probability mass where p(w) has no probability mass but just ignores q(w) for not placing high probability mass where p(w) actually has a high probability. This is what underestimates the model’s uncertainty.

类似地，对于输入x和标签y为条件的模型权重w ，删除具有真实的后验p(w | x，y) 。 q(w)用作此后验的近似分布。然后，我们降低q(w)与实际后验p(w | x，y)之间的KL散度，以使其尽可能接近。但是，这样做会惩罚q(w)来放置概率质量，而p(w)没有概率质量，而只是忽略q(w)来不放置高概率质量，而p(w)实际上具有高概率。这是低估了模型不确定性的原因。

c) Bootstrap Ensembles

c)引导乐团

Multiple independent models are trained, and their predictions averaged. Should these models approximate an almost similar output, it would show they agree, indicating certainty in their predictions.

训练了多个独立的模型，并对它们的预测取平均值。如果这些模型近似于几乎相似的输出，则表明它们同意，表明其预测中具有确定性。

Averaging N ensemble models 平均N个整体模型

To make the models independent of each other, each model’s weights ϴᵢ is trained with a subset of data sampled with replacement from the training set. However, random initialisation of the weights ϴᵢ and stochastic gradient descent during training is known to make them independent enough.

为了使模型彼此独立，每个模型的权重with 用从训练集中替换后采样的数据子集进行训练。然而，权重的随机初始化Θᵢ和培训过程中随机梯度下降是已知的，使他们独立就够了。

Dropout, as a measure of uncertainty, can be assumed a cheap approximation to an ensemble method, where each sampled dropout acts as a different model. The Bayesian Neural Network has the ensemble concept too — by taking an expectation under the posterior distribution on the weights P(w|D), the Bayesian network becomes equivalent to an infinite number of ensembles — many means better.

可以将辍学作为不确定性的一种度量，它是对集成方法的一种廉价近似方法，在该方法中，每个采样的辍学都充当不同的模型。贝叶斯神经网络也具有合奏概念-通过在权重P ( w | D)的后验分布下取一个期望，贝叶斯网络就变成了无穷多个合奏，这意味着更好。

d) Curious iLQR

d)好奇的iLQR

Think of curiosity as an inspiration to solve for uncertainties in the agent’s environment. Let’s see how we can add curious behaviour to an agent’s control loop.

将好奇心视为解决代理商环境中不确定性的灵感。让我们看看如何将奇怪的行为添加到代理的控制循环中。

Some LQR background

一些LQR背景

In RL, a Linear Quadratic Regulator (LQR) outputs a linear controller which is used to exploit the model. When working with non-linear dynamics, we fit the model p(sₜ ₊₁ | sₜ, aₜ) at each time step using linear regression. This iterative control process is called iterative LQR (iLQR), a form of differential dynamic programming (DDP).

在RL中，线性二次调节器(LQR)输出一个线性控制器，该线性控制器用于开发模型。当使用非线性动力学时，我们使用线性回归在每个时间步拟合模型p (sₜ₊₁|sₜ，aₜ)。这种迭代控制过程称为迭代LQR(iLQR)，这是差分动态编程(DDP)的一种形式。

The system dynamics is represented by the equation:

系统动力学由以下方程式表示：

System dynamics 系统动态

f represents the learned dynamics model while xₜ ₊₁ is the state at the next time step, expressed as the current state xₜ plus the model’s predicted change on the current state xₜ when action uₜ is taken. For instance, if the state is a robot’s velocity, xₜ would be the current velocity, while f(xₜ, uₜ)Δt would be the predicted change when uₜ is selected, resulting in a new velocity xₜ ₊₁.

F 代表了解到动力学模型，同时xₜ₊₁是国家在接下来的时间步长，表示为当前状态xₜ加模型的预测当前的状态xₜ变化时动作Uₜ 被采取。例如，如果状态是机器人的速度，则xₜ将是当前速度，而f(xₜ，uₜ) Δt将是u时的预测变化选择，得到新的速度xₜ 。

Making it Curious

令人好奇

For the integration of uncertainty in the above system dynamics, it’s written as a Gaussian distribution, represented by a mean vector, μ, and a covariance matrix, Σ.

为了将上述系统动力学中的不确定性进行积分，将其写为高斯分布，用均值向量μ和协方差矩阵Σ表示。

A Gaussian Policy has a neural network mapping a state and action pair to the mean change in state. This change in state is the mean vector expressed by the model μ(f).

高斯策略具有将状态和动作对映射到状态平均变化的神经网络。状态的这种变化是由模型μ( f )表示的平均向量。

We can implement the system dynamics as a Gaussian Process (GP) by drawing the model f from a normal distribution, where we attempt to learn the best mean vector μ that minimizes the cost function. The GP then delivers predictions using the equation:

通过绘制模型f，我们可以将系统动力学实现为高斯过程(GP)。从正态分布中，我们尝试学习使成本函数最小的最佳均值向量μ 。然后，GP使用以下公式提供预测：

System dynamics as a Gaussian Process 系统动力学作为高斯过程

where f(xₜ, uₜ) is the mean vector represented by the trainable dynamics function, and Σₜ ₊₁ the covariance matrix of the GP predictions at the current state and action.

其中f(xₜ，Uₜ)由可训练动力学表示的均值向量的作用，和Σₜ₊₁ 当前状态和动作下GP预测的协方差矩阵。

This GP is identical to an ordinary non-curious LQR stochastic dynamics equations. What’s different? In non-curious iLQR, we would ignore the variance parameter Σₜ ₊₁ owing to symmetry of Gaussians. However, curious iLQR needs the covariance of the predictive distribution to ascertain the model uncertainty. High model uncertainty equals high variance. Σₜ ₊₁ represents the model’s uncertainty on the prediction xₜ ₊₁ at current state and action (xₜ, uₜ).

该GP与普通的非好奇LQR随机动力学方程相同。有什么不同？在非好奇的iLQR中，由于高斯的对称性，我们将忽略方差参数Σₜ 。但是， 好奇的iLQR需要预测分布的协方差才能确定模型的不确定性 。高模型不确定性等于高方差。 Σₜ₊₁表示在当前的状态和行为对预测xₜ₊₁模型的不确定性(Xₜ，Uₜ)。

This uncertainty from the GP model is then used to plead with the agent to take actions that resolve the model’s future-uncertainty on such states. In short, the agent is encouraged to select actions that reduce the model’s variance. This is done by rewarding the agent for acts that include some degree of uncertainty while still maximizing the goal-specific reward.

然后，使用GP模型的不确定性来请求代理采取行动，以解决此类状态下模型的未来不确定性。简而言之，鼓励代理选择减少模型方差的动作。这是通过奖励包括某些不确定性的行为的代理，同时仍使目标特定的奖励最大化而实现的。

The Mountain Car experiment: The goal is to reach the hilltop. By seeking negative rewards at first (moving back up the opposite hill) the car gains enough momentum to reach the top. See it’s only the curious agent that explores this. ( source ) 登山车实验：目标是到达山顶。 首先，通过 寻求负面奖励 (向后爬上对面的山坡)，汽车可获得足够的动力来达到顶峰。 看到它只是 探索 这一点 的好奇代理 。 ( 来源 )

Understanding LQR optimization in Model-based RL can a bit juggling but is essential for grasping how the curiosity algorithm is derived. Let’s not scare ourselves with those equations now though.

了解 基于模型的RL 中的 LQR优化 可能会有些困难，但对于掌握好奇心算法的派生方式至关重要。 现在让我们不要害怕那些方程式。

Rewarding curious actions enables the agent to reach its goal faster than using standard iLQR. It prevents the model from getting stuck in local optima, which finds better solutions in a short time.

奖励好奇行为使代理能够比使用标准iLQR更快地实现其目标。它可以防止模型陷入局部最优状态，从而在短时间内找到更好的解决方案。

结论 (Conclusion)

Uncertainty awareness can be used to influence an agent’s learned policy depending on how it’s added to the training cost. It can encourage exploration or pessimistic behaviour where the agent avoids outcomes likely to be risky. The latter is of interest in AI safety. For more robustness, the solutions to adding uncertainty awareness can be used together.

不确定性意识可用于影响座席的学习策略，具体取决于将其添加到培训成本中的方式。在代理人避免可能具有风险的结果时，它可以鼓励探索或悲观行为。后者对于AI安全性很重要。为了提高鲁棒性，可以一起使用添加不确定性意识的解决方案。

[1] A. Nagabandi, G. Kahn, S. Fearing, S. Levine, Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning (2018), ICRA 2018.

[1] A.Nagabandi，G.Kahn，S.Fearing，S.Levine，《基于模型的深度强化学习与无模型精调的神经网络动力学》 (2018年)，ICRA 2018年。

[2] S. Daftry, S. Zeng, J. A. Bagnell, and M. Hebert, Introspective Perception: Learning to Predict Failures in Vision Systems (2016), IROS, 2016.

[2] S. Daftry，S。Zeng，JA Bagnell和M. Hebert，《内省感知：学习预测视觉系统的故障》 (2016年)，IROS，2016年。

[3] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight Uncertainty in Neural Networks (2015), ICML 2015.

[3] C. Blundell，J。Cornebise，K。Kavukcuoglu，D。Wierstra，神经网络的重量不确定性 (2015年)，ICML 2015。

[4] Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML, 2016.

[4] Y. Gal和Z. Ghahramani。辍学作为贝叶斯近似：代表深度学习中的模型不确定性。 ICML，2016年。

[5] Y. Li and Y. Gal. Dropout Inference in Bayesian Neural Networks with Alpha-divergences, 2017.

[5] Y. Li和Y. Gal。具有阿尔法散度的贝叶斯神经网络中的辍学推断，2017年。

[6] Model-Based Reinforcement Learning, Deep RL Decision making and Control (2019), Berkley.

[6]基于模型的强化学习，《深度RL决策与控制》 (2019年)， 伯克利 。

[7] G. Kahn, A. Villaflor, V. Pong, S. Levine. Uncertainty-Aware Reinforcement Learning for Collision Avoidance, 2017.

[7] G. Kahn，A。Villaflor，V。Pong，S。Levine。避免冲突的不确定性意识强化学习，2017年。

[8] S. Bechtle, A. Rai, Y. Lin, L. Righetti, F. Meier. Curious iLQR: Resolving Uncertainty in Model-based RL. ICML, 2019.

[8] S. Bechtle，A。Rai，Y。Lin，L。Righetti，F。Meier。好奇的iLQR ：解决基于模型的RL中的不确定性。 ICML，2019年。

翻译自: https://towardsdatascience.com/uncertainty-aware-reinforcement-learning-c95c25c220d3