One of the main challenges in reinforcement learning (RL) is generalisation. In typical deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable. This paper contributes towards tackling such challenging domains, by proposing a new method, called Hybrid Reward Architecture (HRA). HRA takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically only depends on a subset of all features, the corresponding value function can be approximated more easily by a low-dimensional representation, enabling more effective learning. We demonstrate HRA on a toy-problem and the Atari game Ms. Pac-Man, where HRA achieves above-human performance.
强化学习(RL)的主要挑战之一是泛化。 在典型的深度RL方法中,这是通过使用深度网络用低维表示近似最优值函数来实现的。 虽然这种方法在许多领域都很有效,但在最优值函数不能简单地简化为低维表示的领域中,学习可能非常缓慢且不稳定。 本文通过提出一种称为混合奖励架构(HRA)的新方法,有助于解决这些具有挑战性的领域。 HRA将分解的奖励函数作为输入,并为每个组件奖励函数学习单独的值函数。 因为每个组件通常仅取决于所有特征的子集,所以可以通过低维表示更容易地近似相应的值函数,从而实现更有效的学习。 我们展示了玩具问题上的HRA和Atari游戏Pac-Man女士,HRA在那里实现了人类以上的表现。
In reinforcement learning (RL) (Sutton & Barto, 1998; Szepesvári, 2009), the goal is to find a behaviour policy that maximises the return—the discounted sum of rewards received over time—in a data-driven way. One of the main challenges of RL is to scale methods such that they can be applied to large, real-world problems. Because the state-space of such problems is typically massive, strong generalisation is required to learn a good policy efficiently.
Mnih et al. (2015) achieved a big breakthrough in this area: by combining standard RL techniques with deep neural networks, they achieved above-human performance on a large number of Atari 2600 games, by learning a policy from pixels. The generalisation properties of their Deep Q-Networks (DQN) method is achieved by approximating the optimal value function. A value function plays an important role in RL, because it predicts the expected return, conditioned on a state or state-action pair. Once the optimal value function is known, an optimal policy can be derived by acting greedily with respect to it. By modelling the current estimate of the optimal value function with a deep neural network, DQN carries out a strong generalisation on the value function, and hence on the policy.
The generalisation behaviour of DQN is achieved by regularisation on the model for the optimal value function. However, if the optimal value function is very complex, then learning an accurate low-dimensional representation can be challenging or even impossible. Therefore, when the optimal value function cannot easily be reduced to a low-dimensional representation, we argue to apply a complementary form of regularisation on the target side. Specifically, we propose to replace the optimal value function as target for training with an alternative value function that is easier to learn, but still yields a reasonable—but generally not optimal—policy, when acting greedily with respect to it.
The key observation behind regularisation on the target function is that two very different value functions can result in the same policy when an agent acts greedily with respect to them. At the same time, some value functions are much easier to learn than others. Intrinsic motivation (Stout et al., 2005; Schmidhuber, 2010) uses this observation to improve learning in sparse-reward domains, by adding a domain-specific intrinsic reward signal to the reward coming from the environment. When the intrinsic reward function is potential-based, optimality of the resulting policy is maintained (Ng et al., 1999). In our case, we aim for simpler value functions that are easier to represent with a low-dimensional representation.
Our main strategy for constructing an easy-to-learn value function is to decompose the reward function of the environment into n different reward functions. Each of them is assigned a separate reinforcement-learning agent. Similar to the Horde architecture (Sutton et al., 2011), all these agents can learn in parallel on the same sample sequence by using off-policy learning. Each agent gives its action-values of the current state to an aggregator, which combines them into a single value for each action. The current action is selected based on these aggregated values.
We test our approach on two domains: a toy-problem, where an agent has to eat 5 randomly located fruits, and Ms. Pac-Man, one of the hard games from the ALE benchmark set (Bellemare et al., 2013).
在强化学习(RL)(Sutton&Barto,1998;Szepesvári,2009)中,目标是找到一种行为政策,以数据驱动的方式最大化回报 - 随时间收到的奖励的折扣总和。 RL的主要挑战之一是扩展方法,使它们可以应用于大的现实问题。由于这些问题的状态空间通常很大,因此需要强有力的推广来有效地学习好的政策。
Mnih等人。 (2015)在这一领域取得了重大突破:通过将标准RL技术与深度神经网络相结合,他们通过学习像素策略,在大量Atari 2600游戏中实现了人类的上述表现。他们的深度Q网络(DQN)方法的泛化属性是通过近似最优值函数来实现的。值函数在RL中起着重要作用,因为它预测了以状态或状态 - 动作对为条件的预期回报。一旦知道了最优值函数,就可以通过贪婪地对其进行推导来推导出最优策略。通过利用深度神经网络对最优值函数的当前估计进行建模,DQN对值函数进行了强有力的推广,从而对策略进行了强有力的推广。
DQN的泛化行为是通过对最优值函数的模型进行正则化来实现的。然而,如果最佳值函数非常复杂,那么学习精确的低维表示可能具有挑战性甚至是不可能的。因此,当最优值函数不能容易地简化为低维表示时,我们认为在目标侧应用互补形式的正则化。具体而言,我们建议将最佳价值函数替换为具有替代价值函数的培训目标,该函数更易于学习,但在对其进行贪婪行为时仍然会产生合理但通常不是最优的策略。
关于目标函数正则化背后的关键观察是当代理人相对于它们贪婪地行为时,两个非常不同的值函数可以导致相同的策略。与此同时,一些价值功能比其他功能更容易学习。内在动机(Stout et al。,2005; Schmidhuber,2010)通过在来自环境的奖励中添加特定领域的内在奖励信号,利用这一观察来改善稀疏奖励领域的学习。当内在奖励函数以潜在为基础时,维持最终政策的最优性(Ng等,1999)。在我们的例子中,我们的目标是更简单的值函数,更容易用低维表示来表示。
我们构建易于学习的价值函数的主要策略是将环境的奖励函数分解为n个不同的奖励函数。他们每个人都被分配了一个单独的强化学习代理。与Horde架构类似(Sutton等,2011),所有这些代理都可以通过使用非策略学习在相同的样本序列上并行学习。每个代理将其当前状态的操作值提供给聚合器,聚合器将它们组合为每个操作的单个值。根据这些聚合值选择当前操作。
我们在两个领域测试我们的方法:一个玩具问题,一个代理人必须吃5个随机分布的水果,和Pac-Man女士,一个来自ALE基准集的硬游戏(Bellemare et al。,2013)。
Our HRA method builds upon the Horde architecture (Sutton et al., 2011). The Horde architecture consists of a large number of ‘demons’ that learn in parallel via off-policy learning. Each demon trains a separate general value function (GVF) based on its own policy and pseudo-reward function. A pseudo-reward can be any feature-based signal that encodes useful information. The Horde architecture is focused on building up general knowledge about the world, encoded via a large number of GVFs. HRA focusses on training separate components of the environment-reward function, in order to more efficiently learn a control policy. UVFA (Schaul et al., 2015) builds on Horde as well, but extends it along a different direction. UVFA enables generalization across different tasks/goals. It does not address how to solve a single, complex task, which is the focus of HRA.
Learning with respect to multiple reward functions is also a topic of multi-objective learning (Roijers et al., 2013). So alternatively, HRA can be viewed as applying multi-objective learning in order to more efficiently learn a policy for a single reward function.
Reward function decomposition has been studied among others by Russell & Zimdar (2003) and Sprague & Ballard (2003). This earlier work focusses on strategies that achieve optimal behavior. Our work is aimed at improving learning-efficiency by using simpler value functions and relaxing optimality requirements.
There are also similarities between HRA and UNREAL (Jaderberg et al., 2017). Notably, both solve multiple smaller problems in order to tackle one hard problem. However, the two architectures are different in their workings, as well as the type of challenge they address. UNREAL is a technique that boosts representation learning in difficult scenarios. It does so by using auxiliary tasks to help train the lower-level layers of a deep neural network. An example of such a challenging representation-learning scenario is learning to navigate in the 3D Labyrinth domain. On Atari games, the reported performance gain of UNREAL is minimal, suggesting that the standard deep RL architecture is sufficiently powerful to extract the relevant representation. By contrast, the HRA architecture breaks down a task into smaller pieces. HRA’s multiple smaller tasks are not unsupervised; they are tasks that are directly relevant to the main task. Furthermore, whereas UNREAL is inherently a deep RL technique, HRA is agnostic to the type of function approximation used. It can be combined with deep neural networks, but it also works with exact, tabular representations. HRA is useful for domains where having a high-quality representation is not sufficient to solve the task efficiently.
Diuk’s object-oriented approach (Diuk et al., 2008) was one of the first methods to show efficient learning in video games. This approach exploits domain knowledge related to the transition dynamic to efficiently learn a compact transition model, which can then be used to find a solution using dynamic-programming techniques. This inherently model-based approach has the drawback that while it efficiently learns a very compact model of the transition dynamics, it does not reduce the state-space of the problem. Hence, it does not address the main challenge of Ms. Pac-Man: its huge state-space, which is even for DP methods intractable (Diuk applied his method to an Atari game with only 6 objects, whereas Ms. Pac-Man has over 150 objects).
Finally, HRA relates to options (Sutton et al., 1999; Bacon et al., 2017), and more generally hierarchical learning (Barto & Mahadevan, 2003; Kulkarni et al., 2016). Options are temporally-extended actions that, like HRA’s heads, can be trained in parallel based on their own (intrinsic) reward functions. However, once an option has been trained, the role of its intrinsic reward function is over. A higher-level agent that uses an option sees it as just another action and evaluates it using its own reward function. This can yield great speed-ups in learning and help substantially with better exploration, but they do not directly make the value function of the higher-level agent less complex. The heads of HRA represent values, trained with components of the environment reward. Even after training, these values stay relevant, because the aggregator uses them to select its action.
我们的HRA方法建立在Horde架构之上(Sutton等,2011)。部落建筑由大量“恶魔”组成,这些“恶魔”通过非政策性学习同时学习。每个恶魔根据自己的政策和伪奖励功能训练单独的一般价值函数(GVF)。伪奖励可以是编码有用信息的任何基于特征的信号。 Horde架构专注于建立关于世界的一般知识,通过大量GVF编码。 HRA专注于培训环境奖励功能的各个组成部分,以便更有效地学习控制政策。 UVFA(Schaul等人,2015)也建立在部落之上,但是它沿着不同的方向延伸。 UVFA可以实现不同任务/目标的概括。它没有解决如何解决单一,复杂的任务,这是HRA的重点。
关于多种奖励功能的学习也是多目标学习的主题(Roijers等,2013)。因此,HRA可以被视为应用多目标学习,以便更有效地学习单个奖励功能的策略。
Russell&Zimdar(2003)和Sprague&Ballard(2003)等人已经研究了奖励函数分解。此早期工作的重点是实现最佳行为的策略。我们的工作旨在通过使用更简单的价值函数和放宽最优性要求来提高学习效率。
HRA和UNREAL之间也有相似之处(Jaderberg等,2017)。值得注意的是,两者都解决了多个较小的问题,以解决一个难题。但是,这两种体系结构的工作方式不同,以及它们所面临的挑战类型。 UNREAL是一种在困难场景中促进表示学习的技术。它通过使用辅助任务来帮助训练深层神经网络的低级层。这种具有挑战性的表示 - 学习场景的一个示例是学习在3D迷宫域中导航。在Atari游戏中,报告的UNREAL性能提升很小,这表明标准的深度RL架构足以提取相关的表示。相比之下,HRA架构将任务分解为更小的部分。 HRA的多个较小的任务不是无人监督的;它们是与主要任务直接相关的任务。此外,虽然UNREAL本质上是一种深度RL技术,但HRA与所使用的函数近似类型无关。它可以与深度神经网络结合使用,但它也可以使用精确的表格表示。 HRA对于具有高质量表示不足以有效解决任务的域非常有用。
Diuk的面向对象方法(Diuk et al。,2008)是展示视频游戏中有效学习的首批方法之一。该方法利用与转换动态相关的领域知识来有效地学习紧凑转换模型,然后可以使用动态编程技术来找到解决方案。这种固有的基于模型的方法的缺点是,虽然它有效地学习了一个非常紧凑的过渡动力学模型,但它并没有减少问题的状态空间。因此,它没有解决Pac-Man女士面临的主要挑战:它巨大的状态空间,即使是DP方法也难以处理(Diuk将他的方法应用于只有6个物体的Atari游戏,而Pac-Man女士则有超过150个对象)。
最后,HRA涉及选项(Sutton等,1999; Bacon等,2017),更一般地说是分层学习(Barto&Mahadevan,2003; Kulkarni等,2016)。选项是时间延长的动作,像HRA的头部一样,可以根据自己的(内在的)奖励功能并行训练。但是,一旦选项被训练,其内在奖励功能的作用就结束了。使用选项的更高级别代理将其视为另一个操作,并使用其自己的奖励功能对其进行评估。这可以极大地提高学习速度,并有助于更好地进行探索,但它们不会直接使更高级别的代理人的价值功能变得更加复杂。 HRA的负责人代表价值观,并使用环境奖励的组成部分进行培训。即使在训练之后,这些值仍然相关,因为聚合器使用它们来选择其动作。
and to train a separate reinforcement-learning agent on each of these reward functions. There are infinitely many different decompositions of a reward function possible, but to achieve value functions that are easy to learn, the decomposition should be such that each reward function is mainly affected by only a small number of state variables.
并在每个奖励功能上培训单独的强化学习代理。 奖励函数可以有无限多种不同的分解,但是为了实现易于学习的价值函数,分解应该使得每个奖励函数主要仅受少量状态变量的影响。
In its basic setting, the only domain knowledge applied to HRA is in the form of the decomposed reward function. However, one of the strengths of HRA is that it can easily exploit more domain knowledge, if available. Domain knowledge can be exploited in one of the following ways:
1.Removing irrelevant features. Features that do not affect the received reward in any way (directly or indirectly) only add noise to the learning process and can be removed.
2.Identifying terminal states. Terminal states are states from which no further reward can be received; they have by definition a value of 0. Using this knowledge, HRA can refrain from approximating this value by the value network, such that the weights can be fully used to represent the non-terminal states.
3.Using pseudo-reward functions. Instead of updating a head of HRA using a component of the environment reward, it can be updated using a pseudo-reward. In this scenario, a set of GVFs is trained in parallel using pseudo-rewards.
While these approaches are not specific to HRA, HRA can exploit domain knowledge to a much great extend, because it can apply these approaches to each head individually. We show this empirically in Section 4.1.
在其基本设置中,应用于HRA的唯一领域知识是以分解的奖励函数的形式。但是,HRA的优势之一是它可以轻松利用更多的领域知识(如果有的话)。可以通过以下方式之一利用领域知识:
删除不相关的功能。不以任何方式(直接或间接)影响收到的奖励的功能只会在学习过程中添加噪音,并且可以删除。
识别终端状态。终端州是无法获得进一步奖励的州;根据定义,它们具有值0.使用该知识,HRA可以避免通过值网络逼近该值,使得权重可以完全用于表示非终端状态。
使用伪奖励功能。不是使用环境奖励的组件更新HRA的头部,而是可以使用伪奖励来更新它。在这种情况下,使用伪奖励并行训练一组GVF。
虽然这些方法并非特定于HRA,但HRA可以在很大程度上利用领域知识,因为它可以将这些方法单独应用于每个方面。我们在4.1节中凭经验证明了这一点。
In our first domain, we consider an agent that has to collect fruits as quickly as possible in a 10 10 grid. There are 10 possible fruit locations, spread out across the grid. For each episode, a fruit is randomly placed on 5 of those 10 locations. The agent starts at a random position. The reward is +1 if a fruit gets eaten and 0 otherwise. An episode ends after all 5 fruits have been eaten or after 300 steps, whichever comes first.
We compare the performance of DQN with HRA using the same network. For HRA, we decompose the reward function into 10 different reward functions, one per possible fruit location. The network consists of a binary input layer of length 110, encoding the agent’s position and whether there is a fruit on each location. This is followed by a fully connected hidden layer of length 250. This layer is connected to 10 heads consisting of 4 linear nodes each, representing the action-values of the 4 actions under the different reward functions. Finally, the mean of all nodes across heads is computed using a final linear layer of length 4 that connects the output of corresponding nodes in each head. This layer has fixed weights with value 1 (i.e., it implements Equation 5). The difference between HRA and DQN is that DQN updates the network from the fourth layer using loss function (2), whereas HRA updates the network from the third layer using loss function (6).
在我们的第一个领域,我们认为代理商必须尽快在10 10网格中收集水果。有10个可能的水果位置,分布在整个网格中。对于每一集,将水果随机放置在这10个位置中的5个位置。代理人从一个随机位置开始。如果吃水果,奖励为+1,否则为0。在所有5个水果被吃掉或经过300步后,以先到者为准,一集结束。
我们使用相同的网络比较DQN与HRA的性能。对于HRA,我们将奖励函数分解为10个不同的奖励函数,每个可能的水果位置一个。网络由长度为110的二进制输入层组成,编码代理的位置以及每个位置是否有水果。接下来是一个长度为250的完全连接的隐藏层。该层连接到10个头部,每个头部由4个线性节点组成,代表不同奖励函数下4个动作的动作值。最后,使用长度为4的最终线性层计算跨头的所有节点的平均值,该最终线性层连接每个头中相应节点的输出。该层具有值为1的固定权重(即,它实现等式5)。 HRA和DQN之间的区别在于DQN使用丢失函数(2)从第四层更新网络,而HRA使用丢失函数从第三层更新网络(6)。
Besides the full network, we test using different levels of domain knowledge, as outlined in Section 3.2: 1) removing the irrelevant features for each head (providing only the position of the agent + the corresponding fruit feature); 2) the above plus identifying terminal states; 3) the above plus using pseudo rewards for learning GVFs to go to each of the 10 locations (instead of learning a value function associated to the fruit at each location). The advantage is that these GVFs can be trained even if there is no fruit at a location. The head for a particular location copies the Q-values of the corresponding GVF if the location currently contains a fruit, or outputs 0s otherwise. We refer to these as HRA+1, HRA+2 and HRA+3, respectively. For DQN, we also tested a version that was applied to the same network as HRA+1; we refer to this version as DQN+1.
除了完整的网络,我们测试使用不同级别的领域知识,如第3.2节所述:1)删除每个头的不相关特征(仅提供代理的位置+相应的水果特征); 2)以上加识别终端状态; 3)以上加上使用伪奖励来学习GVF到10个位置中的每一个(而不是学习与每个位置处的水果相关联的值函数)。 优点是即使一个地方没有水果,也可以训练这些GVF。 如果位置当前包含水果,则特定位置的头部复制相应GVF的Q值,否则输出0。 我们将它们分别称为HRA + 1,HRA + 2和HRA + 3。 对于DQN,我们还测试了与HRA + 1应用于同一网络的版本; 我们将此版本称为DQN + 1。
into the different heads. Because the heads are linear, this representation reduces to an exact, tabular representation. For the tabular representation, we used the same step-size as the optimal step-size for the deep network version.
进入不同的头部。 因为头部是线性的,所以这种表示简化为精确的表格表示。 对于表格表示,我们使用与深度网络版本相同的步长作为最佳步长。
Our second domain is the Atari 2600 game Ms. Pac-Man
(see Figure 4). Points are obtained by eating pellets, while avoiding ghosts (contact with one causes Ms. Pac-Man to lose a life). Eating one of the special power pellets turns the ghosts blue for a small duration, allowing them to be eaten for extra points. Bonus fruits can be eaten for further points, twice per level. When all pellets have been eaten, a new level is started. There are a total of 4 different maps and 7 different fruit types, each with a different point value. We provide full details on the domain in the supplementary material.
我们的第二个领域是Atari 2600游戏Pac-Man女士
(见图4)。 通过食用颗粒获得积分,同时避免鬼魂(与一个人接触会导致Pac-Man女士失去生命)。 吃一种特殊的粉状颗粒会使鬼魂变成蓝色,持续时间很长,可以吃掉额外的点数。 奖金水果可以吃更多点,每个级别两次。 当所有颗粒都被吃掉后,开始新的水平。 总共有4种不同的地图和7种不同的水果类型,每种都有不同的点值。 我们在补充材料中提供有关域的完整详细信息。
Baselines. While our version of Ms. Pac-Man is the same as used in literature, we use different preprocessing. Hence, to test the effect of our preprocessing, we implement the A3C method (Mnih et al., 2016) and run it with our preprocessing. We refer to the version with our preprocessing as ‘A3C(channels)’, the version with the standard preprocessing ‘A3C(pixels)’, and A3C’s score reported in literature ‘A3C(reported)’.
Preprocessing. Each frame from ALE is 210 160 pixels. We cut the bottom part and the top part of the screen to end up with 160 160 pixels. From this, we extract the position of the different objects and create for each object a separate input channel, encoding its location with an accuracy of 4 pixels. This results in 11 binary channels of size 40 40. Specifically, there is a channel for Ms. Pac-Man, each of the four ghosts, each of the four blue ghosts (these are treated as different objects), the fruit plus one channel with all the pellets (including power pellets). For A3C, we combine the 4 channels of the ghosts into a single channel, to allow it to generalise better across ghosts. We do the same with the 4 channels of the blue ghosts. Instead of giving the history of the last 4 frames as done in literature, we give the orientation of Ms. Pac-Man as a 1-hot vector of length 4 (representing the 4 compass directions).
HRA architecture. The environment reward signal corresponds with the points scored in the game. Before decomposing the reward function, we perform reward shaping by adding a negative reward of -1000 for contact with a ghost (which causes Ms. Pac-Man to lose a life). After this, the reward is decomposed in a way that each object in the game (pellet/fruit/ghost/blue ghost) has its own reward function. Hence, there is a separate RL agent associated with each object in the game that estimates a Q-value function of its corresponding reward function.
基线。虽然我们的Pac-Man女士版本与文献中使用的相同,但我们使用不同的预处理。因此,为了测试我们的预处理的效果,我们实施了A3C方法(Mnih等,2016)并使用我们的预处理来运行它。我们将使用我们的预处理的版本称为’A3C(通道)’,带有标准预处理’A3C(像素)'的版本,以及文献’A3C(报告)'中报告的A3C得分。
预处理。 ALE的每个帧是210 160像素。我们切割屏幕的底部和顶部,最终得到160 160像素。由此,我们提取不同对象的位置,并为每个对象创建一个单独的输入通道,以4像素的精度对其位置进行编码。这导致11个二进制通道的大小为40 40.具体来说,有一个Pac-Man女士的通道,四个鬼中的每一个,四个蓝色幽灵中的每一个(这些被视为不同的对象),水果加一个通道所有颗粒(包括动力颗粒)。对于A3C,我们将鬼魂的4个通道组合成一个通道,以便在鬼魂中更好地概括。我们对蓝鬼的4个通道做同样的事情。我们没有像文献中那样给出最后4帧的历史,而是将Pac-Man女士的方向作为长度为4的1热矢量(代表4个罗盘方向)。
HRA架构。环境奖励信号对应于游戏中得分。在分解奖励功能之前,我们通过在与鬼的接触中添加-1000的负奖励来进行奖励塑造(这导致吃豆人失去生命)。在此之后,奖励以游戏中的每个对象(小球/水果/鬼/蓝鬼)具有其自己的奖励功能的方式被分解。因此,存在与游戏中的每个对象相关联的单独的RL代理,其估计其对应的奖励函数的Q值函数。
To estimate each component reward function, we use the three forms of domain knowledge discussed in Section 3.2. HRA uses GVFs that learn pseudo Q-values (with values in the range [0, 1]) for getting to a particular location on the map (separate GVFs are learnt for each of the four maps). In contrast to the fruit collection task (Section 4.1), HRA learns part of its representation during training: it starts off with 0 GVFs and 0 heads for the pellets. By wandering around the maze, it discovers new map locations it can reach, resulting in new GVFs being created. Whenever the agent finds a pellet at a new location it creates a new head corresponding to the pellet.
The Q-values for an object (pellet/fruit/ghost/blue ghost) are set to the pseudo Q-values of the GVF corresponding with the object’s location (i.e., moving objects use a different GVF each time), multiplied with a weight that is set equal to the reward received when the object is eaten. If an object is not on the screen, all its Q-values are 0.
We test two aggregator types. The first one is a linear one that sums the Q-values of all heads (see Equation 5). For the second one, we take the sum of all the heads that produce points, and normalise the resulting Q-values; then, we add the sum of the Q-values of the heads of the regular ghosts, multiplied with a weight vector.
为了估计每个组件奖励函数,我们使用3.2节中讨论的三种领域知识形式。 HRA使用GVF来学习伪Q值(值在[0,1]范围内)以到达地图上的特定位置(为四个地图中的每一个学习单独的GVF)。与水果采集任务(第4.1节)相反,HRA在训练期间学习其部分表示:它从0 GVF开始,0头用于弹丸。通过在迷宫中游荡,它会发现它可以到达的新地图位置,从而导致创建新的GVF。每当代理人在新位置找到一个弹丸时,它会创建一个与弹丸相对应的新头部。
对象的Q值(pellet / fruit / ghost / blue ghost)被设置为与对象位置对应的GVF的伪Q值(即,移动对象每次使用不同的GVF),乘以权重这被设置为等于吃掉对象时收到的奖励。如果屏幕上没有对象,则其所有Q值均为0。
我们测试两种聚合器类型。第一个是线性的,它对所有磁头的Q值求和(见公式5)。对于第二个,我们取所有产生点的头的总和,并对得到的Q值进行归一化;然后,我们加上常规重影的头部的Q值之和,乘以权重向量。
For exploration, we test two complementary types of exploration. Each type adds an extra exploration head to the architecture. The first type, which we call diversification, produces random Q-values, drawn from a uniform distribution over [0, 20]. We find that it is only necessary during the first 50 steps, to ensure starting each episode randomly. The second type, which we call count-based, adds a bonus for state-action pairs that have not been explored a lot. It is inspired by upper confidence bounds (Auer et al., 2002). Full details can be found in the supplementary material.
For our final experiment, we implement a special head inspired by executive-memory literature (Fuster, 2003; Gluck et al., 2013). When a human game player reaches the maximum of his cognitive and physical ability, he starts to look for favourable situations or even glitches and memorises them. This cognitive process is indeed memorising a sequence of actions (also called habit), and is not necessarily optimal. Our executive-memory head records every sequence of actions that led to pass a level without any kill. Then, when facing the same level, the head gives a very high value to the recorded action, in order to force the aggregator’s selection. Note that our simplified version of executive memory does not generalise.
Evaluation metrics. There are two different evaluation methods used across literature which result in very different scores. Because ALE is ultimately a deterministic environment (it implements pseudo-randomness using a random number generator that always starts with the same seed), both evaluation metrics aim to create randomness in the evaluation in order to rate methods with more generalising behaviour higher. The first metric introduces a mild form of randomness by taking a random number of no-op actions before control is handed over to the learning algorithm. In the case of Ms. Pac-Man, however, the game starts with a certain inactive period that exceeds the maximum number of no-op steps, resulting in the game having a fixed start after all. The second metric selects random starting points along a human trajectory and results in much stronger randomness, and does result in the intended random start evaluation. We refer to these metrics as ‘fixed start’ and ‘random start’.
为了探索,我们测试了两种互补的探索类型。每种类型都为架构增加了额外的探索头。第一种类型,我们称之为多样化,产生随机Q值,从[0,20]上的均匀分布中得出。我们发现只需要在前50个步骤中确保随机开始每一集。第二种类型,我们称之为基于计数,为尚未进行过大量探索的状态 - 动作对添加了奖励。它的灵感来自上层置信区间(Auer et al。,2002)。完整的细节可以在补充材料中找到。
对于我们的最终实验,我们实施了一个受执行记忆文献启发的特殊头部(Fuster,2003; Gluck等,2013)。当一个人类游戏玩家达到他的认知和身体能力的最大值时,他开始寻找有利的情况甚至故障并记住它们。这种认知过程确实记忆了一系列动作(也称为习惯),并不一定是最佳的。我们的执行记忆头记录了导致通过一个级别而没有任何杀戮的每一系列动作。然后,当面对相同的级别时,头部给予记录的动作非常高的值,以便强制聚合器的选择。请注意,我们的执行内存的简化版本没有概括。
评估指标。在文献中使用两种不同的评估方法,导致得分非常不同。由于ALE最终是一个确定性环境(它使用始终以相同种子开始的随机数生成器实现伪随机性),因此两种评估指标都旨在在评估中创建随机性,以便对具有更高泛化行为的方法进行评级。第一个度量通过在将控制权移交给学习算法之前采用随机数量的无操作动作来引入温和形式的随机性。然而,在Pac-Man女士的情况下,游戏以超过无操作步骤的最大数量的某个非活动时段开始,从而导致游戏具有固定的开始。第二度量沿着人类轨迹选择随机起始点并且导致更强的随机性,并且确实导致预期的随机开始评估。我们将这些指标称为“固定开始”和“随机开始”。
Results. Figure 5 shows the training curves; Table 1 shows the final score after training. The best reported fixed start score comes from STRAW (Vezhnevets et al., 2016); the best reported random start score comes from the Dueling network architecture (Wang et al., 2016). The human fixed start score comes from Mnih et al. (2015); the human random start score comes from Nair et al. (2015). We train A3C for 800 million frames. Because HRA learns fast, we train it only for 5,000 episodes, corresponding with about 150 million frames (note that better policies resultinmoreframesperepisode). We tried a few different settings for HRA: with/without normalisation and with/without each type of exploration. The score shown for HRA uses the best combination: with normalisation and with both exploration types. All combinations achieved over 10,000 points in training, except the combination with no exploration at all, which—not surprisingly—performed very poorly. With the best combination, HRA not only outperforms the state-of-the-art on both metrics, it also significantly outperforms the human score, convincingly demonstrating the strength of HRA.
Comparing A3C(pixels) and A3C(channels) in Table 1 reveals a surprising result: while we use advanced preprocessing by separating the screen image into relevant object channels, this did not significantly change the performance of A3C.
In our final experiment, we test how well HRA does if it exploits the weakness of the fixed-start evaluation metric by using a simplified version of executive memory. Using this version, we not only surpass the human high-score of 266,330 points,1 we achieve the maximum possible score of 999,990 points in less than 3,000 episodes. The curve is slow in the first stages because the model has to be trained, but even though the further levels get more and more difficult, the level passing speeds up by taking advantage of already knowing the maps. Obtaining more points is impossible, not because the game ends, but because the score overflows to 0 when reaching a million points.
结果 :图5显示了训练曲线;表1显示了训练后的最终得分。报告的最佳固定开始得分来自STRAW(Vezhnevets等,2016);最好的随机起始评分来自Dueling网络架构(Wang et al。,2016)。人类固定的起始分数来自Mnih等人。 (2015);人类随机起始分数来自Nair等人。 (2015年)。我们训练了A3C 8亿帧。因为HRA学得很快,我们只训练了5000集,相当于大约1.5亿帧(注意更好的政策导致更多的帧速度)。我们为HRA尝试了一些不同的设置:有/无标准化和有/无每种类型的探索。显示的HRA分数使用最佳组合:标准化和两种探索类型。所有组合在训练中都达到了超过10,000分,除了完全没有探索的组合,这一点并不令人惊讶地表现得非常糟糕。通过最佳组合,HRA不仅在两个指标上都超越了最先进的技术,它还显着优于人类得分,令人信服地展示了HRA的实力。
比较表1中的A3C(像素)和A3C(通道)显示了一个令人惊讶的结果:虽然我们通过将屏幕图像分成相关的对象通道来使用高级预处理,但这并没有显着改变A3C的性能。
在我们的最后一个实验中,我们测试了HRA如果通过使用简化版本的执行内存来利用固定启动评估指标的弱点,它的效果如何。使用这个版本,我们不仅超过266,330分的人类高分,1我们在不到3,000集的情况下达到999,990分的最大可能分数。曲线在第一阶段是缓慢的,因为模型必须经过训练,但即使进一步的水平变得越来越困难,水平传递也会通过利用已经知道的地图来加速。获得更多分数是不可能的,不是因为游戏结束,而是因为当达到百万分时分数溢出到0。
One of the strengths of HRA is that it can exploit domain knowledge to a much greater extent than single-head methods. This is clearly shown by the fruit collection task: while removing irrelevant features improves performance of HRA, the performance of DQN decreased when provided with the same network architecture. Furthermore, separating the pixel image into multiple binary channels only makes a small improvement in the performance of A3C over learning directly from pixels. This demonstrates that the reason that modern deep RL struggle with Ms. Pac-Man is not related to learning from pixels; the underlying issue is that the optimal value function for Ms. Pac-Man cannot easily be mapped to a low-dimensional representation.
HRA solves Ms. Pac-Man by learning close to 1,800 general value functions. This results in an exponential breakdown of the problem size: whereas the input state-space corresponding with the binary channels is in the order of 1077, each GVF has a state-space in the order of 103 states, small enough to be represented without any function approximation. While we could have used a deep network for representing each GVF, using a deep network for such small problems hurts more than it helps, as evidenced by the experiments on the fruit collection domain.
We argue that many real-world tasks allow for reward decomposition. Even if the reward function can only be decomposed in two or three components, this can already help a lot, due to the exponential decrease of the problem size that decomposition might cause.
HRA的优势之一是它可以比单头方法更大程度地利用领域知识。水果收集任务清楚地表明了这一点:虽然删除不相关的功能可以提高HRA的性能,但是当提供相同的网络架构时,DQN的性能会下降。此外,将像素图像分离成多个二进制通道仅使得A3C的性能比直接从像素学习略微改善。这表明现代深度RL与Pac-Man女士斗争的原因与从像素学习无关;潜在的问题是Pac-Man女士的最佳价值函数不能轻易映射到低维表示。
HRA通过学习近1,800种一般价值功能解决了Pac-Man女士的问题。这导致问题大小的指数分解:而对应于二进制信道的输入状态空间大约为1077,每个GVF具有大约103个状态的状态空间,小到足以在没有任何状态的情况下表示函数逼近。虽然我们本可以使用深层网络来代表每个GVF,但使用深层网络来解决这些小问题会比它有所帮助更多,正如水果收集领域的实验所证明的那样。
我们认为许多现实世界的任务允许奖励分解。即使奖励函数只能在两个或三个组件中分解,由于分解可能导致的问题大小的指数减少,这已经可以帮助很多。
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem.
Machine learning, 47(2-3):235–256, 2002.
Bacon, P., Harb, J., and Precup, D. The option-critic architecture. In Proceedings of the Thirthy-first AAAI Conference On Artificial Intelligence (AAAI), 2017.
Barto, A. G. and Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
Diuk, C., Cohen, A., and Littman, M. L. An object-oriented representation for efficient reinforcement learning. In Proceedings of The 25th International Conference on Machine Learning, 2008.
Fuster, J. M. Cortex and mind: Unifying cognition. Oxford university press, 2003.
Gluck, M. A., Mercado, E., and Myers, C. E. Learning and memory: From brain to behavior.
Palgrave Macmillan, 2013.
Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D., and Kavukcuoglu,
K.Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2017.
Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. Hierarchical deep reinforce-ment learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems 29, 2016.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., Kumaran, H. King D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu,
K.Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd Interna-tional Conference on Machine Learning, pp. 1928–1937, 2016.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., Maria, A. De, Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K., and Silver, D. Massively parallel methods for deep reinforcement learning. In In Deep Learning Workshop, ICML, 2015.
Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of The 16th International Conference on Machine Learning, 1999.
Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 2013.
Russell, S. and Zimdar, A. L. Q-decomposition for reinforcement learning agents. In Proceedings of The 20th International Conference on Machine Learning, 2003.
Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In Proceedings of The 32rd International Conference on Machine Learning, 2015.
Schmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). In IEEE Transactions on Autonomous Mental Development 2.3, pp. 230–247, 2010.
Sprague, N. and Ballard, D. Multiple-goal reinforcement learning with modular sarsa(0). In International Joint Conference on Artificial Intelligence, 2003.
Stout, A., Konidaris, G., and Barto, A. G. Intrinsically motivated reinforcement learning: A promising framework for developmental robotics. In The AAAI Spring Symposium on Developmental Robotics, 2005.
Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 1998.
Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, Doina. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2011.
Sutton, R.S., Precup, D., and Singh, S.P. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.
Szepesvári, C. Algorithms for reinforcement learning. Morgan and Claypool, 2009.
van Seijen, H., van Hasselt, H., Whiteson, S., and Wiering, M. A theoretical and empirical analysis of expected sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 177–184, 2009.
Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., and Kavukcuoglu, K. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems 29, 2016.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016.