Adam坤

Hybrid Reward Architecture for Reinforcement Learning

用于强化学习的混合奖励架构

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Abstract

One of the main challenges in reinforcement learning (RL) is generalisation. In typical deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable. This paper contributes towards tackling such challenging domains, by proposing a new method, called Hybrid Reward Architecture (HRA). HRA takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically only depends on a subset of all features, the corresponding value function can be approximated more easily by a low-dimensional representation, enabling more effective learning. We demonstrate HRA on a toy-problem and the Atari game Ms. Pac-Man, where HRA achieves above-human performance.
强化学习（RL）的主要挑战之一是泛化。在典型的深度RL方法中，这是通过使用深度网络用低维表示近似最优值函数来实现的。虽然这种方法在许多领域都很有效，但在最优值函数不能简单地简化为低维表示的领域中，学习可能非常缓慢且不稳定。本文通过提出一种称为混合奖励架构（HRA）的新方法，有助于解决这些具有挑战性的领域。 HRA将分解的奖励函数作为输入，并为每个组件奖励函数学习单独的值函数。因为每个组件通常仅取决于所有特征的子集，所以可以通过低维表示更容易地近似相应的值函数，从而实现更有效的学习。我们展示了玩具问题上的HRA和Atari游戏Pac-Man女士，HRA在那里实现了人类以上的表现。

Introduction

In reinforcement learning (RL) (Sutton & Barto, 1998; Szepesvári, 2009), the goal is to find a behaviour policy that maximises the return—the discounted sum of rewards received over time—in a data-driven way. One of the main challenges of RL is to scale methods such that they can be applied to large, real-world problems. Because the state-space of such problems is typically massive, strong generalisation is required to learn a good policy efficiently.

Mnih et al. (2015) achieved a big breakthrough in this area: by combining standard RL techniques with deep neural networks, they achieved above-human performance on a large number of Atari 2600 games, by learning a policy from pixels. The generalisation properties of their Deep Q-Networks (DQN) method is achieved by approximating the optimal value function. A value function plays an important role in RL, because it predicts the expected return, conditioned on a state or state-action pair. Once the optimal value function is known, an optimal policy can be derived by acting greedily with respect to it. By modelling the current estimate of the optimal value function with a deep neural network, DQN carries out a strong generalisation on the value function, and hence on the policy.
The generalisation behaviour of DQN is achieved by regularisation on the model for the optimal value function. However, if the optimal value function is very complex, then learning an accurate low-dimensional representation can be challenging or even impossible. Therefore, when the optimal value function cannot easily be reduced to a low-dimensional representation, we argue to apply a complementary form of regularisation on the target side. Specifically, we propose to replace the optimal value function as target for training with an alternative value function that is easier to learn, but still yields a reasonable—but generally not optimal—policy, when acting greedily with respect to it.

The key observation behind regularisation on the target function is that two very different value functions can result in the same policy when an agent acts greedily with respect to them. At the same time, some value functions are much easier to learn than others. Intrinsic motivation (Stout et al., 2005; Schmidhuber, 2010) uses this observation to improve learning in sparse-reward domains, by adding a domain-specific intrinsic reward signal to the reward coming from the environment. When the intrinsic reward function is potential-based, optimality of the resulting policy is maintained (Ng et al., 1999). In our case, we aim for simpler value functions that are easier to represent with a low-dimensional representation.

Our main strategy for constructing an easy-to-learn value function is to decompose the reward function of the environment into n different reward functions. Each of them is assigned a separate reinforcement-learning agent. Similar to the Horde architecture (Sutton et al., 2011), all these agents can learn in parallel on the same sample sequence by using off-policy learning. Each agent gives its action-values of the current state to an aggregator, which combines them into a single value for each action. The current action is selected based on these aggregated values.

We test our approach on two domains: a toy-problem, where an agent has to eat 5 randomly located fruits, and Ms. Pac-Man, one of the hard games from the ALE benchmark set (Bellemare et al., 2013).
在强化学习（RL）（Sutton＆Barto，1998;Szepesvári，2009）中，目标是找到一种行为政策，以数据驱动的方式最大化回报 - 随时间收到的奖励的折扣总和。 RL的主要挑战之一是扩展方法，使它们可以应用于大的现实问题。由于这些问题的状态空间通常很大，因此需要强有力的推广来有效地学习好的政策。

Mnih等人。（2015）在这一领域取得了重大突破：通过将标准RL技术与深度神经网络相结合，他们通过学习像素策略，在大量Atari 2600游戏中实现了人类的上述表现。他们的深度Q网络（DQN）方法的泛化属性是通过近似最优值函数来实现的。值函数在RL中起着重要作用，因为它预测了以状态或状态 - 动作对为条件的预期回报。一旦知道了最优值函数，就可以通过贪婪地对其进行推导来推导出最优策略。通过利用深度神经网络对最优值函数的当前估计进行建模，DQN对值函数进行了强有力的推广，从而对策略进行了强有力的推广。
DQN的泛化行为是通过对最优值函数的模型进行正则化来实现的。然而，如果最佳值函数非常复杂，那么学习精确的低维表示可能具有挑战性甚至是不可能的。因此，当最优值函数不能容易地简化为低维表示时，我们认为在目标侧应用互补形式的正则化。具体而言，我们建议将最佳价值函数替换为具有替代价值函数的培训目标，该函数更易于学习，但在对其进行贪婪行为时仍然会产生合理但通常不是最优的策略。

关于目标函数正则化背后的关键观察是当代理人相对于它们贪婪地行为时，两个非常不同的值函数可以导致相同的策略。与此同时，一些价值功能比其他功能更容易学习。内在动机（Stout et al。，2005; Schmidhuber，2010）通过在来自环境的奖励中添加特定领域的内在奖励信号，利用这一观察来改善稀疏奖励领域的学习。当内在奖励函数以潜在为基础时，维持最终政策的最优性（Ng等，1999）。在我们的例子中，我们的目标是更简单的值函数，更容易用低维表示来表示。

我们构建易于学习的价值函数的主要策略是将环境的奖励函数分解为n个不同的奖励函数。他们每个人都被分配了一个单独的强化学习代理。与Horde架构类似（Sutton等，2011），所有这些代理都可以通过使用非策略学习在相同的样本序列上并行学习。每个代理将其当前状态的操作值提供给聚合器，聚合器将它们组合为每个操作的单个值。根据这些聚合值选择当前操作。

我们在两个领域测试我们的方法：一个玩具问题，一个代理人必须吃5个随机分布的水果，和Pac-Man女士，一个来自ALE基准集的硬游戏（Bellemare et al。，2013）。

2 Related Work

Our HRA method builds upon the Horde architecture (Sutton et al., 2011). The Horde architecture consists of a large number of ‘demons’ that learn in parallel via off-policy learning. Each demon trains a separate general value function (GVF) based on its own policy and pseudo-reward function. A pseudo-reward can be any feature-based signal that encodes useful information. The Horde architecture is focused on building up general knowledge about the world, encoded via a large number of GVFs. HRA focusses on training separate components of the environment-reward function, in order to more efficiently learn a control policy. UVFA (Schaul et al., 2015) builds on Horde as well, but extends it along a different direction. UVFA enables generalization across different tasks/goals. It does not address how to solve a single, complex task, which is the focus of HRA.

Learning with respect to multiple reward functions is also a topic of multi-objective learning (Roijers et al., 2013). So alternatively, HRA can be viewed as applying multi-objective learning in order to more efficiently learn a policy for a single reward function.

Reward function decomposition has been studied among others by Russell & Zimdar (2003) and Sprague & Ballard (2003). This earlier work focusses on strategies that achieve optimal behavior. Our work is aimed at improving learning-efficiency by using simpler value functions and relaxing optimality requirements.

There are also similarities between HRA and UNREAL (Jaderberg et al., 2017). Notably, both solve multiple smaller problems in order to tackle one hard problem. However, the two architectures are different in their workings, as well as the type of challenge they address. UNREAL is a technique that boosts representation learning in difficult scenarios. It does so by using auxiliary tasks to help train the lower-level layers of a deep neural network. An example of such a challenging representation-learning scenario is learning to navigate in the 3D Labyrinth domain. On Atari games, the reported performance gain of UNREAL is minimal, suggesting that the standard deep RL architecture is sufficiently powerful to extract the relevant representation. By contrast, the HRA architecture breaks down a task into smaller pieces. HRA’s multiple smaller tasks are not unsupervised; they are tasks that are directly relevant to the main task. Furthermore, whereas UNREAL is inherently a deep RL technique, HRA is agnostic to the type of function approximation used. It can be combined with deep neural networks, but it also works with exact, tabular representations. HRA is useful for domains where having a high-quality representation is not sufficient to solve the task efficiently.

Diuk’s object-oriented approach (Diuk et al., 2008) was one of the first methods to show efficient learning in video games. This approach exploits domain knowledge related to the transition dynamic to efficiently learn a compact transition model, which can then be used to find a solution using dynamic-programming techniques. This inherently model-based approach has the drawback that while it efficiently learns a very compact model of the transition dynamics, it does not reduce the state-space of the problem. Hence, it does not address the main challenge of Ms. Pac-Man: its huge state-space, which is even for DP methods intractable (Diuk applied his method to an Atari game with only 6 objects, whereas Ms. Pac-Man has over 150 objects).

Finally, HRA relates to options (Sutton et al., 1999; Bacon et al., 2017), and more generally hierarchical learning (Barto & Mahadevan, 2003; Kulkarni et al., 2016). Options are temporally-extended actions that, like HRA’s heads, can be trained in parallel based on their own (intrinsic) reward functions. However, once an option has been trained, the role of its intrinsic reward function is over. A higher-level agent that uses an option sees it as just another action and evaluates it using its own reward function. This can yield great speed-ups in learning and help substantially with better exploration, but they do not directly make the value function of the higher-level agent less complex. The heads of HRA represent values, trained with components of the environment reward. Even after training, these values stay relevant, because the aggregator uses them to select its action.
我们的HRA方法建立在Horde架构之上（Sutton等，2011）。部落建筑由大量“恶魔”组成，这些“恶魔”通过非政策性学习同时学习。每个恶魔根据自己的政策和伪奖励功能训练单独的一般价值函数（GVF）。伪奖励可以是编码有用信息的任何基于特征的信号。 Horde架构专注于建立关于世界的一般知识，通过大量GVF编码。 HRA专注于培训环境奖励功能的各个组成部分，以便更有效地学习控制政策。 UVFA（Schaul等人，2015）也建立在部落之上，但是它沿着不同的方向延伸。 UVFA可以实现不同任务/目标的概括。它没有解决如何解决单一，复杂的任务，这是HRA的重点。

关于多种奖励功能的学习也是多目标学习的主题（Roijers等，2013）。因此，HRA可以被视为应用多目标学习，以便更有效地学习单个奖励功能的策略。

Russell＆Zimdar（2003）和Sprague＆Ballard（2003）等人已经研究了奖励函数分解。此早期工作的重点是实现最佳行为的策略。我们的工作旨在通过使用更简单的价值函数和放宽最优性要求来提高学习效率。

HRA和UNREAL之间也有相似之处（Jaderberg等，2017）。值得注意的是，两者都解决了多个较小的问题，以解决一个难题。但是，这两种体系结构的工作方式不同，以及它们所面临的挑战类型。 UNREAL是一种在困难场景中促进表示学习的技术。它通过使用辅助任务来帮助训练深层神经网络的低级层。这种具有挑战性的表示 - 学习场景的一个示例是学习在3D迷宫域中导航。在Atari游戏中，报告的UNREAL性能提升很小，这表明标准的深度RL架构足以提取相关的表示。相比之下，HRA架构将任务分解为更小的部分。 HRA的多个较小的任务不是无人监督的;它们是与主要任务直接相关的任务。此外，虽然UNREAL本质上是一种深度RL技术，但HRA与所使用的函数近似类型无关。它可以与深度神经网络结合使用，但它也可以使用精确的表格表示。 HRA对于具有高质量表示不足以有效解决任务的域非常有用。

Diuk的面向对象方法（Diuk et al。，2008）是展示视频游戏中有效学习的首批方法之一。该方法利用与转换动态相关的领域知识来有效地学习紧凑转换模型，然后可以使用动态编程技术来找到解决方案。这种固有的基于模型的方法的缺点是，虽然它有效地学习了一个非常紧凑的过渡动力学模型，但它并没有减少问题的状态空间。因此，它没有解决Pac-Man女士面临的主要挑战：它巨大的状态空间，即使是DP方法也难以处理（Diuk将他的方法应用于只有6个物体的Atari游戏，而Pac-Man女士则有超过150个对象）。

最后，HRA涉及选项（Sutton等，1999; Bacon等，2017），更一般地说是分层学习（Barto＆Mahadevan，2003; Kulkarni等，2016）。选项是时间延长的动作，像HRA的头部一样，可以根据自己的（内在的）奖励功能并行训练。但是，一旦选项被训练，其内在奖励功能的作用就结束了。使用选项的更高级别代理将其视为另一个操作，并使用其自己的奖励功能对其进行评估。这可以极大地提高学习速度，并有助于更好地进行探索，但它们不会直接使更高级别的代理人的价值功能变得更加复杂。 HRA的负责人代表价值观，并使用环境奖励的组成部分进行培训。即使在训练之后，这些值仍然相关，因为聚合器使用它们来选择其动作。

3 Model

3.1 Hybrid Reward Architecture（混合奖励架构）

and to train a separate reinforcement-learning agent on each of these reward functions. There are infinitely many different decompositions of a reward function possible, but to achieve value functions that are easy to learn, the decomposition should be such that each reward function is mainly affected by only a small number of state variables.
并在每个奖励功能上培训单独的强化学习代理。奖励函数可以有无限多种不同的分解，但是为了实现易于学习的价值函数，分解应该使得每个奖励函数主要仅受少量状态变量的影响。

3.2 Improving Performance further by using high-level domain knowledge.

In its basic setting, the only domain knowledge applied to HRA is in the form of the decomposed reward function. However, one of the strengths of HRA is that it can easily exploit more domain knowledge, if available. Domain knowledge can be exploited in one of the following ways:

1.Removing irrelevant features. Features that do not affect the received reward in any way (directly or indirectly) only add noise to the learning process and can be removed.
2.Identifying terminal states. Terminal states are states from which no further reward can be received; they have by definition a value of 0. Using this knowledge, HRA can refrain from approximating this value by the value network, such that the weights can be fully used to represent the non-terminal states.
3.Using pseudo-reward functions. Instead of updating a head of HRA using a component of the environment reward, it can be updated using a pseudo-reward. In this scenario, a set of GVFs is trained in parallel using pseudo-rewards.

While these approaches are not specific to HRA, HRA can exploit domain knowledge to a much great extend, because it can apply these approaches to each head individually. We show this empirically in Section 4.1.
在其基本设置中，应用于HRA的唯一领域知识是以分解的奖励函数的形式。但是，HRA的优势之一是它可以轻松利用更多的领域知识（如果有的话）。可以通过以下方式之一利用领域知识：

删除不相关的功能。不以任何方式（直接或间接）影响收到的奖励的功能只会在学习过程中添加噪音，并且可以删除。
识别终端状态。终端州是无法获得进一步奖励的州;根据定义，它们具有值0.使用该知识，HRA可以避免通过值网络逼近该值，使得权重可以完全用于表示非终端状态。
使用伪奖励功能。不是使用环境奖励的组件更新HRA的头部，而是可以使用伪奖励来更新它。在这种情况下，使用伪奖励并行训练一组GVF。

虽然这些方法并非特定于HRA，但HRA可以在很大程度上利用领域知识，因为它可以将这些方法单独应用于每个方面。我们在4.1节中凭经验证明了这一点。

Experiments

4.1 Fruit Collection task

In our first domain, we consider an agent that has to collect fruits as quickly as possible in a 10 10 grid. There are 10 possible fruit locations, spread out across the grid. For each episode, a fruit is randomly placed on 5 of those 10 locations. The agent starts at a random position. The reward is +1 if a fruit gets eaten and 0 otherwise. An episode ends after all 5 fruits have been eaten or after 300 steps, whichever comes first.

We compare the performance of DQN with HRA using the same network. For HRA, we decompose the reward function into 10 different reward functions, one per possible fruit location. The network consists of a binary input layer of length 110, encoding the agent’s position and whether there is a fruit on each location. This is followed by a fully connected hidden layer of length 250. This layer is connected to 10 heads consisting of 4 linear nodes each, representing the action-values of the 4 actions under the different reward functions. Finally, the mean of all nodes across heads is computed using a final linear layer of length 4 that connects the output of corresponding nodes in each head. This layer has fixed weights with value 1 (i.e., it implements Equation 5). The difference between HRA and DQN is that DQN updates the network from the fourth layer using loss function (2), whereas HRA updates the network from the third layer using loss function (6).
在我们的第一个领域，我们认为代理商必须尽快在10 10网格中收集水果。有10个可能的水果位置，分布在整个网格中。对于每一集，将水果随机放置在这10个位置中的5个位置。代理人从一个随机位置开始。如果吃水果，奖励为+1，否则为0。在所有5个水果被吃掉或经过300步后，以先到者为准，一集结束。

我们使用相同的网络比较DQN与HRA的性能。对于HRA，我们将奖励函数分解为10个不同的奖励函数，每个可能的水果位置一个。网络由长度为110的二进制输入层组成，编码代理的位置以及每个位置是否有水果。接下来是一个长度为250的完全连接的隐藏层。该层连接到10个头部，每个头部由4个线性节点组成，代表不同奖励函数下4个动作的动作值。最后，使用长度为4的最终线性层计算跨头的所有节点的平均值，该最终线性层连接每个头中相应节点的输出。该层具有值为1的固定权重（即，它实现等式5）。 HRA和DQN之间的区别在于DQN使用丢失函数（2）从第四层更新网络，而HRA使用丢失函数从第三层更新网络（6）。

Besides the full network, we test using different levels of domain knowledge, as outlined in Section 3.2: 1) removing the irrelevant features for each head (providing only the position of the agent + the corresponding fruit feature); 2) the above plus identifying terminal states; 3) the above plus using pseudo rewards for learning GVFs to go to each of the 10 locations (instead of learning a value function associated to the fruit at each location). The advantage is that these GVFs can be trained even if there is no fruit at a location. The head for a particular location copies the Q-values of the corresponding GVF if the location currently contains a fruit, or outputs 0s otherwise. We refer to these as HRA+1, HRA+2 and HRA+3, respectively. For DQN, we also tested a version that was applied to the same network as HRA+1; we refer to this version as DQN+1.
除了完整的网络，我们测试使用不同级别的领域知识，如第3.2节所述：1）删除每个头的不相关特征（仅提供代理的位置+相应的水果特征）; 2）以上加识别终端状态; 3）以上加上使用伪奖励来学习GVF到10个位置中的每一个（而不是学习与每个位置处的水果相关联的值函数）。优点是即使一个地方没有水果，也可以训练这些GVF。如果位置当前包含水果，则特定位置的头部复制相应GVF的Q值，否则输出0。我们将它们分别称为HRA + 1，HRA + 2和HRA + 3。对于DQN，我们还测试了与HRA + 1应用于同一网络的版本; 我们将此版本称为DQN + 1。

into the different heads. Because the heads are linear, this representation reduces to an exact, tabular representation. For the tabular representation, we used the same step-size as the optimal step-size for the deep network version.
进入不同的头部。因为头部是线性的，所以这种表示简化为精确的表格表示。对于表格表示，我们使用与深度网络版本相同的步长作为最佳步长。

4.2 ATARI game: Ms. Pac-Man

Our second domain is the Atari 2600 game Ms. Pac-Man

(see Figure 4). Points are obtained by eating pellets, while avoiding ghosts (contact with one causes Ms. Pac-Man to lose a life). Eating one of the special power pellets turns the ghosts blue for a small duration, allowing them to be eaten for extra points. Bonus fruits can be eaten for further points, twice per level. When all pellets have been eaten, a new level is started. There are a total of 4 different maps and 7 different fruit types, each with a different point value. We provide full details on the domain in the supplementary material.
我们的第二个领域是Atari 2600游戏Pac-Man女士

（见图4）。通过食用颗粒获得积分，同时避免鬼魂（与一个人接触会导致Pac-Man女士失去生命）。吃一种特殊的粉状颗粒会使鬼魂变成蓝色，持续时间很长，可以吃掉额外的点数。奖金水果可以吃更多点，每个级别两次。当所有颗粒都被吃掉后，开始新的水平。总共有4种不同的地图和7种不同的水果类型，每种都有不同的点值。我们在补充材料中提供有关域的完整详细信息。

Baselines. While our version of Ms. Pac-Man is the same as used in literature, we use different preprocessing. Hence, to test the effect of our preprocessing, we implement the A3C method (Mnih et al., 2016) and run it with our preprocessing. We refer to the version with our preprocessing as ‘A3C(channels)’, the version with the standard preprocessing ‘A3C(pixels)’, and A3C’s score reported in literature ‘A3C(reported)’.

Preprocessing. Each frame from ALE is 210 160 pixels. We cut the bottom part and the top part of the screen to end up with 160 160 pixels. From this, we extract the position of the different objects and create for each object a separate input channel, encoding its location with an accuracy of 4 pixels. This results in 11 binary channels of size 40 40. Specifically, there is a channel for Ms. Pac-Man, each of the four ghosts, each of the four blue ghosts (these are treated as different objects), the fruit plus one channel with all the pellets (including power pellets). For A3C, we combine the 4 channels of the ghosts into a single channel, to allow it to generalise better across ghosts. We do the same with the 4 channels of the blue ghosts. Instead of giving the history of the last 4 frames as done in literature, we give the orientation of Ms. Pac-Man as a 1-hot vector of length 4 (representing the 4 compass directions).

HRA architecture. The environment reward signal corresponds with the points scored in the game. Before decomposing the reward function, we perform reward shaping by adding a negative reward of -1000 for contact with a ghost (which causes Ms. Pac-Man to lose a life). After this, the reward is decomposed in a way that each object in the game (pellet/fruit/ghost/blue ghost) has its own reward function. Hence, there is a separate RL agent associated with each object in the game that estimates a Q-value function of its corresponding reward function.
基线。虽然我们的Pac-Man女士版本与文献中使用的相同，但我们使用不同的预处理。因此，为了测试我们的预处理的效果，我们实施了A3C方法（Mnih等，2016）并使用我们的预处理来运行它。我们将使用我们的预处理的版本称为’A3C（通道）’，带有标准预处理’A3C（像素）'的版本，以及文献’A3C（报告）'中报告的A3C得分。

预处理。 ALE的每个帧是210 160像素。我们切割屏幕的底部和顶部，最终得到160 160像素。由此，我们提取不同对象的位置，并为每个对象创建一个单独的输入通道，以4像素的精度对其位置进行编码。这导致11个二进制通道的大小为40 40.具体来说，有一个Pac-Man女士的通道，四个鬼中的每一个，四个蓝色幽灵中的每一个（这些被视为不同的对象），水果加一个通道所有颗粒（包括动力颗粒）。对于A3C，我们将鬼魂的4个通道组合成一个通道，以便在鬼魂中更好地概括。我们对蓝鬼的4个通道做同样的事情。我们没有像文献中那样给出最后4帧的历史，而是将Pac-Man女士的方向作为长度为4的1热矢量（代表4个罗盘方向）。

HRA架构。环境奖励信号对应于游戏中得分。在分解奖励功能之前，我们通过在与鬼的接触中添加-1000的负奖励来进行奖励塑造（这导致吃豆人失去生命）。在此之后，奖励以游戏中的每个对象（小球/水果/鬼/蓝鬼）具有其自己的奖励功能的方式被分解。因此，存在与游戏中的每个对象相关联的单独的RL代理，其估计其对应的奖励函数的Q值函数。
To estimate each component reward function, we use the three forms of domain knowledge discussed in Section 3.2. HRA uses GVFs that learn pseudo Q-values (with values in the range [0, 1]) for getting to a particular location on the map (separate GVFs are learnt for each of the four maps). In contrast to the fruit collection task (Section 4.1), HRA learns part of its representation during training: it starts off with 0 GVFs and 0 heads for the pellets. By wandering around the maze, it discovers new map locations it can reach, resulting in new GVFs being created. Whenever the agent finds a pellet at a new location it creates a new head corresponding to the pellet.

The Q-values for an object (pellet/fruit/ghost/blue ghost) are set to the pseudo Q-values of the GVF corresponding with the object’s location (i.e., moving objects use a different GVF each time), multiplied with a weight that is set equal to the reward received when the object is eaten. If an object is not on the screen, all its Q-values are 0.

We test two aggregator types. The first one is a linear one that sums the Q-values of all heads (see Equation 5). For the second one, we take the sum of all the heads that produce points, and normalise the resulting Q-values; then, we add the sum of the Q-values of the heads of the regular ghosts, multiplied with a weight vector.

为了估计每个组件奖励函数，我们使用3.2节中讨论的三种领域知识形式。 HRA使用GVF来学习伪Q值（值在[0,1]范围内）以到达地图上的特定位置（为四个地图中的每一个学习单独的GVF）。与水果采集任务（第4.1节）相反，HRA在训练期间学习其部分表示：它从0 GVF开始，0头用于弹丸。通过在迷宫中游荡，它会发现它可以到达的新地图位置，从而导致创建新的GVF。每当代理人在新位置找到一个弹丸时，它会创建一个与弹丸相对应的新头部。

对象的Q值（pellet / fruit / ghost / blue ghost）被设置为与对象位置对应的GVF的伪Q值（即，移动对象每次使用不同的GVF），乘以权重这被设置为等于吃掉对象时收到的奖励。如果屏幕上没有对象，则其所有Q值均为0。

我们测试两种聚合器类型。第一个是线性的，它对所有磁头的Q值求和（见公式5）。对于第二个，我们取所有产生点的头的总和，并对得到的Q值进行归一化;然后，我们加上常规重影的头部的Q值之和，乘以权重向量。

For exploration, we test two complementary types of exploration. Each type adds an extra exploration head to the architecture. The first type, which we call diversification, produces random Q-values, drawn from a uniform distribution over [0, 20]. We find that it is only necessary during the first 50 steps, to ensure starting each episode randomly. The second type, which we call count-based, adds a bonus for state-action pairs that have not been explored a lot. It is inspired by upper confidence bounds (Auer et al., 2002). Full details can be found in the supplementary material.

For our final experiment, we implement a special head inspired by executive-memory literature (Fuster, 2003; Gluck et al., 2013). When a human game player reaches the maximum of his cognitive and physical ability, he starts to look for favourable situations or even glitches and memorises them. This cognitive process is indeed memorising a sequence of actions (also called habit), and is not necessarily optimal. Our executive-memory head records every sequence of actions that led to pass a level without any kill. Then, when facing the same level, the head gives a very high value to the recorded action, in order to force the aggregator’s selection. Note that our simplified version of executive memory does not generalise.

Evaluation metrics. There are two different evaluation methods used across literature which result in very different scores. Because ALE is ultimately a deterministic environment (it implements pseudo-randomness using a random number generator that always starts with the same seed), both evaluation metrics aim to create randomness in the evaluation in order to rate methods with more generalising behaviour higher. The first metric introduces a mild form of randomness by taking a random number of no-op actions before control is handed over to the learning algorithm. In the case of Ms. Pac-Man, however, the game starts with a certain inactive period that exceeds the maximum number of no-op steps, resulting in the game having a fixed start after all. The second metric selects random starting points along a human trajectory and results in much stronger randomness, and does result in the intended random start evaluation. We refer to these metrics as ‘fixed start’ and ‘random start’.
为了探索，我们测试了两种互补的探索类型。每种类型都为架构增加了额外的探索头。第一种类型，我们称之为多样化，产生随机Q值，从[0,20]上的均匀分布中得出。我们发现只需要在前50个步骤中确保随机开始每一集。第二种类型，我们称之为基于计数，为尚未进行过大量探索的状态 - 动作对添加了奖励。它的灵感来自上层置信区间（Auer et al。，2002）。完整的细节可以在补充材料中找到。

对于我们的最终实验，我们实施了一个受执行记忆文献启发的特殊头部（Fuster，2003; Gluck等，2013）。当一个人类游戏玩家达到他的认知和身体能力的最大值时，他开始寻找有利的情况甚至故障并记住它们。这种认知过程确实记忆了一系列动作（也称为习惯），并不一定是最佳的。我们的执行记忆头记录了导致通过一个级别而没有任何杀戮的每一系列动作。然后，当面对相同的级别时，头部给予记录的动作非常高的值，以便强制聚合器的选择。请注意，我们的执行内存的简化版本没有概括。

评估指标。在文献中使用两种不同的评估方法，导致得分非常不同。由于ALE最终是一个确定性环境（它使用始终以相同种子开始的随机数生成器实现伪随机性），因此两种评估指标都旨在在评估中创建随机性，以便对具有更高泛化行为的方法进行评级。第一个度量通过在将控制权移交给学习算法之前采用随机数量的无操作动作来引入温和形式的随机性。然而，在Pac-Man女士的情况下，游戏以超过无操作步骤的最大数量的某个非活动时段开始，从而导致游戏具有固定的开始。第二度量沿着人类轨迹选择随机起始点并且导致更强的随机性，并且确实导致预期的随机开始评估。我们将这些指标称为“固定开始”和“随机开始”。

Results. Figure 5 shows the training curves; Table 1 shows the final score after training. The best reported fixed start score comes from STRAW (Vezhnevets et al., 2016); the best reported random start score comes from the Dueling network architecture (Wang et al., 2016). The human fixed start score comes from Mnih et al. (2015); the human random start score comes from Nair et al. (2015). We train A3C for 800 million frames. Because HRA learns fast, we train it only for 5,000 episodes, corresponding with about 150 million frames (note that better policies resultinmoreframesperepisode). We tried a few different settings for HRA: with/without normalisation and with/without each type of exploration. The score shown for HRA uses the best combination: with normalisation and with both exploration types. All combinations achieved over 10,000 points in training, except the combination with no exploration at all, which—not surprisingly—performed very poorly. With the best combination, HRA not only outperforms the state-of-the-art on both metrics, it also signiﬁcantly outperforms the human score, convincingly demonstrating the strength of HRA.

Comparing A3C(pixels) and A3C(channels) in Table 1 reveals a surprising result: while we use advanced preprocessing by separating the screen image into relevant object channels, this did not significantly change the performance of A3C.

In our final experiment, we test how well HRA does if it exploits the weakness of the fixed-start evaluation metric by using a simplified version of executive memory. Using this version, we not only surpass the human high-score of 266,330 points,1 we achieve the maximum possible score of 999,990 points in less than 3,000 episodes. The curve is slow in the first stages because the model has to be trained, but even though the further levels get more and more difficult, the level passing speeds up by taking advantage of already knowing the maps. Obtaining more points is impossible, not because the game ends, but because the score overflows to 0 when reaching a million points.
结果：图5显示了训练曲线;表1显示了训练后的最终得分。报告的最佳固定开始得分来自STRAW（Vezhnevets等，2016）;最好的随机起始评分来自Dueling网络架构（Wang et al。，2016）。人类固定的起始分数来自Mnih等人。（2015）;人类随机起始分数来自Nair等人。（2015年）。我们训练了A3C 8亿帧。因为HRA学得很快，我们只训练了5000集，相当于大约1.5亿帧（注意更好的政策导致更多的帧速度）。我们为HRA尝试了一些不同的设置：有/无标准化和有/无每种类型的探索。显示的HRA分数使用最佳组合：标准化和两种探索类型。所有组合在训练中都达到了超过10,000分，除了完全没有探索的组合，这一点并不令人惊讶地表现得非常糟糕。通过最佳组合，HRA不仅在两个指标上都超越了最先进的技术，它还显着优于人类得分，令人信服地展示了HRA的实力。

比较表1中的A3C（像素）和A3C（通道）显示了一个令人惊讶的结果：虽然我们通过将屏幕图像分成相关的对象通道来使用高级预处理，但这并没有显着改变A3C的性能。

在我们的最后一个实验中，我们测试了HRA如果通过使用简化版本的执行内存来利用固定启动评估指标的弱点，它的效果如何。使用这个版本，我们不仅超过266,330分的人类高分，1我们在不到3,000集的情况下达到999,990分的最大可能分数。曲线在第一阶段是缓慢的，因为模型必须经过训练，但即使进一步的水平变得越来越困难，水平传递也会通过利用已经知道的地图来加速。获得更多分数是不可能的，不是因为游戏结束，而是因为当达到百万分时分数溢出到0。

5 Discussion

One of the strengths of HRA is that it can exploit domain knowledge to a much greater extent than single-head methods. This is clearly shown by the fruit collection task: while removing irrelevant features improves performance of HRA, the performance of DQN decreased when provided with the same network architecture. Furthermore, separating the pixel image into multiple binary channels only makes a small improvement in the performance of A3C over learning directly from pixels. This demonstrates that the reason that modern deep RL struggle with Ms. Pac-Man is not related to learning from pixels; the underlying issue is that the optimal value function for Ms. Pac-Man cannot easily be mapped to a low-dimensional representation.

HRA solves Ms. Pac-Man by learning close to 1,800 general value functions. This results in an exponential breakdown of the problem size: whereas the input state-space corresponding with the binary channels is in the order of 1077, each GVF has a state-space in the order of 103 states, small enough to be represented without any function approximation. While we could have used a deep network for representing each GVF, using a deep network for such small problems hurts more than it helps, as evidenced by the experiments on the fruit collection domain.

We argue that many real-world tasks allow for reward decomposition. Even if the reward function can only be decomposed in two or three components, this can already help a lot, due to the exponential decrease of the problem size that decomposition might cause.
HRA的优势之一是它可以比单头方法更大程度地利用领域知识。水果收集任务清楚地表明了这一点：虽然删除不相关的功能可以提高HRA的性能，但是当提供相同的网络架构时，DQN的性能会下降。此外，将像素图像分离成多个二进制通道仅使得A3C的性能比直接从像素学习略微改善。这表明现代深度RL与Pac-Man女士斗争的原因与从像素学习无关;潜在的问题是Pac-Man女士的最佳价值函数不能轻易映射到低维表示。

HRA通过学习近1,800种一般价值功能解决了Pac-Man女士的问题。这导致问题大小的指数分解：而对应于二进制信道的输入状态空间大约为1077，每个GVF具有大约103个状态的状态空间，小到足以在没有任何状态的情况下表示函数逼近。虽然我们本可以使用深层网络来代表每个GVF，但使用深层网络来解决这些小问题会比它有所帮助更多，正如水果收集领域的实验所证明的那样。

我们认为许多现实世界的任务允许奖励分解。即使奖励函数只能在两个或三个组件中分解，由于分解可能导致的问题大小的指数减少，这已经可以帮助很多。

References

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem.

Machine learning, 47(2-3):235–256, 2002.

Bacon, P., Harb, J., and Precup, D. The option-critic architecture. In Proceedings of the Thirthy-first AAAI Conference On Artificial Intelligence (AAAI), 2017.

Barto, A. G. and Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

Diuk, C., Cohen, A., and Littman, M. L. An object-oriented representation for efficient reinforcement learning. In Proceedings of The 25th International Conference on Machine Learning, 2008.

Fuster, J. M. Cortex and mind: Unifying cognition. Oxford university press, 2003.

Gluck, M. A., Mercado, E., and Myers, C. E. Learning and memory: From brain to behavior.

Palgrave Macmillan, 2013.

Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D., and Kavukcuoglu,

K.Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2017.

Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. Hierarchical deep reinforce-ment learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems 29, 2016.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., Kumaran, H. King D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu,

K.Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd Interna-tional Conference on Machine Learning, pp. 1928–1937, 2016.

Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., Maria, A. De, Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K., and Silver, D. Massively parallel methods for deep reinforcement learning. In In Deep Learning Workshop, ICML, 2015.

Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of The 16th International Conference on Machine Learning, 1999.

Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 2013.

Russell, S. and Zimdar, A. L. Q-decomposition for reinforcement learning agents. In Proceedings of The 20th International Conference on Machine Learning, 2003.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In Proceedings of The 32rd International Conference on Machine Learning, 2015.

Schmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). In IEEE Transactions on Autonomous Mental Development 2.3, pp. 230–247, 2010.

Sprague, N. and Ballard, D. Multiple-goal reinforcement learning with modular sarsa(0). In International Joint Conference on Artificial Intelligence, 2003.

Stout, A., Konidaris, G., and Barto, A. G. Intrinsically motivated reinforcement learning: A promising framework for developmental robotics. In The AAAI Spring Symposium on Developmental Robotics, 2005.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 1998.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, Doina. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2011.

Sutton, R.S., Precup, D., and Singh, S.P. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.

Szepesvári, C. Algorithms for reinforcement learning. Morgan and Claypool, 2009.

van Seijen, H., van Hasselt, H., Whiteson, S., and Wiering, M. A theoretical and empirical analysis of expected sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 177–184, 2009.

Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., and Kavukcuoglu, K. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems 29, 2016.

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016.

你可能感兴趣的:(机器学习,算法,论文研读,深度强化学习,强化学习,DQN)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
Goolge earth studio 进阶4——路径修改与平滑陟彼高冈yu Google earth studio 进阶教程旅游
如果我们希望在大约中途时获得更多的城市鸟瞰视角。可以将相机拖动到这里并创建一个新的关键帧。camera_target_clip_7EarthStudio会自动平滑我们的路径，所以当我们通过这个关键帧时，不是一个生硬的角度，而是一个平滑的曲线。camera_target_clip_8路径上有贝塞尔控制手柄，允许我们调整路径的形状。右键单击，我们可以选择“平滑路径”，这是默认的自动平滑算法，或者我们可
基于社交网络算法优化的二维最大熵图像分割智能算法研学社（Jack旭）智能优化算法应用图像分割算法 php 开发语言
智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码文章目录智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码1.前言2.二维最大熵阈值分割原理3.基于社交网络优化的多阈值分割4.算法结果：5.参考文献：6.Matlab代码摘要：本文介绍基于最大熵的图像分割，并且应用社交网络算法进行阈值寻优。1.前言阅读此文章前，请阅读《图像分割：直方图区域划分及信息统计介绍》htt
121. 买卖股票的最佳时机薄荷糖的味道_fb40
给定一个数组，它的第i个元素是一支给定股票第i天的价格。如果你最多只允许完成一笔交易（即买入和卖出一支股票），设计一个算法来计算你所能获取的最大利润。注意你不能在买入股票前卖出股票。示例1:输入:[7,1,5,3,6,4]输出:5解释:在第2天（股票价格=1）的时候买入，在第5天（股票价格=6）的时候卖出，最大利润=6-1=5。注意利润不能是7-1=6,因为卖出价格需要大于买入价格。示例2:输入:
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
回溯算法-重新安排行程 chirou_ 算法数据结构图论 c++图搜索
leetcode332.重新安排行程这题我还没自己ac过，只能现在凭着刚学完的热乎劲把我对题解的理解记下来。本题我认为对数据结构的考察比较多，用什么数据结构去存数据，去读取数据，都是很重要的。classSolution{private:unordered_map>targets;boolbacktracking(intticketNum,vector&result){//1.确定参数和返回值//2
Faiss：高效相似性搜索与聚类的利器网络·魚大数据 faiss
Faiss是一个针对大规模向量集合的相似性搜索库，由FacebookAIResearch开发。它提供了一系列高效的算法和数据结构，用于加速向量之间的相似性搜索，特别是在大规模数据集上。本文将介绍Faiss的原理、核心功能以及如何在实际项目中使用它。Faiss原理：近似最近邻搜索：Faiss的核心功能之一是近似最近邻搜索，它能够高效地在大规模数据集中找到与给定查询向量最相似的向量。这种搜索是近似的，
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
insert into select 主键自增_mybatis拦截器实现主键自动生成 weixin_39521651 insert into select 主键自增 mybatis delete返回值 mybatis insert返回主键 mybatis insert返回对象 mybatis plus insert返回主键 mybatis plus 插入生成id
前言前阵子和朋友聊天，他说他们项目有个需求，要实现主键自动生成，不想每次新增的时候，都手动设置主键。于是我就问他，那你们数据库表设置主键自动递增不就得了。他的回答是他们项目目前的id都是采用雪花算法来生成，因此为了项目稳定性，不会切换id的生成方式。朋友问我有没有什么实现思路，他们公司的orm框架是mybatis，我就建议他说，不然让你老大把mybatis切换成mybatis-plus。mybat
k均值聚类算法考试例题_k均值算法(k均值聚类算法计算题) 寻找你83497 k均值聚类算法考试例题
?算法：第一步：选K个初始聚类中心，z1(1),z2(1)，…，zK(1)，其中括号内的序号为寻找聚类中心的迭代运算的次序号。聚类中心的向量值可任意设定，例如可选开始的K个.k均值聚类：---------一种硬聚类算法，隶属度只有两个取值0或1，提出的基本根据是“类内误差平方和最小化”准则；模糊的c均值聚类算法：--------一种模糊聚类算法，是.K均值聚类算法是先随机选取K个对象作为初始的聚类
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
推荐算法_隐语义-梯度下降 _feivirus_ 算法机器学习和数学推荐算法机器学习隐语义
importnumpyasnp1.模型实现"""inputrate_matrix:M行N列的评分矩阵，值为P*Q.P:初始化用户特征矩阵M*K.Q:初始化物品特征矩阵K*N.latent_feature_cnt:隐特征的向量个数max_iteration:最大迭代次数alpha:步长lamda:正则化系数output分解之后的P和Q"""defLFM_grad_desc(rate_matrix,l
K近邻算法_分类鸢尾花数据集 _feivirus_ 算法机器学习和数学分类机器学习 K近邻
importnumpyasnpimportpandasaspdfromsklearn.datasetsimportload_irisfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_score1.数据预处理iris=load_iris()df=pd.DataFrame(data=ir
数据结构 | 栈和队列 TT-Kun 数据结构与算法数据结构栈队列 C语言
文章目录栈和队列1.栈：后进先出（LIFO）的数据结构1.1概念与结构1.2栈的实现2.队列：先进先出（FIFO）的数据结构2.1概念与结构2.2队列的实现3.栈和队列算法题3.1有效的括号3.2用队列实现栈3.3用栈实现队列3.4设计循环队列结论栈和队列在计算机科学中，栈和队列是两种基本且重要的数据结构，它们在处理数据存储和访问顺序方面有着独特的规则和应用。本文将详细介绍栈和队列的概念、结构、实
[Python] 数据结构详解及代码 AIAdvocate 算法 python 数据结构链表
今日内容大纲介绍数据结构介绍列表链表1.数据结构和算法简介程序大白话翻译,程序=数据结构+算法数据结构指的是存储,组织数据的方式.算法指的是为了解决实际业务问题而思考思路和方法,就叫:算法.2.算法的5大特性介绍算法具有独立性算法是解决问题的思路和方式,最重要的是思维,而不是语言,其(算法)可以通过多种语言进行演绎.5大特性有输入,需要传入1或者多个参数有输出,需要返回1个或者多个结果有穷性,执行
Python算法L5：贪心算法小熊同学哦 Python算法算法 python 贪心算法
Python贪心算法简介目录Python贪心算法简介贪心算法的基本步骤贪心算法的适用场景经典贪心算法问题1.**零钱兑换问题**2.**区间调度问题**3.**背包问题**贪心算法的优缺点优点：缺点：结语贪心算法（GreedyAlgorithm）是一种在每一步选择中都采取当前最优或最优解的算法。它的核心思想是，在保证每一步局部最优的情况下，希望通过贪心选择达到全局最优解。虽然贪心算法并不总能得到全
遥感影像的切片处理 sand&wich 计算机视觉 python 图像处理
在遥感影像分析中，经常需要将大尺寸的影像切分成小片段，以便于进行详细的分析和处理。这种方法特别适用于机器学习和图像处理任务，如对象检测、图像分类等。以下是如何使用Python和OpenCV库来实现这一过程，同时确保每个影像片段保留正确的地理信息。准备环境首先，确保安装了必要的Python库，包括numpy、opencv-python和xml.etree.ElementTree。这些库将用于图像处理
【RabbitMQ 项目】服务端：数据管理模块之绑定管理月夜星辉雪 rabbitmq 分布式
文章目录一.编写思路二.代码实践一.编写思路定义绑定信息类交换机名称队列名称绑定关键字：交换机的路由交换算法中会用到没有是否持久化的标志，因为绑定是否持久化取决于交换机和队列是否持久化，只有它们都持久化时绑定才需要持久化。绑定就好像一根绳子，两端连接着交换机和队列，当一方不存在，它就没有存在的必要了定义绑定持久化类构造函数：如果数据库文件不存在则创建，打开数据库，创建binding_table插入
非对称加密算法原理与应用2——RSA私钥加密文件私语茶馆云部署与开发架构及产品灵感记录 RSA2048 私钥加密
作者：私语茶馆1.相关章节（1）非对称加密算法原理与应用1——秘钥的生成-CSDN博客第一章节讲述的是创建秘钥对，并将公钥和私钥导出为文件格式存储。本章节继续讲如何利用私钥加密内容，包括从密钥库或文件中读取私钥，并用RSA算法加密文件和String。2.私钥加密的概述本文主要基于第一章节的RSA2048bit的非对称加密算法讲述如何利用私钥加密文件。这种加密后的文件，只能由该私钥对应的公钥来解密。
粒子群优化 (PSO) 在三维正弦波函数中的应用 subject625Ruben 机器学习人工智能 matlab 算法
在这篇博客中，我们将展示如何使用粒子群优化（PSO）算法求解三维正弦波函数，并通过增加正弦波扰动，使优化过程更加复杂和有趣。本文将介绍目标函数的定义、PSO参数设置以及算法执行的详细过程，并展示搜索空间中的动态过程和收敛曲线。1.目标函数定义我们使用的目标函数是一个三维正弦波函数，定义如下：objectiveFunc=@(x)sin(sqrt(x(1).^2+x(2).^2))+0.5*sin(5
非对称加密算法————RSA理论及详情 hu19930613
转自：https://www.kancloud.cn/kancloud/rsa_algorithm/48484一、一点历史1976年以前，所有的加密方法都是同一种模式：（1）甲方选择某一种加密规则，对信息进行加密；（2）乙方使用同一种规则，对信息进行解密。由于加密和解密使用同样规则（简称"密钥"），这被称为"对称加密算法"（Symmetric-keyalgorithm）。这种加密模式有一个最大弱点
ai绘画工具midjourney怎么下载？附作品管理教程设计师早上好
Midjourney是一款功能强大的AI绘画工具，它使用机器学习技术和深度神经网络等算法，可以生成各种艺术风格的绘画作品。在创意设计、广告宣传等方面有着广泛的应用前景。那么，ai绘画工具midjourney怎么下载？本文将为您介绍Midjourney的下载以及作品的相关管理。一、Midjourney下载Midjourney的下载非常简单，只需打开Midjourney官网（点击“GetMidjour
【加密算法基础——对称加密和非对称加密】 XWWW668899 网络安全服务器笔记
对称加密与非对称加密对称加密和非对称加密是两种基本的加密方法，各自有不同的特点和用途。以下是详细比较：1.对称加密特点密钥:使用相同的密钥进行加密和解密。发送方和接收方必须共享这个密钥。速度:通常速度较快，适合处理大量数据。实现:算法相对简单，计算效率高。常见算法AES(高级加密标准)DES(数据加密标准)3DES(三重数据加密标准)RC4(流密码)应用场景文件加密磁盘加密传输大量数据时的加密2.
【算法练习】IDEA集成leetcode插件实现快速刷 2401_84102892 2024年程序员学习算法 intellij-idea leetcode
============点击右侧边leetcode->设置->配置地址、用户名、密码、存放目录、文件模板用户名要登录后在账号信息里看模板代码1.codefilename!velocityTool.camelC
[实践应用] 深度学习之模型性能评估指标 YuanDaima2048 深度学习工具使用深度学习人工智能损失函数性能评估 pytorch python 机器学习
文章总览：YuanDaiMa2048博客文章总览深度学习之模型性能评估指标分类任务回归任务排序任务聚类任务生成任务其他介绍在机器学习和深度学习领域，评估模型性能是一项至关重要的任务。不同的学习任务需要不同的性能指标来衡量模型的有效性。以下是对一些常见任务及其相应的性能评估指标的详细解释和总结。分类任务分类任务是指模型需要将输入数据分配到预定义的类别或标签中。以下是分类任务中常用的性能指标：准确率(
【加密算法基础——RSA 加密】 XWWW668899 网络服务器笔记 python
RSA加密RSA（Rivest-Shamir-Adleman）加密是非对称加密，一种广泛使用的公钥加密算法，主要用于安全数据传输。公钥用于加密，私钥用于解密。RSA加密算法的名称来源于其三位发明者的姓氏：R:RonRivestS:AdiShamirA:LeonardAdleman这三位计算机科学家在1977年共同提出了这一算法，并发表了相关论文。他们的工作为公钥加密的基础奠定了重要基础，使得安全通
机器学习-聚类算法不良人龍木木机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记，感谢点赞关注！1.AHC2.K-means3.SC传统谱聚类：个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持！
生成式地图制图 Bwywb_3 深度学习机器学习深度学习生成对抗网络
生成式地图制图（GenerativeCartography）是一种利用生成式算法和人工智能技术自动创建地图的技术。它结合了传统的地理信息系统（GIS）技术与现代生成模型（如深度学习、GANs等），能够根据输入的数据自动生成符合需求的地图。这种方法在城市规划、虚拟环境设计、游戏开发等多个领域具有应用前景。主要特点：自动化生成：通过算法和模型，系统能够根据输入的地理或空间数据自动生成地图，而无需人工逐
eclipse maven IXHONG eclipse
eclipse中使用maven插件的时候，运行run as maven build的时候报错 -Dmaven.multiModuleProjectDirectory system propery is not set. Check $M2_HOME environment variable and mvn script match. 可以设一个环境变量M2_HOME指
timer cancel方法的一个小实例 alleni123 多线程 timer
package com.lj.timer; import java.util.Date; import java.util.Timer; import java.util.TimerTask; public class MyTimer extends TimerTask { private int a; private Timer timer; pub
MySQL数据库在Linux下的安装 ducklsl mysql
1.建好一个专门放置MySQL的目录 /mysql/db数据库目录 /mysql/data数据库数据文件目录 2.配置用户，添加专门的MySQL管理用户 >groupadd mysql ----添加用户组 >useradd -g mysql mysql ----在mysql用户组中添加一个mysql用户 3.配置，生成并安装MySQL >cmake -D
spring------>>cvc-elt.1: Cannot find the declaration of element Array_06 spring bean
将-------- <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3
maven发布第三方jar的一些问题 cugfy maven
maven中发布第三方jar到nexus仓库使用的是 deploy:deploy-file命令有许多参数，具体可查看 http://maven.apache.org/plugins/maven-deploy-plugin/deploy-file-mojo.html 以下是一个例子： mvn deploy:deploy-file -DgroupId=xpp3
MYSQL下载及安装 357029540 mysql
好久没有去安装过MYSQL，今天自己在安装完MYSQL过后用navicat for mysql去厕测试链接的时候出现了10061的问题，因为的的MYSQL是最新版本为5.6.24，所以下载的文件夹里没有my.ini文件，所以在网上找了很多方法还是没有找到怎么解决问题，最后看到了一篇百度经验里有这个的介绍，按照其步骤也完成了安装，在这里给大家分享下这个链接的地址
ios TableView cell的布局张亚雄 tableview
cell.imageView.image = [UIImage imageNamed:[imageArray objectAtIndex:[indexPath row]]]; CGSize itemSize = CGSizeMake(60, 50); &nbs
Java编码转义 adminjun java 编码转义
import java.io.UnsupportedEncodingException; /** * 转换字符串的编码 */ public class ChangeCharset { /** 7位ASCII字符，也叫作ISO646-US、Unicode字符集的基本拉丁块 */ public static final Strin
Tomcat 配置和spring aijuans spring
简介 Tomcat启动时，先找系统变量CATALINA_BASE，如果没有，则找CATALINA_HOME。然后找这个变量所指的目录下的conf文件夹，从中读取配置文件。最重要的配置文件：server.xml 。要配置tomcat，基本上了解server.xml，context.xml和web.xml。 Server.xml -- tomcat主
Java打印当前目录下的所有子目录和文件 ayaoxinchao 递归 File
其实这个没啥技术含量，大湿们不要操笑哦，只是做一个简单的记录，简单用了一下递归算法。 import java.io.File; /** * @author Perlin * @date 2014-6-30 */ public class PrintDirectory { public static void printDirectory(File f
linux安装mysql出现libs报冲突解决 BigBird2012 linux
linux安装mysql出现libs报冲突解决安装mysql出现 file /usr/share/mysql/ukrainian/errmsg.sys from install of MySQL-server-5.5.33-1.linux2.6.i386 conflicts with file from package mysql-libs-5.1.61-4.el6.i686
jedis连接池使用实例 bijian1013 redis jedis连接池 jedis
实例代码： package com.bijian.study; import java.util.ArrayList; import java.util.List; import redis.clients.jedis.Jedis; import redis.clients.jedis.JedisPool; import redis.clients.jedis.JedisPoo
关于朋友 bingyingao 朋友兴趣爱好维持
成为朋友的必要条件：志相同，道不合，可以成为朋友。譬如马云、周星驰一个是商人，一个是影星，可谓道不同，但都很有梦想，都要在各自领域里做到最好，当他们遇到一起，互相欣赏，可以畅谈两个小时。志不同，道相合，也可以成为朋友。譬如有时候看到两个一个成绩很好每次考试争做第一，一个成绩很差的同学是好朋友。他们志向不相同，但他
【Spark七十九】Spark RDD API一 bit1129 spark
aggregate package spark.examples.rddapi import org.apache.spark.{SparkConf, SparkContext} //测试RDD的aggregate方法 object AggregateTest { def main(args: Array[String]) { val conf = new Spar
ktap 0.1 released bookjovi kernel tracing
Dear, I'm pleased to announce that ktap release v0.1, this is the first official release of ktap project, it is expected that this release is not fully functional or very stable and we welcome bu
能保存Properties文件注释的Properties工具类 BrokenDreams properties
今天遇到一个小需求：由于java.util.Properties读取属性文件时会忽略注释，当写回去的时候，注释都没了。恰好一个项目中的配置文件会在部署后被某个Java程序修改一下，但修改了之后注释全没了，可能会给以后的参数调整带来困难。所以要解决这个问题。 &nb
读《研磨设计模式》-代码笔记-外观模式-Facade bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /* * 百度百科的定义： * Facade（外观）模式为子系统中的各类（或结构与方法）提供一个简明一致的界面， * 隐藏子系统的复杂性，使子系统更加容易使用。他是为子系统中的一组接口所提供的一个一致的界面 * * 可简单地
After Effects教程收集 cherishLC After Effects
1、中文入门 http://study.163.com/course/courseMain.htm?courseId=730009 2、videocopilot英文入门教程（中文字幕） http://www.youku.com/playlist_show/id_17893193.html 英文原址： http://www.videocopilot.net/basic/ 素
Linux Apache 安装过程 crabdave apache
Linux Apache 安装过程下载新版本： apr-1.4.2.tar.gz（下载网站：http://apr.apache.org/download.cgi） apr-util-1.3.9.tar.gz（下载网站：http://apr.apache.org/download.cgi） httpd-2.2.15.tar.gz（下载网站：http://httpd.apac
Shell学习之变量赋值和引用 daizj shell 变量引用赋值
本文转自：http://www.cnblogs.com/papam/articles/1548679.html Shell编程中，使用变量无需事先声明，同时变量名的命名须遵循如下规则：首个字符必须为字母（a-z，A-Z）中间不能有空格，可以使用下划线（_）不能使用标点符号不能使用bash里的关键字（可用help命令查看保留关键字）需要给变量赋值时，可以这么写：
Java SE 第一讲（Java SE入门、JDK的下载与安装、第一个Java程序、Java程序的编译与执行） dcj3sjt126com java jdk
Java SE 第一讲： Java SE：Java Standard Edition Java ME: Java Mobile Edition Java EE：Java Enterprise Edition Java是由Sun公司推出的（今年初被Oracle公司收购）。收购价格：74亿美金 J2SE、J2ME、J2EE JDK：Java Development
YII给用户登录加上验证码 dcj3sjt126com yii
1、在SiteController中添加如下代码： /** * Declares class-based actions. */ public function actions() { return array( // captcha action renders the CAPTCHA image displ
Lucene使用说明 dyy_gusi Lucene search 分词器
Lucene使用说明 1、lucene简介 1.1、什么是lucene Lucene是一个全文搜索框架，而不是应用产品。因此它并不像baidu或者googleDesktop那种拿来就能用，它只是提供了一种工具让你能实现这些产品和功能。 1.2、lucene能做什么要回答这个问题，先要了解lucene的本质。实际
学习编程并不难,做到以下几点即可! gcq511120594 数据结构编程算法
不论你是想自己设计游戏，还是开发iPhone或安卓手机上的应用，还是仅仅为了娱乐，学习编程语言都是一条必经之路。编程语言种类繁多，用途各异，然而一旦掌握其中之一，其他的也就迎刃而解。作为初学者，你可能要先从Java或HTML开始学，一旦掌握了一门编程语言，你就发挥无穷的想象，开发各种神奇的软件啦。 1、确定目标学习编程语言既充满乐趣，又充满挑战。有些花费多年时间学习一门编程语言的大学生到
Java面试十问之三：Java与C++内存回收机制的差别 HNUlanwei java C++finalize()堆栈内存回收
大家知道， Java 除了那 8 种基本类型以外，其他都是对象类型（又称为引用类型）的数据。 JVM 会把程序创建的对象存放在堆空间中，那什么又是堆空间呢？其实，堆（ Heap）是一个运行时的数据存储区，从它可以分配大小各异的空间。一般，运行时的数据存储区有堆（ Heap）和堆栈（ Stack），所以要先看它们里面可以分配哪些类型的对象实体，然后才知道如何均衡使用这两种存储区。一般来说，栈中存放的
第二章 Nginx+Lua开发入门 jinnianshilongnian nginx lua
Nginx入门本文目的是学习Nginx+Lua开发，对于Nginx基本知识可以参考如下文章： nginx启动、关闭、重启 http://www.cnblogs.com/derekchen/archive/2011/02/17/1957209.html agentzh 的 Nginx 教程 http://openresty.org/download/agentzh-nginx-tutor
MongoDB windows安装基本命令 liyonghui160com
windows安装安装目录： D:\MongoDB\ 新建目录 D:\MongoDB\data\db 4.启动进城： cd D:\MongoDB\bin mongod -dbpath D:\MongoDB\data\db &n
Linux下通过源码编译安装程序 pda158 linux
一、程序的组成部分　　Linux下程序大都是由以下几部分组成：　　二进制文件：也就是可以运行的程序文件　　库文件：就是通常我们见到的lib目录下的文件　　配置文件：这个不必多说，都知道　　帮助文档：通常是我们在linux下用man命令查看的命令的文档　　二、linux下程序的存放目录　　linux程序的存放目录大致有三个地方：　　/etc, /b
WEB开发编程的职业生涯４个阶段 shw3588 编程 Web 工作生活
觉得自己什么都会 2007年从学校毕业，凭借自己原创的ASP毕业设计，以为自己很厉害似的，信心满满去东莞找工作，找面试成功率确实很高，只是工资不高，但依旧无法磨灭那过分的自信，那时候什么考勤系统、什么OA系统、什么ERP，什么都觉得有信心，这样的生涯大概持续了约一年。根本不是自己想的那样 2008年开始接触很多工作相关的东西，发现太多东西自己根本不会，都需要去学，不管是asp还是js，
遭遇jsonp同域下变作post请求的坑 vb2005xu jsonp 同域post
今天迁移一个站点时遇到一个坑爹问题,同一个jsonp接口在跨域时都能调用成功,但是在同域下调用虽然成功,但是数据却有问题. 此处贴出我的后端代码片段 $mi_id = htmlspecialchars(trim($_GET['mi_id '])); $mi_cv = htmlspecialchars(trim($_GET['mi_cv '])); 贴出我前端代码片段: $.aj