Adam坤

LEARNING GOAL-CONDITIONED VALUE FUNCTIONS WITH ONE-STEP PATH REWARDS RATHER THAN GOAL- REWARDS

学习目标条件的价值功能与一步走的路径奖励比目标奖励更多

ABSTRACT

Multi-goal reinforcement learning (MGRL) addresses tasks where the desired goal state can change for every trial. State-of-the-art algorithms model these problems such that the reward formulation depends on the goals, to associate them with high reward. This dependence introduces additional goal reward resampling steps in algorithms like Hindsight Experience Replay (HER) that reuse trials in which the agent fails to reach the goal by recomputing rewards as if reached states were psuedo-desired goals. We propose a reformulation of goal-conditioned value func-tions for MGRL that yields a similar algorithm, while removing the dependence of reward functions on the goal. Our formulation thus obviates the requirement of reward-recomputation that is needed by HER and its extensions. We also extend a closely related algorithm, Floyd-Warshall Reinforcement Learning, from tabular domains to deep neural networks for use as a baseline. Our results are competitive with HER while substantially improving sampling efficiency in terms of reward computation.
多目标强化学习（MGRL）解决了每个试验中期望目标状态可以改变的任务。最先进的算法对这些问题进行建模，使得奖励制定取决于目标，将它们与高回报联系起来。这种依赖性在诸如Hindsight Experience Replay（HER）之类的算法中引入了额外的目标奖励重新采样步骤，其重用了通过重新计算奖励而无法达到目标的试验，好像达到状态是假想的目标。我们建议重新制定MGRL的目标条件值函数，产生类似的算法，同时消除奖励函数对目标的依赖性。因此，我们的表述避免了HER及其扩展所需的奖励 - 重新计算的要求。我们还扩展了一个密切相关的算法，Floyd-Warshall强化学习，从表格域到深度神经网络，用作基线。我们的结果与HER竞争，同时在奖励计算方面大大提高了采样效率。

1 INTRODUCTION

Many tasks in robotics require the specification of a goal for every trial. For example, a robotic arm can be tasked to move an object to an arbitrary goal position on a table (Gu et al., 2017); a mobile robot can be tasked to navigate to an arbitrary goal landmark on a map (Zhu et al., 2017). The adaptation of reinforcement learning to such goal-conditioned tasks where goal locations can change is called Multi-Goal Reinforcement Learning (MGRL) (Plappert et al., 2018). State-of-the-art MGRL algorithms (Andrychowicz et al., 2017; Pong et al., 2018) work by estimating goal-conditioned value functions (GCVF) which are defined as expected cumulative rewards from start states with specified goals. GCVFs, in turn, are used to compute policies that determine the actions to take at every state.

To learn GCVFs, MGRL algorithms use goal-reward, defined as the relatively higher reward re-cieved on reaching the desired goal state. This makes the reward function dependent on the desired goal. For example, in the Fetch-Push task (Plappert et al., 2018) of moving a block to a given lo-cation on a table, every movement incurs a “-1” reward while reaching the desired goal returns a “0” goal-reward. This dependence introduces additional reward resampling steps in algorithms like Hindsight Experience Replay (HER) (Andrychowicz et al., 2017), where trials in which the agent failed to reach the goal are reused by recomputing rewards as if the reached states were pseudo-desired goals. Due to the dependence of the reward function on the goal, the relabelling of every pseudo-goal requires an independent reward-recomputation step, which can be expensive.

In this paper, we demonstrate that goal-rewards are not needed to learn GCVFs. For the Fetch-Push example, the “0” goal-reward does not need to be achieved to learn its GCVF. Specifically, the agent continues to receive “-1” reward even when the block is in the given goal location. This reward formulation is atypical in conventional RL because high reward is used to specify the desired goal location. However, this goal-reward is not necessary in goal-conditioned RL because the goal is already specified at the start of every episode. We use this idea, to propose a goal-conditioned RL algorithm which learns to reach goals without goal-rewards. This is a counter-intuitive result which is important for understanding goal-conditioned RL.
机器人技术中的许多任务都需要为每次试验指定目标。例如，机器人手臂的任务是将物体移动到桌子上的任意目标位置（Gu et al。，2017）;移动机器人的任务是导航到地图上的任意目标地标（Zhu et al。，2017）。强化学习适应目标位置可以改变的目标条件任务被称为多目标强化学习（MGRL）（Plappert等，2018）。最先进的MGRL算法（Andrychowicz等，2017; Pong等，2018）通过估计目标条件值函数（GCVF）来工作，GCVF被定义为具有指定目标的起始状态的预期累积奖励。反过来，GCVF用于计算确定每个州要采取的行动的政策。

为了学习GCVF，MGRL算法使用目标奖励，定义为在达到期望的目标状态时重新获得的相对较高的奖励。这使得奖励功能取决于期望的目标。例如，在将块移动到桌子上的给定位置的Fetch-Push任务（Plappert等，2018）中，每次移动都会产生“-1”奖励，而到达期望目标则返回“0”目标奖励。这种依赖性在诸如Hindsight Experience Replay（HER）（Andrychowicz等人，2017）之类的算法中引入了额外的奖励重新采样步骤，其中通过重新计算奖励来重复使用代理未能达到目标的试验，就好像达到的状态是伪期望的一样目标。由于奖励函数对目标的依赖性，每个伪目标的重新标记需要独立的奖励 - 重新计算步骤，这可能是昂贵的。

在本文中，我们证明了学习GCVF不需要目标奖励。对于Fetch-Push示例，不需要实现“0”目标奖励来学习其GCVF。具体而言，即使块在给定的目标位置，代理也继续接收“-1”奖励。该奖励制定在传统RL中是非典型的，因为高奖励用于指定期望的目标位置。然而，这个目标奖励在目标条件RL中是不必要的，因为目标已经在每集开始时指定。我们使用这个想法，提出一个目标条件的RL算法，该算法学会达到目标而没有目标奖励。这是一个反直觉的结果，对于理解目标条件RL很重要。

Let us consider another example to motivate the redundancy of goal-rewards. Consider a student who has moved to a new campus. To learn about the campus, the student explores it randomly with no specific goal in mind. The key intuition here is that the student is not incentivized to find specific goal locations (i.e. no goal-rewards) but is aware of the effort required to travel between points around the university. When tasked with finding a goal classroom, the student can chain together these path efforts to find the least-effort path to the classroom. Based on this intuition of least-effort paths, we redefine GCVFs to be the expected path-reward that is learned for all possible start-goal pairs. We introduce a one-step loss that assumes one-step paths to be the paths of maximum reward between pairs wherein the state and goal are adjacent. Under this interpretation, the Bellman equation chooses and chains together one-step paths to find longer maximum reward paths. Experimentally, we show how this simple reinterpretation, which does not use goal rewards, performs as well as HER while outperforming it in terms of reward computation.

We also extend a closely related algorithm, Floyd-Warshall Reinforcement Learning (FWRL) (Kael-bling, 1993) (also called Dynamic Goal Reinforcement learning) to use parametric function ap-proximators instead of tabular functions. Similar to our re-definition of GCVFs, FWRL learns a goal-conditioned Floyd-Warshall function that represents path-rewards instead of future-rewards. We translate FWRL’s compositionality constraints in the space of GCVFs to introduce additional loss terms to the objective. However, these additional loss terms do not show improvement over the baseline. We conjecture that the compositionality constraints are already captured by other loss terms.

In summary, the contributions of this work are twofold. Firstly, we reinterpret goal-conditioned value functions as expected path-rewards and introduce one-step loss, thereby removing the dependency of GCVFs on goal-rewards and reward resampling. We showcase our algorithm’s improved sample efficiency (in terms of reward computation). We thus extend algorithms like HER to domains where reward recomputation is expensive or infeasible. Secondly, we extend the tabular Floyd-Warshal Reinforcement Learning to use deep neural networks.
让我们考虑另一个例子来激励目标奖励的冗余。考虑一个搬到新校区的学生。要了解校园，学生会随意探索校园，没有特定的目标。这里的关键直觉是学生没有被激励去寻找特定的目标位置（即没有目标奖励），而是知道在大学周围的点之间旅行所需的努力。当负责找到目标教室时，学生可以将这些路径工作联系在一起，找到最不费力的课堂路径。基于这种最小努力路径的直觉，我们将GCVF重新定义为所有可能的起始目标对所学习的预期路径奖励。我们引入一步损失，假设一步路径是状态和目标相邻的对之间的最大奖励路径。根据这种解释，贝尔曼方程选择并将一步路径链接在一起以找到更长的最大奖励路径。在实验上，我们展示了这种不使用目标奖励的简单重新解释如何在奖励计算方面表现优于HER。

我们还扩展了一个密切相关的算法，Floyd-Warshall强化学习（FWRL）（Kael-bling，1993）（也称为动态目标强化学习），以使用参数函数ap-proximators而不是表格函数。与我们对GCVF的重新定义类似，FWRL学习了一个目标条件的Floyd-Warshall函数，它代表了路径奖励而不是未来奖励。我们在GCVF空间中翻译FWRL的组合性约束，为目标引入额外的损失项。但是，这些额外的损失条款并未显示出超过基线的改善。我们推测组合性约束已经被其他损失术语所捕获。

总之，这项工作的贡献是双重的。首先，我们将目标条件值函数重新解释为预期的路径奖励并引入一步损失，从而消除GCVF对目标奖励和奖励重新采样的依赖性。我们展示了我们的算法提高的样本效率（在奖励计算方面）。因此，我们将像HER这样的算法扩展到奖励重新计算昂贵或不可行的领域。其次，我们扩展表格Floyd-Warshal强化学习以使用深度神经网络。

2 RELATED WORK

Goal-conditioned tasks in reinforcement learning have been approached in two ways, depending upon whether the algorithm explicitly separates state and goal representations. The first approach is to use vanilla reinforcement learning algorithms that do not explicitly make this separation (Mirowski et al., 2016; Dosovitskiy & Koltun, 2016; Gupta et al., 2017; Parisotto & Salakhutdi-nov, 2017; Mirowski et al., 2018). These algorithms depend upon neural network architectures to carry the burden of learning the separated representations.

The second approach makes this separation explicit via the use of goal-conditioned value functions (Foster & Dayan, 2002; Sutton et al., 2011). Universal Value Function Appoximators (Schaul et al., 2015) propose a network architecture and a factorization technique that separately encodes states and goals, taking advantage of correlations in their representations. Temporal Difference Models combine model-free and model-based RL to gain advantages from both realms by defining and learning a horizon-dependent GCVF. All these works require the use of goal-dependent reward functions and define GCVFs as future-rewards instead of path-rewards, contrasting them from our contribution.

Unlike our approach, Andrychowicz et al. (2017) propose Hindsight Experience Replay, a technique for resampling state-goal pairs from failed experiences; which leads to faster learning in the pres-ence of sparse rewards. In addition to depending on goal rewards, HER also requires the repeated recomputation of the reward function. In contrast, we show how removing goal-rewards removes the need for such recomputations. We utilize HER as a baseline in our work.

Kaelbling (1993) also use the structure of the space of GCVFs to learn. This work employs com-positionality constraints in the space of these functions to accelerate learning in a tabular domain. While their definition of GCVFs is similar to ours, the terminal condition is different. We describe this difference in Section 4. We also extend their tabular formulation to deep neural networks and evaluate it against the baselines.
强化学习中的目标条件任务有两种方式，取决于算法是否明确区分状态和目标表示。第一种方法是使用未明确进行这种分离的香草强化学习算法（Mirowski等，2016; Dosovitskiy＆Koltun，2016; Gupta等，2017; Parisotto＆Salakhutdi-nov，2017; Mirowski等。，2018年）。这些算法依赖于神经网络架构来承担学习分离表示的负担。

第二种方法通过使用目标条件值函数使这种分离明确（Foster＆Dayan，2002; Sutton等，2011）。通用价值函数Appoximators（Schaul等人，2015）提出了一种网络架构和分解技术，它们分别对状态和目标进行编码，利用其表示中的相关性。时间差异模型通过定义和学习与地平线相关的GCVF，将无模型和基于模型的RL结合起来，从两个领域中获得优势。所有这些工作都需要使用与目标相关的奖励函数，并将GCVF定义为未来奖励而不是路径奖励，将它们与我们的贡献进行对比。

与我们的方法不同，Andrychowicz等人。（2017）提出Hindsight Experience Replay，这是一种从失败的经历中重新取样状态目标对的技术;这导致在稀疏奖励的情况下更快地学习。除了依赖目标奖励之外，HER还需要重复重新计算奖励功能。相比之下，我们展示了如何删除目标奖励消除了对此类重新计算的需求。我们利用HER作为我们工作的基线。

Kaelbling（1993）也使用GCVF空间的结构来学习。这项工作在这些函数的空间中使用了对位性约束，以加速表格域中的学习。虽然他们对GCVF的定义与我们的相似，但终端条件不同。我们在第4节中描述了这种差异。我们还将它们的表格表达式扩展到深度神经网络，并根据基线进行评估。

3 BACKGROUND

DEEP REINFORCEMENT LEARNING

A number of reinforcement learning algorithms use parametric function approximators to estimate the return in the form of an action-value function, Q(s; a):
许多强化学习算法使用参数函数逼近器以动作值函数Q（s; a）的形式估计回报：

Hindsight Experience Replay HER (Andrychowicz et al., 2017) builds upon this definition of GCVFs (5). The main insight of HER is that there is no valuable feedback from the environment when the agent does not reach the goal. This is further exacerbated when goals are sparse in the state-space. HER solves this problem by reusing these failed experiences for learning. It recomputes a reward for each reached state by relabeling them as pseudo-goals.

In our experiments, we employ HER’s future strategy for pseudo-goal sampling. More specifi-cally, two transitions from the same episode in the replay buffer for times t and t + f are sam-pled. The achieved goal gt+f is then assumed to be the pseudo-goal. The algorithm generates a new transition for the time step t with the reward re-computed as if gt+f was the desired goal, (st; at; st+1; R(st; at; gt+f )). HER uses this new transition as a sample.
Hindsight Experience Replay HER（Andrychowicz et al。，2017）建立在GCVFs的这个定义之上（5）。 HER的主要见解是当代理未达到目标时，没有来自环境的有价值的反馈。当目标在状态空间稀疏时，情况会进一步恶化。她通过重复使用这些失败的经验来解决这个问题。它通过将它们重新标记为伪目标来重新计算每个达到状态的奖励。

在我们的实验中，我们采用了HER未来的伪目标抽样策略。更具体地说，对于时间t和t + f，来自重放缓冲器中的相同剧集的两个转换被采样。然后将所实现的目标gt + f假设为伪目标。该算法为时间步长t生成新的转变，重新计算奖励，就好像gt + f是期望的目标一样，（st; at; st + 1; R（st; at; gt + f））。 HER使用这种新的过渡作为样本。

4 PATH REWARD-BASED GCVFS

In our definition of the GCVF, instead of making the reward function depend upon the goal, we count accumulated rewards over a path, path-rewards, only if the goal is reached. This makes the dependence on the goal explicit instead of implicit to the reward formulation. Mathematically,
在我们对GCVF的定义中，不是让奖励函数取决于目标，而是仅仅在达到目标时计算路径上的累积奖励，路径奖励。这使得对目标的依赖明确而不是隐含于奖励制定。在数学上，

where l is the time step when the agent reaches the goal. If the agent does not reach the goal, the GCVF is defined to be negative infinity. This first term (6a) is the expected cumulative reward over paths from a given start state to the goal. This imposes the constraint that cyclical paths in the state space must have negative cumulative reward for (6a) to yield finite values. For most practical physical problems, this constraints naturally holds if reward is taken to be some measure of negative energy expenditure. For example, in the robot arm experiment, moving the arm must expend energy (negative reward). Achieving a positive reward cycle would translate to generating infinite energy . In all our experiments with this
其中l是代理达到目标时的时间步长。如果代理未达到目标，则GCVF被定义为负无穷大。该第一项（6a）是从给定开始状态到目标的路径上的预期累积奖励。这强加了约束条件，即状态空间中的循环路径必须具有负累积奖励（6a）以产生有限值。对于大多数实际的物理问题，如果将奖励作为负能量消耗的某种度量，则这种约束自然成立。例如，在机器人手臂实验中，移动手臂必须消耗能量（负面奖励）。实现积极的奖励周期将转化为产生无限的能量。在我们所有的实验中

Notice that terminal step in this equation is the step to reach the goal. This differs from Equation (3), where the terminal step is the step at which the episode ends. This formulation is equivalent to the end of episode occuring immediately when the goal is reached. This reformulation does not require goal-rewards, which in turn obviates the requirement for pseudo-goals and reward recomputation.

One-Step Loss To enable algorithms like HER to work under this reformulation we need to rec-ognize when the goal is reached (7b). This recognition is usually done by the reception of high goal reward. Instead, we use (7b) as a one-step loss that serves this purpose which is one of our main contributions:
请注意，此等式中的终端步骤是达到目标的步骤。这与等式（3）不同，其中终止步骤是情节结束的步骤。该公式相当于达到目标时立即发生的剧集结束。这种重新制定不需要目标奖励，这反过来又不需要伪目标和奖励重新计算。

一步损失为了使像HER这样的算法能够在这个重新制定下工作，我们需要重新认识何时达到目标（7b）。这种认可通常是通过接收高目标奖励来完成的。相反，我们使用（7b）作为一步失败来实现这一目的，这是我们的主要贡献之一：

This loss is based on the assumption that one-step reward is the highest reward between adjacent start-goal states and allows us to estimate the one-step reward between them. Once learned, it serves as a proxy for the reward to the last step to the goal (7b). The Bellman equation (7), serves as a one-step rollout to combine rewards to find maximum reward paths to the goal.

One-step loss is different from the terminal step of Q-Learning because one-step loss is applicable to every transition unlike the terminal step. However, one-step loss can be thought of as Q-Learning where every transition is a one-step episode where the achieved goal is the pseudo goal.
这种损失是基于这样的假设：一步奖励是相邻起始目标状态之间的最高奖励，并允许我们估计它们之间的一步奖励。一旦学会了，它就可以作为目标最后一步奖励的代理（7b）。贝尔曼方程（7）作为一步推出，结合奖励以找到目标的最大奖励路径。

一步损失与Q-Learning的终端步骤不同，因为一步损失适用于与终端步骤不同的每次转换。然而，一步损失可以被认为是Q-Learning，其中每个过渡都是一步一集，其中实现的目标是伪目标。

DEEP FLOYD-WARSHALL REINFORCEMENT LEARNING

The GCVF redefinition and one step-loss introduced in this paper are inspired by the tabular for-mulation of Floyd-Warshall Reinforcement Learning (FWRL) (Kaelbling, 1993). We extend this algorithm for use with deep neural networks. Unfortunately, the algorithm itself does not show sig-nificant improvement over the baselines. However, the intuitions gained in its implementation led to the contributions of this paper.
本文介绍的GCVF重新定义和一步损失的灵感来自Floyd-Warshall强化学习（FWRL）的表格形式（Kaelbling，1993）。我们扩展此算法以用于深度神经网络。不幸的是，算法本身并没有显示出超过基线的显着改进。然而，在实施过程中获得的直觉导致了本文的贡献。

Note that the above terms differ only by choice of the target and main network.
请注意，上述术语的区别仅在于选择目标网络和主网络。

Figure 2: For the Fetch tasks, we compare our method (red) against HER (blue) (Andrychowicz et al., 2016) and FWRL (green) (Kaelbling, 1993) on the distance-from-goal and success rate met-rics. Both metrics are plotted against two progress measures: the number of training epochs and the number of reward computations. Except for the Fetch Slide task, we achieve comparable or better performance across the metrics and progress measures.
图2：对于Fetch任务，我们将我们的方法（红色）与HER（蓝色）（Andrychowicz等人，2016）和FWRL（绿色）（Kaelbling，1993）的距离与目标距离和成功率进行比较 - RICS。两个指标都针对两个进度度量绘制：训练时期的数量和奖励计算的数量。除了Fetch Slide任务，我们在指标和进度指标上实现了相当或更好的性能。

5 EXPERIMENTS

We use the environments introduced in Plappert et al. (2018) for our experiments. Broadly the en-vironments fall in two categories, Fetch and Hand tasks. Our results show that learning is possi-ble across all environments without the requirement of goal-reward. More specifically, the learning happens even when reward given to our algorithm is agent is always “-1” as opposed to the HER formulation where a special goal-reward of “0” is needed for learning to happen.

The Fetch tasks involve a simulation of the Fetch robot’s 7-DOF robotic arm. The four tasks are Reach, Push, Slide and PickAndPlace. In the Reach task the arm’s end-effector is tasked to reach the a particular 3D coordinate. In the Push task a block on a table needs to be pushed to a given point on it. In the Slide task a puck must be slid to a desired location. In the PickAndPlace task a block on a table must be picked up and moved to a 3D coordinate.

The Hand tasks use a simulation of the Shadow’s Dexterous Hand to manipulate objects of different shapes and sizes. These tasks are HandReach, HandManipulateBlockRotateXYZ, HandManipula-teEggFull and HandManipulatePenRotate. In HandReach the hand’s fingertips need to reach a given configuration. In the HandManipulateBlockRotateXYZ, the hand needs to rotate a cubic block to a desired orientation. In HandManipulateEggFull, the hand repeats this orientation task with an egg, and in HandManipulatePenRotate, it does so with a pen.

Snapshots of all these tasks can be found in Figure 1. Note that these tasks use joint angles, not visual input.
我们使用Plappert等人介绍的环境。（2018）我们的实验。从广义上讲，这些环境分为Fetch和Hand两个类别。我们的研究结果表明，在不需要目标奖励的情况下，可以在所有环境中进行学习。更具体地说，即使当给予我们的算法的奖励是代理总是“-1”而不是HER制剂时，学习也会发生，其中学习发生需要特殊的目标 - 奖励“0”。

Fetch任务涉及Fetch机器人的7-DOF机器人手臂的模拟。四个任务是Reach，Push，Slide和PickAndPlace。在Reach任务中，手臂的末端执行器的任务是到达特定的3D坐标。在Push任务中，需要将表上的块推送到其上的给定点。在幻灯片任务中，必须将冰球滑动到所需位置。在PickAndPlace任务中，必须拾取表格上的块并将其移动到3D坐标。

Hand任务使用Shadow的Dexterous Hand模拟来操纵不同形状和大小的物体。这些任务是HandReach，HandManipulateBlockRotateXYZ，HandManipula-teEggFull和HandManipulatePenRotate。在HandReach中，手的指尖需要达到给定的配置。在HandManipulateBlockRotateXYZ中，手需要将立方块旋转到所需的方向。在HandManipulateEggFull中，手用一个蛋重复这个方向任务，而在HandManipulatePenRotate中，它用一支笔完成。

所有这些任务的快照可以在图1中找到。请注意，这些任务使用关节角度，而不是视觉输入。

METRICS

Similar to prior work, we evaluate all experiments on two metrics: the success rate and the average distance to the goal. The success rate is defined as the fraction of episodes in which the agent is able to reach the goal within a pre-defined threshold region. The metric distance of the goal is the euclidean distance between the achieved goal and the desired goal in meters. These metrics are

plotted against a standard progress measure, the number of training epochs, showing comparable results of our method to the baselines.

To emphasize that our method does not require goal-reward and reward re-computation, we plot these metrics against another progress measure, the number of reward computations used during training. This includes both the episode rollouts and the reward recomputations during HER sam-pling.
与之前的工作类似，我们会根据两个指标评估所有实验：成功率和到目标的平均距离。成功率被定义为代理能够在预定阈值区域内达到目标的事件的分数。目标的度量距离是达到目标与所需目标之间的欧氏距离（以米为单位）。这些指标是

根据标准进度测量，训练时期的数量绘制，显示我们的方法与基线的可比结果。

为了强调我们的方法不需要目标奖励和奖励重新计算，我们将这些指标与另一个进度测量，即训练期间使用的奖励计算的数量进行对比。这包括剧集推出和HER采样期间的奖励重新计算。

5.2 HYPER-PARAMETERS CHOICES

Unless specified, all our hyper-parameters are identical to the ones used in the HER implementa-tion (Dhariwal et al., 2017). We note two main changes to HER to make the comparison more fair. Firstly, we use a smaller distance-threshold. The environment used for HER and FWRL returns the goal-reward when the achieved goal is within this threshold of the desired goal. Because of the ab-sence of goal-rewards, the distance-threshold information is not used by our method. We reduce the threshold to 1cm which is reduction by a factor of 5 compared to HER.

Secondly, we run all experiments on 6 cores each, while HER uses 19. The batch size used is a function of the number of cores and hence this parameter has a significant effect on learning.

To ensure fair comparison, all experiments are run with the same hyper-parameters and random seeds to ensure that variations in performance are purely due to differences between the algorithms.
除非另有说明，否则我们所有的超参数都与HER实现中使用的相同（Dhariwal等，2017）。我们注意到HER的两个主要变化，使比较更加公平。首先，我们使用较小的距离阈值。当达到目标在期望目标的阈值内时，用于HER和FWRL的环境返回目标奖励。由于目标奖励的缺失，我们的方法不使用距离阈值信息。我们将阈值降低到1cm，与HER相比减少了5倍。

其次，我们在每个6个核心上运行所有实验，而HER使用19.所使用的批量大小是核心数量的函数，因此该参数对学习具有显着影响。

为了确保公平比较，所有实验都使用相同的超参数和随机种子运行，以确保性能的变化完全是由于算法之间的差异。

5.3 RESULTS

All our experimental results are described below, highlighting the strengths and weaknesses of our algorithm. Across all our experiments, the distance-to-the-goal metric achieves comparable perfor-mance to HER without requiring goal-rewards.

Fetch Tasks The experimental results for Fetch tasks are shown in Figure 2. For the Fetch Reach and Push tasks, our method achieves comparable performance to the baselines across both metrics in terms of training epochs and outperforms them in terms of reward recomputations. Notably, the Fetch Pick and Place task trains in significantly fewer epochs. For the Fetch Slide task the opposite is true. We conjecture that Fetch Slide is more sensitive to the distance threshold information, which our method is unable to use.

Hand Tasks For the Hand tasks, the distance to the goal and the success rate show different trends. We show the results in Figure 3. When the distance metric is plotted against epochs, we get com-parable performance for all tasks; when plotted against reward computations, we outperform all baselines on all tasks except Hand Reach. The baselines perform well enough on this task, leav-ing less scope for significant improvement. These trends do not hold for the success rate metric, on which our method consistently under-performs compared to the baselines across tasks. This is surprising, as all algorithms average equally on the distance-from-goal metric. We conjecture that this might be the result of high-distance failure cases of the baselines, i.e. when the baselines fail, they do so at larger distances from the goal. In contrast, we assume our method’s success and failure cases are closer together.
我们所有的实验结果如下所述，突出了我们的算法的优点和缺点。在我们所有的实验中，距离目标指标达到了与HER相当的性能，而不需要目标奖励。

获取任务Fetch任务的实验结果如图2所示。对于Fetch Reach和Push任务，我们的方法在训练时期方面实现了两个指标的基线性能，并且在奖励重新计算方面优于它们。值得注意的是，Fetch Pick and Place任务训练的时期显着减少。对于Fetch Slide任务，情况恰恰相反。我们猜想Fetch Slide对距离阈值信息更敏感，我们的方法无法使用。

手部任务对于手部任务，到目标的距离和成功率显示不同的趋势。我们在图3中显示结果。当距离度量与时期相关时，我们可以获得所有任务的可比性能;当针对奖励计算进行绘制时，除了Hand Reach之外，我们在所有任务上的表现都优于所有基线。基线在此任务上表现良好，从而减少了显着改进的范围。这些趋势并不适用于成功率指标，与我们的方法相比，我们的方法始终表现不佳。这是令人惊讶的，因为所有算法在距目标距离度量上的平均值相同。我们推测这可能是基线的高距离失败情况的结果，即当基线失效时，它们在离目标较远的距离处这样做。相反，我们假设我们的方法的成功和失败案例更加紧密。

6 ANALYSIS

To gain a deeper understanding of the method we perform three additional experiments on different tasks. We ask the following questions: (a) How important is the step loss? (b) What happens when the goal-reward is also available to our method? © How sensitive is HER and our method to the distance-threshold?

How important is the step loss? We choose the Fetch-Push task for this experiment. We run our algorithm with no goal reward and without the step loss on this task. Results show that our algorithm fails to reach the goal when the step-loss is removed (Fig. 4a) showing its necessity.
为了更深入地了解该方法，我们对不同的任务进行了三次额外的实验。我们提出以下问题：（a）步损有多重要？（b）当我们的方法也可以获得目标奖励时会发生什么？（c）HER和我们的方法对距离阈值有多敏感？

步损失有多重要？我们为此实验选择了Fetch-Push任务。我们运行我们的算法没有目标奖励，没有这项任务的步骤损失。结果表明，当步骤损失被移除（图4a）显示其必要性时，我们的算法未能达到目标。

Figure 3: For the hand tasks, we compare our method (red) against HER (blue) (Andrychowicz et al., 2016) and FWRL (green) (Kaelbling, 1993) for the distance-from-goal and success rate metrics. Furthermore, both metrics are plotted against two progress measures, the number of training epochs and the number of reward computations. Measured by distance from the goal, our method performs comparable to or better than the baselines for both progress measurements. For the success rate, our method underperforms against the baselines.
图3：对于手部任务，我们比较了我们的方法（红色）与HER（蓝色）（Andrychowicz等人，2016）和FWRL（绿色）（Kaelbling，1993）的目标距离和成功率指标。此外，两个指标都针对两个进度度量，训练时期的数量和奖励计算的数量绘制。通过距离目标的距离来测量，我们的方法与两个进度测量的基线相当或更好。对于成功率，我们的方法在基线方面表现不佳。

Figure 5: We measure the sensitive of HER and our method to the dsitance-threshold ( ) with respect to the success-rate and distance-from-goal metrics. Both algorithms success-rate is sensitive the threshold while only HER’s distance-from-goal is affected by it.

What happens when the goal-reward is also available to our method? We run this experiment on the Fetch PickAndPlace task. We find that goal-rewards do not affect the performance of our algorithm further solidifying the avoidability of goal-reward (Fig 4b).

How sensitive is HER and our method to the distance-threshold? In the absence of goal-rewards, our algorithm is not to able capture distance threshold information that decides whether the agent has reached the goal or not. This information is available to HER. To understand the sen-sitivity of our algorithm and HER on this parameter, we vary it over 0.05 (the original HER value), 0.01 and 0.001 meters (Fig. 5). Results show that for the success-rate metric, which is itself a func-tion of this parameter, both algorithms are affected equally (Fig. 5a). For the distance-from-goal, only HER is affected (Fig. 5b). This fits our expectations as set up in section 5.2.
图5：我们测量HER和我们的方法对dsitance-threshold（）的敏感度，关于成功率和目标距离指标。两种算法的成功率都是敏感的阈值，而只有HER的距离目标受其影响。

当目标奖励也可用于我们的方法时会发生什么？我们在Fetch PickAndPlace任务上运行此实验。我们发现目标奖励不会影响我们算法的性能，进一步巩固了目标奖励的可避免性（图4b）。

她和我们的方法对距离阈值有多敏感？在没有目标奖励的情况下，我们的算法无法捕获决定代理是否已达到目标的距离阈值信息。这些信息可供HER使用。为了理解我们的算法和HER对该参数的敏感性，我们将其变化超过0.05（原始HER值），0.01和0.001米（图5）。结果表明，对于成功率度量，它本身就是这个参数的函数，两种算法都受到相同的影响（图5a）。对于距离目标的距离，只有HER受到影响（图5b）。这符合我们在第5.2节中设定的期望。

7 CONCLUSION

In this work we pose a reinterpretation of goal-conditioned value functions and show that under this paradigm learning is possible in the absence of goal reward. This is a surprising result that runs counter to intuitions that underlie most reinforcement learning algorithms. In future work, we will augment our method to incorporate the distance-threshold information to make the task easier to learn when the threshold is high. We hope that the experiments and results presented in this paper lead to a broader discussion about the assumptions actually required for learning multi-goal tasks.
在这项工作中，我们对目标条件价值函数进行了重新解释，并表明在这种范式下，在没有目标奖励的情况下学习是可能的。这是一个令人惊讶的结果，与大多数强化学习算法背后的直觉背道而驰。在未来的工作中，我们将增加我们的方法来合并距离阈值信息，以便在阈值高时更容易学习任务。我们希望本文中提供的实验和结果能够更广泛地讨论学习多目标任务所需的假设。

REFERENCES

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience re-play. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.

Richard Bellman. The theory of dynamic programming. Technical report, RAND Corp Santa Mon-ica CA, 1954.

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https: //github.com/openai/baselines, 2017.

Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779, 2016.

David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning, 49 (2-3):325–346, 2002.

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 3389–3396. IEEE, 2017.

Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, pp. 1094–1099. Citeseer, 1993.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993.

Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.

Piotr Mirowski, Matthew Koichi Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith An-derson, Denis Teplyashin, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, and Raia Hadsell. Learning to navigate in cities without a map. arXiv preprint arXiv:1804.00168, 2018.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier-stra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015a.

Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. arXiv preprint arXiv:1702.08360, 2017.

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Pow-ell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforce-ment learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.

Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081, 2018.

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxima-tors. In International Conference on Machine Learning, pp. 1312–1320, 2015.

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.

Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsuper-vised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 3357–3364. IEEE, 2017.

Figure6 :Ablation on loss functions for Fetch Push task. The Floyd-Warshall inspired loss functions Llo and Lup do not help much. Lstep helps a little but only in conjunction with HER Andrychowicz et al. (2016).
图6：Fetch Push任务的损失函数的消融。 Floyd-Warshall激发了失落的功能Llo和Lup没有多大帮助。 Lstep有一点帮助，但只与HER Andrychowicz等人合作。（2016）。

Figure 7: Even when the Goal rewards are removed from HER Andrychowicz et al. (2016) training, the HER is able to learn only if the Lstep is added again. (HER-Goal Rewards+Lstep) is our proposed method.
图7：即使从HER Andrychowicz等人中删除了目标奖励。（2016）训练，HER只有在再次添加Lstep时才能学习。（HER-Goal Rewards + Lstep）是我们提出的方法。

APPENDIX

Our algorithm 1 is different from HER Andrychowicz et al. (2016) because it contains additional

step-loss term Lstep at line number 17 which allows the algorithm to learn even when the rewards received are independent of desired goal. Also in HER sampling (line 13), the algorithm recomputes

the rewards because the goal is replaced with a pseudo-goal. Our algorithm does not need reward recomputation because the reward formulation does not depend on the goal and is not affected by choice of pseudo-goal. Our algorithm is also different from Floyd-Warshall Reinforcement learning because it does not contain Lup and Llo terms and contains the additional Lstep.
我们的算法1与HER Andrychowicz等人不同。（2016）因为它包含额外的

在行号17处的步进丢失项Lstep，其允许算法即使在所接收的奖励独立于期望目标时也学习。同样在HER采样（第13行）中，算法重新计算

奖励因为目标被伪目标取代。我们的算法不需要奖励重新计算，因为奖励制定不依赖于目标，也不受伪目标选择的影响。我们的算法也不同于Floyd-Warshall强化学习，因为它不包含Lup和Llo项并包含额外的Lstep。

9ABLATION ON LOSS AND GOAL REWARDS

In Figure 6 and Figure 7 we show ablation on loss functions and goal rewards. In Figure 7 Our method is shown in blue with HER - Goal rewards + Lstep.
在图6和图7中，我们显示了对损失函数和目标奖励的消融。在图7中，我们的方法以蓝色显示，其中HER - 目标奖励+ Lstep。

你可能感兴趣的:(论文研读,ICLR,深度强化学习,强化学习)

【AI论文】GLM-4.1V-思考：借助可扩展强化学习实现通用多模态推理东临碣石82 人工智能
摘要：我们推出GLM-4.1V-Thinking这一视觉语言模型（VLM），该模型旨在推动通用多模态推理的发展。在本报告中，我们分享了在以推理为核心的训练框架开发过程中的关键发现。我们首先通过大规模预训练开发了一个具备显著潜力的高性能视觉基础模型，可以说该模型为最终性能设定了上限。随后，借助课程采样强化学习（ReinforcementLearningwithCurriculumSampling，R
【心灵鸡汤】深度学习技能形成树：从零基础到AI专家的成长路径全解析智算菩萨人工智能深度学习
引言：技能树的生长哲学在这个人工智能浪潮汹涌的时代，深度学习犹如一棵参天大树，其根系深深扎入数学与计算科学的沃土，主干挺拔地承载着机器学习的核心理念，而枝叶则繁茂地延伸至计算机视觉、自然语言处理、强化学习等各个应用领域。对于初入此领域的新手而言，理解这棵技能树的生长规律，掌握其形成过程中的关键节点和发展阶段，将直接决定其在人工智能道路上能够走多远、攀多高。技能树的概念源于游戏设计，但在学习深度学习
【机器学习笔记 Ⅱ】10 完整周期
机器学习的完整生命周期（End-to-EndPipeline）机器学习的完整周期涵盖从问题定义到模型部署的全过程，以下是系统化的步骤分解和关键要点：1.问题定义（ProblemDefinition）目标：明确业务需求与机器学习任务的匹配性。关键问题：这是分类、回归、聚类还是强化学习问题？成功的标准是什么？（如准确率>90%、降低10%成本）输出：项目目标文档（含评估指标）。2.数据收集（DataC
大模型RLHF强化学习笔记（二）：强化学习基础梳理Part2 Gravity! 大模型笔记大模型 LLM 强化学习人工智能
【如果笔记对你有帮助，欢迎关注&点赞&收藏，收到正反馈会加快更新！谢谢支持！】一、强化学习基础1.4强化学习分类根据数据来源划分Online：智能体与环境实时交互，如Q-Learning、SARSA、Actor-CriticOffline：智能体使用预先收集的数据集进行学习根据策略更新划分On-Policy：学习和行为策略是相同的，数据是按照当前策略生成的，如SARSAOff-Policy：学习策
爆改RAG！用强化学习让你的检索增强生成系统“开挂”——从小白到王者的实战指南许泽宇的技术分享人工智能
“RAG不准？RL来救场！”——一位被RAG气哭的AI工程师前言：RAG的烦恼与AI炼丹师的自我修养在AI圈混久了，大家都知道RAG（Retrieval-AugmentedGeneration，检索增强生成）是大模型落地的“万金油”方案。无论是企业知识库、智能问答，还是搜索引擎升级，RAG都能插上一脚。但你用过RAG就知道，理想很丰满，现实很骨感。明明知识库里啥都有，问个“量子比特的数学表达式”，
机器学习18-强化学习RLHF 坐吃山猪机器学习机器学习人工智能
机器学习18-强化学习RLHF1-什么是RLHFRLHF（ReinforcementLearningfromHumanFeedback）即基于人类反馈的强化学习算法，以下是详细介绍：基本原理RLHF是一种结合了强化学习和人类反馈的机器学习方法。传统的强化学习通常依赖于预定义的奖励函数来指导智能体的学习，而RLHF则通过引入人类的反馈来替代或补充传统的奖励函数。在训练过程中，人类会对智能体的行为或输
策略梯度在网络安全中的应用：AI如何防御网络攻击 AI智能探索者 web安全人工智能安全 ai
策略梯度在网络安全中的应用：AI如何防御网络攻击关键词：策略梯度、网络安全、AI防御、强化学习、网络攻击、入侵检测、自适应防御摘要：本文将探讨策略梯度这一强化学习算法在网络安全领域的创新应用。我们将从基础概念出发，逐步揭示AI如何通过学习网络攻击模式来构建自适应防御系统，分析其核心算法原理，并通过实际代码示例展示实现过程。文章还将讨论当前应用场景、工具资源以及未来发展趋势，为读者提供对这一前沿技术
2024大模型秋招LLM相关面试题整理 AGI大模型资料分享官人工智能深度学习机器学习自然语言处理语言模型 easyui
0一些基础术语大模型：一般指1亿以上参数的模型，但是这个标准一直在升级，目前万亿参数以上的模型也有了。大语言模型（LargeLanguageModel，LLM）是针对语言的大模型。175B、60B、540B等：这些一般指参数的个数，B是Billion/十亿的意思，175B是1750亿参数，这是ChatGPT大约的参数规模。强化学习：（ReinforcementLearning）一种机器学习的方法，
【深度学习】强化学习（Reinforcement Learning, RL）主流架构解析烟锁池塘柳0 机器学习与深度学习深度学习人工智能机器学习
强化学习（ReinforcementLearning,RL）主流架构解析摘要：本文将带你深入了解强化学习（ReinforcementLearning,RL）的几种核心架构，包括基于价值（Value-Based）、基于策略（Policy-Based）和演员-评论家（Actor-Critic）方法。我们将探讨它们的基本原理、优缺点以及经典算法，帮助你构建一个清晰的RL知识体系。文章目录强化学习（Rei
返利佣金最高软件的技术壁垒：基于强化学习的动态佣金算法架构揭秘
返利佣金最高软件的技术壁垒：基于强化学习的动态佣金算法架构揭秘大家好，我是阿可，微赚淘客系统及省赚客APP创始人，是个冬天不穿秋裤，天冷也要风度的程序猿！一、背景介绍在返利佣金软件中，动态佣金算法是提升用户活跃度和平台收益的关键技术。传统的佣金算法通常是静态的，无法根据用户的实时行为和市场动态进行调整。为了突破这一技术瓶颈，我们引入了强化学习（ReinforcementLearning,RL），通
【ICLR 2022】时序精选论文08｜Pyraformer: 基于金字塔注意力机制与多尺度辨识卷积的时间序列预测模型（代码解读附源码） OverOnEarth 时间序列预测项目实战人工智能机器学习深度学习 python 算法
ICLR2022PYRAFORMER:LOW-COMPLEXITYPYRAMIDALAT-TENTIONFORLONG-RANGETIMESERIESMODELINGANDFORECASTINGPyraformer要解决的问题基于时间序列数据面临的挑战：建立一个灵活但简约的模型，能够捕获不同范围的时间依赖性。时间序列通常表现为短期和长期的重复模式，将他们考虑在内是准确预测的关键。即能够获得一个同时
农业物联网平台中的灌溉系统研究 sj52abcd 农业物联网和人工智能物联网数据分析 python 大数据毕业设计
研究目的本研究旨在开发一个基于Python语言的农业物联网平台，整合土壤墒情监测与精准灌溉系统，通过现代信息技术手段实现农业生产的智能化管理。系统将采用Python作为主要开发语言，结合MySQL数据库进行数据存储与管理，利用ECharts.js实现数据可视化展示，并引入机器学习和强化学习算法优化灌溉决策。具体目标包括：1)构建实时土壤墒情监测网络，通过物联网传感器采集土壤温湿度、电导率等关键参数
用于人形机器人强化学习运动的神经网络架构分析
1.引言：人形机器人运动强化学习中的架构探索人形机器人具备在多样化环境中自主运行的巨大潜力，有望缓解工厂劳动力短缺、协助居家养老以及探索新星球等问题。其拟人化的特性使其在执行类人操作任务（如运动和操纵）方面具有独特优势。深度强化学习（DRL）作为一种前景广阔的无模型方法，能够有效控制双足运动，实现复杂行为的自主学习，而无需显式动力学模型。1.1人形机器人运动强化学习的机遇与挑战尽管DRL取得了显著
人形机器人运动控制技术演进：从强化学习到神经微分方程的前沿解析
1.引言：人形运动控制的挑战与范式迁移人形机器人需在非结构化环境中实现双足行走、跑步、跳跃等复杂动作，其核心问题可归结为高维连续状态-动作空间的实时优化。传统方法（如基于模型的预测控制MPC）依赖精确的动力学建模，但在实际系统中面临以下瓶颈：模型失配：复杂接触动力学（如足-地交互）难以显式建模；计算瓶颈：高维非线性优化难以满足实时性需求；环境扰动敏感：传统控制器对未知干扰的鲁棒性不足。近年来，以强
NVIDIA Isaac GR00T N1.5 人形机器人强化学习入门教程（五）强化学习与机器人控制仿真机器人与具身智能人工智能机器人深度学习神经网络强化学习模仿学习具身智能
系列文章目录目录系列文章目录前言一、更深入的理解1.1实体化动作头微调1.1.1实体标签1.1.2工作原理1.1.3支持的实现1.2高级调优参数1.2.1模型组件1.2.1.1视觉编码器（tune_visual）1.2.1.2语言模型（tune_llm）1.2.1.3投影器（tune_projector）1.2.1.4扩散模型（tune_diffusion_model）1.2.2理解数据转换1.2
强化学习：Deep Deterministic Policy Gradient (DDPG) 学习笔记烨川南强化学习学习笔记算法人工智能机器学习
一、DDPG是什么？1.1核心概念DDPG=Deep+Deterministic+PolicyGradientDeep：使用深度神经网络和类似DQN的技术（经验回放、目标网络）Deterministic：输出确定的动作（而不是概率分布）PolicyGradient：基于策略梯度的方法，优化策略以最大化累积奖励1.2算法特点特性说明连续动作空间直接输出连续动作值（如方向盘角度、机器人关节扭矩）离线学
提升自动驾驶导航能力：基于深度学习的场景理解技术星辰和大海都需要门票路径规划算法自动驾驶深度学习人工智能
EnhancingAutonomousVehicleNavigationUsingDeepLearning-BasedSceneUnderstanding提升自动驾驶导航能力：基于深度学习的场景理解技术摘要-为应对复杂环境下的自动驾驶导航，系统高度依赖场景理解的准确性。本研究提出一种基于深度学习的新方法，将目标识别、场景分割、运动预测与强化学习相结合以提升导航性能。该方法首先采用U-Net架构分解
【EI复现】基于深度强化学习的微能源网能量管理与优化策略研究（Python代码实现）
欢迎来到本博客❤️❤️博主优势：博客内容尽量做到思维缜密，逻辑清晰，为了方便读者。⛳️座右铭：行百里者，半于九十。本文目录如下：目录1概述一、微能源网能量管理的基本概念与核心需求二、深度强化学习（DRL）在微能源网中的应用优势三、关键技术挑战四、现有基于DRL的优化策略案例五、相关研究文档的典型结构与撰写规范六、结论与未来方向2运行结果2.1有/无策略奖励2.2训练结果12.2训练结果23参考文献
强化学习贝尔曼方程推导愤怒的可乐强化学习人工智能概率论机器学习算法
引言强化学习中贝尔曼方程的重要性就不说了，本文利用高中生都能看懂的数学知识推导贝尔曼方程。回报折扣回报GtG_tGt的定义为：Gt=Rt+1+γRt+2+γ2Rt+3+⋯=∑k=0∞γkRt+k+1(1)G_t=R_{t+1}+\gammaR_{t+2}+\gamma^2R_{t+3}+\cdots=\sum_{k=0}^\infty\gamma^kR_{t+k+1}\tag1Gt=Rt+1+γR
强化学习RLHF详解贝塔西塔强化学习大模型人工智能深度学习机器学习算法语言模型
RLHF（ReinforcementLearningfromHumanFeedback）模型详解一、背景1.传统强化学习的局限性传统的强化学习（ReinforcementLearning,RL）依赖于预定义的奖励函数（RewardFunction），但在复杂任务（如自然语言生成、机器人控制）中，设计精确的奖励函数极为困难。例如：模糊目标：生成“高质量文本”难以量化，无法用简单的指标（如BLEU、R
强化学习【chapter0】-学习路线图明朝百晓生算法人工智能机器学习
前言：主要总结一下西湖大学赵老师的课程【强化学习的数学原理】课程：从零开始到透彻理解（完结）_哔哩哔哩_bilibili1️⃣基础阶段（Ch1-Ch7）：掌握表格型算法，理解TD误差与贝尔曼方程2️⃣进阶阶段（Ch8-Ch9）：动手实现DQN/策略梯度，熟悉PyTorch/TensorFlow3️⃣前沿阶段（Ch10：阅读论文（OpenAISpinningUp/RLlib文档）Chapter1：基
讯飞星火深度推理模型X1，为教育医疗带来革新
在科技飞速发展的今天，人工智能大模型已经成为推动各行业变革的重要力量。科大讯飞作为人工智能领域的佼佼者，其研发的星火深度推理模型X1，凭借独特的技术优势和强大的功能，为教育和医疗两大关乎国计民生的领域带来了前所未有的革新。技术原理与创新讯飞星火深度推理模型X1基于Transformer架构，并在此基础上进行了一系列创新。它通过大规模多阶段强化学习训练方法，在复杂推理、数学、代码、语言理解等场景全面
Instrct-GPT 强化学习奖励模型 Reward modeling 的训练过程原理实例化详解 John_今天务必休息一天 2_大语言模型基础 #2.2 生成式预训练语言模型GPT gpt log4j 语言模型人工智能自然语言处理算法
Instrct-GPT强化学习奖励模型Rewardmodeling的训练过程原理实例化详解一、批次处理的本质：共享上下文的比较对捆绑（1）为什么同一prompt的比较对必须捆绑？（2）InstructGPT的优化方案二、输入输出与损失函数的具体构造（1）输入输出示例（2）人工标注数据的处理（3）损失函数的计算过程（4）反向传播的核心逻辑三、为什么不需要人工标注分值？（1）排序数据的天然属性（2）避
D-FINE使用pth权重批量推理可视化图片悠悠海风代码调试深度学习人工智能 python 目标检测计算机视觉
关于D-FINE相关的内容可参考下面这篇博客：论文解读：ICLR2025|D-FINE_d-fine:redefineregressiontaskindetrsasfine--CSDN博客文章浏览阅读949次，点赞18次，收藏28次。D-FINE是一款功能强大的实时物体检测器，它将DETRs中的边界框回归任务重新定义为细粒度分布细化（FDR），并引入了全局最优定位自蒸馏（GO-LSD），在不引入额
人工智能-基础篇-2-什么是机器学习？（ML，监督学习，半监督学习，零监督学习，强化学习，深度学习，机器学习步骤等） weisian151 人工智能人工智能机器学习学习
1、什么是机器学习？机器学习（MachineLearning,ML）是人工智能的一个分支，是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析等数学理论。其核心目标是让计算机通过分析数据，自动学习规律并构建模型，从而对未知数据进行预测或决策，而无需依赖显式的程序指令。基本思想：通过数据驱动的方式，使系统能够从经验（数据）中改进性能，形成对数据模式的抽象化表达。基本概念：模型：模型是对现实世界现
Keras环境复现代码（三） yanyiche_ keras 深度学习人工智能
DQN雅达利Breakout强化学习实验要求明确实验目的：学习和实现深度Q学习（DQN），这是一种结合了Q学习和深度神经网络的强化学习算法，用于解决复杂的决策问题。清楚实验原理：1、深度Q学习（DeepQ-Network）将卷积神经网络与Q学习结合，解决高维视觉输入的强化学习问题：2、经验回放：将状态转换存储到缓冲区，打破数据相关性，稳定训练。3、目标网络：定期更新目标Q值计算网络，减少训练中的目
Keras环境复现代码（二） yanyiche_ Keras 机器学习人工智能
PPOCartPole控制算法实践实验要求明确实验目的：学习和实现PPO算法，这是一种改进的策略梯度方法，通过限制策略更新的幅度来提高训练的稳定性。清楚实验原理：PPO算法是一种基于策略梯度的强化学习算法，它旨在解决传统策略梯度方法（如REINFORCE算法）在训练过程中可能出现的策略更新不稳定问题。PPO算法通过引入一种新的策略更新机制，限制每次更新的幅度，从而提高训练的稳定性和效率。PPO算法
【Transformer论文】通过蒙面多模态聚类预测学习视听语音表示 Wwwilling 推荐系统论文阅读 Transformer系列论文 transformer 聚类多模态
文献题目：LEARNINGAUDIO-VISUALSPEECHREPRESENTATIONBYMASKEDMULTIMODALCLUSTERPREDICTION发表时间：2022发表期刊：ICLR摘要语音的视频记录包含相关的音频和视觉信息，为从说话者的嘴唇运动和产生的声音中学习语音表示提供了强大的信号。我们介绍了视听隐藏单元BERT(AV-HuBERT)，这是一种用于视听语音的自我监督表示学习框架
行为正则化与顺序策略优化结合的离线多智能体学习算法
离线多智能体强化学习（MARL）是一个新兴领域，目标是在从预先收集的数据集中学习最佳的多智能体策略。随着人工智能技术的发展，多智能体系统在诸如自动驾驶、智能家居、机器人协作以及智能调度决策等方面展现了巨大的应用潜力。但现有的离线MARL方法也面临很多挑战，仍存在不协调行为和分布外联合动作的问题。为了应对这些挑战，中山大学计算机学院、美团履约平台技术部开展了学术合作项目，并取得了一些的成果，希望分享
利用视觉-语言模型搭建机器人灵巧操作的支架三谷秋水智能体大模型计算机视觉语言模型机器人人工智能计算机视觉机器学习
25年6月来自斯坦福和德国卡尔斯鲁厄理工的论文“ScaffoldingDexterousManipulationwithVision-LanguageModels”。灵巧机械手对于执行复杂的操作任务至关重要，但由于演示收集和高维控制的挑战，其训练仍然困难重重。虽然强化学习(RL)可以通过在模拟中积累经验来缓解数据瓶颈，但它通常依赖于精心设计的、针对特定任务的奖励函数，这阻碍了其可扩展性和泛化能力。
面向对象面向过程 3213213333332132 java
面向对象：把要完成的一件事，通过对象间的协作实现。面向过程：把要完成的一件事，通过循序依次调用各个模块实现。我把大象装进冰箱这件事为例，用面向对象和面向过程实现，都是用java代码完成。 1、面向对象 package bigDemo.ObjectOriented; /** * 大象类 * * @Description * @author FuJian
Java Hotspot: Remove the Permanent Generation bookjovi HotSpot
openjdk上关于hotspot将移除永久带的描述非常详细，http://openjdk.java.net/jeps/122 JEP 122: Remove the Permanent Generation Author Jon Masamitsu Organization Oracle Created 2010/8/15 Updated 2011/
正则表达式向前查找向后查找,环绕或零宽断言 dcj3sjt126com 正则表达式
向前查找和向后查找 1. 向前查找：根据要匹配的字符序列后面存在一个特定的字符序列(肯定式向前查找)或不存在一个特定的序列(否定式向前查找)来决定是否匹配。.NET将向前查找称之为零宽度向前查找断言。对于向前查找，出现在指定项之后的字符序列不会被正则表达式引擎返回。 2. 向后查找：一个要匹配的字符序列前面有或者没有指定的
BaseDao 171815164 seda
import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; import java.sql.PreparedStatement; import java.sql.ResultSet; public class BaseDao { public Conn
Ant标签详解--Java命令 g21121 Java命令
这一篇主要介绍与java相关标签的使用终于开始重头戏了，Java部分是我们关注的重点也是项目中用处最多的部分。 1
[简单]代码片段_电梯数字排列 53873039oycg 代码
今天看电梯数字排列是9 18 26这样呈倒N排列的,写了个类似的打印例子，如下: import java.util.Arrays; public class 电梯数字排列_S3_Test { public static void main(S
Hessian原理云端月影 hessian原理
Hessian 原理分析一．远程通讯协议的基本原理网络通信需要做的就是将流从一台计算机传输到另外一台计算机，基于传输协议和网络 IO 来实现，其中传输协议比较出名的有 http 、 tcp 、 udp 等等， http 、 tcp 、 udp 都是在基于 Socket 概念上为某类应用场景而扩展出的传输协
区分Activity的四种加载模式----以及Intent的setFlags aijuans android
在多Activity开发中，有可能是自己应用之间的Activity跳转，或者夹带其他应用的可复用Activity。可能会希望跳转到原来某个Activity实例，而不是产生大量重复的Activity。这需要为Activity配置特定的加载模式，而不是使用默认的加载模式。加载模式分类及在哪里配置 Activity有四种加载模式： standard singleTop
hibernate几个核心API及其查询分析 antonyup_2006 html .net Hibernate xml 配置管理
(一) org.hibernate.cfg.Configuration类读取配置文件并创建唯一的SessionFactory对象.(一般,程序初始化hibernate时创建.) Configuration co
PL/SQL的流程控制百合不是茶 oracle PL/SQL编程循环控制
PL/SQL也是一门高级语言,所以流程控制是必须要有的,oracle数据库的pl/sql比sqlserver数据库要难,很多pl/sql中有的sqlserver里面没有流程控制; 分支语句 if 条件 then 结果 else 结果 end if ; 条件语句 case when 条件 then 结果; 循环语句 loop
强大的Mockito测试框架 bijian1013 mockito 单元测试
一.自动生成Mock类在需要Mock的属性上标记@Mock注解，然后@RunWith中配置Mockito的TestRunner或者在setUp()方法中显示调用MockitoAnnotations.initMocks(this);生成Mock类即可。二.自动注入Mock类到被测试类 &nbs
精通Oracle10编程SQL(11)开发子程序 bijian1013 oracle 数据库 plsql
/* *开发子程序 */ --子程序目是指被命名的PL/SQL块，这种块可以带有参数，可以在不同应用程序中多次调用 --PL/SQL有两种类型的子程序：过程和函数 --开发过程 --建立过程：不带任何参数 CREATE OR REPLACE PROCEDURE out_time IS BEGIN DBMS_OUTPUT.put_line(systimestamp); E
【EhCache一】EhCache版Hello World bit1129 Hello world
本篇是EhCache系列的第一篇，总体介绍使用EhCache缓存进行CRUD的API的基本使用，更细节的内容包括EhCache源代码和设计、实现原理在接下来的文章中进行介绍环境准备 1.新建Maven项目 2.添加EhCache的Maven依赖 <dependency> <groupId>ne
学习EJB3基础知识笔记白糖_ bean Hibernate jboss webservice ejb
最近项目进入系统测试阶段，全赖袁大虾领导有力，保持一周零bug记录，这也让自己腾出不少时间补充知识。花了两天时间把“传智播客EJB3.0”看完了，EJB基本的知识也有些了解，在这记录下EJB的部分知识，以供自己以后复习使用。 EJB是sun的服务器端组件模型，最大的用处是部署分布式应用程序。EJB (Enterprise JavaBean)是J2EE的一部分，定义了一个用于开发基
angular.bootstrap boyitech AngularJS AngularJS API angular中文api
angular.bootstrap 描述：手动初始化angular。这个函数会自动检测创建的module有没有被加载多次，如果有则会在浏览器的控制台打出警告日志，并且不会再次加载。这样可以避免在程序运行过程中许多奇怪的问题发生。使用方法： angular .
java-谷歌面试题-给定一个固定长度的数组，将递增整数序列写入这个数组。当写到数组尾部时，返回数组开始重新写，并覆盖先前写过的数 bylijinnan java
public class SearchInShiftedArray { /** * 题目：给定一个固定长度的数组，将递增整数序列写入这个数组。当写到数组尾部时，返回数组开始重新写，并覆盖先前写过的数。 * 请在这个特殊数组中找出给定的整数。 * 解答： * 其实就是“旋转数组”。旋转数组的最小元素见http://bylijinnan.iteye.com/bl
天使还是魔鬼？都是我们制造 ducklsl 生活教育情感
----------------------------剧透请原谅，有兴趣的朋友可以自己看看电影，互相讨论哦！！！从厦门回来的动车上，无意中瞟到了书中推荐的几部关于儿童的电影。当然，这几部电影可能会另大家失望，并不是类似小鬼当家的电影，而是关于“坏小孩”的电影！自己挑了两部先看了看，但是发现看完之后，心里久久不能平
[机器智能与生物]研究生物智能的问题 comsci 生物
我想,人的神经网络和苍蝇的神经网络,并没有本质的区别...就是大规模拓扑系统和中小规模拓扑分析的区别.... 但是,如果去研究活体人类的神经网络和脑系统,可能会受到一些法律和道德方面的限制,而且研究结果也不一定可靠,那么希望从事生物神经网络研究的朋友,不如把
获取Android Device的信息 dai_lm android
String phoneInfo = "PRODUCT: " + android.os.Build.PRODUCT; phoneInfo += ", CPU_ABI: " + android.os.Build.CPU_ABI; phoneInfo += ", TAGS: " + android.os.Build.TAGS; ph
最佳字符串匹配算法（Damerau-Levenshtein距离算法）的Java实现 datamachine java 算法字符串匹配
原文：http://www.javacodegeeks.com/2013/11/java-implementation-of-optimal-string-alignment.html------------------------------------------------------------------------------------------------------------
小学5年级英语单词背诵第一课 dcj3sjt126com english word
long 长的 show 给...看，出示 mouth 口，嘴 write 写 use 用，使用 take 拿，带来 hand 手 clever 聪明的 often 经常 wash 洗 slow 慢的 house 房子 water 水 clean 清洁的 supper 晚餐 out 在外 face 脸，
macvim的使用实战 dcj3sjt126com mac vim
macvim用的是mac里面的vim, 只不过是一个GUI的APP, 相当于一个壳 1. 下载macvim https://code.google.com/p/macvim/ 2. 了解macvim :h vim的使用帮助信息 :h macvim
java二分法查找蕃薯耀 java二分法查找二分法 java二分法
java二分法查找 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年6月23日 11:40:03 星期二 http:/
Spring Cache注解+Memcached hanqunfeng spring memcached
Spring3.1 Cache注解依赖jar包：  <dependency> <groupId>com.google.code.simple-spring-memcached</groupId> <artifactId>simple-s
apache commons io包快速入门 jackyrong apache commons
原文参考 http://www.javacodegeeks.com/2014/10/apache-commons-io-tutorial.html Apache Commons IO 包绝对是好东西，地址在http://commons.apache.org/proper/commons-io/，下面用例子分别介绍： 1）工具类 2
如何学习编程 lampcy java 编程 C++c
首先,我想说一下学习思想.学编程其实跟网络游戏有着类似的效果.开始的时候,你会对那些代码,函数等产生很大的兴趣,尤其是刚接触编程的人,刚学习第一种语言的人.可是,当你一步步深入的时候,你会发现你没有了以前那种斗志.就好象你在玩韩国泡菜网游似的,玩到一定程度,每天就是练级练级,完全是一个想冲到高级别的意志力在支持着你.而学编程就更难了,学了两个月后,总是觉得你好象全都学会了,却又什么都做不了,又没有
架构师之spring-----spring3.0新特性的bean加载控制@DependsOn和@Lazy nannan408 Spring3
1.前言。如题。 2.描述。 @DependsOn用于强制初始化其他Bean。可以修饰Bean类或方法，使用该Annotation时可以指定一个字符串数组作为参数，每个数组元素对应于一个强制初始化的Bean。 @DependsOn({"steelAxe","abc"}) @Comp
Spring4+quartz2的配置和代码方式调度 Everyday都不同代码配置 spring4 quartz2.x 定时任务
前言：这些天简直被quartz虐哭。。因为quartz 2.x版本相比quartz1.x版本的API改动太多，所以，只好自己去查阅底层API…… quartz定时任务必须搞清楚几个概念： JobDetail——处理类 Trigger——触发器，指定触发时间，必须要有JobDetail属性，即触发对象 Scheduler——调度器，组织处理类和触发器，配置方式一般只需指定触发
Hibernate入门 tntxia Hibernate
前言使用面向对象的语言和关系型的数据库，开发起来很繁琐，费时。由于现在流行的数据库都不面向对象。Hibernate 是一个Java的ORM（Object/Relational Mapping）解决方案。 Hibernte不仅关心把Java对象对应到数据库的表中，而且提供了请求和检索的方法。简化了手工进行JDBC操作的流程。如
Math类 xiaoxing598 Math
一、Java中的数字（Math）类是final类，不可继承。 1、常数 PI：double圆周率 E：double自然对数 2、截取（注意方法的返回类型） double ceil(double d) 返回不小于d的最小整数 double floor(double d) 返回不大于d的整最大数 int round(float f) 返回四舍五入后的整数 long round