Adam坤

THE WISDOM OF THE CROWD: RELIABLE DEEP REINFORCEMENT LEARNING THROUGH ENSEMBLES OF Q--FUNCTIONS

ABSTRACT

Reinforcement learning agents learn by exploring the environment and then ex-ploiting what they have learned. This frees the human trainers from having to know the preferred action or intrinsic value of each encountered state. The cost of this freedom is reinforcement learning is slower and more unstable than su-pervised learning. We explore the possibility that ensemble methods can remedy these shortcomings and do so by investigating a novel technique which harnesses the wisdom of the crowds by bagging Q-function approximator estimates.

Our results show that this proposed approach improves all three tasks and rein-forcement learning approaches attempted. We are able to demonstrate that this is

adirect result of the increased stability of the action portion of the state-action-value function used by Q-learning to select actions and by policy gradient methods to train the policy.
强化学习代理通过探索环境然后开发他们学到的东西来学习。这使得人类训练员不必知道每个遇到状态的优选动作或内在值。这种自由的代价是强化学习比经验教学更慢，更不稳定。我们探索了集合方法可以弥补这些缺点的可能性，并通过调查一种利用Q函数近似估计来利用人群智慧的新技术来实现这一点。

我们的结果表明，这种提出的方法改进了所有三项任务和尝试的强制学习方法。我们能够证明这是

Q学习用于选择行动的状态 - 行动 - 价值函数的行动部分的稳定性增加的直接结果，以及通过政策梯度方法来训练政策。

1INTRODUCTION

In the reinforcement learning (RL) approach the agent learns by exploring its environment and the, sometimes, many approaches to solving a given problem. It frees the human trainers from having to know the preferred action or intrinsic value of each encountered state. There is no denying that RL has a grassroots feel: it is an important form of learning in the natural world and it is only natural that machine learning practitioners would want to mimic its success.

This freedom comes at a price, however. The most common complaints about RL, especially when using function approximators to learn Q-functions, are that RL is too slow and unstable during learn-ing. Learning by exploration of the environment results in a training signal which is less informative than a supervised training signal resulting in the requirement of a large number of training samples and repeated exposure to those samples. The proposed approach is an ensemble learning approach to RL similar to the well-known bagging approach, Breiman (1996), which trains ensemble members using the experience replay memory and combines their action selections using voting or averaging.
在强化学习（RL）方法中，代理通过探索其环境以及有时解决给定问题的许多方法来学习。它使人类培训师不必知道每个遇到状态的首选动作或内在价值。无可否认，RL有一种基层的感觉：它是自然界中一种重要的学习形式，机器学习从业者想要模仿它的成功是很自然的。

然而，这种自由是有代价的。关于RL的最常见的抱怨，特别是当使用函数逼近器来学习Q函数时，RL在学习期间太慢且不稳定。通过探索环境来学习导致训练信号，其比监督的训练信号信息量小，导致需要大量训练样本并重复暴露于那些样本。所提出的方法是RL的集合学习方法，类似于众所周知的装袋方法，Breiman（1996），其使用经验重放记忆训练合奏成员并且使用投票或平均来组合他们的动作选择。

1. RECENT ADVANCEMENTS IN RL

There has been tremendous advancements in RL in recent years spurred by the excitement sur-rounding deep Q-learning (DQN), Mnih et al. (2015). Impressively, RL has featured prominently in published work showing super-human performance in tasks which were previously considered un-touchable by state of the art RL approaches just a few years ago. These include RL agents playing chess at a grand-master level, Lai (2015), and the famous AlphaGo, Silver et al. (2016). Through this work we have found just how far we can go when applying existing RL approaches with large amounts of computing power.

A large effort has been devoted to addressing the RL obstacle of slow learning, Mnih et al. (2016); Schaul et al. (2016), with an emphasis on speeding-up RL for the high-dimension inputs popularized by the DQN work – especially since DQN exacerbated this issue both in terms of the required com-putation and number of training time steps. There has also been work showing other RL approaches can be adapted to the DQN paradigm, Lillicrap et al. (2016).
For all of these recent works mentioned here the primary emphasis is achieving improved perfor-mance on simulated tasks in the shortest amount of wall-clock time. The issues of RL instability during training and of reducing the number of interactions with the training environment have re-ceived little attention. The crowd ensemble addresses the obstacle of training instability without sacrificing the number of interactions with the training environment which, in many real-world ap-plication, may be more expensive than the computational costs. Furthermore, the crowd ensemble method can be used alongside any of these recent advances in RL.
近年来，由于围绕深度Q学习（DQN）的激动，Mnih等人在RL中取得了巨大的进步。（2015年）。令人印象深刻的是，RL在已发表的作品中占据突出地位，展示了几年前最先进的RL方法无法触及的任务中的超人类表现。这些包括RL代理人在大师级别下棋，Lai（2015），以及着名的AlphaGo，Silver等人。（2016）。通过这项工作，我们发现在应用具有大量计算能力的现有RL方法时我们能走多远。

Mnih等人致力于解决缓慢学习的RL障碍。（2016）; Schaul等人。（2016），重点是加快由DQN工作推广的高维输入的RL - 特别是因为DQN在所需的计算和训练时间步数方面加剧了这个问题。还有一些工作表明其他RL方法可以适用于DQN范例，Lillicrap等。（2016）。
对于这里提到的所有这些最近的工作，主要的重点是在最短的挂钟时间内实现模拟任务的改进性能。训练期间RL不稳定以及减少与训练环境的相互作用的问题几乎没有引起人们的注意。人群集合解决了训练不稳定性的障碍，同时又不牺牲与训练环境的交互次数，在许多现实世界的应用中，这可能比计算成本更昂贵。此外，人群集合方法可以与RL中的任何这些最新进展一起使用。

1. THE WISDOM OF CROWDS

Francis Galton, in 1906, observed that a large group’s mean guess was able to come within one pound when guessing the weight of an Ox, Galton (1907). This was surprising because the crowd, while containing a few potential experts, was presumably made of non-experts with no knowledge of estimating the weight of oxen.

More recently Treynor (1987) emphasizes the need for independence of the individual to the success of the crowd and that allowing sources of shared error or bias will reduce the accuracy of a crowd’s prediction. He concludes that chasing an expert is folly and that a simple combination of crowd information is best. Larrick & Soll (2006) and Hastie & Kameda (2005) emphasize the power of averaging predictions. Appropriate use of averaging in group decisions begins with understanding that averaging is not a regression to the mean but, rather, an error-reduction technique

Surowiecki (2005) synthesizes the many benefits of crowd-based decision-making. Although the crowd will not regularly out-perform the best individual, no field can provide a mechanism to predict which individual will out-perform the crowd. Fortunately, many of the obstacles to quality crowd decision-making are a result of the shortcomings and complexities of human group dynamics are not problems we encounter when training Q-function ensembles.
弗朗西斯·高尔顿（Francis Galton）于1906年观察到，当猜测牛的重量时，一个大群体的平均猜测能够在一磅之内，高尔顿（1907）。这是令人惊讶的，因为人群虽然包含一些潜在的专家，但可能是由非专家组成，不知道估计牛的重量。

最近，Treynor（1987）强调了个体对人群成功的独立性的必要性，并且允许共享错误或偏见的来源将降低人群预测的准确性。他的结论是，追逐专家是愚蠢的，人群信息的简单组合是最好的。 Larrick＆Soll（2006）和Hastie＆Kameda（2005）强调平均预测的力量。在群体决策中适当使用平均值首先要了解平均值不是平均值的回归，而是减少误差的技术

Surowiecki（2005）综合了基于群体的决策制定的诸多好处。虽然人群不会经常表现出最好的个人，但没有一个领域可以提供一种机制来预测哪个人会胜过人群。幸运的是，质量人群决策的许多障碍是人类群体动力学的缺点和复杂性的结果，这不是我们在训练Q-功能集合时遇到的问题。

2 RELATED WORK

Most ensemble RL methods fit the mixture of experts paradigm described in Jacobs et al. (1991). A Gaussian mixture model approach is utilized as the ensemble mechanism by Agostini & Celaya (2011). This gives each ensemble member a region of expertise. They test their approach on the pen-dulum swing-up and cart-pole balancing tasks. An couple ensembles of ANN Q-function approx-imators are compared in a limited set of experiments on the pole-balancing task in Hans & Udluft (2010). They conclude that a hard combination of ensemble members, that is only one expert is active at a time, is superior to the soft-combination of experts they initially attempted.

In the multiple model reinforcement learning (MMRL), Doya & Samejima (2002), approach each expert has a forward model and a Q-function. The forward models determine how to combine member outputs and how to backpropogate the error signals. In Doya & Samejima (2002) MMRL is tested on the non-stationary pendulum swing-up task. For these experiments they are forced manually partition the state space.

In Faußer & Schwenker (2015) a concept very similar to the crowd ensemble method is proposed. Their approach has an important difference: the ensemble members are trained in parallel with each other. This results in a factor of NE more interactions with the training environment. Their results show that their approach with NE > 3 performs better than a single Q-learner on a maze navigation task and a simplified tetris task (SZ-tetris) with a reduced set of pieces. No explanation is given for the improved performance of the ensemble.

Recently Q-learning approaches have been developed which resemble ensemble methods and may share some of the same benefits. These approaches are not ensembles because they do not have a mechanism which combines ensemble member outputs.

Mnih et al. (2016) and Nair & Silver (2015) present a RL approach which uses multiple, simultane-ous simulations to speed-up DQN. In Mnih et al. (2016) this method is called asynchronous DQN. True to its name, the parameter updates are not synchronized allowing each simulation to period-ically update the global parameter values using an accumulated gradient over several time steps. Asynchronous DQN is tested using a large number of arcade learning environment.
Double Q-learning, Hasselt (2010). is designed to address the issue of over estimation of Q-values. This is done by leveraging two Q-functions. In a later work the double Q-learning concept is adapted to the DQN framework van Hasselt et al. (2015). The double Q-learning concept is taken a step further by investigating using any number of Q-learners which they call multi Q-learning, Duryea (2016).
大多数整体RL方法符合Jacobs等人描述的专家范例的混合。（1991年）。高斯混合模型方法被Agostini＆Celaya（2011）用作集合机制。这为每个合奏成员提供了专业领域。他们在笔杆摆动和推车杆平衡任务上测试他们的方法。在Hans＆Udluft（2010）的杆极平衡任务的有限实验中，比较了几个ANN Q函数近似器的集合。他们的结论是，一次只有一名专家的合奏成员的硬组合优于他们最初尝试的专家的软组合。

在多模型强化学习（MMRL），Doya＆Samejima（2002）中，每个专家的方法都有正演模型和Q函数。正向模型确定如何组合成员输出以及如何反向传播误差信号。在Doya＆Samejima（2002）中，MMRL在非平稳摆锤摆动任务上进行了测试。对于这些实验，他们被迫手动分区状态空间。

在Faußer＆Schwenker（2015）中，提出了一种非常类似于人群集合方法的概念。他们的方法有一个重要的区别：合奏成员彼此平行训练。这导致NE与训练环境的更多交互因素。他们的结果表明，NE> 3的方法在迷宫导航任务和简化的俄罗斯方块任务（SZ-tetris）上的表现优于单个Q学习者。对于整体的改进性能没有给出解释。

最近开发了类似于集合方法的Q学习方法，并且可以分享一些相同的益处。这些方法不是合奏，因为它们没有组合整体成员输出的机制。

Mnih等人。（2016）和Nair＆Silver（2015）提出了一种RL方法，该方法使用多个同时模拟来加速DQN。在Mnih等人。（2016）这种方法称为异步DQN。与其名称相反，参数更新不同步，允许每个模拟使用多个时间步长的累积梯度周期性地更新全局参数值。异步DQN使用大量的街机学习环境进行测试。
双Q学习，Hasselt（2010）。旨在解决Q值过高估计的问题。这是通过利用两个Q函数来完成的。在后来的工作中，双Q学习概念适用于van Hasselt等人的DQN框架。（2015年）。双Q学习概念更进一步，通过调查使用他们称之为多Q学习的任意数量的Q学习者，Duryea（2016）。

3METHOD

Here we briefly describe the crowd ensemble approach to Q-learning and the experiments used to evaluated the proposed method. We begin with the crowd ensemble.
在这里，我们简要描述了Q-learning的人群集合方法和用于评估所提方法的实验。我们从人群合奏开始。

3.1 THE CROWD ENSEMBLE APPROACH TO Q-LEARNING

Q-learning, Sutton & Barto (1998), learns a state-action function, Q(s, a) which is updated according to:

Prior to learning, the Q-function approximator parameters for each expert should be randomly ini-tialized. Learning in a crowd ensemble follows these steps.

1.Vote: select an action for each expert.

2.Tally: determine which action is selected by the most members.

3.Act: take the action selected by the ensemble.

4.Observe: store the new state and store it in the experience replay memory.

5.Sample: independently sample from the experience replay memory for each member.

6.Compute errors: compute a TD-error for each member.

7.Update: update Q-function approximations using each member’s TD-error.

Our implementation of a Q-learning crowd ensemble is described in greater detail in Algorithm 1. In our experiments no new hyper-parameter search is conducted when applying a crowd ensemble: the parameters found via a manual search for the baseline approaches are also used for the crowd ensemble members.

在学习之前，应该随机初始化每个专家的Q函数逼近器参数。在人群中学习遵循这些步骤。

1.投票：为每位专家选择一个动作。

2.Tally：确定大多数成员选择的动作。

3.Act：采取整体选择的动作。

4.Observe：存储新状态并将其存储在体验重放存储器中。

5.Sample：从每个成员的经验重播记忆中独立采样。

6.计算错误：计算每个成员的TD错误。

7.更新：使用每个成员的TD错误更新Q函数近似值。

我们在算法1中更详细地描述了Q学习人群集合的实现。在我们的实验中，在应用人群集合时没有进行新的超参数搜索：通过手动搜索基线方法找到的参数也用于人群合奏成员。

4EXPERIMENTS

The purpose of the cart-pole task is to start a trial with the agent’s pole in the down position and allow the agent to move the cart back and forth along a 2-D track in order to swing the pole up and balance it. The track is of finite length and each end of the track has a wall with which the cart interacts via elastic collisions. The cart-pole task has four state variables: cart position, cart velocity, pole angle, and pole angular velocity. The actions are discrete with a ∈ {−1, 0, 1} which translate to push left, no push, and push right. The agent is rewarded a negative one when pointing downward, a positive one when pointing upward, and zero elsewhere.
We approximate a Q-function using an ANN with a single hidden layer with 20 nodes. The ANN inputs are the four state dimensions and scalar action value. The ANN output is the associated Q(s, a) value. The parameters are updated using scaled conjugate gradient, Møller (1993). We store all experiences and sample them in batches of 1000 for each update. We update the parameters using five batches every 1000 time steps during training. The training simulation is run continuously and is never reset. Evaluation is performed using a separate simulation which is reset for each evaluation.

We also evaluate using a high-dimension state representation of the cart-pole task where the state is represented by two consecutive frames of an image of the cart-pole environment. The approach is similar to the DQN approach described in Mnih et al. (2015). The ANN has two convolutional and two fully-connected layers. No batch normalization layers are used. The convolutional layers use ReLU activation functions while the fully-connected layers use tanh. The convolutional layers have 20 and 40 features, respectively. The window size of the first convolutional layer is 6 × 6 and the second is 4 × 4. The strides for both layers is 2 × 2 meaning a new window starts every two pixels in both directions leading to overlap of the windows. The fully-connected layers are of size 100 and 20, respectively. The parameters are updated using ADAM, Kingma & Ba (2015).

The DQN approach requires a large number of parameters so we share the convolutional layers between the ensemble members. This is done by accumulating the gradient from the fully-connected layers of all ensemble members and using it update the shared convolutional layers.

When applying the crowd ensemble via DQN the same cart-pole simulation is used by the inputs are two sequential frames of the simulation. These can be seen in Section 8.3 along with some example features extracted by the CNN.

We also apply the crowd ensemble approach to the continuous state-action bipedal walker task, team (2016), via the DDPG algorithm which is a actor-critic approach, (Sutton & Barto, 1998, 69). The bipedal walker task objective is to train an agent to move a simple bipedal robot across a two-dimension, set-width, gently sloping, plane. The robot consists of two legs and an oblong hull which sits on top of the legs. The action space is continuous in four dimensions: the actuations for the hip and knee in each leg of the robot. The state space is represented in 24 dimensions: hull angle, hull angular velocity, hull x velocity, hull y velocity, leg one hip angle, leg one hip speed, leg one knee angle, leg one knee velocity, leg one ground contact indicator (boolean), leg two hip angle, leg two hip speed, leg two knee angle, leg two knee velocity, leg two ground contact indicator (boolean), and ten lidar measurements measuring the distance to the ground from the center of the hull from ten different angles. The reward function is designed to reward forward motion with minimal motor actuation while encouraging the agent to keep the hull from pointing downward and severely penalizing the agent if the hull touches the ground. The task is considered solved if the entire course is traversed within the allocated amount of time with a total reward greater than 300.

The parameters used for training the actor and critic ANNs for the bipedal walker were taken from the supplementary material of Lillicrap et al. (2016). DDPG views the actor outputs as defining the mean of a Normal distribution with unit variance. Combining the output of multiple actors using averaging will, most likely, result in a location of exceedingly small probability. Furthermore actor outputs will be multi-modal in four dimensions which will result in a challengingly large number of modes for which no straight-forward method to find the highest-probability location exists. Instead the crowd ensemble approach is applied to DDPG by training a single actor from the combined output of a crowd ensemble of critics.
推车杆任务的目的是在座椅杆处于向下位置的情况下开始试验，并允许座椅沿着2-D轨道来回移动推车，以使杆向上摆动并平衡它。轨道具有有限的长度，并且轨道的每个末端具有壁，推车通过弹性碰撞与该壁相互作用。推车杆任务有四个状态变量：推车位置，推车速度，杆角和杆角速度。动作是离散的，∈{-1,0,1}转换为向左推，无推，向右推。当指向下方时，代理被奖励为负数，当指向上方时，奖励为正，而在其他地方则为零。
我们使用具有20个节点的单个隐藏层的ANN来近似Q函数。 ANN输入是四个状态维度和标量动作值。 ANN输出是相关的Q（s，a）值。使用缩放共轭梯度更新参数，Møller（1993）。我们存储所有经验，并在每次更新时以1000个批量对其进行抽样。我们在培训期间每1000个步骤使用五个批次更新参数。训练模拟连续运行，永不复位。使用针对每个评估重置的单独模拟来执行评估。

我们还使用推车杆任务的高维状态表示来评估，其中状态由推车杆极环境的图像的两个连续帧表示。该方法类似于Mnih等人描述的DQN方法。（2015年）。 ANN具有两个卷积层和两个完全连接层。不使用批量标准化层。卷积层使用ReLU激活函数，而完全连接的层使用tanh。卷积层分别具有20和40个特征。第一卷积层的窗口大小是6×6，第二卷积是4×4。两个层的步幅是2×2，意味着新窗口在两个方向上每两个像素开始，导致窗口重叠。完全连接的层分别为100和20。使用ADAM，Kingma＆Ba（2015）更新参数。

DQN方法需要大量参数，因此我们共享集合成员之间的卷积层。这是通过累积来自所有集合成员的完全连接层的梯度并使用它来更新共享卷积层来完成的。

当通过DQN应用人群集合时，输入使用相同的车竿模拟是模拟的两个连续帧。这些可以在8.3节中看到，以及CNN提取的一些示例特征。

我们还通过DDPG算法将人群集合方法应用于连续状态 - 行动双足步行者任务团队（2016），这是一种演员 - 评论家方法（Sutton＆Barto，1998,69）。双足步行者任务的目标是训练一个特工将一个简单的双足机器人移动到一个二维的，宽度设置的，平缓倾斜的平面上。机器人由两条腿和一条长圆形船体组成，船体位于腿的顶部。动作空间在四个维度上是连续的：机器人每条腿的髋部和膝部的动作。状态空间以24维表示：船体角度，船体角速度，船体x速度，船体y速度，腿部一个臀部角度，腿部一个臀部速度，腿部一个膝盖角度，腿部一个膝盖速度，腿部一个地面接触指示器（ boolean），腿部两个臀部角度，腿部两个臀部速度，腿部两个膝盖角度，腿部两个膝盖速度，腿部两个地面接触指示器（布尔值），以及十个激光雷达测量距离船体中心距离地面的距离十不同的角度。奖励功能旨在以最小的马达驱动来奖励前进运动，同时鼓励代理人保持船体不向下指示并且如果船体接触地面则严重惩罚代理人。如果整个课程在分配的时间内遍历并且总奖励大于300，则该任务被认为已解决。

用于训练双足步行者的演员和评论人工神经网络的参数来自Lillicrap等人的补充材料。（2016）。 DDPG将actor输出视为定义具有单位方差的Normal分布的均值。使用平均值组合多个参与者的输出很可能会导致概率非常小的位置。此外，演员输出将是四维的多模态，这将导致具有挑战性的大量模式，对于这些模式，不存在找到最高概率位置的直接方法。相反，人群集合方法适用于DDPG，通过从一群批评者的综合输出中训练一个演员。

5 RESULTS

The cart-pole tasks benefits from the crowd ensemble approach as shown in Figure 1 which shows that the mean reward during evaluation is improved when using any size ensemble with no significant improvement after NE = 5 where NE is the ensemble size. The improved stability is evident in Figure 2 which shows four randomly-selected runs of the ensemble with NE = 50 and four non-ensemble run. All selected non-ensemble runs show catastrophic forgetting while this happens in none of the ensemble examples.

Figure 1b shows that the ensemble approach solves the task earlier and more reliably than the non-ensemble approach. In this figure, all crowd ensemble agents solve the task within 1.7 × 105 time steps while 29 of 30 base Q-learners solve the task by 3 × 105.
如图1所示，推杆极限任务受益于人群集合方法，其显示在使用任何尺寸集合时评估期间的平均奖励在NE = 5之后没有显着改善，其中NE是整体尺寸。图2中显示了改进的稳定性，其显示了随机选择的四个NE = 50和四个非整体运行的整体运行。所有选定的非集合运行都会显示灾难性遗忘，而这种情况并非在任何集合示例中发生。

图1b显示了集成方法比非集合方法更早，更可靠地解决了任务。在该图中，所有人群集合代理在1.7×105时间步骤内解决任务，而30个基础Q学习者中的29个以3×105解决任务。

图3显示了高维推车杆任务的结果。似乎单个DQN方法的平均奖励永远不会达到甚至小人群整体大小的平均表现。随着NE增加，我们也看到持续改善。

图4b显示了从所选NE值的像素任务中解决了极点的代理数量的百分比。在这种情况下，认为大于1800的评估奖励被解决。 NE = 50的人群集合是唯一一个在1.5×106时间步长内每次解决任务的人。此外，所有NE = 50个代理已经在单个NE = 1代理之前解决了该任务。“

在图5中，作为运动效率的度量，行进的距离单位在这里我们看到，随着NE增加，效率也表示更高质量的解决方案。我们推测，人群集体的批评稳定性得到改善，可以让行为者改进所学知识并进一步完善政策。图5显示，随着NE的增加，政策的效率也会提高。它显示了五个选定NE值中每一个的直方图。 x轴表示每个时间步之后收到的奖励金额。所有2000个评估时间步长为10个随机选择的解决方案（能够在规定的时间内遍历整个空间的代理）。仅使用1000次训练后发现的解决方案，以便为代理商提供充分的机会来完善他们的解决方案。

The decision space is dominated by a boundary between the push left and push right decisions which runs diagonally through this 2-dimension cross-section of the space. The Q-function is dominated by a ridge of high Q-values which also runs diagonally across the center of the decision boundary.
决策空间由推左和推右决策之间的边界支配，该边界沿对角线穿过该空间的二维横截面。 Q函数由高Q值的脊支配，该Q值也沿决策边界的中心对角线运行。

图7：尺寸为20的Q函数集合的低维车极任务的Q函数和决策空间波动率的二维视图。

可以针对每个离散动作绘制Q函数，如图6b所示。在这里，我们看到跨越状态位置的Q函数存在相当大的差异，但是Q值仅在动作之间略微移位。

这导致Q函数的最陡区域中Q（s，a = -1）-Q（s，a = 1）的较大差异，但是状态空间区域中的相对小的差异具有较少的显着浮雕。这种关系在图6c中可见，其中具有最大变化率的Q函数区域在动作之间具有最大的Q函数差异。

人群合奏对决策空间有稳定作用。对于单个Q函数，大多数参数更新导致改进或维持成功导致目标状态的决策空间。然而，相当一部分更新的破坏性足以导致决策空间发生重大变化，从而导致遗忘。因此，整体成员决策的简单组合将更有可能避免对决策空间进行不稳定的变更。图7b示出了与图6d中的变化相比减小的大小NE = 20的集合的动作变化的数量。

这种改进的稳定性来自于基于多数投票的决策空间变化的相对大的障碍。在图7a中示出了大小为20的十个集合的平均大多数大小。与通过参数更新所采用的噪声路径经常克服的小Q差异不同，大多数尺寸不是那么容易克服的。

7CONCLUSIONS AND FUTURE WORK

We have presented a simple ensemble approach to Q-learning which confronts the most common complaints about RL: that training takes too long and that training is unstable. The crowd ensemble approach to Q-learning can be used in tandem with other approaches including the most recent advances in the field. Instead of multiplying the number of interactions with the environment, as in recent, high-profile work, it increases the computational requirements but reduces the number of required interactions with the environment. Furthermore, each member of the crowd ensemble can be trained in parallel allowing for a negligible increase in wall-clock time. Our experiments demonstrate that the approach improves performance by reducing decision space volatility resulting in improved mean reward, a near elimination of catastrophic forgetting, an increase in the speed and reliability of learning, and an improvement in the quality of the solutions.

Our results regarding shared convolutional layers point to the potential for a dramatic speed-up in training for domains with high-dimension inputs. In this instance the instability of Q-learning worked to our advantage by providing the shared layers with a combined gradient which leads to more direct path toward high-quality features.

An important item of remaining work is a comparison of the crowd ensemble against more tradi-tional ensemble methods such as mixtures of experts which have not received widespread adoption in the RL literature. An important limitation of the crowd ensemble approach to Q-learning is that the ability of the ensemble is limited by the ability of the ensemble members. Mixture of experts style ensembles are designed specifically to not have this problem.
我们提出了一种简单的Q-learning集合方法，它面临着关于RL的最常见的抱怨：训练时间太长而且训练不稳定。 Q-learning的人群集合方法可以与其他方法一起使用，包括该领域的最新进展。与最近的高调工作相比，它不是将与环境的交互次数相乘，而是增加了计算要求，但减少了与环境所需的交互次数。此外，人群集合中的每个成员可以并行训练，从而允许壁钟时间的可忽略的增加。我们的实验表明，该方法通过减少决策空间波动性来提高绩效，从而提高平均回报率，几乎消除了灾难性遗忘，提高了学习的速度和可靠性，并提高了解决方案的质量。

我们关于共享卷积层的结果表明，对于具有高维输入的域，培训可能会大幅加速。在这种情况下，Q学习的不稳定性通过为共享层提供组合梯度来实现我们的优势，这导致更直接的路径朝向高质量特征。

剩下的工作的一个重要项目是将人群集合与更传统的集合方法（例如在RL文献中未被广泛采用的专家混合）进行比较。人群集成方法对Q学习的一个重要限制是整体的能力受到整体成员能力的限制。专家风格合奏的混合物专门设计为没有这个问题。

REFERENCES

A. Agostini and E. Celaya. A competitive strategy for function approximation in q-learning. In 2011 International Joint Conference on Artificial Intelligence, pp. 1146–1151, 2011.

Charles W Anderson, Minwoo Lee, and Daniel L Elliott. Faster reinforcement learning after pre-training deep networks to predict state dynamics. In International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE, 2015.

Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

Kenji Doya and Kazuyuki Samejima. Multiple model-based reinforcement learning. Neural Com-putation, 14:1347–1369, 2002.

Ganger M. Hu W. Duryea, E. Exploring deep reinforcement learning with multi q-learning. Intelli-gent Control and Automation, (7):129–144, 2016.

Stefan Faußer and Friedhelm Schwenker. Neural network ensembles in reinforcement learning.
Neural Processing Letters, 41(1):55–69, 2015.

Francis Galton. Vox populi. Nature, 75:450–451, March 1907.

A. Hans and S. Udluft. Ensembles of neural networks for robust reinforcement learning. In Ninth International Conference on Machine Learning and Applications (ICMLA), pp. 401–406, Dec 2010. doi: 10.1109/ICMLA.2010.66.

Hado V. Hasselt. Double q-learning. In J. D. Lafferty, C. K. I. Williams,
J. Shawe-Taylor, R. S. Zemel, and A. Culotta (eds.), Advances in Neural Informa-
tion Processing Systems 23, pp. 2613–2621. Curran Associates, Inc., 2010. URL
http://papers.nips.cc/paper/3964-double-q-learning.pdf.

Reid Hastie and Tatsuya Kameda. The robust beauty of majority rules in group decisions. Psycho-logical review, 112(2):494, 2005.

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6980.

Matthew Lai. Giraffe: Using deep reinforcement learning to play chess. Master’s thesis, Imperial College London, 2015.

Richard P. Larrick and Jack B. Soll. Intuitions about combining opinions: Misappreciation of the averaging principle. Management Science, 52(1):111–127, 2006. ISSN 00251909, 15265501. URL http://www.jstor.org/stable/20110487.

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learn-ing. In Proceedings of the International Conference on Learning Representations, 2016. URL http://arxiv.org/abs/1509.02971.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.

Martin Møller. A scaled conjugate gradient algorithm for fast super-
vised learning. Neural Networks, 6(4):525 – 533, 1993. ISSN 0893-
6080. doi: http://dx.doi.org/10.1016/S0893-6080(05)80056-5. URL
http://www.sciencedirect.com/science/article/pii/S0893608005800565.

Srinivasan Praveen Blackwell Sam Alcicek Cagdas Fearon Rory Maria Alessandro De Panneershel-vam Vedavyas Suleyman Mustafa Beattie Charles Petersen Stig Legg Shane Mnih Volodymyr Kavukcuoglu Koray Nair, Arun and David Silver. Massively parallel methods for deep rein-forcement learning. In International Conference on Machine Learning Deep Learning Workshop, 2015.

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, 2016. URL http://arxiv.org/abs/1511.05952.

David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–503, 2016. URL
http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html.

David Surowiecki. The Wisdom of Crowds. Anchor Books, 2005.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

Open AI team. Openai gym. http://gym.openai.com/, 2016.

Jack L Treynor. Market efficiency and the bean jar experiment. Financial Analysts Journal, 43(3):
50–53, 1987.

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with

double q-learning. Computing Research Repository, abs/1509.06461, 2015. URL http://arxiv.org/abs/1509.06461.

8APPENDIX

8.1 PSEUDO CODE ALGORITHM OF CROWD ENSEMBLE Q-LEARNING

An implementation of a crowd ensemble is shown in Algorithm 1. Specifically, this algorithm describes the steps used to train an ensemble of Q-functions to solve the low-dimension cart-pole task.

Lines 13 and 14 are the action voting and action selection steps. Lines 22–27 are repeated for each ensemble member meaning each ensemble member draws a unique set of training sequences to replay during training. Lines 24 and 25 are computations of the member-specific Q-function target values and error function for use during parameter updates.
算法1中示出了人群集合的实现。具体地，该算法描述了用于训练Q函数集合以解决低维度极点任务的步骤。

第13和14行是动作投票和动作选择步骤。对于每个集合成员重复行22-27，意味着每个集合成员绘制一组唯一的训练序列以在训练期间重放。第24和25行是在参数更新期间使用的成员特定的Q函数目标值和误差函数的计算。

8.2 DECISION SURFACES CHANGES IN FREQUENTLY VISITED STATE SPACES

Here we provide frames from two movies showing decision space volatility in state spaces visited for a typical solution to the cart-pole task. Surfaces changes across 20 non-ensemble Q-learners once each had solved the task and continuing until training stopped. The figure shows the fraction of those parameter updates which resulted in a decision space change at that state location. Here we see that an agent must pass through several regions of high volatility on its way to the goal. Once the goal is reached, however, it is safely located between two relatively unchanging regions. These figures show that, in the cart-pole swing-up task, the greatest cause of catastrophic forgetting where the agent appears to have forgotten much of what it has learned is a result of volatility in the regions of the state space that must be visited on the way to the goal.

Figure 9 shows the same states as Figure 8 but the decision space changes are computed from an ensemble of size NE = 20. The plots are scaled to keep the colors consistent between figures. In fact, the maximum fraction of time steps that a particular location in state space changes its selected action is nearly identical for the NE = 20 and NE = 1 case: just over 47% of the parameter updates.

The difference in state space volatility is striking! The ensemble has not removed volatility but it has mitigated it considerably. Furthermore, the regions of Figure 9 which have the most instability are the regions where it matters least: namely the decision surface boundary in and around the goal region which is surrounded by regions of low decision space volatility. The other regions of high decision space volatility for the ensemble is when the pole is pointed downward with little angular velocity where the action decision had little impact.
在这里，我们提供了两部电影中的帧，显示了所访问的状态空间中的决策空间波动，这是典型的推车极点任务解决方案。一旦每个人都解决了任务并持续到训练停止，表面会在20个非整体Q学习者中发生变化。该图显示了那些参数更新的分数，这些更新导致该状态位置的决策空间变化。在这里，我们看到代理商必须在通往目标的途中经过几个高波动性区域。然而，一旦达到目标，它就安全地位于两个相对不变的区域之间。这些数字表明，在推车极摆动任务中，灾难遗忘的最大原因是，经纪人似乎已经忘记了所学的内容，这是因为必须访问的国家空间区域的波动在去目标的路上。

图9示出了与图8相同的状态，但是决策空间变化是从尺寸NE = 20的整体计算的。绘图被缩放以保持图之间的颜色一致。实际上，对于NE = 20和NE = 1的情况，状态空间中的特定位置改变其所选动作的时间步长的最大部分几乎相同：仅超过47％的参数更新。

状态空间波动的差异是惊人的！整体并未消除波动性，但它已大大减轻了波动性。此外，图9中具有最不稳定性的区域是最不重要的区域：即在目标区域内和周围的决策表面边界，其被低决策空间波动的区域包围。对于整体来说，高决策空间波动的其他区域是当极点向下指向角速度很小的情况时，行动决策几乎没有影响。

图8：电影的帧，显示代理通过状态空间的移动，用于单个Q-learner（NE = 1）的示例运行。每个帧都是一个单一的时间步骤。显示前120个时间步骤。状态空间位置根据该位置的首选操作更改的次数着色。代理程序在状态空间中的当前位置中的红点。缩放颜色条以匹配图9中的颜色

图9：电影的帧，显示代理通过状态空间的移动，用于集合Q学习器（NE = 20）的示例运行。每个帧都是一个单一的时间步骤。显示前120个时间步骤。状态空间位置根据该位置的首选操作更改的次数着色。代理程序在状态空间中的当前位置中的红点。缩放颜色条以匹配图8中的颜色

8.3 CROWD ENSEMBLE DQN ADDITIONAL INFORMATION

图10：具有两个卷积层和两个完全连接层以及三个动作输出的深度Q学习网络示例。示出了每个卷积层的两个示例输入帧和两个示例特征。输入帧和后续功能是在培训期间从我们的DQN代理捕获的。

图10显示了两个特征的输出，每个特征来自两个输入帧计算的两个卷积层。在这些框架中，杆子顺时针旋转。特征一似乎突出了推车和杆的方向，在这个例子中都是向右的。第一层特征之一的亮白色像素似乎是推车的前缘和指示方向的杆。第一层的特征20可以编码相反的信息。第二层中的一个特征似乎编码杆的位置，而特征40可以编码推车的位置。

你可能感兴趣的:(ICLR,论文研读,强化学习,深度强化学习,机器学习)

在 Conda 中删除环境及所有安装的库 Studying 开龙wu conda
注意事项1.删除环境前确保你没有在该环境中运行任何程序。2.删除操作是不可逆的，所有该环境中的包和配置都会被永久删除。3.如果你想保留环境的配置信息，可以在删除前使用condaenvexport>environment.yml导出环境配置。关于requirements.txt和environment.yaml文件使用介绍详情可参考以往文章，争对机器学习和深度学习里Python项目开发管理项目依赖的
[论文]基于强化学习的控制输入非线性水下机器人自适应神经网络控制王莽v2 机器人神经网络神经网络算法控制器
[论文]基于强化学习的控制输入非线性水下机器人自适应神经网络控制摘要本文研究了在水平面内运动的全驱动自主水下机器人的轨迹跟踪问题。在我们的控制设计中考虑了外部干扰、控制输入非线性和模型不确定性。基于离散时间域的动力学模型，两个神经网络(包括一个临界神经网络和一个作用神经网络)被集成到我们的自适应控制设计中。引入临界神经网络来评价设计的控制器在当前时间步长内的长期性能，并利用作用神经网络来补偿未知动
TensorFlow GPU 2.10.1 for Python 3.9快速安装指南疑样
本文还有配套的精品资源，点击获取简介：TensorFlowGPU2.10.1是专为Windowsx64和Python3.9设计的TensorFlow版本，它集成了GPU支持以加快深度学习模型的训练。本指南提供了该版本的概述、安装步骤及注意事项，旨在帮助开发者利用其性能优势提升机器学习项目的效率。1.TensorFlowGPU介绍1.1TensorFlow的起源与功能TensorFlow是由Goog
进阶向:基于Python的智能客服系统设计与实现
智能客服系统开发指南系统概述智能客服系统是人工智能领域的重要应用，它通过自然语言处理(NLP)和机器学习技术自动化处理用户查询，显著提升客户服务效率和响应速度。基于Python的实现方案因其丰富的生态系统（如NLTK、spaCy、Transformers等库）、跨平台兼容性以及易于集成的特点，成为开发智能客服系统的首选。系统架构系统核心包括两个主要功能模块：1.API集成模块负责连接各类外部服务，
机器学习专栏（62）：手把手实现工业级ResNet-34及调优全攻略
目录一、ResNet革命性突破解析1.1残差学习核心思想1.2ResNet-34结构详解二、工业级Keras实现详解2.1数据预处理流水线2.2完整模型实现三、模型训练调优策略3.1学习率动态调整3.2混合精度训练四、性能优化技巧4.1分布式训练配置4.2TensorRT推理加速五、实战应用案例5.1医疗影像分类5.2工业质检系统六、模型可视化分析6.1特征热力图6.2参数量分析七、常见问题解决方
模式识别与机器学习课程笔记（1）：数学基础 Ro Jace 学习笔记机器学习笔记人工智能
模式识别与机器学习课程笔记（1）：数学基础特征矢量和特征空间随机矢量的描述随机矢量的分布函数随机矢量的数字特征随机变量、随机矢量间的统计关系随机矢量的变换正态分布正态分布的定义正态分布随机矢量的性质离散随机矢量及其分布信息论矩阵微分法基本知识矢量或矩阵对于数量变量的微分二、数量函数对于矢量的微分三、矢量函数对于矢量的微分特征矢量和特征空间特征量的类型：物理量、次序量、名义量物理量：直接反映特征的实
6+，基于免疫原性细胞死亡的非肿瘤分型文章，投稿到接收仅一个多月，肿瘤的热点已经传导至非肿瘤生信文章中！生信小课堂
影响因子：6.147本文从投稿到接收仅一个多月关于非肿瘤生信，我们也解读过很多，主要有以下类型1单个疾病WGCNA+PPI分析筛选hub基因。2单个疾病结合免疫浸润，热点基因集，机器学习，分子分型等。3两种相关疾病联合分析，包括非肿瘤结合非肿瘤，非肿瘤结合肿瘤或者非肿瘤结合泛癌分析目前非肿瘤生信发文的门槛较低，有需要的朋友欢迎交流！研究概述：脑卒中是世界上死亡和残疾的主要原因之一，缺血性中风占80
VSCode使用Jupyter完整指南配置机器学习环境 z日火校招学习日记 vscode jupyter 机器学习
接下来开始机器学习部分第一步配置环境：VSCode使用Jupyter完整指南1.安装必要的扩展打开VSCode，按Ctrl+Shift+X打开扩展市场，搜索并安装以下扩展：必装扩展：Python(Microsoft官方)-Python语言支持Jupyter(Microsoft官方)-Jupyternotebook支持Pylance(Microsoft官方)-Python智能提示和语法检查推荐扩展：
养老院管理系统基于SpringBoot的养老院管理系统系统设计与实现（源码+论文+部署讲解等）
博主介绍：✌全网粉丝60W+,csdn特邀作者、Java领域优质创作者、csdn/掘金/哔哩哔哩/知乎/道客/小红书等平台优质作者，计算机毕设实战导师，目前专注于大学生项目实战开发,讲解,毕业答疑辅导，欢迎高校老师/同行前辈交流合作✌技术栈范围：SpringBoot、Vue、SSM、Jsp、HLMT、Nodejs、Python、爬虫、数据可视化、小程序、安卓app、大数据、物联网、机器学习、单片机
AI产品经理成长记《零号列车》第一集邂逅0XAI列车黑客思维者 AI产品经理养成人工智能 AI产品经理大模型智能体
《零号列车》绝非传统意义上的AI产品经理教程——它是我沉淀二十多年跨行业数字化转型与工业4.0实战经验后，首创的100集大型小说体培养指南。那些曾在千行百业验证过的知识与经验，不再是枯燥的文字堆砌，而是化作一场沉浸式的学习旅程。这里没有生硬的理论灌输，而是用跌宕起伏的故事情节，串联起AI技术的底层逻辑。你会跟着角色的脚步推进剧情，在不知不觉中吃透机器学习、大模型应用等专业概念；更有深入浅出的技术拆
人工智能时代下的数据新职业：新兴工作岗位版图研究司南锤 economics 人工智能
目录摘要第一章：AI驱动的数据价值链重构1.1从“沉睡金矿”到“流动的血液”：数据作为核心经济资产的激活1.2知识的新经济学：零边际成本革命1.3AI作为新的“操作系统”：重塑产业竞争格局第二章：基石层：数据准备与质量保障中的角色2.1数据标注与标签领导力：数据标注经理/主管2.2“地面真实”的守护者：AI数据质量专家第三章：技术核心层：构建AI与机器学习全生命周期的工程角色3.1AI生产线架构师
Python领域制造业的Python应用 Python编程之道 Python编程之道 python 开发语言 ai
Python在制造业中的应用：从自动化到智能制造关键词：Python、制造业、工业自动化、数据分析、机器学习、物联网、智能制造摘要：本文深入探讨Python编程语言在制造业中的广泛应用。从基础的自动化脚本到复杂的智能制造系统，Python凭借其丰富的库生态系统和易用性，正在重塑现代制造业。我们将分析Python在制造业中的核心应用场景，包括设备监控、质量控制、预测性维护和供应链优化等，并通过实际案
【机器学习】探索未来科技的前沿：人工智能、机器学习与大模型 AIGC零基础入门小白 AI大模型大模型教程人工智能机器学习科技 AI大模型 AIGC AI教程大模型教程
文章目录引言一、人工智能：从概念到现实1.1人工智能的定义1.2人工智能的发展历史1.3人工智能的分类1.4人工智能的应用二、机器学习：人工智能的核心技术2.1机器学习的定义2.2机器学习的分类2.3机器学习的实现原理2.4机器学习的应用2.5机器学习的示例代码2.6解释代码三、大模型：推动AI前沿发展的关键技术3.1大模型的定义3.2大模型的发展历程3.3深度学习与神经网络3.4大模型的优势与挑
人工智能入门指南：从基础概念到实际应用
前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家。点击跳转到网站。https://www.captainbed.cn/north文章目录1.**人工智能的基本概念**1.1什么是人工智能？1.2人工智能的分类2.**人工智能的核心技术**2.1机器学习（MachineLearning）2.1.1机器学习的类型2.1.2机器学习流程2.2深度学习（DeepLearni
Java与机器学习的邂逅：Weka框架入门指南墨夶 Java学习资料1 java 机器学习数据挖掘
在这个数据驱动的时代，机器学习已经成为各行业创新和优化的关键技术。而Java，作为一门成熟且广泛应用的编程语言，在企业级应用开发中占据着重要地位。将二者结合起来，利用Java实现机器学习算法，不仅可以充分发挥其强大的生态系统优势，还能为开发者提供一个高效、稳定的开发环境。今天，我们将带您走进Java与机器学习的世界，探索如何使用Weka这一著名的机器学习库来开启您的智能之旅。Weka简介及其优势什
机器学习基础：从数据到智能的入门指南
一、何谓机器学习在我们的日常生活中，机器学习的身影无处不在。当你打开购物软件，它总能精准推荐你可能喜欢的商品；当你解锁手机，人脸识别瞬间完成；当你使用语音助手，它能准确理解你的指令。这些背后，都离不开机器学习的支撑。机器学习是一门让计算机能够从数据中学习并改进的学科。随着传感器技术的飞速发展，我们身边充满了各种传感器，如手机中的摄像头、麦克风，交通监控中的传感器等，它们收集了海量的数据。这些数据就
【深度强化学习】MIP-DQN 实现案例（完整Python代码）
目录MIP-DQN算法概述建模基础训练阶段（Training）部署阶段（OnlineExecution）DNN网络转化为MIP表达式性能指标完整Python代码实现主函数：random_generator_battery模型函数：MIP_DQN基础/专用库包安装模型运行（完整Python代码）参数设置函数：Parameters参考本博客根据论文《Optimalenergysystemschedul
大模型算法工程师技术路线全解析：从基础到资深的能力跃迁 Mr.小海大模型算法数据挖掘人工智能机器学习深度学习机器翻译 web3
文章目录大模型算法工程师技术路线全解析：从基础到资深的能力跃迁一、基础阶段（0-2年经验）：构建核心知识体系与工程入门数学与机器学习基础编程与深度学习框架NLP与Transformer入门二、进阶阶段（2-4年经验）：深化模型技术与工程落地能力大模型预训练与微调技术预训练原理：数据与任务的协同设计微调工具：参数高效适配与工程优化对齐实践：价值观优化与实证效果分布式训练与框架工具并行策略：多维度协同
Go与Python在数据管道与分析项目中的抉择：性能与灵活性的较量真智AI 人工智能 python go
你正在设计一个全新数据管道或启动一个分析项目，此时你或许正在思考该选择Python还是Go。五年前，这甚至不是个值得讨论的问题——你会毫不犹豫地选择Python，故事到此为止。然而，近年来Go在数据领域，尤其是在数据基础设施和实时处理方面，正逐渐被更多人采用。实际上，这两种语言都已在现代数据技术栈中找到了各自的定位。Python依然非常适合机器学习和数据分析，而Go则逐步成为高性能数据基础设施的首
Python爬虫实战：从新浪财经爬取股票新闻的完整实现 Python爬虫项目 python 爬虫开发语言数据分析 php
第一部分：爬虫概述1.1什么是爬虫？爬虫是指通过程序模拟浏览器的行为，自动化地抓取网络上的数据。通过爬虫技术，能够从各种网站上提取信息，广泛应用于数据采集、数据分析、机器学习等领域。1.2新浪财经简介新浪财经是中国最大的财经信息平台之一，提供股票、基金、债券、外汇等多方面的财经新闻和数据。在股票领域，新浪财经提供了大量的股票行情、实时数据、新闻报道等信息，因此爬取新浪财经的股票新闻对于投资分析和决
AI 智能运维，重塑大型企业软件运维：从自动化到智能化的进阶实践 AI、少年郎人工智能运维自动化
一、引言：企业软件运维的智能化转型浪潮在数字化转型加速的背景下，大型企业软件架构日益复杂，微服务、多云环境、分布式系统的普及导致传统运维模式面临效率瓶颈。AI技术的渗透催生了智能运维（AIOps）的落地，通过机器学习、大模型、智能Agent等技术，实现从"人工救火"到"智能预防"的范式转变。本文结合头部企业实践，解析AI在运维领域的核心应用场景、技术架构及未来趋势，特别针对基础运维中流程重构、技术
Spring AI 概述与功能简介 drebander AI 编程 spring 人工智能 java
SpringAI是一个由Spring团队开发的开源框架，旨在为人工智能（AI）和机器学习（ML）提供一个成熟且高效的开发平台。它将Spring生态系统的设计理念应用于AI开发，尤其强调模块化、可移植性以及简洁的集成。SpringAI提供了丰富的功能，涵盖从AI模型的调用到与数据库的集成等多个方面，帮助开发者构建和管理AI驱动的应用程序。1.SpringAI背景SpringAI的背景源于Spring
在二分类任务中如何处理包含中文的类别特征 Dush32 分类数据挖掘人工智能机器学习数据分析
在机器学习中，处理类别特征（CategoricalFeatures）是常见的任务，特别是在中文数据中，很多类别特征如省份、城市等都是字符串类型。如何将这些类别变量转换为模型可以理解的数值格式，是每个数据科学家都必须面对的挑战。在这篇文章中，我们将探讨两种常见的类别特征编码方法：astype('category')和LabelEncoder，并比较它们在二分类任务中的效果。我们以“省份”这一类别特征
基于用户画像的商品推荐系统 Dush32 机器学习人工智能 python 推荐算法
随着人工智能和大数据技术的进步，产品推荐系统成为了现代广告与电商平台中不可或缺的部分。通过深度挖掘用户的行为数据，能够为广告主提供精准的用户画像，从而更高效地推荐相关产品，提升购买转化率。本项目基于科大讯飞AI营销云大赛的赛题，目的是利用用户画像进行产品推荐，预测用户是否会购买相应商品。我们使用了机器学习的二分类模型，通过分析用户的性别、年龄、常驻地、机型等信息，来判断用户的付费行为。项目目标：本
AI原生应用领域多租户的技术架构剖析 AI天才研究院 AI-native 架构人工智能 ai
AI原生应用领域多租户技术架构深度剖析元数据框架标题：AI原生应用多租户技术架构：从隔离性到智能化的分层设计与实践关键词：AI原生应用、多租户架构、数据隔离、模型共享、云原生租户管理摘要：本文系统解析AI原生应用场景下多租户技术架构的核心设计逻辑，覆盖从数据层到模型层的全栈隔离与共享机制。通过第一性原理推导，结合云原生、机器学习生命周期管理（MLOps）等技术范式，提出包含租户上下文管理、动态资源
Python爬虫实战：批量下载小红书笔记图片的全流程技术解析 Python爬虫项目 2025年爬虫实战项目 python 爬虫笔记开发语言音视频 github
1.引言：为什么要爬取小红书笔记图片小红书作为新兴的生活方式分享平台，聚集了大量高质量原创笔记内容，涵盖时尚、美妆、旅游、美食等多领域。笔记中的图片往往是内容的核心，批量下载小红书笔记图片，有助于：内容归档与备份数据分析与用户行为研究图像识别与机器学习训练电商推广及内容再加工但小红书对内容保护做得较好，爬取难度较高，需要结合多技术手段突破。2.小红书平台特点与爬取难点动态加载与API接口多变：页面
【机器学习】必会降维算法之：独立成分分析（ICA） Carl_奕然机器学习算法人工智能
独立成分分析（ICA）1、引言2、独立成分分析（ICA）2.0引言2.1定义2.2应用场景2.3核心原理2.4实现方式2.5算法公式2.6代码示例3、总结1、引言小屌丝：鱼哥，最近胡塞武装很哇塞啊。小鱼：你什么时候开始关注军事了？小屌丝：这…还用关注吗？都上新闻了。小鱼：嗯，那你知道胡塞武装为什么这么厉害吗？小屌丝：额…当然是光脚不怕穿鞋的。小鱼：…你可真是…小屌丝：真是啥？小鱼：一个字，自己体会
Java 大视界 -- Java 大数据机器学习模型在金融市场情绪分析与投资策略制定中的应用青云交大数据新视界 Java 大视界 java 大数据机器学习情绪分析智能投资多源数据
Java大视界--Java大数据机器学习模型在金融市场情绪分析与投资策略制定中的应用）引言：正文：一、金融情绪数据的立体化采集与治理1.1多模态数据采集架构1.2数据治理与特征工程二、Java机器学习模型的工程化实践2.1情感分析模型的深度优化2.2强化学习驱动的动态投资策略三、顶级机构实战：Java系统的金融炼金术四、技术前沿：Java与金融科技的未来融合4.1量子机器学习集成4.2联邦学习在合
【机器学习【9】】评估算法：数据集划分与算法泛化能力评估 roman_日积跬步-终至千里 #机器学习机器学习
文章目录一、数据集划分：训练集与评估集二、K折交叉验证：提升评估可靠性1.基本原理1.1.K折交叉验证基本原理1.2.逻辑回归算法与L22.基于K折交叉验证L2算法三、弃一交叉验证（Leave-One-Out）1、基本原理2、代码实现四、ShuffleSplit交叉验证1、基本原理2、为什么能降低方差3、代码测试五、选择建议在机器学习中，评估算法的核心目标是衡量模型在“未知数据”上的表现，而不是仅
Python day15
@浙大疏锦行Pythonday15.内容：复习日本周主要的内容是一些常见的机器学习流程以及其中的部分内容标签编码以及连续特征的处理：归一化和正态化等。图像的绘制：热力图、Shap图等的绘制超参数优化算法：网格搜索、贝叶斯以及启发式算法模拟退火、遗传算法等不平衡数据集的处理：过采样以及欠采样。
Dom 周华华 JavaScript html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
【Spark九十六】RDD API之combineByKey bit1129 spark
1. combineByKey函数的运行机制 RDD提供了很多针对元素类型为(K,V)的API，这些API封装在PairRDDFunctions类中，通过Scala隐式转换使用。这些API实现上是借助于combineByKey实现的。combineByKey函数本身也是RDD开放给Spark开发人员使用的API之一首先看一下combineByKey的方法说明：
msyql设置密码报错：ERROR 1372 (HY000): 解决方法详解 daizj mysql 设置密码
MySql给用户设置权限同时指定访问密码时，会提示如下错误： ERROR 1372 (HY000): Password hash should be a 41-digit hexadecimal number；问题原因：你输入的密码是明文。不允许这么输入。解决办法：用select password('你想输入的密码');查询出你的密码对应的字符串，然后
路漫漫其修远兮吾将上下而求索周凡杨学习思索
王国维在他的《人间词话》中曾经概括了为学的三种境界古今之成大事业、大学问者，罔不经过三种之境界。“昨夜西风凋碧树。独上高楼，望尽天涯路。”此第一境界也。“衣带渐宽终不悔，为伊消得人憔悴。”此第二境界也。“众里寻他千百度，蓦然回首，那人却在灯火阑珊处。”此第三境界也。学习技术，这也是你必须经历的三种境界。第一层境界是说，学习的路是漫漫的，你必须做好充分的思想准备，如果半途而废还不如不要开始。这里，注
Hadoop(二)对话单的操作朱辉辉33 hadoop
Debug： 1、 A = LOAD '/user/hue/task.txt' USING PigStorage(' ') AS (col1,col2,col3); DUMP A; //输出结果前几行示例： (>ggsnPDPRecord(21),,) (-->recordType(0),,) (-->networkInitiation(1),,)
web报表工具FineReport常用函数的用法总结（日期和时间函数）老A不折腾 finereport 报表工具 web开发
web报表工具FineReport常用函数的用法总结（日期和时间函数）说明：凡函数中以日期作为参数因子的，其中日期的形式都必须是yy/mm/dd。而且必须用英文环境下双引号(" ")引用。 DATE DATE(year,month,day):返回一个表示某一特定日期的系列数。 Year:代表年，可为一到四位数。 Month:代表月份。
c++ 宏定义中的##操作符墙头上一根草 C++
#与##在宏定义中的--宏展开 #include <stdio.h> #define f(a,b) a##b #define g(a) #a #define h(a) g(a) int main() { &nbs
分析Spring源代码之，DI的实现 aijuans spring DI 现源代码
(转) 分析Spring源代码之，DI的实现 2012/1/3 by tony 接着上次的讲，以下这个sample [java] view plain copy print
for循环的进化 alxw4616 JavaScript
// for循环的进化 // 菜鸟 for (var i = 0; i < Things.length ; i++) { // Things[i] } // 老鸟 for (var i = 0, len = Things.length; i < len; i++) { // Things[i] } // 大师 for (var i = Things.le
网络编程Socket和ServerSocket简单的使用百合不是茶网络编程基础 IP地址端口
网络编程;TCP/IP协议网络:实现计算机之间的信息共享,数据资源的交换协议:数据交换需要遵守的一种协议,按照约定的数据格式等写出去端口:用于计算机之间的通信每运行一个程序，系统会分配一个编号给该程序，作为和外界交换数据的唯一标识 0~65535 查看被使用的
JDK1.5 生产消费者 bijian1013 java thread 生产消费者 java多线程
ArrayBlockingQueue：一个由数组支持的有界阻塞队列。此队列按 FIFO（先进先出）原则对元素进行排序。队列的头部是在队列中存在时间最长的元素。队列的尾部是在队列中存在时间最短的元素。新元素插入到队列的尾部，队列检索操作则是从队列头部开始获得元素。 ArrayBlockingQueue的常用方法：
JAVA版身份证获取性别、出生日期及年龄 bijian1013 java 性别出生日期年龄
工作中需要根据身份证获取性别、出生日期及年龄，且要还要支持15位长度的身份证号码，网上搜索了一下，经过测试好像多少存在点问题，干脆自已写一个。 CertificateNo.java package com.bijian.study; import java.util.Calendar; import
【Java范型六】范型与枚举 bit1129 java
首先，枚举类型的定义不能带有类型参数，所以，不能把枚举类型定义为范型枚举类，例如下面的枚举类定义是有编译错的 public enum EnumGenerics<T> { //编译错，提示枚举不能带有范型参数 OK, ERROR; public <T> T get(T type) { return null;
【Nginx五】Nginx常用日志格式含义 bit1129 nginx
1. log_format 1.1 log_format指令用于指定日志的格式，格式： log_format name(格式名称) type(格式样式) 1.2 如下是一个常用的Nginx日志格式： log_format main '[$time_local]|$request_time|$status|$body_bytes
Lua 语言 15 分钟快速入门 ronin47 lua 基础
- - 单行注释 - - [[ [多行注释] - - ]] - - - - - - - - - - - 1. 变量 & 控制流 - - - - - - - - - - num = 23 - - 数字都是双精度 str = 'aspythonstring'
java-35.求一个矩阵中最大的二维矩阵 ( 元素和最大 ) bylijinnan java
the idea is from: http://blog.csdn.net/zhanxinhang/article/details/6731134 public class MaxSubMatrix { /**see http://blog.csdn.net/zhanxinhang/article/details/6731134 * Q35 求一个矩阵中最大的二维
mongoDB文档型数据库特点开窍的石头 mongoDB文档型数据库特点
MongoDD: 文档型数据库存储的是Bson文档-->json的二进制特点：内部是执行引擎是js解释器，把文档转成Bson结构，在查询时转换成js对象。 mongoDB传统型数据库对比传统类型数据库：结构化数据，定好了表结构后每一个内容符合表结构的。也就是说每一行每一列的数据都是一样的文档型数据库：不用定好数据结构，
[毕业季节]欢迎广大毕业生加入JAVA程序员的行列 comsci java
一年一度的毕业季来临了。。。。。。。。正在投简历的学弟学妹们。。。如果觉得学校推荐的单位和公司不适合自己的兴趣和专业，可以考虑来我们软件行业，做一名职业程序员。。。软件行业的开发工具中，对初学者最友好的就是JAVA语言了，网络上不仅仅有大量的
PHP操作Excel – PHPExcel 基本用法详解 cuiyadll PHP Excel
导出excel属性设置//Include classrequire_once('Classes/PHPExcel.php');require_once('Classes/PHPExcel/Writer/Excel2007.php');$objPHPExcel = new PHPExcel();//Set properties 设置文件属性$objPHPExcel->getProperties
IBM Webshpere MQ Client User Issue (MCAUSER) darrenzhu IBM jms user MQ MCAUSER
IBM MQ JMS Client去连接远端MQ Server的时候，需要提供User和Password吗？答案是根据情况而定，取决于所定义的Channel里面的属性Message channel agent user identifier (MCAUSER)的设置。 http://stackoverflow.com/questions/20209429/how-mca-user-i
网线的接法 dcj3sjt126com
一、PC连HUB (直连线)A端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。 B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。二、PC连PC （交叉线）A端：(568A)：白绿，绿，白橙，蓝，白蓝，橙，白棕，棕； B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。三、HUB连HUB&nb
Vimium插件让键盘党像操作Vim一样操作Chrome dcj3sjt126com chrome vim
什么是键盘党？键盘党是指尽可能将所有电脑操作用键盘来完成，而不去动鼠标的人。鼠标应该说是新手们的最爱，很直观，指哪点哪，很听话！不过常常使用电脑的人，如果一直使用鼠标的话，手会发酸，因为操作鼠标的时候，手臂不是在一个自然的状态，臂肌会处于绷紧状态。而使用键盘则双手是放松状态，只有手指在动。而且尽量少的从鼠标移动到键盘来回操作，也省不少事。在chrome里安装 vimium 插件
MongoDB查询（2）——数组查询[六] eksliang mongodb MongoDB查询数组
MongoDB查询数组转载请出自出处：http://eksliang.iteye.com/blog/2177292 一、概述 MongoDB查询数组与查询标量值是一样的，例如，有一个水果列表，如下所示： > db.food.find() { "_id" : "001", "fruits" : [ "苹
cordova读写文件（1） gundumw100 JavaScript Cordova
使用cordova可以很方便的在手机sdcard中读写文件。首先需要安装cordova插件：file 命令为： cordova plugin add org.apache.cordova.file 然后就可以读写文件了，这里我先是写入一个文件，具体的JS代码为： var datas=null;//datas need write var directory=&
HTML5 FormData 进行文件jquery ajax 上传到又拍云 ileson jquery Ajax html5 FormData
html5 新东西：FormData 可以提交二进制数据。页面test.html <!DOCTYPE> <html> <head> <title> formdata file jquery ajax upload</title> </head> <body> <
swift appearanceWhenContainedIn:(version1.2 xcode6.4) 啸笑天 version
swift1.2中没有oc中对应的方法： + (instancetype)appearanceWhenContainedIn:(Class <UIAppearanceContainer>)ContainerClass, ... NS_REQUIRES_NIL_TERMINATION; 解决方法：在swift项目中新建oc类如下： #import &
java实现SMTP邮件服务器 macroli java 编程
电子邮件传递可以由多种协议来实现。目前，在Internet 网上最流行的三种电子邮件协议是SMTP、POP3 和 IMAP，下面分别简单介绍。　　◆ SMTP 协议　　简单邮件传输协议(Simple Mail Transfer Protocol,SMTP)是一个运行在TCP/IP之上的协议，用它发送和接收电子邮件。SMTP 服务器在默认端口25上监听。SMTP客户使用一组简单的、基于文本的
mongodb group by having where 查询sql qiaolevip 每天进步一点点学习永无止境 mongo 纵观千象
SELECT cust_id, SUM(price) as total FROM orders WHERE status = 'A' GROUP BY cust_id HAVING total > 250 db.orders.aggregate( [ { $match: { status: 'A' } }, { $group: {
Struts2 Pojo（六） Luob. POJO strust2
注意：附件中有完整案例 1.采用POJO对象的方法进行赋值和传值 2.web配置 <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee&q
struts2步骤 wuai struts
1、添加jar包 2、在web.xml中配置过滤器 <filter> <filter-name>struts2</filter-name> <filter-class>org.apache.st