Learning to Communicate with Deep Multi-Agent Reinforcement Learning

Abstract

We consider the problem of multiple agents sensing and acting in environments with the goal of maximising their shared utility. In these environments, agents must learn communication protocols in order to share information that is needed to solve the tasks. By embracing deep neural networks, we are able to demonstrate end- to-end learning of protocols in complex environments inspired by communication riddles and multi-agent computer vision problems with partial observability. We propose two approaches for learning in these domains: Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL). The former uses deep Q-learning, while the latter exploits the fact that, during learning, agents can backpropagate error derivatives through (noisy) communication channels. Hence, this approach uses centralised learning but decentralised execution. Our experi- ments introduce new environments for studying the learning of communication protocols and present a set of engineering innovations that are essential for success in these domains.
我们考虑多个代理在环境中感知和行动的问题，目标是最大化其共享效用。在这些环境中，代理必须学习通信协议，以便共享解决任务所需的信息。通过采用深度神经网络，我们能够展示复杂环境中协议的端到端学习，这些环境受到通信谜语和具有部分可观察性的多智能体计算机视觉问题的启发。我们提出了两种在这些领域中学习的方法：强化的代理间学习（RIAL）和可区分的代理间学习（DIAL）。前者使用深度Q学习，而后者利用这样的事实：在学习期间，代理可以通过（嘈杂的）通信信道反向传播误差导数。因此，这种方法使用集中式学习但分散执行。我们的实验为研究通信协议的学习引入了新的环境，并提出了一系列工程创新，这些创新对于这些领域的成功至关重要。

1 Introduction

How language and communication emerge among intelligent agents has long been a topic of intense debate. Among the many unresolved questions are: Why does language use discrete structures? What role does the environment play? What is innate and what is learned? And so on. Some of the debates on these questions have been so fiery that in 1866 the French Academy of Sciences banned publications about the origin of human language.

The rapid progress in recent years of machine learning, and deep learning in particular, opens the door to a new perspective on this debate. How can agents use machine learning to automatically discover the communication protocols they need to coordinate their behaviour? What, if anything, can deep learning offer to such agents? What insights can we glean from the success or failure of agents that learn to communicate?

In this paper, we take the first steps towards answering these questions. Our approach is programmatic: first, we propose a set of multi-agent benchmark tasks that require communication; then, we formulate several learning algorithms for these tasks; finally, we analyse how these algorithms learn, or fail to learn, communication protocols for the agents.

The tasks that we consider are fully cooperative, partially observable, sequential multi-agent decision making problems. All the agents share the goal of maximising the same discounted sum of rewards. While no agent can observe the underlying Markov state, each agent receives a private observation correlated with that state. In addition to taking actions that affect the environment, each agent can also communicate with its fellow agents via a discrete limited-bandwidth channel. Due to the partial observability and limited channel capacity, the agents must discover a communication protocol that enables them to coordinate their behaviour and solve the task.

To address these tasks, we formulate two approaches. The first, named reinforced inter-agent learning (RIAL), uses deep Q-learning [2] with a recurrent network to address partial observability. In one variant of this approach, which we refer to as independent Q-learning, the agents each learn their own network parameters, treating the other agents as part of the environment. Another variant trains a single network whose parameters are shared among all agents. Execution remains decentralised, at which point they receive different observations leading to different behaviour.

The second approach, which we call differentiable inter-agent learning (DIAL), is based on the insight that centralised learning affords more opportunities to improve learning than just parameter sharing. In particular, while RIAL is end-to-end trainable within an agent, it is not end-to-end trainable across agents, i.e., no gradients are passed between agents. The second approach allows real- valued messages to pass between agents during centralised learning, thereby treating communication actions as bottleneck connections between agents. As a result, gradients can be pushed through the communication channel, yielding a system that is end-to-end trainable even across agents. During decentralised execution, real-valued messages are discretised and mapped to the discrete set of communication actions allowed by the task. Because DIAL passes gradients from agent to agent, it is an inherently deep learning approach.

Our empirical study shows that these methods can solve our benchmark tasks, often discovering elegant communication protocols along the way. To our knowledge, this is the first time that either differentiable communication or reinforcement learning (RL) with deep neural networks have succeeded in learning communication protocols in complex environments involving sequences and raw input images1. The results also show that deep learning, by better exploiting the opportunities of centralised learning, is a uniquely powerful tool for learning communication protocols. Finally, this study advances several engineering innovations, outlined in the experimental section, that are essential for learning communication protocols in our proposed benchmarks.

智能代理人之间的语言和沟通如何长期以来一直是激烈争论的话题。许多尚未解决的问题包括：为什么语言使用离散结构？环境扮演什么角色？什么是天生的，学到了什么？等等。关于这些问题的一些辩论非常激烈，以至于1866年法国科学院禁止出版有关人类语言起源的出版物。

近年来机器学习的快速发展，特别是深度学习，为这场辩论的新视角打开了大门。代理如何使用机器学习自动发现协调其行为所需的通信协议？什么，如果有的话，深度学习可以提供给这些代理商？我们可以从学习沟通的代理人的成功或失败中获得什么见解？

在本文中，我们迈出了回答这些问题的第一步。我们的方法是程序化的：首先，我们提出一组需要通信的多代理基准测试任务;然后，我们为这些任务制定了几种学习算法;最后，我们分析了这些算法如何学习或无法学习代理的通信协议。

我们考虑的任务是完全合作的，部分可观察的，顺序的多智能体决策问题。所有代理人都有共同的目标，即最大化相同的折扣金额。虽然没有代理可以观察到基础马尔可夫状态，但每个代理都会收到与该状态相关的私人观察。除了采取影响环境的行动之外，每个代理还可以通过离散的有限带宽信道与其代理进行通信。由于部分可观察性和有限的信道容量，代理必须发现一种通信协议，使其能够协调其行为并解决任务。

我们专注于集中学习但分散执行的设置。换句话说，代理之间的通信在学习期间不受限制，这通过集中式算法来执行;但是，在执行学习策略期间，代理只能通过有限带宽信道进行通信。虽然并非所有现实问题都能以这种方式解决，但是很多人可以例如在模拟器上训练一组机器人时。集中规划和分散执行也是多智能体规划的标准范例[1]。为了完整起见，我们还提供分散的学习基准。

为了解决这些任务，我们制定了两种方法。第一个名为强化的代理间学习（RIAL），使用深度Q学习[2]和循环网络来解决部分可观察性问题。在这种方法的一个变体中，我们将其称为独立的Q学习，每个代理都学习他们自己的网络参数，将其他代理视为环境的一部分。另一种变体训练单个网络，其参数在所有代理之间共享。执行仍然是分散的，此时他们会收到导致不同行为的不同观察结果。

第二种方法，我们称之为可区分的代理间学习（DIAL），其基础是集中式学习提供更多机会来改善学习，而不仅仅是参数共享。特别是，虽然RIAL在代理中是端到端可训练的，但它不是跨代理的端到端可训练的，即代理之间没有传递梯度。第二种方法允许实值消息在集中学习期间在代理之间传递，从而将通信动作视为代理之间的瓶颈连接。结果，可以通过通信信道推送梯度，从而产生即使跨代理也可以端到端训练的系统。在分散执行期间，实值消息被离散化并映射到任务允许的离散通信动作集。因为DIAL从代理到代理传递渐变，所以它是一种固有的深度学习方法。

我们的实证研究表明，这些方法可以解决我们的基准任务，通常会发现优雅的通信协议。据我们所知，这是第一次无差异通信或强化学习（RL）与深度神经网络成功地学习了涉及序列和原始输入图像的复杂环境中的通信协议1。结果还表明，通过更好地利用集中学习的机会，深度学习是学习通信协议的独特强大工具。最后，本研究推进了实验部分概述的几项工程创新，这些创新对于我们提出的基准测试中的学习通信协议至关重要。

2 Related Work

Research on communication spans many fields, e.g. linguistics, psychology, evolution and AI. In AI, it is split along a few axes: a) predefined or learned communication protocols, b) planning or learning methods, c) evolution or RL, and d) cooperative or competitive settings.

Given the topic of our paper, we focus on related work that deals with the cooperative learning of communication protocols. Out of the plethora of work on multi-agent RL with communication, e.g., [4–8], only a few fall into this category. Most assume a pre-defined communication protocol, rather than trying to learn protocols. One exception is the work of Kasai et al. [8], in which tabular Q-learning agents have to learn the content of a message to solve a predator-prey task with communication. Another example of open-ended communication learning in a multi-agent task is given in [9]. Here evolutionary methods are used for learning the protocols which are evaluated on a similar predator-prey task. Their approach uses a fitness function that is carefully designed to accelerate learning. In general, heuristics and handcrafted rules have prevailed widely in this line of research. Moreover, typical tasks have been necessarily small so that global optimisation methods,
such as evolutionary algorithms, can be applied. The use of deep representations and gradient- based optimisation as advocated in this paper is an important departure, essential for scalability and further progress. A similar rationale is provided in [10], another example of making an RL problem end-to-end differentiable.

Finally, we consider discrete communication channels. One of the key components of our methods is the signal binarisation during the decentralised execution. This is related to recent research on fitting neural networks in low-powered devices with memory and computational limitations using binary weights, e.g. [11], and previous works on discovering binary codes for documents [12].

对通信的研究涉及许多领域，例如语言学，心理学，进化和人工智能。在AI中，它沿着几个轴分开：a）预定义或学习的通信协议，b）规划或学习方法，c）演化或RL，以及d）协作或竞争设置。

鉴于我们的论文主题，我们专注于处理通信协议的合作学习的相关工作。在具有通信的多代理RL的大量工作中，例如[4-8]，只有少数属于此类别。大多数人假设使用预定义的通信协议，而不是尝试学习协议。一个例外是Kasai等人的工作。 [8]，其中表格Q学习代理必须学习消息的内容，以通过通信解决捕食者 - 猎物任务。在[9]中给出了多代理任务中开放式通信学习的另一个例子。这里使用进化方法来学习在类似捕食者 - 猎物任务上评估的协议。他们的方法使用精心设计的健身功能来加速学习。一般来说，启发式和手工制作的规则在这一系列研究中占据了广泛的地位。而且，典型的任务必然很小，所以全局优化方法，
例如进化算法，可以应用。本文提出的深度表示和基于梯度的优化的使用是一个重要的出发点，对于可扩展性和进一步的进展至关重要。在[10]中提供了类似的基本原理，这是使RL问题端对端可微分的另一个例子。

最后，我们考虑离散通信渠道。我们方法的关键组成部分之一是分散执行期间的信号二进制化。这与最近关于在具有存储器和使用二进制权重的计算限制的低功率设备中拟合神经网络的研究有关，例如， [11]，以前的工作是发现文档的二进制代码[12]。

3 Background

4 Setting

In this work, we consider RL problems with both multiple agents and partial observability. All the agents share the goal of maximising the same discounted sum of rewards Rt. While no agent can observe the underlying Markov state st, each agent receives a private observation oa correlated to st. In each time-step, the agents select an environment action u ∈ U that affects the environment, and a communication action m ∈ M that is observed by other agents but has no direct impact on the environment or reward. We are interested in such settings because it is only when multiple agents and partial observability coexist that agents have the incentive to communicate. As no communication protocol is given a priori, the agents must develop and agree upon such a protocol to solve the task.

Since protocols are mappings from action-observation histories to sequences of messages, the space of protocols is extremely high-dimensional. Automatically discovering effective protocols in this space remains an elusive challenge. In particular, the difficulty of exploring this space of protocols is exacerbated by the need for agents to coordinate the sending and interpreting of messages. For example, if one agent sends a useful message to another agent, it will only receive a positive reward if the receiving agent correctly interprets and acts upon that message. If it does not, the sender will be discouraged from sending that message again. Hence, positive rewards are sparse, arising only when sending and interpreting are properly coordinated, which is hard to discover via random exploration.

We focus on settings with centralised learning but decentralised execution. In other words, com- munication between agents is not restricted during learning, which is performed by a centralised algorithm; however, during execution of the learned policies, the agents can communicate only via the limited-bandwidth channel. While not all real-world problems can be solved in this way, a great many can, e.g., when training a group of robots on a simulator. Centralised planning and decentralised execution is also a standard paradigm for multi-agent planning [1, 18].
在这项工作中，我们考虑了多个代理和部分可观察性的RL问题。所有代理人都有共同的目标，即最大化相同的折扣总和Rt。虽然没有代理可以观察到基础马尔可夫状态st，但每个代理都接收到与st相关的私人观察。在每个时间步骤中，代理选择影响环境的环境动作u∈U，以及由其他代理观察但对环境或奖励没有直接影响的通信动作m∈M。我们对这样的设置感兴趣，因为只有当多个代理和部分可观察性共存时，代理才有动力进行交流。由于没有先验地给出通信协议，因此代理必须开发并同意这样的协议来解决该任务。

由于协议是从动作观察历史到消息序列的映射，协议的空间是非常高维的。在这个领域自动发现有效的协议仍然是一个难以捉摸的挑战。特别是，由于需要代理协调消息的发送和解释，因此探索协议空间的难度加剧了。例如，如果一个代理向另一个代理发送有用的消息，则只有在接收代理正确解释并对该消息进行操作时，它才会收到肯定的奖励。如果没有，则不鼓励发件人再次发送该消息。因此，积极的奖励是稀疏的，只有在发送和解释得到适当协调时才会产生，这很难通过随机探索发现。

我们专注于集中学习但分散执行的设置。换句话说，代理之间的通信在学习期间不受限制，这通过集中式算法来执行;但是，在执行学习策略期间，代理只能通过有限带宽信道进行通信。虽然并非所有现实问题都能以这种方式解决，但是很多人可以例如在模拟器上训练一组机器人时。集中规划和分散执行也是多智能体规划的标准范例[1,18]。

5 Methods

In this section, we present two approaches for learning communication protocols.

5.1 Reinforced Inter-Agent Learning

Figure 1: In RIAL (a), all Q-values are fed to the action selector, which selects both environment and communication actions. Gradients, shown in red, are computed using DQN for the selected action and flow only through the Q-network of a single agent. In DIAL (b), the message ma bypasses the action selector and instead is processed by the DRU (Section 5.2) and passed as a continuous value to the next C-network. Hence, gradients flow across agents, from the recipient to the sender. For simplicity, at each time step only one agent is highlighted, while the other agent is greyed out.

Parameter Sharing. RIAL can be extended to take advantage of the opportunity for centralised learning by sharing parameters among the agents. This variation learns only one network, which is
used by all agents. However, the agents can still behave differently because they receive different observations and thus evolve different hidden states. In addition, each agent receives its own index a as input, allowing them to specialise. The rich representations in deep Q-networks can facilitate the learning of a common policy while also allowing for specialisation. Parameter sharing also dramatically reduces the number of parameters that must be learned, thereby speeding learning.

图1：在RIAL（a）中，所有Q值都被馈送到动作选择器，动作选择器选择环境和通信动作。以红色显示的梯度使用DQN计算所选操作，并仅通过单个代理的Q网络流动。在DIAL（b）中，消息ma绕过动作选择器，而是由DRU处理（第5.2节），并作为连续值传递给下一个C网络。因此，梯度流经代理，从接收者到发送者。为简单起见，在每个时间步骤仅突出显示一个代理，而另一个代理显示为灰色。

参数共享。通过在代理之间共享参数，可以扩展RIAL以利用集中学习的机会。这种变化只能学习一个网络
由所有代理商使用。然而，代理仍然可以表现不同，因为它们接收不同的观察结果，从而演变出不同的隐藏状态。此外，每个代理都接收自己的索引a作为输入，允许它们专门化。深层Q网络中丰富的表示可以促进学习共同政策，同时也允许专业化。参数共享还大大减少了必须学习的参数数量，从而加快了学习速度。

5.2 Differentiable Inter-Agent Learning

While RIAL can share parameters among agents, it still does not take full advantage of centralised learning. In particular, the agents do not give each other feedback about their communication actions. Contrast this with human communication, which is rich with tight feedback loops. For example, during face-to-face interaction, listeners send fast nonverbal queues to the speaker indicating the level of understanding and interest. RIAL lacks this feedback mechanism, which is intuitively important for learning communication protocols.

To address this limitation, we propose differentiable inter-agent learning (DIAL). The main insight behind DIAL is that the combination of centralised learning and Q-networks makes it possible, not only to share parameters but to push gradients from one agent to another through the communication channel. Thus, while RIAL is end-to-end trainable within each agent, DIAL is end-to-end trainable across agents. Letting gradients flow from one agent to another gives them richer feedback, reducing the required amount of learning by trial and error, and easing the discovery of effective protocols.

DIAL works as follows: during centralised learning, communication actions are replaced with direct connections between the output of one agent’s network and the input of another’s. Thus, while the task restricts communication to discrete messages, during learning the agents are free to send real-valued messages to each other. Since these messages function as any other network activation, gradients can be passed back along the channel, allowing end-to-end backpropagation across agents.

虽然RIAL可以在代理之间共享参数，但它仍然没有充分利用集中式学习。特别是，代理商不会互相提供有关其通信行为的反馈。将此与人类交流相比较，人类交流富有紧密的反馈循环。例如，在面对面互动期间，听众向发言者发送快速非语言队列，表明理解和兴趣的程度。 RIAL缺乏这种反馈机制，这对学习通信协议非常重要。

为了解决这一局限，我们提出了可区分的代理间学习（DIAL）。 DIAL背后的主要见解是，集中式学习和Q网络的结合使得不仅可以共享参数，而且可以通过通信渠道将梯度从一个代理推送到另一个代理。因此，虽然RIAL在每个代理中都是端到端可训练的，但DIAL可以跨代理进行端到端的训练。让渐变从一个代理流向另一个代理可以为他们提供更丰富的反馈，通过反复试验减少所需的学习量，并简化有效协议的发现。

DIAL的工作原理如下：在集中式学习期间，通信动作被一个代理网络的输出与另一个网络的输入之间的直接连接所取代。因此，虽然任务将通信限制为离散消息，但在学习期间，代理可以自由地向彼此发送实值消息。由于这些消息可以像任何其他网络激活一样运行，因此可以沿着通道传回梯度，从而允许跨代理进行端到端反向传播。

6 Experiments

In this section, we evaluate RIAL and DIAL with and without parameter sharing in two multi-agent problems and compare it with a no-communication shared parameters baseline (NoComm). Results presented are the average performance across several runs, where those without parameter sharing (-
NS), are represented by dashed lines. Across plots, rewards are normalised by the highest average reward achievable given access to the true state (Oracle).

In our experiments, we use an -greedy policy with = 0.05, the discount factor is γ = 1, and the target network is reset every 100 episodes. To stabilise learning, we execute parallel episodes in batches of 32. The parameters are optimised using RMSProp with momentum of 0.95 and a learning rate of 5 ×10−4. The architecture makes use of rectified linear units (ReLU), and gated recurrent units (GRU) [20], which have similar performance to long short-term memory [21] (LSTM) [22, 23]. Unless stated otherwise we set σ = 2, which was found to be essential for good performance. We intent to published the source code online.

在本节中，我们在两个多代理问题中使用和不使用参数共享来评估RIAL和DIAL，并将其与无通信共享参数基线（NoComm）进行比较。所呈现的结果是几次运行的平均性能，其中没有参数共享的那些（ -
NS），用虚线表示。在整个情节中，奖励通过获得真实状态（Oracle）可实现的最高平均奖励来标准化。

在我们的实验中，我们使用-greedy策略= 0.05，折扣因子为γ= 1，目标网络每100集重置一次。为了稳定学习，我们以32个批次执行并行剧集。使用RMSProp优化参数，动量为0.95，学习率为5×10-4。该架构利用整流线性单元（ReLU）和门控循环单元（GRU）[20]，它们具有与长短期记忆相似的性能[21]（LSTM）[22,23]。除非另有说明，否则我们设置σ= 2，这被认为是良好性能所必需的。我们打算在线发布源代码。

6.2 SwitchRiddle

The ﬁrst task is inspired by a well-known riddle described as follows: “One hundred prisoners have been newly ushered into prison. The warden tells them that starting tomorrow, each of them will be placed in an isolated cell, unable to communicate amongst each other. Each day, the warden will choose one of the prisoners uniformly at random with replacement, and place him in a central interrogation room containing only a light bulb with a toggle switch. The prisoner will be able to observe the current state of the light bulb. If he wishes, he can toggle the light bulb. He also has the option of announcing that he believes all prisoners have visited the interrogation room at some point in time. If this announcement is true, then all prisoners are set free, but if it is false, all prisoners are executed. The warden leaves and the prisoners huddle together to discuss their fate. Can they agree on a protocol that will guarantee their freedom?” [25].

第一项任务的灵感来自于一个众所周知的谜语：“100名囚犯新入狱。监狱长告诉他们，从明天开始，他们每个人都将被置于一个孤立的牢房中，无法彼此沟通。每天，监狱长会随意选择其中一名囚犯随意更换，并将他安置在一个中央审讯室，里面只有一个带拨动开关的灯泡。囚犯将能够观察灯泡的当前状态。如果他愿意，他可以切换灯泡。他还可以选择宣布他相信所有囚犯都在某个时间点访问了审讯室。如果这个公告属实，那么所有囚犯都被释放，但如果是假的，所有囚犯都被处决。监狱长离开，囚犯挤在一起讨论他们的命运。他们能否同意一项保证其自由的议定书？“[25]。

Experimental results.

Figure 4(a) shows our results for n = 3 agents. All four methods learn an optimal policy in 5k episodes, substantially outperforming the NoComm baseline. DIAL with param- eter sharing reaches optimal performance substantially faster than RIAL. Furthermore, parameter sharing speeds both methods. Figure 4(b) shows results for n = 4 agents. DIAL with parameter sharing again outperforms all other methods. In this setting, RIAL without parameter sharing was unable to beat the NoComm baseline. These results illustrate how difficult it is for agents to learn the same protocol independently. Hence, parameter sharing can be crucial for learning to communicate. DIAL-NS performs similarly to RIAL, indicating that the gradient provides a richer and more robust source of information.

We also analysed the communication protocol discovered by DIAL for n = 3 by sampling 1K episodes, for which Figure 4© shows a decision tree corresponding to an optimal strategy. When a prisoner visits the interrogation room after day two, there are only two options: either one or two prisoners may have visited the room before. If three prisoners had been, the third prisoner would have finished the game. The other options can be encoded via the “On” and “Off” position respectively.

图4（a）显示了n = 3种药剂的结果。所有四种方法都在5k集中学习最优策略，大大优于NoComm基线。具有参数共享的DIAL比RIAL快得多地达到最佳性能。此外，参数共享可加速这两种方法。图4（b）显示n = 4种试剂的结果。具有参数共享的DIAL再次优于所有其他方法。在此设置中，没有参数共享的RIAL无法击败NoComm基线。这些结果说明了代理人独立学习相同协议的难度。因此，参数共享对于学习交流至关重要。 DIAL-NS的表现与RIAL类似，表明梯度提供了更丰富，更强大的信息来源。

我们还通过采样1K集分析了DIAL针对n = 3发现的通信协议，其中图4（c）示出了对应于最优策略的决策树。当囚犯在第二天后访问审讯室时，只有两种选择：一两个囚犯可能曾经访问过这个房间。如果有三名囚犯，第三名囚犯将完成比赛。其他选项可分别通过“开”和“关”位置编码。

6.3 MNIST Games

Architecture. The input processing network is a 2-layer MLP TaskMLP(|c|×28×28), 128, 128. Figure 5 depicts the generalised setting for both games. Our experimental evaluation showed improved training time using batch normalisation after the first layer.

Experimental results. Figures 6(a) and 6(b) show that DIAL substantially outperforms the other methods on both games. Furthermore, parameter sharing is crucial for reaching the optimal protocol. In multi-step MNIST, results were obtained with σ = 0.5. In this task, RIAL fails to learn, while in colour-digit MNIST it fluctuates around local minima in the protocol space; the NoComm baseline is stagnant at zero. DIAL’s performance can be attributed to directly optimising the messages in order to reduce the global DQN error while RIAL must rely on trial and error. DIAL can also optimise the message content with respect to rewards taking place many time-steps later, due to the gradient passing between agents, leading to optimal performance in multi-step MNIST. To analyse the protocol that DIAL learned, we sampled 1K episodes. Figure 6© illustrates the communication bit sent at time-step t by agent 1, as a function of its input digit. Thus, each agent has learned a binary encoding and decoding of the digits. These results illustrate that differentiable communication in DIAL is essential to fully exploiting the power of centralised learning and thus is an important tool for studying the learning of communication protocols. We provide further insights into the difficulties of learning a communication protocol for this task with RIAL compared to DIAL in the AppendixB.

建筑。输入处理网络是2层MLP TaskMLP [（| c |×28×28），128,128]（oa）。图5描绘了两种游戏的通用设置。我们的实验评估显示在第一层之后使用批次标准化改善了训练时间。

实验结果。图6（a）和6（b）显示DIAL在两场比赛中都明显优于其他方法。此外，参数共享对于达到最佳协议至关重要。在多步MNIST中，获得的结果为σ= 0.5。在这项任务中，RIAL无法学习，而在彩色数字MNIST中，它在协议空间中的局部最小值附近波动; NoComm基线停滞在零。 DIAL的性能可归因于直接优化消息以减少全局DQN错误，而RIAL必须依赖于反复试验。由于代理之间的梯度传递，DIAL还可以在多个时间步之后优化消息内容，从而在多步MNIST中实现最佳性能。为了分析DIAL学习的协议，我们采样了1K集。图6（c）示出了代理1在时间步t发送的通信比特，作为其输入数字的函数。因此，每个代理已经学习了数字的二进制编码和解码。这些结果表明，DIAL中的可区分通信对于充分利用集中式学习的能力至关重要，因此是研究通信协议学习的重要工具。与附录B中的DIAL相比，我们提供了有关使用RIAL学习此任务的通信协议的困难的进一步见解。

6.4 Effect of Channel Noise

7 Conclusions

This paper advanced novel environments and successful techniques for learning communication protocols. It presented a detailed comparative analysis covering important factors involved in the learning of communication protocols with deep networks, including differentiable communication, neural network architecture design, channel noise, tied parameters, and other methodological aspects.

This paper should be seen as a first attempt at learning communication and language with deep learning approaches. The gargantuan task of understanding communication and language in their full splendour, covering compositionality, concept lifting, conversational agents, and many other important problems still lies ahead. We are however optimistic that the approaches proposed in this paper can play a substantial role in tackling these challenges.
本文提出了学习通信协议的新环境和成功技术。它提供了详细的比较分析，涵盖了深度网络通信协议学习中涉及的重要因素，包括可区分通信，神经网络架构设计，信道噪声，绑定参数和其他方法方面。

本文应被视为通过深度学习方法学习交流和语言的第一次尝试。理解沟通和语言的巨大任务，包括组合性，概念提升，会话代理和许多其他重要问题仍然存在。然而，我们乐观地认为，本文提出的方法可以在应对这些挑战方面发挥重要作用。

Acknowledgements

This work was supported by the Oxford-Google DeepMind Graduate Scholarship and the EPSRC. We would like to thank Brendan Shillingford, Serkan Cabi and Christoph Aymanns for helpful comments.

References

[1] L. Kraemer and B. Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning.
Neurocomputing, 190:82–94, 2016.
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature,
518(7540):529–533, 2015.
[3] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate to solve riddles with deep distributed recurrent q-networks. arXiv preprint arXiv:1602.02672, 2016.
[4] M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In ICML, 1993.
[5] F. S. Melo, M. Spaan, and S. J. Witwicki. QueryPOMDP: POMDP-based communication in multiagent systems. In Multi-Agent Systems, pages 189–204. 2011.
[6] L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous Agents and
Multi-Agent Systems, 11(3):387–434, 2005.
[7] C. Zhang and V. Lesser. Coordinating multi-agent reinforcement learning with limited communication. In
AAMAS, volume 2, pages 1101–1108, 2013.
[8] T. Kasai, H. Tenmoto, and A. Kamiya. Learning of communication codes in multi-agent reinforcement learning problem. In IEEE Soft Computing in Industrial Applications, pages 1–6, 2008.
[9] C. L. Giles and K. C. Jim. Learning communication for multi-agent systems. In Innovative Concepts for
Agent-Based Systems, pages 377–390. Springer, 2002.
[10] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
[11] M. Courbariaux and Y. Bengio. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016.
[12] G. Hinton and R. Salakhutdinov. Discovering binary codes for documents by learning deep generative models. Topics in Cognitive Science, 3(1):74–91, 2011.
[13] R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.
[14] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente. Multiagent cooperation and competition with deep reinforcement learning. arXiv preprint arXiv:1511.08779, 2015.
[15] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical
Foundations. Cambridge University Press, New York, 2009.
[16] E. Zawadzki, A. Lipson, and K. Leyton-Brown. Empirically evaluating multiagent learning algorithms.
arXiv preprint 1401.8074, 2014.
[17] M. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527, 2015.
[18] F. A. Oliehoek, M. T. J. Spaan, and N. Vlassis. Optimal and approximate Q-value functions for decentralized
POMDPs. JAIR, 32:289–353, 2008.
[19] K. Narasimhan, T. Kulkarni, and R. Barzilay. Language understanding for text-based games using deep reinforcement learning. arXiv preprint arXiv:1506.08941, 2015.
[20] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
[21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [22] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on
sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
[23] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures.
In ICML, pages 2342–2350, 2015.
[24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
[25] W. Wu. 100 prisoners and a lightbulb. Technical report, OCF, UC Berkeley, 2002.
[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
[27] M. Studdert-Kennedy. How did language go discrete? In M. Tallerman, editor, Language Origins: Perspectives on Evolution, chapter 3. Oxford University Press, 2005.

A DIAL Details

Algorithm 1 formally describes DIAL. At each time-step, we pick an action for each agent -greedily with respect to the Q-function and assign an outgoing message

你可能感兴趣的:(深度强化学习,强化学习,论文研读)

深度 Qlearning：在直播推荐系统中的应用 AGI通用人工智能之禅程序员提升自我硅基计算碳基计算认知计算生物计算深度学习神经网络大数据 AIGC AGI LLM Java Python 架构设计 Agent 程序员实现财富自由
深度Q-learning：在直播推荐系统中的应用关键词：深度Q-learning,强化学习,直播推荐系统,个性化推荐1.背景介绍1.1问题的由来随着互联网技术的飞速发展,直播平台如雨后春笋般涌现。面对海量的直播内容,用户很难快速找到自己感兴趣的内容。因此,个性化推荐系统在直播平台中扮演着越来越重要的角色。1.2研究现状目前,主流的个性化推荐算法包括协同过滤、基于内容的推荐等。这些方法在一定程度上缓
OpenAI o1 的价值意义及“强化学习的Scaling Law” & Kimi创始人杨植麟最新分享：关于OpenAI o1新范式的深度思考光剑书架上的书 ChatGPT 大数据AI人工智能计算人工智能算法机器学习
OpenAIo1的价值意义及“强化学习的ScalingLaw”蹭下热度谈谈OpenAIo1的价值意义及RL的Scalinglaw。一、OpenAIo1是大模型的巨大进步我觉得OpenAIo1是自GPT4发布以来，基座大模型最大的进展，逻辑推理能力提升的效果和方法比预想的要好，GPT4o和o1是发展大模型不同的方向，但是o1这个方向更根本，重要性也比GPT4o这种方向要重要得多，原因下面会分析。为什
探索未来，大规模分布式深度强化学习——深入解析IMPALA架构汤萌妮Margaret
探索未来，大规模分布式深度强化学习——深入解析IMPALA架构scalable_agent项目地址:https://gitcode.com/gh_mirrors/sc/scalable_agent在当今的人工智能研究前沿，深度强化学习（DRL）因其在复杂任务中的卓越表现而备受瞩目。本文要介绍的是一个开源于GitHub的重量级项目：“ScalableDistributedDeep-RLwithImp
如何有效的学习AI大模型？ Python程序员罗宾学习人工智能语言模型自然语言处理架构
学习AI大模型是一个系统性的过程，涉及到多个学科的知识。以下是一些建议，帮助你更有效地学习AI大模型：基础知识储备：数学基础：学习线性代数、概率论、统计学和微积分等，这些是理解机器学习算法的数学基础。编程技能：掌握至少一种编程语言，如Python，因为大多数AI模型都是用Python实现的。理论学习：机器学习基础：了解监督学习、非监督学习、强化学习等基本概念。深度学习：学习神经网络的基本结构，如卷
反思的魔力：用语言的力量强化AI智能体步子哥人工智能机器学习
在浩瀚的代码海洋中，AI智能体就像初出茅庐的航海家，渴望探索未知的宝藏。然而，面对复杂的编程任务，他们常常迷失方向。今天，就让我们跟随“反思”的灯塔，见证AI智能体如何通过语言的力量，点亮智慧的明灯，成为代码世界的征服者！智能体的困境近年来，大型语言模型（LLM）在与外部环境（如游戏、编译器、API）交互的领域中大放异彩，化身为目标驱动的智能体。然而，传统的强化学习方法如同一位严苛的训练师，需要大
机器学习实战笔记5——线性判别分析绍少阿机器学习笔记可视化机器学习 python 人工智能
任务安排1、机器学习导论8、核方法2、KNN及其实现9、稀疏表示3、K-means聚类10、高斯混合模型4、主成分分析11、嵌入学习5、线性判别分析12、强化学习6、贝叶斯方法13、PageRank7、逻辑回归14、深度学习线性判别分析（LDA）Ⅰ核心思想对于同样一件事，站在不同的角度，我们往往会有不同的看法，而降维思想，亦是如此。同上节课一样，我们还是学习降维的算法，只是提供了一种新的角度，由上
大模型的实践应用29-大语言模型的RLHF(人类反馈强化学习)的具体应用与原理介绍微学AI 大模型的实践应用语言模型人工智能自然语言处理 RLHF
大家好，我是微学AI，今天给大家介绍一下大模型的实践应用29-大语言模型的RLHF(人类反馈强化学习)的具体应用与原理介绍。在当今人工智能发展的浪潮中，大语言模型（LargeLanguageModels,LLMs）凭借其强大的语言理解和生成能力，成为了研究与应用的热点。而在这股浪潮中，一种名为“基于人类反馈的强化学习”的方法脱颖而出，为大语言模型的优化和应用开辟了新的路径。本文首部分将深入浅出地介
坚定理想信念，锤炼党性修养知涵知
理想信念是中国共产党人的政治灵魂，是共产党人精神上的“钙”，没有理想信念，理想信念不坚定，精神上就会“缺钙”，就会得“软骨病”。党员干部只有坚定理想信念，强化责任担当，锤炼道德操守，提升党性修养，才能切实做到为党分忧、为国尽责、为民奉献。坚定理想信念，就要强化学习精神、自律精神、担当精神。思想理论上的坚定清醒是政治上坚定的前提，党员干部要始终把理论学习作为政治责任、事业需要和精神追求，积极参加组织
python 物理引擎_在 Gym 上构建会动的人工智障1（python） weixin_39542608 python 物理引擎
背景说明作者最近使用processing的一个重要目标就是为学生的编程学习设计具体的应用场景，最近突然发现有一个包已经提供了部分功能，所以探索一下。这个包就是我们今天的主人公：Gym。Gym是用于开发和比较强化学习算法的python包，但是我们也完全可以使用它来作为我们自己程序的应用背景，并提供可视化。简单的说，就是我们使用自己写的小程序，而不是强化学习算法，来尝试完成其中的任务，并把完成任务的过
强化学习（二）----- 马尔可夫决策过程MDP Duckie-duckie 机器学习数据数据分析数据挖掘机器学习算法
1.马尔可夫模型的几类子模型大家应该还记得马尔科夫链(MarkovChain)，了解机器学习的也都知道隐马尔可夫模型(HiddenMarkovModel，HMM)。它们具有的一个共同性质就是马尔可夫性(无后效性)，也就是指系统的下个状态只与当前状态信息有关，而与更早之前的状态无关。马尔可夫决策过程(MarkovDecisionProcess,MDP)也具有马尔可夫性，与上面不同的是MDP考虑了动作
Python强化学习，基于gym的马尔可夫决策过程MDP，动态规划求解，体现序贯决策 baozouxiaoxian python gym qlearning python 强化学习 mdp 动态规划求解马尔科夫决策过程
决策的过程分为单阶段和多阶段的。单阶段决策也就是单次决策，这个很简单。而序贯决策指按时间序列的发生，按顺序连续不断地作出决策，即多阶段决策，决策是分前后顺序的。序贯决策是前一阶段决策方案的选择，会影响到后一阶段决策方案的选择，后一阶段决策方案的选择是取决于前一阶段决策方案的结果。强化学习过程中最典型的例子就是非线性二级摆系统，有4个关键值，小车受力，受力方向，摆速度，摆角，每个状态下都需要决策车的
强化学习分类 0penuel0
Model-free:Qlearning,Sarsa,PolicyGradientsModel-based:能通过想象来预判断接下来将要发生的所有情况.然后选择这些想象情况中最好的那种基于概率：PolicyGradients基于价值：Qlearning,Sarsa两者融合：Actor-Critic回合更新：Monte-carlolearning，基础版的policygradients单步更新：Ql
7. 深度强化学习：智能体的学习与决策 Network_Engineer 机器学习学习机器学习深度学习神经网络 python 算法
引言深度强化学习结合了强化学习与深度学习的优势，通过智能体与环境的交互，使得智能体能够学习最优的决策策略。深度强化学习在自动驾驶、游戏AI、机器人控制等领域表现出色，推动了人工智能的快速发展。本篇博文将深入探讨深度强化学习的基本框架、经典算法（如DQN、策略梯度法），以及其在实际应用中的成功案例。1.强化学习的基本框架强化学习是机器学习的一个分支，专注于智能体在与环境的交互过程中，学习如何通过最大
深度强化学习之DQN-深度学习与强化学习的成功结合 CristianoC
目录概念深度学习与强化学习结合的问题DQN解决结合出现问题的办法DQN算法流程总结一、概念原因：在普通的Q-Learning中，当状态和动作空间是离散且维数不高的时候可以使用Q-Table来存储每个状态动作对应的Q值，而当状态和动作空间是高维连续时，使用Q-Table不现实。一是因为当问题复杂后状态太多，所需内存太大；二是在这么大的表格中查询对应的状态也是一件很耗时的事情。image通常的做法是把
一对一包教会脑电教学服务茗创科技
茗创科技专注于脑科学数据处理，涵盖（EEG/ERP,fMRI,结构像,DTI,ASL,FNIRS）等，欢迎留言讨论及转发推荐，也欢迎了解茗创科技的脑电课程，数据处理服务及脑科学工作站销售业务，可添加我们的工程师（微信号MCKJ-zhouyi或17373158786）咨询。★课程简介★最近有不少人留言“脑电该怎么学习？想强化学习脑电某个内容版块可以吗？...”，也有小伙伴联系我们，咨询脑电相关内容能
基于时序差分的无模型强化学习：Q-learning 算法详解晓shuo 算法强化学习
目录一、无模型强化学习中的时序差分方法与Q-learning1.1时序差分法1.2Q-learning算法状态-动作值函数（Q函数）Q-learning的更新公式Q-learning算法流程Q-learning的特点1.3总结一、无模型强化学习中的时序差分方法与Q-learning 动态规划算法依赖于已知的马尔可夫决策过程（MDP），在环境的状态转移概率和奖励函数完全明确的情况下，智能体无需与环
（18-1）基于深度强化学习的股票交易模型：项目介绍+准备环境码农三叔强化学习从入门到实践人工智能深度学习股票交易模型 DRL Double DQN Dueling DQN
在本章的这个项目中，实现了一个用于股票交易的DRL模型，旨在展示DRL在金融领域的潜力，提供其在股票交易中应用的实际例子。希望通过本章内容的学习，能够为那些对金融与机器学习交叉领域感兴趣的人士提供有益的参考。1.1项目介绍在金融市场中，股票交易是一项充满挑战的任务，需要在高度波动和复杂的市场环境中做出快速且精准的决策。传统的交易策略通常依赖于经验、基本面分析或技术分析。然而，这些方法往往无法在快速
深度学习算法——Transformer fw菜菜数学建模深度学习 transformer 人工智能数学建模 python pytorch
参考教材：动手学pytorch一、模型介绍Transformer模型完全基于注意力机制，没有任何卷积层或循环神经网络层。尽管Transformer最初是应用于在文本数据上的序列到序列学习，但现在已经推广到各种现代的深度学习中，例如语言、视觉、语音和强化学习领域。Transformer作为编码器－解码器架构的一个实例，其整体架构图在下图中展示。正如所见到的，Trans‐former是由编码器和解码器
sumo carla 自动驾驶联合仿真安装配置教程开发驾驶模拟强化学习 jZhUeZPQZw 自动驾驶人工智能机器学习
sumocarla自动驾驶联合仿真安装配置教程开发驾驶模拟强化学习轨迹预测轨迹规划标题：基于SUMO和CARLA的自动驾驶联合仿真系统安装与配置：教程与开发探索摘要：随着自动驾驶技术的迅猛发展，仿真环境在自动驾驶系统的评估、训练和验证中扮演着重要的角色。本文介绍了基于SUMO（SimulationofUrbanMObility）和CARLA（CarLearningtoAct）的自动驾驶联合仿真系统
Python知识点：如何使用Python实现强化学习机器人杰哥在此 Python系列 python 机器人开发语言编程面试
实现一个强化学习机器人涉及多个步骤，包括定义环境、状态和动作，选择适当的强化学习算法，并训练模型。下面是一个简单的例子，使用Python和经典的Q-learning算法来实现一个强化学习机器人，目标是通过OpenAIGym提供的FrozenLake环境训练机器人学会如何在冰面上移动以找到目标。1.安装必要的库首先，需要安装OpenAIGym和Numpy。你可以使用以下命令安装它们：pipinsta
机器学习在医学中的应用听忆. 机器学习人工智能
边走、边悟迟早会好机器学习在医学中的应用是一个广泛且复杂的领域，涵盖了从基础研究到临床应用的多个方面。以下是一个万字总结的结构性思路，分章节深入探讨不同应用场景、技术方法、挑战与未来展望。1.引言背景与发展：介绍医学领域的数字化转型以及机器学习的兴起，探讨其在医学中的潜力。机器学习的基本概念：简要介绍机器学习的基本原理、分类（监督学习、非监督学习、强化学习等）和常用算法（如神经网络、支持向量机、随
人工智能&机器学习&深度学习 AA杂货铺111
机器学习：一切通过优化方法挖掘数据中规律的学科。深度学习：一切运用了神经网络作为参数结构进行优化的机器学习算法。强化学习：不仅能利用现有数据，还可以通过对环境的探索获得新数据，并利用新数据循环往复地更新迭代现有模型的机器学习算法。学习是为了更好地对环境进行探索，而探索是为了获取数据进行更好的学习。深度强化学习：一切运用了神经网络作为参数结构进行优化的强化学习算法。人工智能定义与分类人工智能（Art
学习日志6 Simon#0209 学习
关于量子强化学习：论文Variational_Quantum_Circuits_for_Deep_Reinforcement_Learning：变分量子电路在深度强化学习中的应用论文主要内容：将经典深度强化学习算法（如经验重放和目标网络）重塑为变分量子电路的表示摘要当前最先进的机器学习方法基于经典冯·诺伊曼计算架构，并在许多工业和学术领域得到广泛应用。随着量子计算的发展，研究人员和技术巨头们试图为
【科技前沿】用深度强化学习优化电网，让电力调度更聪明！风清扬雨人工智能人工智能 python 智能电网深度强化学习
Hey小伙伴们，今天我要跟大家分享一个超级酷炫的技术应用——深度强化学习在电网优化中的典型案例！如果你对机器学习感兴趣，或是正寻找如何用AI技术解决实际问题的方法，这篇分享绝对不容错过！‍✨开场白大家好，我是你们的技术小助手！今天我们要聊的是如何利用深度强化学习（DRL）来优化电网的调度，让电力系统变得更智能、更高效。引入话题想象一下，如果你能够通过一种先进的技术手段，自动调整电网中的能源分配，不
大模型对齐方法笔记一：DPO及其变种IPO、KTO、CPO chencjiajy 深度学习笔记机器学习人工智能
DPODPO(DirectPreferenceOptimization)出自2023年5月的斯坦福大学研究院的论文《DirectPreferenceOptimization:YourLanguageModelisSecretlyaRewardModel》，大概是2023-2024年最广为人知的RLHF的替代对齐方法了。DPO的主要思想是在强化学习的目标函数中建立决策函数与奖励函数之间的关系，以规避
多智能体环境设计（二） AI-星辰强化学习自定义环境 python 机器学习
多智能体环境设计：接口设计与实现目录引言PettingZoo框架概述核心接口方法详解3.1reset()方法3.2step(action)方法3.3observe(agent)方法3.4render()方法空间定义4.1观察空间4.2动作空间高级特性5.1并行环境5.2智能体通信5.3动态环境性能优化测试和调试实际应用示例最佳实践和常见陷阱1.引言多智能体环境是强化学习和人工智能研究中的一个重要领
【伤寒强化学习训练】打卡第四十五天一期90天 A卐炏澬焚
3.5.2麻黄汤续讲与大、小青龙汤麻黄九禁【7.18】脉浮紧者，法当汗出而解。若身重心悸者，不可发汗，须自汗出乃愈。所以然者，尺中脉微，此里虚也。须里实，津液自和，便自汗出愈。【7.19】脉浮紧者，法当身疼痛，宜以汗解之。假令尺中迟者，不可发汗。所以然者，以荣气不足，血弱故也。【7.18】：脉浮紧的人照理说要发汗，如果身体重、心悸是不可以发汗；发汗，不一定用麻黄汤，大青龙汤也可以感冒很多人身体都是
从自动驾驶看无人驾驶叉车的技术落地和应用电气_空空自动驾驶自动驾驶机器人人工智能毕设
摘要｜介绍无人驾驶叉车在自动驾驶技术中的应用，分析其关键技术，如环境感知、定位、路径规划等，并讨论机器学习算法和强化学习算法的应用以提高无人叉车的运行效率和准确性。无人叉车在封闭结构化环境、机器学习、有效数据集等方法的助力下，可有效推动叉车无人驾驶关键技术的发展。关键词：无人叉车；自动驾驶；机器学习；数据集随着人工智能技术的持续进步，无人叉车领域的供给与需求均呈现迅猛增长态势。它们不仅正在逐步替代
强化学习自定义环境基础知识 AI-星辰强化学习自定义环境 python 机器学习
1.引言本文旨在全面介绍OpenAIGym自定义环境的创建过程，重点解析其接口、关键属性和函数。本指南适合初学者深入了解强化学习环境的构建原理和实践方法。2.OpenAIGym环境基础OpenAIGym提供了一个标准化的接口，用于创建和使用强化学习环境。了解这个接口的核心组件是创建自定义环境的基础。2.1Env类所有Gym环境都继承自gym.Env类。这个基类定义了环境应该具有的基本结构和方法。i
【《伤寒论》强化学习训练】打卡第32天，一期目标90天最闪亮的那颗星_b02d
一、桂枝加葛根汤和葛根汤不能通用，因为葛根汤里有麻黄，会散阳气。太阳传到阳明时血分受邪，要用麻黄从血分把邪气发出来，所以用葛根汤治燥热感冒。桂枝汤治营卫不调的出汗或桂枝加附子汤治阳虚自汗，不能一开始就用黄芪，黄芪会让桂枝汤发挥不了通营卫的效果，汗止不了。人体表面的能量不足的时候，身体不能收摄自己身体的水分，桂枝加附子汤里有附子，可治阳虚自汗。玉屏风散治表虚的汗有效；桂枝加附子汤治虚汗有效，但是两个
knob UI插件使用换个号韩国红果果 JavaScript jsonp knob
图形是用canvas绘制的 js代码 var paras = { max:800, min:100, skin:'tron',//button type thickness:.3,//button width width:'200',//define canvas width.,canvas height displayInput:'tr
Android+Jquery Mobile学习系列(5)-SQLite数据库白糖_ JQuery Mobile
目录导航 SQLite是轻量级的、嵌入式的、关系型数据库，目前已经在iPhone、Android等手机系统中使用,SQLite可移植性好，很容易使用，很小，高效而且可靠。因为Android已经集成了SQLite，所以开发人员无需引入任何JAR包，而且Android也针对SQLite封装了专属的API，调用起来非常快捷方便。我也是第一次接触S
impala-2.1.2-CDH5.3.2 dayutianfei impala
最近在整理impala编译的东西，简单记录几个要点：根据官网的信息（https://github.com/cloudera/Impala/wiki/How-to-build-Impala）： 1. 首次编译impala，推荐使用命令： ${IMPALA_HOME}/buildall.sh -skiptests -build_shared_libs -format 2.仅编译BE ${I
求二进制数中1的个数周凡杨 java 算法二进制
解法一：对于一个正整数如果是偶数，该数的二进制数的最后一位是 0 ，反之若是奇数，则该数的二进制数的最后一位是 1 。因此，可以考虑利用位移、判断奇偶来实现。 public int bitCount(int x){ int count = 0; while(x!=0){ if(x%2!=0){ /
spring中hibernate及事务配置 g21121 Hibernate
hibernate的sessionFactory配置：  <bean id="sessionFactory" class="org.springframework.orm.hibernate3.LocalSessionFactoryBean"> <
log4j.properties 使用 510888780 log4j
log4j.properties 使用一.参数意义说明输出级别的种类 ERROR、WARN、INFO、DEBUG ERROR 为严重错误主要是程序的错误 WARN 为一般警告，比如session丢失 INFO 为一般要显示的信息，比如登录登出 DEBUG 为程序的调试信息配置日志信息输出目的地 log4j.appender.appenderName = fully.qua
Spring mvc-jfreeChart柱图（2）布衣凌宇 jfreechart
上一篇中生成的图是静态的，这篇将按条件进行搜索，并统计成图表，左面为统计图，右面显示搜索出的结果。第一步：导包第二步；配置web.xml(上一篇有代码) 建BarRenderer类用于柱子颜色 import java.awt.Color; import java.awt.Paint; import org.jfree.chart.renderer.category.BarR
我的spring学习笔记14-容器扩展点之PropertyPlaceholderConfigurer aijuans Spring3
PropertyPlaceholderConfigurer是个bean工厂后置处理器的实现，也就是BeanFactoryPostProcessor接口的一个实现。关于BeanFactoryPostProcessor和BeanPostProcessor类似。我会在其他地方介绍。 PropertyPlaceholderConfigurer可以将上下文（配置文件）中的属性值放在另一个单独的标准java
maven 之 cobertura 简单使用 antlove maven test unit cobertura report
1. 创建一个maven项目 2. 创建com.CoberturaStart.java package com; public class CoberturaStart { public void helloEveryone(){ System.out.println("=================================================
程序的执行顺序百合不是茶 JAVA执行顺序
刚在看java核心技术时发现对java的执行顺序不是很明白了,百度一下也没有找到适合自己的资料,所以就简单的回顾一下吧代码如下; 经典的程序执行面试题 //关于程序执行的顺序 //例如： //定义一个基类 public class A(){ public A(
设置session失效的几种方法 bijian1013 web.xml session失效监听器
在系统登录后，都会设置一个当前session失效的时间，以确保在用户长时间不与服务器交互，自动退出登录，销毁session。具体设置很简单，方法有三种：（1）在主页面或者公共页面中加入：session.setMaxInactiveInterval(900);参数900单位是秒，即在没有活动15分钟后，session将失效。这里要注意这个session设置的时间是根据服务器来计算的，而不是客户端。所
java jvm常用命令工具 bijian1013 java jvm
一.概述程序运行中经常会遇到各种问题，定位问题时通常需要综合各种信息，如系统日志、堆dump文件、线程dump文件、GC日志等。通过虚拟机监控和诊断工具可以帮忙我们快速获取、分析需要的数据，进而提高问题解决速度。本文将介绍虚拟机常用监控和问题诊断命令工具的使用方法，主要包含以下工具: &nbs
【Spring框架一】Spring常用注解之Autowired和Resource注解 bit1129 Spring常用注解
Spring自从2.0引入注解的方式取代XML配置的方式来做IOC之后，对Spring一些常用注解的含义行为一直处于比较模糊的状态，写几篇总结下Spring常用的注解。本篇包含的注解有如下几个： Autowired Resource Component Service Controller Transactional 根据它们的功能、目的，可以分为三组，Autow
mysql 操作遇到safe update mode问题 bitray update
我并不知道出现这个问题的实际原理,只是通过其他朋友的博客,文章得知的一个解决方案,目前先记录一个解决方法,未来要是真了解以后,还会继续补全. 在mysql5中有一个safe update mode,这个模式让sql操作更加安全,据说要求有where条件,防止全表更新操作.如果必须要进行全表操作,我们可以执行 SET
nginx_perl试用 ronin47 nginx_perl试用
因为空闲时间比较多，所以在CPAN上乱翻，看到了nginx_perl这个项目(原名Nginx::Engine)，现在托管在github.com上。地址见：https://github.com/zzzcpan/nginx-perl 这个模块的目的，是在nginx内置官方perl模块的基础上，实现一系列异步非阻塞的api。用connector/writer/reader完成类似proxy的功能（这里
java-63-在字符串中删除特定的字符 bylijinnan java
public class DeleteSpecificChars { /** * Q 63 在字符串中删除特定的字符 * 输入两个字符串，从第一字符串中删除第二个字符串中所有的字符。 * 例如，输入”They are students.”和”aeiou”，则删除之后的第一个字符串变成”Thy r stdnts.” */ public static voi
EffectiveJava--创建和销毁对象 ccii 创建和销毁对象
本章内容： 1. 考虑用静态工厂方法代替构造器 2. 遇到多个构造器参数时要考虑用构建器（Builder模式） 3. 用私有构造器或者枚举类型强化Singleton属性 4. 通过私有构造器强化不可实例化的能力 5. 避免创建不必要的对象 6. 消除过期的对象引用 7. 避免使用终结方法 1. 考虑用静态工厂方法代替构造器类可以通过
[宇宙时代]四边形理论与光速飞行 comsci
从四边形理论来推论为什么光子飞船必须获得星光信号才能够进行光速飞行？一组星体组成星座向空间辐射一组由复杂星光信号组成的辐射频带，按照四边形-频率假说一组频率就代表一个时空的入口那么这种由星光信号组成的辐射频带就代表由这些星体所控制的时空通道，该时空通道在三维空间的投影是一
ubuntu server下python脚本迁移数据 cywhoyi python Kettle pymysql cx_Oracle ubuntu server
因为是在Ubuntu下，所以安装python、pip、pymysql等都极其方便，sudo apt-get install pymysql，但是在安装cx_Oracle（连接oracle的模块）出现许多问题，查阅相关资料，发现这边文章能够帮我解决，希望大家少走点弯路。http://www.tbdazhe.com/archives/602 1.安装python 2.安装pip、pymysql
Ajax正确但是请求不到值解决方案 dashuaifu Ajax async
Ajax正确但是请求不到值解决方案解决方案：1 . async: false , 2. 设置延时执行js里的ajax或者延时后台java方法！！！！！！！例如： $.ajax({ &
windows安装配置php+memcached dcj3sjt126com PHP Install memcache
Windows下Memcached的安装配置方法 1、将第一个包解压放某个盘下面，比如在c:\memcached。 2、在终端（也即cmd命令界面）下输入 'c:\memcached\memcached.exe -d install' 安装。 3、再输入： 'c:\memcached\memcached.exe -d start' 启动。（需要注意的: 以后memcached将作为windo
iOS开发学习路径的一些建议 dcj3sjt126com ios
iOS论坛里有朋友要求回答帖子，帖子的标题是：想学IOS开发高阶一点的东西，从何开始，然后我吧啦吧啦回答写了很多。既然敲了那么多字，我就把我写的回复也贴到博客里来分享，希望能对大家有帮助。欢迎大家也到帖子里讨论和分享，地址：http://bbs.csdn.net/topics/390920759 下面是我回复的内容：结合自己情况聊下iOS学习建议，
Javascript闭包概念 fanfanlovey JavaScript 闭包
1.参考资料 http://www.jb51.net/article/24101.htm http://blog.csdn.net/yn49782026/article/details/8549462 2.内容概述要理解闭包，首先需要理解变量作用域问题内部函数可以饮用外面全局变量 var n=999; 　　functio
yum安装mysql5.6 haisheng mysql
1、安装http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm 2、yum install mysql 3、yum install mysql-server 4、vi /etc/my.cnf 添加character_set_server=utf8
po/bo/vo/dao/pojo的详介 IT_zhlp80 java BO VO DAO POJO po
JAVA几种对象的解释 PO:persistant object持久对象,可以看成是与数据库中的表相映射的java对象。最简单的PO就是对应数据库中某个表中的一条记录，多个记录可以用PO的集合。PO中应该不包含任何对数据库的操作. VO:value object值对象。通常用于业务层之间的数据传递，和PO一样也是仅仅包含数据而已。但应是抽象出的业务对象,可
java设计模式 kerryg java 设计模式
设计模式的分类：一、设计模式总体分为三大类： 1、创建型模式（5种）：工厂方法模式，抽象工厂模式，单例模式，建造者模式，原型模式。 2、结构型模式（7种）：适配器模式，装饰器模式，代理模式，外观模式，桥接模式，组合模式，享元模式。 3、行为型模式（11种）：策略模式，模版方法模式，观察者模式，迭代子模式，责任链模式，命令模式，备忘录模式，状态模式，访问者
[1]CXF3.1整合Spring开发webservice——helloworld篇木头.java spring webservice CXF
Spring 版本3.2.10 CXF 版本3.1.1 项目采用MAVEN组织依赖jar 我这里是有parent的pom，为了简洁明了，我直接把所有的依赖都列一起了，所以都没version，反正上面已经写了版本 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="ht
Google 工程师亲授：菜鸟开发者一定要投资的十大目标 qindongliang1922 工作感悟人生
身为软件开发者，有什么是一定得投资的？ Google 软件工程师 Emanuel Saringan 整理了十项他认为必要的投资，第一项就是身体健康，英文与数学也都是必备能力吗？来看看他怎么说。（以下文字以作者第一人称撰写））你的健康无疑地，软件开发者是世界上最久坐不动的职业之一。每天连坐八到十六小时，休息时间只有一点点，绝对会让你的鲔鱼肚肆无忌惮的生长。肥胖容易扩大罹患其他疾病的风险，
linux打开最大文件数量1,048,576 tianzhihehe c linux
File descriptors are represented by the C int type. Not using a special type is often considered odd, but is, historically, the Unix way. Each Linux process has a maximum number of files th
java语言中PO、VO、DAO、BO、POJO几种对象的解释衞酆夼 java VO BO POJO po
PO:persistant object持久对象最形象的理解就是一个PO就是数据库中的一条记录。好处是可以把一条记录作为一个对象处理，可以方便的转为其它对象。可以看成是与数据库中的表相映射的java对象。最简单的PO就是对应数据库中某个表中的一条记录，多个记录可以用PO的集合。PO中应该不包含任何对数据库的操作。 BO:business object业务对象封装业务逻辑的java对象